6  Alternative Analysis

This chapter addresses some of the weaknesses in the data set, comparing corrections to the base level used by Subcreation. Each subpart here addresses a specific shortcoming.

6.1 Using the full sample

One thing I don’t particularly like about the data analysis is restriction to the top 100 runs. it will ignore the more difficult dungeons, which we collected via stratified sampling. We can ask what happens if we just compute a confidence interval on the full sample, in the traditional way. This avoids all the broken computations which blends data from the top-100 and the full sample.

healers_full_sample <- healers |>
  group_by(class_spec) |>
  dplyr::reframe(
    class = class,
    score_mean = mean(score),
    score_sd = sd(score),
    n = n(),
    error = qt(.975, n-1) * score_sd / sqrt(n),
    conf.low = score_mean - error,
    conf.high = score_mean + error,
  ) |>
  distinct() |>
  mutate(analysis = "full-sample")

healers_full_sample |>
  dplyr::select(class_spec, n, score_mean, score_sd, conf.low, conf.high) |>
  kbl() |> kable_styling()
class_spec n score_mean score_sd conf.low conf.high
druid_restoration 211 390.8502 35.25540 386.0657 395.6348
monk_mistweaver 196 386.6638 36.67898 381.4967 391.8308
shaman_restoration 3756 380.4083 35.01109 379.2882 381.5283
priest_discipline 1631 392.6421 33.89883 390.9957 394.2884
evoker_preservation 178 388.6713 36.44791 383.2801 394.0626
paladin_holy 252 379.0413 36.15706 374.5555 383.5271
priest_holy 91 376.9330 28.47857 371.0020 382.8639

The way we’ve set up data frames with an analysis tag will let us compare the difference between the two ways of computation. Here we use colors to differentiate between the methods used by Subcreation, and using the full sample. Note how the popular specs are beeing reeled in. This is to be expected, because when we don’t pick the top-100 for those specializations, we are not selecting the best runs among the higher population count.

Some players are likely to be better than the average of our selected runs, and some are likely to be worse. Subcreation just picks all the lucky dice throws, and ignores all the unlucky ones. I’m arguing this is a bad practice, because the popular specs get far more dice throws than the less popular specs.

Code
healers_subc_base <- healers_subc_base |> dplyr::select(-spec_score_mean)
ggplot(union_all(healers_subc_base, healers_full_sample),
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = analysis,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange(position = position_dodge(width = .5)) +
    geom_hline(yintercept=clusters$centers, lty="dashed", color="white"  ) +
    coord_flip() +
    labs(title = "Subcreation confidence intervals",
         subtitle = "Top 100 vs full sample",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI") +
    scale_color_material_d(palette = "light")
Figure 6.1: Plot where we compare the original Subcreation model (top-100 runs) to the full sample data where all runs are included. The intent is to show there’s a difference based on what model you decide to use. How you decide to handle your data can have a large effect on the outcome and also the conclusion people will end up making based on the data.

6.2 Regional Effects

The choice of stratifying by region ensures we are getting a share of runs from every region. But because we are also purposively sampling the top players, it is important to ask if there are any regional differences. We can do that via a quick violin-plot over all the data, but split by region.

Violin-plots are in the family of density plots. In statistics a density (Probability Density Function / PDF) tells us the relative likelihood of drawing a value from the distribution. Where the distribution has mass, there is a larger likelihood of getting a value. In the violin-plot, we can see that scores are more likely to be around certain threshold values. These correspond to key-level increases. The variance within a given key level pertains to the clock and how much time was left on it. In “between” key levels, there are no observations at all, so the density violin becomes very thin.
Code
ggplot(specs, aes(x = region.str, y = score, fill=region)) +
  geom_violin(color="lightgray") +
  labs(title = "Score distributions per Region",
       subtitle = "Points are count-grouped to prevent overplotting",
       caption = "Source: https://raider.io",
       x = "Region",
       y = "Score") +
  scale_fill_material_d(palette = "light") +
  scale_x_discrete(label = scales::label_wrap(12))
Figure 6.2: Score distributions split by region shows regional differences.

We see there are considerable effects from regional differences. Data from Asia are likely to have an effect on our results. In particular, we aren’t measuring the top players. It is very likely if we just pulled more data from the Europe and Americas regions, we would see a real view of the top players. The data set we are working with have the following regional counts

Code
specs |>
  group_by(region.str) |>
  summarize(count = n()) |>
  kbl() |> kable_styling()
region.str count
Americas 7921
Asia (South Korea) 7858
Asia (Taiwan) 7889
Europe 7894
Figure 6.3: Table counting the number of runs per region in the base data set.

But we also have an alternative data set, where we find the world-wide top dungeon runs for the week. This dataset is called specs_topk. This data set have counts which differ:

It is worth stressing the regional difference is largely due to population counts. The regions with a lot of players will be able to push key levels quicker because there are more players running those keys. Hence, it is far easier to form groups. Also, because we are sampling among the top, a larger population in a region means we’ll have more skilled players in the region, even if we assume no skill difference.
Code
specs_topk |>
  group_by(region.str) |>
  summarize(count = n()) |>
  kbl() |> kable_styling()
region.str count
Americas 12548
Asia (South Korea) 6185
Asia (Taiwan) 6854
Europe 14298

Counting top-k players, ignoring the region. These are the true top players world-wide.

Code
ggplot(specs_topk, aes(x = fct_infreq(region.str), fill=region)) +
  geom_bar() +
  labs(title = "Region counts for the top-k dungeon runs",
       subtitle = "Regions differ in the highest scoring runs",
       caption = "Source: https://raider.io",
       x = "Region",
       y = "Count") +
  scale_fill_material_d(palette = "light") +
  scale_x_discrete(label = scales::label_wrap(12))

Top-k players grouped by region. Some regions with large playerbases have far more runs among the Top-k players.

This confirms our hypothesis: the data we are using aren’t representative of the top players.

Like Subcreation, we will compute a K-means for the top-k dataset:

Code
clusters_topk <- Ckmeans.1d.dp(specs_topk$score, 6)
clusters_topk$centers
[1] 302.4704 369.1015 383.9502 397.9609 412.7586 428.8449

It might be good to compare the new cluster computation to the older one.

Code
clusters_centers <- tibble(
  centers = clusters$centers,
  source = "regional")
clusters_topk_centers <- tibble(
  centers = clusters_topk$centers,
  source = "topk")

centers <- union_all(clusters_centers, clusters_topk_centers)
Code
ggplot(centers, aes(x = source,
                    y = centers,
                    color = source)) +
  geom_point(size=5, alpha=.8) +
  coord_flip() +
  labs(title = "Ckmeans cluster centers",
       subtitle = "Regionally stratified vs TopK data",
       caption = "Source: https://raider.io",
       x = "Dataset",
       y = "Cluster Centers") +
  scale_color_material_d(palette = "light")
Figure 6.4: Plot comparing clusters computed on the original Subcreation sampling scheme and a top-k sampling scheme. This shows how the clusters move when we change the underlying data set.

We can repeat the same computation Subcreation uses, but on this new data set.

Code
healers_subc_topk <- specs_topk |> filter(role == "healer") |>
  group_by(class_spec) |>
  dplyr::mutate(
    spec_score_mean = mean(score),
    spec_score_sd   = sd(score)
  )

healers_subc_topk <- healers_subc_topk |>
  slice_max(n = 100, score) |>
  reframe(
    class = class,
    spec_score_mean = spec_score_mean,
    score_mean = mean(score),
    score_sd = sd(score),
    n = n(),
    error = qt(.975, n-1) * spec_score_sd / sqrt(n),
    conf.low = score_mean - error,
    conf.high = score_mean + error,
  ) |>
  distinct() |>
  mutate(analysis = "topk")

Running the comparison shows us a major change among the unpopular specs. Remember, in subcreations computation, you pick among the best 100 runs for each healer, but for the unpopular specs, there aren’t even 100 runs. So you pick all of them. If you stratify over region in your data collection, you will get more players from the Asia regions, which runs lower key levels in general, skewing the results.

Code
healers_subc_topk <- healers_subc_topk |> dplyr::select(-spec_score_mean)
ggplot(union_all(healers_subc_base, healers_subc_topk),
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = analysis,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange(position = position_dodge(width = .5)) +
    coord_flip() +
    labs(title = "Subcreation confidence intervals",
         subtitle = "Regionally Stratified vs TopK data",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI") +
    scale_color_material_d(palette = "light")
Figure 6.5: Compare the Subcreation base computation with a dataset where the top players are picked world-wide rather than being stratified per region. The plot shows how computing data among the true top-k players alters the outcome of the analysis.

6.3 Controlling for repeated measures

Statistical work often have assumptions which have to be fulfilled for the math to work out as expected. Violating the assumptions voids the correctness of the statistic. One such assumption is independence; each observation is independent of other observations. If you have 7 Protection Paladin runs in Shadowmoon Burial Grounds, the assumption is that those runs are done by 7 different Paladins. If all those 7 runs were done by the same player, it would be wrong to count this as 7 independent samples.

We can analyze the data set for repetitions by grouping on the unique Id of players.

Code
healers |>
 group_by(id) |>
 summarize(count = n(), name = name, class_spec = class_spec) |>
 distinct() |>
 ungroup() |>
 slice_max(n = 12, count) |>
 kbl() |>
 kable_styling()
`summarise()` has grouped output by 'id'. You can override using the `.groups`
argument.
id count name class_spec
57283101 89 黑嗆 paladin_holy
196677259 71 Growlp priest_discipline
224076162 65 Magma shaman_restoration
107425432 59 Vickman druid_restoration
110391665 57 Gregxo priest_discipline
119657994 55 Cryve evoker_preservation
150827206 51 你又沒有 priest_discipline
19921201 47 樂町鶴 shaman_restoration
138690359 45 艾艾法 priest_discipline
227091143 44 Héléna shaman_restoration
226113716 42 Gregoryxo shaman_restoration
226033514 41 Ihatersham shaman_restoration

How do they distribute? We can pick all healers who have played more than a single dungeon run for the given week.

Code
ggplot(
  healers |>
   group_by(id) |>
   reframe(count = n(), name = name, class_spec = class_spec) |>
   filter(count > 1) |>
   arrange(desc(count)) |>
   distinct(),
  aes(x = count)) +
  geom_histogram(binwidth=1) +
  labs(title = "Repeated runs histogram",
       subtitle = "The dataset contains repeated measures",
       caption = "Source: https://raider.io",
       x = "Repeated Runs by same player",
       y = "Count")

So a lot of runs repeats, and it is very likely to affect our results. This is something you have to control for when handling the data. As an example, we can look at specialization-popularity but control for repetitions in the data set:

Code
specs |>
  group_by(class, class_spec, id) |>
  summarize(id_count = n()) |>
  summarize(count = n(), .groups = "keep") |>
  inner_join(spec_names, by = join_by(class_spec == key)) |>
  ggdotchart(
    x = "class_spec.str",
    y = "count",
    color = "class",
    rotate = TRUE,
    sorting = "descending",
    add = "segments",
    dot.size = 7,
    ggtheme = theme_zugzug()) +
  scale_color_manual(values = wow_colors) +
  labs(title = "Specialization Popularity (no repetitions)",
       subtitle = "4 Regions, 8 Dungeons from each region, top-k runs",
       caption = "Source: https://raider.io",
       x = "Specialization",
       y = "Count")
`summarise()` has grouped output by 'class', 'class_spec'. You can override
using the `.groups` argument.
Figure 6.6: Counting the number of specializations with no repetitions of players. This avoids the pseudoreplication which is present if you just naively count specializations as we did earlier.

Compared to the uncorrected plot, it’s fairly obvious active players have an effect on the outcome. Active players will do more runs, and that leads to a conclusion there are more players at the top level playing a given specialization. The plot in Figure 6.6 is a more accurate way of looking at popularity among specializations.

We want to keep individual dungeon runs separated. That is, if a player has run different dungeons, we are assuming they are independent runs. Because we might have dungeon runs at several different key levels and with different times remaining, we remove duplicates. A more advanced model would take the dungeon variability into account as well, but this will do for now.

Code
healers_unique <- healers |>
  dplyr::group_by(dungeon, id) |>
  reframe(
    class_spec = class_spec,
    dungeon = dungeon,
    class = class,
    name = name,
    mean_score = mean(score)) |>
  mutate(score = mean_score) |>
  distinct() |>
  ungroup()

We can now do the same analysis as we did for Subcreation, but on unique healers.

Code
healers_subc_unique <- healers_unique |>
  group_by(class_spec) |>
  dplyr::mutate(
    spec_score_mean = mean(score),
    spec_score_sd   = sd(score)
  )

healers_subc_unique <- healers_subc_unique |>
  slice_max(n = 100, score) |>
  reframe(
    class = class,
    spec_score_mean = spec_score_mean,
    score_mean = mean(score),
    score_sd = sd(score),
    n = n(),
    error = qt(.975, n-1) * spec_score_sd / sqrt(n),
    conf.low = score_mean - error,
    conf.high = score_mean + error,
  ) |>
  distinct()

healers_subc_unique <- healers_subc_unique |> mutate(analysis = "unique")

We see yet another correction has influence in the reported mean values. The general trend is an adjustment downwards, because we don’t let a single healer influence the numbers too much. Some of the more popular specs are heavily influenced by single players who play the spec a lot, which means the skill of a few individual players have a large effect on the perceived power of the specialization.

Code
healers_subc_unique <- healers_subc_unique |> dplyr::select(-spec_score_mean)
ggplot(union_all(healers_subc_base, healers_subc_unique),
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = analysis,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange(position = position_dodge(width = .5)) +
    coord_flip() +
    labs(title = "Subcreation confidence intervals",
         subtitle = "Subcreation vs Controlling for repeated measures",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI") +
    scale_color_material_d(palette = "light")

6.4 Bootstrapped Confidence Intervals

A core assumption of Subcreation is that the data follows a normal distribution. If the data isn’t normally distributed, chances are this assumption is violated, and the statistics become weak.

Code
ggdensity(healers, x = "score", fill = "grey50", ggtheme = theme_zugzug()) +
  labs(title = "Healer score density plot",
       subtitle = "Scores spike around key levels",
       caption = "Source: https://raider.io",
       x = "Score",
       y = "Density")

Kernel density plot of the healer scores.

We know this dataset isn’t normally distributed. Mythic+ scores are artifically generated by combining key level, affixes, and time remaining in the dungeon. Since the base score for each key level has gaps in between them, our data set has gaps as well. There are certain scores which are unobtainable in between key levels, which violates the assumption the data is normally distributed. We can analyze the data via a Shapiro-Wilk test. This test sets up a hypothesis that the data is normally distributed; then asks “Assuming data are normally distributed, how likely is the data we have?”

If this test comes out with \(p < 0.05\) this data would occur at random less than 5% of the time.

Code
healers |>
  slice_max(score, n = 4000) |>
  summarize(shapiro.p = tidy(shapiro.test(score))$p.value) |>
  kbl(digits = 3) |> kable_styling()
shapiro.p
0

This value is practically 0, so it fails the test.

A way around this conundrum is to use bootstrapping. We can let a computer grind at the problem by resampling our population and derive a 95% Confidence interval from the resampling. This will then estimate the confidence interval1. A bootstrapped approximation puts us on saner statistical ground than the current computation.

1 This is due to the resampling forming a sampling distribution, which means we can invoke the Central Limit Theorem.

Code
confidence_interval <- function(data) {
  mean <- mean(data$score)
  sd <- sd(data$score)
  n <- length(data$score)
  error <- qt(.975, n-1) * sd / sqrt(n)
  conf.low <- mean - error
  conf.high <- mean + error

  c(conf.low, conf.high)
}

confidence_interval(healers |> slice_max(n = 100, order_by = score))
[1] 431.7186 433.9210

Bootstrapping works by sampling in the sampled scores with replacement2. We create a “new” sample of 100 healers via this resampling, and then we take the mean. This process repeats. If we keep creating “new” samples this way 5000 times, we have 5000 means. These are a sampling distribution, and we can then compute the 95% confidence interval on that distribution.

2 Replacement is a fancy way of saying we can pick the same sample more than once.

Say we do that. We get the following result:

Code
boot_fn <- function(data, indices) {
  d <- data[indices, ]

  mean(d$score)
}

bootres <- boot(
  healers |> slice_max(n = 100, order_by = score),
  boot_fn, R=5000)
tidy(bootres, conf.int = TRUE) |> kbl(digits=3) |> kable_styling()
statistic bias std.error conf.low conf.high
432.82 -0.028 0.543 431.778 433.907

We can also compute it on a per-spec basis:

Code
run_boot <- function(d) {
  bootres <- boot(d, boot_fn, R=5000)

  bootres
}

healers_boot_ci <- healers |>
  group_by(class_spec) |>
  slice_max(n = 100, order_by = score) |>
  nest() |>
  mutate(bootres = map(data, run_boot)) |>
  mutate(tidy = map(bootres, broom::tidy, conf.int = TRUE)) |>
  unnest(tidy) |>
  mutate(data = NULL, bootres = NULL)

healers_boot_ci |> kbl(digits = 3) |> kable_styling()
class_spec statistic bias std.error conf.low conf.high
druid_restoration 414.843 0.007 0.957 412.992 416.738
monk_mistweaver 411.692 0.004 0.815 410.073 413.277
shaman_restoration 424.203 0.002 0.469 423.257 425.118
priest_discipline 432.188 -0.008 0.577 431.100 433.339
evoker_preservation 411.050 -0.019 0.753 409.570 412.497
paladin_holy 406.708 0.028 0.995 404.849 408.668
priest_holy 376.933 0.063 2.972 370.802 382.493

Let us compare the bootstrap to the original subcreation computation. We can plot this, which is a better way of understanding the differences, than looking at a table.

Code
healers_boot_ci <- healers_boot_ci |>
  rename(
    conf.low = conf.low,
    conf.high = conf.high,
    score_mean = statistic) |>
  mutate(analysis = "bootstrapped")
Code
union_all(
  healers_subc_base |> dplyr::select(
    class_spec, score_mean, conf.low, conf.high, analysis),
  healers_boot_ci |> dplyr::select(
    class_spec, score_mean, conf.low, conf.high, analysis)) |>
  ggplot(aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = analysis,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange(position = position_dodge(width = .5)) +
    coord_flip() +
    labs(title = "Confidence intervals",
         subtitle = "Subcreation vs Bootstrapped",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI") +
    scale_color_material_d(palette = "light")
Figure 6.7: Bootstrapped confidence intervals compared to the Subcreation base computation.

One advantage of this approach is that it is more robust against changes in the underlying data. We see the derived confidence intervals via bootstrapping tends to be smaller. If we want to analyze if specs are different, having smaller confidence intervals tend to be better because there will be less overlap of the intervals. Usually this will let us tease out smaller effects from the data.

6.5 Combining all the Alternatives

We have a number of alternative suggestions for Subcreation. These are:

  • Use the full sample
  • Control for regional effects by pulling a true topk dataset
  • Control for repeated measures from players
  • Bootstrap the confidence interval

Combining all these suggestions is an obvious idea, which should yield better results.

6.5.1 Healers

Because we want to run the computations for healers, tanks, and dps, we benefit from setting up a function for doing the computation. The function preprocess combines all our controls and runs them in order.

preprocess <- function(df, tag) {
  # Handle repeated measures by averaging players over their multiple runs of the same dungeon
  df <- df |>
    dplyr::group_by(dungeon, id) |>
    dplyr::reframe(
      class_spec = class_spec,
      dungeon = dungeon,
      class = class,
      name = name,
      mean_score = mean(score)) |>
    mutate(score = mean_score) |>
    distinct() |>
    ungroup()

  # Bootstrap a confidence interval
  df <- df |>
    group_by(class, class_spec) |>
    filter(n() > 3) |>
    nest() |>
    mutate(bootres = map(data, run_boot)) |>
    mutate(tidy = map(bootres, broom::tidy, conf.int = TRUE)) |>
    unnest(tidy) |>
    mutate(data = NULL, bootres = NULL) |>
    ungroup()

  # rename columns to fit the rest of the data here
  df <- df |>
    rename(
      conf.low = conf.low,
      conf.high = conf.high,
      score_mean = statistic) |>
    mutate(analysis = tag)

  df
}

With this function in hand, we can easily run the computation for different roles. We’ll focus on healers first.

Code
healers_combined <- preprocess(specs_topk |> filter(role == "healer"), "combined")

And make our usual plot. It is nice to have a stand-alone plot of healers in this case.

Code
ggplot(healers_combined,
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = class,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange() +
    geom_point(aes(y=conf.low), color="white", size=1.5) +
    scale_color_manual(values = wow_colors) +
    geom_hline(yintercept=clusters_topk$centers, lty="dashed", color="white") +
    coord_flip() +
    labs(title = "Corrected Subcreation confidence intervals",
         subtitle = "4 corrections applied to the Subcreation analysis",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI")

Once we control for undesired effects, it becomes clear the healers tend to be far more balanced. The difference in rating between the best and worst healer becomes far smaller. Because they are so close, we also provide a zoomed-in view with no clustering:

Code
ggplot(healers_combined,
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = class,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange() +
    geom_point(aes(y=conf.low), color="white", size=1.5) +
    scale_color_manual(values = wow_colors) +
    coord_flip() +
    labs(title = "Corrected Subcreation confidence intervals",
         subtitle = "4 corrections applied to the Subcreation analysis",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI")

Finally, we can compare our findings with the base. Subcreations data compared to ours:

Code
union_all(
  healers_subc_base |> dplyr::select(class_spec, conf.low, conf.high, score_mean, analysis),
  healers_combined |> dplyr::select(class_spec, conf.low, conf.high, score_mean, analysis)) |>
  ggplot(aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = analysis,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange(position = position_dodge(width = .5)) +
    coord_flip() +
    labs(title = "Confidence intervals",
         subtitle = "Subcreation vs Corrections",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI") +
    scale_color_material_d(palette = "light")

Code
distances <- tibble(analysis = factor(c("corrected", "subcreation")),
                    distance = c(max(healers_combined$conf.low) - min(healers_combined$conf.low),
                                 max(healers_subc_base$conf.low) - min(healers_subc_base$conf.low)))
distances
# A tibble: 2 × 2
  analysis    distance
  <fct>          <dbl>
1 corrected       23.0
2 subcreation     54.5

Note how much the distance between the best and worst specs have shrunk. The balance window closes. This means the difference in power is going to be a small effect, and other factors, such as player skill, are going to dominate. This is good for the game, as you’d like to pick better players than relying on specialization power to construct a roster for a dungeon run.

6.5.2 Tanks

With the preprocess function in hand, we can now focus on tanks:

Code
tanks_combined <- preprocess(
  specs_topk |> filter(role == "tank"),
  "combined")

tanks_combined |>
  dplyr::select(-analysis) |>
  kbl(digits = 3) |> kable_styling()
class_spec class score_mean bias std.error conf.low conf.high
paladin_protection paladin 381.403 0.008 0.716 379.984 382.785
warrior_protection warrior 378.313 -0.009 0.926 376.531 380.117
monk_brewmaster monk 383.171 0.052 3.678 375.667 390.040
demon-hunter_vengeance demon-hunter 374.264 -0.030 2.099 370.068 378.199
druid_guardian druid 375.203 -0.061 1.680 371.836 378.468
death-knight_blood death-knight 368.728 0.001 2.528 363.493 373.667

Like in the previous work on healers, we don’t need the cluster centers anymore, so we’ll make a plot without those. This lets us focus on where the differences are and how large of a difference there is.

Code
ggplot(tanks_combined,
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = class,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange() +
    geom_point(aes(y=conf.low), color="white", size=1.5) +
    scale_color_manual(values = wow_colors) +
    coord_flip() +
    labs(title = "Corrected Subcreation confidence intervals",
         subtitle = "4 corrections applied to the Subcreation analysis",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI")

6.5.3 DPS

For convenience, we split the DPS role into ranged and melee. This keeps the plot in a size that is readily workable and easy. It is also common practice in a lot of data expositions about specializations in WoW to do so. We’ll just follow suit.

By now, the drill is known, so we’ll just bring the plots

6.5.3.1 Ranged

Code
ranged_combined <- preprocess(
  specs_topk |> filter(role == "dps" & is_ranged == "Yes"),
  "combined")
Code
ggplot(ranged_combined,
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = class,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange() +
    geom_point(aes(y=conf.low), color="white", size=1.5) +
    scale_color_manual(values = wow_colors) +
    coord_flip() +
    labs(title = "Corrected Subcreation confidence intervals",
         subtitle = "4 corrections applied to the Subcreation analysis",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI")

6.5.3.2 Melee

Code
melee_combined <- preprocess(
  specs_topk |> filter(role == "dps" & is_ranged == "No"),
  "combined")
Code
ggplot(melee_combined,
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = class,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange() +
    geom_point(aes(y=conf.low), color="white", size=1.5) +
    scale_color_manual(values = wow_colors) +
    coord_flip() +
    labs(title = "Corrected Subcreation confidence intervals",
         subtitle = "4 corrections applied to the Subcreation analysis",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI")

6.6 Remarks

We’ve traced the computations done by Subcreation and then provided a series of corrections to the computations. Before these corrections, the balance window reported by Subcreation is fairly large.

However, once you control for undesirable data effects, the balance window almost closes.

The findings here are presented one change at a time with a comparison, because it allows people to make their own informed decision on what they believe.

Should Subcreation want to implement some of this, they have the option. They can pick whatever they want among the options, including those they think are correct.

The key take-away is that something can be statistically significant, but have small effect. This is especially true if you have a lot of observations. The statistics can squeeze out significance, even for very minuscule differences among the specializations. I think this is what is happening here, and it is happening across all roles. Except for a few outliers, most specs fall into a very small window.

Some of the suggestions here are outright hacks3. By this, I mean they fix problems with the approach, but does so with a focus on being rather simple fixes to add to the Subcreation site. This is deliberate. I’d much rather see some of the suggestions here implemented, even if they are simple in nature. Other chapters address the specialization power question with a stronger toolbox from statistics: bayesian inference. This approach is arguably the correct one, but it also involves a lot more machinery, is computationally heavy, and requires one to work with a probalistic programming language such as stan.

3 Here a hack means overcoming an obstacle via non-traditional means