6  Alternative Analysis

This chapter addresses some of the weaknesses in the data set, comparing corrections to the base level used by Subcreation. Each subpart here addresses a specific shortcoming.

6.1 Using the full sample

One thing I don’t particularly like about the data analysis is restriction to the top 100 runs. it will ignore the more difficult dungeons, which we collected via stratified sampling. We can ask what happens if we just compute a confidence interval on the full sample, in the traditional way. This avoids all the broken computations which blends data from the top-100 and the full sample.

healers_full_sample <- healers |>
  group_by(class_spec) |>
  dplyr::reframe(
    class = class,
    score_mean = mean(score),
    score_sd = sd(score),
    n = n(),
    error = qt(.975, n-1) * score_sd / sqrt(n),
    conf.low = score_mean - error,
    conf.high = score_mean + error,
  ) |>
  distinct() |>
  mutate(analysis = "full-sample")

healers_full_sample |>
  dplyr::select(class_spec, n, score_mean, score_sd, conf.low, conf.high) |>
  kbl() |> kable_styling()
class_spec n score_mean score_sd conf.low conf.high
priest_discipline 5032 413.6214 47.65348 412.3044 414.9384
monk_mistweaver 257 418.8397 45.92685 413.1980 424.4813
shaman_restoration 724 381.4935 50.16778 377.8331 385.1539
evoker_preservation 33 394.0182 46.28364 377.6067 410.4296
druid_restoration 77 397.9468 42.41756 388.3191 407.5744
paladin_holy 83 386.9217 51.46663 375.6836 398.1597
priest_holy 21 406.1095 51.09179 382.8528 429.3662

The way we’ve set up data frames with an analysis tag will let us compare the difference between the two ways of computation. Here we use colors to differentiate between the methods used by Subcreation, and using the full sample. Note how the popular specs are beeing reeled in. This is to be expected, because when we don’t pick the top-100 for those specializations, we are not selecting the best runs among the higher population count.

Some players are likely to be better than the average of our selected runs, and some are likely to be worse. Subcreation just picks all the lucky dice throws, and ignores all the unlucky ones. I’m arguing this is a bad practice, because the popular specs get far more dice throws than the less popular specs.

Code
healers_subc_base <- healers_subc_base |> dplyr::select(-spec_score_mean)
ggplot(union_all(healers_subc_base, healers_full_sample),
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = analysis,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange(position = position_dodge(width = .5)) +
    geom_hline(yintercept=clusters$centers, lty="dashed", color="white"  ) +
    coord_flip() +
    labs(title = "Subcreation confidence intervals",
         subtitle = "Top 100 vs full sample",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI") +
    scale_color_material_d(palette = "light")
Figure 6.1: Plot where we compare the original Subcreation model (top-100 runs) to the full sample data where all runs are included. The intent is to show there’s a difference based on what model you decide to use. How you decide to handle your data can have a large effect on the outcome and also the conclusion people will end up making based on the data.

6.2 Regional Effects

The choice of stratifying by region ensures we are getting a share of runs from every region. But because we are also purposively sampling the top players, it is important to ask if there are any regional differences. We can do that via a quick violin-plot over all the data, but split by region.

Violin-plots are in the family of density plots. In statistics a density (Probability Density Function / PDF) tells us the relative likelihood of drawing a value from the distribution. Where the distribution has mass, there is a larger likelihood of getting a value. In the violin-plot, we can see that scores are more likely to be around certain threshold values. These correspond to key-level increases. The variance within a given key level pertains to the clock and how much time was left on it. In “between” key levels, there are no observations at all, so the density violin becomes very thin.
Code
ggplot(specs, aes(x = region.str, y = score, fill=region)) +
  geom_violin(color="lightgray") +
  labs(title = "Score distributions per Region",
       subtitle = "Points are count-grouped to prevent overplotting",
       caption = "Source: https://raider.io",
       x = "Region",
       y = "Score") +
  scale_fill_material_d(palette = "light") +
  scale_x_discrete(label = scales::label_wrap(12))
Figure 6.2: Score distributions split by region shows regional differences.

We see there are considerable effects from regional differences. Data from Asia are likely to have an effect on our results. In particular, we aren’t measuring the top players. It is very likely if we just pulled more data from the Europe and Americas regions, we would see a real view of the top players. The data set we are working with have the following regional counts

Code
specs |>
  group_by(region.str) |>
  summarize(count = n()) |>
  kbl() |> kable_styling()
region.str count
Americas 7858
Asia (South Korea) 7700
Asia (Taiwan) 7735
Europe 7840
Figure 6.3: Table counting the number of runs per region in the base data set.

But we also have an alternative data set, where we find the world-wide top dungeon runs for the week. This dataset is called specs_topk. This data set have counts which differ:

It is worth stressing the regional difference is largely due to population counts. The regions with a lot of players will be able to push key levels quicker because there are more players running those keys. Hence, it is far easier to form groups. Also, because we are sampling among the top, a larger population in a region means we’ll have more skilled players in the region, even if we assume no skill difference.
Code
specs_topk |>
  group_by(region.str) |>
  summarize(count = n()) |>
  kbl() |> kable_styling()
region.str count
Americas 10915
Asia (South Korea) 5895
Asia (Taiwan) 4995
Europe 18194

Counting top-k players, ignoring the region. These are the true top players world-wide.

Code
ggplot(specs_topk, aes(x = fct_infreq(region.str), fill=region)) +
  geom_bar() +
  labs(title = "Region counts for the top-k dungeon runs",
       subtitle = "Regions differ in the highest scoring runs",
       caption = "Source: https://raider.io",
       x = "Region",
       y = "Count") +
  scale_fill_material_d(palette = "light") +
  scale_x_discrete(label = scales::label_wrap(12))

Top-k players grouped by region. Some regions with large playerbases have far more runs among the Top-k players.

This confirms our hypothesis: the data we are using aren’t representative of the top players.

Like Subcreation, we will compute a K-means for the top-k dataset:

Code
clusters_topk <- Ckmeans.1d.dp(specs_topk$score, 6)
clusters_topk$centers
[1] 302.1084 398.5312 412.8772 427.3339 442.4347 458.4979

It might be good to compare the new cluster computation to the older one.

Code
clusters_centers <- tibble(
  centers = clusters$centers,
  source = "regional")
clusters_topk_centers <- tibble(
  centers = clusters_topk$centers,
  source = "topk")

centers <- union_all(clusters_centers, clusters_topk_centers)
Code
ggplot(centers, aes(x = source,
                    y = centers,
                    color = source)) +
  geom_point(size=5, alpha=.8) +
  coord_flip() +
  labs(title = "Ckmeans cluster centers",
       subtitle = "Regionally stratified vs TopK data",
       caption = "Source: https://raider.io",
       x = "Dataset",
       y = "Cluster Centers") +
  scale_color_material_d(palette = "light")
Figure 6.4: Plot comparing clusters computed on the original Subcreation sampling scheme and a top-k sampling scheme. This shows how the clusters move when we change the underlying data set.

We can repeat the same computation Subcreation uses, but on this new data set.

Code
healers_subc_topk <- specs_topk |> filter(role == "healer") |>
  group_by(class_spec) |>
  dplyr::mutate(
    spec_score_mean = mean(score),
    spec_score_sd   = sd(score)
  )

healers_subc_topk <- healers_subc_topk |>
  slice_max(n = 100, score) |>
  reframe(
    class = class,
    spec_score_mean = spec_score_mean,
    score_mean = mean(score),
    score_sd = sd(score),
    n = n(),
    error = qt(.975, n-1) * spec_score_sd / sqrt(n),
    conf.low = score_mean - error,
    conf.high = score_mean + error,
  ) |>
  distinct() |>
  mutate(analysis = "topk")

Running the comparison shows us a major change among the unpopular specs. Remember, in subcreations computation, you pick among the best 100 runs for each healer, but for the unpopular specs, there aren’t even 100 runs. So you pick all of them. If you stratify over region in your data collection, you will get more players from the Asia regions, which runs lower key levels in general, skewing the results.

Code
healers_subc_topk <- healers_subc_topk |> dplyr::select(-spec_score_mean)
ggplot(union_all(healers_subc_base, healers_subc_topk),
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = analysis,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange(position = position_dodge(width = .5)) +
    coord_flip() +
    labs(title = "Subcreation confidence intervals",
         subtitle = "Regionally Stratified vs TopK data",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI") +
    scale_color_material_d(palette = "light")
Figure 6.5: Compare the Subcreation base computation with a dataset where the top players are picked world-wide rather than being stratified per region. The plot shows how computing data among the true top-k players alters the outcome of the analysis.

6.3 Controlling for repeated measures

Statistical work often have assumptions which have to be fulfilled for the math to work out as expected. Violating the assumptions voids the correctness of the statistic. One such assumption is independence; each observation is independent of other observations. If you have 7 Protection Paladin runs in Shadowmoon Burial Grounds, the assumption is that those runs are done by 7 different Paladins. If all those 7 runs were done by the same player, it would be wrong to count this as 7 independent samples.

We can analyze the data set for repetitions by grouping on the unique Id of players.

Code
healers |>
 group_by(id) |>
 summarize(count = n(), name = name, class_spec = class_spec) |>
 distinct() |>
 ungroup() |>
 slice_max(n = 12, count) |>
 kbl() |>
 kable_styling()
`summarise()` has grouped output by 'id'. You can override using the `.groups`
argument.
id count name class_spec
110391665 217 Gregxo priest_discipline
19148583 145 哀豔 priest_discipline
57475032 114 Panier priest_discipline
228944557 113 綾野夏樹 priest_discipline
213113103 100 Spingødxx monk_mistweaver
229740582 97 Wippysl priest_discipline
230116618 94 Dast priest_discipline
173342916 90 Ayije priest_discipline
231776615 85 Lyskanon priest_discipline
138690359 85 艾艾法 priest_discipline
230836566 83 흰큼이 priest_discipline
227653400 83 Timbermawp priest_discipline

How do they distribute? We can pick all healers who have played more than a single dungeon run for the given week.

Code
ggplot(
  healers |>
   group_by(id) |>
   reframe(count = n(), name = name, class_spec = class_spec) |>
   filter(count > 1) |>
   arrange(desc(count)) |>
   distinct(),
  aes(x = count)) +
  geom_histogram(binwidth=1) +
  labs(title = "Repeated runs histogram",
       subtitle = "The dataset contains repeated measures",
       caption = "Source: https://raider.io",
       x = "Repeated Runs by same player",
       y = "Count")

So a lot of runs repeats, and it is very likely to affect our results. This is something you have to control for when handling the data. As an example, we can look at specialization-popularity but control for repetitions in the data set:

Code
specs |>
  group_by(class, class_spec, id) |>
  summarize(id_count = n()) |>
  summarize(count = n(), .groups = "keep") |>
  inner_join(spec_names, by = join_by(class_spec == key)) |>
  ggdotchart(
    x = "class_spec.str",
    y = "count",
    color = "class",
    rotate = TRUE,
    sorting = "descending",
    add = "segments",
    dot.size = 7,
    ggtheme = theme_zugzug()) +
  scale_color_manual(values = wow_colors) +
  labs(title = "Specialization Popularity (no repetitions)",
       subtitle = "4 Regions, 8 Dungeons from each region, top-k runs",
       caption = "Source: https://raider.io",
       x = "Specialization",
       y = "Count")
`summarise()` has grouped output by 'class', 'class_spec'. You can override
using the `.groups` argument.
Figure 6.6: Counting the number of specializations with no repetitions of players. This avoids the pseudoreplication which is present if you just naively count specializations as we did earlier.

Compared to the uncorrected plot, it’s fairly obvious active players have an effect on the outcome. Active players will do more runs, and that leads to a conclusion there are more players at the top level playing a given specialization. The plot in Figure 6.6 is a more accurate way of looking at popularity among specializations.

We want to keep individual dungeon runs separated. That is, if a player has run different dungeons, we are assuming they are independent runs. Because we might have dungeon runs at several different key levels and with different times remaining, we remove duplicates. A more advanced model would take the dungeon variability into account as well, but this will do for now.

Code
healers_unique <- healers |>
  dplyr::group_by(dungeon, id) |>
  reframe(
    class_spec = class_spec,
    dungeon = dungeon,
    class = class,
    name = name,
    mean_score = mean(score)) |>
  mutate(score = mean_score) |>
  distinct() |>
  ungroup()

We can now do the same analysis as we did for Subcreation, but on unique healers.

Code
healers_subc_unique <- healers_unique |>
  group_by(class_spec) |>
  dplyr::mutate(
    spec_score_mean = mean(score),
    spec_score_sd   = sd(score)
  )

healers_subc_unique <- healers_subc_unique |>
  slice_max(n = 100, score) |>
  reframe(
    class = class,
    spec_score_mean = spec_score_mean,
    score_mean = mean(score),
    score_sd = sd(score),
    n = n(),
    error = qt(.975, n-1) * spec_score_sd / sqrt(n),
    conf.low = score_mean - error,
    conf.high = score_mean + error,
  ) |>
  distinct()

healers_subc_unique <- healers_subc_unique |> mutate(analysis = "unique")

We see yet another correction has influence in the reported mean values. The general trend is an adjustment downwards, because we don’t let a single healer influence the numbers too much. Some of the more popular specs are heavily influenced by single players who play the spec a lot, which means the skill of a few individual players have a large effect on the perceived power of the specialization.

Code
healers_subc_unique <- healers_subc_unique |> dplyr::select(-spec_score_mean)
ggplot(union_all(healers_subc_base, healers_subc_unique),
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = analysis,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange(position = position_dodge(width = .5)) +
    coord_flip() +
    labs(title = "Subcreation confidence intervals",
         subtitle = "Subcreation vs Controlling for repeated measures",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI") +
    scale_color_material_d(palette = "light")

6.4 Bootstrapped Confidence Intervals

A core assumption of Subcreation is that the data follows a normal distribution. If the data isn’t normally distributed, chances are this assumption is violated, and the statistics become weak.

Code
ggdensity(healers, x = "score", fill = "grey50", ggtheme = theme_zugzug()) +
  labs(title = "Healer score density plot",
       subtitle = "Scores spike around key levels",
       caption = "Source: https://raider.io",
       x = "Score",
       y = "Density")

Kernel density plot of the healer scores.

We know this dataset isn’t normally distributed. Mythic+ scores are artifically generated by combining key level, affixes, and time remaining in the dungeon. Since the base score for each key level has gaps in between them, our data set has gaps as well. There are certain scores which are unobtainable in between key levels, which violates the assumption the data is normally distributed. We can analyze the data via a Shapiro-Wilk test. This test sets up a hypothesis that the data is normally distributed; then asks “Assuming data are normally distributed, how likely is the data we have?”

If this test comes out with \(p < 0.05\) this data would occur at random less than 5% of the time.

Code
healers |>
  slice_max(score, n = 4000) |>
  summarize(shapiro.p = tidy(shapiro.test(score))$p.value) |>
  kbl(digits = 3) |> kable_styling()
shapiro.p
0

This value is practically 0, so it fails the test.

A way around this conundrum is to use bootstrapping. We can let a computer grind at the problem by resampling our population and derive a 95% Confidence interval from the resampling. This will then estimate the confidence interval1. A bootstrapped approximation puts us on saner statistical ground than the current computation.

1 This is due to the resampling forming a sampling distribution, which means we can invoke the Central Limit Theorem.

Code
confidence_interval <- function(data) {
  mean <- mean(data$score)
  sd <- sd(data$score)
  n <- length(data$score)
  error <- qt(.975, n-1) * sd / sqrt(n)
  conf.low <- mean - error
  conf.high <- mean + error

  c(conf.low, conf.high)
}

confidence_interval(healers |> slice_max(n = 100, order_by = score))
[1] 463.7793 466.2324

Bootstrapping works by sampling in the sampled scores with replacement2. We create a “new” sample of 100 healers via this resampling, and then we take the mean. This process repeats. If we keep creating “new” samples this way 5000 times, we have 5000 means. These are a sampling distribution, and we can then compute the 95% confidence interval on that distribution.

2 Replacement is a fancy way of saying we can pick the same sample more than once.

Say we do that. We get the following result:

Code
boot_fn <- function(data, indices) {
  d <- data[indices, ]

  mean(d$score)
}

bootres <- boot(
  healers |> slice_max(n = 100, order_by = score),
  boot_fn, R=5000)
tidy(bootres, conf.int = TRUE) |> kbl(digits=3) |> kable_styling()
statistic bias std.error conf.low conf.high
465.006 0.011 0.612 463.845 466.249

We can also compute it on a per-spec basis:

Code
run_boot <- function(d) {
  bootres <- boot(d, boot_fn, R=5000)

  bootres
}

healers_boot_ci <- healers |>
  group_by(class_spec) |>
  slice_max(n = 100, order_by = score) |>
  nest() |>
  mutate(bootres = map(data, run_boot)) |>
  mutate(tidy = map(bootres, broom::tidy, conf.int = TRUE)) |>
  unnest(tidy) |>
  mutate(data = NULL, bootres = NULL)

healers_boot_ci |> kbl(digits = 3) |> kable_styling()
class_spec statistic bias std.error conf.low conf.high
priest_discipline 465.041 -0.008 0.636 463.809 466.306
monk_mistweaver 448.701 0.005 0.671 447.416 450.067
shaman_restoration 428.987 0.006 0.501 428.056 430.051
evoker_preservation 394.018 0.131 7.994 377.595 408.868
druid_restoration 397.947 0.010 4.788 388.040 406.944
paladin_holy 386.922 -0.062 5.677 375.214 397.692
priest_holy 406.110 0.120 10.737 382.982 425.232

Let us compare the bootstrap to the original subcreation computation. We can plot this, which is a better way of understanding the differences, than looking at a table.

Code
healers_boot_ci <- healers_boot_ci |>
  rename(
    conf.low = conf.low,
    conf.high = conf.high,
    score_mean = statistic) |>
  mutate(analysis = "bootstrapped")
Code
union_all(
  healers_subc_base |> dplyr::select(
    class_spec, score_mean, conf.low, conf.high, analysis),
  healers_boot_ci |> dplyr::select(
    class_spec, score_mean, conf.low, conf.high, analysis)) |>
  ggplot(aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = analysis,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange(position = position_dodge(width = .5)) +
    coord_flip() +
    labs(title = "Confidence intervals",
         subtitle = "Subcreation vs Bootstrapped",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI") +
    scale_color_material_d(palette = "light")
Figure 6.7: Bootstrapped confidence intervals compared to the Subcreation base computation.

One advantage of this approach is that it is more robust against changes in the underlying data. We see the derived confidence intervals via bootstrapping tends to be smaller. If we want to analyze if specs are different, having smaller confidence intervals tend to be better because there will be less overlap of the intervals. Usually this will let us tease out smaller effects from the data.

6.5 Combining all the Alternatives

We have a number of alternative suggestions for Subcreation. These are:

  • Use the full sample
  • Control for regional effects by pulling a true topk dataset
  • Control for repeated measures from players
  • Bootstrap the confidence interval

Combining all these suggestions is an obvious idea, which should yield better results.

6.5.1 Healers

Because we want to run the computations for healers, tanks, and dps, we benefit from setting up a function for doing the computation. The function preprocess combines all our controls and runs them in order.

preprocess <- function(df, tag) {
  # Handle repeated measures by averaging players over their multiple runs of the same dungeon
  df <- df |>
    dplyr::group_by(dungeon, id) |>
    dplyr::reframe(
      class_spec = class_spec,
      dungeon = dungeon,
      class = class,
      name = name,
      mean_score = mean(score)) |>
    mutate(score = mean_score) |>
    distinct() |>
    ungroup()

  # Bootstrap a confidence interval
  df <- df |>
    group_by(class, class_spec) |>
    filter(n() > 3) |>
    nest() |>
    mutate(bootres = map(data, run_boot)) |>
    mutate(tidy = map(bootres, broom::tidy, conf.int = TRUE)) |>
    unnest(tidy) |>
    mutate(data = NULL, bootres = NULL) |>
    ungroup()

  # rename columns to fit the rest of the data here
  df <- df |>
    rename(
      conf.low = conf.low,
      conf.high = conf.high,
      score_mean = statistic) |>
    mutate(analysis = tag)

  df
}

With this function in hand, we can easily run the computation for different roles. We’ll focus on healers first.

Code
healers_combined <- preprocess(specs_topk |> filter(role == "healer"), "combined")

And make our usual plot. It is nice to have a stand-alone plot of healers in this case.

Code
ggplot(healers_combined,
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = class,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange() +
    geom_point(aes(y=conf.low), color="white", size=1.5) +
    scale_color_manual(values = wow_colors) +
    geom_hline(yintercept=clusters_topk$centers, lty="dashed", color="white") +
    coord_flip() +
    labs(title = "Corrected Subcreation confidence intervals",
         subtitle = "4 corrections applied to the Subcreation analysis",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI")

Once we control for undesired effects, it becomes clear the healers tend to be far more balanced. The difference in rating between the best and worst healer becomes far smaller. Because they are so close, we also provide a zoomed-in view with no clustering:

Code
ggplot(healers_combined,
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = class,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange() +
    geom_point(aes(y=conf.low), color="white", size=1.5) +
    scale_color_manual(values = wow_colors) +
    coord_flip() +
    labs(title = "Corrected Subcreation confidence intervals",
         subtitle = "4 corrections applied to the Subcreation analysis",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI")

Finally, we can compare our findings with the base. Subcreations data compared to ours:

Code
union_all(
  healers_subc_base |> dplyr::select(class_spec, conf.low, conf.high, score_mean, analysis),
  healers_combined |> dplyr::select(class_spec, conf.low, conf.high, score_mean, analysis)) |>
  ggplot(aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = analysis,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange(position = position_dodge(width = .5)) +
    coord_flip() +
    labs(title = "Confidence intervals",
         subtitle = "Subcreation vs Corrections",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI") +
    scale_color_material_d(palette = "light")

Code
distances <- tibble(analysis = factor(c("corrected", "subcreation")),
                    distance = c(max(healers_combined$conf.low) - min(healers_combined$conf.low),
                                 max(healers_subc_base$conf.low) - min(healers_subc_base$conf.low)))
distances
# A tibble: 2 × 2
  analysis    distance
  <fct>          <dbl>
1 corrected       49.0
2 subcreation     79.9

Note how much the distance between the best and worst specs have shrunk. The balance window closes. This means the difference in power is going to be a small effect, and other factors, such as player skill, are going to dominate. This is good for the game, as you’d like to pick better players than relying on specialization power to construct a roster for a dungeon run.

6.5.2 Tanks

With the preprocess function in hand, we can now focus on tanks:

Code
tanks_combined <- preprocess(
  specs_topk |> filter(role == "tank"),
  "combined")

tanks_combined |>
  dplyr::select(-analysis) |>
  kbl(digits = 3) |> kable_styling()
class_spec class score_mean bias std.error conf.low conf.high
paladin_protection paladin 405.704 0.019 0.780 404.196 407.260
druid_guardian druid 396.454 0.062 5.973 384.309 407.615
demon-hunter_vengeance demon-hunter 383.992 -0.097 6.518 370.532 396.175
warrior_protection warrior 386.139 -0.034 4.905 376.239 395.444
monk_brewmaster monk 405.464 0.171 8.234 388.165 420.587
death-knight_blood death-knight 383.257 -0.082 7.653 367.613 397.495

Like in the previous work on healers, we don’t need the cluster centers anymore, so we’ll make a plot without those. This lets us focus on where the differences are and how large of a difference there is.

Code
ggplot(tanks_combined,
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = class,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange() +
    geom_point(aes(y=conf.low), color="white", size=1.5) +
    scale_color_manual(values = wow_colors) +
    coord_flip() +
    labs(title = "Corrected Subcreation confidence intervals",
         subtitle = "4 corrections applied to the Subcreation analysis",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI")

6.5.3 DPS

For convenience, we split the DPS role into ranged and melee. This keeps the plot in a size that is readily workable and easy. It is also common practice in a lot of data expositions about specializations in WoW to do so. We’ll just follow suit.

By now, the drill is known, so we’ll just bring the plots

6.5.3.1 Ranged

Code
ranged_combined <- preprocess(
  specs_topk |> filter(role == "dps" & is_ranged == "Yes"),
  "combined")
Code
ggplot(ranged_combined,
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = class,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange() +
    geom_point(aes(y=conf.low), color="white", size=1.5) +
    scale_color_manual(values = wow_colors) +
    coord_flip() +
    labs(title = "Corrected Subcreation confidence intervals",
         subtitle = "4 corrections applied to the Subcreation analysis",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI")

6.5.3.2 Melee

Code
melee_combined <- preprocess(
  specs_topk |> filter(role == "dps" & is_ranged == "No"),
  "combined")
Code
ggplot(melee_combined,
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = class,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange() +
    geom_point(aes(y=conf.low), color="white", size=1.5) +
    scale_color_manual(values = wow_colors) +
    coord_flip() +
    labs(title = "Corrected Subcreation confidence intervals",
         subtitle = "4 corrections applied to the Subcreation analysis",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI")

6.6 Remarks

We’ve traced the computations done by Subcreation and then provided a series of corrections to the computations. Before these corrections, the balance window reported by Subcreation is fairly large.

However, once you control for undesirable data effects, the balance window almost closes.

The findings here are presented one change at a time with a comparison, because it allows people to make their own informed decision on what they believe.

Should Subcreation want to implement some of this, they have the option. They can pick whatever they want among the options, including those they think are correct.

The key take-away is that something can be statistically significant, but have small effect. This is especially true if you have a lot of observations. The statistics can squeeze out significance, even for very minuscule differences among the specializations. I think this is what is happening here, and it is happening across all roles. Except for a few outliers, most specs fall into a very small window.

Some of the suggestions here are outright hacks3. By this, I mean they fix problems with the approach, but does so with a focus on being rather simple fixes to add to the Subcreation site. This is deliberate. I’d much rather see some of the suggestions here implemented, even if they are simple in nature. Other chapters address the specialization power question with a stronger toolbox from statistics: bayesian inference. This approach is arguably the correct one, but it also involves a lot more machinery, is computationally heavy, and requires one to work with a probalistic programming language such as stan.

3 Here a hack means overcoming an obstacle via non-traditional means