6 Alternative Analysis

This chapter addresses some of the weaknesses in the data set, comparing corrections to the base level used by Subcreation. Each subpart here addresses a specific shortcoming.

6.1 Using the full sample

One thing I don’t particularly like about the data analysis is restriction to the top 100 runs. it will ignore the more difficult dungeons, which we collected via stratified sampling. We can ask what happens if we just compute a confidence interval on the full sample, in the traditional way. This avoids all the broken computations which blends data from the top-100 and the full sample.

healers_full_sample <- healers |>
  group_by(class_spec) |>
  dplyr::reframe(
    class = class,
    score_mean = mean(score),
    score_sd = sd(score),
    n = n(),
    error = qt(.975, n-1) * score_sd / sqrt(n),
    conf.low = score_mean - error,
    conf.high = score_mean + error,
  ) |>
  distinct() |>
  mutate(analysis = "full-sample")

healers_full_sample |>
  dplyr::select(class_spec, n, score_mean, score_sd, conf.low, conf.high) |>
  kbl() |> kable_styling()

class_spec	n	score_mean	score_sd	conf.low	conf.high
priest_discipline	5032	413.6214	47.65348	412.3044	414.9384
monk_mistweaver	257	418.8397	45.92685	413.1980	424.4813
shaman_restoration	724	381.4935	50.16778	377.8331	385.1539
evoker_preservation	33	394.0182	46.28364	377.6067	410.4296
druid_restoration	77	397.9468	42.41756	388.3191	407.5744
paladin_holy	83	386.9217	51.46663	375.6836	398.1597
priest_holy	21	406.1095	51.09179	382.8528	429.3662

The way we’ve set up data frames with an analysis tag will let us compare the difference between the two ways of computation. Here we use colors to differentiate between the methods used by Subcreation, and using the full sample. Note how the popular specs are beeing reeled in. This is to be expected, because when we don’t pick the top-100 for those specializations, we are not selecting the best runs among the higher population count.

Some players are likely to be better than the average of our selected runs, and some are likely to be worse. Subcreation just picks all the lucky dice throws, and ignores all the unlucky ones. I’m arguing this is a bad practice, because the popular specs get far more dice throws than the less popular specs.

Code

healers_subc_base <- healers_subc_base |> dplyr::select(-spec_score_mean)
ggplot(union_all(healers_subc_base, healers_full_sample),
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = analysis,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange(position = position_dodge(width = .5)) +
    geom_hline(yintercept=clusters$centers, lty="dashed", color="white"  ) +
    coord_flip() +
    labs(title = "Subcreation confidence intervals",
         subtitle = "Top 100 vs full sample",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI") +
    scale_color_material_d(palette = "light")

Figure 6.1: Plot where we compare the original Subcreation model (top-100 runs) to the full sample data where all runs are included. The intent is to show there’s a difference based on what model you decide to use. How you decide to handle your data can have a large effect on the outcome and also the conclusion people will end up making based on the data.

6.2 Regional Effects

The choice of stratifying by region ensures we are getting a share of runs from every region. But because we are also purposively sampling the top players, it is important to ask if there are any regional differences. We can do that via a quick violin-plot over all the data, but split by region.

Violin-plots are in the family of density plots. In statistics a density (Probability Density Function / PDF) tells us the relative likelihood of drawing a value from the distribution. Where the distribution has mass, there is a larger likelihood of getting a value. In the violin-plot, we can see that scores are more likely to be around certain threshold values. These correspond to key-level increases. The variance within a given key level pertains to the clock and how much time was left on it. In “between” key levels, there are no observations at all, so the density violin becomes very thin.

Code

ggplot(specs, aes(x = region.str, y = score, fill=region)) +
  geom_violin(color="lightgray") +
  labs(title = "Score distributions per Region",
       subtitle = "Points are count-grouped to prevent overplotting",
       caption = "Source: https://raider.io",
       x = "Region",
       y = "Score") +
  scale_fill_material_d(palette = "light") +
  scale_x_discrete(label = scales::label_wrap(12))

Figure 6.2: Score distributions split by region shows regional differences.

We see there are considerable effects from regional differences. Data from Asia are likely to have an effect on our results. In particular, we aren’t measuring the top players. It is very likely if we just pulled more data from the Europe and Americas regions, we would see a real view of the top players. The data set we are working with have the following regional counts

Code

specs |>
  group_by(region.str) |>
  summarize(count = n()) |>
  kbl() |> kable_styling()

region.str	count
Americas	7858
Asia (South Korea)	7700
Asia (Taiwan)	7735
Europe	7840

Figure 6.3: Table counting the number of runs per region in the base data set.

But we also have an alternative data set, where we find the world-wide top dungeon runs for the week. This dataset is called specs_topk. This data set have counts which differ:

It is worth stressing the regional difference is largely due to population counts. The regions with a lot of players will be able to push key levels quicker because there are more players running those keys. Hence, it is far easier to form groups. Also, because we are sampling among the top, a larger population in a region means we’ll have more skilled players in the region, even if we assume no skill difference.

Code

specs_topk |>
  group_by(region.str) |>
  summarize(count = n()) |>
  kbl() |> kable_styling()

region.str	count
Americas	10915
Asia (South Korea)	5895
Asia (Taiwan)	4995
Europe	18194

Counting top-k players, ignoring the region. These are the true top players world-wide.

Code

ggplot(specs_topk, aes(x = fct_infreq(region.str), fill=region)) +
  geom_bar() +
  labs(title = "Region counts for the top-k dungeon runs",
       subtitle = "Regions differ in the highest scoring runs",
       caption = "Source: https://raider.io",
       x = "Region",
       y = "Count") +
  scale_fill_material_d(palette = "light") +
  scale_x_discrete(label = scales::label_wrap(12))

Top-k players grouped by region. Some regions with large playerbases have far more runs among the Top-k players.

This confirms our hypothesis: the data we are using aren’t representative of the top players.

Like Subcreation, we will compute a K-means for the top-k dataset:

Code

clusters_topk <- Ckmeans.1d.dp(specs_topk$score, 6)
clusters_topk$centers

[1] 302.1084 398.5312 412.8772 427.3339 442.4347 458.4979

It might be good to compare the new cluster computation to the older one.

Code

clusters_centers <- tibble(
  centers = clusters$centers,
  source = "regional")
clusters_topk_centers <- tibble(
  centers = clusters_topk$centers,
  source = "topk")

centers <- union_all(clusters_centers, clusters_topk_centers)

Code

ggplot(centers, aes(x = source,
                    y = centers,
                    color = source)) +
  geom_point(size=5, alpha=.8) +
  coord_flip() +
  labs(title = "Ckmeans cluster centers",
       subtitle = "Regionally stratified vs TopK data",
       caption = "Source: https://raider.io",
       x = "Dataset",
       y = "Cluster Centers") +
  scale_color_material_d(palette = "light")

Figure 6.4: Plot comparing clusters computed on the original Subcreation sampling scheme and a top-k sampling scheme. This shows how the clusters move when we change the underlying data set.

We can repeat the same computation Subcreation uses, but on this new data set.

Code

healers_subc_topk <- specs_topk |> filter(role == "healer") |>
  group_by(class_spec) |>
  dplyr::mutate(
    spec_score_mean = mean(score),
    spec_score_sd   = sd(score)
  )

healers_subc_topk <- healers_subc_topk |>
  slice_max(n = 100, score) |>
  reframe(
    class = class,
    spec_score_mean = spec_score_mean,
    score_mean = mean(score),
    score_sd = sd(score),
    n = n(),
    error = qt(.975, n-1) * spec_score_sd / sqrt(n),
    conf.low = score_mean - error,
    conf.high = score_mean + error,
  ) |>
  distinct() |>
  mutate(analysis = "topk")

Running the comparison shows us a major change among the unpopular specs. Remember, in subcreations computation, you pick among the best 100 runs for each healer, but for the unpopular specs, there aren’t even 100 runs. So you pick all of them. If you stratify over region in your data collection, you will get more players from the Asia regions, which runs lower key levels in general, skewing the results.

Code

healers_subc_topk <- healers_subc_topk |> dplyr::select(-spec_score_mean)
ggplot(union_all(healers_subc_base, healers_subc_topk),
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = analysis,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange(position = position_dodge(width = .5)) +
    coord_flip() +
    labs(title = "Subcreation confidence intervals",
         subtitle = "Regionally Stratified vs TopK data",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI") +
    scale_color_material_d(palette = "light")

Figure 6.5: Compare the Subcreation base computation with a dataset where the top players are picked world-wide rather than being stratified per region. The plot shows how computing data among the *true* top-k players alters the outcome of the analysis.

6.3 Controlling for repeated measures

Statistical work often have assumptions which have to be fulfilled for the math to work out as expected. Violating the assumptions voids the correctness of the statistic. One such assumption is independence; each observation is independent of other observations. If you have 7 Protection Paladin runs in Shadowmoon Burial Grounds, the assumption is that those runs are done by 7 different Paladins. If all those 7 runs were done by the same player, it would be wrong to count this as 7 independent samples.

We can analyze the data set for repetitions by grouping on the unique Id of players.

Code

healers |>
 group_by(id) |>
 summarize(count = n(), name = name, class_spec = class_spec) |>
 distinct() |>
 ungroup() |>
 slice_max(n = 12, count) |>
 kbl() |>
 kable_styling()

`summarise()` has grouped output by 'id'. You can override using the `.groups`
argument.

id	count	name	class_spec
110391665	217	Gregxo	priest_discipline
19148583	145	哀豔	priest_discipline
57475032	114	Panier	priest_discipline
228944557	113	綾野夏樹	priest_discipline
213113103	100	Spingødxx	monk_mistweaver
229740582	97	Wippysl	priest_discipline
230116618	94	Dast	priest_discipline
173342916	90	Ayije	priest_discipline
231776615	85	Lyskanon	priest_discipline
138690359	85	艾艾法	priest_discipline
230836566	83	흰큼이	priest_discipline
227653400	83	Timbermawp	priest_discipline

How do they distribute? We can pick all healers who have played more than a single dungeon run for the given week.

Code

ggplot(
  healers |>
   group_by(id) |>
   reframe(count = n(), name = name, class_spec = class_spec) |>
   filter(count > 1) |>
   arrange(desc(count)) |>
   distinct(),
  aes(x = count)) +
  geom_histogram(binwidth=1) +
  labs(title = "Repeated runs histogram",
       subtitle = "The dataset contains repeated measures",
       caption = "Source: https://raider.io",
       x = "Repeated Runs by same player",
       y = "Count")

So a lot of runs repeats, and it is very likely to affect our results. This is something you have to control for when handling the data. As an example, we can look at specialization-popularity but control for repetitions in the data set:

Code

specs |>
  group_by(class, class_spec, id) |>
  summarize(id_count = n()) |>
  summarize(count = n(), .groups = "keep") |>
  inner_join(spec_names, by = join_by(class_spec == key)) |>
  ggdotchart(
    x = "class_spec.str",
    y = "count",
    color = "class",
    rotate = TRUE,
    sorting = "descending",
    add = "segments",
    dot.size = 7,
    ggtheme = theme_zugzug()) +
  scale_color_manual(values = wow_colors) +
  labs(title = "Specialization Popularity (no repetitions)",
       subtitle = "4 Regions, 8 Dungeons from each region, top-k runs",
       caption = "Source: https://raider.io",
       x = "Specialization",
       y = "Count")

`summarise()` has grouped output by 'class', 'class_spec'. You can override
using the `.groups` argument.

Figure 6.6: Counting the number of specializations with no repetitions of players. This avoids the pseudoreplication which is present if you just naively count specializations as we did earlier.

Compared to the uncorrected plot, it’s fairly obvious active players have an effect on the outcome. Active players will do more runs, and that leads to a conclusion there are more players at the top level playing a given specialization. The plot in Figure 6.6 is a more accurate way of looking at popularity among specializations.

We want to keep individual dungeon runs separated. That is, if a player has run different dungeons, we are assuming they are independent runs. Because we might have dungeon runs at several different key levels and with different times remaining, we remove duplicates. A more advanced model would take the dungeon variability into account as well, but this will do for now.

Code

healers_unique <- healers |>
  dplyr::group_by(dungeon, id) |>
  reframe(
    class_spec = class_spec,
    dungeon = dungeon,
    class = class,
    name = name,
    mean_score = mean(score)) |>
  mutate(score = mean_score) |>
  distinct() |>
  ungroup()

We can now do the same analysis as we did for Subcreation, but on unique healers.

Code

healers_subc_unique <- healers_unique |>
  group_by(class_spec) |>
  dplyr::mutate(
    spec_score_mean = mean(score),
    spec_score_sd   = sd(score)
  )

healers_subc_unique <- healers_subc_unique |>
  slice_max(n = 100, score) |>
  reframe(
    class = class,
    spec_score_mean = spec_score_mean,
    score_mean = mean(score),
    score_sd = sd(score),
    n = n(),
    error = qt(.975, n-1) * spec_score_sd / sqrt(n),
    conf.low = score_mean - error,
    conf.high = score_mean + error,
  ) |>
  distinct()

healers_subc_unique <- healers_subc_unique |> mutate(analysis = "unique")

We see yet another correction has influence in the reported mean values. The general trend is an adjustment downwards, because we don’t let a single healer influence the numbers too much. Some of the more popular specs are heavily influenced by single players who play the spec a lot, which means the skill of a few individual players have a large effect on the perceived power of the specialization.

Code

healers_subc_unique <- healers_subc_unique |> dplyr::select(-spec_score_mean)
ggplot(union_all(healers_subc_base, healers_subc_unique),
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = analysis,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange(position = position_dodge(width = .5)) +
    coord_flip() +
    labs(title = "Subcreation confidence intervals",
         subtitle = "Subcreation vs Controlling for repeated measures",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI") +
    scale_color_material_d(palette = "light")

6.4 Bootstrapped Confidence Intervals

A core assumption of Subcreation is that the data follows a normal distribution. If the data isn’t normally distributed, chances are this assumption is violated, and the statistics become weak.

Code

ggdensity(healers, x = "score", fill = "grey50", ggtheme = theme_zugzug()) +
  labs(title = "Healer score density plot",
       subtitle = "Scores spike around key levels",
       caption = "Source: https://raider.io",
       x = "Score",
       y = "Density")

Kernel density plot of the healer scores.

We know this dataset isn’t normally distributed. Mythic+ scores are artifically generated by combining key level, affixes, and time remaining in the dungeon. Since the base score for each key level has gaps in between them, our data set has gaps as well. There are certain scores which are unobtainable in between key levels, which violates the assumption the data is normally distributed. We can analyze the data via a Shapiro-Wilk test. This test sets up a hypothesis that the data is normally distributed; then asks “Assuming data are normally distributed, how likely is the data we have?”

If this test comes out with \(p < 0.05\) this data would occur at random less than 5% of the time.

Code

healers |>
  slice_max(score, n = 4000) |>
  summarize(shapiro.p = tidy(shapiro.test(score))$p.value) |>
  kbl(digits = 3) |> kable_styling()

shapiro.p
0

This value is practically 0, so it fails the test.

A way around this conundrum is to use bootstrapping. We can let a computer grind at the problem by resampling our population and derive a 95% Confidence interval from the resampling. This will then estimate the confidence interval¹. A bootstrapped approximation puts us on saner statistical ground than the current computation.

¹ This is due to the resampling forming a sampling distribution, which means we can invoke the Central Limit Theorem.

Code

confidence_interval <- function(data) {
  mean <- mean(data$score)
  sd <- sd(data$score)
  n <- length(data$score)
  error <- qt(.975, n-1) * sd / sqrt(n)
  conf.low <- mean - error
  conf.high <- mean + error

  c(conf.low, conf.high)
}

confidence_interval(healers |> slice_max(n = 100, order_by = score))

[1] 463.7793 466.2324

Bootstrapping works by sampling in the sampled scores with replacement². We create a “new” sample of 100 healers via this resampling, and then we take the mean. This process repeats. If we keep creating “new” samples this way 5000 times, we have 5000 means. These are a sampling distribution, and we can then compute the 95% confidence interval on that distribution.

² Replacement is a fancy way of saying we can pick the same sample more than once.

Say we do that. We get the following result:

Code

boot_fn <- function(data, indices) {
  d <- data[indices, ]

  mean(d$score)
}

bootres <- boot(
  healers |> slice_max(n = 100, order_by = score),
  boot_fn, R=5000)
tidy(bootres, conf.int = TRUE) |> kbl(digits=3) |> kable_styling()

statistic	bias	std.error	conf.low	conf.high
465.006	0.011	0.612	463.845	466.249

We can also compute it on a per-spec basis:

Code

run_boot <- function(d) {
  bootres <- boot(d, boot_fn, R=5000)

  bootres
}

healers_boot_ci <- healers |>
  group_by(class_spec) |>
  slice_max(n = 100, order_by = score) |>
  nest() |>
  mutate(bootres = map(data, run_boot)) |>
  mutate(tidy = map(bootres, broom::tidy, conf.int = TRUE)) |>
  unnest(tidy) |>
  mutate(data = NULL, bootres = NULL)

healers_boot_ci |> kbl(digits = 3) |> kable_styling()

class_spec	statistic	bias	std.error	conf.low	conf.high
priest_discipline	465.041	-0.008	0.636	463.809	466.306
monk_mistweaver	448.701	0.005	0.671	447.416	450.067
shaman_restoration	428.987	0.006	0.501	428.056	430.051
evoker_preservation	394.018	0.131	7.994	377.595	408.868
druid_restoration	397.947	0.010	4.788	388.040	406.944
paladin_holy	386.922	-0.062	5.677	375.214	397.692
priest_holy	406.110	0.120	10.737	382.982	425.232

Let us compare the bootstrap to the original subcreation computation. We can plot this, which is a better way of understanding the differences, than looking at a table.

Code

healers_boot_ci <- healers_boot_ci |>
  rename(
    conf.low = conf.low,
    conf.high = conf.high,
    score_mean = statistic) |>
  mutate(analysis = "bootstrapped")

Code

union_all(
  healers_subc_base |> dplyr::select(
    class_spec, score_mean, conf.low, conf.high, analysis),
  healers_boot_ci |> dplyr::select(
    class_spec, score_mean, conf.low, conf.high, analysis)) |>
  ggplot(aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = analysis,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange(position = position_dodge(width = .5)) +
    coord_flip() +
    labs(title = "Confidence intervals",
         subtitle = "Subcreation vs Bootstrapped",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI") +
    scale_color_material_d(palette = "light")

Figure 6.7: Bootstrapped confidence intervals compared to the Subcreation base computation.

One advantage of this approach is that it is more robust against changes in the underlying data. We see the derived confidence intervals via bootstrapping tends to be smaller. If we want to analyze if specs are different, having smaller confidence intervals tend to be better because there will be less overlap of the intervals. Usually this will let us tease out smaller effects from the data.

6.5 Combining all the Alternatives

We have a number of alternative suggestions for Subcreation. These are:

Use the full sample
Control for regional effects by pulling a true topk dataset
Control for repeated measures from players
Bootstrap the confidence interval

Combining all these suggestions is an obvious idea, which should yield better results.

6.5.1 Healers

Because we want to run the computations for healers, tanks, and dps, we benefit from setting up a function for doing the computation. The function preprocess combines all our controls and runs them in order.

preprocess <- function(df, tag) {
  # Handle repeated measures by averaging players over their multiple runs of the same dungeon
  df <- df |>
    dplyr::group_by(dungeon, id) |>
    dplyr::reframe(
      class_spec = class_spec,
      dungeon = dungeon,
      class = class,
      name = name,
      mean_score = mean(score)) |>
    mutate(score = mean_score) |>
    distinct() |>
    ungroup()

  # Bootstrap a confidence interval
  df <- df |>
    group_by(class, class_spec) |>
    filter(n() > 3) |>
    nest() |>
    mutate(bootres = map(data, run_boot)) |>
    mutate(tidy = map(bootres, broom::tidy, conf.int = TRUE)) |>
    unnest(tidy) |>
    mutate(data = NULL, bootres = NULL) |>
    ungroup()

  # rename columns to fit the rest of the data here
  df <- df |>
    rename(
      conf.low = conf.low,
      conf.high = conf.high,
      score_mean = statistic) |>
    mutate(analysis = tag)

  df
}

With this function in hand, we can easily run the computation for different roles. We’ll focus on healers first.

Code

healers_combined <- preprocess(specs_topk |> filter(role == "healer"), "combined")

And make our usual plot. It is nice to have a stand-alone plot of healers in this case.

Code

ggplot(healers_combined,
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = class,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange() +
    geom_point(aes(y=conf.low), color="white", size=1.5) +
    scale_color_manual(values = wow_colors) +
    geom_hline(yintercept=clusters_topk$centers, lty="dashed", color="white") +
    coord_flip() +
    labs(title = "Corrected Subcreation confidence intervals",
         subtitle = "4 corrections applied to the Subcreation analysis",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI")

Once we control for undesired effects, it becomes clear the healers tend to be far more balanced. The difference in rating between the best and worst healer becomes far smaller. Because they are so close, we also provide a zoomed-in view with no clustering:

Code

ggplot(healers_combined,
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = class,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange() +
    geom_point(aes(y=conf.low), color="white", size=1.5) +
    scale_color_manual(values = wow_colors) +
    coord_flip() +
    labs(title = "Corrected Subcreation confidence intervals",
         subtitle = "4 corrections applied to the Subcreation analysis",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI")

Finally, we can compare our findings with the base. Subcreations data compared to ours:

Code

union_all(
  healers_subc_base |> dplyr::select(class_spec, conf.low, conf.high, score_mean, analysis),
  healers_combined |> dplyr::select(class_spec, conf.low, conf.high, score_mean, analysis)) |>
  ggplot(aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = analysis,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange(position = position_dodge(width = .5)) +
    coord_flip() +
    labs(title = "Confidence intervals",
         subtitle = "Subcreation vs Corrections",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI") +
    scale_color_material_d(palette = "light")

Code

distances <- tibble(analysis = factor(c("corrected", "subcreation")),
                    distance = c(max(healers_combined$conf.low) - min(healers_combined$conf.low),
                                 max(healers_subc_base$conf.low) - min(healers_subc_base$conf.low)))
distances

# A tibble: 2 × 2
  analysis    distance
  <fct>          <dbl>
1 corrected       49.0
2 subcreation     79.9

Note how much the distance between the best and worst specs have shrunk. The balance window closes. This means the difference in power is going to be a small effect, and other factors, such as player skill, are going to dominate. This is good for the game, as you’d like to pick better players than relying on specialization power to construct a roster for a dungeon run.

6.5.2 Tanks

With the preprocess function in hand, we can now focus on tanks:

Code

tanks_combined <- preprocess(
  specs_topk |> filter(role == "tank"),
  "combined")

tanks_combined |>
  dplyr::select(-analysis) |>
  kbl(digits = 3) |> kable_styling()

class_spec	class	score_mean	bias	std.error	conf.low	conf.high
paladin_protection	paladin	405.704	0.019	0.780	404.196	407.260
druid_guardian	druid	396.454	0.062	5.973	384.309	407.615
demon-hunter_vengeance	demon-hunter	383.992	-0.097	6.518	370.532	396.175
warrior_protection	warrior	386.139	-0.034	4.905	376.239	395.444
monk_brewmaster	monk	405.464	0.171	8.234	388.165	420.587
death-knight_blood	death-knight	383.257	-0.082	7.653	367.613	397.495

Like in the previous work on healers, we don’t need the cluster centers anymore, so we’ll make a plot without those. This lets us focus on where the differences are and how large of a difference there is.

Code

ggplot(tanks_combined,
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = class,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange() +
    geom_point(aes(y=conf.low), color="white", size=1.5) +
    scale_color_manual(values = wow_colors) +
    coord_flip() +
    labs(title = "Corrected Subcreation confidence intervals",
         subtitle = "4 corrections applied to the Subcreation analysis",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI")

6.5.3 DPS

For convenience, we split the DPS role into ranged and melee. This keeps the plot in a size that is readily workable and easy. It is also common practice in a lot of data expositions about specializations in WoW to do so. We’ll just follow suit.

By now, the drill is known, so we’ll just bring the plots

6.5.3.1 Ranged

Code

ranged_combined <- preprocess(
  specs_topk |> filter(role == "dps" & is_ranged == "Yes"),
  "combined")

Code

ggplot(ranged_combined,
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = class,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange() +
    geom_point(aes(y=conf.low), color="white", size=1.5) +
    scale_color_manual(values = wow_colors) +
    coord_flip() +
    labs(title = "Corrected Subcreation confidence intervals",
         subtitle = "4 corrections applied to the Subcreation analysis",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI")

6.5.3.2 Melee

Code

melee_combined <- preprocess(
  specs_topk |> filter(role == "dps" & is_ranged == "No"),
  "combined")

Code

ggplot(melee_combined,
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    color = class,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange() +
    geom_point(aes(y=conf.low), color="white", size=1.5) +
    scale_color_manual(values = wow_colors) +
    coord_flip() +
    labs(title = "Corrected Subcreation confidence intervals",
         subtitle = "4 corrections applied to the Subcreation analysis",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI")

6.6 Remarks

We’ve traced the computations done by Subcreation and then provided a series of corrections to the computations. Before these corrections, the balance window reported by Subcreation is fairly large.

However, once you control for undesirable data effects, the balance window almost closes.

The findings here are presented one change at a time with a comparison, because it allows people to make their own informed decision on what they believe.

Should Subcreation want to implement some of this, they have the option. They can pick whatever they want among the options, including those they think are correct.

The key take-away is that something can be statistically significant, but have small effect. This is especially true if you have a lot of observations. The statistics can squeeze out significance, even for very minuscule differences among the specializations. I think this is what is happening here, and it is happening across all roles. Except for a few outliers, most specs fall into a very small window.

Some of the suggestions here are outright hacks³. By this, I mean they fix problems with the approach, but does so with a focus on being rather simple fixes to add to the Subcreation site. This is deliberate. I’d much rather see some of the suggestions here implemented, even if they are simple in nature. Other chapters address the specialization power question with a stronger toolbox from statistics: bayesian inference. This approach is arguably the correct one, but it also involves a lot more machinery, is computationally heavy, and requires one to work with a probalistic programming language such as stan.

³ Here a hack means overcoming an obstacle via non-traditional means