5  Retracing

We begin by establishing a baseline. In our case that is an implementation of Subcreation in R. This section provides the baseline, and does some initial analysis on the data.

The subcreation M+ site contains statistics for different M+ metrics in World of Warcraft. Subcreation works by periodically gathering new data for the week, and then by performing an analysis of the new data. The tier lists are run on a per-week basis, in order to avoid old data affecting the current status. Subcreation also contains a large backlog of older weeks of data.

Since we are mostly interested in the data analysis part, we’ll just pick one weeks worth of data. We don’t have to collect data far back in time, and we can still do all the analysis needed on a recent week.

5.1 Data Gathering

We use our zugzug tool to read the data in the same way as Subcreation does. Data is returned in JSON format by the raider.io API, and we flatten this to a usual data frame, so it is easy to read into R.

5.1.1 Study Design

A study design is a plan for how you are going to collect your data. It’s often overlooked, but it is crucial to get right in order to avoid drawing the wrong conclusions. If the data you collect introduces a bias you don’t correct for, you risk the quality of your study. Furthermore, your study design will shape your statistical model. Hence, one should design a data collection scheme rather than haphazardly making one

The overall design of data collection looks like

  • For each of the 4 regions
    • For each of the 8 dungeons
      • Gather the top-100 runs
        • Collect information about the dungeon
        • Collect information about the 5 players in the run (tank, healer, 3 dps)

In other words, our sampling stratifies over the 4 regions (eu, us, kr, tw), and then further stratifies over the 8 dungeons (Algeth’ar Academy, Temple of the Jade Serpent, …). Stratified sampling makes sure we get representation from each region, and further by each dungeon in the region.

Next, the gather the top-k (k=100, k=500, or k=1000) runs for a given region/dungeon combination. This is an example of purposive sampling where the purpose is to focus on the best players. An alternative is to sample randomly in the full population of dungeon runs, but Subcreation specifically focuses on the top-k runs.

The score assigned to each of the 5 players is the score of the dungeon run. This also assigns a score to a given specialization.

One immediate consequence of this design is that we are using dungeon data in order to gauge the power of individual specializations. We are pin-pointing a specific dungeon run and then keeping track of the score. Thus, you see specialization scores in the 195–215 range, which is typical of a single high-key-level dungeon run1.

1 Typical in season 1 of Dragonflight when I started doing the data gathering. Depending on the point in a season, this number will vary.

Furthermore, stratifying over region defeats the purpose of sampling the top-k dungeon runs. If there are differences among regions, then you will be collecting runs from a weaker region and adding that to the data. But this means you don’t collect the top runs anymore. One can go further and argue stratification over dungeons are also bad in this design. If there are differences between dungeon difficulty, that will affect the top-k purposive sampling.

In statistics, we use the word “pseudoreplication” when we wrongly assume data points as independent. This is an example of pseudoreplication.

Data is assumed to be independent. But it clearly isn’t! The same player might be doing multiple dungeon runs in the same week. If the player is highly skilled, it is likely they have more than a single run at the very top. For example, we have a single player running 7 dungeons. We don’t have 7 players running one dungeon each. On top, a run is made by 5 players in a party. So the group composition might affect the results. And often, you have the same group running multiple runs at the very top. All in all, this accounts for multiple repetitions, and it has to be controlled for in our data analysis. Otherwise, very active players can artificially change the perceived power of a given specialization in the data set.

An alternative would be to go and sample in the leaderboards of each specialization2. Either randomly or purposively. There are many advantages to this. First, getting world-wide data is easier because raider.io provides this directly. It will avoid regional difference. Second, all dungeon runs are aggregated and the score is the sum of best runs from each dungeon. The scoring also factors in Tyrannical and Fortified. Many independence problems are solved because each player only occurs close to once. It also evens out the dungeon variability by forcing each dungeon to count. The flip side, however, is that the data isn’t weekly, but accumulates over the season.

2 Using the leaderboards for specialization power estimation is part of the site.

5.2 Data Import

Code
source('load-dungeons.R')
source('load-specs.R')
specs <- specs |>
  filter(score > 1) |>
  right_join(dungeon_names, by=c("dungeon" = "key")) |>
  inner_join(region_names, by=c("region" = "key"))
specs_topk <- specs_topk |>
  filter(score > 1) |>
  right_join(dungeon_names, by=c("dungeon" = "key")) |>
  inner_join(region_names, by=c("region" = "key"))

To get an idea of what the data looks like, we glimpse at them:

glimpse(dungeons)
Rows: 6,400
Columns: 9
$ region            <fct> eu, eu, eu, eu, eu, eu, eu, eu, eu, eu, eu, eu, eu, …
$ dungeon           <fct> arakara-city-of-echoes, arakara-city-of-echoes, arak…
$ key_level         <dbl> 20, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, …
$ score             <dbl> 302.6, 475.4, 475.2, 474.1, 473.7, 472.7, 472.7, 472…
$ time_remaining_ms <dbl> -122146, 273021, 261908, 207809, 185420, 137578, 135…
$ group_id          <fct> e3de0cddf05856ed8264f4f832a54d84c1a84c68be838fe966c9…
$ tank_id           <fct> 197798483, 195435816, 226801834, 195435816, 19779848…
$ healer_id         <fct> 168952271, 231776615, 192781776, 231776615, 16895227…
$ dps_id            <fct> 82c7313607536e1c73b24d36ecbdfaadb99f5aae06604fe816d1…
glimpse(specs)
Rows: 31,133
Columns: 18
$ region            <chr> "eu", "eu", "eu", "eu", "eu", "eu", "eu", "eu", "eu"…
$ dungeon           <fct> arakara-city-of-echoes, arakara-city-of-echoes, arak…
$ key_level         <int> 20, 20, 20, 20, 20, 19, 19, 19, 19, 19, 19, 19, 19, …
$ score             <dbl> 302.6, 302.6, 302.6, 302.6, 302.6, 475.4, 475.4, 475…
$ time_remaining_ms <int> -122146, -122146, -122146, -122146, -122146, 273021,…
$ id                <fct> 197798483, 168952271, 232497511, 232493270, 18606320…
$ name              <chr> "Yonteaux", "Ihatepriest", "Speedyo", "Elbroiblo", "…
$ role              <fct> tank, healer, dps, dps, dps, tank, dps, healer, dps,…
$ class             <fct> paladin, priest, rogue, shaman, evoker, paladin, dru…
$ spec              <fct> protection, discipline, assassination, enhancement, …
$ has_external      <fct> No, Yes, No, No, No, No, No, Yes, No, No, No, Yes, N…
$ has_bloodlust     <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye…
$ has_resurrection  <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye…
$ class_spec        <fct> paladin_protection, priest_discipline, rogue_assassi…
$ is_ranged         <fct> No, Yes, No, No, Yes, No, Yes, Yes, No, Yes, No, Yes…
$ dungeon.abbrev    <chr> "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK"…
$ dungeon.str       <chr> "Ara-Kara, City of Echoes", "Ara-Kara, City of Echoe…
$ region.str        <chr> "Europe", "Europe", "Europe", "Europe", "Europe", "E…

These are both obtained from raider.io by means of their API. For the dungeons we have a list of individual dungeon runs. These are timed dungeons. The data includes

  • The region in which the dungeon was run
  • The name of the dungeon
  • The key level of the dungeon
  • The score obtained by running the dungeon
  • The time remaining in milliseconds on the timer
  • The class-spec of the tank
  • The class-spec of the healer

We also produce individual rows per player from the dungeon run. It adds the following columns:

  • The unique Id of the player. It is used to identify when the same player ran more than one dungeon.
  • The name of the player. Note duplicates might occur in this column. It’s more for us to get some idea of the players, not for the machine to process.
  • Role: tank, healer, or dps.
  • Class: obvious.
  • Spec: obvious.
  • Class-Spec: Since specialization names aren’t unique, we add the class to them which makes them unique.
  • Has external encodes if the healer has an external ability. This is No for Shaman and Monk, and Yes for everyone else.
  • Has bloodlust is Yes if the team has a bloodlust ability.
  • Has resurrection is Yes if the team has access to a combat resurrection ability.

To get a simple overview of the data, we can look at a plot counting the different specs in question.

Caution

A plot like the following pseudoreplicates because it doesn’t take multiple runs by the same player into account. Thus, prolific players who plays a lot will be counted way more. We will see ways to address this later on, avoiding the multiple counts that’s being done here. For a quick overview of the data set as a whole, it’s fine, but don’t put too much into the data.

Code
specs |>
  group_by(class, class_spec) |>
  summarize(count = n(), .groups = "keep") |>
  inner_join(spec_names, by = join_by(class_spec == key)) |>
  ggdotchart(
    x = "class_spec.str",
    y = "count",
    color = "class",
    rotate = TRUE,
    sorting = "descending",
    add = "segments",
    dot.size = 7,
    ggtheme = theme_zugzug()) +
  scale_color_manual(values = wow_colors) +
  labs(title = "Specialization Popularity",
       subtitle = "4 Regions, 8 Dungeons from each region, top-k runs",
       caption = "Source: https://raider.io",
       x = "Specialization",
       y = "Count")
Figure 5.1: Count of each specialization in the data set, showing the variance in popularity.

5.2.0.1 Role split

To follow subcreation, we need to split the data into the different roles. We will address healers first, and then proceed to the other roles. We can create new data frames from the original by filtering on the roles. Focusing on a summary of healer data looks like:

Code
healers <- specs |> filter(role == "healer")
tanks <- specs |> filter(role == "tank")
dps <- specs |> filter(role == "dps")

healers |> group_by(class_spec) |> summarize(count = n()) |> kbl()
class_spec count
priest_discipline 5032
monk_mistweaver 257
shaman_restoration 724
evoker_preservation 33
druid_restoration 77
paladin_holy 83
priest_holy 21

It turns out some healers are far more popular than others. To get an idea how popular, we can plot the data set for healers only, so we can focus. The above count contains all the specs, but the healers also show a large variance in popularity.

Code
healers |>
  group_by(class, class_spec) |>
  summarize(count = n(), .groups = "keep") |>
  inner_join(spec_names, by = join_by(class_spec == key)) |>
  ggdotchart(
    x = "class_spec.str",
    y = "count",
    color = "class",
    rotate = TRUE,
    sorting = "descending",
    add = "segments",
    dot.size = 7,
    ggtheme = theme_zugzug()) +
  scale_color_manual(values = wow_colors) +
  labs(title = "Healer Population Counts",
       subtitle = "Popularity among top healers",
       caption = "Source: https://raider.io",
       x = "Specialization",
       y = "Count")
Figure 5.2: Count of healer popularity in the data set, showing variance for the role.

5.2.1 Data Analysis

We’ll now concentrate on the data analysis from the Subcreation site.

Subcreation assumes the data are normally distributed for each specialization. We can therefore compute a mean and SD for each specialization. However, we restrict ourselves to the top-k runs based on score only, ignoring the rest. We can compute this in R pretty easily.

Let us group the healers by the spec, and then pick the top-100 scoring runs for each healer. Then plot the counts for each dungeon:

Code
healers |>
  group_by(class_spec) |>
  slice_max(n = 100, score) |>
  ggplot(aes(x = fct_infreq(dungeon.str))) +
  geom_bar() +
  labs(title = "Dungeon counts among healers",
       subtitle = "Popularity differs among dungeons",
       caption = "Source: https://raider.io",
       x = "Dungeon",
       y = "Count") +
  scale_x_discrete(labels = scales::label_wrap(10))
Figure 5.3: Take the top-100 runs for each healer and count the dungeons. This gives an overview of what dungeons are prevalent at the very top and which dungeons are largely ignored.

If some dungeons are easier than others, they will have a larger effect on our mean and SD computations. This happened in Season 1 and 2 of Dragonflight. The easier dungeons are more popular. In short, we are going to be affected by some dungeons being much easier than others. The extreme example is when a spec isn’t able to deal with a boss in one dungeon, but can blast through another. Such a spec could be categorized as S-tier by our analysis, yet have an obvious problem in general by being unable to time certain dungeons, unless you drop the difficulty by a couple of key levels.

While this is a problem, let us ignore it for now and process data as it is being done by Subcreation.

Subcreation is trying to fit a normal distribution to the data. But it is being done is a way that is arguably wrong. It leads to situations where the lower bound of a 95% Confidence Interval on the mean can be greater than the (sampling) mean. Statistically, this is impossible. Also, we are reporting the count \(n\) as being the full count of a population, but in the computations we are restricting outselves to the top-k observations. This is highly misleading as a reader might believe the confidence interval is computed on a much larger sample than it is.

The first part of the Subcreation computation computes a mean and SD for the spec on the full sample. We do the same in the following piece of code, and we attach the mean and SD as new columns.

healers_subc_base <- healers |>
  group_by(class_spec) |>
  dplyr::mutate(
    spec_score_mean = mean(score),
    spec_score_sd   = sd(score)
  )

Next, we compute on the top-100 scoring runs. We can cook this down to a few lines thanks to functional programming and the ease of data manipulation in R.

healers_subc_base <- healers_subc_base |>
  slice_max(n = 100, score) |>
  dplyr::reframe(
    class = class,
    spec_score_mean = spec_score_mean,
    score_mean = mean(score),
    score_sd = sd(score),
    n = n(),
    error = qt(.975, n-1) * spec_score_sd / sqrt(n),
    conf.low = score_mean - error,
    conf.high = score_mean + error
  ) |>
  distinct()

healers_subc_base <- healers_subc_base |> mutate(analysis = "subcreation")

Obviously, this calculation is broken from a statistical point of view. Due to slicing the top 100 runs, we have \(n = 100\). Thus, score_mean and score_sd is computed on 100 observations. But when we are computing the error term, we use spec_score_sd which is the standard deviation of the full sample. We then flip back and use the \(n = 100\) for the \(\sqrt(n)\) computation. We are using numbers from different populations as we like, and the end result is an odd blend which only confuses and brings no valid computation to the table.

As a side-note, we use \(n-1\) for the degrees of freedom, whereas Subcreation wrongly uses \(n\). For most inputs, however the error is fairly small, due to \(n\) being large enough, so it’s not really concerning for the end result.

Subcreation will present the value \(ci\_lb\) which is our conf.low in the above, together with the mean (written \(\bar{x}\)). But the \(\bar{x}\) value is taken from spec_score_mean, and not score_mean, which it should have been. This means you can get lower bounds which are greater than the mean. This does not give much confidence in the data handling in general, and indeed it is pretty bad.

5.2.1.1 Cluster handling

Subcreation sorts specializations into tiers. A clustering algorithm is used to compute a set of clusters from the specialization scores. Then, the clusters are mapped into tiers (S, A, B, …) and specializations are placed within tiers.

Following Subcreation, we use Ckmeans to compute the clusters over the full dataset. That includes all the specializations, and not only the healers. This makes sense, insofar we want to compare across different roles. The centers are the middle of each cluster. So in order to score into a tier, you need to be closer to one number than the other.

Code
clusters <- Ckmeans.1d.dp(specs$score, 6)
clusters$centers
[1] 301.8235 398.2522 412.9114 427.6789 442.8809 458.4979

Recommendation: chose an odd cluster count: 5 or 7

The current number of clusters is 6. That gives the tiers S, A, B, C, D, and F. I’d recommend cutting either the D tier, or adding an E tier. An odd number will create an obvious center tier, either B or C. And if we have perfect balance, we should assume all specs will end up in this middle tier. The choice of an even number forces a split on specializations which are already very close.

A peculiarity is that the centers look fairly even. To see why, let us plot the clusters, but overlay a kernel density plot so we can see how the scores distribute.

Code
ggdensity(specs,
  x = "score",
  fill = "grey50",
  ggtheme = theme_zugzug()) +
  geom_vline(xintercept=clusters$centers, lty="dashed", color="grey60") +
  labs(title = "Cluster center and Score Density",
       subtitle = "Tiers match key levels",
       caption = "Source: https://raider.io",
       x = "Score",
       y = "Density")
Figure 5.4: Density plot of scores with overlaid cluster centers. The centers are the dashed vertical lines in the plot. This shows how cluster centers align with scoring for different key levels. Each key level provides a spike in density according to the average score for that key level (ignoring dungeon differences);

Huh. So the tiers are roughly equivalent to the different key levels which are being run at the top level. That is an interesting piece of information3.

3 Given the way IO is computed, I don’t think this is a particularly good way to cluster specializations. In practice, the clusters will correspond hard to dungeon key levels, which poses a number of problems. If the span of key levels is smaller than the number of clusters, clusters won’t space evenly. Furthermore, there are relatively few specializations so there are few cluster data-points. Most clustering algorithms enjoy having a larger set of data points. In practice, what people are after is a ranking from best to worst rather than a placement into tiers.

5.2.2 Data Presentation

Here’s what data looks like:

Code
healers_subc_base |>
  dplyr::select(
    class_spec,
    score_mean,
    score_sd,
    n,
    error,
    conf.low,
    conf.high) |>
  kbl()
class_spec score_mean score_sd n error conf.low conf.high
priest_discipline 465.0410 6.386351 100 9.455484 455.5855 474.4965
monk_mistweaver 448.7010 6.759889 101 9.066536 439.6345 457.7675
shaman_restoration 428.9871 5.058412 101 9.903749 419.0834 438.8909
evoker_preservation 394.0182 46.283639 33 16.411465 377.6067 410.4296
druid_restoration 397.9468 42.417557 77 9.627604 388.3191 407.5744
paladin_holy 386.9217 51.466627 83 11.238051 375.6836 398.1597
priest_holy 406.1095 51.091789 21 23.256704 382.8528 429.3662

It might not be that interesting to look at the data in a table, but we can plot the data instead. The plot will give us a far better view of what the data looks like between the different specializations:

Code
ggplot(healers_subc_base,
  aes(
    x = reorder(class_spec, conf.low),
    y = score_mean,
    colour = class,
    ymin = conf.low,
    ymax = conf.high)
  ) +
    geom_pointrange() +
    geom_point(aes(y=conf.low), color="grey90", size=1.5) +
    scale_color_manual(values = wow_colors) +
    geom_hline(yintercept=clusters$centers, lty="dashed", colour="grey60") +
    coord_flip() +
    labs(title = "Subcreation confidence intervals",
         subtitle = "Computations follows Subcreation exactly",
         caption = "Source: https://raider.io",
         x = "Specialization",
         y = "Mean score & 95% CI")
Figure 5.5: Healer strength computation in the style of Subcreation. The X-axis represents the typical score for a single dungeon run among the sampled top players. Scores fall into a 95% confidence interval of possible values, and overlaps of the intervals suggest two specializations have roughly the same power. A white dot marks the lower bound of the confidence interval, which is a conservative estimate used by Subcreation. The dashed vertical lines represents the centers of the tiers with the rightmost line being S-tier.

Generally, if there’s no overlap between the intervals, there’s likely to be a difference between the specializations. If the intervals overlap, however, it isn’t as clear. It may be those two specializations are roughly the same in power, but errors in measurements have led to the mean being slightly different. There are statistical methods to tease out such a difference.

The white dot marks the lower bound of the 95% Confidence Interval. This is the scoring used by Subcreation in order to sort each specialization into tiers. The centers of the tiers are marked by the dashed lines in the output, with S-tier being the rightmost and F-tier being the leftmost. Specializations are sorted into the closest tier based on the computed center of that tier. In turn, specializations with lower bounds that are relatively close ends up in the same tiers.

5.2.3 Discussion

Subcreation uses a computation which is wrong and misleading.

  • It uses the top-100 runs from each spec, but blends it with the full dataset.
  • Because specialization popularity varies a lot, then the outcome will vary with popularity of specs.
  • From the perspective of sampling, we’ve just sampled way more of the popular specs.
  • You pick the best runs from the popular specs, but you pick all runs from the less popular specs.
  • We are also sampling in dungeon runs, not among specializations.
  • A hard or easy dungeon can skew the data.