A critique of Subcreation

This part contains an analysis of subcreations (SC) computations. The core idea is to first retrace the steps of SC, and then come up with suggestions on what to change in order to improve the statistics. By retracing the steps, we can see what is going to happen with each change, and also how that change is going to affect the results.

One particular strength of the programming language R is that statistical computations are easy. In particular, we have the advantage of vectors being a base type, which is good for a lot of the data we are interested in.

This means we can implement the core of SC in a few lines of code. We can then proceed by asking questions of the “what-if” style. Altering aspects of the computation will immediately tell us what the changes look like.

Summary

This section is a summary of the overall critque of Subcreation (henceforth SC), dated March 2023. If you read this later, you may find much of this critique fixed.

SC wants to create a tier-list via mathematical means. This is a grand idea, because such a list can update itself as data changes. However, in the pursuit of defining a tier-list, a lot of good statistical practice was thrown out of the window.

The presentation includes a table with the vital data used in the computation. The columns are:

  • \(lb\_ci\) - A lower bound of a 95% confidence interval.
  • Spec - The specialization we observed.
  • \(\bar{x}\) - The mean score of the given spec.
  • max - The maximal score of the spec, and the key level run. This column is emphasized.
  • \(n\) - Number of observations for the spec.

However, this table is highly misleading in several ways. First, it is odd to report some of the vitals, like the maximal value, while not also providing the minimal value. If you want to summarize a vector of score observations, I’d have chosen Min, Max, Mean, and Median at the least. And probably also the 25th and 75th percentiles (sometimes called the 1st and 3rd quartile). Even better, use a Tukey boxplot on the data to give an overall view.

The max value is emphasized, but this value is not used in ranking the specializations. This misleads readers to intuit the wrong thing.

Even worse, the columns are computed from different data! The computation of \(lb\_ci\) is not linked to \(\bar{x}\) as you would expect. What happens under the hood is that the data is sliced, taking the top-k of each specialization. The \(lb\_ci\) value is computed on this k-observation slice, but \(\bar{x}\) is computed on the full \(n\) of the data set. Internally, an SD is also derived for the full \(n\), but is then used in the computation of the k-observation \(lb\_ci\). It’s a mess, and it misleads.

As a result, you have situations where \(lb\_ci > \bar{x}\) which is a statistical impossibility. The astute reader with some knowledge about confidence interval computations will immediately single out the data as bogus, but the less inclined reader might overlook it.

Ultimately, the computation spreads out the difference between specializations more than the data really supports. This is great for providing content, but in reality it is a tragedy because the content is useless junk.

The home page links to Evan Miller’s work on ranking items. In particular Ranking Items With Star Ratings and How Not To Sort By Average Rating. Yet, none of those documents strictly apply to Subcreations situation. We are not rating specs by a star-score (1-5 stars). Nor are we looking at a reddit style voting system with positive and negative votes. It might mislead the reader to think the computations are grounded in those posts. In reality the computation internally is totally different from Miller’s work.

In fact, Miller wrote a followup post Bayesian Average Ratings in which he pushes a stake through his own vampire:

The solution I proposed previously — using the lower bound of a confidence interval around the mean — is what computer programmers call a hack. It works not because it is a universally optimal solution, but because it roughly corresponds to our intuitive sense of what we’d like to see at the top of a best-rated list: items with the smallest probability of being bad, given the data. — Evan Miller

As a side-note, using a bayesian approach for this data is arguably the right thing to do.

When presenting data, it is a good practice to plot them. Visualization of data is a great way to describe it and it’s underlying principles. Not only does it explain the data, but it makes people think. In general, the less played specs have less certainty in their confidence intervals. This is something people need decide on themselves. Do you take the gamble on a less certain spec? But if the presentation hides this subtle point, people might not even think.

The tiers are computed on the full specialization data. If looking at Healers, the tiers include valuations by DPS and Tanks. The tiers are then clusters computed via an algorithm, Ckmeans. Specs are binned in clusters, based on the value of \(lb\_ci\). This assignment is rather arbitrary though, and it is a hack, providing no statistical value.

One weakness of a clustering algorithm such as Ckmeans is that it forces a placement of a specialization into a tier. For some specializations, it is rather clear they are in the middle of the tier, and that the tier is the most likely for them. But others are on the boundary between two specializations. This makes their placement in one or the other tier rather arbitrary. Rather than a hard placement, one could use a softer placement where specializations are put into tiers with a probability. This allows you to be in two tiers with roughly equal probability, solving the split problem. See, for example, Gaussian Mixture Models which also allow the data to have non-circular shape.

In the following section, we control for the noise in the data. Doing so, shrinks the difference between specializations such they all fall within the same tier. That is, the effect from individual specialization choice is far less impactful than what the SC leads you to believe.