3  Critique of this site

People have criticized some parts of this site. I think that’s a good thing. The way we find flaws in an approach is by pointing them out, leading to corrections down the road. Producing good work is a process, and we are bound to run into the guard rails now and then when we approach our goal.

It is the same approach I took with Subcreation: document your findings, and make your case. It is way better to underpin your argument with sound rationale than going by feeling alone, even if your intuition might be right. You want your approach to be actionable in some way, so things can gradually improve.

You’re just angry because Mistweaver is F-tier

Maybe I shouldn’t have written that, as it invites critique of the content by means of critiquing its author. I’m trying very hard to be as impartial as possible in the data processing. If anything, this work doesn’t show that Mistweavers are suddenly the strongest specialization in the game. What it does show however, is that there’s a larger amount of uncertainty when it comes to the spec strength of Mistweaver, mostly due to low participation. And that the difference among specializations are probably smaller than what people think.

I think that’s an important part of the story.

Also, I think one should use plenty of salt when looking at data such as this. Your opinion should be informed by your own qualitative experience, visualization, data analysis, prior knowledge, and so on. Data analysis is just a piece of the puzzle, and you shouldn’t rely on it as a sole pillar. If the data analysis here points in a totally different direction than other inputs you use, then you should try to figure out why.

In the grander scheme of things, I think you want to stand on firm statistical ground. At least, as firm as you can possibly make it. Underperforming specs deserves a buff, and overperforming specs might require a nerf. But if you don’t have the analysis right, then you might end up drawing flawed conclusions. Conclusions the data doesn’t support. The work here allows your intuition to be better supported by data in many cases.

Your analysis just squeezes all the specs closer to each other.

Indeed.

You can still make the case strength differs among specs. You just have to zoom a bit on the plot. You can also still make a tier list, though you’d have to use a different clustering approach as the one taken by Subcreation currently: you’d have to look at something other than the whole dataset for the cluster centers.

When you average over a larger population, surely the variance shrinks, and the means get closer to each other. This has to be kept in mind. But it also improves the bias a smaller population might induce.

The squeeze also shows you are getting closer to a “noise floor” of sorts. There’s a point where you can’t realistically get any more balance between specializations. If they are close, then what matters is skill expression.

Your analysis misses the point. Subcreation is all about what the best players are playing, not spec strength.

This would be valid critique, were it not for the fact that Subcreation is specifically not doing that.

First, subcreation includes regional data from the weaker Asian regions. This data then “buffs” the more popular specs to higher scores. Popular and strong often correlates, but not always.

Second, if you are interested in what the best players are doing, you shouldn’t model spec strength. You should model players instead. Subcreation pseudoreplicates players, assuming their runs independent. This means that a really good player who is very active can skew the results. So in the end, you are not modeling what the best players are playing, but rather what the most active players are playing.

The extreme example is that you are picking the top-100 best runs for each spec, but players might have 20+ runs that week. So a small handful of players can dominate in representation by means of activity alone.

Using 2700 M+ rating score as a cutoff point is bad because a 2700 score is too low to grab the top players.

This pertains to the leaderboard approach to spec strength in the chapter of Specializations. We just need a cutoff such that we can fit the model properly. The reason 2700 is chosen is because we need to get rid of the modalities around KSM and KSH, so it doesn’t mess with our analysis. Cutting a good deal above 2500 lets us get rid of that. It’s about the top 5% of the population.

The assumption is that above 2700 rating, people start to know what they are doing, and they are able to utilize the spec to its full potential. Of course, people playing above 3200 rating will be even better, but that’s expected.

We are then looking at the distribution of players at that range. We don’t care about an individual player as much as how they are distributed above 2700 rating. What we care about is if all players in a given specialization group are playing at a higher skill rating as a whole. Suppose a spec had fewer players around 2700 and more players in every other skill bracket above 2700. That would lead to the spec being “shifted upwards” in scoring, across the board. And that would be an indicator of an effect not from the individual players, but from the spec. The same happens in the other direction.

The whole premise is then, that these shifts of player groups playing a certain spec, is a way to estimate the effect each spec has on the M+ score. Stronger specs will mean our density has more mass at the higher skill brackets, and we are exploiting this to gauge how strong specs are.

Of course, this modeling has the underlying assumption about spec strength affecting a wide skill bracket. That is, if a spec is strong, that strength shows up in more than the top 0.1% of players. Here, we are assuming the strength is present when you get above 2700 M+ rating. I think that’s a reasonable assumption, but as always, your mileage may vary.

This is way overkill.

Yes.

I could have stopped at the critique of Subcreation.

But I like building models, and much of the further work revolves around the question of “What would a model even look like, for data such as this?”

Most PvP games use some kind of internal ranking system (Match-making ranking, MMR), in which a win increases your rating and a loss decreases it. Here, the players play against an environment, and we only measure a success. Yet, there’s a limit to the number of attempts one can have at a given dungeon difficulty, mainly due to time and key availability. In a certain way, we are exploring what an MMR-system would look like for games of this kind.

The key insight is that dungeons, player skill, and specializations have to be handled simultaneously. If you want to answer questions about one aspect, you need to control for the other aspects.

There’s way too few Mistweavers among the top-100. So this analysis is useless.

The following is not specific to Mistweavers, but they are a great example.

It mostly has to do with specialization popularity. If a spec isn’t popular, it’s less likely to occur at the very top. If you imagine you randomly drew samples among the top players, many of those draws would have 0 Mistweavers in them, because they aren’t that prevalent as a whole.

But this isn’t an indicator that the spec is weak.

If we look at the data at face value, we are looking at the whole population of top players, not a sample of them. But (frequentist) statistics can still be used if we frame the question correctly. We can ask what happens if we had even more players, or an infinite amount of players. Then we would be interested in what happens at the limit. In this view, the full data is just a sample, and we entertain the idea of what would happen if we had even more players playing at the top.

TL;DR—It’s perhaps a bit counterintuitive, but if the process is random, then this absence of a spec might occur fairly often.

You run into three factors which ends up having a massive effect on the outcome:

Hence, the absence of certain specs at the very top shouldn’t be that surprising.

The key idea of this work is to look across all players of a given specialization. If they are all playing better than the average, across the board, you would say that’s a contribution not by individuals, but attributable to the spec. The more toward the top player base you cut, the more you run into the above factors, and you risk not measuring anything about the spec.

Finally, one should be cautious when using the very top of the curve. When you look at the data, it’s pretty obvious that everything gets very noisy at the top. Some of the classical statistical models gets into trouble because the top players provide a heavy tail and the noise gets so large it ends up affecting your analysis. If we had insisted on going forward with those models, you would perhaps start removing the top players on the whim they are outliers and they skew the data too much. It’s one of the reasons I eventually went with a Bayesian approach, because it’s better at dealing with noise in the data.

Bootstrapping a spec with 3 players is a bad idea.

Bootstrapping tend to work well for small sample sizes. But it is not a magic wand. Bootstrapping doesn’t buy us additional power and because of the small sample size, the power will be low. In turn, we’ll have a hard time detecting a difference in spec strength because our study is underpowered.

What we can show, however, is that there’s a large uncertainty for the given specialization. I think it’s worth reporting this, as it allows people to make better decisions. Playing a spec with uncertain strength is way more of a gamble than playing a spec with a more certain strength.

In some of the Bayesian models, we can get around the underpresented specs in another way, via partial pooling. We can use data from the popular specs to make an educated guess at what the uncertainty is for the less popular ones.