Over the past month, I’ve been working on a side project that investigated the Bayesian rating system of popular website vndb.net. It is somewhat well known that the Bayesian rating tends to underrate most games. Quite severely. The goal was to see if popular machine learning algorithms could be used to create a better prior estimate when a visual novel has only a small sample of votes. I have included my paper that goes into the VNDB Bayesian system in greater detail and the models I explored below.
In the end, I discovered a couple patterns in the data that the machines could identify to get pretty good estimates for the mean score of a visual novel. It turns out that voters are pretty consistent at rating games that have certain combinations of attributes.
It also turns out that voter bias does contain useful information for the construction of a prior. One flaw/feature in the design was that the total average of votes was used as the “true rating.” It is clear that any attempt to remove the bias from the votes will cause the average of a subset of votes to be a biased estimator of the total average, making it perform worse as the size of the sample gets larger. If we were to use the full formulation of the Platt-Burges model, the true rating would also remove the voter bias, and the sample average with voter bias removed would perform “better” than the sample mean itself. It may be of philosophical interest to know what this “unbiased average rating” is for a visual novel, but in practice, I don’t see it being particularly more useful than the sample mean itself.
I also discovered that the problem that the Bayesian score was trying to fix wasn’t a particularly important one. From the Central Limit Theorem, we know that the sample mean tends to be normally distributed around the population mean with smaller variance as the sample gets large. This project essentially demonstrated the CLT in action, and showed that even when the number of votes was as small as 30, that the sample mean does a pretty good job of estimating the rating of the game – much better than any prior estimate we can generate. A reasonable person can see that a game has a very small number of votes and not use that sample mean to heavily influence an opinion on the true rating of the game, and if it was really necessary to create a ranking list, one could simply only include games that have more than a certain threshold of votes.
One final observation, which is more trivia than anything, is that the VNDB data dump only includes votes from public users, while the website averages include hidden user votes. With only the public votes, the top 3 visual novels with more than a 100 votes in order are 1) Rance X, 2) White Album 2, and 3) Muv Luv Alternative. If this sounds familiar, it sounds a lot like the EGS ranking (at least the last time I checked). Conspiracy? Probably not. If Rance X ever gets translated to English, I expect it to return to its true spot on the top.
There is no doubt a lot more that can be discussed about the methodology. The fact that I performed the experiment with games with more than a 100 votes already begs the question of whether or not it’s really reasonable to expect the same patterns to exist in games with truly small numbers of votes. But as I mentioned already, it kind of feels silly to care so much about what to do with the ratings for these games when the problem fixes itself so quickly as more people vote. And if these votes never come, maybe the true numerical rating of the game is a mystery better left unsolved.