Jump to content
Sign in to follow this  
Gryphon_

Reason for High Expected Values for New Tanks

Recommended Posts

I think I've figured out why we seem to always get expected values for new tanks that are seem too high once we review the first set of data and derive values from it. Bjshnog was close when he posted about recency bias - but I think its all down to the low battle count associated with any new tank for any user.

 

I think that the rDAMAGE per user/tank has a wider distribution for newer tanks with lower battlecounts than it does later once those battlecounts have risen and everyone 'tends toward the mean'. I think this is called the central limit theorem. 

 

This would explain why all new tanks seem to come in high, even when we regress out the skill factor, as we are only looking at the top 50% of a (too) wide distribution of rDAMAGE; later, once battlecounts are risen, we have a tighter distribution and the values always come down. Its much easier to have an rDAMAGE of 3.0 in the Object 907 after 51 battles than it would be after 500, correct?

 

If you agree that this the cause, we need to strategize on how to counter this. We need to look at the 50 battle minimum and 50% / median filters as neither help.

 

Thoughts?

Share this post


Link to post
Share on other sites

Weight the player stats by Z score. That will remove the effect of variance. You can set it up so that the expected value will always be a fixed percentage above or below the mean regardless of what the mean or variance is. 

 

You would decide how to normalize the variance based on the variance you get with comparable vehicles - basically the 907 would have high mean and variance now - you would normalize the data points to have the same variance as the 430 (or an average of all 3 normal mediums). You can then directly compare the mean win rate and Rstats for each tank. 

 

You can then adjust the expected value by equating the win rate and seeing what damage you get on the other curve. If the expected WR of your selected comparable is ~53%, you find where 53% lies on your new normalized tank plot and there is your expected values. 

 

The central limit theorem still applied, but it would make any distortions random as opposed to systematically high/low, and would tend to converge faster. 

Share this post


Link to post
Share on other sites

That doesn't quite explain why all new tanks (at least at high tiers) seem to have the same problem: overestimates. Higher battle counts in the tank might help, just to get the correlation coefficient higher so that the linear model has a higher slope and may be more accurate, but the recency bias is still going to exist for a significant amount of time. That and the battle counts will probably still be insignificant compared to account total battles, and the recency bias will last until the player has about twice as many battles (probably more even, depending on how skill changes with respect to battle count).

 

Anyway, I'm going to suggest again that you weight each point by battle count when you do that rStat = 1 calculation, if you haven't done that already. Although, battle count is probably not good enough for this kind of calculation, so maybe something else should be used. I'm not sure about how stats like damage actually vary, maybe the best way to weight the points is by log(tankBattles)*log(accountBattles-tankBattles)*log(tankBattles/accountBattles). I'm not sure. Also, I don't know if it's practical for you, but you shouldn't compare the tank stats to the complete account stats. Stats for each tank should be subtracted from the account total, because otherwise you are correlating the tank's stats with themselves.

Share this post


Link to post
Share on other sites

The mean of the top half of a Gaussian is farther from the overall mean the larger the variance is. The small sample tanks will nearly always have a larger variance, automatically producing a high bias. 

 

How large of an effect does recency have on a normal tank? When the STB-I was released how high were the first expected values compared to today?

Share this post


Link to post
Share on other sites

The mean of the top half of a Gaussian is farther from the overall mean the larger the variance is. The small sample tanks will nearly always have a larger variance, automatically producing a high bias. 

 

How large of an effect does recency have on a normal tank? When the STB-I was released how high were the first expected values compared to today?

 

Yes, you get it. The expected values for the STB-1 damage have been as follows:

 

v15: 1859 (an estimate)

v16: 1926

v17: 1969

v18: 1969

v19: 1959

 

..not quite the pattern we were looking for, but thats what it is

Share this post


Link to post
Share on other sites

So for the STB there was basically no change.

 

What about the 140 and 430? I am curious about recency and how large of an effect it has. 

 

If recency seems to not be an issue it would be easier to adjust the special tank values by normalizing the variance and then shifting the mean. 

Share this post


Link to post
Share on other sites

So for the STB there was basically no change.

 

What about the 140 and 430? I am curious about recency and how large of an effect it has. 

 

If recency seems to not be an issue it would be easier to adjust the special tank values by normalizing the variance and then shifting the mean. 

 

Its not easy to do any of that, especially as the process for WN8 is documented on the wiki and that isnt part of it. The fundamental problem I have is that after the 50% filter, despite starting with 25,000 accounts I have a bunch of unpopular / new / rare tanks where I have less than hundred players' worth of data. If only more people would upload to vbaddict... I

 

If we start to mess with the >50 battles per tank filter or the 50% filter itself, everything changes: worst of all, a mass of low tier data floods in and the expected values for the low tiers would plummet - cant go there.

 

The irony here is that the 907 update from the original estimate was based upon over 300 accounts of data, so apart for the low battlecount effect we've been discussing, the method is sound. Being subjective now, I dont have a 907, but I've fought a few and it seems to be a very good tank, better than the STB even - so its not surprising the damage value is just above the STB1 - at least, not to me.

Share this post


Link to post
Share on other sites

Is it possible that the expected values of the new tanks are perfectly fine and the disagreement

in values originates from errors in appraising the expected performance in old tanks?

 

So in other words instead of recency bias it's another form of legacy bias that is a known issue with wn8

from praetor days.

 

 

The 907/STB ete values represent the current skill level of players who play them.  The older tanks would have

a tendency to reflect an average skill level of the driver over his/her career. 

 

Ignore this post if this is a rehash previously dealt with.

 

 

 

Anecdotally, I seem to see many more players that have recent wn8 performances that are LOWER than their 

base WN8 rating than I did in the past.

Share this post


Link to post
Share on other sites

Find a way to base wn8 expected damages on recent overall performance of tanks, like average over last 6 months or so, and wn8 would be 10 times better than now.

It's a very different game now, than it was 2 years ago, with much higher hp pools and more damage.

Share this post


Link to post
Share on other sites

Find a way to base wn8 expected damages on recent overall performance of tanks, like average over last 6 months or so, and wn8 would be 10 times better than now.

 

I did this months back. There wasn't much interest, and it wasn't possible to push the T-62's expected values above the M48, which is apparently a fundamental requirement. The most recent version is here:

 

https://docs.google.com/spreadsheets/d/1h2HtBoginNRF4Q4-5ZLlT15BQBnLOZZGRXb-ZCCfdqk/edit?usp=sharing

 

There's some possible corruption from strongholds & team battles. They're still broken in the public API and difficult to remove accurately. Trying to work around that at the moment, but the method will only work on new data, if at all.

 

The 907 does currently come out as overpowered, especially on winrate, but you can easily explain that by the armour layout. It should generate ricochets far more easily than the other mediums.

Share this post


Link to post
Share on other sites

I did this months back. There wasn't much interest, and it wasn't possible to push the T-62's expected values above the M48, which is apparently a fundamental requirement. The most recent version is here:

 

https://docs.google.com/spreadsheets/d/1h2HtBoginNRF4Q4-5ZLlT15BQBnLOZZGRXb-ZCCfdqk/edit?usp=sharing

 

There's some possible corruption from strongholds & team battles. They're still broken in the public API and difficult to remove accurately. Trying to work around that at the moment, but the method will only work on new data, if at all.

 

The 907 does currently come out as overpowered, especially on winrate, but you can easily explain that by the armour layout. It should generate ricochets far more easily than the other mediums.

 

RN, I've got to the point where I am prepared to ditch the 'as-documented' WN8 method of filtering the dataset, as long as we can clearly explain to the community why we did it. I think the argument that the game has changed is a powerful one and hard to dispute; but this drives us to using 'recent' data only. Can you please confirm that the link you posted has an analysis that is based on recent data? Are you are looking at the WG database at two points in time and working with the difference? This is something I'm going to try with the vbaddict data as I now have a years worth, sampled at 3 month intervals, so I should be able to do some R scripting to read in two datasets 3 or 6 months apart and output a 'difference' dataset that will be 'recent' data. Does that make sense?

 

Then, using that data - but without the WN8 filter - I intend to run some plots for each tank showing a regression of user rSTAT per tank vs user rSTAT overall, as I still think that 'tank vs overall' method is by far the best way of providing expected values that are balanced across all tanks and types and account for varying skill levels. I might try running the data by each type in turn, so that user overall is per type, not for all types, which might make a breakthrough in quality of output. What do you think?

Share this post


Link to post
Share on other sites

RN, I've got to the point where I am prepared to ditch the 'as-documented' WN8 method of filtering the dataset, as long as we can clearly explain to the community why we did it. I think the argument that the game has changed is a powerful one and hard to dispute; but this drives us to using 'recent' data only. Can you please confirm that the link you posted has an analysis that is based on recent data? Are you are looking at the WG database at two points in time and working with the difference? This is something I'm going to try with the vbaddict data as I now have a years worth, sampled at 3 month intervals, so I should be able to do some R scripting to read in two datasets 3 or 6 months apart and output a 'difference' dataset that will be 'recent' data. Does that make sense?

 

An important part of the method is to make use of the "total XP" API value to discard battles that are likely to have been played during stock grinds (essentially a far improved version of the 50 battle filter). For this to work effectively, you need to sample at a relatively high frequency and then selectively merge the samples by player. That sheet uses eight one-week interval samples.

 

Currently I also use the total XP value to adjust performance according to likely crew skill, although this is of questionable value now that WG have broken tiers 1-3. It was mainly necessary to get quality data below tier 5, and may be counter-productive at high tiers.

 

 

The second problem is that you need a far wider sample for recent data than for overall data. Regressions for tanks with <10k battles are clearly unreliable, especially for high-variance parameters like winrate and defence. Unlike with the overall method, this error doesn't persist over time: Tanks with small battle counts will have results that jump around wildly between versions.

 

I suspect the VBaddict sample doesn't have sufficient data and you'd need to be hitting the API instead. The API unfortunately has a some significant problems:

 

1. Tanks/stats will (apparently randomly) generate a small proportion (~0.1%) of junk values. I suspect this is caused by the server not locking accounts during updates. There's no easy method of detecting this problem: I sanity-check by comparing values like wins <= battles in the weekly diffs and throwing away the whole account's data if there's a glitch.

 

2. Strongholds and team battles are frequently missing from tanks/stats. This wouldn't be a problem for interval data except that they get drip-fed back in, so random battles (which are the difference between "all" and everything else) will seem to disappear. There's no easy way to correct for this. Currently I'm including strongholds and team battles in the results to work around the problem, although I'm trying another trick on new data.

 

3. The API is not stable. Features are frequently added or removed at a moment's notice. Any data collection requires maintenance.

 

It's possible that the effect of strongholds and team battles is insignificant, and so point 2 isn't a problem. WG may well fix this eventually.

 

 

Overall rSTAT vs tank rSTAT regression is what I use. I tested various regression methods on random subsets of samples. Weighting each data point by tank battles was far better than any other weighting I tried, and there was no advantage in filtering out small samples. Least-squares was fractionally better than least-absolute-deviation for most tanks, while bisquare robust regression and three-median methods were much worse.

 

I currently start the regressions from a neutral point, with all tanks being equal. Native slope & intercept combinations are used the whole way through. This needs some nudging to generate the correct result: For the higher-variance parameters, I run the first few passes with a zero-intercept regression and only switch to the floating regression later. Otherwise it tends to pick the "everyone has the same skill" solution. Smaller samples have more trouble finding the correct solution.

 

Regressing by class is interesting, but you'd need a lot more data for the same accuracy, which likely rules it out as a practical option.

 

 

One final issue that needs discussion is that the metagame can change drastically. The most obvious example is artillery damage, which has dropped by ~15% across the board since the missions started. I'm not sure what to do with this.

Share this post


Link to post
Share on other sites

Very interesting. Let me spend some time catching up by creating a difference/recent  dataset from vbaddict data, with 3 months delta, as my starting point, and run some regressions at account level, weighting each data point by tank battles. We can then compare the expected values I come up with, against the ones you did. There are benefits in using two different datasets and sampling intervals but the same regression; if our results are close, we'll be able to move rapidly forward form there. If not, we need to figure out the reasons for the difference and see if we can eliminate those reasons.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...