Jump to content
bjshnog

⟪WN9⟫ Development

Recommended Posts

This thread is for developing WN9. The OP will be updated when it is closer to release.

 

Current concept:

  • Somewhat similar to WN8 in that expected values are used.
  • Instead of having multiplicative terms (... + a*b + ...) which increase in a parabolic curve, square rooted versions will be used, which linearize the parabola (and consequently improve accuracy).
  • Scales differently depending on tanks played, so as to make similar ratings between tanks approximately equivalent to each other.
  • Separate formulas may be used for scout tanks and artillery, to fix perceptibly important problems.

 

Progress:

  • Still a while off.

Share this post


Link to post
Share on other sites

Many people criticize WN8 for what it considers only the damage. They forget that the damage is taken into account much better than in other rankings, but it has some true.

 

In choosing a criteria for the selection of indicators we based on the correlation winrate with each indicator separately. I believe we can improve this situation, if we'll look at some indicators together. For example points of spots we can't consider correctly because we need to be protected from suicidal spot. But we can consider spots in conjunction with avg time of life. I think it will clarify the influence of spot and enhance it.

Share this post


Link to post
Share on other sites

Many people criticize WN8 for what it considers only the damage. They forget that the damage is taken into account much better than in other rankings, but it has some true.

 

In choosing a criteria for the selection of indicators we based on the correlation winrate with each indicator separately. I believe we can improve this situation, if we'll look at some indicators together. For example points of spots we can't consider correctly because we need to be protected from suicidal spot. But we can consider spots in conjunction with avg time of life. I think it will clarify the influence of spot and enhance it.

You can only use information present in the API. Survival time is not part of that (requires a replay, not even available in dossier).

Share this post


Link to post
Share on other sites

Frags is actually a proxy for useful survival time, which is why frags*spots works and is generally slightly better than dmg*spots. Similarly, defence only ever works as a combination with other parameters, indicating contribution throughout the game not just at the end of it.

 

While we're on the subject, pseudo-linear terms such as sqrt(frags*spots) fix a lot of problems at the top end. I suspect these weren't tried in the original Eureqa runs.

 

I did experiment with other parameters like survrate/winrate and dmgtaken-winrate*maxhp, but they're probably just platoon-padding proxies. Another possibility was using hitrate to distinguish between long and short range engagements, but there's too much junk in there. Frags may be the best proxy for close range engagements, due to the tendency of low-hp players to hide.

 

If anyone has any ideas for other combinations of the available parameters, I can test them pretty quickly.

Share this post


Link to post
Share on other sites

You can only use information present in the API. Survival time is not part of that (requires a replay, not even available in dossier).

Oops, some reason I was sure that avgtime available in the API 

Share this post


Link to post
Share on other sites

I did experiment with other parameters like survrate/winrate and dmgtaken-winrate*maxhp, but they're probably just platoon-padding proxies. Another possibility was using hitrate to distinguish between long and short range engagements, but there's too much junk in there. Frags may be the best proxy for close range engagements, due to the tendency of low-hp players to hide.

It's also really important to consider not just the current noise, but the affect if players try to game the system.

So even if you found that hit rate was a good adjustment for a stat, it would probably have a negative effect on gameplay and become unreliable.

Using one stat as a proxy for another carries a lot of risk.

Share this post


Link to post
Share on other sites

Yeah, gaming the system rules out using hitrate as a negative. dmgtaken isn't massively gameable but the gaming advantage is probably more significant than the correlation benefit.

 

Similarly there's an argument for using dmg*spots rather than frags*spots, because even though frags*spots gives a slightly better correlation, frags are much more gameable than damage. On the other hand, most WN8 padders don't seem to have figured out how important the non-linear terms are at high skill levels. Security through obfuscation, maybe.

Share this post


Link to post
Share on other sites

This right here has crossed my mind countless times. It should definitely be tried (in Eureqa, if we are to use it again).

 

I tried it in my own multiple linear regression solver. It works. sqrt(frags*def) is also better than frags*def. frags*dam comes out slightly better than sqrt(frags*dam) where it appears at all, but it's probably just trying to linearize data that's actually a cumulative normal distribution.

 

It should be noted that there are no huge improvements here. rDamage on its own has an R-squared of 0.930 on my data. A plain linear function is around 0.936 and the best pseudolinear functions are down to a 4sf improvement.

Share this post


Link to post
Share on other sites

Yeah, gaming the system rules out using hitrate as a negative. dmgtaken isn't massively gameable but the gaming advantage is probably more significant than the correlation benefit.

 

Similarly there's an argument for using dmg*spots rather than frags*spots, because even though frags*spots gives a slightly better correlation, frags are much more gameable than damage. On the other hand, most WN8 padders don't seem to have figured out how important the non-linear terms are at high skill levels. Security through obfuscation, maybe.

Despite WN7 being heavily frags dependent, I did not see any large body of evidence showing that individuals were able to manipulate their frags greatly...outside of playing against vastly inferior-ly skilled opponents by dropping down into tier 1-5 or so. Perhaps I am mistaken and someone can show or even attest differently. Frags seems gameable, by holding shots, but its unclear if anyone has been able to turn it into an artificially high rating. 

 

Is it possible that in frags*spots we've found something that distills ability to the game beyond the ability to identify the worst tanks and simply exceed the rather low bars the public has set? Probably not. We looked pretty hard at spots during development...mainly because there were a lot of valid complaints about useless spotting being an option. There is a LOT of variability in spots amongst the very top players we looked at. But even amongst the ones who pay more conservatively, they do spot eventually...even if its SPGs during mop up. No doubt that Yato and Valachio are so high in their ratings because of tank selection and aggressive play, but the game goes for EJ and Barks across numerous other tanks. Its just not that easy to play that aggressively, kill that many people and keep up superior average damage game after game.

 

But can it be gamed? Absolutely, especially in aggregate, for the same reason average tier didn't work in WN6/7. There is no way for the aggregate statistics to tell that you went and got 9 spots in your T71, 6 frags in your MS-1 and then dropped huge damage in your tier 10 TD/HT/MT. Sure each expected value for those tanks is high on spots, frags and damage respectively, but its min-maxing each part of the inputs. Right now people are focusing on damage on low expectation tanks, and it just so happens that the high RoF and good camo/vision of the RU meds makes for well maxed inputs on each of the main portions of the equation. 

 

So its unclear to me if its security through obfuscation, or if its because its actually hard to do or simply results in "good play". During testing I was able to produce some extremely high values, but during testing I was also winning craptons of games, platooned and solo. So if trying to game the metric is also making me win lots...is it gaming the metric or am I just playing in the optimal manner? The only tank I was unable to produce silly high (4k+ for 30 or more games) WN8 without actually winning was the T71. I think thats because the autoloader is just enough to allow more frags than would be expected, but doesn't actually hit hard enough or quite have the HP to make crucial game winning kills and trades like the 13 90. But my T71 is also just carries more losses than my other LTs, by a good margin, being a solid 10% less in wins...might just be an anomaly for me. I bet someone else can turn T71 into stupid high WN8 and decent wins (Millard comes to mind) because of its ability to spot/dmg/frag beyond expected.

 

I tried it in my own multiple linear regression solver. It works. sqrt(frags*def) is also better than frags*def. frags*dam comes out slightly better than sqrt(frags*dam) where it appears at all, but it's probably just trying to linearize data that's actually a cumulative normal distribution.

 

It should be noted that there are no huge improvements here. rDamage on its own has an R-squared of 0.930 on my data. A plain linear function is around 0.936 and the best pseudolinear functions are down to a 4sf improvement.

 

Good to have further validation months later. Where were you a year ago? Might have saved poor Praetor from burnout!

Share this post


Link to post
Share on other sites

Despite WN7 being heavily frags dependent, I did not see any large body of evidence showing that individuals were able to manipulate their frags greatly...outside of playing against vastly inferior-ly skilled opponents by dropping down into tier 1-5 or so. Perhaps I am mistaken and someone can show or even attest differently. Frags seems gameable, by holding shots, but its unclear if anyone has been able to turn it into an artificially high rating.

 

 

Crab - its not that important, if I can bring statistical evidence to the table, that for some players kill stealing resulted in a better rating. To the end of WN7 nearly every yellow-green understood that kills are a huge factor in WN7. That was observable in the overall gameplay at least on the EU-Server. The number of games, where some fucking clueless idiot just shadowed me  and waited behind me while I took the damage/shots to get out the kill shot ... .

Its not all about the "success" of the padding, but also important the game play we promote with creating a ratring. Its the same with too low expected stats for mid/low tiers and to high numbers in t8-10, which would result in a massive shift from above avg players going down 1-3 tiers and clubbing seals, which especially for the NA server is a very bad thing. The gameplay with WN8 improved, ofc we got countered by WG and there retarded changes with 0.8.6 (sigma, TD-meta). But that kill stealing BS nearly vanished.

Share this post


Link to post
Share on other sites

Crab - its not that important, if I can bring statistical evidence to the table, that for some players kill stealing resulted in a better rating. To the end of WN7 nearly every yellow-green understood that kills are a huge factor in WN7. That was observable in the overall gameplay at least on the EU-Server. The number of games, where some fucking clueless idiot just shadowed me  and waited behind me while I took the damage/shots to get out the kill shot ... .

Its not all about the "success" of the padding, but also important the game play we promote with creating a ratring. Its the same with too low expected stats for mid/low tiers and to high numbers in t8-10, which would result in a massive shift from above avg players going down 1-3 tiers and clubbing seals, which especially for the NA server is a very bad thing. The gameplay with WN8 improved, ofc we got countered by WG and there retarded changes with 0.8.6 (sigma, TD-meta). But that kill stealing BS nearly vanished.

Double this, exactly my point. The effectiveness is secondary to the impact on the gameplay of players, because it WILL impact play. Observe the current scout behavior; you can tell it's trouble when you have a t71 shadowing an e100 to try to deal damage.

Share this post


Link to post
Share on other sites

Shouldn't be deciding what style of game-play to reward with a higher score. If your a kill stealing asshole but you win a ton of games because of it, does it matter? And kills are by far the strongest predictor of a win unless you solo'd the enemy team.

 

You could do something akin to a ratio of damage to expected enemy team battle tier total hp. For example, if your waffle does 3k average damage a game and the expected total hitpoints of an enemy team is 30k (15*2000), assume 2k is average hps of enemy tank in a battle tier 9-11 (obviously wrong but simplifies the calculation) 15 man match then you approximately did 10% of enemy health. You could further move kills into dmg dealt by valuing a kill at some hp/dmg amount (such as 10% i think WG uses for xp calculation). Thus we modify the total expected enemy team health is 27k and each kill is worth 200 dmg, this should lessen the effects of kills (and strength damage more) in the equation and standardizes to battle tiers. I personally don't support making damage any more important than it already is in the equation.

 

I'm still of the opinion that the rating should be by some other classification and not an aggregate measure. Such as tank type, tier or battle tier/spread.

Share this post


Link to post
Share on other sites

Shouldn't be deciding what style of game-play to reward with a higher score. If your a kill stealing asshole but you win a ton of games because of it, does it matter? And kills are by far the strongest predictor of a win unless you solo'd the enemy team.

 

You could do something akin to a ratio of damage to expected enemy team battle tier total hp. For example, if your waffle does 3k average damage a game and the expected total hitpoints of an enemy team is 30k (15*2000), assume 2k is average hps of enemy tank in a battle tier 9-11 (obviously wrong but simplifies the calculation) 15 man match then you approximately did 10% of enemy health. You could further move kills into dmg dealt by valuing a kill at some hp/dmg amount (such as 10% i think WG uses for xp calculation). Thus we modify the total expected enemy team health is 27k and each kill is worth 200 dmg, this should lessen the effects of kills (and strength damage more) in the equation and standardizes to battle tiers. I personally don't support making damage any more important than it already is in the equation.

 

I'm still of the opinion that the rating should be by some other classification and not an aggregate measure. Such as tank type, tier or battle tier/spread.

 

A bit of a late reply here, but anyway... It must be an aggregate measure because the stats are aggregated. If you want a more proprietary measure, you'll have to use other software.

 

About the expected total HP, I think someone should go ahead and do that analysis. It shouldn't be hard at all.

Share this post


Link to post
Share on other sites

RichardNixon, on 22 Jun 2014 - 02:09 AM, said:snapback.png

Here's a relatively simple way of fixing skill scaling:

 

1. Generate expected values as usual, centred on 1565.

2. Go back through your player database, calculating per-tank WN8 values with the new expected values.

3. Throw out tanks below 50 games and then the bottom 50% of tanks for each player, as usual.

4. Calculate a "recent WN8" based on the remaining tanks.

5. Throw out any players below 2500 recent WN8.

6. Average the tanks of the remainder, and the WN8 of the players of that tank.

7. Normalize the ratio between the average tank WN8 and average player WN8 to give you a scale factor per point of WN8 from 1565: scalefactor = ((tankWN8 - 1565) / (playerWN8 - 1565))

 

Final results should be a bit like this:

 

https://docs.google....dit?usp=sharing

 

Mine's a bit distorted because I'm mixing Gryphon's expected values with my own database, but you get the idea. Fast tanks have scale factors above 1.0, while slow tanks have scale factors below 1.0. Low tiers are mostly garbage results due to lack of data, but you can guess or substitute 1.0 as appropriate.

 

Once you've got that, you add a final step to the WN8 calculation:

 

1. Sum battles*scalefactor over the player's tanks. Divide by total battles to give scaleAvg.

2. Adjust with the following formula:

 

scaledWN8 = 1565 + ((wn8 - 1565) / scaleAvg)

 

So for example, a 3000 WN8 player who's only played the Maus will get 3057 scaledWN8, and a 3000 WN8 player who's only played the T62A will get 2884 scaledWN8.

Share this post


Link to post
Share on other sites

I sent him a PM recommending that he come here and clear everything up. If we do something that doesn't work or we have to change something, etc, then it's best he knows what changed.

Share this post


Link to post
Share on other sites

Here is the plot of the T-62a rDAMAGE vs user_rDAMAGE for all users that pass the filter:

 

rDAMAGE%20_vs_%20user_rDAMAGE%20_09_11_v

 

The slope of the least squares fit line is 1.11, so would you consider that a useful parameter for a scaling adjustment? If so, what would the adjusted tank WN8 be for a user who had 2000 (raw) WN8 on the tank?

 

Here is the WN8 vs user_WN8 version:

 

%20WN8%20_vs_%20user_WN8%20_09_11_vals_T

Share this post


Link to post
Share on other sites

It looks to be good enough in theory, since the data is fairly linear. The slope between overall and tank WN8 is what should be used though (whatever amount of difference that will make), since that is what is being scaled.

 

If a player had 2000 WN8 in the T-62A, then their WN9 would be 1565 + ((2000 - 1565) / 1.11) = 1956. If they had 3500 WN8, it would be 3308.

 

The base expected values might need to be changed a bit, or the method slightly altered to account for the presence of scaling, because I think 3308 is still a bit too high (I would estimate the proper score should be around 3000-3100, though this is just based on my experience).

 

Also, 0 WN8 on tanks with scaling less than 1 would end up with more than 0 as a rating.

Share this post


Link to post
Share on other sites

If you're concerned about accuracy for high-skill players then half of the problem is formula errors. WN8 is strongly non-linear with solo winrate for high rStats values, and becomes increasingly dependent on the noisier parameters. The WN8 formula is only non-linear due to lack of platoon filtering in the input data, and so the natural conclusion is that it should be fixed at the same time.

 

 

Cutting a very long story short, formulae like this work reasonably well for mostly-solo players:

 

rWinC = 0.7*rDmg + 0.1*sqrt(rFrag*rDmg) + 0.15*sqrt(rFrag*rSpot) + 0.05*sqrt(rFrag*rDef)

 

rSpot is the only parameter that really needs capping and scaling, with a minimum in the 0.33 to 0.38 range. rDmg and rFrag intercept close to 0 with rWinC, while rDef is a power function but modelling it as linear has almost no detectable impact. Capping by damage is unnecessary once the formula is linearised. Adding in rWinC is optional.

 

If you want a higher zero point (as in WN8) then the logical method is to cap rWinC harder, but I can't find any real justification for using a higher zero point. The average player with 0.2 rDmg still wins more games than the average player with 0.1 rDmg.

 

 

Note that high-WN8 players will take a large absolute hit from a contribution-linear formula. Switching to a completely different scale may alleviate some of the whining.

 

 

Now that the tanks/stats API mostly works, other improvements are also possible, although further out of scope:

 

1. Accurate normalization. Fixes a lot of +/-100 point drift depending on how you played your low tiers.

2. Ignoring certain tank tiers or types entirely in the overall result.

3. Penalizing players for playing the same 6-skill tank forever.

4. Using a different formula for different tanks/classes/tiers.

 

Formula variation isn't as useful as you might think. For scout tanks, R-squared only rises from about 0.65 to 0.7 compared to the general formula. The best formula for scouts looks something like this:

rWinC = 0.5*rDmg + 0.2*sqrt(rSpot*rDmg) + 0.2*sqrt(rSpot*rFrag) + 0.1*sqrt(rSpot*rDef)

 

There's also a straightforward artillery improvement that doesn't require multiple formulas. Assuming that rSpot is not correlated with success for artillery, you could simply raise the expected spots sky-high and then drop the other expected values by 15% to balance.

 

 

Finally there's always the option of switching to an expected values table based on genuine recent values rather than the current fudge. I'll have the data in a couple of weeks, but this does of course have pros and cons.

Share this post


Link to post
Share on other sites

Yeah, I think it's worth having separate formulas for light tanks and artillery (treat mediums, heavies and tank destroyers as "normal tanks"). However, rather than doing a basic average based on proportion of games played in each class, I think it should be weighted or filtered by the tier also. For example, if a player has a lot of battles in the SU-26, but then also a lot of tier 8-10 battles, then the arty battles have a lot less weight, since they naturally have less bearing on rDamage (I'm sure you understand why). With light tanks, though, you'd have to classify each scout tank specifically, like all light tanks tier 5+ and then some specific lower tier ones after. Basically, you'd mark them like MM marks scout tanks. This is a small basic improvement, easy to implement, that should fix a few minor issues and make WN9 more valuable for scouts.

 

The ultimate extension of this concept is to have totally separate expected values, scales, coefficients and weights on every stats for every tank, but that's completely impractical and there are better ways to do it.

 

The scaling factors will probably still be necessary for the overall equation, even with the linearisation, since there are still skill floors and ceilings.

 

I think we should just abandon the wins part of the rating on the spot, since win rate and average tier are already intended to be used alongside it.

 

 

EDIT: Also, the baseline could be changed so that it's a simple subtraction from WN9, once we work out what the baseline should be. I don't see much reason why each stat individually should have a baseline (as in WN8) when it could just be applied to the formula as a whole.

Share this post


Link to post
Share on other sites

#1However, rather than doing a basic average based on proportion of games played in each class, I think it should be weighted or filtered by the tier also. For example, if a player has a lot of battles in the SU-26, but then also a lot of tier 8-10 battles, then the arty battles have a lot less weight, since they naturally have less bearing on rDamage (I'm sure you understand why).

 

#2 With light tanks, though, you'd have to classify each scout tank specifically, like all light tanks tier 5+ and then some specific lower tier ones after. Basically, you'd mark them like MM marks scout tanks. This is a small basic improvement, easy to implement, that should fix a few minor issues and make WN9 more valuable for scouts.

 

#3 The ultimate extension of this concept is to have totally separate expected values, scales, coefficients and weights on every stats for every tank, but that's completely impractical and there are better ways to do it.

 

#4 The scaling factors will probably still be necessary for the overall equation, even with the linearisation, since there are still skill floors and ceilings.

 

#5 EDIT: Also, the baseline could be changed so that it's a simple subtraction from WN9, once we work out what the baseline should be. I don't see much reason why each stat individually should have a baseline (as in WN8) when it could just be applied to the formula as a whole.

 

#1: This is autofixed by using tanks/stats data in the natural way, essentially performing the expected value divisions per-tank rather than on the final sum. I don't think there's a way to fix normalization without using tanks/stats.

 

#2: That's what I did for the scout formula. I suppose it does mean that tank class would be a parameter you'd add to the expected value table, rather than expecting websites to roll their own.

 

#3: Not too sure about "impractical", but the cost/benefit is certainly bad and it would be difficult to generate accurate per-tank formulas: Skill distribution is uncontrollable and the platoon filtering doesn't work well.

 

I'm not even convinced that multiple formulas are worth the cost/benefit, given that the scout formula is only a small improvement without assisted damage. The main benefit is probably that it's harder to pad, although I'd like to get experimental evidence on that. If spot-padding is straightforward then it may not be an improvement.

 

#4: Yes, the benefits of scaling still apply. Linearising the formula should also bring the regression slope method relatively close to the unicum-average method, which would make the scaling data a lot easier to collect for the low tiers. Similarly, linearisation also makes scaling work a lot better for the low side of the skill range.

 

#5: Yes, that's a much better method of handling the baseline.

Share this post


Link to post
Share on other sites

I concur that WN8 isnt great for lights and arty, I ran some plots using current formula and dataset last night to see what sort of distribution we get for rWINc vs user_WN8 when the data is broken into 5 datasets, one for each tank type. Mediums, heavies and TDs produce plots with very good linear regression and high r squared but lights are very so-so and arty is a mess. 

 

What I get from this is that for WN9 we need a separate WN formula by type. We should be looking for more accurate results for light and arty players; any modest success in that area will outweigh by a few orders of magnitude the benefit that some unicums would get (or suffer) from a scaling modification to WN8.

 

EDIT: @ RN - try adding in Caps to the light formula. Lights often cap for the win. It will make a difference, I'm sure

Share this post


Link to post
Share on other sites

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...