bjshnog 4,447 Share Posted July 1, 2014 This thread is for developing WN9. The OP will be updated when it is closer to release. Current concept: Somewhat similar to WN8 in that expected values are used. Instead of having multiplicative terms (... + a*b + ...) which increase in a parabolic curve, square rooted versions will be used, which linearize the parabola (and consequently improve accuracy). Scales differently depending on tanks played, so as to make similar ratings between tanks approximately equivalent to each other. Separate formulas may be used for scout tanks and artillery, to fix perceptibly important problems. Progress: Still a while off. vurhd1, phlakisgay and johnsparrow 2 1 Link to post Share on other sites
seriych 50 Share Posted July 15, 2014 Is there anywhere a public discussion on wn9 development? Link to post Share on other sites
bjshnog 4,447 Author Share Posted July 15, 2014 The last few posts in the WN8 thread are effectively about WN9. Link to post Share on other sites
seriych 50 Share Posted July 26, 2014 Many people criticize WN8 for what it considers only the damage. They forget that the damage is taken into account much better than in other rankings, but it has some true. In choosing a criteria for the selection of indicators we based on the correlation winrate with each indicator separately. I believe we can improve this situation, if we'll look at some indicators together. For example points of spots we can't consider correctly because we need to be protected from suicidal spot. But we can consider spots in conjunction with avg time of life. I think it will clarify the influence of spot and enhance it. Gryphon_ and Xen 1 1 Link to post Share on other sites
stagnate 513 Share Posted July 26, 2014 Many people criticize WN8 for what it considers only the damage. They forget that the damage is taken into account much better than in other rankings, but it has some true. In choosing a criteria for the selection of indicators we based on the correlation winrate with each indicator separately. I believe we can improve this situation, if we'll look at some indicators together. For example points of spots we can't consider correctly because we need to be protected from suicidal spot. But we can consider spots in conjunction with avg time of life. I think it will clarify the influence of spot and enhance it.You can only use information present in the API. Survival time is not part of that (requires a replay, not even available in dossier). Link to post Share on other sites
RichardNixon 835 Share Posted July 26, 2014 Frags is actually a proxy for useful survival time, which is why frags*spots works and is generally slightly better than dmg*spots. Similarly, defence only ever works as a combination with other parameters, indicating contribution throughout the game not just at the end of it. While we're on the subject, pseudo-linear terms such as sqrt(frags*spots) fix a lot of problems at the top end. I suspect these weren't tried in the original Eureqa runs. I did experiment with other parameters like survrate/winrate and dmgtaken-winrate*maxhp, but they're probably just platoon-padding proxies. Another possibility was using hitrate to distinguish between long and short range engagements, but there's too much junk in there. Frags may be the best proxy for close range engagements, due to the tendency of low-hp players to hide. If anyone has any ideas for other combinations of the available parameters, I can test them pretty quickly. ZXrage and SamCyanide 1 1 Link to post Share on other sites
seriych 50 Share Posted July 26, 2014 You can only use information present in the API. Survival time is not part of that (requires a replay, not even available in dossier). Oops, some reason I was sure that avgtime available in the API Link to post Share on other sites
stagnate 513 Share Posted July 26, 2014 I did experiment with other parameters like survrate/winrate and dmgtaken-winrate*maxhp, but they're probably just platoon-padding proxies. Another possibility was using hitrate to distinguish between long and short range engagements, but there's too much junk in there. Frags may be the best proxy for close range engagements, due to the tendency of low-hp players to hide.It's also really important to consider not just the current noise, but the affect if players try to game the system.So even if you found that hit rate was a good adjustment for a stat, it would probably have a negative effect on gameplay and become unreliable.Using one stat as a proxy for another carries a lot of risk. Nicook5 and pauli 2 Link to post Share on other sites
RichardNixon 835 Share Posted July 26, 2014 Yeah, gaming the system rules out using hitrate as a negative. dmgtaken isn't massively gameable but the gaming advantage is probably more significant than the correlation benefit. Similarly there's an argument for using dmg*spots rather than frags*spots, because even though frags*spots gives a slightly better correlation, frags are much more gameable than damage. On the other hand, most WN8 padders don't seem to have figured out how important the non-linear terms are at high skill levels. Security through obfuscation, maybe. Link to post Share on other sites
bjshnog 4,447 Author Share Posted July 26, 2014 sqrt(frags*spots) This right here has crossed my mind countless times. It should definitely be tried (in Eureqa, if we are to use it again). Link to post Share on other sites
RichardNixon 835 Share Posted July 26, 2014 This right here has crossed my mind countless times. It should definitely be tried (in Eureqa, if we are to use it again). I tried it in my own multiple linear regression solver. It works. sqrt(frags*def) is also better than frags*def. frags*dam comes out slightly better than sqrt(frags*dam) where it appears at all, but it's probably just trying to linearize data that's actually a cumulative normal distribution. It should be noted that there are no huge improvements here. rDamage on its own has an R-squared of 0.930 on my data. A plain linear function is around 0.936 and the best pseudolinear functions are down to a 4sf improvement. Link to post Share on other sites
CraBeatOff 5,741 Share Posted July 26, 2014 Yeah, gaming the system rules out using hitrate as a negative. dmgtaken isn't massively gameable but the gaming advantage is probably more significant than the correlation benefit. Similarly there's an argument for using dmg*spots rather than frags*spots, because even though frags*spots gives a slightly better correlation, frags are much more gameable than damage. On the other hand, most WN8 padders don't seem to have figured out how important the non-linear terms are at high skill levels. Security through obfuscation, maybe. Despite WN7 being heavily frags dependent, I did not see any large body of evidence showing that individuals were able to manipulate their frags greatly...outside of playing against vastly inferior-ly skilled opponents by dropping down into tier 1-5 or so. Perhaps I am mistaken and someone can show or even attest differently. Frags seems gameable, by holding shots, but its unclear if anyone has been able to turn it into an artificially high rating. Is it possible that in frags*spots we've found something that distills ability to the game beyond the ability to identify the worst tanks and simply exceed the rather low bars the public has set? Probably not. We looked pretty hard at spots during development...mainly because there were a lot of valid complaints about useless spotting being an option. There is a LOT of variability in spots amongst the very top players we looked at. But even amongst the ones who pay more conservatively, they do spot eventually...even if its SPGs during mop up. No doubt that Yato and Valachio are so high in their ratings because of tank selection and aggressive play, but the game goes for EJ and Barks across numerous other tanks. Its just not that easy to play that aggressively, kill that many people and keep up superior average damage game after game. But can it be gamed? Absolutely, especially in aggregate, for the same reason average tier didn't work in WN6/7. There is no way for the aggregate statistics to tell that you went and got 9 spots in your T71, 6 frags in your MS-1 and then dropped huge damage in your tier 10 TD/HT/MT. Sure each expected value for those tanks is high on spots, frags and damage respectively, but its min-maxing each part of the inputs. Right now people are focusing on damage on low expectation tanks, and it just so happens that the high RoF and good camo/vision of the RU meds makes for well maxed inputs on each of the main portions of the equation. So its unclear to me if its security through obfuscation, or if its because its actually hard to do or simply results in "good play". During testing I was able to produce some extremely high values, but during testing I was also winning craptons of games, platooned and solo. So if trying to game the metric is also making me win lots...is it gaming the metric or am I just playing in the optimal manner? The only tank I was unable to produce silly high (4k+ for 30 or more games) WN8 without actually winning was the T71. I think thats because the autoloader is just enough to allow more frags than would be expected, but doesn't actually hit hard enough or quite have the HP to make crucial game winning kills and trades like the 13 90. But my T71 is also just carries more losses than my other LTs, by a good margin, being a solid 10% less in wins...might just be an anomaly for me. I bet someone else can turn T71 into stupid high WN8 and decent wins (Millard comes to mind) because of its ability to spot/dmg/frag beyond expected. I tried it in my own multiple linear regression solver. It works. sqrt(frags*def) is also better than frags*def. frags*dam comes out slightly better than sqrt(frags*dam) where it appears at all, but it's probably just trying to linearize data that's actually a cumulative normal distribution. It should be noted that there are no huge improvements here. rDamage on its own has an R-squared of 0.930 on my data. A plain linear function is around 0.936 and the best pseudolinear functions are down to a 4sf improvement. Good to have further validation months later. Where were you a year ago? Might have saved poor Praetor from burnout! Avendyl 1 Link to post Share on other sites
Folterknecht 2,257 Share Posted July 27, 2014 Despite WN7 being heavily frags dependent, I did not see any large body of evidence showing that individuals were able to manipulate their frags greatly...outside of playing against vastly inferior-ly skilled opponents by dropping down into tier 1-5 or so. Perhaps I am mistaken and someone can show or even attest differently. Frags seems gameable, by holding shots, but its unclear if anyone has been able to turn it into an artificially high rating. Crab - its not that important, if I can bring statistical evidence to the table, that for some players kill stealing resulted in a better rating. To the end of WN7 nearly every yellow-green understood that kills are a huge factor in WN7. That was observable in the overall gameplay at least on the EU-Server. The number of games, where some fucking clueless idiot just shadowed me and waited behind me while I took the damage/shots to get out the kill shot ... . Its not all about the "success" of the padding, but also important the game play we promote with creating a ratring. Its the same with too low expected stats for mid/low tiers and to high numbers in t8-10, which would result in a massive shift from above avg players going down 1-3 tiers and clubbing seals, which especially for the NA server is a very bad thing. The gameplay with WN8 improved, ofc we got countered by WG and there retarded changes with 0.8.6 (sigma, TD-meta). But that kill stealing BS nearly vanished. GeeForcer and pauli 2 Link to post Share on other sites
stagnate 513 Share Posted July 27, 2014 Crab - its not that important, if I can bring statistical evidence to the table, that for some players kill stealing resulted in a better rating. To the end of WN7 nearly every yellow-green understood that kills are a huge factor in WN7. That was observable in the overall gameplay at least on the EU-Server. The number of games, where some fucking clueless idiot just shadowed me and waited behind me while I took the damage/shots to get out the kill shot ... .Its not all about the "success" of the padding, but also important the game play we promote with creating a ratring. Its the same with too low expected stats for mid/low tiers and to high numbers in t8-10, which would result in a massive shift from above avg players going down 1-3 tiers and clubbing seals, which especially for the NA server is a very bad thing. The gameplay with WN8 improved, ofc we got countered by WG and there retarded changes with 0.8.6 (sigma, TD-meta). But that kill stealing BS nearly vanished.Double this, exactly my point. The effectiveness is secondary to the impact on the gameplay of players, because it WILL impact play. Observe the current scout behavior; you can tell it's trouble when you have a t71 shadowing an e100 to try to deal damage. Nimrodor 1 Link to post Share on other sites
Xelos 62 Share Posted August 5, 2014 Shouldn't be deciding what style of game-play to reward with a higher score. If your a kill stealing asshole but you win a ton of games because of it, does it matter? And kills are by far the strongest predictor of a win unless you solo'd the enemy team. You could do something akin to a ratio of damage to expected enemy team battle tier total hp. For example, if your waffle does 3k average damage a game and the expected total hitpoints of an enemy team is 30k (15*2000), assume 2k is average hps of enemy tank in a battle tier 9-11 (obviously wrong but simplifies the calculation) 15 man match then you approximately did 10% of enemy health. You could further move kills into dmg dealt by valuing a kill at some hp/dmg amount (such as 10% i think WG uses for xp calculation). Thus we modify the total expected enemy team health is 27k and each kill is worth 200 dmg, this should lessen the effects of kills (and strength damage more) in the equation and standardizes to battle tiers. I personally don't support making damage any more important than it already is in the equation. I'm still of the opinion that the rating should be by some other classification and not an aggregate measure. Such as tank type, tier or battle tier/spread. Link to post Share on other sites
bjshnog 4,447 Author Share Posted September 17, 2014 Shouldn't be deciding what style of game-play to reward with a higher score. If your a kill stealing asshole but you win a ton of games because of it, does it matter? And kills are by far the strongest predictor of a win unless you solo'd the enemy team. You could do something akin to a ratio of damage to expected enemy team battle tier total hp. For example, if your waffle does 3k average damage a game and the expected total hitpoints of an enemy team is 30k (15*2000), assume 2k is average hps of enemy tank in a battle tier 9-11 (obviously wrong but simplifies the calculation) 15 man match then you approximately did 10% of enemy health. You could further move kills into dmg dealt by valuing a kill at some hp/dmg amount (such as 10% i think WG uses for xp calculation). Thus we modify the total expected enemy team health is 27k and each kill is worth 200 dmg, this should lessen the effects of kills (and strength damage more) in the equation and standardizes to battle tiers. I personally don't support making damage any more important than it already is in the equation. I'm still of the opinion that the rating should be by some other classification and not an aggregate measure. Such as tank type, tier or battle tier/spread. A bit of a late reply here, but anyway... It must be an aggregate measure because the stats are aggregated. If you want a more proprietary measure, you'll have to use other software. About the expected total HP, I think someone should go ahead and do that analysis. It shouldn't be hard at all. Link to post Share on other sites
Never 7,715 Share Posted September 30, 2014 https://www.youtube.com/watch?v=EOQcnliEjXM Link to post Share on other sites
Folterknecht 2,257 Share Posted September 30, 2014 RichardNixon, on 22 Jun 2014 - 02:09 AM, said: Here's a relatively simple way of fixing skill scaling: 1. Generate expected values as usual, centred on 1565. 2. Go back through your player database, calculating per-tank WN8 values with the new expected values. 3. Throw out tanks below 50 games and then the bottom 50% of tanks for each player, as usual. 4. Calculate a "recent WN8" based on the remaining tanks. 5. Throw out any players below 2500 recent WN8. 6. Average the tanks of the remainder, and the WN8 of the players of that tank. 7. Normalize the ratio between the average tank WN8 and average player WN8 to give you a scale factor per point of WN8 from 1565: scalefactor = ((tankWN8 - 1565) / (playerWN8 - 1565)) Final results should be a bit like this: https://docs.google....dit?usp=sharing Mine's a bit distorted because I'm mixing Gryphon's expected values with my own database, but you get the idea. Fast tanks have scale factors above 1.0, while slow tanks have scale factors below 1.0. Low tiers are mostly garbage results due to lack of data, but you can guess or substitute 1.0 as appropriate. Once you've got that, you add a final step to the WN8 calculation: 1. Sum battles*scalefactor over the player's tanks. Divide by total battles to give scaleAvg. 2. Adjust with the following formula: scaledWN8 = 1565 + ((wn8 - 1565) / scaleAvg) So for example, a 3000 WN8 player who's only played the Maus will get 3057 scaledWN8, and a 3000 WN8 player who's only played the T62A will get 2884 scaledWN8. Link to post Share on other sites
bjshnog 4,447 Author Share Posted September 30, 2014 I sent him a PM recommending that he come here and clear everything up. If we do something that doesn't work or we have to change something, etc, then it's best he knows what changed. Link to post Share on other sites
Gryphon_ 541 Share Posted September 30, 2014 Here is the plot of the T-62a rDAMAGE vs user_rDAMAGE for all users that pass the filter: The slope of the least squares fit line is 1.11, so would you consider that a useful parameter for a scaling adjustment? If so, what would the adjusted tank WN8 be for a user who had 2000 (raw) WN8 on the tank? Here is the WN8 vs user_WN8 version: Link to post Share on other sites
bjshnog 4,447 Author Share Posted September 30, 2014 It looks to be good enough in theory, since the data is fairly linear. The slope between overall and tank WN8 is what should be used though (whatever amount of difference that will make), since that is what is being scaled. If a player had 2000 WN8 in the T-62A, then their WN9 would be 1565 + ((2000 - 1565) / 1.11) = 1956. If they had 3500 WN8, it would be 3308. The base expected values might need to be changed a bit, or the method slightly altered to account for the presence of scaling, because I think 3308 is still a bit too high (I would estimate the proper score should be around 3000-3100, though this is just based on my experience). Also, 0 WN8 on tanks with scaling less than 1 would end up with more than 0 as a rating. Link to post Share on other sites
RichardNixon 835 Share Posted October 1, 2014 If you're concerned about accuracy for high-skill players then half of the problem is formula errors. WN8 is strongly non-linear with solo winrate for high rStats values, and becomes increasingly dependent on the noisier parameters. The WN8 formula is only non-linear due to lack of platoon filtering in the input data, and so the natural conclusion is that it should be fixed at the same time. Cutting a very long story short, formulae like this work reasonably well for mostly-solo players: rWinC = 0.7*rDmg + 0.1*sqrt(rFrag*rDmg) + 0.15*sqrt(rFrag*rSpot) + 0.05*sqrt(rFrag*rDef) rSpot is the only parameter that really needs capping and scaling, with a minimum in the 0.33 to 0.38 range. rDmg and rFrag intercept close to 0 with rWinC, while rDef is a power function but modelling it as linear has almost no detectable impact. Capping by damage is unnecessary once the formula is linearised. Adding in rWinC is optional. If you want a higher zero point (as in WN8) then the logical method is to cap rWinC harder, but I can't find any real justification for using a higher zero point. The average player with 0.2 rDmg still wins more games than the average player with 0.1 rDmg. Note that high-WN8 players will take a large absolute hit from a contribution-linear formula. Switching to a completely different scale may alleviate some of the whining. Now that the tanks/stats API mostly works, other improvements are also possible, although further out of scope: 1. Accurate normalization. Fixes a lot of +/-100 point drift depending on how you played your low tiers. 2. Ignoring certain tank tiers or types entirely in the overall result. 3. Penalizing players for playing the same 6-skill tank forever. 4. Using a different formula for different tanks/classes/tiers. Formula variation isn't as useful as you might think. For scout tanks, R-squared only rises from about 0.65 to 0.7 compared to the general formula. The best formula for scouts looks something like this: rWinC = 0.5*rDmg + 0.2*sqrt(rSpot*rDmg) + 0.2*sqrt(rSpot*rFrag) + 0.1*sqrt(rSpot*rDef) There's also a straightforward artillery improvement that doesn't require multiple formulas. Assuming that rSpot is not correlated with success for artillery, you could simply raise the expected spots sky-high and then drop the other expected values by 15% to balance. Finally there's always the option of switching to an expected values table based on genuine recent values rather than the current fudge. I'll have the data in a couple of weeks, but this does of course have pros and cons. Link to post Share on other sites
bjshnog 4,447 Author Share Posted October 1, 2014 Yeah, I think it's worth having separate formulas for light tanks and artillery (treat mediums, heavies and tank destroyers as "normal tanks"). However, rather than doing a basic average based on proportion of games played in each class, I think it should be weighted or filtered by the tier also. For example, if a player has a lot of battles in the SU-26, but then also a lot of tier 8-10 battles, then the arty battles have a lot less weight, since they naturally have less bearing on rDamage (I'm sure you understand why). With light tanks, though, you'd have to classify each scout tank specifically, like all light tanks tier 5+ and then some specific lower tier ones after. Basically, you'd mark them like MM marks scout tanks. This is a small basic improvement, easy to implement, that should fix a few minor issues and make WN9 more valuable for scouts. The ultimate extension of this concept is to have totally separate expected values, scales, coefficients and weights on every stats for every tank, but that's completely impractical and there are better ways to do it. The scaling factors will probably still be necessary for the overall equation, even with the linearisation, since there are still skill floors and ceilings. I think we should just abandon the wins part of the rating on the spot, since win rate and average tier are already intended to be used alongside it. EDIT: Also, the baseline could be changed so that it's a simple subtraction from WN9, once we work out what the baseline should be. I don't see much reason why each stat individually should have a baseline (as in WN8) when it could just be applied to the formula as a whole. RichardNixon 1 Link to post Share on other sites
RichardNixon 835 Share Posted October 1, 2014 #1However, rather than doing a basic average based on proportion of games played in each class, I think it should be weighted or filtered by the tier also. For example, if a player has a lot of battles in the SU-26, but then also a lot of tier 8-10 battles, then the arty battles have a lot less weight, since they naturally have less bearing on rDamage (I'm sure you understand why). #2 With light tanks, though, you'd have to classify each scout tank specifically, like all light tanks tier 5+ and then some specific lower tier ones after. Basically, you'd mark them like MM marks scout tanks. This is a small basic improvement, easy to implement, that should fix a few minor issues and make WN9 more valuable for scouts. #3 The ultimate extension of this concept is to have totally separate expected values, scales, coefficients and weights on every stats for every tank, but that's completely impractical and there are better ways to do it. #4 The scaling factors will probably still be necessary for the overall equation, even with the linearisation, since there are still skill floors and ceilings. #5 EDIT: Also, the baseline could be changed so that it's a simple subtraction from WN9, once we work out what the baseline should be. I don't see much reason why each stat individually should have a baseline (as in WN8) when it could just be applied to the formula as a whole. #1: This is autofixed by using tanks/stats data in the natural way, essentially performing the expected value divisions per-tank rather than on the final sum. I don't think there's a way to fix normalization without using tanks/stats. #2: That's what I did for the scout formula. I suppose it does mean that tank class would be a parameter you'd add to the expected value table, rather than expecting websites to roll their own. #3: Not too sure about "impractical", but the cost/benefit is certainly bad and it would be difficult to generate accurate per-tank formulas: Skill distribution is uncontrollable and the platoon filtering doesn't work well. I'm not even convinced that multiple formulas are worth the cost/benefit, given that the scout formula is only a small improvement without assisted damage. The main benefit is probably that it's harder to pad, although I'd like to get experimental evidence on that. If spot-padding is straightforward then it may not be an improvement. #4: Yes, the benefits of scaling still apply. Linearising the formula should also bring the regression slope method relatively close to the unicum-average method, which would make the scaling data a lot easier to collect for the low tiers. Similarly, linearisation also makes scaling work a lot better for the low side of the skill range. #5: Yes, that's a much better method of handling the baseline. Link to post Share on other sites
Gryphon_ 541 Share Posted October 1, 2014 I concur that WN8 isnt great for lights and arty, I ran some plots using current formula and dataset last night to see what sort of distribution we get for rWINc vs user_WN8 when the data is broken into 5 datasets, one for each tank type. Mediums, heavies and TDs produce plots with very good linear regression and high r squared but lights are very so-so and arty is a mess. What I get from this is that for WN9 we need a separate WN formula by type. We should be looking for more accurate results for light and arty players; any modest success in that area will outweigh by a few orders of magnitude the benefit that some unicums would get (or suffer) from a scaling modification to WN8. EDIT: @ RN - try adding in Caps to the light formula. Lights often cap for the win. It will make a difference, I'm sure Link to post Share on other sites