WAR: It Works

We use Wins Above Replacement around here a lot, as one of the focuses of the site is to accurately quantify the value each player produces, and WAR is the best tool we have to do that. However, it faces a decent amount of skepticism from people who don’t trust various components for a variety of reasons – they don’t like the numbers that UZR spits out for defense, they don’t believe in replacement level, or they believe that pitchers do have control over their BABIP rates.

So, the question is, does WAR work? If it’s designed well, there should be a pretty strong correlation between a team’s total WAR and their actual record. Fans of WAR rejoice – there is.

For 2009, the correlation between a team’s projected record based on their WAR total and their actual record was .83. This is a robust number, especially considering that WAR is almost completely context independent and currently includes some notable omissions – base running (besides SB/CS, which are included in wOBA) and catcher defense are both ignored in the calculations. We also don’t have an adjustment for differences in leagues, so we’re not accounting for the fact that the AL is better than the NL.

Despite these imperfections, WAR still performs extremely well. One standard deviation of the difference between WAR and actual record is 6.4 wins, and every single team is within two standard deviations. Only four teams were more than 10 wins away from their projected total by WAR, with Tampa Bay ending up the furthest away from our expectation (96.6 projected wins, 84 actual wins), and 18 of the 30 teams were within six wins of their projected WAR total.

For comparison, the correlation between pythagorean expected record and actual record is .91, and pythag includes some aspects of context (performance with men on base, for instance) that impact runs scored and allowed, so we would expect it to predict actual record somewhat better than a context independent metric like WAR. The fact that WAR is even close to somewhat-context-included pythag is impressive in its own right.

WAR isn’t perfect. But given the known limitations and the variations in how contextual situations impact final record, it does an awfully impressive job of projecting wins and losses.




Print This Post



Dave is a co-founder of USSMariner.com and contributes to the Wall Street Journal.


132 Responses to “WAR: It Works”

You can follow any responses to this entry through the RSS 2.0 feed.
  1. JoeR43 says:

    Try taking extra data (from 2002 on) and plot it out. Still, anything with > 80% correlation is usually good to go.

    Vote -1 Vote +1

  2. RKO36 says:

    That’s pretty good. Like you said it isn’t perfect, but it’s pretty dang good. Keep up the good work.

    Vote -1 Vote +1

  3. Logan says:

    I’m a bit lazy- can anyone tell me how many games a replacement level team is supposed to win? I can’t seem to find it.

    Vote -1 Vote +1

  4. James says:

    I already trusted WAR, and now I have reason for continuing to do so. Sweet.

    Dave: are there any plans for integrating catcher defense into the stat? Obviously it’s an extremely complicated factor, but it would be really interesting to see the WAR jump for a guy like Y Molina

    Vote -1 Vote +1

    • Matt Harms says:

      I don’t think anyone’s developed a reliable methodology for catcher defense. There’s just too much in play.

      Vote -1 Vote +1

      • jinaz says:

        They can at least report the catcher fielding statistic that Rally and many others use, based on SB/CS/PB/WP/E rates.

        Yes, there are other factors that come into play in catcher defense. But something is better than nothing. And, generally speaking, but guys widely recognized as being outstanding defenders (Molina, Laird, etc) rate very well by those catcher fielding stats.
        -j

        Vote -1 Vote +1

  5. An 83% correlation is very compelling evidence. Thanks for pointing this out Dave.

    It continues to amaze me how effectively baseball can be analyzed through data.

    Vote -1 Vote +1

  6. Mark says:

    Interesting that two of the 4 teams who were off their expected WAR by more than 10 were the Jays and the Rays.

    Vote -1 Vote +1

    • Wally says:

      Yeah, that’s probably got something to do with competition they are facing…

      Vote -1 Vote +1

      • Torgen says:

        Why would that make a difference?

        Vote -1 Vote +1

      • Wally says:

        Because as Dave is explaining game situations matter. So if you can beat up on the rest of the league, but get just edged out in your tough division, it will skew the results.

        Vote -1 Vote +1

      • Torgen says:

        But why should the Yankees and Red Sox edge out the Rays and Jays particularly often?

        Vote -1 Vote +1

      • Wally says:

        Because they are just a little better.

        WAR is attempting to measure talent and you’re trying to correlate that to W%, but different levels of talent in individual teams will lead to different W% in different talent environments that depend on the league talent and scheduling. So, if a very talented team, plays a slightly more talented team more often than the rest of the league, they aren’t going to fit the average talent to win correlation. I hope that makes sense.

        Vote -1 Vote +1

      • Torgen says:

        But we know how often teams should win head-to-head given how good each of them is: (A-A*B)/(A+B-2*A*B). For teams in the .400 to .600 range that function is almost exactly .500 + A – B. (Bill James derived this in 1981.) Given that the Rays and Red Sox had almost identical WARs, they should have played .500 against each other and both been equally disadvantaged against the Yankees, but that’s not what happened. Even the difference between the Jays and Rays doesn’t explain the 14-4 season series between them.

        Vote -1 Vote +1

      • Wally says:

        Well, I wouldn’t pretend to support that this effect explains the entire difference for all teams equally (ie. the Red Sox aren’t as effected by this as the Rays or Jays), especially over such a small sample size. But if you’re going to regress WAR against W% across the whole league, then the unbalanced schedule is going to fuck you up (yes that’s the technical term), because each team and their talent isn’t competing against the same talent as the rest of the league. So, it isn’t surprising that we have two outliers with depressed records that happen to be in the best division in the game.

        This is just a case where your data is not collected ideally to do these kind of league wide analyses. That doesn’t mean we can’t do this kind of stuff, but we have to understand how the imbalances in the data can effect the conclusions.

        Vote -1 Vote +1

      • Torgen says:

        The only way divisional play would screw up WAR vs. W% analysis was if games within the division were unnaturally predisposed to being close and/or low scoring. Is there any evidence that that’s the case?

        Vote -1 Vote +1

      • Wally says:

        No, that’s just one way and it may be an off shoot of this effect. However, you are failing to understand that each sample (or data point) isn’t receiving the same treatment, but is being analyzed if it where receiving the same treatment.

        Maybe a simple example would help. I’m trying to find genes that correlate with heat disease. To do this I have people take a genetic test and then ask them if they have any heart conditions. I do this study over 10 cities that all have different racial and social backgrounds. In one city they LOVE their chicken fried steak with a pack of cigarettes on the side. This city just happens to have a lot of green people. So, when I go do my correlation, I’m going to find that genes that are common in green people have a high correlation with heart disease just because those people have a different treatment (diet in this case) and not because those genes actually have anything to do with heart disease.

        I hope this helps because I can’t explain this flaw any better.

        Vote -1 Vote +1

      • Toffer Peak says:

        Wally – What I don’t get is that you seem to be suggesting that because the Jays and Rays had an tougher schedule that this has thrown off their WAR-W% correlation. Correct? But as Dave said above, “WAR is almost completely context independent and currently includes some notable omissions…doesn’t have an adjustment for differences in leagues, so we’re not accounting for the fact that the AL is better than the NL.”

        This means that because the J/Rays faced a tougher schedule than normal their WAR total is actually lower than it would have been had they faced a more normal schedule. So their expected win% (bases on total team WAR) already accounts for the fact that they faced a tough schedule. Why you would want to then give them credit again for facing a tough schedule confuses me.

        Vote -1 Vote +1

      • Rob says:

        The two best closers in the league play in the AL East. Bullpens that are so dominant are going to screw up strict innings pitched measures

        Vote -1 Vote +1

      • B says:

        “The two best closers in the league play in the AL East. Bullpens that are so dominant are going to screw up strict innings pitched measures”

        That’s a pretty bold statement, and I’m not sure that it’s true. It certainly might be, Rivera and Papelbon are certainly great closers, but I’d throw Andrew Bailey, Brian Wilson, Joe Nathan and Jonathan Broxton into the mix, too. I don’t think you can definitively say Rivera and Papelbon are the best…

        Vote -1 Vote +1

  7. JSE says:

    What’s the correlation between WAR one year and W/L in the subsequent year, and how does this compare with Pythag?

    Vote -1 Vote +1

  8. Gerry says:

    Query:

    If I go to the team page it shows the Jays were 19.6 WAR in 2009. When I look at the player stats I see the hitters were approx. 19 WAR and the pitchers were approx. 19 WAR, that’s 38. Why the difference between the team page and the player page?

    Thank you

    Vote -1 Vote +1

  9. Travis L says:

    Anybody have a handy chart for the correlation coefficients of other (both sabrmetric and traditional) stats? Granted, there aren’t any traditional stats that attempt to account for pitching, defense, and offense, but it would be really handy to be able to know correlations between, say, BA and team wins.

    Vote -1 Vote +1

    • DavidCEisen says:

      Exactly my thoughts. That WAR and wins are correlated means nothing if we aren’t comparing it to something.

      Vote -1 Vote +1

      • Wally says:

        Well we did compare it to our friendly pythagorean record. And correlations do mean something without a comparison. They tell you how more of the variation in the data is accounted for by the variable in question. In any case .81 is a VERY good number, especially considering some of the things left out of WAR that would make it a better predictor.

        Plus, no matter how well we can judge talent, luck is still going to play a roll in who wins what games. So our R^2 is never going to be 1. And if it ever is, then you aren’t going to be measuring pure skill anymore.

        Vote -1 Vote +1

  10. LorenzoStDuBois says:

    I’d say it’s good for absolutely nothing.

    Vote -1 Vote +1

  11. Steve Sommer says:

    Jeff Zimmerman at BtB did something similar dating back to 1980. http://tinyurl.com/ybeh4cl

    Vote -1 Vote +1

  12. Steve says:

    For comparison sake, I performed the same exercise using BP’s WARP1 figures. WARP1’s predicted wins had a correlation coefficient of .902 when predicting team wins, which is even better than the WAR figures.

    The thing that was interesting was that BP’s replacement level was much lower; they handed out 1457.8 wins above replacement, leaving 972.2 wins as the rep level for the league. That’s 32.41 wins per team, or a .200 winning percentage.

    Does anyone know how each method establishes replacement level and which is more accurate? BP’s strikes me as more arbitrary (they seem to be intentionally setting it at the round number of .2), but their results (.902) are quite impressive.

    Vote -1 Vote +1

    • Shawn Hoffman says:

      To add to Steve’s point, I’m not sure this is really the best method for testing the stat. WARP1 is probably higher because its pitching component seems to be closer to ERA than FIP (offensively I think it’s basically the same just with a lower replacement level). We know that’s not really optimal, just like we know using the individual players’ R+RBI-HR isn’t optimal. But using ERA and R+RBI-HR would almost definitely have a higher correlation than .83. I’m almost positive using OPS Diff would as well. But we intuitively know that’s not what we’re looking for.

      Vote -1 Vote +1

      • Wally says:

        Right, correlations can’t go down if we’re just looking at R^2. Maybe we should break the whole thing up (split WAR and WARP into its parts) and use adjusted R^2?

        Vote -1 Vote +1

      • Dave Cameron says:

        Shawn’s right. The more you account for context, the higher your correlation with actual record will be, but we can’t really compare correlations of systems with different contextual basis and determine that one is “better” than the other.

        Vote -1 Vote +1

      • Shawn Hoffman says:

        Dave — I hate to say it, but the best judge of this might just be personal preferences of the individual components. I prefer xFIP, or a version of FIP with a heavily regressed HR/FB component. I also think single-year UZR numbers should be regressed to each player’s historical mean — or the league mean, if there is no history on that player.

        Am i right? Is that a better system? Maybe, maybe not. It’s just what I prefer. And I know going in that it would have a lower correlation to actual wins — just like WAR has a lower correlation than WARP, even though WAR has “better” components.

        Vote -1 Vote +1

      • Dave Cameron says:

        It depends on what you’re trying to figure out, I think. xFIP, or something with a regressed HR/FB rate, will give you a better indicator of a pitcher’s true talent level and help you project future performance better. Same deal with regressed UZR components.

        But is that always what we want to use WAR for – to find true talent level? Sometimes, sometimes not. If I’m trying to describe how valuable Zack Greinke was to the Royals in 2009, I don’t care that much about his true talent level. If I’m trying to figure out how good Zack Greinke is, I do.

        There are times when xFIP will be more appropriate, or you should regress UZR. But that isn’t the only application, and thus, I’m not sure I’d say its a “better” way of doing it.

        Vote -1 Vote +1

      • Colin Wyers says:

        Shawn, there are two kinds of accuracy we’re worried about – how well something reflects what’s occured, and how well something reflects a player’s “true talent level.” xFIP predicts future ERA better than FIP, certainly. But the actual number of home runs allowed is certainly more reflective of how many home runs occured than how xFIP handles it.

        (The reason we care about FIP at all instead of ERA is not because of “luck” but because we are trying to handle the split credit between a pitcher’s run prevention and that of his defense. I honestly prefer Rally’s method of using observed RA with an adjustment for defense quality based on aggregated TotalZone results.)

        Vote -1 Vote +1

      • Shawn Hoffman says:

        Colin / Dave — I agree with you. The question is, where do we draw that line? HR/FB has just about as much year-to-year predictive power as BABIP, so technically FIP will be more “true talent” than ERA, but ERA will tell you more about what happened. The same thing goes for FIP vs xFIP. I think where we draw the line is simply up to your own personal preference, and mine is heavily skewed toward true talent.

        Vote -1 Vote +1

      • Dave Cameron says:

        Shawn,

        The difference between HR/FB and BABIP is that we’re already accounting for defense in WAR with the UZR input. Theoretically, FIP + UZR should be something close to actual runs prevented, so we’re hopefully accounting for the realities of what really occurred, even though we’re not doing it implicitly with the pitcher.

        If you regress HR/FB rate, you’re taking it to a different level, throwing out something that actually happened and removing it from the formula all together. And I’m not sure I want to go there.

        Vote -1 Vote +1

      • Shawn Hoffman says:

        Dave, I can understand that, and I realize I’m probably in the minority. But even more than just random fluctuation, the HR-related park effects bug me.

        Also, as you said, factoring in competition is becoming more and more important. I’ve been pounding the pavement for Halladay the past couple weeks, b/c his opposition is so different than anyone else’s, unless you pitch for the Jays or Orioles.

        Vote -1 Vote +1

      • vivaelpujols says:

        Dave, FIP also eliminates timing which may have as much of an impact on Runs Allowed as defense.

        I agree with Shawn here. Comparing WAR to wins is pretty much useless. Yes, you would want it to correlate somewhat so that you know you are in the right range; however, as Shawn pointed out WARP > WAR using correlation to wins as the judge.

        I don’t know the right way to test how well WAR “works”, however, comparing it to wins is certainly not the right way.

        Also, WAR is a tricky stat. It regresses some things all the way, like BABIP and timing, while not regressing other things at all. I think agree with Shawn that WAR should be more skewed towards true talent rather then performance. It’s already halfway there, why not push it all the way?

        Vote -1 Vote +1

      • Dave Cameron says:

        WAR doesn’t regress BABIP – it credits the fielders instead of the pitcher. It doesn’t “regress” timing – it ignores it entirely.

        Vote -1 Vote +1

      • vivaelpujols says:

        Dave – BABIP and UZR won’t correlate directly, because a lot of BABIP is luck and not defensive performance. As to regressing vs. ignoring timing, that’s just semantics.

        If I were to describe FanGraphs WAR, it would be “how much would a player contribute to his team if timing, and in the case of pitchers, BABIP luck, were taken out of play.”

        The problem with that is it gives full credit to other things that are mostly out of a players control, like HR/FB for pitchers and BABIP for batters (I am aware that it is partially under a hitters control, but not so much that it should be taken at face value).

        Vote -1 Vote +1

    • Wally says:

      It would be interesting to see this comparison done over multiple seasons. I would guess a single season, which would give only 30 data points, has a beta-value (error term of your generated slope) of the regression that is larger than the difference between the two lines generated by WARP and WAR.

      It would be cool if someone would go all the way back to 2002 and give this a try. Those 240 data points may yield even better results.

      Vote -1 Vote +1

  13. CH says:

    The only problem I have with WAR is when people take it as gospel in player-to-player comparisons.

    Example: Franklin Gutierrez had a higher WAR than Miguel Cabrera this year.

    Anyone who thinks Gutierrez is a more valuable player than Cabrera is just a moron.

    Nyjer Morgan more valuable than Jayson Werth?

    I really don’t think so.

    It’s a nice stat, but let’s not completely surrender our gift of reason to the numbers quite yet.

    Vote -1 Vote +1

    • Dave Cameron says:

      This is a great example of why the critics of WAR aren’t taken seriously.

      Vote -1 Vote +1

      • Not David says:

        Why not?

        I find that calling people morons makes for a fantastic debate technique, proven effective in middle schools through the nation.

        Vote -1 Vote +1

      • JoeR43 says:

        You say this now, Dave.
        Take WAR to one of the more insufferable members of the NY media, boldly say that Zobrist had a better 2009 than Jeter, watch as he uses it as fodder for his next half-assed, ass kissing column, and watch scores of Yankee fans eat it up.

        (This is obviously a diss on those who don’t take the metric seriously vs. those who do).

        Vote -1 Vote +1

    • Wally says:

      The whole reason we use numbers is because our gift of reasoning sucks.

      Here’s a physical example: You are stopped at light in your car with a helium balloon in the back seat. The light turns green, you start to accelerate. Which direction do you reason the balloon moves relative to you?

      Vote -1 Vote +1

    • Torgen says:

      So if WAR works, doesn’t that mean JP Ricciardi was a great GM for accumulating it so cheaply and certainly didn’t deserve to be fired?

      Vote -1 Vote +1

    • vivaelpujols says:

      You’re not wrong. WAR is the best stat for evaluating talent in a single season; however, that doesn’t mean that it will project future performance well.

      While Morgan and Guttierez have been better players than the two you listed this year, in all likelihood, they probably won’t be going forward, mainly due to unsustainable defensive ratings this year.

      Vote -1 Vote +1

      • Mooser says:

        Not sure why you would say this. Morgan and Guttierez are good fielders, reputation wise, scouting wise, and stat wise. There is nothing to suggest they will not continue to be good fielders as measured by UZR / WAR. We dont believe they are more valuable than guys like Cabrerra, because we have not yet accepted the fact that Fielding can be as big apart of total value as what WAR suggests.

        Vote -1 Vote +1

      • B says:

        It’s pretty simple why he’d say this. If you look at all players who have put up a UZR in the range of 28.5, I’m pretty confident that sample of players is much more likely to go down than up the next season. Since we don’t “know” Franklin’s true talent level, it’s not unreasonable to assume his numbers will follow the trend.

        I will say, looking at Guitterez’s numbers, I’m really confused. His UZR/150 is 19.2, but in 153 games (1353 innings) he put up a UZR of 28.5. Am I missing something? That doesn’t make any sense to me. If he played 153 games and his UZR/150 is 19.2, shouldn’t his UZR be something like…19.7?

        Vote -1 Vote +1

      • Dave Cameron says:

        The denominator in UZR/150 isn’t “games played” or “innings played”, but “Defensive Games”, which is based on the amount of opportunities a fielder had to make a play. Because the Mariners had a flyball pitching staff, Gutierrez played 231 “defensive games”.

        Vote -1 Vote +1

      • B says:

        Thanks for the answer, Dave. Makes sense now.

        Vote -1 Vote +1

    • vivaelpujols says:

      To add on, you have to remember that no stat in a single season is going to be very accurate in terms of judging how good a player actually is. You have to look at multiple seasons, and scouting reports to get a sense of that.

      Vote -1 Vote +1

      • Norm says:

        to quote B:

        I will say, looking at Guitterez’s numbers, I’m really confused. His UZR/150 is 19.2, but in 153 games (1353 innings) he put up a UZR of 28.5. Am I missing something? That doesn’t make any sense to me. If he played 153 games and his UZR/150 is 19.2, shouldn’t his UZR be something like…19.7?

        I’m curious about this as well….

        Vote -1 Vote +1

      • vivaelpujols says:

        Read Dave’s response above.

        Vote -1 Vote +1

    • Really? says:

      So, on a real team, you would trade Miguel Cabrera and Jayson Werth in order to acquire Gutierrez and Morgan?

      Based on ONE year of WAR, you would say Gutierrez and Morgan are more valuable players than Cabrera and Werth?

      Take the word “moron” out of the original post, and respond to it.

      You think one year of WAR is enough data to argue the overall value of players?

      Vote -1 Vote +1

  14. Steve says:

    So, WARP includes more contextual information than WAR does? Didn’t know that, but that would make sense in explaining the better correlation.

    Dave (or anyone)… why are the BP and FanGraphs replacement levels so different? Is one “more right” than the other?

    Vote -1 Vote +1

    • Dave Cameron says:

      I’d rather not speak for BP. I think the generally accepted level of replacement player production is somewhere in the .280 to .320 winning percentage range. If someone uses a different replacement level, they’d have to explain why they feel that’s more accurate.

      Vote -1 Vote +1

    • Colin Wyers says:

      BPro’s replacement level for WARP is wrong, as far as I can tell. – back calculating from published figures on their website they seem to be using a .200 replacement level, in other words a team that wins 32 games in a 162 game schedule.

      That’s better than the old WARP, which was based off a supposed .150 win percentage, but had a much lower observed win percentage (I did the math and one year WARP1 had a negative win percentage under the old system – I am not kidding, they had a negative win percentage baseline one year for the old WARP1.) But it still seems unrealistically low.

      Only one baseball team has ever played worse than .200, and that’s the 1899 Cleveland Spiders , and no other team has been really close. The worst win% for a team in the postwar era is the ’62 Mets, with .250.

      Moreover, a lot of the linear assumptions that are used to calculate such stats (be they WAR or WARP) start breaking down below a .250 win percentage. (Which is how old WARP and its .150 rep-level baseball ended up producing more observed WARP than actual wins in at least one season.) I would guess that the explicit rep-level baseline they use is really higher than .200 (but lower than WAR’s) and is simply breaking down at some point.

      Vote -1 Vote +1

      • JoeR43 says:

        I remember some of the ridiculously high levels of WARP players had in the old system.

        It’s much, much better now. I believe A-Rod in 2007 was originally rated as a +14 WARP-3. It’s 11 now. VORP is a good offensive stat, but when I did a regression of it, the slope of team VORP to team runs was only .78, so that could be why BP may understate replacement.

        EqA is still a good stat, though. Nice and simple, .260 is average, .220 is a AAAA guy, .300 is an all-star, generally speaking.

        Vote -1 Vote +1

  15. Bill says:

    I noticed an odd bit of WAR accounting earlier this season. For WAR positional adjustments, pinch hitters and pinch runners can be disproportionately affected.

    Dave Appleman explained that value was calculated based on total games at a position. Therefore, pinch running for a DH in the 9th inning was counted the same as DHing for 9 innings (-.11 runs).

    Vote -1 Vote +1

  16. Dirty Water says:

    Assigning WAR value to a player is fine. However, assigning one to a team via the total of individual WAR is taking the stat too far. Teams make many, many players (how they’re utilized; putting those players in positions for success; instilling confidence), not the other way around, and bit players (read: replacement level) are often the difference between winning and losing.

    There is no “I” in team. Every hear that one? That’s what no stat accounts for. Except, of course, the stat of W/L record.

    -6 Vote -1 Vote +1

  17. Wrighteous says:

    WAR: What is it good for?

    Absolutely Nothin’

    Grienke leads pitchers in WAR and he doesnt even have the most wins! That makes ZERO SENSE!!!!!

    -9 Vote -1 Vote +1

    • Jeremy says:

      Please tell me this post is a joke.

      Vote -1 Vote +1

      • Toffer Peak says:

        You would like to think so but this is the same guy who blames steroids for all under performing players. He is a classic internet troll, please ignore him.

        Vote -1 Vote +1

      • Joe R says:

        He’s either a moron or a troll.

        Since he’s on here, I assume troll. And not a funny troll. Probably one of those guys who just post 4chan memes on other boards and things he’s awesome.

        Vote -1 Vote +1

      • Dirty Water says:

        Well, to be honest, as long as it’s Wins Above Replacement Level, and not something like Talent Above Replacement Level, Wrighteous does have a point.

        Vote -1 Vote +1

      • B says:

        No, he doesn’t have a point. Wins as a stat for a pitcher is a bad stat. You know why? Because the pitcher’s actual performance accounts for less than half of the input into wins. Winning games is dependant on two things – runs scored and runs allowed, both equally important. So offense is 50% of a pitchers W-L record. To make it even worse, pitchers don’t even control 100% of runs allowed, rather it’s a product of both defense and pitching…so the point is a pitchers W-L record is more dependant on offense than pitching, making it a stupid stat, and meaning he doesn’t have a point.

        Vote -1 Vote +1

  18. JoeR43 says:

    Dirty Water: Uh, that’s a pretty good correlation given just one season of data. And “Teams make players, not the other way around”, is not a coherent thought.

    Wrighteous: You’re still a terrible poster and even worse at trolling.

    Vote -1 Vote +1

    • Dirty Water says:

      Loose language.

      What I mean is that ‘team’ can (will) effect player performance for the better or worse. My argument is that if you took any of those Mets players and put them on winning teams, chances are their performance on the field would have been far better. As it was, that entire Mets roster probably knew exiting ST that they were to suck, and played to that.

      Vote -1 Vote +1

    • B says:

      “What I mean is that ‘team’ can (will) effect player performance for the better or worse. My argument is that if you took any of those Mets players and put them on winning teams, chances are their performance on the field would have been far better. As it was, that entire Mets roster probably knew exiting ST that they were to suck, and played to that.”

      I’d like to see a single shred of evidence this is true? One study that suggests this might be the case?

      Vote -1 Vote +1

  19. JoeR43 says:

    And um, back to dirty water.

    The whole point of win values is based on run scoring and run prevention at the team level. The only way anyone can figure out how offensive stats relate to runs scored, and winning, is to compare it to the team level, because a team has a *real* runs scored as a result of their offensive output. So in that sense, not only are win value statistics not arbitrary, but they probably should be assigned to a team first, and then down to the individual level.

    Vote -1 Vote +1

  20. JoeR43 says:

    One more note:

    Looking at the 1962 Mets, maybe we can call it the ultimate “replacement level” team.

    BP has their team VORP at around 88 runs (Pujols and Mauer were both more valuable offensively according to VORP in 2009 than the entire 1962 Mets combined)

    FRAR of 84, FRAA of -65 (oof).

    PRAR of 183

    So a “replacement level” team is probably along the lines of VORP + FRAA + PRAR, or 206 runs.

    The 1962 Mets were bad, but their pythag was 50-110. So for shits and giggles, let’s distribute 103 less RS and 103 more RA (so a 514/1051 split).

    Their xW% becomes 21.27%. A 34-35ish win team.

    Most of that team was minor league / fringe majors level, but there were players who performed better than replacement level (Ashburn, Thomas, Craig, Jackson), so in this 1 team sample, WARP does look like it holds up. At least now.

    b-r has their defense at -78.6 runs.

    Vote -1 Vote +1

  21. Decoud says:

    Long time lurker, first post. I really enjoy this site.

    To put this in context, I did some quick analysis. If split the data into the AL and NL and regress ERA and BA on wins the R^2 values for 2009 were .84 and .61. For ERA and RBI, .86 and .82. For ERA and OPS, .87 and .81. I have the print outs from R on my crappy blog and the correlation between ERA, OPS, BA, and RBI vs. wins there too.

    Vote -1 Vote +1

    • OvWARrated says:

      So, since dave is citing the r2 of WAR as .83, then wouldnt your results suggest that ERA and RBI might actually be slightly more correllated with wins than WAR (assuming the ML r2 for those would be the average of .86 and .82 or .835)? This would at least appear to indicate that they are just AS correllated.

      the same goes for ERA and OPS.

      From your website:

      WAR has an R^2 of .83, but what does that really mean? Some context (all stats from ESPN.com)…..

      “I made three linear models for each league to find the R^2 of wins versus ERA and BA, ERA and RBI, and ERA and RBI. The R^2 values are .84, .86, and .87 respectively for the AL. For the NL .61, .82, and .81.”

      In the end WAR does not appear to be a significant upgrade over these other measures.

      Vote -1 Vote +1

      • Joe R says:

        And you totally miss the point of this exercise.

        Of course actual run production ties to winning. The point of WAR, and any win-rate stat, is to tie an individual’s actual, individual production to winning, while RBI/Runs/ERA/etc are undeniably tied to the performance of other players on their team.

        Vote -1 Vote +1

      • OvWARrated says:

        Ok, but WAR is not an improvement over ERA and OPS either.

        Vote -1 Vote +1

      • B says:

        ERA, on a team level, makes at least as much (and possibly more) sense to use to evaluate how a team performed at pitching + defense. After all, when it comes to winning, how many runs you allow is half the equation and ERA is a pretty direct measure of that. The problem is breaking that contribution down to the individual level creates problems that FIP + defensive metrics try to solve. ERA is not a great stat at an individual pitcher level. As for OPS, I don’t think anybody thinks it’s a bad stat, per se – it does perform pretty decently when describing a players offensive contributions, it just has small flaws and isn’t as good as wOBA + park adjustments (and OPS doesn’t make much sense from a theoretical standpoint).

        Vote -1 Vote +1

      • Ken says:

        And as was pointed out elsewhere, even if you don’t want to include ERA due to its relation to actual run-scoring, using net on-base percentage (OBP – OBPA), gets you a correlation of 0.83.

        So WAR does not do better than its primary component. You can believe that WAR does better at partitioning OBPA between pitchers and fielders, but the evidence in the article does not relate to that.

        Aggregating team statistics to prove the value of an individual statistic is extremely difficult. If the aggregation of players into teams was actually random, then the aggregated measures would tell us essentially nothing about the quality of the individual statistic.

        Vote -1 Vote +1

      • JoeR43 says:

        The only problem with rollup, win value stats is how the stats themselves have generally been made from the results of 60-100 win teams. The best and worst MLBers have little to no historical context for their performances, and because of it, the data goes outside of the bounds.

        Example: Mauer had a WAR of +8.2. But how many catcher seasons have there been like that? None, maybe 1. How many overall? Very few, out of tens, maybe hundreds of thousands of hitter season data.

        Vote -1 Vote +1

  22. Tim says:

    Is there somewhere on the site where an “adjusted wins” standings are published, similar to what BP has, only based on WAR instead of EQR/EQRA? Because that would be a great feature.

    Vote -1 Vote +1

  23. Eric says:

    The correlation is nice, but I think the variation paints a much better story. Linear regression tests for a non-causal relationship, but does not test how well one variable parallels another. In other words, linear regression allows one to test a linear relationship (i.e. greater WAR means more wins), but correlation does not test whether a team acquiring X WAR will have X wins. The fact that the variance between WAR and wins is low does suggest that the two match each other.

    Intraclass correlation coefficient (ICC) is a good way to measure the ratio of variance between WAR and wins to total variance. In essence this tests whether WAR is a reliable measure of wins or not. Did you perform a reliability test on WAR and wins, and if so, what was the ICC?

    Vote -1 Vote +1

  24. Mikel says:

    Question about RF/UZR:

    Does a fielder necessarily have to have good range if they have a good RF? Couldn’t it just be that they position themself properly?

    Vote -1 Vote +1

    • Michael says:

      Don’t listen to RF. It’s a dinosaur. Range Runs works on much better principles, and it actually means something that can be related to runs without jumping through hoops.

      Vote -1 Vote +1

    • Joe R says:

      RF led to good things, but in itself is archaic. It’s undeniably too tied to random chance.

      Example: Nyjer Morgan can play on a team with primarily strikeout and grounder pitchers, limiting his opportunities. Because of this, his RF stinks. Is Morgan suddenly bad? No, his opportunities are just down.

      Vote -1 Vote +1

  25. Eric says:

    The intraclass correlation coefficient for WAR and wins, using the ICC(3,1) formula (i.e. asking the question how much of the variance is explained by variance between teams) is about 0.82, which is in good agreement with the correlation analysis. What ICC demonstrates, however, is that for any ONE team in a season, WAR is a reliable measure of how many wins a team has.

    You could add more data points, but more data points beyond 30 will not make a difference.

    Unfortunately these tests do not answer the question of whether WAR can PREDICT success, it only tells us that WAR can MEASURE success (R^2 and ICC for 2008 WAR and 2009 wins ~ 0.46). Has anyone used the individual player 2008 WAR, combined that data for every team (i.e. use the 2008 WAR of everyone on team X’s current roster) and examine the relationship between that player-adjusted 2008 WAR and 2009 wins? If a high ICC and correlation are observed ( > 0.8), such a finding would demonstrate the predictive power of WAR.

    Vote -1 Vote +1

  26. neuter_your_dogma says:

    If you count the wins and losses for a given team at the end of the season, you get a pretty good correlation with how many wins and losses said team actually had. Just saying :)

    Vote -1 Vote +1

  27. Colin Wyers says:

    Using Pythagenpat, RMSE with same-year wins:

    WAR: 5.61
    Pythag: 3.91

    RMSE with next year’s wins:

    WAR: 11.96
    Pythag: 10.85

    In other words, Pythag typically does a better job of predicting wins (either same-season or y+1) than WAR.

    Vote -1 Vote +1

    • Colin Wyers says:

      Sorry, forgot to mention. I used 1995-2008. WAR data is from Rally’s website.

      Vote -1 Vote +1

    • Dave Cameron says:

      I’m not sure there’s any reason to use pythag or WAR to try to predict the next year’s win total. They’re not projection systems and the roster turnover makes it a fairly pointless exercise.

      Vote -1 Vote +1

      • Colin Wyers says:

        Sure, I s’pose. But one could be excused for thinking you believed otherwise:

        http://ussmariner.com/2009/10/05/war-and-the-2009-mariner/

        And honestly, prior season Pythag gets you within about a win of where the more complicated projection systems place in accuracy. It’s honestly not a bad tool, especially when you consider how much less work-intensive it is.

        Vote -1 Vote +1

      • Dave Cameron says:

        That post had nothing to do with projecting the 2010 Mariners. The only thing I said about WAR in regards to pythag in that post is that it’s a better estimator of true talent level, which is true, because it strips out the contextual noise of base/out performance.

        Vote -1 Vote +1

      • vivaelpujols says:

        Because WAR would theoretically measure actual skill better than Pythag, thus if you used the two metrics going forward, even with with roster turnover, you would expect WAR to be more predictive than Pythag.

        Vote -1 Vote +1

      • Colin Wyers says:

        Well, Dave, how can you measure “true talent level?” I guess one way is to define what metric you think best reflects true talent, but you end up with a tautology – WAR is the best measure of true-talent level because it’s the best definition of true-talent level.

        Or you could try using it to make predictions. Looking at how well a metric predicts future events is the usual way in which sabermetricians determine how well a metric reflects “true talent,” and by that standard, WAR is not a better reflection of true talent than Pythag.

        We can also confront the problem logically, and see what makes intuitive sense. WAR is an attempt to take team wins and distribute them among different players. In order to do so, we have to break down team wins – and think of it literally as breaking team wins, like dropping a vase upon the floor so it shatters. We then take those pieces and distribute them out to different players.

        Our methods for breaking down and distributing team wins to individual players are sometimes coarse – some pieces get lost, others misassigned. Our metrics are not so perfect that we can say we CERTAINLY got our measure right, even when only dealing with past performance. And so when it comes time to put the vase back together, not everything fits right. In the aggregate it captures the shape of the original vase, but it’s not an exact match.

        And that process of breaking down and then summing up wins at the team level introduces inaccuracies. I don’t see why we would expect aggregate team WAR to better reflect a team’s underlying talent level than simply looking at observed W-L or Pythag.

        Vote -1 Vote +1

      • Dave Cameron says:

        Well, Dave, how can you measure “true talent level?”

        Create a metric that strips out as much noise as possible.

        Or you could try using it to make predictions. Looking at how well a metric predicts future events is the usual way in which sabermetricians determine how well a metric reflects “true talent,” and by that standard, WAR is not a better reflection of true talent than Pythag.

        You are taking one fact (pythag > WAR at predicting record in year n+1) and then using that to say that WAR is a worse estimator of true talent. But those statements don’t follow at all.

        If teams were required to have static rosters, we might be interested in prior year WAR/pythag as a predictor of future success. The reason we try to make predictions about the future to test the validity of a statistic is that we can reasonably assume that the main variables are held constant.

        John Lackey isn’t going to start pitching left-handed next year. We can use information about Lackey this year to project Lackey next year because John Lackey is the main component in both years.

        With teams, that’s just not true. There’s no reason to care what the 2008 Mariners team pythag/WAR was when it came to projecting the 2009 Mariners – there were only a handful of players in common. Any WAR/Pythag/Any Stat You Want that was generated by a group of players wearing similar laundry that are no longer in the picture is irrelevant to future performance of new guys in same said laundry.

        If you want to measure the predictability of WAR, you have to do it based on the roster that a team actually fields. Prior year roster does not work. Prior year performance does not matter beyond informing you about the expected results from the players that will be on both teams.

        That was the entire point of the USSM post that you apparently didn’t like. You’ll note that I did an entire post right after that using projected WAR of players *actually still on the roster* to talk about our current expectation level for the 2010 Mariners. Because that’s the best way we can estimate true talent – updated projections of current rosters.

        Vote -1 Vote +1

      • Colin Wyers says:

        Dave, that’s an arguement for using the correct inputs. Once you have those inputs, you are still going to get a more correct projection of team wins if you use straight Pythag rather than going through the manipulations of WAR first. Positional adjustments, for instance, are a necessary evil when assessing an individual player’s value, but if all you want to know is team wins they’re pointless at best.

        Vote -1 Vote +1

      • Dave Cameron says:

        But how do you get get a “projected” pythag once you use the correct inputs? You run a bunch of simulations or stick the aggregate team numbers into a run estimator, right? You definitely do not carry over prior year base/out performance, assuming that teams that did well in those situations will do well again. And that’s why prior year pythag is not as good of an estimator of true talent level, because it includes that non-skill context.

        So, in reality, a pythag projection with correct inputs isn’t really any different than WAR. Once you strip out the base/out context, they’re the same thing – win estimators based on linear weights.

        Vote -1 Vote +1

      • Colin Wyers says:

        That would be true, Dave, if our positional adjustments were exactly correct, for instance. But a lot of the assumptions that underlie WAR are simply close approximations – to the extent that we don’t need them to determine team wins, they are a hindrance and not a help to an accurate team win projection.

        (And you also getter a better fit using Pythag than a linear run-to-win estimator, for instance. And you get a better fit using BaseRuns than linear weights. WAR doesn’t do those things because you’re measuring a player’s performance in a context of an average team, and you don’t need to handle team performances substantially different from .500 – even an absurdly good 15-win player is only going to make a .592 win% team, well within the range that those assumptions hold.)

        Vote -1 Vote +1

    • Joe R says:

      That should be self-evident, though; pythag is based on actuals, not projected numbers based on actuals.

      Vote -1 Vote +1

      • JoeR43 says:

        From what I interpreted him saying was that pythag predicted wins better than WAR. Which is obvious, since pythag uses real run differentials to estimate wins, while WAR uses real statistics to estimate run differentials, and in turn estimate wins.

        But if he was talking about using pythag and WAR as a predictor for the next season, well, I disagree with that idea in the first place. That’s assuming everything stays the same for each team, which is obviously not true.

        Vote -1 Vote +1

    • vivaelpujols says:

      Hehe, previous years Pythag is better than PECOTA is this year.

      Vote -1 Vote +1

  28. BATTLETANK says:

    i emailed the site, but i’ll post it here, too.

    can we get a post season table on the homepage showing the WAR/WPA/FB? would be nice to track who’s doing what in the post season!

    Vote -1 Vote +1

  29. Ken says:

    This correlation is interesting – so I tried it with a couple of other statistics. Net batting average (ie. BA – BAA) has a correlation with wins of 0.78. And net OBP (OBP – OBPA) has a correlation with wins of 0.83.

    Vote -1 Vote +1

  30. JoeR43 says:

    Fun thing to try using the General error minimum unbiased standard percent error equation for the values (y = 46.122 + 1.1945x^0.951) is it’s easy to find out who was the luckiest and unluckiest team.

    Most unlucky was the Indians, whose WAR projected win value was 11 1/2 games better than the “real” one. Luckiest? Surprisingly the Mariners were fairly close to their projected total. The winners there are the Reds and Padres. This must make fans of those teams feel real good.

    Vote -1 Vote +1

  31. Dirty Water says:

    The most vital addition needed for Team WAR to be predictive: Manager and GM WAR values.

    Vote -1 Vote +1

    • B says:

      How does the GM affect winning and losing, other than by acquiring players? Since we already measure what those players are doing, what’s the point of wanting a GM number? Same with a manager – a good manager gets the right people in the right positions to succeed, and since we already measure those players success, what’s left?

      Vote -1 Vote +1

  32. JoeR43 says:

    One more thing I notice about Wins value stats (though I haven’t attempted to statistically prove it).

    Teams who have one guy carrying the WAR load tend to underperform their projection. Indians, Choo and Lee (while he was in Cleveland) carried a lot of the load, and between them accounted for nearly 1/3rd the Indians marginal wins. Royals/Greinke, same thing.

    The Phillies, on the other hand, have a lot of plus contributers, and while their team wins above replacement stat isn’t all that great (actually below Atlanta’s), they have a very good team. Mariners are a bit top heavy with Guiterrez, King Felix, and Ichiro, but it’s not like guys like Branyan, Lopez, and Aardsma are roster deadweight.

    Vote -1 Vote +1

  33. Nate says:

    How can you be sure that adding base running and catchers’ defense to your formula will result in better correlation? How do you know it won’t hurt your results?

    Also, how does small sample size not apply in this case? How can you be sure that the correlation between WAR and wins for the 2010 season won’t be .16 or some other low number?

    Sorry if these questions seem basic, I’m not really up on my statistics.

    Vote -1 Vote +1

  34. Mike says:

    Sweet.

    I did this on my site with a buddy and got the same results. We mean to get the standard deviation another time but never got around to it.

    Our R2 was a little for W% and pythag%, but it was pretty darn close to what you got.

    Vote -1 Vote +1

  35. Eric says:

    I disagree about whether it is important and whether it is testable to examine if WAR can predict wins in a subsequent year. To answer this question, you could account for roster turnover by using the 2008 WAR of the individual players to measure 2008 WAR for the 2009 roster for the team. If WAR is a measure of “true talent” then the collective “true talent” of a team’s roster (measured in 2008), should correlate closely with 2009 wins.
    I think it’s important to test this question, because a measure of “true talent” that is context independent should be relatively constant from year to year. Therefore, one would predict that a previous year’s WAR (accounting for roster turnover) should correlate strongly with wins in the following year.

    Vote -1 Vote +1

  36. Jud says:

    Hi,

    I have a few reservations regarding WAR as it relates to allotting value to pitchers. As I recall, WAR for pitchers is totally based upon FIP. So I would have the following concerns

    1) FIP does not take into account sequencing. So a HR, 1B, 1B, K, K, K is identical to 1B, 1B HR, K ,K, K. If we’re speaking about future predictions this might be true (I don’t know if sequencing is a repeatable skill or not), but not if we’re talking about value provided
    2) FIP does not take into account leverage at all. So all relief pitchers will be at a huge inherent disadvantage. I’m not advocating going all the way and using WPA as the only measure, but I do think that some mix should be made
    3) As Colin pointed out elsewhere, the range of FIP is much less than that of ERA. So there is an artificial shrinking of the differences between the best and worst pitchers
    4) In cutting out the noise of fielding, FIP also removes value delivered. A hard hit ball off the Monster should not be considered a neutral event as far as the pitcher is concerned. Now disregarding FIP altogether would inject all of the fielding noise and that would be worse. Hopefully, Hit f/x would provide us with a better solution

    I’m also having problems completely buying into UZR. Especially as UZR, +/-, and BP fielding stats give diverging results (we don’t see such a difference between WOBA and EQA). The BP guys claim that the raw PBP data is not that accurate, here too maybe Hit f/x will come to the rescue

    Vote -1 Vote +1

  37. Joe R says:

    In the “hmmm” category for everyone:

    WAR’s 2009 R^2 (linear): 68.18%
    WARP-1’s R^2 (linear): 74.44%

    WAR baselines out to around 47.5 wins, WARP-1 to 31. Both had slopes close to 1.

    That actually surprised me a bit.

    Vote -1 Vote +1

  38. Tom Feeney says:

    I think WAR is a somewhat useful statistic but it still has a very long way to go. Since the linear weighting of WAR is arbitrary it’s definitely not an exact science to say the least. Once quick example is in looking at Derek Jeter’s numbers for last year his overall WAR is 2.1, and that translates to barely being a starting player (according to the stat, 2.0 is the minimum). I realize Jeter has lost a step or two and his defensive skills aren’t what they once were but his offensive numbers were excellent so I think the 2.1 is way below his overall contributions to the Yankees last year. And this is coming from a diehard RedSox fan! In general, the weighting of the WAR numbers are far overvalued in the defensive skills of a player in my opinion.

    Vote -1 Vote +1

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>