# What is WAR good for?

The quick answer to the title’s question is, of course, *not* absolutely nothing. What wins above replacement, or what is more commonly referred to as WAR, is actually good for is much more complicated and involved.

This past weekend, I had the honor and pleasure to present a paper at the 2012 Saber Seminar, a charity baseball conference in Boston that raises money to benefit the Jimmy Fund. I’ve cut down that paper a good deal and made some modifications to it based on the feedback I received at the conference. Essentially, I converted the paper into a form that would work well as an article at The Hardball Times.

Almost three years ago, the managing editor of FanGraphs, Dave Cameron, wrote a post entitled, “WAR: It Works,” that showed the correlation between 2009 projected wins based on WAR and actual wins. The correlation was high (r=.83), and Cameron used that finding to back this interesting statement:

WAR isn’t perfect. But given the known limitations and the variations in how contextual situations impact final record, it does an awfully impressive job of projecting wins and losses.

Keep Cameron’s quote in mind; I’ll get back to it.

I think it is really exciting that WAR has begun to get some mainstream attention. Baseball-Reference’s WAR is found directly on ESPN’s main statistical leaderboards. WAR has even been featured on SportsCenter.

It might be just be an inherent quality within the sabermetric community’s “cult” mentality, but in my opinion there has been an inverse relationship between WAR’s mainstream acceptance and the weight that the statistic holds among sabermetricians. When Cameron wrote the post I referred to earlier, he noted that WAR faced detractors within the sabermetric community, even at that time:

(WAR) faces a decent amount of skepticism from people who don’t trust various components for a variety of reasons – they don’t like the numbers that UZR spits out for defense, they don’t believe in replacement level, or they believe that pitchers do have control over their BABIP rates.

WAR has been gaining acceptance, but some of the internet’s best sabermetric minds are distancing themselves from the statistic more than ever, especially as a single-season metric.

The goal of my paper for the Saber Seminar was to evaluate the ability of WAR to describe performance in a given season, as well as to predict future performances in a subsequent season.

The first analysis was the same as Cameron’s original study, with a few modifications. I expanded the sample and used Baseball-Reference’s WAR instead of FanGraphs’ version.

First, here’s a quick reference to the differences between B-R’s WAR and FG’s. Secondly, I modified Cameron’s sample to include five randomly selected teams per season (80 teams in total) during the full-season Wild Card era, 1996-2011 (1995 was a slightly shortened season). I then took the cumulative WAR for each of those teams and regressed it against their actual win total.

Baseball-Reference’s version of WAR uses a baseline winning percentage of .320; thus, a team with zero WAR, or an entirely replacement-level team, would expect to win roughly 52 games over the course of a 162-game season. Essentially, if WAR does a correct job at explaining where wins come from, the linear regression equation should be close to or exactly:

*WINS = 52 + 1.0*WAR*

So, for each WAR a player contributed to his team, the team should win one more game above the 52-win baseline. Simple, but effective.

Here are the results of the 80-team sample regression:

The first thing that jumps out from the regression is how well the samples fit to the projected linear equation. I expected the slope of the trendline to be around 1.0, and it came out to be 0.97, while I expected the intercept to be around 52 and it came out to 52.7—very close.

The correlation coefficient, r, from this sample (.91) was higher than the one from Cameron’s study (.83). Also, that correlation can be converted into an r^2 of .83, which simply means that 83 percent of the variance in wins is accounted for by WAR. That is amazing.

Some of the detractors to Cameron’s original study argued that projected wins based on WAR weren’t useful, mainly because he showed that Pythagorean Record had a .91 correlation with wins, which was higher than Cameron’s WAR correlation. Interestingly, the correlation that WAR had in my study was identical to the one for Cameron’s Pythagorean record.

Also, Cameron’s study calculated one standard deviation of difference between WAR and actual wins to be over six wins (6.4), but my sample’s standard deviation is under three wins (2.91). In this sample, 42 of the 80 teams were within three wins of their projected WAR total. Cameron noted that 18 of his 30 teams (67 percent) were within six wins, while in this sample 67 of the 80 teams (84 percent) were within six wins.

The main reason sabermetricians have been distancing themselves from the importance of single-season WAR values is that single-season defensive metrics have a crazy amount of variability, so many people don’t trust them. The defensive statistic that has received the most criticism this season has to be Defensive Runs Saved (DRS), a statistic published by Baseball Info Solutions.

DRS data are used in calculating Baseball-Reference’s WAR, but that data only dates back to 2003; thus, my sample had WAR data which used two different defensive metrics. The 1996-2002 portion of the sample used Sean Smith’s Total Zone Rating (TZR).

I checked to see if there was a significant difference between the WARs based on the two different metrics, mainly because the critiques of DRS put enough doubt in my head that there was a good chance that the DRS portion of WAR had thrown off the sample:

TZR (1996-2002): WINS = .94*(WAR) + 53.37, r= .88; r^2 = .78; p < .001

DRS (2003-2011): WINS = .99*(WAR) + 52.1, r = .94; r^2 = .88; p < .001
The results for both samples are very good, and this shows that a defensive metric that has been so critiqued, like DRS, and is a major aspect of WAR does not render the statistic useless, but instead had a very high correlation and an almost perfect *WINS = 52 + 1.0*WAR* regression equation that I was looking for.

WAR does a very good job of describing what has happened in a given season; however, that isn’t always very useful. Predicting outcomes in future seasons is almost always more important (valuable) than describing what has happened before, so I decided to test the predictive value of single-season WAR.

I took a random sample of 30 teams (five per season) from 2006-2011 and summed their WARs in the previous season to project their win totals in the subsequent season. For example, I would calculate the cumulative 2010 WAR of the 2011 Toronto Blue Jays’ roster and then regress that total against their actual wins in 2011.

Some critical assumptions to the model were an assumed half-WAR (0.5) reduction for players who were declining (>age 30 in the outcome season) as well as a replacement-level or 0.0 WAR assumption for rookies. The replacement-level assumption may seem a little flawed, because rookies like Brett Lawrie come up and put up 3.0-plus win seasons in their first big league campaign. But for every Lawrie, there are a dozen rookies who play at or below replacement-level in their rookie campaign.

Here are the results of predictor year WAR’s (ex., 2010 Blue Jays) ability to project outcome year wins (ex., 2011 Blue Jays):

The results were statistically significant, with a decent correlation of .59. That correlation also means that only 35 percent of the variance in outcome year win totals is accounted for by the previous year’s WAR total. Also, the linear regression equation was nowhere near the expected WINS = 52 + 1*0WAR. Instead, the equation had an intercept of almost 64 wins, with a slope of just 0.68.

To reiterate WAR’s descriptive strength, I ran a regression between the outcome-year WAR (ex., 2011 Blue Jays) and outcome-year wins (ex., 2011 Blue Jays), for this sample, in the same manner as the original sample of 80 teams:

The results of this regression were essentially identical to the results of the original study; the correlation, r, stayed at .91, and the linear regression was very close to expected, with a slope of 1.02 and an intercept of 51.93 wins.

Single-season WAR is quite obviously much better at describing what has happened than what will happen. I keep emphasizing the fact that some sabermetricians have begun to put very little weight or trust into single-season WAR results, but at the same time there are many sabermetricians who may overuse or overvalue the statistic. I’ve read time and time again in either trade or contract analyses that a certain player is going to provide a 3.0-5.0 win improvement for his new team based on his WAR from the year before. That conclusion is most likely incorrect. Take for instance, this extreme example:

The Red Sox signed Carl Crawford prior to the 2011 season. Crawford was worth over six wins (6.6 WAR, in 2010, and many wrote that Crawford would be a six-win improvement for Boston in 2011. As we all know, Crawford underperformed considerably for the Sox, accumulating zero wins over the course of the 2011 season, a perfectly replacement-level season. I think this example reiterates the point that baseball is so difficult to predict.

On a season-to-season basis, there is too much variability and uncertainty to possibly attempt to say what happened the season prior is definitely what will occur in the subsequent season. Weighted projection systems like Oliver, PECOTA, ZIPS and others are much more capable of looking at the full picture and projecting into the future than one season of data can.

### Conclusion

The fact of the matter is the small sample size of certain metrics that go into WAR do not represent the true talent level of a player, especially defensively. This fact has caused many to claim that single-season WAR is useless, because it doesn’t reflect the true talent level of an individual.

But why should a single season of WAR reveal to us the true talent level of any player? How many times is one season of work enough to uncover the actual talent of any baseball player?

**Never**. Fluke seasons happen all of the time and are part of what makes baseball great.

While some are correct in saying that certain metrics aren’t a reflection of true talent, others make the claim that single-season defensive metrics are utterly useless because they are largely based on context and sequence of fringe defensive plays; thus, they tend to fluctuate greatly from season to season. But quite honestly, that is true of any baseball statistic. Traditional statistics like ERA, RBI, and even advanced statistics like wOBA and FIP have large sequence and context factors.

Consider this scenario:

Player A: An average defensive player over his career randomly has a great defensive season.

Player B: An average offensive player over his career puts up gaudy offensive numbers out of nowhere.

Player A’s WAR from that defensive season will be dismissed by the vast majority as being useless and incorrect, and his WAR will be ignored because of the “bad or incorrect data” being used to measure his defense. Quite oppositely, Player B’s WAR will be accepted as hard fact, and his numbers are either considered a fluke or a “breakout” campaign. This doesn’t make a whole lot of sense.

The misconception of true talent level versus where wins come from is where the analysis of WAR as a statistic falls apart.

The last sentence brings us full circle back to Cameron’s quote, which I cited earlier. Here it is another time:

WAR isn’t perfect. But given the known limitations and the variations in how contextual situations impact final record, it does an awfully impressive job of projecting wins and losses.

This quote jibes very well with my point, except with one little modification to the wording. Cameron says WAR does an impressive job of “projecting” wins and losses. I would reword this part to what I think he meant: WAR does an awfully impressive job of *describing* where individual wins come from for a team.

Single-season WAR does a phenomenal job at doing what it says it does. Single-season WAR should not be used to predict win totals or even WAR in a subsequent season. Single-season WAR also is not supposed to reflect the true talent level of a player, which I think is far and away the largest flaw in the way people interpret the statistic. If WAR did reflect true talent, every player would have the same WAR that perfectly encompassed how much value his talent should bring to his team every single year.

Even in the various definitions of WAR, the words “true talent level” never pop up:

-Our definition at THT is:

(WAR) is a metric that combines a player’s contributions on offense and defense and then compares him to the appropriate replacement level for his position.

-FanGraphs definition of WAR:

Wins Above Replacement (WAR) is an attempt by the sabermetric baseball community to summarize a player’s total contributions to their team in one statistic.

-Baseball Prospectus definition of WAR(P):

Wins Above Replacement Player is Prospectus’ attempt at capturing a player’s total value. This means considering playing time, position, batting, baserunning, and defense for batters, and role, innings pitched, and quality of performance for pitchers…Prospectus’ definition of replacement level contends that a team full of such players would win a little over 50 games.

-Baseball-Reference’s definition of WAR:

The idea behind the WAR framework is that we want to know how much better a player is than what a team would typically have to replace that player.

The consensus seems to be that WAR is how much value (WINS!!) a player contributes to his team over the baseline of a player who could replace him. WAR does not reflect the true talent level of a player, but instead it describes how many wins an individual player contributes on the actual field, and in that aspect it works spectacularly well.

**References & Resources**

All WAR data comes courtesy of Baseball-Reference

Dr. George J. DuPaul was a co-author on the original paper

“Some critical assumptions to the model were an assumed half-WAR (0.5) reduction for players who were declining (>age 30 in the outcome season) as well as a replacement-level or 0.0 WAR assumption for rookies.”

Was any consideration given to a more sophisticated approach to calculating the expected WAR for the next year? I would think that enough WAR data by age would be available to generate a table of WAR aging factors by age.

That was essentially my point, though. Projection systems do weight things like playing time, debut, park environment, injury history and a lot more to get a handy-dandy calculation. And I think, for the most part, they do a very good job of projecting WAR.

My point was that using simple WAR addition isn’t enough, which I think we see far too often, like I said with the Crawford example.

Also it looks like my first comment got cut off, the full statement was “there’s probably a good chance that the results would’ve been better, but by only a negligible amount”

in the last comment I meant to say predictor year WAR not wins, did not affect the r by a significant amount

I appreciate the enlightening article, but a bit off topic from the article, what do you think of the seemingly growing tendency of people in the media to try to use WAR as almost an “MVP-measure”? Do you feel WAR is best used for his type of analysis between given players in a certain year? Where do you think WAR fits into an MVP discussion versus other statistical analysis? Would appreciate any insight or opinion you have.

Oops. Sorry, Glenn. Just came across your other article “Andrew McCutchen: more valuable than we may have thought”. I think this probably answers most of the questions I just asked in my previous post, eh?