A Different Sort of Debate on WAR

Last month, the sabermetrics community descended into complete and utter anarchy over the latest and greatest debate on WAR. Industry heavyweights like Bill James, Tom Tango, and our own Dave Cameron all weighed in on the merits of baseball’s premier metric. After the dust settled, Sam Miller published an article on ESPN igniting a different sort of debate on WAR.

Miller’s piece noted that aside from the possible flaws behind WAR itself, each corner of the internet is calculating it a different way. For pitching specifically, FanGraphs (fWAR), Baseball Reference (rWAR), and Baseball Prospectus (WARP) all publish measures of WAR that oftentimes have significant disagreements. But that’s by design.

These three metrics were brilliantly characterized by Miller as so:

  • rWAR – “What Happened WAR”
  • fWAR – “What Should Have Happened WAR”
  • WARP – “What Should Have Should Have Happened WAR”

The rest of the piece is outstanding, and comes highly recommended by this author. In the aftermath, though, Tom Tango of MLB Advanced Media responded with the following challenge:

Given that I humbly consider myself to be an aspiring saberist, I took that challenge. Well, I first took the challenge of college final exams, but then the pitching WAR challenge!

The dataset from which I worked off included 1165 qualified individual pitching seasons spanning from 2000-2016. For each season, I collected the player’s fWAR, rWAR, WARP, RA9-WAR, and RA9-WAR in the subsequent year. As Tango suggested, using RA9-WAR to look retrospectively at our 3 competing pitching metrics will be the most effective way to measure the differences amongst the metrics themselves.

For those interested in the raw data, feel free to check it out here, and make a copy if you’d like to play around with it yourself.

Given the nature of the dataset, a logical first place to start was with a straightforward correlation table and go from there. That correlation table is displayed below.


As expected, small differences do exist between the various metrics in their abilities to predict future performance. In the sample, fWAR leads both WARP and rWAR by slight margins. For all you statheads out there, a linear regression on the data returns statistically significant p-values for fWAR and WARP, but not rWAR.

So that was fun, wasn’t it? With all of the nitty gritty math out of the way, let’s dive into a few examples. Miller already highlighted Teheran’s strange 2017 season, but as it turns out, there are far more extreme instances of metric disagreement.

Take Felix Hernandez’s 2006 season for example. His first full season in the bigs culminated in an underwhelming 4.52 ERA, but a 3.91 FIP and a 3.37 xFIP were promising signs of future success. Similarly, the WAR metrics were unable to come to any sort of consensus.


By WARP, the 20-year-old Hernandez was the 14th best pitcher in 2006. He was surrounded on the leaderboard by names like Roy Halladay, Randy Johnson, and Greg Maddux. By rWAR, his 2006 season ranked 135th alongside Jose Mesa, Cory Lidle, and interestingly enough, Greg Maddux.

fWAR, on the other hand, seems to have found a happy medium between the other two metrics. Sure enough, it was also the most accurate predictor of Hernandez’s RA9-WAR in 2007.

Taking a step back, I now wanted to determine which of the three metrics was the most accurate predictor of a pitcher’s future RA9-WAR. Just as Tango does, we’ll call the current season”Year T” and the next “Year T+1.” The results of this exercise are displayed below.
Yet again, we see a slight victory for the FanGraphs WAR metric. However, with over 1100 seasons in our sample, no single metric stands apart from the others. After all, they are designed with the same goal in mind: measure pitcher value. As you’ll see below, each metric usually ends up with a similar result to the others. (Click to view a larger version)


What happens, though, in instances like Teheran’s? When the metrics have stark disagreements with each other, which metric remains most reliable? To answer this question, I dug up the 10 most significant head-to-head disagreements among each of the metrics, and again looked at which version of WAR best predicted the RA9-WAR in Year T+1. Those results are listed below.

What stands out to me here is not only that fWAR still appears to be the best forward-looking metric, but also that in nine of its ten most significant disagreements with rWAR, the DIPS approach to WAR won out.

Just as in “The Great WAR Debate of 2017,” this discussion too is entirely dependent on what one intends to use WAR for. Here, we’ve established fWAR as an excellent forward-looking metric. Depending on who you ask, rWAR likely serves its best purpose illustrating, as Miller put it, what did happen. WARP may either be many years ahead of its time, or could still use a fair amount of tweaking. Or both. No matter, each version of pitching WAR comes with its own purpose, and each purpose has its own theoretical use.



Print This Post

newest oldest most voted
E_Max23
Member

https://www.patreon.com/posts/solution-for-war-16376783

i wrote about this, already posted, waiting to see if Jeff approves it, or could you at least tell me what must i improve, thx, just trying to be heard

E_Max23
Member

i’ll delete the link if it’s inappropriate, just trying to get Jeff’s attention

35th and Not James Shields
Member
Member
35th and Not James Shields

Adam,

Thanks for the additional research on the three W’s. I had read Sam Miller’s article, and your piece is a nice follow up.

Look forward to your next article, of course, when college permits.

phealy48
Member
phealy48

Excellent article, perhaps the most interesting article I have read on WAR in a couple of months. Thanks.

John Edwards
Member

Interesting! I wonder if fWAR was the best at predicting Hernandez’s stuff because it functions based off of a more peripheral statistic than rWAR. Good read, Adam.

BigChief
Member
Member
BigChief

You may want to ask someone at BP but I believe because the most recent version of DRA includes data from PitchInfo, DRA and therefore WARP may be calculated differently pre-2008 than it is now.

If this is indeed the case I wouldn’t at all be surprised if WARP performs better now at year+1 than what is suggested from your data set.

Good work, btw.

Jeff Long
Member

This is interesting work Adam, thanks for sharing. I do think this warrants a discussion about a/the goal of WAR(P), because I’m not sure I agree with the premise of Tom’s tweet wherein he suggests that WAR(P) aught to be predictive of performance in year T+1.

Another researcher pointed out to me: since fWAR has the most regressed inputs, it makes sense that it would perform the best on these tests as players tend to move toward the mean over time.

This brings up an important consideration which is that we should really look at all the inputs for a metric [WAR(P) in this case] and think through the pros/cons of each before testing them for validity. Each version of WAR(P) operates with certain assumptions, approaches, and more built in. As such using a singular metric, in this case RA9-WAR, doesn’t do rightful justice to the nuances that have been baked into each version.

Balancing he various iterations of WAR(P) requires a thorough examination of the bits and pieces that make them unique.

You’ve explored an interesting question with a good deal of rigor. That’s far more than most could say and while there’s opportunity for iteration and potential improvement, there’s not much “aspiring” about this.