This morning, Jon Heyman noted an odd thing on Twitter:
— Jon Heyman (@JonHeymanCBS) April 29, 2013
He was quoting Baseball-Reference’s WAR calculation, and the two are indeed tied at +1.7 WAR on B-R. Here, we have Bryce Harper (+1.5 WAR) ahead of Starling Marte (+1.2 WAR), but the point still basically stands; WAR thinks Harper (1.200 OPS) and Marte (.835 OPS) have both been pretty great this year, with just a small (or no) difference between them. What Harper has done with the bat, WAR believes that Marte has mostly made up with his legs in baserunning (+3 run advantage) and defense (+3 run advantage), as well a slight bump from getting 12 extra plate appearances.
There’s no question that Harper has been a better offensive player, but there are questions about the defensive valuations, because defensive metrics aren’t as refined at this point as offensive metrics are. It is much easier to prove that Harper has been +10 runs better with the bat this year than it is to prove that Marte has been +3 runs better defensively by UZR, or +7 runs better defensively by DRS. There are more sources for error in the defensive metrics, and Heyman’s tweet led to a discussion on Twitter about the usefulness of including small sample defensive metrics in WAR.
I’ve written before about the strong correlation between team WAR and team winning percentage, and others have followed up with similar analysis more recently. However, all those articles have focused on full season or multi-season data samples, and since the question was raised and I hadn’t yet seen it answered, I became curious about whether WAR would actually correlate better at this point in the year if we just assumed every player in baseball was an average defender.
Essentially, if we just removed defensive metrics from the equation, and evaluated teams solely on their hitting and pitching, how would our WAR calculation compare to team winning percentage? And how does WAR correlate to team winning percentage based on just April 2013 data, when we’re dealing with much smaller sample sizes?
To answer that question, I turned to the numbers. To convert team WAR into an expected winning percentage, I just added position player and pitcher WAR to get total team WAR, divided by games played, multiplied WAR per game by 162 to get a full season total, and then added +47.7 wins — the replacement level assumption — to extrapolated team WAR. I then divided that total by 162 games to get a team’s expected winning percentage based solely on their WAR total, and compared that winning percentage to their actual current winning percentage. Here’s the table showing that comparison.
The correlation between actual team winning percentage and expected team winning percentage based on WAR is .88, which is almost exactly what Glenn DuPaul found when testing the correlation between full season WAR and team winning percentage last summer. It’s higher than what I got when I compared WAR to team winning percentage back in 2009, before we added things like baserunning to improve the formula. With about 15% of the season completed, current team WAR explains 78% of current team winning percentage.
Considering that WAR doesn’t include any kind of situational context, and we know the sequencing of hits and runs can have a major impact on a team’s win-loss record, that’s still a very robust correlation. That correlation suggests that WAR is doing a lot of things right in terms of measuring the results that lead to wins and losses.
It is almost certainly doing some things wrong as well, and it is theoretically possible that what WAR is getting right is hitting and pitching, and the defensive component is weakening what would be an even stronger correlation if the fielding metrics weren’t included. So, let’s check that out. Here’s a table of team winning percentage compared to a WAR-based winning percentage that assumes every player in baseball has played average defense this season. This is WAR with UZR removed, essentially.
Get rid of those crappy small sample useless defensive metrics that are full of errors and bias and you end up with a lower correlation to team wins and losses. The r squared now explains just 68% of a team’s winning percentage. A month into the 2013 season, WAR explains less about team performance without UZR than it does with it.
The original source of Heyman’s tweet, though, wasn’t UZR. He was quoting B-R’s WAR calculation, which uses Defensive Runs Saved as its fielding metric. UZR isn’t nearly as bullish on Starling Marte‘s defensive performance as DRS, so maybe it’s BIS’ fielding metric that’s the problem here? To check, I swapped out UZR for DRS in our WAR calculation and re-ran the numbers again. One more table.
Well, that’s not it. WAR with DRS comes up with basically the same correlation to team winning percentage as WAR with UZR, and both do better than WAR without any defensive component.
Now, maximizing correlation to team winning percentage should not be the goal of WAR. If it was, we’d just make the inputs RBIs and RBIs allowed, and the correlation would be something like .99. It wouldn’t be a better metric simply because it was more highly correlated with winning percentage. This test is essentially a sanity check to make sure that WAR is actually measuring things that impact team wins and losses. The inputs of WAR were chosen to try and identify context-neutral individual player performance, and it’s a good sign that things chosen for those reasons end up correlating well to team wins and losses. It tells us that WAR is working pretty well, even in small samples. Even with imperfect inputs. Even with defensive inputs that are best used in the largest sample you can possibly get.
WAR is not imperfect, nor is it precise. It is best used in whole numbers, with any fractional difference being seen as marginal gaps at best, especially if that difference is based mostly on the defensive components. I wouldn’t say that Starling Marte has been Bryce Harper‘s equal so far, because I doubt that’s true. You shouldn’t take two dozen games worth of WAR at face value.
But you shouldn’t take two dozen games worth of anything at face value. The major league leader in ERA is currently Jake Westbrook, at 0.98. If you took ERA at face value, you’d have to argue that Jake Westbrook has been the best pitcher in baseball, and is on pace to have the best pitcher season in the history of the game. No one actually believes that, and no one is arguing that, because everyone knows that in a month’s worth of games, you’re going to see some funky results. Funky results in 24 games do not invalide a metric.
Just like every statistic under the sun, WAR is better when used in large samples. But, despite the beatings it takes on a regular basis, WAR actually does its job pretty well — not perfectly, because it is not a perfect model — even just based on April data alone.
And, getting back to the original point, I’ll note that WAR is very good at spotlighting players like Starling Marte, who deserve recognition but probably aren’t getting it due to the continuing focus on the triple crown statistics in the mainstream media. Today, because of WAR, a lot of people learned that Starling Marte is having a pretty great April. I’ll call that a win, even if the calculation might off by a few runs here or there.
Print This Post