Win Probability Added Above Replacement

Zach Britton was great in 2016, but was he really the American League’s best pitcher? (via Keith Allison)

Short Version

WPA, used as a value metric, is incomplete. You have to build in replacement level, including bullpen chaining, to get the full story. These adjustments, which are commonly accepted parts of WAR, shift value to starting pitchers relative to high-leverage relievers, while keeping the win expectancy framework of WPA at the heart of the value metric. With readily available data we can use some basic algebra to convert WPA to WPA-Above-Replacement, or WPAAR.

Much Longer Version

Thanks to Zach Britton’s near-perfect season, a reliever received real Cy Young support. In the saber community, his candidacy was supported by WPA (Win Probability Added) where he led the majors with +6.2 wins (and Andrew Miller finished second with +4.8 wins.) WPA-backed arguments mimic those that tout Mike Trout as the obvious MVP: “whoever helped their team win the most games is the most valuable player.” The problem with WPA-based MVP arguments, however, lies in the assumption that the WPA leader is the player who helped his team win the most games. Even after deciding that the win probability framework is the one you want to use, WPA is just a win probability metric, not the win probability metric, and, as I’ll lay out below, it’s an incomplete win probability metric.

Background: Win Probability and Leverage

Win probability (or win expectancy) is the baseball version of the little percentages next to the cards in televised poker. Given the state of the game, and an expectation of “all” the possibilities that could occur the rest of the game, how likely is each team to win? Win probability added is the change in win probability after an event occurs. When Jose Bautista hits a home run, the Blue Jays are more likely to win, and that change in expectancy is credited to Bautista. Add up all these little changes over a full season, and you have a player’s WPA.

WPA is very similar to WAA, Wins Above Average, except for how the wins are tallied. WPA uses win probabilities, WAA uses linear weights. In the middle is REW (Run Expectancy Wins). Run expectancy, like win expectancy, uses the game situation to calculate the change of each play. The difference is that run expectancy only takes into account runners on base and number of outs, while win expectancy also accounts for inning and score. Linear weights doesn’t care about context of an event at all, using the average value across all possible contexts. REW, like linear weights, use a runs-per-win converter to translate runs into wins. Win probability starts with wins as the unit.

To summarize:

Metric Summary
Linear weights Run expectancy Win probability Championship probability
Context/leverage None Runners on base, outs Inning, score, runners, outs Standings, inning, score, runners, outs
Question answered On average, across all situations a PA might occur in, how many runs does a single add? How many more runs do we expect to score this inning because of this single? How much more likely are we to win this game because of this single? How much more likely are we to win the World Series because of this single?
Common stats wRAA, RAA, WAA (converted to wins) RE24, REW (converted to wins) WPA cWPA
cWPA References:

In all of these expectancy metrics, there is an inherent assumption that some situations are more important than others. For example, an at-bat in a tied game in the ninth inning matters more than in a six-run game in the fifth. It matters more because the outcome of the at-bat has a bigger influence on the outcome of the game. Mathematically, the average change in win expectancy is larger in the first example – there are wider swings. The difference between a strikeout and a home run is quite wide in a tied game in the ninth, while the difference is negligible in a six-run game in the fifth. And you know that intuitively, because your heart is racing. This “average change in win expectancy” is known as leverage. Every situation can be assigned a leverage value using similar math to expectancy metrics. Each expectancy metric has its own version of leverage, according to the context it cares about.

If you’ve heard of leverage, it’s most likely the one associated with win expectancy, but there’s also base-out leverage, championship leverage, etc. (Linear weights does not have an associated leverage, since outcomes have no context in linear weights.) FanGraphs reports a few aggregated stats measuring win expectancy leverage. pLI averages a pitcher’s average leverage across all plate appearances. inLI averages leverage across the first pitch of an inning a pitcher started. gmLI averages leverage across the leverages of the first pitch a pitcher makes in a game. exLI cares about the leverage when a pitchers exits. When calculating reliever WAR, wins above average based on linear weights (or FIP or ERA) is multiplied by LI to give relievers who pitch more important innings more credit for their runs prevented.

Background: Bullpen Leverage Chaining

Finally, while closers pitch high-leverage innings and deserve a lot of credit for doing so, their replacements aren’t replacement-level relievers, but instead are setup guys. When a closer goes down, the guy added from Triple-A is given mop-up duty, not the closer role, while everyone else moves one step higher on the ladder. The closer is replaced by the setup guy, the setup guys is replaced by the 7th inning guy, all the way down the line. All those little changes add up to yield the actual value of the closer. To account for this, we give half credit for the higher leverage innings of good relievers. Why half? Because that’s what makes the math work out – there’s a longer explanation and an example calculation here if you are interested in said math. Closers usually deserve to close because they’re excellent relievers, but replacing them with setup guys doesn’t hurt the team as much as their raw leverage and WPA numbers suggest.

Background: Replacement Level

Again, what all these probability/expectancy stats have in common is that they are relative to average. You can interpret that as the league summing to zero net wins, or that each player is compared to an average player. But we don’t use wins-above-average very often, because it’s incomplete. It doesn’t account for the value that an average player provides over a replacement level player. It says that a 0 WAA player over 10 plate appearances was just as valuable as a 0 WAA player over 600 plate appearances. But you’d rather have the second player, because the first requires you to find another 590 plate appearances at league-average rate. That’s not easy, and not cheap. That’s the reason why we usually use WAR (Wins Above Replacement), building in the value of an average player above and beyond that of a replacement level player. This can be more than a two-win difference for full-time players.

Relative to above-average stats, above-replacement stats reward additional playing time. This shifts value from relievers to starters, because starters pitch more innings. Additionally, the replacement level for relievers is better, because performance improves moving from starting to relieving (and vice versa). This adjustment isn’t too dissimilar from park adjustments, accounting for the difficulty of the job each player does. Relieving is easier than starting. The advantages of pitching in relief include throwing harder, using only your best pitches, and facing hitters only once per game. Most relievers are failed starters. Justin Verlander has a career 3.47 ERA as a starter, but can you imagine what his ERA would be as a reliever, going just one inning at a time? Research has shown that the typical pitcher would have an ERA almost a full run lower in a relief role than as a starter. Strikeouts increase about 17 percent, home runs per batted ball decrease about 17 percent, and BABIP decreases by about 17 percent. Replacement level for relievers is about the league average ERA, while replacement level for starters is about a full run higher. One run of ERA over 180 innings is a difference of 20 runs, or about two wins. That, not coincidentally, is the value of a league average starter: two wins.

As you can probably guess, these adjustments comparing an average player to a replacement level player significantly decrease the value of high-leverage relievers when judged solely by WPA. But these are all adjustments that we already make in WAR and are commonly accepted. By using win probability above replacement, we’re still giving bullpen aces lots of credit for their higher-leverage performances, just not as much as raw WPA claims.

The New Stuff: Converting WPA to WPAAR

So, what’s the solution? I’m going to call it Win Probability Added Above Replacement, and calculate it using the 2016 versions of Zach Britton and Jon Lester (the top starter by WPA) as examples. The two main adjustments are for bullpen chaining and differing replacement levels of starters and relievers.

  • Start with WPA. For Britton, this is +6.1. For Lester, this is +4.6. Because the former is a higher number than the latter, many people make claims like “WPA says Zach Britton was more valuable than Jon Lester.” The purpose of this article has been to highlight the context missing in that interpretation of the two numbers. I’d go so far as to say it’s plain wrong.
  • Adjust pLI (leverage index) halfway toward 1. This is the bullpen chaining adjustment. For Britton, a pLI of 1.8 becomes 1.4. For Lester, .94 stays at .94, since he’s a starter. WPA is giving Britton full credit for the situations he pitched in, when he really only deserves half.
  • Move WPA toward average (zero) by the ratio of LI_adj/LI. For Britton, that ratio is 1.4/1.8 = 78%, and 78% * 6.1 = 4.8. For Jon Lester, no change from +4.6. Because Britton only deserves half credit for the high-leverage situations he finds himself in, his WPA is adjusted down.
  • Credit the player for the value of league-average performance over replacement level. For starters, that’s about 2 wins per 180 innings. So Jon Lester gains 2*202/180 = 2.2, for a total of 6.8 WPAAR. But since reliever replacement level is approximately league average, there’s no extra credit for Britton. He stays at +4.8.

In total, Jon Lester gains 2.2 wins due to replacement level, while Britton loses 1.4 wins due to replacement level and chaining. Britton’s 1.5 win lead in WPA over Lester becomes a 2.0 win deficit in WPAAR. Here’s the top 25 leaderboard from 2016.

2016 WPAAR Leaders
Name Team IP WPA WPAAR delta
Jon Lester CHC 202 4.6 6.8  2.2
Johnny Cueto SF 219 3.8 6.2  2.4
Max Scherzer WAS 228 3.6 6.1  2.5
Kyle Hendricks CHC 190 3.9 6.0  2.1
Justin Verlander DET 227 3.5 6.0  2.5
Clayton Kershaw LAD 149 4.2 5.8  1.7
Jose Fernandez MIA 182 3.2 5.2  2.0
Tanner Roark WAS 210 2.9 5.2  2.3
Aaron Sanchez TOR 192 2.9 5.1  2.1
Masahiro Tanaka NYY 199 2.6 4.8  2.2
Chris Sale CHW 226 2.3 4.8  2.5
Zach Britton BAL  67 6.1 4.8 -1.4
Jose Quintana CHW 208 2.4 4.7  2.3
Madison Bumgarner SF 226 1.9 4.4  2.5
J.A. Happ TOR 195 2.2 4.4  2.2
Rick Porcello BOS 223 1.8 4.3  2.5
Noah Syndergaard NYM 183 2.1 4.1  2.0
Corey Kluber CLE 215 1.7 4.0  2.4
Cole Hamels TEX 200 1.8 4.0  2.2
Julio Teheran ATL 188 1.9 3.9  2.1
Andrew Miller – – –  74 4.8 3.8 -1.0
Marco Estrada TOR 176 1.8 3.8  2.0
Jake Arrieta CHC 197 1.6 3.8  2.2
Carlos Martinez STL 195 1.6 3.7  2.2
Rich Hill – – – 110 2.4 3.6  1.2

If you want to see the whole list, which displays more of the data, you can see it here.

The Incompleat Starting Pitcher
The end of the nine-inning start and how we got here.

Additional Notes

Now, I don’t actually suggest using WPA for starting pitchers, as their leverage is heavily dependent on run support and timing of the runs scored in the game, which are clearly not pitching skills (for more on not using WPA for starting pitchers, read these three pieces at The Book blog). A better approach is to use a different, more traditional WAR metric for starting pitchers, even if you want to compare them to the WPAAR numbers of relievers. If we remove starting pitchers from the WPAAR leaderboard above, here’s how relievers stack up:

2016 WPAAR Leaders, Full-Time Relief Pitchers
Name Team IP WPA WPAAR delta
Zach Britton BAL 67 6.1 4.8 -1.4
Andrew Miller – – – 74 4.8 3.8 -1.0
Sam Dyson TEX 70 3.6 2.6 -0.9
Dan Otero CLE 70 2.1 2.5  0.4
Mark Melancon – – – 71 3.1 2.4 -0.7
Jeremy Jeffress – – – 58 2.9 2.3 -0.6
Roberto Osuna TOR 74 2.8 2.2 -0.5
Aroldis Chapman – – – 58 2.7 2.2 -0.5
Robbie Ross Jr. BOS 55 1.8 2.0  0.2
Will Harris HOU 64 2.3 1.9 -0.4
Mychal Givens BAL 74 1.9 1.9 -0.1
Seung Hwan Oh STL 79 2.2 1.8 -0.4
Joe Blanton LAD 80 1.9 1.7 -0.1
Blake Treinen WAS 67 1.7 1.7  0.0
Cody Allen CLE 68 2.1 1.7 -0.4
A.J. Ramos MIA 64 2.1 1.6 -0.5
Mauricio Cabrera ATL 38 1.9 1.6 -0.3
Ryan Buchter SD 63 1.7 1.6  0.0
Addison Reed NYM 77 1.8 1.5 -0.2
Brad Hand SD 89 1.7 1.5 -0.2
Tyler Lyons STL 48 1.1 1.5  0.3
Kenley Jansen LAD 68 1.8 1.5 -0.3
Kelvin Herrera KC 72 1.7 1.5 -0.3
Nate Jones CHW 70 1.9 1.5 -0.4
Tyler Thornburg MIL 67 1.9 1.4 -0.4
Matt Bush TEX 61 1.6 1.4 -0.2
Peter Moylan KC 44 1.1 1.4  0.3
Jeurys Familia NYM 77 1.8 1.4 -0.5

Additionally, WPA does a poor job of parsing defensive credit between pitching and fielding (as in, it doesn’t do it). A fielder making a great play is credited to the pitcher under WPA, when really the pitcher should be held accountable for the quality of the batted balls he gave up, while the fielder is credited or debited value from that point depending if he makes the play. With the growing popularity and availability of Statcast data, this splitting of WPA credit between pitchers and fielders might be possible.


After adjusting WPA to account for replacement level and bullpen chainging, Zach Britton remains one of the top five most valuable pitchers in the American League in 2016, and only Justin Verlander is significantly ahead of him. But the lead he held in WPA has disappeared. WPA is a fine metric, but it’s incomplete. You can’t forget replacement level and all of its repercussions. With WPAAR, I think we have a metric that is more closely aligned with a pitcher’s true value.

References & Resources

Print This Post
Sort by:   newest | oldest | most voted
I just want to say that the bullpen chaining argument, as presented here (I haven’t yet read the linked piece that describes it) sounds utterly bogus to me. The argument, as I follow it, is that because an AAA replacement is given the lowest leverage innings, and the setup guy is thus given the highest leverage innings, the closer should effectively be compared against a “setup guy-replacement” as opposed to a “AAA replacement” as is typically meant. It’s effectively shifting your defined replacement level up because of the other guys in the bullpen. But the reason that argument fails is… Read more »
The reason for the chaining has to do with a different effect, I think. This is something that was discussed on Fangraphs a lot during the playoffs in regards to when to use Andrew Miller. Assume for a moment a game where the score is 2-0 after the 1st inning. The starter goes 6 innings, then 7th, 8th, 9th are each individual relievers. Because each new inning closes a higher percentage of remaining opportunity to score, the WPA for each pitcher will go 9th, 8th, 7th regardless of who they face. Think about it like this: The perfect 7th inning… Read more »

Right. Number of outs left is another way of saying that leverage increases during the game, as measured by leverage index. I think the author is assuming people understand what leverage index is and how it works.

Sky Kalkman
I agree that the ability to give different quality relief pitchers different average LI’s is what requires the chaining. That’s not different from what I wrote in the article, but it could have been called out more directly. I’ll point out that what you say about leverage increasing as the game goes on is true *given the assumption that the game remains close*. If each reliever gives up one run instead of none, the leverage is likely to get lower over the rest of the game. And if the game doesn’t start close, leverage will go down over the rest… Read more »
Sky Kalkman
1. Glad you came around on the chaining 😉 You’re exactly right that rep level for a closer isn’t the setup man, but the combined shifting of all relievers up the ladder. 2. “Just because a situation was subject to 1.8 higher times WPA-shifts in either direction doesn’t mean that 1 run production there is worth 1.8 run production at a average-leverage time” — yes, that’s actually exactly what it means! And you, yourself say it two sentences later: “the *whole point* is that high leverage runs prevented are more valuable than 1/10th of a win.” You can either multiply… Read more »
Sky Kalkman

In point #2, you’re write that it’s the inverse, I misread what you wrote. But that just means we’re in agreement, I think. Do you think I implied the other way somewhere?

Chaining can be a complicated subject. As an example, it can be applied to position players too. For instance: Let’s say your catcher is a 3.0 talent and his backup is a 1.0 talent, both over 600 PAs. And to make it realistic, let’s say that your catcher has 400 PAs and his backup has 200 PAs in Season A, for 2.33 WAR in total from catchers. What if your primary catcher is injured all of Season B? Based on the catcher’s WAR, you might say that the team will lose 2.0 WAR. But the backup catcher becomes the primary… Read more »
Sky Kalkman

And there’s leverage considerations for hitters, too! Middle-of-the-order lineup spots tend to hit in slightly higher leverage situations, because there are likely to be more runners on base in front of them. WAR doesn’t give them this leverage credit. Should it? I’d argue yes, because just like a dominant reliever deserves to pitch in higher leverage situations, a top hitter deserves to hit in the middle of the lineup. Should it give them all of the leverage credit? Probably not. How much credit, then? Uhh, err, I’ll get back to you.


Just use RE24 for batters! That’s my preference, anyway.


Now we’re starting down a slippery slope. Players on teams with great offense also hit with runners on base more often, on average, than players on lesser offensive teams [think Josh Donaldson vs Mike Trout, 2015]. That is probably going to translate into higher average leverage, though it depends on the average runs allowed by the team, too. In this case, higher leverage production clearly correlates with Runs and RBI, context-dependent stats that I thought no saber would be caught dead praising.


Studes, for your catcher chaining example, I assume we should be using the value of the average backup catcher in the league for the replacement-level calculation, correct?

Peter Jensen
Peter Jensen
Sky – I don’t think you should be multiplying WPA by leverage index. WPA already is computed with the exact leverage for the situation. So relievers are already correctly valued using WPA. Leverage Index was devised to be able to correct for other metrics that don’t include all the factors that WPA uses. Studes – Using RE24 for batters is correct for descriptive metrics but you can actually calculate run value added by line up position if you want to correct for the variation of opportunities by lineup position. Linear weights is better for predictive metrics unless you are a… Read more »

Peter — I’m multiplying WPA by [(LI+1)/2)]/LI. If LI is 1.8, that’s [1.8+1]/2]/1.8 = 1.4/1.8 = .78. This ratio will always be less than 1, shifting WPA towards zero. The reasoning is that while WPA correctly values the change in WP for any event, relievers don’t deserve full leverage when determining their value to the team — their replacements would also be “given” those leverage opportunities. (It’s the same logic as the LI adjustment in reliever WAR.) Think of it as WPA/LI, but instead of removing all the leverage of each PA, just a bit is removed.

Peter Jensen
Peter Jensen

Thanks for the explanation Sky. I knew you were multiplying by (LI+1)/2, but I had missed that you were then dividing the result by LI.


Thanks Peter. To me, WAR is a descriptive stat, which is why I’m all in on RE24. On Twitter, Sky asked me if I like RE24/boLI. That’s one way of correcting for variations. Unfortunately, my memory doesn’t work so well. I remember that I wasn’t a fan of boLI, but I can’t remember why. Getting old sucks.

John Autin
John Autin

I haven’t fully digested the theory, and I tend to doubt there can be one metric that lets RPs be gauged against both RPs and SPs. But I admire the effort, and the clear explanation. Thanks, Sky!

In the move from WPA to WPAAR, there is a major change that went unmentioned but deserves to be highlighted. Team WPA is the sum of individual WPA, but this is not the case with WPAAR. This same shift happened in the move from Win Shares to WAR. I am waiting, like you, for StatCast-based WPA, or SCWPA, or SCWPAAR. I imagine we are pretty much there; every launch angle/exit velo should by now should have a pretty well established probability distribution of outcomes, allowing us to calculate the expected value of a batted ball. Actually, we probably could go… Read more »
Yeezy Shoes Fake

We’re back to recap the most noteworthy sneaker releases of the weekend ahead, ensuring that you stay plugged into the source and are able to cop with the swiftness.