## Relief Pitching in Context

If you recall, last week, I talked about one approach that we can take for evaluating starting pitcher performance. Today, I’d like to continue on that vein, this time taking a look at relief pitching.

With regards to evaluating both player performance and player talent, relief pitching is one of the least understood aspects of baseball. There are a few factors that lead me to believe this, but the only one I’d like to talk about today is the problem of mid-inning pitching changes.

You’ve probably noticed the ridiculousness of the following situation: A starting pitcher grooves through the first six innings of the game, allowing just one run. However, he loses his control with two outs in the seventh and walks the bases loaded, forcing the manager to call on the bullpen to get the last out. The reliever proceeds to allow a three-run double, followed by the third out.

Of course, in this situation, the starting pitcher is held responsible, and subsequently penalized for, all three runners that he left on base when the manager removed him from the game. Conversely, the reliever, despite being the active pitcher while all three runs scored, is rewarded for pitching a scoreless third of an inning. In the box score for the game, there is no indication that those three runs allotted to the starting pitcher in fact scored while said pitcher was sitting on the bench.

Anyone will agree that this is not only unfair to the relief pitcher, whose salary and legacy is strongly tied to his ERA, but a significant flaw in the way that we allocate runs allowed. It is particularly important for relief pitchers given their small number of innings pitched each season, and subsequently the larger importance of each run allowed.

On the surface, there is not a simple solution to this issue. For obvious reasons, we cannot assign all inherited runners to the reliever, nor can we assign none, and any whole number in between will likely be an arbitrary number.

Luckily for us, there is a fantastically simple solution: RE24. While it has a complicated name, RE24 takes the team’s expected runs scored in the inning before the play and subtracts that from their expected runs scored in the inning after the play. In other words, how much did the hitter, or conversely the pitcher, help his team maximize runs scored?

Let’s apply that to the case of inherited runners, specifically the one presented above. When the manager brings in the reliever with the bases loaded and two outs, the batting team was expected to score about 0.7 runs for the rest of the inning given an average offense and defense. However, the reliever then gives up a three-run double followed by the third out. At the end of the inning, the batting team obviously has a run expectancy of zero. Therefore, the run expectancy dropped from 0.7 to 0 from when the reliever entered the game, but three runs scored. The 0.7 run change in run expectancy minus the three runs scored means that the reliever’s RE24 for the inning was -2.3. In contrast to the official Runs Allowed, in which the reliever is assigned zero runs, RE24 assigns him almost all of the runs that scored, minus the runs that were expected to score when the starter left the game.

Make sense? Good. Let’s look at some actual numbers.

**Top ten reliever seasons by RE24, since 1974:**

Num | Name | Season | Team | RE24 |
---|---|---|---|---|

1 | Mark Eichhorn | 1986 | Blue Jays | 52.88 |

2 | Doug Corbett | 1980 | Twins | 43.81 |

3 | Bill Campbell | 1977 | Red Sox | 42.97 |

4 | Mariano Rivera | 1996 | Yankees | 40.48 |

5 | Jim Kern | 1979 | Rangers | 40.38 |

6 | Rich Gossage | 1977 | Pirates | 40.1 |

7 | Willie Hernandez | 1984 | Tigers | 37.91 |

8 | Rich Gossage | 1975 | White Sox | 35.8 |

9 | Keith Foulke | 1999 | White Sox | 35.14 |

10 | B.J. Ryan | 2006 | Blue Jays | 35.13 |

Mark Eichhorn’s epic season in which he pitched 157 innings with a 1.72 ERAÂ *all in relief* leads the way, and it’s not close. There are some other big names on this list, as well as some not-so-big names, but regardless, these are ten of the best relief seasons in the past 40 years.

While the above list is somewhat interesting, it doesn’t really tell us everything we want to know. First of all, RE24 is a cumulative metric, as evidenced by the fact that almost every one of those relievers pitched over 100 innings, a feat rarely seen in recent years. Secondly, the scale means very little for most people. There is nothing to compare that number to as far as we normally measure pitching, making it difficult to practically use in conversation and discussion.

These two concerns can be solved in one simple way: convert RE24 to a “runs-per-nine-innings” scale similar to ERA and RA. At this point, I must give credit where credit is due, as Tom Tango provided the method for converting RE24 to an RA9-scale yesterday:

[W]e can recast RE24 into an RA9 scale (i.e., similar to ERA) as follows.Â Say the league average is .48 runs per inning.Â Say you have a pitcher that has an RE24 of +40 runs and has pitched 200 innings.Â That means the league average is .48 x 200 = 96 runs, and our pitcher here is 40 runs better than that, or 56 runs allowed.Â So, his (RE24-based) RA9 is simply 56/200*9 = 2.52.

Don’t get bogged down in the details of the calculation. The result is a metric which measures the pitcher’s runs allowed per nine innings using RE24 instead of actual runs allowed. We’ll call this Context RA9, or cRA9, going forward.

The great thing about cRA9 is not just that we have a more accurate measure of a relief pitcher’s run prevention, but that we can compare it to their RA9 in order to identify the best and worst pitchers at preventing inhereted runners from scoring.

First of all, the leaders in reliever Context RA9 since 1974:

Num | Name | Season | Team | RA9 | cRA9 | Diff |
---|---|---|---|---|---|---|

1 | David Robertson | 2011 | Yankees | 1.21 | 0.38 | -0.84 |

2 | Al Alburquerque | 2011 | Tigers | 1.87 | 0.40 | -1.47 |

3 | B.J. Ryan | 2006 | Blue Jays | 1.49 | 0.54 | -0.95 |

4 | Joaquin Benoit | 2005 | Rangers | 1.30 | 0.55 | -0.74 |

5 | Mike Jackson | 1998 | Indians | 1.55 | 0.70 | -0.85 |

6 | Jonathan Papelbon | 2006 | Red Sox | 1.05 | 0.70 | -0.35 |

7 | Rob Murphy | 1986 | Reds | 0.72 | 0.81 | 0.09 |

8 | Cla Meredith | 2006 | Padres | 1.07 | 0.84 | -0.22 |

9 | Neal Cotts | 2013 | Rangers | 1.41 | 0.86 | -0.55 |

10 | Greg Holland | 2011 | Royals | 1.95 | 0.89 | -1.06 |

Wow. I remember David Robertson’s 2011 season as being very good, but based on RE24/cRA9, it was historically elite. He is known as “Houdini” for his ability to pitch out of seemingly impossible jams, and these numbers are just further evidence of that. Amazingly enough, Al Alberquerque had an equally impressive season the same year, albeit with fewer innings pitched.

A noted above, it may also be interesting to look at the relief pitchers with the largest positive differences between their Context RA9 and their normal RA9 — that is, the pitchers who received more credit than they deserved from RA9 and ERA:

Num | Name | Season | Team | RA9 | cRA9 | Diff |
---|---|---|---|---|---|---|

1 | Dan Quisenberry | 1987 | Royals | 2.76 | 5.66 | 2.90 |

2 | Joe Smith | 2007 | Mets | 3.65 | 6.23 | 2.57 |

3 | Mike Duvall | 1999 | Devil Rays | 4.73 | 7.24 | 2.51 |

4 | Tim Crews | 1989 | Dodgers | 3.94 | 6.19 | 2.25 |

5 | John Franco | 1996 | Mets | 2.50 | 4.73 | 2.23 |

6 | Luis Aquino | 1995 | - – - | 7.23 | 9.43 | 2.21 |

7 | Scott Radinsky | 1996 | Dodgers | 3.27 | 5.44 | 2.18 |

8 | Scott Radinsky | 1998 | Dodgers | 3.06 | 5.23 | 2.16 |

9 | Joe Beckwith | 1980 | Dodgers | 2.56 | 4.67 | 2.11 |

10 | Matt Kinney | 2004 | - – - | 4.53 | 6.63 | 2.10 |

This list, above everything else, is evidence that reliever ERA or RA9 can be, frankly, horrible indicators of a pitcher’s actual performance. By RA9, pitchers like Quisenberry and Franco look fantastic, but by cRA9, we might think them deserving of demotion to the minors.

On the other side of the coin, here are the ten largest negative differences in cRA9 and RA9 since 1974, or the pitchers who were better than their RA9 and ERA indicated:

Num | Name | Season | Team | RA9 | cRA9 | Diff |
---|---|---|---|---|---|---|

1 | Javier Lopez | 2004 | Rockies | 7.52 | 4.13 | -3.40 |

2 | Mike Munoz | 1997 | Rockies | 4.93 | 2.38 | -2.54 |

3 | Mike Holtz | 2000 | Angels | 5.71 | 3.20 | -2.51 |

4 | Steve Frey | 1992 | Angels | 3.57 | 1.37 | -2.20 |

5 | Dennis Cook | 1996 | Rangers | 4.35 | 2.21 | -2.14 |

6 | Lee Smith | 1981 | Cubs | 4.30 | 2.42 | -1.88 |

7 | Dennis Powell | 1992 | Mariners | 4.74 | 2.88 | -1.86 |

8 | Steve Reed | 1993 | Rockies | 5.02 | 3.18 | -1.83 |

9 | Geoff Geary | 2007 | Phillies | 5.88 | 4.09 | -1.79 |

10 | Andrew Miller | 2012 | Red Sox | 3.35 | 1.56 | -1.79 |

Again, these numbers are indicators of the unreliability of conventional run prevention metrics with regards to relievers, especially relievers that often come into games in the middle of an inning. By cRA9, we see a list of mostly very good, possibly elite, relief pitchers, Â but by RA9, we see a list of pitchers only worth appearing in blowouts.

The issue presented at the beginning of this piece should not be a controversial one. The idea that starting pitchers should be responsible for all runners left on base is ridiculous, and the idea that relievers shouldÂ *notÂ *be responsible for said runners is just as ridiculous, but more important. As the results above show, a relief pitcher’s performance when brought in mid-inning can be the difference between one of the best reliever seasons in baseball and one of the worst.

So next time you see a relief pitcher’s ERA, consider the context of their appearances and their RE24 compared to their peers before you come to any conclusions.

Print This Post

Nice article Matt!

“The idea that starting pitchers should be responsible for all runners left on base is ridiculous, and the idea that relievers should not be responsible for said runners is just as ridiculous, but more important.”

Couldn’t have said it better. It’s about time to kill the “inherited runners scored”.

I don’t think “Inherited runners scored” should be “killed”. If a relief pitcher has an ERA of 1 and has let 10 of 25 inherited runners to score, his performance and value are obviously different from that ERA of 1. This stat, like almost all stats has to be though of in context, managerial decisions play a large part in whether a pitcher gets into situations with inherited runners scored. Additionally, a runner at 2nd and 3rd with no outs facing the 3, 4 and 5 hitters in a lineup is very different from runner on 1st, 2 outs facing the backup catcher.

Just like you shouldn’t look just at ERA alone, don’t look at Inherited runners scored alone.

I’ll upgrade that “nice article” to “great article.” Many years, or maybe even decades ago, I had the idea that the runs scored in an inning with a within-inning pitching change should be preportionally charged by the likelihood that the runners would score when the first pitcher left.

However, I could not figure out anyway to do that, and it never occurred to me to use RE24 even after I recently learned of it.

I can think of many reasons why this is not a perfect solution (e.g. one pitcher may have faced tougher batters than the other), but it is plenty good enough.

Thank you for making a childhood dream come true.

Please Davids, carry this stat for all pitchers.

What is the YTY correlation in RE24 or cRA9? Is the ability to “mop-up” something certain pitchers can replicate? Also, I wonder how much of this is based on managerial decisions (misuse of LOOGYs or ROOGYs) or pulling a starting pitcher too late/too early?

I’d like to know this too. This smells of “clutch-ness” to me.

I love this stat, and I love the recent articles on using WPA in these new ways. Any way we could get this as a permanent stat on Fangraphs?

The biggest problem with any statistical calculation for one-inning and one-out RP’s is the volatility of randomness itself. Just to use an example – if the “average” for RA is 0.48 runs per inning; then NO pitching appearance of less than roughly 2 innings can be validly quantified using most statistical measures. You can only score runs in increments of 1 – not tenths of a run. So any pitcher whose entire season is composed of short appearances will have a set of completely bipolar outcomes. Every data point will be an outlier – and no measure of “central tendency” (eg mean) is valid. Any calculation that uses such a measure of central tendency (eg some calc that is attempting to compare SP’s with RP’s or to use an SP measure to measure RP’s) is also invalid.

Sometimes it’s better to ask whether quantification itself is valid.

Gosh, you are right. We should just throw up our hands and say “it’s impossible!” Relievers are entirely a black box.

Quantification of reliever value is happening whether or not you think it’s possible. Regardless if runs are scored in integer values, individual contributions to that run must be in partial values – every event in baseball involves multiple players who contribute to an outcome. That means that each individual involved gets partial credit for that outcome, and by totaling those up we can approximate value!

Just because you can DO something doesn’t make it valid statistically.

It doesn’t matter if it’s valid, it only matters if it’s more valid than the incumbent method.

to cite Robertson’s 2011 season. His actual ERA (and yes I’m using that because it’s easier to find — his RA results are just as bipolar) results that year were:

64 appearances with a 0.00 ERA

2 appearances with a 9.00 ERA

1 appearance with a 13.50 ERA

1 appearance with an 18.00 ERA

2 appearances with a 27.00 ERA

It may be valid to measure his (“good” appearance)/(total appearance) ratios and use that as a measure of something. It is not valid to try to find some “mean” of these individual numbers.

That’s not what’s being done here. Your measure is stripped of context entirely and does not tell you if inherited runners scored in all 64 of those 0.00 ERA appearances. Context matters for relievers, fair or not. cRA9 is taking into account how many runs are expected to score when a reliever comes into a game versus how many actually score. This is something that we can total up, a counting stat. Counting stats happen at a rate. This rate can give us approximation of value. I really don’t see the same problem you do.

“Context” does NOT matter if the run increment (earned or not or attributed to someone else or not) ends up producing this sort of bipolar distribution. No measure of central tendency is valid in that instance.

Further, RP’s are not responsible at all for how long they pitch. That’s a decision of the manager. But that makes a HUGE impact on the outcome.

RP’s (at least the modern variety of 1 IP max) can validly be measured on all sorts of things — but not runs.

Okay, I think I’m getting a handle on what you want. RE24 values by game do have a bimodal distribution and are unlikely to have a value of zero. Instead of seeing this as sign that we can’t use it at all, we can acknowledge that means don’t give us predictive measures (although they do tell us something about past value, and are therefore interesting) and attempt to devise a measure that credits a reliever for the ratio and position of the two modes in his distribution. Is this something that would satisfy you that context does matter for reliever value?

It’s a big step forward to use cRA9 vs RA9, because in individual games, RE24 happen on a bimodal distribution, not random clusters. We can use this. Point that out, instead of “OMG, you suck because mean values aren’t predictive in bimodal distributions, CAPS YELLING!” It’s the difference between contributing and being unhelpfully negative.

Not EVERY statistic needs to have predictive value. Sometimes having a useful way to describe what actually happened is useful, too. I agree that reliever stats are almost always going to come with a SSS caveat. But…what’s wrong with saying “look at what this reliever actually did”, which is what we’re doing here. It doesn’t matter that Robertson’s 2011 is not one he’ll produce again. It happened and it’s interesting to look at what he did. Again…even if we can reasonably look at the numbers and know they aren’t repeatable.

I’m not talking about statistical significance or predictability. And the problem is that an invalid measure is NOT descriptive either. It is merely deceptive.

RP’s – the 1 IP types in today’s run-scoring environment – cannot be measured on runs. You can validly measure them on outs, K’s, BB’s, batted ball outcomes, etc. But the manager himself has to take responsibility for the “runs” because he is the one who has chopped the “appearance length” into sub-run pieces.

Likewise, if we ever get to true one-out-only pitchers; they cannot be measured on any of the things above – but only on pitches.

Now I think you might be misunderstanding the concept of RE altogether.

http://www.fangraphs.com/blogs/get-to-know-re24/

Basically, each and every play has an expected number of runs scored associated with it. If a player allows more runs than we would expect from the base/out state, we can dock their value. If they allow less than the base/out state, we can add to their value. Relievers influence the number of runs scored with their pitching. They can be measured on this.

I can get behind the concept that reliever outcomes exist on a bimodal distribution and the mean isn’t terribly predictive, but I can’t get behind the idea that they don’t influence the run prevention.

I’m pretty sure that game-based endpoints are arbitrary and you can treat any sample of relief innings as a single session, but I can’t prove it.

Tim, you are right. You can actually use play-by-play data, since that’s what matters when you are considering context.

Game based endpoints aren’t arbitrary if what you measuring is game context info. K’s or other stuff – sure.

And re run prevention – pitchers influence that via outs. Outs can be validly measured using means and such. And on a valid base, one can also then do something like weight outs differently depending on context — iow some outs are bigger than others (and that is also recognized – if not quantified – by every drunk fan who says “that’s a big out”).

Further, focusing on outs also allows you to combine outs and thus to quantify an element of partial inning pitching strategy. Eg – walking the first batter in order to induce/force a GIDP and end the inning.

I think this article does the best job I’ve seen yet of correlating how MLB teams pay relievers vs. how WAR thinks they should be paid. The real question is how reliable or projectable this stat is. My guess is probably “not at all,” but judging by the way even smart front offices are paying FA relief pitchers, GMs must have some idea of how to predict which pitchers will contribute to run prevention.

Mark Eichhorn in 1986 came in a distant third place to Jose Canseco and Wally Joyner in the AL Rookie of the Year balloting. Both b-refWAR and fWAR indicate that Eichhorn was robbed.

Eichhorn was Finkeled.

Regarding Eichhorn’s 1986 season, an affiliated website to Fangraphs names it as the best relief season ever:

http://www.hardballtimes.com/main/article/10-greatest-reliever-seasons-ever/

Eichorn I have felt was entirely robbed by the mojo of Jose Canseco, not that Jose wasn’t a treat, but that year Eichorn was incredible, he made hitters look silly, very very silly.

He’s one of my favourite memories of the mid-80s Jays. He struck out batters at ace rates, pitched back-end-of-the-rotation innings, didn’t walk many (and even 1/3 of his walks were IBB, perhaps contributing to the beauty of this article), didn’t give up many hits, and was an unforeseen superstar set-up man in a ‘pen that needed it desperately.

I suspect that several things worked against rightful recognition, though. He was a failed starter with a decidedly funky delivery. That delivery brought immediate comparisons to one of the great RP of the day, Dan Quisenberry, and Eichhorn got the short end of those comparisons. Also, in a day when offense reigned and triple crown stats were indeed all that, Eichhorn was up against Canseco and Joyner.

I’ll remember Eichhorn as a player who made Jimy Williams look better than he was. Williams was a terrible manager who turned excellence into mediocrity. Without Eichhorn’s dominance in ’86, the Jays’ record is likely reversed.

I’m noticing that the top 4 of the last table (Javier Lopez through Steve Frey) were all used as strict LOOGYs in those seasons (very-sub-1.0 IP/G), so they probably entered and left in the middle of an inning quite often.

I think the analysis of the last table was incorrect. Taking the example from the article say a pitcher begins the inning and loads the bases with two outs without giving up a run. Then a second reliever is brought in and gives up a bases clearing double. While RA9 gives all of the fault to the original pitcher RE24/cRA9 gives most to the second reliever. Essentially RA9 is overly harsh against the original pitcher while RE24/cRA9 places too much blame on the mid-inning reliever. Ultimately, context plays such a big role in mid-inning relief appearances that it is hard to analyze. Perhaps if we take into account the leverage index(exLI for the original pitcher, inLI for the reliever) we can better reward/blame pitchers in this situation.

sorry, it is gmLI for the reliever.

RE24 is correct. It’s not overly- or underly-harsh.

And it matters it the bases were loaded with 0 or 2 outs.

When you say RE24, you really mean COBRA, right?

When looking at the best and worst bullpens this season, would it be more reliable to use ERA or xFIP?

Sorry but this is more complicated than necessary, but more importantly, it’s ad-hoc and mathematically questionable. For example, if the reliever in your example gets the guy out instead, are we to take 0.7 earned runs *away* from his total? That would be the logical follow through on the idea. Whatever the run expectation is any given situation, it has no relation to who is responsible for any runs that do score–those are two completely different concepts. Much simpler solutions exist.

RE24 is the best solution, and yes, we do end up with negative runs.

Now, if you want to argue there are more PALATABLE solutions, sure, go ahead.

Sure I can do that.

But suppose you first explain to us, or at least to me, just how it is that RE24 is the “best” solution to a variable that inherently has no closed form analytical solution.

cRA9 sounds to xFIP what WPA is to wRC+