## Quantifying Bullpen Roles: The 2016 Season

Author’s Note: This is the second of a two-part article, both of which are intended to stand on their own. The first introduces terminology and a mathematical framework used to derive statistics; the second uses these new ideas to draw conclusions which are hopefully intriguing to the reader. If you need it as a reference, you can refer back to the first article (here).

Below, I’ll use some metrics – average and weighted-average Euclidian distance between relievers – to look at the 2016 season. Ideally, we’d like to be able to associate a covariate with these metrics. That is, we’d like to be able to say “bullpens with lower weighted-average distances are (blank),” where we fill in the blank with some common-sense concept or truism about the way we know the game to work. Short of that though, maybe we can just get an understanding of why the bullpens at either extreme have found themselves there.

So, without further ado, here are the bullpens of all 30 teams as sorted by weighted average Euclidian distance in 2016.

How can we interpret this? There’s no real obvious trend here: there are “good” and “bad” bullpens on both ends of the table, along with “good” and “bad” teams. At the extremes are good case studies, though: A subpar Phillies bullpen on a subpar Phillies team, a solid Orioles bullpen on a solid Orioles team, and of course, the Cubs. What can we learn from looking at them in more detail?

The 2016 Phillies Bullpen: An Ode to Brett Oberholtzer

Most people reading this know how the Phillies season went last year. They were supposed to be bad. Then, briefly, they appeared to be good. People did what they could to explain why the Phillies appeared to be good, including looking at their overachieving bullpen. As it turns out, the Phillies were bad after all. Baseball is fun.

The Phillies being bad explains part of what you see above. They tended to employ a lot of guys in the middle innings when they were already behind in the game. That’s a product of circumstance, and not an indictment of those guys. Elvis Araujo, Severino Gonzalez and Colton Murray weren’t great pitchers, and it’s sort of odd to have three of those guys rotating into your bullpen at various points in the season. Then again, the Phillies were bad, and those three guys were young, and they could afford to give young guys longer runs than a competing team could have.

There are those three guys, and then there’s Brett Oberholtzer, a slightly older, more experienced pitcher, whose MLB time before 2016 was mostly as a starter. He can be considered the quintessential mop-up guy in 2016. He’s way over there to the left – in fact, he had the lowest average score differential when entering the game out of any relief pitcher in 2016. Here’s what his inning-score matrix looked like:

This doesn’t even do Brett Oberholtzer justice, though. Here’s a histogram of score differential by appearance that puts it into context.

Oberholtzer made 26 appearances for the Phillies in 2016, and most of them were in garbage time. Then, there was the one appearance where the Phillies actually led when he came into the game. It was the 10th inning, and most of the Phillies bullpen had already been spent. Pete Mackanin had little choice but to bring Oberholtzer in to protect a one-run lead in the 10th. Which he did, earning a save. Brett Oberholtzer has no “regular” mode, no “normal” days. Baseball is wonderful. Baseball is weird.

Getting back to the Phillies bullpen as a whole: It’s not so atypical outside of Oberholtzer and an abundance of negative-score pitchers. Jeanmar Gomez was used in a fairly typical “closer” role, with Hector Neris and Edubray Ramos in higher-leverage setup roles. This all seems to comport with how we think of modern bullpens.

The 2016 Orioles: A Well-Oiled Machine

The Orioles had a very effective bullpen by most measures in 2016. Certainly, it helps to have Zach Britton churning out ground ball after ground ball, but overall the group was very effective, registering a league-leading 10.22 WPA for the season (with second place not being particularly close). Their 53 “meltdowns” were also fewest in the league. This was a playoff team, largely because of their bullpen. That is to say, this is a very different team than the 2016 Phillies.

That said, there are some similarities here.

The general shape is the same, although the Orioles were giving their bullpen a lead more often than the Phillies. One striking similarity is the presence of a “mop-up” guy, in this case, Vance Worley. Worley logged an impressive 64.2 innings in just 31 relief appearances. He was also never given the ball with a lead of less than six (!).

Worley soaked up a lot of innings for the O’s, and he did so in a rather effective way, ending with an ERA of 3.53 – a number which, while partially luck-driven, probably doesn’t suffer from quite as much inherited-runner variance as the average reliever. He created his own messes, and was allowed to clean them up, because Buck Showalter mostly thought the game was over anyway. The overall structure of a bullpen may be related, by necessity, to the depth that the starting rotation can get on a regular basis.

One item of interest here: The unweighted average distance is actually higher in the O’s bullpen than in the Phillies bullpen. When weighting by inverse variance, the Phillies show an even larger average distance, while the average distance narrows for the Orioles. This speaks to more rigid roles, particularly for the setup guys. Darren O’Day was very seldom called upon when the team was behind (four out of 34 appearances, none when trailing by more than three runs), whereas Hector Neris was used a bit more fluidly (18 out of 79 appearances, five appearances when trailing by five or more runs). There may again be a team effect at work here: Maybe the Phillies found themselves needing to get Neris work more often during long losing streaks, and were set on throwing him on a certain day regardless of score.

The 2016 Cubs: An Embarrassment of Riches

If you’ve been under a rock or are currently time traveling, this may shock you: The Cubs were really good last year. They even won the World Series! The Cubs!

OK, with that out of the way, this graph is going to look quite different than the previous two.

Did the Cubs ever not have a lead going into the seventh inning? Well, yes, I assure you that they did. Multiple times, in fact! However, they didn’t do it often enough to give anyone in their bullpen a “mop-up” role, or anything that resembles one. Look at that graph! The Cubs had Aroldis Chapman and Hector Rondon, and then they had seven other guys hanging out in the O’Day / Neris / Brad Brach neighborhood of the graph. What’s going on here?

There’s another thing that’s different about the Cubs which can help explain this. A lot of members of their bullpen have very high variances by score. Whereas O’Day, Neris and Brach have score variances in the single digits, many of the Cubs relievers have score variances north of 10. Take another look at the score variances in the Phillies and Orioles bullpen. Double-digit numbers are typically reserved for long men, mop-up guys, and lower-leverage relievers. Here’s Justin Grimm, who represents this pretty well:

Maybe this was a conscious decision by Joe Maddon, matching up in high-leverage situations with different arms. Maybe this was simply a necessary decision to keep everyone fresh in the face of repeated high-leverage situations: If you have late-game leads for five or six consecutive games, the same three arms can’t be used in all of them. It’s not as if Justin Grimm was used a lot in these situations, and no one would refer to him as a “high-leverage reliever.” He did have a dozen or so appearances in the high-leverage areas of the graph, though, and that’s not nothing.

You can chalk this up to the Cubs being really, really good in 2016, and likely, there’s some merit to that. But it also probably doesn’t tell the whole story. Out of 279 relievers with 20 or more appearances in 2016, only 18 of them had an average inning of 7 or later, an average score differential of 1 or more, and a score variance of 10 or more. Five of those 18 were on the Cubs. The Nationals, Rangers, Red Sox and Dodgers – all good teams in their own right, if not quite as dominant as the Cubs – had one such player each. The Indians had none.

It’s safe to say that Joe Maddon managed his bullpen differently than any of these teams in 2016. It’s also hard to argue with the results.

## Quantifying Bullpen Roles: The Math

Author’s Note: This is the first of a two-part article, both parts of which are intended to stand on their own. The first introduces terminology and a mathematical framework used to derive statistics; the second uses these new ideas to draw conclusions which are hopefully intriguing to the reader. If you’re not into math, you can skip to the second article (here) and refer back to this one as needed.

Recently, I wrote about the inning-score matrix, and how we could refine the concept to put a finer point on when and how certain relief pitchers are used. Statistical oddities and outliers are always fun topics of conversation, and certainly, appearance data can give us that.

But can it give us more than that? I don’t care so much that Will Smith was used differently after he was traded or that Brett Oberholtzer was the closest thing to a true mop-up man in the game last year – OK, actually, those things are really interesting too – so much as I care to define how managers are employing bullpens. This may not even give rise to why managers are doing what they’re doing; it’s difficult to attribute intent when looking at numbers abstracted away from the human elements of the game. However, the decision to bring a specific relief pitcher into the game is a conscious one by the manager, largely influenced by game situation. To that end, appearance data can also be aggregated by team — and, if what we care about is the managerial decisions that give rise to bullpen roles, we should really be focused at the team level.

To gain insight into, and ultimately quantify, how bullpens are constructed, we need to define a few concepts. As we go through, I’ll do my best to explain the concept that we’re trying to quantify in baseball terms, before diving into the nuts and bolts of how I’m quantifying them.

Concept 1: Center of gravity

Your personal center of gravity is probably around your belly button – it’s the point at which half of your mass is above, half is below, half is left, half is right.

In addition to their physical centers of gravity (which they work so hard on, Bartolo Colon notwithstanding), relief pitchers have another “center of gravity”: the one at the center of their inning-score matrix. The inning-score matrix has two dimensions (score differential on the X-axis, inning on the Y-axis), and each appearance can be plotted in these two dimensions.

If we treat all appearances equally, a reliever’s center of gravity can be defined as the average inning and score when entering the game. This tells us a great deal about how the pitcher is being used on its own. For example, without looking at the names, you can probably guess which of these guys was a high-leverage reliever in 2016 and which was a mop-up guy.

###### Player A: Vance Worley; Player B: Zach Britton

The center of gravity is a snapshot of a player’s role. It doesn’t tell you everything – you can’t pick out a lefty specialist, for example, or a guy whose game situations changed drastically over the course of a season. In fact, in the latter case, a player’s center of gravity for an entire season may actually be misleading. Still, it’s the most information you can get about the player’s usage in a couple numbers. We’ll think of it as where the player “lives” in the inning-score matrix.

Concept 2: Euclidian distance

If you’re not a math person, ignore the word “Euclidian.” This is just “distance” in the way you think about it in everyday life. If I have two points in space, a straight line between them has a distance, and in layman’s terms, we’d say that the size of that distance constitutes “how close” or “how far apart” the two points are. Mathematically, for two points with coordinates (xi, yi) and (xj, yj), the Euclidian distance between them can be calculated as:

A bullpen lives in the two-dimensional space that we used to define center of gravity: For every appearance a member of the bullpen makes, there is an inning (y), and there is a score (x). In this space, each member of the bullpen has a center of gravity. As such, we can say the two pitchers in our earlier example were far apart, but that these two are close together:

###### Player A: Shane Greene; Player B: Justin Wilson

In fact, you can start to look at entire bullpens graphically, in order to form an image of how the bullpen is constructed. Our “twins” from above are easy to pick out when we do this:

Nice to look at, and the trend makes intuitive sense: guys who pitch later in games are generally also trusted with leads. But how can we use it to compare bullpens? We need metrics to quantify what we’re seeing above, to describe how similar or dissimilar the roles are in a bullpen. Then we can compare that to other bullpens and give context to how a team is managing their pen relative to the rest of the league.

Concept 3: Average Euclidian distance

The simplest thing one could do would be to sum the distances of the lines connecting each player’s center of gravity. This has the disadvantage of being biased: Bullpens which have more qualifying players will have more dots to connect and, therefore, more total distance.

Naturally, we can calculate an average of these distances instead. This requires us to know how many unique distances there are between distinct pairs of relievers. We can deduce this logically: From the first of n relievers, there are (n – 1) lines, connecting that reliever to all the others. From the second reliever, we’ve already drawn the line to the first reliever, so we can draw (n – 2) more lines, connecting him to the remaining relievers … and so forth. Thus, for n relievers in a bullpen, there are (n – 1) + (n – 2) + … + 2 + 1 distances between them, and we can calculate the average Euclidian distance as:

This looks intimidating, but the numerator is really just the sum of all the distances of all the lines that we drew. The denominator is the number of lines that we drew. Voila: an average!

Concept 4: Weighted-average Euclidian distance

You may be tiring of all this talk about Euclidian distance. It’s important, though, to take this one step further. To use the average distance between all members of the bullpen as a basis of comparison is to make the assumption that all relievers are created equal – that, if you’re a fan of the Indians, you care about the distance between Kyle Crockett and Dan Otero as much as you do about the distance between Bryan Shaw and Cody Allen. You probably don’t, and that makes sense – the former duo isn’t nearly as important to the makeup of the Indians’ bullpen as the latter. We should, therefore, be emphasizing certain relievers and the distances associated with them.

How do we characterize certain members of a bullpen as important, numerically? We could weight them by, say, the average Leverage Index at the time they entered the game; players who are trusted in critical situations are surely more important, right? The issue with this idea is that leverage is highly correlated with the inning and score – in fact, it’s derived from them. Weighting by Leverage Index would tell us that players in a certain area of the graph are more important to team success. This is intuitive and not very interesting.

What do we want to measure? It might be interesting to know how rigid or fluid a team’s bullpen is; that is, do they have a “seventh-inning guy” or a “mop-up guy” who is consistently called on in certain situations? In this case, we want to give more weight to relievers who have lower variance by game situation when entering the game. If the manager gives someone a highly-specific role by inning and score, that reliever is important insofar as the structure of the bullpen is concerned. That may not translate to how important they are with respect to the outcome of games, but presumably, that reliever has a fixed role because they have a skillset that in some way lends itself to his residence in a certain part of the graph.

Fortunately, the concept of inverse-variance weighting is an established mathematical concept. The idea is that players with lower variance by inning and score should be weighted more heavily. In short, this works in three steps:

1. For each pair of players, divide the Euclidian distance between them by the sum of score and inning variances associated with their centers of gravity;
2. For each pair of players, divide 1 by that very same sum of score and inning variances;
3. Divide the sum of results of (1) by the sum of results of (2).

Mathematically, this looks like this:

Portrait of a Modern Bullpen

If you’re still with me, you may be wondering what the use of all this is. Let’s summarize what we’ve done so far:

• The average Euclidian distance between members of the bullpen tells us how clustered or spread out that bullpen is as a whole.
• Using a weighted average refines that metric in order to emphasize members of the bullpen that have well-defined, rigid roles – usually a closer and a setup man or two, but sometimes a surprise as well.

We can summarize a bullpen with these metrics and a plot of all members of a bullpen (as represented by their centers of gravity). Here’s how the 2016 Marlins bullpen looks in a snapshot. The 2016 Marlins have been chosen because they were a very average bullpen in terms of performance as well as structure, on a very average team overall. I couldn’t find anything at all that stood out about them.

We can use this framework to compare bullpens going forward: Which teams have very large distances between relievers? Which are more clustered? Which are oriented differently? We can not only compare bullpens within a single season, but also how bullpen structures have changed over time across the league. We can explore whether the structure of a bullpen is consistent from year to year on a single team, or if certain managers have ways of managing their bullpens which consistently show up in the data associated with their teams. There are a lot of exciting possible applications.

And of course, we can point out statistical oddities along the way. Why wouldn’t we?

## Exploring Relief Pitcher Usage Via the Inning-Score Matrix

Relief pitching has gotten a lot of attention across baseball in the past few seasons, both in traditional and analytical circles. This has come into particular focus in the past two World Series, which saw the Royals’ three-headed monster effectively reducing games to six innings in 2015, and a near over-reliance on relief aces by each manager this past October. It came to a head this offseason, when Aroldis Chapman signed the largest contract in history for a relief pitcher. Teams are more willing than ever to invest in their bullpens.

At the same time, analytical fans have long argued for a change in the way top-tier relievers are used – why not use your best pitcher in the most critical moments of the game, regardless of inning? For the most part, however, managers have appeared largely reluctant to stray from traditional bullpen roles: The closer gets the 9th inning with the lead, the setup man gets the 8th, and so forth. This might be in part due to managerial philosophy, or in part due to the fact that relievers are, in fact, human beings who value continuity and routine in their roles.

That’s the general narrative, but we can also quantify relief-pitching roles by looking at the circumstances when a pitcher comes into the game. One basic tool for this is the inning/score matrix found at the bottom of a player’s “Game Log” page at Baseball-Reference. The vertical axis denotes the inning in which the pitcher entered the game, while the horizontal axis measures the score differential (+1 indicating a 1-run lead, -1 indicating a 1-run deficit).

From this, we can tell that Andrew Miller was largely used in the 7th through 9th innings to protect a lead. This leaves a lot to be desired, however, both visually and in terms of the data itself. Namely:

• Starts are included in this data. This doesn’t matter for Miller, but skews things quite a bit if we only care about bullpen usage for a player who switched from bullpen to rotation, such as Dylan Bundy.
• Data is aggregated for innings 1-4 and 10+, and for score differentials of 4+. In Miller’s case, those two games in the far left column of the above chart actually represent games where his team was down seven runs. This is important if we want to calculate summary statistics (more on this in a bit).
• Appearances are aggregated for an entire year, regardless of team. This is a big issue for Miller, who split his time between the Yankees and Indians last year, as there is no easy way to discern how his usage changed upon being traded from one to the other.

To address these issues, I’ve collected appearance data for all pitchers making at least 20 relief appearances for a single team in 2016. We can then construct an inning/score matrix which is specific by team and includes only relief appearances. Additionally, we can calculate summary statistics (mean and variance) for the statistics associated with their relief appearances, including: score and inning when they entered the game, days rest prior to the appearance, batters faced, and average Leverage Index during the appearance. This gives insight into the way the manager decided to use that pitcher: Was there a typical inning or score situation where he was called upon? Was he usually asked to face one batter, or go multiple innings? Was his role highly specific or more fluid?

So let’s start there – and in particular, let’s see if we can identify some relievers who had very rigid roles, or roles that simply stood out from the crowd. To start, here are the relievers who had the lowest variance by inning in 2016.

No surprise here: Most teams reserve their closers for the 9th inning, and rarely deviate from that formula. What you have is a list of guys who were closers for the vast majority of their time with the listed team in 2016, with one very notable exception. Prior to being traded over to Toronto, Joaquin Benoit made 26 appearances for Seattle – 25 of which were in the 8th inning! The next-most rigid role by inning, excluding the 9th inning “closer” role, was Addison Reed, who racked up 63 appearances in the 8th inning for the Mets, but was also given 17 appearances in either the 7th or 9th. In short, Benoit’s role with the Mariners was shockingly inning-specific. I’ve also included the variance of the score differential, which shows that score seemed to have no bearing on whether Benoit was coming into the game. The 8th inning was his, whether the team really needed him there or not.

Speaking of variance in score differential, there’s a name at the top of that list which is quite interesting, too.

Here we mostly see a collection of accomplished setup men and closers who are coming in to protect 1-2 run leads in highly-defined roles (low variance by inning). We also see Matt Strahm, a young lefty who quietly made a fantastic two-month debut for a Royals team that was mostly out of the playoff picture, and a guy who Paul Sporer mentioned as someone who might be in line for a closer’s role soon. Strahm’s great numbers – 13 hits and 0 home runs surrendered in 22.0 innings, to go with 30 strikeouts – went under the radar, but Ned Yost certainly trusted Strahm with a fairly high-leverage role in the 6th and 7th innings rather quickly. With Wade Davis and Greg Holland both out of the picture, it’s not unreasonable to think Strahm will move into a later-game role, if the Royals opt not to try him in the rotation instead.

This next leaderboard, sorted by average batters faced per appearance, either exemplifies Bruce Bochy’s quick hook, or the fact that the Giants bullpen was a dumpster fire, or perhaps both.

This is a list mostly reserved for lefty specialists: The top 13 names on the list are left-handed. Occupying the 14th spot is Sergio Romo, which is notable because he’s right-handed, and also because he’s the fourth Giants pitcher on the list. The Giants take up four of the top 14 spots!

While they never did quite figure out the right configuration (or simply never had enough high-quality arms at their disposal), certainly one could question why Will Smith appears here; the Giants traded for Smith who was, by all accounts, an effective and important part of the Brewers’ pen. The Giants not only used him (on average) in lower-leverage situations, but they also used him in shorter outings, and with less regard for the score of the game.

Dave Cameron used different data to come to the same conclusion several months ago. Very strange, considering that they had not just one, but two guys who already fit the lefty-specialist role in Javier Lopez and Josh Osich. Smith is back in San Francisco for the 2017 season, and it will be interesting to track whether his usage returns to the high-leverage setup role that he occupied in Milwaukee.

This is a taste of how this data can be used to pick out unique bullpens and bullpen roles. My hope is that a deeper, more mathematical review of the data can produce insights on how bullpens are structured: Perhaps certain teams are ahead of the curve (or just different) in this regard, or perhaps the data will show that there is a trend toward greater flexibility over the past few seasons. Certainly, if teams are spending more than ever on their bullpens, it stands to reason that they should be thinking more than ever about how to manage them, too.