## Simulating the Impact of Pitcher Inconsistency

I thought Matt Hunter’s FanGraphs debut article last week was really interesting. So interesting, in fact, that I’m going to rip it off right now. The difference is I’ll be using a Monte Carlo simulator I made for this sort of situation, which I’ll let you play with after you’re done reading (it’s at the bottom).

Matt posed the question of whether inconsistency could be a good thing for a pitcher. He brought up the example of Jered Weaver vs. Matt Cain in 2012 — two pitchers with nearly identical overall stats, except that Weaver was a lot less consistent. However, Weaver had a bit of an advantage in Win Probability Added (WPA), Matt points out. WPA factors in a bunch of things, e.g. how close the game is and how many outs are left in the game when events occur. Because of that, it’s a pretty noisy stat, heavily influenced by factors the pitcher doesn’t control much. It’s not a predictive stat. For that reason, I figured simulations might be fun and enlightening on the subject. They sort of accomplish the same thing that WPA does, except that they allow you to base conclusions off of a lot more possible conditions and outcomes than you’d see in a handful of starts (i.e., they can help de-noise the situation).

### Weaver vs. Cain

### Weaver, wearing hat and smiling Cain, eating a mean watermelon

The following chart will break down the frequencies of each of the above’s 2012 ERAs by game:

What you see here is that Weaver’s ERA was two or less in a third of his starts, a feat Cain accomplished in fewer than sixteen percent of his starts. Weaver’s problem, of course, is that he got shelled for an ERA over ten in ten percent of his games. Yet, overall, their ERAs were nearly identical: 2.81 for Weaver and 2.79 for Cain.

So, to look into which pitcher should theoretically be more useful to a team, I set up a few different types of simulations. One simply randomly matches up performances by Cain directly against Weaver’s results, assigns a run contribution to each of their bullpens for the games, and tells us who comes out on top more often. The next simulator matches up each of their performances against some normally-distributed level of offensive runs scored, to give each a win percentage. The last is a more generalizable way of comparing pitchers based on means and standard deviations.

### Assumptions (of the simulation)

Bad assumptions can ruin a model. Garbage in, garbage out, as the saying goes. Hopefully mine aren’t too bad, but you’ll be able to change them in the web app lower in the article if you disagree.

One of the main things I figured I should factor in was the effect of placing a larger work load on the bullpen. I guessed that if the ‘pen only had to pitch an inning or two in a game, it would disproportionately be the closer and set-up man doing the pen’s pitching, and therefore the pen’s runs allowed per 9 innings (RA/9) would be lower in those games. Well, here’s what I found out for 2011-2012 games that went nine innings or less:

IP (by Bullpen) | Bullpen RA/9 | Bullpen ERA | Starter RA/9 | Starter ERA |
---|---|---|---|---|

0-2 | 4.27 | 3.88 | 2.72 | 2.48 |

2.1-4 | 4.00 | 3.66 | 5.23 | 4.80 |

>4 | 4.38 | 4.10 | 13.06 | 12.15 |

Yeah, it turns out that in those games where the starter leaves after 7+ (typically after having given up fewer than three runs), the bullpen doesn’t do so well. I don’t know if that’s because many of those games are blowouts being pitched by mop-up guys, or because opposing teams and managers tend to pull out all the stops at the end of the game. But, there you have it.

If you don’t exclude extra-inning games, the Bullpen’s RA/9 in games where they pitch over four innings drops to 3.95 (while the other numbers stay pretty much the same). But since my simulators limit games to nine innings, in the name of keeping things simple, I’ll stick with the numbers above.

What about the effect of tiring out relievers so that they can’t pitch the next day, or at least not pitch as well? OK, that’s a lot harder to figure out. I just had to make a guess when I decided what RA/9 to use for the various IP levels. Ultimately, I chose a RA/9 of 4 for when the bullpen pitched four innings max, and a RA/9 of 4.5 when they had to pitch more than that.

So, I had the bullpen average RA/9 out of the way, but now I had to work on standard deviations (how spread-out, or inconsistent they are). There’s a minimum expected level of standard deviation that’s predicted by the formula: SQRT(Chances * Rate * (1 – Rate)) / Chances. “SQRT” mean square root. “Rate” is the run-allowing rate, and “Chances” are determined by how many innings are left for the bullpen to pitch (but also by how many batters they let on base… long story). Using that, I figured the minimum RA/9 standard deviation for typical bullpen would be about 5.87 over 1 IP, 4.15 for 2 IP, 3.26 for 3 IP, and 2.85 for 4 IP. It looks like in actuality (in 2011-2012), the standard deviations were around 10%-25% higher than that (being more accurate for 1 or 2 IP). So I decided to adjust the standard deviations up 15% from the minimum expected level.

For the third simulation type, I had to look into the relationship between how many runs a starter allows vs. how soon until he gets the hook. I thought the results were surprisingly clean:

The trend here is extremely clear. It shows that a starter is probably going to get yanked by the manager if he gives up 5 runs around the first inning, or by the second if he allows around 6 runs total, etc. (the scale is runs per inning here). If he makes it deep into the 9th inning, he’s probably only allowed about 1 run total. Part of this is a function of pitch counts, and part of it is a function of performance, but it’s all pretty predictable, apparently. In the third sim type, after assigning a RA/9 rate to a starter (based on a randomly generated percentile of their individual bell-curve), I used the formula on this chart as the primary determinant of how many innings they’d likely pitch that game.

Still, there are undoubtedly some pitchers who have a tendency towards racking up higher pitch-counts over a given number of innings, or whose managers tend to let them rack up high pitch counts, for example. To address that, I also added the option of adjusting the starter’s IP towards some “base” IP level, weighted as strongly as you’d like.

### Results

#### Simulation Type 1: Weaver vs. Cain, head-to-head

With the above assumptions, over a million simulations, Weaver’s team beat Cain’s **52.7%** of the time (excluding extra inning games, which happened 12.8% of the time overall). This assumes the two are on a level playing field, with the same defenses behind them, and that they’re both facing a completely average offense.

#### Simulation Type 2: Weaver and Cain, with various levels of run support

This piggybacks off of Sim#1. From that, I obtained the following breakdown of win percentage according to how many runs each pitcher received in support:

Runs Scored by Offense | Win% | |

Weaver | Cain | |

0 | 0.00% | 0.00% |

1 | 18.35% | 13.45% |

2 | 35.91% | 30.56% |

3 | 56.90% | 52.38% |

4 | 72.64% | 68.48% |

5 | 82.28% | 82.20% |

6 | 88.01% | 91.94% |

7 | 91.09% | 97.02% |

8 | 92.77% | 99.15% |

9 | 94.05% | 99.79% |

10 | 95.32% | 99.92% |

11 | 96.74% | 99.94% |

12 | 98.06% | 99.94% |

13 | 99.05% | 99.94% |

14 | 99.59% | 99.94% |

15 | 99.83% | 99.94% |

What you’re seeing has to do not only with Weaver’s inconsistency in runs allowed, but also in innings pitched (transferring more weight to the performance of the bullpen).

From there, it was a matter of coming up with the likelihood of the pitchers getting each of those levels of run support. Based on 2012 MLB averages, I assumed they’d each get an average of 4.3 runs of support, with a standard deviation of 3 runs. Results of a million simulations: Weaver wins **63.42%** of his non-extra innings games; Cain wins **63.17%**. After that many trials, that’s significant, but it’s of course a pretty minor advantage. Weaver’s greater success in low-scoring games gave him the edge in head-to-heads against fellow elites, but his propensity to lose high-scoring games counters that overall.

#### Simulation Type 3: General Simulation

OK, here’s the part where you get to play along at home. The default I’ve entered here approximates the Weaver vs. Cain battle (“Team 1″ representing Weaver’s team). I found in my testing that Team 1 wins around **50.5%** of the time. That’s not as dramatic as the 52.7% I found in Sim#1, but that may be because the actual distributions weren’t exactly normal, as the assumption is in this model. Anyway, you can change around the assumptions (white boxes with red borders) within the app here, or you can download the spreadsheet with the green icon at the bottom:

What you’re seeing here is the results of only 2,000 simulations at a time (trying to keep the file size down), but if you download it, you can copy the rows of the “Calculations” tab downwards to do a lot more at a time. At only 2000 sims, there’s a bit of a margin of error, as you’ll probably see if you start changing blank cells (which will come up with new simulations each time).

One thing I think you might notice is that for an inconsistent pitcher who sometimes gets taken out of games very early, having a bad bullpen behind him (especially long relievers) hurts more.

Well, hopefully this hasn’t been too confusing. I’ll be around to answer questions, just in case. Have fun!

Print This Post

Favorite writer on fangraphs. <3.

Wow, thank you!

Why did bullpen ERA dip in the mid-range of 2-4 IP? One hypothesis: Starters run into pitch count limits in innings 5-7, therefore managers pull them more often before they’re able to close out the inning – often after a walk (or two) and thus the base-out situations tend to be messy. Baserunners benefit the bullpen ERA because they create more opportunities to record outs (double plays, etc) but any runners who score are debited to the Starting Pitcher, not the bullpen.

By comparison starters that go 7+, a portion of whom will either finish the game (leaving no base-out situation at all) or be pulled at the completion of an inning when they have reached their PC limit.

Ah, great point. I’m sure that’s at least part of it, now that you mention it.

Also in those high leverage situations, a team is most likely to use its best (if not its VERY BEST) relievers. That combined with the lack of responsibility for inherited runners per RA as payroll suggests above, would seem to explain most of the difference in bullpen performance.

Hm, but the very best relievers are typically used in the last two innings, so why would that factor explain the worse RA/9 of the 0-2 bullpen IP games?

I’m not entirely convinced it’s not either a fluke (underperforming/aging closers, maybe) or perhaps even a shift in bullpen management philosophy. I’d have to look at more years.

The best two relievers might be less likely to throw in 0-2 Bullpen IP games because the starter’s outing length would indicate good performance and a higher likelihood of being significantly ahead. It’s easy to see why the worst relievers would throw in 4+ Bullpen IP games, this would account for the inverse of the mop up work.

Bigger picture, the question for me seems to be whether it is viable that a higher percentage of close games occur in the 2-4 Bullpen IP games? I think so, and it seems reasonable that the 0-2 games would be closer to 2-4 than the 4+ games because it would also account for the scenarios where the starter goes 7+ and the game is close, prompting use of say, Robertson and Rivera, to use the Yankees as an example.

Just guessing here. Loved the piece Steve.

Yeah, I think you have to assume that games in which the SP goes 7+ are less close that games in which he goes less than 7 full, because the former situation almost always means fewer runs allowed, with an added element of more runs scored. In other words, if my team’s scored 7, then I’ll let my SP go as long as his pitch count and basic effectiveness will permit, while if my team’s scored only 3 or 4, I’m only sending out my SP for the 7th if he’s dealing (or my bullpen is atrocious).

Point being, while starter-to-setup-to-closer is the way we script it, I think it’s the exception rather than the rule.

Oh, and it’s probably worth remembering that, in this day of ultra-rigid BP roles, it’s quite common for the designated Closer *not* to be the best RP on the team. Now, when that’s the case I think the setup guy usually is, so you’d still expect to see the 2 best relievers throw the last 2 innings of a close game, but it probably has some effect.

OK, thanks guys. That’s more or less what I was thinking while writing the article.

If you’re bored, you could try to determine how much a run on defense or a run on offense influences a team’s winning % via Monte Carlo simulation.

OK, that can be done with this sim, if I understand you correctly. For both teams, I’ll assume a standard deviation of 3 runs per 9 innings, and set “Base IP” to 9, “Base IP SD” to something small like 0.0001 (because the formula won’t accept “0″ as the standard deviation), and “Base IP Weight” to 1, in order to make it act like the starter is pitching the whole 9 innings (just so we can unify the whole team’s pitching performance under one mean and standard deviation). Now, we can just use “Team 1″ to represent a team’s defense and “Team 2″ to represent the same team’s offense. Just now think of the mean RA/9 for “Team 2″ as the mean runs scored by the team.

Here are some of the rough win% (only like 30k simulations) I got from different mean RA/9 and RS/9 values:

RA_ RS_ W%

3__ 4__ 60%

4__ 3__ 40% (just checking)

3__ 5__ 70%

5__ 7__ 70%

5__ 8__ 78%

5__ 9__ 84.5%

Of course, the lower the standard deviations you’re using, the higher the win percentages would be when you tend to score more than you allow.

Great Article! This is exactly the sort of writing that smart baseball fans yearn for on sleepless nights…and here it is!

I don’t think I understand “Base IP weight”. What is the “correct” weight supposed to be? 1?

Thanks!

OK, I never did any testing to see what the “correct” Base IP weight level is, but since I think the IP a pitcher gets in a game is mainly determined by how well he’s pitching, I’d guess the weight should be something low, like 0.2 or less. But you can set it to 1 if you want to take how many runs he’s allowing out of the equation of how many IP he gets.