# Should we be using ERA estimators during the season?

The 2012 season is quickly coming to a close. September is the time of year where baseball fans and writers are either looking forward to the playoffs or looking back on the season and wondering what could have been.

Before the season and during the season, the mindset of the baseball community is different. For example, a major talking point before any season is projections. The various systems (Oliver, PECOTA, Marcel, ZIPS, Steamer, etc.) release their projections and it leads to much excitement and discussion over who could collapse, break out, or hold serve in the coming season. In a recent post at the Book Blog, Tom Tango posed the question of whether this was a bad season for forecasters (projection systems).

It was an interesting question, something people begin thinking about this time of year, but it also piqued my interest into another question.

There’s a group of statistics in the sabermetric community known as “ERA-estimators,” These statistics are based on outcomes that are more under a pitcher’s control (strikeouts, walks, groundballs, home runs), typically known as peripherals. They attempt to forecast where a pitcher’s ERA is going to move in the future.

The most common ERA estimators currently are fielding independent pitching (FIP), expected fielding independent pitching (xFIP) , skill-intereactive ERA (SIERA) and true ERA (tERA).

How well do these estimators typically work?

{exp:list_maker}Matt Swartz has shown that SIERA is the best predictor of next season ERA

Bill Petti showed that for pitchers who pitch in more hitter-friendly parks xFIP and FIP perform better than SIERA

Colin Wyers showed that when you get out of the season-season comparison, which can have a great deal of random variation, ERA begins to perform the best at predicting itself {/exp:list_maker}

My goal is to not re-hash these studies, but instead to delve into what happened just this season.

The three studies I mentioned above looked at large(ish) sample sizes; year-to-year data or bigger. Typically, when we study baseball statistics we look for a large sample size; because there is so much random variation and noise in baseball, it’s tough to get a full picture of what truly happened when dealing with a smaller sample. In many instances, one season of data isn’t a large enough sample for some statistics, which might sound crazy to some, but it’s actually true.

### The idea

Writers in the sabermetric community, myself included, talk about ERA estimators during the season fairly frequently. For example, this made-up quote would be a fairly common thing to read on sabermetric websites in the middle months of the season:

Pitcher X’s ERA (2.50) is much lower than his xFIP (4.50). This result indicates that Pitcher X has probably been lucky, and his ERA will regress back closer to his xFIP as the season goes forward.

That idea is fairly commonly accepted. If a pitcher’s xFIP, FIP and SIERA are significantly above or below his current ERA, then the assumption for most is that his ERA will move back either up or down toward those numbers.

The original goal of this post was to split the season in half, and look at how the ERA estimators have done in terms of predicting ERA and runs allowed per nine innings for the second half of the season. Essentially, I agreed with commonly accepted idea that a pitcher’s ERA estimators at the midpoint of the season were better indicators of where his ERA was trending than his actual ERA.

I thought that although half a season of baseball is a small (random) sample size, it is still valuable for teams to know what they should expect from their pitchers in the second half. This information would be useful in certain midseason decisions front offices have to make. Some examples would be

{exp:list_maker}Sending players down to the minors

Moving a pitcher out of the rotation

Deciding if your team was contender

Deciding what players to trade away or which players to target in a trade. {/exp:list_maker}

A quick example of this idea comes from the comparison between the Angels acquiring Zack Greinke and the Rangers acquiring Ryan Dempster at this season’s trading deadline.

At that time, Greinke had an ERA of 3.44, while Dempster’s ERA was 2.55. But Greinke’s xFIP was 2.82, while Dempster’s was 3.73. Thus, many predicted positive regression for Greinke’s ERA with the Angels, and negative regression for Dempster’s ERA with the Rangers.

Please note that I understand we’d expect their ERAs to fluctuate somewhat anyway after the trade. Both players were changing leagues and ballparks, and would have different defenses playing behind them. At the same time, had both those pitchers stayed with their original ball clubs, the assumption that Greinke would have positive regression and Dempster would have negative regression would still likely have been the consensus.

### The study

For this study I used July 1 as the cutoff point. Then I looked only at starting pitchers who had at least 50 innings pitched before July 1 and at least 45 innings pitched after that date.

I found the ERA, FIP, xFIP, SIERA and tERA, for each qualifying pitcher, from the beginning of the season to July 1, then regressed those numbers against their runs allowed (RA9) and ERA for the second half of the season (July 1-Sept. 16). I also added in an extremely simple baseline of strikeouts minus walks divided by innings pitched (K-BB/IP), as another predictor. Interestingly, exactly 100 starters qualified for the sample.

Also, please note that although 50 and 45 respectively were the minimum number of innings, the average number of innings thrown before July 1 for the sample was 92 innings, and the average number thrown after July 1 was 81 innings. So a good portion of these numbers are based on close to 100 innings, which is still not a great sample, but at least feels a lot better than 45-50 innings.

### The results

First, I ran simple linear regression for each predictor against the pitcher’s second half runs against (RA9). In a table below, I list both the r-squared and mean square error for each predictor in the sample.

For those who aren’t statistically savvy, r-squared shows the percent of variation in what we are trying to predict (RA9), that is explained by our predictor (ERA, xFIP, etc.). A higher r-squared shows a stronger relationship between the predictor and outcome.

The mean squared error shows us how far, on average, our prediction is away from the actual outcome; thus, a lower number would show a stronger relationship.

Here are the RA9 single regression results:

Predictor | R-Sqaured | RMSE |
---|---|---|

(K-BB)/IP | 9.14% | 1.207 |

SIERA | 6.19% | 1.246 |

xFIP | 4.65% | 1.267 |

FIP | 2.92% | 1.290 |

ERA | 1.86% | 1.304 |

tERA | 0.43% | 1.343 |

RA9 is a better statistic than ERA, but, as I noted form the outset, these metrics are supposed to be ERA estimators, not RA9 estimators (for better or worse).

This is most likely why we see a near-zero r-squared for tERA, because it is scaled on purpose to predict ERA, instead of RA9.

So I ran simple linear regression for the predictors against ERA, as well:

Predictor | R-Squared | RMSE |
---|---|---|

(K-BB)/IP | 8.84% | 1.092 |

SIERA | 5.99% | 1.127 |

xFIP | 4.48% | 1.145 |

tERA | 3.04% | 1.162 |

FIP | 2.42% | 1.170 |

ERA | 1.45% | 1.185 |

These numbers jibe fairly well with the single-season results from the three studies I referred to at the outset of the article.

The most shocking result is that for both tests, the predictor with the highest r-squared and the lowest mean squared error was the simple base line of strikeouts minus walks divided by innings pitched.

In Swartz’ study, the second best predictor of Year 2 ERA, behind SIERA, was a statistic known as kwERA (strikeout to walk ERA. whjch uses only strikeouts and walks. I actually considered kwERA for my baseline, as it does a better job of actually weighting the value of strikeouts and walks, and is already on an ERA scale. But I wanted to keep my baseline as simple as possible, so I just used simple subtraction, and even left intentional walks in the data.

Interestingly, strikeouts minus walks still ended up being the best predictor.

Simply comparing six separate predictors’ single linear regressions isn’t as effective of an analysis as running a multiple regression that includes all six predictors at the same time. So I ran a multiple regression with all six predictors thrown in:

The first table is the SPSS readout of coefficients for the RA9 test:

RA9 | Unstandardized | Coefficients | Stand. Coeff. | ||
---|---|---|---|---|---|

Predictors | B | Std. Error | Beta | t-score | Sig. |

(Constant) | 5.671 | 2.144 | 2.645 | 0.01 | |

K-BB | -2.074 | 1.292 | -0.352 | -1.605 | 0.112 |

ERA | 0.14 | 0.173 | 0.13 | 0.808 | 0.421 |

FIP | -0.229 | 0.437 | -0.171 | -0.524 | 0.602 |

xFIP | -0.018 | 1.104 | -0.009 | -0.016 | 0.987 |

tERA | 0.074 | 0.341 | 0.057 | 0.215 | 0.829 |

SIERA | -0.023 | 1.223 | -0.012 | -0.018 | 0.985 |

The second table is the SPSS readout of coefficients for the ERA test:

ERA | Unstandardized | Coefficients | Stand. Coeff. | ||
---|---|---|---|---|---|

Predictors | B | Std. Error | Beta | t-score | Sig. |

(Constant) | 5.388 | 2.108 | 2.67 | 0.009 | |

K-BB | -2.015 | 1.216 | -0.363 | -1.657 | 0.101 |

ERA | 0.156 | 0.163 | 0.153 | 0.856 | 0.342 |

FIP | -0.332 | 0.412 | -0.263 | -0.807 | 0.422 |

xFIP | 0.002 | 1.039 | 0.001 | 0.002 | 0.998 |

tERA | 0.099 | 0.321 | 0.081 | 0.308 | 0.758 |

SIERA | 0.004 | 1.151 | 0.002 | 0.003 | 0.997 |

The column we want to look at here is titled “Sig.” This column tells the statistical significance of each predictor. For most tests, a predictor becomes statistically significant once the value goes below 0.05. As you can see from both of these results, none of the predictors are statistically significant; strikeout minus walks comes the closest.

I found that putting all of the predictors together did not really improve the r-squared we found from just using K-BB/IP:

Mutiple Regression r^2 | K-BB/IP r^2 | |
---|---|---|

RA9 | 10.10% | 9.1% |

ERA | 10.40% | 8.8% |

I also found that K-BB/IP was a statistically significant predictor on its own, but when the other predictors were added it no longer was statistically significant. This is most likely due to a degrees of freedom issue (sample size of 100 with six predictors), but as I’ve already got into too much statistical jargon, I’ll just leave that be.

Of the 100 pitchers in the sample, 13 changed teams at some point during this season. As I noted with the Greinke/Dempster comparison earlier, this could have an effect on the results. Future ERAs could fluctuate when a pitcher changes leagues, teams and home ballparks. So I checked to see how removing those pitchers would affect the results.

Below, I listed the r-squareds for the predictors for the 87 pitchers who have stayed with the same team all season:

Predictor | ERA r^2 | RA9 r^2 |
---|---|---|

(K-BB)/IP | 13.01% | 12.65% |

SIERA | 8.40% | 8.18% |

xFIP | 6.87% | 6.72% |

tERA | 7.28% | 0.75% |

FIP | 5.17% | 5.48% |

ERA | 1.70% | 2.30% |

Removing the 13 starters who changed teams improved the overall r-squareds slightly, but did not really change the two orders we saw with the original sample that included those starters.

### Putting it all together

The number of tables and tests I just went through was probably exhausting, but I think it was pretty meaningful.

Most of these statistics become more meaningful as the sample size grows larger. You could classify all this information as simply small sample size noise. I’m looking at less than one season worth of data, for just 100 starters (or only 87 if you prefer those numbers). There’s a lot to be said for that argument.

ERA and RA9 in general are subject to a good deal of random variation and noise. These predictors were regressed against a sample of ERAs and RA9s that came from a range of 50.1 and 102 innings pitched. I think there’s a possibility that this analysis could be run again with the numbers from 2011, and we’d see a different predictor come out on top, solely because of that noise.

At the same time, I think these results should be taken as both a lesson and a cautionary tale. The ERA estimators that were tested (xFIP, FIP, SIERA and tERA) all did a better job of predicting future ERA than actual ERA; which was to be expected and is the normal assumption in the sabermetric community. But although they did better than ERA, simply subtracting walks from strikeouts did a better job of predicting ERAs for the second half than any of the advanced statistics.

I’m not trying to say that we should move away from FIP and other ERA estimators and simply use strikeouts and walks to attempt to predict how many runs a pitcher will give up in the future.

The highest r-squared (0.13055) I found came from K-BB/IP in the 87-pitcher sample. That number still tells us that more than 86 percent of the variation in second half ERA was still left unexplained by the predictor; which isn’t very good at all.

Instead, my point is that maybe we shouldn’t even be using the results of the first half to attempt to predict ERAs for the second half of the season.

For example before July 1, Kyle Lohse‘s ERA was 2.82, but his xFIP was 4.19. The normal assumption would be that Lohse had been lucky and we should trust his xFIP and assume that his ERA would regress negatively, in the second half.

His post-July 1 ERA is 2.81, essentially the same as it was during the first half. This is an extreme example, but I think it is something to learn from.

Maybe too often those in the sabermetric community simply assume that pitchers will regress toward their peripherals as the season goes on. But most of the time that regression doesn’t have time to occur in just half of a season.

Those who have read about sabermetrics long enough are probably sick of the phrase small sample size (SSS!!!). But, I think people who write about sabermetrics still fall prey to small sample sizes. I did when I began the idea for this article. I simply assumed that the ERA estimators from the first half would have a pretty strong correlation to second half ERA and RA9 numbers, and I was ready to write about which had been doing the best job this season. Then I found the results and realized that none had been doing well. And not only that, but something as simple as subtracting walks from strikeouts did better.

Therin lies the rub. In small samples baseball statistics are still very unpredictable, even when using the most “advanced metrics” that were created to to predict them.

So, next June when a starting pitcher has an ERA over five, but a SIERA in the mid-threes, please be wary of assuming that his ERA will regress over the next three months of the season.

**References & Resources**

All statistics come courtesy of FanGraphs and are updated through Sunday, Sept. 16.

Print This Post

Throwing all of the very strongly related variables into a single regression of course kills the significance – it’s not a degrees of freedom thing, it’s a multi-colinearity thing. If you want to stick with simple regression, you simply can’t feed the variables in like that.

Also, the Adjusted R-Sq result is more meaningful in cases like this to test whether the model has improved (even a random noise variable will marginally increase R-squared most of the time.)

@aweb

I agree. I was going to get into how all of the main predictors had strikeout and walks as main comments, which killed the significance. Using 6 predictors in 100 sample isn’t good form either though.

To note with the adjusted R^2, when all six are thrown in, the adjusted R^2 was .046 vs. the .104 for the ERA test. So the model really didn’t improve with all of the predictors in, if you consider the adjusted number to be more meaningful

@Matt

My opinion is that Lohse has been luckier than Weaver, because Weaver is a guy who has shown that he can outpitch his FIP consistently, while Lohse is not. So we’d expect Lohse to regress more, although over half of a season that did not occur

@ogc

I agree that pitchers like Matt Cain, Greg Maddux, Barry Zito and others who can consistently outpitch their peripherals, make advanced saber-stats less useful for those certain pitchers.

Thanks a bunch for the compliments, both of you

Maybe too often those in the sabermetric community simply assume that pitchers will regress toward their peripherals as the season goes on. But most of the time that regression doesn’t have time to occur in just half of a season.Hey Glenn, great article. But I don’t understand this comment. You’ve just shown that pitchers do regress toward the mean of their peripherals in the second half of the season. Perhaps they don’t regress as far as you’d like, but the regression occurs and it’s real, isn’t it?

Almost all stats are designed to tell us what happened, not what will happen. A stat will be useful as a projection of what will happen largely to the extent that it measures skill rather than performance, and even then only if the skill level remains constant. That’s a big reason why the simple metric of strikeouts and walks performed so well in this study—K’s and BB’s are more directly tied to a pitcher’s actual skills than almost anything else. It’s also why those 2 stats are so prominantly used in advanced pitching metrics.

I think an interesting question would be how much do pitchers regress towards either their career peripherals/ERA estimators or towards forecasting systems like Zips, etc. Presumably, those would have much higher relationships with the rest of season performance. What I got out of your piece is that even with faster stabilizing peripherals we can still fall victim to small sample size. But if career numbers/projections have similarly low r-squared values, that would imply we know a lot less about pitcher performance than we think.

Actually, that was kind of a confusing answer. I think the simpler version is that, within split half-seasons, the evidence is that pitchers will regress toward their peripherals—particularly strikeouts and walks. However, ERA/RA is still largely a random thing in the short run, and the regression isn’t particularly strong.

You summed up my point in a much clearer way than I did. And I think you’re probably right, that I should’ve stayed away from saying the “regression doesn’t have time to occur within the season”

I see your Kyle Lohse and raise you Max Scherzer. Sometimes it does work exactly the way simple sabermetrics says it should =)