Standard deviation and ERA estimators

Sabermetrics is my passion, but that does not mean it always has been.

One of the main reasons I became interested in studying baseball statistics was fielding independent pitching (FIP) and ERA estimators. Over the past few months, my interest in ERA estimators turned into an obsession. During that time, I developed predictive FIP (pFIP), an ERA estimator of my own.

In almost every test that I ran, pFIP came out ahead (higher correlation) of other more established ERA estimators. This lead me to believe that pFIP was the best ERA estimator currently available, but that in no way meant that the metric was without its own flaws.

Below I listed the standard deviation for pFIP, FIP, ERA and two other ERA estimators (SIERA and xFIP), for pitchers who threw at least 100 innings in a season for the years 2010-12:

May I Have Your Autograph, Please?
The payoff of being polite.

Metric STDEV
ERA 0.862
SIERA 0.510
xFiP 0.502
pFIP 0.387

Unsurprisingly, on a single season basis, ERA has the widest distribution, while pFIP has the tightest.

Quite honestly, the fact that pFIP’s standard deviation is significantly smaller than xFIP and SIERA (which are known for typically having small standard deviations) was a cause for concern.

In Colin Wyers’ piece on SIERA and other ERA estimators this issue is discussed in great detail:

In a real sense, that’s what we do whenever we use a skill-based metric like xFIP or SIERA. We are using a proxy for regression to the mean that doesn’t explicitly account for the amount of playing time a pitcher has had. We are, in essence, trusting in the formula to do the right amount of regression for us. And like using fly balls to predict home runs, the regression to the mean we see is a side effect, not anything intentional.

Simply producing a lower standard deviation doesn’t make a measure better at predicting future performance in any real sense; it simply makes it less able to measure the distance between good pitching and bad pitching. And having a lower RMSE based upon that lower standard deviation doesn’t provide evidence that skill is being measured. In short, the gains claimed for SIERA are about as imaginary as they can get, and we feel quite comfortable in moving on.

My understanding of Colin’s argument is that metrics like xFIP and SIERA “crudely” regress each pitcher to the mean which would lead to a higher correlation (lower RMSE), but at the same time may not be an accurate measure of a pitcher’s true talent level.

It is evident that of the four ERA estimators discussed in this piece, pFIP has the largest regression to the mean for each individual. This fact brings me to a question whose answer I’ve found myself switching sides on countless times.

What is the point of an ERA estimator?

There are two answers to that question that could hold serious weight in an argument:
{exp:list_maker} To be the best at predicting (highest correlation with) future ERA
To be the best representation of a pitcher’s true talent level{/exp:list_maker}
It would be nice if an ERA estimator came along that could fulfill both of those requirements, but I would argue that that is not the case.

For the first (high correlation with future or next season ERA), the estimation should be seriously regressed to mean. But when one is attempting to estimate a pitcher’s true talent level, should that regression be as harsh? At lower innings pitched totals there should be some, but not nearly as strong as when the goal is to simply predict future ERA.

My main issue with the true talent level idea for ERA estimators is how difficult it actually is to calculate that number. An ERA estimator that reflected a pitcher’s skill should be able to account for all of the possible factors within the pitcher’s control and weed out all of the other factors around the pitcher correctly. The problem is that it is nearly impossible to do.

In the extreme, relievers throw so few innings on a per-season basis that by the time they throw enough innings for us to get a fair idea of their ERA talent, years will have passed. And in all likelihood, their true talent will have changed. Even for starters who throw more innings, their true talent level is tough to decipher out of all the different factors that go into run prevention.

The consensus at this point is that the estimator with a standard deviation as wide as one’s true talent ERA and a high correlation with future ERA is the best at measuring true talent. However, there are issues with this approach too.

Pinning down an exact number for the standard deviation of a pitcher’s true talent ERA is difficult. This issue was raised in a FanGraphs Community Post by Steve Staude. He showed that from 200 to 1,000 innings pitched, the standard deviation of ERAs range from 0.8 down to 0.5, as the innings increase.

I think most would agree that true talent does not reveal itself at the 200-inning mark, but then where? 500? 750? 1,000?

Most pitchers never reach 1,000 career innings; many do not reach 500. It also takes most pitchers at least three seasons to reach 500 innings, and it seems reasonable that an individual’s talent level could change significantly over the course of those seasons.

For a moment though, let’s ignore that and look to Wyers’ original piece to see that ERA true talent seems to be revealed somewhere between 400 and 500 innings. According to Staude’s study, the standard deviation of ERA between 400 and 500 innings ranges from about 0.65 to 0.6; thus, it would make logical sense that an ERA estimator with a high correlation with future ERA and a standard deviation of around 0.6 or 0.65 would be the best true talent estimator.

Interestingly, if we look at the standard deviation that I found for FIP in this article, it falls right in line with that logic. FIP has a higher correlation (in small to medium samples) with future ERA than ERA and it has a standard deviation that is similar to “true talent” ERA. This assumption also falls in line with a trend we often see: A pitcher’s career FIP lines up fairly closely to his career ERA.

The fact that most logic would lead one to conclude that FIP is best true talent ERA estimator we have available fascinates me.

Why? Because the structure of FIP is in no way meant to predict future ERA.

FIP is commonly used in that fashion because it does a fairly good job of predicting future ERA, but that is not the statistic’s purpose. FIP is meant to be a describer of a pitcher’s performance that is scaled to look like ERA. It’s best described as a what a pitcher’s ERA should have been. That type of description may make FIP sound similar to a true talent evaluator, but it is in no way correlated or meant to describe future performance.

This idea brought about the birth of pFIP.

pFIP regresses the components of FIP (strikeouts, walks, home runs) to predict future performance rather than describe of past performance. In plain English, that idea sounds great and interestingly the math works out, too.

FIP’s more volatile components (walks and homers) receive a fair amount of regression, while strikeouts (the least volatile) receive little or no regression and these regressions result in a stronger correlation with next season ERA than when simply using FIP.

But is pFIP really saying anything about a pitcher’s true talent level?

I would argue that it may give one some of indication of a pitcher’s talent, but it is not a true talent evaluator. If you look at either the pFIP equation, the standard deviation of pFIP or an individual’s numerical pFIP, what the statistic actually does becomes very clear.

Essentially, pFIP starts each individual’s ERA projection at the same point (the mean ERA) and then moves each number slightly away from the mean based on the player’s individual peripherals. This strategy works great when your goal is to predict with the highest rate of success, but it does not give you a great idea of a pitcher’s actual true ERA or skill.

Thus, when one decides to evaluate pFIP as a statistic one must return to our original question: What is the purpose of an ERA estimator?

pFiP is essentially useless if you’d like to evaluate a pitcher’s talent level, but if your goal is to predict next season’s ERA then pFIP will serve you well. However, if predicting future ERA is the only real purpose of pFIP then is there any real reason for the statistic?

Projection systems are a very real thing, and their goal (at least from what I understand) is to do exactly what pFIP does; project future performance. I’ve shown before that pFIP is fairly comparable to projection systems when looking at overall correlation with next season runs (or ERA). Although, I’m fairly certain that simple correlation with the next season is notthe best way to test how well a projection system works, let’s say for a moment that it is.

Is pFIP really better or equivalent to a projection system?

The short answer is quite obviously no, but the evidence behind that assessment is fairly educational.

I’m not saying this is true, but consider a fantasy scenario where pFIP has exactly the same correlation with future ERA as an average projection system. How would we test which one was actually doing a better job? A good starting point would be to consider the standard deviations of the projected ERAs.

I looked at the standard deviations of ERA projections for three projection systems (Marcel, Bill James and ZIPS) for the years 2010-2012 for pitchers who were projected to have at least 100 innings in that season:

Projection System STDEV
Bill James 0.514
Marcel 0.520
ZIPS* 0.657

*ZIPS does not project playing time, so Marcel’s playing time projections were used for the pitchers in the sample.

Under the assumption that pFIP and the projection systems have similar or equivalent correlations, it would seem that projection systems do a better job at really projecting future performance/skill as their distributions are wider.

This should not really be too surprising, as projection systems take a great deal more information into account than pFIP. Also projection systems are even more useful as they project playing time and counting stats as well as the rate stats (like ERA, FIP, etc.)

This all brings me back again to my original question: What is the point of an ERA estimator? Or to be more clear, if we have projection systems, then what is the point of ERA estimators?

I can think of only two answers to that question.

The first is that some people don’t trust or find utility in projection systems and thus find sanctuary in using much simpler ERA estimators, which are still fairly predictive and easier to understand. The second is that ERA estimators should be a reflection of a pitcher’s true talent level, which, of course, is almost impossible to define.

References & Resources
All statistics come courtesy of FanGraphs.

Print This Post
Sort by:   newest | oldest | most voted
Steve "Staude"
Steve "Staude"
Well done, Glenn.  I wrestled with the question of the point of an ERA estimator too.  My conclusion was that it should be to estimate true talent level… but I also think that good prediction of future ERAs is a major sign that you have a grasp on their true talent level.  There’s a lot of noise there, though, so it can’t be the sole basis. I think over a career, an estimator should also line up well with career ERA, as with enough IP, career ERA becomes an even better indicator of true talent level (as you say). That… Read more »
Glenn DuPaul
Glenn DuPaul

Thanks for the response, Steve.  I’ve toyed with the creation of an ERA estimator that takes multiple years into account. The only issue with it though, is that it came out very similar (huge regression) to pFIP, because I was regressing the numbers upon one season.  I think in order to create a true talent ERA estimator, one would have to regress the components upon 400 innings (or so) of data.  This would result in far less regression and possibly an idea of how to project true talent ERA.  I imagine this is similar to what projection systems do though.

Robert Arne
Robert Arne
Thanks, this article comes along at just the right moment.  I’ve been thinking of some of the aspects of what you mention in the article as I shirk my own personal responsibilities.  The origin of this has to do with moves made by the Angels this offseason.  My original preference was for them to resign Greinke rather than Hamilton and viewed the contracts fairly equal at $25 million per year.  However, being open to different strategies, I’m wondering if the moves made hadn’t increased the possible number of wins vs. signing the new Dodger.  In other words, I read an… Read more »
Glenn DuPaul
Glenn DuPaul
@MGL Thanks for dropping by.  There’s a lot here to digest, but I’ll do my best to get to all of your main points. 1.) Anytime that I mention correlation or correlation with future ERA, I’m referring to next season ERA, I apologize for not being more clear.  I also agree that the best predictor of future performance and true talent level are for all intents and purposes the same thing.  However, like you refer to with defenses, park factors, etc. I’m not positive that a metric or projection system that is meant to project one season of data can… Read more »

i dont understand why these stats cant be easily figured and used by coaches?  i coach high school baseball and use these stats during the season as a predictor for future performance for that season and the next season.  this wwould be a great stat, but i majored in mass media management and education…

Steve "Staude"
Steve "Staude"
MGL, you’re a master.  I’d love to hear your thoughts on my FanGraphs Community Research article that Glenn talks about here. I came to the same conclusion you did regarding the ideal standard deviation in an ERA estimator—about 0.5 earned runs.  I said that 0.5 implied that about 2.2% of pitchers have a true ERA under ~3.1, which sounds about right to me (maybe the SD could be a tiny bit higher, though). Glenn—yeah, you can’t completely rely on single season data for either the basis or the comparison for a formula, I think—there’s just as much noise in the… Read more »
Gelnn, nice job bringing up and discussing some interesting questions! Here are some comments I have regarding these issues/questions: 1) Please, please do not use the term “correlation” without explaining or mentioning what you are correlating with what! For example, “In almost every test that I ran, pFIP came out ahead (higher correlation) of other more established ERA estimators.” I assume that you mean “higher correlation” with next year’s ERA, but it is by no means obvious. It could mean correlation with the same year’s ERA or next year’s pFIP (or FIP or whatever metric you are discussing). You do… Read more »
4) Finally, the whole idea of the spread of your stat or projection versus its accuracy in terms of correlation or RMS, that is a very tricky issue. It is true that you want as wide a variance as possible in your projection or projection stat, while at the same time you want as high a correlation or as low a RMS square as possible, but by no means is it clear what kind of balance is best. Perhaps that was your point, I am not sure. Sure, if your stat has too small a variance it is not a… Read more »