New SIERA, Part Four (of Five): Testing
SIERA’s updated version was unveiled Monday at FanGraphs, and as part of the release, I’ve been taking readers through its ERA-estimation process. I’ve written about SIERA’s ability to predict BABIP and HR/FB, and then I broke down its formula and structure. But like any good analysis, it needed to be tested against other estimators.
So far, the results are pretty conclusive. In fact, SIERA might be the best tool yet to help us understand and better-interpret pitching performance.
I did a round of SIERA testing at Baseball Prospectus in January before making my newest adjustments here. I showed in that article that SIERA and xFIP were by far the best readily available estimators, and that SIERA did slightly better at predicting ERA than xFIP did. This involved a number of testing procedures.
First, I used Root Mean Square Error and saw that from 2003 to 2010, SIERA was closer to next season’s ERA by an average of .021. FIP was .063 behind — though it was only .033 behind when predicting non-park-adjusted ERAs for pitchers on the same team. That figure suggests park effects are biasing and exaggerating the difference between FIP and xFIP.
I also found that the difference between xFIP and SIERA was persistent, with SIERA closer to next-year’s ERA six out of the seven testable years. It also correlated slightly higher (.398 vs. .352) with next season’s ERA and was closer to next season’s ERA 54.6% of the time. Similarly, SIERA also was closest to the previous season’s ERA, to the ERA two seasons later, to the ERA two seasons prior, to the ERA three seasons later, and to the ERA three seasons prior. Weighting didn’t affect the ranking of ERA estimators.
This time, I incorporated two statistics that Tom Tango suggested during the original SIERA discussion: kwERA (defined as kwERA = constant + 11*(UBB+HBP-SO)/PA) and bbFIP (defined as bbFIP= constant + 11*(UBB+HBP+LD-SO-IFFB)/PA +3*(OFFB-GB)/PA).
I weighted the estimator’s square error by the number of innings pitched and excluded all pitchers with fewer than 40 innings pitched — either during the season in question, or during the following season. This table shows the estimators’ ranking:
| ERA Estimator (N =2096) | RMSE |
| SIERA (new) | 1.075 |
| SIERA (old) | 1.079 |
| kwERA | 1.081 |
| bbFIP | 1.083 |
| xFIP | 1.089 |
| FIP | 1.141 |
| FRA (scaled to ERA) | 1.171 |
| tRA (scaled to ERA) | 1.215 |
| ERA (park-adj) | 1.308 |
With a standard deviation of the square error terms of about 2 — and a literal .999 correlation between xFIP square error and SIERA square error — a standard error would be about .002. That means the differences are statistically significant.
I also adjusted the weighting using the geometric average of IP in both the season in question and in the following season (in which the ERA was estimated). The ranking was the same:
| ERA Estimator (N =2096) | RMSE |
| SIERA (new) | 1.040 |
| SIERA (old) | 1.045 |
| kwERA | 1.053 |
| xFIP | 1.057 |
| bbFIP | 1.057 |
| FIP | 1.118 |
| FRA (scaled to ERA) | 1.150 |
| tRA (scaled to ERA) | 1.202 |
| ERA (park-adj) | 1.289 |
Note that SIERA is not overfit to ERA. All of these estimations of RMSE compared to the following season come from regressing ERA against same-season ERA. Regressing it against next season’s ERA would be cheating, of course, but it would have a lower RMSE (actually, it would be 0.920). This SIERA version, which is fitted to next season’s ERA, is called SIERA*:
| Variable | SIERA coefficient | SIERA* coefficient |
| (SO/PA) | -15.518 | -15.219 |
| (SO/PA)^2 | 9.146 | 12.746 |
| (BB/PA) | 8.648 | -0.385 |
| (BB/PA)^2 | 27.252 | 10.671 |
| (netGB/PA) | -2.298 | -2.844 |
| (netGB/PA)^2 | -4.920 | -2.232 |
| (SO/PA)*(BB/PA) | -4.036 | 15.421 |
| (SO/PA)*(netGB/PA) | 5.155 | 5.226 |
| (BB/PA)*(netGB/PA) | 4.546 | 10.150 |
| Constant | 5.534 | 5.952 |
| Year coefficients (versus 2010 for SIERA, versus 2009 for SIERA*) | From -0.020 to +0.289 | From 0.000 to 0.426 |
| % innings as SP | 0.367 | 0.246 |
| Year | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 |
| SIERA Coefficient | -.020 | +.093 | +.154 | +.037 | +.289 | +.226 | +.116 | +.103 | .000 |
| SIERA* Coefficient | +.140 | +.181 | +.106 | +.426 | +.204 | +.133 | +.116 | .000 | n/a |
Take SIERA* with a grain of salt, though, since the metric is calibrated to model in-sample data. And because it’s directly minimizing the square error in estimating next season’s statistic by design, this is an ERA predictor (a projection metric), rather than an ERA estimator (which recreates ERA net of luck and defense). If it’s a projection metric, it should be compared to Oliver, ZIPS, PECOTA, Marcel and the rest. If the goal is to take non-peripheral luck and defense out of the equation, then SIERA is more useful than SIERA*.
I also tested the ERA estimators against same-year ERA. Naturally, the statistics that directly factored in a pitcher’s home run rate did the best (except for tRA) — and bbFIP did better than new SIERA. That’s probably because line drives raise a pitcher’s bbFIP, while scoring bias might arbitrarily inflate that pitcher’s line-drive total when he suffers bad luck. (Note that line-drive rate is not persistent year to year when measured per batted ball. The way to give up fewer line drives is to strike out more batters. In other words, a pitcher needs to avoid batted balls in the first place.) Although kwERA did pretty well with next-season ERA testing, it did poorly with same-year estimation.
| ERA Estimator (N =3328) | RMSE |
| FIP | .740 |
| FRA (scaled to ERA) | .779 |
| bbFIP | .852 |
| SIERA (new) | .870 |
| xFIP | .875 |
| tRA (scaled to ERA) | .886 |
| SIERA (old) | .890 |
| kwERA | .913 |
Since the 2002 SIERA is not calculable because of the unavailability of Retrosheet batted ball data from that year, old SIERA seemed like it might be unfairly damaged. So I removed 2002 estimators from the sample and recalculated for the 2003 through the 2009 seasons’ estimators’ predictions of 2004 through 2010 park-adjusted ERA.
| ERA Estimator (N =1838) | RMSE |
| SIERA (new) | 1.075 |
| SIERA (old) | 1.083 |
| kwERA | 1.086 |
| bbFIP | 1.088 |
| xFIP | 1.092 |
| FIP | 1.149 |
| FRA (scaled to ERA) | 1.181 |
| tRA (scaled to ERA) | 1.217 |
| ERA (park-adj) | 1.313 |
The ranking remained the same, anyway.
As I did last time, I checked whether FIP did better with pitchers who stayed on the same team. To determine that, I tested against un-park-adjusted ERA and compared how other statistics predicted park-adjusted ERA for pitchers who stayed on the same teams. The results show an improvement for FIP, but not enough to out-predict xFIP.
| ERA Estimator (N =1353) | RMSE |
| SIERA (new) | 1.066 |
| SIERA (old) | 1.068 |
| xFIP | 1.071 |
| kwERA | 1.073 |
| bbFIP | 1.078 |
| FIP (un-adjusted ERA) | 1.114 |
| FRA (scaled to ERA) | 1.157 |
| tRA (scaled to ERA) | 1.205 |
| ERA (park-adj) | 1.273 |
While Root Mean Square Errors are useful when estimating the average difference between an ERA estimator and an ERA, sometimes it’s easier to simply compare statistics head-to-head.
In the next table, I checked how often each statistic matched with the next season’s ERA. The new version of SIERA was better in head-to-head matches versus other estimators. Both bbFIP and kwERA beat xFIP; and kwERA even beat the old SIERA; xFIP was close behind and outlasted all other estimators. All the differences are statistically significant, unless they’re italicized.
| % of times row closer (N=2096, so St. Dev. = 1.1%) | SIERA (new) | SIERA (old) | xFIP | FIP | bbFIP | kwERA | FRA (adj) | tRA (adj) | ERA (adj) |
| SIERA (new) | – | 50.4% | 52.7% | 55.0% | 52.9% | 53.0% | 56.2% | 56.4% | 60.6% |
| SIERA (old) | – | 53.4% | 54.6% | 52.1% | 48.1% | 55.0% | 58.0% | 59.9% | |
| xFIP | – | 53.2% | 49.8% | 48.6% | 54.5% | 57.3% | 58.7% | ||
| FIP | – | 46.4% | 46.3% | 51.6% | 55.7% | 59.5% | |||
| bbFIP | – | 49.4% | 53.9% | 58.0% | 60.1% | ||||
| kwERA | – | 54.4% | 57.9% | 59.4% | |||||
| FRA (adj) | – | 53.6% | 58.2% | ||||||
| tRA (adj) | – | 53.6% | |||||||
| ERA (adj) | – |
What I found particularly useful about SIERA is that it doesn’t regress a pitcher’s performance as far back toward the mean as other estimators. The standard deviation for SIERA is about .75, while the standard deviation of xFIP is .68. Even though peripheral performance regresses to the mean — which makes an estimator that regresses ERA to the mean better, according to RMSE — SIERA still holds its own against xFIP.
Basically, SIERA does what ERA estimators are supposed to do. It estimated what ERA would have been — net of defense and sequencing/BABIP/HR luck — while allowing the effects of their performances on peripherals to shine through.
From many angles, it seems that SIERA is the leading ERA estimator available. Still, its advantage over xFIP is small. One can see that, along with bbFIP and kwERA, these statistics are in a tightly packed league of their own and both are useful when predicting next season’s ERA and isolating pitcher performance. Looking at peripherals alone can be useful, but finding a meaningful way to incorporate all of these statistics is valuable. Because of that, each of these statistics has its strengths.
What is particularly interesting, given their very high correlation, is how often SIERA and xFIP still differ. The frequent differences between these two estimators should make it clear why I use both when evaluating a pitching matchup.
So what ideas weren’t incorporated into the new SIERA? Tomorrow, you’ll find out.
I just wanted to say I have really enjoyed reading this whole series.
This is probably the most well-researched, thoroughly-defended piece I’ve ever read on fangraphs.
This series has been excellent. We really appreciate the obvious time and energy devoted to a more accurate statistic.
I know this was a minor point, but I was a little surprised that there isn’t year-to-year correlation between a pitcher’s line drive rates. Is that because pitchers who aren’t decently good at preventing line drives are not in the majors for long?
It seems more like preventing linedrives isn’t a skill at all. From a pitcher’s perspective, LD% is a function of luck and scoring bias. A pitcher can control GB/FB, but linedrives are just batted balls that are halfway between the two.
Yeah, exactly what both of you said.
I would think that inducing weaker contact meant allowing fewer line drives.
Apparently not. But probably weaker line drives. I’m guessing hitters slow their bats down to make it easier to hit a line drive against tough pitchers, but the ball doesn’t go as far or as fast.
Bravo.
Outstanding series. One of the very best I’ve seen on this site. Very nice work.
Maths.
Seriously, though, I ended up skimming through most of this because it seemed mostly over my head, but the general gist I got was this: SIERA is the best ERA estimator out there, but the differences between it and xFIP, as well as other estimators, is very small. In other words, SIERA is cool because it more accurately portrays the actual skill level of the pitcher and gives us more insight into what a pitcher has and doesn’t have control over, but is not really necessary as a replacement for xFIP. Did I get that right?
Yeah, that’s basically right. Using SIERA or xFIP is basically a matter of preference, as 99% of the time, they’re going to give the same answer. When they differ, the margin is still very small. If you prefer one to the other, that’s great, and we’re happy to offer both here on FanGraphs.
Any plans for Fangraphs to use SIERA or xFIP instead of FIP when determining WAR at some point?
@Briks
Philosophically, that would be tricky, since FIP is *supposed* to be descriptive, while xFIP and SIERA are *supposed* to be predictive.
Any thought to developing a new stat describing a pitcher’s average overall contribution given luck- and defense/park/etc. neutrality? xWAR, if you will?
Yes, exactly. I almost always use both when I’m watching a game and want stats on the pitcher matchup– as in, if I’m being a regular stat-interested baseball fan rather than a stat-developer.
Love the series and I do believe that the new SIERA is better than the old SIERA, however if we’re using statistical rigour as our qualification, your tables above seem to suggest the difference between the old and new SIERA is not statistically significant (I’m guessing you used a 95% CI).
I agree with the changes in methodology and truly believe they are improvements on the old SIERA. And maybe it is that I’m not understanding the testing properly, but I do not see the statistical proof that it is significantly better. So it is possibly random chance over the specific sample in question that the new is better than the old. Just a thought…
Yeah, the difference isn’t statistically significant per se, but it’s slightly better using a few different measures. I guess the only reason to use the old SIERA is if you don’t like one of the changes I made, because I didn’t take anything out. A few of the new terms are statistically significant, if that helps.
You mean, you just wanted to make enough changes that BP wouldn’t sue you for taking their proprietary stat to Fangraphs?
Public formulas are not proprietary.
I wouldn’t make changes unless I thought it improved the formula.
@Yir
Imagine a world where you could copyright a regression equation. heh :p
To be clear, I was joking / giving Swartz a hard time.
What are kwERA and bbFIP trying to measure exactly? Are they closer to ERA and FIP, respectively, or are they closer to SIERA? Since they don’t have coefficients (just a constant term), it seems like they are simpler. I can calculate kwERA in my head at the ballpark (if I knew the constant), but what is that stat trying to tell me?
kwERA expresses a pitcher’s talent at BB (and HBP), SO along the ERA scale. That is, it ignores any HR talent, any batted ball talent, and any steals/pickoff talent.
As you can see, it does fantastically well.
bbFIP is kwERA plus the batted ball frequency (LD, Pops, GB, FB), and espresses it along the ERA scale. So, it ignores any HR talent, any talent at getting more outs on GB than another pitcher would get on GB, and any steals/pickoff talent.
As you can see, it is WORSE than kwERA (even though it contains more information).
What that means is that the frequency of batted balls and/or the weights it uses for each batted ball type, is not useful to determine a pitcher’s talent level.
It IS useful if you are trying to explain current-season data. But, if that’s the goal, then FIP wins that one much better. Though wOBA for pitchers would be even better than FIP.
It’s worth noting that part of the main reason kwERA did so well was that it had the lowest standard deviation of the group. Anything closer to the mean ERA is going to have an advantage.
Regress bbFIP against next year ERA directly and you do better than regressing kwERA against next year ERA.
(Note that I don’t actually use next-year ERA in any of the regressions to determine SIERA. Next-year ERA, like past-year ERA, two-year-later ERA, two-year-past ERA, and three-years later and past ERAs are all just proxies for how well estimators pick up on skill.)
When you tested SIERA using 2003-2010 data, did you test it as would have calculated it if you had created SIERA in 2003? If not, then wouldn’t the results be biased becasue you are testing a regression against the data that was used to construct it?
I think I know what you’re asking, but I’m not sure.
I’m testing SIERA against next-year-ERA while the formula was devised as a regression against same-year-ERA. As a hypothetical, I showed what a biased regression like you describe would look like above when I developed the “fake SIERA” called SIERA*. That RMSE with next-year-ERA was 0.920, as opposed to regular SIERA’s 1.040.
In the original SIERA run, Eric and I developed it for 2003-2007 and then applied it 2008 peripheral data against 2009 ERA and it worked. The original formula released in 2/2010 did well using 2009 peripherals to predict 2010 ERA too.
Is that what you were asking about?
Well put-together. I will have to start using SIERA more.
One thought on your closing lines: is there any meaningful trend on the pitchers where xFIP and SIERA produce significant disparities in predictions? Are these disparities random or consistent?
Check out Part 3 from Wednesday. The short answer is high-K, low-GB pitchers do better with SIERA, but it’s flushed out a little more in Wednesday’s piece.
I guess I don’t quite see the point of these “horse races.” They show that for predicting the next season’s ERA, SIERA has slightly lower RMSE than the other statistics. But wouldn’t you do better yet by combining several statistics using a multiple regression? For most practical problems, we shouldn’t be looking to limit ourselves to a single statistic – I think what we really should be looking for is how much weight to give to SIERA, how much to FIP, how much to simple ERA, etc. I don’t want to ignore the information that’s incorporated in FIP or ERA and that’s ignored by SIERA, I just want to give it an appropriate weight.
Furthermore, the weights should differ depending on the time frame used. That is, if I’m forecasting based on data for a relatively short period (say, half a season), I’d probably give almost no weight to ERA, but if a pitcher’s ERA is consistently better than the defense-independent metrics for several years, then I’d figure there are skills being captured by ERA (holding and picking off base runners, pitcher fielding, skill at inducing infield flies) that are missing from metrics like SIERA. So I’d guess the coefficients of FIP and ERA would tend to rise when forecasting using data for longer time spans.
Good idea. I did this once before but using the same data-set…
Using all pitchers with at least 40 IP in consecutive seasons and weighting by IP in the first season (using ERA as next season’s park-adjusted ERA)
ERA = 1.50 + .95*SIERA – .28*xFIP
(all p-stats < .002)
(This has multi-collinearity issues since the two stats correlate and we are trying to differentiate which effect is what, but splitting odd & even seasons, I get):
ERA = 1.47 + .86*SIERA – .19*xFIP
ERA = 1.50 + 1.04*SIERA – .37*xFIP
(p<.003, except for the first xFIP coefficient is .128)
Doing FIP in there, we get:
ERA = 1.30 + .61*SIERA + .14*FIP
(all p-stats < .01)
ERA = 1.32 + .56*SIERA + .15*FIP
ERA = 1.27 + .67* SIERA + .06 FIP
(p-stats <.01 except for the last FIP one of .265)
Doing FIP & ERA-same-year:
ERA = 1.31 + .61*SIERA + .04*FIP + .07*ERA
(p-stats .000, .398, .007)
Splitting halfs…
ERA = 1.32 + .56*SIERA + .10*FIP + .06*ERA
(p-stats: .000, .146, .132)
ERA = 1.28 + .66*SIERA – .02*FIP + .08*ERA
(p-stats: .000, .764, .020)
Anything else to run? I’m happy to run tests as asked if my programming skills can handle it! Thanks for the great idea, Brent!
One test I would like to see is to take each pitcher’s variance (SIERA – ERA) and correlate between seasons. Should come out close to zero, I would hope.