New SIERA, Part Four (of Five): Testing

SIERA’s updated version was unveiled Monday at FanGraphs, and as part of the release, I’ve been taking readers through its ERA-estimation process. I’ve written about SIERA’s ability to predict BABIP and HR/FB, and then I broke down its formula and structure. But like any good analysis, it needed to be tested against other estimators.

So far, the results are pretty conclusive. In fact, SIERA might be the best tool yet to help us understand and better-interpret pitching performance.

I did a round of SIERA testing at Baseball Prospectus in January before making my newest adjustments here. I showed in that article that SIERA and xFIP were by far the best readily available estimators, and that SIERA did slightly better at predicting ERA than xFIP did. This involved a number of testing procedures.

First, I used Root Mean Square Error and saw that from 2003 to 2010, SIERA was closer to next season’s ERA by an average of .021. FIP was .063 behind — though it was only .033 behind when predicting non-park-adjusted ERAs for pitchers on the same team. That figure suggests park effects are biasing and exaggerating the difference between FIP and xFIP.

I also found that the difference between xFIP and SIERA was persistent, with SIERA closer to next-year’s ERA six out of the seven testable years. It also correlated slightly higher (.398 vs. .352) with next season’s ERA and was closer to next season’s ERA 54.6% of the time. Similarly, SIERA also was closest to the previous season’s ERA, to the ERA two seasons later, to the ERA two seasons prior, to the ERA three seasons later, and to the ERA three seasons prior. Weighting didn’t affect the ranking of ERA estimators.

This time, I incorporated two statistics that Tom Tango suggested during the original SIERA discussion: kwERA (defined as kwERA = constant + 11*(UBB+HBP-SO)/PA) and bbFIP (defined as bbFIP= constant + 11*(UBB+HBP+LD-SO-IFFB)/PA +3*(OFFB-GB)/PA).

I weighted the estimator’s square error by the number of innings pitched and excluded all pitchers with fewer than 40 innings pitched — either during the season in question, or during the following season. This table shows the estimators’ ranking:

ERA Estimator (N =2096) RMSE
SIERA (new) 1.075
SIERA (old) 1.079
kwERA 1.081
bbFIP 1.083
xFIP 1.089
FIP 1.141
FRA (scaled to ERA) 1.171
tRA (scaled to ERA) 1.215
ERA (park-adj) 1.308

With a standard deviation of the square error terms of about 2 — and a literal .999 correlation between xFIP square error and SIERA square error — a standard error would be about .002. That means the differences are statistically significant.

I also adjusted the weighting using the geometric average of IP in both the season in question and in the following season (in which the ERA was estimated). The ranking was the same:

ERA Estimator (N =2096) RMSE
SIERA (new) 1.040
SIERA (old) 1.045
kwERA 1.053
xFIP 1.057
bbFIP 1.057
FIP 1.118
FRA (scaled to ERA) 1.150
tRA (scaled to ERA) 1.202
ERA (park-adj) 1.289

Note that SIERA is not overfit to ERA. All of these estimations of RMSE compared to the following season come from regressing ERA against same-season ERA. Regressing it against next season’s ERA would be cheating, of course, but it would have a lower RMSE (actually, it would be 0.920). This SIERA version, which is fitted to next season’s ERA, is called SIERA*:

Variable SIERA coefficient SIERA* coefficient
(SO/PA) -15.518 -15.219
(SO/PA)^2 9.146 12.746
(BB/PA) 8.648 -0.385
(BB/PA)^2 27.252 10.671
(netGB/PA) -2.298 -2.844
(netGB/PA)^2 -4.920 -2.232
(SO/PA)*(BB/PA) -4.036 15.421
(SO/PA)*(netGB/PA) 5.155 5.226
(BB/PA)*(netGB/PA) 4.546 10.150
Constant 5.534 5.952
Year coefficients (versus 2010 for SIERA, versus 2009 for SIERA*) From -0.020 to +0.289 From 0.000 to 0.426
% innings as SP 0.367 0.246

 

Year 2002 2003 2004 2005 2006 2007 2008 2009 2010
SIERA Coefficient -.020 +.093 +.154 +.037 +.289 +.226 +.116 +.103 .000
SIERA* Coefficient +.140 +.181 +.106 +.426 +.204 +.133 +.116 .000 n/a

Take SIERA* with a grain of salt, though, since the metric is calibrated to model in-sample data. And because it’s directly minimizing the square error in estimating next season’s statistic by design, this is an ERA predictor (a projection metric), rather than an ERA estimator (which recreates ERA net of luck and defense). If it’s a projection metric, it should be compared to Oliver, ZIPS, PECOTA, Marcel and the rest. If the goal is to take non-peripheral luck and defense out of the equation, then SIERA is more useful than SIERA*.

I also tested the ERA estimators against same-year ERA. Naturally, the statistics that directly factored in a pitcher’s home run rate did the best (except for tRA) — and bbFIP did better than new SIERA. That’s probably because line drives raise a pitcher’s bbFIP, while scoring bias might arbitrarily inflate that pitcher’s line-drive total when he suffers bad luck. (Note that line-drive rate is not persistent year to year when measured per batted ball. The way to give up fewer line drives is to strike out more batters. In other words, a pitcher needs to avoid batted balls in the first place.) Although kwERA did pretty well with next-season ERA testing, it did poorly with same-year estimation.

ERA Estimator (N =3328) RMSE
FIP .740
FRA (scaled to ERA) .779
bbFIP .852
SIERA (new) .870
xFIP .875
tRA (scaled to ERA) .886
SIERA (old) .890
kwERA .913

Since the 2002 SIERA is not calculable because of the unavailability of Retrosheet batted ball data from that year, old SIERA seemed like it might be unfairly damaged. So I removed 2002 estimators from the sample and recalculated for the 2003 through the 2009 seasons’ estimators’ predictions of 2004 through 2010 park-adjusted ERA.

ERA Estimator (N =1838) RMSE
SIERA (new) 1.075
SIERA (old) 1.083
kwERA 1.086
bbFIP 1.088
xFIP 1.092
FIP 1.149
FRA (scaled to ERA) 1.181
tRA (scaled to ERA) 1.217
ERA (park-adj) 1.313

The ranking remained the same, anyway.

As I did last time, I checked whether FIP did better with pitchers who stayed on the same team. To determine that, I tested against un-park-adjusted ERA and compared how other statistics predicted park-adjusted ERA for pitchers who stayed on the same teams. The results show an improvement for FIP, but not enough to out-predict xFIP.

ERA Estimator (N =1353) RMSE
SIERA (new) 1.066
SIERA (old) 1.068
xFIP 1.071
kwERA 1.073
bbFIP 1.078
FIP (un-adjusted ERA) 1.114
FRA (scaled to ERA) 1.157
tRA (scaled to ERA) 1.205
ERA (park-adj) 1.273

While Root Mean Square Errors are useful when estimating the average difference between an ERA estimator and an ERA, sometimes it’s easier to simply compare statistics head-to-head.

In the next table, I checked how often each statistic matched with the next season’s ERA. The new version of SIERA was better in head-to-head matches versus other estimators. Both bbFIP and kwERA beat xFIP; and kwERA even beat the old SIERA; xFIP was close behind and outlasted all other estimators. All the differences are statistically significant, unless they’re italicized.

% of times row closer (N=2096, so St. Dev. = 1.1%) SIERA (new) SIERA (old) xFIP FIP bbFIP kwERA FRA (adj) tRA (adj) ERA (adj)
SIERA (new) 50.4% 52.7% 55.0% 52.9% 53.0% 56.2% 56.4% 60.6%
SIERA (old) 53.4% 54.6% 52.1% 48.1% 55.0% 58.0% 59.9%
xFIP 53.2% 49.8% 48.6% 54.5% 57.3% 58.7%
FIP 46.4% 46.3% 51.6% 55.7% 59.5%
bbFIP 49.4% 53.9% 58.0% 60.1%
kwERA 54.4% 57.9% 59.4%
FRA (adj) 53.6% 58.2%
tRA (adj) 53.6%
ERA (adj)

What I found particularly useful about SIERA is that it doesn’t regress a pitcher’s performance as far back toward the mean as other estimators. The standard deviation for SIERA is about .75, while the standard deviation of xFIP is .68. Even though peripheral performance regresses to the mean — which makes an estimator that regresses ERA to the mean better, according to RMSE — SIERA still holds its own against xFIP.

Basically, SIERA does what ERA estimators are supposed to do. It estimated what ERA would have been — net of defense and sequencing/BABIP/HR luck — while allowing the effects of their performances on peripherals to shine through.

From many angles, it seems that SIERA is the leading ERA estimator available. Still, its advantage over xFIP is small. One can see that, along with bbFIP and kwERA, these statistics are in a tightly packed league of their own and both are useful when predicting next season’s ERA and isolating pitcher performance. Looking at peripherals alone can be useful, but finding a meaningful way to incorporate all of these statistics is valuable. Because of that, each of these statistics has its strengths.

What is particularly interesting, given their very high correlation, is how often SIERA and xFIP still differ. The frequent differences between these two estimators should make it clear why I use both when evaluating a pitching matchup.

So what ideas weren’t incorporated into the new SIERA? Tomorrow, you’ll find out.



Print This Post



Matt writes for FanGraphs and The Hardball Times, and models arbitration salaries for MLB Trade Rumors. Follow him on Twitter @Matt_Swa.


Sort by:   newest | oldest | most voted
Andrew
Guest
Andrew
4 years 10 months ago

I just wanted to say I have really enjoyed reading this whole series.

Yirmiyahu
Member
Yirmiyahu
4 years 10 months ago

This is probably the most well-researched, thoroughly-defended piece I’ve ever read on fangraphs.

Sitting Curveball
Member
4 years 10 months ago

This series has been excellent. We really appreciate the obvious time and energy devoted to a more accurate statistic.

I know this was a minor point, but I was a little surprised that there isn’t year-to-year correlation between a pitcher’s line drive rates. Is that because pitchers who aren’t decently good at preventing line drives are not in the majors for long?

Yirmiyahu
Member
Yirmiyahu
4 years 10 months ago

It seems more like preventing linedrives isn’t a skill at all. From a pitcher’s perspective, LD% is a function of luck and scoring bias. A pitcher can control GB/FB, but linedrives are just batted balls that are halfway between the two.

williams .482
Member
Member
williams .482
4 years 10 months ago

I would think that inducing weaker contact meant allowing fewer line drives.

Yirmiyahu
Member
Yirmiyahu
4 years 10 months ago

Bravo.

razor
Guest
razor
4 years 10 months ago

Outstanding series. One of the very best I’ve seen on this site. Very nice work.

Matt Hunter
Member
Member
4 years 10 months ago

Maths.

Seriously, though, I ended up skimming through most of this because it seemed mostly over my head, but the general gist I got was this: SIERA is the best ERA estimator out there, but the differences between it and xFIP, as well as other estimators, is very small. In other words, SIERA is cool because it more accurately portrays the actual skill level of the pitcher and gives us more insight into what a pitcher has and doesn’t have control over, but is not really necessary as a replacement for xFIP. Did I get that right?

Dave Cameron
Admin
Member
4 years 10 months ago

Yeah, that’s basically right. Using SIERA or xFIP is basically a matter of preference, as 99% of the time, they’re going to give the same answer. When they differ, the margin is still very small. If you prefer one to the other, that’s great, and we’re happy to offer both here on FanGraphs.

Briks42
Guest
Briks42
4 years 10 months ago

Any plans for Fangraphs to use SIERA or xFIP instead of FIP when determining WAR at some point?

Temo
Member
Temo
4 years 10 months ago

@Briks

Philosophically, that would be tricky, since FIP is *supposed* to be descriptive, while xFIP and SIERA are *supposed* to be predictive.

saberbythebay
Guest
4 years 10 months ago

Any thought to developing a new stat describing a pitcher’s average overall contribution given luck- and defense/park/etc. neutrality? xWAR, if you will?

Michael
Guest
Michael
4 years 10 months ago

Love the series and I do believe that the new SIERA is better than the old SIERA, however if we’re using statistical rigour as our qualification, your tables above seem to suggest the difference between the old and new SIERA is not statistically significant (I’m guessing you used a 95% CI).

I agree with the changes in methodology and truly believe they are improvements on the old SIERA. And maybe it is that I’m not understanding the testing properly, but I do not see the statistical proof that it is significantly better. So it is possibly random chance over the specific sample in question that the new is better than the old. Just a thought…

Sitting Curveball
Member
4 years 10 months ago

What are kwERA and bbFIP trying to measure exactly? Are they closer to ERA and FIP, respectively, or are they closer to SIERA? Since they don’t have coefficients (just a constant term), it seems like they are simpler. I can calculate kwERA in my head at the ballpark (if I knew the constant), but what is that stat trying to tell me?

tangotiger
Guest
tangotiger
4 years 10 months ago

kwERA expresses a pitcher’s talent at BB (and HBP), SO along the ERA scale. That is, it ignores any HR talent, any batted ball talent, and any steals/pickoff talent.

As you can see, it does fantastically well.

bbFIP is kwERA plus the batted ball frequency (LD, Pops, GB, FB), and espresses it along the ERA scale. So, it ignores any HR talent, any talent at getting more outs on GB than another pitcher would get on GB, and any steals/pickoff talent.

As you can see, it is WORSE than kwERA (even though it contains more information).

What that means is that the frequency of batted balls and/or the weights it uses for each batted ball type, is not useful to determine a pitcher’s talent level.

It IS useful if you are trying to explain current-season data. But, if that’s the goal, then FIP wins that one much better. Though wOBA for pitchers would be even better than FIP.

Zach
Guest
Zach
4 years 10 months ago

When you tested SIERA using 2003-2010 data, did you test it as would have calculated it if you had created SIERA in 2003? If not, then wouldn’t the results be biased becasue you are testing a regression against the data that was used to construct it?

tom s.
Guest
tom s.
4 years 10 months ago

Well put-together. I will have to start using SIERA more.

One thought on your closing lines: is there any meaningful trend on the pitchers where xFIP and SIERA produce significant disparities in predictions? Are these disparities random or consistent?

Brent
Guest
Brent
4 years 10 months ago

I guess I don’t quite see the point of these “horse races.” They show that for predicting the next season’s ERA, SIERA has slightly lower RMSE than the other statistics. But wouldn’t you do better yet by combining several statistics using a multiple regression? For most practical problems, we shouldn’t be looking to limit ourselves to a single statistic – I think what we really should be looking for is how much weight to give to SIERA, how much to FIP, how much to simple ERA, etc. I don’t want to ignore the information that’s incorporated in FIP or ERA and that’s ignored by SIERA, I just want to give it an appropriate weight.

Furthermore, the weights should differ depending on the time frame used. That is, if I’m forecasting based on data for a relatively short period (say, half a season), I’d probably give almost no weight to ERA, but if a pitcher’s ERA is consistently better than the defense-independent metrics for several years, then I’d figure there are skills being captured by ERA (holding and picking off base runners, pitcher fielding, skill at inducing infield flies) that are missing from metrics like SIERA. So I’d guess the coefficients of FIP and ERA would tend to rise when forecasting using data for longer time spans.

Doug D.
Guest
Doug D.
4 years 8 months ago

One test I would like to see is to take each pitcher’s variance (SIERA – ERA) and correlate between seasons. Should come out close to zero, I would hope.

wpDiscuz