## New SIERA, Part Five (of Five): What Didn’t Work

SIERA is now available at FanGraphs, and several important changes have been made to improve its performance, as discussed in parts one, two, three and four. Still, there were some components that I didn’t adopt. For the sake of completeness — and for those interested in what I learned about pitching along the way — I include these in the final installment of my five-part series.

**1. Park-adjusted batted ball data**

FanGraphs’ Dave Appelman developed park factors for batted balls for me, but these weren’t used in the new SIERA. I figured that since park-specific-scorer-bias might make batted-ball data less accurate, removing the noise might help. Park effects themselves come with a bit of noise, though, and the effect of park-adjusting the batted-ball data didn’t help the predictions. The gain simply wasn’t large enough: The standard deviations of the difference between actual and park-adjusted fly ball and ground-ball rates was only 1.4% and 1.5%, respectively, and the park factors themselves were always between 95.2 and 103.9 (the average park factor is set to 100).

The pitcher with the biggest SIERA difference — when it came to park-adjusted-batted-ball totals — was LaTroy Hawkins, who in 2007 in went from 3.63 to 3.73. The second-biggest difference came when Aaron Cook went from 4.00 to 4.08 in 2008. Brian Lawrence went from 3.55 to 3.63 in 2002, and Cook made the list again in 2007 when went from 4.63 to 4.71. There were only 16 other pitchers with ERA changes more than .05 in either direction.

As a result, very few people saw major changes to their SIERAs, and the combined effect was a weaker prediction.

**2. Year-specific ground ball coefficients**

I wanted to know if yearly changes to batted-ball scoring required adjustments, so I allowed the coefficients on ground-ball rate and ground-ball-rate squared to vary year to year. The net effect was that a lack of data dominated the calculations, and no real improvement came from this adjustment. I abandoned my plan when I saw the standard errors on these coefficients.

**3. Adding hit-by-pitches and subtracting intentional walks**

I tried changing BB to UBB + HBP.Â Due to the volatility of HBP, though, this didn’t improve prediction, either. I left it as BB for simplicity’s sake. The changes also were very small. Since my research showed that pitchers with low walk rates tend to allow less-damaging walks, I think of intentional walks as corner solutions on a spectrum of intent. Put a simpler way, how many of Greg Madduxâ€™s unintentional walks had an element of intent?

**4. Including pitchers with fewer than 40 innings**

If you use all pitchers with 162 IP or more and regress next seasonâ€™s BABIP against strikeout rate, walk rate, ground-ball rate and ground-ball-rate square youâ€™ll get an R^2 of .0805. But regressing it against previous yearâ€™s BABIP yields an R^2 of .0184. Home-runs-per-fly-ball can actually be predicted better using the same peripherals — with an R^2 of .0575 — than with previous year’s home-run-per-fly-ball rate of .0386. If you lower the innings restriction, though, the R^2 statistics shrink. Some of this is sample size — but some of this is because pitchers who donâ€™t throw many innings often don’t deserve to be regressed to typical major league levels of BABIP and HR/FB. Since predicting BABIP and HR/FB is one of SIERA’s major strengths, it’s important not to include pitchers who muddy the analysis.

Because of that, I was hesitant to include pitchers with fewer than 40 innings (the cut-off that Eric Seidman and I used when creating the original SIERA). I was curious if keeping pitchers with fewer than 40 innings this time (but weighted less) would help. But because pitchers with few innings can have gigantic ERAs (Sandy Rosario had a 54.00 ERA and one inning last year), the regression formula was designed to match SIERA to ERAs like Rosarioâ€™s at any cost to measuring everyone elseâ€™s skill. The formula made no sense.

**5. SIRA, Skill-Interactive Run Average**

Finally, I attempted to develop an alternative to Skill-Interactive Earned Run Average (SIERA), which I called Skill-Interactive Run Average — or, SIRA. Since the goal of pitching is to prevent runs –rather than simply preventing earned runs — I figured that scaling SIERA to “run average” might make for a more-accurate estimator. As it turned out, there was basically no benefit to doing this. The coefficients were similar between the two equations and the ordering of pitchers was similar, as well.

The following table shows the coefficient for SIRA and SIERA. Note the similarities:

Variable |
SIERA coefficient |
SIRA coefficient |

(SO/PA) |
-15.518 |
-17.172 |

(SO/PA)^2 |
9.146 |
10.332 |

(BB/PA) |
8.648 |
8.413 |

(BB/PA)^2 |
27.252 |
27.800 |

(netGB/PA) |
-2.298 |
-2.555 |

(netGB/PA)^2 |
-4.920 |
-5.049 |

(SO/PA)*(BB/PA) |
-4.036 |
8.793 |

(SO/PA)*(netGB/PA) |
5.155 |
6.808 |

(BB/PA)*(netGB/PA) |
4.546 |
-0.910 |

Constant |
5.534 |
6.078 |

Year coefficients (versus 2010) |
From -0.020 to +0.289 |
From -0.003 to +0.320 |

% innings as SP |
0.367 |
0.363 |

Year | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 |

SIERA Coefficient | -.020 | +.093 | +.154 | +.037 | +.289 | +.226 | +.116 | +.103 | .000 |

SIRA Coefficient | -.003 | +.104 | +.172 | +.031 | +.320 | +.234 | +.118 | +.092 | .000 |

With such similar coefficients, it’s not surprising that the SIERA and SIRA pitcher ranking were almost identical. I looked at the Top 25 in SIERA and SIRA from 2002 to 2010. The numbers overlapped at least 24 of 25 pitchers each year — and in some seasons, they were the same 25.

The closest thing to a major difference was in 2006, when Johan Santana led the league in SIRA over Brandon Webb, with a 3.42 SIRA to Webbâ€™s 3.43. But because ground-ball pitchers are slightly more likely to give up unearned runs, Webbâ€™s SIERA was 3.02, compared with Santanaâ€™s 3.16.

Overall, no starting pitcher had a ranking that changed more than 12 spots out of about 90 when flipping between SIERA and SIRA. In 2004, Brandon Webb was 40^{th} in SIERA but only 52^{nd} in SIRA; Ted Lilly was 28^{th }in SIRA but 38^{th} in SIERA. Otherwise, everyone else changed fewer than 10 spots. With so few changes, I opted for familiarity and decided to publish SIERA over SIRA.

These five tweaks were abandoned, but some changes — such as yearly run-environment and pitching-role-ERA-differences — stuck and were important. And that, essentially, is what matters.

In essence,Â SIERA was a strong statistic after its original development. But with a new run environment, it needed to be freshened. Now, with its extensive changes here at FanGraphs, the metric should stand the test of time.

Print This Post

Awesome

Can anyone with a degree in Statistics chime in and comment on the validity of a stat like that is shaped by so much observational data? Is it mathematically sound way to approach this type of problem? It seems like SIERA is a sand castle on the beach (not a great analogy but bear with me) and the stats just wash over it and shape it into whatever gives the best prediction for our current data sets. I admit that I have no idea. On the one hand it seems like applying all of these data sets and factors should iteratively improve the tool, but on the other hand I feel like there is something fundamentally wrong about this approach.

I had to take a lot of statistics and econometrics courses in Econ PhD program, so I’ll answer the question, but anyone else can also chime in.

The main thing to take away here is that the regression does NOT regress against NEXT year’s ERA. It regresses against THIS year’s ERA and tests against next year’s ERA. That’s just because next year’s ERA is a good proxy for this year’s skill level. I also have tested it against past year ERA too, which is also a good proxy. The tests I did yesterday were on out-of-sample data to check this.

One thing to know is that the original version of SIERA was done in Feb. 2010, and the 2009 SIERAs did well on the 2010 ERA testing. Similarly, Eric and I originally tested it by running it on the 2003-08 data and then seeing how it did at the end of 2009. So it’s been tested out of sample even in that sense too.

The formula involves a lot of tweaks but the essentials are that it does correlate with another very good skill metric, xFIP, with a .94 correlation. The difference is primarily extra juice added to Ks and a little juice taken away from ground balls.

Along the lines of Telo’s question, is this a stat that would scale well to baseball in other eras? For example, over the next five years baseball teams make a huge emphasis on defense, particularly infield, and ground balls become more valuable. Will SIERA self-adjust or will it need more tweaking?

Well, it depends on how batted ball data is measured. If the batted ball data is measured the same, then it will probably only need yearly adjustments. What I’ll probably do is re-run the regression after each season and let the coefficients go up or down by a small bit, but they will barely move if batted ball data is measured the same. I guess hypothetically there could be changes, but the procedure should be the same, and with sufficient data, it should work fine to just re-run the same code.

I don’t think it will make a big difference. As you say, next year’s (or last year’s) ERA is a good (but not great) proxy for this year’s skill level. Skill levels do change, year-to-year. They can even change within a year, say 1st half vs. 2nd half or home vs. road. But they are unlikely to change start-to-start, at least not in any systematic way. So you could try this: select all pitchers that had at least 20 starts in a season and split their starts by count, odd vs. even. Use the odd starts to get the coefficients, then test the results with the even starts.

So you think the entire field of regression modeling is fundamentally wrong?

Yea, I don’t think I worded my post exactly the way I intended. I am more just dubious of the implementation here, and the massive amount of moving parts.

Mathematically, if you run a regression on a set of data, observe which variables have significant coefficients, change the model, and run a new regresion on the same data, the test statistics and p-values cannot be trusted. However, it is perfectly fine to develop a model in this manner and then apply it to fresh data. A common apporaoch is to develop a model using prior year data, or using a subset of the data you’re interested in, then apply the finished model to the full data set.

I cannot follow Matt’s methodology or his answer to your question well enough to determiine what exactly he has done here, i.e., did he tweak the model using the very data we are currently interested in? Most sabermetric regression analysis I’ve seen commits that error.

Huge thanks for the whole series. Very fan articles to read.

The reasons for not publishing SIRA sound an awful lot like reasons to publish SIRA framed negatively.

But ERA is more approachable (granted, the hypothetical approacher already has come most of the way if he’s reading SIERA at Fangraphs). More practically, many of the Rotographs fanatics hereabouts are playing in leagues where ERA is a valued stat and so predicting it is valuable.

However, RA is more important when predicting

teamresults, which is what the front offices, Las Vegas, and (especially) us fans ultimately care about. So while it may not matter if the relative ordering of pitchers doesn’t change between SIRA and SIERA, a change in the absolute number of runs allowed can matter a great deal.I just don’t see why it’s an either/or situation. Why not publish both?

But “park-specific-scorer-bias”

isa park effect, just like the walls or the grass (and just like the walls and the grass, it can be subject to change from season to season). However, if it doesn’t make the calculation better, it doesn’t matter. I do wonder if this just reflects too-coarse park effects, though. (Was it just a single blanket number for each park, or broken up by batted ball and K rates as THT did a couple of years ago?)Dave Appelman was very thorough. He gave me park effects for runs, ground balls, fly balls, and many others. Gobs of them– really very appreciated on my part. I used the R, GB, and FB ones though. The LD park effects are much larger (though not terribly worse than most data you see in fields that aren’t as well documented as baseball), which is one of the reasons it’s good to leave LD out of the calculations directly.

I, too, am no statistician, and I didn’t understand a lot of what was said, but that doesn’t stop me from expressing an ignorant opinion.

(1) All this tweaking makes me very suspicious. Aren’t you basically improving a prediction of the past by doing this?

(2) It seems to me that you need to leave the stat alone for a while to see if it really does predict without changing it all the time.

(3) Since your version of SIERA is very complex and has a .94 correlation with XFIP, why is it needed?

The key here is that the tests of SIERA are on out-of-sample data. SIERA still has a smaller sample size than I’d like, so I’ll always tweak it. When I first published with Eric Seidman at BP, we were told “see how it does in 2010 before I believe it!” It did great. And incorporating 2010 data only will help going forward. The numbers won’t change much– it might not even be worth re-programming them in– but it’ll just add information over time if I do. The constant in FIP & xFIP changes every year too. There’s no harm in more information if it doesn’t make you throw out old information.

For the 3rd point you make, I would say that the non-correlated parts are very interesting– almost all of Part Three was about that. There is value in a metric that likes high-strikeout pitchers much more because it implicitly predicts BABIP.

We know that batted-ball events (BABIP, HR/FB, even LOB%) all have certain points in which they “stabilize” (r=.5 @ 3700 for BIP, etc.) Matt- at what point do you think we should no longer worry about the typically reliable guidelines that SIERA makes about batted balls and run prevention and accept the pitchers BABIP or HR/FB as “their own” regardless of what “should have happened? In other words – at what point would you feel more comfortable using a system like Smith’s WAR instead of a SIERA based WAR? I love SIERA season-to-season, but am still skeptical of using SIERA or FIP, etc. for long-career pitchers.

Also, on the 3rd part of this very informative series, I talked about the baffling complexities of Tom Glavine and how he beats the assumptions laid out by SIERA. Are there any other pitchers that you noticed who’s batted ball results can’t be explained by SIERA or any commonalities that those pitchers share?

Thanks a lot…very interesting stuff!

Thanks, Matthew. Ks, BBs, GBs and FBs stabilize between 100-300 PA for pitchers, so I’d ballpark the r=.5 point at about 200 PA, so maybe 40 IP or so.

In reality, the best time to use SIERA over ERA (or really RA) is about two-three years of starting. You cite Tom Glavine who is one of the most DIPS-beating admirable pitchers of all time. Clearly using career FIP, xFIP, or SIERA for him is silly. Use his RA, and do a Sean Smith-ian adjustment for defense and park. Others are guys like Matt Cain, who SIERA likes but not as much as it should– he’s one of the few who does it. I’m not sure I buy into Jurrjens yet, though I need really need to see more data. The guys that beat their SIERAs three years in a row (and by enough that you KNOW it’s not just defense/park effects)– these are the ones to start suspecting.

Great question(s). Thanks.

Thank you for the quick response!

“We alll know [concerning stabilizing stats]”? We have Pizza Cutter’s article, which appeared to commit the fallacy of regrress, tweak, regress using the same data. I haven’t seen any independent confirmation of PC’s numbers. Absent that, I think we should use phrases like “PC’s work suggests…” rather than “we know.”

Great stuff, Matt.

SIERA should also be a more marketable acronym than xFIP. No lower-case letters and an acronym based off of the ERA stat so many people are comfortable with. xFIP sounds scary. SIERA? That can be pitched as just “a new and improved ERA” (even though xFIP and SIERA are pretty similar in performance and share the same basic goal).

I’d love to see SIRA get published, if just for team stats at the very least.

Also, is there any chance we see a SIERA- in the near future to go along with ERA-, FIP- and xFIP-?

Yeah, this thing kind of reminds me about what one of my Civil Engineering professors said about an equation governing concrete design, “So we put all of these factors together and we get an equation that has no meaning”.

That being said, I like the work.

Sadly, I’m STILL waiting on something that can accurately model Tom Glavine’s performance from 2002 to 2008. He had a 3.869 ERA in 1293.33 IP, but the metrics have a far different view (4.59 xFIP, 4.94 tERA, 4.90 SIERA).

Glavine’s great BABIP days were slowing down, but he was still average to above average all of those years but 2005. What kills Glavine in SIERA is that with a high LD rate, an above average GB rate, and a poor K rate, he “should” have had horrific BABIP. But his BABIP on GB for his career is WAY better than average.

His Braves/Mets UZR over those seasons was about +20…only about 2-3 runs saved for Glavine, so that isn’t the issue.

He was a GB pitcher most of those seasons and racked up a lot of DP’s. Despite declining some in this area (he was Matt Cain before Matt Cain in this area for 15 years), his HR/FB were still better than league average till the last year or two. His much-discussed situational splits were still a big part of his game, and despite being mediocre over the first half of his career, got really,really good at controlling the running game.

And of course Shea helped some as it was a pitchers’ park. Of course early in his career, people forget that Fulton added about .003 BABIP per season, which people forget.

I think it was Bill James that said that a pitcher can have a great career without a good K rate… but they have do everything else well. That man is Tom Glavine.

Uh oh:

http://www.baseballprospectus.com/article.php?articleid=14603

Nerd fight!!!

Hi Matt,

Forgive me if I missed it completely, but I am not sure where the formula for SIERA is. I know you said that it was in the table in part 2 of the series, but to me, the table is a bunch of numbers that I cannot understand. It may just be my fault, but could you please write out the full formula, or direct me to where it is? Thanks a lot.

Sure, you take the table labled “fgSIERA Coefficient” and multiply it by the variable in the left-most table. So it would start off:

SIERA = -15.518*(SO/PA) + 9.146*((SO/PA)^2) + … etc.

Oh, I see. Thanks for the help and the quick response (two minutes after I asked the question!) By the way, I think SIERA is a great stat, and I’m going to use it a lot in the future when I am comparing pitchers in real life or in fantasy.

I loved this whole series (and an extra thanks for explaining the formula). My question is, Why not incorporate into the definition of SIERA the process by which you would make tweaks each year? Following on your response to GiantHusker, couldn’t you define the statistic to take all the accumulated data, or the data from the last 15 or so applicable years, as inputs to recalculate the coefficients (as you would), thereby taking out the last relatively subjective element? (That being your deciding or not deciding to make tweaks and deciding when to account for possible long-term, broad-scale changes in baseball.)

I don’t understand all the statistics here, and this whole question might be nonsense, but it seems to me that you expect to be doing a consistent and somewhat algorithmic process when you decide to make tweaks, so they could theoretically be incorporated into the SIERA stat itself.

-A whiny math major