New SIERA, Part Three (of Five): Differences Between xFIPs and SIERAs
Who’s up and who’s down? Which pitcher will improve upon a stellar season, and who is headed for the trash heap? SIERA and xFIP attempt to answer these questions from year to year, but they’re not totally interchangeable metrics. Why? The biggest difference is the way each uses strikeout rates.
That’s not to say that the two statistics don’t generally say the same thing. In fact, they’re much more reliable – and calculated much differently – than traditional ERA. For a quick example, take a look at the top 10 pitchers in all three metrics during the past four seasons.
2010
| Rk | Pitcher | SIERA | Pitcher | xFIP | Pitcher | ERA |
| 1 | Roy Halladay | 2.90 | Roy Halladay | 2.80 | Clay Buchholz | 2.25 |
| 2 | Francisco Liriano | 3.02 | Francisco Liriano | 2.95 | Josh Johnson | 2.28 |
| 3 | Cliff Lee | 3.10 | Adam Wainwright | 3.02 | Felix Hernandez | 2.37 |
| 4 | Josh Johnson | 3.10 | Josh Johnson | 3.02 | Roy Halladay | 2.41 |
| 5 | Adam Wainwright | 3.13 | Cliff Lee | 3.06 | Adam Wainwright | 2.49 |
| 6 | Jered Weaver | 3.15 | Tim Lincecum | 3.09 | Ubaldo Jimenez | 2.65 |
| 7 | Mat Latos | 3.16 | Felix Hernandez | 3.14 | Jaime Garcia | 2.78 |
| 8 | Felix Hernandez | 3.20 | Jon Lester | 3.18 | Roy Oswalt | 2.79 |
| 9 | Tim Lincecum | 3.21 | Mat Latos | 3.21 | David Price | 2.79 |
| 10 | Jon Lester | 3.25 | Cole Hamels | 3.28 | Tim Hudson | 2.88 |
2009
| Rk | Pitcher | SIERA | Pitcher | xFIP | Pitcher | ERA |
| 1 | Javier Vazquez | 2.86 | Javier Vazquez | 2.77 | Zack Greinke | 2.12 |
| 2 | Tim Lincecum | 2.92 | Tim Lincecum | 2.83 | Chris Carpenter | 2.31 |
| 3 | Zack Greinke | 3.04 | Roy Halladay | 3.00 | Tim Lincecum | 2.47 |
| 4 | Justin Verlander | 3.06 | Dan Haren | 3.02 | Felix Hernandez | 2.59 |
| 5 | Dan Haren | 3.07 | Jon Lester | 3.09 | Jair Jurrjens | 2.64 |
| 6 | Roy Halladay | 3.12 | Zack Greinke | 3.09 | Adam Wainwright | 2.70 |
| 7 | Jon Lester | 3.15 | Justin Verlander | 3.20 | Roy Halladay | 2.80 |
| 8 | Rick Nolasco | 3.23 | Ricky Nolasco | 3.23 | Clayton Kershaw | 2.85 |
| 9 | Josh Beckett | 3.40 | Josh Beckett | 3.30 | Matt Cain | 2.89 |
| 10 | Chris Carpenter | 3.43 | Adam Wainwright | 3.32 | J.A. Happ | 2.89 |
2008
| Rk | Pitcher | SIERA | Pitcher | xFIP | Pitcher | ERA |
| 1 | Roy Halladay | 3.13 | CC Sabathia | 3.06 | Cliff Lee | 2.58 |
| 2 | CC Sabathia | 3.16 | Roy Halladay | 3.11 | Johan Santana | 2.60 |
| 3 | Tim Lincecum | 3.18 | Tim Lincecum | 3.13 | Tim Lincecum | 2.61 |
| 4 | Dan Haren | 3.21 | Dan Haren | 3.16 | CC Sabathia | 2.74 |
| 5 | Josh Beckett | 3.21 | Josh Beckett | 3.19 | Roy Halladay | 2.79 |
| 6 | Brandon Webb | 3.23 | Derek Lowe | 3.32 | Daisuke Matsuzaka | 2.80 |
| 7 | Ervin Santana | 3.32 | Brandon Webb | 3.33 | Ryan Dempster | 2.82 |
| 8 | Derek Lowe | 3.37 | Mike Mussina | 3.43 | Cole Hamels | 3.05 |
| 9 | Randy Johnson | 3.55 | Ervin Santana | 3.48 | Jon Lester | 3.10 |
| 10 | Mike Mussina | 3.55 | A.J. Burnett | 3.51 | Jake Peavy | 3.11 |
2007
| Rk | Pitcher | SIERA | Pitcher | xFIP | Pitcher | ERA |
| 1 | Erik Bedard | 2.95 | Erik Bedard | 2.90 | Jake Peavy | 2.54 |
| 2 | Johan Santana | 3.20 | Felix Hernandez | 3.27 | Brandon Webb | 3.01 |
| 3 | Josh Beckett | 3.32 | Johan Santana | 3.30 | John Lackey | 3.01 |
| 4 | Jake Peavy | 3.33 | Brandon Webb | 3.30 | Brad Penny | 3.03 |
| 5 | Felix Hernandez | 3.35 | Josh Beckett | 3.31 | Fausto Carmona | 3.06 |
| 6 | Brandon Webb | 3.43 | John Smoltz | 3.33 | Dan Haren | 3.07 |
| 7 | Derek Lowe | 3.45 | Jake Peavy | 3.34 | John Smoltz | 3.11 |
| 8 | Javier Vazquez | 3.45 | Cole Hamels | 3.40 | Chris Young | 3.12 |
| 9 | John Smoltz | 3.46 | CC Sabathia | 3.40 | Erik Bedard | 3.16 |
| 10 | Cole Hamels | 3.46 | Derek Lowe | 3.40 | Roy Oswalt | 3.18 |
Pitchers atop the leaderboards both for SIERA and for xFIP usually stick there. It’s not a coincidence. The best pitchers usually have the best stuff. On the other hand, the ERA league-leaders are most always a mix of luck and good performances – meaning it’s difficult for pitchers to replicate those successes from year-to-year.
Three pitchers appear three times on the SIERA and xFIP leaderboards from the 2007 to 2010 (Roy Halladay, Josh Beckett and Tim Lincecum). Out of that trio, only Halladay appears three times among the ERA leaders. In the past four seasons, there have been 13 pitchers with multiple years in the top 10 in SIERA and 14 pitchers with multiple years in the top 10 in xFIP. But only six pitchers have managed to crack the top 10 in ERA more than once in the past four years.
From this, it’s obvious that xFIP and SIERA generally agree — with a few exceptions. Let’s have a look at those pitchers so it’s clearer when — and why — the two metrics differ.
Among the 88, 78 and 92 pitchers with enough innings to qualify for the ERA titles from 2008 to 2010, there were several who were as much as 10 ranking apart. Here’s the 2010 list.
2010:
| Pitcher | SIERA | xFIP | SIERA Rank (of 92) | xFIP Rank (of 92) |
| Clay Buchholz | 4.29 | 4.07 | 67 | 56 |
| Jaime Garcia | 3.81 | 3.87 | 37 | 22 |
| Tommy Hanson | 3.74 | 3.87 | 30 | 42 |
| Colby Lewis | 3.58 | 3.74 | 19 | 35 |
| Ian Kennedy | 3.99 | 4.10 | 47 | 59 |
| Shaun Marcum | 3.62 | 3.71 | 20 | 32 |
| Jeremy Guthrie | 4.51 | 4.60 | 75 | 86 |
| Phil Hughes | 4.05 | 4.13 | 48 | 60 |
| Brian Matusz | 4.21 | 4.31 | 59 | 73 |
Pitchers with good, but not great, groundball rates fare much better with xFIP than they do with SIERA. That’s why SIERA was more bearish on Clay Buchholz than xFIP. Because of that — and also because of Buchholz’s modest strikeout rate — SIERA assumed the Red Sox pitcher was hittable. SIERA also wasn’t as excited about Jaime Garcia as xFIP was after 2010, but Garcia improved his walk rate and maintained a low ERA.
SIERA saw Tommy Hanson’s high strikeout rate in 2010 and knew that his low BABIP wasn’t a fluke. And neither SIERA nor xFIP had much confidence in Jeremy Guthrie’s chance to reproduce his 3.83 ERA in 2010. SIERA figured he’d at least control damage by giving up less-costly walks.
Only a few pitchers in 2009 had large spreads between their xFIP and SIERA rankings.
2009:
| Pitcher | SIERA | xFIP | SIERA Rank (of 78) | xFIP Rank (of 78) |
| Ted Lilly | 3.78 | 3.90 | 21 | 33 |
| Mark Buehrle | 4.62 | 4.37 | 66 | 55 |
| Jered Weaver | 4.25 | 4.40 | 45 | 55 |
| Zach Duke | 4.58 | 4.25 | 60 | 49 |
| Johnny Cueto | 4.34 | 4.51 | 51 | 66 |
The pitchers whom xFIP is particularly cautious about are guys who give up a lot of fly balls despite whiffing lots of batters. On the other hand, SIERA tends to favor exactly those types of pitchers.
SIERA likes Ted Lilly because it thinks he can limit his BABIP and HR/FB, just like Jered Weaver has done. Both these pitchers generate weak contact because of their high strikeout rates — and both of their high fly ball rates mean they generate shorter, more catchable balls. Their SIERAs are regularly more attractive than their BABIPs.
Zach Duke’s low 2009 strikeout rate looked like a recipe for disaster to SIERA, and that’s exactly what happened in 2010. Since strikeouts mean more to SIERA than other outs, SIERA puts more distance between guys like Lilly and Duke than other estimators.
Consider this: If you subtract 20 Ks from Lilly’s 2009, his SIERA would shoot up from 3.78 to 4.09 – 31 points. On the other hand, his xFIP would go up from 3.95 to 4.22 — a 27-point difference. Adding 20 strikeouts to Zach Duke’s 2009 would help his ERA improve from 4.58 to 4.32, a 26-point difference — while his xFIP would fall 21 points to 4.04. These discrepancies are not massive, but they add up.
Strikeouts matter more to SIERA than to xFIP. Fly balls matter less, so a pitcher like Lilly would suffer less from being slightly below average in that category. Duke would benefit less from being slightly above average.
Along those lines, there were more pitchers with big xFIP and SIERA differences in 2008, including some familiar names from the 2009 and 2010 lists.
2008:
| Pitcher | SIERA | xFIP | SIERA Rank (of 88) | xFIP Rank (of 88) |
| Ted Lilly | 3.93 | 4.08 | 29 | 41 |
| Nick Blackburn | 4.61 | 4.42 | 67 | 54 |
| Oliver Perez | 4.58 | 4.74 | 64 | 76 |
| Jered Weaver | 4.06 | 4.70 | 58 | 72 |
| Manny Parra | 4.06 | 3.81 | 35 | 23 |
| Vicente Padilla | 4.50 | 4.65 | 59 | 71 |
| Johnny Cueto | 4.11 | 4.31 | 38 | 52 |
| Justin Verlander | 4.50 | 4.70 | 58 | 72 |
Nick Blackburn’s low 2008 strikeout rate looked like trouble to SIERA, while Justin Verlander was the classic high-strikeout-low-groundball guy whom SIERA loves. Since SIERA doesn’t see large differences between pitchers with less than 50% groundball rates, pitchers who really miss bats while allowing a few more fly balls — like Verlander, Weaver and Lilly — are the types of pitchers who look better to SIERA compared to xFIP.
Johnny Cueto appears on both the 2008 and 2009 lists. With strikeout rates slightly above average and fly ball rates slightly below average, Cueto fits the mold of a pitcher whom SIERA prefers, compared to xFIP. Whiffs indicate better performance in other areas. Keeping the ball on the ground often does not.
Pitchers with control problems differ depending on how many hitters they can strike out. SIERA thought Manny Parra’s wildness — as well as his so-so 2008 strikeout and groundball rates – was a sign of trouble. And it was in 2009. Still, it made a mistake thinking that Oliver Perez’s strikeout rate would mitigate some of his control problems, which clearly didn’t happen.
Remove 20 walks from Parra’s 2008 line and his SIERA would go down 40 points to 3.66. His xFIP would go down 36 points — to 3.45. But SIERA thinks those extra walks were a slightly bigger sign of trouble. In a way, it smelled doom.
With Perez, SIERA thought that as long as he maintained his strikeout rate, he’d do well enough. But drop his strikeouts by 20 to make it look more like his 2009 rate, and his SIERA would spike to 4.87, clearly moving Perez from the area from an acceptable to an unacceptable pitcher. These examples show why it’s important to treat SIERA as an estimator and not as a predictor. In reality, Perez’s talent level in 2009 was probably around a 4.58 ERA.
Fans’ these days know that pitchers have little control over BABIP and HR/FB. Taking this hypothesis to the extreme, xFIP shows fans the effects of the most important parts of pitching. Relaxing that hypothesis is more realistic, though, which is what SIERA does by allowing groups of pitchers to have different BABIPs and HR/FBs.
Run prevention directly due to strikeouts, walks and home runs is better evaluated with xFIP; run prevention due to other factors is better evaluated with SIERA. This is why both are useful. Now that SIERA and xFIP are available at FanGraphs, readers have the tools to understand pitching like never before.
With these two statistics, it’s worth testing them against each other – and then against the rest of the field. We’ll do that tomorrow.
Matt,
I’m really liking this series, and I’m excited that SIERA is at Fangraphs. I do think you should be a bit more cautious in some of your statements.
“SIERA saw Tommy Hanson’s high strikeout rate in 2010 and knew that his low BABIP wasn’t a fluke.” Seems like “suspected” would be better than “knew.” While I’m trusting your research has shown that high strikeout pitchers have lower BABIP, there certainly have been high strikeout pitchers with average or high BABIP.
“Both these pitchers generate weak contact because of their high strikeout rates — and both of their high fly ball rates mean they generate shorter, more catchable balls.” Again, these ideas may generally be true, but I don’t think we should state them as fact.
That said, on second glance most of the time you did state these ideas as likelihoods rather than absolute truths, so I’m probably nitpicking. Looking forward to tomorrow’s entry.
The more I see, the more in love I become.
Great stuff, is this Sierra equation something that is/will be made public?
Yes. It was in yesterday’s article in that table with coefficients and variables. I gave the formula’s initial BP version and the new FanGraphs version.
In reading this series it seems as though SIERA is the most effective tool in evaluating a pitcher’s true talent level. If this is the case would it make more sense to use SIERA to calculate a pitcher’s WAR? I’m not saying it is or isn’t just curious what the more enlightened readers, or the author thinks. Seems like if we’re looking for true talent level then the most accurate measure of a pitcher’s ability would make the most sense.
I think this is partly a philosophical question, but the argument for using FIP over xFIP, despite the latter being a better measure of skill, is that you need team WAR to go down when a HR is hit against them. Removing BABIP makes sense because it can be credited/debited from fielder’s WAR. There are some things that don’t get factored into team WAR but this at least lets a HR help the hitting team’s WAR while it hurts the pitching team’s WAR. You ask a fair question, though.
I actually like the idea of using SIERA for pitcher WAR. This is probably in the distant future at best, though, because it would complicate matters. To compensate, you would have to credit/debit fielders based on who was pitching.
For example, if I understand this right, SIERA credits pitchers like Verlander because his flyballs are “more catchable.” So, it would make sense that Austin Jackson would get less credit for catching a Verlander flyball than an average one. I understand this adds yet another coefficient to each runs saved calculation (and would be a ton of work), but I think the idea is sound.
Why are you looking at difference in rankings and not just difference between xFIP and SIERA themselves? You’re going to have some bias because of the inconsistent spacing in the rankings.
Also, given that these are the pitchers with the least agreement between xFIP and SIERA and you’re only showing a few with gaps wider than .15 runs, it implies that xFIP and SIERA are really damn similar. Maybe you could tease their y-t-y correlation for us?
The correlation between xFIP & SIERA is .94, as opposed to SIERA and FIP which is .80, and just .62 between SIERA & ERA. That’s all pitchers with at least 40 IP between 2002-10.
The logic behind using rankings was just that if the average xFIP and the average SIERA in a given season did not match, there could be too many pitchers on the SIERA>>xFIP or too many on the SIERA<
So you’re not setting league-average SIERA to league-average ERA? I thought that’s what the coefficients from yesterday were about?
Thanks for the correlations. Pretty damn close. Any chance you have FIP/xFIP handy?
The coefficients were set to minimize the mean square error for pitchers with 40 IP or more, weighting on IP, so they should be very close league average ERA but not quite. SIERA should probably be a little lower because it’s not trying to model the MLB BABIP of pitchers who throw 5 innnings and get demoted right away, because they generally aren’t MLB talent level. In practice, this won’t make a huge bit of a difference but rankings seemed easier to show a few on each side each year.
xFIP & FIP correlation is .82.
Great series so far. Just one question. How do you pronounce it? I’ve always said it like Sierra Mountains. Is it supposed to be S I ERA?
I pronounce it. I assume most people do, too, but I guess I wouldn’t know. Hmm…
forget something?
By “pronounce it”, I mean I sound it out. Like Sierra Mountains.
since SIERA is so high on Ks does it factor in how much easier pitchers are to strike out?
Good question!
I love that we are getting better and better with our measures, but this methodology still undervalues greatly pitchers who actually show a skill in preventing hits: the guys who are able to attain a significantly sub-.300 BABIP.
I realize that exceptions makes the process less mechanical, less easy to do, but we are basically penalizing pitchers like Matt Cain, Barry Zito, knuckleballers, for being good at what they do.
As long as they keep their walks down and their HRs down, I think it won’t make much difference. I, for one, will say there’s value in Tim Wakefield doing what he does, just very little value.
Good “true” BABIP reducers can prevent .5 WAR per season. Great ones can prevent 1 WAR per season. Not a huge deal over the course of a season…but over a career, it is a huge deal. Of course, you don’t come to FG WAR if you are worried about career pitcher WAR anyway. People use rWAR for that.
Matt, I’m curious… If having a variable for pitcher handedness lowered your MSE, would you include it in SIERA?
Probably not, but it depends why. If I really only wanted to minimize MSE using next-year ERA, I would regress on next-year ERA. I want to pick up same year skill effects, so I regress on same-year ERA. The reason the %IP as SP was included was specifically because I noticed that SIERA/xFIP/FIP always are lower than ERA on average for SP and higher than ERA on average for RP. Since I’m focused now on the concept of SIERA as picking up BABIP on HR/FB skill levels, it only makes sense to allow RP to have lower BABIPs.
As to why handedness probably doesn’t belong, I can’t think of a good reason why any effect would actually improve the relationship on out-of-sample data. In other words, is there a reason why LHP would have different BABIP or HR/FB than RHP, conditional on their peripherals? I can’t think of any, but maybe I’m missing something. It sounds like the type of thing that would only work on finding a best fit line, and would fall apart when correlating with next-year ERA or previous-year ERA or anything like that. I’m curious if I’m not thinking of anything though?
Left-handers are known for being able to reduce stolen bases, get more pickoffs, and keep runners closer to first, limiting baserunning advances. That’s not usually considered a “pitching” skill, but it certainly influences how proficient a pitcher is at preventing runs.
That’s probably a whole ‘nother ball of wax, though, as that skill varies greatly among pitchers, and it can’t be considered an inherent skill, as there are right-handers that are very good at this as well. I’m not even sure a metric exists right now that addresses this.
Perhaps that belongs on the defensive side of a pitchers ledger, which presently isn’t factored into pitcher WAR.
None consider it specifically, but Sean Smith’s WAR does not weed it out. Kinda the same thing.
Matt — Perhaps you will cover this in one of the next two parts of the series, but with respect to SIERA being valuable as a “predictor” for future ERA: have you tested how many IP it takes for SIERA to become “useful”?
For example:
- it’s one thing to look at a full-season’s worth of data and conclude that a pitcher is unlikely to repeat that performance next year (e.g. Clay Bucholz in 2010).
- however, what I’m asking about is IN-SEASON predictive power. For example, let’s say a guy with a long history of 4.50 SIERA performance (Edwin Jackson type) comes out and puts up a 3.50 SIERA for the first two months of the season. If I’m projecting him going forward for the rest of the season, should I pay more attention to the smaller sample of data (3.50 ERA for the first two monts) with the idea that some underlying skills / approach have changed, or is it more likely that he will perform like his career numbers (which has a much larger sample size)? or would you do some regressed amalgam (weighted avg) of the two?
That’s a really good question. In general, the average baseball stat should be weighted about 50% for events that happened this year, 30% for events that happened last year, and 20% for events that happened the year before. But obviously if you have 3/5 as much data this year as last, you effectively are weighting last year and this year’s totals similarly. However, statistics that stabilize more quickly, you can do better than 50/30/20. Since the year-to-year correlation for BB%, SO%, and GB% are all around 70%, I guess maybe weight this year’s events as 2x as much as last years (as in weight this year’s SIERA through 81 games about as much as last year’s through 162 games). But there’s probably a more thorough way to do this. I’ll think about it. Really good question. Thanks.
(pats self on back)
matt i gotta say, the best part of this SIERA series is that you come back with timely responses to reader questions, something fangraphs typically is very lacking in
Yes, big props to Matt for taking the time to read and respond to all of our questions and comments. A lot of time and effort on his part that’s very appreciated.
Well, it helps that this is technical enough that it seems to be keeping the rude trolls away. Considering the tone of some of the comments you see on Fangraphs, I’m not surprised when the authors choose not to respond.
Am I the only one that noticed that Jaime Garcia and Tommy Hanson have the same xFIP in 2010 but are ranked 20 spots apart? I’m not sure where the error is there…
Jaime Garcia is 3.62. It was a copy/paste problem. Sorry and thanks.
So, not to be a constant nag about this, but I do want to point out that FG uses all FB in its xFIP calculation, while the original intent was to use just outfield fly balls. Matt, do you have an opinion about which approach is better?
Good point. I would guess that xFIP with outfield fly balls only would probably perform better, because HR/FB is lower for fly ball pitchers overall, and IFFB/FB is higher for fly ball pitchers. Since that’s not really the direct reason it was created, I would guess the question is whether HR/FB (net of team) has more persistence than HR/OFFB (net of team). The one with the least persistence would be the way to go. I’d bet it’s OFFB and I’m guessing that improves xFIP too. I’ll see if I can check out this stuff. Thanks.
Thanks, Matt. Maybe it’s the wording, but isn’t the factor with the MOST persistence the way to go?
Well, I mean that if HR/OFFB has less persistence than HR/FB, then you’d want to regress xHR to OFFB*(league average HR/FB). Basically, figure out which number is less likely to be a skill and make it league average.
Gotcha
Yet another metric that makes Nolasco look really solid.
/frustrated fantasy owner
Yup, I’ve just started Nolasco and Vazquez back-to-back against the Padres. What could possibly go wrong, twice, I thought. :-P
Sorry if this was brought up in another post but any chance you move SIERA to the dashboard right next to ERA, xFIP and FIP for a quick comparison? Great work