I like the idea of using defense-independent pitching info in the projection systems (Marcel, PECOTA, ZiPS). But pitching performance seems so volatile, I wonder if using just the prior year’s data might be more accurate than using multiple years of data anyway.
It would be relatively easy to make a Marcel projection using SIERA instead of ERA to test this.
I’m sorry, I know you explained it like 50 times, but I’m still a little confused about the difference between correlation and RMSE. If I understand it correctly, correlation doesn’t care about how far apart the estimator/projection and next year’s ERA are, but how consistent that difference is among all pitchers. RMSE, on the other hand, simply measures how close the estimator/projection is to the next year’s ERA. So, the projection system could systematically project ERAs to be a full point lower than they end up being, and have a perfect correlation while having a bad RMSE. Correct?
The main reason would probably be to evaluate run scoring. Of course, then we’d really want to use RA, not ERA, but they’re pretty close with small biases. FIP, xFIP, SIERA are all only as useful as they apply to other statistics. I could lead the league in SIERA, but if it didn’t mean that I’m good keeping runs off the board, it doesn’t mean anything.
This works the opposite way actually. A pitcher moving to Coors Field will have a high ERA and a high ERA projection, but his SIERA will stay low, reflecting league average. So that should make it harder for ERA Estimators to beat projections.
Chris Volstad’s 2011 SIERA?– 3.84. Carlos Zambrano?– 4.46.
I’d be more confident Volstad could post that sort of ERA if I didn’t think the Cubs’ infield defense might be just marginally better this year than it was last. Volstad and his career 50+ GB% better hope Starlin Castro gets his act together at SS pretty quick…
Oops. I apparently replied to the wrong comment. That was meant to go to the guy asking why we cared about ERA.
But, speaking of park factors…. Someone in the fangraphs forums was asking a good question about SIERA’s park adjustment. How does that work? Are the underlying variables (K%, BB%, GB%, FB%) adjusted by park? Or does the park adjustment come after the main calculations?
FIP and xFIP are estimators of ERA, so theoretically in the long run ERA = xFIP/SIERA/whatever you use to describe true-talent level. Projecting FIP instead of ERA would be like predicting what political candidates’ Intrade odds will be instead of who will win the election.
The Davids (Appleman and Cameron) are the guys to ask, not Matt Swartz. I think the answer would be “no,” with the usual reason that WAR is not supposed to be predictive- – it’s supposed to reflect what actually happened.
Now, I’m personally not content with that answer. By using FIP, you’re including one luck-based variable (HR/FB%), excluding other luck-based variables (BABIP, LOB%), and excluding some skill-based variables (GB%, SB/CS).
FIP is nice insofar as it’s simple and clean. The Three True Outcomes (BB, K, HR) are the pitcher’s responsibility; everything else is attributed to team defense.
FIP is not an estimator, though it may be used as such, just as ERA can.
It is a measure of what actually happened independently of defense.
Thus, using FIP as the target stat for pitching estimator’s would be an improvement over using ERA.
My projection method involves me projecting all the underlying components and then spitting out various ERA projections using the ERA estimators. That way I can capture the guys who have shown the true skill of keeping a low BABIP, while also not using actual past ERAs to project future ERA.
i did a correlation comparison a few weeks ago using pitcher seasons (min 100IP for starters, 40IP for relievers; N=328) from 2009 and 2010, comparing them with their results in the following year. here’s what i came up with:
Matt, I need to read this a bit more, but I don’t think your assessment of projection systems is correct. For instance, Oliver does take the “interplay between different pitching skills and their effect on runs prevention” into account (I believe), though perhaps not the way you would. For instance, take a read of Brian Cartwright’s article in this year’s THT Annual. The results may not yet bear it out, but the work is happening.
As always great to see your work, regardless of site. Consistent with everything you’ve ever posted (readers new to Matt’s work should peruse BP archives for his initial work on SIERA, home field advantage, Cole Hamels, and even his electrifying performance on BP Idol) it’s great to know you’ll follow-up frequently in the comments section.
Given that a few questions:
1) I’ve heard Nate Silver (a bit testily, I think) take pains to point out PECOTA is an algorithm. (Therefore, not a formula?). From his perspective, how is that different from xFIP, SIERA, etc and is that a possible defense against your findings?
2) SIERA is scales to average ERA, correct? Will the weightings change this year as a result? Is it a moving average scale or simply last year’s ERA, etc? When will you publish new factors and if so do you ever go back and change old weightings if say, 2011 was a lower run environment than expected or does, for example, Ryan Dempster’s 2009 SIERA stay fixed?
3). Why not scale SIERA to RA? Or, if a team really wanted to know how many runs allowed a pitcher might give up in the future, should it just take all pitcher’s ERA and multiply it by the league average multiple (I think it’s about 1.09%)?
I’d like to see a further study of this taking a look at pitchers in different aging buckets. As MGL keeps arguing at The Book Blog, pitcher aging is considerably different than hitter aging. By looking at different age groups, we might find certain parts of the aging curve where the aging adjustment of projection systems is counterproductive.
As one possibility, there is some evidence to suggest that pitchers typically begin their decline in the early 20s, at least in performance measured by rate stats (workloads are usually increasing). Perhaps projection systems are applying a positive aging adjustment to young pitchers, expecting improvement, while the pitchers are on the aggregate declining. That could explain why an accurate estimator that applies no aging adjustment might be a stronger predictor. SIERA could be a strong enough measure of ability where the improper aging hurts the projections more than the larger sample of data helps them.
Excellent analysis. I hope someone can take a deeper look.
This seems particularly surprising. Is it possible that the rotochamp and fans are adjusting for the lower run environment in 2010 vs. 2009 but the others aren’t? I’ve never seen a gap in projections quite this big.
Yeah, I read Brian’s article in detail and discussed it with him a lot. It’s a fantastic article, probably one of the best sabermetric articles I’ve read in a long time. He definitely is starting to take into account some of the interplay, though I’d bet if he added strikeout rate to his excellent findings about the effects on flyball rate on BABIP (and extra-base hits), he’d probably take great strides forward. I also tend to disagree with the linear weights approach to pitching, since I think situational pitching is a skill, particularly one associated with BB% and somewhat with K%.
Nate’s PECOTA was an algorithm, right, not a formula. I think many systems are. I don’t think know if Colin’s PECOTA is a formula or algorithm. I think Marcel is more or less a formula (weighting and regressing mostly). However, that isn’t really a defense against my findings. It should project the future better. I suspect it could if we incorporated more information about the interplay between peripherals and outcomes, outside of their linear weight effect.
Weightings will eventually change. For now, David A has fixed it so that the average SIERA moves by the same amount as the average ERA each year. Eventually, I’ll let the constants adjust for more data too, so Ryan Dempster’s 2009 SIERA will move by a couple points (not many).
If you look at Part 5 of the SIERA series in July, you’ll find SIRA. It’s also in my SIERA Calculator (I’ll find a link to this).
Jesse hinted at an issue — starters vs. relievers — that has not received enough attention. I’ve computed my own SIERA regressions using Retrosheet data from 2003-2010, and found consistently that you get substantially different coefficient estimates (and significance levels) if you estimate separate regressions for starters and relievers. Note that this is very different than simply tacking on a reliever adjustment value after the fact (which is how the current form of SIERA works).
I’m a big fan of Matt’s overall approach, but I honestly think he hasn’t done enough statistical testing of the actual regression results (which is not the same thing as testing how well SIERA predicts next year’s ERA); this has led him to overemphasize and over-interpret the importance and meaning of the squared and interaction terms in his equation.
The significance levels are bound to go way down if you only look at smaller sub-samples. I’m sure that there are some differences in the coefficients, but I’d be curious if their confidence intervals didn’t overlap. It may be that interaction terms change a lot for SP vs. RP, but I strongly doubt that the squared terms would. If you look at Brian Cartwright’s article in the THT Annual on launch angle as well as my work on GB-BABIP by GB%, you’ll see that there’s a very good reason why GB% has a negative coefficient on its squared term. Furthermore, every which way that I’ve estimated K%, the squared term has had a positive coefficient, too, so I doubt that would go away, since there’s no reason that the effect of a K + the indirect effect on BABIP and HR/FB of a K would be linear. BB% has a positive coefficient on its squared term because of my findings about situational pitching. These things are pretty much the basis for the squared terms. I’d be curious to see the regression output of starters vs. relievers, and I may play with that myself at some point, but I think by splitting the sample, you limit the effect of what is already a smaller sample size than you’d like if you want to get precise coefficients.
Also, Retrosheet data does get different coefficients than BIS data at FanGraphs. I’d try it with FanGraphs data and see if you find similar findings to mine. I’d also be curious if you found any major differences in the squared terms looking at RP vs. SP, because that’s very counterintuitive to me based on other findings. Thanks for the comment.
It would if the correlation measured across years. The correlation shows how the variables move together, so if one system knew the run environment, while the others assumed previous year run environment, that would affect things. All of the ones above pretty much use the previous year environment as far as I know.
But even without splitting the sample between starters/relievers, most of the time I found that the interaction terms were completely insignificant, and many times the squared terms were as well. I never could duplicate your results, although I came “relatively” close.
Part of this is undoubtedly due to the difference between Retrosheet and BIS; I guess I really should download the data from FG and re-do the whole thing, but I’m not really looking forward to that!
A few really basic questions:
Which years did you actually include in your estimation?
Did you weight your observations (by IP I assume?)
Exclude observations with <40 IP?
Does BB include or exclude intentional walks? HBP?
Do flyballs in your netGB number include both outfield FB and popups?
I assume you estimated with just a single constant term? (as opposed to a year dummy for each year?
Dependent variable was same-year ERA — ballpark-adjusted or no?
1) I used 2002-2010
2) Weighted by IP
3) Excluded observation <40 IP
4) BB=UBB+IBB and not +HBP
6) Dummy variable for each year except one + a constant (same as having dummy variables for each year and no constant of courses)
7) Dependent variable = park-adjusted ERA
All of our predictions use mathematical algorithms that look at key player performance indicators based on historical performace. We generally look at these indicators over a 3-year period with the most recent history weighed the most.
on pitchers specifically:
We discount the traditional metrics like ERA when predicting 2011 performance and use the more reliable metrics like FIP and xFIP to generate pitching projections.
not a ton of information there. my guess is that their projections were better geared to the lower run environment because they don’t take the previous year’s ERA into account in the first place.
i went back and looked at the data. turns out the rotochamp projections i have only coincide with 2011 since fangraphs only began carrying them last year. i split the ZIPS data out into 2010 and 2011 results and got the following correlations:
In the article where Colin Wyers criticized SIERA (which by the way I thought was a really lousily written article, people don’t use career SIERA, and it was shown to be better for 1 season which most people use ERA estimators for) it showed that plain ERA was better over a larger sample, so maybe, using the reliability or something, we could find the percent that we usually regress, and use SIERA for that instead. That would probably give the best estimate.
Also minorleaguecentral.com carries minor league SIERA for anyone who wants to know.
Dekker…. no it doesn’t. Yes. it takes into account park factor but it doesn’t take into account the variation (luck) factor.
You are assuming the majority of year to year HR/FB ratios for a pitcher is based on the park and that is clearly not the case – just look at a player who has played in the same place for year and ho much the HR/FB% fluctuates year to year.
As an example Roy Halladay had a 5.1% HR/FB ratio last year, the previous year (in the same park) he had a 11.3% ratio.
Do you have a feel, or have you done a study to determine at what point mid-season SIERA is more predictive of rest-of-season ERA than say, last year’s SIERA? For instance at Memorial Day this year how much weight on current year SIERA would you assign vs. last year’s SIERA as a predictor of rest-of-year SIERA?
I remember last May, SIERA favorite Matt Garza did not maintain his newly found level of strikeout rates but on the other hand, Charlie Morton really had transformed himself into a ground ball pitcher.
Where did you get the idea that a year ago, Jaime Garcia was on the trading block? That absolutely is not true. I’m not sure it’s particularly relevant to the article but Garcia was never on the trading block. I’m not sure Latos was a year ago either — perhaps I’m wrong there — but I’m certain the Cards did not consider trading Garcia last year.
Did you use 2012 SIERA to project 2006-2011 ERA? And is SIERA based on a regression including data from 2006-2011? If yes to both, then we would naturally expect SIERA to project 2006-2011 better than other systems, since it was built from the data contained therein.
I would suggest that the study is only reliable if we use SIERA as it would have been calculated in 2005 to project 2006-2011.
This is a common mistake. SIERA is a regression of same-year results. Next-year results are out of sample. The dependent variable in SIERA for, say, 2008, is 2008 ERA, and the indenpedent variables are 2008 SO,BB,GB,FB,SP/RP. However, the 2008 SIERA is highly correlated with 2009 ERA, but this is out of sample.
However, the regression doesn’t include 2011 data anyway, and as you can see from above, it does quite well at predicting 2011.
This is a fascinating finding. I wonder what it is that the projection systems are doing wrong. (Except for Marcel; I know exactly how Marcel works.)
There are a couple of cheap and easy things to improve our projection from year-1 SIERA (such as velocity, as Matt pointed out at THT in “You Shall Know Our Velocity,” which was kind to one of my articles) and that should extend SIERA’s lead.
Matt, or any other qualified person: Is there a good reason for the correlation scores to drop so significantly when going from the one-year sample to the five-year sample?
Comment by John R. Mayne — January 6, 2012 @ 10:21 am
This is very interesting analysis. I’m a big fan of siera and think that as you say its really important to figure out ways of using siera type thinking to modify the data that goes into projections. I still very much believe in the core of the Nate silver Pecota model (kind of hard to get a handle on what it’s become) for projections.
I think there could be a plausible explanation of the greater problems projections had with relievers. I’m guessing that a lot of the variance in this sample (and the efficiency of ERA estimators) is the tendency of regression to mean. When you examine a “season” as a data point, relievers suffer from way more sample size issues and as such their unweighted tendency of regression to mean outweighs that of starters. The difference between the projections and the estimators is the introduction of comparables to assessment of true talent level to find some sort of progression over a career. That should be more accurate for starters than relievers, particularly if your assessment of true talent is not as robust as the new era estimators. \
oooh crap just scrolled down and saw the bztips discussion. Well… glad to see it.
Based on the quickly stabilizing variables, this means that SIERA stabilizes fairly quickly as well, right? How quickly do you think you can rely on a current season’s SIERA to be more predictive than the prior year’s SIERA?
In case Matt (or anyone else) is still interested, I’ve taken Matt’s specs that he listed above and tried to duplicate his SIERA equation as close as possible, using all the same variables and for the same time period 2002-2010. I couldn’t duplicate his results exactly, because:
a) I used Retrosheet data and he used BIS data
b) I did my own park-adjusted ERA calculations
But I came fairly close to his published results. Then I split the sample and ran separate regressions for:
–American League only
–National League only
Here’s what I found:
The coefficients on K’s, BB’s and netGB’s are pretty stable and (usually) statistically significant across the entire sample and all sub-samples; this is good.
The squared term on strikeouts is very unstable and highly insignificant in 3 of the 4 split samples (all except NL); this is contrary to what Matt suggested would happen.
The squared term on walks is unstable and highly insignificant, even in the initial unsplit sample.
The squared term on netGB’s is stable and significant – nice!
Among the interaction terms, only the K/netGB interaction is fairly stable, though not very significant statistically. The other two interactions are very unstable and very insignificant.
I got adj-R^2 of between .36 and .40 for all runs except the relievers sample, which came in at under .28; clearly there are some real differences in how well a SIERA-type regression tells us what’s going on with relievers as compared to starters. This was my concern from the beginning.
As a quick follow-on, I re-ran all 5 regressions excluding those terms that were consistently unstable/insignificant. This left me with the following set of 5 variables: Ks, BBs, netGBs, netGB^2, and K/netGB interaction. This yielded very similar results to the initial set of results that included all those other terms; in fact, I got slightly HIGHER adj-R^2 in most cases, meaning that the excluded variables really don’t belong in the equations.
So I’m left to conclude that Matt really may be over-interpreting the meaning and importance of all the non-linear terms in his equation. The overall approach is still a great way to try to go beyond the oft-repeated claim that “strikeouts and walks are the only things that we can measure a pitcher by”; but a somewhat simplified version would probably do just as well in predicting out-of-sample ERAs.
bztips, I appreciate all the effort you’re putting into this, and I find your results interesting. I think you might be missing a few things.
Firstly, you don’t need to add and subtract terms to a regression as you go, since we’re not trying to actually model each effect to see if it has one. We know that all of these things have effects, and we’re just not sure how much, and we’re not sure the functional form. If you’re familiar with calculus (which it sounds like you must be, since you speak pretty knowledgeably about regression), think of it as checking:
dERA/dSO = a + b*SO + c*BB + d*GB
dERA/dBB = e + f*SO + g*BB + h*GB
dERA/dnetGB = i + j*SO + k*BB + m*GB
In that sense, I’m only assuming that the direct and indirect effects of each thing are not constant (but are at least linear) with respect to every other thing that we know affects them. For a term to be insignificant or flip signs, it’s not a huge deal.
I did use Retrosheet data with Eric Seidman in version one of SIERA, and we also found that BB^2 wasn’t significant with that data set. Given my findings on situational pitching and walks (namely, higher proportion of inopportune UBB/all UBB for pitchers with higher UBB), it is highly likely that BB^2 should have a positive term. Thus, I think the insignificance in Retrosheet data tests is batted ball measurement error causing attenuation bias.
Relievers, by the way, are going to have a lower R^2 of course! They have more variance in their ERA due to smaller sample sizes! If you took two months of pitcher data at a time, I’d bet you’d get a similar R^2 as you get for a full year of relievers. That’s just the magnitude of the error, I’m pretty sure. There are also just other factors, like when a reliever gets brought in & versus who, that add variance to his ERA in a way that is potentially not too correlated with peripherals.
I’m guessing that the insignificant SO^2 term is probably a sample size issue. Does it change much or turn negative? Or does the p-stat just get bigger because the standard error is larger?
Matt, yes I understand the marginal impacts implied by the functional form of your equation. I suppose it’s a matter of perspective — while you may be comfortable “knowing” that there are complicated and indirect effects between Ks, BBs and GBs regardless of how well they show up in your equation, my perspective is that the data may not support what you think you know.
I doubt that the differences I see when I break the dataset into starters and relievers are due mostly to just sampling error — I have 1745 observations for starters and 1494 for relievers. Obviously there is some impact from the fact that the avg number of innings for starters in my sample is around 138, while the avg of relievers is 61. But still, the idea that all of the insignificant terms I see in the reliever equation are just due to sample size issues doesn’t make sense. Here are the t-statistics I got for the reliever equation:
And again, the R^2 was around 0.28. What this says to me is that there’s something else beyond linear or nonlinear Ks, BBs and nGBs, and their interactions with each other, that explain reliever performance; as opposed to your suggestion that we KNOW these things are tied to performance and it’s just small sample sizes or bad Retrosheet data that’s preventing us from seeing it.
As to your specific questions:
The BB^2 term in the starter equation was 41.48 with a t-stat of 1.68; for relievers it was 10.24 with a t-stat of 0.43.
The SO^2 term in the starter equation was 9.11 with a t-stat of 1.26; for relievers it was -1.00 with a t-stat of -0.17.
So for starters I’m not too concerned about the significance levels, because the point estimates are similar enough to what you found for the entire sample. But not so for the relievers — there’s got to be something else going on that is just not being captured well by the variables in your equation IMO.
Thanks again for continuing this discussion; the more we can learn from the data, the better off we’ll all be.
I didn’t mean sampling error. I meant just random error in smaller samples. There’s more noise in ERA relatively to signal because there is less data. For instance, if you pitch one inning, the only ERAs you can have are 0.00, 9.00, 18.00, etc., so the R^2 will be lower if you regress ERA for pitchers with 1 innings on SO, BB, netGB.
The other thing going on with relievers, though, that even with comparable IP won’t make your R^2 the same is that reliever usage varies, so that explains some of the differences in ERA between them too.
I’d focus more on the coefficients than the significance levels when you’re comparing datasets, since the t-stats are just coef/std.err. and we know the std.err. will be different.
OK, I know we’re beating a dead horse, but here’s (hopefully) my last shot at it.
You can’t have it both ways:
1) claiming that coefficients matter but the large standard errors don’t when it comes to deciding which explanatory variables belong, and
2) using these same standard errors to argue that the coefficients from the reliever sub-sample aren’t really all that different from the coefficients from the full sample.
If in fact there’s all this noise due to their smaller number of innings pitched, then doesn’t it make more sense to just exclude relievers from the analysis altogether and apply your conclusions just to starters?? Instead, you’re making an assertion (and that’s all it is) that even though we can’t see it in the data, we should just ASSUME that relievers’ performance can be explained by the same model that explains starters’ performance.
In both 1) & 2), what I’m saying is that the standard errors are large when the sample gets smaller, so you can’t really assume that the coefficients are different. The relievers have fewer IP so I weighted them less. You’ll do better out of sample to include more data, but to weight it higher. I weighted by IP; it helped a lot out of sample.
Just multiply by them. So, like if a pitcher’s home park has a 1.20 park factor, then him pitching half his games on the road would give him a 1.10 park factor for his team. So say this pitcher’s SIERA is 4.00, then his expected ERA since he’s pitching half his games in a high run environment would be 4.40.
Comment by Matt Swartz — November 24, 2012 @ 11:51 pm
I’d like to have a more private conversation about some of this stuff, and I have a few general questions about fangraphs, too. Can I e-mail you?
Comment by lschechter — November 29, 2012 @ 10:29 pm