In doing my research, I often like to take a look at correlations to get an idea about whether factors might be connected. At the end of this season, I put together a spreadsheet to help me with that. Well, I haven’t finished the research yet (FG+ subscribers will probably soon find out what’s been keeping me from it), but in the meantime, I thought I’d share what I hope will be a pretty handy tool for whomever out there might be interested in what lies a little beneath the surface of all these stats on FanGraphs. And I do mean all of them. Any pitching-related stat on FanGraphs should be represented in this tool. You can compare one stat to another, or to itself in a different year. Or, what the heck, you can even compare a stat to a different stat in a different year. And, for you sticklers out there, it will even give you a confidence interval on these correlations (by default, it gives you the range of correlations that the true correlation has a 95% chance of being within).
What can you do with this? Well, let’s say you want to see whether a stat is predictive of the next year’s ERA. You could, for example, set Stat 1 to K% (after selecting the correct white box, type it in, or select from the drop-down list via the arrow to the right of the box), with the year set to 0 (meaning the present year), then set Stat 2 to ERA, with the year set to 1 (meaning the next year). If you don’t change the IP or Season filters, you should see a correlation of -0.375. That shows there’s a pretty decent connection between the two stats, in that if a pitcher has a high strikeout percentage in one season, he’ll likely have a low ERA the next (relative to the rest of the pitchers in the comparison). If you change the year under ERA to 0, you’ll see the correlation gets stronger, whereas if you change it to 2 or 3, you’ll see it gets weaker. That has a lot to do with the unpredictability of K%, and especially of ERA. You’ll notice if you compare year 0 K% to year 1 K%, the correlation is a very strong 0.702, whereas if you do the same for ERA, it’s a moderate-to-weak 0.311. Hopefully the graph will give you an idea of how strong those connections really are.
About those filters: 30 innings pitched (in a season) is the minimum for this data set. If you change either the minimum or maximum IP setting, it will apply to whichever seasons are in question. However, the Season filters will only affect the range of Year 0s. It’s assumed that you’re going to set either or both of Stats 1 and 2 to Year 0, and doing otherwise will limit your sample a bit (by cutting out those who didn’t pitch in Year 0).
If you’re wondering, the “PU%” you see as one of the default stats is what I’ve been calling “Popup Percentage,” which I define as IFFB/Batted Balls. [Edit: I made wOBA against the default instead of PU%, to show you all this new addition to the spreadsheet]. You might think IFFB% would cover that, but IFFB% is actually IFFB/FB. Batted Balls can be calculated by adding the three main batted ball types together (FB+GB+LD). PU% is sort of weakly anticorrelated with the next year’s BABIP, but it’s actually a stronger correlation than BABIP has with itself, year-to-year. It gets stronger when you raise the IP minimum, which helps weed out a bit of the randomness.
Some more about correlations, in case anybody is unclear: a -0.5 correlation is not weaker than a +0.5 correlation; they’re the same strength, only in opposite directions. In a +0.5 correlation, when one stat gets higher, the other also tends to; in -0.5, when one goes up, the other goes down. Notice I said “tends to”; in a +1.00 correlation relationship, when one goes up, the other does go up, by a very predictable amount. But, unless you’re correlating a stat with itself (same season), you probably won’t see anything like a +1.00 correlation.
Here are some ideas of things to try:
- Clutch, year 0 vs. Clutch year 1. You should see pretty much no connection at all — just a circle of points on the scatterplot, pretty much.
- WPA, year 0 vs. WPA, year 1. Now, at least there’s a little bit of a correlation, and it’s more of an oval.
- vFA (pfx), year 0 vs. vFA (pfx), year 1. This is four-seam fastball velocity. It’s almost a straight line. You’re not going to find many stats more consistent than this one, year-to-year.
- ERA, year 0 vs. BB%, year 0. The correlation is surprisingly low(~0.2). Now set ERA to year 1 — pretty much no correlation at all, at least until you raise the IP minimum a bit.
- ERA, year 0 vs. K%, year 0. OK, this is a pretty good correlation (~0.5). When you set ERA to year 1, the correlation weakens, of course, but it’s still pretty decent (~0.37… actually, it’s stronger than ERA’s correlation to itself between years).
- K-BB% (something I added at the end) year 0 vs. ERA year 0. This is even better than K%, at around a 0.56 correlation. However, when comparing it to next-year ERAs, it’s pretty much as predictive as simple K%, if not a little lower. I also added kwERA, which correlates the same as K-BB%, but in the opposite direction (the formula used is 5.40 – (12*((K-BB)/PA))) ).
At the end of the drop down stat lists, I added some bonus stats:
- Foul%, which Christopher Carruthers brought up in a very interesting Community article, as did Russell Carleton/Pizza Cutter before him. I think there’s some potential for this stat. It seems to have a better connection to HR/FB than just about anything, for one (though it’s still a weak one).
- TIPS, the ERA estimator from Christopher’s article linked above
- BERA, my ERA estimator from last offseason. It was designed to match long-term ERA as well as to be predictive of next-season ERA
- SBERA, another of my ERA estimators. It was purely aimed at being predictive of next-season ERA.
- pFIP, by Glenn DuPaul, which is a differently-weighted FIP meant to be predictive of the next season.
- (Late addition): MBRAT, by Dan Greenlee, which includes the pitcher-related fielding stats rSB and rPM, also late additions to the spreadsheet
So if you’re interested, you can take a look at all these ERA estimators (including FIP, xFIP, SIERA, and tERA), and see how they compare to ERAs of surrounding or same years, with different IP or season ranges. Here’s a comparison (out-of-sample) of the ERA estimators, matching each pitcher’s 2012 stat against their 2013 ERA (30 IP minimum):
Here are the next-season ERA correlations over the entire sample (2007-2013, 30+ IP):
The low and high correlation estimates are all at 95% confidence. There’s quite a bit of overlap between the 95% ranges of all of these, so it’s not exactly conclusive which is best, but it’s pretty clear that a lot of them are better predictors of next-year ERA than is ERA itself.
Well, I’ll leave you guys to it. Let us all know if you discover anything interesting!
Edit: per your requests, I’ve added the following stats:
- HBP% (hit by pitches per batter faced)
- SV% (saves/(saves + blown saves))
- RA9 (runs allowed per 9 IP)
Just for the heck of it, I also added ShO% and CG%, or shutouts and complete games per game started. I also replaced wins and losses with W/(W+L). To make room for these new stats, I had to cut out a lot of superfluous counting stats and rarely used pitch types (eephus, knuckle curve, knuckler, etc.). I’ll try to accommodate other requests… well, if I think they’re interesting, anyway.
Edit #2: I added wOBA against as a stat.