- FanGraphs Baseball - http://www.fangraphs.com/blogs -

Tool: Basically Every Pitching Stat Correlation

In doing my research, I often like to take a look at correlations to get an idea about whether factors might be connected.  At the end of this season, I put together a spreadsheet to help me with that.  Well, I haven’t finished the research yet (FG+ subscribers will probably soon find out what’s been keeping me from it), but in the meantime, I thought I’d share what I hope will be a pretty handy tool for whomever out there might be interested in what lies a little beneath the surface of all these stats on FanGraphs.  And I do mean all of them.  Any pitching-related stat on FanGraphs should be represented in this tool.  You can compare one stat to another, or to itself in a different year.  Or, what the heck, you can even compare a stat to a different stat in a different year.  And, for you sticklers out there, it will even give you a confidence interval on these correlations (by default, it gives you the range of correlations that the true correlation has a 95% chance of being within).

What can you do with this?  Well, let’s say you want to see whether a stat is predictive of the next year’s ERA.  You could, for example, set Stat 1 to K% (after selecting the correct white box, type it in, or select from the drop-down list via the arrow to the right of the box), with the year set to 0 (meaning the present year), then set Stat 2 to ERA, with the year set to 1 (meaning the next year).  If you don’t change the IP or Season filters, you should see a correlation of -0.375.  That shows there’s a pretty decent connection between the two stats, in that if a pitcher has a high strikeout percentage in one season, he’ll likely have a low ERA the next (relative to the rest of the pitchers in the comparison).  If you change the year under ERA to 0, you’ll see the correlation gets stronger, whereas if you change it to 2 or 3, you’ll see it gets weaker.  That has a lot to do with the unpredictability of K%, and especially of ERA.  You’ll notice if you compare year 0 K% to year 1 K%, the correlation is a very strong 0.702, whereas if you do the same for ERA, it’s a moderate-to-weak 0.311.  Hopefully the graph will give you an idea of how strong those connections really are.


About those filters: 30 innings pitched (in a season) is the minimum for this data set.  If you change either the minimum or maximum IP setting, it will apply to whichever seasons are in question.  However, the Season filters will only affect the range of Year 0s.  It’s assumed that you’re going to set either or both of Stats 1 and 2 to Year 0, and doing otherwise will limit your sample a bit (by cutting out those who didn’t pitch in Year 0).

If you’re wondering, the “PU%” you see as one of the default stats is what I’ve been calling “Popup Percentage,” which I define as IFFB/Batted Balls[Edit: I made wOBA against the default instead of PU%, to show you all this new addition to the spreadsheet].  You might think IFFB% would cover that, but IFFB% is actually IFFB/FB.  Batted Balls can be calculated by adding the three main batted ball types together (FB+GB+LD).  PU% is sort of weakly anticorrelated with the next year’s BABIP, but it’s actually a stronger correlation than BABIP has with itself, year-to-year.  It gets stronger when you raise the IP minimum, which helps weed out a bit of the randomness.

Some more about correlations, in case anybody is unclear: a -0.5 correlation is not weaker than a +0.5 correlation; they’re the same strength, only in opposite directions.  In a +0.5 correlation, when one stat gets higher, the other also tends to; in -0.5, when one goes up, the other goes down.  Notice I said “tends to”; in a +1.00 correlation relationship, when one goes up, the other does go up, by a very predictable amount.  But, unless you’re correlating a stat with itself (same season), you probably won’t see anything like a +1.00 correlation.

Here are some ideas of things to try:

At the end of the drop down stat lists, I added some bonus stats:

So if you’re interested, you can take a look at all these ERA estimators (including FIP, xFIP, SIERA, and tERA), and see how they compare to ERAs of surrounding or same years, with different IP or season ranges.  Here’s a comparison (out-of-sample) of the ERA estimators, matching each pitcher’s 2012 stat against their 2013 ERA (30 IP minimum):

ERA Estimator Correlation Low High
SBERA 0.394 0.295 0.484
BERA 0.361 0.260 0.454
SIERA 0.356 0.254 0.449
MBRAT 0.335 0.232 0.430
pFIP 0.327 0.224 0.423
-K% 0.313 0.208 0.410
xFIP 0.311 0.206 0.408
TIPS 0.297 0.191 0.395
tERA 0.292 0.186 0.390
FIP 0.290 0.184 0.389
kwERA 0.281 0.175 0.381
ERA 0.238 0.130 0.341

Here are the next-season ERA correlations over the entire sample (2007-2013, 30+ IP):

ERA Estimator Correlation Low High
SBERA 0.433 0.395 0.470
SIERA 0.417 0.378 0.455
BERA 0.408 0.369 0.446
MBRAT 0.402 0.362 0.440
pFIP 0.397 0.357 0.435
-K% 0.375 0.335 0.414
xFIP 0.372 0.331 0.411
kwERA 0.367 0.326 0.406
FIP 0.358 0.317 0.398
tERA 0.355 0.314 0.395
TIPS 0.334 0.293 0.375
ERA 0.311 0.269 0.352

The low and high correlation estimates are all at 95% confidence.  There’s quite a bit of overlap between the 95% ranges of all of these, so it’s not exactly conclusive which is best, but it’s pretty clear that a lot of them are better predictors of next-year ERA than is ERA itself.

Well, I’ll leave you guys to it.  Let us all know if you discover anything interesting!

Edit: per your requests, I’ve added the following stats:

Just for the heck of it, I also added ShO% and CG%, or shutouts and complete games per game started.  I also replaced wins and losses with W/(W+L).  To make room for these new stats, I had to cut out a lot of superfluous counting stats and rarely used pitch types (eephus, knuckle curve, knuckler, etc.).  I’ll try to accommodate other requests… well, if I think they’re interesting, anyway.

Edit #2: I added wOBA against as a stat.