# Tool: Basically Every Hitting Stat Correlation

By popular demand (OK, one guy asked for it), my first offering in the re-envisioned THT is this batting version of my pitching statistic correlation tool (newer version here). This tool will allow you to see, both graphically and in terms of a correlation figure, how any two of FanGraphs’ batting statistics collectively relate to each other. With it, you’ll even be able to compare a group of players’ stats in one year to their stats in a different year. Comparing a stat in Year 0 to a stat in Year 1, for example, is a good way to gauge how predictive the first stat could possibly be of the other (just remember, correlation does not necessarily imply causation).

Without further ado, here’s the tool:

The white cells you see in the tool are the ones that you should be playing with. The Statistic and Year cells can be changed either by drop-down lists or by typing the name of the statistic directly (in the web app, it should help you narrow your choices when you start typing). Data should be entered directly into the other white cells.

As for the filters, the default setting considers a batter’s season only if they have 300 or more plate appearances in that season; you can set that as low as 100 PA, or higher if you’d like. The default year range is 2007-2013, but this can also be changed; but keep in mind these years affect the range of Year 0s, and that you should have Stat 1 set to year 0, or else you’ll be excluding some data you probably didn’t mean to. “Year 0” implies the present season, while “Year 1” implies the next season, and “Year -1” implies the previous year. The three filter categories at the bottom each have drop-down lists, allowing you to simultaneously filter by three extra statistics of your choosing.

A quick refresher on correlations: they range between -1 and 1. A correlation of 1 means that when one stat goes up, so does the other, in a straight line on a graph like the type you see above. Correlate a stat to itself in the same year and you’ll see a correlation of 1; for something more useful, try to correlate same-year OPS and wOBA – it should be 0.993, and pretty dang close to a straight line.

A correlation of -1 should also appear as a straight line, except ninety degrees off from a correlation of 1; the two stats move in opposite directions. You’d get this if you correlated a stat to the negative of itself, for some strange reason. For something more practical, try K% vs. Contact% in the same year, which should come in at a very strong -0.888.

A correlation of 0 suggests that there’s probably no relationship between the two stats, although it is possible to for there to be an interesting relationship that escapes the correlation calculation. The graph will be harder to fool, however, so you may want to keep an eye out for strange patterns you see on it.

The *Confidence Level* box can also be changed. By default, it’s set to provide the estimated boundaries between which the true correlation is 95% likely to lie between. You’ll see this below it.

**An Exercise in Batted Ball and BABIP Correlational Analysis**

By default, you’ll see a comparison on the tool between batters’ PU% in one year and their BABIP in the next. PU%, if you’re confused, is Pop-Up percentage, my unofficial name for infield fly balls per batted ball (*batted ball* being defined as FB+LD+GB), as opposed to the official stat IFFB%, which is infield fly balls per *fly ball*. What you’ll notice is that PU% does indeed appear to be fairly predictive of BABIP, in that batters who pop the ball up a lot in one year will tend to have a low BABIP in the next (the correlation is -0.386 in the default sample). Makes sense, right? Of course, it helps a lot that PU% is a fairly predictable stat, with a year-to-year correlation around 0.638, as you can see. For comparison, LD%—line drives per batted ball—has only a 0.366 YTY correlation, while BABIP’s is 0.370. To summarize:

Correlation with BABIP in Year: | |||
---|---|---|---|

Statistic |
0 (Same Year) |
1 (Next Year) |
YTY Correlation (with itself) |

PU% | -0.468 | -0.386 | 0.638 |

OFFB% | -0.262 | -0.213 | 0.754 |

LD% | 0.418 | 0.187 | 0.366 |

GB% | 0.192 | 0.226 | 0.788 |

FB% | -0.356 | -0.288 | 0.789 |

IFFB% | -0.416 | -0.350 | 0.555 |

BABIP | 1 | 0.370 | 0.370 |

So, although LD% is a significant factor in same-season BABIP, its relative unpredictability makes it a much less reliable indicator of true-talent BABIP skills than PU%. This is also the case with pitchers, whose BABIPs are of course even less predictable.

If you’re curious, here are 2013’s relevant facts for each basic type of batted ball, straight from the league splits on FanGraphs:

Batted Ball Statistics, 2013 | |||||
---|---|---|---|---|---|

Type |
BABIP |
AVG |
SLG |
ISO |
wOBA |

Line drives | 0.683 | 0.685 | 0.878 | 0.193 | 0.681 |

Ground Balls | 0.232 | 0.232 | 0.250 | 0.018 | 0.213 |

Fly Balls | 0.124 | 0.213 | 0.616 | 0.403 | 0.346 |

The low BABIP of fly balls in general might lead you to believe they are less desirable for a hitter than a ground ball. Don’t forget, though, that **home runs are excluded from consideration in BABIP**, meaning the batting average of a power-hitting fly ball hitter probably isn’t going to suffer as much as you might think. Clearly line drives get the best results, being low-risk with very high-rewards. Meanwhile, ground balls are medium-risk, low reward, and fly balls are high-risk, high reward; on average, though, FBs are preferable to GBs, as wOBA demonstrates. That’s not even taking into account the increased risk of double plays that comes with ground balls.

As a little bonus, here’s something I queried off of FanGraphs’ top-secret database: a more in-depth breakdown that uses more distinct batted ball types:

Batted Ball Statistics, 2013 | |||||||||
---|---|---|---|---|---|---|---|---|---|

Type |
BABIP |
AVG |
SLG |
ISO |
wOBA |
1B% |
2B% |
3B% |
HR% |

IFFB | 0.004 | 0.004 | 0.005 | 0.001 | 0.004 | 0.3% | 0.1% | 0.0% | 0.0% |

OFFB | 0.049 | 0.155 | 0.531 | 0.376 | 0.288 | 0.8% | 2.8% | 0.7% | 11.1% |

FlinerF | 0.280 | 0.362 | 0.889 | 0.528 | 0.530 | 7.7% | 15.5% | 1.5% | 11.4% |

FlinerL | 0.627 | 0.631 | 0.870 | 0.240 | 0.652 | 42.9% | 17.5% | 1.6% | 1.1% |

LD | 0.746 | 0.746 | 0.883 | 0.138 | 0.715 | 61.4% | 12.6% | 0.6% | 0.0% |

GB | 0.232 | 0.232 | 0.250 | 0.018 | 0.213 | 21.5% | 1.6% | 0.1% | 0.0% |

In this classification system, the two types of “Fliners” are somewhere between fly balls and line drives, and there’s no overlap between the classifications. Relating these to what you see on FanGraphs: IFFB, OFFB, and FlinerF are all counted towards FB, while FlinerL and LD are counted towards LD.

Here, OFFBs are the really high outfield flies which—if they don’t clear the fences—are going to be caught 95.1% of the time. But home runs do occur on 11.1% of these high outfield flies, so you can’t discount them. Remember that these numbers are just averages; for a powerless batter, OFFBs are likely going to be a really bad thing; for a power hitter, they might actually be good. And try not to be confused—in this article’s correlation tool, FlinerFs are included as part of “OFFB.” I’m just not sure if it’s alright for me to let the details of this system out of the bag, unfortunately.

OK, now forget I mentioned all that stuff about fliners, because I’m going to be referring to the standard FanGraphs batted ball classifications from now on.

Back to BABIP: the main point of it is not to directly value a player, but to be an indicator of how lucky the player was. Skill does come into play, however, especially in the case of batters. But let’s take a look at how batted ball types correlate with a bonus stat I added into the correlation tool: Hits/Batted Ball, (let’s call it H/BatBall for short) which are hits divided by the sum of fly balls, line drives, and ground balls.

H/Batball Correlations | |||||
---|---|---|---|---|---|

Correlation with H/BatBall | Correlation with BABIP | ||||

Statistic |
0 (Same Year) |
1 (Next Year) |
YTY Correlation (with itself) |
0 (Same Year) |
1 (Next Year) |

PU% | -0.343 | -0.265 | 0.638 | -0.468 | -0.386 |

OFFB% | 0.006 | 0.030 | 0.754 | -0.262 | -0.213 |

LD% | 0.289 | 0.104 | 0.366 | 0.418 | 0.187 |

GB% | -0.034 | 0.004 | 0.788 | 0.192 | 0.226 |

FB% | -0.089 | -0.046 | 0.789 | -0.356 | -0.288 |

IFFB% | -0.370 | -0.296 | 0.555 | -0.416 | -0.350 |

BABIP | 0.894 | 0.315 | 0.370 | 1.000 | 0.370 |

H/BatBall | 1.000 | 0.420 | 0.420 | 0.894 | 0.315 |

HR/FB | 0.466 | 0.323 | 0.706 | 0.075 | 0.038 |

So, with home runs back in the equation, most of the predictiveness of the batted ball types—when it comes to the chance of getting a hit on a batted ball—completely disappear. Except for popups and maybe line drives (a little bit), that is. Also notice that HR/FB, while apparently useless for BABIP, is an important predictor of next-year H/BatBall. Not surprisingly, HR/FB is also a good predictor of wOBA (0.444 YTY correlation).

There are some interesting interactions here that take a multiple regression to weed out, though. Remember how I just said HR/FB is apparently useless for BABIP? Regression begs to differ; it outputs a formula for expected next-year BABIP of:

xBABIP = 0.083*HR/FB + 0.1*LD% – 0.55*PU% – 0.013*OFFB% + 0.007*Spd*GB% + 0.283

This formula has a 0.437 correlation with next-season BABIP, and 0.573 with same-season BABIP. More details on the factors:

Predictive Factors Of BABIP | ||||||||
---|---|---|---|---|---|---|---|---|

50% Values | 95% Values | |||||||

Statistic |
Coefficients |
Std Error |
t Stat |
P-value |
Lower |
Upper |
Lower |
Upper |

Intercept | 0.283 | 0.011 | 24.758 | 6.50E-110 | 0.275 | 0.29 | 0.260 | 0.305 |

LD% | 0.100 | 0.034 | 2.932 | 0.003432 | 0.077 | 0.123 | 0.033 | 0.167 |

PU% | -0.546 | 0.053 | -10.325 | 5.14E-24 | -0.582 | -0.510 | -0.650 | -0.442 |

Spd*GB% | 0.007 | 0.001 | 6.373 | 2.63E-10 | 0.007 | 0.008 | 0.005 | 0.010 |

OFFB% | -0.013 | 0.019 | -0.666 | 0.505428 | -0.025 | 0.000 | -0.050 | 0.025 |

HR/FB | 0.083 | 0.017 | 4.866 | 1.29E-06 | 0.072 | 0.095 | 0.050 | 0.117 |

Translation: OFFB% probably doesn’t matter, but the other factors pretty certainly do, especially PU%, followed by Spd*GB% (well, Spd itself works almost as well, leaving GB% out entirely), then HR/FB, then LD%. So, you can cut out OFFB% to make:

**xBABIP = 0.08*HR/FB + 0.1*LD% – 0.56*PU% + 0.008*Spd*GB% + 0.278**

…which is practically equally good, with a 0.436 correlation to next-year BABIP.

It might also be a good idea to add current BABIP itself to the equation, to possibly help capture that certain je ne sais quoi about a batter’s BABIP, if simply predicting the next year is the goal. Handedness is likely significant as well. But I’ll save that for another time.

Well, hopefully I’ve given you all enough to play with and to think about for today. Tell us in the comments if you find out something interesting from your experiments!

Good God.

Sorry and/or thank you?

“Good God” = what a non-rocket scientist says after hearing a rocket scientist speak. LOL

I’m am really happy I stumbled upon this. The examples you provided are EXACTLY what I was gathering info on to figure out the past couple of days. I expected I’d have to do the work myself, but you did a lot of it for me and gave me a took to do the rest. It made me say Good God too, but I’ll get all of what you said there figured out eventually. Thanks! 🙂

Ha, OK, thank you. Glad you enjoy.

Plotting OBP vs Age, same year, I don’t see any evidence that performance declines with age. The highest OBP on the chart is someone over 40. What stat should I be looking at to see the alleged decline with age that people seem to always talk about?

Better yet would be to download the spreadsheet, insert a column next to OBP in the ‘Data’ sheet, and divide the player’s OBP by the league OBP in that season (by creating a table that contains the league OBP in each year and doing a VLOOKUP on it). OBP relative to league average would make a better basis for the comparisons than OBP itself, as OBP has been on the decline since 2007 (probably due to increasing K rates, mainly).

This is gold. Absolute gold. After building financial models and regression tools for my firm for the last 2 years I’ve wanted so bad to have one at my disposal for baseball but didn’t have the dataset/willpower after 5pm to get it done. THANK YOU for doing this. I’m already overturning my previous-thought assumptions.

You’re very welcome! The data all comes from fangraphs’ custom leaderboards, by the way.

Steve,

Phenomenal work – a clarification please. Is HR/FB based on all FBs or just OFFB (so, excluding IFFB). Also, are total season speed scores published anywhere for all players ?

Thanks.

Thank you!

HR/FB is a standard batted ball stat on FanGraphs, and it is based on all fly balls (though HR/OFFB might make more sense): http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=2&season=2013&month=0&season1=2013&ind=0&team=0&rost=0&age=0&filter=&players=0

Spd is published under ‘Advanced’: http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=1&season=2013&month=0&season1=2013&ind=0&team=0&rost=0&age=0&filter=&players=0

Thanks for the clarification. I agree – it should be HR/OFFB. Using total FB probably double counts infield flies since their negative impact should already be accounted for in the PU category. OK, so a new homework for you ??

OK, I just now stuck both OFFB% and HR/OFFB stats into the Data sheet. Please let me know if you notice that I broke something in doing so.

Where’s the data sheet ?

See the tab at the bottom — between ‘Main’ and ‘Calcs’? That’s where you can add whatever stats you want, and they’ll then show up as options in the drop-down lists on the ‘Main’ tab.

Is there any way to include Jeff Zimmerman’s fly ball distance (from baseballheatmaps.com) into these correlation tools (both pitcher and hitter)?

Great idea! I stuck ‘FB Distance’ and ‘FB Angle’ in this one just now, with what I could gather from Jeff’s site. There’s some missing data there, but Jeff is going to send me his data when he gets the chance, and I’ll update it.

Interesting, but at age 88 I do not comprehend it. The most important state to me is the Game-Winning RBI, the one that put the team ahead to stay. Or it could be broken down to “RBI’s That Put His Team in a Tie or Ahead.”

–Stay tuned.

Hey Steve, I just want to throw in my two cents and say this is terrific. Thanks for making available to everyone (and for being so responsive to comments).

Thanks Dave!

Nice article! Stat question though ; when you created these formulas at the end did you do any kind of forward or backwards selection? Meaning were all these predictor variables reasonably independent of the other predictors? I was curious why HR/FB would be included in the model if before it showed a very little correlation. Would it be correct to assume that while the correlation was low, it accounted for a unique part of the xBABIP variance? Thanks!

I still love this. A lot.

Bump to my previous comment…

Also, is there a way to correlate batters’ stats vs. PITCHERS’ stats? The correlation between the various batters’ stats to themselves is very fascinating, but one wonders which pitchers’s stats give best correlations/predictors to various hitters’ stats.

Thank you for a fabulous resource.

J

This doesn’t seem to be working anymore. I love the tool, could someone find a way around the 5Mb limit?