## Predicting Home Runs Per Fly Ball, The Next Step

A year ago, I discovered how highly correlated a hitter’s average home run and fly ball distance is to his HR/FB rate. Chad Young and I then embarked on a quest to use an assortment of data, including this batted ball distance, to construct an expected HR/FB, or xHR/FB rate, metric. Unfortunately, we failed to find an equation much better than the one that used just distance, of which the R-squared was just 0.54. While this was an excellent start, it simply wasn’t good enough to use in place of plain old HR/FB rate.

Thanks to Jeff Zimmerman, whose Baseball Heat Maps site inspired this quest to be undertaken to begin with, I have been provided with a wealth of additional data. The hope was that it included another piece or set of pieces to the HR/FB rate puzzle.

I began with a player population set that included 4,985 hitter seasons from 2008-2013, which also included pitchers during their times to the plate. In order to prevent the results from being skewed due to the randomness occurring in the smaller samples, I removed all player seasons with fewer than 20 total home runs and fly balls. This left me with a pool of 2,645 ready for analysis.

Let us begin with a correlation table:

Metric | Correlation with HR/FB | R-squared |
---|---|---|

avg_dist | 0.74 | 0.54 |

stddev_dist | 0.68 | 0.46 |

max_dist | 0.55 | 0.30 |

avg_abs_angle | 0.13 | 0.02 |

stddev_angle | 0.09 | 0.01 |

avg_angle | -0.02 | 0.00 |

abs_avg_angle | -0.14 | 0.02 |

**Avg Dist** – average home run and fly ball distance; a correlation nearly identical to the one found in the original study and therefore not a surprise

**Std Dev Dist** – standard deviation of the home run and fly ball distances; this is one of the data points that we figured would be essential to constructing an xHR/FB rate metric, and sure enough it has the second highest correlation

**Max Dist** – the furthest distance reached out of all home runs and fly balls; the third highest correlation and not surprisingly it has a 0.65 correlation with Avg Dist, so given the potential for multicollinearity with Avg Dist, I didn’t test it in a regression

**Avg Absolute Angle** – average of the absolute values of the angles of all home run and fly balls; a slight positive correlation, but I expected it to be higher as it’s easier to hit a home run over the closest fence, which is typically down the lines

**Std Dev Angle** – standard deviation of the angles of all home run and fly balls; a slightly positive correlation, but fairly insignificant

**Avg Angle** – average of the angles of all home run and fly balls; oddly negative, but so small as to be completely insignificant

**Absolute Avg Angle** – absolute value of the average angle above; also strangely negative, nearly the exact opposite of the Avg Absolute Angle

Basic logic and the correlations of HR/FB with the various metrics above suggests that any regression equation must include Avg Dist and probably Std Dev Dist as well. The Std Dev Dist is important here, because a hitter who hits two fly balls that travel 325 feet each is likely going to end up with no home runs, while a hitter who knocks a ball 450 feet and then 200 feet will certainly be rewarded with a dinger. Both pairs result in the same Avg Dist, but the latter hitter had a higher Std Dev Dist. So this is why the metric is a significant piece.

If we include both Avg Dist and Std Dev Dist into a regression, we see a significant improvement over the 0.544 R-squared from just using Avg Dist. The adjusted R-squared jumps to 0.629.

But we are not done yet. Shouldn’t angle play some sort of role? Despite the correlations for all the angle-related metrics failing to be very significant, it is obvious that angle matters. The Avg Abs Angle proved to be the best metric to incorporate and it also passes the common sense test.

Adding the Avg Abs Angle into a regression with the pair of metrics already included improves the equation only slightly, but meaningfully. Unfortunately, at the lowest extreme, the equation is capable of producing a negative xHR/FB rate. So to fix this issue, I simply turned the regression equation into an IF function whereby if the result is less than 0%, then the output should be 0%.

Here is a plot of the results:

The adjusted R-squared for the equation with all three components that produces the above chart is 0.649. Clearly, there is still even more to the story, but this is a nice jump from where we started. The dimensions of the ball parks each hitter has batted in during the season certainly plays a role, but this information isn’t easily accessible and is too difficult to factor into an equation.

Another issue I mentioned in my original article was that the old xHR/FB rates being calculated from just a hitter’s Avg Dist was much too low on the top end. For example, Matt Kemp led baseball in Avg Dist in 2012, yet the old xHR/FB rate gave us a mark of just 19.7%. That means that just below 20% was the highest xHR/FB rate the equation would possibly spit out for the 2012 season, which is far too low.

Using my data set, I found the maximums for each metric to put together the ultimate home run hitter and test the ceiling for this new and (hopefully) improved equation. Since 2008, the highest xHR/FB rate this equation can produce based on the maximums of each metric actually posted is 36.8%. Beautiful. Using the old equation with this data set would produce just a 23.6% xHR/FB rate at its peak. So we no longer have to worry about the equation underestimating the best of the best.

Finally, the new and improved xHR/FB rate equation:

**xHR/FB = -0.8895 + (Avg Dist * 0.0025) + (Avg Absolute Angle * 0.0048) +(Std Dev Dist * 0.0038)**

Now let’s put this into action and compare 2013 HR/FB rates to the xHR/FB rates calculated from this equation. Below is a spreadsheet that compares each hitter’s HR/FB rate with his xHR/FB rate. The first tab is the list sorted by who overperformed the most, while the second tab is sorted in descending order of underperformance.

Now let’s discuss some of the notable names.

Overperformers

If Khris Davis is indeed thrust into full-time action, he is going to be a popular sleeper pick this draft season. Obviously, no one is expecting anywhere close to a repeat of that 28.9% HR/FB rate, but a 15.7% mark ain’t too shabby, suggesting he still does possess excellent power.

Surprise, surprise, the other Chris Davis isn’t really *that* good. What slightly hurt him was just an averageish Avg Absolute Angle. Oh, and it’s also really, really hard to possess true upper 20% HR/FB rate talent.

There’s Domonic Brown‘s name on a list you don’t want to be on heading into 2014. Back in early July, I included him on my list of surprising batted ball distances, as his mark was well below what his then HR/FB rate would typically match up with. Then in the middle of that same month, I chose him as one of my four second half bust candidates for that very reason. His xHR/FB rate validates my concerns about his performance this upcoming season.

Sure, Wilson Ramos obviously had to enjoy some good fortune to post a HR/FB rate that would have ranked second in all of baseball had he qualified. But his power surge appears legit and was supported by a batted ball distance that amazingly ranked fourth in all of baseball.

Miguel Cabrera? Don’t act surprised. His 2013 HR/FB rate was a career high and his xHR/FB rate is just a smidgen above his career average. Nothing fishy here.

If Will Middlebrooks‘ power declines, he has little to fall back on. Same goes for Brett Wallace who struck out at an exorbitant rate.

Underperformers

Corey Dickerson has a chance to be on the good side of a platoon in left field for the Rockies and could be a source of great profit with his power and speed combination.

Maybe the Nationals front office is reading this and can be convinced to give Danny Espinosa another chance?

Good news for the Mariners, as it doesn’t appear that Logan Morrison‘s power truly evaporated. Safeco Field suppresses left-handed home runs much less so than Marlins Park.

If Rickie Weeks could wrangle his starting job back, he makes for a strong rebound candidate, though the health of his surgically-repaired hamstring will play a role in his performance.

A pair of Giants in Buster Posey and Pablo Sandoval suffered from a decline in power this year, but xHR/FB rate suggests 2014 should be better.

David Freese could also see his power rebound with his new team out in Anaheim, at which point the media will report it as a “change of scenery” effect.

Nolan Arenado! I’m a fan, as I love his contact ability and the power should grow, perhaps quickly.

Though much will depend on his recovery from knee surgery, Manny Machado owners in keeper leagues should be happy to see an xHR/FB rate of 12.6% versus an actual mark of just 7.9%. You figure some of his doubles would turn into home runs, but a full healthy season would make it more of a lock.

Jason Heyward, with an xHR/FB rate 4.5% above his actual mark, is looking like a prime rebound candidate, though a place atop the lineup will cut into his RBI total.

Print This Post

*Projecting X 2.0: How to Forecast Baseball Player Performance*, which teaches you how to project players yourself. His projections helped him win the inaugural 2013 Tout Wars mixed draft league. He also sells beautiful photos through his online gallery, Pod's Pics. Follow Mike on Twitter @MikePodhorzer and contact him via email.

Wouldn’t you also want to know how correlated the inputs to your xHR/FB measure are from year to year? If those inputs are themselves products of random within-season variation, then xHR/FB might provide a useful description of what happened that year, but would not tell us much about what we should expect going forward. Your look at under- and over-performers suggests you see this having predictive value. Just curious on how you see this measure being used.

I think it’s two different questions – given a hitter’s combination of distance, angle and std dev, what should his HR/FB rate have been? And given his history in those three metrics, what should his HR/FB rate be in year X + 1? The second question would require figuring out the YoY correlations, but not the first one.

If you have historical xHR/FB rates (I have a spreadsheet going back to 2007), then you could follow a hitter’s trend and it could be used the same as regular HR/FB rate, with no real need to know how well they correlate with each other.

Could you go into more detail about how you incorporated the “maximums” for each metric, so the formula would not underestimate the best hitters?

It wasn’t incorporated into the formula, it was just a check I did to make sure the equation worked well. I simply found the maximums for each category and plugged it into the equation just to see what the highest mark it could spit out was.

There is no park adjustment in this equation, correct? So, taking Chris Davis as an example, assuming he maintains the same batted ball distance/angle profile, in HR friendly Camden Yards he actually ought to do somewhat better than the 22.3% xHR/FB, right? Not that I expect him to pull of 29% again, of course. Just want to make sure I understand the equation properly.

You are correct that there’s no park adjustment but it wouldn’t work like that. Park factors are affected by environmental factors such as wind, the air (think Coors Field), etc, in addition to the distance of the fences. The environmental factors should already be accounted for when looking at a hitter’s distance. So the only thing remaining is fence distance. Does Camden Yards have closer fences than other parks? That’s what would matter, not just looking at a park’s HR park factor.

Ah, I see. That makes sense. Thanks.