## The Quest to Predict HR/FB Rate, Part 4

***For those of you who have listened to the Fantasy Baseball Roundtable radio show in the past, I wanted to share that we are officially back on the air! And if you have never listened, well here’s your chance to hear my manly voice talking about nerdy stats. Listen live every Wednesday night at 9-10 PM EST.*

The quest continues! On Monday, Chad Young and I set out on a journey to try utilizing the fly ball and home run distances and angles found on Baseball Heat Maps in an attempt to answer several questions regarding a hitter’s HR/FB rate. As expected, we found a strong correlation between it and batted ball distance. But, distance alone wasn’t telling us the whole story. Chad decided to incorporate batted ball angle and the previous season HR/FB rate and that certainly improved our equation. Then yesterday I took that another step further and found that including the HR/FB rate from two seasons ago was even better. But, I wasn’t satisfied. A hitter’s HR/FB rate in 2012 should not be affected by how he performed in the metric in previous years. We pinpointed what we thought may be one of the major hindrances and that was the way the angle data was presented as an average.

Well sure enough, like a superhero, Jeff Zimmerman came through. Not only did he provide me with batted ball angles that take the average of the absolute values of each angle (rather than differentiating between left and right field with one side a negative number and the other a positive), the data listed every single hitter instead of the truncated list of around 250 per season on the leader boards page of his site. Better data and more of it?! I was giddy.

I then spent hours in my mother’s basement playing around with the data and crunching the numbers. The good news? We are making progress folks! I ran all the same regressions and tests that I did in Part 3, but this time included the more useful angle data. I also experimented with squaring both the distance and angle due to the suggestions of many. I required a minimum of 30 fly balls and home runs and ended up using 1,926 player seasons from 2008-2012. Here are the results:

Model | Adjusted R-Squared |
---|---|

Distance | 0.5613 |

Distance & Distance^2 | 0.5713 |

Distance & Angle | 0.5852 |

Distance, Distance^2, Angle, Angle^2 | 0.5980 |

Distance^2 & Angle^2 | 0.5892 |

Distance^2 & Angle | 0.5900 |

Distance & Angle^2 | 0.5839 |

The exciting part is that whereas we saw essentially no change in the R-Squared when adding the old angle data, we now do see a meaningful increase compared to the first two models that only include distance. Unfortunately, that increase was not as significant as I had hoped. You might remember that in Part 3, the best model was Yr1 and Yr2 HR/FB ratio and Yr3 Distance, with an R-Squared of 0.5848. The best model here is barely ahead. This is perfectly fine. My mission was to find a model that only used current year data so I did not have to use HR/FB rates from previous seasons. The R-Squared is now just as good as the model that cheated.

The model that included distance, distance squared, angle, and angle squared proved to be best and it was confirmed through another residual sum of squares test. The equation is thus:

**xHR/FB = (-0.00845 * distance) + (0.00002 * distance^2) + (0.02125 * angle) + (-0.00043 * angle^2) + 0.61064**

The max distance from those 1,900 plus player seasons was 324 feet, while the max angle was 26.2 feet. I was curious what HR/FB rate the above equation would spit out given those maximums. If a player averaged both those two numbers, the result from the equation would act as a cap for the highest xHR/FB it would ever produce. That answer is 27.4%, which is usually very close to where the MLB leader sits.

However, as you can see in the chart, the equation still seems to underestimate the top tier of home run hitters. The equation produces an xHR/FB of 20%-25% for very few players, only 39 to be exact. Yet, 114 hitters in the data set posted an actual HR/FB rate of at least 20%. The thought was that squaring the data would give an extra boost to the higher distances, but it doesn’t seem to have been enough. I have been playing around with raising the distance to various higher powers like 10 and 15 times to get the boost we need.

So we are definitely inching closer, but we know there is at least one variable not accounted for yet: park factors. That would be impossible to incorporate though, so the search continues for the holy grail. A player list with xHR/FB rates including an analysis of the hitters with the largest discrepancies is coming….as soon as I feel we have exhausted all avenues and are left with an equation we are unable to improve any further (which might very well be the one just above!).

Print This Post

*Projecting X: How to Forecast Baseball Player Performance*, which teaches you how to project players yourself. His projections helped him win the inaugural 2013 Tout Wars mixed draft league. He also sells beautiful photos through his online gallery, Pod's Pics. Follow Mike on Twitter @MikePodhorzer and contact him via email.

When looking for the best way to account for non-linear effects, are splines ever considered here?

I’m a little alarmed at your willingness to include lots of variables and higher order functions that really don’t improve the fit very much. I’d be curious to see how these models would be ranked using an Information Criterion approach, which includes a penalty for adding more variables. Looking at your table, what jumps out at me is that distance is what matters. There is this desire to fit everything perfectly, to maximize that R2 value, but really that doesn’t produce a model that is meaningfully better, if it takes four times as many parameters to create it.

One indication of “overfitting” is that the resulting formula seems to make no sense to me. Perhaps I misunderstand the formula. However, if I interpret it correctly, it says that a ball hit with 0 distance and 0 angle will have a 61% chance of being a home run. That is clearly nonsense, so either I misinterpret or you have too many parameters in your fit. Maybe you could clarify.

Thanks for the constructive criticism. This is what happens when you try to be a statistician when you remember nothing from your college stats class :-(

Given that this is part 4, I don’t recall if it had been mentioned. Has bat speed been considered as a variable? Are there reliable numbers on batspeed anyhow? Or is there any possible way to include pitch speed as a variable?

Bat speed has not been considered because we’re not trying to project distance, we already have that. We’re just trying to use distance to estimate what HR/FB should have been.

But is there a 1:1 correlation between batspeed and distance? Perhaps true bat speed is a better indicator than actual distance.

I’m not sure, bat speed is not data that is publicly available to my knowledge. ESPN Hit Tracker has the speed off bat stat on home runs, but it’s only available on each player’s page so would take forever to collect everyone’s. And, it only includes home runs, not all fly balls.

There is a strong correlation between batted ball speed and distance. You can see that in TrackMan data as well as the hittracker HR data. And, of course, it certainly makes a lot of sense intuitively: the harder you hit it, the further it will go, all other things equal. But all other things are not equal, since FB distance depends on more than just batted ball speed, particularly the vertical launch angle. A batter who gets a consistently high batted ball speed (which is indicative of a high bat speed) but who hits the ball with a low launch angle is not going to hit many home runs. So, batted ball speed (or bat speed) is going to be less indicative of HR/FB than FB distance, which is about as direct as there is.

Bat speed is not measured by anyone in a game situation. It has been measured in specialized experiments, using high-speed motion analysis and/or video. And there is quite clearly a very strong correlation between batted ball speed and bat speed.

Can I ask a simple question? I’m a bit ignorant, and I’m not sure exactly how you come up with these equations. I’m fully aware of how the variables fit into it, but what about all the constants? is it mainly guess and check, or do you have some sort of system for doing this?

I don’t know what software he used, but programs that do statistical analysis let you put in the variables (say distance and angle) and then it will calculate the appropriate coefficients and constant to fit the data as best it can.

Not only will they do what Bill says, they will also calculate the statistical uncertainty on the coefficients, based on the scatter of the fit from the data. In particular, if there is “overfitting” (i.e., more variables in the fit than are warranted by the data), generally one indication is the the resulting uncertainty in the fitted parameters will be large. I suspect that might be the case in the 4-parameter fit done here.

I have been using the Analytics ToolPak on Microsoft Excel and the regression tool.

caveat: i am emphatically not much of a math guy

but saying park factors are impossible to incorporate seems untrue to me. It may be that park factors are next to impossible to incorporate meaningfully, but i can think of at least 2 ways to account for park factors here:

1- use matt klaasen’s approach of only looking at players who stayed with the same home team. This limits the sample size given the years you’re working with, but it gives a baseline that could be tested against a larger sample size once you think you’ve got the equation worked out

2- rather than using specific park factors, look at averages for performance in pitcher parks vs hitter parks. since your equation is currently park neutral, it should presumably apply to neutral parks. I imagine only slight adjustments are needed to account for how pitcher parks suppress hr, and how hitter parks enhance fb. But again, I’m not a math guy so my imagination might not carry much weight. If these ideas wouldn’t work, could you explain to me in very small words why not?

I like the option of same-team players, not sure if you’re suggesting the model ONLY deal with their home splits. That’s what I was going to toss out as a way to hone the model.

Could also try to track players where the home park was DIFFERENT across years and see if/how that impacted their home stats.

The issue is that we’d have to determine every single park the specific hitter hit in and take into account those park factors. There’s no way to then incorporate those factors into a one size fits all formula.

we could always only calculate home stats can’t we?

this makes sense to me. consider me satisfied, at least until pt 5!

First this is a really cool article but i have a few questions/ suggestions.

1). are you accounting for batter handedness when you are looking at batted ball angle because and angle that would represent a pulled ball by a right handed hitter would be and opposite field ball by a left handed hitter, so if you are ignoring handedness off the hitter the batted ball angle becomes almost useless.

2.) my understanding a of the batted ball angle data is that it is based on where the ball lands not the actual angle it was hit at, this is a fairly good approximation for batted ball angle except for the fact that batted balls tend to hook towards the foul pools making further hit pull or opposite field balls look like they were hit at a more extreme angle then they acutely were.

3.) this is more of a suggestion to help avoid over fitting, throw out terms that don’t really mean anything or that you don’t have a hypothesis to back up there existence. the first term that i would eliminate is angle^2 because there really is not such thing a as and angle squared, and it is also dependent on how you define your angle. Also i would get rid of any term related to distance that has a negative coefficient because hitting a ball a shorter distance should not increase hr/fb%.

4) you also might want to look at distance as a function of angle. players in general tend to pull most of there home runs but hit the majority of there fly balls to the opposite field, so a 1ft increase in opp field distance will increase average distance more than a 1ft increase in pull distance but the increase in pull distance increase home runs by more because the pulled balls are closer to the fence on average so they should benefit more from an extra foot.

5) also looking at fb% or batted ball frequency as a function of angle could also be informative.

Thanks for the questions.

1) All the angle data is the absolute value now. The higher the number, the more pull-happy, so handedness is irrelevant.

2) Interesting point, I’m not sure at which point of a ball’s flight the angle is taken from.

3) Yeah, I questioned why the distance coefficient was negative as well.