On Monday, I set the groundwork for a quest to try to predict a hitter’s HR/FB ratio utilizing the new data first made available last year that tells us the average distance of a hitter’s batted balls. Our goal was to answer several questions:
1) Given a hitter’s average distance and any other factors we identify, what should that hitter’s HR/FB rate have been in Year X? In other words, this would be an xHR/FB rate that is backwards looking.
2) Given a hitter’s average distance and any other factors we identify, what should we project that hitter’s HR/FB rate to be in Year X + 1? In other words, this would be a forward looking projected HR/FB rate.
While trying to answer question 2, we are also attempting to determine:
3) If a hitter experiences a significant change in average distance in Year X, how much of it sticks in Year X + 1?
To begin, we are trying to answer the first question and come up with an equation to calculate an xHR/FB rate. Once we are happy with our findings, we will turn our attention to the second and third questions to project HR/FB rate in the next year.
In Part 1, I confirmed that average distance does indeed have a strong correlation with HR/FB rate and we should therefore use it as a variable to determine both what a hitter’s HR/FB rate should have been and what we should project for the next season. Yesterday, Chad Young took that finding one step further and concluded that an equation that includes the previous season’s HR/FB rate was even better than just distance. This bothers me because it suggests there is something else, perhaps multiple influences, at play besides average distance that affect a hitter’s HR/FB rate.
We originally thought that batted ball angle was the answer. Given the same distance, a ball hit down the line is significantly more likely to land in the seats than a ball hit to center field. However, I realized that the way batted ball angle is presented as an average is problematic. Consider a hitter who hits half his fly balls and home runs down the left field line and the other half down the right field line. The angle statistic would be 0.0 for this hitter (one side of the field is considered negative, the other positive), suggesting he hits balls to dead center on average. This, of course, is not the case in the example and might explain why our attempts to incorporate the data into our regressions have barely moved the R-squared needle. So for now, we are going to ignore batted ball angle.
In Chad’s Part 2, he made mention of reducing the entire player list from what I used in Part 1 to only those players who had also played the previous season. That led him to a model that included average distance in Year X and HR/FB rate in Year X – 1. I went a step further and further reduced the data set to include only those who played any three consecutive seasons. This left me with n = 697. I decided to test five different potential models. For ease of understanding, Year 3 is the season I am coming up with an xHR/FB rate for. Here are the results:
|Yr1 & Yr2 HR/FB, Yr3 Distance||0.5848|
|Yr2 HR/FB, Yr3 Distance||0.5698|
|Yr1 & Yr2 Hr/FB||0.4277|
I also tested the models using Yr3 Distance squared as some have suggested, but that screwed with the p-values, so I left them out. You will notice that the R-squared for “Yr2 HR/FB, Yr3 Distance” is a bit lower than what Chad found yesterday. This is due to the different data set I am using.
I then ran all five models through a Residual Sum of Squares (RSS) test and the results came in the same order as the above. Thus, the best model at this point appears to be using HR/FB rates from the previous two seasons along with the average distance for the most recent season you are calculating the xHR/FB rate for.
The equation is:
xHR/FB = (0.165864 * Yr1 HR/FB) + (0.263489 * Yr2 HR/FB) + (0.002081 * Yr3 Distance) – 0.528386
Is it cheating to use HR/FB ratios from previous years? Yes. I am not happy about it. The xHR/FB rate for Year 3 really should have nothing to do with how a hitter has performed in previous seasons. But as you may recall from Part 1, distance alone was simply not telling the whole story. Whether it’s a combination of the elusive angle data and park factors or something else at work, we have yet to figure it out. So at the moment, HR/FB rates from previous seasons are essentially acting as a proxy for these other mystery variables.
In the model that only included Yr 3 Distance that was introduced in Part 1, I was also unhappy with the highest xHR/FB it spit out. The max for 2012 was 19.7% for Matt Kemp. This new formula arrives at a 20.6% estimate, which is closer to his actual 21.7% mark. In addition, the new equation does a better job of estimating other hitters on the high end. Ryan Howard used to consistently average 320 feet and post 30% HR/FB rates. In the old distance-only equation, Howard’s xHR/FB rate would be just 21.8%. No one averages more than 320 feet, so that essentially places a hard cap of nearly 22% for xHR/FB. Plugging in that 320 average distance, along with 30% HR/FB rates in the previous two seasons, into the new equation above, we arrive at a more reasonable 26.6% xHR/FB rate.
I still believe that the batted ball angle is going to prove important. We are currently working on getting the data into a more useful format to incorporate into our regressions. Logic suggests that a fly ball will go for a home run depending on a) how far it travels, b) where in the park it is hit and c) what park it was hit in. Taking into account park dimensions would be way too difficult and time consuming, so I have to think that batted ball angle is the holy grail. I’m all ears though in hearing your thoughts about what other variables could possibly be included.