xHR%: Questing for a Formula (Part 4)

Apologies for the significant delay between the third post and this one. A little Dostoevsky and the end of the quarter really cramp one’s time. Since it’s been a while, it would probably be helpful for mildly interested readers to refresh themselves on Part 1, Part 2, and Part 3.

As a reminder, I have conceptualized a new statistic, xHR%, from which xHR (expected home runs) can and should be derived. Importantly, xHR% is a descriptive statistic, meaning that it calculates what should have happened in a given season rather than what will happen or what actually happened. In searching for the best formula possible, I came up with three different variations, all pictured below with explanations.

HRD – Average Home Run Distance. The given player’s HRD is calculated with ESPN’s home run tracker.

AHRDH – Average Home Run Distance Home. Using only Y1 data, this is the average distance of all home runs hit at the player’s home stadium.

AHRDL – Average Home Run Distance League. Using only Y1 data, this is the average distance of all home runs hit in both the National League and the American League.

Y3HR – The amount of home runs hit by the player in the oldest of the three years in the sample. Y2HR and Y1HR follow the same idea.

PA – Plate appearances

Now that most everything of importance has been reviewed, it’s time to draw some conclusions. But first, please consider the graphs below.

Expected home runs (in blue) graphed with actual home runs (orange) using the .5 method. I plotted expected home runs and actual home runs instead of xHR% and HR% because it’s easier to see the differences this way.

Expected home runs (in blue) graphed with actual home runs (orange) using the .6 method.

Expected home runs (in blue) graphed with actual home runs (orange) using the .7 method.

Conclusions

Honestly, those graphs look pretty much the same. Yes, as the method increases from .5 through .7, the numbers seem to get more bunched up around the mean, but the differences really aren’t significant between the methods. Nor are the results from those methods particularly different from the actual results. And therein lies the crux of the matter. The formulae suggest that what happened is what should have happened, but I don’t think that’s true.

I know a great deal of luck goes into baseball. I know as a player, as a fan, and as a budding analyst that luck plays a fairly large role in every pitch, every swing, and every flight the ball takes. I don’t know how to quantify it, but I know it’s there and that’s what sites like FanGraphs try to deal with day in and day out. Knowledge is power, and the key to winning sustainably is to know which players need the least amount of luck to play well and acquire them accordingly. Statistics like xFIP, WAR, and xLOB% aid analysts and baseball teams in their lifelong quests for knowledge, whether it be by hobby or trade.

For those reasons, xHR% in its current form is a mostly useless statistic. It fails to tell the tale I want it to tell — that players are occasionally lucky. An average difference of between .6 and 1 home runs per player simply doesn’t cut it because it essentially tells me what really happened. At this juncture it’s basically a glorified version of HR/PA where you have to spend a not insignificant amount of time searching for the right statistics from various sources. But hey, you could use it to impress girls by convincing them you’re smart and know a formula that looks sort of complicated (please don’t do that).

I don’t know how big of a difference there needs to be between what should have happened and what actually happened. Obviously there still has to be a strong relationship between them, but it needs to be weaker than an R² of .95, which is approximately what it was for the three methods.

All statistics that try to project the future and describe the past are educated shots in the dark. The concept is similar to the American dollar in that nearly all of their value is derived from our belief in them, in addition to some supposedly logical mathematical assumptions about how they work. Even mathematicians need a god, and if that god happens to be WAR, then so be it.

Even though my formula doesn’t do what I want it to do quite yet, I won’t give up. Did King Arthur and Sir Lancelot give up when they searched for the Holy Grail? No, they searched tirelessly until they were arrested by some black-clad British constables with nightsticks and thrown in the back of a van. I will keep working until I find what I’m looking for, or until I get arrested (but there’s really no reason for me to be).

I know that wasn’t particularly mathematical or analytical in the purest sense, and that it was more of a pseudo-philosophical tract than anything else, but please bear with me. Any suggestions would be helpful. I have some ideas, but I’d appreciate yours as well.

Part 5 will arrive as soon as possible, hopefully with a new formula, new results, and better data.





A busy person, but one who spends his free time in front of a computer screen, fiddling with statistics. And yes, that describes everyone who regularly visits this website.

6 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
ZachTheQuack
8 years ago

Hey, Jackson, keep up the good work. Just a few thoughts—disclaimer: a day’s worth of neuroanatomy labs and lectures have passed since I read through your series of articles…

Firstly, I just wanted to make sure that you were using data from years 1, 2, and 3 to make descriptive statements about data in year 0 (for lack of a better term), e.g. information from 2012—2014 to evaluate HR in 2015. I ask only because I cannot find any explicit statement as to which year the HRD of the player being evaluated is supposed to be from. If it were year 1 data, then the formula has a predictive element to it because all statistics attempting to elucidate the results of the given year of interest are statistics from past seasons; whereas, if the HRD is from year 0 (2015 in example above) then the metric you’ve designed is partly a reflection of past results and partly a reflection of current results from the perspective of the year in question. Since HRD is strongly correlated with HR, if you were to be using year 0 data instead of year 1 data for HRD that could explain the abnormally high correlation between the data set you’re attempting to describe and your (pseudo?)-predictive metric.

ZachTheQuack
8 years ago

Well, it depends what your goal is. If your goal is to decide the luck factor on homeruns the data set that attempts to describe any regressed HR metric will indeed depend on the data for the season in question (and perhaps previous seasons); however, I’m not sure that delving into HR/PA is quite as pertinent as it seems at first blush (although, it is indeed, pertinent). It depends on what one considers to be luck, skill, a skill that is likely to be repeated, &c.

The following is a long response that offers a variety of perspectives on the question you’re attempting to ask. If you have a definite answer on the particular question you’re asking (and the version of that question, as well) then your results and any future foray can be assessed.

If one is attempting to elucidate lucky homeruns in a vacuum, so to speak, then there are three primary arms of the problem so far as I can see it (perhaps you see more!). The first is, with the ball hit at its current distance, what is the likelihood that such a ball will become a homerun in an average park (or, if you’re math heavy, likelihood on a per park basis as a reimann sum); this problem does not take into account direction.

This leads to the next “luck” problem which is how often does a ball hit X distance at Y angle leave an average park; here we have our first intersection of skill and luck (which is where HR/PA may carry predictive value) since pull% and the match between spray pattern and home park play a factor in the final results.

Finally there’s another layer to the problem which is that distance itself is not a fundamental quantity, but is instead related to a combination of trajectory, initial velocity, spin rate, and weather factors such as altitude, game conditions, temperature, humidity, and wind. Altitude and wind have the largest effects from a physics perspective over the range of conditions one is likely to encounter. To illustrate, if one hits the hell out of the ball and there’s a 25mph headwind, well, good luck. That strong effort may look like a lazy fly ball that had just enough when using HRD rather than an upper decker. And while such effects should even out in the long run, different parks are routinely subject to different conditions and the sample size of any season is small to the point where weather can significantly affect the results. I like your inclusion of the average home run distance for a player’s given park and for the league as a whole as this is a catch-all way of attempting to determine park dimensions (smaller parks should have lower average HRD) and perhaps consistent altitude and weather patterns: playing at Coors lengthens the average fly ball and if the distribution of % of fly balls of a certain depth has a skewed unimodal distribution–think uneven bell curve–which intuition says it likely does, then is there now a greater representation of short HR v. average or long HR since a slightly larger subsection of fly balls of a previously too short depth are now qualifying as HR and bring the weighted average downwards? Including home park HRD and league average HRD may help to include some of these factors (especially when the values are from the same season as that one is attempting to describe) including weather since the park HRD from any given season is also a reflection of the weather conditions experienced that season at that park.

Since data on exit velocity and batted ball distribution is not taken on every ball put in play, but only a sample of those balls, it’s hard to determine with accuracy a hitter’s true profile. Hard% and FB% and Pull% may be just as, if not more, predictive regarding a hitter’s tendencies. And since tendencies can augment or detract from the amount of HR a player hits (especially depending on handedness and park) this can play a big role.

I understand why you want to focus on HR/PA (especially since research shows this is a more stable and predictive statistic than HR/FB%); in a formula that starts from the basis of HRD (and not, say, average FB distance, or something along those lines) HR/PA is important since it catches other aspects of the hitter’s profile not related to HRD like FB% and, of course, implicitly strikeouts as well. The question you need to ask yourself is that does HR/PA tie in a very direct way to skill, or is it more a catch-all description of a player’s results? Including same year HR/PA seems a vital way of factoring in K% and FB% in an oblique manner, but remember that HR/PA on a same year basis doesn’t help you separate luck from skill as lucky homeruns and earned (so to speak) HR are equally accounted for under the umbrage of HR/PA. If this point isn’t clear, let me know and I’ll find other words to explain it, but it’s essentially why your descriptive model and the season’s results so closely align.

How you want to move away from HR/PA (if you do at all) is an issue that’s entirely up to you. Using past year data for HR/PA can help to set a baseline on expected talent, but does not account for changes in talent level between past seasons and the season in question. Your best bet is to figure out how you may want to estimate a player’s skill from a HR standpoint based on something apart from HR/PA, but that confers some of the same benefits in terms of including repeatable skills that impact performance (again, the aforementioned K%, FB%, &c.).

Where you may want to go from here is to see what work has been done previously on HR/PA (or HR/FB%) and avg. FB distance, avg. HRD, etc. See what other factors people may have included that helped to further explain the results (pull%, hard hit%, etc.) and then see what improvements can be made to such a model.

The most important takeaway from what you’ve done so far, I think, is that you had the intuition to include player avg. HRD/(avg. park avg. HRD + home park avg. HRD). This co-factor seems a better choice than many that have been used in the past such as park factor for HR by handedness, etc. Two things that may improve (although I make no guarantees) this co-factor are as follows: firstly, do not forget to subtract (1/30)*(home park avg. HRD) from the league avg. HRD since the home park is not included in the road data set, and then multiply the result by (30/29); secondly, it might be interesting as an exercise to estimate the player’s HRD on the road and at home by some method since if one has a favorable home park like Coors, one’s average HRD may be 390 feet, but that doesn’t mean that one’s average HR distance at home and on the road are of equal value. The manner in which you have this home/road effect accounted for is a straight average of home and road HRD environments. BUT, imagine a case in which one’s average HRD for one’s home park is (and here I’m picking numbers out of the air since I don’t know the numbers at hand) 385 ft, but the league average is 399 ft. This means that you’re likelier to hit more HR at home than on the road all other things being equal and that one’s home park HRD has a greater influence on one’s season long avg. HRD and thus the effects of home and road are not a 50/50 split. There are ways to estimate this using math (or empirical results), but since I’ve already written so much I figure I’d leave that to another time.

Hope this post may have answered some questions you had about why your results were too spot on and helped you to figure out what question it is that you want to ask/answer and how you may go about achieving your results.

ZachTheQuack
8 years ago

Quick personal bio: I’m a former physical chemist who is now in medical school who sadly uses his extensive baseball knowledge and mathtastic skills to make money playing fantasy baseball and daily fantasy rather than for the greater good of the SABRmetric community. But, as I find myself too pressed for time with medical school to capitalize on my knowledge anymore I’m coming around to the idea of sharing a lot of it. I don’t have any articles, but I am considering writing a few over the Summer so keep an eye out.

As for why HR/PA in year 1 is the stat poisoning your model… you’re attempting to elucidate what most likely should have been, but HR/PA is a description of what actually was. It’s like trying to figure out what an answer key for an exam should have been, but then generating the probable answer key from the original answer key. HR/PA in year one when multiplied over the PA a player had in a given season is an exact reflection of the season’s actual results. It may be a relatively stable indicator of a player’s power potential over one’s career, sure, but that’s because as a player’s career extends over a period of time we believe that the results on the field begin to closely match the player’s true talent level. What you have to ask yourself is if a guy hits 5 or 6 lucky homeruns over the course of the season, will this fact be contra-indicated by his HR/PA or will the lucky HR be supported by the HR/PA statistic? The answer, of course, is that the luck is reflected in the results with no way of ascertaining whether or not it was luck. From the perspective of HR/PA, all home runs are equal and in the case of same season HR/PA all home runs are equal and fully accounted for.

If one were to use HR/PA for year 1 because you’d like to see a player’s current talent level reflected in the results, you’d have to have some way of trying to ascertain what part of HR/PA is a reflection of talent and what part is a reflection of results irrespective of whether such results were derived from skill or luck. That’s often why the best models attempt to use past data sets to explain current results in an effort to avoid an echo-chamber-esque bias.

As for the other stuff, I was just using co-factor colloquially. The denominator of your HRD term is a constant derived from table values and the numerator is a variable, but one that isn’t controllable, merely observable and descriptive; so often in my lab we’d refer to such things as co-factors even though a co-factor should be comprised of constants only, but we did so because it wasn’t something we could control for so it may as well have been a constant from the perspective of experimental design. Sorry for the confusing language!

I’d say take what you’ve done, gather new insights, and set on a steady course of re-working the results until you find what you’re looking for. Could take a while, I understand the being busy thing. But if this is a problem that really matters to you, it’s probably worth the wait. Any other questions you have include in the comment section and I’ll be happy to give them a go.

ZachTheQuack
8 years ago

Ah, I forgot to mention what possible uses remain for HR/PA as they are currently being used by your model. One possible use might be to use current season HRD and it’s deviation from the previous 2 years (or 3 year average) as an indicator of how heavily year 1 HR/PA should be weighed. That is to say, in the event of no difference between year 1 HRD and the previous years 2 and 3, perhaps a factor of zero is given to the current year’s HR/PA and years 2 and 3 are given a 70/30 split (a split based purely on whimsy for the sake of illustration). As the HRD deviation from the previous standards becomes greater, perhaps the weighted factor of the year 1 HR/PA becomes larger. As for how to strike this balance and find the best model, there are a lot of approaches to that and I’ll leave you to think about it. Again, feel free to ask what I might do, but I don’t want to take any of the fun away from you, hah.