## What Factors Affect HR/FB Rate?

In the past few weeks, home run per fly ball rate has been a hot topic here at FanGraphs and around the baseball blogosphere. Dave Cameron sparked a lively discussion with this post about Matt Cain, and then showed that Giants’ pitching coach Dave Righetti has been able to get better than average HR/FB rates out of a number of his starters. Another interesting article suggests that HR/FB rate may have something to do with pitch movement.

At this point, it seems clear that HR/FB rate may not be completely luck. It may have a lot of luck involved, but there are some factors that can increase or decrease this statistic. Using the vast FanGraphs database and some regression modeling, we can hunt for these factors and find out just how much they matter.

The data examined is from 2002 to 2009, leaving last season out of the data set in order to test the model’s forecasting ability later. This data is for starting pitchers only, and only those who threw 80 or more innings. Also, pitchers who changed teams during the season were excluded.

There were numerous independent variables tested, and countless models examined. For brevity, I won’t take you through ever iteration of the process, but instead tackle categories of variables.

**Park Factor**

This is the variable which is probably the most obvious to include. Many studies have shown that a player’s HR/FB rate depends on his home park, and this analysis concurs. The variable’s coefficient is positive and significant at the 99-percent level, which means that there is less than a 1-percent chance that the relationship between this variable and the dependent variable (HR/FB) is completely random.

The only surprise here is that the coefficient for park factor has a relatively low elasticity, which means it does not have a large effect on HR/FB rate. This is probably due to the fact that park factor really only controls for about half of the games a pitcher starts. This variable could be improved by making it a weighted average of all of the parks the pitcher played in over the course of the season, but that is simply too time consuming for this end.

**Pitcher Skill**

Using xFIP as an independent yields a significant coefficient, but that is too vague of an answer. What exactly about xFIP matters for HR/FB rate?

It turns out that both K/9 and BB/9 yield significant coefficients at the 99-percent level. The former has a negative coefficient and the latter has a positive, meaning that more strikeouts and fewer walks led to a lower HR/FB.

For K/9, this shows that pitchers who miss bats regularly may be able to suppress their HR/FB rates. This is not very surprising – if hitters are unable to put the ball in play against a pitcher, they may be less likely to square up a pitch and drive it when they do make contact.

For BB/9, the interpretation is a little less obvious. The theory put forth here is that walk rate measures a pitcher’s control. A pitcher with good control is less likely throw balls out of the strike zone, thus the low walk rate, and also is less likely to groove one down the middle.

**Plate Discipline**

For this section, FanGraphs’ variables for swing rates in and out of the zone, and contact rates in and out of the zone were tested, as well as first pitch strike percentage and swinging strike percentage.

Of all of these variables, only one showed evidence of a relationship to HR/FB rate: O-Contact%. This variable had a significant negative coefficient, meaning that the more often that hitters were making contact on pitches outside the strike zone, the lower the HR/FB rate was for the pitcher.

This tells us that balls hit outside the strike zone are simply less likely to be home runs. This does not mean they are not likely to be hits, or even extra base hits, but they are not leaving the park at the same rate as balls hit within the zone.

This variable has a 99-percent significance and a large elasticity, meaning that this variable highly effects the HR/FB rate. If a pitcher could consistently cause hitters to make contact with pitches outside the strike zone, they may be able to hold down their home run rate. Can pitchers continually bait hitters into swinging at pitches that they can hit, but not hit well? That’s an area for more research.

**Batted Ball Percentages**

The four batted ball types were examined as independent variables: fly ball percentage, line drive percentage, ground ball percentage, and infield fly ball percentage. Of these four, the only variable which proved to have a significant effect on HR/FB is IFFB%.

Including IFFB% yields a negative coefficient which is significant at the 99-percent level. This makes a ton of sense. Infield flies have a zero probability of being home runs. Together with O-Contact%, these variables help remove batted balls which have little or no chance of leaving the ball park from a pitcher’s HR/FB percentage.

What is more interesting here is that none of the other three batted ball variables have a significant coefficient. While it has been shown that fly ball pitchers usually have better HR/FB rates than ground ball pitchers, the results of this model suggest that it may have more to do with the increase in popups, rather than an ability to get a higher percentage of outs on fly balls to the outfield.

**Pitch Velocity**

The variables tested had to do with the frequency and velocity of pitch types in a hurler’s arsenal. This category produced truly interesting results.

From the pitch type variables, one makes it to the final model: average fastball velocity. The statistic FBv was shown to have a negative effect on HR/FB rate, significant at the 90-percent level. The better fastball a pitcher has, the better his HR/FB rate.

At first this seems counterintuitive because fast pitches travel farther when hit. However, this result shows that HR/FB rate may have more to do with being able to square up a pitch. High velocity pitches may simply be tougher to make solid contact with, allowing pitchers who throw harder to get away with more mistakes.

**Putting It All Together**

After countless regression models and variable combinations, the final model is thus:

Elasticity is defined as the percent change in x for a percent change in y. For this use, you can interpret the K/9 elasticity as a five percent increase in strikeout rate translating into a one percent decrease in HR/FB rate.

The r-squared for the model is 10-percent. Yes, this is a low r-squared. No, it does not invalidate the model. The significance of the independent variables are what is important here, and all of those values are more than acceptable. The fact that the r-squared is only 10 percent after fitting six significant independent variables shows that HR/FB is still very much impacted by randomness or other outside factors. This model gets us closer to the heart of HR/FB rate, but it does not explain all of its variability, and perhaps nothing can. This does suggest that, while pitchers have some ability to impact their HR/FB rates, there are significant variables that they do not control, which explains the year to year fluctuations for pitchers who undergo no other noticeable changes.

The next logical question is “does the model work?” Using this model to project 2010 HR/FB rates yields a smaller sum squared error and mean squared error than a naive estimator of 10.6 HR/FB rate. So, preliminarily, “yes, it does.”

In the coming days, this model will be used to further the discussion about HR/FB rate and look at if it is possible to continually outperform the mean.

Print This Post

“What is more interesting here is that none of the other three batted ball variables have a significant coefficient.”

What specifically do you mean by “significant”? I see you included BB/9 with an 85% significance — how low were the other variables?

I ask because the typical cut-offs for significance (99%, 95%, and 90%) and annoyingly arbitrary and disgustingly misleading. My curiosity compels me to discover what you decided for a cut-off point.

On a related note, did you include each of the four batted ball type percentages in the model simultaneously, or just one at a time?

Excellent questions. I went with a 80% cutoff. That’s lower than some use, but including a semi-significant variable creates less problems than excluding something that turns out to be relevant.

For testing, I included the variables together, one at a time, and nearly every combination in-between. Tons of iterations, trust me.

Instead of only looking at p-values, you might instead ask if you can reject reasonably large effects. The elasticities you present are easily interpretable, and at least for those you present in the table seem to be pretty precisely estimated. In other words 0 is not necessarily the most interesting null hypothesis here.

I’d also be interested to see results using standardized versions of all the regressors as an alternative to elasticities.

Finally, I am uncomfortable with cherry-picking regression specifications based on statistical significance (at least I think this is what you did). This can lead to severe bias and also invalidates all of your p-values. This is somewhat resolved using the out-of sample 2010 prediction, but it would be better to pre-specify just a few different specifications and test those.

Still, really interesting, and good work.

What about pitch location? I was noticing in a recent article on average pitch location (forget where I saw it) that both Barry Zito and Matt Cain (who both have low HR/FB ratios) also seemed to be outliers in average pitch location, living higher in the zone than most other pitchers. Made me curious if this was just a coincidence, or if there was some correlation there.

See the discussion at http://www.fangraphs.com/community/index.php/does-iffb-correlate-with-hrfb-rates/

We just talked about this stuff! The FB% matters too, in addition to IFFB%. Also, there are annual fixed effects. Did you try out FB% and annual dummy variables?

Last year I argued here, and on Lookout Landing, that Rafael Betancourt and Brandon League seemed on opposite ends of the spectrum with respect to hr/fb, and that some skill seemed involved. The numbers were too consistent and stark, despite overall low innings totals making it hard to be definitive. Betancourt did get bit a bit by Coors this year, yet his hr/fb rate again was lower than League’s.

The Lookout Landing folks that addressed it pretty much scoffed (ok, so I was being a bit of a troll anyway), insisting it was well-settled that no skill set is involved.

Well, this study suggests outside contact rates and IFFB% have a correlation with hr/fb rates.

Sure enough. Betancourt career O-Contact 57.7%, League 48.8%, Betancourt IFFB% 14.1%, League 2.9%. Betancourt career HR/FB rate 7.7%, League 17%.

Good work guys. Can we expect changes to xfip calcs?

Pasted from my comment in the above post:

Dependent variable: HR/9

iffb -0.0298***

(0.00735)

k9 -0.0536***

(0.0164)

fb 0.0266***

(0.00353)

y2009 -0.0505

(0.0806)

y2010 -0.145*

(0.0772)

Constant 0.741***

(0.164)

Observations 480

R-squared 0.140

I’m wondering how reasonable it is to assume (as the regression model requires) that the observations in these models (i.e. HR/FB rates for individual pitchers) are independent of one another. If we assume that park factors matter, then aren’t we essentially saying that pitchers who pitch in the same park will be “more like each other” (with respect to HR/FB rates) than pitchers who pitch in different parks, and wouldn’t that essentially mean that observations are not independent?

If so, then I am tempted to think that you should account for clusters (i.e. teams) of pitchers in some way that I don’t think is addressed by including the park factor score as an independent variable…

I agree completely. And there are repeated observations of players in the data, which certainly speaks to a need for clustering.

For my two cents, I think clustering would be somewhat useful but I am not sure if it’s worth the trouble, given that this is a fairly high-level super-aggregated analysis anyways.

I’ve been wondering about a similar issue on the same tact though: If we want to know the factors involved, why are we using aggregated pitcher stats to look for them? Shouldn’t we instead be using something like pitch outcomes?

I mean, it seems to me that at this point there is quite a bit of data pouring in that is on a per-pitch basis. Firstly, this gives a heck of a lot more observations. Secondly, this is way more instructive than using averages. I mean, this states: “Faster thrown balls -> Harder to hit HR.” But what if we actually found, at the disaggregated level, that the harder thrown balls for a pitcher were hit out of the park more- while they were actually lowering their overall HR rate on their breaking balls (such as by a higher speed differential- which is much easier with a strong fastball).

I would rather see a model that uses either individual pitches or plate appearances as events. That way, we can use the actual data available. In my opinion, pre-aggregating data can lead to some serious misleading conclusions sometimes. Doing individual pitches has the nice benefit of being able to really hone in on the particular pitch that got hit for a dinger. Maybe there could be a sort of cross-class comparison of the differences in characteristics between pitches that got hit over the fence versus those that didn’t (and subcategories therein).

Of course, the downside of individual pitch events is that you lose out on anything that comes from sequencing or game-theoretic responses (i.e. batters preparing differently for different types of pitchers). If one wanted to adjust for some of that, one could do lots of parallel cross-class comparisons to see the differences between each individual pitcher’s pitches or plate appearances based on HR or not. There’s definitely enough events for it, each pitcher is in for thousands of pitches and many hundreds of PA. Then the significant factors across pitchers could be analyzed.

So we’d end up with something like: “Jamie Moyer is hit for HR when his pitches are up in the zone” and “Neftali Perez is hit for HR when his pitches are up in the zone and such pitches tend to be 5 MPH slower than those that are not hit for HR” From those individual regularities, analyses could be run to determine what are common factors across all pitchers, or pitchers could be clustered based upon the key factors that typify their HR. Both of those would be fairly interesting.

Having figured out what characteristics tend to make a pitch go over the fence in the first place (which may differ for different kinds of pitchers), I think we’d be in a decent position to look into which of those things appears to be controllable and who has them.

So those are my thoughts on it. Either way, still interesting work- it’s a very nice cut at this.

Nice analysis!

I would also note a problem, probably you just used the wrong word, but the K-rate does not always correlate to velocity, as you gave in your example of “For this use, you can interpret the K/9 elasticity as a five percent increase in velocity translating into a one percent decrease in HR/FB rate. ”

I assume you mean 5% increase in K/9 = 1% decrease HR/FB rate.

Since I don’t see a “Reply” button under Jesse’s response above, I guess I’ll follow up here.

Assuming that the four batted ball type percentages (or proportions) sum to 1.0, you should include no more than 3 of them at a time in the model. Including all four simultaneously will result in severe multicollinearity.

That being said, I gather that you also tried all possible combos of 3 at a time, so it doesn’t really matter.

Good work,

I have one “but” . . .

If you include IFFB% as a predictor variable, you have to use HR/OFFB as your dependent variable. Otherwise your regression is just telling you that when a pop-up fails to be a home run, it fails to be a home run.

Yeah,

I was thinking about this too.

Using HR/OFFB probably makes more sense, whether or not IFFB is a predictor variable.

That being said, if he thinks he needs to use HR/FB, because that’s what people are familiar with, then he still should add IFFB%. It’s kind of a backwards way of accounting for IFFB, but the model will probably be better than if he leaves it out.

Is there a way we can all pull this data to try and replicate your study. It would be fascinating to test other combinations and permutations of variables

Great work!

One thing I’d like to see…

Include last year’s HR/FB, after accounting for the other variables, and see if it is still significant. Or a wighted average of previous year’s HR/FB.

That way we can see if there’s something else we’re missing, even if we don’t know what it is.

Can you clarify the difference between coefficient and elasticity? My econ 101 training conflated them in my mind (e.g., demand = elasticity * 1/price, where I would call elasticity the coef for 1/price in the model).

The coefficient typically specifies the unit change in the dependent variable resulting from a one-unit change in the independent variable.

Elasticity typically specifies the percent change in the dependent variable resulting from a one-percent change in the independent variable.

Those who use elasticity typically do so because they believe that it tells them which independent variables are “most important” (i.e. which have the largest effect on the dependent variable).

Perfect, thanks.

As an engineer, I’m waiting for an all-dimensionless-variable stats site, so I can see the appeal of elasticity vis a vis coef.

So can we quantify how much skill and how much randomness there is in HR/FB%? When you say that the analysis produces an r squared of 10 does that tell us anything precise about the ratio of skill to randomness?

the R^2 is more about how well this particular model explains reality. There is already some randomness already involved in each of the independent variables (K/BB rate depends somewhat on umpires, park factor probably depends on weather, etc), so I don’t think you can really use the R^2 of the model to say, for example, that 90% of HR/FB is explained by luck.

http://www.insidethebook.com/ee/index.php/site/comments/pre_introducing_batted_ball_fip_part_2/

Good analysis, I’m impressed that someone here bothered to develop a model and then test it on fresh data. One caveat: the “significance” levels are essentially bogus; once you run a regression and start re-arranging things (throwing out “insignificant” variables) you throw the statistical basis of the model out the window. You should report the significance of the very first model you ran with this data.

The 10% r-square means that the model explains about 10% of the variation in HR/fb rates. Better than nothing.

Yup. Though, I think I’d believe significance stats on the unreported out-of-sample 2010 regression.

Screw FIP and xFIP. tERA is where it’s at. Finally a stat that doesn’t treat all balls in play the same. Suck it, nerds.

I guess there’s no way to measure bat speed yet? I would imagine that this has a pretty big impact on the FB/HR rate as well…

In general, I would think that we wouldn’t expect much variation across pitchers with respect to the bat speeds they face. For the part, I would think that most pitchers would face the same bat speeds.

Bat speed would be useful for predicting the rate at which fly balls leave the park, but probably (?) not so much for explaining differences in HR/FB rates across pitchers.

You might also want to think about average movement for each type of pitch. A flatter curveball is likely to elicit different effects than one with a lot of movement.

Also, I agree you have some problems with multiple comparisons after running lots of regressions.

Just a reminder that at THT we always used HR/OF, and xFIP was based on outfield flies, not total fly balls.

I see the virtues in this way of doing things. I understand why Fangraphs does what it does, but we are doomed to a lot of tedious re-discussion everytime we discuss hr/fb.

OK, good to remember.

What defines infield? Just to the beginning of the outfield grass? Some portion beyond?

Seems strange to me that a high O-contact rate and a high K rate would both be predictors of a low HR/FB. Generally, these are an entirely different set of pitchers (and a different type of pitcher). For the most part, high O-contact rate pitchers are high O-contact rate pitchers because they can’t get swing and misses and therefore have low k/9s, and vice-versa.

The top 10 O-contact rate pitchers last year:

Doug Fister

Nick Blackburn

Livan Hernandez

Rodrigo Lopez

Jeremy Guthrie

Mark Buehrle

Kevin Slowey

Kyle Kendrick

Bronson Arroyo

Scott Feldman

Wow, very interesting.

If these are the factors that drive HR/FB, then I truly feel for the Buc’s Charlie Morton: last year he had a league average FBv, a league average O-Contact %, a league average K/9, and a somewhat below average (1.0 stdev) IFFB%–yet despite playing in a pitcher’s park, he had the highest FB/HR (18.1%) of all SPs with over 50 IP. (FanGraphs data, filter for SPs and min 50 ip).

Forget Two-buck Chuck; what we’ve got here is “No-luck Buc Chuck”.

The one thing that stands out to me, but may well be impossible to quantify or calculate, is to remove the two types of balls that almost never leave the yard (IFFB, and balls hit outside of the stike zone [the latter being the more problematic one, especially if narrowed down to OFFB hit out of the zone]) and then seeing if there’s something more of a constant for home runs per “normal” fly ball. Then, it would seem the skill lies in getting batters to hit IFFB and to make contact out of the zone, both of which are potentially tied to things like K/9 (movement on pitches leading to poor contact or balls hit outside of the zone once the batter has committed to swinging), and BB/9 (ostensibly, control, so ability to expand the zone, paint the corners, etc.).

Of course, these are nothing but the idle musings of a seventeen year old with rather too much time on his hands; I’m very likely not onto anything — certainly not onto anything truly new. But, hey, who knows?

Strong work! I think this really starts to shed some light on why there may be certain pitchers who are able to constently do some things that have been attributed to “luck” by some analysts.

Does it make sense to consider batted balls as discrete species, as opposed to a continuum? Presumably there’s a continuum of batted ball speeds, and a continuum of batted ball angle of elevation.

Take elevation angle. There’s a range of nearly 360deg of possible outcomes, and there’s no reason to think that there should be discrete zones. Instead, there may be regions of high probability. Maybe it looks like a normal(ish) distribution, and pitchers have control over the median and standard deviation of this distribution.

Maybe what Cain-like pitchers do is flatten this distribution (increase the stdev), reducing the number of “well-hit” balls. So, for example, you shift some of the OFFB to IFFB, which decreases HR/FB. And if you can prevent an associated increase in GB –> FB, then you decrease your HR/FB.

Whew. That was probably long winded and hard to understand–we need picture comments. Anyway, I’m curious what the analysis of HR rate looks like if we don’t normalize it by FB, but instead by something like BIP, OFFB, etc.

“The next logical question is ‘does the model work?’ Using this model to project 2010 HR/FB rates yields a smaller sum squared error and mean squared error than a naive estimator of 10.6 HR/FB rate. So, preliminarily, ‘yes, it does.’”

How did you come up with your naive estimator? Looking at the HR/FB data for 2010 presented on the leaderboards, I come up with an average HR/FB of .093, if I weight by presumed BIP (AB-SO+SH+SF) times FB%.

Since weather affects ball flight, wouldn’t the varying schedule affect the HR/FB rate of pitchers on a year to year basis. i.e…The Phillies played @ Colorado and SF in early cold weather games. If you switched those series with ones in July or August with retractable roof stadiums like Milwaukee or Houston, wouldn’t the pitchers for the Phillies be more likely to give up home runs for that schedule.

Amazing article. Top 5 for the last year on Fangraphs.