# Examining the components of batting average and BABIP

I know we’ve talked about batting average before, but I thought that since we’re getting into the off-season now, it would be a good idea to go over some of the concepts behind predicting it and even introduce a few new things concerning BABIP. There are three primary components to batting average, listed below.

- Contact rate – How often the ball is put in play.

- Home runs – How often the ball is hit out of the park. No chance of being caught. Essentially guaranteed hits.

- Batting Average on Balls in Play – How often the contacted balls (that aren’t home runs) fall for hits.

You could also consider smaller factors like batting average on bunts, but these are the three biggies.

The benchmark for what most people consider a good contact hitter is a .300 batting average (although the number needed to add value to a fantasy team is lower). If you have a .300 hitter on your fantasy team, he is going to provide you some excellent value in that category. The problem is that hitting .300, consistently, is no easy task. In order to truly be a .300 hitter, you either need a solid set of skills in each of the three above categories or amazing skills in at least one of them.

For example, a guy with a 95% contact rate and a .316 BABIP would post a .300 batting average without hitting a single home run. But if you drop that contact rate to even 85% (which is still above average), the batting average drops to .269. The BABIP would have to increase to .353 (which is very difficult to do, but again would demonstrate the need to be very strong in at least one category and above average in the second) in order to get the batting average back to .300.

###### Contact rate and home runs

Contact rate and home runs are the most stable of the three batting average components.

When we do a year-to-year correlation test for contact rate, we get a very strong **.8305** correlation coefficient. This basically means that you can predict next year’s contact rate very well simply by using this year’s contact rate.

When we do this for home runs (AB/HR to normalize), our correlation coefficient is lower, but still decent, at **.6245**. Keep in mind that I’m still working on a system for projecting home runs (with infinite thanks to Greg Rybarczyk of HitTracker for his help) that I’m hoping I’ll be able to introduce within a week.

Please note the following criteria used for the two correlations above: 2004-2007 numbers were used. Players who changed teams mid-year in either Year 1 or Year 2 were excluded. Also, players needed to have at least 250 plate appearances in both Year 1 and Year 2.

###### BABIP tests

That leaves us with BABIP, a critical number for every player in baseball but one that it is highly variable and very difficult to predict. Still, because of how important it is, we need to try and do just this.

To do this, we’ll use some simple correlations to find which stats are best able to predict BABIP.

First, a quick note. For each of these correlations (with the exception of #1), I’m not using straight BABIP. I’m excluding bunts and only including the four outcomes (outfield fly, infield fly, grounder, liner) that can occur when a player is swinging to get a hit to show a clearer picture of a hitter’s ability. We’ll call this BABIP2 for the sake of easy reference. When we eventually compile our projected batting average, we’ll also include bunts separately from BABIP.

**1) BABIP correlation from year-to-year**

Let’s see exactly how well BABIP can predict itself.

__Correlation Coefficient:__ **.3066**

__Criteria__: 2004-2007 numbers were used. Players who changed teams mid-year in either Year 1 or Year 2 were excluded. Also, players needed to have at least 250 plate appearances in both Year 1 and Year 2.

Not terrible (considering some of the results we get later), but not very good either considering what we got for contact rate and home runs. This confirms what I said earlier about BABIP being very variable. There is a positive correlation between the two, but it isn’t especially strong. Let’s see if we can find something better.

**2) Walk rate correlation with BABIP2**

The logic behind this is that walk rate shows patience and selectivity. Those who wait for good pitches, theoretically, will be more likely to convert the ones they do swing at into hits. Of course, this doesn’t take actual hitting ability into consideration.

__Correlation Coefficient:__ **– .2926**

__Criteria__: 2004-2007 numbers were used. Players needed to have at least 250 plate appearances to be eligible.

Wow. Not at all what I was expecting. There’s actually a not-all-that-weak negative correlation, meaning the more walks, the lower the batting average. Very surprising. I’d have to think it is because, as I said before, actual hitting ability isn’t considered.

**3) (Called Strikes + Balls)/(Total Pitches) with BABIP2**

Maybe walk rate isn’t the best measure of selectivity, so we’ll dig a little deeper into the number and use the actual pitch data (a big thank you to Retrosheet for this data). Let’s see if the results are any different than they were for walk rate.

__Correlation Coefficient:__ **0.0266**

__Criteria__: 2004-2006 numbers were used (Retrosheet doesn’t have 2007 numbers up yet). Players needed to have at least 250 plate appearances to be eligible.

Well, at least we’re in positive territory. The correlation is — for all intents and purposes — non-existent, though.

**4) Walks/Strikeouts (BB/K) correlation with BABIP2**

I’ve often heard that walks divided by strikeouts is a good measure of a batter’s discipline, or his eye, or his command of the strike zone. Let’s see if this has any relationship with BABIP.

__Correlation Coefficient:__ **– .0196**

__Criteria__: 2004-2007 numbers were used. Players needed to have at least 250 plate appearances to be eligible.

Nope. It just doesn’t seem like these types of numbers tell us much about BABIP. I definitely think they are useful for different purposes, but for today, they haven’t been much help. Let’s check out our batted ball data and see if we can do better.

**5) Line drive rate correlation with BABIP2**

I would expect this one to be much better than walk rate proved to be. Line drives fall for hits, on average, around 71% of the time. Logically, those who hit a lot of them should have higher BABIPs. Let’s see if that’s the case.

__Correlation Coefficient:__ **.4169**

__Criteria__: 2004-2007 numbers were used. Players needed to have at least 250 plate appearances to be eligible.

Not fantastic, but considering that we’re working with BABIP, I definitely think that it is significant. Line drives seem like a good measure to use for projecting BABIP. I just found an interesting post from 2005 by Dave Studeman, in which he produces a general formula for predicting BABIP: LD% + .120.

**6) Outfield fly ball BABIP correlation with BABIP2**

As David Gassko surmised in his article from a couple of weeks ago, since fly balls have one very stable event (home runs) and lots of easily fielded balls (lazy flies), the guys who have high hit rates on fly balls are probably hitting the ball harder than other players. Let’s test this theory on BABIP.

__Correlation Coefficient:__ **.3061**

__Criteria__: 2004-2007 numbers were used. Players needed to have at least 250 plate appearances to be eligible.

Not quite as good as line drive percentage, but it’s decent. Let’s see how these two can predict themselves.

**7) Outfield fly ball BABIP correlation from year-to-year**

How consistent is outfield fly ball BABIP?

__Correlation Coefficient:__ **.1635**

__Criteria__: 2004-2007 numbers were used. Players who changed teams mid-year in either Year 1 or Year 2 were excluded. Also, players needed to have at least 250 plate appearances in both Year 1 and Year 2.

As you see, while fly ball BABIP is a decent predictor of actual BABIP, it isn’t very consistent from year to year.

**8) Line drive rate correlation from year-to-year**

How consistent is a player’s line drive rate?

__Correlation Coefficient:__ **.2653**

__Criteria__: 2004-2007 numbers were used. Players who changed teams mid-year in either Year 1 or Year 2 were excluded. Also, players needed to have at least 250 plate appearances in both Year 1 and Year 2.

Not very consistent from year-to-year, but it correlates better than outfield fly BABIP does and is better at predicting BABIP too. Still, it seems like it would be difficult to predict BABIP before the season begins using either of these two.

**9) 3 year, unweighted BABIP2 correlation with Year 4 BABIP2**

Moving on from batted ball numbers, let’s see if several years of a player’s BABIP can predict the following year’s BABIP with any certainty.

__Correlation Coefficient:__ **.5843**

__Criteria__: 2004, 2005, and 2006 numbers were combined (but unweighted) to get a player’s 3-year BABIP2. This was then compared with that player’s 2007 BABIP2. Players needed to have at least 650 plate appearances between 2004 and 2006 and at least 250 plate appearances in 2007 to be eligible.

Our best result yet. It seems that, given enough at-bats, a player’s true ability to convert balls in play into hits will begin to reveal itself. Keep in mind that I used an unweighted three-year figure and that there were far fewer records than any of our other correlations (just 232 records). I don’t have data from other years to work with, but right now this seems like our best bet for predicting BABIP.

**10) 2 year, unweighted BABIP2 correlation with Year 3 BABIP2**

Since some players haven’t yet played 3 years in baseball, I wanted to know if a two-year BABIP would be better than line drive rate and outfield fly BABIP.

__Correlation Coefficient:__ **.5851**

__Criteria__: 2004 and 2005, and 2005 and 2006 numbers were combined (but unweighted) to get a player’s 2-year BABIP2. This was then compared with that player’s 2006 and 2007 BABIP2, respectively. Players needed to have at least 450 plate appearances between the first two years and at least 250 plate appearances in the third year to be eligible.

Turns out, the correlation coefficient is actually a tiny bit better than the three-year figure. Keep in mind, though, that the three year sample size was somewhat small (if you’re curious, there were 487 records in the two-year set). It looks like it should be okay to evaluate guys who have played for two years using this.

**11) 3 year, weighted BABIP2 correlation with Year 4 BABIP2**

Let’s see if the results get any better if we weigh the numbers.

__Correlation Coefficient:__ **.5812**

__Criteria__: 2004, 2005, and 2006 numbers were combined (and weighted) to get a player’s 3-year BABIP2. This was then compared with that player’s 2007 BABIP2. Players needed to have at least 650 plate appearances between 2004 and 2006 and at least 250 plate appearances in 2007 to be eligible.

Nearly identical results to the unweighted correlation. Very interesting stuff.

###### Closing thoughts

Reviewing, the three most important components of a player’s batting average are contact rate, home run rate, and BABIP. Contact rate is the most stable, home run rate is second, and BABIP is quite unstable.

I think we made some strides, though, in our attempt to find numbers that can predict it with some measure of accuracy. Our best results came from weighted and unweighted multi-year BABIPs, batted ball data gave moderate results, and the stats reflecting patience and selectivity showed almost no effect on BABIP whatsoever.

For now, it looks like the best route for predicting BABIP before the season begins will be multi-year BABIP and during the season perhaps a combination of multi-year BABIP and line drive percentage. I’m sure we’ll be talking more about and digging deeper into this type of stuff in the future, but I think this is a good start. Also, as I mentioned before, stay on the lookout for a new system for home runs (using HitTracker) in the near future.

EDIT: The following corrects for a mistake I made in this article. — D.C. 11/22/07

###### Errata

In this article, I had incorrectly calculated BABIP2. This had little affect on most of the correlation coefficients, but a few had significant changes. All of the new correlation coefficients are listed below.

2) Walk rate correlation with BABIP2 — **0.05**

3) (Called Strikes + Balls)/(Total Pitches) with BABIP2 — **0.03**

4) Walks/Strikeouts (BB/K) correlation with BABIP2 — **-0.02**

5) Line drive rate correlation with BABIP2 — **0.45**

6) Outfield fly ball BABIP correlation with BABIP2 — **0.52**

9) 3 year, unweighted BABIP2 correlation with Year 4 BABIP2 — **0.39**

10) 2 year, unweighted BABIP2 correlation with Year 3 BABIP2 — **0.37**

11) 3 year, weighted BABIP2 correlation with Year 4 BABIP2 — **0.38**

Outfield fly ball BABIP gets a big boost, enough to become the top predictor of BABIP2 that we looked at. Unfortunately, as explained in this article, it isn’t a very stable event. Line drive rate also got a tick higher, and — as we discussed in the same article — it is somewhat predictable using a three-year figure.

9, 10, and 11 — obviously — are significantly lower than where we had them before. They are still decent, but not great. More work certainly needs to be done in this field.