2009 BABIP-xBABIP Splits
Yesterday, we took a look at the starting pitchers with the biggest difference between their ERAs and their Expected Fielding Independent ERAs, attempting to find which hurlers performed above or below their peripheral stats in 2009.
Today, let’s turn out attention to the hitters. I compiled a list of the batters (minimum 350 plate appearances) with the biggest gap between their batting average on balls in play (BABIP) and their expected batting average on balls in play (xBABIP).
What’s xBABIP? Last winter, Chris Dutton and Peter Bendix sought to find which variables were most strongly correlated with a batter’s BABIP. Using data from the 2002-2008 seasons, Dutton and Bendix found that a hitter’s eye (BB/K ratio), line drive percentage, speed score and pitches per plate appearance had a positive relationship with BABIP (the better a batter rated in those areas, the higher his BABIP). Pitches per extra-base hit, fly ball/ground ball rate, spray (distribution of hits to the entire field) and contact rate had a negative relationship with BABIP. From this research, they created a model for predicting a batter’s BABIP.
Prior to Dutton and Bendix’s work, a lot of people used to calculate a hitter’s expected batting average on balls in play by taking line drive rate and adding .120. It made some sense: line drives have the highest batting average of any batted ball type by far, falling for a hit well over 70 percent of the time.
However, line drive rates don’t show a high correlation from year to year. That makes the “LD% plus .120″ method unreliable. Dutton and Bendix’s model showed a 59 percent correlation between actual and expected BABIP. The LD +.120 method showed just an 18 percent correlation.
Some of the numbers used in Dutton and Bendix’s study are not readily available. However, Derek Carty of The Hardball Times and Slash12 of Beyond the Box Score have both come up with expected batting average on balls in play calculators based on the new findings.
For the purposes of this article, I used Slash12′s calculator. It uses the following variables:
- Line Drive Percentage (LD%)
- Ground Ball Percentage (GB%)
- Fly Ball Percentage (FB%)
- Infield/Fly Ball Percentage (IFFB%)
- Home Run/Fly Ball Percentage (HR/FB%)
- Infield Hit Percentage (IFH%)
While not identical to the variables used by Dutton and Bendix, these batted ball numbers do a good job of taking into account the aspects that lead to a higher or lower BABIP.
First, a disclaimer. Like the ERA-xFIP charts from yesterday, these lists of “lucky” and “unlucky” hitters are based on just one year of data. To get a better feel for how a hitter will perform in the future, it’s vital to take a good hard look at multiple seasons worth of performance. This is just a quick-and-dirty exercise.
To provide a little more context, I also included each batter’s actual BABIP since 2007, when possible. The three-year averages help us get a better picture of each hitter, and help us figure out which batters might be “tricking” the xBABIP calculator based on one year of abberrant batted ball numbers.
Take Jason Kendall, for instance. Kendall had a 12 percent infield hit rate in 2009, compared to a 7.6% career average. The calculator doesn’t know that Kendall’s ankle exploded like a cheap Acme bomb a decade ago, and that he’s a 35 year-old catcher who has a BABIP under .270 since 2007. It thinks he has speed due to the infield hit rate. That’s why you need to look at multi-year numbers.
Here are the hitters with actual batting average on balls in play figures exceeding the expected batting average on balls in play numbers. These are the guys who might see their batting averages fall in 2010:
Higher BABIP than xBABIP
And here are the batters with actual BABIPs falling well short of the XBABIP totals. These hitters could experience a bounce-back in 2010:
Lower BABIP than xBABIP


One important note about Slash 12′s xBABIP calculator: He doesn’t use park factors in his calculator. I prefer The Hardball Times’ calculator because they do consider park factors.
I’d really like to see a study comparing these two xBABIP calculators one day.
Slash12 here, actually it does take park factors into account, I believe. What does a park effect? The biggest things that come to mind are: Home Run Rate, Slowness of the infield, and Foul Ball territory. These are all included in the analysis in HR/FB, IFH%, and IFFB% respectively.
Still, there’s definitely something missing from this equation, because if you look over a few years, there are a few anomalies that really stick out. It thinks ryan howard, and brandon phillips should be hitting for much higher BABIP’s for instance. I have a feeling it’s something like spray that’s missing, but I am not sure where to get that data.
Still, spray aside, it still does a pretty descent job predicting future BABIP (though a less accurate job against dead pull hitters, or hitters going the other way more regularly I suspect) as it is.
I got this from the link of your article introducing the calculator:
“It’s worth noting, that I’m not taking into account ballpark factors (which surely have some kind of effect on BABIP as well)”
I do see how they are kinda incorporated in the %’s from your last post. But you don’t use the ESPN or Firstinning or some other park factor?
And when I try opening the link to your calculator why does it say I do not have permission to access this spreadsheet??
Sorry about the google docs link, I guess it doesn’t work the way I had hoped (I can’t seem to just make a spreadsheet available to anybody who wants it). The data here is much more interesting then what I’ve got in that spreadsheet anyway (and the download should work I believe).
I guess I’m contradicting my initial post a little, but I’ve done a lot of thinking, and research since I wrote the original article, in my defense. I do currently believe that ballpark factors are worked into the calculator, at least somewhat, but probably not completely. A big outfield probably accounts for more fly balls dropping for hits, and that’s not factored in anywhere for instance. It’s only built in, as much as it’s built into the batted ball data that I’m working with allows. If I had access to an outfield flyball hit % that would help even more incorporate park factors.
Actually, according to this study at Hardball Times park factors affect just about everything — including unexpected things like singles and walks and even groundball rates. Intuitively you wouldn’t think that to be the case, and nobody seems to have a good explanation, but the data apparently supports it. I’d love to see a park factors adjustment that incorporates this data (as well as their home run park factors) because our current crude park factors don’t seem to be adequately modelling their effect.
I’m not exactly clear on the exact details of their study, but it seems to me like it would be really difficult to factor out other non-ballpark factors that seem much more likely to effect strikeouts in a given ballpark. Some pitching staffs are going to be more likely to give up more strikeouts, and some teams are going to be less likely to strikeout as well, how do you truly factor these out? I don’t think that ESPN’s ballpark factor does a good job of doing so, based on how much so many of their ballpark factors change from year to year.
Homerun ballpark factors are the same thing….if you go back and look at previous years history, a lot of ballparks will go from a homerun ballpark, to a pitcher friendly ballpark (the twins stadium if I recall is one).
I believe all they are doing is taking the # of K’s, Homeruns, or whatever for visiting team, and the home team (and giving more weight to visiting teams). Doing this, is going to be highly prone to error, because the home team’s batting, or hitting skill will surely play into those numbers (and that changes every year, unlike the ballpark itself).
Anyway, in summary, I’m just skeptical as to how good these various ballpark factors are, and how much they are really indicative of the ballpark itself, and not just a factor of something else. For instance, if a team has a ton of flyball pitchers on their staff, that’s going to drive the ballpark HR factor way up, no matter what the dimensions are. Same with strikeouts, Walks, etc.
Oh so I’m guessing David G. used the formula you had there to compile his own spreadsheet posted above since your link doesn’t work at your post.
Bobby, my understanding of ESPN park factors is that they are derived by comparing home rates vs road rates for the home team, and park rates vs average rates for road teams.
So to use your example, a home team with a lot of flyball pitchers wouldn’t affect the HR park factor, since it’s based on how many more (or fewer) HR’s that pitching staff gives up at home than on the road — not just how many HR’s above (or below) average are hit in that park for the year.
The Red Sox and Mariners serve as decent examples, as their staffs had the 2nd and 3rd highest FB%, respectively, in 2009 but Fenway and Safeco played as the 10th and 3rd worst home run parks. Since the park factor is similar to looking at home-road splits, at least a few seasons of data is required to make a determination of the true park effect, which accounts for the annual swings you mention.
I agree that it is difficult to say whether the THT study is based on the same methodology.
Aren’t most projection systems comprehensive enough to account for most of this? Or is this information important on top of using CHONE, or PECOTA for example?
Actually, looking at the results, it seems that BABIP07-09 is a stronger predictor of BABIP09 than XBABIP. Now, it does have the data in it, which is kind of cheating, but if BABIP has a strong year to year correlation then we don’t need a new stat.
Assuming everything stays the same, you may very well be right (i haven’t done an analysis to prove or disprove this). As I eluded above, there’s something missing from this (that I think is something like spray), that throws this off for some batters.
However, thing’s don’t always stay the same, especially with younger players. They may flip flop their GB/FB rate over time. They may start driving the ball better as they mature (increasing LD%, and decreasing IFFB%), they may have a leg injury, or recover for one (effecting IFFB%), or they may change home parks over the course of 3 years too. All of these things can effect the things that makeup your BABIP.
So which is better? I’m not sure, this would be an interesting analysis, and one that I’ll try running when I have some time, if somebody doesn’t beat me to it.
Sorry, left out one detail. This equation does a better job predicting the next year’s BABIP then the previous year’s BABIP does. This has been statistically proven. Using a 3 year average BABIP has not been tested.
What I’d like to test is the following 3 variables:
-3 years of actual BABIP data
-1 year of xBABIP
-3 years of xBABIP data
One of the 3 years will prove to be most accurate for sure, but it will be interesting to see how just 1 year of data compares. The problem is, I’m not sure how to do this analysis, because I’m not sure how to get a players 3 year LD% (simply averaging 3 years of LD% won’t be accurate). Comparing 1 year of xBABIP data to 3 years of BABIP data, I suspect BABIP would win out.
The Iron_Throne,
I think xBABIP is definitely useful. Matt Swartz over at Baseball Prospectus did a study during the spring that looked at the year-to-year correlation of certain stats:
http://www.baseballprospectus.com/article.php?articleid=8932
Hitter BABIP from one year to the next showed a 36-37 percent correlation. Dutton and Bendix’s model showed a near 60 percent correlation between predicted and actual BABIP. So, knowing a guy’s xBABIP based on all of these inputs usually gives us a better idea of what his future BABIP will be, as opposed to just using past BABIP.
maybe i shouldn’t look a gift horse in the mouth (i certainly appreciate all the work you guys do), but do you have expected batting averages based off of xBABIP?
maybe even an adjusted slash line based either off assuming only singles are added/subtracted, or maybe assuming an equal ratio of each type of non-HR hit added/subtracted?
either way, thanks for the article, it’s very helpful
it’s pretty easy to calculalate, but you’ll need some data:
xAvg = (((xAB – (xAB*K%) – xHR) x xBABIP)+xHR)/xAB
xAB = expected at bats
K% = Strikeout percentage
xHR = expected Home runs
I’ve done this with my spreadsheet, but my projections aren’t really complete
Is there a place where we can look at xBABIP for all players instead of just this listing of players?
Hey guys, this spreadsheet let’s you look at all xBABIP (and adjusted lines) for all players in our database (if it’s locked, use the password “tuftsbat”):
http://www.hardballtimes.com/main/fantasy/article/simple-xbabip-calculator/
Derek Carty and I are also putting together a revised, more comprehensive dataset in order to refine the model, so expect to see a second iteration in the next few weeks.
-Chris
When you calculated this data, what did you use as a base team from which to project xBABIP? I did something similar with excel and used the Kansas City Royals as the base team (Kaufman stadium had the most neutral park effect in 2008 and the 2008 royals were 18/30 in terms of cumulative fielding ability).
Suggestions to make a better excel chart? Also, can you make the GB/FB/IFFB data sortable? I had to estimate these specific figures using GB%/FB%/IFFB% and they are not quite right (ie, Eckstein’s estimated GB figure is 211.6, while his actual number is 209).
This resulted, for example, in my HanRam xBABIP being .323 rather than .319.
A large % of the “over-achievers” have above average speed and a large % of the “under-achievers have below average speed. This fact will account for a lot of the discrepencies in the numbers. Hanley Ramirez does does not need to hit the ball as hard as Jason Giambi in order to get a base hit. I think these numbers will always be skewed due to this fact and I do not think it is an accurate predictor of the next seasons batting average. I would discount all of the people from the “over-achievers” list that have above average speed in terms of predicting next year’s average. I would also do the same for all the players in the under-achiever” list that have below average speed. However, a guy like Pablo Sandoval, who is an over-achiever with below average speed, is probably due for a letdown. A guy like Ryan Spilborgh, an under-achiever with above average spee, is probably due for a spike in average.