At the MIT Sloan Analytics Conference, Dan Rosenheck offered a presentation on the effects of infield fly rate and in-zone contact rate on predicting BABIP. He has generously let us republish his talk here. The full text of his presentation is published below. You can find more of Dan’s work at The Economist and The New York Times, or check out his previous project on Wins Above Replacement.
Long before baseball’s statistical revolution entered the mainstream, Hollywood writers showed an impressive grasp of one of the game’s most confounding nuances. In the 1988 movie “Bull Durham,” Kevin Costner plays Crash Davis, a career minor league catcher tasked with tutoring the pitching prospect Nuke LaLoosh. Explaining the cruel randomness that determines the fates of so many aspiring major leaguers, Crash gives Nuke a brief math lesson at a bar.
Crash’s speech actually contradicted much of conventional wisdom about baseball at the time. Ever since the Hall of Famer Wee Willie Keeler explained the secret to his success as “hit ‘em where they ain’t,” fans and sportswriters have generally given hitters credit for expertly placing their seeing-eye grounders and Texas Leaguers into the gaps between opposing fielders. Similarly, they have praised pitchers who seem to induce opposing hitters into serving up a steady stream of routine plays for the defense. At one point, even Crash himself preaches the value of pitching to contact. “Don’t try to strike everybody out,” he advises Nuke. “Strikeouts are boring. Besides that, they’re fascist. Throw some ground balls. It’s more democratic.”
That school of thought went unquestioned until 1999, when Voros McCracken, then a 28-year-old paralegal in Chicago, unveiled what was arguably the most important discovery in baseball analytics of the last 20 years. Voros, if you’re in the audience, please tip your cap. Seeking an edge in his fantasy baseball league, he began investigating how well one could predict pitching statistics by looking at past performance. He broke up the events on a baseball field into two groups: those determined entirely by the pitcher and hitter, and those that required the defense to make a play. The former, what Crash would call “fascist” plays, included strikeouts, walks, and home runs—later dubbed the “Three True Outcomes” by McCracken’s devotees—as well as hit by pitch. The latter, Crash’s “democratic” results, incorporated singles, doubles, triples, and all fielded outs.
The results were stunning.
The defense-independent numbers were fairly steady: pitchers who struck out a lot of batters one year were extremely likely to do so in the following season, and those who walked too many guys once tended to do it again as well. Home runs seemed to be a bit messier, but there was no doubting that some pitchers had severe, incurable gopheritis year in and year out. But batting average on balls in play—the share of fielded balls that became singles, doubles, or triples—was a crapshoot. Using a standard measure of correlation known as r, its consistency from year to year came in at 0.153—barely half the level that social scientists have long mocked as meaningless with the quip, “The world is correlated at 0.3.” There seemed to be no rhyme or reason to the leaderboards. In 1999, Pedro Martínez and Greg Maddux had among the worst BABIP’s in baseball; the following year they posted two of the best.
McCracken did not shy away from the intellectual consequences of his discovery. As hard to believe as it seemed, Wee Willie Keeler was wrong and Crash Davis was right: pitchers had little to no control over the outcomes of balls hit into play against them. The only difference between Pedro Martínez and Pedro Borbón was the former’s superior strikeout, walk, and home run rates. Everything else—everything from screeching liners to gorps and dying quails—was up to the fielders and Lady Luck.
To make use of the finding, McCracken developed a formula to estimate a pitcher’s ERA based exclusively on the factors he could influence. Called “Defense Independent Pitching Statistics,” or DIPS for short, it presumed that all pitchers would have a league-average BABIP. And it predicted next year’s ERA better than this year’s ERA did. Bill James, widely recognized as the father of quantitative baseball analysis or “sabermetrics,” later wrote of DIPS, “I feel stupid for not having realized it 30 years ago,” and commented that “Voros’s realization has become one of the pivotal points in the history of sabermetrics.”
McCracken’s counterintuitive research prompted an uproar in the nascent online baseball analysis community. The game’s finest minds soon began combing through the data. Most studies found that McCracken had somewhat overstated his case. One obvious exception was knuckleball pitchers like Tim Wakefield, who consistently posted BABIP’s far below the league average. Superstar pitchers like Johan Santana also tended to have somewhat better-than-average BABIP’s over the course of their careers. At the other extreme, as a group, minor-league pitchers with brief appearances in MLB compiled BABIP’s well above the league average.
That suggests that preventing hits on balls in play is indeed a real skill—but that pitchers who are not major league-caliber in this department give up so many hits that they simply don’t last long in The Show. Finally, pitchers do exert a strong influence on their ratio of ground balls to fly balls. Grounders go through for hits more often than non-home-run fly balls do, though they are much less likely to yield extra bases.
Nonetheless, the core of McCracken’s research held up. Yes, there were some statistically significant differences among major league pitchers in their ability to prevent hits on balls in play. However, in the non-knuckleballing division, the effect was so small, and the yearly fluctuations so big, that by the time you had enough data to determine that a pitcher possessed such a skill, he was probably on the verge of retirement. You’d get better projections just by assuming that everyone had major league average BABIP skill than you would by trying to decipher the BABIP mystery.
13 years after DIPS took the baseball world by storm, today’s most popular ERA estimator is the Fielding Independent Pitching equation developed by Tom Tango. Used by the statistical website Fangraphs.com to calculate its all-in-one value stat for pitchers, it is simply a stripped-down version of McCracken’s venerable Three True Outcomes: homers times 13, plus walks and hit by pitch times 3, minus strikeouts times 2, divided by innings pitched and adjusted for the league average. It doesn’t get much simpler than that.
Yet one important thing has changed since McCracken published his groundbreaking work: the Internet has made a treasure trove of statistics available to anyone with a computer. Today, we can download reams of data on pitch type, velocity, swing rates, batted ball trajectories, and any number of other variables instantly. I never had any intention of wading into the DIPS debate. But like McCracken, I am an avid fantasy baseball player, and every year I compile my own projections in the hopes of getting a leg up on the competition. And last year, in the course of my annual fantasy draft preparation, I stumbled on some data that seemed to lie at the start of the path towards sabermetrics’ holy grail.
My calculations confirmed McCracken’s contention that a pitcher’s BABIP in one season provides little information about what it will be the next year. However, I noticed that two other statistics did seem to offer helpful clues about hit prevention. The first was popup rate — the share of batted balls that are flies to the infield. Popups have virtually the same effect as strikeouts, since they are almost always caught and runners cannot advance on them. All other things equal, a pitcher who induces more popups should have a lower BABIP. And unlike some other batted ball types, such as line drives, popups do show a strong year-to-year correlation — pitchers who get lots of them in one year are likely to induce an above-average total the next year as well. From 2010 to 2011, popups correlated just as well as walks, significantly better than homers and of course far better than BABIP. Even more encouragingly, a pitcher’s popup rate in 2010 had a negative correlation with his BABIP in 2011. Although analysts have long known that inducing popups is a repeatable skill, I had never seen a publicly available equation that used it for the purpose of BABIP prediction.
The second promising variable was a statistic provided by Fangraphs.com called Z-Contact. It measures how often opposing batters make contact when they swing at pitches thrown within the strike zone. Since almost all strikes are hittable, and most of them are easy to square up, if a pitcher gets batters to miss a high proportion of his easier-to-hit pitches, then the balls that do get put into play against him will tend to come on harder-to-hit pitches. All other things equal, such batted balls should be more likely to be hit weakly and become outs. Z-Contact also has a very strong year-to-year correlation, and again seemed to have some predictive power regarding BABIP. At the time, I had never come across any reference to the relationship between Z-Contact and BABIP, although about seven months later Steve Staude of Fangraphs.com independently reported it.
This piqued my interest enough to do a longer-term study. I started by looking at every pitcher who had thrown at least 100 innings in four consecutive years between 2002 and 2011, giving me a sample of 387 pitcher-seasons. In order to get the best possible forecasts for popup rate and Z-Contact for the years in question, I first examined how they correlate from year to year, to calculate the correct weights for predicting future performance from past performance. For example, I project Z-Contact for a given year by using 39% of the previous year’s Z-Contact, 21% of the number from the year before that, 14% from three seasons ago, and 26% the league average.
Next, I had to factor out the confounding effects of ballpark and the quality of fielders, which together influence BABIP far more than pitchers themselves do. For example, in 2011, the Rays’ Wade Davis posted a .280 BABIP, just below the AL starters’ average that year of .283. However, the Rays’ rotation collectively compiled a .265 BABIP, thanks to the team’s expert positioning of its infielders and outstanding defense from Evan Longoria, Sam Fuld, and Ben Zobrist. Davis’s BABIP was actually 19 points worse than that of Tampa Bay’s other starters. So for each pitcher-season, I measured the gap between their BABIP and that of the rest of their team’s rotation.
Finally, I tested the relationship between my projections for popup rate and Z-Contact for each pitcher-season—derived exclusively from prior years’ data—against the gap between that pitcher’s BABIP for that season and that of his teammates. Both statistics showed a very strong effect, and a similar pattern of impact. More popups are always good, and a higher Z-Contact is always bad. However, not all changes in popups or Z-Contact rates are created equal: the BABIP’s of pitchers who were worse than average in those categories suffered only modestly, whereas pitchers who were among the best in the league in them consistently had BABIP’s far below those of their teammates.
If we plot forecast popup rate against BABIP relative to team, you can see that the slope is flat at one end and steep at the other. By using a curved fit, we can give appropriate credit to the game’s finest popup machines, without projecting utterly ghastly BABIP’s for pitchers who induce very few popups. The same trend is apparent for Z-Contact, though to a lesser degree. There’s a chance this pattern is a mere product of selection bias—perhaps only the pitchers who lucked into tolerable BABIP’s despite having low popup rates and high Z-Contact scores got the chance to pitch enough innings to qualify for the study. But the existence of a zero lower bound for popups, and a 100% upper bound for Z-Contact, provides intuitive support for the curved fit.
Combining these two equations, I was able to produce retroactive forecasts for all 387 pitcher-seasons. On the whole, projected popup rate and Z-Contact were able to explain 15% of the variance in pitchers’ BABIP relative to their teammates. That might not sound like much—indeed, it largely confirms McCracken’s original finding about the randomness of BABIP. But it leads to far, far more accurate forecasts than simply assuming that all pitchers will have a league-average BABIP, as Fielding Independent Pitching does.
The equation correctly nails every one of the major BABIP outliers of the past decade. Of the 43 pitcher-seasons it projected to have a BABIP at least 15 points below their teammates’ average, seven were by Tim Wakefield, the game’s premier knuckleballer. Seven more were by Ted Lilly, whose .268 career BABIP is the second-lowest among active starting pitchers. (The average is around .290). Johan Santana, whose .272 lifetime BABIP ranks fifth among active pitchers, shows up third on this list, with six seasons forecast to be 15 points or more below the team average. Next come Jered Weaver and Barry Zito with three seasons each.
Their career BABIP’s of .269 and .271 are currently the game’s third- and fourth-lowest. Matt Cain, baseball’s active career BABIP leader, also shows up twice on the list, as does Pedro Martínez, whose lifetime mark was an outstanding .279.
The formula’s single favorite pitcher-season was Chris Young going into 2008, whom it projected to best his teammates’ BABIP by a whopping 43 points. The 6’10” soft-tosser fell just short of that forecast that year, posting a BABIP merely 38 points lower than the rest of the Padres’ staff. His .254 lifetime BABIP would easily be baseball’s best if he had surpassed my cutoff of 1,000 career innings pitched.
The equation also does a good job of spotting BABIP laggards. Of the five pitchers with the most seasons projected for a BABIP at least 7 points above their teammates’ average, three—John Lackey, Liván Hernández, and Paul Maholm—have career BABIP’s among the league’s worst. A fourth, Jeff Suppan, is also a significant underperformer. Of the five, only the formula’s distaste for Jason Marquis is not justified. The equation also correctly singles out Zach Duke, who has the highest lifetime BABIP among active pitchers. He only qualified for the study in two seasons, but they were both among the nine worst projected single-season BABIP’s the equation spit out.
Despite these seemingly impressive results, retroactive projections like these must always be taken with a hefty grain of salt. Historical data sets are full of spurious relationships between variables as well as genuine ones—noise as well as signal, as Nate Silver, who may also be here, would put it. Science is rife with purported discoveries leading to predictions that fail miserably in the real world, because they reflect nothing more than random variation within the researcher’s specific sample of information. Indeed, Silver says that this phenomenon, known as overfitting, is “the most important scientific problem you’ve never heard of.” How can we determine whether the correlation between popup rate, Z-Contact, and BABIP is legitimate, or a mere quirk that happened to show up in recent years but will soon disappear?
The best way to avoid falling victim to the fallacy of overfitting is to test one’s equations not just on the data set from which they were derived but also on an entirely separate sample. If the formula retains its predictive power even when confronted with cases that it doesn’t already “know,” there’s a good chance you’re onto something real. Fortunately, baseball generates fresh new data sets with every season, giving researchers endless opportunities to distinguish signal from noise. We know that we can predict BABIP from 2005 to 2011 with some accuracy using other data from 2005 to 2011. But what happens when we use the 2005-11 data to predict 2012?
The answer is that not only does the formula continue to be useful, it actually improved. The equation was able to account for 17% of the variance in 2012 BABIP performance, slightly better than its 15% mark for the 2005-11 period. To be fair, the 2012 sample is small—just 51 pitchers—and the results are heavily influenced by a single outstanding prediction. The formula expected Jered Weaver to have a BABIP 41 points lower than that of his teammates. The Angels’ ace kindly obliged by posting a .241 mark, his lowest since 2006 and precisely 42 points lower than the rest of the Angels’ staff. But even if you remove Weaver from the sample, the equation still retained much of its predictive power when confronted with unfamiliar data. Its five favorite pitchers were all significantly better than average, and four of its five least-favorite were worse. The equation’s success not only helped me to win the annual forecasting competition published at Fangraphs.com this year, but far more importantly, gave me the extra edge I needed to eke out the championship in my fantasy league.
The next step in this analysis would be to study what types of pitchers tend to have extremely high or low popup rates and Z-Contact scores, in order to understand better what hurlers can do to reduce their BABIP. I had a few guesses, beyond the previously well-known discrepancies between flyball and groundball pitchers. Two of the biggest positive outliers, Jered Weaver and Chris Young, are extremely tall at 6’7” and 6’10”, and I thought it might make sense that batters would be more likely to pop up balls thrown from a higher release point. And Johan Santana, Cole Hamels, and Pedro Martínez are all renowned changeup artists, which made me think that a Bugs Bunny offspeed offering might lead to fewer hits on balls in play.
Unfortunately, when I compared groups of pitchers with the highest and lowest projected BABIP’s in the study, neither of these hypotheses held up. Both sets averaged 6 feet, 3 inches, and their changeup percentages differed by only two points. So if anyone in the audience has any ideas on this front—what distinguishes, say, Ted Lilly and Barry Zito from Mark Buehrle and Tom Glavine—perhaps you can move forward our understanding of baseball’s most enduring statistical mystery much further than I have.
To see the slides from the presentation, click here.