Statcast Data Limitations – Year-End Update

October 8, 2015

The book on the 2015 season is still being written, but when it is finished, at least a chapter or two will need to be devoted to Statcast. This year will likely go down as the one in which granular batted-ball data went mainstream. More data has been made public, and discussion of exit velocity, launch angle and even route efficiency has permeated the airwaves of even the most old-school broadcasts. All of the numbers, in and of themselves, mean very little. Only with the addition of context can they become meaningful.

A couple months ago in these pages, I detailed some of the limitations of the newly publicly available data. Today, let’s update those findings with an examination of the year-end data.

With the full Statcast data set unavailable publicly, I figured that the best way to test it would be take a representative sample of the data available on Baseball Savant. At the All-Star break, I downloaded all of the granular batted-ball data available for each pitcher who qualified for the ERA title of that date. I then did the same for all season-end qualifiers, as well as the second-half data for the All-Star break qualifiers who no longer qualified at season’s end. This encompassed 102 starting pitchers, and 54,921 batted balls allowed. That’s a fairly representative sample quantity-wise, over 40% of the population of 2015 balls in play (BIP).

To determine whether this would be a representative sample quality-wise, I compared the batting average (AVG) and slugging percentage (SLG) of my sample to the MLB season-end marks; my sample’s .320 AVG and .505 SLG were a bit off from the actual .323 AVG and .513 SLG posted by MLB hitters on all BIP in 2015. This difference is fairly easily explainable: my sample included only BIP allowed by starting pitchers good enough to accumulate enough innings to qualify for the ERA title at either the All-Star break or at season’s end; the actual MLB averages were inflated by lesser, often younger starting pitchers brought up to cut their teeth in August and September. Those who watch macro trends might have noticed that run-scoring across baseball spiked late in the season. My sample isn’t a perfect representation of baseball in 2015, but it’s close enough so that some reasonable conclusions can be drawn from it.

The first limitation of the Statcast sample noted in my August article remains a primary concern: the large size of the “null” group, the batted balls for which no exit velocity was recorded. In 2014, only 4.9% of batted balls fit into the “null” group; in 2015, a whopping 23.7% of all batted balls had no exit velocity recorded. There was some improvement in this area down the stretch: at the All Star break, the “null” group comprised 25.4% of all BIP, dropping to 21.5% for games after the break. That’s right, the average exit velocity information that has been disseminated through various print, television, radio and online media this season is based on a sample that is far from complete. Statcast still has a long way to go before it is capturing the 95.1% of batted balls that represented the 2014 industry norm.

Now, some growing pains are to be expected with any new technology, and these percentages should continue to improve over time. We need to get a better understanding, however, for the types of batted balls that are being missed, as this affects the overall analytical value of the data set in the short term. Let’s take a look at the AVG and SLG on all BIP, the “null” group, and the remaining sample for which velocity readings were received:

Production on 2014-15 BIP

	15 AVG	15 SLG	14 AVG	14 SLG
ALL	0.320	0.505	0.318	0.489
LESS NULL	0.239	0.356	0.224	0.359
W/VELO	0.345	0.551	0.323	0.496

In both seasons, the production generated by the “null” BIP group was virtually identical. In 2015, however, the dramatic expansion in size of that group more significantly impacts the production generated by the BIPs that did generate velocity readings. In 2014, the difference between .318 AVG-.489 SLG and .323 AVG-.496 SLG was minimal; this year, the difference between .323 AVG-.505 SLG and .345 AVG-.551 SLG is quite large. The data in the Statcast sample simply is not very representative of what has actually been going on the field this season.

In 2014, most of the batted balls that were “missed” were popups and very weak ground balls. In 2015, that is again the case, but the population of the BIP being missed is much more extensive, and the velocity thresholds below which they are more likely to be missed have risen. Of the batted balls in the 2015 “null” group, 54.1% are ground balls; that is considerably higher than the 47.2% figure which they comprise of the entire population. Similarly, 16.2% of the batted balls in the “null group” were classified as popups, compared to 6.7% in the entire population.

The popups are particularly a big deal; 56.8% of popups generated no velocity reading in 2015, and therefore didn’t find their way into the granular data. This percentage has actually increased from 56.3% through the All-Star break. This does no favors to the extreme popup generators of the world, the Jered Weavers and Marco Estradas, when doing detailed analysis upon this data set.

Ditto the weak grounder generators. Velocity readings have been generated on only 72.8% of grounders this season, which is at least up a bit from 71.4% through the break. How do we know it’s the weaker ones that are being missed? Well, hitters have batted .257 AVG-.281 SLG on the ones for which we have readings, and just .201 AVG-.214 SLG on the “null” group. That’s pretty strong evidence; again, this suggests improvement in data capture late in the season, as production on “null” grounders at the break was higher at .212 AVG-.225 SLG. The velocity threshold at which a reading is being obtained for a grounder is dropping lower, but is still nowhere the 2014 level. So, the pitchers who have shown an ability to generate weak ground-ball contact, the Dallas Keuchels, Garrett Richardses and Johnny Cuetos, don’t get enough credit for that talent.

Then there’s the difference between the velocities generated by the new and old equipment measuring the data. The Statcast equipment runs quite a bit “hotter” than the previous HITf/x system. Below, you’ll see the average batted-ball velocity in the three major BIP categories for both 2014 and 2015, as well the overall averages:

Average Velocity by BIP Type

	2015	2014
FLY	90.2	83.2
LD	92.8	87.9
GB	85.8	69.7
ALL	88.4	77.9

That’s a pretty massive difference. The largest change is in the ground ball category, and that is likely due in part to the scrubbing of bunts from the data. Still, the increases in average velocity are pretty staggering. Now this is not a problem, per se, as seasonal analysis is based on performance relative to one’s peers. It does change some of previously developed notions as to what constitutes a well hit ball, the nature of the fly ball and ground ball dead zones, etc.

For instance, only 205 fly balls were hit at 105 mph or harder in 2014, using the previous measuring equipment. In 2015, 394 fly balls were hit that hard in my sample alone, despite the growth of the “null” group. In 2014, 424 liners were hit at 105 mph or harder; in 2015, 1687 were hit that hard in my sample alone, with 293 hit at 110 mph or harder. In 2014, 372 grounders were hit at 105 mph or harder; in 2015, 1378 were hit that hard in my sample alone, with 247 hit at 110 mph or harder.

We’ve discussed the fly ball “donut hole” in the past. In 2014, MLB hitters batted .077 AVG-.148 SLG on fly balls hit between 75 and 90 mph. Well, the upper boundary of that donut hole has moved higher: in 2015, MLB hitters batted just .041 AVG-.084 SLG on fly balls hit between 75-94 mph.

There’s a similar grounder dead zone. In 2014, MLB hitters batted .116 AVG-.128 SLG on grounders hit at 70 mph or lower. This constituted 23.2% of ground balls. In 2015, MLB hitters are batting .125 AVG-.134 SLG on grounders hit at 75 mph or lower, constituting 22.8% of a grounder total that we have shown is understated, as many more weakly hit grounders are being missed.

There are other issues as well. When I first started working with BIP data sets, I compared exit angle data to box score classifications, and came up with the following exit angle categories: 50+ is a popup, 20-50 is a fly ball, 5-20 is a line drive, and < 5 is a ground ball. The Statcast category labels are way out of whack with these assumptions; many, many well hit 20+ degree exit angle batted balls are being categorized as liners. Using my BIP exit angle breakdowns, only 176 line drive homers were hit in 2014; StatCast says that 792 line drive homers were hit in my limited sample of 2015 data. Clubs have access to the detailed exit angle data, but without it, the public faces this significant blurring of the fly ball/line drive categories, making it more difficult to do detailed research. Don't get me wrong... Statcast is a great thing, and we are only scratching the surface of what it can eventually become. The sample generated for this article yielded some benchmarks which will serve as the foundation for some analysis you will see here in the offseason. Still, when one is faced with a data set, one must put it into some sort of context, while acknowledging its limitations. In many of my previous articles here, I have warned readers never to take pure average velocity data at face value; launch angles, BIP type frequencies, pull percentages, etc., significantly affect hitter and pitcher performance, and can be easily be overlooked. For this year, at least, we should additionally be aware of the Statcast data set's unique shortcomings, which adjust the context within which analysis takes place.

BAL	CHW	LAA
BOS	CLE	OAK
NYY	DET	SEA
TBR	KCR	TEX
TOR	MIN	HOU

ATL	CHC*	ARI
MIA	CIN	COL
WSN	MIL	LAD
NYM*	PIT	SDP*
PHI	STL	SFG