Accounting for the “No Nulls” Solution

There are certain exit velocity buckets where a fair number of batted balls aren’t being tracked. (via Andrew Perpetua)

In 2015 and 2016, Statcast had a well-publicized problem with missing data, but now in 2017 every batted ball has exit velocity and launch angle information. So that means the problem was solved, right? Wrong. The problem persists. Around 11.6 percent of batted balls on Baseball Savant have their exit velocity and launch angle decided by an algorithm, not a TrackMan measurement. The consequences of this algorithm may color your understanding of Statcast and which techniques you will want to avoid when analyzing the league or its players.

How Much Of The Data Is Missing?

Statcast had a rocky first year and failed to report 20-30 percent of batted balls in 2015. There were many reasons for this, but long story short, it wasn’t necessarily a technical limitation because the data was retroactively filled in after the season ended. Today, having logged three full seasons, Statcast appears to be reporting data for about 88 to 89 percent of the batted balls. Later on in this piece I’ll explain how I am identifying missing data, but for now it is important to point out the types of batted balls that are missing from the dataset.

Stringer Totals
Type Total Missing % Missing
Ground Ball 181,017 29,047 16.0%
Line Drive 101,000  1,860  1.8%
Fly Ball  82,973  2,541  3.1%
Pop Up 27,282 12,328 45.2%
Total 392,272 45,776 11.7%
SOURCE: Statcast

As it turns out, TrackMan has trouble detecting balls that move perpendicularly to the face of the radar. That is, straight up, down, left (third base side) or right (first base side). The more the movement is directed away from the radar; i.e., into the field of play, the better it is at detecting and accurately measuring the velocity of the ball.

Luckily, baseball has foul territory, so many of the balls that move left or right can be ignored as foul balls. Unfortunately, balls hit down into the ground and straight up into the air are fairly common. This means that not only does TrackMan have a bias against certain types of batted balls, but these batted balls are generally very weakly hit ground balls and pop-ups.

In addition to missing individual batted balls,  nine games appear to be missing all their TrackMan data. July 24 and 25, 2015 in San Francisco; July 3, 2016 in Fort Bragg, N.C.; Sept. 23 through 27, 2016 in Pittsburgh; and Aug. 20, 2017 in Williamsport, Pa..

The missing data in Fort Bragg and Williamsport shouldn’t be a surprise, considering neither of those stadiums has TrackMan installed. Well, I’m not sure Fort Bragg should even be considered a stadium, but that is neither here nor there. Missing five consecutive games in Pittsburgh is a bit odd, but I suppose hardware problems can arise from time to time. It is possible there are missing games that I have overlooked, but it is nice to see that the number appears to be minimized.

All in all, though, TrackMan has covered 7,385 out of 7,394 major league games in the past three years. These nine missing games had 726 out of the 561,666 plate appearances that occurred. There are so few of these missing games that we can effectively ignore them going forward, although it is important to acknowledge they exist, especially in 2016, which had six missing games.

However, you cannot explain all of the missing data using only weakly hit balls and missing games. There are a large number of missing line drives and fly balls as well. There may be some fraction of batted balls that are missed by random chance, or perhaps there is another mechanism for missing batted balls that I don’t quite understand.

Either way, you can generally assume that a great many of the missing data points are weakly hit.  So much so that if you were to calculate average exit velocity and launch angle using the balls measured by TrackMan. you would end up with inflated numbers. Both the exit velocity and launch angle would be too high, since you would be throwing out an enormous number of weakly hit ground balls. Pop-ups, too, but mostly ground balls.

The Toolbox

This is a problem. We want to accurately measure exit velocity and launch angle, but to do so we need to account for these missing batted balls. Before we can address a solution we need to address the tools we have to work with.

When TrackMan data is missing we lose out on:

  • Exit velocity
  • Launch angle
  • Batted ball distance
  • Batted ball spin

It is important to understand that batted ball distance is lost. If you had batted ball distance, you might be able to reverse engineer an exit velocity using the batted ball type and fielding location data. Alas, we are left without the distance data.

We are left with:

  • Batted ball type
  • Batted ball result
  • Which players fielded the ball
  • Rough approximation of where the ball was fielded (hc_x, hc_y)

The first three items on this list are called the stringer information. The hc_x and hx_y coordinates are very rough estimates, and are not especially reliable (although they are better than nothing when you’re left with no other choice).

Last year, Jeff Zimmerman developed a method in which he found the average launch angle and exit velocity for batted balls fielded by each position. So, for example, a pop-up to the second basemen versus a fly ball to the right fielder, etc. In this way he used all three aspects of the stringer information to estimate the batted ball quality.

I was working on a system of grouping balls using the hc_x and hc_y coordinates along with batted ball type. Before I had a chance to finish this project, MLB announced it would be filling in the missing data on its own. Since then, MLB has retroactively filled in data for 2015 and 2016 and provided data for the 2017 season. Once MLB implemented its solution, accompanied by Tom Tango’s article, I shifted my focus to other aspects of the game. The missing data problem went out of sight, out of mind. But I believe the missing data is still an issue that needs to be addressed.

The “No Nulls” Solution

As I have already stated, there are two types of missing batted ball data. Missing games, and missing batted balls. It comes down to an issue of measurement bias versus failure of measurement.

In the cases where the game is recorded, but an individual batted ball does not register, you can assume that the majority of the time (70-85 percent) the ball was poorly hit. Therefore the batted ball likely has a weaker than average launch angle and exit velocity.

When the game is entirely missed, you don’t have any information about the batted ball, so you cannot make any assumptions.

The data you see on Baseball Savant is going through a multi-step process to correct for this missing data. First, it is checking to see whether the game is entirely missed. If the game is missed, it gives an average launch angle and exit velocity for each stringer batted ball type. So, for example a groundball single will have an exit velocity of 93 mph and a launch angle of -3, whereas a flyball double will have an exit velocity of 93 mph and a launch angle of 32 degrees.

If the game is recorded, but an individual batted ball is missing, it will assign a second set of values based upon the stringer data. These are, by far, the most common data corrections you see for batted balls with missing data. For example, a ground out has an exit velocity of 83 mph and a -21 degree launch angle and a pop out is 80 mph, with a 69 degrees launch angle.

This is a simplistic understanding. Under further analysis, there appear to be multiple launch angle and exit velocity combinations for the various stringer types, even in missing games. So, the exact rules for how these numbers are distributed are a bit complicated and remain unknown to me. Presumably, you could reverse engineer them if you were so inclined.

Mountains of Balls

Since MLB is using a short list of rules to distribute exit velocity and launch angle information to the batted balls with missing data, we can reverse engineer the rules using the stringer data and frequency information.

Generally speaking, combinations of exit velocity (which has one decimal place) and launch angle (three decimals) are pretty random. There are so many possible combinations of launch angles and exit velocities that you wouldn’t expect many results for each given pair of numbers (for example 84.5 and 23.457). When you include the stringer tags, the number drops even further. In fact, there are only 75 combinations of exit velocity, launch angle, and stringer tag with five or more matches. Compared to 346,956 with fewer than five matches, 99.8 percent of which are unique.

I have taken these 75 triplets and named them the most likely candidates for MLB’s “No Nulls” rules. It is possible that a few of these are not actually part of the “No Nulls” set, and it is possible there are a few combinations of “No Nulls” that are so rare that they haven’t yet occurred five times. For example, a pop-up triple.

I am fairly confident that nearly all of the batted balls identified in this manner are No Nulls, and at the end of this piece I will include several tables that include all of the candidate groups of balls. For now,  look at these two images, which show the distribution of No Nulls in the dataset. Click the images to make them larger.

The dark red bars show the batted balls measured by TrackMan, and the light red show the balls added using the No Nulls rules. Look at the enormous concentrations of balls around -24 to -20 degrees and 68 to 72 degrees. These are the majority of your 29,000 missing ground balls and 12,000 missing pop-ups. On the exit velocity chart you can see these balls between 80 and 84 mph.

There is a clear gap in frequencies of batted balls hit between 85 and 90 mph. It seems like there is a good chance the missing batted balls might fill in this gap. With the No Nulls solution, MLB has more than filled in this gap. Realistically, you’d probably expect these batted balls to be spread out more, following a more gradual curve. The exact shape of that curve is unknown, but the balls measured by TrackMan give us a good idea of what it might look like.

The MLB No Nulls solution has created very large spikes in the frequencies of very specific batted balls, but this gets even more messy when you start comparing different seasons. In the GIF below you will see vertical launch angle along the Y axis and exit velocity along the X axis. Each cell represents 2 mph by 2 degrees. The colors represent frequency as a percentile of the largest cell. Green cells are high frequency, and blue cells are low frequency. Click the image to make it larger.

This GIF shows the No Nulls frequency problem better than anything else I have seen to date. Do you see those cells that remain dark green in each season? Those are the missing batted balls filled in by MLB. Some of these cells mysteriously disappear in 2017. Can you guess why? I’ll give you a hint, I told you the answer up above.

The frequencies of various exit velocity and launch angle combinations are clearly changing over time. The difference between 2015 and 2017 is particularly stark. In 2017 the high exit velocity balls appear to be shifting up in launch angle, and the low launch angle balls appear to be moving down in exit velocity. You can also see growing frequencies of pop-ups. The suppressed number of TrackMan recorded pop-ups in 2015 may have been a technical limitation, perhaps. Maybe. I have no evidence for that, but it could be the case, considering how 2016 and 2017 suddenly have balls measured above 80 degrees, although it is clear that the number of pop-ups is definitely increasing each season.

However, with all of these changes, those sticky No Nulls cells are constant.

If the No Nulls are remaining constant from year to year, and there is a clear league-wide migration of batted balls, wouldn’t that mean the No Nulls will get increasingly less accurate over time? If we assume these changing trends are constant, anyway. Perhaps in 2018 some of this will reverse course, batters will start hitting lower launch angle hard-hit balls, and ground balls will increase in velocity. Maybe. That seems unlikely. It is more likely that players will hit even more balls into the air, in an effort to maximize the value of each plate appearance.

Change Is In The Air

When MLB instituted this No Nulls solution, it did so to correct the average launch angle and exit velocity data, both for the major leagues and and for individual players. However, if the average distributions keep changing and the No Nulls aren’t updated to match, these No Nulls may end up overcompensating and hurting the data. For example, in the table below I have put the average exit velocity for all balls hit below -20 degrees for each of the three seasons. Notice how it is dropping with each season.

Below -20 Degrees
Year Exit Velocity
2015 74.03
2016 73.82
2017 69.16
SOURCE: Baseball Savant
Discounting Nulls

The difference between 2017 and 2016 is dramatic. Many of the No Nulls ground balls fall into this category — about 25,000 of them. The vast majority of these No Nulls ground balls are listed with an exit velocity of 83 mph. That seems a bit high, in light of the sudden drop in groundball exit velocity in 2017. Perhaps I am wrong. Maybe TrackMan happened to record more of the softly hit ground balls and missed all of the hard hit ones. It’s certainly a possibility.

If the batters are, in fact, producing weaker contact on ground balls and the No Nulls solution hasn’t accounted for this, then the average exit velocity may actually be lower than what you see on Baseball Savant. Perhaps MLB is doing something with the No Nulls to keep up with these trends. I can’t see that it has, and I think the above GIF speaks for itself. These clusters of batted balls haven’t changed, even though the landscape around them has.

What To Do Going Forward

When you are looking at major league average launch angle and exit velocity, you should use the No Nulls solution put forward by MLB. You should understand that these numbers are estimates, and even as estimates they appear to have a minor flaw. The league-average launch angle probably will not be off by much, but exit velocity could be off by as much as 1 mph.

The league averages aren’t a big concern, though. Nor are the player averages. Rather, you must be careful when you examine the league-wide results when bucketing balls based upon their launch angle, exit velocity, or batted ball type, particularly balls hit between -30 and -20 degrees or above 60 degrees. Bucketing these batted balls will subject you to the double whammy of both being artificially inflated in frequency and exit velocity due to the No Nulls solution.

Whenever you are bucketing batted balls, you will want to first remove the No Nulls balls. I have created four tables consisting of the No Nulls categories I have identified. You can use these definitions to remove these balls from your own research, if you deem it necessary.

The No Nulls solution implemented by MLB could be better, but it is difficult to criticize without knowing the exact rules that are being used to classify each batted ball. Clearly the balls could be smoothed out more, perhaps using fielder location or some other metric. It appears that the rules are being applied across all three seasons evenly, but perhaps they should be tailored to each individual season. But, again, I don’t know how MLB is assigning the balls, so perhaps fielder location and seasonal variations are already included to some extent.

Ultimately, though, there is only one true solution to the problem, and that is TrackMan recording 100 percent of the data. It goes without saying that this is one of the highest priorities. Any attempt we as analysts make to manipulate the data after the fact will leave us wanting for more.

References & Resources

Appendix: No Nulls Definitions

Stringer Ground Balls
Stringer EV Vertical Angle Sample
Ground Ball Double    90     -13   124
Ground Ball Double  90.2     -13    92
Ground Ball Double    93      -1    12
Ground Ball Error    43     -62    57
Ground Ball Error    84     -20   462
Ground Ball Error    86     -11    15
Ground Ball Single    40     -36   988
Ground Ball Single    90     -17  2340
Ground Ball Single  90.3   -17.3  1128
Ground Ball Single    93      -3   149
Ground Ball Triple    94     -12     6
Ground Ball Triple  94.3   -12.1    11
Ground Out    41     -39  3777
Ground Out  82.9 -20.699  6083
Ground Out    83     -21 13296
Ground Out    84     -13   507
Total Ground Balls 76.97  -23.07 29047
SOURCE: Statcast

 

Stringer Line Drives
Stringer EV Vertical Angle Sample
Line Drive Double  98.8   17.1   96
Line Drive Double    99     17  244
Line Drive Home Run   104     24   76
Line Drive Home Run 104.4 23.699   21
Line Drive Single    41     16    5
Line Drive Single    90     15  564
Line Drive Single  90.4   14.6  135
Line Drive Triple  98.4     18   11
Line Drive Triple    99     18   31
Line Out    37     31   37
Line Out    91     18  512
Line Out  91.1 18.199  128
Total Line Drives 91.76  17.24 1860
SOURCE: Statcast

 

Stringer Fly Balls
Stringer EV Vertical Angle Sample
Fly Ball Home Run   103     30  160
Fly Ball Home Run 102.8 30.199   57
Fly Ball Triple    97     31   11
Fly Ball Double    95     29   18
Fly Ball Double  93.1     32   32
Fly Ball Double    93     32   48
Fly Out  89.2 39.299  594
Fly Out    89     38  213
Fly Out    89     39 1323
Fly Ball Single    73     34   11
Fly Ball Single  71.4     36   19
Fly Ball Single    71     36   55
Total Fly Balls 89.85  37.79 2541
SOURCE: Statcast

 

Stringer Pop Ups
Stringer EV Vertical Angle Sample
Pop Out    37    62   306
Pop Out    75    60    89
Pop Out    80    69 11833
Pop Up Double    89    63    12
Pop Up Error    81    65    40
Pop Up Single    86    67    48
Total Pop Ups 78.93 68.73 12328
SOURCE: Statcast


Andrew Perpetua is the creator of CitiFieldHR.com and xStats.org, and plays around with Statcast data for fun. Follow him on Twitter @AndrewPerpetua.
newest oldest most voted
Tangotiger
Member
Member

Excellent analysis and understanding of the issues and limitations in handling untracked data.

channelclemente
Member

It would be very interesting to learn if there was a pitch type bias reflected in the null data results. From the types of hits missing, one might presume they resulted from a pitch that induced poor contact. That would suggest curve, slider, or cutter, hypothetically, would be over represented in the null buckets.