Data: Impossible-The Minor League Strike Zone Part 1

In his 64 MLB innings, Scott Rice threw his pitches closer to the ground than any other pitcher. (via slgckgc)

Our mission, which we have chosen to accept, is to delve into new areas of baseball research and explore data sets that have yet to be explored. Today and tomorrow we will explore an old but never used data set, the minor league strike zone, to assess its quality. We’ll also leverage it to learn more about Aaron Nola, predict ground ball pitchers and predict home run hitters.

We baseball fans have been spoiled for many years now by the excellent and public PITCHf/x and StatCast data sets. As a sports fan and as a data analyst who is always on the hunt for new, cool data sets to play with, I’ve become a bigger fan of baseball, at the expense of the other major sports, because of these great public data.

Recently, I was curious to see if it was possible to turn the minor league GameDay strike zone data into actual usable data. MLB’s GameDay XML files detail every pitch thrown in affiliated ball, going all the way back to 2008. These pitches often are hand-coded on a screen to demarcate location, either in the strike zone, or in the playing field for batted balls. Needless to say, there are a lot of errata in these data, especially in the lower leagues or the farther back you go.

However, in prior research, we demonstrated we can use minor league data to measure HR+Fly Ball distance as low as rookie ball and get meaningful, useful data. Those data will be updated and improved in an upcoming article here at The Hardball Times. Today, as well as tomorrow, we’ll look at the minor league strike zone.

If you’re interested in a high-level overview of the methodology behind these data, jump down to the Appendix: Methodology section. Essentially, the x and y coordinates as placed by MiLB stringers for each pitch are converted into PITCHf/x-StatCast style px (horizontal location of the ball as it crosses home plate) and pz (vertical location of the ball).

Data: Impossible – Actually, Possible

Given the hand-crafted source of the data, it is imperative we first establish if there are any real signals in the MiLB strike zone data by proving that there is real, potentially useful information in the data collected by MiLB stringers. When researching these data, I was skeptical whether it was possible to extract anything real, let alone useful. This first part is as much to convince me of the veracity as you.

We’ll start our exploration at the Triple-A level, where the data are of higher quality than in the lower minors, and look at the average vertical location (in feet) a pitcher threw to, comparing it to what that same pitcher did at the major league level, filtered to pitchers with at least 500 pitches thrown at both levels. The strength of the relationship will tell us how accurate the vertical mapping of pitches at the Triple-A level is, or if it is usable at all.

Triple-A Vertical Location to MLB Vertical Location – Pitchers | R2 0.48

An R2 of 0.48! When aggregating the locations at the pitcher/level, we get a very strong signal, indicating pitchers maintain a consistent approach to how high or low they typically throw the ball, and we are able to capture this approach in Triple-A based on strike zone data recorded by GameDay. I was  stunned at the strength of the signal, which, quite frankly, made me wonder if this was an accident or if I had made a mistake somewhere. So the next logical step was to replicate the chart, this time splitting it by batters, instead of pitchers.

Triple-A Vertical Location to MLB Vertical Location – Batters | R2 0.03

The exact same data, when spun to the batter’s perspective, produces a very small, negligible correlation, which is extremely encouraging. We’d expect pitchers, who control where the ball will go, should have a strong correlation, whereas batters, who can merely influence the pitcher, should have a negligible correlation.

Let’s look at some of the pitchers in the first chart, starting with the very interesting Scott Rice. In 64 big-league innings, Rice managed an excellent 61.7 percent groundball percentage, as well as a very strong 0.28 HR/9 ratio. How did he accomplish this? Simply by pitching waaay closer to the ground (on average) than any other pitcher, which probably explains his 5.43 BB/9 ratio as well.

It’s always encouraging when your most extreme data point, in this case Rice, is not only the most extreme in both data sets (Triple-A and MLB) but also has results that make intuitive sense, all pointing to the presence of real signal. Jared Hughes and Cody Eppley tell a similar story, both high-groundball percentage, low-homer pitchers, all predicated on locating pitches really low on the zone. Let’s look at horizontal location.

Triple-A Horizontal Location to MLB Horizontal Location – Pitchers to RHH | R2 0.25

The Worcester Red Sox and the Problem of History
As the Red Sox prepare to move an affiliate, Pawtucket stands to lose more than just baseball.

A 0.26 R2 value for horizontal location is particularly interesting, as it is less likely to be influenced by confirmation bias. A stringer is more likely to chart a groundball pitch as lower in the zone, or a home run as closer to the center, whereas horizontal location may only be impacted by pull/oppo, which would be more dependent on the batter than the pitcher.

It is important to note we’re measuring two things here: First is the relationship between a pitcher’s approach in the minors (with respect to horizontal and vertical location), which could be a lot higher than 0.25. Second is the statistical noise baked into the quality of the minor league data. This tells us at the zoomed out, aggregate-pitcher level, there are some very real, fairly accurate data here.

Triple-A Horizontal Location to MLB Horizontal Location – Pitchers to LHH | R2 0.17

We continue to see a strong signal, albeit not as strong as to right-handed hitters. This is very encouraging, as this potentially could give us information on which pitchers are comfortable pitching inside to hitters (negative px is inside to righties, positive px is inside to lefties).

Let’s move on down the MiLB ladder and take a look at Double-A. Don’t pay attention to the absolute px Double-A number, as it doesn’t translate perfectly. Rather, look at it as a relative measure compared to other pitchers.

Double-A Vertical Location to MLB Vertical Location – Pitchers | R2 0.34

As we move down the professional ladder, we again meet Scott Rice, one of the more unusual pitchers, statistically speaking. While he was never a legit prospect, had we been armed with these data, we easily would have seen through his excellent groundball percentage as merely the product of an unsustainable approach. Matt Strahm gave up a 14.4 percent HR/FB rate in his Double-A season, perhaps as a by-product of keeping the ball up in the zone. How much lower on the MiLB ladder can we go and still get strong signals?

High-A Vertical Location to MLB Vertical Location – Pitchers | R2 0.03

Data quality appears to evaporate once we go below Double-A, which is not too surprising and suggests we really can use strike zone data only from Double-A and above.

Pitching: Carefully

There is another lens we can use to test the veracity of the data, specifically, the average distance from the center of the zone. Batters who routinely swing at everything or exhibit tremendous power will draw pitches with a greater average distance from center. Some pitchers try to nibble and throw as close to the edge of the zone as possible, while others are around the heart of the plate more. We should see a more balanced correlation between batters and pitchers through this lens. Note that the numbers on each axis here translate to pixels, so we’ve left them out since what’s importance is a player’s relative number, rather than the number of screen pixels it translated to.

Triple-A Distance from Center to MLB Distance from Center – Batters | R2 0.15

Aaron Judge makes for a far more interesting outlier than Scott Rice. Judge joins Joey Gallo, Andrew Knapp, Rhys Hoskins, Matt Olson and Aaron Altherr as players who pitchers were most careful pitching to while they were in Triple-A. All of those players, aside from perhaps Knapp, make sense as players minor league pitchers would avoid, as opposed to Joe Mather and Felipe Lopez.

The data get a bit wonky as we move to Double-A and High-A, where we see some correlation, but not enough to convince us it is showing a true signal, at least for batters. This could be due to flaws in either the methodology employed here or the data, or it could be that most batters in Double-A and below aren’t exposed to consistently different approaches, allowing for a few exceptions.

Triple-A Distance from Center to MLB Distance from Center – Pitchers | R2 0.13

It’s nice to see Kevin Slowey as a classic control pitcher outlier, both in Triple-A and in the majors. I wouldn’t bank on these data alone to tell us how a batter is being approached, or how much a Triple-A pitcher is nibbling. However, in conjunction with other data points, it may give us a slightly deeper insight.

Conclusion: Possible

We definitely have real, usable and (hopefully) actionable data at our disposal. Tomorrow, we’ll explore the data some more and see what they can tell us about Aaron Nola, calculate how much it can help us predict ground ball pitchers and leverage it to build a model predicting home run percentage using only Triple-A data.

Appendix: Methodology

Step 1: Calculate Center of Strike Zone

Stringers for each game will have their own unique bias, based on their vantage point and how well they can see the zone. To account for this, we found the median value for all horizontal locations (i.e. the median “x” value) as well as the average vertical value (i.e. the average of all “y” values) for each game. I don’t have any scientific reason why I used the median for the “x” and the average for the “y” other than my data gut felt it gave a cleaner result.

For those of you using Tableau, this can be done by using Fixed LOD calculations, such as {Fixed [game_id]:MEDIAN([x])}, which calculates the Median x for each game_id. These values were then used as the de-facto center of the zone. This method, while not perfect, was the simplest way to adjust for x and y values that shifted based on year or were affected by bias from the stringer inputting the data. This should roughly adjust for that, even if it might not be capturing the true center. Effectively, what we’re measuring is relative horizontal and vertical location compared to the average pitcher in that game.

Step 2: Calculate Distance from Horizontal Zone and Vertical Zone

The next step was to determine the Delta X and Delta Y (distance from the center as calculated in step 1). This is what it looks like:

The organic bell-curves produced were very encouraging as the data began to emerge. For fun, these are what seasons 2008 to 2010 looked like for Triple-A:

Totally unusable data, at least based on the above methodology. For this reason, all data presented above are from 2011 to 2018.

Step 3: Convert to MLB type px and pz

Lastly, the pixels were converted to there equivalent “px” and “pz” (i.e. the horizontal and vertical location of the ball as it crosses the front edge of home plate) using a simple conversion formula that adjusted for differences in seasons and levels of play and then scaled to roughly the MLB equivalent.

References and Resources


Eli Ben-Porat is a Senior Manager of Reporting & Analytics for Rogers Communications. The views and opinions expressed herein are his own. He builds data visualizations in Tableau, and preps data in Alteryx. Follow him on Twitter @EliBenPorat.
newest oldest most voted
Jetsy Extrano
Member
Jetsy Extrano

That AAA batters distance from center has an interesting nonlinear shape, like a boomerang. It seems to say that up to a certain level of batter, AAA pitchers don’t care, they don’t change their approach, whereas MLB pitchers do. Above that level, both take notice and move away from meatballs.