Let’s Build Our Own Catch Probability Metric

By now you’ve seen the Statcast Catch Probabilities. They’re great! Or, at the very least, they’re a shiny new toy to play with until the regular season rolls around. But, as you may have noticed, there are a few frustrating details about it — namely, the actual math behind the statistic is completely opaque, and the details about when an individual catch happened are hard to find. So let’s fix those two problems! We’ll create a catch probability metric that anyone can compute in Excel, using data that anyone can download easily.

You may have noticed a problem with this plan, though — the data that is used for the official Statcast catch probability isn’t easily accessible. We’ll have to make do with what we can get from the Statcast search at Baseball Savant. Specifically, instead of using hang time and distance traveled, we’ll use exit velocity and launch angle. Note that this completely disregards defensive positioning and it even disregards the horizontal angle off the bat*! It’s going to make for a less perfect metric, of course, but (spoiler alert) it will turn out okay.

*This really makes more sense if you think about it in terms of probability of the hitter making an out. The old saying goes “hit ’em where they ain’t” but in recent years we’ve come to understand that it’s really “hit it hard and in the air.”

I’m not going to go into the details of how I computed this metric; it’s standard machine learning stuff. If you want to follow along with the computation, I’ve put my code up on GitHub. Instead of going through all that here, I’ll just jump to the finish line: the formula for catch probability ends up being

1/(1+exp(-(-10.152 + 0.057 * hit_speed + 0.218 * hit_angle)))

Now you might be worried that such a simple formula, excluding tons of information, might be totally worthless. I was worried about that too! But applying this formula to a test set revealed this formula to be surprisingly accurate:

Catch Probability Assessment
Statistic Value
Accuracy 0.8385
Precision 0.8338
Recall 0.8671
F1 0.8501

(if you’ve never seen those numbers before — closer to 1 is better. Trust me, it’s pretty good.)

Well, that’s all well and good, but how can you get this for yourself and play around with it? Start by downloading the data you’re interested in from Baseball Savant. For instance, you can get all the data from, say, May 1 of last year by going here. Download the CSV with the link at the bottom and then you can simply add the above formula in a new column in Excel. If you need a concrete example of how this looks in Google Sheets, I’ve put one here.

Okay, now you’ve got this, but what are you going to do with it? One possibility is to use this to try to figure out which plays the official metric estimated as being difficult. For instance, let’s say you’ve noticed that Miguel Sano made two highlight-quality plays but you don’t know Mike Petriello well enough to ask him which ones those are. Just compute your own probabilities and you’re off! Although, as expected, the numbers differ. Our numbers do have Sano making two plays in the 0-25% range, but they’re not the same ones that Statcast flagged (sorry about the quality of the GIFs).

Catch #1: estimated catch probability 18.3%

Catch #2: estimated catch probability 21.3%

The Twins announcers praised his first step in the former video, while in the second they talked about how the ball “hung up” for Sano to be able to catch it. Not spectacular plays by any means, but neither were the other two, of course.

Finally, because I’m sure you’re curious, here’s the top catch of 2016 according to this metric (estimated catch probability: 8.6%).

Of course it’s a Kevin Kiermaier catch. Hey, at least we know we’re doing something right.

Print This Post

The Kudzu Kid does not believe anyone actually reads these author bios.

newest oldest most voted

correct me if i’m off base here but your model says that catch probability increases monotonically with hit velocity as well as with launch angle. hang time also increases monotonically with those variables.

couldn’t you just use both of those components to calculate hang time using physics? would that model be better or worse than your model?


Were you able to exclude home runs from the model? My concern would be that some semi-fence scrapers would cause some unnecessary error.


Also, is it possible to add horizontal spray angle? It can be estimated using the X,Y coordinates of the batted ball and some trig.