Probabilistic Pitch Framing (part 2)

September 25, 2013

This is part two of a three-part series detailing a method of judging pitch framing based on the prior probability of the pitch being called a strike. In part 1, we motivated the method. Here in part 2, we will formalize it.

The formula we’ll use for judging catcher framing is pretty simple on its face. For each pitch delivered, we calculate a value

IsCalledStrike + prob(CalledStrike)

Here, IsCalledStrike is simply 1 if the pitch is called a strike, and 0 otherwise. The second term is the probability that the pitch would have been called a strike, absent any information about the catcher’s involvement. We add up these values for every called ball or strike that a catcher receives, and report the resulting number. Since this method is essentially identical to defensive plus/minus, I’ve taken to calling it Catcher Plus/Minus (CPM), although someone reading this can probably come up with something better. I should mention the following: it has been brought to my attention that this method has been developed before. However, I can’t find it written up anywhere on the web. So you are welcome to consider this the documentation of an existing method, if you’d like.

Pitch F/X hands us the first addend above; we’ll have to work a bit to get the second. There are many ways to approximate these probabilities; we will follow the lead of Matthew Carruth and use a generalized additive model to build them (more on this later). Now we could simply build this model for all pitches:

and that would probably be pretty good, but let’s instead also include two more pieces of information: the ball-strike count and the handedness of the batter. This will give us 24 different probability plots like the one above, one for each combination of count and handedness.

Now, you might rightly object that this is willfully ignoring tons of information that MLB is giving us. We know so much about these pitches — the pitch type, the horizontal and vertical break, the handedness of the pitcher, the top of the batter’s strike zone, the wind, the stadium, the home plate umpire … any and all of these bits of info could paint a better picture of the strike zone. We have to use all the information at our disposal, don’t we?

Well, we should … but it turns out it’s, um, really hard. The data gets really thin when you drill down too much and we can’t do something decent with that many variables without building a fully Bayesian model. Now, again, we should do this, but building such a model is much more difficult than the few lines of R it takes to build a GAM fit*. And while these other variables probably do influence the strike zone, the assumption here is that they do not influence it as much as handedness and count. On the other hand, if someone does want to do this and plug the numbers into the CPM formula, well, that would be just great.

* Set up a data frame with fields CalledStrike, px, and pz called pitches. Then:

require("mgcv")
s <- gam(CalledStrike~s(px)+s(pz),family=binomial,data=pitches)
# To get the probability at a point (my_x, my_z):
my_point <- data.frame(px=my_x, pz=my_z)
logit_prob <- predict(s,my_point)
final_prob <- exp(logit_prob)/(1+exp(logit_prob))

Anyhow! We can use this to judge pitch framing, but more importantly, we can make some animated gifs! Let’s take yet another look at how handedness affects the strike zone:

We can also look at how the count affects the strike zone. I for one was stunned at how much of an effect it had, but then again, I’m easily stunned.

Heck, we can even look at the different strike zones for sliders and four-seam fastballs, even though we’ve decided not to use it:

Aw, hell, there is a difference. Well, as we’ve already discussed, the way I’ve decided to go is definitely not the best way to compute these probabilities. But it’s pretty good, computationally tractable, and will hopefully give us decent results for catcher framing. I guess we’ll find out in part 3.

7 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

J. Cross

10 years ago

Kudzu,

I really like this method of nailing down pitch framing. Brian Mills did some research:
http://princeofslides.blogspot.com/2013/07/advanced-sab-r-metrics-parallelization.html

and found that using an interaction, s(px,pz), worked better than s(px)+s(pz). He actually uses s(px, pz, batter_hand) but I’m guessing that just doing separate regression for lefties and righties gets it done too. I like the idea of using the gamm package and coding pitchers as a random effect but IIRC when I tried this it took forever and ultimately crashed my computer. Anyway, I’m looking forward to seeing what you get.

btw, should it be: IsCalledStrike – prob(CalledStrike) since you’re interested in the difference or am I missing what you’re doing?

The Kudzu Kid

Reply to J. Cross

Gah! Yes, it is supposed to be a minus sign. I can’t seem to edit it; maybe I can get an admin to do it. In the meantime, you can mentally replace the plus sign with a minus sign.

I hadn’t seen Brian Mills’ work — thanks for the link. It looks good enough that I will weigh using his method when I run the numbers for part 3.

Peter Jensen

Here is the link to Max Marchi’s probablistic model of catcher framing published at the Hardball Times two years ago. You really should read it and all of Max’s follow up articles at THT and BP.

http://www.hardballtimes.com/main/article/evaluating-catchers-quantifying-the-framing-pitches-skill/

You really can’t ignore pitcher handedness as a ball on the corners by an opposite handed pitcher will have a completely different look to an umpire than one by a same handed pitcher.

Reply to Peter Jensen

Thanks for the link. I don’t have enough pitch f/x data to work with, but if I can get a few more years I will definitely include (at least) pitcher handedness.

Sandy Kazmir

I’m pretty lousy with R, but I wanted to try your code for the GAM fit in the middle of this pretty solid article. I can read in my dataset that includes whether the pitch was a called strike (1) or not (0). I can get it to load the mgcv package and I can enter the first two lines no problem, but when I get to the my_point… line I get this error:

Error in data.frame(px = my_x, pz = my_z) : object ‘my_x’ not found

I was hoping you could help with this because I’d like to learn how to do more with R. Thanks for your time.

Reply to Sandy Kazmir

You need to define my_y and my_x, like
my_x <- 0 my_y <- 3

Reply to The Kudzu Kid

Thanks a ton.

BAL	CHW	LAA
BOS	CLE	OAK
NYY	DET	SEA
TBR	KCR	TEX
TOR	MIN	HOU

ATL	CHC*	ARI
MIA	CIN	COL
WSN	MIL	LAD
NYM*	PIT	SDP*
PHI	STL	SFG