Since I develop my own player forecasts, I am always looking for better ways to use the advanced metrics to project the more basic ones. You may remember the quest Chad Young and I engaged in earlier this year to predict HR/FB ratio using batted ball distance. That didn’t go as far as I had hoped, but it did highlight the value of the batted ball distance data. When I started doing the research for my Contact% articles last week, I figured there might be a combination of plate discipline metrics, including Contact%, that does a good job of estimating a hitter’s strikeout percentage. I was right. Of course, this is nothing earth shattering, as Jeff Zimmerman reminded me that he looked at that very same concept late last season, which clearly had already escaped my brain. Since I had the data and regressions done, I decided to do a version 2.0 of formulating a hitter’s expected K%.
Initially, I used a smaller data set than Jeff. But if I was going to duplicate his work, it would have been much more useful if I used the same data set so I can compare my results to his. So I did, collecting all hitters with at least 200 plate appearances from 2002 to 2012, which gave me a total of 3,796 player seasons. My initial thought was that a hitter’s strikeout rate would depend largely on his Contact% and the rate of pitches he saw inside the strike zone, indicated by Zone%. I then tested a host of different combinations that made logical sense to me (if anyone knows how to tell Excel to automatically test every single regression combination from a series of variables, PLEASE share!) until I found the best one, as follows:
xK% = 1.095 – (Z-Swing% * 0.250) – (Contact% * 0.888) – (Zone% * 0.076)
I wasn’t sure whether it was even worth posting a version 2.0 given that Jeff’s R-squared was at 0.79, so this was only a slight improvement. However, it has one less variable, plus, it includes Zone%. However, my initial assumption was that Zone% would have a positive correlation with K%. A hitter who sees more strikes will strike out more often seems like an obvious concept. However, this equation, and the others that I generated, all had Zone% with a negative value. I wonder if that is because a hitter who sees fewer pitches in the zone is liable to chase more pitches out of the zone, which are harder to make contact with. That’s the only explanation I can think of, but it does seem to make sense.
On the other hand, the negative value for Z-Swing% makes complete sense. If you’re getting thrown strikes and not swinging at them, they will be called strikes, and the hitter is more likely to get called out looking. The strong negative correlation with Contact% is self explanatory. I tried breaking Contact% up into Z-Contact% and O-Contact% instead of the umbrella term, but it led to a worse R-squared (and more terms! gasp!).
There was also an interesting comment on Jeff’s article by slash12, known around here for his work on an xBABIP formula. He noted that he looked at an equation for estimating a hitter’s strikeout rate in the past, but did not find it to be a good predictor and players who beat the model seemed to consistently do so. Unfortunately, this is going to happen with any estimator metric we derive with outliers always screwing up our dreams of the metric working for every player. There is always going to be other factors involved that either don’t show up in our specific stats, all stats, including those not available here, or they do show up, but we just haven’t figured out exactly where to look.
Given that plate discipline metrics stabilize sooner than results-based statistics like strikeout rate, it is still worth using an xK% formula this early in the season. As usual, I will follow up on this with the names of hitters striking out more and less often than the xK% formula estimates.