Using Markov Chains to Predict K% and BB%

There are 12 “states” of the count in baseball: 0-0, 0-1, 0-2, 1-0, 1-1, 1-2, 2-0, 2-1, 2-2, 3-0, 3-1, 3-2. In addition there are 3 “states” in which a plate appearance can end: strikeout, walk, and ball in play. This means that MLB plate appearances lend themselves wonderfully to analysis with Markov chains.

Every pitch thrown in MLB can be classified as a swinging strike, called strike, ball, foul, or ball in play. Each of these classifications has a defined effect in each count. For example, a swinging strike in an 0-1 count leads to an 0-2 count, and a foul in a 2-2 count leads to another 2-2 count.

Using PITCHf/x plate discipline statistics and a little algebra, it is possible to calculate the chance of each of these occurrences on any given pitch. Called strikes, swinging strikes, and balls are easy enough to calculate, but it gets tricky with fouls and balls in play. They both have the same requirements, in that the batter must swing and must make contact. To separate fouls from balls in play, then, we need to find how many pitches a pitcher allowed to be contacted, and then subtract the number of pitches that were put into play. This is easily found, since every batter faced by a pitcher either strikes out, walks, or puts the ball in play.

Unfortunately for the Markov process, major league players do not act randomly. In different counts, pitchers are more or less likely to throw the ball in the zone, and hitters are more or less likely to swing. This must be accounted for or the simulation will bear only a passing resemblance to the game actually played on the field. Using BaseballSavant, I found the rate at which pitchers throw in and out of the zone on every count, and then created an index stat like wRC+, where 100 is average and 110 is 10% more than average. For example, 3-0 counts have a Zone index of 129, and 0-2 counts have a Zone index of just 62. I did the same thing for Z-swing% and O-swing%. One caveat is that the Zone% numbers I got on BaseballSavant do not match those found in the PITCHf/x plate discipline stats. However, since these index stats are all RELATIVE to league average, it should not make a difference.

  ZONE+ ZSWING+ OSWING+
0-0 110 61 53
0-1 88 112 98
0-2 62 131 117
1-0 113 91 82
1-1 99 119 115
1-2 75 134 135
2-0 121 91 80
2-1 115 123 120
2-2 95 137 152
3-0 129 18 19
3-1 128 114 106
3-2 122 139 169

Once we have all this data for a pitcher, we can use a Markov chain to essentially simulate an infinite number of plate appearances for him. Every plate appearance starts at 0-0. By knowing the chances of all the per-pitch results, we can estimate how many 1-0 and 0-1 counts the pitcher would get into, and how many times the pitch would be put into play. From 1-0, we can estimate how many counts become 2-0 or 1-1 or balls in play, and from 0-1, we can estimate how many become 0-2 or 1-1 or balls in play. Simulating in this way, every plate appearance will eventually lead to a strikeout, walk, or ball in play.

For every pitcher who qualified for the ERA title in 2014, I imported his Zone%, Z-swing%, O-swing%, Z-contact%, O-contact%, TBF, K, BB, and HBP (the last 4 only to calculate fair/foul%). Using these, I created a transition matrix for each pitcher that shows the probabilities of moving to any state of the count from any other given count. For example, here is Clayton Kershaw’s 2014 transition matrix.

  0-0 0-1 0-2 1-0 1-1 1-2 2-0 2-1 2-2 3-0 3-1 3-2 K BB IP
0-0 0 0.546 0 0.344 0 0 0 0 0 0 0 0 0 0 0.110
0-1 0 0 0.471 0 0.350 0 0 0 0 0 0 0 0 0 0.180
0-2 0 0 0.207 0 0 0.395 0 0 0 0 0 0 0.221 0 0.177
1-0 0 0 0 0 0.542 0 0.290 0 0 0 0 0 0 0 0.168
1-1 0 0 0 0 0 0.509 0 0.283 0 0 0 0 0 0 0.208
1-2 0 0 0 0 0 0.240 0 0 0.317 0 0 0 0.238 0 0.204
2-0 0 0 0 0 0 0 0 0.564 0 0.260 0 0 0 0 0.175
2-1 0 0 0 0 0 0 0 0 0.541 0 0.225 0 0 0 0.234
2-2 0 0 0 0 0 0 0 0 0.283 0 0 0.231 0.246 0 0.241
3-0 0 0 0 0 0 0 0 0 0 0 0.664 0 0 0.298 0.038
3-1 0 0 0 0 0 0 0 0 0 0 0 0.567 0 0.203 0.229
3-2 0 0 0 0 0 0 0 0 0 0 0 0.332 0.242 0.144 0.282
K 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
BB 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
IP 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

The left column represents the count before a given pitch is thrown. The top row represents the count after that pitch has been thrown. The intersection of any column and row is the chance of that particular transition occurring. So, for 2014 Kershaw, there was a 54.6% chance that he would get ahead of a batter 0-1, a 34.4% chance he would fall behind 1-0, and an 11% chance the batter would put the first pitch into play. Since the transition matrix shows the probabilities associated with throwing one pitch, raising the matrix to the second power simulates throwing 2 pitches. Similarly, finding the limit of the matrix simulates throwing an infinite number of pitches, after which a plate appearance is certain to be over. This is why the limit of Kershaw’s matrix (shown below) only has non-zero probabilities in the last 3 columns; after an infinite number of pitches, a plate appearance will have finally reached a conclusion of a strikeout, walk, or ball in play.

  0-0 0-1 0-2 1-0 1-1 1-2 2-0 2-1 2-2 3-0 3-1 3-2 K BB IP
0-0 0 0 0 0 0 0 0 0 0 0 0 0 0.285 0.041 0.674
0-1 0 0 0 0 0 0 0 0 0 0 0 0 0.369 0.023 0.608
0-2 0 0 0 0 0 0 0 0 0 0 0 0 0.530 0.014 0.455
1-0 0 0 0 0 0 0 0 0 0 0 0 0 0.243 0.082 0.675
1-1 0 0 0 0 0 0 0 0 0 0 0 0 0.341 0.046 0.613
1-2 0 0 0 0 0 0 0 0 0 0 0 0 0.505 0.029 0.466
2-0 0 0 0 0 0 0 0 0 0 0 0 0 0.202 0.197 0.602
2-1 0 0 0 0 0 0 0 0 0 0 0 0 0.295 0.111 0.594
2-2 0 0 0 0 0 0 0 0 0 0 0 0 0.459 0.069 0.471
3-0 0 0 0 0 0 0 0 0 0 0 0 0 0.136 0.515 0.349
3-1 0 0 0 0 0 0 0 0 0 0 0 0 0.205 0.326 0.469
3-2 0 0 0 0 0 0 0 0 0 0 0 0 0.362 0.216 0.422
K 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
BB 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
IP 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Now, to predict Kershaw’s K% and BB%, we need only look at the top row, since all plate appearances begin with an 0-0 count. After a 0-0 count, we estimate Kershaw has a 28.5% chance to strike out any given batter and a 4.1% chance to walk him. Kershaw in 2014 actually had a 31.9% strikeout rate and a 4.1% walk rate.

This method produces a very robust r-squared of .86 when plotting xK% vs. actual K%. Unfortunately, r-squared drops to .54 when plotting xBB% vs. actual BB%.

I then imported the same statistics for batters, because there really is no reason why this method should not work equally well for both pitchers and hitters. It actually seems to work better as a whole on batters, with an r-squared of .81 for batters’ strikeouts and .77 for batters’ walks.

If there are any players in particular you’re interested in, I have included the full list of all qualified pitchers and position players, with both their expected and actual strikeout and walk rates.

Player              xK%       2014 K%            xBB%    2014 BB%
Hughes 19.1 21.8 1.8 1.9
Kershaw 28.5 31.9 4.1 4.1
Price 25.6 26.9 3.7 3.8
Sale 31.3 30.4 6.1 5.7
Zimmermann 23.2 22.8 3.3 3.6
Scherzer 28.5 27.9 6.4 7
Bumgarner 24.9 25.1 5.1 4.9
Lackey 22 19.7 3.7 5.6
Kluber 28.1 28.3 4.9 5.4
Strasburg 26.8 27.9 5.6 5
Samardzija 24.4 23 4.8 4.9
Hamels 24.8 23.9 5.6 7.1
McCarthy 20.1 20.9 4.4 3.9
Cueto 23.6 25.2 6.8 6.8
Wood 24.9 24.5 6.3 6.5
Kennedy 24.8 24.5 7.5 8.3
Greinke 25.4 25.2 6.4 5.2
Odorizzi 24.1 24.2 9 8.2
Hutchison 23.1 23.4 7.2 7.6
Teheran 21.4 21 5.2 5.8
Harang 20.8 18.4 6 8.1
Eovaldi 17.9 16.6 4.9 5
Felix 26 27.2 6.8 5
Dickey 21.9 18.9 6.3 8.1
Fat Bartolo 17.4 17.8 4 3.5
Kazmir 20.9 21.1 6.5 6.4
Wainwright 21.4 19.9 5.1 5.6
Wheeler 25.2 23.6 9.8 9.9
Ventura 21.2 20.3 6.9 8.8
Fister 17.8 14.8 4.4 3.6
Chen 17.8 17.6 5.9 4.5
Norris 20 20.2 8.4 7.6
Lester 22.6 24.9 8.1 5.4
Richards 24.9 24.2 8 7.5
Porcello 18.3 15.4 4.6 4.9
Shields 20.8 19.2 6.5 4.7
Lewis 18.8 17.5 5.7 6.3
Simon 18.3 15.5 5.5 6.8
Iwakuma 18.6 21.7 5.5 3
Lynn 20.3 20.9 8.6 8.3
Wood 18.3 18.7 8.4 9.7
Hammel 22.5 22.1 7.7 6.2
Noesi 18.9 16.8 6.3 7.6
Verlander 18.5 17.8 6.8 7.3
Miller 17.9 16.6 6.9 9.6
Young 18.2 15.7 7.5 8.7
Koehler 20.3 19.1 6.6 8.8
Archer 22.9 21 8.1 8.8
Roark 19 17.3 5.9 4.9
Haren 19.1 18.7 7.6 4.6
Peavy 18.4 18.5 7.4 7.4
Ross 25.2 24 9 8.9
Niese 17.9 17.6 5.3 5.7
Tillman 17.5 17.2 7.9 7.6
Cobb 22.6 21.9 8.4 6.9
Danks 19.4 15.1 7.1 8.7
Garza 18.3 18.5 7.4 7.4
Santana 22.1 21.9 7.3 7.7
Quintana 20.4 21.4 9.1 6.3
Alvarez 15.7 14.4 3.9 4.3
Liriano 27.3 25.3 10.8 11.7
Volquez 20.5 17.3 7.1 8.8
Guthrie 16.1 14.4 6.3 5.7
Buchholz 18.7 17.9 7.3 7.3
Gray 20.5 20.4 7.7 8.2
Burnett 21.5 20.3 8.8 10.3
Collmenter 16.4 16 6.9 5.4
Vargas 19.2 16.2 6.9 5.2
Lohse 17.4 17.3 6.9 5.5
de la Rosa 19.2 18.1 10 8.7
Leake 16.7 18.2 6.9 5.5
Vogelsong 17.9 19.4 9.6 7.4
Cosart 18 15 8.4 9.5
Weaver 19.2 19 8.2 7.3
Hudson 16.1 15.2 5.8 4.3
Feldman 15.6 14 8.3 6.5
Kuroda 17.3 17.8 8 4.3
Hernandez 17.4 14.5 9 10.1
Buehrle 14.7 13.9 6.2 5.4
Keuchel 18.5 18.1 8.3 5.9
Peralta 16.5 18.4 9.5 7.3
Elias 19.5 20.6 11.1 9.2
Miley 18.3 21.1 10.3 8.7
Kendrick 15.1 14 7.3 6.6
Wilson 21 19.8 13.1 11.2
Gibson 15.8 14.1 8.5 7.5
Stults 14.5 14.5 8.7 5.9
Gallardo 16.3 17.9 11.1 6.6
McCutchen 19.8 17.7 11.7 13
V-Mart 13.9 6.6 8.3 10.9
Abreu 23.7 21.1 6.5 8.2
Stanton 27.6 26.6 12.9 14.7
Trout 27.9 26.1 12.4 11.8
Bautista 19.8 14.3 12 15.5
Rizzo 23.2 18.8 9.5 11.9
E5 20.3 15.1 9.5 11.4
Brantley 10.9 8.3 8 7.7
Cabrera 17.6 17.1 7.1 8.8
Beltre 16.6 12.1 6.7 9.3
Puig 17.5 19.4 10.4 10.5
Werth 24 18 11.4 13.2
Freeman 18.7 20.5 12.3 12.7
Morneau 11.8 10.9 5.7 6.2
Posey 15.5 11.4 7.5 7.8
Cruz 22 20.6 7 8.1
Kemp 24.8 24.2 7.9 8.7
Ortiz 16.7 15.8 11.1 12.5
Lucroy 18.3 10.8 6.4 10.1
Gomez 19.4 21.9 7.1 7.3
Harrison 17.9 14.7 3.8 4
Upton 27 26.7 8.2 9.4
Altuve 9 7.5 3.3 5.1
Han-Ram 16.2 16.4 9.9 10.9
Duda 25.3 22.7 11.5 11.6
Rendon 17.9 15.2 8.7 8.5
Cano 12.2 10.2 6.6 9.2
Holliday 14.3 15 9.5 11.1
Marte 25.2 24 6.3 6.1
Smith 20 16.7 11.6 13.2
LaRoche 19.8 18.4 12.5 14
Walker 15.2 15.4 9 7.9
Cabrera 13.5 10.8 7.1 6.9
Santana 22.6 18.8 14.2 17.1
Gonzalez 19.3 17 6.1 8.5
Donaldson 19.9 18.7 10.5 10.9
Frazier 22 21.1 8.4 7.9
Fowler 20.8 21.4 13.2 13.1
Seager 18.7 18 9.7 8
Gordon 22.9 19.6 9.9 10.1
Carter 32.4 31.8 9.1 9.8
Peralta 19 17.8 8.7 9.2
Valbuena 24.7 20.7 8.8 11.9
Span 14.3 9.7 5.6 7.5
Calhoun 19.7 19.4 6.3 7.1
Castro 18 17.6 7.3 6.2
Yelich 22.9 20.8 10.9 10.6
Pence 20.8 18.4 8.6 7.3
Jones 20 19.5 5.2 2.8
Gomes 23 23.2 5.6 4.6
Eaton 20.7 15.4 5.3 8
Pujols 14.7 10.2 5.7 6.9
Braun 19.9 19.5 6 7.1
Chisenhall 20.2 18.6 5.2 7.3
Dozier 25.9 18.2 8.6 12.6
Moss 27.8 26.4 9.7 11.6
Blackmon 16.3 14.8 5.7 4.8
Carpenter 25.1 15.7 9.9 13.4
Ozuna 27.8 26.8 6.8 6.7
Adams 19 20.2 5.6 4.6
Hunter 16 15.2 4.6 3.9
Ramirez 13.9 14.1 4.7 4
Dunn 30.9 31.1 14.1 13.9
Zobrist 17.6 12.8 9.4 11.5
Gardner 25.6 21.1 9.6 8.8
Plouffe 19.7 18.7 9.3 9.1
Davis 21.6 22.2 7.6 5.8
Gillaspie 14.9 15.4 6.7 7.1
Byrd 29.4 29 4.3 5.5
Heyward 18 15.1 9.7 10.3
Desmond 27.4 28.2 6.9 7.1
Kendrick 19.9 16.3 5.7 7.1
Ellsbury 14 14.6 7.9 7.7
Cespedes 20.6 19.8 5.4 5.4
Markakis 16.1 11.8 8.1 8.7
Utley 15.8 12.8 8.5 8
Suzuki 15.9 9.1 6.8 6.8
Prado 18.2 14 6.9 4.5
Murphy 13.4 13.4 6.4 6.1
Sandoval 12.1 13.3 4.9 6.1
Mauer 23.5 18.5 9.2 11.6
Choo 26.7 24.8 9.9 11
Reyes 12.7 11.1 5.6 5.8
Granderson 25.3 21.6 10.1 12.1
Aoki 11 8.9 8 7.8
Rollins 21.4 16.4 8.2 10.5
McGehee 16 14.8 8.5 9.7
Kinsler 11.3 10.9 5.7 4
Loney 12.7 12.3 7.2 6.3
Pedroia 19.3 12.3 6 8.4
Solarte 14.6 10.8 7.9 9.9
Teixeira 24.2 21.5 10.3 11.4
Longoria 20.3 19 6.2 8.1
Jones 20.4 21.2 8.9 8.4
Headley 21.9 23 10.9 9.6
Navarro 18 14.6 5.9 6.2
Ramirez 13.2 12.3 4.8 3.7
Crisp 18.3 12.3 8.9 12.3
Freese 24.9 24.3 7.3 7.4
Hosmer 17.4 17 7.7 6.4
Jennings 22.2 19.9 8.3 8.7
Gordon 20.5 16.5 4.5 4.8
Butler 17.3 15.9 5.6 6.8
de Aza 24.6 22.5 6.4 7.4
Crawford 24.8 22.9 7.4 10.5
Rios 18.7 17.9 7.1 4.4
Wright 18.7 19.3 7 7.2
Davis 34.1 33 10 11.4
Aybar 11 9.7 4.7 5.6
Cabrera 16.7 17.5 7.2 8
Montero 19.5 17.3 7.5 10
Castellanos 23.9 24.2 6.7 6.2
Escobar 14.8 13.4 4.6 3.7
Martin 20.5 19.6 5.5 6.7
Howard 30.1 29.3 9.8 10.3
McCann 16.9 14.3 7 5.9
Ackley 19.9 16.6 5.8 5.9
Revere 15.1 7.8 3.9 2.1
Perez 14 14 3.4 3.6
Hardy 24.8 18.3 5 5.1
Viciedo 20.2 21.7 6.5 5.7
Lowrie 13.6 14 7.3 9
Mercer 19.5 16 5.5 6.3
Escobar 10.9 11.3 8.9 8.1
Parra 14.8 17.4 7.3 5.6
Bogaerts 26.3 23.2 6.8 6.6
Jackson 23.4 22 8 7.2
LeMahieu 16.5 18 6 6.1
Castro 27 29.5 8.1 6.6
Andrus 18.5 14 8.5 6.7
Hechavarria 13.2 15 4 4.5
Hill 17.7 17 7.4 5.2
Kipnis 22.3 18 7.6 9
Johnson 26 26 3.9 3.8
Bruce 26.1 27.3 8.5 8.1
Hamilton 20.4 19.1 6 5.6
Brown 15.2 17.8 8.4 6.6
Infante 14.6 11.8 6.2 5.7
Jeter 12.1 13.7 5.5 5.5
Upton 29.3 29.7 7.4 9.8
Simmons 11.1 10.4 5.1 5.6
Segura 15.6 12.6 4.3 5
Craig 21.6 22.4 7.2 6.9
Dominguez 21.9 20.6 5.2 4.8
Cozart 15.3 14.5 5.3 4.6

One advantage of this method over any of the many regression based estimates using plate discipline stats is that this can be further tailored to each player. The reason for this is that ZONE+, ZSWING+, and OSWING+ are all league average indexes, and some players’ talents are just not captured by league averages. For example, Dustin Pedroia’s expected strikeout rate is nowhere near his actual strikeout rate. Presumably, Pedroia has swing tendencies in certain counts that are markedly different from the average hitter. By examining these swing tendencies, it is likely possible to predict Pedroia’s yearly strikeout rates with much greater accuracy, as those tendencies are probably part of his approach at the plate year after year. Still, as preliminary research into this area, these I think these results as a whole are very promising.



Print This Post

newest oldest most voted
Owen
Member

Great Job!!!! I love it.

Matt P
Guest
Matt P

This is interesting. Good work.

Spencer Jones
Member

Is there such a thing as not fat Bartolo??

Peter Jensen
Guest
Peter Jensen

Captain – Did you miss the part where a Markov Chain requires that movement from state to state be independent of the method of achieving that movement. In plain language for a Markov to give acceptable results what happens at a 0-1 count would not depend on whether the first pitch was a called strike, a swinging strike or a foul ball. This is definitely not the case.

Michael
Guest
Michael

This is sort of to Peter’s point. Empirically the fair/foul% depends on the count. More foul balls happen with two strikes (choking up). So it didn’t surprise me that the average xK% exceeds the true K% for both batters and pitchers in your data.

But otherwise the implementation of a structural model of walks and strikeouts is well done. It is a major refinement from what I’ve used, a linear approximation of walks in strikeouts using the same variables (O-Swing%, etc).

Kyle
Guest
Kyle

This is good stuff. Are you able to take this and simulate at-bats between a pitcher and hitter? If you can post an example, that would be awesome. Keep it up!