Predicting 2012′s Strikeout Improvements
If heteroscedasticity lasts longer than three hours, consult your physician immediately.
“Honey, I think I’ve got heteroscedasticity,” I said to my wife when she walked in the door. As a writer who works at home, I spend the majority of my time locked away in my windowless home office, concocting ways to frighten my dear wife who works all day.
“And it’s ruining my spreadsheets,” I finally added, after she had stood wide-eyed and wordless for a few moments.
On Tuesday, we examined the fantastic and bizarre case of rookie right-hander Jeremy Hellickson, whose high swinging-strike rate has not translated into an equally high strikeout rate (K%). Today, let’s expand the scope of that investigation.
As we see above, the relationship between K% and swinging-strike rates is an interesting one. The above data includes pitchers — both starters and relievers — over the last four years who have pitched at least 150 innings.
I imagine readers with only the thickest of glasses will immediately note the data’s heteroscedasticity — the triangular or conical shape of the data. Heteroscedasticity sounds like an infectious diseases, but it is merely a stats word that means the error term is not consistent. As we can see above, there appears to be greater volatility in the higher swinging strike range than in the low SwStr% range.
There are some statistical tricks to solve a problem like Maria heteroscedasticity, but let’s skip all the math and just cheat a little here (please don’t tattle to Tango).
Let’s just look at on of the worst possible regressions — and by “worst,” I mean “least spectacular.” Connecting the two orange dots (the dots of Paul Byrd and Michael Wuertz), we find an equation of y = 2.2619x + 0.0163.
This is essentially the worst-case scenario for pitchers with a given swinging strike rate. Only the big exceptions will perform worse than ol’ Byrdie and Wuertz… Wuertzy…?
So, if we apply this formula to the 2011 season, say 100 IP minimum, we should be able to effectively predict a true talent floor, so to speak, and assemble a cache of player poised for more strikeouts (and thereby a higher FIP).
Obviously, some fellas, such as old man Tim Wakefield, will be exceptions. Wakefield ranks among the top of this improvement group, but we all well know that he and his knuckling twirl have long passed their 20% K-rate days.
So do realize, these numbers lack a degree of precision — each player involved may or may not have legit reasons to beat or under-perform their expected K% (and likewise their adjusted FIP). Moreover, the Should FIP or adjusted FIP below was made pretty much by multiplying their total batters faced by their new strikeout rate. Presumably, though, their TBF total would be different if they were indeed striking out more batters.
So caveats all around. The main lesson here: This is not precise, but this should be fun.
NOTE: Because I am using a worst-case formula, I have only included the players who we expect should improve next year. So even those pitchers with 0% expected change may see K% upticks in 2012.
| Name | SwStr% | K% | FIP | xK% | DIFF | Should FIP | FIP DIFF |
| Francisco Liriano | 11.4% | 19.0% | 4.54 | 27.4% | 8.4% | 3.80 | -0.74 |
| Tim Wakefield | 8.9% | 13.7% | 4.99 | 21.8% | 8.1% | 4.29 | -0.70 |
| Jeremy Hellickson | 9.7% | 15.1% | 4.44 | 23.6% | 8.5% | 3.75 | -0.69 |
| Chris Narveson | 10.4% | 18.0% | 4.06 | 25.2% | 7.2% | 3.44 | -0.62 |
| Carl Pavano | 7.1% | 10.7% | 4.10 | 17.7% | 7.0% | 3.50 | -0.60 |
| Fausto Carmona | 7.9% | 13.1% | 4.56 | 19.5% | 6.4% | 3.99 | -0.57 |
| Jaime Garcia | 10.5% | 18.9% | 3.23 | 25.4% | 6.5% | 2.68 | -0.55 |
| Jeff Francis | 7.0% | 11.3% | 4.10 | 17.5% | 6.2% | 3.56 | -0.54 |
| Dillon Gee | 9.1% | 16.2% | 4.65 | 22.2% | 6.0% | 4.12 | -0.53 |
| Randy Wells | 8.2% | 14.1% | 5.11 | 20.2% | 6.1% | 4.58 | -0.53 |
| Phil Coke | 8.3% | 14.6% | 3.57 | 20.4% | 5.8% | 3.06 | -0.51 |
| Freddy Garcia | 8.7% | 15.3% | 4.12 | 21.3% | 6.0% | 3.61 | -0.51 |
| John Lannan | 7.6% | 13.1% | 4.28 | 18.8% | 5.7% | 3.78 | -0.50 |
| Hiroki Kuroda | 10.3% | 19.2% | 3.78 | 24.9% | 5.7% | 3.31 | -0.47 |
| Daniel Hudson | 9.9% | 18.4% | 3.28 | 24.0% | 5.6% | 2.81 | -0.47 |
| Shaun Marcum | 10.3% | 19.2% | 3.73 | 24.9% | 5.7% | 3.26 | -0.47 |
| Edwin Jackson | 9.2% | 17.2% | 3.55 | 22.4% | 5.2% | 3.10 | -0.45 |
| Edinson Volquez | 10.9% | 21.3% | 5.29 | 26.3% | 5.0% | 4.84 | -0.45 |
| Josh Tomlin | 7.7% | 13.4% | 4.27 | 19.0% | 5.6% | 3.82 | -0.45 |
| Ricky Nolasco | 8.9% | 16.6% | 3.54 | 21.8% | 5.2% | 3.09 | -0.45 |
| Carlos Carrasco | 8.5% | 15.9% | 4.28 | 20.9% | 5.0% | 3.85 | -0.43 |
| Philip Humber | 9.1% | 17.2% | 3.58 | 22.2% | 5.0% | 3.16 | -0.42 |
| Luke Hochevar | 8.2% | 15.3% | 4.29 | 20.2% | 4.9% | 3.88 | -0.41 |
| Jeff Karstens | 7.7% | 14.4% | 4.29 | 19.0% | 4.6% | 3.91 | -0.38 |
| Chris Capuano | 10.5% | 21.0% | 4.04 | 25.4% | 4.4% | 3.66 | -0.38 |
| Brett Cecil | 8.4% | 16.4% | 5.10 | 20.6% | 4.2% | 4.73 | -0.37 |
| Charlie Morton | 7.4% | 14.3% | 3.77 | 18.4% | 4.1% | 3.41 | -0.36 |
| Guillermo Moscoso | 7.4% | 14.1% | 4.23 | 18.4% | 4.3% | 3.88 | -0.35 |
| John Danks | 9.3% | 18.5% | 3.82 | 22.7% | 4.2% | 3.47 | -0.35 |
| Tom Gorzelanny | 10.5% | 21.3% | 4.19 | 25.4% | 4.1% | 3.84 | -0.35 |
| Roy Oswalt | 8.0% | 15.7% | 3.44 | 19.7% | 4.0% | 3.09 | -0.35 |
| Cole Hamels | 11.3% | 22.8% | 3.05 | 27.2% | 4.4% | 2.71 | -0.34 |
| Jason Vargas | 7.8% | 15.3% | 4.09 | 19.3% | 4.0% | 3.75 | -0.34 |
| R.A. Dickey | 7.8% | 15.3% | 3.77 | 19.3% | 4.0% | 3.44 | -0.33 |
| Jair Jurrjens | 7.4% | 14.4% | 3.99 | 18.4% | 4.0% | 3.66 | -0.33 |
| Ricky Romero | 9.6% | 19.4% | 4.20 | 23.3% | 3.9% | 3.88 | -0.32 |
| Homer Bailey | 9.3% | 18.9% | 4.06 | 22.7% | 3.8% | 3.74 | -0.32 |
| A.J. Burnett | 10.0% | 20.7% | 4.77 | 24.2% | 3.5% | 4.46 | -0.31 |
| Jason Hammel | 6.5% | 12.7% | 4.83 | 16.3% | 3.6% | 4.52 | -0.31 |
| Dan Haren | 9.9% | 20.2% | 2.98 | 24.0% | 3.8% | 2.67 | -0.31 |
| Joel Pineiro | 5.2% | 9.8% | 4.43 | 13.4% | 3.6% | 4.12 | -0.31 |
| Carlos Villanueva | 7.5% | 15.0% | 4.10 | 18.6% | 3.6% | 3.79 | -0.31 |
| Derek Lowe | 8.1% | 16.5% | 3.70 | 20.0% | 3.5% | 3.39 | -0.31 |
| Mark Buehrle | 6.5% | 12.7% | 3.98 | 16.3% | 3.6% | 3.68 | -0.30 |
| Jason Marquis | 6.5% | 13.0% | 4.05 | 16.3% | 3.3% | 3.75 | -0.30 |
| CC Sabathia | 11.2% | 23.4% | 2.88 | 27.0% | 3.6% | 2.58 | -0.30 |
| Matt Garza | 11.2% | 23.5% | 2.95 | 27.0% | 3.5% | 2.65 | -0.30 |
| Alexi Ogando | 8.9% | 18.2% | 3.65 | 21.8% | 3.6% | 3.36 | -0.29 |
| Alfredo Simon | 8.1% | 16.6% | 4.42 | 20.0% | 3.4% | 4.13 | -0.29 |
| Aaron Harang | 8.4% | 17.3% | 4.17 | 20.6% | 3.3% | 3.88 | -0.29 |
| Michael Pineda | 11.8% | 24.9% | 3.42 | 28.3% | 3.4% | 3.14 | -0.28 |
| Chris Volstad | 7.9% | 16.3% | 4.32 | 19.5% | 3.2% | 4.04 | -0.28 |
| Bud Norris | 10.5% | 22.1% | 4.02 | 25.4% | 3.3% | 3.74 | -0.28 |
| Chris Carpenter | 9.2% | 19.2% | 3.06 | 22.4% | 3.2% | 2.79 | -0.27 |
| Josh Collmenter | 7.9% | 16.1% | 3.80 | 19.5% | 3.4% | 3.53 | -0.27 |
| Brian Duensing | 7.8% | 16.2% | 4.27 | 19.3% | 3.1% | 4.00 | -0.27 |
| Jo-Jo Reyes | 6.6% | 13.6% | 4.90 | 16.6% | 3.0% | 4.63 | -0.27 |
| John Lackey | 7.0% | 14.5% | 4.71 | 17.5% | 3.0% | 4.44 | -0.27 |
| Joe Saunders | 6.2% | 12.4% | 4.78 | 15.7% | 3.3% | 4.51 | -0.27 |
| Jake Westbrook | 6.3% | 12.9% | 4.25 | 15.9% | 3.0% | 3.98 | -0.27 |
| Tim Hudson | 8.6% | 17.9% | 3.39 | 21.1% | 3.2% | 3.13 | -0.26 |
| Livan Hernandez | 6.4% | 13.2% | 3.96 | 16.1% | 2.9% | 3.71 | -0.25 |
| Zach Britton | 7.0% | 14.6% | 4.00 | 17.5% | 2.9% | 3.75 | -0.25 |
| Brad Penny | 4.6% | 9.2% | 5.02 | 12.0% | 2.8% | 4.77 | -0.25 |
| Max Scherzer | 9.8% | 20.9% | 4.14 | 23.8% | 2.9% | 3.89 | -0.25 |
| Kevin Correia | 5.7% | 11.7% | 4.85 | 14.5% | 2.8% | 4.61 | -0.24 |
| Johnny Cueto | 7.9% | 16.5% | 3.45 | 19.5% | 3.0% | 3.21 | -0.24 |
| Rick Porcello | 6.3% | 13.3% | 4.06 | 15.9% | 2.6% | 3.83 | -0.23 |
| Ivan Nova | 6.6% | 13.9% | 4.01 | 16.6% | 2.7% | 3.79 | -0.22 |
| Scott Baker | 10.4% | 22.5% | 3.45 | 25.2% | 2.7% | 3.23 | -0.22 |
| Trevor Cahill | 7.6% | 16.3% | 4.10 | 18.8% | 2.5% | 3.88 | -0.22 |
| James Shields | 10.7% | 23.1% | 3.42 | 25.8% | 2.7% | 3.20 | -0.22 |
| Matt Harrison | 7.6% | 16.3% | 3.52 | 18.8% | 2.5% | 3.31 | -0.21 |
| Josh Beckett | 10.5% | 22.8% | 3.57 | 25.4% | 2.6% | 3.37 | -0.20 |
| Matt Cain | 9.1% | 19.7% | 2.91 | 22.2% | 2.5% | 2.71 | -0.20 |
| Mat Latos | 10.6% | 23.2% | 3.16 | 25.6% | 2.4% | 2.96 | -0.20 |
| Roy Halladay | 10.8% | 23.6% | 2.20 | 26.1% | 2.5% | 2.00 | -0.20 |
| Jake Peavy | 9.2% | 20.2% | 3.21 | 22.4% | 2.2% | 3.02 | -0.19 |
| Randy Wolf | 6.8% | 14.8% | 4.29 | 17.0% | 2.2% | 4.11 | -0.18 |
| Bronson Arroyo | 5.8% | 12.6% | 5.71 | 14.7% | 2.1% | 5.53 | -0.18 |
| Jhoulys Chacin | 8.2% | 18.1% | 4.23 | 20.2% | 2.1% | 4.06 | -0.17 |
| Kyle McClellan | 5.7% | 12.5% | 4.92 | 14.5% | 2.0% | 4.75 | -0.17 |
| Mike Leake | 7.7% | 17.0% | 4.22 | 19.0% | 2.0% | 4.05 | -0.17 |
| Mike Pelfrey | 5.5% | 12.2% | 4.47 | 14.1% | 1.9% | 4.30 | -0.17 |
| Bruce Chen | 6.7% | 14.8% | 4.39 | 16.8% | 2.0% | 4.23 | -0.16 |
| Anibal Sanchez | 10.9% | 24.3% | 3.35 | 26.3% | 2.0% | 3.19 | -0.16 |
| Alfredo Aceves | 7.6% | 16.9% | 4.03 | 18.8% | 1.9% | 3.87 | -0.16 |
| Ervin Santana | 8.4% | 18.8% | 4.00 | 20.6% | 1.8% | 3.84 | -0.16 |
| Wade Davis | 5.9% | 13.2% | 4.67 | 15.0% | 1.8% | 4.52 | -0.15 |
| Brad Bergesen | 6.0% | 13.5% | 4.92 | 15.2% | 1.7% | 4.77 | -0.15 |
| Gavin Floyd | 8.4% | 18.9% | 3.81 | 20.6% | 1.7% | 3.67 | -0.14 |
| Anthony Swarzak | 5.5% | 12.5% | 4.04 | 14.1% | 1.6% | 3.90 | -0.14 |
| Brandon Morrow | 11.5% | 26.1% | 3.64 | 27.6% | 1.5% | 3.51 | -0.13 |
| Javier Vazquez | 8.9% | 20.3% | 3.57 | 21.8% | 1.5% | 3.45 | -0.12 |
| James McDonald | 8.2% | 18.8% | 4.68 | 20.2% | 1.4% | 4.56 | -0.12 |
| Tim Lincecum | 10.7% | 24.4% | 3.17 | 25.8% | 1.4% | 3.05 | -0.12 |
| Jeremy Guthrie | 6.3% | 14.6% | 4.48 | 15.9% | 1.3% | 4.37 | -0.11 |
| Nick Blackburn | 4.8% | 11.3% | 4.84 | 12.5% | 1.2% | 4.74 | -0.10 |
| Justin Masterson | 7.5% | 17.4% | 3.28 | 18.6% | 1.2% | 3.18 | -0.10 |
| Brandon McCarthy | 7.7% | 17.8% | 2.86 | 19.0% | 1.2% | 2.76 | -0.10 |
| Felipe Paulino | 9.6% | 22.2% | 3.69 | 23.3% | 1.1% | 3.59 | -0.10 |
| Ted Lilly | 8.5% | 19.8% | 4.21 | 20.9% | 1.1% | 4.12 | -0.09 |
| Ryan Dempster | 9.3% | 21.7% | 3.91 | 22.7% | 1.0% | 3.82 | -0.09 |
| Brett Myers | 7.4% | 17.5% | 4.26 | 18.4% | 0.9% | 4.18 | -0.08 |
| Dustin Moseley | 5.3% | 12.7% | 3.99 | 13.6% | 0.9% | 3.91 | -0.08 |
| Carlos Zambrano | 6.7% | 15.9% | 4.59 | 16.8% | 0.9% | 4.52 | -0.07 |
| Kyle Kendrick | 5.1% | 12.3% | 4.55 | 13.2% | 0.9% | 4.48 | -0.07 |
| Jered Weaver | 9.1% | 21.4% | 3.20 | 22.2% | 0.8% | 3.13 | -0.07 |
| Jordan Zimmermann | 7.9% | 18.7% | 3.16 | 19.5% | 0.8% | 3.10 | -0.06 |
| Danny Duffy | 7.7% | 18.4% | 4.82 | 19.0% | 0.6% | 4.76 | -0.06 |
| Kyle Lohse | 5.9% | 14.3% | 3.67 | 15.0% | 0.7% | 3.62 | -0.05 |
| Jonathan Sanchez | 9.7% | 23.0% | 4.30 | 23.6% | 0.6% | 4.25 | -0.05 |
| Jake Arrieta | 7.4% | 17.8% | 5.34 | 18.4% | 0.6% | 5.29 | -0.05 |
| Chad Billingsley | 7.6% | 18.3% | 3.83 | 18.8% | 0.5% | 3.79 | -0.04 |
| Paul Maholm | 5.7% | 14.1% | 3.78 | 14.5% | 0.4% | 3.75 | -0.03 |
| Travis Wood | 6.7% | 16.4% | 4.06 | 16.8% | 0.4% | 4.03 | -0.03 |
| Tyler Chatwood | 4.6% | 11.7% | 4.89 | 12.0% | 0.3% | 4.86 | -0.03 |
| Gio Gonzalez | 9.5% | 22.8% | 3.64 | 23.1% | 0.3% | 3.61 | -0.03 |
| Wandy Rodriguez | 8.5% | 20.5% | 4.15 | 20.9% | 0.4% | 4.12 | -0.03 |
| Jonathon Niese | 8.2% | 19.9% | 3.36 | 20.2% | 0.3% | 3.33 | -0.03 |
| Derek Holland | 7.9% | 19.2% | 3.94 | 19.5% | 0.3% | 3.92 | -0.02 |
| Doug Fister | 6.7% | 16.7% | 3.02 | 16.8% | 0.1% | 3.01 | -0.01 |
| Colby Lewis | 8.2% | 20.1% | 4.54 | 20.2% | 0.1% | 4.54 | 0.00 |
| Jeff Niemann | 7.4% | 18.4% | 4.13 | 18.4% | 0.0% | 4.13 | 0.00 |
I must credit again Mike Podhorzer who got me thinking about this.


really interesting; i’d love to see this on an xfip, sierra or tera basis, too. not sure how involved all of that is, from a math standpoint.
If you break down 2010 stats, a lot of the same pitchers have high xK’s that didn’t materialize.
I looked at the 92 pitchers, presumably that were innings qualified, and here were the top 10:
Name 2010 diff
Randy Wells -6.92%
Brett Cecil -6.34%
Hiroki Kuroda -6.23%
R.A. Dickey -6.03%
Clay Buchholz -5.99%
Carl Pavano -5.69%
Shaun Marcum -5.68%
Jaime Garcia -5.25%
Edwin Jackson -5.05%
Johan Santana -4.84%
Francisco Liriano -4.78%
sorry. hit enter before I finished.
The “2010 diff” is the difference between 2010 K% and “expected K%” using the y = 2.2619x + 0.0163
formula from the author’s analysis.
Not sure what that means… I can’t think of similarities between Kuroda, Liriano, Pavano, Marcum, Wells, Jaime Garcia, etc.. Just thought I’d throw it out there.
A) I really liked this post. I thought it was well written and funny, and all around put together well.
B) Would a team philosophy of “pitch to contact”make a difference here? I’m only familiar with the Twins program, and they definitely have that strategy. Is it a coincidence that Liriano and Pavano (two pitchers that didn’t come up in the Twins system but have pitched there for 6 and 2 seasons respectively) are near the top? Could anyone else who knows say if the pitchers near the top of the list are from programs with similar philosophies?
I don’t think its “pitch to contact” as is so often said, but more “don’t walk people.” But yes, I think it could make a difference.
Also, the philosophy of featuring a change-up as a strikeout pitch could have something to do with it. Obviously Liriano’s best pitch is his slider, but I believe (I haven’t looked this up) he threw more changeups this year.
While the differential in expected and actual K% is interesting in and of itself, it seems to me the cause of the heteroskedasticity is actually the more interesting question. You talk about “fixing it” statistically (which of course you can do), but wouldn’t it be more interesting to predict it? You could run a heteroskedastic regression model, which doesn’t make any assumptions about the error structure and allows you to include predictors of both the outcome (K%) and the variance. I wonder what would predict it? Perhaps pitch selection, like “pitching backwards”? Perhaps some guys throw their highest SwStr% pitch early in the count and one of their lower SwStr% pitches once they get two 2 strikes? Something that could probably be easiest checked with the proper methodology.
Anyhow, interesting stuff.
Yes, this sounds awesome. Someone with more time than me should do this.
I have a feeling that the type of pitch getting the swinging strikes will be a significant variable, with the change-up leading to the higher variance.
Seems to me the biggest factor you missed was BB%. Imagine a pitcher who took every count to 3-2, but had a league average SwStr% rate. He’d be an elite K% pitcher, but he’d also walk a boat load of dudes.
There should be a “nibble” factor for pitchers, which would complete the picture here.
I prefer this method as it differentiates between SwStr out of and in zone.
http://www.draysbay.com/2009/7/21/956509/updated-expected-strikeouts-based
The formula:
K%=(ClStr%*.9)+(Foul%*.5)+(InPly%*-.9)+(InZSwStr%*1.1)+(OZSwStr%*1.5)
Adjusted r^2 of 91.4 is extremely strong. I’ll be updating this for major starters at some point in the offseason as it’s a very nice look at which guys can be expected to take a leap forward next season.
I assume foul balls count as swinging strikes. Is there data out there for total foul balls and fouls/PA? Just looking at the list, I know Hamels gets a ton of foul balls because he rarely puts anyone away with the heater, using it to set up his change piece, which induces swinging strikes of its own. I’d want to cross check this list to a fouls/PA leader board.
http://statcorner.com/leader.php?type=6&year=2011&leag=MLB&limit=300
Peep this
This is a very interesting graph / list of players.
I look at the upper section of the list as guys who I would bet on improvement for next year… once I factor out all of the guys that throw less than 91-92ish?
The bottom of the list I would expect to not be quite as good as they were this year.. all the way up to -.20 roughly…
Have you been drinking again, Jim?
No way Nolasco on a should improve next year list…
Do the majority of sinkerballers reside in the upper half (excluding ultimate sinkerballer Justin Masterson)?
I see a lot of Cardinals and Twins on here. Both teams which emphasize control. A logical philosophy in pitch selection might lead to pounding the zone with a sinker on a 2 strike count rather than a true ‘strikeout’ pitch, which is more likely to be a ball or fouled off, extending the at-bat.
Just a thought, anyways.
Having used SwStk% a lot in my articles, I could tell you that a lot of the pitchers who appear to be due for a K% spike are below average in getting called strikes. Since a strikeout counts the same whether it is the result of a swinging or called strike, I would think this is important. I know Kuroda specifically has a lower than average called strike rate each season, which likely is the primary cause of his lower than expected K%.
Statcorner.com has the ClStk% (called strike percentage) stat.
2 twins starters in the top 5 I now have reason for hope
Also your wife tolerates this? Does she have a sister…
WOW I just posted this on the forums, and their taking the credit
One problem is that this should be plotted on a log vs log scale.
Why? Because a strikeout rate of 10% is 2X worse than 5% and 2X better than 20%. Thus the unit between 5% to 10% should be the same as 10% to 20%. I suspect that may eliminate the heterodescosity.
The bigger problem, however, may be the variance in this correlation. A high R^2 indicates a linear relationship. But if the variance is high, the predictability is more problematic, which is why the precision is lacking. By eye, I’m estimating a 1/2 log unit (10X) spread on the data. In other words, with a 10% swinging strikeout rate, the K-rate appears to show a 95% Confidence Interval from 12% to 24%. a 12% K-rate is a whole lot different from a 24% K-rate.
Also, the equation provided for the orange line is clearly not correct. At x=10% (the swinging K-rate), y=22.6353 (the K-rate), but the orange line is at around a 15% K-rate with a 10% swinging K-rate.