# Predicting 2012’s Strikeout Improvements

If heteroscedasticity lasts longer than three hours, consult your physician immediately.

“Honey, I think I’ve got heteroscedasticity,” I said to my wife when she walked in the door. As a writer who works at home, I spend the majority of my time locked away in my windowless home office, concocting ways to frighten my dear wife who works all day.

“And it’s ruining my spreadsheets,” I finally added, after she had stood wide-eyed and wordless for a few moments.

On Tuesday, we examined the fantastic and bizarre case of rookie right-hander Jeremy Hellickson, whose high swinging-strike rate has not translated into an equally high strikeout rate (K%). Today, let’s expand the scope of that investigation.

As we see above, the relationship between K% and swinging-strike rates is an interesting one. The above data includes pitchers — both starters and relievers — over the last four years who have pitched at least 150 innings.

I imagine readers with only the thickest of glasses will immediately note the data’s heteroscedasticity — the triangular or conical shape of the data. Heteroscedasticity sounds like an infectious diseases, but it is merely a stats word that means the error term is not consistent. As we can see above, there appears to be greater volatility in the higher swinging strike range than in the low SwStr% range.

There are some statistical tricks to solve a problem like Maria heteroscedasticity, but let’s skip all the math and just cheat a little here (please don’t tattle to Tango).

Let’s just look at on of the worst possible regressions — and by “worst,” I mean “least spectacular.” Connecting the two orange dots (the dots of Paul Byrd and Michael Wuertz), we find an equation of y = 2.2619x + 0.0163.

This is essentially the worst-case scenario for pitchers with a given swinging strike rate. Only the big exceptions will perform worse than ol’ Byrdie and Wuertz… Wuertzy…?

So, if we apply this formula to the 2011 season, say 100 IP minimum, we should be able to effectively predict a true talent floor, so to speak, and assemble a cache of player poised for more strikeouts (and thereby a higher FIP).

Obviously, some fellas, such as old man Tim Wakefield, will be exceptions. Wakefield ranks among the top of this improvement group, but we all well know that he and his knuckling twirl have long passed their 20% K-rate days.

So do realize, these numbers lack a degree of precision — each player involved may or may not have legit reasons to beat or under-perform their expected K% (and likewise their adjusted FIP). Moreover, the Should FIP or adjusted FIP below was made pretty much by multiplying their total batters faced by their new strikeout rate. Presumably, though, their TBF total would be different if they were indeed striking out more batters.

So caveats all around. The main lesson here: This is not precise, but this should be fun.

NOTE: Because I am using a worst-case formula, I have only included the players who we expect should improve next year. So even those pitchers with 0% expected change may see K% upticks in 2012.

 Name SwStr% K% FIP xK% DIFF Should FIP FIP DIFF FranciscoÂ Liriano 11.4% 19.0% 4.54 27.4% 8.4% 3.80 -0.74 Tim Wakefield 8.9% 13.7% 4.99 21.8% 8.1% 4.29 -0.70 JeremyÂ Hellickson 9.7% 15.1% 4.44 23.6% 8.5% 3.75 -0.69 Chris Narveson 10.4% 18.0% 4.06 25.2% 7.2% 3.44 -0.62 Carl Pavano 7.1% 10.7% 4.10 17.7% 7.0% 3.50 -0.60 Fausto Carmona 7.9% 13.1% 4.56 19.5% 6.4% 3.99 -0.57 Jaime Garcia 10.5% 18.9% 3.23 25.4% 6.5% 2.68 -0.55 Jeff Francis 7.0% 11.3% 4.10 17.5% 6.2% 3.56 -0.54 Dillon Gee 9.1% 16.2% 4.65 22.2% 6.0% 4.12 -0.53 Randy Wells 8.2% 14.1% 5.11 20.2% 6.1% 4.58 -0.53 Phil Coke 8.3% 14.6% 3.57 20.4% 5.8% 3.06 -0.51 Freddy Garcia 8.7% 15.3% 4.12 21.3% 6.0% 3.61 -0.51 John Lannan 7.6% 13.1% 4.28 18.8% 5.7% 3.78 -0.50 Hiroki Kuroda 10.3% 19.2% 3.78 24.9% 5.7% 3.31 -0.47 Daniel Hudson 9.9% 18.4% 3.28 24.0% 5.6% 2.81 -0.47 Shaun Marcum 10.3% 19.2% 3.73 24.9% 5.7% 3.26 -0.47 Edwin Jackson 9.2% 17.2% 3.55 22.4% 5.2% 3.10 -0.45 EdinsonÂ Volquez 10.9% 21.3% 5.29 26.3% 5.0% 4.84 -0.45 Josh Tomlin 7.7% 13.4% 4.27 19.0% 5.6% 3.82 -0.45 Ricky Nolasco 8.9% 16.6% 3.54 21.8% 5.2% 3.09 -0.45 CarlosÂ Carrasco 8.5% 15.9% 4.28 20.9% 5.0% 3.85 -0.43 Philip Humber 9.1% 17.2% 3.58 22.2% 5.0% 3.16 -0.42 Luke Hochevar 8.2% 15.3% 4.29 20.2% 4.9% 3.88 -0.41 Jeff Karstens 7.7% 14.4% 4.29 19.0% 4.6% 3.91 -0.38 Chris Capuano 10.5% 21.0% 4.04 25.4% 4.4% 3.66 -0.38 Brett Cecil 8.4% 16.4% 5.10 20.6% 4.2% 4.73 -0.37 Charlie Morton 7.4% 14.3% 3.77 18.4% 4.1% 3.41 -0.36 GuillermoÂ Moscoso 7.4% 14.1% 4.23 18.4% 4.3% 3.88 -0.35 John Danks 9.3% 18.5% 3.82 22.7% 4.2% 3.47 -0.35 Tom Gorzelanny 10.5% 21.3% 4.19 25.4% 4.1% 3.84 -0.35 Roy Oswalt 8.0% 15.7% 3.44 19.7% 4.0% 3.09 -0.35 Cole Hamels 11.3% 22.8% 3.05 27.2% 4.4% 2.71 -0.34 Jason Vargas 7.8% 15.3% 4.09 19.3% 4.0% 3.75 -0.34 R.A. Dickey 7.8% 15.3% 3.77 19.3% 4.0% 3.44 -0.33 Jair Jurrjens 7.4% 14.4% 3.99 18.4% 4.0% 3.66 -0.33 Ricky Romero 9.6% 19.4% 4.20 23.3% 3.9% 3.88 -0.32 Homer Bailey 9.3% 18.9% 4.06 22.7% 3.8% 3.74 -0.32 A.J. Burnett 10.0% 20.7% 4.77 24.2% 3.5% 4.46 -0.31 Jason Hammel 6.5% 12.7% 4.83 16.3% 3.6% 4.52 -0.31 Dan Haren 9.9% 20.2% 2.98 24.0% 3.8% 2.67 -0.31 Joel Pineiro 5.2% 9.8% 4.43 13.4% 3.6% 4.12 -0.31 CarlosÂ Villanueva 7.5% 15.0% 4.10 18.6% 3.6% 3.79 -0.31 Derek Lowe 8.1% 16.5% 3.70 20.0% 3.5% 3.39 -0.31 Mark Buehrle 6.5% 12.7% 3.98 16.3% 3.6% 3.68 -0.30 Jason Marquis 6.5% 13.0% 4.05 16.3% 3.3% 3.75 -0.30 CC Sabathia 11.2% 23.4% 2.88 27.0% 3.6% 2.58 -0.30 Matt Garza 11.2% 23.5% 2.95 27.0% 3.5% 2.65 -0.30 Alexi Ogando 8.9% 18.2% 3.65 21.8% 3.6% 3.36 -0.29 Alfredo Simon 8.1% 16.6% 4.42 20.0% 3.4% 4.13 -0.29 Aaron Harang 8.4% 17.3% 4.17 20.6% 3.3% 3.88 -0.29 Michael Pineda 11.8% 24.9% 3.42 28.3% 3.4% 3.14 -0.28 Chris Volstad 7.9% 16.3% 4.32 19.5% 3.2% 4.04 -0.28 Bud Norris 10.5% 22.1% 4.02 25.4% 3.3% 3.74 -0.28 ChrisÂ Carpenter 9.2% 19.2% 3.06 22.4% 3.2% 2.79 -0.27 JoshÂ Collmenter 7.9% 16.1% 3.80 19.5% 3.4% 3.53 -0.27 Brian Duensing 7.8% 16.2% 4.27 19.3% 3.1% 4.00 -0.27 Jo-Jo Reyes 6.6% 13.6% 4.90 16.6% 3.0% 4.63 -0.27 John Lackey 7.0% 14.5% 4.71 17.5% 3.0% 4.44 -0.27 Joe Saunders 6.2% 12.4% 4.78 15.7% 3.3% 4.51 -0.27 Jake Westbrook 6.3% 12.9% 4.25 15.9% 3.0% 3.98 -0.27 Tim Hudson 8.6% 17.9% 3.39 21.1% 3.2% 3.13 -0.26 LivanÂ Hernandez 6.4% 13.2% 3.96 16.1% 2.9% 3.71 -0.25 Zach Britton 7.0% 14.6% 4.00 17.5% 2.9% 3.75 -0.25 Brad Penny 4.6% 9.2% 5.02 12.0% 2.8% 4.77 -0.25 Max Scherzer 9.8% 20.9% 4.14 23.8% 2.9% 3.89 -0.25 Kevin Correia 5.7% 11.7% 4.85 14.5% 2.8% 4.61 -0.24 Johnny Cueto 7.9% 16.5% 3.45 19.5% 3.0% 3.21 -0.24 Rick Porcello 6.3% 13.3% 4.06 15.9% 2.6% 3.83 -0.23 Ivan Nova 6.6% 13.9% 4.01 16.6% 2.7% 3.79 -0.22 Scott Baker 10.4% 22.5% 3.45 25.2% 2.7% 3.23 -0.22 Trevor Cahill 7.6% 16.3% 4.10 18.8% 2.5% 3.88 -0.22 James Shields 10.7% 23.1% 3.42 25.8% 2.7% 3.20 -0.22 Matt Harrison 7.6% 16.3% 3.52 18.8% 2.5% 3.31 -0.21 Josh Beckett 10.5% 22.8% 3.57 25.4% 2.6% 3.37 -0.20 Matt Cain 9.1% 19.7% 2.91 22.2% 2.5% 2.71 -0.20 Mat Latos 10.6% 23.2% 3.16 25.6% 2.4% 2.96 -0.20 Roy Halladay 10.8% 23.6% 2.20 26.1% 2.5% 2.00 -0.20 Jake Peavy 9.2% 20.2% 3.21 22.4% 2.2% 3.02 -0.19 Randy Wolf 6.8% 14.8% 4.29 17.0% 2.2% 4.11 -0.18 Bronson Arroyo 5.8% 12.6% 5.71 14.7% 2.1% 5.53 -0.18 Jhoulys Chacin 8.2% 18.1% 4.23 20.2% 2.1% 4.06 -0.17 Kyle McClellan 5.7% 12.5% 4.92 14.5% 2.0% 4.75 -0.17 Mike Leake 7.7% 17.0% 4.22 19.0% 2.0% 4.05 -0.17 Mike Pelfrey 5.5% 12.2% 4.47 14.1% 1.9% 4.30 -0.17 Bruce Chen 6.7% 14.8% 4.39 16.8% 2.0% 4.23 -0.16 Anibal Sanchez 10.9% 24.3% 3.35 26.3% 2.0% 3.19 -0.16 Alfredo Aceves 7.6% 16.9% 4.03 18.8% 1.9% 3.87 -0.16 Ervin Santana 8.4% 18.8% 4.00 20.6% 1.8% 3.84 -0.16 Wade Davis 5.9% 13.2% 4.67 15.0% 1.8% 4.52 -0.15 Brad Bergesen 6.0% 13.5% 4.92 15.2% 1.7% 4.77 -0.15 Gavin Floyd 8.4% 18.9% 3.81 20.6% 1.7% 3.67 -0.14 AnthonyÂ Swarzak 5.5% 12.5% 4.04 14.1% 1.6% 3.90 -0.14 Brandon Morrow 11.5% 26.1% 3.64 27.6% 1.5% 3.51 -0.13 Javier Vazquez 8.9% 20.3% 3.57 21.8% 1.5% 3.45 -0.12 James McDonald 8.2% 18.8% 4.68 20.2% 1.4% 4.56 -0.12 Tim Lincecum 10.7% 24.4% 3.17 25.8% 1.4% 3.05 -0.12 Jeremy Guthrie 6.3% 14.6% 4.48 15.9% 1.3% 4.37 -0.11 Nick Blackburn 4.8% 11.3% 4.84 12.5% 1.2% 4.74 -0.10 JustinÂ Masterson 7.5% 17.4% 3.28 18.6% 1.2% 3.18 -0.10 BrandonÂ McCarthy 7.7% 17.8% 2.86 19.0% 1.2% 2.76 -0.10 Felipe Paulino 9.6% 22.2% 3.69 23.3% 1.1% 3.59 -0.10 Ted Lilly 8.5% 19.8% 4.21 20.9% 1.1% 4.12 -0.09 Ryan Dempster 9.3% 21.7% 3.91 22.7% 1.0% 3.82 -0.09 Brett Myers 7.4% 17.5% 4.26 18.4% 0.9% 4.18 -0.08 Dustin Moseley 5.3% 12.7% 3.99 13.6% 0.9% 3.91 -0.08 CarlosÂ Zambrano 6.7% 15.9% 4.59 16.8% 0.9% 4.52 -0.07 Kyle Kendrick 5.1% 12.3% 4.55 13.2% 0.9% 4.48 -0.07 Jered Weaver 9.1% 21.4% 3.20 22.2% 0.8% 3.13 -0.07 JordanÂ Zimmermann 7.9% 18.7% 3.16 19.5% 0.8% 3.10 -0.06 Danny Duffy 7.7% 18.4% 4.82 19.0% 0.6% 4.76 -0.06 Kyle Lohse 5.9% 14.3% 3.67 15.0% 0.7% 3.62 -0.05 JonathanÂ Sanchez 9.7% 23.0% 4.30 23.6% 0.6% 4.25 -0.05 Jake Arrieta 7.4% 17.8% 5.34 18.4% 0.6% 5.29 -0.05 ChadÂ Billingsley 7.6% 18.3% 3.83 18.8% 0.5% 3.79 -0.04 Paul Maholm 5.7% 14.1% 3.78 14.5% 0.4% 3.75 -0.03 Travis Wood 6.7% 16.4% 4.06 16.8% 0.4% 4.03 -0.03 Tyler Chatwood 4.6% 11.7% 4.89 12.0% 0.3% 4.86 -0.03 Gio Gonzalez 9.5% 22.8% 3.64 23.1% 0.3% 3.61 -0.03 WandyÂ Rodriguez 8.5% 20.5% 4.15 20.9% 0.4% 4.12 -0.03 Jonathon Niese 8.2% 19.9% 3.36 20.2% 0.3% 3.33 -0.03 Derek Holland 7.9% 19.2% 3.94 19.5% 0.3% 3.92 -0.02 Doug Fister 6.7% 16.7% 3.02 16.8% 0.1% 3.01 -0.01 Colby Lewis 8.2% 20.1% 4.54 20.2% 0.1% 4.54 0.00 Jeff Niemann 7.4% 18.4% 4.13 18.4% 0.0% 4.13 0.00

Print This Post

Guest
mike
4 years 8 months ago

really interesting; i’d love to see this on an xfip, sierra or tera basis, too. not sure how involved all of that is, from a math standpoint.

Guest
Brian
4 years 8 months ago

If you break down 2010 stats, a lot of the same pitchers have high xK’s that didn’t materialize.

I looked at the 92 pitchers, presumably that were innings qualified, and here were the top 10:

Name 2010 diff
Randy Wells -6.92%
Brett Cecil -6.34%
Hiroki Kuroda -6.23%
R.A. Dickey -6.03%
Clay Buchholz -5.99%
Carl Pavano -5.69%
Shaun Marcum -5.68%
Jaime Garcia -5.25%
Edwin Jackson -5.05%
Johan Santana -4.84%
Francisco Liriano -4.78%

Guest
Brian
4 years 8 months ago

sorry. hit enter before I finished.

The “2010 diff” is the difference between 2010 K% and “expected K%” using the y = 2.2619x + 0.0163
formula from the author’s analysis.

Not sure what that means… I can’t think of similarities between Kuroda, Liriano, Pavano, Marcum, Wells, Jaime Garcia, etc.. Just thought I’d throw it out there.

Member
TFINY
4 years 8 months ago

A) I really liked this post. I thought it was well written and funny, and all around put together well.

B) Would a team philosophy of “pitch to contact”make a difference here? I’m only familiar with the Twins program, and they definitely have that strategy. Is it a coincidence that Liriano and Pavano (two pitchers that didn’t come up in the Twins system but have pitched there for 6 and 2 seasons respectively) are near the top? Could anyone else who knows say if the pitchers near the top of the list are from programs with similar philosophies?

Guest
Steven Ellingson
4 years 8 months ago

I don’t think its “pitch to contact” as is so often said, but more “don’t walk people.” But yes, I think it could make a difference.

Also, the philosophy of featuring a change-up as a strikeout pitch could have something to do with it. Obviously Liriano’s best pitch is his slider, but I believe (I haven’t looked this up) he threw more changeups this year.

Guest
Jeff
4 years 8 months ago

While the differential in expected and actual K% is interesting in and of itself, it seems to me the cause of the heteroskedasticity is actually the more interesting question. You talk about “fixing it” statistically (which of course you can do), but wouldn’t it be more interesting to predict it? You could run a heteroskedastic regression model, which doesn’t make any assumptions about the error structure and allows you to include predictors of both the outcome (K%) and the variance. I wonder what would predict it? Perhaps pitch selection, like “pitching backwards”? Perhaps some guys throw their highest SwStr% pitch early in the count and one of their lower SwStr% pitches once they get two 2 strikes? Something that could probably be easiest checked with the proper methodology.

Anyhow, interesting stuff.

Guest
Steven Ellingson
4 years 8 months ago

Yes, this sounds awesome. Someone with more time than me should do this.

I have a feeling that the type of pitch getting the swinging strikes will be a significant variable, with the change-up leading to the higher variance.

Guest
Slartibartfast
4 years 8 months ago

Seems to me the biggest factor you missed was BB%. Imagine a pitcher who took every count to 3-2, but had a league average SwStr% rate. He’d be an elite K% pitcher, but he’d also walk a boat load of dudes.

There should be a “nibble” factor for pitchers, which would complete the picture here.

Member
Sandy Kazmir
4 years 8 months ago

I prefer this method as it differentiates between SwStr out of and in zone.

http://www.draysbay.com/2009/7/21/956509/updated-expected-strikeouts-based

The formula:
K%=(ClStr%*.9)+(Foul%*.5)+(InPly%*-.9)+(InZSwStr%*1.1)+(OZSwStr%*1.5)

Adjusted r^2 of 91.4 is extremely strong. I’ll be updating this for major starters at some point in the offseason as it’s a very nice look at which guys can be expected to take a leap forward next season.

Guest
DD
4 years 8 months ago

I assume foul balls count as swinging strikes. Is there data out there for total foul balls and fouls/PA? Just looking at the list, I know Hamels gets a ton of foul balls because he rarely puts anyone away with the heater, using it to set up his change piece, which induces swinging strikes of its own. I’d want to cross check this list to a fouls/PA leader board.

Member
Sandy Kazmir
4 years 8 months ago
Guest
Jim Lahey
4 years 8 months ago

This is a very interesting graph / list of players.

I look at the upper section of the list as guys who I would bet on improvement for next year… once I factor out all of the guys that throw less than 91-92ish?

The bottom of the list I would expect to not be quite as good as they were this year.. all the way up to -.20 roughly…

Guest
eric
4 years 7 months ago

Have you been drinking again, Jim?

Guest
4 years 8 months ago

No way Nolasco on a should improve next year list…

Member
4 years 8 months ago

Do the majority of sinkerballers reside in the upper half (excluding ultimate sinkerballer Justin Masterson)?

Guest
Voxx
4 years 8 months ago

I see a lot of Cardinals and Twins on here. Both teams which emphasize control. A logical philosophy in pitch selection might lead to pounding the zone with a sinker on a 2 strike count rather than a true ‘strikeout’ pitch, which is more likely to be a ball or fouled off, extending the at-bat.

Just a thought, anyways.

Member
Member
4 years 8 months ago

Having used SwStk% a lot in my articles, I could tell you that a lot of the pitchers who appear to be due for a K% spike are below average in getting called strikes. Since a strikeout counts the same whether it is the result of a swinging or called strike, I would think this is important. I know Kuroda specifically has a lower than average called strike rate each season, which likely is the primary cause of his lower than expected K%.

Statcorner.com has the ClStk% (called strike percentage) stat.

Member
4 years 8 months ago

2 twins starters in the top 5 I now have reason for hope

Also your wife tolerates this? Does she have a sister…

Guest
Carlcrawfordisawesome
4 years 8 months ago

WOW I just posted this on the forums, and their taking the credit

Guest
Matt Lee
4 years 3 months ago

One problem is that this should be plotted on a log vs log scale.
Why? Because a strikeout rate of 10% is 2X worse than 5% and 2X better than 20%. Thus the unit between 5% to 10% should be the same as 10% to 20%. I suspect that may eliminate the heterodescosity.

The bigger problem, however, may be the variance in this correlation. A high R^2 indicates a linear relationship. But if the variance is high, the predictability is more problematic, which is why the precision is lacking. By eye, I’m estimating a 1/2 log unit (10X) spread on the data. In other words, with a 10% swinging strikeout rate, the K-rate appears to show a 95% Confidence Interval from 12% to 24%. a 12% K-rate is a whole lot different from a 24% K-rate.

Guest
Matt Lee
4 years 3 months ago

Also, the equation provided for the orange line is clearly not correct. At x=10% (the swinging K-rate), y=22.6353 (the K-rate), but the orange line is at around a 15% K-rate with a 10% swinging K-rate.