Predicting 2012’s Strikeout Improvements

If heteroscedasticity lasts longer than three hours, consult your physician immediately.

“Honey, I think I’ve got heteroscedasticity,” I said to my wife when she walked in the door. As a writer who works at home, I spend the majority of my time locked away in my windowless home office, concocting ways to frighten my dear wife who works all day.

“And it’s ruining my spreadsheets,” I finally added, after she had stood wide-eyed and wordless for a few moments.

On Tuesday, we examined the fantastic and bizarre case of rookie right-hander Jeremy Hellickson, whose high swinging-strike rate has not translated into an equally high strikeout rate (K%). Today, let’s expand the scope of that investigation.

As we see above, the relationship between K% and swinging-strike rates is an interesting one. The above data includes pitchers — both starters and relievers — over the last four years who have pitched at least 150 innings.

I imagine readers with only the thickest of glasses will immediately note the data’s heteroscedasticity — the triangular or conical shape of the data. Heteroscedasticity sounds like an infectious diseases, but it is merely a stats word that means the error term is not consistent. As we can see above, there appears to be greater volatility in the higher swinging strike range than in the low SwStr% range.

There are some statistical tricks to solve a problem like Maria heteroscedasticity, but let’s skip all the math and just cheat a little here (please don’t tattle to Tango).

Let’s just look at on of the worst possible regressions — and by “worst,” I mean “least spectacular.” Connecting the two orange dots (the dots of Paul Byrd and Michael Wuertz), we find an equation of y = 2.2619x + 0.0163.

This is essentially the worst-case scenario for pitchers with a given swinging strike rate. Only the big exceptions will perform worse than ol’ Byrdie and Wuertz… Wuertzy…?

So, if we apply this formula to the 2011 season, say 100 IP minimum, we should be able to effectively predict a true talent floor, so to speak, and assemble a cache of player poised for more strikeouts (and thereby a higher FIP).

Obviously, some fellas, such as old man Tim Wakefield, will be exceptions. Wakefield ranks among the top of this improvement group, but we all well know that he and his knuckling twirl have long passed their 20% K-rate days.

So do realize, these numbers lack a degree of precision — each player involved may or may not have legit reasons to beat or under-perform their expected K% (and likewise their adjusted FIP). Moreover, the Should FIP or adjusted FIP below was made pretty much by multiplying their total batters faced by their new strikeout rate. Presumably, though, their TBF total would be different if they were indeed striking out more batters.

So caveats all around. The main lesson here: This is not precise, but this should be fun.

NOTE: Because I am using a worst-case formula, I have only included the players who we expect should improve next year. So even those pitchers with 0% expected change may see K% upticks in 2012.

Name SwStr% K% FIP xK% DIFF Should FIP FIP DIFF
Francisco Liriano 11.4% 19.0% 4.54 27.4% 8.4% 3.80 -0.74
Tim Wakefield 8.9% 13.7% 4.99 21.8% 8.1% 4.29 -0.70
Jeremy Hellickson 9.7% 15.1% 4.44 23.6% 8.5% 3.75 -0.69
Chris Narveson 10.4% 18.0% 4.06 25.2% 7.2% 3.44 -0.62
Carl Pavano 7.1% 10.7% 4.10 17.7% 7.0% 3.50 -0.60
Fausto Carmona 7.9% 13.1% 4.56 19.5% 6.4% 3.99 -0.57
Jaime Garcia 10.5% 18.9% 3.23 25.4% 6.5% 2.68 -0.55
Jeff Francis 7.0% 11.3% 4.10 17.5% 6.2% 3.56 -0.54
Dillon Gee 9.1% 16.2% 4.65 22.2% 6.0% 4.12 -0.53
Randy Wells 8.2% 14.1% 5.11 20.2% 6.1% 4.58 -0.53
Phil Coke 8.3% 14.6% 3.57 20.4% 5.8% 3.06 -0.51
Freddy Garcia 8.7% 15.3% 4.12 21.3% 6.0% 3.61 -0.51
John Lannan 7.6% 13.1% 4.28 18.8% 5.7% 3.78 -0.50
Hiroki Kuroda 10.3% 19.2% 3.78 24.9% 5.7% 3.31 -0.47
Daniel Hudson 9.9% 18.4% 3.28 24.0% 5.6% 2.81 -0.47
Shaun Marcum 10.3% 19.2% 3.73 24.9% 5.7% 3.26 -0.47
Edwin Jackson 9.2% 17.2% 3.55 22.4% 5.2% 3.10 -0.45
Edinson Volquez 10.9% 21.3% 5.29 26.3% 5.0% 4.84 -0.45
Josh Tomlin 7.7% 13.4% 4.27 19.0% 5.6% 3.82 -0.45
Ricky Nolasco 8.9% 16.6% 3.54 21.8% 5.2% 3.09 -0.45
Carlos Carrasco 8.5% 15.9% 4.28 20.9% 5.0% 3.85 -0.43
Philip Humber 9.1% 17.2% 3.58 22.2% 5.0% 3.16 -0.42
Luke Hochevar 8.2% 15.3% 4.29 20.2% 4.9% 3.88 -0.41
Jeff Karstens 7.7% 14.4% 4.29 19.0% 4.6% 3.91 -0.38
Chris Capuano 10.5% 21.0% 4.04 25.4% 4.4% 3.66 -0.38
Brett Cecil 8.4% 16.4% 5.10 20.6% 4.2% 4.73 -0.37
Charlie Morton 7.4% 14.3% 3.77 18.4% 4.1% 3.41 -0.36
Guillermo Moscoso 7.4% 14.1% 4.23 18.4% 4.3% 3.88 -0.35
John Danks 9.3% 18.5% 3.82 22.7% 4.2% 3.47 -0.35
Tom Gorzelanny 10.5% 21.3% 4.19 25.4% 4.1% 3.84 -0.35
Roy Oswalt 8.0% 15.7% 3.44 19.7% 4.0% 3.09 -0.35
Cole Hamels 11.3% 22.8% 3.05 27.2% 4.4% 2.71 -0.34
Jason Vargas 7.8% 15.3% 4.09 19.3% 4.0% 3.75 -0.34
R.A. Dickey 7.8% 15.3% 3.77 19.3% 4.0% 3.44 -0.33
Jair Jurrjens 7.4% 14.4% 3.99 18.4% 4.0% 3.66 -0.33
Ricky Romero 9.6% 19.4% 4.20 23.3% 3.9% 3.88 -0.32
Homer Bailey 9.3% 18.9% 4.06 22.7% 3.8% 3.74 -0.32
A.J. Burnett 10.0% 20.7% 4.77 24.2% 3.5% 4.46 -0.31
Jason Hammel 6.5% 12.7% 4.83 16.3% 3.6% 4.52 -0.31
Dan Haren 9.9% 20.2% 2.98 24.0% 3.8% 2.67 -0.31
Joel Pineiro 5.2% 9.8% 4.43 13.4% 3.6% 4.12 -0.31
Carlos Villanueva 7.5% 15.0% 4.10 18.6% 3.6% 3.79 -0.31
Derek Lowe 8.1% 16.5% 3.70 20.0% 3.5% 3.39 -0.31
Mark Buehrle 6.5% 12.7% 3.98 16.3% 3.6% 3.68 -0.30
Jason Marquis 6.5% 13.0% 4.05 16.3% 3.3% 3.75 -0.30
CC Sabathia 11.2% 23.4% 2.88 27.0% 3.6% 2.58 -0.30
Matt Garza 11.2% 23.5% 2.95 27.0% 3.5% 2.65 -0.30
Alexi Ogando 8.9% 18.2% 3.65 21.8% 3.6% 3.36 -0.29
Alfredo Simon 8.1% 16.6% 4.42 20.0% 3.4% 4.13 -0.29
Aaron Harang 8.4% 17.3% 4.17 20.6% 3.3% 3.88 -0.29
Michael Pineda 11.8% 24.9% 3.42 28.3% 3.4% 3.14 -0.28
Chris Volstad 7.9% 16.3% 4.32 19.5% 3.2% 4.04 -0.28
Bud Norris 10.5% 22.1% 4.02 25.4% 3.3% 3.74 -0.28
Chris Carpenter 9.2% 19.2% 3.06 22.4% 3.2% 2.79 -0.27
Josh Collmenter 7.9% 16.1% 3.80 19.5% 3.4% 3.53 -0.27
Brian Duensing 7.8% 16.2% 4.27 19.3% 3.1% 4.00 -0.27
Jo-Jo Reyes 6.6% 13.6% 4.90 16.6% 3.0% 4.63 -0.27
John Lackey 7.0% 14.5% 4.71 17.5% 3.0% 4.44 -0.27
Joe Saunders 6.2% 12.4% 4.78 15.7% 3.3% 4.51 -0.27
Jake Westbrook 6.3% 12.9% 4.25 15.9% 3.0% 3.98 -0.27
Tim Hudson 8.6% 17.9% 3.39 21.1% 3.2% 3.13 -0.26
Livan Hernandez 6.4% 13.2% 3.96 16.1% 2.9% 3.71 -0.25
Zach Britton 7.0% 14.6% 4.00 17.5% 2.9% 3.75 -0.25
Brad Penny 4.6% 9.2% 5.02 12.0% 2.8% 4.77 -0.25
Max Scherzer 9.8% 20.9% 4.14 23.8% 2.9% 3.89 -0.25
Kevin Correia 5.7% 11.7% 4.85 14.5% 2.8% 4.61 -0.24
Johnny Cueto 7.9% 16.5% 3.45 19.5% 3.0% 3.21 -0.24
Rick Porcello 6.3% 13.3% 4.06 15.9% 2.6% 3.83 -0.23
Ivan Nova 6.6% 13.9% 4.01 16.6% 2.7% 3.79 -0.22
Scott Baker 10.4% 22.5% 3.45 25.2% 2.7% 3.23 -0.22
Trevor Cahill 7.6% 16.3% 4.10 18.8% 2.5% 3.88 -0.22
James Shields 10.7% 23.1% 3.42 25.8% 2.7% 3.20 -0.22
Matt Harrison 7.6% 16.3% 3.52 18.8% 2.5% 3.31 -0.21
Josh Beckett 10.5% 22.8% 3.57 25.4% 2.6% 3.37 -0.20
Matt Cain 9.1% 19.7% 2.91 22.2% 2.5% 2.71 -0.20
Mat Latos 10.6% 23.2% 3.16 25.6% 2.4% 2.96 -0.20
Roy Halladay 10.8% 23.6% 2.20 26.1% 2.5% 2.00 -0.20
Jake Peavy 9.2% 20.2% 3.21 22.4% 2.2% 3.02 -0.19
Randy Wolf 6.8% 14.8% 4.29 17.0% 2.2% 4.11 -0.18
Bronson Arroyo 5.8% 12.6% 5.71 14.7% 2.1% 5.53 -0.18
Jhoulys Chacin 8.2% 18.1% 4.23 20.2% 2.1% 4.06 -0.17
Kyle McClellan 5.7% 12.5% 4.92 14.5% 2.0% 4.75 -0.17
Mike Leake 7.7% 17.0% 4.22 19.0% 2.0% 4.05 -0.17
Mike Pelfrey 5.5% 12.2% 4.47 14.1% 1.9% 4.30 -0.17
Bruce Chen 6.7% 14.8% 4.39 16.8% 2.0% 4.23 -0.16
Anibal Sanchez 10.9% 24.3% 3.35 26.3% 2.0% 3.19 -0.16
Alfredo Aceves 7.6% 16.9% 4.03 18.8% 1.9% 3.87 -0.16
Ervin Santana 8.4% 18.8% 4.00 20.6% 1.8% 3.84 -0.16
Wade Davis 5.9% 13.2% 4.67 15.0% 1.8% 4.52 -0.15
Brad Bergesen 6.0% 13.5% 4.92 15.2% 1.7% 4.77 -0.15
Gavin Floyd 8.4% 18.9% 3.81 20.6% 1.7% 3.67 -0.14
Anthony Swarzak 5.5% 12.5% 4.04 14.1% 1.6% 3.90 -0.14
Brandon Morrow 11.5% 26.1% 3.64 27.6% 1.5% 3.51 -0.13
Javier Vazquez 8.9% 20.3% 3.57 21.8% 1.5% 3.45 -0.12
James McDonald 8.2% 18.8% 4.68 20.2% 1.4% 4.56 -0.12
Tim Lincecum 10.7% 24.4% 3.17 25.8% 1.4% 3.05 -0.12
Jeremy Guthrie 6.3% 14.6% 4.48 15.9% 1.3% 4.37 -0.11
Nick Blackburn 4.8% 11.3% 4.84 12.5% 1.2% 4.74 -0.10
Justin Masterson 7.5% 17.4% 3.28 18.6% 1.2% 3.18 -0.10
Brandon McCarthy 7.7% 17.8% 2.86 19.0% 1.2% 2.76 -0.10
Felipe Paulino 9.6% 22.2% 3.69 23.3% 1.1% 3.59 -0.10
Ted Lilly 8.5% 19.8% 4.21 20.9% 1.1% 4.12 -0.09
Ryan Dempster 9.3% 21.7% 3.91 22.7% 1.0% 3.82 -0.09
Brett Myers 7.4% 17.5% 4.26 18.4% 0.9% 4.18 -0.08
Dustin Moseley 5.3% 12.7% 3.99 13.6% 0.9% 3.91 -0.08
Carlos Zambrano 6.7% 15.9% 4.59 16.8% 0.9% 4.52 -0.07
Kyle Kendrick 5.1% 12.3% 4.55 13.2% 0.9% 4.48 -0.07
Jered Weaver 9.1% 21.4% 3.20 22.2% 0.8% 3.13 -0.07
Jordan Zimmermann 7.9% 18.7% 3.16 19.5% 0.8% 3.10 -0.06
Danny Duffy 7.7% 18.4% 4.82 19.0% 0.6% 4.76 -0.06
Kyle Lohse 5.9% 14.3% 3.67 15.0% 0.7% 3.62 -0.05
Jonathan Sanchez 9.7% 23.0% 4.30 23.6% 0.6% 4.25 -0.05
Jake Arrieta 7.4% 17.8% 5.34 18.4% 0.6% 5.29 -0.05
Chad Billingsley 7.6% 18.3% 3.83 18.8% 0.5% 3.79 -0.04
Paul Maholm 5.7% 14.1% 3.78 14.5% 0.4% 3.75 -0.03
Travis Wood 6.7% 16.4% 4.06 16.8% 0.4% 4.03 -0.03
Tyler Chatwood 4.6% 11.7% 4.89 12.0% 0.3% 4.86 -0.03
Gio Gonzalez 9.5% 22.8% 3.64 23.1% 0.3% 3.61 -0.03
Wandy Rodriguez 8.5% 20.5% 4.15 20.9% 0.4% 4.12 -0.03
Jonathon Niese 8.2% 19.9% 3.36 20.2% 0.3% 3.33 -0.03
Derek Holland 7.9% 19.2% 3.94 19.5% 0.3% 3.92 -0.02
Doug Fister 6.7% 16.7% 3.02 16.8% 0.1% 3.01 -0.01
Colby Lewis 8.2% 20.1% 4.54 20.2% 0.1% 4.54 0.00
Jeff Niemann 7.4% 18.4% 4.13 18.4% 0.0% 4.13 0.00

I must credit again Mike Podhorzer who got me thinking about this.




Print This Post



Bradley writes for FanGraphs and The Hardball Times. Follow him on Twitter @BradleyWoodrum.


21 Responses to “Predicting 2012’s Strikeout Improvements”

You can follow any responses to this entry through the RSS 2.0 feed.
  1. mike says:

    really interesting; i’d love to see this on an xfip, sierra or tera basis, too. not sure how involved all of that is, from a math standpoint.

    Vote -1 Vote +1

  2. Brian says:

    If you break down 2010 stats, a lot of the same pitchers have high xK’s that didn’t materialize.

    I looked at the 92 pitchers, presumably that were innings qualified, and here were the top 10:

    Name 2010 diff
    Randy Wells -6.92%
    Brett Cecil -6.34%
    Hiroki Kuroda -6.23%
    R.A. Dickey -6.03%
    Clay Buchholz -5.99%
    Carl Pavano -5.69%
    Shaun Marcum -5.68%
    Jaime Garcia -5.25%
    Edwin Jackson -5.05%
    Johan Santana -4.84%
    Francisco Liriano -4.78%

    Vote -1 Vote +1

    • Brian says:

      sorry. hit enter before I finished.

      The “2010 diff” is the difference between 2010 K% and “expected K%” using the y = 2.2619x + 0.0163
      formula from the author’s analysis.

      Not sure what that means… I can’t think of similarities between Kuroda, Liriano, Pavano, Marcum, Wells, Jaime Garcia, etc.. Just thought I’d throw it out there.

      Vote -1 Vote +1

  3. TFINY says:

    A) I really liked this post. I thought it was well written and funny, and all around put together well.

    B) Would a team philosophy of “pitch to contact”make a difference here? I’m only familiar with the Twins program, and they definitely have that strategy. Is it a coincidence that Liriano and Pavano (two pitchers that didn’t come up in the Twins system but have pitched there for 6 and 2 seasons respectively) are near the top? Could anyone else who knows say if the pitchers near the top of the list are from programs with similar philosophies?

    Vote -1 Vote +1

    • Steven Ellingson says:

      I don’t think its “pitch to contact” as is so often said, but more “don’t walk people.” But yes, I think it could make a difference.

      Also, the philosophy of featuring a change-up as a strikeout pitch could have something to do with it. Obviously Liriano’s best pitch is his slider, but I believe (I haven’t looked this up) he threw more changeups this year.

      Vote -1 Vote +1

  4. Jeff says:

    While the differential in expected and actual K% is interesting in and of itself, it seems to me the cause of the heteroskedasticity is actually the more interesting question. You talk about “fixing it” statistically (which of course you can do), but wouldn’t it be more interesting to predict it? You could run a heteroskedastic regression model, which doesn’t make any assumptions about the error structure and allows you to include predictors of both the outcome (K%) and the variance. I wonder what would predict it? Perhaps pitch selection, like “pitching backwards”? Perhaps some guys throw their highest SwStr% pitch early in the count and one of their lower SwStr% pitches once they get two 2 strikes? Something that could probably be easiest checked with the proper methodology.

    Anyhow, interesting stuff.

    Vote -1 Vote +1

    • Steven Ellingson says:

      Yes, this sounds awesome. Someone with more time than me should do this.

      I have a feeling that the type of pitch getting the swinging strikes will be a significant variable, with the change-up leading to the higher variance.

      Vote -1 Vote +1

  5. Slartibartfast says:

    Seems to me the biggest factor you missed was BB%. Imagine a pitcher who took every count to 3-2, but had a league average SwStr% rate. He’d be an elite K% pitcher, but he’d also walk a boat load of dudes.

    There should be a “nibble” factor for pitchers, which would complete the picture here.

    Vote -1 Vote +1

  6. Sandy Kazmir says:

    I prefer this method as it differentiates between SwStr out of and in zone.

    http://www.draysbay.com/2009/7/21/956509/updated-expected-strikeouts-based

    The formula:
    K%=(ClStr%*.9)+(Foul%*.5)+(InPly%*-.9)+(InZSwStr%*1.1)+(OZSwStr%*1.5)

    Adjusted r^2 of 91.4 is extremely strong. I’ll be updating this for major starters at some point in the offseason as it’s a very nice look at which guys can be expected to take a leap forward next season.

    Vote -1 Vote +1

  7. DD says:

    I assume foul balls count as swinging strikes. Is there data out there for total foul balls and fouls/PA? Just looking at the list, I know Hamels gets a ton of foul balls because he rarely puts anyone away with the heater, using it to set up his change piece, which induces swinging strikes of its own. I’d want to cross check this list to a fouls/PA leader board.

    Vote -1 Vote +1

  8. Jim Lahey says:

    This is a very interesting graph / list of players.

    I look at the upper section of the list as guys who I would bet on improvement for next year… once I factor out all of the guys that throw less than 91-92ish?

    The bottom of the list I would expect to not be quite as good as they were this year.. all the way up to -.20 roughly…

    Vote -1 Vote +1

  9. Adam says:

    No way Nolasco on a should improve next year list…

    Vote -1 Vote +1

  10. Choo says:

    Do the majority of sinkerballers reside in the upper half (excluding ultimate sinkerballer Justin Masterson)?

    Vote -1 Vote +1

  11. Voxx says:

    I see a lot of Cardinals and Twins on here. Both teams which emphasize control. A logical philosophy in pitch selection might lead to pounding the zone with a sinker on a 2 strike count rather than a true ‘strikeout’ pitch, which is more likely to be a ball or fouled off, extending the at-bat.

    Just a thought, anyways.

    Vote -1 Vote +1

  12. Mike Podhorzer says:

    Having used SwStk% a lot in my articles, I could tell you that a lot of the pitchers who appear to be due for a K% spike are below average in getting called strikes. Since a strikeout counts the same whether it is the result of a swinging or called strike, I would think this is important. I know Kuroda specifically has a lower than average called strike rate each season, which likely is the primary cause of his lower than expected K%.

    Statcorner.com has the ClStk% (called strike percentage) stat.

    Vote -1 Vote +1

  13. adohaj says:

    2 twins starters in the top 5 I now have reason for hope

    Also your wife tolerates this? Does she have a sister…

    Vote -1 Vote +1

  14. Carlcrawfordisawesome says:

    WOW I just posted this on the forums, and their taking the credit

    Vote -1 Vote +1

  15. Matt Lee says:

    One problem is that this should be plotted on a log vs log scale.
    Why? Because a strikeout rate of 10% is 2X worse than 5% and 2X better than 20%. Thus the unit between 5% to 10% should be the same as 10% to 20%. I suspect that may eliminate the heterodescosity.

    The bigger problem, however, may be the variance in this correlation. A high R^2 indicates a linear relationship. But if the variance is high, the predictability is more problematic, which is why the precision is lacking. By eye, I’m estimating a 1/2 log unit (10X) spread on the data. In other words, with a 10% swinging strikeout rate, the K-rate appears to show a 95% Confidence Interval from 12% to 24%. a 12% K-rate is a whole lot different from a 24% K-rate.

    Vote -1 Vote +1

  16. Matt Lee says:

    Also, the equation provided for the orange line is clearly not correct. At x=10% (the swinging K-rate), y=22.6353 (the K-rate), but the orange line is at around a 15% K-rate with a 10% swinging K-rate.

    Vote -1 Vote +1

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>