Workload and Durability (Part 1)

Much has been written on the subject of pitch counts. In some quarters, the notion that high pitch counts are dangerous to a pitcher’s health is an article of faith; the idea makes intuitive sense. There is just one problem — a lack of evidence in its favor.

Earlier this year, Rob Neyer and Bill James published the exceptional Neyer/James Guide to Pitchers — an encyclopedia of the pitching repertoire of nearly every significant pitcher in Major League and Negro League history. In one essay, titled “Abuse and Durability” (pp. 449-463), James runs a series of matched-pair studies, identifying the most similar non-abused pitchers to pitchers listed as “abused” in various editions of Baseball Prospectus based on the Pitcher Abuse Points (PAP) system devised by Rany Jazayerli and Keith Woolner. The results skew in one direction: the “abused” pitchers keep more of their value (on average) than comparable “non-abused” pitchers. That’s right — keep their value.

James concludes his essay speculating about what is behind the phenomenon:

Most injuries to pitchers are not the result of chronic overuse; some are, particularly to young pitchers, but most are not. They’re catastrophic events, just like a heart attack or a torn muscle. They happen suddenly, and they happen when a pitcher goes outside the envelope of his previous conditioning.

Backing away from the pitcher’s limits too far doesn’t make a pitcher less vulnerable; it makes him more vulnerable. And pushing the envelope, while it may lead to a catastrophic event, is more likely to enhance the pitcher’s durability than to destroy it.

And yet, questions linger. James himself notes that since power pitchers last longer (and tend to throw more pitches per inning) than finesse types, controlling for quality of pitcher isn’t sufficient to isolate the effect of high pitch counts. In addressing the issue of pitch count we must be sensitive to differences of pitcher type.

The quality of a matched-pair study depends on how similar your comparison groups are in all respects save for the one under study. On the other hand, pegging the similarity standard too high may lead to too few matches to tell us anything useful. A balance must be struck between sample size and degree of similarity.

Matched-Pair Workload Study #1

Starting with a large pool of players from which to match leads to more good matches. To that end, I settled on a pool of starting pitchers born after 1945 and before 1970. This 24-year period encompasses the baby boom and immediate post-boom generations. All but a handful of pitchers born before 1970 are either retired or no longer starting in the majors, so we don’t need to worry very much about incomplete data.

To start we need to define heavy and moderate workloads for starting pitchers. A heavy workload was defined as exceeding 3,800 estimated pitches(1) in a given year; 3,000 to 3,600 estimated pitches was defined as a moderate workload. Because of the power of the pitch count and the pervasiveness of the five-man rotation, very few pitchers have exceeded 3,800 pitches in recent years (starting 34 times, a pitcher would need to average almost 112 pitches a start).

Group A pitchers were those who had at least one heavy workload season before age 28. Group B pitchers were those who never exceeded 3,600 estimated pitches in a year before age 28. Matches were based on highest similarity score, using single season to single season comparisons, and taking into account the following characteristics:

(1) Strikeouts per Opportunity [K/(BF-IW]
(2) Non-Intentional Walks per Opportunity [(W-IW)/(BF-HBP-IW)]
(3) Earned Run Average [ER/IP*9]
(4) Year of Birth
(5) Age on July 1st(2)
(6) The matched pitchers must throw with the same hand

Here’s a hypothetical example of how similarity scores work in this study. Imagine two pitchers with identical ERAs, strikeout rates and walk rates. These pitchers are the same age (to the day) and are born in the same year. The Group A pitcher, however, throws 750 estimated pitches more than the Group B pitcher. The method considers this a perfect match — earning 1,000 points. In actual cases, the differences in each category result in points deducted from 1,000; the higher the final similarity score, the greater the (statistical) similarity between the two pitchers.

The final requirement was that no Group B pitcher could be matched with more than one Group A pitcher; the match with the higher similarity score was given priority. Each matched season was designated Year Zero for that particular pitcher. A more detailed description of the comparison method(3) can be found in the footnotes.

Quality Control

Before we turn to the matched pairs, let’s consider what James calls “quality leakage.” James noted that in matched pair studies, there is a tendency for very good pitchers to be matched with lesser pitchers because the former are usually unique. James’ solution was to select pitchers for his “Group B” that were of slightly higher quality (more Win Shares) than his “Group A” pitchers so as to offset the leakage. I took a different approach: I disposed of the worst third (according to similarity score) of the matched pairs.

Of the 69 matched pairs, the 23 least similar pairs were removed from consideration. I believe this is sufficient to alleviate the worst effects of the quality leakage problem, while maintaining a sufficiently large sample. To illustrate, the worst “match” among the original 69 pairs was Nolan Ryan/David Cone. Ryan is nearly a generation older than Cone and walked and struck out batters at a greater rate as a young pitcher. Because they are so dissimilar, there is no reason to think that the Ryan/Cone match tells us anything about durability.

The Incompleat Starting Pitcher
The end of the nine-inning start and how we got here.

Unmatched Group A pitchers
Vida Blue (’71) Ted Higuera (’86) John Montefusco (’75)
Bert Blyleven (’73) Catfish Hunter (’72) Mike Mussina (’96)
Jim Clancy (’80) Randy Jones (’76) Gary Nolan (’70)
Joe Coleman (’74) Clay Kirby (’71) J.R. Richard (’76)
Ron Darling (’85) Mark Langston (’87) Nolan Ryan (’74)
Larry Dierker (’69) Bill Lee (’73) Frank Tanana (’76)
Dwight Gooden (’85) Dennis Leonard (’77) Fernando Valenzuela (’82)
Ron Guidry (’78) Jon Matlack (’74)

The “cast-offs” were pooled to create a new group (Group C); I’ll consider them in Part 2 of this series. A few Hall of Fame-type pitchers from Group A made it into the study, most notably Roger Clemens and Greg Maddux. Should we exclude them as well? Arbitrarily removing “special arms” seems like a sensible approach, but it creates its own problems (which I will also consider in Part 2). Hand-picking which pairs stayed and which went was not the path I wanted to go down.

Without further ado, the 92 subjects of Study #1 are:

Group A Pitcher Sim. Group B Pitcher Group A Pitcher Sim. Group B Pitcher
Len Barker(’80) 929 Jose Guzman(’88)    D.Lemanczyk(’77) 959 Bart Johnson(’76)
Bill Bonham(’74) 936 Ken Forsch(’73)    Greg Maddux(’91) 959 Andy Benes(’92)
Oil Can Boyd(’85) 954 John Burkett(’90)    Dennis Martinez(’79) 955 Bill Gullickson(’83)
Tom Bradley(’71) 927 Reggie Cleveland(’72)    Jack McDowell(’92) 957 S.Bankhead(’89)
Kevin Brown(’92) 955 Pedro Astacio(’96)    Doc Medich(’74) 967 Bob Moose(’73)
Tom Browning(’85) 972 Jamie Moyer(’88)    Mike Moore(’86) 984 Andy Hawkins(’86)
Ron Bryant(’73) 956 John Curtis(’73)    Jack Morris(’82) 966 Eric Show(’83)
Steve Busby(’74) 933 Gary Gentry(’69)    Mike Norris(’80) 925 Orel Hershiser(’85)
Roger Clemens(’87) 966 Erik Hanson(’90)    Melido Perez(’92) 962 Pete Harnisch(’93)
Jim Colborn(’73) 928 Dave Frost (’79)    Dan Petry(’83) 953 Jay Tibbs (’85)
Joe Decker(’74) 938 Buzz Capra(’74)    Rick Reuschel(’74) 949 Rick Langford(’77)
D.Eckersley(’78) 971 Scott Sanderson(’80)    Jerry Reuss(’73) 944 Bob Shirley(’77)
Cal Eldred(’93) 953 Ben McDonald(’92)    Steve Rogers(’77) 942 Burt Hooten(’77)
R.Erickson(’78) 950 Mark Lemongello(’78)    Bret Saberhagen(’88) 929 Frank Castillo(’92)
Alex Fernandez(’96) 956 Tommy Greene(’93)    Jim Slaton(’76) 953 Bob Forsch(’75)
Ed Figueroa(’76) 935 Alan Foster(’73)    John Smoltz(’93) 936 Kevin Appier(’95)
Mike Flanagan(’78) 942 Bob Ojeda(’84)    Mario Soto(’83) 953 Tim Belcher(’89)
W.Garland(’77) 962 Doyle Alexander(’77)    Paul Splittorf(’73) 941 John Candelaria(’80)
Ross Grimsley(’74) 935 Ken Brett (’73)    Dave Stieb(’83) 956 Charlie Lea (’83)
Mark Gubicza(’88) 967 Ken Hill(’92)    Rick Sutcliffe(’83) 953 Dave Stewart(’84)
Ed Halicki(’77) 953 Pete Vuckovich(’79)    Dick Tidrow(’73) 946 Glenn Abbott(’77)
Pat Hentgen(’96) 964 Ramon Martinez(’95)    Frank Viola(’86) 948 Britt Burns(’85)
Jim Hughes(’75) 938 Dave Freisleben(’74)    Mike Witt(’86) 936 Jose Rijo(’91)

The weighted average performance of the Group A pitchers was 17 wins, 13 losses, 3.52 ERA, 15.0% strikeout rate, 7.3% walk rate, 268.0 IP, and 4,038 estimated pitches.

The weighted average performance of the Group B pitchers was 13 wins, 11 losses, 3.54 ERA, 15.0% strikeout rate, 7.5% walk rate, 216.7 IP, and 3,268 estimated pitches.

The only significant statistical differences between the two groups in Year Zero are those related to workload. Aha, you might say — that’s only one season. Could the Group B pitchers be (in truth) inferior and their Year Zero performance merely a result of a preponderance of career years? Could there be differences in performance in the years leading up to the seasons in question? The numbers for the average Group A and Group B pitcher for the three years up to and including Year Zero …

Year -2 to Year Zero
  IP Pitches ERA K rate W rate Wins Losses
Group A average 594.7 9008 3.54 15.3 7.6 36 30
Group B average 452.7 6833 3.56 15.2 7.4 27 24

… tell the same tale. Apart from workload indicators, the two groups appear to be a very good match.

Suppose you are the general manager of a baseball team and are considering acquiring one of two pitchers: a 25-year-old pitcher who threw 3,900 pitches in 2004 and a very similar pitcher who threw only 3,300. Your scouts don’t turn up any major differences between the two and their overall performance over the last three years has also been very similar. The one difference is that the first pitcher has been subjected to a significantly greater workload than the second pitcher. Who would you choose and why?

Is surviving the heavy workload a marker of greater durability, or instead does the greater “mileage” mean you’d be better off acquiring the “underused” pitcher? The answer … next week.

References & Resources
(1) Pitches thrown were estimated using the Extended Pitch Count Estimator developed by Tangotiger.

(2)Age was calculated using exact date of birth as of July 1st of the year in question.

(3)Similarity Scores were determined by dividing the assigned weight for each category by the standard error based on the population of 3000+ pitch seasons in the pool. The weights for each category were as follows: strikeout rate= 40 points; ERA= 40 points; Age= 30 points; birth year= 30 points; walk rate= 20 points; estimated pitches=20 points; Total= 180. For all categories (except estimated pitches thrown) the absolute difference between the two pitchers was multiplied by the assigned weight and divided by the standard error. For estimated pitches, the absolute difference from a difference of 750 pitches was multiplied by the assigned weight and divided by the standard error.

Sample Calculation (Figures in blue = standard error)

Pat Hentgen (1996), born 1968: 16.1% K rate, 8.3% W rate, 3.22 ERA, 27.63 age, 4,012 estimated pitches
Ramon Martinez (1995), born 1968: 16.2% K rate, 9.0% W rate, 3.66 ERA, 27.28 age, 3,150 estimated pitches

Strikeout Points: abs(.161-.162)*40/.0400 = 1.00 Walk Points: abs(.083-.090)*20/.0211 = 6.64
ERA Points: abs(3.22-3.66)*40/1.026 = 17.15 Age Points: abs(27.63-27.28)*30/1.814 = 5.79
Year of Birth Points: abs(1968-1968)*30/6.99 = 0.00
Estimated Pitches Points: (abs(750-abs(4012-3150)))*20/354.6 = 6.32

Sum of Deductions: 1.00 + 6.64 + 17.15 + 5.79 + 0.00 + 6.32 = 36.90

Similarity Score = 1000 – 36.90 = 963.10 (rounded off to 963**)

** Due to rounding errors in the above calculations, the correct similarity score was not 963, but rather 964 (as noted in the main text)

Print This Post

Comments are closed.