Starting Pitcher DL Projections (Part 2 of 2)
Yesterday, I went through the formula used for predicting which starting pitchers have the greatest chances of going on the DL in a given year. Now here are the projections for 2011. Besides revealing the list, a few other points and possible improvements to the process will be discussed.
First, here are the five most and least likely starting pitchers (>20 GS and >120 innings in 2010) to go onto the DL in 2011 (creating these projections is still a work in progress, so no one should take too much stock in them right now):

There are no real surprises on the list, with young experience pitchers ~25% less likely to go on the DL than older injured pitchers. A complete list of players can be found here in this Google Doc.
The projections estimate that 45 of the pitchers will go on the DL sometime during the season, which is 39% of the pitchers being examined. Looking at 2010, the projections would have predicted 43 players going on the DL and 37 actually did or 34% of the total pitchers.
Knowing just the chances for going on the DL is not the entire picture. The number of days lost also needs to be known, but I have not figured out a good way to get the days lost yet. Instead, here is a chart to use with the number of days lost for each trip to the DL for this group of starting pitchers:

One encouraging sign from the proceeding graph is that, once on the DL, the pitcher has less than a 20% chance of missing more than 90 day in the season.
Besides figuring out the possible days lost, I may look into a couple other improvements in the future, as follow.
1. Use Tom Tango’s fan playing-time projections. Instead of looking at what a pitcher did the year before, it would look at how much fans think the pitcher will pitch in the coming year. He has only generated them for the past two years, so the data set used to create the projection would be limited.
2. Look at the pitcher’s fastball velocity. It seems that pitchers who throw harder are more likely to end up injured than pitchers that are soft tossers. I may add these pitch speeds in when I get around using Tom’s playing-time projections.
With the starting pitchers done for now, I will be moving on to relief pitchers next. Hopefully, I will have some data in the next couple of weeks.
The lowest chance is 27 percent. Wow.
These numbers need confidence intervals.
I guess I’ll note that these are not entirely straightforward to build, but under they are easier than you think. You just use the linear part of the model and build CIs like you usually do, and the map those CIs back to the 0/1 space.
I have someone helping me with this process, but logistic regression doesn’t lend itself to confidence intervals. I will tell you that I am 100% confident that the player will either be on the DL or won’t be.
I’m telling you that approximate CIs are easier than you think. The program that you use should be able to pop out prediction intervals, but if it can’t, you can use the VC of the last iteration of the weighted least squares to construct them just like you would for a regular weighted regression. The only difference is that you then have to take those predictions and run them through the logit function at the end to get the results on the interval [0,1].
Yeah I feel like this list just sort of restates the obvious of the past…
I find it hard to believe that…
Ubaldo
Lincecum
Lester
Volstad
Cain
5 very different pitching builds, different pitch selections, different pitching strains, would all be lumped together within a 0.3% probability.
Also I would tack on an extra 15% to any pitcher that has to throw for Dusty Baker, haha.
Again with Dusty-Baker-destroys-arms propaganda . . . Please name a single pitcher with decent pitching mechanics (i.e. not Kerry Wood or Mark Prior) that Baker has destroyed with overuse.
Might also be worth adding another age variable. If you include both age and age-squared, you can get a parabolic shape on the variable.
This is purely conjecture, but I think that very young pitchers have an increased risk of injury just as very old pitchers do. Might be worth examining.
Also, any thought to including innings pitched instead of games started? Or maybe innings per start? Seems like this is usually used for this type of equation instead of games started.
Good work though, I’m looking forward to seeing the results with pitch speed as well.
Stuff that I wonder about when it comes to the difference between myth and reality…
innings per season?
innings per start?
pitches per start?
pitches per inning?
average days rest between starts?
For younger pitchers I always wonder how their progressed use throughout their minor league career contributes…
Is there any difference whatsoever between a drafted high school pitcher’s history and a drafted college pitcher’s history?
Off-speed pitch selections… Breaking ball usage compared to change-up usage… And is there an increased risk when you transition between particular breaking ball thresholds? for instance the risk of increased velocity with a 12-6 slow curve… the slurve… a slider transitioning to a cut fastball…
(I guess that could be something like the velocity differentials between certain pitches… Is a pitcher putting increased strain on his arm when they have unusually high or low pitch velocity differentials with breaking balls and fastball types.)
to clarify… I wonder how Clayton Kershaw might have effected injury probability with his pitch selection shift…
Going from a reliance on that big curve with a 20MPH difference to his fastball to using a frequent slider. It could be two fold… pitch stress and a more probably increase in control/command development (efficiency)?
Jesse – Good idea on the age sqaured
I used IP and it and GS just evened each other out, so I just went with GS
baty
At BtB, I did a study on college vs high school pitchers and found that neither had an advantage on longevity.
I may look at velocity diffentials later, but that would be a pain to set up.
So Hudson is extremely likely because of his TJ surgery a few years ago? Part of it would be age of course, but I think that probably overestimates his probability in your projections.
I like the process behind this, and I think when you look at the results (projections) and see stuff like this, it can make you question your formulas, which can lead you to tweak or confirm your methodology, neither of which of course are bad. I think there’s something interesting here though, even if what you have know is early in the process.
At some point I will look at TJ pitchers.
Hudson’s age is a huge factor. 35 year pitchers aren’t at their peaks for sure.
any chance we could get the team names in a collumn next to the pitcher’s name, that way you could see which rotations are most at risk.. then maybe in a later post you can see which clubs have the best minor league help and compare that to dl risk
Tim Hudson seems a bit off as one of the highest risk candidates.
Looking at Hudson, he’s been quite durable over his career, but missed a lot of games over 2008-2009 due to TJ. Chris Carpenter’s overall playing time statistics over the past three years match Hudson’s closely, and he’s about the same age, but his injury history is more extensive.
In Hudson’s case, he’s hit for 2 DL trips for one injury, and a lot of missed starts, but how does past TJ surgery actually affect a pitcher’s future risk once he’s made a recovery? With pitchers, TJ is such a common injury that causes enough missed time that it may need to be treated as a separate case in order to prevent confounding the results.
At some point I will look just at TJ pitchers.
It needs a better mechanics component than prior DL trips. For example, does anyone really think there’s a 37% chance Barry Zito goes on the DL? The guy was taught perfect mechanics by Randy Jones when he was 10 years old and practiced them every day in the mirror for years so they’re constantly repeatable. One of the reasons the Giants gave him that contract is that Boras pushed hard on the “never been injured” theme. In fact, it’s kind of a problem because the Giants can’t hide him on the DL during his periods of ineffectiveness. (Zito’s protege, Dan Haren, who copied his warmup techniques in Oakland, is also unlikely to be injured).
Whenever there is a reliable and tested source for mechanics, I see no reason to use them.
Just want to point out that not all injuries are due to poor mechanics. Over the last 3 years, players have ended up on the DL for: anxiety, appendectomy/appendicitis, blisters, bone chips, blood clots, bruises, bursitis, concussions, fractures, infections, shin splints, and shingles, among many others.
Does it all add up to a 37% chance of ending on the DL? Probably not. But it’s not 0% either. How much it should be is up for debate, and I’m sure Jeff would love to get more accurate numbers if he could find a way to do so.
Wow, only a 42% chance Johan Santana ends up on the DL! And here I was, thinking he was recovering from surgery and would be on the DL until at least August. Phew!
One possible future suggestion. Maybe instead of using DL trips in the equation, that number could be waited by injury type?
This of course would lead to a big increase in subjectivity going into the formula, but does a sprained ankle deserve the same waiting as a shoulder injury in terms of number of DL trips? (I realize this gets factored in a bit as it will impact games started)
I’m not sure how you would come up with the weightings, but I just thought I’d throw that out there as a possibility.
I could divide it locations pretty easy. Good idea.
Pretty easy – but the trick is to come up with proper weightings
(or ‘waitings’ as I seem to use interchangeably…. twice!)
How hard is to process the data? Would it be possible to run what you did on 2007-2009 to see what would have been the 2010 projections? This may lead to ideas on additional factors if there is a huge mismatch… (of course that type of process can be problematic)
I could be wrong, but it seems like a decline in fastball velocity should be taken into account as well.
I ran the equation for Erik Bedard, just for kicks. 75%!
It’s not fair to run the numbers on Bedard because the calculations specifically excluded people who started less than 20 times the year before. We have no idea if the numbers under the calculated subset can be extrapolated to a more general population.
I didn’t mean to imply that Erik Bedard was a fair use of this calculation. It simply was out of curiosity since he is comically lacking in the three variables used. Obviously someone who doesn’t pitch the entire year (at least in the majors) won’t generate statistically significant data.
“I will tell you that I am 100% confident that the player will either be on the DL or won’t be.”
JZ’s response here is simply incorrect. It’s true that it’s not easy in practice to get the confidence interval for an estimated percent from logistic regression, but it’s incorrect to suggest that it’s not possible or that the question doesn’t make sense.
The presentation as a whole is leaving a lot to desire; as it is, I’m highly skeptical that what JZ did was done appropriately.
““I will tell you that I am 100% confident that the player will either be on the DL or won’t be.”
JZ’s response here is simply incorrect”
Are you saying that the chance that the player will either be on the DL or won’t be is NOT 100%? I hope you don’t work on any statistics I’m reading, in that case. The sum of probabilities of the sample space is generally accepted to be 100%.
JZ: would you be willing to post the data somewhere so I can take a look at it (I don’t know why sabermetricians don’t do this more often)? I realize my last comment may have come across as overly antagonistic, and I think it’d be more productive for me to take a stab and what you’re working than to sit around whining.
send me an email at wydiyd ~ hotmail ~ com
Where is there good DL data?
Josh Hermsmeyer created a set from 2002 to 2009 last year and put it up on his website. I am not posting a link because it seems that someone hacked the site and if you go there, your computer will get attacked by virus’s and spyware. I have tried to contact him, but he has responded. I will not distribute the data without his permission, but would if he allowed me.
I collected the 2010 data and it is linked in part 1.
Jeff, thanks for the detailed explanation.
[...] Zimmerman is working on a DL prediction system for pitchers, and guess who’s expected to be the 2nd most fragile pitcher in the Majors this year? Our own [...]
Wouldn’t this be something that might be better off modeled as a Decision Tree using a binary variable of “will player X go on the DL” or “will player X go on the DL for more than Y total days”
Love the dataset and the avenue you are pursuing.
[...] there’s another virtual certainty for Harden – he’ll hit the DL at some point. Per Jeff Zimmerman’s excellent work on predicting DL stints for starters, Harden has a 51% chance of hitting the DL in 2011. To some, that may sound low, but the bottom end [...]
If I told you how easy it is to get a job in this recession, you wouldn’t believe me. But the truth is more employers are going online to find people just like you and me who are ready to work at a good job (one that pays good!). The only thing that makes sense is to stop wasting time driving around all day filling out a dozen applications and going from one boring low paying job to another. I found this site that pretty much matches you up with your dream job that is available in your city right now. I have found it very helpful. Go to YouFindWork.com
[...] recently posted a projection formula (here and here) that estimated the chance of a starting pitcher spending time on the disabled list. To say the [...]