# Using Survival Analysis to Predict Chase Utley’s 200th Hit By Pitch

*Editor’s Note: This piece was adapted from a presentation at SaberSeminar 2018. *

At the beginning of the season, the Los Angeles Dodgers were coming off a Game Seven loss in the World Series, and Chase Utley was coming off a year in which he suffered his 199th career hit-by-pitch.

The Dodgers broadcast booth was making a big deal of Utley’s looming 200th hit-by-pitch. Beat writers were making a big deal of it. Even MLB, via Cut4, was making a big deal out of it. To be fair, it was kind of a big deal: as of the start of the season, Chase Utley was eighth all-time on the career hit-by-pitch list, and the only active player currently in the top 10. So I started wondering, in the midst of a 1:30 a.m. insomniac episode: can I predict when number 200 will come?

Now, of course, we know that Utley’s 200th HBP actually occurred on April 17. But let’s forget that for now. Let’s rewind a few months, and pretend, for the next few paragraphs, that it’s still early April.

To answer my question, I first need to determine what factors are associated with the risk of getting hit. I hypothesize a few:

- Year: Maybe he took more risks when he was younger. Who knows?
- Plate appearances since last HBP: Maybe he’s a little more cautious right after getting hit?
- Team (Phillies or Dodgers): Perhaps different teams have different attitudes on proximity to the plate? Perhaps different opponents matter?
- Current game score (Ahead, tied or behind): Perhaps there is more pressure to get on base when tied? Maybe when the team is behind?
- Current inning: Maybe there is a greater inclination to lean into one later in the game when there’s less time left and more urgency to win?
- Number of outs: Maybe he’d want to get on base more with more outs?
- Opposing team: We all know that certain teams have reputations for throwing inside.

To determine when Utley’s 200th HBP will come, we’re going to turn to our trusty friend, logistic regression. Logistic regression is great if you have an event with a binary outcome, as we do here. I pulled all of Utley’s 7,690 plate appearances up to April 3 from Baseball Reference. We’re treating each plate appearance as an event, and the outcome we’re interested in is HBP. I don’t care what else happens in each plate appearance; I only care whether he gets hit.

Logistic regression (and its sibling, linear regression) is a useful tool, but it’s still just a tool. It’ll take whatever data you put in and churn out a model, but that doesn’t necessarily mean the model you’ve gotten is correct. So we’re going to have to filter out all the factors that don’t actually matter. This process is called model selection, and it’s more of an art than a science. There are a few different ways to do this automatically — forward selection, backward selection, stepwise selection — but there’s an increased risk of a Type I error when you do so, especially when you have a lot of variables. Or put more plainly, because we’re making so many comparisons between potential models to determine which one is better, we run a higher risk of accidentally identifying a variable as important when it really isn’t.

Let’s say we’re looking at an outcome, X, and we have the variables A, B, and C. Our base model would look like this:

Where each represents our best estimate for how the log odds of X change with a one-unit increase in A, B, or C. When you do forward selection, you start out with an empty model, and add variables one at a time to see if that model does a better job at predicting X than the one before it. Is A alone a better model than an empty model? Is A + B better than just A? Is A + B + C better than A + B? Backwards selection works the exact opposite way. We start with A + B + C and drop variables to see if the resulting model is is any better.

Stepwise selection is a mix of both: you start with a full model (A + B + C) but after dropping variables, you’ll add in some prior ones just in case they matter now. We might find A + B is better than A + B + C, but A is better than A + B, but A + C is better than A.

That ends up being a lot of comparisons. Each time you run one, there’s a risk that of obtaining a false positive — saying something’s important when it actually isn’t. The more times you run the comparison, the bigger that risk gets. And considering that we have a base model with a lot of variables — we already have 31 just to account for opposing teams (repose en paix, les Expos de Montréal) — that’s a lot of comparisons we’d have to run.

So we’re going to do this a little more holistically, which requires a little more thought. For example, the contribution of facing a certain American League team may appear statistically significant if we’re going by p-values alone, but is it meaningful? Does it really mean anything when Utley faced that team only once every three years, for a total of 38 PAs? Or is that just noise from a small sample size?

After examining our base model (with all possible variables included) and the maximum likelihood estimates obtained from it, it turns out there are actually only two variables that contribute strongly to the outcome in which we are interested:

- The inning number, or how late it is in the game
- If the team he’s facing is the New York Mets

Yes, you read that right. The New York Mets.

That gives us the following model:

Note that the variable INNING NUMBER is a numerical variable (1, 2, 3, etc), but the variable METS is binary (1 or 0; yes or no).

What this equation gives us is the log odds of Utley getting hit in an individual plate appearance, but since it’s a relatively rare event, the odds can also estimate the risk of him getting hit, which is often easier for people to understand, including myself.

We’ll have to exponentiate our coefficients to get anything meaningful out of those numbers, but doing so gives us the following information about the risk of Utley being hit:

- There is a 2 percent underlying risk in each plate appearance
- We see a 6.8 percent increase in risk with each inning that passes
- We see a shocking 385 percent increase in risk when Utley is facing the New York Mets

Yes, you read that right. An almost 400 percent increase when the Mets are involved.

You might be saying, “This is cool, I guess, but that doesn’t tell me when Utley’s going to be hit.” (That you already know, and are pretending it is April is quite generous of you.) And you’re right. What this model gives us are situations where he’s most likely to be hit, but nothing concrete about the when.

So we’ll switch tools. For my next trick, I’m going to be using something called survival analysis, which is also referred to as time-to-event analysis. As the name suggests, time-to-event analysis is used to analyze the amount of time until an event happens. Instead of taking each plate appearance as they come, we’ll string them together until we reach one where Utley gets hit. Since we know every single instance of Utley getting plunked, this is pretty straightforward.

I’ve included the quartile estimates in an attempt to save us from trying to stick rulers up against our screens.

Percent | Point Estimate |
---|---|

25% | 9 PAs |

50% | 23 PAs |

75% | 49 PAs |

What this graph shows us is that, first and foremost, Utley’s probability of getting hit does not increase linearly as time goes on. In fact, 25 percent of the time, he gets hit within nine plate appearances. Half the time, it’s within 23 plate appearances. And 75 percent of the time, it’s within 49 plate appearances.

The main takeaways from these two models? First, Utley gets hit approximately once every 38 plate appearances, but it’s not evenly distributed. He’s hit within 23 plate appearances of his last HBP 50 percent of the tme. He’s slightly more likely (7 percent) to get hit later in the game. And lastly, he’s four times more likely to get hit while facing the New York Mets.

As for actual number 200 on April 17? Well it happened exactly 39 plate appearances after his 199th. The Dodgers were playing in San Diego; Utley got hit in the top of the second. It was a fairly average situation, in terms of baseline risk: early in the game, and not against the New York Mets. And it was maybe a bit later than one would expect, given our survival estimates, but still almost exactly his career pace.

I’ll show another example of the survival model, now, just to prove that I’m not solely obsessed with Chase Utley. This time, we’re going to look at recovery from Tommy John surgery. The data for this graph come from Jon Roegele’s amazing Tommy John surgery list.

Percent | Point Estimate |
---|---|

25% | 15 months |

50% | 43 months |

75% | — |

In this case, return to play is defined as return to play at the same level of play. For example, if a major league pitcher gets surgery and then does rehab games in the minors before going back to the big leagues, his return to play date is the day he’s reinstated to the majors, not his rehab stint. You’ll notice two big differences: This curve flattens out a lot earlier, and there is no 75 percent quartile estimate. These are both for the same reason: Unfortunately, not everyone returns to play from Tommy John surgery. Of those who do return to play 25 percent do so within 15 months; 50 percent have done so within 43 months. Nobody has returned to play from Tommy John surgery after that 43-month mark.

So, to sum things up: I’ve created two models that can be used for a variety of events, both rare and non-rare. They can both be used to determine risk factors for the events, which is pretty neat. The survival model, as I’ve shown, can be used for an individual or a group of people for any event that has a temporal component to it. Want to compare recovery times for TJS between surgeons? You can do that. Want to see if low-round high school draft picks make the majors faster than players who declined to sign and were drafted again out of college? Absolutely possible. As we demonstrated, the approach has a variety of baseball applications.

And finally, and perhaps least surprisingly, we confirmed the New York Mets have a serious grudge against Chase Utley.

great. You have explained it very well. homepage

It will take anything records you put in and churn out a model however that doesn’t necessarily imply the model you’ve gotten is accurate. So we are going to have to filter out all the factors that do not definitely remember. There’s an accelerated danger of a kind I errors whilst you accomplish Whiteboard Explainer Video that in particular when you have a variety of variables. The contribution of facing a sure American League group may seem statistically large if we’re going by using values alone.