Jeff recently ran two articles about the season’s worst and best home runs, as measured by exit velocity. As a small addendum to that, I’d like to include both exit velocity and launch angle to try to determine the season’s least likely home run. So how do we do such a thing? Warning! I’m going to spend a bunch of time talking about R code and machine learning. If you want to skip all that, feel free to scroll down a bit. If, on the other hand, you’d like a more in-depth look at running machine learning on Statcast data, hit me up in the comments and I’ll write some more fleshed-out pieces.
As usual, we’re going to rely heavily on Baseball Savant. Thanks to their Statcast tool, we can download enough information to blindly feed into a machine-learning model to see how exit velocity and launch angle affect the probability of getting a home run. For instance, if we wanted to make a simple decision tree, we could do something like this.
# Read the data my_csv <- 'hr_data.csv' data_raw <- read.csv(my_csv) # Make training and test sets library(caret) inTrain <- createDataPartition(data_raw$HR,p=0.7,list=FALSE) training <- data_raw[inTrain,] testing <- data_raw[-inTrain,] # rpart == decision tree method <- 'rpart' # train the model modelFit <- train(HR ~ ., method=method, data=training) # Show the decision tree library(rattle) fancyRpartPlot(modelFit$finalModel)
That looks like what we would expect. To hit a home run, you want to hit the ball really hard (over 100 MPH) and at the right angle (between 20 and 40 degrees). So far so good.
Now, decision trees are pretty and easy to interpret but they’re no good for what we want to do because (a) they’re not as accurate as other, more sophisticated methods and (b) they don’t give meaningful probability values. Let’s instead use boosting and see how well we did on our test set.
method <- 'gbm' # boosting modelFit <- train(HR ~ ., method=method, data=training) # How did this work on the test set? predicted <- predict(modelFit,newdata=testing) # Accuracy, precision, recall, F1 score accuracy <- sum(predicted == testing$HR)/length(predicted) precision <- posPredValue(predicted,testing$HR) recall <- sensitivity(predicted,testing$HR) F1 <- (2 * precision * recall)/(precision + recall) print(accuracy) # 0.973 print(precision) # 0.792 print(recall) # 0.657 print(F1) # 0.718
The accuracy number looks nice, but the precison and recall show that this is far from an amazingly predictive algorithm. Still, it’s decent, and all we really want is a starting point for the conversation I started in the title, so let’s apply this prediction to all home runs hit in 2016.
Once you throw out some fairly clear blips in the Statcast data, the “winner”, with a 0.3% chance of turning into a home run, is this beauty from Darwin Barney.* This baby had an exit velocity of 91 MPH and launch angle of 40.7 degrees. For fun, let’s look at where similarly-struck balls in the Rogers Centre ended up this year.
* I’m no bat-flip expert, but I believe you can see more of a flip of “I’m disgusted” than “yay” in that clip.
Congrats Darwin Barney! There are no-doubters, then there are maybes, and then there are wall-scrapers. They all look the same in the box score, but you can’t fool Statcast.