Author Archive

The Least Interesting Player of 2016

Baseball is great! We all love baseball. That’s why we’re here. We love everything about it, but we especially love the players who stick out. You know, the ones who’ve done something we’ve never seen before, or the ones that make us think, “Wow, I didn’t know that could happen.” It’s fun to look at players who are especially good — or, let’s face it, especially bad — at some aspect of this game. They’re the most interesting part of this game we love.

But not everyone can be interesting. Some players are just plain uninteresting! Like this guy.

OMG taking a pitch? That’s boring. You’re boring everybody. Quit boring everyone!


You caught a routine fly ball? YAWN! Wake me when something interesting happens.

But it’s hopeless; nothing interesting will ever happen with Stephen Piscotty. I’m sure the two GIFs above have convinced you that he was the least interesting player in baseball last year. But, on the off-chance that you have some lingering doubts, we can quantify it. I’ve made a custom leaderboard of various statistics for all qualified batters in 2016. For each of these statistics, I computed the z-score and the square of the z-score. In this way, we can boil down how interesting each player was to one number — the sum of the squared z-scores. The idea is that if a player was interesting in even one of these statistics, they’d have a high number there. Here are the results:

Click through for an interactive version

I don’t need to tell you who the guy on the far right is. On the flip side, though, there are two data points on the left that stick out. The slightly higher of the two is Marcell Ozuna, with an interest score of 1.627. The one on the very far left is Stephen Piscotty, with an interest score of 0.997. That’s right — if you sum the squares of his z-scores, you don’t even get to 1! This is as boring and average as baseball players get.

Where the real fun begins, though, is when you start making scatter plots of these statistics against each other. I’ve made an interactive version where you can play around with making these yourself, but here are a few highlights:


AVG vs. SLG


IFFB% vs. OPS


ISO vs. wRC+

Pretty boring, right? But wait, there’s more! Let’s investigate a little further what went into his interest score. Remember how we summed his squared z-scores and got a value below 1? Well, let’s look at the individual components that went into that sum.

The Most Boring Table Ever
Statistic Squared z-score
LD% 0.108
GB% 0.002
PA 0.296
G 0.220
OPS 0.001
BB% 0.057
SLG 4.888e-05
WAR 0.007
BABIP 0.141
K% 0.103
IFFB% 0.0004
ISO 5.313e-05
FB% 0.007
wOBA 0.022
AVG 1.69e-29
wRC+ 0.025
OBP 0.006

Yes, you’re reading that right — where he stood out the most was in games played and plate appearances. Yay, we got to see that much more boring! Also, I think it is especially apt that his AVG was EXACTLY league average.

All right, time to step back and be serious for a second. As Brian Kenny is always reminding us, there is great value in being a league-average hitter. Piscotty was worth 2.8 WAR last year, just his second year in the league. He’s already a very valuable contributor to a very good team. Maybe it’s time we started noticing guys who do everything just as well as everyone else, and value their contributions too?

(Nah, I’m going to go back and pore over Barry Bonds’s early-2000s stats for the next few hours.)

All the code used to generate the data and visualizations for this post can be found on my GitHub.


Basic Machine Learning With R (Part 2)

(For part 1 of this series, click here)

Last time, we learned how to run a machine-learning algorithm in just a few lines of R code. But how can we apply that to actual baseball data? Well, first we have to get some baseball data. There are lots of great places to get some — Bill Petti’s post I linked to last time has some great resources — but heck, we’re on FanGraphs, so let’s get the data from here.

You probably know this, but it took forever for me to learn it — you can make custom leaderboards here at FanGraphs and export them to CSV. This is an amazing resource for machine learning, because the data is nice and clean, and in a very user-friendly format. So we’ll do that to run our model, which today will be to try to predict pitcher WAR from the other counting stats. I’m going to use this custom leaderboard (if you’ve never made a custom leaderboard before, play around there a bit to see how you can customize things). If you click on “Export Data” on that page you can download the CSV that we’ll be using for the rest of this post.

View post on imgur.com


Let’s load this data into R. Just like last time, all the code presented here is on my GitHub. Reading CSVs is super easy — assuming you named your file “leaderboard.csv”, it’s just this:

pitcherData <- read.csv('leaderboard.csv',fileEncoding = "UTF-8-BOM")

Normally you wouldn’t need the “fileEncoding” bit, but for whatever reason FanGraphs CSVs use a particularly annoying character encoding. You may also need to use the full path to the file if your working directory is not where the file is.

Let’s take a look at our data. Remember the “head” function we used last time? Let’s change it up and use the “str” function this time.

> str(pitcherData)
'data.frame':	594 obs. of  16 variables:
 $ Season  : int  2015 2015 2014 2013 2015 2016 2014 2014 2013 2014 ...
 $ Name    : Factor w/ 231 levels "A.J. Burnett",..: 230 94 47 ...
 $ Team    : Factor w/ 31 levels "- - -","Angels",..: 11 9 11  ...
 $ W       : int  19 22 21 16 16 16 15 12 12 20 ...
 $ L       : int  3 6 3 9 7 8 6 4 6 9 ...
 $ G       : int  32 33 27 33 33 31 34 26 28 34 ...
 $ GS      : int  32 33 27 33 33 30 34 26 28 34 ...
 $ IP      : num  222 229 198 236 232 ...
 $ H       : int  148 150 139 164 163 142 170 129 111 169 ...
 $ R       : int  43 52 42 55 62 53 68 48 47 69 ...
 $ ER      : int  41 45 39 48 55 45 56 42 42 61 ...
 $ HR      : int  14 10 9 11 15 15 16 13 10 22 ...
 $ BB      : int  40 48 31 52 42 44 46 39 58 65 ...
 $ SO      : int  200 236 239 232 301 170 248 208 187 242 ...
 $ WAR     : num  5.8 7.3 7.6 7.1 8.6 4.5 6.1 5.2 4.1 4.6 ...
 $ playerid: int  1943 4153 2036 2036 2036 12049 4772 10603 ...

Sometimes the CSV needs cleaning up, but this one is not so bad. Other than “Name” and “Team”, everything shows as a numeric data type, which isn’t always the case. For completeness, I want to mention that if a column that was actually numeric showed up as a factor variable (this happens A LOT), you would convert it in the following way:

pitcherData$WAR <- as.numeric(as.character(pitcherData$WAR))

Now, which of these potential features should we use to build our model? One quick way to explore good possibilities is by running a correlation analysis:

cor(subset(pitcherData, select=-c(Season,Name,Team,playerid)))

Note that in this line, we’ve removed the columns that are either non-numeric or are totally uninteresting to us. The “WAR” column in the result is the one we’re after — it looks like this:

            WAR
W    0.50990268
L   -0.36354081
G    0.09764845
GS   0.20699173
IP   0.59004342
H   -0.06260448
R   -0.48937468
ER  -0.50046647
HR  -0.47068461
BB  -0.24500566
SO   0.74995296
WAR  1.00000000

Let’s take a first crack at this prediction with the columns that show the most correlation (both positive and negative): Wins, Losses, Innings Pitched, Earned Runs, Home Runs, Walks, and Strikeouts.

goodColumns <- c('W','L','IP','ER','HR','BB','SO','WAR')
library(caret)
inTrain <- createDataPartition(pitcherData$WAR,p=0.7,list=FALSE)
training <- data[inTrain,goodColumns]
testing <- data[-inTrain,goodColumns]

You should recognize this setup from what we did last time. The only difference here is that we’re choosing which columns to keep; with the iris data set we didn’t need to do that. Now we are ready to run our model, but which algorithm do we choose? Lots of ink has been spilled about which is the best model to use in any given scenario, but most of that discussion is wasted. As far as I’m concerned, there are only two things you need to weigh:

  1. how *interpretable* you want the model to be
  2. how *accurate* you want the model to be

If you want interpretability, you probably want linear regression (for regression problems) and decision trees or logistic regression (for classification problems). If you don’t care about other people being able to make heads or tails out of your results, but you want something that is likely to work well, my two favorite algorithms are boosting and random forests (these two can do both regression and classification). Rule of thumb: start with the interpretable ones. If they work okay, then there may be no need to go to something fancy. In our case, there already is a black-box algorithm for computing pitcher WAR, so we don’t really need another one. Let’s try for interpretability.

We’re also going to add one other wrinkle: cross-validation. I won’t say too much about it here except that in general you’ll get better results if you add the “trainControl” stuff. If you’re interested, please do read about it on Wikipedia.

method = 'lm' # linear regression
ctrl <- trainControl(method = 'repeatedcv',number = 10, repeats = 10)
modelFit <- train(WAR ~ ., method=method, data=training, trControl=ctrl)

Did it work? Was it any good? One nice quick way to tell is to look at the summary.

> summary(modelFit)

Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.38711 -0.30398  0.01603  0.31073  1.34957 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.6927921  0.2735966  -2.532  0.01171 *  
W            0.0166766  0.0101921   1.636  0.10256    
L           -0.0336223  0.0113979  -2.950  0.00336 ** 
IP           0.0211533  0.0017859  11.845  < 2e-16 ***
ER           0.0047654  0.0026371   1.807  0.07149 .  
HR          -0.1260508  0.0048609 -25.931  < 2e-16 ***
BB          -0.0363923  0.0017416 -20.896  < 2e-16 ***
SO           0.0239269  0.0008243  29.027  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4728 on 410 degrees of freedom
Multiple R-squared:  0.9113,	Adjusted R-squared:  0.9097 
F-statistic: 601.5 on 7 and 410 DF,  p-value: < 2.2e-16

Whoa, that’s actually really good. The adjusted R-squared is over 0.9, which is fantastic. We also get something else nice out of this, which is the significance of each variable, helpfully indicated by a 0-3 star system. We have four variables that were three-stars; what would happen if we built our model with just those features? It would certainly be simpler; let’s see if it’s anywhere near as good.

> model2 <- train(WAR ~ IP + HR + BB + SO, method=method, data=training, trControl=ctrl)
> summary(model2)

Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.32227 -0.27779 -0.00839  0.30686  1.35129 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.8074825  0.2696911  -2.994  0.00292 ** 
IP           0.0228243  0.0015400  14.821  < 2e-16 ***
HR          -0.1253022  0.0039635 -31.614  < 2e-16 ***
BB          -0.0366801  0.0015888 -23.086  < 2e-16 ***
SO           0.0241239  0.0007626  31.633  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4829 on 413 degrees of freedom
Multiple R-squared:  0.9067,	Adjusted R-squared:  0.9058 
F-statistic:  1004 on 4 and 413 DF,  p-value: < 2.2e-16

Awesome! The results still look really good. But of course, we need to be concerned about overfitting, so we can’t be 100% sure this is a decent model until we evaluate it on our test set. Let’s do that now:

# Apply to test set
predicted2 <- predict(model2,newdata=testing)
# R-squared
cor(testing$WAR,predicted2)^2 # 0.9108492
# Plot the predicted values vs. actuals
plot(testing$WAR,predicted2)

View post on imgur.com


Fantastic! This is as good as we could have expected from this, and now we have an interpretable version of pitcher WAR, specifically,

WAR = -0.8 + 0.02 * IP + -0.13 * HR + -0.04 * BB + 0.02 * K

Most of the time, machine learning does not come out as nice as it has in this post and the last one, so don’t expect miracles every time out. But you can occasionally get some really cool results if you know what you’re doing, and at this point, you kind of do! I have a few ideas about what to write about for part 3 (likely the final part), but if there’s something you really would like to know how to do, hit me up in the comments.


Basic Machine Learning With R (Part 1)

You’ve heard of machine learning. How could you not have? It’s absolutely everywhere, and baseball is no exception. It’s how Gameday knows how to tell a fastball from a cutter and how the advanced pitch-framing metrics are computed. The math behind these algorithms can go from the fairly mundane (linear regression) to seriously complicated (neural networks), but good news! Someone else has wrapped up all the complex stuff for you. All you need is a basic understanding of how to approach these problems and some rudimentary programming knowledge. That’s where this article comes in. So if you like the idea of predicting whether a batted ball will become a home run or predicting time spent on the DL, this post is for you.

We’re going to use R and RStudio to do the heavy lifting for us, so you’ll have to download them (they’re free!). The download process is fairly painless and well-documented all over the internet. If I were you, I’d start with this article. I highly recommend reading at least the beginning of that article; it not only has an intro to getting started with R, but information on getting baseball-related data, as well as some other indispensable links. Once you’ve finished downloading RStudio and reading that article head back here and we’ll get started! (If you don’t want to download anything for now you can run the code from this first part on R-Fiddle — though you’ll want to download R in the long run if you get serious.)

Let’s start with some basic machine-learning concepts. We’ll stick to supervised learning, of which there are two main varieties: regression and classification. To know what type of learning you want, you need to know what problem you’re trying to solve. If you’re trying to predict a number — say, how many home runs a batter will hit or how many games a team will win — you’ll want to run a regression. If you’re trying to predict an outcome — maybe if a player will make the Hall of Fame or if a team will make the playoffs — you’d run a classification. These classification algorithms can also give you probabilities for each outcome, instead of just a binary yes/no answer (so you can give a probability that a player will make the Hall of Fame, say).

Okay, so the first thing to do is figure out what problem you want to solve. The second part is figuring out what goes into the prediction. The variables that go into the prediction are called “features,” and feature selection is one of the most important parts of creating a machine-learning algorithm. To predict how many home runs a batter will hit, do you want to look at how many triples he’s hit? Maybe you look at plate appearances, or K%, or handedness … you can go on and on, so choose wisely.

Enough theory for now — let’s look at a specific example using some real-life R code and the famous “iris” data set. This code and all subsequent code will be available on my GitHub.

data(iris)
library('caret')
inTrain <- createDataPartition(iris$Species,p=0.7,list=FALSE)
training <- iris[inTrain,]
model <- train(Species~.,data=training,method='rf')

Believe it or not, in those five lines of code we have run a very sophisticated machine-learning model on a subset of the iris data set! Let’s take a more in-depth look at what happened here.

data(iris)

This first line loads the iris data set into a data frame — a variable type in R that looks a lot like an Excel spreadsheet or CSV file. The data is organized into columns and each column has a name. That first command loaded our data into a variable called “iris.” Let’s actually take a look at it; the “head” function in R shows the first five rows of the dataset — type

head(iris)

into the console.

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

As you hopefully read in the Wikipedia page, this data set consists of various measurements of three related species of flowers. The problem we’re trying to solve here is to figure out, given the measurements of a flower, which species it belongs to. Loading the data is a good first step.

library(caret)

If you’ve been running this code while reading this post, you may have gotten the following error when you got here:

Error in library(caret) : there is no package called 'caret'

This is because, unlike the iris data set, the “caret” library doesn’t ship with R. That’s too bad, because the caret library is the reason we’re using R in the first place, but fear not! Installing missing packages is dead easy, with just the following command

install.packages('caret')

or, if you have a little time and want to ensure that you don’t run into any issues down the road:

install.packages("caret", dependencies = c("Depends", "Suggests"))

The latter command installs a bunch more stuff than just the bare minimum, and it takes a while, but it might be worth it if you’re planning on doing a lot with this package. Note: you should be planning to do a lot with it — this library is a catch-all for a bunch of machine-learning tools and makes complicated processes look really easy (again, see above: five lines of code!).

inTrain <- createDataPartition(iris$Species,p=0.7,list=FALSE)

We never want to train our model on the whole data set, a concept I’ll get into more a little later. For now, just know that this line of code randomly selects 70% of our data set to use to train the model. Note also R’s “<-” notation for assigning a value to a variable.

training <- iris[inTrain,]

Whereas the previous line chose which rows we’d use to train our model, this line actually creates the training data set. The “training” variable now has 105 randomly selected rows from the original iris data set (you can again use the “head” function to look at the top 5).

model <- train(Species~.,data=training,method='rf')

This line of code runs the actual model! The “train” function is the model-building one. “Species~.” means we want to predict the “Species” column from all the others. “data=training” means the data set we want to use is the one we assigned to the “training” variable earlier. And “method=’rf'” means we will use the very powerful and very popular random-forest method to do our classification. If, while running this command, R tells you it needs to install something, go ahead and do it. R will run its magic and create a model for you!

Now, of course, a model is no good unless we can apply it to data that the model hasn’t seen before, so let’s do that now. Remember earlier when we only took 70% of the data set to train our model? We’ll now run our model on the other 30% to see how good it was.

# Create the test set to evaluate the model
# Note that "-inTrain" with the minus sign pulls everything NOT in the training set
testing <- iris[-inTrain,]
# Run the model on the test set
predicted <- predict(model,newdata=testing)
# Determine the model accuracy
accuracy <- sum(predicted == testing$Species)/length(predicted)
# Print the model accuracy
print(accuracy)

Pretty good, right? You should get a very high accuracy doing this, likely over 95%*. And it was pretty easy to do! If you want some homework, type the following command and familiarize yourself with all its output by Googling any words you don’t know:

confusionMatrix(predicted, testing$Species)

*I can’t be sure because of the randomness that goes into both choosing the training set and building the model.

Congratulations! You now know how to do some machine learning, but there’s so much more to do. Next time we’ll actually play around with some baseball data and explore some deeper concepts. In the meantime, play around with the code above to get familiar with R and RStudio. Also, if there’s anything you’d specifically like to see, leave me a comment and I’ll try to get to it.


The Worst Pitch in Baseball

Quick thought experiment for you: what’s the worst pitch a pitcher can throw? You might say “one that results in a home run” but I disagree. Even in batting practice, hitters don’t hit home runs all the time, right? In fact, let’s quantify it — according to Baseball Savant there were 806 middle-middle fastballs between 82 and 88 MPH thrown in 2016. Here are the results of those pitches:

2016 Grooved Fastballs
Result Count Probability
Strike 296 36.7%
Ball 1 0.1%
Out 191 23.7%
Single 49 6.1%
Double 17 2.1%
Triple 4 0.5%
Home Run 36 4.5%
Foul 212 26.3%
SOURCE: Baseball Savant

So 86% of the time, we have a neutral or positive result for the pitcher, and the remaining 14% something bad happens. Not great, but when a pitcher *does* give up a homer on one of these pitches, there wasn’t really more than a 5% chance of that happening.

No, for my money, the worst thing a pitcher can do is to throw an 0-2 pitch that has a high probability of hitting a batter. The pitcher has a huge built-in advantage on 0-2, and by throwing this pitch he throws it all away and gives the batter a free base (or, at best, runs the count to 1-2). But everyone makes mistakes.


That’s Clayton Kershaw, hitting the very first batter he saw in 2015 with an 0-2 pitch. Here’s Vin Scully, apparently unwilling to believe Kershaw could make such a mistake, calling the pitch:

Strike two pitch on the way, in the dirt, check swing, and it might have hit him on the foot, and I believe it did. So Wil Myers, grazed by a pitch on an 0-2 count, hit on the foot and awarded first base. So Myers…and actually, he got it on his right knee when you look at the replay.

I was expecting more of a reaction from Kershaw — for reference, check out this reaction to throwing Freddie Freeman a sub-optimal pitch — but we didn’t get one. I wouldn’t worry about him, though — he’s since thrown 437 pitches on 0-2 counts without hitting a batter.

Kershaw is pretty good at avoiding this kind of mistake, but the true champion of 0-2 HBP avoidance is Yovani Gallardo*, who has thrown well over 1,200 0-2 pitches in his career without hitting a batter once. Looking at a heat map of his 0-2 pitches to right-handers (via Baseball Savant), you can see why — it’s hard to hit a batter when you’re (rightly) burying the pitch in the opposite batter’s box.

*Honorable mention: Mat Latos, who has thrown nearly as many 0-2 pitches as Gallardo without hitting a batter

Of course, 0-2 HBPs are fairly rare events, so it shouldn’t be too surprising to find that a few pitchers have managed to avoid them entirely. In fact, most pitchers are well under 1% of batters hit on 0-2 pitches. To get a global overview of how all pitchers did, let’s look at a scatter plot of average 0-2 velocity versus percent of HBPs in such counts over the past three years (click through for an interactive version):

I think one of these data points sticks out a bit to you.

I hate to pick on the guy, but that’s Nick Franklin, throwing the only 0-2 pitch of his life, and hitting Danny Espinosa when a strikeout would have (mercifully) ended the top of the ninth of this game against the Nationals. Interestingly, Franklin was much more demonstrative than Kershaw was, clapping his hands together and then swiping at the ball when it came back from the umpire. He probably knew that was his best opportunity to record a strikeout in the big leagues, and instead he gave his man a free base. Kevin Cash! Give this man another chance to redeem himself. He doesn’t want to be this kind of outlier forever.


Happy Trails, Josh Johnson

Josh Johnson could pitch. In this decade, seven players have put up a season in which they threw 180+ innings with a sub-60 ERA-: Clayton Kershaw (three times), Felix Hernandez (twice), Kyle Hendricks and Jon Lester in 2016, Zack Greinke and Jake Arrieta in 2015, and Josh Johnson in 2010. That was the second straight excellent year for Johnson, making the All-Star team in both 2009 and 2010, and finishing fifth in the Cy Young balloting the latter year. Early in 2011 he just kept it going, with a 0.88 ERA through his first few starts. In four of his first five starts that year, he took a no-hitter into the fifth inning. Dusty Baker — a man who has seen quite a few games of baseball in his life and normally isn’t too effusive in his praise of other teams’ players — had this to say at that point:

“That guy has Bob Gibson stuff. He has power and finesse, instead of just power. That’s a nasty combination.”

It seemed like he was going to dominate the NL East for years to come.

Josh Johnson felt pain. His first Tommy John surgery was in 2007, when he was just 23. His elbow had been bothering him for nearly a year before he finally got the surgery. His manager was optimistic at the time:

“I think he’ll be fine once he gets that rehab stuff out of the way,” Gonzalez said. “You see guys who underwent Tommy John surgery, they come back and pitch better.”

But the hits kept coming. His excellent 2010 season was cut short because of shoulder issues (though he didn’t go on the DL) and his promising 2011 season came up short because of shoulder issues. Those same issues had been bothering him all season but he pitched through the pain for two months.

“It took everything I had to go and say something,” he said. “Once I did, it was something lifted off my shoulders. Let’s get it right and get it back to feeling like it did at the beginning of the season.”

“I’m hoping [to return by June 1st],” he said. “You never know with this kind of stuff. You’ve got to get all the inflammation out of there. From there it should be fine.”

That injury cost him the rest of the season.

Josh Johnson loved baseball. Think about something you loved doing, and your reaction if someone told you that you had to undergo painful surgery with a 12-month recovery time in order to continue doing it. Imagine you did that, but then later on, someone told you that you had to do it again if you wanted even an outside chance of performing that activity, but the odds were pretty low. Josh Johnson had three Tommy John surgeries, because they gave him a glimmer of hope of continuing to play baseball.

Josh Johnson had a great career. It’s only natural to look at a career cut short by injuries and ask “what if?” but he accomplished plenty. He struck out Derek Jeter and Ichiro in an All-Star Game, threw the first pitch in Marlins Park, and made over $40 million playing the game he loved. He even lucked his way into hitting three home runs. Now he’s a 33-year-old millionaire in retirement; I think he did all right.


Hierarchical Clustering For Fun and Profit

Player comps! We all love them, and why not. It’s fun to hear how Kevin Maitan swings like a young Miguel Cabrera or how Hunter Pence runs like a rotary telephone thrown into a running clothes dryer. They’re fun and helpful, because if there’s a player we’ve never seen before, it gives us some idea of what they’re like.

When it comes to creating comps, there’s more than just the eye test. Chris Mitchell provides Mahalanobis comps for prospects, and Dave recently did something interesting to make a hydra-comp for Tim Raines. We’re going to proceed with my favorite method of unsupervised learning: hierarchical clustering.

Why hierarchical clustering? Well, for one thing, it just looks really cool:

That right there is a dendrogram showing a clustering of all player-seasons since the year 2000. “Leaf” nodes on the left side of the diagram represent the seasons, and the closer together, the more similar they are. To create such a thing you first need to define “features” — essentially the points of comparison we use when comparing players. For this, I’ve just used basic statistics any casual baseball fan knows: AVG, HR, K, BB, and SB. We could use something more advanced, but I don’t see the point — at least this way the results will be somewhat interpretable to anyone. Plus, these stats — while imperfect — give us the gist of a player’s game: how well they get on base, how well they hit for power, how well they control the strike zone, etc.

Now hierarchical clustering sounds complicated — and it is — but once we’ve made a custom leaderboard here at FanGraphs, we can cluster the data and display it in about 10 lines of Python code.

import pandas as pd
from scipy.cluster.hierarchy import linkage, dendrogram
# Read csv
df = pd.read_csv(r'leaders.csv')
# Keep only relevant columns
data_numeric = df[['AVG','HR','SO','BB','SB']]
# Create the linkage array and dendrogram
w2 = linkage(data_numeric,method='ward')
labels = tuple(df.apply(lambda x: '{0} {1}'.format(x[0], x[1]),axis=1))
d = dendrogram(w2,orientation='right',color_threshold = 300)

Let’s use this to create some player comps, shall we? First let’s dive in and see which player-seasons are most similar to Mike Trout’s 2016:

2016 Mike Trout Comps
Season Name AVG HR SO BB SB
2001 Bobby Abreu .289 31 137 106 36
2003 Bobby Abreu .300 20 126 109 22
2004 Bobby Abreu .301 30 116 127 40
2005 Bobby Abreu .286 24 134 117 31
2006 Bobby Abreu .297 15 138 124 30
2013 Shin-Soo Choo .285 21 133 112 20
2013 Mike Trout .323 27 136 110 33
2016 Mike Trout .315 29 137 116 30

Remember Bobby Abreu? He’s on the Hall of Fame ballot next year, and I’m not even sure he’ll get 5% of the vote. But man, take defense out of the equation, and he was Mike Trout before Mike Trout. The numbers are stunningly similar and a sharp reminder of just how unappreciated a career he had. Also Shin-Soo Choo is here.

So Abreu is on the short list of most underrated players this century, but for my money there is someone even more underrated, and it certainly pops out from this clustering. Take a look at the dendrogram above — do you see that thin gold-colored cluster? In there are some of the greatest offensive performances of the past 20 years. Barry Bonds’s peak is in there, along with Albert Pujols’s best seasons, and some Todd Helton seasons. But let’s see if any of these names jump out at you:

First of all, holy hell, Barry Bonds. Look at how far separated his 2001, 2002 and 2004 seasons are from anyone else’s, including these other great performances. But I digress — if you’re like me, this is the name that caught your eye:

Brian Giles’s Gold Seasons
Season Name AVG HR SO BB SB
2000 Brian Giles .315 35 69 114 6
2001 Brian Giles .309 37 67 90 13
2002 Brian Giles .298 38 74 135 15
2003 Brian Giles .299 20 58 105 4
2005 Brian Giles .301 15 64 119 13
2006 Brian Giles .263 14 60 104 9
2008 Brian Giles .306 12 52 87 2

Brian Giles had seven seasons that, according to this method at least, are among the very best this century. He had an elite combination of power, batting eye, and a little bit of speed that is very rarely seen. Yet he didn’t receive a single Hall of Fame vote, for various reasons (short career, small markets, crowded ballot, PED whispers, etc.) He’s my vote for most underrated player of the 2000s.

This is just one application of hierarchical clustering. I’m sure you can think of many more, and you can easily do it with the code above. Give it a shot if you’re bored one offseason day and looking for something to write about.


The Season’s Least Likely Non-Homer

A little while back, I took a look at what might be considered the least likely home run of the 2016 season. I ended up creating a simple model which told us that a Darwin Barney pop-up which somehow squeaked over the wall was the least likely to end up being a homer. But what about the converse? What if we looked at the ball that was most likely to be a homer, but didn’t end up being one? That sounds like fun, let’s do it. (Warning: GIF-heavy content follows.)

The easy, obvious thing to do is just take our model from last time and use it to get a probability that each non-homer “should” be a home run. So let’s be easy and obvious! But first — what do you think this will look like? Maybe it was robbed of being a home run by a spectacular play from the center fielder? Or maybe this fly ball turned into a triple in the deepest part of Minute Maid Park? Perhaps it was scalded high off the Green Monster? Uh, well, it actually looks like this.

That’s Byung-ho Park, making the first out of the second inning against Yordano Ventura on April 8. Just based off exit velocity and launch angle, it seems like a worthy candidate for the title, clocking in at an essentially ideal 110 MPH with a launch angle of 28 degrees. For reference, here’s a scatter plot of similarly-struck balls and their result (click through for an interactive version):

(That triple was, of course, a triple on Tal’s hill)

But, if you’re anything like me, you’re just a tad underwhelmed at this result. Yes, it was a very well-struck ball, but it went to the deepest part of the park. What’s more, Kauffman Stadium is a notoriously hard place to hit a home run. It really feels like our model should take into consideration both the ballpark in which the fly ball was hit, and the horizontal angle of the batted ball, no? Let’s do that and re-run the model.

One tiny problem with this plan is that Statcast doesn’t actually provide us with the horizontal angle we’re after. Thankfully Bill Petti has a workaround based on where the fielder ended up fielding the ball, which should work well enough for our purposes. Putting it all together, our code now looks like this:

# Read the data
my_csv <- 'data.csv'
data_raw <- read.csv(my_csv)
# Convert some to numeric
data_raw$hit_speed <- as.numeric(as.character(data_raw$hit_speed))
data_raw$hit_angle <- as.numeric(as.character(data_raw$hit_angle))
# Add in horizontal angle (thanks to Bill Petti)
horiz_angle <- function(df) {
angle <- with(df, round(tan((hc_x-128)/(208-hc_y))*180/pi*.75,1))
angle
}
data_raw$hor_angle <- horiz_angle(data_raw)
# Remove NULLs
data_raw <- na.omit(data_raw)
# Re-index
rownames(data_raw) <- NULL

# Make training and test sets
cols <- c(‘HR’,’hit_speed’,’hit_angle’,’hor_angle’,’home_team’)
library(caret)
inTrain <- createDataPartition(data_raw$HR,p=0.7,list=FALSE)
training <- data_raw[inTrain,cols]
testing <- data_raw[-inTrain,cols]
# gbm == boosting
method <- ‘gbm’
# train the model
ctrl <- trainControl(method = “repeatedcv”,number = 5, repeats = 5)
modelFit <- train(HR ~ ., method=method, data=training, trControl=ctrl)
# How did this work on the test set?
predicted <- predict(modelFit,newdata=testing)
# Accuracy, precision, recall, F1 score
accuracy <- sum(predicted == testing$HR)/length(predicted)
precision <- posPredValue(predicted,testing$HR)
recall <- sensitivity(predicted,testing$HR)
F1 <- (2 * precision * recall)/(precision + recall)

print(accuracy) # 0.973
print(precision) # 0.811
print(recall) # 0.726
print(F1) # 0.766

Great! Our performance on the test set is better than it was last time. With this new model, the Park fly ball “only” clocks in at a 90% chance of becoming a home run. The new leader, with a greater than 99% chance of leaving the yard with this model is ARE YOU FREAKING KIDDING ME

I bet you recognize the venue. And the away team. And the pitcher. This is, in fact, the third out of the very same inning in which Byung-ho Park made his 400-foot out. Byron Buxton put all he had into this pitch, which also had a 28-degree launch angle, and a still-impressive 105 MPH exit velocity. Despite the lower exit velocity, you can see why the model thought this might be a more likely home run than the Park fly ball — it’s only 330 feet down the left-field line, so it takes a little less for the ball to get out that way.

Finally, because I know you’re wondering, here was the second out of that inning.

This ball was also hit at a 28-degree launch angle, but at a measly 102.3 MPH, so our model gives it a pitiful 81% chance of becoming a home run. Come on, Kurt Suzuki, step up your game.


Where Bryce Harper Was Still Elite

Bryce Harper just had a down season. That seems like a weird thing to write about someone who played to a 112 wRC+, but when you’re coming off a Bondsian .330/.460/.649 season, a line of .243/.373/.441 seems pedestrian. Would most major-league baseball players like to put up a batting line that’s 12% better than average? Yes (by definition). But based on his 2015 season, we didn’t expect “slightly above average” from Bryce Harper. We expected “world-beating.” We didn’t quite get it, but there’s one thing he is still amazing at — no one in the National League can work the count quite like him.
Read the rest of this entry »


Maple Leaf Mystery

Canadians! They walk among us, only revealing themselves when they say something like “out” or “sorry” or “I killed and field-dressed my first moose when I was six.” But we don’t get to hear baseball players talk that often, so how can we tell if a baseball player is Canadian? Generally there are three warning signs:

  1. They have a vaguely French-sounding last name
  2. They have been pursued by the Toronto Blue Jays1
  3. They bat left-handed and throw right-handed

1 I honestly thought Travis d’Arnaud was Canadian until just now

Wait, hold on. What’s up with that third one? This merits a bit of investigation.
Read the rest of this entry »


Michael Lorenzen Is the New Brian Wilson

Have you heard of Michael Lorenzen?  You might have heard of Michael Lorenzen.  You’re a baseball fan, and he plays baseball.  But chances are, you haven’t heard of him.  He’s a mostly unremarkable relief pitcher for a very unremarkable Cincinnati team.  He put up a 2.88 ERA with a 3.67 FIP for the Reds in 2016, hurt by a 22.7% HR/FB rate.  But he did do something special last year, and that something special is worth noting.  Before getting into that, though, let’s take a trip to the distant past of 2009.
Read the rest of this entry »