Archive for Featured Homepage

The Importance of the 30-Minute Population Radius on MLB Attendance

In 1992, the San Francisco Giants almost moved to St. Petersburg, Florida. Before the i’s could be dotted and the t’s crossed, new ownership bought the team and the Giants stayed in their Bay Area. Less than 10 years later, the Tampa Bay area received the Devil Rays.

While their results on the field have been somewhat similar since 2008 (Rays winning %: .552, Giants winning % .526), the two teams couldn’t be more different in regards to stadium experience. Since Oct 1, 2010, the Giants have sold out every game at AT&T Park, while the Rays have had 14 regular season sell-outs total since 2010. The Giants play in a beautiful new ballpark on the water, while the Rays play in a dilapidated 30-year old dome.

There is one other major difference when we look at the Giants and the Rays (besides the fact the Giants did draft Buster Posey):

Last year, of the US-based teams, the Giants had the smallest difference in weekend/weekend attendance; the Rays had the largest. By selling out every game, the Giants maintained an average Monday through Thursday attendance of 41,588 and a Friday through Sunday average of 41,589. An average of one person squeezed in to AT&T Park on the weekends.

Meanwhile, at Tropicana Field, the Rays averaged only 14,297 fans per game Monday through Thursday. This was the lowest average weekday attendance in Major League Baseball. On the weekends, however, the Rays averaged 21,692 fans per game. While still the lowest weekend average in Major League Baseball, the Rays saw a 51.7% average increase in attendance on the weekends.

There are many reasons why the Rays struggle with attendance. Many fans and residents point to the condition of the stadium, the demographics, and lack of mass transit as reason for not going. But one of the biggest and least-discussed reasons is that few people actually live near Tropicana Field. According to Maury Brown’s 2011 research on population, the Rays are dead last in population with a 30-mile radius of their ballpark.

A definite correlation exists between the population living within 30 minutes of a ballpark and the difference between weekend and weekday attendance. With only a few exceptions, teams with a 30-minute radius larger than 2 million have smaller weekend/weekday attendance differences. Teams that play in a population radius of less than 2 million, on the other hand, tend to have higher weekend/weekday differences.

Here is a breakdown of the 2014 MLB attendance:

2014 MLB attendance

Only the Chicago White Sox and Washington Nationals have more than 2 million people within 30 minutes of their ballpark and had an average weekend difference greater than 20%. Teams with less than 2 million people within 30 minutes of their ballpark who saw a smaller than 20% difference in average weekday to weekend attendance included the Cardinals, Twins, Rangers, and Marlins. The circumstances behind these fanbases should be studied further.

Looking at the data graphically, it is best to omit the New York teams, as the each can draw from a 30-minute population of over 8 million people, more than double any other team on the list. Removing the Mets and Yankees, we see the following:

MLB 2014 weekend weekday 30 min radius

On the left side of the chart, we see teams with smaller average weekend-to-weekday attendance difference. Notice they are all above 1.5 million and a majority are over 2 million. As we move right on the chart, the percentage gets higher and the dots trend lower, with the exception of the White Sox, who are the top-right dot. The Rays are also evident, as they are the dot in the lower-right.

Local population is important as they are the pool of fans who can most easily get to the ballpark after a day at the office. These are the fans who can also get home from a 3-hour game at a reasonable time. Having a larger local pool to draw from makes it easier for teams to pack their ballpark during fans’ valuable weekday time. It is easier to fill the average major league ballpark on weekdays when 8 million potential fans live within 30 minutes than when a majority of the area’s 3 million people have to travel over an hour each way.

Weekends, on the other hand, usually allow for more time to travel to the ballpark. Fans also don’t have to rush home to get to sleep before the next work day. Fridays and the rare Sunday night game are the odd exceptions as they have a time crunch on one side of the trip, but not the other.

While they don’t have the largest local population, the San Francisco Giants are doing a great job getting local residents to the ballpark. Fans show up, and they show up every day. (Yes, there are articles disputing exactly how many tickets are actually sold.)

The Tampa Bay Rays, on the other hand, will continue to struggle with attendance as long as they have less than 1 million fans living within 30 minutes of Tropicana Field. This is one of clearest reasons for a move to downtown Tampa, where the Tampa Bay Lightning see weekday/weekend attendance differences of approximately 5%. A move to the center of their market could vastly increase the pool of fans within 30 minutes of a Rays game. Or barring a new stadium in a new location, the Rays could build homes, apartments, and condos in an attempt to surround Tropicana Field with at least one million new neighbors.

The Outcome Machine: Predicting At Bats Before They Happen

A player comes up to the plate. He’s a very good hitter; he’s hitting .300 on the year and has 40 home runs. On the mound stands a pitcher, also very good. The pitcher is a Cy Young candidate, and his ERA sits barely over 2.00. He leads the league in strikeouts and issues very few walks.

After a 10-pitch battle, the pitcher is the one to crack and the batter slaps a hanging curveball into the gap for a double. The batter has won. His batting average for the at bat is a very nice 1.000. Same for his OBP. His slugging percentage? 2.000. Fantastic. If he did this every time, he’d be MVP, no question, every year. The pitcher, meanwhile, has a WHIP for the at bat of #DIV/0!. Hasn’t even recorded a single out. His ERA is the same. He’s not doing too great. But let’s be fair. We’ll give him the benefit of the doubt, since we know he’s a good pitcher – we’ll pretend he recorded one out before this happened. Now his WHIP is 3.000. Yeesh – ugly. If he keeps pitching like this, his ERA will climb, too, since double after double after double is sure to drive every previous runner home.

Now, obviously, this is a bit ridiculous. Not every at bat is the same. The hitter won’t double every single at bat, and the pitcher won’t allow a double every time either. Baseball is a game of random variation, skill, luck, quality of opponents and teammates, and a whole bunch of other elements. In our scenario, all those elements came together to result in a two-bagger. But, like we said, you can’t expect that to happen every single time just because it happens once.

So… how do we predict what will happen in an at bat? Any person well-versed in baseball research knows that past performance against a specific batter or pitcher means little in terms of how the next at bat will turn out, at least not until you get a meaningful number of plate appearances – and even then it’s not the best tool.

Of course, if we knew the result of every at bat before it happened, it would take most of the fun out of watching. But we’re never going to be able to do that, and so we might as well try to predict as best we can. And so I have come up with a methodology for doing so that I think is very accurate and reliable, and this post is meant to present it to you.

To claim full credit for the inspiration behind this idea would be wrong; FanGraphs author and baseball-statistics aficionado Steve Staude wrote an article back in June 2013 aiming to predict the probability of a strikeout given both the batter’s and the pitcher’s strikeout rates, which led me to this topic. In that article he found a very consistent (and good) model that predicted strikeouts:

Expected Matchup K% = B x P / (0.84 x B x P + 0.16)
Where B = the batter’s historical K% against the handedness of the pitcher; and P = the pitcher’s historical K% against the handedness of the batter

He then followed that up with another article that provided an interactive tool that you could play around with to get the expected K% for a matchup of your choosing and introduced a few new formulas (mostly suggested in the comments of his first article) to provide different perspectives. It’s all very interesting stuff.

But all that gets us is K%. Which, you know, is great, and strikeouts are probably one of the most important and indicative raw numbers to know for a matchup. But that doesn’t tell us about any other stats. So as a means of following up on what he’s done (something he mentioned in the article but I have not seen any evidence of) and also as a way to find the probability of each outcome for every type of matchup (a daunting task), I did my own research.

My methodology was very similar. I took all players and plate appearances from 2003-2013 (Steve’s dataset was 2002-2012; also, I got the data all from via – both truly indispensable resources) and for each player found their K%, BB%, 1B%, 2B%, 3B%, HR%, HBP%, and BABIP during that time. This means that a player like, say, Derek Jeter will only have his 2003-2013 stats included, not any from before 2003. I further refined that by separating each player’s numbers into vs. righty and vs. lefty numbers (Steve, in another article, proved that handedness matchups were important). I did this for both batters and pitchers. Then, for each statistic, I grouped the numbers for the batters and the numbers for the pitchers, and found the percentage of plate appearances involving a batter and a pitcher with the two grouped numbers that ended in the result in question. That’s kind of a mouthful, so let me provide an example:


These are my results for strikeout percentage (numbers here are expressed as decimals out of 1, not percentages out of 100). Total means the total proportion of plate appearances with those parameters that ended in a strikeout, while batter and pitcher mean the K% of the batter and pitcher, respectively. Count(*) measures exactly how many instances of that exact matchup there were in my data. Another important point to note – this is by no means all of the combinations that exist; in fact, for strikeouts, there were over 2,000, far more than the 20 shown here. I did have to remove many of those since there were too few observations to make meaningful assumptions…


…but I was still left with a good amount of data to work with (strikeout percentage gave me just over 400 groupings, which was plenty for my task). I went through this process for each of the rate stats that I laid out above.

My next step was to come up with a model that fit these data – in other words, estimate the total K% from the batter and pitcher K%. I did this by running a multiple regression in R, but I encountered some problems with the linearity of the data. For example, here are the results of my regression for BB% plotted against the real values for BB%:


It looks pretty good – and the r^2 of the regression line was .9653, which is excellent – but it appears to be a little bit curved. To counter that I ran a regression with the dependent variable being the natural logarithm of the total BB%, and the independent variables being the natural logarithms of the batter’s and pitcher’s BB%. After running the regression, here is what I got:


The scatterplot is much more linear, and the r^2 increased to .988. This means that ln(total) = ln(bat)*coefficient + ln(pitch)*coefficient + intercept. So if we raise both sides from the e, we get total = e^(ln(bat)*coefficient + ln(pit)*coefficient + intercept). This formula, obviously with different coefficients and intercepts, fits each of K%, BB%, 1B%, 3B%, HR%, and HBP% remarkably well; for some reason, both 2B% and BABIP did not need to be “linearized” like this and were fitted better by a simple regression without any logarithm doctoring.

Here are the regression equations, along with the r^2, for each of the stats:

Stat Regression equation r^2
K% e^(.9427*ln(bat) + .9254*ln(pit) + 1.5268) 0.9887
BB% e^(.906*ln(bat) + .8644*ln(pit) + 1.9975) 0.9880
1B% e^(1.01*ln(bat) + 1.017*ln(pit) + 1.943) 0.9312
2B% .9206*bat + .95779*pit – .03968 0.7315
3B% e^(.8435*ln(bat) + .8698*ln(pit) + 3.8809) 0.7739
HR% e^(.9576*ln(bat) + .9268*ln(pit) + 3.2129) 0.8474
HBP% e^(.8761*ln(bat) + .7623*ln(pit) + 2.995) 0.8963
BABIP 1.0403*bat + .9135*pit – .2573 0.9655

The first thing that should jump out to you (or at least one of the first) is the extremely high correlation for BABIP. It totally blew my mind to think that you can find the probability, with 96% accuracy, that a batted ball will fall for a hit, given the batter’s BABIP and pitcher’s BABIP.

Another immediate observation: K%, BB%, and HBP% generally have higher correlations than 1B%, 2B%, 3B%, and HR%. This is likely due to the increased luck and randomness that a batted ball is subjected to; for example, a triple needs to have two things happen to become a triple (being put in play and falling in an area where the batter will get exactly three bases), whereas a strikeout only needs one thing to happen – the batter needs to strike out. Overall, I was very satisfied with these results, since the correlations were overall higher than I expected.

Now comes the good part – putting it all together. We have all the inputs we need to calculate many commonly-used batting stats: AVG, OBP, SLG, OPS, and wOBA. So once we input the batter and pitcher numbers, we should be able to calculate those stats with high accuracy. I developed a tool to do just that:

For a full explanation of the tool and how to use it, head over to to my (new and shiny!) blog. I encourage you to go play around with this to see the different results.

One last thing: it is important to note that I made one big assumption in doing this research that isn’t exactly true and may throw the results off a little bit. The regressions I ran were based off of results for players over their whole career (or at least the part between 2003-2013), which isn’t a great reflector of true talent level. In the long run, I think the results still will hold because there were so many data points, but in using the interactive spreadsheet, your inputs should be whatever you think is the correct reflection of a player’s true talent level (which is why I would suggest using projection systems; I think those are the best determinations of talent), and that will almost certainly not be career numbers.

The Unique Path to Success in Oakland

 Two roads diverged in a wood, and I–

I took the one less traveled by,

And that has made all the difference.

— Robert Frost

There are many things that stand out about this year’s Oakland A’s. Their incredible run differential has reached a near historic level, their breakout star from last year has proven that last season was no fluke, and the top three starters are pitching at incredible levels. They’ve been marauding through the American League like Heisenberg’s nemesis through Janjira. However, there’s one aspect of this team that flies under the radar: of their current 25-man roster, only two players were acquired through the amateur draft – Sonny Gray and Sean Doolittle. The rest were acquired through a mix of trades, free agency, waiver claims, purchases, and even one conditional deal.

Billy Beane made his name a while ago by not being afraid to stray from the pack, and in fact looking for those market inefficiencies that could save him a buck or two with the low payroll A’s. By trading for players who may have disappointed at other spots across Major League Baseball, or claiming players put on waivers, Beane is once again finding talent in the most frugal way possible. So is this a new phenomenon in Oakland? Let’s see what the numbers say. Here’s the acquisitional (who says you can’t invent words?!) breakdown of the Oakland A’s roster the last thirteen years.* This includes any hitters who made at least 100 plate appearances and any pitchers who pitched in at least ten games in addition to this year’s current 25-man roster.

* Why thirteen years? Because, Moneyball, of course!

A’s Roster Construction Since 2002
Year AD* FA** T*** AFA^ WC^^ P^^^ CD’ R5” MD”’
2014 2 4 13 1 2 2 1 0 0
2013 4 4 16 1 4 2 1 0 0
2012 7 9 16 2 2 1 0 0 0
2011 6 7 17 0 1 1 0 0 0
2010 9 7 12 1 2 1 0 0 0
2009 11 6 14 1 3 2 0 0 0
2008 8 5 16 1 2 3 0 0 0
2007 10 5 10 1 4 2 0 0 0
2006 8 5 15 0 1 0 0 0 0
2005 10 4 15 0 1 0 0 0 0
2004 8 7 11 0 1 0 0 0 0
2003 8 6 9 2 0 0 1 1 0
2002 6 8 16 2 0 0 0 0 1

AD*= Players acquired through amateur draft;  FA**= Players acquired through free agency;  T***= Players acquired through trades;  AFA^= Players acquired through amatuer free agency;  WC^^= Players acquired through waiver claims;  P^^^= Players acquired through purchases;  CD’= Players acquired through conditional deals;  R5’’= Players acquired through the rule 5 draft;  MnD’’’= Players acquired through minor league draft


While the A’s have always built their roster through trades more than through the draft (the only years those numbers were even tied was in 2007 and 2003; every other year there were more players acquired via trade than draft), the trend is becoming more and more evident as of late. On the A’s current 25-man roster, there are a measly two players who the A’s acquired through the amateur draft versus sixteen acquired through trades. Granted, the number acquired through the draft was bound to be a bit smaller so far this season than in previous years since a 25-man roster was used this season, instead of qualified players (again, players who had either 100 plate appearances or ten games in which a player pitched in that given season), which totaled between 27 and 37 in the previous twelve seasons. However, given that the season with the second lowest number of players acquired via the draft was last season, there definitely appears to be a trend here.

Now the question becomes, “how does this compare to the league as a whole?”

Usually Beane is at the forefront of certain trends, so if the A’s roster composition varies greatly from the rest of the league, could it be the start of a league wide trend, especially given the A’s incredible success so far? To answer that question, data on all 30 teams’ roster composition was collected for the 2013 season. Given the same requirements as the previous A’s seasons (100 plate appearances or ten games pitched), how did other rosters across Major League Baseball look last year?

League Wide Roster Construction in 2013
Team AD* FA** T*** AFA^ WC^^ P^^^ CD’ R5”
BOS 26.47 35.29 26.47 5.88 0.00 5.88 0.00 0.00
STL 65.63 12.50 18.75 0.00 0.00 3.13 0.00 0.00
OAK 12.50 12.50 50.00 3.13 12.50 6.25 3.13 0.00
ATL 33.33 10.00 33.33 6.67 16.67 0.00 0.00 0.00
PIT 28.57 21.43 42.86 3.57 0.00 3.57 0.00 0.00
DET 18.75 40.63 31.25 6.25 3.13 0.00 0.00 0.00
LAD 21.88 34.38 34.38 6.25 0.00 3.13 0.00 0.00
CLE 13.79 24.14 58.62 3.45 0.00 0.00 0.00 0.00
TBR 22.58 29.03 41.94 0.00 3.23 3.23 0.00 0.00
TEX 29.03 32.26 19.35 12.90 0.00 3.23 0.00 3.23
CIN 40.00 23.33 23.33 10.00 3.33 0.00 0.00 0.00
WSN 37.50 25.00 31.25 3.13 3.13 0.00 0.00 0.00
KCR 33.33 13.33 36.67 6.67 3.33 6.67 0.00 0.00
BAL 25.81 12.90 35.48 3.23 9.68 6.45 0.00 6.45
NYY 25.81 35.48 22.58 9.68 3.23 0.00 0.00 3.23
ARI 16.13 25.81 45.16 9.68 3.23 0.00 0.00 0.00
LAA 37.84 29.73 21.62 5.41 5.41 0.00 0.00 0.00
SFG 33.33 36.67 10.00 6.67 10.00 3.33 0.00 0.00
SDP 31.43 20.00 40.00 0.00 2.86 0.00 2.86 2.86
NYM 31.58 31.58 13.16 13.16 10.53 0.00 0.00 0.00
MIL 39.39 36.36 12.12 3.03 6.06 3.03 0.00 0.00
COL 36.36 27.27 21.21 12.12 3.03 0.00 0.00 0.00
TOR 24.32 21.62 45.95 2.70 2.70 2.70 0.00 0.00
PHI 35.00 37.50 20.00 7.50 0.00 0.00 0.00 0.00
SEA 27.27 30.30 30.30 9.09 0.00 0.00 0.00 3.03
MIN 33.33 33.33 12.12 6.06 9.09 0.00 0.00 6.06
CHC 11.43 42.86 22.86 11.43 8.57 0.00 0.00 2.86
CHW 30.00 33.33 20.00 10.00 6.67 0.00 0.00 0.00
MIA 30.30 24.24 39.39 6.06 0.00 0.00 0.00 0.00
HOU 15.00 22.50 40.00 5.00 10.00 0.00 0.00 7.50

That’s a lot of numbers, so let’s take a step back and look at some of the numbers that stick out. First of all, instead of using raw totals, percentages have been used to even out the variance among how many players each team had qualify for this roster construction study. It’s also important to note that the highest and lowest percentage in each column has been bolded (this was used only for the three primary ways of acquiring players – the amateur draft, free agency, and trades). One may think of the old adage, “there’s more than one way to skin a cat” when looking at the top of the league. Apparently this adage holds true for baseball roster construction, as well as cat mutilation, as the St. Louis Cardinals – you know, that franchise that has won four of the last ten NL pennants with a pair of titles, and has the self-proclaimed best fanbase in baseball – has gone the complete opposite direction as the A’s to build their squad, relying more on the amateur draft than any other team in baseball, and doing so with great success. Then there are last year’s World Series champions, the Boston Red Sox, who were among the league leaders in players brought in through free agency.

One consistent, league-wide trend was that teams at the bottom of the league standings had far more players qualify for the 100 plate appearance/ten games pitched minimums. This is a bit of a “chicken or the egg” type observation, where the cause can sometimes be confused with the effect. There are several teams among the league’s cellar dwellers that went through numerous players throughout the season in an attempt to find effective players (the “throw the spaghetti at the wall and see what sticks” approach Jonah Keri has referenced on multiple occasions). This would be your Marlins, Astros, and Cubs. However, there are also teams among the lower tier of the standings that were forced into more personnel choices due to injuries; your Phillies, Blue Jays, and Angels. Whatever the reason, it is noticeable that nearly all the teams at the top of the standings at the end of the year have fewer players qualified for the 100 plate appearance/ten games pitched minimums thanks to good health and a clear vision – two staples of successful franchises (interestingly enough the one team that was an exception to this rule in 2013 was the Boston Red Sox; however, given their disaster of a 2012 season, it’s not as surprising to see that they tinkered a bit with their roster throughout the season).

The data supports what many baseball fans would already think, which is that the teams with higher payrolls usually are among the most reliant on free agents, and, in order to compete, the smaller market teams need to find other ways to build their rosters. For example, the top eight teams who built through free agency were: the Cubs, the Tigers, the Phillies, the Giants, the Brewers, the Yankees, the Red Sox, and the Dodgers. Of those eight, the Tigers, Philles, Giants, Yankees, Red Sox, and Dodgers make up the top six teams by payroll in 2014. The Cubs are in the middle of a complete roster overhaul, and Theo Epstein seems to be constructing a team built for flipping at the deadline for future prospects, so cheap free agents are a prime commodity. The Brewers are the odd team out, and would make for an interesting case study.

On the flip side, the top nine teams created by trading players were: the Indians, the A’s, the Blue Jays, the Diamondbacks, the Pirates, the Rays, the Astros, the Padres, and the Marlins. Of those nine, the A’s Pirates, Rays, Astros, Padres, and Marlins made up the six lowest teams by payroll in 2013; the Indians were not far off, with only the 21st biggest payroll of 2013; and the Blue Jays and Diamondbacks both have super aggressive front offices that prefer to bring in players via (usually poor) trades.

There is, of course, the caveat that while this study looks at general roster construction it does not have the nuance to differentiate between a team that is loaded with free agents that are big money free agents (like the Yankees and Red Sox) versus a team loaded with replacement level free agents (like the Cubs). If each player’s salary was totaled by how he was acquired, and then turned into percentages of roster construction again, this would show us how much each team is truly investing into each method of roster construction from a financial point of view. This could be used to compliment Jonah Keri and Neil Payne’s recent study that looked at roster construction. In their piece, Keri and Payne look at roster construction through the lens of a stars and scrubs roster versus a balanced roster. Although there might be some discrepancy based on the arbitrary 100 plate appearance and ten games pitched cut-offs, the data likely wouldn’t be vastly skewed from the current results.

Todd Boss, of Nationals Arm Race did an interesting study somewhat similar to this one, looking at the core players (the 5-man starting rotation, the setup and closer, the 8 out-field players, and the DH for AL teams) for the playoffs teams in 2013, and put the teams into four different categories of roster construction: draft/development, trade major leaguers, trade prospects, and free agency. The results were similar to what was found here, and help to support the idea that the arbitrary cut-offs of 100 plate appearances and 10 games pitched didn’t have a negative impact on the study. The only slightly different result was that Boss found the Rays to be relying more on the draft than on trades.

Having looked at the league-wide breakdown for roster construction last season, let’s take a look at roster construction from an historical perspective. To make a long story short, when Curt Flood took on Major League Baseball, and eventually the Supreme Court, in his fight to turn down a trade to Philadelphia (who can blame him?), he opened up the Floodgates (couldn’t help myself) for the eventual implementation of free agency in baseball. So, has successful (being judged by the extremely arbitrary “ringz” perspective) roster construction changed since then? Let’s take a look with yet another chart (Marshall Eriksen would be proud), this time looking at the past 40 World Series winners, and how each team was constructed.

Roster Construction of World Series Winners Since 1974
Year Team AD* FA** T*** AFA^ WC^^ P^^^ CD’ R5” MD”’ DC+ XD++
2013 BOS 26.47 35.29 26.47 5.88 0.00 5.88 0.00 0.00 0.00 0.00 0.00
2012 SFG 37.50 37.50 15.63 6.25 3.13 0.00 0.00 0.00 0.00 0.00 0.00
2011 STL 39.39 33.33 21.21 3.03 0.00 3.03 0.00 0.00 0.00 0.00 0.00
2010 SFG 31.25 50.00 15.63 3.13 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2009 NYY 21.88 43.75 12.50 15.63 0.00 6.25 0.00 0.00 0.00 0.00 0.00
2008 PHI 29.63 44.44 14.81 3.70 3.70 0.00 0.00 3.70 0.00 0.00 0.00
2007 BOS 20.00 46.67 23.33 0.00 3.33 6.67 0.00 0.00 0.00 0.00 0.00
2006 STL 16.13 41.94 32.26 0.00 0.00 3.23 0.00 6.45 0.00 0.00 0.00
2005 CHW 14.81 40.74 40.74 0.00 3.70 0.00 0.00 0.00 0.00 0.00 0.00
2004 BOS 9.09 39.39 30.30 0.00 12.12 6.06 3.03 0.00 0.00 0.00 0.00
2003 FLA 10.00 30.00 50.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2002 LAA 35.71 28.57 14.29 7.14 14.29 0.00 0.00 0.00 0.00 0.00 0.00
2001 ARI 10.00 50.00 20.00 6.67 0.00 3.33 0.00 0.00 0.00 0.00 10.00
2000 NYY 25.00 31.25 31.25 9.38 3.13 0.00 0.00 0.00 0.00 0.00 0.00
1999 NYY 20.00 36.00 32.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1998 NYY 16.00 44.00 28.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1997 FLA 6.45 29.03 38.71 16.13 0.00 0.00 0.00 0.00 3.23 0.00 6.45
1996 NYY 12.12 27.27 39.39 18.18 0.00 3.03 0.00 0.00 0.00 0.00 0.00
1995 ATL 40.00 40.00 16.00 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1993 TOR 29.63 37.04 33.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1992 TOR 44.00 20.00 24.00 0.00 0.00 0.00 0.00 8.00 0.00 4.00 0.00
1991 MIN 33.33 29.63 33.33 0.00 0.00 0.00 0.00 3.70 0.00 0.00 0.00
1990 CIN 32.00 12.00 52.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 0.00
1989 OAK 32.14 32.14 32.14 3.57 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1988 LAD 32.14 35.71 28.57 0.00 0.00 3.57 0.00 0.00 0.00 0.00 0.00
1987 MIN 33.33 11.11 51.85 3.70 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1986 NYM 30.77 11.54 50.00 7.69 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1985 KCR 38.46 19.23 26.92 11.54 0.00 3.85 0.00 0.00 0.00 0.00 0.00
1984 DET 42.86 17.86 28.57 3.57 0.00 7.14 0.00 0.00 0.00 0.00 0.00
1983 BAL 32.14 21.43 32.14 10.71 0.00 3.57 0.00 0.00 0.00 0.00 0.00
1982 STL 19.23 3.85 65.38 7.69 0.00 3.85 0.00 0.00 0.00 0.00 0.00
1981 LAD 43.48 17.39 21.74 8.70 0.00 8.70 0.00 0.00 0.00 0.00 0.00
1980 PHHI 39.29 14.29 39.29 3.57 0.00 3.57 0.00 0.00 0.00 0.00 0.00
1979 PIT 32.00 12.00 40.00 16.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1978 NYY 18.18 13.64 63.64 4.55 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1977 NYY 18.18 13.64 63.64 4.55 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1976 CIN 28.00 N/A 44.00 28.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1975 CIN 33.33 N/A 45.83 20.83 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1974 OAK 25.00 N/A 37.50 29.17 0.00 8.33 0.00 0.00 0.00 0.00 0.00

#DC+= Players acquired through free agent draft compensation;  #XD++= Players acquired through the expansion draft

The first note that needs to be made is regarding the 1997 Marlins and 2001 Diamondbacks. Both rosters had skewed roster construction due to how soon after the team’s inception they were able to win a championship. The Marlins have by far the lowest reliance on the amateur draft, and the Diamondbacks have tied for the highest reliance on free agents, but both of these numbers were driven up (or down) by the limited time for drafting and moving along prospects before their championships.

After accounting for the 2001 Diamondbacks season, the steady rise of reliance on free agents since the mid-seventies is notable – up until three years ago that is. It’s hard to tell whether baseball is undergoing an actual “grass roots” movement, with teams relying less and less on big market free agents to succeed, or if this is simply a three-year blip in the radar, but it is certainly notable that the last three World Series winners have had notably lower reliance on free agents than the previous seven years’ winners. The 2011 Cardinals, 2012 Giants, and 2013 Red Sox have not, however, relied on trades, but instead their farm systems more so than other winners of this millennium (not including the 2002 Angels).

In fact, excluding the fluky 2003 Marlins, there has not been a World Series winner as reliant on trades as the 2013 A’s (50 percent) since the mid-eighties Twins and Mets. What’s even more troubling for the A’s is that there hasn’t been a team to use the draft and free agency combined as little as the 2013 A’s since the 1982 Cardinals, a team built during the dawn of free agency.

When judging by championships, in fact, the picture of baseball as a sport in which you need to be in a big market, with the ability to sign big name free agents becomes unfortunately evident. The roster composition of nearly all of the World Series winners this century is quite similar to that first group of teams mentioned above as big market teams built through free agency. This is no surprise to any real baseball fans, however. Look at the cities that have hosted World Series parades since the Yankees’ dynasty of the nineties began. Sure, there are the success stories in Florida and Arizona, but other than that it’s a who’s who of big market teams. While the Cardinals play themselves off as plucky little underdogs, their payroll was the eleventh largest in baseball last year, almost exactly twice that of the A’s.

That’s why this year’s A’s team could be so special. If they are able to continue their regular season success, and finally make the breakthrough they have been struggling so much to make in recent years, they could continue the recent trend of teams moving away from a strictly free agent diet to fulfill their championship dreams. Of course, this has been the case for a couple of years in Oakland now, and it hasn’t happened yet. However, with the top three in the A’s rotation looking as good as any in baseball right now, baseball’s secret superstar at third, and the fact that it is the 25th anniversary of the last A’s World Series title, suddenly it doesn’t seem that unlikely that the A’s could make ole Bobby Frost proud this October.

Estimating Pitcher Release Point Distance from PITCHf/x Data

For PITCHf/x data, the starting point for pitches, in terms of the location, velocity, and acceleration, is set at 50 feet from the back of home plate. This is effectively the time-zero location of each pitch. However, 55 feet seems to be the consensus for setting an actual release point distance from home plate, and is used for all pitchers. While this is a reasonable estimate to handle the PITCHf/x data en masse, it would be interesting to see if we can calculate this on the level of individual pitchers, since their release point distances will probably vary based on a number of parameters (height, stride, throwing motion, etc.). The goal here is to try to use PITCHf/x data to estimate the average distance from home plate the each pitcher releases his pitches, conceding that each pitch is going to be released from a slightly different distance. Since we are operating in the blind, we have to first define what it means to find a pitcher’s release point distance based solely on PITCHf/x data. This definition will set the course by which we will go about calculating the release point distance mathematically.

We will define the release point distance as the y-location (the direction from home plate to the pitching mound) at which the pitches from a specific pitcher are “closest together”. This definition makes sense as we would expect the point of origin to be the location where the pitches are closer together than any future point in their trajectory. It also gives us a way to look for this point: treat the pitch locations at a specified distance as a cluster and find the distance at which they are closest. In order to do this, we will make a few assumptions. First, we will assume that the pitches near the release point are from a single bivariate normal (or two-dimensional Gaussian) distribution, from which we can compute a sample mean and covariance. This assumption seems reasonable for most pitchers, but for others we will have to do a little more work.

Next we need to define a metric for measuring this idea of closeness. The previous assumption gives us a possible way to do this: compute the ellipse, based on the data at a fixed distance from home plate, that accounts for two standard deviations in each direction along the principal axes for the cluster. This is a way to provide a two-dimensional figure which encloses most of the data, of which we can calculate an associated area. The one-dimensional analogue to this is finding the distance between two standard deviations of a univariate normal distribution. Such a calculation in two dimensions amounts to finding the sample covariance, which, for this problem, will be a 2×2 matrix, finding its eigenvalues and eigenvectors, and using this to find the area of the ellipse. Here, each eigenvector defines a principal axis and its corresponding eigenvalue the variance along that axis (taking the square root of each eigenvalue gives the standard deviation along that axis). The formula for the area of an ellipse is Area = pi*a*b, where a is half of the length of the major axis and b half of the length of the minor axis. The area of the ellipse we are interested in is four times pi times the square root of each eigenvalue. Note that since we want to find the distance corresponding to the minimum area, the choice of two standard deviations, in lieu of one or three, is irrelevant since this plays the role of a scale factor and will not affect the location of the minimum, only the value of the functional.

With this definition of closeness in order, we can now set up the algorithm. To be safe, we will take a large berth around y=55 to calculate the ellipses. Based on trial and error, y=45 to y=65 seems more than sufficient. Starting at one end, say y=45, we use the PITCHf/x location, velocity, and acceleration data to calculate the x (horizontal) and z (vertical) position of each pitch at 45 feet. We can then compute the sample covariance and then the area of the ellipse. Working in increments, say one inch, we can work toward y=65. This will produce a discrete function with a minimum value. We can then find where the minimum occurs (choosing the smallest value in a finite set) and thus the estimate of the release point distance for the pitcher.

Earlier we assumed that the data at a fixed y-location was from a bivariate normal distribution. While this is a reasonable assumption, one can still run into difficulties with noisy/inaccurate data or multiple clusters. This can be for myriad reasons: in-season change in pitching mechanics, change in location on the pitching rubber, etc. Since data sets with these factors present will still produce results via the outlined algorithm despite violating our assumptions, the results may be spurious. To handle this, we will fit the data to a Gaussian mixture model via an incremental k-means algorithm at 55 feet. This will approximate the distribution of the data with a probability density function (pdf) that is the sum of k bivariate normal distributions, referred to as components, weighted by their contribution to the pdf, where the weights sum to unity. The number of components, k, is determined by the algorithm based on the distribution of the data.

With the mixture model in hand, we then are faced with how to assign each data point to a cluster. This is not so much a problem as a choice and there are a few reasonable ways to do it. In the process of determining the pdf, each data point is assigned a conditional probability that it belongs to each component. Based on these probabilities, we can assign each data point to a component, thus forming clusters (from here on, we will use the term “cluster” generically to refer to the number of components in the pdf as well as the groupings of data to simplify the terminology). The easiest way to assign the data would be to associate each point with the cluster that it has the highest probability of belonging to. We could then take the largest cluster and perform the analysis on it. However, this becomes troublesome for cases like overlapping clusters.

A better assumption would be that there is one dominant cluster and to treat the rest as “noise”. Then we would keep only the points that have at least a fixed probability or better of belonging to the dominant cluster, say five percent. This will throw away less data and fits better with the previous assumption of a single bivariate normal cluster. Both of these methods will also handle the problem of having disjoint clusters by choosing only the one with the most data. In demonstrating the algorithm, we will try these two methods for sorting the data as well as including all data, bivariate normal or not. We will also explore a temporal sorting of the data, as this may do a better job than spatial clustering and is much cheaper to perform.

To demonstrate this algorithm, we will choose three pitchers with unique data sets from the 2012 season and see how it performs on them: Clayton Kershaw, Lance Lynn, and Cole Hamels.

Case 1: Clayton Kershaw

Kershaw Clusters photo Kershaw_Clusters.jpeg

At 55 feet, the Gaussian mixture model identifies five clusters for Kershaw’s data. The green stars represent the center of each cluster and the red ellipses indicate two standard deviations from center along the principal axes. The largest cluster in this group has a weight of .64, meaning it accounts for 64% of the mixture model’s distribution. This is the cluster around the point (1.56,6.44). We will work off of this cluster and remove the data that has a low probability of coming from it. This is will include dispensing with the sparse cluster to the upper-right and some data on the periphery of the main cluster. We can see how Kershaw’s clusters are generated by taking a rolling average of his pitch locations at 55 feet (the standard distance used for release points) over the course of 300 pitches (about three starts).

Kershaw Rolling Average photo Kershaw_Average.jpeg

The green square indicates the average of the first 300 pitches and the red the last 300. From the plot, we can see that Kershaw’s data at 55 feet has very little variation in the vertical direction but, over the course of the season, drifts about 0.4 feet with a large part of the rolling average living between 1.5 and 1.6 feet (measured from the center of home plate). For future reference, we will define a “move” of release point as a 9-inch change in consecutive, disjoint 300-pitch averages (this is the “0 Moves” that shows up in the title of the plot and would have been denoted by a blue square in the plot). The choices of 300 pitches and 9 inches for a move was chosen to provide a large enough sample and enough distance for the clusters to be noticeably disjoint, but one could choose, for example, 100 pitches and 6 inches or any other reasonable values. So, we can conclude that Kershaw never made a significant change in his release point during 2012 and therefore treating the data a single cluster is justifiable.

From the spatial clustering results, the first way we will clean up the data set is to take only the data which is most likely from the dominant cluster (based on the conditional probabilities from the clustering algorithm). We can then take this data and approximate the release point distance via the previously discussed algorithm. The release point for this set is estimated at 54 feet, 5 inches. We can also estimate the arm release angle, the angle a pitcher’s arm would make with a horizontal line when viewed from the catcher’s perspective (0 degrees would be a sidearm delivery and would increase as the arm was raised, up to 90 degrees). This can be accomplished by taking the angle of the eigenvector, from horizontal, which corresponds to the smaller variance. This is working under the assumption that a pitcher’s release point will vary more perpendicular to the arm than parallel to the arm. In this case, the arm angle is estimated at 90 degrees. This is likely because we have blunted the edges of the cluster too much, making it closer to circular than the original data. This is because we have the clusters to the left and right of the dominant cluster which are not contributing data. It is obvious that this way of sorting the data has the problem of creating sharp transitions at the edge of cluster.

Kershaw Most Likely photo Kershaw_Likely_Final.jpeg

As discussed above, we run the algorithm from 45 to 65 feet, in one-inch increments, and find the location corresponding to the smallest ellipse. We can look at the functional that tracks the area of the ellipses at different distances in the aforementioned case.

Kershaw Most Likely Functional photo Kershaw_Likely_Fcn.jpeg

This area method produces a functional (in our case, it has been discretized to each inch) that can be minimized easily. It is clear from the plot that the minimum occurs at slightly less than 55 feet. Since all of the plots for the functional essentially look parabolic, we will forgo any future plots of this nature.

The next method is to assume that the data is all from one cluster and remove any data points that have a lower than five-percent probability of coming from the dominant cluster. This produces slightly better visual results.

Kershaw Five Percent photo Kershaw_Five_Pct_Final.jpeg

For this choice, we get trimming away at the edges, but it is not as extreme as in the previous case. The release point is at 54 feet, 3 inches, which is very close to our previous estimate. The arm angle is more realistic, since we maintain the elliptical shape of the data, at 82 degrees.

Kershaw Original photo Kershaw_Orig_Final.jpeg

Finally, we will run the algorithm with the data as-is. We get an ellipse that fits the original data well and indicates a release point of 54 feet, 9 inches. The arm angle, for the original data set, is 79 degrees.

Examining the results, the original data set may be the one of choice for running the algorithm. The shape of the data is already elliptic and, for all intents and purposes, one cluster. However, one may still want to remove manually the handful of outliers before preforming the estimation.

Case 2: Lance Lynn

Clayton Kershaw’s data set is much cleaner than most, consisting of a single cluster and a few outliers. Lance Lynn’s data has a different structure.

Lynn Clusters photo Lynn_Clusters.jpeg

The algorithm produces three clusters, two of which share some overlap and the third disjoint from the others. Immediately, it is obvious that running the algorithm on the original data will not produce good results because we do not have a single cluster like with Kershaw. One of our other choices will likely do better. Looking at the rolling average of release points, we can get an idea of what is going on with the data set.

Lynn Rolling Average photo Lynn_Average.jpeg

From the rolling average, we see that Lynn’s release point started around -2.3 feet, jumped to -3.4 feet and moved back to -2.3 feet. The moves discussed in the Kershaw section of 9 inches over consecutive, disjoint 300-pitch sequences are indicated by the two blue squares. So around Pitch #1518, Lynn moved about a foot to the left (from the catcher’s perspective) and later moved back, around Pitch #2239. So it makes sense that Lynn might have three clusters since there were two moves. However his first and third clusters could be considered the same since they are very similar in spatial location.

Lynn’s dominant cluster is the middle one, accounting for about 48% of the distribution. Running any sort of analysis on this will likely draw data from the right cluster as well. First up is the most-likely method:

Lynn Most Likely photo Lynn_Likely_Final.jpeg

Since we have two clusters that overlap, this method sharply cuts the data on the right hand side. The release point is at 54 feet, 4 inches and the release angle is 33 degrees. For the five-percent method, the cluster will be better shaped since the transition between clusters will not be so sharp.

Lynn Five Percent photo Lynn_Five_Pct_Final.jpeg

This produces a well-shaped single cluster which is free of all of the data on the left and some of the data from the far right cluster. The release point is at 53 feet, 11 inches and at an angle of 49 degrees.

As opposed to Kershaw, who had a single cluster, Lynn has at least two clusters. Therefore, running this method on the original data set probably will not fare well.

Lynn Original photo Lynn_Orig_Final.jpeg

Having more than one cluster and analyzing it as only one causes both a problem with the release point and release angle. Since the data has disjoint clusters, it violates our bivariate normal assumption. Also, the angle will likely be incorrect since the ellipse will not properly fit the data (in this instance, it is 82 degrees). Note that the release point distance is not in line with the estimates from the other two methods, being 51 feet, 5 inches instead of around 54 feet.

In this case, as opposed to Kershaw, who only had one pitch cluster, we can temporally sort the data based on the rolling average at the blue square (where the largest difference between the consecutive rolling averages is located).

Lynn Time Clusters photo Lynn_Time_Clusters.jpeg

Since there are two moves in release point, this generates three clusters, two of which overlap, as expected from the analysis of the rolling averages. As before, we can work with the dominant cluster, which is the red data. We will refer to this as the largest method, since it is the largest in terms of number of data points.  Note that with spatial clustering, we would pick up the some of the green and red data in the dominant cluster. Running the same algorithm for finding the release point distance and angle, we get:

Lynn Largest photo Lynn_Large_Final.jpeg

The distance from home plate of 53 feet, 9 inches matches our other estimates of about 54 feet. The angle in this case is 55 degrees, which is also in agreement. To finish our case study, we will look at another data set that has more than one cluster.

Case 3: Cole Hamels

Hamels Clusters photo Hamels_Clusters.jpeg

For Cole Hamels, we get two dense clusters and two sparse clusters. The two dense clusters appear to have a similar shape and one is shifted a little over a foot away from the other. The middle of the three consecutive clusters only accounts for 14% of the distribution and the long cluster running diagonally through the graph is mostly picking up the handful of outliers, and consists of less than 1% of the distribution. We will work with the the cluster with the largest weight, about 0.48, which is the cluster on the far right. If we look at the rolling average for Hamels’ release point, we can see that he switched his release point somewhere around Pitch #1359 last season.

Hamels Rolling Average photo Hamels_Average.jpeg

As in the clustered data, Hamel’s release point moves horizontally by just over a foot to the right during the season. As before, we will start by taking only the data which most likely belongs to the cluster on the right.

Hamels Most Likely photo Hamels_Likely_Final.jpeg

The release point distance is estimated at 52 feet, 11 inches using this method. In this case, the release angle is approximately 71 degrees. Note that on the top and the left the data has been noticeably trimmed away due to assigning data to the most likely cluster. The five-percent method produces:

Hamels Five Percent photo Hamels_Five_Pct_Final.jpeg

For this method of sorting through the data, we get 52 feet, 10 inches for the release point distance. The cluster has a better shape than the most-likely method and gives a release angle of 74 degrees. So far, both estimates are very close. Using just the original data set, we expect that the method will not perform well because there are two disjoint clusters.

Hamels Original photo Hamels_Orig_Final.jpeg

We run into the problem of treating two clusters as one and the angle of release goes to 89 degrees since both clusters are at about the same vertical level and therefore there is a large variation in the data horizontally.

Just like with Lance Lynn, we can do a temporal splitting of the data. In this case, we get two clusters since he changed his release point once.

Hamels Time Clusters photo Hamels_Time_Clusters.jpeg

Working with the dominant cluster, the blue data, we obtain a release point at 53 feet, 2 inches and a release angle of 75 degrees.

Hamels Largest photo Hamels_Large_Final.jpeg

All three methods that sort the data before performing the algorithm lead to similar results.


Examining the results of these three cases, we can draw a few conclusions. First, regardless of the accuracy of the method, it does produce results within the realm of possibility. We do not get release point distances that are at the boundary of our search space of 45 to 65 feet, or something that would definitely be incorrect, such as 60 feet.  So while these release point distances have some error in them, this algorithm can likely be refined to be more accurate. Another interesting result is that, provided that the data is predominantly one cluster, the results do not change dramatically due to how we remove outliers or smaller additional clusters. In most cases, the change is typically only a few inches. For the release angles, the five-percent method or largest method probably produces the best results because it does not misshape the clusters like the mostly-likely method does and does not run into the problem of multiple clusters that may plague the original data. Overall, the five-percent method is probably the best bet for running the algorithm and getting decent results for cases of repeated clusters (Lance Lynn) and the largest method will work best for disjoint clusters (Cole Hamels). If just one cluster exists, then working with the original data would seem preferable (Clayton Kershaw).

Moving forward, the goal is settle on a single method for sorting the data before running the algorithm. The largest method seems the best choice for a robust algorithm since it is inexpensive and, based on limited results, performs on par with the best spatial clustering methods. One problem that comes up in running the simulations that does not show up in the data is the cost of the clustering algorithm. Since the method for finding the clusters is incremental, it can be slow, depending on the number of clusters. One must also iterate to find the covariance matrices and weights for each cluster, which can also be expensive. In addition, the spatial clustering only has the advantages of removing outliers and maintaining repeated clusters, as in Lance Lynn’s case. Given the difference in run time, a few seconds for temporal splitting versus a few hours for spatial clustering, it seems a small price to pay. There are also other approaches that can be taken. The data could be broken down by start and sorted that way as well, with some criteria assigned to determine when data from two starts belong to the same cluster.

Another problem exists that we may not be able to account for. Since the data for the path of a pitch starts at 50 feet and is for tracking the pitch toward home plate, we are essentially extrapolating to get the position of the pitch before (for larger values than) 50 feet. While this may hold for a small distance, we do not know exactly how far this trajectory is correct. The location of the pitch prior to its individual release point, which we may not know, is essentially hypothetical data since the pitch never existed at that distance from home plate. This is why is might be important to get a good estimate of a pitcher’s release point distance.

There are certainly many other ways to go about estimating release point distance, such as other ways to judge “closeness” of the pitches or sort the data. By mathematizing the problem, and depending on the implementation choices, we have a means to find a distinct release point distance. This is a first attempt at solving this problem which shows some potential. The goal now is to refine it and make it more robust.

Once the algorithm is finalized, it would be interesting to go through video and see how well the results match reality, in terms of release point distance and angle. As it is, we are essentially operating blind since we are using nothing but the PITCHf/x data and some reasonable assumptions. While this worked to produce decent results, it would be best to create a single, robust algorithm that does not require visual inspection of the data for each case. When that is completed, we could then run the algorithm on a large sample of pitchers and compare the results.