﻿ The stats we target | The Hardball Times

# The stats we target

For someone who writes about fantasy baseball, ADP (Average Draft Position) is a fun statistic. For instance, doing something as simple as graphing ADP against itself can visualize some aspects of what occurs during a draft. This ADP data, by the way, are from Yahoo drafts for the 2008 season, meaning these drafts occurred before the season began.

The interesting part of this graph is not where the dots are located, but their distance from each other. Noticing how they are relatively bunched at the edges and less dense in the middle reinforces my sentiment in this article—that drafting in the middle rounds is the most difficult.

Fantasy baseballers cannot agree where to take players in these rounds and therefore few players end up with an average draft position in the 100s. Because it is more of a “who” to take rather than a “where” at the end of a draft, you end up with the clustering after the 200 ADP mark that you see.

Ostensibly the reason people drafted these players where they did is because of the stats these players accumulated in the previous year. Comparing a player’s 2007 numbers with his 2008 ADP can provide us with some insight into which of the fantasy stats we target the most in drafts. Before we get buried in numbers, though, let’s first look at some graphs starting with home runs, since I figure they will be an important determinant.

### Graphs

This graphs shows us that it is not imperative to hit a ton of home runs to be taken early, as depicted by the dots toward the lower left of the graph. Also, hitting around 25 home runs seems to be the magic number to get a hitter out of the 200+ ADP cluster and from there a nicely defined linear slope brings us to Alex Rodriguez‘ 54 home runs in 2007 and his corresponding 1.2 ADP in 2008.

Next we will look at stolen bases, which might present a graph that looks radically different from the plateau-shaped home run graph.

This graph actually looks somewhat similar to the home run graph; it features the same basic shape except with more players on the left extreme and fewer to the right one. Simply looking at the graph, though, the dispersion appears more random, whereas on the home run graph there was a more visible downward slope.

Even more random than the stolen bases graph is the one comparing batting average to ADP.

Since batting average is a rate stat, I increased the at-bat threshold to 400 to eliminate possible fluky batting averages attained over a couple of hundred at-bats. Despite that, a player’s batting average appears to have a small effect on where he is drafted. Intuition tells me there must be some degree of correlation, but compared to home runs and stolen bases it appears to be small.

Last we will look at the graph of runs, which appear to correlate well with next year’s ADP, although later we will find out that may not be the case.

As you can see there is a well-defined, generally downward slope to the right, suggesting a correlation. Sometimes with graphs looks can be deceiving, as the next section will show.

### Regression

Looking at pretty graphs is nice, but let’s not get distracted from the purpose of the data. What the data can tell is which of the five main fantasy stats have the largest impact on where a player gets drafted in the following year. For this I used a multivariate regression, two multivariate regressions actually—one using the stats as counting stats with average converted to hits, and the second with them as rate stats, so for example home runs became home runs per at-bat. The results of the regressions are summarized in the following tables.

```+-----------------------------------+
|            ~ COUNTING ~           |
+------+--------------+-------------+
| Stat | Coefficients |	P-value     |
+------+--------------+-------------+
| Int. | 370.6356     |	1.6768 E-31 |
| R    | -0.3829      |	0.34664     |
| HR   | -2.2503      |	0.0093      |
| RBI  | -1.1258      |	0.0030      |
| SB   | -2.1020      |	8.6056 E-07 |
| Hits | -0.3875      |	0.1504      |
+------+--------------+-------------+```

For the coefficients column, a lower coefficient means the stat is more significant. So in counting form home runs edge out stolen bases as the most significant with runs and hits the least important. The “P-value” column shows the significance of the coefficient with anything under .05 statistically significant, meaning home runs, RBI, and especially stolen bases pass the significance test. As I hinted before, runs were extraordinarily insignificant compared to the other stats.

```+-------------------------------------+
|              ~ RATE ~               |
+--------+--------------+-------------+
| Stat   | Coefficients | P-value     |
+--------+--------------+-------------+
| Int.   | 550.6223     | 4.2287 E-24 |
| R/AB   | -97.9406     | 0.6615      |
| HR/AB  | -1523.8494   | 0.0019      |
| RBI/AB | -608.0605    | 0.0045      |
| SB/AB  | -1578.2072   | 7.5421 E-10 |
| AVG    | -833.7461    | 2.9494 E-05 |
+--------+--------------+-------------+```

Once again home runs and stolen bases jump out as the big players, with not surprisingly batting average rising in importance since this is its home court, so to speak. And once again runs display their general lack of relevance.

The one part of these charts I have failed to mention yet is the coefficient of the intercept. The fun activity you can do with these is create a rough estimate of where a player will be drafted given his stat line for a season. Multiplying a player’s stats in each category by its coefficient, adding those numbers up and then subtracting from the intercept coefficient will generate a rough estimate of that player’s ADP. For example if you took Todd Helton‘s 2007 line of 86 runs, 17 homers, 91 RBI, no stolen bases, and 178 hits and plugged it in:

Estimated ADP = 370.6 – (86 * .3829) – (17 * 2.25) – (91 * 1.1258) – (0 * 2.1) – (178 * .3875) = 128.5

Mental Health and the CBA
A particular bit of language in the latest CBA could have negative consequences for some players.

Helton’s estimated ADP of 128.5 is remarkably close to his actual ADP that year of 135.4 given the crudeness of the model (using only one year of data from one website) and the fact that it does not take into account any positional adjustment. This model worked well for this set of data with an R-Squared of .8, but that is not overly surprising considering the model was created off the 2007 season-2008 ADP data. At this point this ADP model probably will not work tremendously well for the 2009 season stats, but given a few more years of data added it could become an interesting tool for leagues that draft early in the offseason, or for some historical context on a player’s ADP.

### Concluding thoughts

I know this article does more of confirming what we might have already suspected—that home runs and steals are the most significant when it comes to determining ADP—instead of providing us with new information, but there still are lessons to be taken away.

First, the insignificance of runs in the regressions points to a possible inefficiency in the fantasy marketplace. People most likely assume runs are a byproduct of other skills and ignore them when ranking players. A system that would take into account position in batting order, team runs per game, and of course the player’s skill level could more accurately predict expected run totals and make rankings more accurate.

The xADP model I debuted is something that could become a powerful fantasy tool given a few more years of ADP data, and hopefully you saw a glimpse of that.

I’ll end with a confession and display of gratitude to colleague Nick Steiner, who ran the multivariate regressions that spewed out the coefficient values that were instrumental to this article. I am more statistically illiterate than you might assume and do not have the savvy to run such regressions. I owe a big thanks to him for his time and effort.

Print This Post
Guest
Kyle

Great blog… Why do you say the 2009 stats will not work in this formula? How did you come up with the coefficients? I have 10 batting and 10 pitching categories and would like to determine what I should be going off of. I would assume it would look similar, but would like to know how some of the other categories affect our league. Thanks

Guest
Millsy
I think this is an interesting, but do you have any qualms about the fact that ADP is EXTREMELY non-independent (a crucial assumption in running a regression).  ADP is a rank-based measure, and I’m not convinced a simple multivariate regression is sufficient.  For every one place someone moves up, another moves down. In addition, the multi-collinearity could be a problem for the ‘Runs’ measure, resulting in your strange p-value for that coefficient, despite the obvious relationship in the scatterplot.  I’m sure Runs are a secondary component of ‘skill’, but if that’s the case, then I’m not sure it’s all that… Read more »
Guest
Millsy

Sorry, one more qualm.  Since ADP is obviously truncated at Pick 1 (and arguably capped depending on the number of players needed for your league), did you run any sort of Tobit model, or just a simple regression?  If it’s a simple multivariate regresison, then the coefficients are going to have problems at the extremes of ADP (and likely consider Albert Pujols a negative ADP).

Guest
Jeff Z

One point on the wide gaps in the middle game could be teams filling up positions later and reaching.  What might be nice is the ADP when each team has drafted a position.

Guest
Nick Steiner
Millsy, I was the one who helped Paul with the regression part of this, so I can probably answer your questions. 1) Yes, I understand that’s a pretty big problem with the fact that ADP is a rank based system instead of a value.  However, I wasn’t really sure how to get around that.  Do you have any suggestions?  2) I think runs are obviously going to have a ton of multicollinearity.  Runs are basically the bi-product of home runs, stolen bases and OBP (which is generally going to mirror around batting average).  BTW, if I run the the regression… Read more »
Guest
Millsy
Thanks for the response, Nick. I’ve been looking into some methods, as we had some plans to mess with this idea over at Fantasy Ball Junkie.  I think it’s a really neat tool to use in general, and there’s lots of room for improving it.  Right now, I’m not sure exactly how to get around the dependency issue in the ranks (though I’m looking into it this next week or so).  However, I don’t think it completely damns the use of regression here. As for the runs, our editor at FBJ and I have been discussing the best way to… Read more »
Guest
John K
Good article – I like the idea of this piece a lot. It’s not correct to simply look at the magnitude of the coefficient and make a claim about a regressor’s “importance.”  For instance, I could apply a monotonic transformation to your RBI number, get exactly the same significance and R^2, but a larger coefficient.  The fact that the mean RBI/AB is higher than that of HR/AB will influence the magnitude of your coefficient.  That is not sufficient for a conclusion that HRs are more “important” in determining ADP. Perhaps it would be more helpful if you multiplied the coefficient… Read more »
Guest
John K

I’m saying your interpretation is off, not just the terminology.

Separately, skimming over the other comments I think the issue of the negative ADP forecasts is a false concern.  All you need from a forecast of ADP is an ordinal ranking.  Why you would want to use a regression to forecast ADP is another question!

Guest
Millsy
John K, It’s not a false concern, IF you’re going to use a standard regression (assuming that’s a wise choice), given the obvious censoring of the data.  If the idea is just to take the negative predictions and then ordinally rank them 1 to whatever, then fine, but you’re still going to get screwey coefficient estimates that likely aren’t correct.  If our interest is in the coefficients as well as the ADP, it’s absolutely a legitimate concern. There’s the issue there as well that the coefficients should NOT be the same across the entire sample.  Jumping from 20 to 5… Read more »
Guest
Paul Singman
Kyle—This model will not work well for 2009 stats because it was created using the 2007 season stats and 2008 ADP data. Ideally a model with more predictive value would be based off of multiple years of data. The coefficients were derived by applying a multivariate regression to season stats and ADP data. This is not something anyone can do (myself included) I suppose it is something you would learn in a stats class. Millsy—Unfortunately I cannot address your concerns regarding some of the finer points of the regression and model, however your wording when you say “ADP is a… Read more »
Guest
Derek Ambrosino
I’d like to offer a simpler hypothesis regarding the importance of SBs and HRs, and lack of importance of runs. (Although, I recognize the multicolinearity issue as well.) I think scale/volume of the stats has something to do with it as well. Teams, as a whole, score a lot of runs and drive in a lot of runs; they certainly accrue way more of each than homers (it’d be impossible not to, I know) and, even moreso, steals. So, when drafting/bidding, owners are less likely to pay a lot of mind to a small difference in runs or RBI. People… Read more »
Guest
B N
I think the predictability of runs is definitely a big factor in why people don’t consider them very highly.  Your ability to score runs is going to be a function primarily of two things: your OBP and who is batting behind you.  One of those factors is something that can be taken into account, as it has to do with player skills.  The lineup position is a crapshoot though.  With rookies, and even with veterans, you can always end up with a case where they get dropped into the bottom of the lineup.  In most cases, batting 5th or 6th… Read more »