What’s past is prologue
Roll up your sleeves; there’s going to be some math in this one. Let’s talk about regression to the mean.
True score theory
The concept of regression to the mean is rooted in what’s called true score theory. The theory states that a measurement consists of two factors: an individual’s “true score,” and measurement error, sometimes called random error. (I’ll take this opportunity to note that not all measurement error is random; some is biased.) In sabermetrics, we have taken to calling a player’s underlying talent (if it could be observed without measurement error) as “true talent.”
We can express this using a mathematical equation:
Observed Performance = True Talent + Random Error + Bias
(From here on out, we’ll be ignoring bias, for the sake of clarity.)
Obviously a baseball player’s innate ability isn’t constant: He can be nursing a minor injury or learn better plate discipline. A lot of things can happen to change a player’s true talent level. Of course, the same can be said of taking a test, the typical use case of true score theory. A student can be wellrested one day, tired another day, for instance. When we refer to something as “true” we simply mean that it is repeatable under the same conditions.
So what we notice is that when we observe something repeatedly, whether it’s baseball players or students, is those who did better or worse than the mean (or average) tend to perform closer to the mean as we add more observations.
What it looks like in practice
Let’s take a look at all batters who had over 100 plate appearances in a season between 19932008. The league average OBP in this period was .356. What we are looking at here is how well their OBP in their first 100 plate appearances in a season predicts their OBP the rest of the season. So with that in mind, here’s a look at how players who started differently fared over the rest of the season:
OBP_Start

PA

OBP_Rest

0.300

101992

0.341

0.320

145662

0.350

0.330

150128

0.350

0.340

165794

0.353

0.350

189717

0.357

0.360

188231

0.357

0.370

189845

0.360

0.380

169848

0.368

0.390

173844

0.365

0.400

164177

0.367

0.410

127379

0.374

0.420

123294

0.379

The first column is OBP in the first 100 PAs, the second column the number of PAs that group of players had after that, and the third column is the average OBP of that group of players after the first 100 PAs. (On average, each player had 393 PAs after their first 100, or 493 PAs in the season.
What this table shows us is that the concept of regression to the mean seems to do very well in predicting how groups of players will perform in the future.
Breaking it down
But do all individuals in the group regress to the mean? Let’s take a look at all players in our sample who put up a .320 OBP in their first 100 PAs. As a group, they put up a .350 OBP in the rest of the season. Pretty well regressed, right? But here’s a look at how the individual players in that group did:
While the group mean does move toward the overall mean, some players end up doing far worse than the mean, and some end up doing far better. We can look at this again, this time using a group that was aboveaverage to begin with, say a .390 OBP:
Again, the group regresses to the mean: They put up a .365 OBP for the rest of the season. But again, some improved, and some did even worse than the mean.
Frankly, 100 PAs isn’t enough to tell us much of anything about a baseball player. The Rsquared (in other words, the square of the correlation between OBP in the first 100 PAs and OBP restofseason) is only .105. We can look at a scatterplot with a regression line on it:
100 PAs of OBP simply doesn’t do a very good job of predicting future OBP. The answer is simply to use more PAs – if Pujols puts up a .320 OBP in his first 100 PAs, the question of how much he will regress to the mean will be almost entirely overshadowed by the fact that he’s Albert Pujols and he is the best hitter on the face of God’s green earth.
Do we regress Pujols’ OBP?
And since he’s Albert Pujols, and we have over 5,800 PAs of .426 OBP from him, we don’t expect him to regress toward the mean anymore, right?
Wrong!
Let’s rework our formula from the beginning. The variance (or spread) of OBP among baseball players in a given time can be broken down into:
Observed Variance = True Variance + Random Error
As the number of observations (in this case, plate appearances) goes up, random error goes down. In 10 PAs, one extra walk (or any other onbase event) is worth .100 points of OBP. In 1,000 PAs, it takes 100 extra walks to be worth .100 points of OBP. Errors become harder as the amount of observations go up.
But so does everything else: both kinds of variance decrease as our number of observations goes up. When we say that Pujols regresses to the mean, we don’t mean that he’s getting worse. We simply mean that due to the limited number of observations we have a certain amount of random variance that we have to account for. Typically for a very large number of PAs it’s a small amount of regression (probably not noticable if you’re rounding to three significant digits), and since his peers are typically expected to regress by the same amount it doesn’t change our relative opinion of him.
A side note about selective sampling
A funny thing happened on the way to the 101st plate appearance: Players who did very poorly tended not to make it that far, while players who did very well tended to get more PAs. There is a selective sampling bias here, in that teams allocate playing time based upon 100 PAs (as unpredictive as we assert that they are). Those players who tend to do poorly and still recieve playing time are generally either those who have done well in the past or who teams think are better players for reasons unrelated to performance to date, like scouting.
References & Resources
The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at “www.retrosheet.org”.
Here’s a great article about regression to the mean that goes into the history of the phrase. Another great resource on the topic is here (even though it focuses on basketball, not baseball.) And of course I’ve written previously on the topic.
The Sage Encyclopedia entry on reliability was a great help in preparing this article.
Graphs were created using gretl. All results were weighted by the number of plate appearances in the second sample – in other words, a player with 300 PAs was counted twice as many times as a player with 150 PAs.
Just taking a wild guess, I would say you would weight his 100 PA at “1”, his career 5800 PA at “0.40” (total of 2320 PA), and add in 200 PA of .356.
That gives me a weighted average of .417.
Now, if that’s all we knew, then that’s our best estimate.
Or, we can do this slightly differently, and regress the yeartodate stats and the career stats seperately, as well as regressing to the population mean:
(.320/.044^2+.426/.006^2+.356/.025^2)/(1/.044^2+1/.006^2+1/.025^2)
The result? .420. I’m not sure if everything works out the same as the PAs go down, but at least in the case of someone like Pujols either approach should give the same results.
Re: Peter’s question. Given Pujols’ 0.426 in 5800 PA combined with his 0.320 in most recent 100 PA:
Best estimate of OBP = (0.426*5800+0.320*100)/5900 = 0.424.
Therefore, the best estimate of his OBP for the next N PA is 0.424, regardless of the size of N. I thought this was Colin’s answer before he took off on a tangent that I didn’t understand.
Ok, guys, where am I going wrong?
Colin – So Pujols has a .320 OBP in his first 100 plate appearances of a season. And you know that the league average OBP has been .356 and that Pujols previous OBP in 5800 PAs has been .426. Show how you would predict what Albert’s OBP would be in his next 500 plate appearances (essentially the rest of the season).
Colin, thanks for the link to The Book. It looks like a must read. Of course you’re right that the .49 would also just be an estimate of the SD since we don’t know the true OBP.
Ah, I wasn’t sure what you were doing with the formula involving the .025 – now I see you’re just weighting the estimates with the inverses of the variances. Keep up the good work, Colin! I’ll definitely read again…
Colin – it’s my first time posting (and reading) and I don’t know what “The Book” is, but where does .47 come from? Assuming you’re finding the standard deviation (SD) there, it should be sqrt(.424(1.424))=.49, right?
Also, why does the .356 estimate have a higher SD associated with it than the .424 based on only 5900 PAs?
I’m looking forward to reading a lot more!