# 525,600 Minutes: How Do You Measure a Player in a Year?

*We’re pleased to republish this often referenced article by Pizza Cutter that originally appeared in StatSpeak.net on November 14th 2007.*

What does a year really tell you about a player? Seriously. If I gave you the seasonal stats for any player last year (or the year before), how much could you really tell me about him? If I told you he hit .300 last year, are you confident that deep down, he’s really a .300 hitter? How do you measure a year in the life?

Like a lot of things that happen out here in the Sabersphere, I take my inspiration for this (series of?) article(s?) from a conversation that went on at the Inside the Book blog. A few folks were discussing an article that I wrote here at StatSpeak on productive outs and as these things are wont to do, the conversation wandered. *Inside the Book* co-author MGL asked me a fair question: when I talked about productive outs, what sample size I was dealing with. Not so much how many player-years were in my data set, but for each of those player years, how many PA’s did each player have. It’s a much more important question than you might think.

If you’ve been reading my work for a while, you know that I often say things like, “minimum of 100 PA.” (I’m hardly the only one to do this, by the way.) Why did I make sure that the batter had 100 PA? Well, first off, let’s say that I’m interested in rating batters by how often they strike out. And I happen to come across a player who got five at-bats in a season and never ever struck out. I hereby crown him the king of all contact hitters! He will never ever ever strikeout ever. Right? Of course not. 5 PA isn’t a big enough sample size to measure anything. But what is? When I say minimum 100 PA, I must admit I’m usually using a very unscientific “yeah, that sounds about right” criteria for picking the number. What if 100 PA isn’t a big enough sample for what I’m trying to measure either? I’m a scientist by training(my cancer biologist wife laughs at me when I say that), and I should be a little more… scientific.

(Major and extensive numerical nerdiness alert. As if the reference to *Rent* wasn’t nerdy enough. This is a really long methodological article for the hardcore researchers out there. If you’re here for witty banter about statistical matters in baseball, may I suggest you pick another article.)

What we’re talking about here is a concept known in social science research as measure reliability. It’s the idea that if I took the same measure over and over again, I’d get (roughly) the same answer each time. This shouldn’t be confused with measure *validity, *which is whether or not the measure I’m using is actually measuring what I think it does. I might ask 25 people to tell me what color the sky is, and they might all say “green with orange polka dots.” The measure is very reliable, but not very valid. In statistics, the way to increase reliability of a measure is to have more observations in the data set. If I took a player’s on-base percentage for his first five at-bats in a season, and then his next five, and then his next five, and so on, those numbers are going to fluctuate all over the place. But if I do it in 200 at-bat sequences, the numbers will be more stable. I’ll hopefully get (roughly) the same number each time I take a sample of 200 at-bats. The question I ask is when does that number become stable enough that we say that it’s OK to make inferences about a group of players?

In social science, we look for a magic number, which is .70. For example, one way of estimating measure stablity and reliablity is to look at things from one time point to another, in the case of baseball, from one season to another. DIPS theory sprung from this type of question. Strikeout rate is stable from year to year, suggesting that a pitcher’s yearly strikeout rate is something that represents a coherent, stable measurement that tells you something about a player himself rather than his circumstances. A pitcher’s BABIP not so much, as it’s not stable from one year to the next. This type of question lends itself well to stats like year-to-year correlation and my favorite, intra-class correlation, and usually the gold standard for reliabilityis .70. Here’s the thing about year-to-year correlation. Let’s say that baseball seasons lasted one plate appearance per season. Nothing’s going to correlate year-to-year because one plate appearance doesn’t tell you much of anything. But, now let’s pretend that baseball seasons lasted for billions and billions plate appearances. If I watched a player for that long, I’d have a really good idea ofwhat his “true” ability is. And if I got another billion PA’s, I’d probably get the exact same number the next time around. Over a bunch of players, with each one measured 1 billion times in each year, my correlation would probably hit 1.0 (which is a perfect correlation), because I would have a perfect measurement of all players at two different time points.

But, in a season we get 600-700 PA for regulars, and less for bench/platoon/fringe/injured players. Is that enough to do real research? Turns out that the answer is “depends on what stat you want to measure.” How we get to that answer is another matter, but we will get there. If I say that 100 PA is enough of a sample to get adequate reliability on a measure (let’s say batting average), then I should be able to take 100 PA and calculate the batting average from those and then another 100 PA with which to calculate AVG. I could do this for some group of players and I could see how well their AVGs in the first group of 100 PA correlate with their second group. Why am I obsessed with .70? Because a correlation of .70 means an R-squared of 49%. Anything north of .70 means that a majority of the variance (> 50%) is stable. Higher correlations mean more stability, which is always better, but .70 is usually “good enough for government work.”

Now, where can we get those samples of 100 or 300 or 500 PA? Well, first off, we’ll need two samples of 100 or 300 or 500 to compare against each other. So, if I want to see if a stat is stable at 300 PA, then I could take a player’s first 300 PA of the season (pre-All Star break?) and compare it against his next 300 (second half?) There’s a problem in there. In the second half, he might be more tired, or he might be better in the later summer, or perhaps he played the second half with an injury. In the aggregate that probably all shakes out, but perhaps all players tire out midway through the season insome systematic predictable way. Perhaps I could look year to year, but I’d have some of the same sort of issues. In the second year, the player is a year older and wiser, and that will affect him in a number of ways, good (smarter) and bad (physical decline?).

There’s another method which sidesteps a lot of these issues. It’s called split-half reliability. Here’s what I did. I took each player’s plate appearances and numbered them sequentially, from his first to his last. Then, I split them up into even-numbered and odd-numbered appearances. In this way, I could split a season of 600 PA into 2-300 PA samples, and there would be plate appearances from just about all games played in both samples. This seems a much more fair way (if more cumbersome) of splitting things up.

There’s one other problem, though. A year usually lasts 600-700 PA’s for regulars. Within a year, I don’t get a second sample of 600-700 more PA’s to use as a comparison. That means that the top number I can check for as the “is it consistent enough” number is going to be around 350 or so, and then, I’m only dealing with people who are good enough (and perhaps consistent enough) to play every day. I originally ran one-year only samples, but found that some stats weren’t reliable at 350 PA. So, I took consecutive two-year windows (2001-2002, 2003-2004, 2005-2006), and used split-half reliability within those two year windows. So, for each player, I took his even numbered PA’s and compared them against the odd numbered PA’s, pulling plate appearances into both groups from both years. It’s not perfect, but at least it balances things out a bit.

As always, data were kindly provided by Retrosheet. I love Retrosheet.

I calculated a lot of the usual stats we like to use in baseball,for the time beingfocusing on batters, and checked to see where they started meeting the “at this minimum of PA’s, the correlation coefficient between the even and odd PA’s at least .70” criteria at varying levels of PA’s. You can interpret them this way:When Iraised the minimum inclusion criteria to include all players witha minimum of ___ PA, the stat in question was reliable enough to actually say something aboutthe sample ofplayers, that is it had a split-half reliability over .70. When future researchers conduct studies *on groups of players *(more on why that phrase is important in a minute) using these statistics, these are the minima I recommend for inclusion in any sort of data set. (Whether anyone cares or not what I recommend is another issue.)

In the previous paragraph, a very important distinction must be made. The minima listed below do not mean that the statistic in question stabilizes at ___ PA for an individual player, but that it stablizes at *in a sample which includes all players with*___ PA *and above*. Since this is the way that we usually do research, it seems to be the best way to begin. Whether a statistic is reliable in a sample of players that had exactly two samples of ___ PA to compare against each other is another study. The problem is that if I say 100 PA or more, I’m taking a look both at those with 2-100 PA samples, but also those with 2-600 PA samples. The 600 PA samples will be much more stable and make things look more stable than they are. Only by restricting the range a wee bit can we answer the other player evaluation question of “How many PA’s until I have a good idea of Jones’s real abilities?” But I’m not doing that yet. Right now, I’m looking for minimum inclusion criteria. Numbers are rounded a little bit to make them a little more appealing to the eye.

Some of the one-number stats for hitters stablized at:

- AVG -never did. at 650 PA, it had only reached a split-half correlation of .668
- BABIP -never did. at 650 PA, it had only reached a split-half correlation of .631
- OBP – 350 PA
- SLG – 350 PA
- ISO – 350 PA
- OPS – 350 PA

Even full-season batting average for a regular players aren’t fully reliable stats, at least according to my definition. Add that to the list of reasons why it’s silly to give out a batting title to the highest batting average in the league (apologies to MagglioOrdonez). However, OBP and SLG (and their derivative combinations) are stable for a sample including part-time players and regulars.

A few “how often does he…” stats stablized at:

- 1B rate – 375 PA
- 2B+3B rate – never did. at 650 PA, it had only reached a split-half correlation of .411
- HR rate – 100 PA
- K rate – under 40 PA
- BB rate – under 40 PA

Talk about three true outcomes! Even when I got down to 40 PA’s, walks and strikeouts remained very stable.

Batted Ball stats:

- GB rate – under 40 PA
- LDrate – under 40 PA
- FBrate – 175 PA
- IFFB rate – 350 PA
- GB/FB – 100 PA
- HR/FB – 100 PA

Hmmmm… ground balls and line drives remain pretty stable, even when the sample size is ridiculously low, but you need half a season or so before the pop ups stablize.

A few advanced stats:

- WPA -never did. at 650 PA, it had a split-half reliability of .401
- Context Neutral wins (sum (WPA/LI)) – never did. at 650 PA, it was at .588
- Clutch (sum WPA – sum (WPA/LI)) – never did. .021.
- RBI above league average expectation (based on average RBI in base/out state faced by batter minus actual RBI) – 650 PA

Oh really? WPA-based stats didn’t fare well in these analyses. Context neutral WPA was more reliable than straight-up WPA, although neither made it to the magic .70 cutoff. That clutch wasn’t reliable shouldn’t come as a surprise to anyone. Even RBI expectation just barely made it to “reliable” and it’s only reliable in a sample that takes into account full-timers. This calls into serious question whether things like WPA and WPA/LI can be used in analyses. The following is a very important distinction. If we want to say that A-Rod added 7.51 wins of WPA to the Yankees this past year, that’s a statement of fact and it is what it is. However, if we were to hit reset and replay the entire 2007 season from scratch, the first go-around’s numbers league wide in WPA wouldn’t be a very good guide to what would happen in the re-play. So, to run analyses with WPA-related variables as predictors or as outcomes can get into some shaky statistical ground. (For the statistically initiated, this is the difference between descriptive and inferrential analyses.)

**UPDATE:** Upon further review, it looks like I had some problems with my calculations on this one. I’m looking into it. All of the WPA and LI based stats are currently under investigation. The rest of this article still stands.

Finally, some swing diagnostic metrics:

- Swing % (swings/pitches) – under 40 PA
- Contact % (ball in play + foul / swings) – under 40 PA
- Senstivity and Response Bias (one of my homemade stats) – under 40 PA
- Pitches/PA – under 40 PA

Well now, players have a great deal of control over whether or not they swing or not, so it makes sense that they are the same whenever and wherever they go.

I really could do this with any batting stat out there. Notable by their omission from this study are stats like runs and stolen bases that generally don’t happen on an at-bat basis because they are base running stats. (A home run scores the batter, but most often, players score runs by having someone else knock them home in a separateplate appearance.)

Now, when do the stats I’ve looked at become stable for individual players? That is, at what point in the season (measured by PA) do stats go from being garbage to being meaningful and actually describing something about the player? This is actually a little harder to do, but it’s just an engineering problem. I returned to the same data base and used the same split-half paradigm on consecutive 2 year windows. This time, however, when looking to see about a certain number of PA’s (say 200), I took the first two samples of that many PA’s that I could find (so, the first 400 PA in the two year period, split into even and odd numbered PA’s) with the rest of a player’s PA’s tossed away. This artificially gives everyone the same number of PA’s within each analysis, so long as they actually logged twice as many as the target number in the two seasons in question (so guys who had only 10 PA were politely excluded from the analyses that required more than 200 PA.) Then, I ran a correlation between the results from the even and odd PA’s to see if the correlation got to .70.

I played around with the number of plate appearances until the correlation either hit .70 or I maxed out on the number of plate appearances available (650 was the upper limit.) I ran the analyses in 50 PA increments. The following are the stats that were stable enough (correlation > .70) at each plateau to be considered reliable. Again, these are the PA levels at which each stat can be considered to be saying something about an individual player.

50 PA – swing percentage

100 PA – contact rate, response bias (both just missed at 50… the real number is probably around 70)

150 PA – K rate, line drive rate, pitches/PA

200 PA – BB rate, grounder rate, GB/FB ratio

250 PA – flyball rate

300 PA – HR rate, HR/FB

350 PA – sensitivity

400 PA – none

450 PA -none

500 PA – OBP, SLG, OPS, 1B rate, popup rate

550 PA – ISO

600 PA – none

650 PA – none

So after 100 PA (roughly a month, if a player is starting nearly everyday),I can tell you about how much a batter likes to swing and how good he is at making contact. But, what happens when the ball leaves his bat? Well, at 150 PA I can tell you if he likes to hit line drives (and line drives are good…), which is the first indicator to stablize that even says anything about what happens to the ball off the bat. At 150 PA, I can also start telling whether he likes to work the count and whether he’s a strikeout king. By 250 PA, I can tell a lot about his walking tendencies and what he’s going to be a ground ball hitter or a flyball hitter. I still have no stats that have stablized that tell me outcomes about where the ball landed after it left the bat. Are you the type of hitter that likes to hit balls into fielders’ gloves or onto that lovely green substance in the outfield? At 300 PA, I finally find out whether or not the player likes to hit the ball out of the park every once in a while. Finally, a lot of the usual 1-number stats (OBP, SLG, OPS) don’t stablize until 500 PA, as well as knowing whether you’re a singles hitter.

A few very interesting stats didn’t stablize, even after 650 PA. Those stats, with their split-half correlation at 650 PA in parentheses.

- Batting Average (.586)
- BABIP (.586) [sic]
- 2B + 3B rate (.401)
- WPA (.403)
- Context neutral WPA (.590)

Even after 650 PA, batting average isn’t an ideal descriptor of a player’s true talent level, at least in so much as his ability to put up a repeat performance of that same AVG. Why do we make such a big deal out of batting titles? I have no idea. (A note: AVG probably stablizes around 1000 PA or so — and that’s just a guess on my part — so career AVG might be a decent enough statistic as far as reliability goes, assuming the player has been around for that many PA.)

The last question that I’ll take on is the question of exactly how stable these stats are for full season starters (650 PA) or for part-timers (300 PA). If you’re trying to predict next year’s performance, how much can you trust the previous year’s numbers? Again, the stats are listed with their split-half correlation coefficients in parentheses, and higher is better.

At 650 PA:

- AVG (.586), OBP (.779), SLG (.762), OPS (.773), ISO (.740), BABIP (.586)
- 1B rate (.831), 2B+3B rate (.401), HR rate (.855), BB rate (.878), K rate (.907)
- GB rate (.883), LD rate (.937), FB rate (.871), IFFB rate (.703), GB/FB (.918), HR/FB (.879)
- WPA (.403), context-neutral WPA (.590)… I didn’t even bother looking at clutch
- Swing % (.954), Contact % (.959), Sensitivity (.833), Response Bias (.961), Pitches/PA (.881)

At 300 PA:

- AVG (.328), OBP (.596), SLG (.634), OPS (.624), ISO (.636), BABIP (.240)
- 1B rate (.572), 2B+3B rate (.218), HR rate (.741), BB rate (.821), K rate (.844)
- GB rate (.805), LD rate (.883), FB rate (.764), IFFB rate (.610), GB/FB (.809), HR/FB (.752)
- WPA (.327), context-neutral WPA (.398)
- Swing % (.940), Contact % (.925), Sensitivity (.742), Response Bias (.937), Pitches/PA (.857)

A few limitations of this study. As with any study of this kind, any time you slap a minimum number of PA on a sample, it’s going to become a selective sample. Playing time is not handed out randomly (would that it were!), and those with 650 PA are a select group of players, namely, those who are good enough to justify starting every game all year. Because I’m comparing players to themselves (within subjects design), I can control for some of that. However, it does leave open the possibility that players who are full time starters are more consistent than those who aren’t. The split-half even-odd methodology might help to control for it, but I suppose there could be a confound in there somewhere. The other problem is that as I raise the bar for minimum number of PA, I get fewer and fewer players that meet the criteria. When putting together a correlation, that drives down my statistical power. In an ideal world, I’d have a million player-seasons (or in this case, 2 seasons) to use, but I don’t. Still, my smallest sample size was 127, which is pretty good for a correlation study. I don’t see a way around these limitations, but as always, I’m open to suggestions.

The lessons to be learned: Those who traffic in projection systems might do well to look closely at these types of analyses. An OBP based on 600 PA is going to be much more reliable of a predictor than 400 PA, and that should be taken into account when projecting next year’s OBP. Even just from a fan perspective, this is a good reminder that we shouldn’t be fooled by a small sample size. Now we can know exactly how small of a sample size we should be wary of.

If you’ve made it this far and have read the whole thing, you win a cookie. If you have a worthwhile suggestion for the comments, you win two cookies.

I’ve always thought that the PA threshold approach was too cut and dry. People will go from saying we have to ignore the data point to fully accepting it as predictive. What I think would be more useful is to provide confidence intervals given a full season of data. So it might be:

AVG: +/- .030

OBP: +/- .020

SLG: +/- .050

BABIP: +/- .40

UZR: +/- 10

Whatever the real values are, the interpretation would be easier and if cited frequently, the values would become a part of getting to learn the stat. Just like we know intuitively how much better a “.300 hitter” is than a “.270 hitter”, we might also get to intuit that a guy who hit .280 last year is likely to have his true talent be somewhere between .250 and .310.

I suppose the PA threshold is useful in understanding when we should start really paying attention as the year goes along, but in terms of pulling in the concept of reliability in to the daily conversation about stats, I think a CI approach may find more traction.

didn’t see this comment before my post… my thoughts exactly.

Intriguing addition. Correct me if I am mistaken but aren’t standard error also affected by sample size?

here’s how this would work – the 95% confidence interval on carl crawford’s AVG *right this second* is given by:

AVG +/- [1.96 * SQRT( AVG*(1-AVG)) / SQRT(AB) ] =

.143 +/- [1.96 * SQRT(.143 * .857) / SQRT(63) ] =

.143 +/- .086

so fear not, sox fans, we can’t reject that crawford’s true talent thus far this season describes a 0.229 hitter!

If the variance of a binomial distribution is np(1-p), then the SD is the square root of that (SQRT(np(1-p))) and the SE is the variance over the square root of n (SQRT(np(1-p)))/SQRT(n). If that’s right, then the SQRT(n)’s cancel and we get the SE to be SQRT(p(1-p)), no? In other words, isn’t it a mistake to divide by SQRT(n)? My calculation is 0.143 +/- 0.686 or so, which is huge- maybe too huge. I’m not too sure. Anyway, yours is a pretty tight 95% confidence interval for this early in the season, don’t you think?

@Gumby

The variance of the binomial is indeed n*p*(1-p), but remember that’s a count, not a proportion.

A proportion is the binomial count divided by the sample size… X/n, if you will. The var(X/n) = (1/n)^2 * n*p*(1-p). n is treated as a constant, and the variance of any c*X = c^2 * var(X). So only one of the n’s cancels out! Another is left in the denominator, and then square rooted in the standard error calculation.

Thats kinda what I was thinking, and was wondering why he did split sampling when a bootstrap would allow him to put CI’s on his correlations as well.

I do worry about how this has come to be interpreted. I presented the levels at which various stats hit the magic .70 mark. However, even that has its problems. .80 is, of course, more reliable than .70, and given a large enough sample, a stat would get there. In the same way, .60 is worse than .70, but it’s not trivial either. Reliability is a matter of degrees, and should be treated as such.

I’ll give you one-up on that. Rather than doing it for a single season, you could do a sweep of intervals and calculate confidence intervals for each. In that way, you’d have a function of Confidence = f(SampleSize) for each type of observation you’re looking at.

With a bit of smoothing and interpolation, this sort of thing would give you an ability to determine an estimate of your confidence at any given sample size. Additionally, it might give you an idea of when you’re hitting an asymptote- the point at which adding more data isn’t really helping you have any better confidence.