# When Samples Become Reliable

One of the most difficult tasks a responsible baseball analyst must take on involves avoiding small samples of data to make definitive claims about a player. If Victor Martinez goes 4-10, it does not automatically make him a .400 hitter. We have enough information about Martinez from previous seasons to know that his actual abilities fall well short of that mark. Not everything, however, should merit a house call from the small sample size police because there are some stats that stabilize more quickly than others. Additionally, a lot of the small sample size criticisms stem from the actual usage of the information, not the information itself. If Pat Burrell struggled mightily after the all star break last season and started this season with similarly poor numbers, we can infer that his skills may be eroding. Isolating these two stretches can prove to be inaccurate, but taking them together offers some valuable information.

The question asked most often with regards to small sample sizes is essentially – When are the samples not small anymore? As in, at what juncture does the data become meaningful? Martinez at 4-10 is meaningless. Martinez at 66-165, like he is right now, tells us much, much more, but still is not enough playing time. What are the benchmarks for plate appearances where certain statistics become reliable? Before giving the actual numbers, let me point out that the results are from this article from a friend of mine, Pizza Cutter over at Statistically Speaking. Warning: that article is very research-heavy so you must put on your 3D-Nerd Goggles before journeying into the land of reliability and validity. Also, Cutter mentioned that he would be able to answer any methodological questions here, so ask away. Half of my statistics background is from school or independent study and the other half is from Pizza Cutter, so do not be shy.

Cutter basically searched for the point at which split-half reliability tests produced a 0.70 correlation or higher. A split-half reliability test involves finding the correlations between partitions of one dataset. For instance, taking all of Burrell’s evenly numbered plate appearances and separating them from the odd ones, and then running correlations on both. When both are very similar, the data becomes more reliable. Though a 1.0 correlation indicated a perfect relationship, 0.70 is usually the ultimate benchmark in statistical studies, especially relative to baseball, when DIPS theory was derived from correlations of lesser strength. Without further delay, here are the results of his article as far as when certain statistics stabilize for individual hitters:

50 PA: Swing % 100 PA: Contact Rate 150 PA: Strikeout Rate, Line Drive Rate, Pitches/PA 200 PA: Walk Rate, Groundball Rate, GB/FB 250 PA: Flyball Rate 300 PA: Home Run Rate, HR/FB 500 PA: OBP, SLG, OPS, 1B Rate, Popup Rate 550 PA: ISO

Cutter went to 650 PA as his max, meaning that the exclusion of statistics like BA, BABIP, WPA, and context-neutral WPA indicates that they did not stabilize. So, here you go, I hope this assuages certain small sample misconceptions and provides some insight into when we can discuss a certain metric from a skills standpoint. There are certain red flags with an analysis like this, primarily that playing time is not assigned randomly and by using 650 PA, a chance exists that a selection bias may shine through in that the players given this many plate appearances are the more consistent players. Cutter avoids the brunt of this by comparing players to themselves. Even so, these benchmarks are tremendous estimates at the very least.

Print This Post

Does he have any plan to do this for pitchers?

The correspondent pitching post: http://statspeak.net/2008/01/on-the-reliability-of-pitching-stats.html

Thanks!

I see an apparent disconnect.

@ 150 PA LD rate is stable.

@200 PA GB rate is stable.

but not until 250 PA is FB% stable.

Since LD% + GB% + FB% = 100, if LD% and GB% are stable by 200 PA, surely FB% must be as well. Either that or the test for stability is flawed. What is wrong with my reasoning?

Those numbers are per PA, not per ball in play. So, for one player who always puts the ball in play LD + GB + FB may account for 95% of his plate appearances. For another guy who strikes out and walks a lot (we’ll call him “Adam Dunn” just to give him a name), LD + GB + FB might only cover 70% of his PA’s.

Ahh, of course. I guess I am used to THT stats definition and assumed those three were % of batted balls, not % of PA.

One of my regrets about that study was that I didn’t do things per BIP… such is life…

Cutter,

Because you used PA instead of BIP, it has a quicker, more practical application than it would otherwise. BIP is accessible, but junkie baseball fanatics know how many PA’s batters have off the top of their head, or at least can estimate. I for one am glad you used PA.

Looking at the preseason ZiPS projections, the RoS ZiPS, and the current numbers for Raul Ibanez it seems that the only real substantial difference in results from what was originally projected to the current results and projections moving forward appear to be HR/FB rate.

His GB/FB rate is about the same, though he is hitting less LD then anytime we have data for. His BABIP is higher than expected, but not that much higher. Walk rates and strikeout rates are about where they should be.

Currently Raul’s HR/FB rate is a 27.3%, which is not something that I think anyone truly expects Ibanez to finish the year with–its about 2.5 times higher than last year. According to Cutter, HR/FB rates don’t normalize until 300 PA, Raul is sitting at about 150 PA right now. My question is what kind of information can be found with a 150 PA sample size for this kind of data; At what rate does the correlation between different sizes of PA grow? Meaning that I assume that the correlation at 150 PA is not .35, but probably something lower.

More so: What at this point can we say about Raul’s HR/FB rate? What are the range of expectations that we should have of HR/FB rate at this point in the season? Are his numbers statistically significant, or can it be chalked up to small sample size?

Split-half reliability is an estimate of how comfortable you should be with a measurement, in this case, HR/FB given X number of PA. In this case, we have 150 PA for Ibanez. I don’t have the exact numbers with me as to what the split-half is at this point.

Reliability coefficients generally follow an asymptotic pattern. They all start out at zero, quickly climb at first, then plateau out later on. There’s not a lot more that the 701st PA will tell you that you didn’t know by the 700th.

My guess is that at 150 PA, the reliability of HR/FB will be pretty low. Translation: Ibanez probably isn’t a 27.3 HR/FB hitter. Sorry Phillies fans.

Yet once Raul gets to 300 PA, his HR/FB ratio will likely be high. As his current FB rate is very close to his career rate, assuming that he hits roughly the same amount of fly balls in the next 150 AB, his HR/FB rate would have to be god awful.

Is there anyway to take the data that you have on HR/FB rates and look at the range of rates that are expected at 150 PA. I imagine that there is a bell curve that starts out extraordinarily wide and narrows as more PA are incurred.

This is true. The best I can say to that is that my cutoff is .70 and .70 isn’t 1.00.

In terms of the range of expected outcomes, what you are discussing is a confidence interval. That sort of calculation can be done.

Does the study look at a random sample of a batter’s performance during a season and look for the size of random samples needed to reach r = .7 or does it specifically focus on his early season plate appearances to see at what point those are significant? Since an initial run of plate appearances is not a random sample, it makes a difference to how we understand these results. Thanks.

Also, isn’t there a difference between the strength of correlation and the significance of correlation? That is, isn’t it possible to have a highly significant but small correlation between X and Y? Can you explain how your study is finding significance rather than strength of correlation? I’m a big noob when it comes to stats; thanks in advance for the answers you have time to give.

The method that I used was an even-odd split-half method. So, Opening Day, you have your first PA (#1), then your second, third, and fourth. The odd-numbered ones go in one basket, the even in the other. This controls, as best as we’re gonna get, for pitcher quality, park effects, etc. I’ve toyed around with taking a truly random sample, and some have argued that this is more proper. I find that it runs the risk of being corrupted by the previously mentioned extraneous variables, (although it does gain the advantage of being corrupted by the fact that all the plate appearances come within the same time period.)

When I have taught stats in the past, I encourage my students to ignore significance testing when looking at correlations. Significance testing answers the question “is there a decent chance that it’s really zero.” With a big enough sample size, which we can usually get in a baseball database, I can usually find a correlation in the .20s as significant (hey, it’s probably not zero!). But a correlation in the .20s isn’t much of anything in the grand scheme of things.

I prefer the .70 cutoff, because that gives you an R-squared of (roughly) .50. (OK, yes, it’s really .49) Since I’m comparing players against themselves, it means that the majority of the variance can be accounted for by factors contained within the player himself.

I have this ongoing fight with Pizza (well, not so much a fight, as I keep punching Pizza, and he keeps sidestepping me… eventually, I will make contact, or I’ll get tired) about showing these results.

Take the HR/FB which “stabilizes” at r=.70, when PA=300. The next question of course is “so what can I do with this?”

Well, I like to get things to r=.50. You’ll see the reason in a minute. If r=.70, when PA=300, then r=.50 when PA=130. It’s not important how I got that for now.

Ok, so what can you do with that?

This means that you can add 130 PA of league average HR/FB to any player, to get an estimate of his true talent.

Ibanez has 173 PA, which means you would take 57% of his current HR/FB rate, and 43% of the league average, to get a best estimate (presuming of course that you have NO OTHER data on him… which is not true, since we do have his career).

This is why I always implore Pizza to show his split-half numbers at r=.50. Because once he does that, we know EXACTLY how much regression toward the mean to apply. As it stands, his numbers sit there, a great amount of research, but unused.

Tango has opened up a thread on his own site. For those of you who don’t know the history, Tango and I have previously disagreed on this issue, more based on some technical methodological issues. If I may direct those who are interested in the guts of the issue to go here:

http://www.insidethebook.com/ee/index.php/site/comments/fighting_with_pizza_cutter/

I’ll be monitoring both as closely as I can…

Does Swing% above mean just Swing%, or all the different Swing% stats?

Cool stuff, Pizza, and thanks for the summary, Eric.

That one is just swings / pitches faced.

Sorry, another question. This may have been addressed above, perhaps in Tango’s post, but I need to broken down very simply for my tiny little brain.

When we talk about these stats “stabilizing” at a certain # of PAs, what does that mean in relation to past performance in determining true talent. For example: Coco Crisp has 178 PA so far on the season, getting close to the 200 PA mark. His BB% as listed on his FG player page is 15%. But his previous three season totals were 7%w, 8.7%, 8.8%. How does this relate?

Is Tango saying that these numbers are simply the numer of PAs of league average that need to be added to the player’s current season performance in order to estimate true talent? That, I understand. But if it’s something else, I think I need it explained further. This isn’t a criticism, just a question.

Keep asking these questions, because you are guys are going to be proving my point.

Take the numbers that Pizza shows, multiply by 3/7, and that’s how many league average PA you need to add to whatever data you have. It’s the regression toward the mean component that turns sample data into an estimate of the true rates.

Stabilizing is perhaps a poor choice of word on my part. It has that sic semper erit (thus it shall be forever!) feeling about it. Everything in baseball is a sample, and thus subject to sampling error. Reliability is a measure of how comfortable we should be with our samples.

Let’s say that Coco gets to 200 PA and still has a walk rate of 15% or thereabouts. These reliability numbers say something like this: If he (and everyone else in MLB) were to re-run the first 200 plate appearances of the season, Coco might go up on his walk rate from there or he might go down or it might stay the same. (Given his previous career numbers, it’s likely that he’d go down.) Split-half reliability tells us how closely we would expect his second-200 PA bucket to mirror his first. If the reliability number is higher, we would expect the numbers in the second set of 200 PA’s to more closely track those of the first 200. If the split-half reliability were 1.00, we would expect that Coco would have exactly a 15% walk rate over his next 200 PA. If it were .00, we would say that we have no idea what’s going to happen next. Reliability is a game of “how sure can I be that this is real?”

Tango’s method is based around regression to the mean and it’s a sound method. If we have a sound idea of the reliability of the stat at X PA, we can project what the true talent level is. Tom uses league average as a baseline. I think we can do better methodologically, but league average is good enough for government work. The point is that if we have 1 plate appearance to judge from, we know nothing about the batter and have to assume that he’s just league average. If we had a billion PA’s, we’d know exactly how good/bad he is.

The thing with Coco is that these 200 PA may be a statistical fluke or maybe he really has developed the ability to walk more. The problem is that before 2009, the latest data that we had on him was from 6-10 months ago (last season). A lot can happen in that time. Is Coco the same man that he was back in 2008? There’s the research methodology problem.

Pizza – Wouldn’t time series analysis be one way to approach a problem like Coco’s walk rate? Split his career into 200 PA segments and measure his walk rate in each segment and the standard deviation between segments and then test for a relationship varying with time?

I’ve actually done some similar work looking at those sorts of trend lines. Worth a shot in this case.

cool stuff. i understand why this might be particularly useful for “complicated” statistics (specifically, those with hard to compute variances), but for the simpler rate statistics, what does this tell us that just computing a confidence interval can’t?

for example dan uggla hit 32 HR in 521 AB (for this discussion i ignore the fact that PA!=AB) last season. this is a rate of about 0.06 HR/AB. using the simple formula for the variance of a binary random variable (p*(1-p))/N) i can compute a 95% confidence interval of his “true” rate as (0.040, 0.081).

if i were to use his rate for the current year (7HR in 141AB), the the confidence interval would be a much less informative (0.014, 0.086) because the sample size is so much smaller. doesn’t this provide a much more informative measure of how “reliable” the estimate is?

what additional benefit does a more formal reliability study provide?

Confidence intervals are nice, and they are important. The problem with confidence intervals is the problem of when they cross that line to being “small enough.” There’s not a really good way to tell. Admittedly, my selection of .70 as my cutoff is an arbitrary line in the sand (although I think it has more of a reason than other lines that I might draw.) But decisions are denominated in yes and no. A confidence interval doesn’t give you that.

I received a brief intro to Bayesian statistics in a class once, and it really clicked for me when I put it into the context of baseball statistics. Basically, I framed it in my mind the way you position what you’ve written here. I don’t remember the specifics of the modeling, but the gist that I do remember is that a guy hitting .350 in April is not a .350 hitter if he hits .275 lifetime. The correct answer is somewhere in between. Pure common sense, but Bayesian statisticians make the effort to build a theoretical framework around the idea, with both subjective and objective (formulaic) underpinnings. This likely isn’t news to you. But your point was interesting to me in that it reminded me of how how baseball brought the abstract theory to life at the time.

Is this not in the glossary? I think this is super useful information, and maybe you guys could either put a link to this article, or just the data in the glossary?

I wonder if a similar method can be applied to evaluate a hitter’s PA numbers necessary for “vs a specific pitcher” stats to stabilize. The reason why I am bringing this up is, we often hear announcers and commentators say “this hitter hits well against this pitcher through his career”, and if you look at the actual stats, they are often talking about the sample sizes of 30 or so. That may be laughable to some of us, but maybe not. The way how I see it, THE biggest variable in batting stats for a hitter is the opponent pitcher. If you eliminate that variable, maybe the sample size required for his stats to stabilize become dramatically reduced.

Let me take an example of Eric Chavez against Jamie Moyer. Despite being known to be completely hapless against most other lefties, Chavez somehow managed to hit Moyer to the tune of 323/397/646 in 72 PA. I then did odd-even year split, and found them to be 366/387/766 (31 PA) in the odd years and 285/405/662 (42 PA) in the even years. Except for BAs, which is not surprising, other numbers look pretty good. So, are the OBG and SLG vs a specific pitcher meaningful after 72 PA?

By increasing the number of individual cases (and maybe using Pizza Cutter’s method of splitting into halves), one may be able to determine the sample size needed for “vs a specific pitcher” stats to stabilize. Although some caveats exist (such as these career stats covering many years), this sort of general information could be quite useful.

oops

Those SLGs can’t be true. I will recalculate when I have time.

Since you are using terminology such as “stabilize” wouldn’t it be prudent to display a scatter of the correlation vs PAs for all the metrics? Some may not “stabilize” at the number by which i mean the slope of the scatter at the point at which it hits .7 may be important. Of course, this is my attack on everything – “Yes, that may be the case, but what is the derivative doing at that point?”

I’m guessing the marginal benefit of extra plate appearances to the stability would be equal to it’s fraction of the number needed to hit 0.7.

I won’t give too much crap about the 0.7. I do understand that sometimes you have to pick an alpha and stick with it, if only for consistencies sake. For lab standards we’ve gotta go with 0.99

These articles don’t exist anymore. Is there a chance that they could get re-published/hosted elsewhere? They seem like pretty important pieces of saber-research and it would be sad to lose them.

http://web.archive.org/web/20080102094412/http://mvn.com/mlb-stats/200

7/11/14/525600-minutes-how-do-you-measure-a-player-in-a-year/

and

http://web.archive.org/web/20080112135748/mvn.com/mlb-stats/2008/01/06

/on-the-reliability-of-pitching-stats/

Do these numbers hold true for players of all ages and experience levels? I would think guys in their first few years of big league experience would fluctuate more and maybe take longer. Also, younger players who evolve over the course of a season might have different rates too.

I’m confused about split-half methodology and how the results can be interpreted.

When you say that measure X stabilizes at 200 PAs, for example, how is that reflected in your methodology?

If I understand correctly, you took 400 PAs and compared the 200 odd PAs to the 200 even PAs and looked for correlation.

But that doesn’t tell you that an individuals’

first200 PAs correlate to his next 200 PAs, yet it’s being advertised as such. Instead, it means that it takes 200 PAs consisting of every other PA over 400 PAs for measure X to stabilize. That’s not exactly useful information.Am I wrong in making this criticism?

So do these pa levels differ between rookie / newer players and established veterans? Curious if newer players are any more likely to have more variability in their play and if that changes over time in their career. Also do you think obs takes so many appearance partly as a result of variation in opponents / opposing pitching and defense? Seems hardto use it to manage anything.