## Small Sample Usefulness

Over the last decade or so, the main sabermetric truisms have been, in no particular order; we like hitters with plate discipline and power, bunting is bad, the modern day bullpen is inefficient, and don’t make decisions based on small sample sizes. The latter is brought up often early in a season, when strange things happen like Mike Hampton striking out 10 batters in a game or Emilio Bonifacio hitting .600 for a week. We trot out the old “it’s early, don’t make any rash judgments” line, and work to convince people that what they’ve seen so far isn’t likely to continue.

However, like most truisms, this is often taken to a non-logical extreme. People have begun to lean on “small sample size” like a crutch that helps them defend their original position in the face of evidence that should convince them that they might not be correct. The evidence might not be overwhelming, but as it begins to pile up, remaining wedded to your preseason thoughts is just as ignorant as overreacting to the performance.

Let’s use Victor Martinez, as an example. I talked about him whacking the ball in April the other day. He’s had a great month, hitting .388/.438/.624 and generally being one of the best hitters in baseball. We can be pretty sure that Martinez won’t keep hitting this well, of course, as it is a small sample of data so far.

However, regardless of what your preseason projection for Martinez was, you should now be quite a bit more optimistic about his performance over the rest of the year than you were on Opening Day. Dan Szymborski released an updated ZIPS projection that accounts for April data (thanks Dan, great stuff!), and ZIPS now thinks Martinez will hit .305/.380/.467 the rest of the season, up from a preseason projection of .293/.366/.447. That small sample size that we’re not supposed to get excited about has increased his projection for the remainder of the season by 34 points of OPS. That’s a significant change.

April is only one month of the season. Things won’t end the way they are now. We do have to be careful about drawing conclusions on small sample sizes. However, let’s not fall into the opposite trap, either – there is useful information to be gleaned from the beginning of the season. Pretending like nothing has changed is just as uninformed as pretending like the current performances will be sustained.

Don’t hang onto your preseason projections like they’re gospel. You’ve got new information in front of you. Use it.

Print This Post

When does a small sample size cease to be small? I’m sure everyone can agree that a week of Bonifacio is small, but what if one week had turned into two? or three? At what point does it become meaningful?

There isn’t a certain point when things become meaningful. Every bit of information has a certain amount of meaning. One day of going 4 for 4 might add a half of a point to a projected OPS. One week might add 5 points. These numbers are completely made up, but the point is that every game, every at bat is meaningful, it just becomes more so as the sample size increases.

All performance is information, but it’s important to know how much information it is. The problem with small samples isn’t that they aren’t information, it’s that they’re very little information. What Dave is saying here is exactly right. (I even go a little further than I think Dave would: if we find two guys with the same UZR and +/- but different FRAA, I think that FRAA is a little bit of information and we should use it, appropriately, and take the greater FRAA guy to be just a tiny bit better.)

This is a process known as conditionalization or updating. There are precise mathematical ways to do it. Last year, Tom Tango released a spreadsheet that let you do it with Marcels. Note that if a player is performing in a very extreme way relative to his projections, that is more significant than if he’s just a little better. (The probability that a true talent 6k/9 pitcher strikes out 12 batters in a game is much smaller than the same pitcher striking out 9 in a game, hence, you should think it more likely that his true talent is greater than 6k/9 if he strikes out 12 than if he strikes out 9.)

One of the most important things to know is which stats become relatively stable over a short run of sampling and which take a lot of sampling before they begin to reveal true talent.

For hitters, in order, here’s what gets stable pretty fast: swing rates, contact rates, strike out rates, walk rates, line drive rates, home run rates. Batting average and BABIP do not get stable in under 500PA, so looking at a slash line this early in the season is worthless. If you want to see whether a hitter is improving, swing and contact rates are the first thing to look at*, then walk and strike out rates, then power numbers. Batting average changes are basically meaningless at this point. By the way, 150PA is a pretty good sample for swing, contact, K and BB. You need more like 300PA in the case of power numbers.

For pitchers, K rate, ground ball rate, line drive rate, flyball rate are relatively stable under small samples. Surprisingly, walk rate is not. K’s an GB get pretty good at 150 BF, 200 is the number for fly balls.

*There’s actually a very good explanation of this fact: a plate appearance involves on average just under four pitches, so one plate appearance actually represents a sample of about four events, not one, when we consider swing rates.

This is really useful information — thanks for posting it!

For those who are curious, “relatively stable” here means the number quoted (150PA or whatever) has an r-squared value of .5 or greater for a random sample of performance that size during the season with the rest of the season performance. Note that the early season is not a random sample, so we cannot assume r-squared .5 for that. So we won’t have a correlation as strong as that, but it’s shouldn’t be horribly far off.

There are small samples and then there are small samples. Victor Martinez has had 2 1/2 times the PA Andruw Jones had had through yesterday, for example. I doubt anyone (well, many people, anyway) disagrees that there’s “useful information” to be gleaned from April. But if there’s a proposition that seemed entirely uncertain or even false a month ago, it’s very unlikely that it has been *proved* true now.

Yeah, Dave is dramatically misrepresenting those that may argue small sample size against his point. Dave has been posting a number of articles/blogs sense the beginning of the season on “lesson’s learned” or on certain players being “back.” So, now he fights back with a guy who’s sample is actually starting to get large enough to make some judgements and who’s performance was so far from expected that it forces us to take note.

Basically, no duh Dave. If a guy OPS >1.000, when we expected about .800, for 1/6th of the season, then it will increase what we expect from him (a little more on this below). Ok, try to do that with Andruw Jones’ 39 PAs and 100 other players in similar situations and see how successful you are.

I’d like to point out Andruw Jones isn’t on the ZiPS update….Maybe he hasn’t had enough PAs to make a significant change to his projection?

Lastly, you are completely ignoring the error in these projections. Is .034 points of OPS actually significant? Isn’t the error term of these projections right around that number (I’d have to look but I thought it was in the .030-.040 range for wOBA, which would likely mean its larger for OPS, can anyone clarify)? So, if it is significant, its likely barely significant. Which leads us back to the whole small sample size problem. If you can’t get a significant change, or just barely so, even from a big over or under performance, you have a sample size (or power) issue.

You’re still here?

Nice answer!

By the way Dave, please tell us when we can expect the sample size to be large enough to tell that:

A) Utley and Howard are not regressing and the Phillies are still actually good. Oh and Ibanez is better than Burrell.

B) Baltimore’s pitching is not better than advertised.

D) The Indians rotation is not better than people think.

E) Boston actually has a really good offense!

Also when should Cain’s HR/FB start to regress?

Yeah Dave, still here, and still waiting for your response of the supposed “significant change” that .034 OPS would represent. For someone that is arguing against the misuse of a statistical concept, sample size, I’d expect you to correctly use another statistical concept, that being significance. After all, significance is heavily dependent on your sample size.

I’m also still waiting for your response that Andruw Jones’ then 35 PAs is a significant sample, and could significantly effect any projection….

….but you have shown an inability handle criticism without turning to insults or any number irrational behaviors, so I don’t really expect those responses from you, but from other readers.

I have no interest in letting you two ruin any more threads. Contribute to a useful conversation or go away.

Dave, it painfully obvious the only one ruining and not contributing to this conversation is you. In a discussion of statistics someone calls out your unjustified use of the word “significant” and your responce is “you’re still here?” That’s just pathetic. You’re either lazy or don’t know what your talking about, you pick.

Its a good thing you work in such an underdeveloped and amateur field. How well do you think the reviews of a research journal would take it if the author’s response to a comment was a long the lines of “You’re still here?” If I gave these kinds of answers to criticism in my field, I would surely lose my job and be destine to make $17/hour washing dishes the rest of my life.

So please Dave, take your own advice, contribute to the discussion or go away.

Um, Dave’s been contributing to baseball discussions and knowledge for YEARS, and repeatedly had to deal with ignorant antagonists. You’ll forgive him if he doesn’t have the patience to bother with idiot #2360923.

In this conversation it is completely irrelevant what Dave has done for YEARS. His two comments above have done nothing to further this current discussion, making his last statement not only an irrational attack on a valid criticism, but laughably hypocritical.

I think you’re confusing a forecaster with a historian. A small sample size is always going to be a problem! What exactly are you going to do 400 AB from now when you’ve still got too many data points lying outside of your one std_dev?

To absolutely ignore the 80 most recent plate appearances a batter has had, is just silly. Events have occurred and you should process them however you see fit, but you’re not allowed to just ignore them.

Hell, add them to last year’s statistics straight up (which clearly undervalues them,) and see what the projection models spit out.

I dont think anyone would argue that Aprils statistics should be viewed as an independent set of events, that’s crazy-stupid.

If you think that April’s statistics are so outrageously high, pull a figure-skating analysis and eliminate the high and low months over the past year.

By the way, Andruw Jones? Seriously? That’s your example? What level of confidence do you think any projection model has in Andruw Jones projections? Whenever a player goes from awesome to awful, or the other way, the Models get angry. There’s obviously no way to account for shit that goes down in real life with a projection model.

Also, 40 PA isn’t enough to say anything — 100 will start a conversation, or at least start the wheels turning.

The problem with small sample sizes (and 100 PA is a small sample size) is that all events are statistically large. I don’t think anyone is going to argue that Martinez is a bad baseball player, or even an average one, but currently he has a .387 BABIP and his BB% is 3% above his career average.

I ran some various scenarios quickly in excel and if Martinez walked once less, had one less HR, and his triple was a double his OBP drops from .455 to .435 and his SLG goes from .636 to .573. That’s not a big shift in performance (in fact you could still write a column about him being ‘back’). If he had walked two times less, had 3 less singles, no triple, and two less HR, his OBP is .386 and SLG is .467 (.853 OPS). Again all of these numbers point to Martinez being a good player, but to say 101 PA is enough to make a judgment on the rest of his season isn’t really compelling.

To say that 35 PA for Jones is enough is a joke.

I was in a hurry to leave my apartment this morning and didn’t quite make the exact point that I wanted.

When I said there wasn’t a huge difference in performance, I meant on field performance, not statistical performance.

But the main point I was trying to make, and never really got around to stating, is that what would be really interesting is when projections are made in the beginning of the season that information on what one and two Standard Deviations are for each projection. Each player should have a different deviation based on sample size of performance, and knowing the size of the bell curve would make these discussions on sample size much more scientifically sound. With the updated ZIPS it would be extremely informative to know whether or not the updated projection falls outside of one or two standard deviations, at which point you can make claims of statistical significance, if the update doesn’t then there is no significance in the data. Error bars are part of science, so if you want to make claims of scientific analysis, do science.

Further it would be absolutely amazing to know the size of the standard deviation for 100 AB, 200 AB, 300 AB, 400 AB, 500 AB, and 600 AB. This way we could look at a 100 AB sample and know whether or not it is a statistically significant (from the actual mathematical definition) deviation from expectations.

I think this is the same point being made by others and their issue with your use of the term significant. No one is arguing that Martinez is a bad player or that the first 100 AB of a season are meaningless, but without doing real statistical analysis your post isn’t telling anyone anything too groundbreaking.

Well said David.

Seriously Nacho…what are you arguing here? Are you the semantic police? First, Dave thinks that a 34 point increase in OPS is significant, and obviously you do not. I happen to agree with Dave that it is significant, but the larger point to the article is that you have to use all the information available, even if it is a smaller sample size, and not stick you’re head in the sand like you’re doing, and just demand an answer to the exact projection of every MLB player. Baseball is a game of neverending changes and adjustments. The players are always getting better, aging and getting worse, all at different rates. Any projection should take any additional information you can get.

Second, you’re asking Dave to take out the noise for every player in a response to your post. It can’t be done. But Dave gave Victor Martinez of an example where small sample performance can have an overall effect on projections. Could it be noise or error? Sure, but based upon his success so far, a projection system certainly has to take those numbers into account. Andruw’s success has come in significantly less AB’s and mostly against leftys. Should it have as much effect? No, it’s a smaller sample and in what should be favorable matchups for him. Modify your projections upward, but not nearly as significantly. It’s a relative calculation. Regardless, to ignore any information you can lay your hands on is ignorant.

Meanwhile, stop being such an ass.

The definition of significant is not subjective. Sabermetrics labels itself as a scientific analysis of baseball, so those who claim to be practicing advanced analysis of baseball should use the well-defined and accepted definitions of words. A significant change is something that falls out side either one or two standard deviations (it depends on the data sample and the claims put forth by the study). Without a complete statistical analysis it is impossible to make the claims that Dave is making. Basically he is making empirical claims, not statistical one–which is, I believe, the opposite of what he thinks he is doing. I think that people that are reading this site are interested in seeing real statistical analysis, not some kind of pseudoscience that Dave is doing.

“Second, youâ€™re asking Dave to take out the noise for every player in a response to your post. It canâ€™t be done.”

I won’t speak for Nacho, I don’t know him at all and don’t know what he believes, but I disagree with your assessment. I’m certainly not asking for the ‘noise’ to be taken out. I’m asking that the standard deviation for these data sets to be compared. I don’t have them and I don’t claim to be a baseball blogger (not even close!), so I’m not going to do this. My guess is that the standard deviations of the two projections cross, which means simply that there is no significant change. This has nothing to do with getting rid of noise, but has everything to do with doing serious, real analysis. This is not ignoring data, but instead considering all data available and conducting an actual comparison.

No one is being an ass but Dave and his blind supporters. He refuses to answer legitimate concerns with his writing, instead he prefers to insult people and look down.

In sum: Significance has a well-defined definition. Use it, or else you aren’t doing statistical analysis.

David, thanks again.

I’d like to further emphasize this sample size issue, but leave baseball for a second. It is often the exact right thing to do to just flat out ignore some data when doing statistical analysis if there are sufficient problems with that data. This problem may not appear in baseball very often, given that those problems usually come from design or sampling issues. But that is the basic reason we don’t trust pre-WWII statistics very much.

As for something more baseball related, one of the major problems is going to be a skewed sample or nesting. So, say you are researching Lake Trout, but only sample 3 lakes. Even if you achieve statistical significance in your study in those 3 lakes, you can only makes conclusions about those 3 lakes (and may even only have achieved significance thanks to pseudoreplication, but I digress). Obviously, 3 lakes isn’t going to be representative of all Lake Trout. A similar problem is faced in baseball and is hinted on with Andruw Jones. He’s faced a higher portion of lefties than we would expect if he played everyday. Victor Martinez, and all other players have a slightly different issue. They have only faced a small handful of pitchers or teams. Currently he’s had 6 PAs against Sidney Ponson, 4 against Kevin Slowey and doesn’t have any more than 3 against any other pitcher, with 46 total pitchers faced (last year he faced 114 pitchers, against 18 he had 4 or more PAs). He’s had 27 PAs against the Royals in general, with a 1.222 in OPS against them. And has only faced 6 teams, getting just 9 PAs against the Red Sox. So 90% of Victor’s sample is nested in 5 teams. 2 of them have the highest team ERA in the game outside of the team Martinez plays for. Minnasota is 24th, Toronto 11th, KC 3rd (who Martinez has hit the best). Does that sound like a representative sample of the whole league?

Now, I’m not saying ignore these April numbers all together (nice straw man), but you have to recognize they have problems. These are small samples, even 100 PAs is pretty small. And because of the scheduling and nature of baseball, those small samples are nested within just a few teams or pitchers.

As for being an ass, how’s this: You and Dave Cameron talk about statistics like you read about it on wikipedia, and didn’t get it. Someone who wants to have a serious conversation about statistics and the misuse of statistical principles would not casually throw around the word significant, nor think its up for interpretation. They then would certainly not resort to such asinine comments, as Dave has, after being faced with criticism.

Being able to react to and answer criticism rationally and logically, even if you don’t know the answer or made a mistake, is absolutely necessary to maintain credibility. Obviously, I’m not terribly concerned with this, I’m an anonymous commenter. But Dave has a great deal of interest in maintaining his credibility….

Finally! Thank you for this one Dave! I come at almost all baseball questions focusing almost solely on fantasy production, and this is driving me nuts. Nuts, I say!

The fantasy baseball community has decided that everything regresses to the mean, and 3/4 of the time they don’t even know what mean they’re regressing too.

When considered from a fantasy perspective, April stats are almost always the most valuable statistics of the year. We’re not dealing with guys with large MLB sample-sizes for the most part, we’re dealing with kids and the late rounders.

Furthermore, baseball isn’t just about statistics. Low-Average/Low-BABIP players get sent to the minors — Sure, their statistics might regress there, but what good is that? Early statistics are also a great indicator of injury — Injured players don’t make terrific buy-low candidates.

This is my biggest pet peeve, ever. You’ve got statistics in front of you, and while you can’t extrapolate these statistics and expect a useful result, YOU CAN USE THEM!

I actually wonder if Dave agree with this post.

I enjoyed reading your posts earlier, and I’m glad you mathematized my point about std-devs far beyond the 4 words I wrote.

There is a disjoint between the two types of baseball stat geeks. I refer to what I practice as Sabr-magics for a damn good reason: I’m fairly well versed in most statistical analysis, but when it comes to realistically predicting the immediate future (that’s what fantasy baseball is,) you can often throw math out the window.

I hate to bring this back to cliches, but 1) steroids 2) injuries 3) off-season work, are all things you cannot predict. If you had information, you could factor them in but unfortunately you don’t have any *trustworthy* information.

e.g. Maybe Ankiel Worked Hard, Ludwick took HGH, and Pujols had a minor injury. There’s nothing you can do to predict the outcome of the season, all you have is incomplete information! Regardless of how poorly Pujols performs, you’re viewing each AB as an independent event.

I’m not sure what Dave claims to be, but there’s a huge difference between those that want accuracy, and those that want to be ahead of the curve in their fantasy baseball league.

I still maintain these early stats are the most important in pointing you in the right direction.

Someone brought up Andruw Jones, he was predicted by most models to hit 35-40 HR last year ( i believe)

Did you keep him on your squad for 200 PA? I doubt it.

I posted this question in the forums, but nobody answered. My question is somewhat relevant to the topic.

I started using this site during the offseason so I have no prior knowledge on how UZR/150 changes during the season. It seems to be a stat that is unreliable in small sample sizes. Which leads me to my question. Daniel Murphy the LF of the NYM has looked absolutely lost in the field this season, yet his URZ/150 is very good, resting at 26.6 this year. Is his UZR/150 so high because of the small sample size? Or is Murphy a better player than people perceive?

UZR measures value, not appearance. Murphy could look absolutely lost chasing a ball, taking a bad route and falling over his shoes, but at most he’ll be penalized for one bad play. In your mind, watching him stumble around, that play will be worth far more than just one bad play in forming your opinion of his abilities.

Small sample UZRs can be unreliable, and no one is claiming that Murphy has actually performed at the exact level that is being suggested by the metric. However, human sight can also be unreliable, and when the two disagree so vehemently, it would be a good idea to add some doubt to the certainty that Murphy is a horrible fielder.

Are those ZIPS projections projecting true talent level or what we expect the final numbers to look like combining true talent with what they’ve done so far?

Both – the numbers I’m quoting are “true talent”, but Dan also has end-of-season numbers in the spreadsheet that account for April’s production in the final tally.

That makes a lot of sense, thank you very much Dave. You are a very smart man.

There is neither enough precision in the projections cited here or OPS as a measure of player value to assert baldly that .034 points of OPS is “a significant change.”

Well said, Colin.

Excellent point… looks to have shot down an argument. Also, the ability to forecast season totals higher based on a hot first month seems to be filed under “duh.”

The observations made about Andrew Jones (seems to be tying in) if true, are provable from an observational standpoint, and not likely a statistical one. The Rangers appear to have protected him some to this point, and if he now plays every day for a month, we will see if his numbers have seen a remarkable jump.

I can use small sample size to opine that the sweep of the formerly 11-7 Pittsburgh Pirates by the Milwaukee Brewers probably dooms them to another sub-.500 season. If they go 1-15 against Milwaukee like they did last year, they will have to be 80-66 against the rest of baseball (a robust .548 average) just to be at .500

mitsubishi air conditioning systems