Shut the (Heck) Up About Sample Size

The analytics revolution in sports has led to profound changes in the way in which sports organizations think about their teams, players play the game, and fans consume the on-field product. Perhaps the best-known heuristic in sports analytics is sample size — the number of observations necessary to make a reliable conclusion about some phenomenon. Everyone has a buddy who loves to make sweeping generalizations about stud prospects, always hedging his bets when the debate heats up: “Well, we don’t have enough sample size, so we just don’t know yet.”

Unfortunately for your buddy, sample size doesn’t tell the whole story. A large sample is a nice thing to have when we’re conducting research in a sterile lab, but in real-life settings like sports teams, willing research participants certainly aren’t always in abundant supply. Regardless of the number of available data points, teams need to make decisions. Shrugging about a prospect’s performance, or a newly cobbled together pitching staff, is certainly not going to help the bottom line, either in terms of wins or dollar signs.

So the question becomes: How do organizations answer pressing questions when they either a) don’t have an adequate sample size, or b) haven’t collected any data? Fortunately, we can use research methods from social science to get a pretty damn good idea about something — even in the absence of the all-powerful sample size.

Qualitative Data
Let’s say you’re a baseball scout for the Yankees watching a young college prospect from the stands. You take copious notes about the player’s poise, physical stature, his hitting, fielding ability, and running abilities, as well as his throwing arm power. For instance, you might write things like, “good approach to hitting” and “lacks pure run/throw tool.”

All of these rich descriptions of this player are qualitative data. This observational data from one game of this college player is a sample size of 1, but you’ve got a helluva lot of data. You could look for themes that consistently emerge in your notes, creating an in-depth profile of the prospect; you could even standardize your observations on a scale from 20-80. Your notes help build a full story about the player’s profile, and the Yanks like the level of depth you bring to scouting.

You’ve worked as a scout for a few years, and the Yankees decide to bring you into their analytics department. It’s the end of the 2011 season, and one of your top prospects, Jesus Montero, just raked (.328/.406/.590, in 69 PAs) in the final month of the season. The GM of the Yankees, Brian Cashman, knocks on your door and says that they’re considering trading him. What do you say?

You compile all of Montero’s quantitative stats from the last month of the season and the minors, as well as any qualitative scouting reports on him. Good job. You’ve mixed quantitative and qualitative data to provide a richer story given a small sample of only 69 PAs. You’ve also reached the holy grail of social science research, triangulation, by which you examined the phenomenon from a different angle and, bingo, arrived at the same conclusion that your preliminary performance metrics gave you. Montero is a bum. Trade him, Brian.

Resampling Techniques
It’s four years later and Cashman knocks on your door again (he’s polite, so he waits for you to say, “come in”). It’s early October and you’ve just lost to the Houston Astros in a one-game playoff. Cashman asks you about one of the September call-ups, Rob Refsnyder, who Cashman thinks is “pretty impressive.” You combine Refsnyder’s September stats (.302/.348/.512, in 46 PAs), minor league stats, and scouting reports, but the data don’t point to a consistent conclusion. You’re not satisfied.

A fancy statistical method that might help in this instance is called bootstrapping; it works by resampling Refsnyder’s small 46 PA sample size over and over again, replacing the numbers back into the pool every time you draw another sample. The technique allows you to artificially inflate your sample size with the numbers that you already have. You can redo his sample of 46 PAs 1,000, 10,000, even 100,000 times, seeing each time how he would perform. Based on your bootstrapped estimates, you feel like Refsnyder’s numbers from last year are a bit inflated, but that he’d fit nicely as a future utility guy.

Cashman, who’s still in your office, now wants to know about two pitching prospects who were also called up in the 2015 class: James Pazos (5 IP, 0 ER, 3 H, 3BB, 5.4 K/9, 1.20 WHIP) and Caleb Cotham (9.2 IP, 7 ER, 14 H, 1BB, 10.2 K/9, 1.56 WHIP). If the team can only keep on of these pitchers, who should we keep? Who is better?

Normally you’d use a t-test to make comparisons between the two pitchers, but with such a small sample of innings for each guy, the conclusions wouldn’t be reliable. Instead, you decide to use a Mann-Whitney U test, which is basically the same thing as a t-test, adjusted for small samples. In fact, there’s a whole litany of statistical tests that are adept at handling small sample sizes: Wilcox’s t, Fisher’s exact, Chi-square, Kendal’s tau, and McNemar. You conclude that Pazos is slightly better, and that Cotham might be better suited for the bullpen. Cashman holds on to Pazos and deals Cotham to the Reds in the trade that brings over Aroldis Chapman to the Yankees. You pat yourself on the back.

Questions Need Answering
Having an adequate sample size brings confidence to many statistical conclusions, but it is certainly not a binary prerequisite for analyses. It’s easy for your buddy to watch his hindsight bias autocorrect for his previous wait-and-see approach, but organizations need to answer questions accurately. As amateur analysts and spectators, let’s change the lexicon by changing our methods.

Print This Post

PhD in Applied Research Methodology. Proponent of psychometrics in sports. More work at

newest oldest most voted

Completely agree that sometimes it seems the term “small sample size” is overused. Sometimes is not always about the sample size either but the shape of the distribution, I’ve always been curious as to whether the assumption of normality is violated during some of the analyses.


Could you please explain resampling/bootstrapping in a little more detail? I’m sure what you wrote in that section of the article makes perfect sense to a mathematician, but as a layman I have no clue what the actual process you’re trying to describe is. As written it kinda sounds like you’re just guessing, but I am sure that’s not actually the case.


Obviously with a small sample size you have a large variance – in two ways: variance of true performance and variance of outcomes.

Variance of true performance: players have hot streaks and cold streaks that last for days or weeks and a week of playing time is not enough to know where a player is on that curve. In addition, I’d expect most rookies to struggle in their first few plate appearances because of nerves/adjusting to MLB. Those first few plate appearances are a priori outliers.

Variance of outcomes: Baseball is super random, and some guy might hit a cheap HR down the LF line or someone makes a diving catch to rob him of a double. It takes a lot of MLB time to account for this variance in luck; entire SEASONS can be outliers in terms of BABIP. For small sample sizes, your priors like a player’s Minor League performance, tools, etc. are going to be much more accurate. Also, I would place a ton of weight on the Statcast data like exit velocity and launch angle because that cuts through a lot of the luck factor.

I would be very surprised if the confidence intervals from methods like this told you more than a priori knowledge, Statcast data, and coaches’ opinions.

Also – your method says Pazos is slightly better. Is that an accurate statement? I would imagine that coaches know whether Pazos is MUCH better, whether Cotham has better stuff but made a few mistakes, etc. I don’t really think that these tests are the best tool to make these decisions — frankly, they’re very tentative, and lack much important data that we can get through other means.


So basically you are saying even with any size of sample data, the GM’s use it to justify the narrative they want. The same stats from a prospect can elicit either… “He has shown that he can be our permanent 3 hole hitter.” or “He is still raw and we want him to spend more time developing in the minors.” So the top pick or Latin American signee where they spent money gets elevated quickly and the 10th round late bloomer that excels stays down because they haven’t invested much in him.