My Small Sample Size Plea
This will not be a lengthy or detailed post discussing what sample sizes are or why they carry importance, but rather a personal plea for fans and readers, especially of this site, to avoid overestimating talent based on a few good games in April. While we can deny ever falling prey to this issue, it is human nature to try and glean information from any and all angles, for whatever reason, be it an edge in a fantasy league or an article claiming why Player A should get more playing time/get a contract extension/get a date with Alyssa Milano.
Last night, Jordan Schafer kickstarted his major league career with a home run in his first at-bat. He followed it up with a single to centerfield. On the night, the 22-yr old rookie went 2-3, with an intentional walk. Seeing as the Braves/Phillies matchup was the first of the season, on national television, I would not be surprised in the least if fantasy players flocked to free agent pools to put in a claim for Schafer’s services. Now, Schafer may very well be a fine major league player but situations like this arise all too often, and they are particularly annoying. A player starts his season off on the right foot, fantasy players get all gooey-eyed, and then call the player a fluke upon dropping him in June on the heels of a .230/.310/.360 slash line.
Schafer could defy his projections and post excellent numbers this season but that is not the point. The point is that decisions should not be based on small sample sizes and we need to admit this is a problem before ever moving past it. It is one thing to discuss how a player has performed in a certain 10-game span, like during Lance Berkman’s ridiculous stretch last season but it is a completely different animal to use such discussions or small samples as the basis for definitive performance claims. On a teamwide level, going crazy over Schafer right now would be equivalent to trying to decipher what is wrong with the Phillies. One measly game has been played. Let’s not go crazy over players until we at least know a little bit about them.
In fact, as Dave mentioned this afternoon, the Nationals decided to forego placing Elijah Dukes, their best player, in the starting lineup because he had a poor spring training. I honestly don’t even know how to respond to that decision. But anyways, there you have it, my plea to avoid overestimating value based on small sample sizes. Waiting until the 50 game mark might be too much to ask, but at least get past 20 games before you claim having advanced knowledge about the causes of a player’s performance this season.
Print This Post

Dude, he was worth $1.5 million last night.
So what you’re saying is Emilio Bonifacio is not turning into Rickey Henderson at the plate? I’m confused now.
Nice.
This is one of the biggest issues I have with trying to discuss baseball statistically with a lot of fans. It seems to show up in any number of ways: batter/pitcher match-ups (i.e. this hitter is 5 for 11 of that pitcher with a home run), season platoon splits (particularly for players who don’t play regularly against same-handed pitchers), who hits well in a given road park, which hitters hit significantly worse in 40 ABs in the 4 spot compared to their regular 5 spot, playoff numbers, etc., etc. The use of statistics without a proper understanding of significance just seems to be a major cause of poor analysis decisions among fans. It is certainly too much to expect most fans to statistically analyze sample sizes and standard errors, but I do wish more of them would take the same conceptual approach of caution and skepticism that you advocate here.
The Elijah Dukes thing reminds me of something I read of Paul DePodesta’s recently where he talked about his first year in a front office, when he had a notebook full of detailed notes and observations from Spring Training, only to have his superior come up to him one day and tell him that the biggest mistakes they make in player evaluation come in Spring Training, and then he thought about it and realized that most of his hard-though conclusions were, at that point, worthless.
It’s both rather sad and yet not surprising at all that you have to make this plea.
I suspect local media outlets will only amplify the need for this, as I don’t see too many newspapers giving any attention to small sample sizes in favor of “SUPERSTAR HITTER IN A 10-GAME SLUMP!”
WEEI (in Boston) just got through a rant of how if the Yankees make the postseason they are in trouble because their best playoff starters is their number 5 pitcher.
Got to love sports talk radio.
Aren’t we basing Dukes being the Nats best player on a small sample size as well? I’m not so sure I’m ready to say that Dukes is better than Zimmerman.
I’m not saying that Kearns is the better player, but I’m not convinced that Dukes is a lock for a 20/20 season.
There are two factors at work in making player decisions based upon the small sample sizes that are provided by spring training play. In an ideal world, a club would have its 25-man roster set before spring training based upon performance in previous seasons, adjusted for aging, etc. However, a club’s management is dealing with human beings and you want to create a culture where the players believe that they earn playing time. That’s going to affect your decisions.
So, say you’ve got a fifth outfielder spot open going into spring training. Say also that you’ve got a hot young outfield prospect that you’d like to keep on the roster, and that you’ve got a 35-year-old veteran who projects to decline and perform more poorly than the kid. If the veteran tears it up in spring training and the kid struggles, do you cut the veteran and keep the kid, or do you send the kid down and keep the veteran? I’d say that you keep the veteran. Sure, you’re basing your decision on a small sample, but you have to consider the effect of the decision on the other players.
Dave Appelman if you are reading this
Seidman is getting worse every article
Release him at once
I drafted Dukes late for him to start not be benched behind Austin Kearns. I guess that’s why the Nationals are a last place team.
So in other words Eric I shouldn’t believe in Joe Saunders after his start tonight?
And ‘09 is most likely it for Kearns’ time in Washington. They won’t be picking up his $10 million option. So what do the Nats do? Bench an $8 million player who could have some sort of value at the deadline (or much earlier) for a guy they control for four more years?
It also gives the team a chance to see how a guy who has shown to be a bit of a hothead handles situations that don’t go in his favor before they decide to invest anything in him.
The Nats aren’t competing for anything this year. They’ll be better but this year will be the time to figure out what they have not directly related to home runs and stolen bases while shedding some dead weight.
In the end, reread your own post. The ‘Opening Day Starter’ is the definition of small sample size. Let’s see where they are at the end of April. $20 says Dukes sees more ABs from that point to the end of the season than Kearns in a Nationals uniform.
I generally agree with you. There is more involved here than simply playing the best player. It makes sense to show case a player you are looking to trade. That being said, showcase him at other times. Opening day is the one day the Nats will have a lot of fans in the stands. The one day that the local media is focused on them. They should play their best team.. It will be a miserable season in DC, but I think they should have given their fans the best possible team opening day.
Re: Dukes
Arizona is doing the same thing with Upton. Ok, he’s not as good as Dukes right now, and Arizona has better players, but they sat him last night because of spring training performance.
Sure, they called it a matchup thing, but Jackson was 0-11 against Cook.
I agree that many fans tend to overemphasize the impact of performance in a small sample size, but I think it’s important to look at the nature of that sample size as well. For instance, is it ok to bench player A, who’s hitting .410 in his last 10 games, for player B, who’s a lifetime .500 hitter off a given starting pitcher? What if player B’s 30 lifetime AB’s against the pitcher were all 4+ years ago, when they both played in the American League?
While both are equally small sample sizes, the more recent one has to take precedence, right? Is there a good way to distinguish between these two scenarios, besides referring to them as a “proven match-up” vs. a “timely hitting streak”? I’m not an expert on statistical analysis… sample sizes and standard deviations and whatnot, so I’m genuinely curious. Any help would be appreciated.
Generally, that is never going to be all you have to go on. You have more information on player A than what he is hitting in his last 10 games, and more information on player B that his lifetime average off the pitcher. In fact, most of what you know about them will likely be separate from those two bits of information, so you would want to consider everything else as well. The point of not using small samples is that you generally don’t have to. There is more information teams have at their disposal if they choose to use it.
With all else being equal, more recent information is stronger. It is similar to the concept of Marcel projections: you take an average of what a player has done, with the more recent performances weighted progressively more heavily. A player’s last 30 ABs are probably more meaningful in evaluating how he is likely to hit now than 30 ABs against the opposing pitcher 4+ years ago. Neither one is a very good indicator by itself (the SD of batting ave. for a .300 hitter over 30 ABs is about .084 points, and he’ll hit better than .410 about 8.5% of the time by pure chance), but the more recent ABs would have more weight. There is also an effect of familiarity with the pitcher, but over that length of time, it shouldn’t be strong enough to overcome the issues of the sample being so old.
I appreciate the reply. I think in my own head I’d made the assumption “all other things being roughly equal,” but I realize I didn’t actually mention that in my post.
You mentioned that the point of not using small sample sizes is that you generally don’t have to. It seems to me that managers do this quite often, and call it playing the odds or something similar. I realize that in general lefties have a harder time facing lefties and vice versa, but doesn’t this become a self-fulfilling prophecy at some point? If you have a platoon situation, or just a backup who averages 30 games a year, how many opportunities do those guys get to hit off same-handed pictchers anyway? I mean, doesn’t the fact that most managers are so ready to take the bat out of their hands in those situations erode their confidence? Not to mention that they just get out of practice hitting in those situations? I realize (ha!) that the type of situation I’m describing here is not at all the norm, and therefore subject to the issues of the small sample size, but it seems like games hang in the balance on these decisions when they’re made, but the collective decision-making criteria might have a much greater impact long term.
Sometimes managers do make decisions like this based primarily on small samples. Most managers do not seem to have a strong statistical background, so when they attempt to explain statistical justification for their decisions, it often ends up making little statistical sense. Sometimes, that can lead to bad decisions, but it doesn’t necessarily. A lot of times when decisions do come down to these types of small samples (i.e. in the hypothetical case where all else is roughly equal), the difference is small enough that there isn’t a severely wrong choice. If the rest of the information available is close enough, then you’ll usually be fine either way, so whether the justification makes sense or not is not all that important, and a manager’s reasoning can sound much worse than his actual decision was.
Managers also rely on general trends that they use to make reasonable assumptions about hitters. For example, the platoon advantage is pretty pervasive, so even if you only have a small sample of data on a left-handed hitter against lefties, if you have a right-handed hitter who is within a reasonable talent level of the left-handed hitter, he can still be expected to do better against a lefty. In other words, the information being used to make the decision is not limited to the player’s sample: the manager also draws on the general trend of platoon splits to make assumptions about who would perform better. It can become self-fulfilling, and hitters who sit frequently with the platoon disadvantage can develop exaggerated splits as a result, but there also is a point that they are unlikely to surpass no matter what against same-handed pitchers, at a certain level below how well they hit with the platoon advantage. If you have an opposite handed hitter who can hit above that level, it makes sense to use him even if you don’t have a significant enough sample to confirm it experimentally.
In fact, I think your point is one major reason this problem is so common with fans. Many of them hear this type of reasoning from managers and assume it is sound, when in fact the reasoning is just not explained well statistically. They assume that because a manager may know the game and make good decisions, that his explanations of the statistics behind those decisions should carry more weight than they actually do.
Well, I like the way you laid it out there, and your argument makes a lot of sense. I’m fairly new at this whole saber-stat business, but I always (I felt, at least) had a pretty intuitive grasp on the game. Of course, the problem with growing up that way, watching the broadcasts, listening to the announcers’ 2nd-hand reasoning behind managers’ decisions, makes for some problematic conclusions. One, there’s a lot of manager-fellatio going on in the booth. Especially on local broadcasts, which I used to get a lot of growing up in the 80’s in southern Illinois. It was pretty much always Braves, Cards, or Cubs, and pretty much always called by the hometown annoucing team. So you get to a point where every decision is right, or at least justifiable, when that’s really not often the case. Sure, they’re usually justifiable in that there’s not many decisions a manager makes in that realm that have any major impact on a game, or especially a season. But that doesn’t mean they’re statistically sound decisions. Two, the stats they used to use to justify and explain those decisions don’t have much power of prediction. Not like the stats we’re often using these days, with your wOBA’s and your OPS’s, your UZR’s and FIP’s and such.
Also, to highlight your point, I remember reading a pretty good artical on … btb, maybe? about lineup optimization. And it basically said that while no manager goes by “the book” as far as constructing an optimal lineup, they might cost their team 10 or fewer runs per year by using the conventional setup. So while not technically correct, their decisions are still more or less “good.” Because with the personnel they carry, there aren’t that many “bad” decisions to be made, I guess, is my conclusion? Just ideal, or very slightly less than ideal.
Yes, it was only Opening Day. Still, all those millions to CC Sabathia and Mark Teixeira couldn’t buy the Yankees a good start to the new season.
-Jerry Crasnick on the frontpage of ESPN.
All winter for months I have been looking forward to the start of the season to see this new team play. Even though they lost last night, I had a great time watching the game. It doesn’t bother me or make me unhappy at all that they lost one game. But when I sign on to espn and read that, I get really, really pissed. I think I would be better if I just only watched the games at this point, and never read anything online except for occasional statistical reference.
What about that statement is incorrect?
I have no problem with it, and I’m a Yankee fan. These guys are trying to squeeze a story out of each and every game of the year — this is the angle they chose for an Opening Day Yankee loss.
However banal and unimaginative it may be, it is valid.
I agree with your “don’t trust small sample sizes” mantra, from the standpoint of decision making for a MLB team.
But not in fantasy. To win fantasy baseball you need to take some chances that maybe some player really is better than the projections say. I’m not saying Schafer is that guy and that everyone should grab him now, or the Bonifacio is the next Chone Figgins. You’ll have to make that guess for yourself.
The guy who waits 50 games or so, until he can make a decision with more data, does not get Cliff Lee in 2008. And does not win his league. Now it’s true that you can ruin your team by picking the wrong small sample size breakout players, but the “safe” road in fantasy leads to 3rd place.
Are you saying Cliff Lee was available on your waiver wire? Yikes.
I completely agree with Rally from a fantasy standpoint. You need to be aggressive on the waiver wire especially the first couple weeks of the season when there will tend to be a lot of undrafted players who could get hot to start the season and really breakout.
If you have junk on your bench, you should be taking educated guesses on some of these guys.
10 team league, 23 players per team, so only the top 3rd of players are worthy of roster spots in the league. At the end of 2007, Lee was not one of them.
I will add that the safe road, making prudent decisions only when enough data is in, is best in the long run even for fantasy. If you stuck to projections religiously for all player moves you’d probably win a 10,000 game season.
But 162 games is too short, so the winner will be the guy who takes some chances, and gets a little lucky.
Small sample size, schmample size. I think it is obvious to everyone that Ken Griffey Jr. is clearly going to hit 162 home runs this year.
Weird. That’s exactly what they said on the Seattle postgame show! (they might have been serious)
Personally, I’d prefer a cool 324 out of Felipe Lopez.
Small sample sizes do have meaning, though.
I like to apply binomial theorem to players’ small sample sizes to see how significant they are. For instance, last May 11 Kevin Cash was batting .375. Obviously he wasn’t a true .375 hitter, but a quick check of probabilities indicated that it was 99% likely that he was a true .207 hitter or better. Entering 2008, Kevin Cash looked more like a true .180 hitter, so that .375 batting average was still good news, even when discounted appropriately for sample size.
Nay sayers will comment that he batted only .167 the rest of the way. I’d counter that catchers always tire later in the season, and that .167 was higher than his previous career split for June-September…the outstanding hitting in April and May was an accurate signal that he’d improved as a hitter.
Other small sample sizes can have similar meaning, if they are dramatically far from the player’s previous norm and if they’re appropriately discounted for their size.
I agree Jayhawk. I think the point is don’t do anything too hasty based on one hot streak from an out of nowhere free agent.
Your point about Cash is well taken, but by the same token anybody who went out and picked him up from the waiver wire didn’t exactly reap dividends post-May 11th.
Most of us are pretty savvy drafters and owners. Our presence on this very site indicates we do our research. If you’ve already researched your drafted players to death already, I think it’s a mistake to scrap countless hours of analysis based on one sexy box score.
We’re somewhat talking about different things here, Jayhawk, but let me be clear that you hit the nail on the head. It’s just that the nail belongs to another topic. What you’re discussing is how a small sample can change a player’s true talent level. This is definitely true, and I wrote an article last year discussing Cliff Lee and CC Sabathia, using The Hardball Times’ marcel in-season estimator to show the effects. Basically, you couldn’t have had a better first month than Lee or a worst first month than Sabathia, but even with those opposite extremes, their in-season adjusted projections were not drastically different.
What I’m talking about here is that people should not go nuts over someone getting off to a 6-10 start. Adam Lind could very well be hitting .400/.500/.900 with 12 rbis through 4 games… my point is – who cares!? It’s 4 games.
The problem with that is that there are enough players and seasons in MLB that aberrations will happen. Things that have 1% probabilities will happen, as we’d expect them to with hundreds or thousands of chances. So when you wait for them to happen and then pick out the aberrations, you can’t really assign much statistical meaning there because of the sample bias. If you randomly selected a sample, or if you pick a player beforehand and say you will look at his next 40 ABs, you could analyze the meaning with binomial theory, but by specifically selecting an anomaly out of the thousands of possible samples you could have picked after the fact, you are introducing a bias that makes the statistical implications meaningless. 99% doesn’t mean all that much when your selection process heavily favours the 1%.
Jayhawk Bill -
If Cash lived in a vacuum ur point wuld be a good but if you have a league with 270 starters and a bunch of bench players, 5 or 10 of them will be hitting 150+ pts over their head in April in BA, so Cash’s start is just probably an anomoly caused by having hundreds of guys playing April. If it was the end of May, its certainly a good sign for Cash at that point cuz it goes from like 1% to like .1%.