Last Friday, I focused my weekly ESPN Insider column (which can also be read here on the site if you are a FanGraphs Plus subscriber) on the predictive power of a team getting off to a strong start in April. We know that at the individual level one month doesn’t mean much, but I wondered whether a dominating start to the season for an entire team might be more predictive of future success.
To do this, we looked at every team since 1974 that won at least 70 percent of their games in April (minimum 15 games), which gave us a sample of 45 teams. We then looked at how these teams performed from May through September to find out how predictive a strong team start actually was. I was pretty surprised at just how little it actually mattered.
To summarize the results, the 45 teams combined for a .743 winning percentage in April but just a .549 winning percentage from May through September. The correlation between April record and May-September record was just .24, and the r squared was just .06, meaning that you could only explain six percent of these team’s record in the final five months by their records in April.
We even broke these 45 teams into quartiles based on ratio of runs scored to runs allowed to see if a pythag method would have done any better, but the correlation was an even weaker .19. In fact, the 12 teams with the worst run differential among the .700+ April clubs performed nearly as well over the remainder of the season as the 11 teams with the best run differential. Even teams that started the year winning games by mauling their opponents regressed heavily over the rest of the season, and knowing a team’s run differential didn’t help identify which teams would sustain more of their strong start than others.
That doesn’t mean April performance is worthless, of course. The fact that these teams won 54 percent of their May-September games shows that the sample was primarily made up of playoff contenders, so we shouldn’t pretend that a strong start to the season is meaningless. As a quick-and-dirty estimate of necessary regression, last week Tom Tango suggested adding 35 wins and 35 losses to a team’s record on any given day.
To test his method against the results of these early season barnstormers, we can add 1,575 wins and 1,575 losses to the April total for these 45 teams, which would bring the total number of adjusted wins and losses to 2,340-1,839, which works out to a .560 winning percentage. That’s just slightly higher than the .549 mark actually posted by these 45 teams over the rest of their season, so Tom’s shortcut seems to work pretty well on this sample of strong starting teams.
Applying that 35-35 regression to the Rangers and Dodgers, who both currently stand at 16-6 to begin the year, would leave you with an expected future winning percentage of .554. This method suggests that we haven’t actually learned all that much about the Rangers, as we were already pretty sure that they were good at baseball. Their first month confirms our preseason expectations, but shouldn’t change it all that much.
For the Dodgers, it’s tempting to say that perhaps they entered the year a tad bit underrated. Rather than regressing to the mean, Matt Kemp has doubled down on his terrific 2011 season, and quality performances from Andre Ethier and their collection of high walk/low power role players (A.J. Ellis, Mark Ellis, and Jerry Hairston have all been particularly good) have pushed the Dodgers out to an early lead in the NL West. Kemp can’t keep this up all year, and the Dodgers pitchers are due for some significant BABIP regression, but the Dodgers may be a little better than they were given credit for.
We should be careful not to overreact to the results of April performances, but also understand that they do carry some meaning, especially when viewed in the right context. A great first month to the season is mostly useful for putting wins in the bank that count in the final standings, but April performance can also help us understand a small part of a team’s expected future performance. April performance isn’t gospel, nor is it worthless. It’s data, and properly regressed, it can have some predictive value.