When I told the writing staff here of my desire to write a post on regression to the mean, using Matt Cain as the proxy of sorts, Dave replied: “Admit it, you just want to write about Cain again…”. He wasn’t really wrong, as many of you know that Cain happens to be my favorite non-Greg Maddux pitcher, but his 2009 season has been so interesting to date that he seems like the perfect subject for a discussion of regression. I have seen it happen countless times, but fans interested in developing their statistical knowledge tend to go through a few stages with regards to the evolution of a particular metric.
First, they are very skeptical, wary to accept something new as meaningful. Next, they grasp the underlying meaning and begin to incorporate the stat into analysis. Finally, basking in the fact that they understand the benefits of the stat, it gets tossed around whenever possible and treated like the gospel. Unfortunately, when this last part occurs, the true understanding is not fully developed and definitive claims are often a bit off course. This is in no way a criticism, as I myself have gone through the same stages at one time or another, but rather an observation.
With regards to Cain, I have seen way too many analyses discussing his ERA-FIP disconnect and how an ugly regression causing his ERA to balloon was inevitable. I profiled this over at Baseball Prospectus, begging for those making such claims to dig deeper and find out what pitchers are doing differently, if anything, before jumping to conclusions. After all, not everyone regresses, and not all regressions are bad. The problem is that the term regression takes on such negative connotations these days that it seems odd for it to portend anything positive. Regression is in fact a two-way road, though, and deserves to be treated that way.
No, Matt Cain is not very likely to sustain an 88% strand rate, but he is also unlikely to post walk and strikeout rates that drastically stray from his true talent level. A pitcher with strikeout rates ranging from 7.4-8.4 has a pretty low likelihood of suddenly whiffing hitters at a rate closer to 6.0 per nine innings; likewise, one with an established unintentional walk rate around 3.5 probably will not finish the next season closer to 4.3 barring unforeseen circumstances. Despite these assertions, after Cain’s 8th start, when his 2.65 ERA supremely bested an FIP built upon a 4.24 UBB/9 and 6.0 K/9, nobody really thought to suggest that those rates would regress (in this case a positive regression). The moral here I suppose is that even though his strand rate will not stay that high, he is going to allow fewer baserunners that need to be stranded.
Five starts later, Cain has reduced his UBB/9 to 3.44 and increased his strikeout rate to 7.10. The ERA is still quite low thanks to the extraordinary strand rate, but his FIP is regressing itself towards levels of the recent past. If I had to bet money on it, I would agree that Cain’s ERA is more likely to increase than his FIP is to decrease, but regression does not occur in just one metric. If his strikeout rate continues to regress and his walk rate either improves or holds true, combined with a regression in stranding runners, Cain could conceivably have an ERA around 3.25 with an FIP at the 3.70 mark. At that point, the disconnect between the two stats isn’t that vast.
In fact, ZiPS sees Cain finishing the season in a similar fashion to the aforementioned numbers, with a 3.28 ERA and 3.83 FIP. An FIP of 3.83 is certainly very solid, as is a 3.28 ERA, and the main reasons the disconnect would reduce involve regression towards established talent levels in walks and strikeouts, that have not yet been experienced this season.
These are certainly big “if’s” but I really just wanted to hammer home two points: the numbers beneath the numbers really need to be analyzed in order to find out why certain rates are where they are, and that regression works both ways, meaning we should not ignore the areas bound to experience a positive regression, which in turn could reduce the amount of negative regression inherent in a dataset.