FanGraphs Logo

Regression, Where Art Thou?

One of the statistical terms we mention quite a bit here when evaluating players is regression. The given definition of the word is to go back or return to a previous state. With regards to baseball players, we use the word to describe what is likely to happen to players either over- or underachieving at any given point in a season.

Chipper Jones was hitting .400 halfway through the season. Did we expect him to continue that torrid pace all year? No, his performance was expected to regress as more plate appearances were accrued. The term can be a bit confusing because it so often finds itself used with overachieving players, but, just like Chipper Jones, it goes both ways… to clarify, that’s a switch-hitter joke. Regression can also refer to a player like Robinson Cano, who performed so poorly at the beginning of this year that you knew he just had to get better. His regression resulted in an impovement.

Three pitchers we have discussed multiple times on this site in posts involving this very term are Joe Saunders, Ervin Santana, and Gavin Floyd. All three are in the midst of career years, but either their ERA-FIP differential or true talent projections told us that they would not be very likely to sustain their performance levels all year. Since all three have made 27 starts, let’s compare the first 18 to the most recent 9 (keep in mind that the FIP is crude here, merely adding 3.20 instead of the exact figure):

Joe Saunders, LAA
18: 120.1 IP, 105 H, 14 HR, 31 BB, 63 K, 3.07 ERA, 4.44 FIP
9:  50.1 IP,   62 H,  5 HR, 17 BB, 18 K, 5.19 ERA, 4.79 FIP

Gavin Floyd, CHW
18: 111.2 IP, 87 H, 17 HR, 47 BB, 75 K, 3.63 ERA, 5.10 FIP
9:  55.1 IP,  56 H,  6 HR, 17 BB, 44 K, 3.58 ERA, 3.94 FIP

Ervin Santana, LAA
18: 121.1 IP, 106 H, 12 HR, 32 BB, 112 K, 3.56 ERA, 3.43 FIP
9:  62.1 IP,   57 H,  6 HR, 11 BB,  71 K, 2.70 ERA, 2.89 FIP

Oddly enough, only Saunders has experienced any type of regression over his most recent nine starts. Floyd and Santana have improved. While Floyd’s ERA is essentially the same in the split, his FIP is much better, meaning the performance has been more skill-driven than before. Santana has been pitching lately like the guy on the Mets with the same last name, if not better. With only four or so starts remaining for each of these players, barring some Sabathia-in-April-type performances, not much damage to their overall stat-lines can be done.

Still, one year of data isn’t enough to evaluate a player and the true talent level will still give us a much better estimate. This is why, even though Floyd and Santana are pitching very well, I would have to imagine they do not strike confidence in fans of the South Siders and Halos. The last important thing to remember is that regression does not always result in a bad season. Even though Saunders has performed poorly lately, he is not that bad, and still has a very solid ERA. He isn’t as good as we were “led to believe” early on, but not as bad as his most recent starts. The jury is still out on Floyd and Santana. Hopefully, next year, at this time, we’ll know if they are for real or flashes in the pan.


Print This Post Print This Post
A lifelong Phillies fan, my work can also be found at Baseball Prospectus.

5 Responses to “Regression, Where Art Thou?”

You can follow any responses to this entry through the RSS 2.0 feed.
Click here to view comments in a non-threaded output.
  1. Bill Krevski says:

    Santana has always had #1 starter stuff, this year he commands it and in turn has become a #1 starter. I see no reason to expect any sort of regression, he’s as legit as they come, but you dont have to take my word for it, just turn on the Angels every 5th day and see for yourself.

    Vote -1 Vote +1

  2. willkoky says:

    This question was undoubtedly answered somewhere in baseball regression history but I have never seen the answer so I’d like to ask. I don’t get something about regression as applied to baseball and you are sort of answering it. If regression exists then why do the performance highs and lows so often come in bunches? Why do players get hot for two months and then cold for two months? Often people say he will regress to his true talent level. But if he were playing at his true talent level the whole time then the points at which he would differ from his true talent level would be randomly dispersed wouldn’t they? Lucky sometimes, unlucky others; not come in streaks. The bulk of the regression effect doesn’t seem to come from luck, it seems to come from what you are describing, a change in talent level. Even mid season. It seems to be hard to maintain superior or inferior talent to what you have displayed in the past. Not to have truly superior or inferior luck. I’m probably missing something but I wanted to ask.

    Vote -1 Vote +1

  3. Eric Seidman says:

    willkoky,

    You’re not really missing anything, just circling around the issue it seems. Players have true talent levels, which are based on the previous three or so years of weighted data. If Albert Pujols is projected to hit .330/.420/.630 this year, it means that we know enough about his past performance, plus normal aging curves, to expect this type of performance.

    If he busts out of the gate for the month of April with a .380/.500/.800 line, it doesn’t mean he has become Barry Bonds in 2004, or Babe Ruth.. it means that he had a great April. What we really want to know in that case is how it affects his true talent level.

    Since it is only a month of data, compared to three previous years, it may increase his TTL to something like .336/.427/.640, which is better than the pre-season projection but still nowhere near the hot streak in April. Then, say he has a somewhat cool month of May, hitting .278/.379/.500… again, it doesn’t mean he is THIS type of player, but rather that he is regressing. He wasn’t as good as April or as “bad” as May. A May like that might bring his true talent level to .332/.423/.634.. so it is still better than projected prior to the year, but you can see how it can shift.

    All we know about a player is his true talent level, and when we look at players early in the year, or even later in a given year, we want to know how his performance changes, if it changes, his TTL.

    Vote -1 Vote +1

  4. Sky says:

    Will — keep in mind that hot and cold streaks aren’t continuous, unbroken streaks. Hitting .400 for a month still means that 60% of at-bats are failures. We define streaks by choosing cut-offs that make them appear as impressive as possible. Let’s say you flips this sequence of H/T:

    HHTTTTHHHHTTTTHH

    You could break that up into first halves and second halves, which have the same number of heads and tails — nothing crazy there. Or you could take from the first T to the last T, which is 2/3 tails — more “streaky”.

    Vote -1 Vote +1

  5. willkoky says:

    Thanks.

    Sky, yes we do create streaks, but baseball sure seems more streaky then randomness would project, perhaps its been proven already that it isn’t. What’s more I would expect it to be streaky, humans appear streaky in many walks of life, especially college seniors.

    I guess I’m saying it seems that the month to month changes Senior Seidman describes in Pujols are changes in TTL. If they weren’t, they would be explainable by luck stats, and not supported by TTL stats. TTL appears to be an average TL for the year. Its not that Pujols is the exact same player in April in May and that he regresses to his TTL. Its that he actually is better in April and worse in May and that its hard to do the right thing every time and stay that good for an extended period. Its the only way to explain the streaky-ness of good and bad, instead of a random distribution of goodness and badness.

    The effect of TTL is the same if you believe the above or not, but it seems like its an important distinction about why the regression happens.

    I think. :)

    Vote -1 Vote +1

Leave a Reply


Player Linker - Contact Us - Terms of Service - Privacy Policy