On Cy Youngs and Theoretical Pitcher WAR Models

Here at FanGraphs, we have two different models of pitcher WAR: one based on FIP, and one based on runs allowed. These represent the extreme opposite ends of the viewpoints on how much credit or blame a pitcher should receive for events in which his teammates have some significant influence. If you go with strictly a FIP-based model, a pitcher is only judged on his walks, strikeouts, and home runs, and the events of hits on balls in play and the sequencing of when events happen are not considered as part of the evaluation.

If you go with the RA9-based model, then everything that happens while the pitcher is on the mound — and in some cases, what happens after they are removed for a relief pitcher — is considered the pitcher’s responsibility, and he’s given full credit or blame for what his teammates do while he’s pitching.

Both of these models are wrong. It is evidently clear that pitchers have some influence over the rate of which their balls in play are turned into hits, and the order in which the events they give up occur, but it also evident that they are not solely responsible for those two things. The quality of defensive support behind a pitcher, and the timing of when the defense either bails out or screws over their teammate, has an impact on a pitcher’s runs allowed total. The truth of nearly every pitcher’s performance lies somewhere in between his FIP-based WAR and his RA9-based WAR.

The trick is that it’s not so easy to know exactly where on the spectrum that point lies, and its not the same point for every pitcher. Some pitchers hold down BABIP better than others. Some pitchers are better at pitching from the stretch than others. Some pitchers on good defensive teams didn’t actually get good defensive support behind them, and vice versa, and we can’t simply assume equal defensive support for every pitcher based on a team’s total aggregate UZR or DRS.

But, at the same time, we know that both FIP-WAR and RA9-WAR are giving either too much or too little credit to pitchers for their past results. Having both on the site gives us a nice ability to say that a player was likely worth between x and y wins, and you can essentially defend almost any number in between those two points. The question in evaluating pitchers essentially comes down to how much of the Fielding Dependent aspects of pitching you want to give the pitcher credit for.

So, with the Cy Young announcements coming this evening, I thought it would be useful to look at the entire spectrum of potential pitcher-WAR models, and how the candidates would rate based on a compromise between the two extremes. We already have the 100% FIP and 100% runs allowed models on the site, but what about all the WAR-models that would result from a compromise between the two, giving some but not full credit to pitchers for things measured by FDP? That’s what the tables below are going to show. First, the American League, because that’s the more interesting conversation.

The players in this table posted at least +6.0 WAR either by FIP-WAR or by RA9-WAR, meaning that they have some legitimate claim on Cy Young consideration depending on your perspective. The table is initially sorted by 100% FIP-WAR, but each column is sortable, so you can see how the leaders would stack up by the various hybrid models that contained some parts FIP and some parts RA9. The first number in the header represents what percentage weight the FIP-based WAR is being given in that calculation, so it moves in 10% increments, from left to right, from 100% FIP-WAR to 0% FIP-WAR. The results.

Name FIP-WAR 90/10 80/20 70/30 60/40 50/50 40/60 30/70 20/80 10/90 RA9-WAR
Max Scherzer 6.4 6.4 6.4 6.3 6.3 6.3 6.3 6.3 6.2 6.2 6.2
Anibal Sanchez 6.2 6.2 6.2 6.1 6.1 6.1 6.1 6.1 6.0 6.0 6.0
Felix Hernandez 6.0 5.9 5.8 5.7 5.6 5.6 5.5 5.4 5.3 5.2 5.1
Yu Darvish 5.0 5.2 5.3 5.5 5.7 5.9 6.0 6.2 6.4 6.5 6.7
James Shields 4.5 4.7 4.8 5.0 5.1 5.3 5.4 5.6 5.7 5.9 6.0
Hisashi Iwakuma 4.2 4.5 4.7 5.0 5.2 5.5 5.8 6.0 6.3 6.5 6.8

Because of their excellence in both FIP and runs allowed, the two Tigers grade out as at least +6 WAR pitchers by every combination model, and even the polar extremes. No matter what amount of credit you want to give to a pitcher for his FDP, Scherzer and Sanchez will essentially come out as legitimate Cy Young candidates.

Darvish and Iwakuma become legitimate candidates if you believe that a pitcher is responsible for a large majority of the fielding dependent outcomes, as Darvish climbs above +6 WAR in every model that weights runs allowed as at least 60% of the calculation; Iwakuma gets over +6 WAR in the models where runs allowed is at least 70% of the calculation. In the 40/60 and 30/70 models, however, Scherzer still comes out on top.

In the 20/80 model, Darvish actually takes the lead, but the differences are so tiny that any reasonable conclusion would be that it’s a three way tie. Darvish and Iwakuma essentially tie in the 10/90 and 0/100 models, with Scherzer finally sliding out of the top position in just those two calculations.

Across the spectrum of potential pitcher WAR models, Scherzer is either the leader or tied for the lead in 8 of the 10 calculations. The only way to give the 2013 Al Cy Young Award to someone besides Scherzer is to lean almost entirely on runs allowed, and believe that absolutely everything that happens when a pitcher is on the mound is the responsibility of the pitcher. I find that position pretty hard to defend, personally, and given that the pitchers with a very small advantage on the right side of the spectrum have a much larger deficit on the left side, I think the reasonable conclusion is that Scherzer was the American League’s best pitcher in 2013.

Now, for the National League, which isn’t quite as compelling.

Name FIP-WAR 90/10 80/20 70/30 60/40 50/50 40/60 30/70 20/80 10/90 RA9-WAR
Clayton Kershaw 6.5 6.7 7.0 7.2 7.4 7.7 7.9 8.1 8.3 8.6 8.8
Adam Wainwright 6.2 6.1 6.0 6.0 5.9 5.8 5.7 5.6 5.6 5.5 5.4
Matt Harvey 6.1 6.1 6.1 6.1 6.1 6.1 6.0 6.0 6.0 6.0 6.0

If you absolve the pitchers of all their non BB/K/HR events and ignore the timing of when events occurred, Kershaw is in a pretty tight race with Wainwright and Harvey. If you give any credit to the pitcher for his FDP events, however, Kershaw starts to pull away easily, taking a full +1 WAR lead as early as the 80/20 calculation. By the time you get to 100% runs allowed, Kershaw is nearly +3 WAR ahead of the next closest NL starter. As much as I love Adam Wainwright, there’s really no case for anyone besides Kershaw here. Any calculation that includes both FIP and runs allowed will have Kershaw running circles around the rest of the field.

I think eventually we’ll get to a point where we have enough data to support a model of pitcher WAR that includes both FIP and FDP events, and my guess — and right now, it’s only a guess — is that the data will suggest something in the range of the 70/30 model, giving pitchers some credit for FDP but being closer to FIP than RA9. If I had to put my money down on one single calculation that captured all pitcher’s past value most accurately (and I couldn’t switch between models for different types of pitchers), I’d probably pick the 70/30 model from the table above.

And that’s why I don’t have any problem with Scherzer or Kershaw winning today. They’re both deserving candidates, no matter how you you try to isolate a pitcher’s contributions to run prevention.




Print This Post



Dave is a co-founder of USSMariner.com and contributes to the Wall Street Journal.


38 Responses to “On Cy Youngs and Theoretical Pitcher WAR Models”

You can follow any responses to this entry through the RSS 2.0 feed.
  1. ASURay says:

    Both of these models are wrong.

    All models are wrong, so you’re okay :)

    Vote -1 Vote +1

  2. Bryce says:

    This is an interesting way to get a quick idea of how big of a spread there is among candidates. However, I have a couple of quibbles:

    1) Linear interpolation isn’t really a compromise between these measures. A more-interesting one might be WAR based on wOBA allowed, which would remove sequencing-dependence without removing defense-dependence.

    2) FIP isn’t really an extreme on the spectrum: we all know about xFIP, and you could get even more defense-independent with TIPS, explained in a recent community article: http://www.fangraphs.com/community/tips-a-new-era-estimator/

    Vote -1 Vote +1

    • japem says:

      And there’s another community article about evaluating pitchers on wOBA allowed…

      http://www.fangraphs.com/community/wrc-for-pitchers-and-koji-ueharas-dominance/

      (I wrote that article)

      I’ll try to come up with a version of wOBA-WAR since I already have all my data. If I find anything useful I’ll put it up on the community research blog.

      Vote -1 Vote +1

    • Ben Hall says:

      xFIP sets HR/FB rate to league average, and while that may be better for prediction, it is not saying what happened. Whether or not you expect pitchers to give up as few (or as many) home runs as they did per fly ball in the past, they gave up those home runs. When talking about value, xFIP shouldn’t be used.

      Vote -1 Vote +1

      • Bip says:

        FIP sets BABIP and LOB% to league average by scaling itself to ERA. FIP could also be said to not measure what happened, because it ignores hits allowed by the pitcher. It does this because the player doesn’t have total control over whether a ball in play goes for a hit, but he also doesn’t have total control over whether a ball in play goes for a home run, or if a batter strikes out against him.

        FIP basically applies a threshold that says “a pitcher must have at least X% control over this event to be included”. xFIP does the same thing, but with a higher value for X. Where that line is drawn and why is not particularly well defined.

        Vote -1 Vote +1

      • Bip says:

        So, for example, let’s say a pitcher holds the following responsibility percentages:

        10% for even sequencing
        20% responsibility for hits
        40% for homers
        50% for batted ball type
        60% for strikeouts and walks

        RA/9 includes all of it. A war based on wOBA allowed puts that threshold around 20%. FIP puts it around 40%. xFIP puts it at around 50%.

        That is the major difference between those measures to me. The idea is that the more pitcher skill is represented in the measure, the more predictive ability it has because your data is more meaningful and there is less noise, with the trade off that you are neglected the real (though small) influence a pitcher has over those events that fall below the threshold. But everything except RA/9 excludes things that happened. FIP definitely does.

        Vote -1 Vote +1

  3. John Elway says:

    Mr Cameron, you truly are a Man O’WAR

    Just neighing…

    +5 Vote -1 Vote +1

  4. Luke says:

    Do you have any thoughts on the value of xFIP-WAR or SIERA-WAR?

    Vote -1 Vote +1

    • Brandon says:

      The main reason why xFIP/Siera WAR doesn’t really make sense is because when a pitcher gets a home run hit against him he should have his WAR decreased. Because xFIP regresses home runs pretty much completely, home runs really don’t matter at all.

      As a predictor of player WAR xFip and especially SIERA work very well, but they’re not good descriptors of past value.

      Vote -1 Vote +1

      • skippyballer486 says:

        This is based solely on your own definition of value. Value could quite easily be defined as the expected production of a player over a given time frame, which would make a good prediction a good measure of value. Using xFIPWAR or sieraWAR would attempt to decouple results from process and award credit for process. Whether or not they do a good (or good enough) job is open for debate; whether or not that is a definition for valuable is only open to opinion.

        Vote -1 Vote +1

  5. narvane92 says:

    gjghhhhjjjkioui9gti7tgbj ygvgphm gvbg gvbvbg gfgbb fcvvrd xffv fvcdc xww2wq

    -6 Vote -1 Vote +1

  6. GilaMonster says:

    I feel like you need to take park factors into account. Darvish posted the second highest HR/FB% among qualified starters. Iwakuma was 23rd. Say Darvish pitched in Seattle. I think his FIP would be lower.

    Vote -1 Vote +1

    • GilaMonster says:

      Yeah, but it is blind to overall park factor. I think the overall difference between Arlington and Safeco is 6, but HR it is 10. HRs are worth more than anything else.

      Vote -1 Vote +1

      • Hank says:

        I’ve always wondered about this. I believe the adjustments are based on run factors but some parks are pretty divergent between HR and run environment factors.

        Take for example comparing a pitcher in Boston with one in NY. They have fairly similar run environements but vastly different HR environments. It would seem a FIP based WAR model may be overstating or understating value for parks that don’t have similar HR and run factors

        Vote -1 Vote +1

  7. Jose Altuve says:

    It would be interesting to see (and forgive me if it exists) a measurement of WAR based strictly on how difficult a pitcher’s pitches are to hit up to and including contact but excluding anything that happens afterward. I assume this would be some combo of swinging strike rate, looking strike rate, and some contact quality metric I can’t name.

    Vote -1 Vote +1

    • tz says:

      I was thinking about something like that as well. A proxy for contact quality could be some kind of weighted average of rate of infield fly ball %, OF FB%, line drive %, and ground ball % – I think tERA uses something like this to simulate the ERA equivalent of how difficult a pitcher’s pitchers are to hit.

      When there is enough PitchFX data over the years, it should be possible to regress the at-bat result against the pitch speed, movement, and location. Figuring out the impact of pitch sequencing on this regression would be the biggest challenge.

      Vote -1 Vote +1

    • The Stranger says:

      For contact quality, maybe you could use batted-ball mix, the same as xBABIP? I think we’re getting to where we have good speed-off-the-bat data as well, which seems (a)like something the pitcher has some control over, and (b)like it should correlate somewhat with whether those batted balls are ending up as hits.

      Vote -1 Vote +1

  8. Bip says:

    If I had to put my money down on one single calculation that captured all pitcher’s past value most accurately (and I couldn’t switch between models for different types of pitchers), I’d probably pick the 70/30 model from the table above.

    Why would you want to switch between models for different pitchers? If you have to switch between models for different pitchers, doesn’t that suggest your model is wrong? And, in fact, no model that relies on scaling FIP and RA/9 is correct, because exclusion of RA/9 excludes a pitcher’s ability to influence balls in play, and any inclusion of RA/9 includes defense and luck.

    When I read this, I suspect he is referring to knuckleballers, and that for them he would choose a model more heavily weighted towards RA/9. But why would this be better for knuckleballers? Does being a knuckleballer correlate with better defense? Are knuckleballers naturally lucky people? The percentage of a player’s BABIP variation explained by the player’s skill should not vary per pitcher, it is only the degree of variation from the mean that player is able to affect. Knuckleballers may exert a greater suppressing force on their BABIP than other pitchers, but luck and defense will play just as big a role for them as for everyone else. Therefore their FIP to RA/9 balance should be the same as for everyone else.

    Vote -1 Vote +1

  9. Bad Bill says:

    Thoughtful and thought-provoking analysis, and timely because of the way FIP and xFIP have attained near-oracular status in some circles. It may be time to take on some other sacred cows of sabermetrics as well, for example, the contention that there is nothing special about being a closer or pitching in a save situation.

    Vote -1 Vote +1

    • Brandon says:

      The ideas behind FIP/xFIP are good, the problem is that they’re pretty blunt ideas.

      SIERA is definitely a step in the right direction as an attempt to use the same principles as FIP/XFIP but to be less blunt about it.

      Vote -1 Vote +1

  10. Jordan says:

    This was a really interesting presentation – especially like the sortable graphs. Having said that, though, the idea of combining RA and FIP based models seems to me misguided. The issue with RA9 WAR, much like the issue with UZR, is that one season isn’t a large enough sample for the data to stabilize. But the way to deal with that isn’t to search for some other stat to balance out single season samples of UZR data, it’s to gather more data. Over a large enough sample, 100% RA9 WAR is the best gauge of true talent; over a single season, maybe we’d want to regress that a certain amount to account for luck.

    Vote -1 Vote +1

    • Brandon says:

      The problem with this is that it’s not necessarily true. If a groundball pitcher pitches for seven years with a bad infield defense, RA-9 is not going to be the best measure of talent.

      But, yeah, if you don’t go to the extremes RA-9 does tend to become more accurate as the sample size increases. Then again, that’s true for pretty much everything tough.

      Vote -1 Vote +1

      • Jordan says:

        The point isn’t that RA9 WAR gets more accurate as the sample size increases, but that it becomes a better gauge of true talent than FIP WAR at larger sample sizes. And that the issue at smaller sample sizes isn’t that the metric is flawed, but rather that – like single season samples of UZR – the data hasn’t had enough time to stabilize. And combining RA9 WAR with FIP WAR doesn’t do anything to fix that.

        Vote -1 Vote +1

    • Jordan says:

      I’ve given this a little more thought, and I feel like the way to do this would be to regress the individual components of RA9 WAR that include luck (BABIP, sequencing, LOB%) towards a combination of league average and player’s career average, with the latter gaining weight as his career IP increases. I don’t know enough about statistics to say how that would be done, but it seems like the best way to go.

      Vote -1 Vote +1

    • Bip says:

      The issue with RA/9 WAR is that it depends on luck and defense. Large samples mitigate this because over large samples, large deviations from the mean in luck will become less likely, compared to the size of the sample, meaning rate stats will be perturbed less by variation in luck. With defense, we expect a player’s defense will probably not be consistently great or terrible over a large sample, but we don’t know that. Many Rays pitchers will probably have had consistently good defense over their careers.

      Vote -1 Vote +1

      • Jordan says:

        The same is probably true for UZR, though, but we just don’t understand it as well. Things like which park you play in, which defenders you play alongside, and which pitchers you defend behind probably have some impact on UZR, and if a particular player spends most of his career playing in similar circumstances you could see similar distorting effects. But these sorts of outlier cases don’t mean we should abandon the metrics.

        Vote -1 Vote +1

        • Bip says:

          Who said anything about abandoning it? I’m pointing out that the number, when used to measure a pitcher’s effectiveness, is flawed. When a measure is flawed, it’s not useless, but it does mean we should find something that isolates what we want to measure without including what we don’t want.

          Vote -1 Vote +1

      • Hank says:

        But when looking backward at past value like say for awards (and not using it as a predictor), does it really matter if the pitcher got lucky or unlucky – it is what it is. We wouldn’t even think about adjusting a hitter’s value if he had a BABIP well off his career norm, but we would consider it if we were trying to value him moving forward.

        I do think you need to have some sort of adjustment for defense though, how is beyond me. Crudely you can look at team level UZR but then you have issues if a guy is an extreme FB or GB pitcher and the distribution of that defense could be a factor.

        But if a guy standed 80% of his baserunners or had a .240 BABIP against, I think once you correct for defense his value is his value (and the sustainability of that going forward is a separate question)

        Vote -1 Vote +1

        • TheWrightStache says:

          This discussion is right along the lines of my thoughts on the subject. Leaving defense aside for a second, has anyone done the math to see what kind of sample size is needed for RA9-WAR to become more predictive than FIP-WAR?

          Vote -1 Vote +1

        • Jordan says:

          Tango has, but I don’t remember the results offhand.

          Vote -1 Vote +1

        • Matthew Cornwell says:

          Pretty sure Tango said 7 or 8 years is the point in which RA is more predictive.

          Not surprising, as 6-9 years is also the point in which r=.5 for BABIP, HR/FB, and LOB%.

          Vote -1 Vote +1

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>