On Context, or Evaluating Hitters and Pitchers Differently

Here at FanGraphs, our pitching WAR is built around Fielding Independent Pitching, which focuses solely on a pitcher’s walks, strikeouts, and home runs allowed. Because it ignores the results of balls in play and the order in which results occur, there are occasionally big differences between a pitcher’s FIP and his ERA. This divide often leads to some consternation when a pitcher with a high ERA posts a decent WAR, or in reverse, when our WAR doesn’t grade out a pitcher with a very low ERA that highly.

A significant number of people — including a good chunk of our own readers, and noted sabermetric evangelists like Brian Kenny — prefer to evaluate pitchers by runs allowed because, as I’ve heard repeatedly over the last few years, that measures “what actually happened”. And that’s one of the reasons we have RA9-WAR here on the site, as we know that a sizable amount of people prefer to evaluate pitchers in that way.

I believe there are valid points on both sides, and I see the argument using a FIP-based WAR and a RA9-based WAR when evaluating a pitcher’s past performance. However, I find it interesting that this debate has not carried over to position players, where there seems to be broad consensus* that context-neutral is the way to go.

*Allowing for the fact that there was definitely some positive response to my article on Context Batting Runs last week, my feeling is that there’s still not much of a push towards this kind of evaluation for hitters.

It’s not even just those of us who subscribe to a linear weights based WAR, like we use here on FanGraphs. Even just looking at a player’s standard batting line, or using BA/OBP/SLG with some adjustments for playing time, or OPS+; these are all offensive evaluations that consider only the number of events that happened, not the situation in which they occurred. And there is essentially broad agreement that these are the best types of measures to use when evaluating how well a hitter contributed to his team’s performance.

The only context-specific statistic that has any real traction is RBIs, and the sabermetric community — myself included, so I’m not pointing fingers here — has spent years explaining why using RBIs to evaluate a player’s contributions is a misuse of statistics. RBIs are something of a pariah in the analytic community, shunned because they are a team statistic masquerading as an individual measure. For valid reasons, they’ve been marginalized by sites like this one, almost completely removed from the discussion of player value among the “new school” crowd.

These decisions on how to value player performance are incongruous. If you use something like wOBA to evaluate a hitter, you are making a conscious decision to ignore the order of events that “actually happened”. However, when evaluating a pitcher by runs allowed, you’re making the same decision to include sequencing factors and hold the pitcher entirely responsible for the order in which events occurred.

For instance, here are two hypothetical innings, with only the order of sequence changed.

Scenario A: Single, single, homer, fly out, fly out, fly out.

Scenario B: Homer, single, single, fly out, fly out, fly out.

By runs allowed, the pitcher in Scenario A is charged with giving up three runs, while in Scenario B, he’s either giving up one or two, depending on whether or not the guy who hit the first single was fast enough to tag up and reach third on the first fly out, and then whether the second fly out was hit deep enough to score him. His FIP for the inning would be 16.04 in either case, but his ERA could be 9.00, 18.00, or 27.00, depending the order of sequence and the speed of the inning’s second hitter.

By wOBA, these two innings are exactly the same, as it simply sees six individual events, giving out credit for each one without regard for what became before or after. The third batter in Scenario A will get the same credit as the first batter in Scenario B, because they both hit home runs, and because the measure is context-neutral (by design), it will simply give them the average credit for a home run based on the expected distribution of when home runs occur. wOBA, like FIP, sees these two innings as equal, both from the perspective of the hitter and pitcher.

ERA and RA9 would see these innings quite differently, assigning three runs to the pitcher in Scenario A — and either one or two in Scenario B — because that is how many runs were scored while he was pitching. The sequence of events is a significant factor in how the pitcher is valued.

This isn’t to say that one or the other is definitively right or wrong. There is some evidence that pitcher sequencing is at least partially a skill. Guys like Jim Palmer and Tom Glavine accumulated an extra +13 wins from sequencing during their careers, while Nolan Ryan was an amazing 34 wins in the negative based on the order of events. If a pitcher struggles to pitch out of the stretch — or conversely pitches in such a way that he strands more runners than you might expect from his overall numbers — then there is an argument that the pitcher ought to be credited or blamed for those results.

Of course, the answers aren’t always that clear cut, especially when we’re not looking at a player’s entire career. In the two scenarios above, we really don’t yet know how much credit or blame the pitcher should receive for the two singles that occurred. They could have been scorching line drives that no fielder had a chance to make a play on, or they could have been routine ground balls that rolled into the outfield because the defenders behind the pitcher have the range of a potted plant.

Both RA9 and FIP respond to this uncertainty by taking polarizing extremes, with FIP giving the pitcher no responsibility for the hit and RA9 giving the pitcher all the responsibility. Both are clearly wrong. There are ways to attempt to adjust for defensive performance, as Baseball-Reference does with their version of pitcher WAR, but they require some huge assumptions about the consistency of team defensive performance that is also clearly wrong, and gets away from that “what actually happened” point of origin.

But I’m getting a little off course here. The goal of this article isn’t to argue that a FIP based WAR or a RA9 based WAR is superior. I simply think it’s worth pointing out that using a linear weights based WAR for position players — which pretty much every popular WAR implantation uses, including ours — is inconsistent with using an ERA/RA9 based WAR for pitchers. If you use that combination of metrics, you are giving hitters no credit or blame for the contexts in which their performances occurred, but you are giving the pitcher full credit or blame for those same situations.

And, based on what we know about how to distribute credit for how balls in play become hits, this is probably backwards. We are pretty sure that, when a hitter gets a hit, he is the only offensive player who deserves credit for that outcome. However, when a pitcher gives up a hit, we often do not know whether it was his fault or whether it was a failure of his defense. And yet, if our scenarios included a bases clearing double instead of a home run, we would assign the pitcher (through ERA/RA9) the full blame for the two runs that scored on that double, while only giving the hitter credit for the average run value of all doubles, ignoring the fact that he drove in two runs in the process.

Again, I see the argument for using both context neutral and context dependent statistics to evaluate player performance, especially when we are looking backwards and asking questions of past value. There is a difference between trying to isolate skills and trying to measure the value of events that have already occurred. I just think that maybe we, as a community, should consider evaluating position players and pitchers the same way.

In this way, wOBA and FIP are similar, which is one of the reasons why we use FIP as our basis for pitching WAR. With a linear weights model for both hitters and pitchers, we are attempting to evaluate pitchers based on the number of events we can credit or blame them for, and not measuring the sequence in which those events occurred. If one’s preference is to use RA9-WAR, then I’d suggest that perhaps it would be more fair to also evaluate hitters based on situational performance, which would lead to relying on something like RE24 for offensive performance.

It is worth noting that RE24 here on FanGraphs isn’t a perfect replacement for Batting Runs in the WAR calculation, because RE24 also includes SB/CS, it’s more like RE24 replaces Batting Runs and the wSB part of our Baserunning measure. However, depending on future interest in this kind of calculation, it is possible to build a version of RE24 that doesn’t include any baserunning, and that could simply be subbed in for context-neutral batting runs if there was a desire to build a version of WAR for position players that modeled the way RA9 treats situational events for pitchers.

But then again, there’s also a school of thought that there are already too many versions of WAR going around as it is. Most people I talk to want fewer WARs, not more. The problem is that we’re not always asking the same question, and at the end of the day, answering questions is the entire reason we have analytical data to begin with. While I’m not necessarily advocating for one position or the other, I do think it’s worth pointing out that the currently popular versions of WAR for hitters do not answer the same question that a runs allowed based WAR for pitchers seeks to answer.

If you prefer RA9-WAR for pitchers, you’re essentially asking a different question than you are when you use WAR for position players. It’s worth considering whether that’s a problem we’re okay with, or whether that is an argument for either using a FIP-based WAR for pitchers — the conclusion we came to when we built WAR here on FanGraphs — or creating a new version WAR for position players that gives them credit or blame for their situational hitting.

Otherwise, combining a linear weights based position player WAR with a runs allowed based pitcher WAR creates a bit of a paradox. Maybe we’re okay with that, but we should probably at least be aware that it’s what many people who use that combination of WARs are doing.




Print This Post



Dave is a co-founder of USSMariner.com and contributes to the Wall Street Journal.


101 Responses to “On Context, or Evaluating Hitters and Pitchers Differently”

You can follow any responses to this entry through the RSS 2.0 feed.
  1. Will says:

    On a very simple level, it makes sense to evaluate pitchers and hitters differently because they have different ways of impacting the game. Hitters get a small amount of non-consecutive at bats each game, while pitchers face a continuous lineup. Context neutral makes more sense for hitters because they don’t contribute to the context. Pitchers, however, often make the bed they sleep in, so there’s value in judging them on the bigger picture.

    +68 Vote -1 Vote +1

    • tz says:

      Best argument I’ve heard on why they should be evaluated differently.

      +9 Vote -1 Vote +1

    • channelclemente says:

      A nuance that seems very important.

      Vote -1 Vote +1

      • tz says:

        This is kind of analogous to the reason that pitchers’ BABIP is assumed to have a “universal” mean, while batters’ BABIPs vary based upon each hitters skill level. Any pitchers’ sample size of batters faced grows fairly quickly and is close to uniformly distributed between hitters batting in each of the 9 spots in the batting order.

        Vote -1 Vote +1

    • Dave Cameron says:

      Except it is not at all clear that the bed pitchers are laying in was of their own making. Often times, they are lying in a bed made by their defenders, or their defense saves them from the bed they made themselves.

      It’d be one thing to make this argument if the scenario was BB/BB/HR. With 1B/1B/HR, it is not at all clear that the pitcher made that bed.

      +23 Vote -1 Vote +1

      • Will says:

        It also not clear that a hitter’s results are independent of defense, which we basically assume it to be. Maybe that 4-4 night was the result of hitting four groundballs past a fielder with limited range?

        Your argument makes sense from a predictive standpoint, when it comes to measuring results (what actually happened), context has much more relevance for pitchers than hitters because of the way they interact with the game. Whether that means we evaluate each one in the extreme is another question, but the I really don’t think the contradiction is a dilemma.

        +5 Vote -1 Vote +1

    • tangotiger says:

      The issue with your statement is that you are ignoring fielders from the sequencing issue. If a team gives up 4 straight singles, a run-based metric would assign all of that (and it’s results, like runs allowed) to the pitcher, while a FIP-based metric would assign all of that not to the pitcher.

      And the truth is somewhere (unknown) in the middle. Which is Dave’s point.

      +15 Vote -1 Vote +1

      • matt w says:

        Then it’s not a point that’s well served by comparing Scenario A and Scenario B. The difference between Scenario A and B is not, or need not be, in the fielders’ contribution; they could consist of the exact same balls hit to the exact same spots on the field and fielded in the exact same way. The difference is in the sequencing of a series of events, to all of which the pitcher has at least some contribution. Whereas the context a hitter faces will almost always be determined by events to which he made no contribution.

        There are at least three factors I can see at work here, over and above the players’ skill.
        1. The contribution of a player’s teammates to an event.
        2. The contribution of the player’s opponent (and luck) to an event.
        3. The context in which the event took place.

        For hitters, number 1 is usually negligible, and number 3 is usually beyond the hitters’ control. So we want to use “context-neutral” metrics which discount 3, and the only question is how we account for 2 (whether we do something like normalize BABIP in order to take out the contribution of the opponents’ defense and batted-ball luck). I suppose park factors might be a fourth element that we’d usually want to account for.

        For pitchers, 1 is quite large because of the contribution of the defense on batted balls. And 3 is at least in part dependent on what the pitcher did. While we’d like to be able to correct for 1, it’s not clear how much we ought to correct for 3, given the pitcher’s responsibility for 3. FIP corrects for 1 and 3 by eliminating defense and sequencing. rWAR corrects for 1 by explicitly normalizing against team defense but doesn’t correct for 3. But if you’ve come up with a way of allocating responsibility for defensive events between the pitcher and defense, it’s not clear that you’re obliged to ignore sequencing once you’ve done that.

        In the post Dave seems to be arguing that it’s inconsistent to use linear weights for batters and RA9-based stats for pitchers, because it gives credit to the pitchers for the context in which they pitch and no credit to the batters. But the batters in fact deserve no credit — none — for the context in which they hit, and the pitchers deserve some credit (but not all the credit) for the contexts in which they pitch. It makes sense to draw a distinction there. We want some way to separate the pitchers’ responsibility from the fielders’ responsibility, but we might not need to go all the way to a linear-weight-based metric to do that.

        +6 Vote -1 Vote +1

      • db says:

        But why do we give batters the complete credit for a single, double or triple which are influenced as much by the fielders when he hits the ball as when the pitcher pitches it? It’s that reason that I think FIP for pitcher WAR misses the point.

        Vote -1 Vote +1

        • Stathead says:

          If I worked for a team, I would look at contact quality, bat speed, batted ball speed, angle off the bat, chase percentage, etc. to evaluate a player and probably less off results. That being said, it probably balances out more in the long term in the batter’s case because they’re hitting to different defenses from game to game, so over the course of a season it’s about league average, whereas a pitcher is pitching to the same defense every game (just about). It’s the same reason that HR/FB ratio and BABIP balance out for pitchers.

          Vote -1 Vote +1

    • Well-Beered Englishman says:

      The only example I can think of, of a hitter affecting sequencing in an inning, is baserunning. Whether it’s making a pitcher nervous, drawing 10 pickoff attempts, stealing, being tasked with a hit-and-run, or the old bogeyman of stealing signs, I can imagine many examples of batters affecting other batters’ contexts but all of them involve reaching base first.

      Maybe if you have a 15-plus-pitch at-bat and simply wear the pitcher out, some benefit will come to the following hitter. This could be a database enquiry for a bored soul.

      Vote -1 Vote +1

      • B N says:

        There is actually some studies, if I recall, that show there is value in taking more pitches for that reason. I’m also relatively sure the Red Sox used that lineup design with some effectiveness a few years ago, which was useful even on better pitchers (e.g., if you can’t beat em, at least make them throw pitches and hope they get pulled for a reliever). I would assume that it increases WAR by some degree, though I’m not sure how you’d quantify it.

        Vote -1 Vote +1

    • DJG says:

      Right. If the same batter hit in all nine spots in the lineup (ghost runners!), then runs score would probably be the only metric people would use to evaluate a hitter.

      Whether you think this would be a good measure or a bad measure, it does, I believe, explain the asymmetry Dave discusses in the article.

      Vote -1 Vote +1

    • B N says:

      This. Exactly the same thought that I had: a batter doesn’t get to hit, then go to bat again. As a batter, you don’t get to say, “Hey, I’ll just try to settle for a single, because I’m sure I can drive myself in on my next at bat.” As a pitcher, you can basically do exactly the converse of that by pitching around one guy to get to another guy that you think you can get a double-play off of.

      Vote -1 Vote +1

  2. Crumpled Stiltskin says:

    But there are already other paradoxes with the methodology we seem to be okay with, like assigning first basemen negative run values as their positional adjustment. I’m pretty sure teams would give up many more runs not playing a first baseman at all. Or more to the point, how can a position one needs to field to even consider playing the game start out being worth negative runs?

    -20 Vote -1 Vote +1

    • Jared Stipelman says:

      That is simply a function of how the metrics are scaled. The essential idea behind the positional adjustments would not change if we said that 1B have no positional adjustment, and simply bumped up the position adjustments for other positions.

      Vote -1 Vote +1

    • olethros says:

      Negative in comparison to the value of all other defensive positions, not negative as in “subtracting runs from the team by virtue of the position’s existence.”

      Basically, it’s the only position on the diamond at which a potted plant actually could provide replacement level defense. See Fielder, Prince.

      Vote -1 Vote +1

    • Dave Cameron says:

      There are a ton of metrics that have negative results because they are compared to a baseline of average. Any metric that compares a player at anything to league average will, by definition, have a lot of players in the red.

      Vote -1 Vote +1

      • Crumpled Stiltskin says:

        The negative valuation for a player who plays 1st base is not like a negative valuation for batting runs. (Which might result because of a comparison to a baseline.). It’s much more akin to the fudging that happens in many models, perhaps especially predictive models in order to arrive at results that seem accurate while at the same time using a methodology that belies (to whatever degree) what is actually happening. The most famous and perhaps best example is ptolemy’s model of plantetary movement that insisted that all the celestial bodies moved on a very complex and intricately connected system of circles, which despite being wholly wrong, could still be used to accurately predict the position of planets for thousands of years.

        To me, the negative valuation of certain defensive positions seems akin to that of error. Perhaps it the best we can do to create a model that gives an accurate valuation of a player’s worth on a baseball field, but still there seems something inherently off about it.

        Vote -1 Vote +1

        • Paul says:

          But if your aim is to jam a bunch of variables into one grand, elegant statistic, you have to ignore inherent biases. If we put ranges on most of these linear weighted stats, it would make them virtually meaningless. Unfortunately, a great many users of these stats don’t realize how much error there is in these models. And many others don’t care, because the result is the grand, elegant statistic they wanted.

          Vote -1 Vote +1

        • Noah Baron says:

          Luckily baseball is a lot easier to understand than planetary motion.

          Vote -1 Vote +1

      • Stathead says:

        These distributions tend to be highly skewed right too, so they’ll probably have more than half of players in the red.

        Vote -1 Vote +1

    • Drew says:

      This is one of the worst comments I’ve ever read.

      Vote -1 Vote +1

      • Grohman says:

        I think he was using the old “if you’re confusing enough, the other guy will give up” method to win the argument.

        Vote -1 Vote +1

    • B N says:

      These adjustments exist so that team WAR can be, to some vague degree, an estimation of the number of wins that a team will get above an all-scrubs team (e.g., all replacement level players). So a negative WAR for your 1B doesn’t mean he’s worth less than nothing. It means he is worth less than a typical 1B you could have called up from the minor leagues.

      Sometimes I think that the 1B positional adjustment is a bit harsh on a good defender (e.g., someone who might be able to play a scratch 2B, but is blocked at that position by an excellent defender). However, there’s not much way to adjust for that unless you want to break a player down into a bunch of constituent tools, then try to find the maximum of their potential performances across all the positions on the diamond (e.g., their maximum “potential” WAR if you put them at the optimal position). Given that fielding stats are already terribly noisy, tools might be even noisier, and integrating and maximizing over these tools is probably EVEN noisier… Well, I don’t expect to see it happen soon. Good luck explaining it to a layman, too.

      Vote -1 Vote +1

  3. Fastpiece says:

    “Here at FanGarphs . . .”

    +7 Vote -1 Vote +1

  4. Wobatus says:

    I also find it interesting that FIP is used for pitcher WAR and not xFIP.

    Kershaw is at 5.9 WAR but his xFIP is 2.93 and his FIP 2.36, and his ERA even lower. Kershaw has consistently outperformed his xFIP both in ERA and FIP.

    All that said, he is getting a lot of support for MVP, even though McCutchen has a higher WAR.

    Vote -1 Vote +1

    • triple_r says:

      FIP seems to be a better indicator of a pitcher’s single season results, whereas xFIP is better for a longer period of time (e.g. their career as a whole).

      Vote -1 Vote +1

      • NS says:

        I would argue the exact opposite. The larger the sample, the likelier a pitcher’s HR/FB ratio is close to “true talent”.

        Vote -1 Vote +1

      • Wobatus says:

        I could see that, although you could also flip it around: in a given year a pitcher might get lucky in hr/fb, in which case his single year FIP may not be as representative of his true talent as xFIP. However, if, over his career, a pitcher has shown apparent hr/fb suppression abilities (Kershaw actually being a decent example), than career FIP may be more representative.

        Vote -1 Vote +1

        • triple_r says:

          Yeah, that was what I was trying to get at.

          Vote -1 Vote +1

        • triple_r says:

          Yeah, that was what I was trying to get at. In one given year, a pitcher could have an extremely lucky or unlucky HR/FB%, and this will usually correlate with their ERA for the year; however, over the course of a career, it’ll usually balance out, which is why xFIP would be more useful.

          Vote -1 Vote +1

    • Ben Hall says:

      WAR is assigning value to a player. Whether or not we expect a pitcher to give up the same number of home runs in the future is not part of the value of this season. The value from this season (or game, or whatever) is based on how many home runs he actually gave up. That’s why FIP is used instead of xFIP.

      Vote -1 Vote +1

      • Cool Lester Smooth says:

        WAR is assigning value to a player. Whether or not we expect a pitcher to give up the same number of runs in the future is not part of the value of this season. The value from this season (or game, or whatever) is based on how many runs he actually gave up. That’s why RA9 should be used instead of FIP.

        Vote -1 Vote +1

  5. tz says:

    The Holy Grail, in my opinion, would be a fully context-specific Wins Above Replacement for both hitters and pitchers. Context would include not only game situation (like WPA) and park effects but also include adjustments for the quality of opposition faced. This would aim to answer the question “How much more has Player X added to his team’s chances of winning vs. a borderline Major Leaguer, all other factors being equal?”

    Unfortunately, it’s obvious that this would be an exhaustive calculation, and would inherently bake a lot of “noise” into the final value. Except for maybe a few outliers with SSS (such as situational relievers), most of the existing WAR calculations would be highly correlated to the “theoretically correct” value. So there might not be a lot of incremental value in sharpening the WAR tool much more already has been done.

    I do agree with Dave that we should be aware of any inconsistencies within the calculation, whether we’re ok living with them or feel the need to force consistency.

    Vote -1 Vote +1

    • Hurtlocker says:

      I agree. Everyone that enjoys baseball can anecdotally relate to players that seem to be “clutch”. If you quantitate each at bat as to positive/neutral/negative it would give you a true value to a hitter or pitcher’s value overall.

      Vote -1 Vote +1

  6. Julian says:

    I don’t think it’s fair to say batters are viewed as completely context-neutral, because in many ways they are evaluated based on what happened, not what they particularly did. FIP attempts to take context out of the equation by normalizing factors like BABIP and HR/FB ratio, something that is not done in wOBA or wRC+. If we really want to take context out of the equation, why not use xBABIP instead of BABIP to attempt to normalize wOBA to be completely context-neutral? While batter WAR currently does not take into consideration outs or runners on base, it certainly takes into consideration the opposing team’s defense, or high/low HR/FB ratios, and even reaching base on error, something which is not done by a truly context-neutral metric like FIP.

    Vote -1 Vote +1

    • Dave Cameron says:

      FIP does not normalize HR/FB ratio. FIP measures only the actual results of BB/K/HR, not any expected metrics. That it does not include results on balls in play makes it similar to how WAR for catchers treats the non-SB/CS part of defense, ignoring the differences between players even though we know that some catchers are better at other parts of catching than others. The metric is incomplete, but that doesn’t equate it to something like xBABIP.

      Vote -1 Vote +1

  7. Monty says:

    I get why you had to keep repeating your non-position on which type of evaluation is better, but it gives the whole article a “Tell my wife I said ‘hello'” vibe.

    Vote -1 Vote +1

  8. Bryce says:

    Do we have a linear-weights-based metric for pitchers, and if not, why not? The discrepancy between hitter and pitcher WAR has long bothered me, as have the extremes of FIP and RA. Couldn’t we use RE24 to make a pitching metric?

    Sure, pitchers aren’t in control of whether the batter hits a line-out or a double, but neither are batters, yet we give them credit based on outcomes. Also, yes sequencing matters for pitchers, but wouldn’t it be informative to assume independence of at bats without completely punting on outcomes other than walks, strikeouts, and home runs (or fly balls for xFIP)?

    Vote -1 Vote +1

  9. Xeifrank says:

    Both ways have limitations. The tools to measure a fair RA9 and the hitters equivalent have not been invented yet or are in their infancy.

    Vote -1 Vote +1

  10. Bonertron5000 says:

    What would a wOBA-based pitching WAR look like? That way you could neutralize the sequencing of allowed runs, while also measuring “what really happened,” right?

    Vote -1 Vote +1

  11. DJG says:

    But my biggest problem with FIP-based WAR isn’t that it doesn’t capture “what happened”, it’s that it doesn’t give pitchers enough credit who can consistently induce “likelier out” type contact. As best I can tell this is a legit skill.

    Vote -1 Vote +1

    • Dan Farnsworth says:

      I think this is less of an issue for teams, now that they presumably have access to Hit F/X data. In the absence of detailed qualities of batted balls (angle, velocity, spin, etc.), I think including contact data in pitching metrics is pretty misleading. The current classifications are too subjective.

      Vote -1 Vote +1

  12. Mac says:

    It seems like the demon we’re really trying to chase down with all of this is luck.

    Here would be something fun to check: of all historical innings that featured back to back single – home run or home run – single, which happens more often?

    The intuitive first guess would be that these events occur equally as often at large enough sample sizes. If that’s true at large sample sizes, then any year where a player (hitter or pitcher) is having more events of the better sequence (1B then HR) is just luck.

    Assuming the above is true (and for the above example, it might not be), the fundamental question then becomes: how much credit do you assign a player for his good or bad fortune?

    Dave mentioned that there are multiple stories to tell with statistics, and I view this as a debate between stats explaining what did happen and stats explaining what was most likely to have happened. The difference between the two is luck, and a major unsolved savermetric problem is determining to what extent luck influences the game.

    Vote -1 Vote +1

    • B N says:

      I would expect single-HR to be less likely than the converse. After you have runners on base, you should be trying to avoid HR by altering your pitching slightly (if you have that capability). I would expect that not all pitchers have this skill. I would also expect context like “bases loaded” to be worse for pitcher outcomes, because both hits and walks lead to runs.

      This type of bias would lead to noise for pitchers who had different sequences of events, but would be relatively “fair” (i.e., unbiased over long periods) if pitchers had no differences in their ability to change their pitching style to avoid HR or other specific outcomes. However, I’m relatively certain that some can and do. If I had to guess, these types of pitchers would have more control or pitches that are very hard to hit for certain outcomes.

      Vote -1 Vote +1

      • Mac says:

        See, I’m just not sure about that. Especially single-HR. Over the course of baseball history? Does one base-runner influence a pitcher to try and avoid long balls more? There’s the whole pitching from the stretch difference to account for, but beyond that I’m not making a call one way or the other until I see the data.

        How does one research sequencing?

        Vote -1 Vote +1

  13. Brian L says:

    The hitter v pitcher valuation debate has two separate pieces: (i) assigning credit for events, and (ii) including context / sequencing.

    When deciding how they prefer to evaluate pitchers, I think most people stop at (i) above (I know I have), i.e. they’re choosing RA9 or FIP-based WAR based on their choice of which extreme they prefer as far as the defense / pitching relationship. Hitting does not require this choice, as there’s no one for hitters to split the credit with, which is why the debate doesn’t occur with hitters.

    You’re right to highlight an overlooked distinction in the treatment of context in these situations, and to alert viewers that their choice in stats inherently includes making choices about valuing context. Maybe this will come more to the forefront, but as of now I think people are more focused simply on the assignment-of-credit piece.

    Vote -1 Vote +1

  14. MrKnowNothing says:

    I kind of tend to just look at everything and get a feel for who is and isn’t good. if a guy has an FIP and ERA that are in sync, I think that either RA or FIP based WAR is appropriate. if there’s a wild divergence between the two, i’d like to know if it’s a one year thing, or a matt cain like habit.

    Vote -1 Vote +1

  15. ralph says:

    How come GIDPs count against WAR but RBI/advancing runners out of GIDP situations doesn’t help WAR?

    For instance, I’m thinking hitting 10 HR with a runner on 1B with no outs could maybe offset 1 GIDP? Even hitting doubles in that situation to make it so men are on 2nd and 3rd with no outs seems like it could add to that kind offset.

    Vote -1 Vote +1

    • ralph says:

      I guess the answer would be that hits already have runner advancement figured into their linear weight values.

      But then it just seems weird to use GIDP in WAR since it penalizes people for having better OBP guys ahead of the lineup (or rewards those who have worse OBP guys ahead of them/lead off more innings), without allowing that effect to be offset.

      Vote -1 Vote +1

      • ralph says:

        And yes, to further this conversation with myself, speed helps reduce GIDP rate given equal opportunities, but speed also increase the ability to reach on error. So it seems at the very least that if GIDP is in WAR, so should ROE.

        But the lineup effects on GIDP does remains a concern of mine regardless.

        Vote -1 Vote +1

  16. DP says:

    What would be wrong with combining FIP-based WAR and RA9-based WAR with some constants into a unified formula? Is that not the best of both worlds?

    Vote -1 Vote +1

    • ralph says:

      Yes, but that constant is basically the Holy Grail, to the point where it might not even be possible to discern because every pitcher’s context is different (i.e. it’s possible that one pitcher is 30% responsible for balls in play, but another one is 50% responsible).

      Vote -1 Vote +1

      • Bip says:

        I don’t see why pitchers would influence different levels of control over their BABIP. A pitcher may have a BABIP farther from the average than another, but that doesn’t imply that pitcher has a greater effect on his BABIP. It just means his effect on his BABIP is more dramatic.

        Vote -1 Vote +1

  17. Blueyays says:

    I think the example of Jim Palmer as a pitcher who supposedly had an ability to pitch to an ERA lower than his FIP is worrisome, and reveals a mistake that is often made with this. When a pitcher outperforms his peripherals for several years in a row, it has become common to hear people declaring that he has some skill that those peripherals, whether FIP, SIERA or otherwise, cannot pick up, and that he should thereby be judged by his ERA. However, this ignores the largest reason that DIPS stats exist: defence. Rather than a consistent ability on the part of the pitcher, this ERA-FIP difference may often be showing the consistent excellence of the defence behind him.
    Take Palmer, for example. In his career, he posted a 3.50 FIP and a 2.86 ERA. His ERA was lower than his FIP in every year except his rookie season and his negligibly small comeback in 1984. This is often taken by supporters of R/9 WAR, as in this article, to be showing that he clearly had a consistent ability to outperform his FIP. Yet when one looks at the defence playing behind him in Baltimore, he finds: Brooks Robinson, Mark Belanger, Bobby Grich, Paul Blair and the like. Clearly, it was Baltimore’s defence, rather than a skill of Palmer, that, at least partially, contributed to his supposed “ability” to outperform his FIP.

    Vote -1 Vote +1

    • Ruki Motomiya says:

      What about guys like Kershaw and Cain who have not consistantly had great defenses behind him?

      What about all the other Orioles pitchers? I have not looked, but I would think that if it is because of the defense, ALL (or at least more than Palmer) pitchers would have a notable ERA-under-FIP from this wonderous defense behind them.

      What about a guy like Greg Maddux who plays for 2-3 teams (Depending on if you count when he is old on the Padres as being indicative of “true” skill level) over more than 20 years and only had 3 years with a .300+ BABIP: And all with different teams!

      And why wouldn’t this Jim Palmer effect show up on other teams with great defense?

      Vote -1 Vote +1

      • Blueyays says:

        Matt Cain had his first full season in 2006. Since then, the San Francisco Giants are 1st in all of Major League Baseball in UZR. Kershaw is an interesting case, as the Dodgers have been average at best defensively over his career, though they’re good now.

        It does appear that other Oriole pitchers also had a lower-than-FIP ERA during this time. Mike Cueller and Steve Barber significantly outperformed their FIPs while with Baltimore in the 60s-70s, as did Dave McNally and Tom Phoebus, though not by as much.

        And as for Maddux, his career ERA is 3.16 to a 3.26 FIP. Well within the norm.

        Vote -1 Vote +1

        • B N says:

          It’s not that hard to tease out the difference though. You can just look at the impact of considering the UZR or DRS of their defenders during the games that they played. If there is still a significant difference, then it’s the pitchers’ impact.

          I think this was done with Matt Cain at one point and it found that he did have a significantly better ERA than you would expect from FIP, even controlling for defense and home park. The skills obviously exist. They may not be huge (+/- 0.1 to 0.2 ERA), but they offer a significant edge.

          Vote -1 Vote +1

        • Maddox’s FIP is close to his ERA because he was very poor at holding runners and was worse with RISP. His very good BABIP is what canceled out those two things.

          Vote -1 Vote +1

        • Blueyays says:

          B N: If they found that Matt Cain has a +/- 0.1 to 0.2 ability to lower his ERA relative to his DIPS, then that might be a sign that he doesn’t really have that ability at all.
          Since 1973, when the Designated Hitter was adopted, the AL’s ERA has been higher than its FIP in only 9 out of 41 seasons. In the NL, however, that number is 28/41. In my opinion, this is because, for the BABIP (non-K, BB or HR) part of FIP, it treats all pitchers as allowing equal BABIPs. This obviously assumes that the pitcher has no control over the BABIP that he allows, but it also assumes that the hitters that he is facing have equally good BABIPs. Yet, since I think it is generally agreed that hitters do have some degree of control over their BABIP, I think it stands to reason that better hitters will have slightly higher BABIPs than worse hitters, on the whole, and therefore that pitchers who face better hitters will tend to have slightly higher BABIPS, and therefore higher ERAs relative to DIPS, than pitchers who face worse hitters. The National League, with pitchers hitting, obviously has a worse set of hitters than the American League, and therefore NL pitchers would be expected to have slightly lower ERAs than FIPS – and, as those statistics show, that is indeed the case.
          Therefore, (finally getting to the point), I think another thing causing Cain’s ERA to go down is the fact that he pitches in the NL. Looking at the league statistics, this difference is probably quite small – somewhere around 0.05-0.10 ERA – but that would lower the amount by which he can supposedly lower his ERA to somewhere around 0.10 – definitely small enough, I think, that it can be attributed to luck, especially considering that he has suddenly flipped around this year.

          Vote -1 Vote +1

    • Matthew Cornwell says:

      Look at Tom Tangos study. There are far more players who have good and bad BABIP compared to teammates over history than random luck would expect. It is no longer a question if pitchers have impact on BABIP. It just takes a long sample size to find out how much an individual pitcher has. Most do not have a lot, but many have some, it just taks 6-7 years to know for sure.

      Vote -1 Vote +1

      • Blueyays says:

        Oh yeah, there’s no doubt that pitchers have an impact on BABIP. That’s why, for 2002 – present, SIERA is a much better tool than FIP, because it recognizes that and gives pitchers credit for having control over their BABIP, through what it does with GB/FB rates. Cain, however, had significantly lower BABIPs than even his SIERA suggested he should’ve, and that was the result of hthe San Francisco defence.

        Vote -1 Vote +1

        • Matthew Cornwell says:

          SIERA, etc. make a lot of batted ball assumptions. Extreme groundballers have much better BABIP on GBs than flyball pitchers and vice versa, but SIERA treats all GB pitchers the same. According to SIERA, since guys like Hudson and Maddux are extreme GBers, they should have far worse than average BABIP, which they cleary do not, with a huge sample size and a large variety of good and poor defenses behind them. I know we have to make metrics that reflect the majority, but if we can avoid one-size-fits-all assumption filled metrics, it would be better. That is just hard to do, as we all know.

          Vote -1 Vote +1

        • Matthew Cornwell says:

          To BN – I am pretty sure we can find Cain’s BABIP compared to his mates, and then regress to find his true BABIP skill.

          Vote -1 Vote +1

        • Blueyays says:

          Matthew: http://www.fangraphs.com/blogs/new-siera-part-two-of-five-unlocking-underrated-pitching-skills/

          If you read this, Swartz shows quite clearly that, in fact, SIERA does NOT treat all GB pitchers the same – it realizes that extreme groundball pitchers (the example they give is Brandon Webb) will have lower BABIPs on grounders. It does exactly what you say it should.

          Vote -1 Vote +1

        • Matthew Cornwell says:

          I stand corrected on SIERA – I did not think it corrected for GB and FB extremes.

          Vote -1 Vote +1

    • Matthew Cornwell says:

      According to TZ, at least, the Orioles defense can only account for about half of Palmers hits prevented on BIP.

      Vote -1 Vote +1

  18. salo says:

    context is not the only difference between RA9 and FIP. why not use linear weights to evaluate the outcome of each AB a pitcher faces? it wouldn’t be defense context neutral but it would be situation context neutral.

    Vote -1 Vote +1

  19. cass says:

    I feel like you are conflating two different things, Dave. One is responsibility for the results of balls in play and the second is context and sequencing.

    FIP-WAR not only removes context and sequencing from the pitcher’s evaluation, but also any influence they might have had on the results of balls in play. These are, indeed, separated out in the Fielding-Dependent pitching metrics on this site.

    Linear weights WAR for batters, on the other hand, is similar to FIP-WAR in that it does not include context but dissimilar in that it gives batters all of the credit for the results of balls in play. If you wanted to make a FIP-like version of WAR for batters, it would count only strikeouts, walks, and home runs and treat all balls in play as equal. No one uses that because we think that batters have the lion’s share of responsibility for balls in play, but they most certainly do not have 100%. Fielders and pitchers also have a piece. And yet by using these two metrics, we assume they have zero.

    I see a matrix of different WARs:

    +Context+BIP +Context-BIP -Context+BIP -Context-BIP
    Hitter RE24 xxx fWAR xxx
    Pitcher RA9-WAR xxx xxx fWAR

    Vote -1 Vote +1

    • salo says:

      yes, this is what I was referring to in my comment above. A none sequence considered defense dependent metricc. That’s not FIPWar nor RAWar.

      cheers

      Vote -1 Vote +1

  20. DrFarmer says:

    Don’t we just need to know the horizontal angle, vertical angle, and velocity of the batted ball to be able to account for the vast majority of pitcher skill beyond BB and K? This information can’t be far off the horizon.

    Vote -1 Vote +1

    • Xeifrank says:

      This along the lines of my earlier comment that the tools needed for measuring the things in this post have either not been invented yet or are in their infancy. We are cavemen measuring things with sticks and rocks when we haven’t invented the tape measurer yet.

      Vote -1 Vote +1

    • cass says:

      These are being measured, but we don’t have access to the data. Teams do, but they have to pay for it. Fans are completely out of the loop these days, unfortunately.

      Vote -1 Vote +1

  21. Brendan says:

    Pitcher WAR has always irritated me — typically I’ve always looked up multiple kinds, then looked up the specific stats and painted my own mental picture of where someone was, I honestly don’t ever really trust one for a single season.

    Neither way feels right to me for a single seasons worth of data. I don’t like not including the fact pitchers have shown some control over the quality of contact. But I don’t like the idea of crediting them with complete control of sequencing either. It’s definitely somewhere in between the two. Ultimately I suppose I would rather start with what we KNOW pitchers are in control of (FIP). I think I’d rather have something incomplete and then discuss what should be filled in to complete it, then to have something with noise and have to figure out what noise to remove.

    Vote -1 Vote +1

    • Blueyays says:

      I totally agree with that last point about preferring to start with what we’re sure of than to include things we aren’t sure about. And on that, I think SIERA is a natural improvement to FIP: continue with Ks and BBs, then add batted balls, since we know now that pitchers have control over that, and then still leave out the other stuff that we aren’t sure of.

      Vote -1 Vote +1

  22. Bip says:

    I’ve been chastised for bringing this up in the wrong forum, so I’ll try again in this, the correct forum.

    Dave’s point in the article that people often employ contradictory standards when evaluating pitchers and hitters. It’s a valid point, and while I tend to sympathize with the argument that such standards are appropriate – given that pitchers hold more responsibility for their own context than do hitters – it is still a point that needs to be made.

    Similarly, there is a point that needs to be made about the very basis for our context-independent pitching WAR. As many have pointed out, FIP isn’t just context dependent, it is also supposed to luck-neutral. By excluding BIP, it is implicitly assuming the pitcher gave up BIP as a league-average rate, excluding both the effect of a pitcher’s defense and the luck associated with batted ball distribution.* However, if variation in HR/FB is essentially a function of luck as well, why isn’t that rate also factored out of the equation?

    The question FIP is answering is “how well did the pitcher perform at the elements of pitching he is able to control?” If he is not able to control how many of his FB go for homers, why are we counting them against him? What is the difference between penalizing a pitcher when a fly ball goes over the fence and penalizing him for a blooper that drops in for a hit?**

    In lieu of absolute knowledge of the degree to which pitchers control BABIP and sequencing, it makes sense to maintain two measures – one which gives pitchers all credit for those and one which gives them none. However, FIP doesn’t give them none, it is a strange half-measure which excludes luck in one case and not another.

    *While BABIP is not part of the FIP equation, mathematically, its calculation now is the same as if it was, and was the same for every pitcher. FIP is normalized using a constant, to make it analogous to ERA. Doing so is just like assuming a constant BABIP for all pitchers and including it in that constant. Furthermore, a relative measure like WAR, necessarily factors out constant values, which means that league-average BABIP might as well be an assumption of FIP.

    **Obviously defense is one difference, but there is still a lot of luck involved in whether that blooper fell for a hit. Either way, whether the effect is caused be luck or by defense, we want to exclude it.

    Vote -1 Vote +1

    • Blueyays says:

      “FIP doesn’t give them none, it is a strange half-measure which excludes luck in one case and not another.” What do you mean here? I’m not sure I understand. Where does it include and where does it exclude luck?

      Vote -1 Vote +1

      • Bip says:

        -It excludes luck from BIP. If a pitcher allows more hits than expected because he was unlucky, and batted balls just happened to find holes, that will not affect his FIP
        -It includes luck from HR/FB rate. If a greater-than-expected number of fly balls allowed by the pitcher go over the fence for homers, then that will count against his FIP.

        I thought the point of xFIP was that players don’t have much control over what percentage of their flyballs will go for homers.

        Vote -1 Vote +1

        • Blueyays says:

          Ahh, I see what you mean. For one thing, FIP includes luck from HR/FB rate because, before 2002, we don’t have batted ball data, so we just no HRs and not HR/FB rate, and we need some way to judge pitchers from before ’02.

          But yeah, for current players, the inclusion of HR/FB luck is a major weakness of FIP, for sure, and other stats, like xFIP (or SIERA, which is better) are probably better choices than FIP, which is probably mostly just useful for historical players now.

          Vote -1 Vote +1

  23. DavidKB says:

    I have really enjoyed this series of articles. However, I think the arguments in this article went slightly off course. You are careful to say that one may “prefer” either context-specific or non-context measures and be equally correct. I don’t think that’s actually the case.. one approach is likely a better measure of a player’s contribution than the other, depending upon whether the context contributes signal or just noise.

    To elaborate a little, if it is true (as many have argued) that batters are not able to “overperform” in important situations, but rather bring the best they have to all plate appearances, then using COBRA is just injecting noise into the numbers. A better WAR would be gotten using non-context numbers. However, if it can be shown that some batters *do* perform differently in different contexts, then COBRA may be more valuable. (As a side note, the a more effective statistical tool may be a context-neutral value with a second order perturbation for context performance, given that context is likely to be a small effect.)

    As for pitchers, the use of FIP statistics isn’t meant to remove the RE24-based context. Rather, it is meant to isolate the pitcher’s contribution from that of the position players. I completely agree that there is an argument to be made that either BOTH pitchers and batters should be measured using fielding-independent statistics, or neither should. But that is a different matter than whether they should be evaluated by treating all PAs as equal or using RE24.

    I say all this still feeling that COBRA/RE24 is by far the most aesthetic measurement. I’m just worried it might not be the best one.

    Vote -1 Vote +1

  24. John Sterling says:

    There’s a persistent bias in favor of sluggers when using RE24 (REW) instead of WPA, but it’s not that big. So I think the lack of context for hitters is slightly wrong for that reason, but it’s marginal, and other than that, there’s pretty good evidence that there’s no true-talent clutch hitting effect bigger than +/-1 or +/-2 at the absolute most points of wRC+. So it’s basically fine to use context-neutral because there’s minimal talent involved in the context-specific and context-neutral differences.

    It’s clear that pitchers, themselves, have a significant influence on their FIP-ERA differences and two obvious reasons are that they vary in how well they control the running game and in how well they field their position (AFAICT, a pitcher’s fielding doesn’t go into fWAR). And that’s before you even get to any possible ability to pitch to the gamestate. So you have, at the very least, sequencing skill, sequencing luck, running control, their defense, and other defense contributing to FIP-ERA differences, and those components range from 100% their responsibility to 0% their responsibility. Both extremes are clearly wrong for some of the components. The answer is somewhere in the middle, while for hitters it’s really, really close to context-neutral. So I don’t think there’s any inherent problem or logical inconsistency with using different approaches for pitchers and hitters.

    Vote -1 Vote +1

  25. Tim says:

    Most people I talk to want fewer WARs, not more.

    If the US government can’t manage this I fail to see why you should be expected to.

    Vote -1 Vote +1

  26. WAR says:

    Should I be worried?

    Vote -1 Vote +1

  27. Look what I found

    http://tangotiger.com/index.php/site/article/at-what-point-does-ra9-become-a-better-indicator-of-pitching-talent-than-fi

    Looks like it takes about six 200 IP seasons for RA9 to be ore reliable than IP. Which makes sense, since the r=.5 mark for BABIP, HR/FB, and sequencing all take around that many years too.

    What does this mean in practical terms? Use a FIP based WAR for the first 5-6 seasons of a career, and a RAWAR after that point.

    Vote -1 Vote +1

  28. Word says:

    This is one of the best pieces I’ve ever read here at FanGarphs.

    Vote -1 Vote +1

  29. Charlie says:

    You are confusing stabilization with projecting future value. While it is possible to have a general idea of the outcome when given peripherals like BABIP, HR/FB, etc as one makes an educated guess about future performance, stabilization data is very much a hindsight way of evaluation. There are variables we cannot account for, where at any given moment players can adopt different skill-sets and thus, talent level changes.

    So, RA9 may be a better indicator over that larger 5 year sample size than FIP, it’s not to project future performance.

    Vote -1 Vote +1

  30. ksclacktc says:

    Just to weigh in with my 2 cents or 36$ and change. I would like to see to 2 different WAR ratings, 1 predictive(future)-WARf and 1 descriptive(actual)-WARa. The uses of both are obvious. Who was the MVP? descriptive. Who is the more talented and going forward who do I want? predictive.

    I also feel that being able to add up descriptive (WARa)to the team totals has value when analyzing the story of what happened. We then have a much clearer picture of luck versus talent at the team level.

    My vote would be to replace context neutral batting runs with RE24 or some version without baserunning. And, create a metric called WARa for actual WAR.

    Vote -1 Vote +1

  31. Mike Green says:

    First off, you have to treat starting pitchers and relief pitchers differently. In the case of relief pitchers, using runs allowed is extremely problematic because there are very significant sequencing/run attribution issues between pitchers in mid-inning and any anomalies are magnified over small samples. FIP is better, but WPA factors are a significant issue. The reliever who comes on in the ninth with a two run lead (or more) ought to give up disproportionately more home runs (and fewer walks and singles) with nobody on. FIP punishes that.

    As for starting pitchers, some cocktail of RA and FIP is obviously preferable with the difference between the cocktail and actual RA attributable to team defence and hopefully consistent with team defensive numbers. The tricky point is the uncertainty about the mixture and what ought to be a leaning toward RA as years pass.

    It might be fun to have one and three year metrics (for players who have been in the league that long). At three years, the defensive metrics have more reliability and the (starting) pitcher WAR measures can safely place most of the weight on runs allowed.

    Vote -1 Vote +1

  32. WhyOWhy says:

    I don’t understand why FIP uses Innings Pitched, which are a fielding-dependent input if ever there was one. A better one would be (SO + BIP) * factor, where BIP = “Balls In Play”, and “factor” is the fraction of non-HR balls in play that the league turns into outs.

    Vote -1 Vote +1

  33. ecocd says:

    We have UZR and all the detail it entails. Why don’t we have an UZR-related pitching metric? % of outs made in each zone where a ball lands including park factors and then assigning a number of “expected hits” for a pitcher summarized as an “expected WHIP”. Is there an xH or xWHIP measure out there for starting pitchers? It would require a lot of data to stabilize, but over the course of an entire season a starting pitcher generates a lot of data.

    Vote -1 Vote +1

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>