Why Our Pitcher WAR Uses FIP, Part Two

This post builds on the one I wrote a few hours ago, so I’d encourage you to read that if you haven’t yet. If you really don’t want to follow the link, this is the paragraph where we’re picking up from:

In the end, we had to choose between two different methods – assuming that the pitcher had no responsibility for the outcome of a ball in play, or attempting to approximate the amount of time that the result was due to the pitcher or the fielder. Ideally, we’d be able to do the latter – which is how Sean approaches it – but I just don’t think we currently have the tools available to make an accurate enough judgment on how to apportion that responsibility.

Clearly, some hits on balls on play are the “fault” of the pitcher. He throws a fastball down the middle in a 3-1 count and the hitter whacks it for a double in the gap – that’s on him, certainly. However, most hits are not of that variety. Instead, they’re ground balls in between two defenders or fly balls that fall near a chasing outfielder before he can get to it. In those instances, we don’t really know how much responsibility for the hit should go to the pitcher or the fielder. Would Elvis Andrus have gotten to that grounder up the middle that Yuniesky Betancourt didn’t get close to? Maybe, maybe not. Did Carl Crawford run down a shallow popup that Juan Rivera would have had to pick up on the third bounce? Perhaps. We don’t have the luxury of having a control group for each ball in play. All we know is whether the guy who happened to be the defender on duty at the time was able to make the play or not.

So, what do we do a specific pitcher’s results on balls in play? This was the thing that I wrestled with the most while we were designing WAR for pitchers a few years ago. I can see an argument for doing it in one of two ways, though I think both have problems.

1. FIP-based WAR, which is what we ended up using, essentially admits that we don’t have enough information about dividing responsibility for the results of balls in play, and so it ignores them.

2. RA-based WAR, which is what Sean ended up using, attempts to adjust for defensive contribution by taking a team’s overall Total Zone rating and assigning an expected defensive debit or credit to each pitcher based on how the team performed on the season as a whole.

I get why Sean did it the way he did it, and I understand why there are people who prefer that path. It appeals to our inherent sense of runs allowed being a record of what actually happened, and presents the possibility of achieving the ultimate goal – a pitcher’s total contribution to run prevention with the effects of his teammates factored out. The problem, though, is a pretty big one, and the one that caused me to lean away from RA-based WAR for our purposes here. It assumes that the distribution of defensive performance was even for each pitcher on every team, which is quite obviously not going to be true. Simply put, it is not a record of what actually happened – it is an assumption of what might have happened if all defenders on a team were evenly skilled and were perfectly consistent on a day-to-day basis.

We can simply look at the distribution of run support for a pitcher on any given team to see that the assumption of even performance is not going to be true. If we use the Yankees rotation as an example, we see that the Yankees averaged 5.30 runs per game this year. Their distribution by starting pitcher is below:

CC Sabathia – 5.89 R/G
A.J. Burnett – 4.29 R/G
Phil Hughes – 6.75 R/G
Andy Pettitte – 6.00 R/G
Javier Vazquez – 4.12 R/G

No pitcher is actually within half a run of the team average. Burnett and Vazquez are over a run per game lower than the overall total, while Pettitte is nearly three quarters of a run per game higher and Hughes is a run and a half per game higher. If you built a metric that worked off the assumption that the Yankees offense scored the same amount of runs per game when Vazquez was on the mound as when Hughes was on the mound, you’d probably draw some pretty inaccurate conclusions. There is no reason to think that defensive performance is any more consistent on a day-to-day basis. If anything, there are reasons to believe that it would vary even more than offense.

In general, a team will run out a similar line-up on a day-to-day basis, and each guy will get about the same number of plate appearances per day, as required by having batters take turns in order. That boundary does not hold with defenders, however. There is no rule that says each player on the field get an equal number of opportunities each day. In fact, given that pitchers have different tendencies in terms of groundball and flyball rates, it’s nearly guaranteed that the defensive opportunities will not be equal between pitchers.

Using aggregate stats from a team’s entire season simply won’t give you the kind of detail needed to accurately determine the quality of defense that was played behind a given pitcher in a season. Doing pitcher WAR that way provides an end result that does not match what actually happened. It is not an accounting of what actually happened.

Since neither method gets us to that goal of accurate accounting, we’re left with a choice of two paths, both with structural problems that can’t be avoided based on the data we currently have access to. Personally, I prefer FIP-based WAR because it is easier to adjust for what we know is not included – BABIP and sequencing, essentially – than it is to take a defense-adjusted RA based WAR and make adjustments for places where the assumption of defensive distribution equality does not hold.

Let’s use Francisco Liriano and Cliff Lee as examples. Liriano’s RA results don’t match his FIP in large part because he has a .340 batting average on balls in play. Since our version of WAR doesn’t hold that against him, he comes out looking really good. An RA-based version not only holds his actual BABIP against him (by starting with runs allowed), but it then penalizes him further because the Twins have an above-average defense, and the assumption is that he got proportionate help from the guys behind him.

What is more likely – that Liriano gave up contacted balls that should have resulted in a .350 to .360 BABIP, and the good glove Twins helped bring that down to .340, or that the guys behind him didn’t make as many plays for him as they did when Carl Pavano or Brian Duensing was pitching? Considering that he posted a basically league average 19.1% line drive rate, I’m more inclined to believe that the latter is closer to the truth. We don’t know exactly what kind of defensive support Liriano got this year, but based on what we know about a pitcher’s control over BABIP, I think we’re better off assuming that there were some issues behind him that hurt him than we are assuming that the Twins defense supported him equally as well as they supported his fellow pitchers.

Lee’s case shows the other side of the coin that FIP ignores – when those hits occur. While he has a normal .302 batting average on balls in play, it is not at evenly distributed within the base-out states. His BABIP is just .257 with the bases empty, but jumps to .350 with men on base, and is .333 with runners in scoring position. Because of that split in when his balls are being turned into outs, he has a LOB% of just 67.9%, well below average and far below what pitchers of his quality have posted this year.

For Lee, it hasn’t been a problem of too many finding holes, but simply those balls finding holes at the wrong times. It’s possible that those hits were a result of poor location, but given that he ran a 10.00 K/BB ratio with men in scoring position, it doesn’t seem like Lee suddenly lost his command when men got on base this year. Maybe he did – I don’t know. But should we assume that a pitcher’s BABIP with RISP is under his control? That’s really the driving force of Lee’s ERA this year, we have to acknowledge that it is certainly possible that his defenders let him down in those critical situations.

It is also possible that he let himself down. We just don’t really know who is at fault – pitcher or defenders. FIP blames BABIP entirely on defense. That’s definitely wrong. Defense-adjusted RA assumes that each pitcher got the same support from their teammates. That is also definitely wrong.

So, we’re left with two imperfect options. Which should you prefer? I can’t answer that for you. They both have strengths and weaknesses, and both are valid attempts to answer the question that we’re really trying to get at. I prefer the FIP-based implementation because it’s easier to make mental adjustments from that number, knowing that BABIP and sequencing are not included, than it is to try and back out of a metric that is already attempting to account for defensive support and find out where it might have missed the mark, but that’s a personal preference more than it’s a right or wrong thing.

WAR is not perfect, and it’s less perfect for pitchers than it is for hitters. Separating out defense from pitching is hard, and we don’t have it all figured out yet. We don’t encourage you to use any version of WAR as the be-all, end-all of analysis. We think its a pretty nifty tool, especially if you understand its limitations, and it does a good in most instances. However, it’s not perfect. Our version isn’t perfect, and Sean’s version isn’t perfect. We’re both trying, and we’re trying from different angles. Rather than focusing on why the differences make both “wrong,” maybe we should admit that its nice to have both perspectives?




Print This Post



Dave is a co-founder of USSMariner.com and contributes to the Wall Street Journal.


132 Responses to “Why Our Pitcher WAR Uses FIP, Part Two”

You can follow any responses to this entry through the RSS 2.0 feed.
  1. AndyS says:

    So I buy the argument for using DIPS, but specifically, you don’t address the biggest question mark – why FIP? Why not xFIP? or tERA? or tERA* or tERAr or SIERA? Or maybe a weighted combination of the bunch weighted on accuracy?

    Vote -1 Vote +1

    • Mike K says:

      From the Glossary:

      “If and when a new metric like tRA is proven to be significantly more effective in valuing pitchers (and I’m hopeful that it will be, given more data exploration on the topic), we won’t be standing here as guardians of the infallibility of FIP.”

      I don’t even think SIERA was published yet.

      Vote -1 Vote +1

      • AndyS says:

        SIERA was published, and xFIP has been shown to be more predictive of a pitcher’s ability than FIP.

        Vote -1 Vote +1

      • suicide squeeze says:

        I think the key word there is “valuing.” You can expect someone with 15% HR/FB to be better than that in the future, but he still gave up those home runs this year, and that should be accounted for.

        Vote -1 Vote +1

      • Hark says:

        @Andy S

        WAR isn’t a predictive value, it’s a descriptive one. xFIP may be more predictive, but FIP is closer to what he actually did. And that’s what WAR measures–how valuable an individual player was to his team, not how valuable he’s going to be. We shouldn’t use predictive stats for that.

        Vote -1 Vote +1

      • Rich says:

        If you want it to be descriptive, it shouldn’t be using FIP, it should be using ERA or some such.

        FIP’s idea is to be predictive, and it does a poor job at it.

        Vote -1 Vote +1

      • @ Rich

        You are right that FIP is at the low end of the descriptive-predictive scale. However, while ERA is descriptive, it still is measuring a combined pitching/defense measurement, and as such is not a good choice for pitching WAR. If you adjust for defense, you get something closer to an adjusted RA value WAR. It seems then that RA adjusted WAR, with additional batted ball adjustments is the best fit.

        Vote -1 Vote +1

      • Rich says:

        “You are right that FIP is at the low end of the descriptive-predictive scale. However, while ERA is descriptive, it still is measuring a combined pitching/defense measurement, ”

        So is FIP, despite the insistence otherwise. FIP’s denominator is IP, which is heavily infuenced by a team’s ability to turn batted balls into outs. Replace IP with PA, and we might be getting somewhere (which is somethign SIERRA has done)

        FIP is no more Defense – Independant than ERA.

        Vote -1 Vote +1

      • Nadingo says:

        Your point about PA over IP is valid, but it’s a huge (and incorrect) leap to get from there to claiming that FIP is no more defense-independent than ERA. ERA explicitly makes judgments about when runs are “earned” based on a hugely flawed metric — whether a defensive play was an “error” or not. And that’s just the tip of the iceberg.

        Vote -1 Vote +1

      • Kevin S. says:

        That’s just silly – ERA equates defense-independent outcomes with defense-dependent outcomes. FIP doesn’t. If you want to argue that FIP isn’t completely defense independent because of the IP issue, that’s fine, but it’s absolutely more defense-independent than ERA. Try not to get carried away next time.

        Vote -1 Vote +1

      • Rich says:

        A stat can not be “more independent”.

        It either is, or it isn’t. FIP isn’t. FIP is dependent on defense in a different way than ERA, but its not any less dependant.

        Vote -1 Vote +1

      • CircleChange11 says:

        more independent, closer to perfect, exactly like except, … a whole bunch of weird phrases that we commonly use.

        Why don’t many point out how much FIP relies on hitter skill? K’s and strikeouts seem to be two things that hitters have a tremendous amount of influence/control over … which is why we see the same guys K and BB about the same rate annually.

        If pitchers controlled strikeouts and walked, there’d be a bunch of one and none of the other.

        That’s not really an issue with Fielding Independent Pitching, which is what FIP is … but the interpretation of it as being “things pitchers control”.

        Like I said, if pitchers controlled HR, BB, and K … the stat line for every IP would read “0 HR, 0 BB, 3 K”. They don’t.

        It’s another case of the stat not being the problem, but the interpretation/usage of it.

        I do, however, don’t think that one can just say “WAR” anymore, I think one has to designate fWAR or rWAR, at least for pitchers. They seemingly measure two different things, but like most stats … most often the same talented group of guys will be in the top 15. But, that’s a pretty low standard to set. With that standard WHIP and ERA would be high levle stats.

        Vote -1 Vote +1

  2. Rich says:

    ” However, most hits are not of that variety. Instead, they’re groundballs in between two defenders or fly balls that fall near a chasing outfielder before he can get to it”

    Thats a very strong assertion. Do you have data for it?

    Vote -1 Vote +1

    • Jason B says:

      I thought the same…I would tend to think most hits are of the “no doubt” variety and few are of the sort where a different defender would have madea difference. And feel pretty confident about that, albeit admittedly without any sort of numbers backing that assertion whatsoever.

      Vote -1 Vote +1

      • MV says:

        I guess what he meant was that pitcher has no control whether grounder is hit directly at a fielder or through a gap.

        Vote -1 Vote +1

      • tom says:

        Does a batter have control over whether a sharp grounder is hit right at the 3rd baseman or 5 feet to his right for a double? Or whether it’s a 6foot high line drive right at 1st base vs one 9 feet high that goes for a hit?

        Do we exclude this from a hitter’s WAR calculation?

        On a site like this that is all about the #’s, and being objective, for Dave to trot out a subjective opinion/assertion about how hits are often close calls provide no actual data is basically asking the readers to rely on perception (something that is often ridiculed here).

        Vote -1 Vote +1

      • CircleChange11 says:

        On a site like this that is all about the #’s, and being objective, for Dave to trot out a subjective opinion/assertion about how hits are often close calls provide no actual data is basically asking the readers to rely on perception (something that is often ridiculed here).

        Agreed.

        I also think it’s goofy for the same site to use batter WAR that includes luck and BABIP, but not for pitcher’s WAR.

        We’re using data the way Creationists use Science … we look for what we want to find, and sometimes use the data, and sometimes not. I have issues with the inconsistency, especially when reasonable options are available. We’re basically intelligent, logical people … right?

        Vote -1 Vote +1

    • isavage says:

      yeah, I’d disagree with that, it seems to me most hits are of the “no doubt” variety … elevated line drives, sharp ground balls. There’s definitely a large amount of luck involved in ground balls, but all grounders are not created equal. Bad luck alone isn’t the reason Justin Masterson’s BABIP is 50 points higher than Fausto Carmona’s. (it does play a part, I’ve witnessed a good majority of both pitchers’ starts, and would say that Masterson suffered from more bad fielding than Carmona, but he also tended to have an inning or two every start where he’d lose control of his sinker and start elevating pitches over the middle of the plate: many still resulted in ground balls, but hard hit ground balls that couldn’t be fielded unless they were hit very close to an infielder)

      I don’t actually mind Fangraphs using FIP for WAR though, it provides another perspective that isn’t out there. I take from it what I choose

      Vote -1 Vote +1

  3. Conshy Matt says:

    excellent article. i feel smarter for having read it.

    having said that, perhaps WAR for pitchers shouldn’t even be used.

    also, shouldn’t we use WAR as a 3+ year tool as opposed to looking at just one season (because of the inherent problem w/ one year’s worth of UZR)?

    Vote -1 Vote +1

    • Brad Johnson says:

      This is getting some talk around the sabrwebs, esp. in reference to projection systems and how to build long range forecasts. I’m not sure it’s been focused on with WAR yet…

      What we need to remember is that WAR is a very rough/brute force measure…not surprisingly either, it’s pretty difficult to describe a player’s season in one number. It’s certainly useful but shouldn’t be used to distinguish between two similarly skilled players or as an end all, be all of player value.

      Vote -1 Vote +1

      • Mike K says:

        For me, WAR has two uses. For valueing players (which is why FanGraphs originally called it Value Wins), and for grouping players.

        For valueing them, once you put a dollar figure on a win you want to know how much a player was worth? How well did your team allocate it’s resources? Etc.

        For grouping, this is also something on InsideTheBook they are discussing. Don’t take the top guy in WAR and say, “he’s the MVP.” Instead, take the top 10 in WAR, or top guys within 1 WAR of the lead, or something. THEN, do a more in-depth analysis. Or if you’re not doing it for specific, “who’s the best”, you can just do it to group guys like A, B, C, etc. players.

        Vote -1 Vote +1

    • Jason B says:

      “perhaps WAR for pitchers shouldn’t even be used.”

      I agree. When you’ve got two competing systems using the same name, it just breeds confusion and an aversion to using the metric at all. These aren’t minor differences in many instances, like the Cliff Lee example JoePozz referenced. Plus it provides an easy inroad for the old-schoolers to attach with – “you can’t even agree on how to calculate your own new-fangled stat!!1! I’m sticking with pitcher wins!”

      We’ve got bright enough minds studying the issue, let’s shelve the metric until we get one methodology that a substantial majority can agree upon. Looking forward to that – pitcher WAR v2.0.

      Vote -1 Vote +1

  4. Thomas says:

    My biggest issue with FIP can be pointed out with Liriano. Basically, people with high BABIP/LOB and low HR/FB get rewarded while people with low BABIP/LOB and high HR/FB get punished. It’s the reason that I already average the two WAR calculations. I don’t really like either by its self (although, like posnanski, I like SIERA versions better), but averaged they paint a more complete picture.

    Also, Id like you to address why you dont use xFIP vs FIP to eliminate this problem. Is it because HR/FB can be slightly influenced by skill?

    Maybe my biggest issue is that while batter WAR is not adjusted for luck (BABIP normalization, Career HR/FB, etc), I wonder why fangraphs considers it necessary to adjust pitchers by luck. Really, I think two different standards are being used, leading to an inaccuracy in comparing WAR from position players to WAR of pitchers on this site.

    Vote -1 Vote +1

    • DavidJ says:

      “Basically, people with high BABIP/LOB and low HR/FB get rewarded while people with low BABIP/LOB and high HR/FB get punished.”

      FIP doesn’t reward pitchers for having a high BABIP/LOB; it just doesn’t punish them for it. Nor does it punish pitchers for having a low BABIP/LOB; it just doesn’t reward them for it. All FIP does is reward pitchers for having high Ks and low BBs and HRs, and punish them for having low Ks and high BBs and HRs. If Liriano’s BABIP were .240 instead of .340, his FIP would be exactly the same, because FIP is only looking at his Ks, BBs, and HRs.

      Vote -1 Vote +1

      • Thomas says:

        Yeah, my bad on that. Still, I dont like FIP’s handling of HR’s or lack of handling of GB% enough to endorse it as a WAR stat.

        Vote -1 Vote +1

      • Locke says:

        What you’re not grasping is that we have a lot of data that shows that most, if not all, pitchers CAN’T control those factors, which is why they aren’t used in WAR. They are not “skills” as much as you’d like to think they are. And while there it is likely that some pitchers can control xBABIP and HR% to a very small degree, it’s not even close to as important as the stats that FIP encompasses. Which is the entire point. You aggregate the IMPORTANT stats that a pitcher can control, and that we can measure accurately, and leave the uncertainty out of the equation. B-R WAR is wildly inaccurate.

        Vote -1 Vote +1

      • Rich says:

        ” If Liriano’s BABIP were .240 instead of .340, his FIP would be exactly the same, because FIP is only looking at his Ks, BBs, and HRs.”

        No, it absolutely would not be.

        Fip is

        (13HR+3BB-2K)/IP + lgERA

        If Liriano’s BABIP was to go up, his IP would go down, and most likely, he’d have more outs via K (because he wouldn’t be getting them on balls in play). His FIP might go up, it might go down, but it certainly wouldn’t stay the same.

        Vote -1 Vote +1

      • Rich says:

        “What you’re not grasping is that we have a lot of data that shows that most, if not all, pitchers CAN’T control those factors, which is why they aren’t used in WAR. ”

        No, we do not. Until we start correcting for Batter’s Faced, we can’t make that statement.

        Right now, all we know is that the confounding factors (park, quality of opponent, etc), are probably larger than the pitcher’s ability to control quality of contact. Considering that the differenced in Quality Batter’s Faced is as high as .050 OPS, that doesn’t say a lot.

        Vote -1 Vote +1

      • mowill says:

        Locke said, ” B-R WAR is wildly inaccurate.”

        I would say it is far more accurate because it uses many more datapoints. Judging pitcher WAR based entirely on 3 data points that are context dependant seems reckless to me.

        Vote -1 Vote +1

    • grandbranyan says:

      I believe FIP is used because WAR is descriptive, it is trying to assign a value to what actually happened.

      xFIP is more predictive in nature because it attempts to eliminate some of the noise around HR/FB rate.

      Vote -1 Vote +1

      • Rich says:

        FIP doesn’t describe what happened at all. FIP describes what happened in approximately 35% of at bats, and pretends the other 65% didn’t happen.

        Vote -1 Vote +1

      • J-Doug says:

        WAR is specifically meant to be a performance–and therefore a prediction–stat, and not description, stat, although the scale (wins) is calibrated to better express the relative weight of the performance.

        Both FIP and xFIP are also supposed to be predictive. In the aggregate, xFIP does a better job of predicting than FIP because it punishes HR rates that typically regress.

        However, on the individual level, they don’t always regress (unlike BABIP for pitchers, which almost always does). For instance, xFIP assumes that 10.6% of fly balls will, over enough time, be home runs. But Mariano Rivera’s kept his HR rate at around 6% for his entire career. In other words, Mo’s xFIP assumes that he’s going to regress back to a 10.6% HR/FB even though he’s not.

        Mo’s not the only pitcher this applies to, but he’s the one that comes to mind.

        Vote -1 Vote +1

      • Rich says:

        “unlike BABIP for pitchers, which almost always does”

        Except in the hundreds of examples where it doesn’t.

        Vote -1 Vote +1

      • Thomas says:

        I disagree with this based mostly of the blanked Predictive/Descriptive assignments given out freely on this sight. xFIP is only more predictive and less descriptive than FIP. Neither of them are absolutes.

        My basic gripe is that I would prefer that WAR give more credit to the batted ball profile (a skill) that is not entirely captured in FIP. IFFB% should be rewarded. GB%/FB% should be rewarded. HR/FB% should be taken into account. I think this would make for a more DESCRIPTIVE statistic.

        Vote -1 Vote +1

      • MV says:

        I guess FIP for e.g. Jim Palmer is a much better descriptive stat than the Robinson-Belanger-Blair-ERA

        Vote -1 Vote +1

      • J-Doug says:

        Rich: Point being, regressing to historical BABIP doesn’t cause as much damage as regressing to historical HR/FB. The distribution for HR/FB is far flatter than BABIP’s. In 2010, the standard deviation for BABIP was equal to 0.08 of the mean, versus .27 for HR/FB. The min-max range for BABIP was 0.38 of the mean, vs. 1.35 for HR/FB. Essentially, BABIP’s regression is 4 times stronger than that of HR rate.

        All stats make some assumptions. Assuming that BABIP will regress is far less dangerous than assuming HR/FB will.

        Vote -1 Vote +1

      • Rich says:

        “All stats make some assumptions. Assuming that BABIP will regress is far less dangerous than assuming HR/FB will.”

        Agree.

        The problem is that assuming BABIP will regress is still dangerous. There are plenty of guys out there who have never in their career come anywhere near league average BABIP. Tim Wakefield, for example, has a career BABIP .025 below league average, and the only season where hes been above league average was at 26. (.305).

        We’re talking 13000 PA at this point for him. Thats not luck, thats a skill.

        The reason why we continue to think that BABIP,etc are luck is that we haven’t found a way to remove one huge confounding factor: Quality of Opponent faced.

        A pitcher who throws 200 innings will face something like 850 batters. If you face a guy like Ichiro, who has a .360 career BABIP, for 17 plate appearances in a year, that should raise your expected BABIP .0015.

        That doesn’t sound like a lot, but when you put it in perspective, a season against a good team (4 starts)vs a season against a bad team can lead to somewhere in the range of 100 AB at a .40 point difference in BABIP (say .280 vs .320), thats .006 or so.

        When you then start looking at say, a starter for the Blue Jays, who has to face the Red Sox, Yankees, and Rays, and we’re looking at .018 difference in expected BABIP (and corresponding OPS changes) vs a guy who gets to face, say, the NL Central, just for the in Division games.

        Whats the ERA difference of an increase in BABIP of .18?

        The problem is, we continue to lie to ourselves and say these things even out over the course of a season, when they most certainly don’t.

        Vote -1 Vote +1

  5. Brad Johnson says:

    Dave,

    Ideally we’d be able to have a (mostly) correct way to do the adjustments Sean does, no? Part III should be what data we need to pull that off. As detailed as possible please? :) I think enough time has passed to open up a new brainstorming session (even if the bulk of it ends up over on the book blog).

    Vote -1 Vote +1

    • tom says:

      We already look at batted ball outcomes to measure defensive ‘skill’ in UZR… why can’t this be flipped and simply look at each ball in play for a pitcher (both hits and outs) and give each batted ball a factor based on the UZR baseline data. A hit that would have been caught 75% of the time would be .25 of a single for the pitcher and an out that would have been a hit 90% of the time if you didn’t have Carl Crawford runnng it down would be given 90% of a hit.

      We seem to be OK with this for defense, why not apply it pitchers?

      This doesn’t address ‘luck’ but if you are looking for a descriptive measure of something that already has happened, luck is a part of it (just like hitter’s WAR). If you want to use FIP/xFIP as a predictive stat (as I believe they are intended), then you can ignore balls in play and regress HR rates.

      Vote -1 Vote +1

      • brendan says:

        this was my take on the post as well. Instead of taking average defensive performance, let’s use actual performance data that is already coded for UZR calculations.

        dave, if the defensive adjustents were made based on the specific plays made behind the pitcher in question, would that address your concerns?

        Vote -1 Vote +1

      • zenbitz says:

        I think you could do this iteratively with the PBP data.
        For each BIP assign a lambda difficulty factor to it’s being caught. These lambdas are unknown. For FIP , they are 0. For pure RA they are all 1. For ERA they are 1 for hits and 0 for errors.

        Actually you might want to flip these so that lamba represents the chance of converting a BIP into an out.

        The question is…what utility function do you maximize? You have an easy starting point based on FB vs. GB vs. LD rates of out conversion. I guess you could further do this by BIP zones – averaged over all fielders/parks.

        Hmm… I guess it’s a little more complicated because instead of 1 lambda you actually have 4 (Pout P1b P2b P3b)… but you could use a factor by zone representing the LWR value of non-outs. Call this Z.

        LWR(pitcher) = FIPr(pitcher) + BIP(pitcher) + BIP(fielder)
        = FIP + Sum(lambda*Z(bip) – Sum(1-lambda*Z(bip))
        with the first two being the pitchers’ responsibility and the second the fielders.

        But now assume the lambda = fielder+pitcher.. and the fielder is constant across all pitchers.

        I think that might work.

        Vote -1 Vote +1

  6. J-Doug says:

    I like fWAR from a theoretical and practical standpoint. The use of FIP is certainly more appropriate for an individual, rather than team-level, stat.

    Of course, many will wonder, as many have of late, whether FIP is a good enough DIPS to rely on. I’m not sure it is, and I don’t think xFIP is the answer either. Developing a more convincing DIPS would go along way to developing a better WAR.

    Vote -1 Vote +1

    • J-Doug says:

      I’ve suggested this in the past, and will again. How about a FIP stat that, instead of assuming the pitcher’s HR/FB rate will regress to 10.6%, assumes it will regress to his career rate? Call if cFIP or aFIP. For any player with a decent sample of FBs it should perform well.

      FTR, I’m a SIERA fan myself.

      Vote -1 Vote +1

      • Spoilt Victorian Child says:

        Yeah, it seems to me that SIERA is probably the evolutionary endpoint (at least until we can just start doing linear weights on batted-ball speeds). I’m guessing that’s BP’s IP, but still, I imagine some version of tERA would also be better than FIP.

        Vote -1 Vote +1

      • Luke in MN says:

        I’m sure that it’s BP’s intellectual property, but I’m equally sure that someone could follow the same logic and end up in roughly the same place.

        Vote -1 Vote +1

      • J-Doug says:

        And it’s not as if SIERA’s perfect either. Its coefficients are based on historical averages just as FIP and xFIP are. In any case where the coefficients regress what is actually skill related, you’re going to over/undervalue the player. No stat is perfect. What’s most important is getting as close as we can to perfect and knowing what stat to use, when, why, and knowing what it tells us.

        Vote -1 Vote +1

    • J-Doug says:

      SIERA = 6.145 – 16.986*(SO/PA) + 11.434*(BB/PA) – 1.858((GB-FB-PU)/PA) + 7.653*((SO/PA)^2) +/- 6.664*(((GB-FB-PU)/PA)^2) + 10.130*(SO/PA)*((GB-FB-PU)/PA) – 5.195*(BB/PA)*((GB-FB-PU)/PA)
      where +/- is as before such that it is a negative sign when (GB-FB-PU)/PA is positive and vice versa.

      from: http://www.baseballprospectus.com/article.php?articleid=10045

      Vote -1 Vote +1

  7. badenjr says:

    My question isn’t really all that related, but the difference between Cliff Lee’s BABIP with no one on vs. his BABIP with men on base reminds me of a question I’ve often wondered. There’s a major difference in those two states that, as far as I know, gets little attention. With no one on base Lee pitches from the windup, while he’ll throw from the stretch with runners on. I can’t imagine that pitchers are equally effective out of the stretch or windup, and I’m equally sure that some pitchers have greater disparity than others. Does anyone look at this?

    It would seem to me that there are all kinds of applications that could be made from understanding the differences in performance between the windup and the stretch. Starters use the windup with no one on base, so one would assume that they are more effective out of the windup since there’s nothing preventing them from using the stretch. At the same time, it seems to me that the windup involves more motion and is more likely to tire a pitcher out sooner. If that’s true, it suggests that starters should go from the stretch simply to maintain their stuff longer. Further, if a case could be made that some pitchers are more effective from the stretch than the windup, wouldn’t that seem to be a criterion to be considered when assessing a young pitcher’s future as a starter vs. a reliever. I seem to always hear people refer to so-and-so having a ceiling as a reliever because he lacks a third pitch. I never hear of so-and-so being destined for the bullpen because he’s not as sharp in the windup. Why is that?

    Vote -1 Vote +1

    • Thomas says:

      Your example is a good reason why Fangraphs doesn’t look at BABIP for WAR. As Dave pointed out, his K/BB ratio in this situation (most of FIP) is 10/1, or really awesome. He likely is pitching just as effectively with people on base. Most pitchers get substantially less effective in terms of K/BB with people on base, so Cliff Lee is even farther ahead of the curve. It just so happens that when people do reach base, the balls find holes more often than when people arent on base.

      2 pitch pitchers tend to be relievers because they struggle with platoon splits and hitters getting more looks at them as the game goes along. As a RHP with an average FB, SL say you have a good out pitch for righties (slider) but struggle with lefties. As a starter, the opposing manager would stack the lineup with lefties and you would get pounded. If that RHP adds a good changeup (for lefties) now he has an out pitch for those lefties that were hitting him hard and he can now conceivably pitch more effectively and deeper into games. Windups do not meet this criteria except that its fair to point out that closers tend to be terrible at holding runners on, and prefer to pitch out of the windup and go for the strikeout. This is really the only example of easily noticed windup issues.

      Vote -1 Vote +1

      • badenjr says:

        I wasn’t suggesting anything about Lee’s performance. I was simply pointing out that the distinction between performance with runners on vs. performance with bases empty triggered a question I was curious about.

        I appreciate the explanation of why 2 pitch pitchers tend to be relievers. It all makes sense. However, that wasn’t really what I was getting at, and perhaps that’s my fault for not making it clearer.

        I’m simply wondering why the difference in effectiveness in the windup vs. in the stretch is not something we consider when evaluating pitchers. Is the difference just that small? If that’s the case and you don’t gain anything by pitching from the windup, why do pitchers do it? If it’s not the case, and it actually does matter, then why don’t we consider it?

        Vote -1 Vote +1

    • CircleChange11 says:

      Evidently, there are guys like Ricky Nolasco that are much less effective with men on base … probably for a few factords involved … mechanics, windup v. stretch, pitch selection & sequence, etc.

      I know I’ve said it a few times, but averaging fWAR and rWAR seems like the best move for assigning a value of what happened during the season.

      I still, strongly, feel that there should be an “opponent’s batting quality” aspect. From my own experience as a pitcher, the greatest factor in my own performance stance wasn;t always what i did, but the batters I was facing. I could throw the exact same “game” against a top 10 team and the worst team in our conference, and get two dramatically different results.

      I can influence a lot of things with pitch location, speed, and type. But K’s, BB’s, and well struck balls have more to do with the batter than the pitcher IMO … with the exception of pitchers that have extraordinary stuff.

      Vote -1 Vote +1

      • badenjr says:

        How do you know this about Nolasco? Do the Marlins leverage this in any way? Maybe this is a more relevant question about a reliever. Do teams have guys that they’ll bring in to start innings, and other guys that they’ll bring in with runners on? If so, do the guys who come in to start innings throw from the windup?

        Vote -1 Vote +1

  8. Bay Slugga says:

    Excellent article, Dave.

    Vote -1 Vote +1

  9. A passerby says:

    It’s surprising to me that anyone would subscribe to the WAR that B-R uses. Your idea of measuring what we know with certainty that the pitcher can control seems like the only logical thing to do. Then hang the disclaimer on it that it may not interpret all of the variables, but the one’s that it does, it hits right on the button.

    As opposed to taking an inaccurate guess based on a many unreliable and incalculable factors and saying: This is right.

    Vote -1 Vote +1

    • Rich says:

      “Your idea of measuring what we know with certainty that the pitcher can control seems like the only logical thing to do.”

      They don’t do that at all. They make the assumption that 60+% of at bats are non-predictive, and ignore them. Any time you eliminate 60% of your sample size because “I don’t know how to deal with this”, its a bad idea.

      Vote -1 Vote +1

      • J-Doug says:

        “Any time you eliminate 60% of your sample size because “I don’t know how to deal with this”, its a bad idea.”

        That’s a good combination of wrong and misleading. First of all, they’re not reducing their sample size, they’re culling variables from their model. And you certainly can drop 60% of your model if it isn’t rather predictive and the remaining 40% of it is, which is the case here.

        We get your point… sometimes those things that vary matter. This is especially true at the individual rather than team level of analysis. It’s still better to configure WAR based on a stat that compensates for defense at the individual level rather than team level, as fWAR does.

        But when you have a model where certain variables introduce more noise than signal–single year variations in BABIP being a very good example–you don’t just keep the noisy ones because you simply have the data on hand. Keep them, and you have a stat that explains less, not more.

        Vote -1 Vote +1

      • Danmay says:

        Rich-

        I have been reading through the comments of this post and keep running into your comments.

        While I can’t say that I agree with all of your opinions I will say that quality of batters faced is dangerously omitted. In addition to the effect that batters can have on BABIP I think that it is worth considering the effect that batters have on Ks, BBs, and HRs. We think of DIPs as fielding indepedent, because they are, but I think that a lot of people tend to extend that logic to mean that Ks, BBs, and HRs are the three outcomes within a pitcher’s control. Those three outcomes are largely dependent on the batter and, as you stated above, these things don’t even out nearly as much as we like to assume. That being said, I think that we assume these things because of the (likely) exponentially increased man labor necessary to look at batter’s faced for an entire season.

        I do think that many of us (myself included) tend to gloss over the stats that seem to even out.

        Vote -1 Vote +1

      • Rich says:

        “That being said, I think that we assume these things because of the (likely) exponentially increased man labor necessary to look at batter’s faced for an entire season. ”

        And thats exactly my point. We should be saying “the stat isn’t sophisticated enough,” not “these things even out”.

        Point out the weakness, don’t pretend its not there.

        Vote -1 Vote +1

  10. Sinkerballer says:

    Wait wait wait

    FIP doesn’t consider GB%?

    WTF?

    Vote -1 Vote +1

    • LeBlur says:

      FIP doesn’t need to consider GB rate because realized ground ball rate will tend to normalize over a full season

      Vote -1 Vote +1

      • Rich says:

        Um, what?

        Vote -1 Vote +1

      • Brad Johnson says:

        I’d like that explained too, a pitcher has some control over whether he generates grounders, no?

        Vote -1 Vote +1

      • MV says:

        The GB% affects his HR rate and his HR/9 is supposed to normalize over a season in correlation with his GB rate.

        Vote -1 Vote +1

      • tom says:

        Your assumption that we know a pitcher can control K’s, BB’s and HR’s with certainty is also flawed… a pitcher might have a majority of the control over this but he doesn’t control whether a ball hits the foul pole or goes fouls by a foot or whether it clears an outfield wall by a foot or hits off the top of it. He doesn’t control whether an umpire decides to ring people up because it’s 8-1 and he wants to go home or whether he’s got a ridiculously tight strike zone. A pitcher doesn’t control when the opposing manager wants to rest some a star and puts in a strikeout machine bench player.

        Let’s not assume K’s BB’s HR’s are 100% apitcher controlled and non luck/variation/defense based. Sadly this seems to become the assumption as FIP and WAR get more mainstream attention.

        Vote -1 Vote +1

    • Brad Johnson says:

      From THT:

      “The formula is (HR*13+(BB+HBP-IBB)*3-K*2)/IP, plus a league-specific factor (usually around 3.2) to round out the number to an equivalent ERA number.”

      If you know how to develop a stat that holds the above basic tenants AND considers regressed GB% AND can be expressed as a single, easy to follow number, you should get started on making it.

      Vote -1 Vote +1

      • Thomas says:

        And herein lies the problem. Adding batted ball profile information and any other ‘skill’ correlated statistics to a stat like FIP that encompasses the easiest skills to separate is tricky.

        Vote -1 Vote +1

  11. Guy says:

    Unfortunately, the three key assertions here are all demonstrably wrong.

    “Clearly, some hits on balls on play are the “fault” of the pitcher. He throws a fastball down the middle in a 3-1 count and the hitter whacks it for a double in the gap – that’s on him, certainly. However, most hits are not of that variety.”

    This is not true. If you built a model to predict the outcomes on BABIP, and included UZR’s estimated out probability on the ball, plus the relevant fielders’ ratings, you would find the BIP qualities explain far more of the outcomes. In fact, the information on type/velocity/location alone probably explain more than 50%, none of which fielders can possibly impact. So this isn’t even close to being true. And I think even casual fans observing the game would say most BIP are either obvious hits or obvious outs.

    “There is no reason to think that defensive performance is any more consistent on a day-to-day basis. If anything, there are reasons to believe that it would vary even more than offense.”

    There is a fantastic reason to believe there is less variation in defensive performance: because we see less variation everywhere we look! Players vary more in their hitting output than in their fielding/position output — just look at your own WAR data. And at the team level, offense varies the same as does pitching and fielding combined, so obviously offense varies much more than fielding alone. So the differences in run support are not a relevant analogy at all.

    “What is more likely – that Liriano gave up contacted balls that should have resulted in a .350 to .360 BABIP, and the good glove Twins helped bring that down to .340, or that the guys behind him didn’t make as many plays for him as they did when Carl Pavano or Brian Duensing was pitching?”

    I thinkg it is MUCH more likely that he gave up hard-to-field balls that explain most of his .350 BABIP. But you can find this out for us and report back, by looking at the quality of fielding (UZR) provided behind each pitcher, and comparing that to pitchers’ BABIP. What I am pretty sure you will find is that the quality of fielding explains only a small part of the variance in pitchers’ BABIP. (Although I don’t know about Liriano specifically, of course.)

    In fact, why don’t you use UZR to calculate pitchers’ WAR? Just adjust their ERA up or down based on the actual quality of defense provided on THEIR balls in play?

    Vote -1 Vote +1

    • Brad Johnson says:

      I nice idea theoretically but you would run into SSS issues immediately.

      Vote -1 Vote +1

      • WilsonC says:

        Why would SSS be a problem here?

        If we’re looking at a defensive skill level, then Sean’s WAR handles it reasonably well. However, it seems to me that the choice to use a FIP-based WAR is largely due to the fact that we’re not looking for an estimate of defensive skill, but rather an estimate of defensive performance, SSS spikes and all.

        Vote -1 Vote +1

      • bill says:

        Yeah, it would certainly be useful… but if UZR data already takes 3 seasons to equal 1 season of offensive data, less than 1/5 of that is going to be even rougher.

        Vote -1 Vote +1

      • Guy says:

        There’s no such thing as a “sample size” issue in a performance metric. All we care about is what happened. WAR is supposed to tell us what the pitcher did, not predict his future performance. Those are very different things.

        Vote -1 Vote +1

      • Brad Johnson says:

        Wilson, the problem as I understand it (and I could be wrong) is that a single game of UZR data does not adequately represent the defensive play in that game. This is a question that’s better posed to MGL…

        Perhaps we could jiggle the data to give us larger samples? I’m thinking if we compared a pitcher’s year to year UZR behind him over many years while adjusting for the quality of defense behind him, we might get a number that is closer to significant. I worry that even then something would have gotten screwed up, and I’m not even talking about pitchers like Joel Piniero who substantially change their game.

        Vote -1 Vote +1

      • Brad Johnson says:

        Guy, the issue is “Is a small sample of UZR data the same as performance?”

        Vote -1 Vote +1

      • Guy says:

        Brad: It’s not that small a sample. With WAR, we usually are working with a full season, or at least a substantial portion of one. So the UZR is based on 600-700 BIP, typically. Even if it’s pretty inaccurate at the game level, it should be a reasonable estimate over the season.

        Or let’s put it this way: if Dave doesn’t have confidence in a season’s worth of UZR behind starting pitchers — which is a larger sample than you have for fielders — then he should explain why he publishes it!

        Vote -1 Vote +1

      • WilsonC says:

        The question then would become whether a single season worth of UZR data behind a given pitcher measures performance well enough to be a useful tool? Even if it’s not as precise as we’d like, does 30 games worth of data measure a team’s performance behind a pitcher well enough to be worth incorporating into a tool to look at the problem from a different angle?

        I don’t know enough about how UZR is calculated to know how much data’s needed to be useful in assessing past performance, but we’re looking at a full team’s data here, so there’s a lot more balls to work with than for any given player. If we’re dealing with a full team’s defense over 1/5 of a season, that’s still more plays than we use for a typical position player’s WAR. If we can’t trust the former, than shouldn’t we stop trusting the latter?

        Vote -1 Vote +1

      • Brad Johnson says:

        Well I appealed to MGL for help, I’ll post what he says here if he gets back to me quickly.

        Vote -1 Vote +1

      • vivaelpujols says:

        Brad I see your point, but I think Guy is right. Let’s say it takes half a season for player UZR to start to semi-accurately reflect what happened. A half a season of UZR is about 500 innings of a single fielders performance. A half a seasons worth of pitching is 100 IP, but that needs to be multiplied by 7 because there are 9 fielders on the field (and I’m excluding pitchers and catchers).

        Pitcher expected UZR deals with a larger sample size than fielder UZR.

        Vote -1 Vote +1

    • Perspective says:

      I think we need to keep some perspective here. The difference of 20 points in batting average is about 2 hits per month, so BABIP is 2 hits per month and a quarter. Without having the statistics placed in front of you, this would be virtually imperceptible, from a practical standpoint. Even for a full season, there just doesn’t seem to be enough of a sample size to appropriately quantify the randomness (luck, if you will) factor. So any theories regarding the additional reasons behind an abnormally high or low BABIP for an individual player based on only one season are likely to be much more skewed by sensibilities than by objectivity. That said, being a Twins fan who has watched most of their games, my sensibilities don’t tell me that the Twins defense has been any worse for Liriano than for Pavano or Duensing. And, I agree with most generally objective Twins Fans that Liriano is a perfect example of the flaws with FIP, and why statistics without observance are limited. He is very good, but until he learns to channel his excess energy better and to handle bad ‘luck’ better, he will not be an elite pitcher. (Though, I’m quite happy to have him on the Twins.)

      Vote -1 Vote +1

  12. MV says:

    ”There is a fantastic reason to believe there is less variation in defensive performance: because we see less variation everywhere we look!”

    If there is a small defensive variation on day-to-day basis, how come there are so great variations year-to-year?

    Vote -1 Vote +1

    • Guy says:

      MV: what do you mean by “great variations” year to year? Do you mean players’ UZR ratings changing from year to year? That is a function of two things: 1) small sample size (compared to PAs), and 2) measurement error in fielding metrics. There is no data I know of to think fielding performance is any less consistent than hitting performance.

      Vote -1 Vote +1

      • Rich says:

        “There is no data I know of to think fielding performance is any less consistent than hitting performance.”

        Is there any data to suggest it is consistent?

        Vote -1 Vote +1

      • Guy says:

        Rich: there is a huge amount of data to suggest that the difference in fielding ability, at both the player and team level, is much smaller than the differences in offensive ability. Thus, we expect it to vary less at the game level as well. It’s Dave who wants to claim that fielding support varies as much or more as pitcher run support. The burden of proof is on him, and as far as I know there is zero evidence to support the claim.

        But Dave could have someone look at how much team UZR varies at the game level, and compare that to offensive runs scored. It’s possible that UZR varies more, but I’d be shocked to learn that is true.

        Vote -1 Vote +1

      • Rich says:

        “Rich: there is a huge amount of data to suggest that the difference in fielding ability, at both the player and team level, is much smaller than the differences in offensive ability.”

        You keep saying that. Post some.

        I think we all agree that the difference in batting ability has more RANGE (and has more of an effect on wins), IE you have +50 run hitters and -25run hitters, but thats not the same as saying one is more consistent than the other.

        Vote -1 Vote +1

      • Guy says:

        “You keep saying that. Post some.”

        Seriously? OK, I looked at Fangraph’s 2010 WAR data (players with at least 500 PA), and the standard deviation on offense is 17.5 runs while the standard deviation for fielding/position is just 10.5 runs. OK?

        Look, if Fangraphs wants to start posting team UZR behind each pitcher, then we can actually compare the variance in defensive support and in run support (and if they already do, let me know). If I’m wrong, so be it. But Dave is just speculating, and the speculation is not very credible given the data we do have.

        Vote -1 Vote +1

      • Rich says:

        “and the standard deviation on offense is 17.5 runs while the standard deviation for fielding/position is just 10.5 runs. OK? ”

        Which tells you absolutely nothing about day to day consistency.

        Vote -1 Vote +1

      • Guy says:

        Rich, you quoted me saying “the difference in fielding ability, at both the player and team level, is much smaller than the differences in offensive ability” and challenged me to “post some.” So I did. If you don’t care about that, why did you ask?

        Look, Dave C claimed that there is “no reason” to think the variance in fielding is smaller, and some reason to think it’s larger. I’ve said:
        1) this first claim is wrong — there are “some” good reasons to think fielding variance will be smaller.
        2) Dave provides no evidence whatsoever for his second claim;
        3) I’ve suggested exactly how someone can measure whether pitchers’ defensive support varies more than offensive support, and prove me wrong.
        Do you actually disagree with any of these points, or are you just being argumentative?

        I’ve also suggested a simple and I think rather obvious solution, which is for Fangraphs to use their own preferred fielding metric to establish how much of a pitcher’s performance was affected by his fielders. This fully addresses Dave’s objection to Rally’s method, that rWAR assumes equal defensive performance behind every pitcher on a team.

        Vote -1 Vote +1

      • Danmay says:

        Guy-

        First off I don’t have any numbers to back up what I am about to say.

        Even if we assume that a player’s defensive performance has no variance from one day to the next I think that we can still say that the fielding effectiveness of team as a whole from one day to the next will vary more than hitting.

        It is a matter of the distribution of balls in play, or more specifically who the ball gets hit too. If one day your defensive wizard centerfielder gets ten balls hit his way and the next day those ten balls go to your lead-footed right fielder instead, the day-to-day fielding performance will vary dramaticly. On the other hand the same nine batters (roughly) will receive the same proportion of plate appearances everyday.

        Vote -1 Vote +1

      • WilsonC says:

        Danmay,

        Even though the distribution of chances is more evenly distributed among hitters, I don’t know if it follows that it’ll cause more variance in fielding than hitting.

        One thing that’s important to remember is that there’s really no such thing as a “routine” hit. In any given plate appearance, the hitter’s most likely outcome is an out. There’s never an instance where success can reasonably be expected of a hitter in a single plate appearance.

        Conversely, there are plenty of “routine” plays on defense. In every game, you see plenty of contact made where you can pretty much assume the play will be made successfully, regardless of the quality of the defender. The range of outcomes for each plate appearance is varied enough that even with an equal distribution of plate appearances from game to game, the end results are highly varied. With fielders, there’s a lot fewer balls in play over the course of the game in which the results aren’t almost automatic from the point of contact, so there’s fewer plays that have an opportunity to dramatically shift the score.

        Vote -1 Vote +1

    • dnc says:

      Bingo. That was precisely my first thought.

      Vote -1 Vote +1

  13. Kazinski says:

    I think the pitchers WAR overrates strikeouts, take the top two relievers in the NL interms of WAR:
    Name K/9 BB/9 WAR
    Carlos Marmol 15.97 6.10 3.0
    Brian Wilson 11.24 3.18 2.7

    They are very similar in just about every category, but Marmol has that ungodly 16 k/9 rate which clearly is being given more weight than that appalling 6.1 BB/9.

    Vote -1 Vote +1

    • Rich says:

      The thing is, the high K rate kind of makes the high walk rate less important. A very high K rate should increast strand rate, so runs are less likely to score.

      Vote -1 Vote +1

      • Kazinski says:

        Not really, Marmol seems to be pretty streaky with his walks, which is not too surprising. He’s given up 58bb in 76 appearances so far. He had 11 multi walk games accounting for 27 of those walks. And he has had 40 no walk games.

        If he evenly distributed his walks and strikeouts I’d agree with you but when he gives up 5BB in .2IP with 1K as he did against Philly on 7/17, there just is no evening that out.

        And of course that buttress my point. Wilson has a higher strand rate than Marmol did 86% – 78%, despite the lower K rate that gives Marmol the edge in WAR.

        Vote -1 Vote +1

      • Danmay says:

        Kazinski-

        You make a great point. To me, the theme of this thread is that when you remove game-to-game context you lose a lot of important information. The trouble is that no stat, that I’m aware of, is currently accounting for individual games and sequencing. I would assume that it currently is a practicality problem in that it would take a lot of manual work to sort through so much data. That doesn’t; however, mean that we should be satisfied with the way things currently are.

        Vote -1 Vote +1

  14. db says:

    WAR is supposed to measure “value” of a pitcher’s performance compared to a notional “replacement”. I think it makes more sense to give a pitcher the credit for inducing pop-ups, or fly-outs, or line drives hit at someone than to assume that he gets no credit for it at all. Hence, no FIP. The value of a fieldable (and fielded) ball is quantifiable. If I want to guess whether a pitcher will continue the performance, I can compare FIP and ERA. But I really think WAR’s descriptive function is more important than adding a predictive element. At some point, you have to look at the results obtained, not the normalized results you expect to obtain. Or, to put it another way, irrespective of how he did it, if a pitcher gives up no runs, he has outperfomed “replacement” regardless of batted ball types or the defense around him.

    Vote -1 Vote +1

    • noseeum says:

      Exactly. It’s WAR, not xWAR. Liriano has a much better FIP than Felix, but at the end of the day, he gave up a significantly higher amount of earned runs per nine innings than Felix.

      I don’t understand the need to eliminate all noise from WAR for pitchers when it’s not done for hitters.

      You adjust for park factors for hitters, but if a hitter gets a hit, he gets a hit. It doesn’t matter what pitcher he’s facing or what defense he’s facing.

      I understand the value of digging deeper and determining true talent, but when you’re talking about “value delivered in the past year” I actually think luck SHOULD be included. Why shouldn’t it? Luck actually happens. Whether it’s random or not, it impacts the results of the games. If a pitcher got lucky and stranded an unusually high amount of runners in a season, resulting in going from a nice 3.00 ERA to a really really nice 2.25 ERA, that actually happened, and it actually helped his team win.

      I think going with FIP is just too far down the rabbit hole for WAR. There’s plenty of luck tied into a hitter’s WAR, so why not have luck in pitcher’s WAR?

      Vote -1 Vote +1

      • CircleChange11 says:

        True. Luck is not factored out of hitter”s WAR. We don;t calculate Hamilton’s 2010 WAR based off his career BABIP, we use his 2010 BABIP, even if it’s significantly higher than his career mark.

        To do the opposite with pitchers seems goofy, inconsistent at best.

        Would anyone support batting WAR as a measure of solely K, BB, and HR … under the guise that these are the only things batters have control over?

        Vote -1 Vote +1

      • joe says:

        We should do it for hitters… we can appropriately call it FIB (fielding independent batting).

        Kind of amusing the 2 different approaches. Almost as amusing as the large portion of the SABR community backing Hernandez for Cy Young by ignoring the exact stats they claim are important (FIP, WAR, xFIP). The ironic thing is voting for Hernandez would effectively be trading one archaic stat (wins) for another (ERA)… yet it will be probably incorrectly hailed as a major shift in acknowledgment of advanced stats

        Vote -1 Vote +1

  15. Giants 162-0 says:

    If we really want to have a great idea for how valuable a pitcher is, lets start looking at inputs rather than outputs. outputs have so many other factors that make it impossible to accurately calculate worth. Since the advent of PitchFX, we now have location, movement and velocity numbers to go with it for each pitch. if we could align that data with the target (catchers mitt) and then compare that data to where certain hitters have holes in their swings, progression of pitches (im thinking pitcher out-gaming the hitter) and history for each in terms of how consistent the “hot” and “cold” spots are… start accounting for player steaks (this would have long and short term medium term??? trends that must also be counted for). All of this can be picked up off PitchFX data. Why is this not being used to calculate an INPUT derived stat for valuing pitchers ability/value?…. then do the same thing for hitters… is that too big a can of worms for this comment section?

    Vote -1 Vote +1

  16. OremLK says:

    The basic reason I’m suspicious of FIP’s usage in WAR is simply that it is not a results-based metric, it’s a process-based metric. Fundamentally, we are looking to value the results with WAR, so I think eliminating random variation is actually a bad thing; we don’t do it for hitters, so why should we with pitchers? It’s not just BABIP, either, that we are eliminating–it’s LOB percentage.

    ERA is an incredibly flawed metric for predictive purposes, and I’m sure there is a much better way to separate the contribution of the fielder from the contribution of the pitcher for a results-based analysis as well… and yet, I still wonder whether it wouldn’t be a better measure of results than FIP.

    Vote -1 Vote +1

    • db says:

      I fully agree. When we think of a replacement pitcher, we assume a guy who would give up around 6 runs per 9 innings, not a guy who would give up two home runs, four walks, six Ks, and a whole bunch of batted balls. War is not a granular stat, but a macro stat showing performance.

      Vote -1 Vote +1

  17. noseeum says:

    Sorry I didn’t even realize that there was a second article. I like what Dave is saying here about the two approaches, but shouldn’t there be three approaches?

    1. Use FIP, which blames BABIP entirely on the defense
    2. Use RA, which assumes that each pitcher got the same support from their teammates.
    3. Use ERA, which blames BABIP entirely on the pitcher

    I don’t see how 3 is any more wrong that the other two, and having all three would be much more illuminating than just having the first two.

    Vote -1 Vote +1

  18. Kazinski says:

    I think ERA may be better to use also, because ERA measures results, as WAR does. As the post below from inside the book points out that ERA almost always matches FIP over a career. And it has the benefit of taking human input from the scorekeeper as to whether a BIP was a playable ball or a hit.

    There you go.
    http://www.insidethebook.com/ee/index.php/site/comments/mike_silva_chronicles_part_4_fip/

    Vote -1 Vote +1

    • J-Doug says:

      WAR does not measure “results,” it measures talent. ERA describes what happened, and not in a very good way. The fact that ERA matches up with FIP over a career is because FIP predicts future ERA very well. The reverse is not the case. ERA in Season 1 does a poor job of predicting ERA in Season 2, compared to both xFIP and FIP (and SIERA and just about any other DIPS).

      The entire point of WAR is to get a measurement of pitcher performance that ignores the noise inherent in ERA, which is generated by defense, random chance, and scorekeepers who make arbitrary decisions about errors that are arbitrarily defined in the rulebook.

      Vote -1 Vote +1

      • Guy says:

        No, the point of pitcher WAR is to measure how many wins a pitcher created (hence the name). So yes, it should ignore defense. But it should NOT ignore random chance, just as WAR for hitters incorporates quite a bit of chance.

        Vote -1 Vote +1

      • noseeum says:

        Where did you get the idea that WAR is meant to measure talent? It’s not. It’s meant to measure value delivered. If you think it’s meant to measure talent, you are very sorely mistaken.

        Take Brett Gardner as an example. He has a top 10 WAR this year, but no one thinks he’s a top 10 talent. A lot of his WAR is tied up in some incredible defensive numbers. Sure, he’s a great fielder, but can he keep that up? Without power, can he sustain such a high OBP for long? We don’t know, but we DO know what he’s done this year, and WAR captures that fairly well. Even if Gardner stinks every year from here on out, he had a heck of a 2010. WAR is about results, not talent.

        If it was about talent, offensive players’ WARs would be reduced due to excessively high BABIPs (see Jackson, Austin).

        Again, that’s why I don’t understand going with FIP for WAR for pitchers. We’re not looking for talent. We’re looking for the best depiction of the results that happened. I know ERA doesn’t tell the whole story of WHY something happened, but it sure is a simple description of what happened that matters, i.e. the number of runs scored that definitely were not caused by errors while a certain pitcher was on the mound. It’s got its problems, but ERA correlates much more closely with the game I watched than FIP.

        Vote -1 Vote +1

      • Why only for pitchers? says:

        To Guy, (And Dave’s reasoning:) Why does fielding issues have more impact on the pitching evaluation than for hitting? It has the same effect on the play from both perspectives. Why isn’t the WAR for a hitter based on FIH? (of course your first reaction is that this is ludicrous = but it is just a ludicrous for pitchers.) In addition, the reasoning here attempts to remove the human variables (both talent and fallibility) in fielding from the equation, but still leaves it in regarding the rest of the pertinent events in the plate appearance: the missed strike/ball calls of the ump, the hitters not hitting mistakes,or hitting good pitches, pitches not taking the path that the pitcher intended. What is the basis for choosing to only the ignore these factors from fielding. Part of my guess is, (since I seem to be on this theme today,) sensibilities: fielding issues seem to get applied to pitching much more than hitting, and there was a relatively convenient way to attempt to address it from a pitching perspective. Until someone starts tracking all of the variables, (including where defensive players were positioned every time a ball it hit; and understanding that we will never be able to know what a pitcher intended,) it seems to me that there should be better substantiation before removing such a large portion of your sample simply based on an outcome.

        Vote -1 Vote +1

  19. astrostl says:

    I get rWAR using RA. I would get it if fWAR used xFIP or tERA, but I think FIP is in too weird a zone between results and skill.

    Vote -1 Vote +1

    • vivaelpujols says:

      I agree. FIP – by taking out sequencing, pitcher skill + other things not captured by FIP like pickoffs, holding runners, etc. – is clearly airing on the side of a skill based metric. I’m sure that David and FanGraphs believed FIP to be good value estimator at the time, but I think it’s clear at this point that park, defense, and quality of opposition RA is a far better value metric.

      Vote -1 Vote +1

  20. WY says:

    I appreciate the frankness with which Dave discusses the flaws in each model. I certainly don’t know the solution, but in any case, I think it’s better to admit that there are shortcomings and that things aren’t perfect than to stick one’s head in the sand and pretend otherwise.

    Vote -1 Vote +1

  21. frug says:

    I think you failed to address one major problem with fangraph’s version of pitchers WAR; it does not include pitchers defense. I understand that you don’t want to give credit to pitchers for the work that is done by other defenders, but pitchers unquestionably should get credit for their own defensive contributions. This isn’t an issue for RA based WAR since a pitcher’s own defensive abilities are already incorporated, but it is a huge issue for any DIPS based WAR system and that is why I think baseball reference’s version is more accurate for determining pitcher value.

    Vote -1 Vote +1

  22. mowill says:

    The premise that defense is less consistent than offense is laughable. Defense is far more consistent, about 96% of the time routine defensive chances occur they are converted into outs. How is that not consistent.

    Now the difference between the guy who converts 96% of his routine chances and the guy who converts 99% of his routine chances (and makes more spectaular plays) is where defensive value comes from and it is helpful and wise to try to find out who the best defenders and team defenses are. But to suggest that the variable is so large as to discount defensive contributions completely is akin to throwing the baby out with the bathwater. Defensive metrics are not perfect but they are definitely advanced enough to allow for pitcher performance to be adjusted accordingly. I mean it is not like pitching performance is adjusted for using team fielding %, although I could make a case that that would still be better than no defensive adjustment at all.

    Also of all the new pitching metrics FIP is by far the worst. Good pitchers pitch around great hitters in key situations, intentional unintentional walks so to speak, FIP penalizes them for it. Also good pitchers oftentimes pitch to contact with a big lead to try and pitch a complete game, also something that FIP penalizes pitchers for. I like FIP but I hate basing pitcher WAR on it. I would like to see Fangraphs pitcher WAR based on a basket of advanced metrics and adjusted for team UZR. In econimics calculations like this are used all the time because the more accurate data points the more valuable the information. Fangraphs is doing a disservice to itself by using only FIP when calculating pitcher WAR.

    Vote -1 Vote +1

    • noseeum says:

      “I mean it is not like pitching performance is adjusted for using team fielding %, although I could make a case that that would still be better than no defensive adjustment at all.”

      It is with ERA. The problem with defensive fielding percentage, and thus ERA, is that it doesn’t take into account range at all. If Ozzie Smith gets to a ball and drops it, he gets an error, even though no other short stop could have even made an attempt at the ball.

      That’s a problem, but I would rather accept that problem than use FIP because on the offensive end in that situation, the hitter gets a hit with the average shortstop and he gets no hit with Ozzie’s error. That hit/no hit is incorporated into his WAR, and I think it’s just more realistic to do the same for the pitcher.

      Vote -1 Vote +1

      • mowill says:

        I wasn’t at all considering ERA when I wrote that. Having read some of your earlier comments though I think you have the best idea, lets use FIP, RA and ERA and adjust them to league average. I think a pitcher WAR based on FIP+, RA+ and ERA+ would be about perfect. I say adjust to league average because of years like this, obviously overall offense is down this year and strikeouts are up, this should also be reflected.

        Vote -1 Vote +1

      • noseeum says:

        I completely agree, mowill. I have all the best ideas! ;)

        Think more of my 3 different WAR scenarios, though, I do think that would be great.

        We all should agree that FIP and ERA are two sides of the same coin. FIP blames everything on defense and ERA blames everything on the pitcher, except errors. No wonder they correlate so highly!

        For someone to say “using ERA for WAR is ridiculous!!!!”, I would argue that FIP is just as ridiculous in the opposite way. BRef tries to square the two. I can understand Fangraphs saying, “We’re not going to square the two until we know how.” That’s valid. But I don’t see how FIP is any better than ERA at all. It’s equally wrong on the opposite side of the spectrum. Fangraphs could have said, “We’re not going to square the two, so we’re going to use ERA.” What’s the difference?

        I think having all three WAR values would be the most honest presentation. None of them claim superiority, but the three presented together can give us a very clear idea of a pitcher’s performance for a season.

        Vote -1 Vote +1

  23. intricatenick says:

    Thanks for the post Dave. I appreciate the candor. I’ll try and balance out the negative bias on comment feedback.

    Vote -1 Vote +1

  24. pft says:

    Wonder if UZR could be used for team for the IP a pitcher has pitched. Of course, that is SSS, but assuming UZR caprtures what happened, and SSS UZR is not used to claim anything about the teams ability, it should be ok.

    Vote -1 Vote +1

  25. MGP says:

    Great write up. The explanation was very interesting and easy to follow. I’ve been wonder about the difference between the 2 WARs for awhile.

    Vote -1 Vote +1

  26. Guy says:

    Dave:
    It seems like there are two imporant issues on the table that your posts don’t address:
    1) Do you have any principled objection to using PZR/UZR to determine how much responsibility a pitcher had (as opposed to his fielders) for his hits allowed? A lot of details would have to be worked out, but do you agree with the concept?

    2) Is there any reason in your view not to credit a pitcher with the impact of the sequencing of FIP events, plus their share of the sequencing of non-HR hits?

    Vote -1 Vote +1

  27. bstar says:

    Is anyone else watching the Braves-Phillies game on Fox? I’ve been tracking how many hit/out outcomes have had ANYTHING at all to do with defense and how many outcomes have been influenced by defense. Counting Dominic Brown’s inability to lumber over into the gap and not come up with a Derrek Lee line drive that just happened, I’ve seen only 3 cases in four full innings of plays that wouldn’t have been made by every fielder on a major league roster. What I’m trying to do is get an idea in my mind about how many plays over the course of a typical game have anything to do with good/bad defense. My hypothesis is it’s not as many as we might think(and it’s certainly not EVERY batted ball). I think maybe only one play at most per full inning is affected by the skill of defenders.

    Vote -1 Vote +1

  28. Rob says:

    One thing I’ve always wondered about pitcher WAR, is whether they *should* accumulate more WAR over a season than do hitters for the amount of control they have on the outcome of a game. I’m not trying to make a backdoor argument that “wins” matter for pitchers, but isn’t it necessarily true that a guy who goes out there for 6+ innings plays a pretty significant role in the outcome of the game? More than a guy who gets 4 ABs? If the guy at the plate has a bad day, he probably won’t have a huge role on the outcome of the game. If the guy on the mound does, game over.

    So shouldn’t a league average starter accumulate more WAR than a league average hitter? Or does the overall accumulation of plate appearances over a season circumvent this boost pitchers *should* (I assume) receive?

    Vote -1 Vote +1

  29. Pat says:

    I know pitchers have little control over BABIP, but what about xBABIP? I remember reading an article on this site that discussed the difference between xBABIP and BABIP for hitters. I think that may be a better determinant of someone’s “luck”.

    Just curious, why doesn’t this site offer xBABIP in it’s standard/batted ball stat sheets? I thought that may be helpful. Are there flaws in my logic? I’m not really sure and have not done a lot of research, it just seemed reasonable to me.

    Vote -1 Vote +1

  30. baty says:

    “We don’t encourage you to use any version of WAR as the be-all, end-all of analysis.”

    It should never be used as the “be-all” analysis. At the very least, maybe with making drastic value comparisons.

    WAR is value assumed, and Fangraphs uses the can’t win, but can’t loose method by removing defense. While Sean’s method may not be correct, it’s cool to see a step in a direction that tries to make more “baseball sense” with these evaluations. To me, the Fangraphs system is best at representing the TIERS of pitching value from a fairly objective perspective.

    While Cliff Lee belongs near the top, no one will ever convince me that he deserves the highest Wins Above Replacement value for the 2010 season.

    Vote -1 Vote +1

  31. AthleticsBraves says:

    Great work Dave-probably my favor article ever on FG-it really helped me understand pitcher WAR better. The Liriano and Lee examples make me prefer FIP WAR. Lee’s 10 K/B w/RISP clearly shows his high BABIP is largely due to bad luck.

    I think I’m right in saying that two pitchers could have very different BABIPS only becuase pitcher A has more liners hit right at defenders than pitcher B. Meaning differences in BABIP aren’t only on balls that are between 2 defenders or just out of reach.

    Also, Cliff Lee finished with a 10.28 K/BB, shattering the MLB record. Does that help convince you?

    Vote -1 Vote +1

    • Guy says:

      Lee’s K/BB ratio with RISP doesn’t prove a thing. I thought it was one of the weaker arguments Dave made. We have a huge amount of data on this, and we know that a pitcher’s K/BB ratio is only a very weak predictor of his BABIP. The two do not have much to do with one another. So it simply doesn’t follow that Lee’s hits allowed with men on base must reflect bad fielding.

      What likely happened is that Lee just happened to give up a bunch of hits at bad times. That doesn’t reflect a lack of “skill” in the sense that we would predict Lee to have this same problem in the future. But he did actually give up the hits, and it really did hurt his team. When men were on base Lee almost certainly gave up balls in play that were much harder to field than when he didn’t have men on base. (But again: Dave can use PZR/UZR data to find out if this is true. We shouldn’t have to guess.)

      Dave writes as though the only possible factors here are pitcher skill and defense. But that misses a factor that is much larger than either of those: random variation in a pitcher’s performance (and that of opposing hitters). The fact is, a pitcher will just give up more hits at certain times. In one season, he may have a lower BABIP at home than away; the next season it might be the opposite. He might give up few hits on weekends, more hits on weekdays. None of these are skill. But there is also no reason to think it mainly reflects differences in fielder performance. The most likely explanation is “sh!t happens.”

      Vote -1 Vote +1

    • CircleChange11 says:

      I think I’m right in saying that two pitchers could have very different BABIPS only becuase pitcher A has more liners hit right at defenders than pitcher B.

      People need to think about that assumption and imagine what it would REALLY look like in a game/season.

      As a former pitcher, I cannot imagine it being a realistic scenario where Trevor Cahill gets 50 more BIP hit right at fielders or fielders making great catches than any of the other pitchers o the A’s staff. It would be a running joke at some point.

      As I have said before, only Hugh Hefner is that lucky and for that long.

      We’re basically saying the difference between an average season and a dominant season is well, luck on BIP. No. Freaking. Way. It is a factor, but nowhere near that significant … like saying the difference between Hamilton as MVP and Hamilton as pretty good is just luck on BIP. That might be the statistical difference, but does not inherently mean that it is luck. hamilton is likely doing something differently and/or better.

      But, if we never look for another cause, we’ll never find one. Just keep deferring to luck. I do admit that it’s entirely possible that lack of technology or advanced data (or the means to properly analyze and calculate it) do not exist yet.

      Vote -1 Vote +1

  32. CircleChange11 says:

    I know I have pointed this out before, but the straw that broke the camel’s back for me was Cliff Lee’s August.

    He gave up 34 ER on 60 H in 45.1 IP in a stretch where he pitched himself out of the CYA, which he was set up to win (dominant season, traded to playoff team, etc).

    He had a 6.75 ERA in August, which to me is pretty much pitching at replacement level.

    Now, that’s not the part that got me. What got me is during this stretch he zoomed from something like 8th or 10th in WAR to 2nd. That’s when I knew that taking just fWAR as a value metric was a huge mistake.

    Around the same time, a similar discussion was occurring at TT’s blog. I had been looking at rWAR and trying to figure out the differences. At the time, the big difference was Adam Wainwright’s fWAR v. rWAR, but now it’s Lee by a landslide. I started averaging the 2 variations of WAR, and then read a Tango quote on the issue, stating “I’d rather be half right than all wrong” (or something close to that) in regards to averaging the WARs instead of just taking one or the other.

    With batters, it’s the UZR/defense aspect that seems to be the issue, where using a single season’s worth of data can fluctuate much more than it should given what we know of offense v. defense fluctuations & consistency.

    The bigger issue is not the calculation of the metrics but the usage and the perceived reliability. far too many people are using WAR as if it is THE “debate ender”. It may be the Ace of Spades trumping any other single metric, but luckily we’re not limited to a single metric.

    Vote -1 Vote +1

  33. CircleChange11 says:

    His BABIP is just .257 with the bases empty, but jumps to .350 with men on base, and is .333 with runners in scoring position. Because of that split in when his balls are being turned into outs, he has a LOB% of just 67.9%, well below average and far below what pitchers of his quality have posted this year.

    Rather than just look at a stat and conclude the first thing that pops into our minds, why don’t we look at other metrics and see if anything changes with men on base.

    From my experience as a player and coach, when no one’s on, batters are much more willing to take a fastball in the good part of the zone for a strike. When guys are on base, guys are often looking dead red for that pitch.

    We’ve discussed Lee before, and his throws a lot of first pitch strikes, and a lof of fastballs that are center to center-away. Nowhere near a nibbler.

    So the difference could be a change in the behavior of the batter with men on base, on nothing to due with what Lee is doing. He could be pitching the very same way he always has, and batters have just figured out that he ain’t gonna walk ya, dso you better launch the first good fastball you see.

    That would seem more logical to me than just “Aw heck, he’s unlucky on BABIP with men on base.”

    Luck is an aspect in baseball, but there’s a lot of other causes out there as well. Luck should not be the default cause, but the last resort. We have it ass-backwards at this site, and at this point, I am not thinking it’s due to lack of intellect, but lack of effort. Yeah, I’m saying as a group we’re lazy, and just willing to chalk everything up to luck, instead of looking further and finding a more impactful and relevant cause.

    Vote -1 Vote +1

  34. AthleticsBraves says:

    If Cahill got 1 more liner hit right at somemone per start than another pitcher, that would be 30 more on the season and would not seem that noticeable to me. I don’t think all of Lee’s high BABIP w/ RISP is due to bad luck, but I think a large portion is. It would be interesting to see his careeer BABIPS w/ RISP to see if there is a pattern. I would bet there is not.

    Vote -1 Vote +1

  35. Greg Maddux says:

    Interesting, related blog post on pitch sequencing and Cliff Lee/Roy Halladay over at The Book.

    http://www.insidethebook.com/ee/index.php/site/article/halladay_v_lee_does_sequencing_count/

    Vote -1 Vote +1

  36. Jeffrey Gross says:

    Dave,

    This question may be directed at Sean Smith, but why does Smith’s version not neutralize BABIP on some expected BABIP basis? That would seemingly eliminate some of the bias.

    I have a third WAR method which is more akin to Smith’s in which I first adjust BABIP based on xBABIP. Using dHit (difference hits), I distribute them according to the pitcher’s current 1B/2B/3B/HR distro. Then, I apply the general linear weights to those totals and adjust RA based on that. Then I accord for park factors (1/2 step) and team defense. Wouldn’t that make for a more logical model?

    Vote -1 Vote +1

  37. Jonathan C says:

    Why can’t when trying to assign blame to a pitcher, defense and the batter when trying to find the value of a pitcher why not use a type of equation (1-%attributed to defense using defensive runs saved that can be converted from Dewan-%attributed to batter with wOBA) to seperate true value for an accounting type purpose from each other or is it a problem of small sample size that does not give us a significant result ?

    Vote -1 Vote +1

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>