FanGraphs Logo

MGL’s Recent Musings

Mitchel Lichtman (MGL), the creator of UZR, has been setting the record straight on UZR, among other things, over and over and over again. Here’s his most recent interview, and some other links worth reading if you’d like gain more insight on UZR and baseball data in general:

Newsday: Bellmore’s Lichtman shows his baseball knowledge through UZR

Comment: The difference between offensive and defensive statistics

Answering: What kinds of factors skew statistical analyses of defense?

Comment: The differences between UZR and +/-

Sample Size and the Granularity of Data [John Smoltz]


Print This Post Print This Post
David Appelman is the creator of FanGraphs.

23 Responses to “MGL’s Recent Musings”

You can follow any responses to this entry through the RSS 2.0 feed.
Click here to view comments in a non-threaded output.
  1. Victor says:

    Excellent stuff. Really liked it. Should be required reading.

    Vote -1 Vote +1

  2. brendan says:

    short version:
    “Let’s say that we had a sophisticated device for
    measuring the result of a coin flip. Let’s call it, the “looking at the coin lying on the floor” device. OK, we flip a coin 50 times and it comes up 28 heads and 22 tails. No big deal, right? Now we flip it again 50 times and it comes up 23 heads and 27 tails. Oh my God, there must be something wrong with our measuring device! Get my point? I hope so. “

    Vote -1 Vote +1

  3. Matt says:

    He wrote one thing I found interesting in a subsequent comment:

    “Using context neutral stats for MVP candidates is ridiculous whether it be offensive or defense. That is a bugaboo of mine when I see analysts and saber-friendly writers constantly quoting OPS+ or wOBA or WAR for MVP type of awards. That is absurd. You don’t create ANY value with a context neutral stat unless those context-neutral stats happen to turn into runs and wins which they often, but not always, do. One player can have an OPS+ of 110 and another 140, and the former can easily have produced more value in terms of runs and wins.”

    I’m not sure I understand this. Does this mean we should be looking at WPA rather than WAR to determine who was the most valuable? Another Yankee that hasn’t been in the MVP discussion at all, Johnny Damon, leads the AL in that category.

    Vote -1 Vote +1

    • B says:

      I commented on this the other day in that thread, have yet to get a reply. Basically my interpretation is he’s talking about how context is necessary to get from a players accomplishments to his actual production. If a guy hits a double, his team doesn’t actually benefit unless he scores (and whether he scores depends on the context). My response was a large sample size fixes this problem, so I don’t really know what he was getting at…

      Vote -1 Vote +1

      • don says:

        A large enough sample size (may) fix this problem, but a season isn’t a large enough sample size.

        Vote -1 Vote +1

      • Matt says:

        So how do we correct for the fact that a season isn’t a large enough sample?

        Making things context-specific seems to me to worsen the problem, not improve it.

        Vote -1 Vote +1

      • PhD Brian says:

        For awards, the guy who wins is the guy who deserved to win despite any numbers. This is because the numbers could be wrong. And we do not have anyway to know if they are wrong or right. The true very best SS defensively in the game will likely have a good UZR but may not, and will most likely not get the best UZR after you crunch all the data. Even though he is truly the best guy! So all you can say is if player A has a +10 UZR and player B has -10 UZR that it is likely but not certain that player A is better defensively than player B. Player B could still be better (Dayton Moore could be right). Therefore If B wins the Gold Glove it could in fact be just even though UZR says A is better. Thus, the only option is to say the crowds collective wisdom is as reliable a measure as anything else we have at this point.

        The guy who wins deserved to win!

        Vote -1 Vote +1

      • Kris says:

        Brian,

        Have you ever read “The Wisdom Of Crowds” by James Surowiecki? If writers actually watched all of the games, I guess it would be more applicable, but it’s definitely an interesting read.

        Vote -1 Vote +1

  4. Brooksy Boy says:

    I’m confused by Lichtman’s statement that if a statistic is completely perfect, then it MUST be wrong for a certain % of players within a finite time period.

    Wrong in what way? In terms of measuring their true talent level or measuring the value of their performance within a particular time period?

    Vote -1 Vote +1

    • B says:

      He’s just referring to the fact that observed results have a distribution – the coin cannot come up heads for exactly half the flips in 100% of the possible samples. The larger the sample size the more narrow the distribution becomes, but there is still a distribution because results are random chance based on the probability of the event happening. It’s sample error, and it’s always present, even if your measurement device is perfect (in other words, has no measurement error).

      Vote -1 Vote +1

  5. tom says:

    The simile with the coin toss does not ring true. The coin is a totally passive THING that is flipped.

    How high it is flipped; with what force of thumb flick it is flipped; whether it is caught and turned over or allowed to fall to the ground, spin, bounce, roll and come to a stop … all contribute to a HEADS/TAILS outcome.

    The ability of a fielder gifted enough to play defense at the MLB level, trained and optimally equipped, is a poor analogy for a coin toss, no matter what sample size is used.

    Vote -1 Vote +1

    • PhD Brian says:

      I disagree. The point is the true difference in ability between one professional ball player and another is far far less than the amount of luck that goes into any one outcome. So it is very hard to measure with accuracy over small sample spaces who is better since that huge luck factor gets in the way. The hope is over very long periods and thousands of outcomes your luck averages out, but that does not always happen and you may never know or know why. This is the same for a coin. It is possible and unfair coin flips 50/50 over a thousand fair tosses. It is possible and absolutely fair coin tosses 80/20 over thousands of absolutely fair tosses. In these cases, you can’t blame the test. Your best hope is that your measuring device or test is as accurate as you can get. You can then say it is likely but not certain something is true. But absolute certainty never exist outside of an infinitesimally small point in time.

      Vote -1 Vote +1

    • Matt says:

      The coin analogy is a great one, because it vastly simplifies the idea of observed value vs. true, underlying value.

      You are trying to measure something’s underlying value. You have only observed events to go on. Take, say, 200 of these observed events and use them to get an estimated value. Do this 100 times. You now have 100 estimated values, each based on 200 observed events. These estimated values are going to fall in a distribution around the underlying value. Some will be right at the underlying value. Some will be far away from it. That’s the idea at work here.

      Vote -1 Vote +1

      • Kris says:

        I like the coin analogy a lot. Earlier this year, someone posed a question on the forum about sample size, and I tried to sum it up without any statistical knowledge.

        It’s always interesting to watch people deal with sample sizes, because realistically, no one is right and no one is wrong. However, it’s just as interesting watching people dismiss data because of the small sample size. We have all of these mathematical tools that let us “guess” as best we can, and to refuse to use them is simply ignorant.

        Can you imagine the articles at fangraphs if all of the statistical data was used?

        Vote -1 Vote +1

    • B says:

      “The simile with the coin toss does not ring true. The coin is a totally passive THING that is flipped.

      How high it is flipped; with what force of thumb flick it is flipped; whether it is caught and turned over or allowed to fall to the ground, spin, bounce, roll and come to a stop … all contribute to a HEADS/TAILS outcome.”

      None of this changes the probability of an event (heads or tails) occurring, though. The analogy works because we’re discussing the issue of sampling to find the true probability of an event. You’re talking about an MLB players fielding skills – essentially the probability he fields a ball. We’re trying to measure what that probability is, just as we would try to measure the probability of an event occuring with a coin (heads or tails).

      Vote -1 Vote +1

  6. MGL says:

    Good questions and responses! A very “wise crowd” here at Fangraphs.

    Let me reiterate and clarify a couple of points:

    There are two sources of error with UZR or almost any other metric in baseball/spots:

    One is measurement error. With coin flips there is essentially zero measurement error (there actually is some because a person may record the outcome of a flip incorrectly assuming that a person is doing the recording). With UZR, for example, there is lots of measurement error. A ground ball that is classified as medium and directly over the third base bag may be a groundskinner or may be a 10 hopper. It may be 38 mph or 45 mph. It may have been 3 feet to the right of the bag or 3 feet to the left. There are other sources of measurement error in the data (and maybe even in the engine that calculates UZR). What that essentially creates is a result which may not entirely match what actually happened, just like if we flipped a coin 100 times, came up 48 heads and the recorder wrote down 47 heads. In the long run, those measurement errors tend to cancel one another out of course. Included in measurement error is bias. Bias also causes the results of the measurement not to be commensurate with what actually happened. It is more insidious because it does not necessarily cancel itself out over the long run if it is persistent bias (if the bias is random it might). For example, if the scorer in Seattle tends to classify a lot of fly ball hits as line drives, the results coming out of Seattle will always tend to me somewhat inaccurate even in the long run.

    The other source of error is what we see with the coin flips. Random sampling error. That is usually normally distributed (maybe always – I am not a statistician). We call it random, but in reality it is not. It is usually a bunch of really nuanced things that create an essentially random outcome centered around a certain mean. Like the coin flip. Where it lands is a function of how high it is thrown, how it is held in the hand, the angle at which it is thrown, etc., but since we have so little control over these variables, the outcome is essentially random centered around a mean of .5 heads or tails (or flip is unbiased).

    Same thing for UZR or any similar fielding metric. For every batted ball that can be exactly described (assuming no measurement error), there is a mean fielding probability for the average fielder or technically for all fielders combined. It might also be .5, like the coin, for a particular type of batted ball. But, like the coin landing on head or tails sometimes, sometimes an average fielder makes the play and sometimes he doesn’t, but in our very large sample, say, 5 years, which we use to approximate the mean of an infinite size sample (the “population mean”), all those same batted balls get fielded 50% of the time. We call the variations around the .5 in any sample of batted balls (1, 5, 100, 1000, etc.) random like we call the fluctuation in observed number of heads in 100 or 10 coin flips random, but like the coin, they are not really random. They are because of the range of the fielder, his jump on that particular play, his positioning on that play, whether he had something in his eye, whether the ball took a good hop or a bad hop (you can see, BTW, that the notions of measurement error and random sample error actually overlap in the case of defensive metrics since a bad hop could be a certain type of batted ball that was recorded incorrectly or it could be one of those many nuanced things that combine for what we call “randomness”).

    Now here is a tricky part to wrap your head around. What we call the “random sampling error” can be one of two things, although I am arbitrarily and artificially creating a dichotomy: It can be the fielder himself sometimes making a good or great play (or several) or not for whatever reasons, many of which we just don’t know and if we watched the video we still wouldn’t know, For example, some reader on one of the sites said that one year Coco Crisp just seemed to make all the spectacular plays and the next year he seemed to just miss those same leaping, running, and diving catches, for no particular reason that they could tell.

    Anyway, the second part of that “random sampling error” (which you could also call measurement error, since technically the source of that error is that we are not properly or precisely enough measuring or recording the data) is when a ball takes a bad hop or an easy hop. Or the fielder happens to be a little closer or further from the ball than he usually is or where we think he is. Again, in the long run, all of these “random sampling errors” will tend to even out, but in the short run, they might not and in fact probably won’t.

    That is why UZR or any metric which attempts to describe and/or translate an event or events will be “wrong” some percentage of time on two levels: One, that it didn’t do a good job at describing or translating what actually happened. Or two, what actually happened was not necessarily the same as what would happen if we allowed the player to have a very large number of opportunities (like the Crisp example above).

    It is likely that for any given sample, the metric will contain both sources of error. The smaller the sample of performance (how small depends on the metric and the event or events that are being measured and/or translated), the more error there will be on a larger percentage of players being measured. For example, if we have one season UZR’s for 100 players, maybe 10% of them will be “way off” and 20% of them will be pretty far off (and 70% of them will be not too far off). If we have 100 players of 5 seasons each, maybe only 5% of them will be way off, 10% will be pretty far off, etc. The key point is that no matter how long we measure our players some percentage (it might be very small if we measure them for a long time) will always be way off, some percentage pretty far off, some percentage a little off, etc.

    So we can ALWAYS (well, usually, depending on how many players we measured) point out at least ONE player who was far off, whether we measured all our players for one season, 10 games, or 10 seasons. If you forgot why that is, read the above paragraphs again or take a course in statistics. And that is regardless of how good our measurement device is. So pointing out one player (or two – maybe more, depending on how many players were measured) who MAY have been measured poorly tells us NOTHING about the quality of our measuring device (in this case, UZR). Nothing. Zilch. Nada. Zippo. If you want to know something about the quality of the measuring device one way might be to look at ALL the players who were measured and in the context of how long they were measured for (keeping in mind that for even the perfect measuring device the shorter the time period we measure our players for, the higher the percentage of players we will find who’s results will not be commensurate with their true talent or what we would expect their results to be over a long period of measurement time), see how many appeared to be correct and how many appeared to be incorrect. Of course those results should be a continuum and should look like a normal (bell) curve given enough players being measured. Of course the problem there is what are you comparing the results to? How do you know what results each player SHOULD achieve? In the case of UZR, I suppose you could compare them with what you think you know of their defense from scouting, reputation and observation. But then again, if you were assuming that that is the correct reference point, what is the point of UZR in the first place. UZR is supposed to add to what we think we know not duplicate it or it is to be discarded. So if you do what I said (compare UZR to what we think we know about a player’s defense from other sources) and you see, say, a player who you think is a very good defender – say a fictional player named Jerek Deter – and his UZR is -5 per year over the last 5 years, would it make sense for you to go, “O.K., there’s one that UZR got wrong. Next.” Again, what would the point of even looking at UZR be? You are supposed to go, “Hmmm. Maybe Mr. Deter is not as good as I though he was using these other methods of evaluation.”

    Anyway I digressed a little.

    Here is a list of the first basemen with the highest and lowest UZR totals over the last 3 years:

    +27 Pujols
    +16 Kotchman
    +16 Youkilis
    +12 Helton

    -10 Giambi
    -11 Nomar
    -18 Jacobs
    -19 Sexson
    -21 Fielder

    Now which piece of evidence do you think is better at assessing UZR based on what we think we know about player defense? The above list or the fact that some genius found a player whose UZR is around zero (-.8) runs but who we think is a very good defender?

    Vote -1 Vote +1

    • B says:

      “That is usually normally distributed (maybe always – I am not a statistician). ”

      “Of course those results should be a continuum and should look like a normal (bell) curve given enough players being measured.”

      Just want to quickly refresh us all on the central limit theorem. A population itself will not necessarily be normally distributed (or even distributed in a bell curve, though that tends to be the most common distribution). The means from samples of a population, though, even if that population is not normal, will approach a normal distribution according to the central limit theorem.

      Vote -1 Vote +1

      • Kris says:

        Are we assuming the margin of error in these ballparks is normally distributed. Excuse my ignorance, I’ve never been a UZR guy, but the systematic error just seems to be a headache. Do we just assume it’s the same person taking the measurements? Or are we assuming that all measurement errors should fall into a gaussian distribution.

        Vote -1 Vote +1

  7. MGL says:

    Kris, there are many persons doing he measurement both at the park and on live and recorded video (I am not sure how BIS does it – STATS used to do it at the park and then do it again on video). I am not sure that makes a difference in terms of the distribution of the measurement errors, but there are also biases that will preclude those errors from looking like a normal distribution, I think. Biases by individual “stringers” (the people who record the data) and biases caused by factors at each park (for whatever reason, some line drives at one park may look like fly balls at another, or there may be difficulties or biases in how the distance or location of batted balls are perceived).

    I think what you are going to get is some of that measurement error contributing to the fluctuation in samples of performance, but tending to even out in the long run, and some of those biases contributing to persistent inaccurate results in the short and long run. For example, STATS UZR for years had Ichiro as a very poor defender in Seattle (his BIS UZR was better, I think). This may have been because of some bias in the park or by the stringer who tends to record data in that park (if there even is one person who tends to do that – I don’t know). Or it may have just been random, non-biased error. I don’t know. But is is likely that there is some persistent bias in the data. That is true to some small extent with offensive data of course, in terms of ROE (some players may happen to be the beneficiaries of an overly generous or stingy official scorer in a season or more).

    Vote -1 Vote +1

  8. MGL says:

    B, thanks for the reminder. Yes, the means of unbiased, random samples of data from the a population with a fixed mean will always be normally distributed, given a large enough number of samples of course. I think that’s right.

    Vote -1 Vote +1

  9. kpedrok says:

    Hey David. I don’t know if it was you or someone else on the site, but a few weeks ago someone mentioned that if the readers had any links they would like to have Fangraphs share to send them in. I just put together this article about what the players to like and not like heading into 2010 in the steals category. If you or whoever could include it in one of your posts I would really appreciate it. Thanks for all the hard work you do – your site along with Baseball Reference help make my blog go!

    Here’s the link: http://fantasybaseballhotstove.blogspot.com/2009/10/stealing-steals-category-in-2010.html

    Vote -1 Vote +1

Leave a Reply


Player Linker - Contact Us - Terms of Service - Privacy Policy