## Defense And Inferential Statistics

A Definition of Inferential Statistics:

*With inferential statistics, you are trying to reach conclusions that extend beyond the immediate data alone.*

This afternoon, I talked about why defensive statistics are not like offensive statistics, and closed with a statement about why I believe that defensive metrics should be viewed as inferential statistics, rather than the results of something that actually occurred. The definition above states it as well as anything I could write – what we want to do with metrics like the advanced defensive statistics we currently have is to make conclusions based on probability that go beyond the data that we have.

Let’s use a baseball example. The +/- system spit out a +47 rating for Chase Utley for 2008, calling him 47 plays better than an average defensive second baseman last year. It’s such an amazingly high number that, on it’s own, it’s basically unbelievable. Did Utley really display such amazing defense that he got to 47 more balls than an average fielder? And if so, how did such a remarkable performance go basically unnoticed by baseball observers?

Perhaps your initial reaction to such an unbelievable number would be to throw it out and discredit the system. After all, if I invented a metric that said that Chase Utley hit .434 last season, you’d just point to the facts and tell me I was wrong. But with defensive metrics, one of the basic tenets we have to accept is that we just can’t **know for certain** whether an average fielder would have actually fielded a particular ball, because this mythical average fielder didn’t have a chance to field that ball – only the fielder that we’re watching got a chance to field that ball. Whether anyone else could have fielded that ball has to be inferred, since it cannot be known.

This is the fundamental point to accepting defensive statistics – they know very little and infer an awful lot.

This doesn’t make them wrong or invalid. There are all kinds of statistics in life that are inferential and, when constructed correctly, give us meaningful information to make our life better. Political polling data is one of the best examples, and the match between polling data and baseball statistics got quite a bit of play with Nate Silver’s rise to fame this summer. When the data is handled correctly, inferential statistics help us answer questions we can’t figure out through descriptive statistics, and right now, defensive value is one of those things that must be inferred.

So, how do we view these numbers differently than if they were descriptive in nature? The key is to see them as data points as part of a larger sample and not take any one single data point too seriously. +/- thinks Utley was +47 last year. Okay. That’s nice. We’ll toss it into the stew, along with as many other valid data points as we can gather and determine how confident we can be within certain boundaries based on the sample that we have.

If you’ve taken a college course on statistics, you’ve probably learned about t tests and how to calculate necessary sample sizes based on given data. We won’t go through the math here, but research from guys like Chris Dial, TangoTiger, and MGL suggest that we need at least two years worth of data before we can start drawing reasonable conclusions from the defensive data we have now. Two years is a minimum. Three is a lot better, and gets us close to the point where we can be comfortable with the results.

With several years worth of data, we can be confidant that the sample is large enough that the noise in the data can be reduced to the point where our inferences can be at least generally accurate. Viewed by itself, Utley’s +47 is highly questionable. When viewed in concert with his +20 ratings in both 2006 and 2007, we can infer that Utley is probably something like a +25 defender compared to an average second baseman.

The human factor is still there, and we can’t pretend like a larger sample eliminates noise entirely, but we can begin to be confident that we can describe a player’s defensive value within a given range and be fairly accurate. Maybe we can’t prove that Utley’s a +25 fielder, but we could say that the probability of his real defensive value being between +20 and +30 is very high.

When someone tells you that defensive statistics simply aren’t as reliable as their offensive brethren, they’re right – there’s no doubt that the tools we have to measure offense are more precise than the ones to measure defense. But as statistics like UZR and +/- have come along, our ability to infer reasonably accurate conclusions about defensive value has grown immensely. They aren’t perfect, but when viewed as a data point, and analyzed as an inferential statistic, we can gather all kinds of information that we’ve never had before. And that’s exciting.

Print This Post

Just to offer another example of inferential statistics: all those offensive statistics like VORP, which are supposed to tell us how many runs or wins a particular player’s performance is worth, are inferential. So is runs created. So is WPA. All these statistics are our best guess about how much a certain player’s performance influences offense. In the case of VORP, it’s a measure of how many additional runs a particular player created as opposed to a logical fiction (the player that would have replaced him.) In WPA, it’s our best guess about the probability of a game. But of course, that probability is estimated based on a sample of actual games that featured different line ups, different pitching, in different parks, and so on. Any offensive stat that’s supposed to be have a direct interpretation in terms of runs produced or wins created essentially involves an *inference* about how that performance affected actual run production. So many of our favorite offensive stats are inferential too.

Are inferential stats worse than descriptive stats? Hell no: RBIs is a descriptive stat that tells you a lot less about a player’s offensive contribution than (wOBA -.335)/1.15*PA–which is one way to attach a run value to player performance.

Q: Does pitch/fx now, or will it in the future, be able to measure the precise vector with (on?) which a ball leaves a bat? That would take some of the guesswork out of defensive stats.

I was thinking about this while messing around at a PGA Superstore after Thanksgiving, where they have simulators that purport to tell you, when you hit a golf ball into a screen, the velocity, again, rotational speed, and rotational angle of the ball.

You’d still have to measure a lot of other things to get a good read on precisely where a ball would land and how fast it gets there (e.g., precise point of contact, atmospheric conditions, effect of grass in different ballparks), but it seems to me that would be a good start.

Or do Dewan/BIS/others already think they can make accurate guesses on these things from video?

Given how quickly a player’s defensive abilities can decline, do you think defensive stats are caught in a bit of a catch-22? You need at least 3 years of data to have reliable numbers, but by the time three years have gone by the player is certain to be worse than when he started out. Certainly there are guys out there who are very consistent fielders and decline only nominally, but what about the other half/two thirds?

It certainly seems to be a catch-22 if you only used them in a vacuum. But these defensive metrics are the same as offensive metrics in the sense the they are part of a toolbox. That toolbox naturally includes real world information like ages and injuries. If a player goes +18, +20, +5 than information like whether a player turn 35 or underwent knee surgery in the offseason (or not) should play a large role in determining whether that +5 season is “noise” or is more likely to be predictive of future performance.

Once a true catalog of defensive data is gatherer over the course of a decade or so I am excited about what it may be able to tell us not just about the players, but about some of the defensive park effects that have been relatively anecdotal up until this point.

Correct me if I am wrong but the fact that +/- data is collected by going through and assigning a value to a play that occurs in a certain vector on the field. Doesnt it as well factor in what the ball in play tragects as(GB/FB/LD) and lastly it factors in where the fielder started and where he ended up? Again I am just now reading up on +/- . But I am pretty sure that is how they described it on the web site. And if this is the case I think that is pretty concrete as everything in the play is factual. As they assign the value of the play against how often a similar play with the same factors results in an out.

I guess basically what I am asking is if they can chart every play against what the average defender does with the same play how is the human element still noise? Is it now assigned a factual average?

Dave,

Your definition of inferential statistics and your use of the concept in this piece are puzzling.

Inferential statistics are simply a family of statistics used to draw conclusions about a population from a sample.

They were developed because most often we can’t measure across the entire population of anything so we are forced to draw samples.

Your use of the term with reference to +/- is particularly puzzling in this regard.

In calculating +/- there is no true sample. The analysis encompasses the entire population of all of the plays relevant to that particular player. Every play is examined over the course of the season. No sampling is done.

As a whole, inferential statistics have limited application to baseball as samples are rarely drawn. The discrete nature of the data and its finite scope allow for entire populations to be analyzed. Baseball is very unique in that way in terms of statistical endeavors.

+/- is simply a kind of descriptive statistic. Now you may try to use it to draw conceptual conclusions about the player’s underlying defense but that does not make it an inferential statistic.

In this piece you seem to be confusing statistics with the underlying construct the metric is trying to operationalize.

The problem with the quantitative analysis of defense has little to do with the nature of inferential statistics. It has to due with psychometrics, such as construct validity. This is where the quantitative analysis of defense in baseball runs into difficulty.

Great discussion here everyone.

On thing that has always made me wonder about the Plus/Minus system is the human observers that are used and if they are assigned to watch the same teams over and over again, even by accident. It could be that Observer A analyzed significantly more games of Utley’s than any of the other observers and because of Observer A’s possibly skewed interpretation of the analysis parameters, Utley comes out with an absurd +47.

Inferential statistics are simply a family of statistics used to draw conclusions about a population from a sample.Right – that’s basically the definition I wrote to start this piece.

They were developed because most often we can’t measure across the entire population of anything so we are forced to draw samples.Exactly.

In calculating +/- there is no true sample. The analysis encompasses the entire population of all of the plays relevant to that particular player. Every play is examined over the course of the season. No sampling is done.And this is where we disagree – I’m stating that a season worth of defensive data needs to be treated as a sample, not a population. If you view a single season as a total population, you’re going to be misled by defensive numbers or assume they’re simply not accurate enough to be taken seriously.

If you instead see them as a sample of a much larger population (say, a player’s career), then you can begin to make some inferences about a player’s ability from the data.

If you view a single season as a total population, you’re going to be misled by defensive numbers or assume they’re simply not accurate enough to be taken seriously.This isn’t a question of measurement you’re now bringing up. Its a question of what the proper unit of analysis is. That’s a conceptual question – not statistical.

And that depends on what the question you are asking is. If you are asking “what is the innate defensive ability of this player” then sure one season can be considered a sample of the players career population.

But you could do that with any statistic in baseball – offensive or defensive. A single season of data can always be looked at as a sample for a career.

In general, the question of innate ability isn’t the one analysts are concerned with however. The question of interest is the empirical question of what was this player’s performance during a proscribed period of time – and the period of time is most often one season.

When an analyst tries to do something such as determine net offense + defense for 2008 the unit of analysis is the season. Not the career. So empirically, the season for the defensive statistics can’t be considered a sample. It is the population. When Chris Dial does his offense + defense or baseball prospectus does their WARP calculations the unit of analysis is the season. You can’t just change that for defensive statistics as that is conceptually flawed.

You are reversing concepts in order to explain away inherent problems in measurement and defects in the metrics.

As an example – if a player is hurt and loses range during a season. If you are interested in his innate defensive ability you could consider that injured season as a sample of his career population.

But that doesn’t change the empirical nature of his performance. If you don’t consider the season a population then you are shortchanging his actual performance on the field.

And in general considering a season as a sample of career populations is very problematic because we know that physical skill deteriorates over time. In this regard, there is no true “innate” ability that remains constant. To consider year 12 of a shortstops career as a sample of his overall career population is technically true but his innate ability in year 12 may be completely different than year 6 or even year 9 of his career. Just grouping data from multiple seasons together creates its own problems.

Now I would agree with you in say doing projections – which are not inherently empirical – you should use multiple seasons of defensive numbers due to the wide variation in numbers.

But that’s an issue of variation. There a major confounding issue in that we don’t know for certain with these metrics whether defensive performance varies widely year to year or whether its an issue of the methodology through which the measurements were done.

The problematic issue in the quantitative analysis of defense is psychometric in nature and not an issue of statistical inference.

Unless sabermatricians understand this the quantitative analysis of defense will remain highly problematic.

These kind of issues are not unique to baseball. Its actually a pretty common, routing problem in other fields.

And that depends on what the question you are asking is. If you are asking “what is the innate defensive ability of this player” then sure one season can be considered a sample of the players career population.But you could do that with any statistic in baseball – offensive or defensive. A single season of data can always be looked at as a sample for a career.Right – I probably worded this poorly initially, but I’m suggesting that we should treat defensive statistics as inferential in nature and be more concerned with figuring out a player’s innate ability than in trying to figure out how he performed in any single season, since the tools we currently have just aren’t good enough to do the latter.

The question of interest is the empirical question of what was this player’s performance during a proscribed period of time – and the period of time is most often one season.Which is why I’m suggesting we need to change the way we view them.

But that doesn’t change the empirical nature of his performance. If you don’t consider the season a population then you are shortchanging his actual performance on the field.I’m saying that we don’t know what his actual performance on the field was. There is too much noise in the data to determine that.

Just grouping data from multiple seasons together creates its own problems.These problems are fairly easily resolved through aging curves. We understand that player’s aren’t the same at 25 as they are at 35, but we can account for that without a significant headache.

There a major confounding issue in that we don’t know for certain with these metrics whether defensive performance varies widely year to year or whether its an issue of the methodology through which the measurements were done.I’d say it’s much more likely a flaw in the measurements.

I’m not sure that data for less than three years is valid for even more visible skills such as hitting and pitching (ERAs and FIPs for the latter approximate after three years, but often diverge significantly over shorter periods).

In the case of Utely, I’m willing to accept that he’s 25 plays (2-3 wins) better than average. With offensive skills in addition, he’s clearly worth more than someone who’s say, five wins better on offense, but replacement level on defense (meaning a contribution of -2 versus the average player).

On the other hand, Oakland’s Mark Ellis, basically an average hitter, was paid mostly for his hitting statistics, with very little regard for his 2-3 extra win defensive contribution. Oakland got a bargain, as you pointed out, Dave. But then they’re Oakland.

I’m saying that we don’t know what his actual performance on the field was. There is too much noise in the data to determine that.Then the psychometric properties of the metric are poor and don’t allow for drawing valid conclusions.

Grouping data together make make the numbers look smoother but they doesn’t alter the underlying validity problem at all.

Then the psychometric properties of the metric are poor and don’t allow for drawing valid conclusions.Or the samples just aren’t large enough and need to be expanded by including multiple years of data and doing the relative adjustments to make that work.

Grouping data together make make the numbers look smoother but they doesn’t alter the underlying validity problem at all.The underlying problem is variance due to sample size issues. That isn’t a problem that isn’t fixable.

The underlying problem is variance due to sample size issues. That isn’t a problem that isn’t fixable.This is where we disagree. To me this is not primarily an issue of statistical variance and sample size.

It an issue of validity, particularly face validity and construct validity.

There’s no way to conclude that the problem is variance due to sample size issues if the psychometrics of the measure aren’t stringently considered.

As far as I know, almost no psychometric evaluation of defensive metrics has ever been done. So I don’t see how one can just arrive at the conclusion that the issue is sample size related variance.

Only after the pschometric behavior of the measure is improved can you make the assumption that variance is due to sample size issues when you are able to measure the entire season.

It’s the concept not the numbers.

Again – if you want to know how the player empirically performed on the field – what actually happened – then the unit of analysis has to be the season. Turn the season into a sample then you are measuring an entirely different construct.

It seems strange to change the conceptual construct in order to make the numbers perform better than they do.

Sabermetrics pays little attention to psychometrics to its detriment.

But we’re probably going to have to agree to disagree on this point.

I agree that you’re a dick.

If we should assume that +/- was off by 20+ plays in Utley’s case, why should we have any faith in it whatsoever when most fielders are separated by far smaller margins? Doesn’t that introduce an of error bar size that renders it meaningless as a tool for statistical analysis?

I haven’t tested it on a large sample but I don’t believe the defensive statistics have any more season to season variation than ERA. I’m almost certain they have less variation than batting average. So, we should probably use the same caution with defensive metrics as we have learned to use with other measures.

I think what is more important than the year to year variation is the measure to measure variation. We know that BIS based measures give us different results than STATS based measures but there are also differences between measures that come from the same data base. I am more concerned with these differences than the seasonal differences which could actually be real in many cases.

<>

Using just one measure as your guide would be a bad idea. What I like to do is take as many measures as I can find and average the results together (or you could take a weighted average if you trusted some more than others). You find the following values for a particular player:

+/- +30

PMR +12

UZR +10

ZR +6

RZR +5

If you take the average of the 5 values, you temper the effect of the +30 outlier and arrive at a number (12.6) which is likely more accurate. I did this for all players on my site last year and will do it again this year once I have gathered and merged all the data. I’m very happy that UZR can now be thrown into the mix.

Dave, until some start to measure and record such things as fielder position on contact, line/path of travel of the ball, and the time, the fielding metrics leave much to be desired. All of the things we want to know can be observed and measured. How long is his first step to his front, his left, his right, in the effort to run in the opposite direction, what’s his time to get 5 and then 10 feet to his front, to his left, etc., and so on and so forth. What type of line/path does he take to the ball, both on balls on the ground and in the air. What is his reaction time, on balls to his front, his left, his right, the fly over his head, to his left, his right, in front of him, etc.

I don’t know whether I posted the matter here or not, I’ve only posted here about 3 times in my life, but this is a lot like that example in math or physics class. Train A, the ball, left the station traveling at …., and then Train B, the fielder, left its station, etc., and so if you want to know whether any living human might have caught that ball, ask yourself whether Train B could have caught up to Train A before Train A departed for the OF. Or that’s how it works for the IF. When we start measuring such things as reaction time, first step, time for 5-10 feet, etc., then we’ll have some idea as to just who might be the superior Train B. In this sense, the worst thing we want to do is to compare baseball fielding to subjects that are not concerned with the laws of physics. Spending and voting habits are one thing, i.e., not subject to explanation by use of the laws of physics, whereas the matter of whether Train B, our newly signed free agent 2B, is all that defensively has a lot to do with the physics of the living train.

It would also help rather immensely if we acknowledged that fielding isn’t a one dimensional thing, and so he’s good to his left but not so good to his right, excels at coming in on balls on the ground and in the air, but appears to have some trouble with balls hit over his head, and while he has a good first step, and an above-average time when it comes to getting to 5 and then 10 feet, he seems to take that straight line across the diamond on the grounders to his sides when it seems that he might have some better results if he’d just angle it more towards the OF and so buy some extra time to intersect his path of travel with that of the ball.

Lastly, the other thing we might remember is that FPCT isn’t as bad a stat as some make it out to be. One defensive skill is to assist or record the out on the balls that you do get to, no matter how great or limited your “range”, and FPCT does provide some rather useful information in that regard.

Oh, almost forgot, but if I did post here prior on this, I probably mentioned that physics measures velocity in terms such as meters per second, and not soft, medium and hard, as if this was some gal’s rear end that we are concerned with. I am otherwise rather unconvinced that some understand what a marginal difference in ball velocity might mean when it comes to whether he or any other living human might have caught that ball.