I’ve always thought that the PA threshold approach was too cut and dry. People will go from saying we have to ignore the data point to fully accepting it as predictive. What I think would be more useful is to provide confidence intervals given a full season of data. So it might be:
Whatever the real values are, the interpretation would be easier and if cited frequently, the values would become a part of getting to learn the stat. Just like we know intuitively how much better a “.300 hitter” is than a “.270 hitter”, we might also get to intuit that a guy who hit .280 last year is likely to have his true talent be somewhere between .250 and .310.
I suppose the PA threshold is useful in understanding when we should start really paying attention as the year goes along, but in terms of pulling in the concept of reliability in to the daily conversation about stats, I think a CI approach may find more traction.
i’ve asked this question before, but why do all of this (inefficient) split sample stuff, when you could just analytically estimate standard errors? this is especially true for binary outcome variables like AVG and BABIP, for which standard error formulas are ridiculously easy to compute.
(full disclosure — i haven’t read the entire piece yet, so if you already answered this somewhere, sorry!)
Cool read; there was a similar study on pitching stats quite a while ago, and it’s certainly been instructive in how I look at pitching performances. I was a little surprised how quickly home run rate stabilized compared to rates of other hit types… maybe I’m so used to hearing the conventional wisdom about “streaky sluggers” and “consistent contact hitters” that it still influences how I think about things.
I’m not that surprised that the WPA-based stats didn’t stabilize quickly. There are tons of factors that go into those besides batter skill. Even WPA/LI, which is still sensitive to context in its own way.
At some point I’ll have to look up your homemade plate discipline stats. They sound interesting. *goes for a cookie*
I do worry about how this has come to be interpreted. I presented the levels at which various stats hit the magic .70 mark. However, even that has its problems. .80 is, of course, more reliable than .70, and given a large enough sample, a stat would get there. In the same way, .60 is worse than .70, but it’s not trivial either. Reliability is a matter of degrees, and should be treated as such.
I’ll give you one-up on that. Rather than doing it for a single season, you could do a sweep of intervals and calculate confidence intervals for each. In that way, you’d have a function of Confidence = f(SampleSize) for each type of observation you’re looking at.
With a bit of smoothing and interpolation, this sort of thing would give you an ability to determine an estimate of your confidence at any given sample size. Additionally, it might give you an idea of when you’re hitting an asymptote- the point at which adding more data isn’t really helping you have any better confidence.
‘In social science, we look for a magic number, which is .70.’
Not that the number is “wrong,’ exactly, but setting the “magic” intra-class correlation threshold at 0.7 is only slightly (and by slightly I mean infinitessimally) more scientific than picking a “magic” number for plate appearances.
I’m a social scientist myself, and this is the first I’ve ever heard of any sort of magic number in intra-class correlation. I’m not sure what your field is, but in political science the standard varies between subfields and research questions due to disparities in the amount of available data.
Either way, 0.7 is only slightly less arbitrary than 100 PA. That’s why it’s best to regress to the mean based on standard error and sample size, as those at Inside the Book usually argue.
I agree, J-Doug. I’m currently in grad school in the social sciences and could not remember anyone ever declaring .7 as the magic number. I even asked my stats teacher today and he agreed. A major issue with these coefficients is that the critical values are mostly arbitrary.
For illustration purposes, though, I like this approach. While the setpoint is more or less arbitrary, it does make some sense among the various arbitrary options you could come up with (when r=.7, r^2=.5, so you have 50% of variation explained). But what I like best about Pizza’s study–and it really is a classic, at this point–is that it gives you this terrific comparison of how quickly different statistics stabilize relative to one another. The use of PA’s as the unit of measure makes it immediately understandable and approachable. And as long as you don’t think the .7 cutoff is massively inappropriate, It’s a great note of caution for those who wish to use small samples to make judgements, be they early season, splits, or whatever.
Are regressed statistics better? Sure. Though you can make an argument that what that does–assigning a single number to something that really should be expressed as an estimate plus/minus error–is a logically futile thing to do. There is an arbitrariness to how that process works as well.
-j
.70 is the place where the R-squared (just about) hits 50% (technically, it’s at .707, but that’s not a nice round number), so anything north of .70 means that I’m picking up more than half the variance.
It’s a line in the sand, like anything else, although with a little better rationale behind it than say .48 or .32
What about stats like wOBA, wRAA and wRC(+)? I would think those could stabilize faster than AVG and BABIP, but still tell you more than OBP, OPS and SLG.
If the variance of a binomial distribution is np(1-p), then the SD is the square root of that (SQRT(np(1-p))) and the SE is the variance over the square root of n (SQRT(np(1-p)))/SQRT(n). If that’s right, then the SQRT(n)’s cancel and we get the SE to be SQRT(p(1-p)), no? In other words, isn’t it a mistake to divide by SQRT(n)? My calculation is 0.143 +/- 0.686 or so, which is huge- maybe too huge. I’m not too sure. Anyway, yours is a pretty tight 95% confidence interval for this early in the season, don’t you think?
First of all: the complete lack of correlation when it comes to all the situation-affected stats, like WPA and clutch, should finally leave them where they should always have been: in the dustbin. (I’m still a little amazed that they ever gained acceptance in the first place.)
Also: Do you think you could, using your results, make us a graph of correlation vs. PAs? I think we might notice some very interesting patterns there that might help us peg the level right where it should be.
Overall, fantastic article, very useful, and I hope to see a pitching equivalent up here soon!
The variance of the binomial is indeed n*p*(1-p), but remember that’s a count, not a proportion.
A proportion is the binomial count divided by the sample size… X/n, if you will. The var(X/n) = (1/n)^2 * n*p*(1-p). n is treated as a constant, and the variance of any c*X = c^2 * var(X). So only one of the n’s cancels out! Another is left in the denominator, and then square rooted in the standard error calculation.
In this case, the standard errors are a little trickier than just plugging into a binomial formula. The binomial standard error for a proportion, SQRT(p*(1-p)/n), assumes independence between events as well as identical events. In baseball, that is obviously not true, as the pitcher and the ballpark have something to say about hitter success, and those things change night after night.
This is some form of an over-dispersion problem where the standard errors are a function of all the different situations a batter finds himself in (like ballpark, pitcher, base-out state, etc.) The binomial distribution would suggest that any stat for which a player has a, say, 30% success rate would have the exact same standard error associated with it (assuming equal sample sizes). But I think one major take away from this article is that the standard errors can be very different between the proportion statistics, even when probabilities of success are equal.
My master’s thesis is actually this exact issue in baseball. My goal is to determine standard errors for these statistics as functions of sample size (plate appearances or pitches seen, or whatever). Then the more intuitive confidence and prediction intervals could be derived.
Bradley Woodrum says:
April 20, 2011 at 1:05 pm
Ever still brilliant.
Rick says:
April 20, 2011 at 1:10 pm
I’ve always thought that the PA threshold approach was too cut and dry. People will go from saying we have to ignore the data point to fully accepting it as predictive. What I think would be more useful is to provide confidence intervals given a full season of data. So it might be:
AVG: +/- .030
OBP: +/- .020
SLG: +/- .050
BABIP: +/- .40
UZR: +/- 10
Whatever the real values are, the interpretation would be easier and if cited frequently, the values would become a part of getting to learn the stat. Just like we know intuitively how much better a “.300 hitter” is than a “.270 hitter”, we might also get to intuit that a guy who hit .280 last year is likely to have his true talent be somewhere between .250 and .310.
I suppose the PA threshold is useful in understanding when we should start really paying attention as the year goes along, but in terms of pulling in the concept of reliability in to the daily conversation about stats, I think a CI approach may find more traction.
Elias says:
April 20, 2011 at 1:22 pm
i’ve asked this question before, but why do all of this (inefficient) split sample stuff, when you could just analytically estimate standard errors? this is especially true for binary outcome variables like AVG and BABIP, for which standard error formulas are ridiculously easy to compute.
(full disclosure — i haven’t read the entire piece yet, so if you already answered this somewhere, sorry!)
Elias says:
April 20, 2011 at 1:27 pm
didn’t see this comment before my post… my thoughts exactly.
mattr84 says:
April 20, 2011 at 1:29 pm
Have bootstrapping or randomization/permutations been considered for to create multiple data sets to then obtain correlations?
mattr84 says:
April 20, 2011 at 1:34 pm
Intriguing addition. Correct me if I am mistaken but aren’t standard error also affected by sample size?
TwainsYankee says:
April 20, 2011 at 1:49 pm
Thats kinda what I was thinking, and was wondering why he did split sampling when a bootstrap would allow him to put CI’s on his correlations as well.
Elias says:
April 20, 2011 at 2:38 pm
here’s how this would work – the 95% confidence interval on carl crawford’s AVG *right this second* is given by:
AVG +/- [1.96 * SQRT( AVG*(1-AVG)) / SQRT(AB) ] =
.143 +/- [1.96 * SQRT(.143 * .857) / SQRT(63) ] =
.143 +/- .086
so fear not, sox fans, we can’t reject that crawford’s true talent thus far this season describes a 0.229 hitter!
Al Dimond says:
April 20, 2011 at 2:59 pm
Cool read; there was a similar study on pitching stats quite a while ago, and it’s certainly been instructive in how I look at pitching performances. I was a little surprised how quickly home run rate stabilized compared to rates of other hit types… maybe I’m so used to hearing the conventional wisdom about “streaky sluggers” and “consistent contact hitters” that it still influences how I think about things.
I’m not that surprised that the WPA-based stats didn’t stabilize quickly. There are tons of factors that go into those besides batter skill. Even WPA/LI, which is still sensitive to context in its own way.
At some point I’ll have to look up your homemade plate discipline stats. They sound interesting. *goes for a cookie*
Al Dimond says:
April 20, 2011 at 3:01 pm
Oh, right, this article was from 2007. Duh.
Justin Merry says:
April 20, 2011 at 3:12 pm
Holy crap, for a second I thought Pizza Cutter had returned!!
Pizza Cutter says:
April 20, 2011 at 4:36 pm
No new episodes, but apparently I’m in syndication now!
Pizza Cutter says:
April 20, 2011 at 4:49 pm
I do worry about how this has come to be interpreted. I presented the levels at which various stats hit the magic .70 mark. However, even that has its problems. .80 is, of course, more reliable than .70, and given a large enough sample, a stat would get there. In the same way, .60 is worse than .70, but it’s not trivial either. Reliability is a matter of degrees, and should be treated as such.
Pizza Cutter says:
April 20, 2011 at 5:01 pm
It can certainly be done. At the time that I wrote this piece, it wasn’t the question that intrigued me, but certainly. it can be done.
evanbrunell says:
April 20, 2011 at 6:11 pm
Haha, good one.
Elias says:
April 20, 2011 at 6:27 pm
cool. thanks. it is neat work, and very useful as is.
B N says:
April 20, 2011 at 7:37 pm
I’ll give you one-up on that. Rather than doing it for a single season, you could do a sweep of intervals and calculate confidence intervals for each. In that way, you’d have a function of Confidence = f(SampleSize) for each type of observation you’re looking at.
With a bit of smoothing and interpolation, this sort of thing would give you an ability to determine an estimate of your confidence at any given sample size. Additionally, it might give you an idea of when you’re hitting an asymptote- the point at which adding more data isn’t really helping you have any better confidence.
J-Doug says:
April 20, 2011 at 7:57 pm
‘In social science, we look for a magic number, which is .70.’
Not that the number is “wrong,’ exactly, but setting the “magic” intra-class correlation threshold at 0.7 is only slightly (and by slightly I mean infinitessimally) more scientific than picking a “magic” number for plate appearances.
I’m a social scientist myself, and this is the first I’ve ever heard of any sort of magic number in intra-class correlation. I’m not sure what your field is, but in political science the standard varies between subfields and research questions due to disparities in the amount of available data.
Either way, 0.7 is only slightly less arbitrary than 100 PA. That’s why it’s best to regress to the mean based on standard error and sample size, as those at Inside the Book usually argue.
mattr84 says:
April 20, 2011 at 8:25 pm
I agree, J-Doug. I’m currently in grad school in the social sciences and could not remember anyone ever declaring .7 as the magic number. I even asked my stats teacher today and he agreed. A major issue with these coefficients is that the critical values are mostly arbitrary.
Justin Merry says:
April 20, 2011 at 10:46 pm
For illustration purposes, though, I like this approach. While the setpoint is more or less arbitrary, it does make some sense among the various arbitrary options you could come up with (when r=.7, r^2=.5, so you have 50% of variation explained). But what I like best about Pizza’s study–and it really is a classic, at this point–is that it gives you this terrific comparison of how quickly different statistics stabilize relative to one another. The use of PA’s as the unit of measure makes it immediately understandable and approachable. And as long as you don’t think the .7 cutoff is massively inappropriate, It’s a great note of caution for those who wish to use small samples to make judgements, be they early season, splits, or whatever.
Are regressed statistics better? Sure. Though you can make an argument that what that does–assigning a single number to something that really should be expressed as an estimate plus/minus error–is a logically futile thing to do. There is an arbitrariness to how that process works as well.
-j
Pizza Cutter says:
April 20, 2011 at 11:17 pm
.70 is the place where the R-squared (just about) hits 50% (technically, it’s at .707, but that’s not a nice round number), so anything north of .70 means that I’m picking up more than half the variance.
It’s a line in the sand, like anything else, although with a little better rationale behind it than say .48 or .32
James says:
April 21, 2011 at 3:09 am
Great, great article but you should be ashamed of the Rent reference in the title.
Jeff Zimmerman says:
April 21, 2011 at 8:50 am
I was thinking the same thing.
Paul says:
April 21, 2011 at 9:39 am
What about stats like wOBA, wRAA and wRC(+)? I would think those could stabilize faster than AVG and BABIP, but still tell you more than OBP, OPS and SLG.
Gumby says:
April 21, 2011 at 11:16 am
If the variance of a binomial distribution is np(1-p), then the SD is the square root of that (SQRT(np(1-p))) and the SE is the variance over the square root of n (SQRT(np(1-p)))/SQRT(n). If that’s right, then the SQRT(n)’s cancel and we get the SE to be SQRT(p(1-p)), no? In other words, isn’t it a mistake to divide by SQRT(n)? My calculation is 0.143 +/- 0.686 or so, which is huge- maybe too huge. I’m not too sure. Anyway, yours is a pretty tight 95% confidence interval for this early in the season, don’t you think?
Jim F. says:
June 27, 2011 at 1:50 am
I don’t comment often, but when I do, the piece is worth reading.
Blueyays says:
June 30, 2011 at 8:25 pm
First of all: the complete lack of correlation when it comes to all the situation-affected stats, like WPA and clutch, should finally leave them where they should always have been: in the dustbin. (I’m still a little amazed that they ever gained acceptance in the first place.)
Also: Do you think you could, using your results, make us a graph of correlation vs. PAs? I think we might notice some very interesting patterns there that might help us peg the level right where it should be.
Overall, fantastic article, very useful, and I hope to see a pitching equivalent up here soon!
Cooperstown2009 says:
February 20, 2012 at 12:03 pm
This just in – Everyone gets song stuck in their head.
Matthias says:
May 1, 2012 at 8:15 pm
@Gumby
The variance of the binomial is indeed n*p*(1-p), but remember that’s a count, not a proportion.
A proportion is the binomial count divided by the sample size… X/n, if you will. The var(X/n) = (1/n)^2 * n*p*(1-p). n is treated as a constant, and the variance of any c*X = c^2 * var(X). So only one of the n’s cancels out! Another is left in the denominator, and then square rooted in the standard error calculation.
Matthias says:
May 1, 2012 at 8:22 pm
In this case, the standard errors are a little trickier than just plugging into a binomial formula. The binomial standard error for a proportion, SQRT(p*(1-p)/n), assumes independence between events as well as identical events. In baseball, that is obviously not true, as the pitcher and the ballpark have something to say about hitter success, and those things change night after night.
This is some form of an over-dispersion problem where the standard errors are a function of all the different situations a batter finds himself in (like ballpark, pitcher, base-out state, etc.) The binomial distribution would suggest that any stat for which a player has a, say, 30% success rate would have the exact same standard error associated with it (assuming equal sample sizes). But I think one major take away from this article is that the standard errors can be very different between the proportion statistics, even when probabilities of success are equal.
My master’s thesis is actually this exact issue in baseball. My goal is to determine standard errors for these statistics as functions of sample size (plate appearances or pitches seen, or whatever). Then the more intuitive confidence and prediction intervals could be derived.
I’ll keep you posted ;-)