## Randomness, Stabilization, & Regression

“Stabilization” plate appearance levels of different statistics have been popular around these parts in recent years, thanks to the great work of “Pizza Cutter,” a.k.a. Russell Carleton. Each stat is given a PA cutoff, each of which is supposed to be a guideline for the minimum number PAs a player needs before you can start to take their results in that stat seriously. Today I’ll be looking at the issue of stabilization from a few different angles. At the heart of the issue are mathy concepts like separating out a player’s “true skill level” from variation due to randomness. I’ll do my best to keep the math as easily digestible as I can.

The first thing I want to discuss is randomness. I’m sure I’m mainly preaching to the choir here, but let me briefly illustrate something: if you have a fair 6-sided die, you know that a given side has a 1/6 chance of coming up; that’s the “true” probability (about 0.167). But the fewer times you roll it, the greater the chance that your desired number is going to come up a lot more or less than 16.7% of the time. Feel free to play around with these **interactive charts** to get a better feel for how that works:

**Instructions:** enter a whole number between 1 and 500 under *Number of Trials*, and a decimal between 0 and 1 (or a percentage between 0% and 100%) under *“True” Success Rate*.

These charts represent the binomial distribution. The top chart indicates the probability (on the vertical axis) of getting the particular result shown on the horizontal (x) axis. The lower chart shows the cumulative distribution function, meaning each point indicates the probability that the result will be equal to or less than the value on the x-axis. Look at the points, in particular — the lines are just there to show the overall trends.

The binomial distribution indicates how likely it is that the results you’re seeing are fluky; i.e., it helps to separate out the randomness in your observations. Now, baseball is a lot more complicated than rolling dice. First of all, you can only *guess* what a player’s “true” success rates are. Complicating things even more are the effects that opposition, parks, and weather can have on their observed success rates. That’s not even the whole of it — a player’s “true” rates can change over their careers, and probably even over the course of a season (depending on their physical — and arguably, psychological — health). Still, the principle behind the binomial distribution plays an important role in baseball stats.

### The Pizza Cutter Method, Version 1

To recap Russell’s original method: he splits each player’s PAs over several seasons into 2 groups — evens and odds (according to their chronological order, I believe). He then figures out each player’s stats in each group for every PA level in question, and figures out the correlations between the stats in the two groups for all the players. He then tries to find the PA level at which the correlations between the odd and even groups reach 0.7. Now, using that 0.7 correlation figure has been the topic of some debate, which I’ll address in a little bit. First, I want to show you the results of my research on 2012 Retrosheet data, using something close to Russell’s original methodology:

The asterisk next to OBP is there because I was calculating times on base divided by plate appearances, for the sake of keeping my Retrosheet spreadsheet a little simpler… the proper definition of OBP would actually subtract sacrifice hits (successful bunts) from PA in the denominator. So, expect a little lower average and probably a tiny bit higher variance for OBP* than for actual OBP.

Anyway, what we see here is that OBP doesn’t stabilize over the course of a season, according to Russell’s definition, but we might extrapolate out the regression line you see there and say that a 0.7 correlation might be reached somewhere in the area of 800-900 PA. However, as I mentioned, some — including Tom Tango — disagree with using 0.7 correlation as the standard for this purpose, thinking 0.5 makes more sense (the reason for that could use a little more setup, I think). According to the alternative view, OBP* stabilizes around 600 PA. Or at least it did in 2012… I don’t think it’s anywhere near a given that it will be that consistent between years.

**A Nerdy Debate**

Why was an 0.7 correlation between the split-halves Russell’s target? Well, if you square correlation, you get a figure that’s supposed to represent the proportion of the variance in factor y that can be accounted for by knowing factor x. 0.7 squared equals 0.49, implying that the factor on the x-axis explains 49% of what’s happening with the factor on the y-axis. So, the idea was the point at which the correlation between the halves passes 0.7 is the point at which we’re seeing half-signal, half-noise out of the sample. However, Tango disagrees, and says an 0.5 correlation is what achieves that goal in this type of situation. Kincaid, of 3-dbaseball.net, agrees with Tango, and explains in the comments,

“r=.7 between two separate observed samples implies that half of the variance in one observed sample is explained by the other observed sample. But the other observed sample is not a pure measure of skill; it also has random variance. So you can’t extrapolate that as half of the variance is explained by skill.”

Correlation squared, or R squared, also known as the coefficient of determination, is supposed to compare the results of a mathematical model to the observed results this model is supposed to predict or explain. The model is supposed to be a representation of the “true” (e.g. skill-based) rates, having regressed out as much of the randomness as it could. So, you see why comparing a de-lucked sample (from the model) to an observed sample is a bit different from comparing two observed samples — the model has a big leg up in matching up with the observed sample.

Kincaid continues:

“You actually get half of the variance explained by skill when the correlation between samples is .5. That is the point when half of the variance in each sample is explained by skill, and half by random variation. Then, when you try to explain the variance of one sample (which is half skill/half random) with the other sample, you get half the variance of one sample explaining half the variance of the other sample, which means the overall variance of one will explain 1/4 of the other.”

### Another Method — Regression Towards the Mean

The aforementioned Kincaid wrote a great (and very mathy) article on some methods of separating out the luck and skill of players. I’m going to apply some of that knowledge to my same 2012 batter OBP* sample.

First of all, remember that what we see in the stats is part skill, part luck. In other words:

**Total Variance = “True” Variance + Error Variance**

That “error” is from sources like insufficient sample size and poor measurements. We want to try to cut out the distraction the errors cause to try to get to the “true” variance.

The first thing we need to do is find the total variance of the population. Not surprisingly, that depends a lot on how many PAs we’re analyzing per player. You may have noticed that underneath my binomial distribution charts, the variance of the distribution was being calculated. The formula for that was:

**Variance = p*(1-p) / n**

…where p = the mean probability of the stat in question (“success rate”… e.g., the league average of the stat, over a given number of PA), and n = the number of events being considered (e.g., PAs per player).

That formula should give us a good idea of how much of the variance we’re seeing comes from the PA level we’re analyzing. It doesn’t address the variance due to the sample size of players we’re using, though (maybe somebody who knows more about stats than me can tell/remind me how that works).

So, for each PA level under investigation, we calculate the random variance as predicted by the second formula. Here, you’ll see both the total (observed) variance and the random variance:

The *p* in the random variance formula changes noticeably in this sample, by the way; up until about 60 PA, the mean OBP* is only around .300, but that number climbs steadily to around .340 by 500 PA. Part of that rise may have something to with batters getting more comfortable with more PAs, but I’m sure the main cause is the exclusion of pitchers, bench players, and other fill-ins from the higher-PA samples.

Now we can subtract the random variance from the total variance, and it gives us:

It looks like once we’ve taken batters with fewer than 100 PA out of the equation, things get a lot more predictable.

Now, to find the PA level at which the true variance equals the random variance (in other words, where the signal equals the noise — the same as Pizza Cutter’s goal) we bring back the second equation, this time inserting the true variance and solving for *n* at each PA level:

Ignoring the flukiness of the first 100 PA, it looks like the “stable” PA level plateaus around the low 300s. Starting a little before 500 PA, there may be some small sample size and outlier issues, with fewer players able to meet that threshold.

How does all this relate to regression to the mean? Well, let’s say that the mean OBP is 0.330. Let’s also say the “stable” PA cutoff for OBP is about 305. Going by the method Kincaid outlines, 305 is going to be our PA denominator, so we find the numerator that’s going to average out to a .330 OBP* by multiplying 0.330 * 305 = 100.65. Now, pick a player — any player — and add 100.65 to however many times he reached base, then divide that quantity by the quantity of his PA + 305. What you’ve just done is to regress that player’s numbers 50% towards the mean, providing an estimate of their true talent based on that threshold. It will therefore estimate batters with fewer PAs to be a lot closer to the mean than a batter with the same stats but a lot more PAs (unless he’s already close to the mean, of course).

### Pizza Cutter, Redux

Russell improved his methodology in a more recent analysis, employing a much bigger sample, and using a fancier technique than his original split-sample method — a formula known as KR-21 (Kuder-Richardson #21). I eventually managed to track down that formula in a statistics textbook:

“KR 21″ = **(K/(K-1)) * (1 – (M*(K-M))/(K * s^2))**

K = # of items on the test

M = Mean Score on the test

s = standard deviation of the test scores based on the mean of the test scores

(page 147 of the linked textbook)

I used that to come up with the following:

Notice that the numbers don’t match up so well with the original split-sample analysis that well (mouse over or tap to see the original). However, see the PA level where an 0.5 reliability coefficient is reached? It’s right around the low 300s — same as Kincaid’s method.

Now, the textbook I linked has, on the same page, has an equation that you’re supposed to run a split-sample correlation through in order to come up with a proper reliability estimate: **2r/(1+r)**. That’s called the Spearman-Brown formula. So I converted all the points in the split sample using that, and here’s what I got:

What do you know — 0.5 is crossed right around 300 PA.

### Discussion

So, I think the evidence is on the side of Tango, Kincaid, et al in the 0.7 vs. 0.5 debate. I’m not a statistician or a Ph.D., though, so maybe I’m missing something. Still, there’s nothing really magical about the 50-50 boundary of chance vs. skill. You might prefer a level that has a higher skill requirement, in which case, 0.7 is fine. Let me hear your thoughts on all this, though.

Textbook cited:

**Interpreting Assessment Data: Statistical Techniques You Can Use**

By Edwin P. Christmann, John L. Badgett

Print This Post

“Part of that rise may have something to with batters getting more comfortable with more PAs, but I’m sure the main cause is the exclusion of pitchers, bench players, and other fill-ins from the higher-PA samples.”

So you’re using different populations throughout the study?

Well, the starting population just gets whittled down as PAs considered rise, which causes players to drop out of consideration. I see your point, though.

I just ran Kincaid’s method on only those with at least 440 PAs, and I only analyzed them all up to the 440 PA level. The results look pretty much the same — it flatlines around 300 PA. There’s even more wildness at the low PAs considered levels, actually. 177 players in that sample, btw.

Great article! I’m bookmarking it for a deeper read later, otherwise I won’t get anything done this afternoon in my “real” job.

Thank you!

this is awesome, steve. please tell me you’ll be presenting something at the saber seminar.

Thanks! No can do on the seminar, though, unfortunately. I’m too poor to make my way to that side of the country. Maybe if they ever have one in CA, or if I start making decent money…

I’m glad you brought this topic up since a component of it has been bugging me for some time.

As you’ve probably noticed, frequently on this site, FG writers make claims like this:

“Well, you know, we’ve reached the point in the season where sample sizes start to become meaningful. Smith has now amassed 150 PA, so we can really start to believe in his performance…”

Unfortunately, that’s not what PC’s studies were about. You can’t apply the data in that way and yet FG writers spout it over and over again when

projectingfuture player performance. Please stop.PC himself acknowledged this:

http://www.baseballprospectus.com/article.php?articleid=17742

Yeah, there’s still plenty of room for weird stuff to happen in a player’s results before or after the threshold is reached. This stuff probably does get taken too far. It’s much more of a “rough idea” sort of tool than a critical element of serious analysis, IMO.

Not sure what the level of statistical acumen is at the Saber conference, but it might be worth pointing out as a limitation that your research is founded on a particular subset of probability theory, namely the frequentist approach which relies on an infinite and identical sampling process. This is still the dominant paradigm in statistics, but some at the conference may prefer something like a Bayesian approach which relies on subjective probability. PECOTA has some Bayesian inference at its roots and the approach generally outperforms frequentist methods when drawing conclusions based on limited observations.

Yes, I’m definitely with you on that. Did you catch Kincaid’s line in his linked article, though: “Bayes and regression to the mean produce identical talent estimates under these conditions (a binomial process where true talent follows a Beta distribution).”

If all you’re interested in is a point-estimate of “talent,” then yes, you would get the same result regressing to the population mean as you would with a Beta-Binomial model where the hyperparameters for the Beta distribution are chosen such that it represents the population expectation (as in what Kincaid did). That would be like saying that your Ford Pinto performs as well as a Bugatti Veyron as long as you don’t use the gas pedal on the Veyron. Using the population distribution as a prior is like saying that you feel the same way about Brendan Ryan’s OBP as you do about Miguel Cabrera’s before watching them hit. It’s more likely that you have some a priori beliefs about those two that you could encode in the prior, such as last year’s OBP, or something. More importantly, the regression model (as framed above), like other frequentist-based processes, assumes that OBP skill is fixed, as opposed to random. This points to a criticism I have about the Kincaid article. He applies a Bayesian machinery, but not Bayesian inference. He goes through all of the work to derive the analytical form of the posterior distribution, only to rely on the posterior mean/variance. Why would I want a summary of a distribution when I can have the entire distribution?

Haha, nice analogy. Maybe Kincaid or somebody will show up to debate you on that, but I’m going to need to read up on Bayesian inference.

The entire posterior distribution (under the assumptions in the article) is the Beta distribution described by the posterior parameters.

Agreed, and I am being nit-picky. What you did is sufficient given the scenario, but fails in the absence of conjugacy, which isn’t an issue in a simple model like this (unless you really want to use a uniform distribution instead of a Beta). I’m actually not that concerned with your particular approach, but more so with the notion that the solution for a simple mean regression is the same under a Bayesian framework as it is using a frequentist approach. This sentence, in particular, concerns me; “So Bayes and regression to the mean produce identical talent estimates under these conditions (a binomial process where true talent follows a Beta distribution).” A particular predicted value (Y_pred) resulting from a regression equation with a certain error variance (e) is not the same as a posterior distribution with a mean of Y_pred and a variance e. Even if the expected distribution of predicted values via the regression models is analytically equivalent to the posterior distribution from the Bayesian approach, there still exists differences in the philosophical and inferential interpretations of the two solutions that is not captured by saying that the estimates are “identical.” I realize this is largely an issue of semantics, in this case, but it’s an issue nonetheless.

The main point of the article is to show the Bayesian justification for regression toward the mean as a point estimate, since that is generally how it is used in baseball talent estimates. The point estimate you get from regression toward the mean is identical to what you would get using a Bayesian process under those assumptions to estimate the same thing (a mean talent estimate).

While most baseball analysts don’t use the full posterior distribution, it can still be easily derived from the regression constant k and the population mean p:

?=kp

?=k(1-p)

Posterior parameters = ?+s, ?+n-s

So you can get a distribution that is numerically identical to Bayes by figuring out the regression constant. Regression toward the mean is just a shortcut for the Bayesian process, as long as you can make certain assumptions for convenience.

The ?s in the above post are supposed to be alpha and beta (i.e. the prior Beta parameters).

I’m not challenging the mathematical equivalency, in this case. I’m challenging the assertion that a point estimate of .377 for the mean of an outcome variable resulting from a regression model is the same as a posterior distribution with a mean of .377. One is a single value intended to estimate a fixed parameter whereas the other is a summary of the distribution of a random variable. In the latter of the two cases, there are any number of ways that we could summarize that distribution (median, CCI, HDCI, etc.). I realize that there probably wouldn’t be any difference in the resulting conclusions in the case of your example, but it’s not accurate to say that the estimates are identical. Again, kind of nit-picky, I realize, but this is an issue very near and dear to me.

The full distributions are identical whichever method you use (again, under the given assumptions).

I disagree. Under the frequentist analysis the parameter being estimated is assumed to be fixed at the population level and, as such, has no distribution. Any distribution that you construct analytically therefore has no interpretation. We can use the mean/SE of such an estimate to construct a confidence interval or to approximate a distribution given some assumptions (i.e., a form of that distribution), but that distribution cannot be interpreted as a distribution of plausible parameter values. Rather it is interpreted as a distribution of potential parameter estimates given an identical and repeatable sampling process. That’s why a frequentist confidence interval is defined as a range of values that would capture the “true” parameter value in X% of identical samples, whereas the Bayesian CCI is defined as a range of values between which X% of parameter values are contained.

You can calculate the posterior Beta parameters from the regression constant. All you are doing by using the regression toward the mean calculations is simplifying the calculations of the Beta parameters. It’s just a shortcut for the same math (and, if you only want the mean of the distribution, you can skip calculating the Beta parameters altogether).

You can start by figuring out your prior parameters, and then updating the prior distribution with the new observations:

a+s

B+n-s

Or, you can start with a population mean p and a regression constant k, and calculate the prior parameters from that:

a = kp

B = k(1-p)

and then update the prior parameters with your new information the same way. You get the exact same posterior distribution whichever way you do it. Sometimes one way is just easier than the other to use.

I’m confused at the ways in which you guys are planning on implementing these IRT models. I understand that you want to want to generate an estimate of ability that is independent from the level of competition (let’s call it “item difficulty”), but I’m not sure that IRT provides the necessary machinery to accomplish that in this case. I see people talking about the parameters needed to model “hitting success,” but let’s remember that IRT aims to model person parameters (latent traits). Defense, pitch type, etc. aren’t really person parameters. Item difficulty in IRT is framed in terms of the abilities of interest. What are we thinking of as the items here? Each plate appearance against a particular pitcher (i.e., item 1is a PA against Kind Felix, item 2 is a PA against CC Sabathia, etc.)? If that’s the case, then I’m assuming we’re envisioning pitcher skill as the driver of item difficulty and that the total item pool is defined by the number of pitchers in the league. If so, how are we dealing with (a) not every examinee (hitter) sees every item, and (b) some examinees see certain items multiple times? Existing IRT models aren’t really adept at handling multiple attempts at a task. Even if we had an IRT model to handle that, there are bound to be a number of local dependencies caused by hitting environment (park), situation, defense, etc. I like the idea of wanting to account for mitigating factors that lead to differential difficulty across plate appearances. Additionally, the advantage that IRT provides of having reliability vary by ability (i.e., we’re less confident about estimating the skill of terrible and great hitters as there are less of them on which to base our estimates) seems desirable. I’m just not sure that what it seems like many of you want to do will lead to an estimable model. Personally, I think that a Bayesian regression model is sufficient. All of the things we seem to be concerned about (defense, pitcher skill, etc.) could be encoded in the prior. For example, instead of using a Beta distribution centered at the population mean OBP, as in what Kincaid did, I could instead use a Beta distribution centered at the expected OBP of a league average hitter that had faced all of the pitchers that the hitter of interest had faced. For example, if Batter X faced Clayton Kershaw 100 times, and Batter Y faced Hector Noesi 100 times, I would use a Beta distribution centered at whatever Kershaw’s OBP allowed is for X and one centered at Noesi’s OBP against for Y.

I meant to put this last comment further down, not as a reply to this topic.

That’s some great advice, thanks. IRT never came up in the brainstorming, as I wasn’t familiar with it (don’t know about Dave), but it sounded like something worth exploring. I think what I had in mind was along the lines of what you suggested, though I haven’t worked out the technical details. I’ll see what I can do.

I am not smart enough to read this.

I’m a grad student in statistics. What you said makes sense, and it’s good to see some actual math, although some of that stuff I’d never seen before. It’s good to see some regression to the mean since it’s something that’s pretty standard in most statistical inference (baseball batting averages were literally the example we used in class for regression to the mean). It would be nice to see some error bars in your inference, especially WAR. You might be overestimating the accuracy of your point estimate (well, strictly speaking you’re not estimating the accuracy of your point estimate).

Very cool on the grad school thing. I kind of wish I’d taken more stats, as my only exposure was through bio and social science-type undergrad stats classes. I took the MBA route in grad school, so my spreadsheet skill outranks my actual mathematical understanding by a bit…

Anyway, when I have Excel spit out standard error bars for the “Expected OBP* Stabilization Levels” chart, it’s telling me only about plus or minus two on the Stable PA level. I’m not sure if it’s doing that properly or what, though. It’s also telling me plus or minus 8 along the PAs considered axis, even though the accuracy of that one isn’t really in question… I’m not sure how to interpret that.

2 points: First, it may just be that the quotes you selected from the Kincaid article are less than optimal, but I’m not sure you make a very compelling case for using .5 instead of .7 as a benchmark for reliability. To be fair, either number is pretty arbitrary, but the Kincaid quote (the first one) seems out of place here. In the case of within sample reliability, were are using one sample to essentially predict itself, rather than two independent samples to predict each other. The KR-21 and Spearman-Brown prophecy formulas are built with that assumption in mind, and reliability acts as a defacto speed limit for validity because nothing can correlate better with anything else than it can correlate with itself, by definition. Thus, per classic test theory, since we only have 1 sample, we have true score and error, and Russell’s estimates apply.

My second point, however, is that you probably don’t really want to be estimating reliability using classic test theory here at all. It seems like Item Response Theory (IRT) would make more sense here, given that classic test theory omits a very important factor when applied in this case: the pitcher. With classic test theory, all items (in this case pitchers) are assumed to be equally difficult, and thus the hitter equally likely to get a hit off of any pitcher (hence the binomial distribution). The only things considered are the skill of the hitter (in this case p) and the number of chances (PA’s). But this omits a huge factor, as the probability of getting a hit varies greatly by pitcher. Furthermore, thanks to MLB’s unbalanced schedule, there is a less than insignificant chance that, particularly early in a season, individual hitters may have faced very different tests, so to speak, and so the number of at bats necessary to stabilize an estimate may vary. Also, they should take into account at least an estimate of the difficulty of pitchers faced. Some batters may have their true score estimates stabilize very quickly; others (particularly platoon players) may take much longer. The point is, classic test theory may not be the right way to approach this problem, because there are more parameters in play than just the hitter. Just my two cents.

Thanks for the great feedback, Kevin.

The full context of Kincaid’s argument is in comment #12 here: http://tangotiger.com/index.php/site/comments/point-at-which-pitching-metrics-are-half-signal-half-noise#comments

So, what you’re saying is that KR-21 is apparently in disagreement with the regression to the mean method? It seemed pretty handy to me that the 50-50 skill vs. luck boundary in Kincaid’s method coincided almost perfectly with the 0.5 correlation figure in KR-21 and the Spearman-Brown adjusted split-sample. Any idea why they seem to be at odds with each other, if 0.7 is the real target?

Thanks for bringing up IRT. I think you’re right. I’m not sure how much more time I want to spend on this particular subject, but IRT could come in real handy for another semi-related project that’s on the horizon.

Thank you for upping the stat-nerd level with this article! After reading that post, I think I just fundamentally disagree with the argument Kincaid and Tango are making there. And there is no disagreement; all measures of internal consistency should agree on how many items are necessary to establish reliability, or they are not really measuring reliability. What I disagree with is the premise that an internal consistency of .5 means half error half true score; agreement between measurement methods does not establish the validity of this point. Kincaid and Tango are arguing that the two halves of a split sample are fundamentally independent; I disagree in this case, as they were produced by the same player. That they produce different amounts of error is irrelevant, as error is randomly distributed and thus correlates at 0 with everything.

The article really did get me thinking about the number of parameters we would need in an IRT model to effectively model the probability of hitting success (pitcher, defense, pitch type?), but in theory, the number of PA’s needed to estimate true score could be drastically reduced using IRT. In practice, however, what it means for us functionally is that every batter’s stats will stabilize differently as a function of the difficulty of the pitchers they have faced.

Glad my nerdiness could be of some use, haha. OK, here’s something I just tried in Excel to address this debate:

1) “=RAND()” in cell A1

2) “=A1+RAND()” in cell B1

3) “=A1+RAND()” in cell C1

Then copy them a ton of rows down, and find the correlation between columns B and C.

The RAND function gives you a random decimal between 0 and 1. If I’m not mistaken, column A can fill in as a representation of “skill”, and columns B and C are each composed of 50% of a matching skill level, and 50% randomness. The correlation between columns B and C gravitates towards 0.5. Thoughts?

I think you’re really going to like what Dave Cameron and I have been brainstorming on, by the way, assuming I don’t screw it up…

Wierd. In the case you describe, the only variance in the sample would be random. So the correlation should approximate 0, not .5, because the only variance is random, and correl(random,random) approaches 0 as n approaches infinity. In any case, even if it did approach .5, all that would show is that the maximal correlation between two things with 50% error is .5, not that .5 is an acceptable reliability.

Regarding ASUR’s comments above: I will fully admit that multidimensional IRT models are not my area of expertise (I’ve worked primarily with Rasch models, when I’ve used IRT, which is not that much). However, the idea is framed with pitchers (or perhaps more specifically pitching ability) as the primary determinant of difficulty, as you point out above. As such, it would not theoretically matter that all batters didn’t face the same pitchers, just as with adaptive testing two students might not take exactly the same test. Also, if we allow pitcher ability to vary slightly, we can also contend that no batter takes the same “test” twice. The part where it gets fuzzy for me is how many parameters we incorporate into the model. I agree with your premise on the superiority of a Baysian model now, but I think that ultimately the question is best answered by allowing for variable parameters instead of more accurate baseline ones, IMO.

There are certainly plenty of examples of IRT models that don’t require a common item pool (testlet, booklet, and adaptive model, etc.). My question, however, is what defines the item pool? If it’s just pitcher faced, then that’s not so bad. You would have one item per pitcher in the league (let’s say ~500 items in the pool), and each batter would only have “responses” for the pitchers they faced. There’s still the issue of how to deal with more than one PA against a particular pitcher. There are some applications of IRT in intelligent tutoring and instructional gaming where a learner tries a task multiple times that might work here. If you want to incorporate park, defense, etc., then your item pool is going to balloon pretty quickly. Let’s say that there are 500 pitchers, 30 parks, and 5 categories of defensive quality (very bad, bad, average, good, very good). That gives you 75,000 combinations of park/pitcher/defense. That’s an enormous item pool and it’s unlikely that you will have a large enough sample for many of those items to reliably estimate any sort of difficulty parameter.

As for the correlation issue, errors should be random, or at least that’s an assumption of regression. In the author’s example, however, we have two variable with some commonality (shared variance due to the values in A1), and some uniqueness (unique variance stemming from the random numbers added in columns B and C). The fact that there is some shared variance means that we shouldn’t expect the correlation to approach zero in the asymptote. In this case, since each vector is comprised of equal parts shared and unique variance, we would expect r=0.5.

Both good points to be sure. Re: the correlation, I agree completely. Hence my comment earlier about it showing a maximal .5 correlation. But when I tried running that specific code in excel, it approached 0, not .5. I’m going to assume its some kind of excel fail on my part, since we all seem to agree that the correlation should approach .5.

Re: the model; I’m not sure its as difficult as you are making it sound, because we don’t need absolute precision here in order to reduce the N needed for true score estimation. For example, at least some of those factors (park) can be estimated using historical data, so we can take those trials out of the equation. Now all we have to estimate are 2 parameters, pitcher skill and defense, which seems more feasible. Also, we do have to remember than every PA can be considered somewhat of an independent item, as the state of the pitcher is not identical from one PA to the next, and thus we don’t really have batters taking the *same* item more than once. I’m not claiming I would be able to run the analysis tomorrow, FWIW, but I do think its possible.

I think what you are proposing could be more easily accomplished through a SEM framework, which would relax a number of the limitations/assumptions of IRT, or even IFA, which is basically IRT conducted in a SEM parameterization.

Also, I would avoid trying to treat every PA for every hitter as a separate item as (a) you will have an unmanageably large item pool, and (b) each item will only have a single response, which means that you won’t be able to estimate any item parameters and, thus, you won’t be able to estimate any person parameters, which is what we’re interested in.

Are you two statistics professors, data scientists, or what?

So I was messing around with that RAND spreadsheet a bit, weighting the variable in column A differently(the “skill” column), trying to figure out how the different weights altered the correlation. When “skill” is weighted half as much as the random components, the correlation is at 0.2. When skill makes up 2/3 of the total, the correlation is at 0.8. Is there a formula that describes this relationship?

Filling in more weighting combinations, the graph looks like a logistic function. I found, experimentally, a formula that does a pretty good job of approximating the curve:

Correlation = 1/(1 + e^(-8.5*p + 4.25))

where p = (Weight of “skill”) / (Weight of Skill + Weight of Randomness)

Any thoughts on that? I imagine there’s already a better formula for this purpose out there?

I wish. Just a naive doctoral candidate. The results in your latest example seem reasonable. Given the bounds of the correlation metric (-1, 1; but really 0, 1 given the date you’re using) your results should approximate a normal ogive function, or a logistic function (equivalent to one another given a linear transformation). As the proportion of total variability attributable to unique variability approaches 1.0, r will approach 0, and vice versa. This is akin to the true score formula you presented above. As the proportion of observed score attributable to error approaches 1.0, reliability approaches 0.

The formula you just presented, incidentally, is the same as the formula for a logistic function. “P,” as you defined it, is an approximation of reliability. The two weights (-8.5 and 4.25) are artifacts of the scale you’re using, most likely. Notice that the exponent is equal to zero when reliability is equal to 0.5 (i.e., equal parts shared and unique variance). Plugging that into the equation for the logistic function returns an expected correlation of 0.5. Makes sense.

“…Is there a formula that describes this relationship?”

The variance of a uniform distribution is 1/12 * (b-a)^2, where (b-a) is the width of the distribution. If you give the skill distribution a width of 1, it will have a variance of 1/12. Then, if you give the random distribution a width 2, it will have a variance of 1/12 * 2^2 = 4/12. The total variance will be 1/12 + 4/12 = 5/12, and the skill portion will be 1/5 of the overall variance. The amount of overall variance that comes from the skill portion should match the correlation between B and C.

Or, simplified, take the two weights for skill and random and square them. So if your weights are 1 and 2, you get 1 for skill and 4 for random. Add the two together for the total weight (5, in this case). Divide the squared weight for the skill portion by the total, and that should give you approximately the correlation between columns B and C (similar to the p in your function, except square the weights first).

Thanks guys. Nice, ASURay — what’s your area of study?

Kincaid — awesome — that formula is perfect, besides being easier. How did you know about that?

So, if you want to translate the skill percentage into the correlation directly, I guess the formula would be:

p^2/(2p^2 – 2p + 1)

Psychometrics, technically. Most of my work is in Item Response Theory, Structural Equation Modeling, and Bayesian Networks. Substantively I work on problems in the fields of educational assessment, child development, and cognition.

I do lots of work in psychometrics as well, including some with IRT. Actually, I’m trying to get an article published right now dealing with how reducing item response options effects reliability (hint: it doesn’t make it better), which is mostly where my expertise lies. As you may surmise from ASUR and I, this type of problem actually comes up fairly often when you are trying to measure stuff like “intelligence” or “extroversion”.

BTW, I think it’s worth noting here that another one of the underlying assumptions of this problem may be untenable. All the methods of establishing reliability we have been discussing (split half, KR-21, internal consistency, regression) assume that the outcome in question (in this case OBP) is unidemensional, or in other words, that there is a single latent “true score” that is responsible for all the true variance in the outcome measure. Is that really true with OBP? Isn’t OBP a combination of at least 2 skills: the ability to recognize pitches and swing only at those in the zone, and the ability to make solid contact with the baseball when you do swing? If that is the case, then all of the methods we are debating here are inappropriate, as they generally assume a single latent trait. For multiply determined measures, test-retest or alternate forms reliability would be more appropriate than internal consistency type approaches, since we wouldn’t expect the stat to correlate as well with itself over time if it was multiply determined.

Hmmm…. when I look back at the original article; doesn’t it seem like most of the things that “stabilize” quickly are also the things that are the most singularly determined? If OBP is determined by multiple factors (i.e. pitch recognition and hitting ability), why do I expect it to correlate well with itself? In other words, for multidimensional stats, their components should stabilize more quickly than the stat itself, and internal consistency will underestimate its reliability due to the imperfect correlation between the components.

The formula comes from the idea that the correlation between column B and C matches how much of the variance in each comes from the “skill” portion. If you know how much of the variance in B and C comes from the common factor, that will tell you how well they correlate to each other. It’s just a simplification of taking the ratio of “skill” variance to the total variance.

You’re missing the point. You are assuming that there is A common factor. One. Not many.

This method of determining stability has limitations with multiply determined measures. Specifically, if there is one factor underlying the scale, then the reliability is the ratio of skill to total variance as you describe. But it is that ratio precisely because there is one, and only one, skill being measured. In theory, the true score variance correlates perfectly with itself, and error falls out. But what if there is more than one thing being measured? Then the “true score” variance will never correlate perfectly with itself, artificially capping the “reliability” of the measure at whatever the correlation between the factors is. The more factors there are, the less “internally consistent” the measure is. This results in the ‘correlate it with itself’ method underestimating the true reliability of the measure in question.

If you still have doubts, just repeat Steve’s experiment from above, but replace the one source of “true” variance with two imperfectly correlated sources of true variance and two random values. You will still have 50% random, 50% true, but you shouldn’t approach .5 anymore. You should approach .5 minus (1 – the correlation between factors).

Ignore that specific formula; its off. I’ll try and figure out what it should be.

Psychometrics — that’s great, guys. I imagine there’s a lot of possible uses for psychometric techniques on baseball stats, given the uncertainties and complexities of both.

Kevin — very interesting. Thinking about it, though, I’m hard-pressed to think of stats that

don’thave a bunch of variables behind them. One of the less complicated one is probably something like vFA (4-seamer velocity according to PitchF/X), but even that varies depending on arm health and maybe mechanical issues. So I imagine we’d have to address not only the number of factors, but the inconsistencies in those factors.When you criticize the idea of “a single latent ‘true score’ that is responsible for all the true variance in the outcome measure,” what do you see as the alternative? A true range? How would you deal with it?

Kevin said –

“since we wouldn’t expect the stat to correlate as well with itself over time if it was multiply determined … doesn’t it seem like most of the things that “stabilize” quickly are also the things that are the most singularly determined?”

Thanks for that insight. I’ve mostly finished a research article which breaks down BABIP into several components. In your terminology, I have taken a what I considered multiply determined stat and redefined it as a set of singularly determined ones. I have not yet run split half or similar tests to determine reliability levels, but the article does show what parts of BABIP a pitchers does and does not have control over, and why.

Personally I think selection bias should also always be explained in articles regarding stabalization. By human nature people discuss the players that have unexpected stats and then the whole idea that the stat has predictive value because it has stabalized becomes untrue. Because you slected the player by looking for a large statistical deviation, the chance that this is due to randomness becomes much larger than when you select a random player.

Yeah, I’m with you there. Outliers can easily fool these methods. You need a much broader context, which means you’re usually better off just looking at the rest-of-season projections of Steamer, ZiPS, etc. for an idea of “true talent,” instead of putting a lot of faith in a current, partial season.

Nerds !!!!!!!!!!!111111111111111

Oh sure, I go on vacation for my brother’s wedding and this happens…

1) In my original article (2007), I made the mistake of whittling down the number of players included in my analyses as the sampling frame got smaller. It appears that you are using a similar method to my earlier work.

In my more recent work, I used multiple years have first selected out the qualifying players (e.g., min 2000 PA in the years under study, obviously over multiple years, which for my recent analyses has been 2003-either 2011 or 2012, depending on when I wrote it), and then run each sampling frame on the same subset of players.

This has real consequences for your calculations. You’re using only 2012 (I think… please correct if I’m wrong — in general), which will drive down how many players are in your sample, and could potentially mess with your variance estimates (larger samples are more robust at estimating variance), and certainly biases them. The players who log 600 PA in a season are going to be a more homogenous bunch than the group you had at 500 PA. Playing time isn’t assigned randomly, as you obliquely allude to.

2) If you’re only using 2012, and since your graph seems to go to 650ish PA, I’m assuming that when you say 600 PA (to pick a number), you’re taking two samples of 300 PA each within 2012. I would personally label that as the reliability level at 300 PA. My thought process there is that the split-half there represents a 300 PA sample for a player compared to a roughly similar (as we can get) 300 PA sample. In fact, my estimate for stability for OBP is 460 here (http://www.baseballprospectus.com/article.php?articleid=17659) and 500 here (http://www.fangraphs.com/blogs/525600-minutes-how-do-you-measure-a-player-in-a-year/#more-48077), with a more coarse methodology. Or roughly about half of where you suggest that my method would have it (800-900). For the record, I consider 460 to be the product of a stronger methodology.

3) Total variance equals more than just random variance plus true variance. The real formula is total = random + true + variance due to measurement error. We assume that measurement error -> 0 as the sampling frame increases to infinity, but I don’t think that we can safely assume that we reach zero in even 750 PA. Your graph implicitly assumes that it is zero.

So, your graph of “True OBP Variance, 2012″ will be pulled down, because you need to pull out the variance caused by measurement error. (Measurement error sources here can include changes in the underlying talent level during the sampling frame and differences in the quality of pitchers/defense faced, and probably a few other contextual factors. Part of the reason that I like split-half is that there’s something of a natural control for these factors inherent in the method.) That’s going to lower the amount of “true” variance in the model, and raise the time it takes true and random to meet.

4) The final 2 figures make the same mistake as above. You suggest that KR-21 would cross .5 at 300ish PA. You have to halve those numbers. I’d estimate .5 at 150 PA. Eyeballing your graph, I’d estimate .7 (an R-squared of .5) at something around 340 PA (looks like it would cross at 680 or so?), which might be a product of using only 2012 data. I had earlier said 460 based on 2003-2011.

Your verdict in favor of the .5 model appears to be based on the thought that the regression to the mean model (which uses .5) gives an estimate of the point of half-noise, half skill at 305 PA (or so). It also appears that the RTM method does not halve its sampling frame as I do (that’s a difference in terminology, not method.)

I’m not entirely sure what you believe my nominee to be for the correct answer. (Again, based on your graph, I’d say 340 or so — which I would point out isn’t all that different from the RTM method.)

And based on the points above, I would suggest that the independent arbiter that you used in this article (which suggests an answer of 300 PA) has some flaws in it that bias that estimate downward, particularly the issue in #3. I believe these are the same mistaken assumptions that are present in the RTM model, which also takes the player’s observed OBP at face value and assumes that the rest is just a matter of cutting out the random variance. Therefore, I’m not surprised that the since the arbiter model shares the same assumption as the RTM model, that it reaches a similar conclusion. If we could actually assume that measurement error was zero (or negligible) then RTM would be fine, and I would guess that those estimates would start to line up with split-half methods.

The reason that I think split-half is a better method for these sorts of things is that when you use even-odd, you have a better (although not perfect) distribution of errors such as facing more difficult pitchers (if you face a pitcher twice in a game, one of those PA’s goes in the even bin, the other goes in the odd…) When you compare two roughly equivalent samples of PA’s, and can hold the player who generated them constant, you can be much more confident that you’re comparing apples to apples. There probably is measurement error in there, but it’s going to be biased in the same direction within a player.

As to the issue of the actual number (.7 or .5), it’s something of a red-herring. What we’re really talking about is a difference in methodology. RTM asks a different question than does split-half and makes different assumptions. RTM assumes that you have a good solid prior to which you can regress (and league average is a good place to start, although there’s always the confoundng factor that better players get more PA’s… so what’s the proper prior for someone with 200 PA? All players who got c. 200 PA? All players?).

Thanks, Pizza, for the very thorough and thought-provoking response. I’ll try to answer:

1) Yes, you’re right about that. In response to the first commenter, I did run the RTM method on a 440 PA minimum sample, and found it didn’t really change the conclusion (at least not this time). I agree that the way I did it was not ideal. However, what if pitchers and bench players stabilize at different rates than regulars? I’m not saying my method really gets to the bottom of that — just saying it’s something to consider.

Yup, you were correct that I only used 2012. All the data I presented here was more for illustration than anything — clearly you did a much more thorough analysis of the data than I did here.

2) I suppose that makes sense. But the Spearman-Brown conversion is supposed to boost the correlation between two sets, each representing half of the full data set, up to the expected correlation between two full-sized sets. Would you therefore say that the Spearman-Brown transformed version of the chart is labeled correctly? If so, maybe the first chart is incorrect whether you halve the x-axis numbers or not.

3) Yeah, I did acknowledge measurement error as part of the equation, but failed to account for it. Part of that was because I didn’t know how to, but at the time, I also considered it to be negligible in this type of situation. I thought about it a little bit more, and I suppose you might consider bad umpiring decisions a source of measurement error?

4) I understand halving the PAs on the first graph, but why on the KR-21 graph? Can you point to examples that prove that to be the official method? Seeing how similar the results are to the Spearman-Brown-transformed numbers, I’d have to hold steady, unless you can prove otherwise.

Just a quick comment on 2): that’s not what the Spearman-Brown prophecy formula is really saying. SBPF tells you what the reliability of a scale (or statistic) would be at a given N; its not “half the full data set”. You can use (and we frequently do use) SB to estimate how reliable a 5 item survey would be if we had 25 or 50 items on the test. Think about it this way: SB tells you, assuming your ratios of true, error and measurement variance remain constant, at what point error can be expected to cancel itself out. The whole ‘assuming the ratios remain constant’ thing is what Pizza Cutter is alluding to above too; if you are only using data from one year, it might not be appropriate to assume that.

Regarding your comment above with the multiple vs single traits: the dirty answer is that there is no good answer. Realistically, when a test measures a bunch of different things instead of one thing, or measures one poorly defined thing, we lower our expectations of how reliable that test is going to be. Or we measure the components independently, which would be my suggestion in this case. Break OBP down into its smallest parts, and estimate its reliability from those.

OK, I’m confused now. Allow me to quote from the textbook I cited:

“The calculated correlation [from the split-half method] is based on a comparison of two half-assessments, each containing one-half the total number of test questions. Because of the small number of questions, the estimated reliability tends to be low. Therefore, the Spearman-Brown formula is used to correct the coefficient for the error caused by splitting the test in half. This formula will give us a theoretical reliability estimate that is closer to the actual reliability estimate for the full version of the test.”

(p. 146-147)

Maybe I misstated something or was unclear, but that was the point I was trying to get across.