I really enjoyed your work in THT Annual by the way.

]]>Point is, it’s definitely on my to-do list, though I’ve been busy lately with other stuff.

]]>If you say the league K% is 20%, and you’re talking about a batter who Ks 20% of the time, and a pitcher who Ks 20% of batters, you’d probably expect that pitcher to strike out that batter about 20% of the time, right? That’s what all the formulas say, more or less. How about a batter and pitcher who both have a K% of 80? With the league K% of 20, all the formulas say something in the 90% range will be the outcome. But if you adjust the league K% up to 80%, then odds ratio instead says the batter will K 80% of the time against that pitcher.

The trick is, the chance of not striking out is 1 minus the chance of striking out; that means, if the formula tells you that a strikeout will happen 20% of the time in a certain circumstance, then it ought to also tell you that it *won’t* happen 80% of the time in that same circumstance. It seems to me that taking the overall average into account is the only way you can do that.

I’m still not getting it regarding PPP/NPP — everything I’m seeing shows it dealing with True Positives, False Positives, etc. I don’t think, even if a K% is the likeliest outcome of a particular PA, that there could be enough confidence in that assessment to label it a “positive.”

I noticed that the following form of the PPP formula is actually similar to the odds ratio method: (sensitivity x prevalence) / (sensitivity x prevalence + (1-sensitivity) x (1-prevalence)), as prevalence is analogous to league average, and both factor in the (1-…) of the terms. Both are based off of Bayes’ Theorem, I bet.

Hm, not a bad idea regarding managerial effectiveness. Not easy, though — there are a lot of variables to consider, some of which we are not always privy to (e.g. health status of bench players or relievers).

]]>Regarding PPP/NPP, it might not be something you can use here explicitly, but as you have described it thus far, your goal is to create essentially a probability matrix of outcomes for each at bat. Which means you will have a probable outcome (along with other less probable outcomes) for each PA. Although we man never have a K% above maybe 30% for a given at bat, there are likely to be situations in which a K is the most probable outcome. Also, since there are a finite number of outcomes to a PA, computing PPP/NPP will be possible (its not strictly binary). Regardless, the basic idea here is that by comparing the amount of error variance above and below the regression lines to the total amount of variance above and below the lines (which is what PPP/NPP does, in a way), I think you can essentially accomplish what you are trying to do with the league k% without explicitly adding it into your formula. Or something like that, at least. I wanted to see it in table form because… well… I like tables better than graphs. In any case, the reason I think this matters is that in this experiment, since strikeouts are the low probability event, I would give more weight to the formula that over-predicts strikeouts than the formula which under-predicts them.

BTW, is this moving towards a measure of managerial effectiveness (manager WAR if you will)?

]]>Anyway, Formulas 1 & 2 don’t directly include league K%, as do Odds Ratio and the Logistic function, but they indirectly include approximately 2002-2012’s league K% through their weighting. It’s important, I think, in light of the fact that the historical K% of each player (which I’m using as an approximation of their “true” rate) has a lot to do with who they faced in the past — and league K% is a proxy for that.

That reminds me of: http://en.wikipedia.org/wiki/False_positive_paradox by the way.

Positive and negative predictive power — how could I apply those here? In what I’m seeing, it looks like they’re used for yes or no predictions. I’m not making any yes or no predictions on strikeouts — only giving a probability.

Not sure what you mean about the direction of the errors — I thought the slopes of the regression lines on the scatterplots, along with the average errors (as opposed to the mean absolute errors) were indicators of that.

]]>A couple of brief points: first, I’m not really sure why you are including the league wide strikeout rate here. Well, I understand why, but I think you’re wrong to do it, as the league-wide strikeout ratio really shouldn’t have anything to do with your prediction for an individual at bat, unless you are trying to bias yourself in the correct direction. Basically, including it is a version of the gamblers fallacy, in a way. But that’s a minor point.

More importantly (and nice work again, by the way), I would like to persuade you, in the future (and present if possible), to use positive predictive power, hit rate and negative predictive power instead of mean averaged error in reporting your findings. I, and presumably other informed readers, would not find it trivial to know in what direction each formula made its errors. To be honest, information like that is usually how I determine the relative utility of predictive models. I also have a hunch that if you break down the distributions of the errors a bit more, you may find what is “wrong” with the odds ratio method.

]]>