Using absolute value of horizontal movement would not be a correct approach since it would in effect make it seem as though a curveball that moves in on a right handed hitter is the same as a curveball that moves out on a right handed hitter. I have heard this suggestion before, and seen it posted elsewhere too, and could not disagree with it more strongly. Movement (and location for that matter) in the horizontal direction should be considered as a spectrum. Just because the frame of reference (or datum, or zero, or whatever you want to call it) is set as a “straight” pitch, does not mean you can for example treat pitches with -2 inches break the same as +2 inches break.

]]>I’d think that either 1) using absolute value of horizontal movement or 2) separating pitchers out by handedness should go a long way to solving the multiple regression problem.

]]>Still and all, a pitcher may make his living with 60-90% fastballs, but the same could never be said of curveballs. Familiarity will eventually get you creamed.

]]>In the 84 mph group there were 184 pitches swung on and missed and only 3 hit for home runs. Of all the velocity groups I graphed, it did have the least number of pitches, so sample size might be an issue. But in terms of misclassification, I don’t think it matters. Regardless of how the pitches were classified, they were similar enough for mlbgameday to classify them together so it looks like there was an 84 mph pitch last season that was very effective.

I have considered doing a multivariate regression analysis (and am still considering it). Logistically, the problem is the relationships between the predictors and outcome would probably have to be modelled piecewise rather than linearly (especially the horizontal movement). Its not a trivial problem, but I assure you I do see the value in a regression analysis and am working on it.

mcuni,

I got all my data from the mlb gameday website:

http://gd2.mlb.com/components/game/mlb/

MLB gameday classifies each pitch thrown based on an artificial neural network that is trained during spring training (each year I think/hope). It also gives a confidence value to tell you how confident it is that it has classified the pitch right.

In terms of data acquisition, I wrote a quick script that mined the xml files for all the games from 2008-2011. I just used the 2011 season for this analysis. Then I wrote a second script that parsed out the data i was interested from the game xml files. Took a few days to get everything working right, but all together, it wasn’t that bad.

Hope that helps!

]]>For instance here at Fansgraph i take some random batter, here goes his play log for 2011:

http://www.fangraphs.com/statsp.aspx?playerid=3057&position=C&season=2011

There is no data about pitches. I can only have it for the whole game:

http://www.fangraphs.com/statsd.aspx?playerid=3057&position=C&type=6&gds=&gde=&season=2011

But that’s not enough to analyze! The fact he batted 79% fastballs 2011-09-28 @LAA doesn’t say anything about HRs and swinging strikes among those. I’m totally lost here. Maybe that stats is available at fangraphs+?

I would much appreciate if you could clarify this.

]]>Are you sure you aren’t reading too much into the 84 mph velocity issue? Is there any chance some of those may actually be misclassified sliders?

Also, did you consider summarizing for all pitchers and using a multiple regression to determine which effects (and interactions) are significant? For horizonatal location / movement, you could use the absolute values to account for differences between left- and right-handers.

If you used a multiple regression, you could also use relative importance metrics to tease out which terms are not just statistically significant but are actually meaningful, which could be interesting to see as well.

I like what you’ve done here and hope to see more soon.

]]>