Towards an award prediction system

Four weeks back, I gave readers here an introduction to my work in the 2014 Hardball Times Annual on voting patterns for the Most Valuable Player and Cy Young Awards. I used correspondences between success in various statistical categories and winning those awards in the past couple of decades to make predictions on how the votes would go this year. The choices this method produced weren’t exactly shocking, which pleased me. Something would have been amiss if it had produced offbeat selections.

In the Comments section, reader Dr. Doom asked me a pertinent question (I do prefer those to impertinent questions), which I will repeat here:

“[H]ow well does this method back-correlate? Of the 34 races you discussed above, how many times did your system pop out the ‘correct’ answer?”

This was something I hadn’t done. To partly repeat the reply I gave Dr. Doom, I was concerned mostly with finding the correlations, and making a predictive system out of those correlations was a tack-on notion. On reflection, I decided that this indeed was a hole in my work, that I should give a look to how well the numbers “back-predict” the winners of the major awards.

I’ll be doing this, at least at the start, the same way I worked out the 2013 predictions. I’ll take the “medalists,” the top-three finishers in the voting, measure which statistical categories they led between the three, and assign them correlation scores for each, which measure how often the stat leader won the award. Add them up, and whoever has the highest number is the predicted winner. I also run variations where closely related metrics, such as bWAR and fWAR, are consolidated into one, so that success in one area is not unfairly magnified.

You may recall that I was rather secretive with the numbers behind all this in my earlier article, as I was trying to tease you into buying the Annual where I did my original study without giving away all the insights I had there. I’m still trying to get you to buy the Annual (seriously, how much more often do I have to link it?), but I’m also going to be somewhat looser with those nuggets I unearthed. That’s right, I’m going to reveal the Magic Month that matters so much to MVP voters, and a little more besides.

2013 redux

Before looking at the historical predictive values, I need to re-visit the contemporary forecasts the correlation numbers were making. You may recall I made educated guesses as to the top three finishers in each category, using votes made by the staff of Baseball Prospectus. In only one of the four major award races, the AL MVP, did those triplets match up with the actual BBWAA tallies. Fortunately, the top two always matched, and it was just the bronze medalists who were different. I will run those three races through my correlation numbers again to see if there are any changes.

In doing so, I also will be correcting a serious error I made in my original article. Due to what I later found was a transcription error (I had a stolen base number in the home run column), I cheated Paul Goldschmidt out his lead in home runs, giving it first to Joey Votto and then upon his elimination to Andrew McCutchen. Obviously, this had a big effect on what was already a moderately close NL MVP race. I went through a correction in the comments section of the first article, but I will update that for the wider audience here.

That will come after I handle the Cy Young races. First, the National League, where Matt Harvey‘s season-ending injury took him further out of the real-life race than it did for B-P’s voters, and for my own guesstimates. Instead of Harvey getting third place, it was Jose Fernandez, the exciting rookie hurler for the Miami Marlins.

Harvey’s departure didn’t do Fernandez much good: he led the medalists in strikeouts per nine innings, but no other category. The five categories Harvey led got split fairly evenly between Clayton Kershaw and Adam Wainwright.

Wainwright got walks per nine, along with xFIP-, the metric that gauges pitchers on strikeouts, walks, and fly balls (as a more predictive placeholder for home runs). Kershaw picked up FIP and FIP-, the metrics that do use home runs, along with, fittingly enough, fewest homers per nine innings—which, you may recall, actually has a negative correspondence to winning the Cy Young. (Which is no longer the case. After this year’s results, the HR/9 correspondence has gone from marginally negative to marginally positive.)

Given Kershaw’s preexisting wide lead in the numbers, this split does not change much. Even when compressing related groups of stats into a single smaller number, because they are measuring largely the same things, Kershaw’s margin is still firm, as this small table will show.

 Proj'd Top Three     Actual Top Three
Kershaw:    1.462    Kershaw:    1.819
Harvey:     0.739    Wainwright: 1.004
Wainwright: 0.618    Fernandez:  0.382

With the NL Cy pleasantly stable, I move over to the American League. Here, B-P’s voters gave third place to Seattle’s Felix Hernandez, while my numbers pointed to winner Max Scherzer‘s rotation-mate, Anibal Sanchez. The real third place did go to a teammate, but one of King Felix’s: Hisashi Iwakuma. Iwakuma finished out of the top five of B-P’s balloting, so I didn’t have him in the mix when trying to identify my top three.

In my original pass, this was a close race, with Yu Darvish edging ahead of Scherzer if one compressed the related categories that Scherzer led, but not the arguably less linked ones that Darvish led. I still called it for Scherzer, with a slight qualm. With Iwakuma in the mix instead of Sanchez, it’s going to be deja vu all over again.

Iwakuma makes a highly creditable showing in the categories I examined. Out of 15, he leads the medalists in four—bWAR, ERA, innings pitched, and walks per nine—and is tied with Darvish in ERA-. Iwakuma takes from Scherzer more than Darvish, but this is balanced by the departing Sanchez’s value breaking in Scherzer’s favor. Unadjusted correlation numbers for the real top three look eerily similar to those for the projected top three:

Proj'd (Unadj'd)   Actual (Unadj'd)
Scherzer: 1.668    Scherzer: 1.627
Darvish:  1.327    Darvish:  1.419
Sanchez:  1.107    Iwakuma:  1.225

Adjustment for overlap originally was split between Scherzer and Sanchez. With Iwakuma taking Sanchez’s place, the adjustments hit Scherzer almost exclusively, though roughly to the extent they did earlier. The lowering of Scherzer’s totals would put Darvish in the lead—if one credits him fully for his leads in strikeouts and strikeout rate. If we compress those, Scherzer just holds his lead. See the final results, without and then with the K – K/9 compression:

Using Recurrent Neural Networks to Predict Player Performance
Technology is rapidly advancing possibilities in decision-making.
Scherzer: 1.149   1.149
Darvish:  1.419   1.037
Iwakuma:  1.133   1.133

It turns out just the way it did before. One formulation favoring Darvish makes him the projected winner; the others land on Scherzer’s side. I could easily do here as I did before, go with the majority of projections, and bask in the system having nailed another one.

I’m going to hold off on that. I’d rather see whether counting strikeouts and strikeout rate separately makes for better predictions over the long haul, which is coming in the second half of this article. So let’s put the AL Cy Young to one side, and finish up with the NL Most Valuable Player vote, and that little goof I made with the home runs.

The voters threw another curve here, again putting into third place someone the B-P staff raised no higher than sixth. This time it was St. Louis catcher Yadier Molina, dislodging teammate Matt Carpenter from the awards podium. This could be taken as a sign of rising voter sophistication, giving Molina credit for things like pitch framing, which long have been hidden values in baseball. Or maybe he just got a break from a couple of hometown baseball writers, who slipped him ahead of the other hometown guy. Opinions vary.

The departing Carpenter led the medalists only in batting average and runs scored. In his place, Molina led the medalists in batting average alone, runs going instead to Goldschmidt. This is a gratifyingly slight change in the numbers, even if it does edge Goldschmidt’s way and thus adds to the threat that I will look like an idiot. Still, with unadjusted numbers, McCutchen retains a clear, if not all that comfortable, lead.

As with the Cy Young races, I telescoped some groups of stats that were close to measuring the same thing. McCutchen had this happen for bWAR and fWAR, along with two pairs of metrics taken from the month of August, the “Magic Month” when voters’ attention is highly focused. Goldschmidt has his numbers compressed for wOBA/wRC+, OPS/OPS+, and then it’s a close call whether slugging and isolated power merit the same treatment. In this case, I will leave SLG/ISO alone, to give Goldschmidt his best possible case.

  Proj'd (Adj'd)        Actual (Adj'd)
McCutchen:   1.293    McCutchen:   1.227
Goldschmidt: 1.069    Goldschmidt: 1.213
Carpenter:   0.369    Molina:      0.338

Tilt things as far from McCutchen as I can, and he still maintains a razor-thin lead. Compress Goldschmidt’s leads in slugging and isolated power into one number, and McCutchen’s lead expands from 0.014 to 0.198, which gives room to at least take a shallow breath. It may be a squeaker, but it’s McCutchen’s squeaker: the system predicts he takes the award after all. Bullet, dodged.

But yes, I’m going to take a long-view look at this one, as well.

The American League MVP race doesn’t need a second look: I got all three finalists right for once. But now that I’ve come clean about August as the Magic Month, I should say something about how Miguel Cabrera is the perfect exemplar of this unexpected principle in MVP voting.

Last year, Cabrera substantially outperformed Mike Trout over the final two months of the season, which helped nail down his first MVP Award. Though one would intuitively look to September as having influenced voters the most, the numbers even then pointed to August as the more critical time. The 2013 season vindicated that result.

Cabrera had another huge August, better than last year’s, outdistancing Trout again even though Trout was well better than his annual performance that month. Cabrera’s September, though, was ruined by the lower-body injuries he would carry into the postseason. He cratered in September, outstripped even by down months from Trout and Chris Davis. Yet Cabrera cruised to his second MVP Award.

You can say the voters had a narrative reason to excuse Cabrera’s lousy finish: he was hurt and shouldn’t be penalized for such bad luck. (Harvey could only wish his voting pool had reacted the same way.) Funny, I don’t recall them crediting Trout last year for the month he missed while stuck in the minors. Cabrera’s down September probably did matter, but not as much as what he did before, particularly in the month when voters apparently pay the most attention.

That’s enough sermonizing from me. Let’s get to the backward-predictions. (Maybe that should be “back-casts?” I don’t know.)

Looking backward

One complication in figuring historical predictions is that, with another year of data available, the numbers change. The correlation factors have gone up or down, or occasionally stayed even, depending on whether the real award winners led the medalists in all of those categories. This is great, in that more data is better, and it is a pain, in that the numbers I’ve been using until now have to change.

Ironically, my cloaking the numbers to give you reasons to go buy the Annual now buffer you from this jolt. It still leaves you in the position of having to trust my numbers without seeing most of them, so I’m going to loosen my grip on those numbers somewhat, sprinkling examples throughout this section. For instance, with Goldschmidt and Trout both missing the brass ring this year, the correlation between leading the medalists in runs scored and winning the MVP Award has slumped from barely positive to exactly zero.

Now, I could have used the new figures to re-calculate this year’s award races, but I refrained because of a logical fallacy. I would effectively be using the results of a prediction to help decide what prediction to make in the first place. I’ve done enough science fiction to know the terrible perils of creative a temporal-causal loop like that. Step on a butterfly in 1996, and today Jeffrey Loria is the Commissioner of Baseball, or President of the United States. I’ll run screaming from that, thank you.

To efface that image from our minds, I’ll get right to the Cy Young races. I tallied up three iterations: one counting the correspondences in all of the categories, without compression; one that compresses the more obviously related metrics, like bWAR/fWAR and FIP/FIP-/xFIP-, when one candidate sweeps them; one that also compresses the borderline case of strikeouts and K/9. I had thought that these compressions could shuffle the finishing orders in many races. I thought wrong.

Out of 36 Cy Young votes from 1996 through 2013, the unadjusted correlation system chooses 25 of the winners correctly, picking the voters’ second-place man five times and the third-place finisher six. After the first level of adjustments, it stays 25 of 36, while one preference for a third-place finisher changes to second instead, a tiny improvement.

(Actually it was a third to first, Mariano Rivera to Pat Hentgen in the 1996 AL, plus a first to second, Cliff Lee to Roy Halladay in the 2008 AL.) Add the strikeout compression, and we lose a correct prediction: it turns away from 2005 NL Cy winner Chris Carpenter in favor of bronze medalist Roger Clemens.

I almost outsmarted myself with my attempt to diminish the influence of closely related metrics, at least with the final step. The effect was quite marginal, even if it did save the system from predicting a Cy Young win for a set-up reliever in 1996. (You readers know I love Mariano, but even if it weren’t a mistake, what voting body would have the guts?) By that marginal amount, though, the partial compression seems to be the best way for this group of correlations to predict Cy Young winners.

Applied back to the AL Cy Young race from this year, this would appear to support a mistake, forecasting Darvish to win the award where he actually took second. And it would do so … if we stuck only with the 1996-2012 numbers. But this year’s votes altered those numbers, in ways boosting the case for Scherzer. This is rather tautological: Scherzer won the Cy, so metrics where he led get a boost in correlation scores (which might be augmented or balanced out by results in the NL). And vice versa for Darvish: the metrics he led get debited because he ran second.

For a one-year prediction this wouldn’t be kosher, but when trying to create the best prediction for a range of years it’s another matter. The tweaks in the numbers that make Scherzer the “predicted” winner in 2013 may well have changed the calculations in other years, quite possibly taking a predicted triumph from a Cy Young winner and giving it to the third-place guy instead. I could investigate that possibility, but looking at the spreadsheets is already starting to give me vertigo, so I’ll just note the hypothetical case and press forward.

Getting 25 of a possible 36 Cy Young votes correct out of a three-player short list may look pretty good. Maybe it even is good. But there is a much quicker route to that level of accuracy. The strongest correlation among pitcher stats I studied is with strikeouts—and the strikeout leader among the medalists won the Cy Young Award 25 out of 36 times. All the other statistics I threw into the mix made zero improvement on that record.

Yes, that was a disappointment. Maybe the MVP numbers will make up for that … or maybe not.

I ran the same checks for the MVP numbers, one counting all the categories, one compressing the obviously related ones (bWAR/fWAR; OPS/OPS+; wOBA/wRC+; August OPS/August OPS+; August wOBA/August wRC+), and one also compressing the less obviously related slugging and isolated power into one number. This time, the best predictions came from the first, uncompressed category. It got 21 of the 36 races right, where the compressed versions lost the 1996 NL MVP race, turning away from winner Ken Caminiti to embrace third-place finisher Ellis Burks.

(The added figures for 2013 thankfully did not jumble the predictions for that year. McCutchen ended up getting a clearer nod in the NL, and Cabrera in the AL was in no danger whatever of being overhauled.)

Correspondences for the MVP races weren’t as strong as those for the Cy Young, but there are still single metrics that beat out the whole system’s predictive power. Slugging for the year matched 21 out of the 36, and OPS and OPS+ for the month of August did one better, 22 of 36. August wOBA and wRC+ are available only back to 2002, but they both produce better rates than the whole suite of metrics. This isn’t just disappointing, it’s retrograde.

However, there is a factor that has a substantial correspondence with winning MVP Awards that I haven’t yet taken into account: playoff appearances. Whether they should or not, whether we agree or not, voters like to see the MVP come from a playoff team. (The Cy Young is much more also-ran friendly.) If we’re going to predict what choices the voters make, this should be part of the equation. But how large a part?

I’m trying to make a good predictive algorithm out of this, so preferably it’ll be enough of a part to maximize the correct predictions it makes. Given that, I don’t want the playoff element to overwhelm everything else. If individual statistics are producing correspondence numbers like 0.4 and 0.5 at the high end, I hope to avoid needing a 2.5 bonus for division winners to make things come out well.

Fortunately, I don’t have to get that crazy to get movement. Using the second, partial-compression set of figures, I find I need just a 0.5 bonus for division winners, and a 0.3 bonus for wild-card teams, to change the 20 of 36 correct predictions into 24. These light touches take care of the ’96 NL race, and also get the 2005 and 2010 NL and 2007 AL race predictions correct. With a full one-point bonus for winning the division, I could bring the 1996 AL into line, but I remain chary about such large corrections.

I could have started with the no-compression numbers, the ones that had the better predictive value to begin with. I didn’t, because to get to the same 24-of-36 endpoint, I would have needed a much larger division bonus, over 1.2. Telescoping the numbers did me a favor.

A result of 24 of 36 is almost as good as the Cy Young numbers, but as with that, it cannot outstrip a much more simplistic method. Using the players’ wRC+ in the month of August as a predictor produces the same two-thirds ratio of hits, though in a smaller, 24-award sample size. Such methods as August wRC+ and pitcher strikeouts may not actually be better—with over a dozen metrics behind each award, chance alone could produce some high numbers—but if the overall system can’t rise above the noise of luck, it isn’t doing very much good.

Interim report

There’s an aphorism attributed to everybody from Niels Bohr to Sam Goldwyn to Casey Stengel and Yogi Berra. The wording is variable, but the gist is that predictions are tough, especially about the future. It turns out that that holds for predictions about the past, as well.

Using all the metrics for which I measured award correspondences and melding them into predictive systems, so far I cannot beat the records of the best single metrics for those awards. The systems may actually be better predicters due to the broader base, and this superiority may emerge with larger sample sizes. That’s obviously going to be a slow process: they give out only two MVPs and two Cy Youngs a year.

Stretching my sample back before 1996 has two problems. First, I chose 1996 as the cutoff because the strike years change the calculus, especially with the MVP and the Magic Month. August performance obviously doesn’t mean the same thing as usual when August lasts 11 days and ends the season, as in 1994. And what about the 1995 season that began on April 26? Does the fourth month out of five hold the same import for award voters as the fifth month out of six?

The second problem also applies to extending the survey into the future: the evolution of voting patterns. Decades ago, MVP voters loved giving the award to the RBI leaders. From 1996 to 2013, RBIs have the absolute worst correspondence of any major stat in major award voting, nine out of 36, a -.25 correlation score by the method I’ve been using for the predictions. A sample stretching almost twenty years risks averaging fluid voting attitudes into a mush, and the longer it goes, the greater the risk becomes.

For the time being, I’m going to have to accept that a system that unearths correlations does not do as well at making forecasts, even if it did manage a four-for-four year in 2013. It could be that some metrics ought to be weighted more than others, beyond the magnitude of the correspondence numbers themselves. My “compression” scheme did this to some extent; I may have to widen that extent.

I may end up doing further work on this, but unless I hit upon some much more satisfactory results, don’t look for a third installment in this series any time soon. If you want more of a look at how the stats affect award voting, all I can suggest is that you consult my article in the 2014 Annual.

Or did I mention that already?

Print This Post
A writer for The Hardball Times, Shane has been writing about baseball and science fiction since 1997. His stories have been translated into French, Russian and Japanese, and he was nominated for the 2002 Hugo Award.
Sort by:   newest | oldest | most voted
Dr. Doom
Dr. Doom
Wow!  Thanks for the shout-out!  I appreciate it.  And I’m glad you did the legwork looking backwards.  The line that really got me was this:  “For the time being, I’m going to have to accept that a system that unearths correlations does not do as well at making forecasts.”  It’s a little bit “correlation does not equal causation,” and a little bit obvious life experience.  It’s an admirabl. effort you made, but ultimately doesn’t work because of one thing:  the BBWAA has created a moving target, and so the requirements for winners, the weighting of categories, and even just what… Read more »

It’s a shame because you cant predict it? Or because change in thought is a shame?