## Comparing 2010 Hitter Forecasts Part 2: Creating Better Forecasts

In Part 1 of this article, I looked at the ability of individual projection systems to forecast hitter performance. The six different projection systems considered are Zips, CHONE, Marcel, CBS Sportsline, ESPN, and Fangraphs Fans, and each is freely available online.  It turns out that when we control for bias in the forecasts, each of the forecasting systems is, on average, pretty much the same.  In what follows here, I show that the Fangraphs Fan projections and the Marcel projections contain the most unique, useful information. Also, I show that a weighted average of the six forecasts predicts hitter performance much better than any individual projection.

Forecast encompassing tests can be used to determine which of a set of individual projections contain the most valuable information. Based on the forecast encompassing test results, we can calculate a forecast that is a weighted average of the six forecasts that will outperform any individual forecast.

The term “forecast encompassing” sounds complicated, but it’s a simple concept. The idea is that if one projection doesn’t contain any unique information helpful to forecasting compared to another projection, then that forecast is said to be “forecast encompassed” and it can be discarded. When we are left with a group of forecasts that don’t encompass each other, then each must then contain some unique, relevant information.

Table 1 shows the optimal forecast weights after forecast encompassing tests have eliminated the forecasts with duplicate or irrelevant information. One thing that we see is that the Fangraphs Fan projections contain a large amount of unique information relevant for forecasting in each statistical category. Marcel projections are relevant in four categories. ESPN and CHONE projections are only useful in two categories, Zips in one, and the CBS projections have no unique, useful information in them according to these metrics.

Table 1. Optimal Forecast Weights

 Runs HRs RBIs SBs AVG Marcel 0.22 0.53 0.25 0.38 Zips 0.30 CHONE 0.44 0.44 Fangraphs Fans 0.19 0.47 0.31 0.29 0.55 ESPN 0.29 0.33 CBS

Using these weights, we can compute a forecast for each statistic that is a weighted average of these six publicly available forecasts. Table 2 shows the Root Mean Squared Forecasting Errors (RMSFE) of this composite forecast versus the other six forecasts. Here, we see that the weighted average performs substantially better than any individual forecast.

Table 2. Root Mean Squared Forecasting Error

 Runs HRs RBIs SBs AVG Marcel 24.43 7.14 23.54 7.37 0.0381 Zips 25.59 7.47 26.23 7.63 0.0368 CHONE 25.35 7.35 24.12 7.26 0.0369 Fangraphs Fans 29.24 7.98 32.91 7.61 0.0396 ESPN 26.58 8.20 26.32 7.28 0.0397 CBS 27.43 8.36 27.79 7.55 0.0388 Weighted Average 21.74 6.62 21.71 6.77 0.0338

Even when we correct for the over-optimism of the six base projections, the average forecast still does better in every category, though by not as much.

Table 3. Bias-corrected Root Mean Squared Forecasting Error

 Runs HRs RBIs SBs AVG Marcel 23.36 6.83 22.81 7.28 0.0348 Zips 22.98 7.02 23.52 7.59 0.0341 CHONE 22.96 6.85 22.33 7.24 0.0341 Fangraphs Fans 23.24 6.88 23.53 7.08 0.0340 ESPN 23.03 7.27 23.62 7.14 0.0357 CBS 22.91 7.29 23.90 7.27 0.0347 Weighted Average 21.74 6.62 21.71 6.77 0.0338

So what is the takeaway from this two-part series comparing six of the freely available sets of hitter forecasts?

1)      Without correcting for the over-optimism (bias) of the forecasts, the mechanical forecasts, Marcel, CHONE, and Zips, outperform the others.

2)      When correcting for the biases, no set of forecasts is any better than another.

3)      A weighted average of the forecasts performs much better than any individual forecast.

4)      Forecast encompassing tests indicate that the Fangraphs Fan projections and the Marcel projections contain the most unique and relevant information in them compared to the other forecasts considered.

Print This Post

Member
Marver

“3) A weighted average of the forecasts performs much better than any individual forecast.”

Well, duh. We shouldn’t expect a model to ourperform all other models in all five categories, which is necessary for this statement to have any chance of being untrue. Otherwise, you could just weight the models that are best in each category at 1.00 and make the same assertion.

What you really want to know, though, is not whether a blended model would have performed better after the season, but whether a blended model can consistently outperform the individual models, with the weights having been set prior to the season.

Guest
Everett

Wouldn’t it be a good idea to run this sort of analysis back several years to see if there’s any consistent weighting of the different systems, or if its just noise?

Member
Marver

@Will…then just include fangraphs’ projections with extremely smaller weights for 2011, or build two models: one with and one without.

Ultimately, the exercise you’re trying to do will prove very difficult due to year-to-year variations in ideal weights, plus the fact that many projection systems incur tweaks to the logic, coding, etc. that further distort its year-to-year weights.

I ran this analysis last year on a few sources and came to the conclusion that the weights were unstable year-to-year, producing an edge that was negligible and ultimately not worth the time/assets that went into it. That’s not to say it isn’t a good article to write!

Member
Marver

I absolutely agree; they are better. The problem is that it is certainly more true in some fields than others, and baseball projecting is a relatively untested in comparison to other fields in which projection models are prevelant, like weather.

I’ve done basically the exact same thing you’re about to replicate and while I found that the result is a better projection system than any of its constituent parts, the difference was small in terms of added applied value to fantasy baseball teams. The difference was especially small when comparing the time put into developing/grading the system to other studies that could have been completed in the same amount of time.

Guest
Jeremiah

@Will I think your articles are very interesting. I don’t have a statistics background and I’ve wondered for years why people didn’t take the useful (unique?) data from the various projection systems to develop a weighted “super” system.

Do you plan on looking at Pitchers as well? What about expanding the hitter categories, K’s, BB’s, XBH’s?)

Guest
Brett

For the 2010 season, I forecasted averages computed using a simple average of forecasts (Zips, Marcel, Chone and ESPN) and it worked rather well. For the 2011 season I plan on adding a simple weighting to my forecasts.

1). What do you feel is a good way to weight the various projections? I had initially thought of ranking the 6 projections, giving 6 to Marcel, 5 to FG Fans, 4 to Chone etc. Denominator would be the sum of 21, so Marcel would be weighted 6/21 and CBS 1/21. Is this too simple?

2) Once I had created my projections, I want to do an ESPN-like-player rater calculationto give weights to marignal production in each roto categorey. I usually play in a points H2H league where such a calculation is easy. Do you have any experience performing such calculations? Any insight.

3) Does anyone know of any sites where rotoheads can contribute or co-develop such projection resources?

Guest
Brett

Will,

I saw your weighting by source in the article above. I was under the impression that these weights were for 2010. Is there any reason to believe that any system is better at projecting a given category from year-to-year?

Also, do you plan to do the same review for pitching forecasts?

Brett

Guest
Jeremiah

@Will – since CHONE is off the free market, how would you suggest a simple weighted average of these three systems: fangraph fans, ZIPS and Marcel?

Thanks!

Jeremiah

Member
jaywrong

@marver: i have forecasted that you need more fiber in your diet.

Guest
Brian

Will,

Pick any of the RMSFE numbers for runs as an example. Knowing that the projects are going to be plus or minus 20-some runs doesn’t help a whole lot does it? That means if a player is predicted to score 90 runs, he could score anywhere from less than 70 to more than 110 runs.

Sure, now you’ve done the statistical analysis to know how accurate the projections are and you have basically shown that all of the projections have a big enough error that they really can’t be trusted. But we need something to base our draft picks on, do we not?

I guess my question is, how do you use any of this information as an advantage come draft day?

Guest
Brian

@Will: I’m totally convinced, that sounds like a great idea to compile a bunch of forecasts and create your own to gain that advantage over everyone else because in the end, isn’t that what we are all looking for? A way to dominate our friends so we can boast about being the best.

Are you going to have these projections somewhere to share so that the rest of us can see them, or are you just describing a way for us to do our own, more accurate projections?

Member
Matt Goldfarb

I found using ~2/3 Bill James and ~1/3 Marcel produced highest pearson, and lowest RMSE to actual results for HITTERS.

I did this a little over a year ago with the 2009 and 2008 stats. I’m not really good enough with SQL to go back any further.

The PECOTA was best for projecting pitchers if I do recall.

Keep up the good work Will,

Matt

Member
evo34

Will, I’m just not sure you understand sample size, and in-sample data vs. out-of-sample data. You cannot try to find optimal weights using one year of data. There is way too much variance in baseball performance to think that the stat-specific weights you mention (below) are anything but noise. The only way to prove otherwise is to generate your optimized weights on a set of training data, and check the performance on a (completely different) set of test data.

So to say you “should do” or “it’s best to do” various stat-specific system weightings based on this extremely limited study can only do more harm than good.

“My article here says that you should use different weights depending on the category. For example, when you want to forecast HRs, it’s best to do about 50% marcel and 50% fangraphs fans and ignore the other systems because they don’t add anything beyond those two. For SBs, it’s best to do 1/3 marcel, 1/3 fangraphs fans, and 1/3 ESPN.”

Guest
Joel

Well done, Will.

Obviously there will be year-to-year deviation, but since the factors underlying the mechanical predictions should remain consistent, a historical accumulation of projection data should be helpful. Even the fan projections should be consistent on some level…

Member
evo34

Will, if you’ve done any prediction work whatsoever (stocks, sports, weather, anything), you should know that you CANNOT optimize parameters of a system on the same data you are using to test said system, and expect it to be successful. This is Data Mining 101.

What you have done is this article is described what has occurred in the past. By itself, that would be fine. Not very useful, but fine. But you take the reckless step of claiming that you have found the best system to use to predict the future: “My article here says that you should use different weights depending on the category.”

That’s a completely inappropriate conclusion based on the (lack of ) analysis you have done.

Guest
evo34

This is exactly the same mentality that led your original erroneous conclusions. You cannot test any model creation hypothesis on one season of data.

It’s critical to understand this when you are in the business of prediction.