Linearization and Fantasy Baseball

Among the astounding phenomena abundant throughout calculus, linearization remains one of the least glamorous. It’s incredibly simple, taught in less than a day, and a more precise (and more complicated) method can often be substituted for it. On the other hand, it’s an incredibly powerful tool and one with weighty implications for fantasy baseball. Because of the concept’s relative simplicity, a reader with even the most basic inkling of what calculus actually is should be able to understand the idea of it, so don’t let a fear of mathematics deter you.

First, let’s think about graphs, functions, and derivatives. Put simply, continuous functions, whether they’re linear, quadratic, or exponential, will generally experience some rate of change — slope. Think of it as the change in the y direction per unit change in the x direction between two points. This is considered a secant line, or the average rate of change between two points. More interesting, however, is the concept of the tangent line, or the instantaneous rate of change at a given point. Note that the tangent line only touches the function at one point rather than two, meaning that we can easily evaluate and analyze the rate of change when comparing two points on a curve. Importantly, the magnitude of the slope of the tangent line tells us the rate by which the function is increasing or decreasing. So the greater the slope, the faster it is increasing (perhaps indicating an exponential function), and the lesser the slope, the more it is decreasing (a negative quadratic).

In calculus, the formula for linearization is:

L(x) = f(a) – f'(a)(x – a)

Here, given some value of a, we get a y-value, or f(a). From there, we subtract the product of the derivative of f(a) and the difference between the value we are estimating, x, and the value we already have, a. This gives the linear approximation and we get a pretty good estimate.

When rendered down to its most basic essence, linearization is a glorified form of estimation that gives credence to gut instinct through a formula. Using the tangent line at a certain point, one can make very incremental estimations, but it’s important to note that they must be very small. The farther from the initial point a that one travels to find an approximation of y, the less accurate the result will be.

It seems that this would have little application to baseball, but that’s incorrect. Recently, I started toying with a couple of formulas that could actually have some importance in the realm of amateur fantasy baseball with the usage of a regression line for an entire player’s career in pretty much any statistic.

L(x) = f(k) – f'(a)(x – a)

Here, f(k) is the actual value at the known point (k), f'(a) is the derivative of the predicted point on the regression line, x is the point for which we are predicting the value, and a is the value we start from.

L(x) = f(a) – f'(a)(x – a)

Differing here, f(a) is the predicted value at the regression line, f'(a) is the derivative of the predicted point on the regression line, x is the point for which we are predicting the value, and a is the value we start from.

I don’t know which would work best, but my guess is that first formula would be most accurate due to its mix of actual and predicted values. Neither of them would be terribly precise, but it’s a heck of a lot better than relying on what you feel might be best.

Regardless of which formula you might prefer, the implications of the linearization idea as applied to fantasy baseball are apparent. Probably best used for 10-day predictions, linearization mixes short-term performance with long-term talent to assess how well a player might perform for a short period of time — whether he’s likely to continue streaking, slumping, or somewhere in between. Rather than having to rely on gut instinct or dated and/or biased statistical analysis, a fantasy player could rely on some concrete math to make short-term decisions. This would be especially helpful in leagues that play for only a month, or can only alter their rosters once a week, or even at the end of a highly competitive season (perhaps making the risky move of dropping a slumping MVP for the streaking rookie).

It’s understandable if it’s unclear how to use one of the formulas at this point. To simplify matters, let’s use formula 1 to demonstrate how this might work in regard to something as simple as batting average. So what you might have is a regression line for a player of rolling 10-game predicted batting averages plotted along with actual values. In this case, x-values are 10-game rolling averages by each 0.01 (the intervals are arbitrary). So 1.1 is the x-value at 110 games played, while 1.2 is the x-value at 120 games. Let’s just say for simplicity that the player has played 110 games in his career, had an actual average of .264 during the last 10-game stretch, and the derivative of the regression line at this point is 0.12. We want to guess his average for the next 10 games, up to career game number 120.

L(1.2) = .264 – (0.12)(1.2 – 1.1)

L(1.2) = .254

We’d expect him to hit .254 over the next 10 games. Hopefully that makes some sense. Obviously it’s still in development and I haven’t done a whole lot of research yet, but expect some to come out later along with some clarifying material if necessary. Confusion is to be expected, but with some explanation applied linearization could potentially help a lot of people out next season in fantasy.

xHR: A Speedy and Mandatory Revision

The Community Research section of FanGraphs serves as an excellent sounding board for aspiring amateurs (yes, those aspiring to rise to the level of amateur). After posting about a new statistical model or a detailed analysis of player performance, fellow Community Researchers are given a chance to chime in with helpful comments, sometimes leading to revision of previously drawn conclusions. More rarely, however, do the names that grace the upper sections of the website comment, but when they do, it always leads to revision.

Last week I published a new iteration of xHR, one that was drawn from xHR/BBE. It used four variables: FBLDEV, wFB/C, SLAVG, and FB%. In my naiveté, I neglected to properly analyze the variables I included in the regression model. As Mike Podhorzer helpfully pointed out, both wFB/C and SLAVG do not quite work as variables in the proper sense. Because they are heavily results-based and are both dependent on home runs for their results, they skew the math quite a bit for calculating how many home runs a player ought to have hit. It’s helpful to think of it in terms of calculating an xSLG. As Mr. Podhorzer put it, “It’s like coming up with an xSLG that utilizes doubles, triples, and home-run rates! Obviously they are all correlated, because they are part of the equation of SLG.”  They make for a sort of statistical circular logic.

For that reason, I came up with a different model, with the same basic objectives and two of the same variables, but getting rid of the improper variables. In this one, I used:

• AVG FBLDEV – Average fly ball/line-drive exit velocity. The idea is that the higher this value is, the harder the player is hitting the ball, and so he will hit more home runs.
• AVG FBDST – Average fly-ball distance. It’s rather intuitive because the farther a player hits fly balls, the more likely he is to hit home runs. If anything, like FBLDEV, it’s a clear demonstration of power. Obviously it has a decent correlation with FB%, but it isn’t necessarily tangled up with home-run results.
• K% – The classic profile of a home-run hitter is one who walks a lot, strikes out quite a bit, and hits balls that leave the yard. I suppose that a common conception is that the harder a player swings, the less control he has.
• FB% – Fly-ball percentage obviously figures pretty heavily into a power hitter’s profile. It’s awfully difficult to hit a lot of home runs without hitting a plethora of fly balls.

Without further ado, here’s the new xHR:

Note: To be clear, the end goal is not necessarily xHR/BBE, but rather xHR. xHR/BBE is just the best path to xHR because HR/BBE is a rate stat, meaning that it will have a better year-to-year correlation than home runs because that’s a counting stat. So if a player gets injured and only plays half a season, his HR/BBE would probably be similar to his career values, but his home-run numbers would not be. With that in mind, remember that the model was made for HR/BBE, not HR, so you will necessarily have “better” results if you’re looking for xHR/BBE.

Pretty good results, to be sure, even if it’s a bit worse than the prior version. A .7989 R-squared value is nothing to scoff at, especially if you think of it as the model explaining 80% of the variance. Clearly it still underestimates the better hitters, and that’s an issue, but there are really so few data points at the top that it’s hard to take it completely seriously up there. If there was a lot more data and it still did that, then I’d be inclined to either add a handicap or to think it ought to be a quadratic regression.

As always, the formula:

xHR= (.170102188*FB% -.014640853*K% + .0000269758*AVGDST + .005672306*FBLDEV -.541845681)*BBE

Even more than the previous version, this model is easily accessible to all fans because the variables are comprehensible. Moreover, it isn’t terribly difficult to head over to Statcast or Baseball Savant to obtain the relevant information and make the calculation. Anyway, I hope you enjoy and use this information to the fullest extent.

Fantasy Metrics and xHR

RotoGraphs, in addition to several Community writers, have been posting about an “x” category of metrics for quite some time. They include things like Andrew Dominijanni’s xISO, Andrew Perpetua’s xBABIP, and more. The clear purpose of developing those statistical indicators was to measure and predict fantasy-baseball success, something we all aspire to in our hopefully low-priced leagues (although you probably found that using x-stats is a lot like overstudying for a test because the amount of effort you put into preparing yields diminishing returns, and you “over-Xed” the players).

One of the most prominent of the x-stats trotted out at the beginning of every season is xHR/FB, developed by Mike Podhorzer, and always accompanied by an amusing “leaders and laggards” piece. His version of xHR/FB is quite good, with a .649 R-squared value. In his regression analysis, Mr. Podhorzer utilizes somewhat exclusive metrics (hopefully public at some point), such as average absolute angle. Overall, it’s a pretty good predictor, and it becomes doubly understandable to the layman when it gets multiplied by fly balls to produce an expected home-run value.

The only real issue I have with HR/FB (and its prediction) is that it is HR/FB. While it is more stable for hitters than for pitchers, it still isn’t quite as stable as a stat I’d like to use for fantasy baseball. For my 1000 player-season sample from 2009-2015, HR/FB had a year-to-year R-squared value of .49. It isn’t terribly difficult to figure out why. There are numerous reasons, including weather changes, team changes, opponent changes, player development, and more. Moreover, it doesn’t take a very good picture of a hitter’s overall profile because it only looks at how many home runs a player hits per fly ball. A player might have a high HR/FB, but he may not hit enough fly balls for the metric to accurately describe his power (i.e. whether he actually hit a lot of home runs). On the other hand, it’s important to note that a high HR/FB generally goes with a higher FB%.

Perhaps a better metric for evaluating a player in the greater context of his hitting profile is HR/BBE. Home runs per batted-ball event is just HR/(AB+SF+SH-SO). It has a slightly higher year-to-year R-squared of .56 (from my sample), in large part because it takes into account more variables than does HR/FB. Under the umbrella of BBE fall not only fly balls, but line drives (and there can be line-drive home runs), and ground balls. In case you’re wondering why I included sacrifice hits, it’s because they tell a little bit about what kind of hitter a player is. Most modern managers are far more likely to ask a Ben Revere to lay down a sacrifice bunt than they are a Kris Bryant.

And so I thought it might be useful to run a linear regression analysis to develop an xHR/BBE (and from there, xHR). I’m a statistical autodidact, so I tried to keep things simple. Additionally, I thought it would be best if I utilized accessible variables like FB% so that a moderately literate sabermetrician could use it. After testing myriad variables, I came up with four that I’d use — average FBLDEV (Statcast), wFB/C, SLAVG, and FB%.

• AVG FBLDEV – Average fly ball/line-drive exit velocity. The idea is that the higher this value is, the harder the player is hitting the ball, and so he will hit more home runs.
• wFB/C – A rather obscure metric buried in the FanGraphs glossary, wFB/C is weighted fastball run values per 100 pitches. I use it because most home runs come off some form of a fastball, and home-run-hitter types are typically good fastball hitters.
• SLAVG – “Slap” average, a metric of my own invention (although someone else has probably thought of it – I just haven’t seen it before), is singles divided by at-bats. It’s a bit like ISO in that it tells you about a player’s power distribution (or lack thereof). I figure that this is inversely correlated with power because the more singles a player hits, the fewer home runs he’s likely to hit.
• FB% – Fly ball percentage obviously figures pretty heavily into a power hitter’s profile. It’s awfully difficult to hit a lot of home runs without hitting a plethora of fly balls.

It seems like a decent list of predictors in that they are understandable and accessible to the average fan, in addition to having a good relation to home-run hitters. I used all players that had at least 100 batted-ball events in 2015 and 2016 (Statcast only has data going back to 2015), which turns out to be close to 500 player-seasons. So let’s throw them into the Microsoft Excel Regression grinder and see what it spits out:

Note: To be clear, the end goal is not necessarily xHR/BBE, but rather xHR. xHR/BBE is just the best path to xHR because HR/BBE is a rate stat, meaning that it will have a better year-to-year correlation than home runs because that’s a counting stat. So if a player gets injured and only plays half a season, his HR/BBE would probably be similar to his career values, but his home-run numbers would not be.

The primary thing to recognize here is the R-squared value: a pretty good .78272. To the uninitiated, this simply means that the model explains 78% of the HR variance. If you’re interested (and you really ought to be), here are the coefficients for the variables and the overall formula:

xHR= (.114557524*FB% – .183885205*SLAVG + .006658976*wFB/C + .004075449*FBLDEV -.343193723) * BBE

With this information, it isn’t terribly difficult to look up a few pieces of data on FanGraphs and Statcast to see how many home runs a player “should” have hit. In case you’re wondering about its predictive value relative to that of HR/BBE, xHR/BBE has an R-square value that’s six points higher (.61). Nevertheless, it’s important to note that, based on the graph, the model struggles to predict home-run numbers for the players on the extremes – the Jose Bautistas of the world. Because the linear regression tends to underestimate rather than overestimate at the top, it’s likely that a quadratic regression would fit better. It’s something to look into, but this’ll do for now. Moreover, while there are some really crazy outliers, like Jose Bautista being predicted to hit 12 fewer home runs (Steamer does have him on pace for only 26 this year!), the model does work reasonably well for more average players.

Keep in mind that numerous improvements will be made. If anyone wants access to data or has a question, then just let me know. If not, then enjoy the tool and use it for fantasy, even though it’s getting a bit late for that. Maybe next year.

The WIS Corollary

Interestingly enough, one of the major postwar genres of Anglo-American literature was the academic comedy. Popularized in large part by Philip Larkin and the “Movement,” authors strove to poke fun at academic institutions and the conventions followed by the terrifically aloof professors. The most famous novel to fall into this genre is Lucky Jim by Kingsley Amis. The book features Jim Dixon, a poverty-stricken pseudo-pedant with a probationary position in the history department of a provincial university. A veritable alcoholic, Dixon attempts to solidify his position by penning a hopelessly yawn-inducing piece entitled “The Economic Influence of the Developments in Shipbuilding Techniques, 1450 to 1485.” Short novel made shorter, it doesn’t help him retain his position, but it does succeed in illustrating the banal formalities that academic writing necessitates.

In sabermetrics, there is a heavy reliance on sometimes inscrutable jargon, acronyms that sound like baby words (“FIP!”), and Mike Trout’s historical comps (Chappie Snodgrass is not a very good one in case anyone is wondering) that quite understandably renders the average fan mildly frustrated and the average fan over sixty wondering how we will ever make baseball great again. Typically, I enjoy those articles very much because they communicate news efficiently and analytically. Occasionally, however, articles stray into the Jim Dixon range of absolute obscurity, examining the baseball equivalent of “Shipbuilding Techniques,” whatever that may be. Such writings form the cornerstone of sabermetrics as they mesh history, theory, and sometimes economics.

Fortunately or unfortunately, my article today isn’t quite Dixon-esque, but it retains some of that style’s more tedious elements. It falls more closely into the category of two-minute ESPN quick sabermetric theory update. I don’t think that’s a thing. Seemingly pointless introduction aside, please consider what you know about DIPS theory. I won’t insult your intelligence, but it was developed by Voros McCracken at the turn of the millennium and has served as one of the principal tenets of the pitching side of sabermetrics ever since then. The theory, in its most atomic form, essentially posits that pitchers should be evaluated independently of defense because it’s something they cannot control. Hence “defense-independent pitching statistics.”

Certainly, it was a revolutionary concept and one that has even gained quite a bit of traction in the mainstream sports media. Announcers talk about how a certain pitcher would look a lot better pitching in front of, say, the Giants instead of the Twins. Metrics like xFIP only serve to quantify that idea.

But every grand theory or doctrine (DIPS is essentially sabermetric doctrine at this point) requires a corollary to frame it. And so I propose something I like to call the “WIS Corollary to DIPS,” where WIS stands for Weather Independent Statistics. The natural extension of evaluating pitcher performance independently of defense is to evaluate players independently of weather because it also exists outside of player control.

The basic idea of this is that weather plays enough of a role in enough games to superficially alter the statistics of players such that they cannot be accurately and precisely compared with the other players in the league because all of them face different environmental conditions. Taking that into consideration, all efforts must be made to strip out the effects of weather when making serious player comparisons. Coors Field is why Colorado performances are regarded with such skepticism, while the nature of San Francisco weather and AT&T Park is supposedly why that location serves as an apt environment for the development of pitchers.

Think about it — it’s something we already do. We look at home/road splits, we evaluate park factors, we try and put players on +/- scales. We talk about this constantly even at youth games. I have heard parents say many times, “If only the wind hadn’t been blowing in so hard he might have hit the fence.” It’s honestly a commonly held, yet generally unquantified, notion that the general public has.

Player X hits a blooper at Stadium C that falls in front of the left fielder for a hit. Player Y hits a blooper at Stadium D with the exact same exit velocity and launch angle as Player X’s ball, but it carries into the glove of an expectant left fielder. Should Player X really get credit for a hit and Player Y for an out? Basically all statistics, striving to communicate objective information, would say yes. If this kind of thing happens enough times over the course of a season, it can make a significant difference. A couple of fly balls that leave the park instead of being caught at the fence would put a dent in a pitcher’s ERA, while changing a player’s wRC+ by no small sum.

For that reason, players should be measured as if they play in a vacuum. One of the biggest goals of sabermetrics is to isolate player performance in order to evaluate him independently of variables he cannot necessarily control. Certainly, this has some far-reaching consequences if the idea gets carried out to its natural conclusion. Someone would likely end up developing a model that standardized stadium size, defensive alignment for varied player types, and other things of that nature. I’m not necessarily advocating for that, just for stripping out the effects of weather.

WIS by itself isn’t radical, but the extent to which it’s applied could be considered as such. As of now, it’s something consciously applied a relatively small portion of the time, but I think that it’s something that should be considered as much as possible. Obviously, there are issues with this. You can’t very well modify “raw” statistics like batting average or ERA so that they reflect play in a vacuum. What you could conceivably do is create a rather complicated model that requires a complicated explanation in order to describe how the players should have performed. And that’s something which brings us to an important point; the metrics that would employ this information would not be for the average fan; rather, they would be aimed at the serious analyst.

This is something I’ve already tried to employ with a metric I created called xHR, which uses the launch angle and exit velocity of batted balls to retroactively predict the number of home runs a player should have hit. The metric is still in development, but I think it’s something that works relatively well and can be applied to other types of metrics. For instance, an incredibly complex and comprehensive expected batting average could utilize Statcast information to determine whether a given fly ball would have been a hit in a vacuum based on fielder routes and the physics of the hit. By no means am I trying to assert that I have all, if any, of the answers. The only thing I’m trying to do here is to bring debate to a small corner of the internet regarding the proper way to evaluate baseball players.

Probably the most crucial thing to understand here is that the point of sabermetrics is to accurately and precisely evaluate players in the best possible way. Sabermetricians already do an incredible job of doing just that, but perhaps it’s time to take things a step further in the evaluation process by developing metrics that put performances in a vacuum. I know that baseball doesn’t happen in a void, but the best possible way to compare players is to measure them* as if they do.

WIS Corollary — One must strip out the effects of weather on players in order to have the most accurate and precise comparison between them.

*Oftentimes it’s necessary to compare players while including uncontrollable factors, like sequencing, especially when doing historical comparisons. It’s important to note that the WIS Corollary is applicable only in very specialized situations, and would generally go unused.

One would do well to recall that the last feature article written about Erick Aybar appeared in NotGraphs (#KeepNotGraphs), where he was pictured as the inept, rebel fleet commander Admiral Ackbar from the good section of Star Wars. Before, that there were articles that described him as, “Erick Aybar: Not as Bad as You Might Think,” and “Erick Aybar, Perennial Sleeper,” and “Erick Aybar: 2012 Sleeper.” Since then, Aybar hasn’t had an actual FanGraphs piece done on him. It looks as though people are still sleeping on him (but for good reason this time).

One of the most interesting parts of the novel 1984 is the concept of “Newspeak,” where the government twists and eliminates the meaning of certain words to serve its own purposes. In the novel, Winston, the protagonist, is educated by one his colleagues at the Ministry of Truth, Syme. He tells Winston, “A word contains its opposite in itself. Take ‘good,’ for instance. If you have a word like ‘good,’ what need is there for a word like ‘bad’? ‘Ungood’ will do just as well – better, because it’s an exact opposite, which the other is not.” In today’s society, particularly in the world of baseball, there is a great need for descriptors such as “ungood” so that people don’t feel bad.

There are numerous expletive-laden phrases that would aptly describe Erick Aybar’s season up to this point, but perhaps it’s best to just say he’s doubleungood. That’s the clearest way of saying that Aybar has been incredibly awful this season. This isn’t just about offense or just about defense. He has been mind-numbingly, historically bad offensively and pretty subpar defensively.

It’s lucky for Aybar that the Braves aren’t exactly their c. 1998 selves because he can hide relatively easily on this roster. The Braves have three of the league’s ten worst players by wRC+ (min. 100 plate appearances), so it’s not like he’s exceptional. Moreover, it doesn’t look like the Braves are terribly interested in winning, anyway, so at least he isn’t holding back a team with championship aspirations (you’re being glared at, Russell Martin).

This season, through 43 games (many of them started) and 161 plate appearances, he has amassed an unimpressive -1.7 WAR, worst in the league. Also absolute worst in the league is his wRC+, which is 11! That’s insane. It’s 89% worse than average! Even 90-year-old A.J. Pierzynski has a 39 wRC+. Consider this: Erick Aybar is running a .184/.222/.211 line. How can a major-league baseball player be this bad?

Well, it’s not terribly helpful to have a .223 BABIP, a number 78 points off of his career average (and basically league average) .301 BABIP. Just for fun, let’s say he has a .301 BABIP this season. That would add approximately nine hits to his total of 27 thus far, giving him a much more respectable .245 batting average. Now let’s say he maintains his ratio of hits to extra-base hits and see what that does to his slugging percentage (he ends up with one more double). This gets him to a much better .245/.279/.279 line. But that’s still probably not good enough to be a major-league player.

As you can probably guess, Aybar’s plate discipline and power numbers suck quite a bit. His four doubles and 23 singles have given him a .027 ISO, which is the worst in the league by 16 points. He has a K-BB% of 14.3%, a number that’s meritorious as a pitcher (hint: Aybar isn’t a pitcher). His O-Swing% increased by five percentage points this year and his contact rate on pitches outside the strike zone decreased by five percentage points, leading to more strikeouts and worse contact when he actually hits it. At least he’s only a slightly below-average baserunner.

Unfortunately, his defensive numbers have been subpar this year also, but at least he’s not the worst player in the league in this category. Instead, he’s eighth-worst, with a raw UZR of -4.9 and a UZR/150 of -22.9. He isn’t committing too many errors, but his range is a definite factor. Aybar hasn’t completed a single play in the remote to unlikely range per Inside Edge. He’s also seen a marked drop in even chance fielding opportunities (down 6.7%) and likely opportunities (down 3%).

There aren’t a whole lot of good reasons for this. He isn’t injured (although he did have to get a chicken bone removed from his throat) and he doesn’t look injured. I can’t find a way to press the videos onto the article, but his swing looks a lot different from last year, at least from the right-hand side of the plate. I’m not a swing expert, but it looks like he isn’t using his hips to turn on the ball like he has in years past, which would explain the lack of power. Additionally, Aybar looks off-balance this year as compared to last year, when he was much better. Another thing to consider is that it seems like his swing has less lift than before, resulting in more ground balls and less power. On the other hand, maybe Aybar is just getting old. He’s 33 and hasn’t missed a lot of time in his career.

On the other hand, he actually was a very good player for a long time, a sleeper even. From 2008 through 2014, he was worth 20.1 WAR, combining passable offense for a middle infielder with good baserunning and decent defense. In fact, he was 57th* in WAR during that time period, better than more highly esteemed names like Carlos Beltran, Nelson Cruz, and David Ortiz. He was a very good player for a very long time, making more money than most people ever dream of. And that’s cause for positivity.

It stands to reason that Aybar will regress back to the mean. No one can sustain those numbers for a full season, if only because they would definitely get benched. There’s a reason why sample size and past performance matters and Mr. Aybar embodies it. If we expected him to keep playing at this level with the same amount of playing time, then he’d end up with the worst season in baseball history with a little over -6 WAR (not that six fewer wins would make that big of a difference to the Braves). But that isn’t going to happen. He’s projected to finish around zero, which would make him an average player the rest of the way. Based on his past performance, I fully expect that to happen and I want it to happen. It’s terribly sad when one of the game’s great, unknown players spirals into oblivion. Nonetheless, what he’s doing right now is insane and not for the right reasons.  Just as Admiral Ackbar managed to right the rebel fleet, Aybar can do the same with his performance.

*Fun fact: Mike Trout is 19th on that list. Remember, WAR totals from 2008 through 2014.

All statistics current through 5/26/2016

xHR%: The Finale

This is the final part of a six-article series on xHR%, a metric devised rather unoriginally by myself. If you feel so inclined, you can look at the other parts here: P.1, P.2, P.3, P.4, P.5.

It’s always nice when things mostly work out. More often than not, when someone devotes countless hours to some pet project, whether it’s a scrapbook of some variety or an amateur statistical endeavor, it doesn’t work out terribly well. From there, one often ends up spending nearly as many hours fixing the project as they did on putting it together in the first place. The experience is incredibly frustrating, and it’s something we’ve all gone through at one time or another.

Luckily, my “quest” went much better than that of Juan Ponce de León.  While I didn’t find the fountain of youth, I did find a formula that works moderately well, even though I can only back it up with one year of data at this point. The only thing Señor Ponce de León has to brag about is being arguably the second most important explorer in colonial history. Somehow those things don’t compare particularly well.

Nonetheless, things do look quite good for xHR% v2. I culled data from a variety of sources, but mainly from FanGraphs and ESPN’s selectively responsive HitTracker. I used FanGraphs for FB%, HR, AB, and strikeout numbers (in order to find BIP, I subtracted strikeouts from at-bats). On the other hand, HitTracker was used just for home run distance numbers and launch angle data. I studied all players with at least 1200 plate appearances between 2012 and 2014 in order to ensure some level of stability for the first sample taken.

And so, without further ado, take a few seconds to look at some relatively interesting graphs (I forgot to title the first one, but it’s xHR vs HR).

Here, it’s fairly easy to discern that there’s a strong relationship between expected home runs and home runs. It doesn’t take John Nash to figure that out. What is fairly interesting, however, is that the average residual is quite high (close to 2.5), indicating that the average player in the sample hit approximately +/-2.5 home runs than he should have. That difference comes from a number of factors which the formula attempts to account for. They include home ballpark, prior performance vs. current performance, and weather. One of the issues, and this was bound to be a problem because of the sample size, is that there aren’t enough data points for players who hit 40+ home runs, so it’s hard to say how accurate the formula actually is as a player approaches that skill level.

This is a slightly zoomed-in version of expected home run percentage vs home run percentage. Clearly, there’s a much stronger relationship between HR% and xHR%, due in large part to the size of the digits and because the formula was written to come out with a percentage, not a solid number. But I won’t waste too much time on xHR% because, quite frankly, it’s far less interesting and understandable than actual home run numbers.

For the interested and worldly reader, here are the equations for each:

xHR: y=.0019x²+.9502x+.1437

xHR%: y=1.0911x²+.9249x+.0007

If either of these equations gets used at all, I expect it will be xHR because home run numbers are far more accessible than home run percentage numbers. Frankly, I regret writing the formula for xHR% for that very reason. This is supposed to be a layman’s formula, so its end result should be something understandable to the average baseball fan. It should be self-evident and easy to comprehend.

Thank you for following along as the formula developed over time. Obviously, it isn’t done yet and it requires some changes, but it’s close enough to where it needs to be. It’s very similar to getting to the door of the room where the Holy Grail is, shrugging, and turning around with the intention of coming back in a few weeks (although in this case it must be noted that the Holy Grail isn’t the real one, but a plastic one covered in lead paint). Expect a return under a different name and a better data set.

You’ll notice that I didn’t include very much statistical analysis at all. I figured that was rather boring to write about, but you can feel free to contact me for the information if you would like a nice nap.

xHR%: Questing for a Formula (Part 5)

This is the long-delayed fifth part in the xHR series. If you really want to read the first four parts, they can be located here, here, here, and here.

More than a month late, the highly anticipated follow-up to the first iteration of xHR has arrived. Once more, that increasingly trivial metric will grace the page of FanGraphs, wallowing in the mostly prestigious Community Research section (on the other hand, this section is most definitely the best section on the World Wide Web for experimental metrics and amateur analyses).

Unless the reader has an impeccable memory for breezily scanned, frivolous articles, he or she likely needs a reminder as to what xHR% is and aims to be. xHR% is a metric that describes at what rate a player should have hit runs over a given season. From this, expected home runs, a more understandable counting statistic, can be found by multiplying plate appearances by xHR%. It cannot be emphasized enough that the metric is not predictive; it only aims to describe. Without further ado, the formula is here:

I know that’s a lot to look at, and it isn’t exactly self-evident what all of the variables mean. As such, an explication of each part is necessary and provided below. (For logical rather than chronological purposes, the Kn variable will be analyzed last.)

AeHRD – One of the biggest differences between this formula and the last one is that this one does not use home run distance. This iteration uses expected distance, rendering it a combination of simple math, sabermetric theory, and physics. As such, expected home run distance strips out one of the biggest factors in luck — the weather.

Expected home run distance is found by utilizing a method taken from Newtonian Mechanics to calculate how far objects go. By using ESPN’s HitTracker website, I was able to obtain launch angles and velocities for nearly every home run hit in 2015. From this, I was able to resolve velocity into its respective parts, velocity in the x-direction (Vx) and velocity in the y-direction (Vy). After that, I calculated the amount of time the ball would be in the air with the formula vf=vi+gt, where vf is final velocity (0 m/s), vi is initial velocity (Vy), and g is simply the gravitational acceleration constant. Finally, I multiplied Vx by time in order to get the total expected distance.

I repeated that process for every home run hit by a given player in order to find his average expected home run distance. By doing this, I was able to strip out all weather-related components.

AeHRDH – Utilizing the same process as above, I found the average expected home run distance for every stadium. This is the player’s home stadium’s average home run distance, regardless of team.

AeHRDL – The same as above, but done for every home run hit in the majors last season.

When put together in the numerator and the denominator, the above variables serve as a “distance constant” of sorts that will at most adjust the resulting expected home runs by plus or minus two. Occasionally, the impact is negligible because the average expected distance is very close to that of the player’s home stadium and the league. Averaging the mean expected home run distance of the league and of the home stadium allows the metric to paint a more accurate picture of where the player hit his home runs and whether or not they should have left the park. Nevertheless, it’s important to note that this formula still fails to account for fly balls that fell just short of the wall due to the wind and other factors, meaning that there are still expected home runs unaccounted for.

FB% – If you remember correctly, or took the time to briefly review the previous posts, then you will recall that in the prior iteration of the formula there was a section very similar to this one. The only differences are that the weights on each year of data have changed (those are still somewhat arbitrary, however, but I am working on getting them to more precisely reflect holdover talent from past years) and the primary statistic used.

Previously, HR/PA was used, but it had to be abandoned because the results were too closely correlated with reality. This time, I looked at how similarly descriptive formulas were quantified. Oftentimes, those metrics did not use the target expected metric in their formulas. Rather, they utilized other metrics that correlated moderately well or strongly with their expected metric. In this case, I decided to use FB% because it’s a relatively stable metric (especially in comparison with HR/FB), and it has a strong correlation with HR% (about .6).

As a clarification, the subscript Y3, Y2, and Y1 indicate the years away from the season being examined, where Y1 is really Y0 because it’s zero years away. So just to be clear, Y1 is the in-season data from the year being examined. In the data to be examined, for example, Y1 is 2015, Y2 is 2014, and Y3 is 2013.

Kn – As you can well imagine, FB% numbers are always far greater than HR% numbers*, resulting in some truly ridiculous results if a constant isn’t applied that relates HR% to FB%. For instance, without a constant to modify the results, Jose Bautista would have been expected to hit 304 home runs last season. That’s a lot of home runs. Just two and a half seasons of playing at that level and he’d have the home run record in the bag. Luckily, I’m not stupid enough to think that that’s actually possible, and so I initially related FB% and xHR% with a constant, called KCon.

Unfortunately, KCon didn’t work as well as I’d hoped because it skewed expected home run results way up for terrible home run hitters and way down for the best home run hitters. By skewed, I mean bad by more than six home runs. And so I, in my infinite (and infantile) amateur mathematical wisdom, made it into a seven part piecewise** function. By this, I mean that there’s a different constant for each piece of the formula, defined by HR% at somewhat arbitrary, though round points. For clarity, here they are:

K1 = HR%<1

K2 = 1≤HR%<2

K3 = 2≤HR%<3

K4 = 3≤HR%<4

K5 = 4≤HR%<5

K6 = 5≤HR%<6

K7 = 6<HR%

It works quite well. I am very excited about the current iteration of xHR%, its implications, and all it has to offer. Of course, it is not finished, but I think I’m getting closer. Please comment if you have any questions, an error to point out, or anything of that nature. There will be a results piece published soon on the 2015 season, so keep an eye out.

*It wouldn’t be surprising if Ben Revere became the first player to have a HR% equal to FB% (both at 0%, naturally).

**It is neither continuous nor differentiable.

xHR%: Questing for a Formula (Part 4)

Apologies for the significant delay between the third post and this one. A little Dostoevsky and the end of the quarter really cramp one’s time. Since it’s been a while, it would probably be helpful for mildly interested readers to refresh themselves on Part 1, Part 2, and Part 3.

As a reminder, I have conceptualized a new statistic, xHR%, from which xHR (expected home runs) can and should be derived. Importantly, xHR% is a descriptive statistic, meaning that it calculates what should have happened in a given season rather than what will happen or what actually happened. In searching for the best formula possible, I came up with three different variations, all pictured below with explanations.

HRD – Average Home Run Distance. The given player’s HRD is calculated with ESPN’s home run tracker.

AHRDH – Average Home Run Distance Home. Using only Y1 data, this is the average distance of all home runs hit at the player’s home stadium.

AHRDL – Average Home Run Distance League. Using only Y1 data, this is the average distance of all home runs hit in both the National League and the American League.

Y3HR – The amount of home runs hit by the player in the oldest of the three years in the sample. Y2HR and Y1HR follow the same idea.

PA – Plate appearances

Now that most everything of importance has been reviewed, it’s time to draw some conclusions. But first, please consider the graphs below.

Expected home runs (in blue) graphed with actual home runs (orange) using the .5 method. I plotted expected home runs and actual home runs instead of xHR% and HR% because it’s easier to see the differences this way.

Expected home runs (in blue) graphed with actual home runs (orange) using the .6 method.

Expected home runs (in blue) graphed with actual home runs (orange) using the .7 method.

Conclusions

Honestly, those graphs look pretty much the same. Yes, as the method increases from .5 through .7, the numbers seem to get more bunched up around the mean, but the differences really aren’t significant between the methods. Nor are the results from those methods particularly different from the actual results. And therein lies the crux of the matter. The formulae suggest that what happened is what should have happened, but I don’t think that’s true.

I know a great deal of luck goes into baseball. I know as a player, as a fan, and as a budding analyst that luck plays a fairly large role in every pitch, every swing, and every flight the ball takes. I don’t know how to quantify it, but I know it’s there and that’s what sites like FanGraphs try to deal with day in and day out. Knowledge is power, and the key to winning sustainably is to know which players need the least amount of luck to play well and acquire them accordingly. Statistics like xFIP, WAR, and xLOB% aid analysts and baseball teams in their lifelong quests for knowledge, whether it be by hobby or trade.

For those reasons, xHR% in its current form is a mostly useless statistic. It fails to tell the tale I want it to tell — that players are occasionally lucky. An average difference of between .6 and 1 home runs per player simply doesn’t cut it because it essentially tells me what really happened. At this juncture it’s basically a glorified version of HR/PA where you have to spend a not insignificant amount of time searching for the right statistics from various sources. But hey, you could use it to impress girls by convincing them you’re smart and know a formula that looks sort of complicated (please don’t do that).

I don’t know how big of a difference there needs to be between what should have happened and what actually happened. Obviously there still has to be a strong relationship between them, but it needs to be weaker than an R² of .95, which is approximately what it was for the three methods.

All statistics that try to project the future and describe the past are educated shots in the dark. The concept is similar to the American dollar in that nearly all of their value is derived from our belief in them, in addition to some supposedly logical mathematical assumptions about how they work. Even mathematicians need a god, and if that god happens to be WAR, then so be it.

Even though my formula doesn’t do what I want it to do quite yet, I won’t give up. Did King Arthur and Sir Lancelot give up when they searched for the Holy Grail? No, they searched tirelessly until they were arrested by some black-clad British constables with nightsticks and thrown in the back of a van. I will keep working until I find what I’m looking for, or until I get arrested (but there’s really no reason for me to be).

I know that wasn’t particularly mathematical or analytical in the purest sense, and that it was more of a pseudo-philosophical tract than anything else, but please bear with me. Any suggestions would be helpful. I have some ideas, but I’d appreciate yours as well.

Part 5 will arrive as soon as possible, hopefully with a new formula, new results, and better data.

xHR%: Questing for a Formula (Part 3)

Part 3 of a series of posts regarding a new statistic, xHR%, and its obvious resultant, xHR. This article will examine formulas 2 and 3.

As a reminder, I have attempted to create a new statistic, xHR%, from which xHR (expected home runs) can be derived. xHR% is a descriptive statistic, meaning that it calculates what should have happened in a given season. In searching for the best formula possible, I came up with three different variations, pictured below.

Today, I’m going to examine formulas 2 and 3 to measure their viability as formulas for xHR%. Hopefully the analysis will shine some light on a murky matter. Likely, formula 2 will end up being the best one because it probably balances in-season performance with prior performance better than formula 3, which has a heavier reliance on in-season performance. Thus, it will end up correlating too well with what actually happened (the same outcome is likely for formula 2).

Methodology

Luckily for myself and the readers, the process was a simple one. Pulling data from FanGraphs player pages, ESPN’s Home Run Tracker, and various Google searches, I compiled a data set from which to proceed. From FanGraphs, I collected all information for Part Two of the formula, including plate appearances and home runs. Unfortunately, because a few of the players from the sample were rookies or had fewer than three years of major league experience, I had to use regressed minor league numbers. In some cases, where that data wasn’t applicable, I dug through old scouting reports to find translatable game power numbers based off of scouting grades (and used a denominator of 600 plate appearances).

Then, from ESPN’s Home Run Tracker website, I obtained all relevant data for player home-run distance, average home-run distance for the player at home, and league average home-run distance. Due to my limited time, I only used players that qualified for the batting title during the 2015 season, yielding a potentially weak sample of only 130 players. Additionally, before anyone complains, please realize that the purpose of my research at this point is to obtain the most viable formula and refine it from there so that it can be applied across a wider population.

Results for Formula 2

Using Microsoft Excel, I calculated the resultant xHR% and xHR. Some key data points:

League Average HR% (actual):  3.03%

Average xHR%:  2.89%

Average Home Runs: 18.7

Expected Home Runs: 17.8

Please note that there is a significant amount of survivorship bias in this data. That is, because all of these players played enough to qualify for the batting title, they are likely significantly better than replacement level, which is why the percentages and home runs seem so high.

Correlation between xHR% and HR%: 0.974418884

R² for above: 0.949492162

HR% Standard Deviation: 1.5769373

xHR% Standard Deviation: 1.4265261

Correlation between xHR and HR: 0.977796283

R² for above: 0.956085571

HR Standard Deviation:  10.43771886

xHR Standard Deviation: 9.474596069

Results for Formula 3

League Average HR% (actual):  3.03%

Average xHR%:  2.92%

Average Home Runs: 18.7

Expected Home Runs: 18.1

Again, note the survivorship bias that comes with having a slightly skewed sample

Correlation between xHR% and HR%: 0.986440621

R² for above: 0.973065099

HR% Standard Deviation: 1.5769373

xHR% Standard Deviation: 1.4615323

Correlation between xHR and HR:0.988287804

R² for above:0.976712783

HR Standard Deviation:  10.43771886

xHR Standard Deviation: 9.698203408

Mostly Boring Analysis

I have opted to condense the analysis into one section instead of two because it would have otherwise been repetitive and boring.

I understand that that’s a lot to process, but the data really isn’t all that dissimilar. The expected home-run percentage is slightly lower than the actual home-run percentage for both of them, but it isn’t a massive difference by any means. When prorated to a 600 plate appearance season, xHR% for formula 2 predicts that the average player in the sample would have hit 17.3 home runs, while formula 3’s xHR% expects that the average home-run total would have been 17.5. In reality the average player hit 18.2 home runs per 600 plate appearances, so both were fairly close (maybe too close).

Both formulas had incredibly high correlations, with formula 3 correlating an insignificantly higher amount more. More importantly, formula 2 explains about 94% of the variance, while formula 3 accounts for 97%. The difference between those is relatively unimportant because they explain a very high amount of what occurred. Furthermore, p<.001, so the data must be statistically significant (actually many times lower than that).

Both formulas resulted in slightly lower standard deviations than what actually occurred, which is a recurring theme. In these formulas, the numbers have been clumped a little bit closer together and tend to underestimate rather than overestimate.

Players of Interest

Mr. Kole Calhoun – Last season he hit 26 home runs, but by both formulas he should have hit 3-4 fewer. Likely, this is because his only previous full season of home runs was in 2014, when he had only 17, in addition to the fact that I was forced to use scout grades for his third season. The scout grades were particularly off for Calhoun because he wasn’t even expected to be good enough for the majors, let alone be an above-average, high-value outfielder. Even though his overall offensive prowess declined slightly this past season (by 20 points of wRC+), he didn’t appear to be selling out for power, as his power profile numbers (FB%, Pull%, etc.) remained the same. Personally, I would expect him to regress next season, and I think the formula agrees with me.

Mr. Nolan Arenado – Arguably having the most unexpected offensive breakout of the season, he increased his home-run totals from 10 in 2013, to 18 in 2014, and finally to an astonishing 42 in 2015. While his totals were probably slightly Coors-inflated, they were real for the most part because his average home-run distance was excellent, in addition to the fact that 22 of his dingers came on the road. Arenado is young and likely to regress somewhat in the power department, but he is probably around to stay as a significant home-run threat. The formula was likely wrong on this one due to weighting of prior seasons, so go ahead and make the lazy Todd Helton comparison.

Mr. Carlos Gonzalez – Though Arenado’s teammate had the highest home-run total (40) of his career in 2015, it isn’t clear that he was anywhere near his peak statistically. His wRC+ was below his career average by six points, in addition to him being a net below-average player. All of this leads to the conclusion that he was selling out for power — which makes sense given that he lost over fifty points of batting average and on-base percentage from his 2010-13 peak years. While a viable argument could be made for his “subpar” performance being due to injuries, a better one could be made that his home runs were in part a result of playing half his games at Coors Field, where he hit 60% of his round-trippers. The formula says he should have hit about seven fewer home runs, which may be a best case scenario for next season given his penchant for injury. Additionally, while the Rockies are by no means full of talent, if Gonzalez continues his overall downward trend, he could get traded and lose the Coors advantage, or he could lose playing time.

Keep watch for a concluding piece in the next week. Criticism would be highly appreciated, but keep in mind that I’m still in high school and have yet to actually study statistics.

xHR%: Questing for a Formula (Part 2)

Part 2 of a series of posts regarding a new statistic, xHR%, and its obvious resultant, xHR, this article will examine formula 1. The primer, Part 1, was published March 4.

As a reminder, I have conceptualized a new statistic, xHR%, from which xHR (expected home runs) can be derived. Furthermore, xHR% is a descriptive statistic, meaning that it calculates what should have happened in a given season rather than what will happen or what actually happened. In searching for the best formula possible, I came up with three different variations, all pictured below with explanations.

HRD – Average Home Run Distance. The given player’s HRD is calculated with ESPN’s home run tracker.

AHRDH – Average Home Run Distance Home. Using only Y1 data, this is the average distance of all home runs hit at the player’s home stadium.

AHRDL – Average Home Run Distance League. Using only Y1 data, this is the average distance of all home runs hit in both the National League and the American League.

Y3HR – The amount of home runs hit by the player in the oldest of the three years in the sample. Y2HR and Y1HR follow the same idea. In cases where there isn’t available major league data, then regressed minor league numbers will be used. If that data doesn’t exist either, then I will be very irritated and proceed to use translated scouting grades.

PA – Plate appearances

(Apologies for my rather long-winded reminder, but if you really forgot everything from Part 1, then you should really invest in some Vitamin E supplements and/or reread the first post.)

The focus formula of this post is the first one, which also happens to be the one I think will work the least well because it relies too heavily on prior seasons to provide an accurate and precise estimate of what should have happened in a given season.

In the second piece of the formula, with only fifty percent of the results from the season being studied taken into account, it likely fails to take into account the fact that breakouts occur with regularity. As a result, it probably predicts stagnation rather than progress.

Methodology

Luckily for myself and the readers, the process was an incredibly simple one. Pulling data from FanGraphs player pages, ESPN’s Home Run Tracker, and various Google searches, I compiled a data set from which to proceed. From FanGraphs, I collected all information for Part Two of the formula, including plate appearances and home runs. Unfortunately, because a few of the players from the sample were rookies or had fewer than three years of major league experience, I had to use regressed minor league numbers. In some cases, where that data wasn’t applicable, I dug through old scouting reports to find translatable game power numbers based off of scouting grades (and used a denominator of 600 plate appearances).

Then, from ESPN’s amazingly in-depth Home Run Tracker website, I obtained all relevant data for player home run distance, average home run distance for the player at home, and league average home run distance. Due to my limited time, I only used players that qualified for the batting title during the 2015 season, yielding an iffy sample of only 130 players. Additionally, before anyone complains, please realize that the purpose of my research at this point is only to obtain the most viable formula and refine it from there.

Results

Using Microsoft Excel, I calculated the resultant xHR% and xHR. Some key data points:

League Average HR% (actual):  3.03%

Average xHR%:  2.85%

Average Home Runs: 18.7

Expected Home Runs: 17.7

Please note that there is a significant amount of survivorship bias in this data. That is, because all of these players played enough to qualify for the batting title, they are likely significantly better than replacement level, which is why the percentages and home runs seem so high.

Clearly, the numbers match up fairly well, with this version of the formula expecting that the league should have hit home runs at a .18% lower clip, and one fewer per player, which amounts to a significant difference. Over the course of a 600 plate appearance season, the difference between them is still only a little more than one home run, an acceptable distance.

Correlation between xHR% and HR%: 0.960506092

R² for above: 0.922571953

HR% Standard Deviation: 1.5769373

xHR% Standard Deviation: 1.3883746

Correlation between xHR and HR: 0.966224253

R² for above: 0.933589307

HR Standard Deviation:  10.43771886

xHR Standard Deviation: 9.201355342

While xHR% using this formula apparently explains about 92% of the variance, correlation may not be the best method of determining whether or not the formula works adequately. This holds at least for between xHR% and HR%, because there’s only a minuscule difference between their numbers (but one that matters), meaning it’s not a particularly explanatory method and that it may not have the descriptive power I’m looking for. Nevertheless, it is important to note that the correlation is not a product of random sampling, as p<.005. Unsurprisingly, the standard deviation for xHR% is smaller than that of HR% (nearly insignificantly so), indicating that the data is clumped together close to the mean as a result of using this formula, a potentially good thing (in terms of regression).

A better indicator of the success of the formula is the correlation between xHR and HR, a relatively high value of ≈.97. Here, presumably because the separation between home runs and expected home runs is greater, the formula ostensibly explains approximately 94% of the variance in outcomes and resultant data. However, in this case, the standard deviation for actual home runs is about 10.4, while for xHR it’s about 9.2, suggesting that, after being multiplied out by plate appearances, xHR is spaced nearly as evenly as HR. Ergo, it likely serves as a decent predictor of actual home runs.

Players of Interest

Mr. Bryce Harper – It’s likely there isn’t a better candidate for regression according to this formula than Bryce Harper, who the formula says have hit only 32 home runs as opposed to his actual total of 42. While he did lead his league in “Just Enough” home runs with 15, he’s also always been known for having prodigious power (or at least a potential for it). Furthermore, Mr. Harper dramatically changed his peripherals last season to ones more conducive to power. Suggesting this are the facts that he increased his pull percentage from 38.9% to 45.4%, his hard hit percentage from 32% to 40%, and his fly ball percentage from 34.6% to 39.3%. On their own, all of the previous statistics lend credence to the idea that Harper changed his profile to a more home-run-drive one, but when taken together they significantly suggest that. His season was no fluke, and the formula certainly failed him here because it weighted prior seasons far too heavily.

Mr. Brian Dozier – No surprises here. Mr. Dozier has certainly been trending upward for a long time, and in a model that heavily weights prior performance such as this one, upticks in performance are punished. Nevertheless, the data vaguely supports the idea that Dozier should have hit 24 home runs instead of 28. While he did significantly increase his pull percentage to an incredibly high 60% from 53%, he did play in a stadium where it’s of an average difficult to hit pull home runs as a right-handed hitter. Moreover, 10 of his 28 home runs were rated as “Just Enough” home runs, in addition to his average home-run distance being 12 feet below average (admittedly not a huge number, nor a perfect way of measuring power). If I were a betting man, I’d expect him to hit 4-6 fewer home runs this coming season.

Keep watch for Part 3 in the coming days, which will detail the results of the other formulas. Something to watch for in this series is the issue that the results of the formula correspond too closely to what actually happened, which would render it useless as a formula.

Note that because I have never formally taken a statistics course, I am prone to errors in my conclusions. Please point out any such errors and make suggestions as you see fit.