## Fun Notes From the Past Calendar Year

Every couple of months, I like to write a post highlighting some data from the Past Calendar Year split on our leaderboards. It’s one of my favorite tools on FanGraphs, giving us a look at how a player has done over a rolling full-season window. It’s a better way to look at recent performance than just season to date, and gives us a larger sample while still focusing mostly on what a player has done in his last ~162 games or so.

So, here are some random statistical tidbits from data accumulated from August 6th, 2012 to August 5th, 2013, with the minimum number of plate appearances set to 400 to include some interesting guys who have missed time due to injuries, as well as expand the number of starting catchers in the pool.

*Top 10 first baseman, by wRC+: Chris Davis (169), Joey Votto (160), Edwin Encarnacion (144), Paul Goldschmidt (137), Brandon Moss (136), Allen Craig (134), Prince Fielder (134), Justin Smoak (133), Brandon Belt (131), Freddie Freeman (128)*

Brandon Moss is kind of the perfect example of why I like the past calendar year feature. If you just look at his seasonal lines, it’s easy to miss how good he’s been. Yeah, he was awesome last year, but that was just a couple hundred plate appearances, and he hasn’t been nearly as good this year, posting just +1.0 WAR over fairly regular playing time. But if you see the entire stretch as one full season, Moss has been a monster, putting up basically the same wRC+ in the past “year” as Goldschmidt, who is considered a potential NL MVP candidate this season.

Granted, we’re only covering 503 plate appearances with Moss, and his stats are being anchored by the oldest data in the sample, but even in 2013, he’s been better than you might realize. Brandon Moss’ success is one of the undercover reasons the A’s are on track to win their second straight division title. This is a guy who was basically considered a AAAA player 12 months ago, and since Oakland gave him an opportunity, he’s hit like an All-Star.

The other interesting name on the list has to be Justin Smoak. On this date a year ago, Smoak was struggling in Triple-A, having been shipped back to the PCL after playing his way off one of the worst offensive teams in the sport. He wasn’t even good in Tacoma, and looked like a classic busted prospect. Last summer, I noted that of all first baseman to get 1,000+ plate appearances before through their age-25 season in the last 30 years, Smoak was the very worst hitter of the entire bunch. It wasn’t a small sample. Smoak racked up several years worth of futility, and even his minor league track record didn’t offer that much hope.

But since he returned from the minors, Smoak has equaled the likes of Fielder and Craig in offensive output. Like with Moss, we’re dealing with a smaller sample of plate appearances, and he hasn’t been consistently great during that stretch — he was awful again in April — but he’s got nearly 500 plate appearances of high quality offensive performance. It doesn’t completely wash away the 1,300 lousy plate appearances that came before it, but Smoak is finally showing some of the production that made him a high first round pick in 2008.

*A pair of third baseman have been almost exactly the same at the plate.*

Name | PA | H | 1B | 2B | 3B | HR | BB | SO |
---|---|---|---|---|---|---|---|---|

David Wright | 676 | 178 | 119 | 32 | 6 | 21 | 68 | 114 |

Josh Donaldson | 657 | 169 | 109 | 35 | 1 | 24 | 64 | 118 |

Even their UZRs match up almost identically. Wright has the advantage in baserunning, but in terms of hitting and fielding, their stats over the last 365 days are virtually identical.

*Worst regular in baseball? Jason Kubel (-1.9 WAR)*

The Royals rightfully got a lot of grief for playing Jeff Francoeur as much as they did, but Kubel has actually been even worse than Frenchy over the last year. An immobile DH who is being forced into the OF because he signed with an NL team, Kubel has hit .207/.281/.358, good for a 69 wRC+. The oddest part about this is how it has basically come out of nowhere.

After the D’Backs signed him last year, he was an offensive monster for the first four months of the 2012 season. At the end of last July, he had a 141 wRC+, making him one of the best hitters in the game at that point. Since then, his wRC+, by month: 52, 73, 164, 28, 102, 31, -30

A year ago, he looked like a huge steal for the Diamondbacks, providing left-handed power at a bargain price. Now, the D’Backs should probably pay the buyout on his 2014 option. When people accuse defensive metrics of being unreliable because of the year to year variance, they should remember things like Jason Kubel’s offensive performance over the last couple of years.

*By RA9-WAR, Clayton Kershaw (+9.5) has been the second most valuable player in baseball.*

His last 34 starts: 246 innings, 56 runs allowed. To put that in context, Justin Verlander has allowed 37 more runs while pitching 17 fewer innings over the same time period. Among qualified starters, only Mat Harvey and Bartolo Colon have allowed fewer total runs to score, and they’ve thrown 52 and 77 fewer innings respectively. For Harvey to match Kershaw’s total runs allowed over the same number of innings, he’d have to post a 1.21 RA — not ERA, just straight RA — over his next ~seven starts.

And this is obviously nothing against Harvey, who has been excellent himself. Kershaw’s just on another level right now. 2.04 runs allowed per start over 34 starts. Yeah, this guy is pretty good.

Print This Post

Who are the leaders by WACK (Wins Above Corey Kluber)?

No one, but C Klubs is on another level

Got called out on a separate post. I guess we could either (a) change the metric to WBCK (but lose the ease of pronunciation) or (b) keep the metric as WACK but understand that values will inherently be negative (to varying degrees)

Corey Kluber is the current leader, at 0.

Highest BA of any OF with 400 PA’s: Torii Hunter at .323.

I guess that is a solid argument against batting average. Hunter is a solid player but he is not the best hitting outfielder even over the past year.

While I don’t disagree, I think Hunter’s transformation from quasi-hacker to somewhat patient hitter to contact hitter who doesn’t really walk has been pretty fascinating. He claims it’s intentional now, but I’m not sure how you decide to run a .375 BABIP over your age 36 and 37 seasons.

Because I’m doing my best to beat this horse into the ground.

“When people accuse defensive metrics of being unreliable because of the year to year variance…”

That’s not why I hate defensive metrics being quoted in small (year-sized) samples. It’s because people constantly want to treat defensive stats like they do offensive ones. Offensive stats are great because at least they tell us “what happened.” Regardless of whether it’s a good estimate of a player’s true talent, at least it’s not in question whether he got on base x% of the time last year.

That’s not true for defensive stats. There’s significant measurement error. It’s not just that a true-talent +5 defender can play like a +0 defender for months at a time due to bad luck or variation around a skill level. It’s that a true-talent +5 defender who is also playing like a +5 defender can be measured as a +0 defender. That very significant amounts of measurement error inherent in the stats mean that you can’t treat them as the great “story-telling” stats like we constantly do with offensive stats. It means that you have a estimate even retroactive values.

This means that even when you’re looking at retroactive values, you still need to estimate them from noisy data, something that never needs to be done with offensive stats, as they measure retroactive performance perfectly. And that means that you need enough sample size, even for retroactive stats, to overcome the measurement error.

Basically my criticism boils down to this: baseball fans don’t know how to think about or how to account for measurement error because offensive stats have been spoiling them for so long. So when they deal with stats that have significant amounts of measurement error, e.g. defensive stats, they don’t use them properly.

I completely agree, I won’t trust WAR until it replaces its current defensive input with a combination of (a) number of Gold Gloves and (b) John Kruk’s estimation.

You show that straw man, Hiro! He’s quaking in his boots.

While I appreciate the joke.

In all seriousness, I do think that defense in WAR needs to be replaced with something. My best idea is to be a Bayesian and just regress UZR/150 towards some baseline mean. Probably something like the positional average or even position-age average.

What’s Bayesian about assuming that everyone’s average defensively? That’s clearly not true.

I never assumed that everyone is average. I said that you should regress their defensive values towards some prior belief. That’s a fundamentally Bayesian technique, as you have some prior belief (the mean you are regressing towards) and you update your beliefs with observed values.

@Bronnt

I’m just goofing off. Roger’s right.

Battled balls are unreliable and random. They can’t be trusted to measure true talent either offensively or defensively. BABIP is very fluky over small samples. Why do you single out defense? Both are measuring actual plays made on actual balls.

Correction: That was supposed to say “They can’t be trusted to measure true talent [b]in small samples[/b] either offensively or defensively.

in small samples(Corrections to corrections – one of those days.)

No they aren’t. That’s my fundamental problem. BABIP at least perfectly measure what happened on the field. No one is going to say that someone with a .400 BABIP actually had a .350 BABIP. Defensive stats don’t perfectly measure what happened. There is significant measurement error even in measuring what happened.

Imagine if you were trying to take the true temperature during the month of May. BABIP fluctuation is like daily fluctuations in temperature if you had a perfect thermometer. There is randomness there, but at least the temperature that you get on each day exactly records “what happened.” So if you want to retroactively determine the temperature you can do it perfectly.

Now imagine that your thermometer randomly adds +5 or -5 to the degree that it is that day. Now you have additional noise when trying to figure out the true monthly temperature, but more than that, you can’t even retroactively determine what temperature it was on a given day. If you want to determine a past day’s temperature retroactively, the best way to do it is to actually take some average temperature to even out the measurement error that you are getting.

For forecasting these situations are more or less equivalent, one just creates more noise than the other. But these are two fundamentally different problems when dealing with retroactive stats (which WAR is). And the people that I see talking about them in baseball just don’t get that.

How does the +5 or -5 thermometer relate to UZR in baseball? I’m not asking because I disagree, I’m just trying to understand what exactly causes the measurement error you say is in UZR. Is it because of the randomness of batted ball location? Is it due to park factors or pitching staff or the complications of the interactions of 9 defenders (ok, well, usually just the 2 nearest you)?

I suppose I should read up on UZR more…

Excuse me for not getting your point at all. It seems that you are assuming that UZR variation is due to sampling error and not actual variation in performance. We don’t regress a player’s batting average to his career norms for WAR, why would we do so for defense? UZR does measure what happened quite well, it simply records outs converted on balls hit into and out of a players positional zone. There isn’t much of a subjective issue with that approach.

If a career .275 hitter can have a .315 season, why can’t an above average defensive SS have a below average year in the field? We don’t want to estimate a players defensive contribution for a counting stat like WAR, we want to count it in a meaningful way.

WAR is inherently mot a predictive stat and seasonal UZR shouldn’t be used as such either. A good UZR tells he the player was good at converting outs last year. A good career UZR tells us that the player is likely a good defender. I don’t see the issue with using UZR with all its variation in WAR when every other component in WAR also varies.

As I understand the basics behind UZR.

For each ball hit in play, they generate a number that encapsulates how “difficult” the ball would be to field. They do this mostly with batted ball data, and mostly by partitioning the field into slices. Using that data, they might determine that a pretty hard-hit ball in the gap would be caught by 5% of center fielders, but that a lazy fly ball right at a fielder would be caught by 95%.

Once you do that, it’s not difficult to aggregate the numbers to create a measure of value. If a player catches a ball that 10% of players would catch then he get a lot of credit. If he drops a ball that 95% of players would have caught he loses a lot of value. So, (numbers are just for example), if he caught the 10% ball then maybe he gets +.4, but if he drops the 10% ball he only get -.05, as those numbers take into account the difficulty.

The error comes in the first step, determining how difficult a ball is to catch. And that’s because we can’t perfectly measure how difficult a ball is to catch, for fairly obvious reasons. Batted ball data (until field f/x is available) is not very granular, and we’re obviously missing a lot of other factors that would impact fielding as well. Add in the discrete slices and different ballparks and there’s quite a few sources of measurement error here.

So when UZR says that a particular ball would have been caught by 70% of fielders, there’s a lot of error in that estimate. It’s not like BABIP or any other offensive stat. Like for OBP, there’s no question what percentage of the time he got on base. It’s perfectly measured. For UZR, when you give a guy credit for +.2 runs because he made a play that only 30% of players would make, there’s a lot of error around the +.2, mostly generated by error around that 30% number.

@Pirates Hurdles

My problem is not with variation. Variation is great, variation is expected. My problem is with the type of variation. There’s a fundamental difference between results that vary around a skill level, and measurement error that variesaround the results.

For instance, say you’re throwing a ball, and I’m interested in how far your “true throwing” distance is. Every time you throw the ball it’s not going to be the same distance. That’s variation around your skill level.

But say that I have an unreliable measuring tape. That’s measurement error. It also adds variation, but it’s different.

For projection or estimation, the problems are the same. There’s variation, and I use an average to even out the variation in an attempt to estimate your skill level.

But if you want a retroactive stat, they aren’t the same. Say I want to know how far you threw it on average in the last 5 trials. If there’s no measurement error then I can take a naive average and that’s exactly what I want. Perfectly measured.

But if there’s measurement error, then that naive average DOES NOT tell you how far you threw it on average in the last 5 trials. And if the measurement error is big enough, taking the average over the last 15 trials might actually give you a better estimate of “the average distance of the last 5 trials,” purely because it helps cut out the measurement error noise.

Oh ok, thanks, I see what you mean. So I guess you would be more satisfied with the measurements if the field slices are shrunk, like using a finer mesh in a finite element simulation (just in case you know what that is), as well as dividing up batted ball type into more, smaller slices. But then if you did that, you would have much smaller historical samples for each combination of ball type and location, so you still have a noise problem, no?

From what I understand, different ballparks are already included, as far as elevation, weather, etc, but maybe, for instance, “deep, straightaway left” in Fenway is currently mistakenly treated the same as “deep, straightaway left” in PNC or Petco?

Again, I think people are spoiled by offensive stats. There’s no measurement error, so retroactive stats are perfect.

For instance, you never argue about whether someone hit .350 over the last month. There’s no question about that. The only question is whether he is going to be able to hit .350 over the next month, or whether hitting .350 is his true skill level.

With measurement error it’s different. If someone grades as a +2 defender last month, there’s a question of whether he actually was a +2 defender last month. It’s not just about projection, it’s actually a question of the ability of the stat to measure what happened. That’s someone that people never have to deal with with offensive stats. It would be like a stat saying that a guy hit .350 last month, but there being a possibility that he actually hit .340 or .360 last month.

@Jay29

I have no problems with the methodology of UZR. I think it’s the best we can do with the data that we have. There’s measurement error there, but it’s still a useful stat and the best we have.

My problem is that no one else seems to acknowledge or account for the problems with UZR. Measurement error is real and should be treated differently than regular skill variation when doing retroactive stats. That’s why I advocate regressing defensive stats towards a mean when using them in WAR. As a simple example, over a full season, if someone posts a +10 UZR they should be credited with +5 defensive runs saved in the WAR calculation. Because regressing stats like that helps mitigate the measurement error problem.

If the problem is that the measuring tool (UZR) isn’t that accurate, wouldn’t you just introduce error bars instead of regressing everyone to the mean?

For instance, last year Michael Bourn was +23 and Curtis Granderson was -18 in CF. Your solution is to regress Bourn to +18 and Granderson to -13, but how do we know that the measurement error didn’t go the other way? Maybe Bourne was really +28 and Granderson was really -23. If the measurement problems that you attribute to UZR are as random as you say, either outcome would be just as likely.

That is also a way of dealing with it.

Regressing to the mean does has a theoretical justification though. If you assume that both skill (either true or realized) and the measurement error are distributed in some sort of center-weighted distribution (e.g. normal), then the farther from the mean you go the bigger the measurement error problem becomes.

For a quick example of how this results, imagine two random variables a and b. Here a might be realized skill and b might be measurement error. Imagine that they are both normally distributed, with mean 0 and variance 5.

Now say that you can only observe a+b, and say that you observe a value of 10. Because the distributions of a and b are center-weighted, it’s much more likely that a=5 and b=5 than a=10 and b=0, and it’s extremely unlikely than a=15 and b=-5.

Say that, instead, you observe an a+b of 0. Now observing a=-5 and b=5 is equally as likely as observing a=5 and b=-5, and observing a=0 and b=0 is more likely than both of those.

You can verify these calculations, but the point is that when dealing with measurement error, the farther away from the mean you go the more likely that you’re observing significant measurement error. Regressing towards the mean helps mitigate this.

Actually, on second thought, adding error bars is wrong. Because of how conditional probability works (as I tried to show in my example), it’s actually the wrong way to do it when dealing with measurement error. You either regress towards a mean or use more data. If you’re comfortable with putting strong assumptions on the distributions of both talent and the measurement error then you can create proper error bars conditional on observing each value (note that they won’t be symmetric), but no one should be comfortable with those assumptions.

Is Harvey going the Latos route and lopping one of the Ts off of his last name?

He used to be Mat Lattos? Score one for Melvil Dewey!

I think it was Mat Latost. Or Mat Tlatos. Or Mat LaTots. I forget.

You beat me by one minute! Great minds and such (some just a little slower than others…)

…or he could have been Latost. Or (my favorite) Tlatos.

And he’s just one more change from being Mat Laos!

Haha nice! Maybe he should forego the letter ‘t’ altogether and go by Ma Laos, the Asian madam of Cincy’s red light district.

Haha. I like that our first two options were even identical. I think that means we’re officially dating.

My immediate reaction to Jason Kubel as worst regular: there is no way he is worse than Yuni. While it is true, Yuni is worse, he avoids the distinct honor of “worst regular” thanks to the Royals releasing him mid-season last year.

The real problem with defensive stats (and therefore with WAR) is that they just aren’t comparable to offensive stats – but I’d emphasize a different issue from Roger’s. The problem I see is that teams simply don’t value offense and defense on the same scale, so the distributions are different. You can’t regress to the mean if you don’t know the mean.

How the teams value contributions is not the same as how valuable the contributions are.

Why are you touting arbitrary endpoints you biased prick!

The big surprise for me in your 1B list is Belt. I’ve read that Giants fans are dissatisfied with him. Buttheads!

There is a really bizarre media movement stoked by KNBR and certain Giants beat writers who are Andrew Baggarly to criticize Belt to a really ridiculous degree, as if he’s singlehandedly to blame for the Giants’ bad year. It makes no sense, but it’s contagious. I had a conversation on twitter the other day with someone who wants to replace Belt with actual literal Adam Dunn.

This is what sports talk radio hath wrought.