# Slowly Back Away from the Pythag Expectation

**Updated: Thanks to the commentors, especially Evan, for double checking my work. I had an issue in Excel that messed up the results for May and June. Charts have been updated.**

For most of 2012, the Baltimore Orioles have been playing over their heads. Well, at least when it comes to their expected win-loss record.

Based on the run differential the team has generated, the O’s have amassed 10 more wins than we would expect based on their Pythagorean winning percentage. The team has outplayed its cumulative expected winning percentage throughout the year and — since April — picked up two additional wins at the end of each month. If they sustain this performance and finish August with at least 10 more wins then their Pythagorean winning percentage would predict, they would be just the third team to do so since 2001 (the 2004 Yankees and the 2007 Diamondbacks are the other two).

Some might point to this glaring discrepancy between Baltimore’s actual winning percentage and Pythagorean winning percentage as evidence that the Orioles cannot sustain their winning ways. Of course, this raises the question of whether we really gain anything from a predictive standpoint heading into September if we focus on a team’s expected winning percentage rather than their actual performance.

The answer based on a review of the past decade seems to be no.

Using Baseball-Reference’s historical standings tool, I looked at cumulative team winning percentage at the end of May, June, July and August — between 2001 and 2011 — and how well their winning percentage and Pythagorean winning percentage* correlated with the actual rest-of-season winning percentage (n=330 individual seasons). Basically: How well did a team’s winning percentage through the end of July correlate with the team’s winning percentage from Aug. 1, through the end of the season?

Here are the results:

For the first few months, Pythagorean winning percentage holds a slight edge over actual winning percentage. In July, the advantage flips to actual winning percentage, but here we are still only talking about a .01 advantage. That changes in August. That’s where winning percentage has about a 3% advantage in terms of explanatory power. It would appear that once a team manages to navigate to August, focusing on a team’s expected winning percentage — with an eye toward predicting September — is less helpful than simply looking at real performance.**

But what about significantly overachieving teams, like this year’s Orioles? Surely, a team that’s five or more wins above their expected winning percentage is expected to fade down the stretch, right?

Turns out, not so much.

First, let’s simply isolate the data to teams that outperformed their Pythagorean expectation by five or more wins through the end of each month:

As you see, the results are similar to what we found with the full sample: Actual winning percentage is just as good, or better, at predicting ROS winning percentage in July and August. And it gives about 4% better explanatory power at the end of August.

For underperforming teams, though, the story is much different:

The predictive power of both types of winning percentage increase each month, but in this case, Pythagorean winning percentage has a higher correlation to ROS winning percentage in each month.

So why the difference? Here’s one theory: It’s somewhat of a self-fulfilling prophecy. For example, teams that are over performing are more likely to see themselves as contenders and make moves to reinforce that status. By the time you get to August, they may have counteracted any kind of natural regression by altering the team. Conversely, underachieving teams might do the opposite, thereby securing a disappointing season.

There probably are a bunch of other theories for this finding (i.e. superior performance in one-run games based on bullpen strength, which isn’t captured by the calculation for expected winning percentage, etc.), but the bottom line is that simply relying on expected winning percentage to discount a team’s chances for sustaining their performance through September isn’t a great idea. Especially since their actual winning percentage generally predicts their September record better. The better way to approach it would be to look at whether the composition of the team has changed (for better or worse) and whether there are trends their more recent performance that foretell a change in either direction (e.g. injuries, pitcher fatigue).

—————

**I also used Baseball-Reference’s Pythagorean Winning Percentage: For details on the equation used, see *here*.*

***These findings differ from Clay Davenport’s, who looked at the difference between actual and Pythagorean winning percentages through various games played increments. Why that’s the case, I’m not sure. But his overall results that Pythagorean becomes less useful relative to actual winning percentage as the end of the season draws closer is consistent with the findings here.*

Print This Post

How about using PWs as a predictor for the playoffs? (Team A won 87 games and made the playoffs, but only had 79 PWs, so they’re toast against Team B who had 90.)

Seems very likely. I would also add this could be due to the types of players you play for experience. For instance the competitive team would probably have a short leash for unproven players while a team at the bottom of the standings is more likely to give inexperienced players who perform badly initially a lot more rope.

When you say “correlated with actual rest of season winning percentage” you mean winning percentage of the remaining schedule, right? Not the the total winning percentage at a later date?

I really hope so, otherwise this article isn’t worth anything. I think it is, though.

Yes, it has to be. You’d never get correlations around 0.50 if record to date was used each time–they’d be way higher.

Am i interpreting the graphs correctly, to be able to state that one can use Pythag to predict

first halfperformance pretty reliably? I.e. take the pythag for a team as of February or March of a given season, and you’ll have a good sense for how that team will do from April through June, possibly a bit into July… but no further.No. You’re not even close. In fact you’re so far off I’m not quite sure where to begin…

To start with a team has no pyth. win/loss in February or March. It’s a statistical measure of actual games played, used to compare a team’s actual record and their expected record based on the number of runs scored/given up.

The numbers in the chart are correlations. Take the first chart ‘Predicting Rest of Season Winning Percentage’ and find the number for ‘end of May’ it is .42. That means that for all teams 42% of them ended up with a similar wining percentage at the end of the season as at the end of May.

That’s an oversimplification, but from where he was starting, good job putting it into words he might understand.

As I read the article the two most likely explanations are the two things you mentioned in your last two paragraphs.

The other thing, which you sort of noted in the last paragraph, is that while Pythagorean formula works in general for teams that doesn’t mean it works for every team, every year. You noted strong bullpen (and benches) might lead to sustained success in 1-run games. Also a team that’s struggling to find a fifth starter could consistently outperform its Pythag. Say on average they win 5-4 when their top 4 starters pitch but lose 4-8 when their 5th starter is on the mound. (That’s extreme but makes the math easy.) Overall they score 24 runs and allow 24 runs each time through the rotation, but since the runs allowed aren’t evenly or normally distributed, you’d actually expect this team to have a >500 winning percentage despite Pythag’s .500 prediction.

Versus the Angels and the Rangers the Orioles are -54. Against the rest of baseball they are +9

Hey, I know you.

Great article.

Run differential, while often being a good indicator of future success, can sometimes be misleading about how good a team is.

For example:

The Blue Jays have a had a better run differential all season, and yet Baltimore has always been ahead in the standings. I believe there are two reasons why this is, or why Baltimore may been better than teams with better +/-

I’ve watched the Jays all year, and while their offense has been good, it’s not great as runs scored would indicate. A quick perusal through game logs shows you that they have more than a few games where they score 10+ runs for one game, beating up on bad pitching. They’ll then struggle to get to 4 for the next few games. Run differential isn’t a good stat to use for teams with inconsistent offenses.

Secondly, Baltimore’s great bullpen has allowed them to have a great record in close games. They have had great production from their bullpen so far, ranking fourth in bullpen E.R.A. As well, of the 5 relievers who have pitched in 50+ games for them, the highest E.R.A. is Jim Johnson’s 3.08 (I use E.R.A. because whether they’re getting lucky pitching in those games up to this point is irrelevant).

Now, can the bullpen keep it up? All five of those relievers’ E.R.A. are outperforming FIP by about at least .5 run. Interesting, UZR doesn’t like Baltimore’s fielding very well either. Jones, Markakis have been both – this year, so one wonders whether the bullpen can sustain their success.

I’m very skeptical about strong bullpen being a major factor in exceeding Pythag. One case, of course, doesn’t prove anything.

However, among people who are at least intelligent and informed enough to discuss the subject, this is generally believed, maybe by 90%.

I would love to see a study on the subject.

Note: the fairly sophisticated fans also believe that exceeding Pythag is proof of a good manager. I am also skeptical about this, but I don’t see how a study could be done on this without using circular reasoning.

Quick look at the numbers: going back to 2001, correlation between Shutdown/Meltdown ratio and wins above/below expected for the season is roughly .42. That’s pretty healthy. Of course, this doesn’t control for any other factors, but at first blush it suggests it’s helpful.

I did a large study of year-to-year Pyth differential correlation and found that it was about 10% predictive. Teams that underperformed the most and who changed closers had no correlation, while those that did not change closers contributed robustly to the overall correlation (which IIRC was a bit less than 10%; that figure excludes the underperforming teams which did change closers). There were of course too few overperforming teams that changed closers to run any numbers on.

Thanks for the replies, Bill & Eric.

In a large sample with an undistorted distribution of runs, the actual win pct would be exactly the pythag win pct. so the way to outperform that is to get runs at the right time. on the one hand, this can be measured by “Clutch” or some other measure of clutchiness. on the other hand, this will look like scoring runs when they matter, i.e. winning a lot of 1-run games, and/or giving up runs when they dont count, ie losing more blowouts than you win. the former is probably sustainable with a good bullpen performance. the latter is, i would imagine, all from random variation.

I’m very confused about how the 2nd and 3rd graphs have totally opposite slopes. The slope of the 3rd graph makes the most sense to me- season W/L is most predictive towards the end of the season as the sample size increases.

This is by far the most interesting part of the data, and the author never addressed it.

As I understand it, the data shows that teams that overperform their expected PyWins early are rather likely to maintain their win %.

On the other hand, teams that underperform early are very unlikely to maintain their low win %.

If this data is accurate, it appears there is a

vastgulf in the meaning and predictive value of Pythagorean wins depending on whether a team is winning more or less than expected.This seems like the result we should have expected, though. Teams that are overperforming respond to their record, not their pythag, so they are more likely to go make trades to bolster their roster, remove struggling players, and take actions that help the team continue to play well.

However, underperforming teams don’t just give up and sell off the farm – the bias is towards trying to win, so they’ll look at their run differential and note that they’re better than they’ve played, and sustain hope based on that. Look at the Cardinals, for instance – they didn’t become sellers simply because they’re way underperforming their pythag.

Underperforming teams regress more heavily to the mean because they’re less likely to make changes that create a self-fulfilling prophecy. Win% causes teams that are overperforming to make significant positive changes to their rosters in a way that it doesn’t cause underperforming teams to make significant negative changes.

You very strongly assert that the explanation which Bill more proper throws out as a possibility is true.

This strong assertion is not justified without proof.

You don’t think it is self-evident?

No, Jason, it is most assuredly not self-evident. I think you need to look up self-evident in a dictionary.

I would add–in response to Cliff– that the reason for less correlation between August and ROS vs. July and ROS has to do with mis-matched sample sizes. By the end of august, there are only 30-ish games left, meaning more variance in results and lower correlations. The highest correlations should theoretically occur when neither sample size is very low.

I have several questions but will limit it to two for now.

1) Could someone help me (a liberal arts guy who has slowly tried to teach himself some basic statistical concepts) understand what is meant by “both [types of winning percentage] explain more than

27% of the variancein ROS winning percentage.”?2) What teams were included in the “teams that outperformed their Pythagorean expectation by five or more wins through the end of each month”? Was it any team that at any point in the season ended a month 5 games up? Or was it only teams that ended every month 5 games up (through August)? If it was the latter, that would seem to significantly bias the sample in such a way as to overstate the predictive power of actual winning percentage.

It’s clear that a complete picture of a team’s ability should include run distribution and how the Runs/RA are distributed over 162 individual games. Having a great, shutdown bullpen appears to be one way to repeatedly beat out Pythag expectations, if the manager makes the right decisions on a game-by-game basis.

I’ve been wondering for a while now if we could use standard deviation to attach a sort of “confidence value” to Pythagorean %, or maybe do something like take out a certain number of outlier games to get an adjusted Pythagorean %, or even cap runs scored and runs allowed at a certain value to see if that correlates better. But I guess all that extra work would defeat the quick and dirty part of Pythagorean Win %.

I would have done it by now, but I honestly don’t know where to find good data. Which brings me to my second (albeit unrelated) point: Fangraphs should do a series on how to conduct sabermetric research (or if it has been done, disregard and/or mock me).

In theory there exists a better Pyth formula which incorporates the standard deviation of RS and RA per game, not just the totals (or per game average). It’s easy to show that a high standard deviation of RS causes you to underperform your Pyth and a high standard deviation of RA causes you to overperform, for the obvious reasons that every extra run you score or allow in a game has less influence on your chances of winning.

The problem is that the standard deviations are not widely available, so anyone who wants to try to come up with such an improved Pyth has to do a ton of preparatory work. I’d love to see the standard historical team records include RS/27 outs and RA/27 outs (rather than per game, which removes the sometimes significant noise of having played one or more marathon games) and the standard deviations of both (you normalize every game to 27 outs to get the SD).

Amazing article. My fav (non Wendy) in a while.

The Orioles certainly have an effective bullpen this year, but has anyone thought about the combination of their total HRs and bullpen ERA being a driver of their better record in one run games, and thus outperforming their run differential? They rank in the top 5 in bullpen ERA and total HRs, and in the top 2 in both categories after the 7th inning.

Just curios.

Where are the error bars? Having the difference between the curves be so small could be accounted for by error. Maybe there is no real difference between the two.

Would certainly like to see some more information in regards to the effects of the bullpen on run diff/actual winning % differences.

As an Indians fan, I remember an interview with Manny Acta earlier this year where he was asked if his poor run diff was concerning, and he said it was but that it was blown out of proportion because the games that the Indians have won tended to be close games that they handed over to their 3 best relievers.

This makes sense. It’s been pretty well-documented that having a well-managed (and/or top-heavy) bullpen will help you outperform your Pythag. What you could be seeing in your data is the point at which a team’s “bullpen management ability” stabilizes.

One other thing to keep in mind is that blowouts have a large impact on Pythag% whereas there is no differentiation when it comes to actual winning %. It’s also possible that you’ve identified the point where blowout wins/losses become less relevant, although this could be due to any number of factors.

I challenge your opening statement and especially question the addition of the purely valuative term “well-managed.” (Again, I smell a circular argument.)

Please provide a source of your “pretty well-documented” assertion.

Well managed means (I presume) that the best pitchers pitch in the highest leverage situations, while a poorly managed bullpen would have worse pitchers pitching in those high leverage situations. In a vacum this may be just as important as actual talent in the bullpen, as (to give an extreme example of poor management) a world class closer is worthless if used primarily in blowouts while a replacement level long man pitches in key spots. Obviously no manager is quite this stupid (although the handling of Alfredo Aceves as the Red Sox closer when almost every pitcher in that pen is similar or better is along those lines) and thus the diferences are less extreme, but bullpen management still looks like it has a (relatively) significant effect.

The idea that bullpen effectiveness (good talent used in the most important situations) will help you win close games seems extremely logical to me, and as Bill posted in response to a previous comment of yours, there is a fairly substantial correlation. A team with Mariano Rivera will hold more one run 9th inning leads than a team trotting out this year’s edition of John Axford or Heath Bell, but be no more effective in games mostly decided before the 8th and 9th innings.

I remember reading quite a bit around Pythag records about five years ago, when the Diamonbacks dramatically outperformed their run differential. http://www.hardballtimes.com/main/article/no-mirage-in-arizona/

If you need more evidence to peer-review my comment thread, there’s this thing called “Google” you might want to check out.

“Seems extremely logical to me” and “Google it” are extremely poor replies.

Things which seem logical to somebody are about as likely to be false as true, and somebody making assertion on evidence that he claims to exist but can not or will not cite is responsible to support it, not to ask somebody else to find it for him.

See the replies to me by Bill and Eric above to find a more helpful form of discourse.

This math is very wrong.

Interesting article but the second chart is surprising.

It simply makes no sense that you could use a teams winning percentage or pythag winning percentage at the end of May to predict end of the season results than those stats at the end of August. I suppose if they were roughly the same it’s possible, but there’s no way the May stats are twice as accurate than the August stats especially when dealing with the outliers.

In contrast, the third chart shows the results you’d expect. As the season goes on, winning and pythag percentage has stronger predictive power.

Not quite. At the end of August, you’re trying to predict the results of one month. And a month is a lot more vulnerable to random variation than, say, two months.

You’re seeing the combination of two functions. One is that as the season goes on, win % and Pyth % gain more information and thus predictive power. The second, which is roughly the inverse of the first, is that you’re trying to predict a smaller and smaller sample. Even if you know the true talent of both teams EXACTLY, it doesn’t give you a 100% chance of predicting the outcome of a single game. Or two games. Or a week. At some point, you will reach a large enough sample size, but it looks like a month isn’t over that threshold.

Let’s say you’re correct. And your explanation certainly explains the first graph well.

Still, in that case, how do you explain the third chart? According to that chart, the model is four times as accurate in August than in May. If the small sample size is the reason why the August numbers are lower than the May numbers for overachieving teams, then that should continue for underachieving teams.

But I don’t think your explanation is totally correct. If it was, then we’d see that the May results have a higher predictive power then the August results for all of the graphs. We only see it in the second one. But certainly, your explanation is why the first graph looks the way it does.

As for me, I think the third graph is more likely to be how the data should look because in the second and third graph we’re dealing with outliers. With time, these outliers should regress to the mean. Hence, it’s less accurate for May and more accurate for July or August despite the small sample size.

I thought it was well-accepted that pythag w-l was a better indicator of future success than actual w-l. Doesn’t the first graph contradict this? What do we mean then by “better indicator”?

Considering that your result shows W-L to be as good as or better than Pythagorean pretty much all the time–which would be a pretty big overturning of the sabermetric consensus–I went back and pulled the “end of June” numbers for the same years from b-r.com and checked your work. I got a .499 correlation for Pythagorean and a .493 for W-L, which doesn’t agree with the result in your first table.

You want to share work and make sure you didn’t make a mistake somewhere?

I’d like to second this request. I read this, and had the same concerns that Matt P. outlined. It seems highly likely that these data were wrong. Even if they are correct, the analysis does not plausibly explain the results.

That’s odd–have re-run the numbers a few times, and no issues. Here’s the data I used for each month so certainly have at it: https://docs.google.com/spreadsheet/ccc?key=0AmiN6Mg98wY1dGZUSUJLeVh2QXZzVHBueFo5LWtsNkE

So we are on the same page, if we look at June I took the actual and pythag W% through June 30th for all teams in all seasons from 2001 through 2011 (n=330) and correlated them to the rest of season W% (so, July 1 through end of year).

Certainly let me know if you spot something.

We used the exact same data, but a slightly different method:

https://docs.google.com/spreadsheet/ccc?key=0AkZiNPqWtE9JdDMxeWJCc0NxZEdwM3BYQWc1dW56eXc

I correlated the end of June W-L W% and the end of June Pythagorean W% with the ROS W%. In other words, I had three fields for each team for each year, and I just ran the CORREL function on the first two against the latter.

Would it make any sense to add leverage to the run differential equation? For example, the reader who brought up the Blue Jays inconsistent offense. Maybe instead of using just runs scored vs. runs allowed, you use only runs scored or allowed while the game is within, say, a 5 run difference. This is crude I know, but I am just using it as an example.

The point I am trying to make is, shouldn’t we discount runs scored or allowed in a blowout? Someone explain to me why this wouldn’t work better.

I haven’t seen it, but surely someone has done this before, similar to removing “garbage time” when analyzing football games. It would make sense, and it could be done using a win probability threshold of 98% or something.

Problem I have with the whole Pythag concept is that it makes an assumption of relatively even distribution of runs both scored and against. Now, I’m willing to accept the assumption about runs scored. Runs against is much more problematic, however, since the distribution of runs per game is going to be much more dependent on the performance of individual pitchers. That’s fine if you’re talking about a relatively similar level of talent in the starting rotation, but if you have two or three excellent pitchers and a couple of scrubs rounding out the five it begins to break down.

But you’d expect runs scored to vary based on pitcher faced…

Bill, can you do a set of multiple linear regressions with ROS% as the dependent variable and Pyth Win % and Pyth Diff % (i.e., Actual Win % – Pyth Win %) as the independent variables? The coefficients (if both prove to be significant) will tell you how much of the differential is predictive. It’ll be especially interesting for end-of-August for the overperforming teams.

I think this is a really interesting study. I have been fascinated by the Pythagorean win formula ever since Bill James proposed it in the 1980s, and was amazed when Steven Miller proved its mathematical foundations in 2006. Like commenter Chris said, we have assumed that Pythag % is a better predictor of future performance than actual %, so someone should look at that.

I was perplexed by the numbers for May for teams underperforming or overperforming by at least 5 wins. how could r change by a factor of 3 between those two cases? so i ran the numbers. i calculated (actualW)-(pythW) at the end of May for all teams 2001-2011. i only found 4 teams over and 6 teams under. Bill, or someone, check my numbers, is that right? that would be too small a sample to use.

all this analysis got me thinking. i think we are trying to determine which has a bigger impact, current actual% or current pythag%. so then we should be looking at the slope of the regression line, not r. because, for example, if r is around 1, but the slope is not, then improvements in the current% lead to guaranteed small improvements in future%. however, if the slope is around 1, but r is not, then improvements in the current% lead to large improvements in future%, but with variation so they are not guaranteed. and thats the question i feel like i want to answer, how valuable is an extra point of win pct?

here are my results. using actual% through may to predict actual% after may, y=.3881x+.306 with r=.4226. using pythag% through may to predict actual% after may, y=.4376x+.281, with r=.4271

So the correlations are pretty similar, but current pythag% is more impactful than current actual% to predict future performance. (its true, if the correlations were not similar, that would require investigation, but fortunately thats not what happened.)

data (calculations and charts at the bottom):

https://docs.google.com/spreadsheet/ccc?key=0AnK8M8mCzkDpdG55OTJaWGpFME5ISm4xZ3hDRVF0SWc

Your numbers are right (although rounding would allow us more observations). Still, the correlation is minimal using this method so I don’t think the author used it.

I got similar results to your calculations in the second part. Indeed, Pythag seems to have a larger influence. But I’m not sure the difference is so large.

I was confused enough by these results that I decided I’d see if I can replicate them.

For the First Chart, our numbers are nearly exactly the same. That doesn’t surprise me as those numbers were what I’d expect.

For the other Two Charts, our numbers are different. For teams on pace to overachieve by five or more wins, I had the following results:

End of May: WPCT R=.404, PYTHAG = .44494

End of June WPCT R=.519, PYTHAG = .538

End of July WPCT R= .451, PYTHAG = .4389

End of Aug: WPCT R= .59, PYTHAG = .556

For underachieving teams, I had:

End of May: WPCT R = .438, PYTHAG = .442

End of June: WPCT R = .452, PYTHAG = .473

End of July: WPCT R = .458, PYTHAG = .484

End of Aug: WPCT R = .634, PYTHAG = .6545

I guess it’s reasonable to presume that I misunderstood what you meant by a team overachieving by 5 wins or more. I presumed that it meant a team on pace to do so(i.e 5 out of a full season). If you meant that the team already was 5 wins above its projection (i.e 5 out of the games already played), this would explain why we had different results.

However, if you did do it like this, you would have a very small sample size especially in May. It’s questionable whether the data would be statistically significant.

There are different ways to be better than one’s expected winning percentage, right? One is to win a disproportionate number of close games, and the other is to be blown out more often than other teams. Could these two types of “overperforming” teams be separated into groups? I have neither the time nor the know-how to do this, but it seems worthwhile. Just spit-balling here.

Anyway, the sample sizes above seem both flawed AND not large enough, a deadly combo. It seems to me that teams only slightly over or under-performing relative to their Pythagorean winning percentages are not instructive here, due to the relatively large amount of statistical noise. I don’t think they tell us much about the Orioles, a team that is greatly outperforming its Pythag winning percentage.

So the Orioles are more likely to play like a 67-57 team than a 56-68 team from here on out? I’d bet against that.

then, of course, is “The Buck Factor” weighing in…

I’m not sure those August correlations are even statistically different in the first graph. It would be tricky to test since the two paired samples are not independent, but a 95% confidence interval for the 0.49 correlation in the first graph is (.404 to .567), which easily includes 0.46 (the pythag correlation).

For the second two graphs, I don’t know how much the sample size is reduced, but in any case the variance is even greater, and correlation differences of 0.06 or even 0.1 are probably not significant.

Not that this retracts from the point. Basically it just says that it can’t be determined which is a better predictor in isolation–win% or pythag%–and that’s good to know. I feel like multiple regressions with some additional independent variables such as bullpen ERA (or FIP or whatever) and some indicator variable for “type of deadline moves” categorized as “seller”, “buyer” or “neutral,” could be enlightening.

Thanks!