Of runs and wins

The Orioles caused a bit of consternation on the Interwebs this season. It’s been an article of faith from the Beginning of Sabermetrics (approximately 1980, or 1 A.J. (After James)) that teams tend to win relative to how well they outscore their opponents. If they score a lot more runs than they allow, they win a lot of games. If they don’t, they don’t.

The Orioles, however, won 93 games and lost only 69 last year, despite scoring only seven more runs than they allowed. This wasn’t a record for outperforming a team’s run differential, but it was close. As the season progressed, saberists kept insisting it couldn’t continue. Yet it did, right up until the end of the season. Anti-saberists seemed to enjoy a certain amount of schadenfreude when the O’s made us sabermetric types look silly. O’s fans, of course, were delighted regardless.

So what are we to think? Is this Article of Sabermetric Faith wrong? Are we fooling ourselves when we pay too much attention to runs scored and allowed, and not enough to basic wins and losses?

To answer these questions, let’s go back to the basics. What’s more, let’s go back to the data.

I decided to compare two consecutive months of a team’s win/loss record to each other, within a specific season. By comparing in-season months, I mostly avoided the hassle of personnel turnover and that sort of thing. To make sure I had enough data, I collected all teams and months from 1970 through 2012, a total of 5,747 team/month/year combinations.

Next, I grouped the teams into 21 buckets, based on how they performed in the first baseline month. (A winning percentage of .000 to .050 was bucket 0, .050 to .100 was bucket 1, and so on. Bucket 19 included winning percentages between .950 and 1.000; bucket 20 was a winning percentage of exactly 1.000.) Some of these months consisted of only one or two games, so I weighted each team/month comparison by the lesser number of games in the two months (if there were 25 games in the baseline month but only one in the next month, the stats were prorated as if there was only one game in both months). That way, I didn’t have to exclude any months by an arbitrary baseline, and I also didn’t have to worry about those pesky strike years.

For the following analysis, I included only groups with a baseline winning percentage (the percentage in the first month) higher than .200 and lower than .800—12 groups representing 6,546 team/month/year combinations in all.

You may not have followed all that. It’s okay. You’ll get it when we dive in.

Regression toward the mean

Before we get to the runs scored and allowed thing, we have to do something else. We have to regress toward the mean. Everything regresses to the mean, as someone once said. The Mariners win 93 games after winning 116. Norm Cash hits .243 after hitting .361. I write a lousy column after writing a great one. Sometimes.

The basic rule is this: Before you compare two things to each other, make sure you regress them toward the mean first.

In the table below, you’ll see that all 12 groups regressed toward .500 in the second month of comparison. Teams in our lowest group, for instance, had a composite .224 winning percentage in the baseline month. In the next month, their winning percentage jumped up to .422. At the other end of the standings, teams that averaged .762 in the baseline month had a .545 winning percentage in the next month. There remained a difference between the groups, but it wasn’t nearly as extreme.

That is as stark an example of regression toward the mean as you’re likely to see today. Here are the data for all 12 groups.

How an Ace Performance Impacts Reliever Workloads
Bullpenning has its advantages, but it's great when an elite starter eats up a bunch of innings, too.

From… To… Number Avg. Win% NextWin%
.200 .250 49 .224 .422
.250 .300 191 .277 .461
.300 .350 363 .329 .446
.350 .400 610 .376 .468
.400 .450 1041 .426 .481
.450 .500 835 .472 .496
.500 .550 1202 .519 .506
.550 .600 931 .571 .516
.600 .650 675 .621 .531
.650 .700 374 .670 .543
.700 .750 195 .719 .557
.750 .800 80 .762 .545

Teams in every category—above-average teams and below-average teams—moved closer to .500 in the second month. Much closer. In fact, these findings lead us to a useful rule of thumb:

If the only thing you know about a team is its winning percentage in a single month and you want to predict how it will perform in the next month, add these two things:
{exp:list_maker}25 percent of its winning percentage in the first month, and
.375 (which is 75 percent of a .500 record). {/exp:list_maker}In made-up technical English, a team will regress 75 percent toward average in its second month.

Now let’s talk about that runs scored and allowed thing.

Pythagorean variance

Here’s what I did next. I took the number of runs scored and allowed per game for each team in its baseline month and used those data to calculate its pythagorean record. The pythagorean record, which is an estimate of a team’s record based on its runs scored and allowed, was developed by Bill James around 1 A.J. The basic formula is RS^2/(RS^2+RA^2). RS means Runs Scored and RA means Runs Allowed. That little hat means “raised by,” or squaring the number. You may remember a similar formula from your geometry class.

I varied the “squared” part of the formula for each grouping, based on the the run environment of each team. It adds a little more precision to the mess.

So how do we factor the pythagrorean record into our regression? Well, first I created a second group (a subgroup, if you will) of teams, based on their pythagorean record in the baseline month. For instance, if a team’s pythagorean record was better than its actual won/loss record by two games (this would be an “unlucky” team in standard sabermetric parlance), it was were placed in group 2. If their won/loss record was two games better than their pythagorean record (a “lucky” team), there were placed in group -2 (negative two). The number of the group represents the pythagorean difference from reality in games won.

Once again, looking at the data output may help you understand what I did. In the table below, I regressed each group’s winning percentage toward the mean (using the 75 percent rule) to predict how it would perform in the second month. Then I broke them into pythagorean subgroups to see how each type of “lucky” team performed relative to its regressed projection.

Bottom line: The higher a group’s pythagorean difference, the more they outperformed their projected record in the second month. The results are dramatic.

Pyth Diff Projection Diff
-7 -.125
-6 -.064
-5 -.057
-4 -.078
-3 -.024
-2 -.008
-1 -.003
0 .012
1 .003
2 .018
3 .015
4 .020
5 .046
6 .235

Picking on one example, teams that were three games better in their pythagorean “runs record” than their actual record beat their regressed projection in the second month by .02 percentage points. Their runs scored and allowed in the baseline month made an impact on the outcome of the second month.

What I’m saying is that baseball analysts are still right. Runs scored and allowed still matter. In fact, if you know nothing about a team except how it performed in one month, add these two things:
{exp:list_maker}30 percent of its pythagorean record in the first month, and
.350 (which is 70 percent of a .500 record). {/exp:list_maker}If you know these two things, knowing the team’s actual won/loss record won’t help you one bit. You’ll do a better job of predicting a team’s future if you ignore its actual won/loss record and just use its “runs record.”

When James first wrote about these things, he developed a number of sabermetric forces. One was called the Plexiglass Principle, which today we call regression toward the mean. The other was called the Johnson Effect, which is what he called it when teams that had extreme pythagorean variances in one year tended to relapse in the next year.*

*This shows how far we’ve fallen as sabermetric writers. No one coins terms than James does.

The problem these days is that people tend to throw the two forces together. When people say that teams tend to “fall back” toward their pythagorean record, they’re really combining the two ideas. Pythagorean variance has become sort of a lazy man’s regression term. Today, I’ve tried to separate the two more distinctly for you.

Still, there are more questions. There are always more questions, aren’t there? What if we have multiple months of a team’s record? Is there a point at which its actual won/loss record is more important than its runs record? Is at two months? Three? Four? Are things different these days than how they used to be? Has bullpen usage changed things at all? Do season-to-season effects still hold?

I’ll be back.

Bonus Table

Wow. I’m impressed you’re still here. As a bonus, I’m going to break out both types of groups in the following table. The top row lists the Winning Percentage groups (from a .200 level to an .800 level, by .050) and the left column list the Pythagorean Difference subgroup. The data in the table is the difference between each group’s second-month projection, based on simple regression to the mean, and its actual record in the second month.

Observe and enjoy.

Pyth Diff 4 5 6 7 8 9 10 11 12 13 14 15 Total
-7 -.125 -.125
-6 -.104 -.034 -.053 -.064
-5 -.046 -.003 -.062 -.036 -.081 -.015 .005 -.220 -.057
-4 -.316 -.162 -.008 -.024 -.068 -.054 .005 -.015 -.058 -.078
-3 .042 -.088 -.028 -.049 -.036 -.026 -.014 -.014 .007 -.030 -.024
-2 .056 -.063 .005 -.022 -.028 -.008 -.015 -.009 -.010 -.013 .019 -.008
-1 -.017 -.010 -.002 -.011 -.016 -.001 .000 -.004 .008 .010 .013 -.001 -.003
-.023 .006 -.013 -.002 .007 .008 .009 .023 .013 .013 .036 .072 .012
1 -.025 .017 -.003 .011 .007 .017 .017 .013 .023 -.028 -.016 .003
2 .030 .020 -.016 -.016 .031 .031 .002 .009 .006 .084 .018
3 -.009 .025 -.014 .051 .008 .069 .034 -.005 -.028 .015
4 -.051 .087 -.010 .068 .005 .020
5 .056 .036 .046
6 .235 .235
Total -.006 .055 -.010 -.026 -.029 .004 -.007 -.012 -.015 .006 -.016 -.039 -.009

Feel free to ask questions, point out faulty logic and generally make fun of my math in the comments below.

References & Resources
Here’s a very mathematical examination of why the Pythagorean Formula works.

All data courtesy of the spectacular folks at Retrosheet.

Print This Post
Dave Studeman was called a "national treasure" by Rob Neyer. Seriously. Follow his sporadic tweets @dastudes.
Sort by:   newest | oldest | most voted

Very good article.

For those who didn’t read it, the TL:DR version is “outliers exist and their existence does not disprove math.”


“The other was called the Johnson Effect, which is what he called it when teams that had extreme pythagorean variances in one year tended to relapse in the next year.*”

Funny, the effect of Johnson was part of why the O’s outperformed their pythagorean record last year.

Bullpen construction can skew results from pythagorean expectations.


I don’t think there is evidence that bullpen construction can skew results from pythagorean expectations. Bullpen production certainly can.

For the details on how the Orioles’ bullpen did, read my article in this:


Howard Mahler
Howard Mahler

Another way to state your result is:
one month of won-loss records has about 25% credibility for predicting the next month (for the same team.)
One month of runs scored and allowed has about 30% credibility for predicting the next month won-loss record.
“Credibility” is being used as per credibility theory used by actuaries.

Bojan Koprivica
Bojan Koprivica
Great topic and very well done, Dave! When you regressed the first order wins to the mean, how did you go about it? Did you first convert runs to wins and then regress, or did you separately regress runs scored and runs allowed to the mean first and then converted them to wins? Also, I guess the amount of regression you used in each example was based on the variance, right? Trying to find where standard deviation of luck (standard binomial distribution) equals that of talent? I’ve been fooling around a little in the last few days with second order… Read more »

Thanks, Bojan. I converted runs scored and allowed to wins via the Pythagopat formula, then regressed those against actual wins using a simple minimum standard error approach.

I’m sure I could approach the math in a more sophisticated way.  Do you have any specific suggestions?  (Happy to discuss over email if you’d like)