## Improving WPS

“All happy families are alike; each unhappy family is unhappy in its own way.” — L. Tolstoy

You can say something similar about baseball games. All boring games are alike; but exciting games are interesting in their own ways. Every boring game has one team building up a big early lead, which is never threatened. But there are many ways to have an exciting game: the pitcher’s duel, the slugfest, the late-inning comeback, extra innings, all in various combinations. And in between them are the bulk of games that are simply ordinary.

All of which makes ranking exciting games a tricky process, at least compared ranking to how boring they are. How does one compare Game 7 of the 1991 WS (1-0 in 10 innings) to Game 4 of the 1993 WS (15-14 in 9 innings) on the same scale? They’re great in different ways.

Back in 2005 I created a system to do just that, a rating system based simply on the runs scored in line score. I may have been the Christopher Columbus of that new world. And ranking the games allows you to rate post-season series-es.

The line-score system did work in the sense that it could tell the difference between a great game and a good one, and between a good one and an ordinary one. But while the line score gives you the basic outline of the game, it was blind to the details of what happens DURING each inning. Zero runs scored in the top of the 1st rates exactly the same; whether there were three pop-ups, or if three singles were followed by a triple play.

Eventually I realized that Baseball-Reference.com (ALL HAIL BBREF) has the play-by-play data for all playoff games, which includes a probability of victory after each play (anything that changes the outs, baserunners or score). Plotted, you can easily see if a game was good; It looks like and earthquake. If it was bad, it looks like the EKG of a corpse. Using those probabilities, we can create a much more accurate game rating. I fiddled with many rating schemes over the last 10 years before settling on one that seems both conceptually simple and that yields reasonable results.

Of course, by then I had been beaten to the basic concept by Dave Studeman (WPA) and Shane Tourtellotte (WPS). Twelve years is too long for laurels resting.

WPA = Sum(change in probability between plays)

Modified WPS = Sum(change in probability between plays) + top three plays + Final play

What I have developed is similar to their work, but I think it has some small advantages. Generally, my ratings will be quite close to Shane’s (R-squared > 99.5%). He correctly realized that simply summing the probabilities doesn’t *quite* work, which is why he modified it. An example…

There are seven post-season games with a WPA of exactly 4.52. Among them are:

**1995 NLCS Game 2**

Reds beat the Braves 6-2 in ten innings.

95 Plays, 13 plays changed the odds by at least 10%

top Play a Mark Portugal bases-loaded wild pitch +18%

70 plays with the odds in the 30% to 70% range

*compared to*

**1960 WS Game 7**

Pirates 10 Yankees 9 in nine innings

77 plays, 15 plays changed the odds by at least 10%

Of those 4 changed the odds by at least 20%

Of those 3 changed the odds by at least 30%

Of those 1 changed the odds by more than 50%

25 plays with the odds in the 30% to 70% range

There is simply no way those games are equal. The 1960 game has five different plays better than any play in the 1995 game. The 1995 game makes up the ground by (1) having 18 more plays (2) having fewer plays where nothing happened because the game was usually within one run.

1960 is still better because a +40% play isn’t twice as exciting as two +20% plays. Bill Mazeroski’s game-ending homer rates as +37%. Bobby Richardson’s game-starting line-out rates at +2%. Making a walk-off homer the equal of about 3 ½ innings with zero hits. NOPE. WRONG.

Shane accounted for this with his modified method. By counting the top three plays twice and Mazeroski’s walk-off homer three times, the ratings are now

1960: 6.49

1995: 5.19

And science prevails.

Of course, there is nothing magical about TOP THREE plays or LAST play. You could try using the top five plays and last five plays (believe me, I did). But I do think that using Top-3 + Last can sometimes lead you astray. I will now present exhibits A and B to demonstrate where it can swing and miss.

Exhibit A: 1988 WS Game 1

Exhibit B: 1985 NLCS Game 6

I expect you to know them. The two biggest home runs in terms of changing the odds in post-season history courtesy of Mr. Clark and Mr. Gibson.

1985: WPA 4.48 in 83 plays and 9 innings

1988: WPA 3.94 in 82 plays and 9 innings

The 1985 game had more action with the same number of plays, which you can easily see in the line scores

StL 0 0 1 0 0 0 3 0 3 (7)

LA 1 1 0 0 2 0 0 1 0 (5)

Compared to

Oak 0 4 0 0 0 0 0 0 0 (4)

LA 2 0 0 0 0 1 0 0 2 (5)

The ‘85 game has a game tie in the 7^{th}, broken tie in the 8^{th} and lead change in the 9^{th}

The ‘88 game has a lead change in the 2^{nd} and a lead change in the 9^{th}

Modified WPS says

1985: 4.48 + 1.34 + 0.01 = 5.83 (Tied for 94^{th} best game)

1988: 3.94 + 1.43 + 0.87 = 6.28 (Tied for 58^{th} best game)

I don’t think you can argue that the 1988 game is much better than the 1985 game; I don’t think it’s a better game at all. And it’s the last-play bonus that is to blame. Had the 1985 game been played in St. Louis then Clark’s homer would have been a walk-off and the game would have rated 6.56, well ahead of the 1988 game.

If you think about it, a last-play bonus is biased towards games won by the home team. If the home team loses, the last play will rarely amount to anything.

Only 23 times has it been at least 20%. When the home team wins, it is at least 20% 122 times.

Only 11 times has it been at least 30%. When the home team wins, it is at least 20% 96 times.

I also know this because I tried last play, last five plays, and last ten plays in trying to construct a rating system. I also tried top five plays, top ten plays, all plays over 10%, WPA – .03 per play (yielding the bizarre result of games with negative excitement).

Eventually I tried a simple power transformation on EVERY play. First, I tried summing the squares of the probabilities changes, like any good statistician would.

When I did that, the 1985 game Rated 10^{th} and the 1988 game rated 5^{th}. Which is the wrong order, and both games are just rated too high. Then I tried other powers…the Goldilocks approach, looking for the one that was just right.

Power Rank Rank

2.0 1985 10^{th} 1988 5^{th} best game

1.9 1985 12^{th} 1988 8^{th }Best game

1.8 1985 15^{th} 1988 20^{th }Best game

1.7 1985 23^{rd} 1988 25^{th }Best game

1.6 1985 32^{nd} 1988 36^{th }Best game

1.5 1985 38^{th} 1988 51^{st }Best game

1.4 1985 53^{rd} 1988 76^{th }Best game

1.3 1985 61^{st} 1988 104^{th }Best game

1.2 1985 79^{th} 1988 133^{rd} Best game

1.1 1985 100^{th} 1988 158^{th} Best game

1.0 1985 116^{th} 1988 185^{th} Best game

Everything above 1.7 was eliminated since it rated 1988 better than 1985

Here’s some shorthand I’m going to use:

Game 6 of the 1985 NLCS: STL 7, LA 5 in 9 innings — WPA 4.48 (9-4-2-1)

Game 1 of the 1988 WS: LA 5, SF 4 in 9 innings — WPA 3.98 (5-2-2-1)

The 1985 game had 9 plays rated>= 0.1, 4 plays rated>=0.2, 2 plays rated>=0.3 and 1 play rated >=0.5

The 1988 game had 5 plays rated>= 0.1, 2 plays rated>=0.2, 2 plays rated>=0.3 and 1 play rated >=0.5

For a sense of scale, the average game is WPA 2.67 (4.89-0.88-0.33-0.03)

(You can check the examples listed below on BBRef to get more detail on each game)

Checking 1.7, both exhibits rated higher than

Game 2 of the 2017 WS: HOU 7, LA 6 in 11 innings — WPA 5.30 (10-5-3-0)

Game 1 of the 2015 WS: KC 5, NYM 4 in 14 innings — WPA 6.36 (16-3-1-0)

1.7 weights the big plays too much

Checking 1.6, both test games rated higher than

Game 6 of the 1986 WS: NYM 6, BOS 5 in 10 innings — WPA 5.14 (16-3-3-0)

Game 6 of the 1986 NLCS: NYM 7, HOU 6 in 16 innings — WPA 5.80 (11-3-2-0)

1.6 weights the big plays too much

Checking 1.5,

the 1985 game rated higher than

Game 6 of the 1986 WS: NYM 6, BOS 5 in 10 innings — WPA 5.14 (16-3-3-0)

The 1988 game rated higher than

Game 4 of the 2001 WS: NYY 4, ARI 3 in 10 innings — WPA 4.58 (10-3-2-0)

1.5 weights the big plays too much, but it’s getting hard to find clear mistakes

Checking 1.4,

the 1985 game rated higher than

Game 3 of the 1976 NLCS: CIN 7, PHI 6 in 9 innings — WPA 4.72 (14-3-2-0)

Lead changes in the 7^{th}, 8^{th} and 9^{th} innings.

The 1988 game rated higher than

Game 4 of the 1986 ALCS: CAL 4, BOS 3 in 11 innings — WPA 4.64 (7-4-2-0)

1.4 weights the big plays too much, but I’m now splitting hairs

Checking 1.3, I like this one. Let me check 1.2

Checking 1.2,

the 1985 game rated lower than

Game 2 of the 1996 ALDS: NYY 5, TEX 4 in 12 innings — WPA 5.02 (8-2-0-0)

Game 2 of the 1990 WS: CIN 5, OAK 4 in 10 innings — WPA 4.50 (10-2-0-0)

1.2 weights the big plays too little. Famous games are losing to games without any highlights.

**So, I think 1.3 is the sweet spot. **

My rating score is = Sum((change in probability between plays)^1.3) *2

The *2 at the end is purely cosmetic. It allows the very best game to score close to ten.

With base WPA, Gibson’s homer (.87) is worth about 25x a normal play (.035). With WPS it’s worth bout 75x a normal play. Raising all the plays to the 1.3 power means that Gibson’s homer is now worth about 65x a typical play.

With base WPA, Clark’s homer (.74) is worth about 21x a normal play (.035). With WPS it’s worth bout 42x a normal play. Raising all the plays to the 1.3 power means that Clark’s homer is now worth about 53x a typical play.

With a little algebra,

WPA: Gibson = 1.18 * Clark

WPS: Gibson = 1.76 * Clark

Power 1.3: Gibson = 1.23 * Clark

A nice property of the transformation is that when the change in odds doubles, the play is worth ~ two and half times a much (2.46x)

EXCITEMENT IS NOT LINEAR

A 10% play is now worth 2.46 times as much as 5% play

A 20% play is now worth 2.46 times as much as 10% play

A 50% play is now worth 2.46 times as much as a 25% play

The system has a single parameter applied to ALL plays, so a game isn’t screwed if it has four great plays or the best play comes in the 8^{th} inning. Ranking games this way, here are the five games better than, and worse than, my two test cases.

Series |
Road Team |
home team |
IP |
(WPA^1.3)*2 |
WPA | Top Play |
# Plays | P>= .1 | P>= .2 | P>=.3 | P>=.5 |

2014ALCS G1 |
Royals 8 |
Orioles 6 |
10 |
5 |
5.14 |
35.0% | 96 | 13 | 3 | 2 | – |

1935WS G3 |
Tigers 6 |
Cubs 5 |
11 |
4.97 |
5.02 |
36.0% | 96 | 15 | 5 | 1 | – |

1976NLCS G3 |
Phillies 6 |
Reds 7 |
9 |
4.95 |
4.72 |
46.0% | 82 | 14 | 3 | 2 | – |

2015ALDS2 G2 |
Rangers 6 |
Blue Jays 4 |
14 |
4.93 |
5.46 |
37.0% | 115 | 7 | 2 | 1 | – |

1997ALCS G4 |
Orioles 7 |
Indians 8 |
9 |
4.92 |
4.92 |
38.0% | 88 | 16 | 4 | 1 | – |

1985NLCS G6 |
Cardinals 7 |
Dodgers 5 |
9 |
4.92 |
4.48 |
74.0% | 83 | 9 | 4 | 2 | 1 |

1975NLCS G3 |
Reds 5 |
Pirates 3 |
10 |
4.88 |
4.52 |
55.0% | 81 | 14 | 3 | 3 | 1 |

1933WS G4 |
Giants 2 |
Senators 1 |
11 |
4.87 |
4.94 |
55.0% | 92 | 9 | 3 | 1 | 1 |

2011ALCS G2 |
Tigers 3 |
Rangers 7 |
11 |
4.86 |
5.10 |
34.0% | 92 | 13 | 3 | 1 | – |

2012ALDS2 G2 |
Athletics 4 |
Tigers 5 |
9 |
4.86 |
4.86 |
41.0% | 85 | 11 | 4 | 1 | – |

1999NLCS G6 |
Mets 9 |
Braves 10 |
11 |
4.85 |
5.12 |
26.0% | 108 | 14 | 3 | – | – |

Series |
Road |
home team |
IP |
(WPA^1.3)*2 |
WPA | Top Play |
# Plays | P>= .1 | P>= .2 | P>=.3 | P>=.5 |

1952WS G5 |
Dodgers 6 |
Yankees 5 |
10 |
4.51 |
4.70 |
44.0% | 92 | 10 | 4 | 1 | – |

1923WS G1 |
Giants 5 |
Yankees 4 |
9 |
4.51 |
4.54 |
40.0% | 78 | 12 | 2 | 2 | – |

1984NLCS G4 |
Cubs 5 |
Padres 7 |
9 |
4.51 |
4.54 |
37.0% | 83 | 10 | 4 | 2 | – |

1992WS G2 |
Blue Jays 5 |
Braves 4 |
9 |
4.5 |
4.40 |
65.0% | 85 | 11 | 1 | 1 | 1 |

1998ALCS G2 |
Indians 4 |
Yankees 1 |
12 |
4.48 |
4.78 |
33.0% | 96 | 11 | 3 | 1 | – |

1988WS G1 |
Athletics 4 |
Dodgers 5 |
9 |
4.47 |
3.98 |
87.0% | 82 | 5 | 2 | 2 | 1 |

2000NLCS G2 |
Mets 6 |
Cardinals 5 |
9 |
4.46 |
4.66 |
32.0% | 91 | 13 | 3 | 2 | – |

2016NLDS2 G5 |
Dodgers 4 |
Nationals 3 |
9 |
4.46 |
4.66 |
21.0% | 84 | 14 | 1 | – | – |

1977WS G1 |
Dodgers 3 |
Yankees 4 |
12 |
4.45 |
4.80 |
30.0% | 97 | 11 | 2 | 1 | – |

1954WS G1 |
Indians 2 |
Giants 5 |
10 |
4.43 |
4.74 |
29.0% | 89 | 11 | 1 | – | – |

1958WS G1 |
Yankees 3 |
Braves 4 |
10 |
4.43 |
4.56 |
40.0% | 88 | 10 | 3 | 2 | – |

I hope you’ll look at these and see that while they have different shapes, they all contain a similar ‘volume’ of excitement.

Another way to evaluate the method is to look at games with the same WPA. Going back to where I began in this article, here are the seven games with a base WPA of 4.52 (No promises that BBRef has not revised the scores since I captured the data…). They are each tied for the 108^{th} highest WPA. But after using the 1.3 power factoring, you get this:

Game |
Outcome |
RANK |
(WPA^1.3)*2 |
WPA |
# Plays |
Top 5 Plays |
# plays 30-70% |
P>= .1 |
P>= .2 |
P>= .3 |
P>= .5 |

1960 WS G7 |
Pit 10 NYY 9 in 9 |
52 |
5.10 |
4.52 |
77 | 1.74 | 25 | 15 | 4 | 3 | 1 |

1975 NLCS G3 |
Cin 5 Pit 3 in 10 |
63 |
4.88 |
4.52 |
81 | 1.60 | 49 | 14 | 3 | 3 | 1 |

1911 WS G3 |
A’s 3 Giants 2 in 11 |
110 |
4.41 |
4.52 |
86 | 1.10 | 58 | 15 | 3 | 1 | – |

1998 NLCS G1 |
SD 3 Atl 2 in 10 |
117 |
4.36 |
4.52 |
84 | 1.10 | 59 | 11 | 2 | 1 | – |

2011 NLDS2 G5 |
Ari 3 Mil 2 in 10 |
119 |
4.35 |
4.52 |
85 | 1.05 | 71 | 13 | 2 | 1 | – |

1926 WS G5 |
NYY 3 StL 2 in 10 |
130 |
4.20 |
4.52 |
86 | 0.84 | 66 | 16 | 1 | – | – |

1995 NLCS G2 |
Atl 6 Cin 2 in 10 |
139 |
4.12 |
4.52 |
95 | 0.75 | 70 | 13 | – | – | – |

1960 gets the love it deserves, moving up 56 spots to the 52^{nd} best game. That despite of having the fewest plays in the 30%-70% victory range. Games with more plays do worse since that means they have smaller impact plays on average. Think of the Top 5 plays as the highlight reel for the game. 1995 NLCS Game 2 has no play >0.2 and therefore drops 31 spots in the rankings.

Adjusted WPS? Weighted WPS? Power WPS? I really do need to give it a proper name.

A Final example, from among the greatest Playoff games ever.

2000 NLDS G3: Mets 3, Giants 2 in 13 innings — ModWPS Rank = 11, PowerWPS Rank = 22

1986 ALCS G5: Red Sox 7, Angels 6 in 11 innings — ModWPS Rank = 22, PowerWPS Rank = 12

1980 NLCS G5: Phillies 8, Astros 7 in 10 innings — ModWPS Rank = 25, PowerWPS Rank = 14

The 2000 game had the higher WPS, partly because it had more plays. ModWPS likes it more due to the additional action and walk-off homer, which the better top-three plays in 80/86 could not overcome.

year WPS Plays Last Play Top-3 ModWPS

2000 6.34 109 0.42 0.98 7.74

1986 5.86 97 0.05 1.42 7.33

1980 6.06 93 0.04 1.11 7.21

So why do I think 1986/1980 are better?

Because, the deeper you go beyond the top three, the better the other two are revealed to be.

2000 1986 1980

1.28 1.94 1.61 Sum of Top-5 Plays

42-31-25-16-14 73-35-34-32-20 40-38-35-26-24 Top-5 Plays

1.88 2.77 2.43 Sum of Top-10 Plays

16-3-2-0 14-5-4-1 17-6-3-0 10%-20%-30%-50% plays

Or simply check the line scores.

2000

0 0 0 2 0 0 0 0 0 0 0 0 0 (2) Giants

0 0 0 0 0 1 0 1 0 0 0 0 1 (3) Mets

1986

0 2 0 0 0 0 0 0 4 0 1 (7) Red Sox

0 0 1 0 0 2 2 0 1 0 0 (6) Angels

1980

0 2 0 0 0 0 0 5 0 1 (8) Phillies

1 0 0 0 0 1 3 2 0 0 (7) Astros

The 2000 game IS a fabulous game. But the 1986 and 1980 games are more epic, with all the late-inning heroics. The 2000 game has exactly the required three big plays and the walk-off. It checks all the boxes.

I do kinda feel bad writing this. It sounds like I’m just picking on modified WPS here. LOOK AT WHAT ELSE IT GOT WRONG…

But as I said before, Power WPS is **b****arely** better. And to show that it’s better at all, I need to show those rare cases where it makes a better call. And it was an excellent benchmark, comparing differences between it and my sixty-eleven schemes helped me identify the flaws in sixty-ten of them.

Of course, even this is not the perfect system. Any play-by-play method will still fail to capture the in-play action. A bases-empty foul pop-out rates exactly the same as a bases-empty thrown-out-at-home-trying-to-stretch-a-triple. But it is the best we can do for now.

Whereas I used to guess my line score method captured maybe 70% of the excitement of a game, PBP ratings must be capturing upwards of 90%. Which means greater confidence in game rankings and playoff series ratings.

Anyway, if anyone has any thoughts, feedback, or questions I’d love to hear them. If no one can shoot the idea full of holes, or even one hole; then comes ranking and lists of games and series.