## Do Catchers Influence Pitcher Performance? The Story of Spanky and Sluggo

From Opening Day to April 20th, Red Sox pitchers posted a 7.14 ERA when Jarrod Saltalamacchia was behind the plate versus a 2.40 ERA when Jason Varitek started. The resulting hubbub about this split made one fact extremely clear, when comparing the influence of different catchers, sample size is really *really* important.

Already by June 24th, Varitek and Salty’s split has been greatly reduced, with pitchers now throwing a 3.44 ERA to the veteran captain and a 4.36 ERA to the new guy. I would bet that these numbers will continue to converge as the season drags on, but even after 182 games it’s unlikely that either catcher will have enough innings to *statistically* test whether one is calling a better game. This is the difficulty of assessing catcher performance: comparing catchers between teams is near impossible (because the pitching staffs are different), and comparing catchers within teams is difficult (because sample sizes are small and different pitchers use different catchers). Nevertheless, many still believe that catchers do influence pitcher performance. Where can we find the data to support this hypothesis?

Enter Jim Leyland. From 1990 to 1992, Leyland’s Pirates deployed one of the longest-running catcher platoons in baseball history on their way to three straight NLCS losses. The main catcher was Mike “Spanky” LaValliere, a strong defender with a cannon arm (career 0.992 field rate, league leading 45% CS rate and GG in 1987) and mediocre offensive production (.269/.355/.342/ through 1992, good for a 97 OPS+). Backing him up was Don “Sluggo” Slaught, who had the reputation of a horrible defender (led the league in SB allowed in 1986, errors in 1988) but hit for more power (.280/.331/.417 through 1992) and crushed lefties (.301/.358/.446 career). Leyland split time between Spanky and Sluggo about 60/40, and during this three-year span five Pirates starters – Doug Drabek, Bob Walk, Randy Tomlin, Zane Smith, and John Smiley – acquired over 100 IP with each catcher. In total, these five pitched 1,347 innings in games that LaValliere started, and 821.2 innings in games that Slaught started. If pitchers did indeed throw differently to Spanky and Sluggo, this should be a large enough sample to observe it.

Thanks to Baseball Reference, I pulled up 339 games between 1990 and 1992 where one of these pitchers and one of these catchers started. Basic box scores were recorded (IP, H, ER, BB, SO, HR, Pit, Str, GB, FB, and LD), and from this some more interesting metrics were calculated (ERA, SO/9, BB/9, Str%, H/9, WHIP, HR/9, GB%, FB%, LD%, HR/FB, and babip). Finally, one advanced sabremetric stat was considered (RE24). These stats were then organized to compare the performance of each pitcher with each of the catchers, and standard t-tests were run to assess whether pitcher performance in these areas significantly changed depending on whether Spanky or Sluggo was starting.

First off, it is interesting to see the parts of pitching that the two catchers apparently didn’t influence. Differences in SO/9, BB/9, H/9, HR/9, HR/FB, IP/game, Str%, GB%, FB%, and LD% were statistically insignificant between catchers, although for some reason Zane Smith seemed to get jacked up when Slaught was catching for him (0.89 HR/9 vs. 0.42 for LaValliere). However, there were some pitching performance indicators that showed very real differences depending on who was catching.

**ERA**

Doug Drabek started 52 games with LaValliere and 43 games with Slaught during this span, pitching a 3.25 ERA to Spanky and a 2.42 ERA to Sluggo. This 0.83 ER difference is statistically strong (p=0.09 in a t-test, meaning we can say with 91% confidence that the difference wasn’t random). Randy Tomlin and John Smiley also had a noticeably lower ERA with Slaught behind the plate (p= 0.28 and 0.32), while Zane Smith pitched better with LaValliere (p=0.15). Bob Walk was essentially a push, with his slight edge towards LaValliere being statistically meaningless (p=0.73).

**RE24**

ERA is a flawed statistic in many ways, and more advanced sabremetric calculations exist that more specifically pinpoint the contribution the pitcher makes to his team during the course of a game. One such stat is RE24, which measures the change in Run Expectancy (RE) before and after every play. A pitcher’s RE24 for a start effectively measures how many net runs he either prevented or allowed in that start.

For each pitcher, RE24 tracks very closely with ERA. Drabek prevented an additional 0.73 runs when Slaught was catching for him (p= 0.08), while Smiley prevented an additional 0.85 runs per start (p=0.14). Tomlin also showed some improvement with Slaught behind the plate (0.65 prevented runs, p=0.28), while Smith still preferred LeValliere but not with the same statistical strength as with ERA (0.50 prevented runs, but p=0.46). Bob Walk again showed little statistical significance in his result (p=0.58).

**BABIP**

Both ERA and RE24 sum up a pitcher’s performance, but neither does a great job of identifying *how* a pitcher is being successful. Because SO/9 and BB/9 rates were essentially the same between the two catchers for all pitchers, what’s most interesting to us if what happened to the balls that did make it into play: BABIP.

Drabek and Smiley had significantly lower batting average on balls in play with Slaught behind the plate (p= 0.07 and 0.20), while Tomlin had a less significant difference (p=0.36), Smith very weakly favored Smith (p=0.49), and Walk continued to be statistically meaningless (p=0.65). BABIP is a frustrating statistic in that it is unclear how fluctuation in the average is attributable to the pitcher, but some pitchers – by increasing their GB% and lowering their LD% – are able to lower their BABIP in a very real way. Unfortunately that is not the case with any of these starters, as GB/FB/LD% were almost identical between the two catchers (varying by less than 2% in almost all cases).

## ANOVA Analysis

None of the analysis above calculates Slaught’s or LeValliere’s total ERA or BABIP, the 339 starts are split into five different samples by pitcher instead of being thrown into one giant group like analysts have done with Varitek and Saltalamacchia. The reason for this is that the Pirate’s best pitchers – Drabek and Smiley – pitched to Slaught in 45% of their starts while the other three only pitched to him 33% of the time. Averaging all the starts together would bias the results towards Slaught, rewarding him for catching more often for the 1990 Cy Young winner and the 1991 American league wins leader.

Luckily, there is a more complex statistical test known as an ANalysis Of VAriance or ANOVA (more specifically a General Linear Model ANOVA in this case). An ANOVA test allows me to simultaneously examine how two different independent variables (pitcher and catcher) influence a particular dependent variable (such as SO/9). To oversimplify, if games where Smiley started and Slaught caught resulted in an average of 3.13 ER, and games where Smith started and LeValliere caught resulted in an average of 3.89 ER, an ANOVA test helps me compare how much that difference was caused by switching pitchers and how much it was caused by switching catchers. Like in the t-test, a low p-value for a particular stat indicates that the pitcher or catcher had a significant amount of influence or control over that metric, while a high p-value indicates that there was not a significant amount of influence. I ran GLM ANOVA tests for all the non box score stats I collected, with results below:

First of all, it is interesting to note how consistent this data is with what we already know about pitcher performance. These five pitchers showed significant control over the lengths of their starts, their SO and BB rate, their Strike %, their HR rate, and their GB and FB rates (p < 0.2), while having less control over factors such as ERA, RE24, H/9, WHIP, HR/FB, BABIP, all stats that rely heavily on fielding and a healthy dose of random chance.

The catcher results are more intriguing. For starters, catcher’s had no influence over the strike % (p=0.920), which may surprise some who would expect that either LeValliere or Slaught was better at framing pitches. The only two categories where the catcher appeared to have at least some control were RE24 and BABIP.

So this is the odd conclusion that we come to with this data analysis. There was at least a somewhat significant difference in pitcher performance between the two catchers, but surprisingly it was the offensive Slaught that appeared to catch a slightly better game. More surprisingly, Slaught’s strength seemed to be that he somehow exerted control over the BABIP of those that pitched to him without significantly influencing the ratio of groundballs, flyballs, and line drives that went into play. This requires a great deal more research, but I would extend the hypothesis that catcher’s do influence pitcher performance, in that different catchers call different games that result in balls being more weakly put into play. It’s possible that Jim Leyland recognized this skill in Slaught (releasing LaValliere before the 1993 season), but it’s unlikely. Even with a large sample size it is difficult to actually parse out the influence of the catcher, so whenever you read any article that prattles on about catcher ERA, you better take it with a huge grain of salt.

And in case you were wondering, Saltalamacchia’s Catcher BABIP for the Red Sox this year is 0.273. Varitek’s? 0.279. Not a big difference there.

Print This Post

With such a large number of categories, isn’t it statistically likely to see p values in the range of 0.10-0.20 even if indeed there is no difference?

If you were to repeat the analysis a couple more times with different catchers (yet ones in a similar situation) I might be more inclined to believe that these results are any more than noise.

Would be pretty interested in seeing what the avg. BABIP skill level was of opposing batters in games Slaught caught vs. games LaValliere caught. The two catchers would never have a truly random sample of the Pirates overall schedule, as opposing teams with a lot of LHP starters would draw Slaught usually, for example. If these LHP-heavy teams happened to have a high-BABIP offense, the apparent correlation between catcher and BABIP could show up.

Also, the analysis assumes that both catchers had team-average defense around them. This is probably not the case if there were any other frequent platoons on the Pirates. I.e., it’s possible that Slaught had better teammate defense that LaValliere did if, on average, the Pirates had better defenders who were RHBs.

Yes, none of the those p-values are meaningful. Even with a marginal one of 0.07 – 0.09 (Drabek) you would expect that in 10 tests.

I agree that the weak BABIP correlation (which is all Drabek) is probably a defense issue. The RE/24 is also just the BABIP.

If you’d normally require p < 0.05 with one comparison, with 14 comparisons you might require p < .0036. If you'd use p < .2 then with 14 comparisons you'd need p < .016 — although .2/14 will get you a rough estimate I think it's probably better to do 1 – (1-.2)^(1/14) .

So, if I'm interpreting your numbers correctly, there's really no evidence of a catcher effect.

Anyone here read this latest study on

Strike Three: Do MLB Umpires Express Racial Bias in Calling Balls and Strikes?

Daniel Hamermesh

07/01/2011 | 12:33 pm

Our paper on discrimination in baseball has finally been published (the June issue of the American Economic Review). While it received a lot of media and scholarly comment in draft, the final version contained a whole new section. The general idea is that those discriminated against will alter their behavior to mitigate the impacts of discrimination on themselves. But while reducing the impacts, these changes are not costless. For example, if you’re an Hispanic pitcher and think that the white umpire is against you, you’ll change your pitches. Where will you throw? How will you throw?

The paper shows that the pitcher will avoid giving the umpire a chance to use his discretion in judging a pitch. More pitches go into the strike zone, more are clearly balls. More are fastballs, fewer curves and change-ups. A rational response, but by avoiding the umpire’s discrimination the pitcher makes it easier for the batter to hit the ball or to walk. Here’s the abstract:

Major League Baseball umpires express their racial/ethnic preferences when they evaluate pitchers. Strikes are called less often if the umpire and pitcher do not match race/ethnicity, but mainly where there is little scrutiny of umpires. Pitchers understand the incentives and throw pitches that allow umpires less subjective judgment (e.g., fastballs over home plate) when they anticipate bias. These direct and indirect effects bias performance measures of minorities downward. The results suggest how discrimination alters discriminated groups’ behavior generally. They imply that biases in measured productivity must be accounted for in generating measures of wage discrimination.

http://www.freakonomics.com/wp-content/uploads/2011/07/baseball-graph.jpg

That was a lot of work to prove nothing. How did this make it to the Big Blog?

“That was a lot of work to prove nothing. How did this make it to the Big Blog?”

Statistically/scientifically “disproof” is often just as important as “proof.” This really shows that catchers don’t have much effect on pitchers. Those p-values are very high given the number of comparisons made, as was pointed out before. The only thing significant in these tests are the pitcher differences (which is to be expected).

Only concern is this: You had a large sample of games with two very different catchers. But what you didn’t have was a large sample of pitchers. I still think it is possible that SOME pitchers are more comfortable with a specific catcher. If Varitek and Salty caught 100 pitchers, you’d see no statistical difference. But maybe Beckett would be one of the ones who does much better with Varitek because he’s very comfortable with him. Also I would suspect that pitchers may perform worse with a catcher they are not at all familiar with. I.e., maybe Boston’s struggles were an adjustment period to Salty, same as Lincecum struggling to deal with going from Molina to Posey last year and now Posey to a new guy yet again this year.

Jim Leyland is a lover of platoons. In 1990, for example, there were three L/R platoons in use by Pittsburgh. Slaught and LaValliere split time behind the plate, starting 61 and 87 games, respectively. Likewise, first basemen Sid Bream (L/L) and Gary Redus (R/R) started 100 and 58 games, respecively. And at the hot corner, Wally Backman (L/R) split time with Jeff King (R/R) 68 games to 86. As one would expect, the LaValliere/Bream/Backman and Slaught/Redus/King trios appear more often than not in the 1990 Pirates defensive lineups (http://www.baseball-reference.com/teams/PIT/1990-lineups.shtml). It should also be noted that Redus was pulled late in over 2/3 of his 1B starts, while Backman completed fewer than 50% of his starts.

The Pirates’ pennant push in 1991 was messier from a platoon standpoint. LaValliere started 100 games to Slaught’s 53, but this season saw LaValliere complete just 77 of his 100 starts. Orlando Merced replaced Sid Bream as the lefty half of the 1B platoon, starting 95 games at the position (completing 85 and playing in some capacity in an additional 10 for 105 total). Redus (43 starts, 33 CG) remained the primary righty platoon member, with Lloyd McClendon (15 starts, 11 CG) also spending a bit of time there. Third base and right field were messy that year, with Bobby Bonilla spending the first month as the everyday RF, shifting everyday to 3B in May, and platooning with R/R John Wehner at 3B and L/R Gary Varsho in RF until the late August acquisition of everyday 3B Steve Buechele, at which point Bobby Bonilla moved firmly back to RF. LaValliere/Merced and Slaught/Redus match up in defensive lineup frequency. King, Bonilla and Buechele all had stretches of being the primary 3B, while Bonilla/Wehner also was a short-term platoon. Switch-hitting Mitch Webster also picked up 15 RF starts from mid-May through the end of June (.565 OPS for his 2nd of 3 1991 teams).

More of the same in 1992, as LaValliere and Merced each started 87 games, while Slaught started 65 behind the plate and righties Redus and King combined for 61 starts. Buechele was the everyday 3B until a midseason trade freed up the spot for King, who gave some time to Wehner, but not in a significant L/R platoon. The post-Bonilla era in RF saw 9 starters, of whom 6 amassed between 100 and 400 innings played at the position.

Through these 3 seasons, Jose Lind and Jay Bell were constants up the middle, while Bonds and Van Slyke manned 2/3 of the OF. Right field and third base saw a lot of fluctuation. And first base was as much a pure platoon situation as the catcher spot. There’s definitely a chance that these other platoons had an effect on the BABIP numbers, but it’s also worth considering how any other teams’ L/R platoons might have affected Drabek/Walk as opposed to Tomlin/Smiley/Smith. Those splits were out of the starting catcher’s control, as well as the 1B’s, thus leaving to chance whether a better defensive C and 1B (LaValliere/Bream) would be in the lineup or not against a team stacked with lefties against Drabek/Walk or the better 3B (not Bonilla or Backman) would be in there against a righty-heavy lineup facing Tomlin/Smiley/Smith. That could certainly have an effect that is outside the study of simple SP/starting catcher relationships and would merit further study once we have better defensive metrics derived from early-90s game footage.

There have been a couple of great articles recently on THT that use PitchFX data (which is much more granular than what you’re doing here) and found that a catcher with good strike-framing abilities can have an impact of up to a full additional win above replacement over the course of a season (which is shorter for catchers). Before you ask, yes, they normalized for pitcher strikezone, batter strikezone, and umpire strikezone.

Jon: this article doesn’t prove or disprove anything. As referred to above, here is an example of how the article should have been done: http://www.hardballtimes.com/main/article/evaluating-catchers-framing-pitches-part-3.

I thought it was an interesting read. Not sure where the haterade is coming from.

As mentioned above, those catcher p-values are all insignificant. I don’t mean to “pour on the haterade,” because I really thought this was an interesting read, but the reason they’re insinificant is rather subtle, and many people with serious training in statistics often mess this up. Here’s a basic explanation of the mistake: http://xkcd.com/882/