Show Summary Details
More options …

# Journal of Quantitative Analysis in Sports

### An official journal of the American Statistical Association

Editor-in-Chief: Steve Rigdon, PhD

4 Issues per year

CiteScore 2017: 0.67

SCImago Journal Rank (SJR) 2017: 0.290
Source Normalized Impact per Paper (SNIP) 2017: 0.853

Online
ISSN
1559-0410
See all formats and pricing
More options …
Volume 13, Issue 3

# A hierarchical Bayesian model of pitch framing

Sameer K. Deshpande
/ Abraham Wyner
Published Online: 2017-10-10 | DOI: https://doi.org/10.1515/jqas-2017-0027

## Abstract

Since the advent of high-resolution pitch tracking data (PITCHf/x), many in the sabermetrics community have attempted to quantify a Major League Baseball catcher’s ability to “frame” a pitch (i.e. increase the chance that a pitch is a called as a strike). Especially in the last 3 years, there has been an explosion of interest in the “art of pitch framing” in the popular press as well as signs that teams are considering framing when making roster decisions. We introduce a Bayesian hierarchical model to estimate each umpire’s probability of calling a strike, adjusting for the pitch participants, pitch location, and contextual information like the count. Using our model, we can estimate each catcher’s effect on an umpire’s chance of calling a strike. We are then able translate these estimated effects into average runs saved across a season. We also introduce a new metric, analogous to Jensen, Shirley, and Wyner’s Spatially Aggregate Fielding Evaluation metric, which provides a more honest assessment of the impact of framing.

## 1 Introduction

The New York Yankees and Houston Astros played each other in the American League Wild Card game in October 2015, with the winner continuing to the next round of the Major League Baseball playoffs. During and immediately after the game, several Yankees fans took to social media expressing frustration that home plate umpire Eric Cooper was not calling balls and strikes consistently for both teams, thereby putting the Yankees at a marked disadvantage. Even players in the game took exception to Cooper’s decision making: after striking out, Yankees catcher Brian McCann argued with Cooper that he was calling strikes on similar pitches when the Astros were pitching but balls when the Yankees were pitching. Figure 1 shows two pitches thrown during the game, one by Astros pitcher Dallas Keuchel and the other by Yankees pitcher Masahiro Tanaka.

Both pitches were thrown in roughly similar locations, near the bottom-left corner of the strike zone, the rectangular region of home plate shown in the figure. According to the official rules, if any part of the pitched ball passes through the strike zone, the umpire ought to call it a strike. Keuchel’s pitch barely missed the strike zone while Tanaka’s missed by a few inches. As a result, the umpire Cooper ought to have called both pitches a ball. That Cooper did not adhere strictly to the official rules is hardly surprising; previous research has shown umpires’ ball/strike decisions may be influenced by the race or ethnicity of the pitcher (see, e.g. Parsons et al. 2011; Tainsky, Mills, and Winfree 2015), player status as measured by age or ability (see, e.g. Kim and King 2014; Mills 2014), and their previous calls (Chen, Moskowitz, and Shue 2016). During the television broadcast of the game, the announcers speculated that the difference in Cooper’s strike zone enforcement was the ability of Astros catcher Jason Castro to “frame” pitches, catching them in such a way to increases Cooper’s chance of calling a strike (Sullivan 2015).

## 5.1 Extensions

There are several extensions of and improvements to our model that we now discuss. While we have not done so here, one may derive analogous estimates of RS and CAFE for batters and pitchers in a straightforward manner. Our model only considered the count into which a pitch was thrown but there is much more contextual information that we could have included. For instance, Rosales and Spratt (2015) have suggested the distance between where a catcher actually receives the pitch and where he sets up his glove before the pitch is thrown could influence an umpire’s ball-strike decision making. Such glove tracking data is proprietary but should it be become publicly available, one could include this distance along with its interaction with the catcher indicator into our model. In addition, one could extend our model to include additional game-state information such as the ball park, the number of outs in the half-inning, the configuration of the base-runners, whether or not the home team is batting, and the number of pitches thrown so far in the at-bat. One may argue that umpires tend to call more strikes late in games which are virtually decided (e.g. when the home team leads by 10 runs in the top of the ninth inning) and easily include measures related to the run-differential and time remaining into our model. Expanding our model in these directions may improve the overall predictive performance slightly without dramatically increasing the computational overhead.

More substantively, we have treated the umpires’ calls as independent events throughout this paper. Chen et al. (2016) reported a negative correlation in consecutive calls, after adjusting for location. To account for this negative correlation in consecutive calls, we could augment our model with binary predictors encoding the results of the umpires’ previous k calls in the same at-bat, inning, or game. Incorporating this Markov structure to our model would almost certainly improve the overall estimation of called strike probability and may produce slightly smaller estimates of RS and CAFE. At this point, however, it is not a priori obvious how large the differences would be or how best to pick k. It is also well-known that pitchers try to throw to different locations based on the count, but we make no attempt to model or exploit this phenomenon. Understanding the effect of pitch sequencing on umpires’ decision making (and vice-versa) would also be an interesting line of future research.

We incorporated pitch location in a two-step procedure: we started from an already quite good generalized additive model trained with historical data and used the forecasted log-odds of a called strike as a predictor in our logistic regression model. Much more elegant would have been to fit a single semi-parametric model by placing, say, a common Gaussian process prior on the umpire-specific functions of pitch location, fu(x, z) in Equation 1. We have also not investigated any potential interactions between pitch location, player, and count effects. While we could certainly add interaction terms to the logistic models considered above, doing so vastly increases the number of parameters and may require more thoughtful prior regularization. A more elegant alternative would be to fit a Bayesian “sum-of-trees” model using Chipman, George, and McCulloch (2010)’s BART procedure. Such a model would likely result in more accurate called strike probabilities as it naturally incorporates interaction structure. We suspect that this approach might reveal certain locations and counts in which framing is most manifest.

Finally, we return to the two pitches from the 2015 American League Wild Card game in Figure 1. Fitting our model to the 2015 data, we find that Eric Cooper was indeed much more likely to call the Keuchel pitch a strike than the Tanaka pitch (81.72% vs 62.59%). Interestingly, the forecasts from the hGAMs underpinning our model were 51.31% and 50.29%, respectively. Looking a bit further, had both catchers been replaced by the baseline catcher, our model estimates a called strike probability of 77.58% for the Keuchel pitch and 61.29% for the Tanaka pitch, indicating that Astros’ catcher Jason Castro’s apparent framing effect (4.14%) was slightly larger than Yankee’s catcher Brian McCann’s (1.30%). The rather large discrepancy between the apparent framing effects and the estimated called strike probabilities reveals that we cannot immediately attribute the difference in calls on these pitches solely to differences in the framing abilities of the catchers. Indeed, we note that the two pitches were thrown in different counts: Keuchel’s pitch was thrown in a 1–0 count and Tanaka’s was thrown in a 1–1 count. In 2015, umpires were much more likely to call strikes in a 1–0 count than they were in a 1–1 count, all else being equal. Interestingly, had the Keuchel and Tanaka pitches been thrown in the same count, our model still estimates that Cooper would be consistently more likely to call the Keuchel pitch a strike, lending some credence to disappointed Yankees’ fans’ claims that his strike zone enforcement favored the Astros. Ultimately, though, it is not so clear that the differences in calls on the two pitches shown in Figure 1 specifically was driven by catcher framing as much as it was driven by random chance.

## B Model comparison with cross-validation

As mentioned in Section 3.1, Roegelle (2014), Mills (2017a), and Mills (2017b) have documented year-to-year changes in umpires’ strike zone enforcement ever since Major League Baseball began reviewing and grading umpires’ decisions in 2009. In other words, umpire tendencies are non-stationary across seasons and we cannot reasonably expect Models 4 and 5, which attempt to identify umpire-specific player effects, to forecast future umpire decisions particularly well. A potentially more appropriate way to diagnose overfitting issues would be to hold out a random subset of our 2014 data, say 10%, fit each model on the remaining 90% of the data, and assess the predictive performance on the held-out 10%. Table 7 shows the average misclassification rate and mean square error for Models 1–5 over 10 such holdout sets. The results in the table confirm our finding that Model 3 represents the best balance between model expressivity and predictive capability.

Table 7:

Hold-out misclassification rate (MISS) and mean square error (MSE) for several models).

## C Catcher and count effects

In Section 3.2, we reported that the posterior distributions of catcher and count effects on the log-odds scale were largely supported in the interval [−1.5, 1.5]. This would indicate that a catcher’s framing effect is of roughly similar magnitude as the effect of count.

Figure 7 shows the approximate posterior densities of the count effects. Recall that these are the partial effects relative to the baseline count of 0–0. As we might expect, umpires are more likely to call strikes in 3–0 and 2–0 counts than in 0–0 counts and much less likely to call strikes in 0–2 and 1–2 counts, all else being equal. Somewhat interestingly, we find that umpires are slightly less likely to call strikes in a 3–1 count than they are in a 1–0 count.

We also see that the posterior distribution for the effects of a 3–0 counts are considerably wider than those for a 0–1 count, indicating that we are much more uncertain about the effect of the former two counts than the latter two. This is due the rather large disparity in the numbers of pitches taken in these counts: in our dataset, there were more than five times called pitches thrown into a 0–1 count than into a 3–0 count (37,513 versus 6162).

Figure 7:

Posterior densities of the partial effect of count. Densities computed using a standard kernel estimator.

To compare the relative magnitudes of catcher and count effects on the probability scale, we return to the hypothetical matchup between batter Yasiel Puig, catcher Buster Posey, and pitcher Madison Bumgarner. Suppose that Bumgarner’s pitch is thrown in a location where the hGAM called strike probability forecast is exactly 50%. According to our model, if this pitch were hypothetically caught by the baseline catcher, Brayan Pena, the forecasted called strike probability averaged over all 93 umpires is 54%, with the difference of 4% attributable to intercept, batter, and pitcher effects. In contrast, if the same pitch had been caught by Buster Posey, our model estimates the called strike probability to be 64%, indicating that on this pitch, Posey added an additional 10% to the forecasted called strike probability. If Posey had caught the same pitch but the count were 2–2 instead of 0–0, the forecast would be 55%. In this way, at least for this pitch, the effect of swapping the baseline catcher with Posey on an 0–0 pitch is the about the same as changing the count from 0–0 to 2–2 with Posey catching.

Table 8:

Difference in forecasted called strike probabilities averaged over all umpires when Bumgarner pitches to Puig, relative to the baseline called strike probability of 54%, for various combinations of catcher and count.

Table 8 elaborates on this example and shows the estimated average called strike probability for the same pitch as a function of count and catcher. That is, we forecasted the called strike probability, averaged across umpires, on a pitch thrown by Bumgarner to Puig at a location where the hGAM forecast was 50% for many combinations of catcher and count. To highlight the relative size of the catcher and count effects, we have subtracted a baseline 54%, the called strike probability when the catcher is Pena and the count is 0–0, from all of these probabilities.

Recall that the baseline called strike probability is 54% on such a pitch. According to our model, the effect of changing the count from 0–0 to 0–2 when Posey is receiving the pitch is about the same as changing the count from 0–0 to 1–2 with the baseline catcher receiving the pitch. We note that the called strike probability forecasts for Tomas Tellis are much lower than for Hank Conger, Posey, and the baseline catcher. For instance, our model estimates that umpires on average would call this pitch a strike only 15% of the time if it were thrown in a 0–2 count and Tellis was receiving, in contrast to 40% for Conger, 36% for Posey, and 28% for Pena.

## References

Published Online: 2017-10-10

Published in Print: 2017-09-26

Citation Information: Journal of Quantitative Analysis in Sports, Volume 13, Issue 3, Pages 95–112, ISSN (Online) 1559-0410, ISSN (Print) 2194-6388,

Export Citation

©2017 Walter de Gruyter GmbH, Berlin/Boston.