Since 2009, Major League Baseball has been using modern pitch-tracking data to evaluate and train its umpires (Mills 2017). While the instant public availability of this data has prompted some calls for electronic automation of ball and strike calls, there is evidence that MLB’s Zone Evaluation system has led to an improvement in umpire accuracy (Davis and Lopez 2015).
The Zone Evaluation system focuses on fidelity to the rectangular front of rule book zone, but actual strike calls in practice conform to the patterns shown in Figure 1. These plots suggest that pitches on corners of the rule book zone are likely to be called balls, forming an accepted “consensus” strike zone that is rounded and non-rectangular. In addition, it appears that pitches off the plate away from the batter are more likely to be called strikes than pitches off the plate inside, suggesting that consensus zones differ for left- and right-handed batters. To more accurately assess umpires within the context of accepted practices, measures of strike zone accuracy can be adapted to account for these consensus zones (Roegele 2017).
However, measures of accuracy alone fail to assess umpire consistency within a game. In this paper, we propose several new ways to measure an umpire’s consistency, apart from accuracy. We relax the requirement of a rectangular zone, and we allow for variations based on the handedness of the batter. Since factors such as the style of the starting pitcher may influence the shape of an umpire’s zone from game to game, we measure consistency within a game and average over all games in the 2017 season, rather than aggregating the call data and taking a single measurement. We also investigate the relationships between consistency, accuracy, and other umpire tendencies.
Over the course of a game, an umpire establishes a region of pitches that are called strikes. Ideally, this established strike zone will have a predictable shape, and no pitches that fall inside of it will ever be called balls. In this section, we propose four metrics for assessing the consistency of calls relative to the strike zone that the umpire establishes.
Each of these metrics depends on the chosen geometry of the established strike zone. First, we consider the consequences and limitations of a simple rectangular established strike zone, and propose a refinement to address these limitations. Next, we relax our geometric assumptions to consider non-rectangular established zones, requiring only that these zones be convex.
Throughout this paper, we use publicly-available data from MLB Advanced Media for the 2017 regular season (MLBAM 2018). Ball and strike data are posted as (
All of the strike zone plots in this paper show
2.1 Rectangular metrics
A natural definition for the established strike zone is the smallest rectangular region containing all the strikes. Figure 2 shows an example of these established strike zones for left- and right-handed batters for a particular MLB game. Any called ball inside these rectangles is inconsistent. For a given game, we define the one-rectangle inconsistency index IR1 as
In the game shown in Figure 2, out of 110 called balls, 2 were inconsistent to left-handed batters and 13 were inconsistent to right-handed batters, so IR1 = (2 + 13)/110 ≈ 0.136.
While the one-rectangle inconsistency index is natural to define and easy to compute, it is highly sensitive to a single outlying strike call. For example, in the right-handed plot in Figure 2, if we remove the lowest strike at (0.26, 1.27), the lower border of the established strike zone moves up to the next-lowest strike, eliminating three inconsistent balls. Had this single low strike been called a ball, the index IR1 would have been (2 + 10)/110 ≈ 0.109 instead of (2 + 13)/110 ≈ 0.136.
Another weakness of the one-rectangle index is that it can fail to account for multiple bad strike calls in the same location. Again using the right-handed plot in Figure 2, we see that eliminating the strike at (−0.93, 2.94) has no effect on IR1, because the resulting rectangle will still enclose the same number of called balls. While this strike call seems inconsistent given the five called balls around
We can mitigate these limitations in the one-rectangle inconsistency index by using more rectangles. As with the definition of IR1, the first rectangular region is the smallest one that contains all the strikes; that is, it is the rectangle determined by the smallest and largest values of both
The rectangles used to calculate IR10 are shown in Figure 3. Versus left-handed batters, rectangle R1 is the only rectangle containing called balls. However, versus right-handed batters, R1 contains 13 called balls (as above), R2 contains 11, R3 contains 7, R4 contains 1, and the remaining rectangles contain none. Therefore,
Notice that, in this example, IR10 = IR9 = ⋯ = IR4, illustrating that the value of this index eventually stabilizes, given enough rectangles. In practice, 10 rectangles is plenty for most MLB game data sets.
Since Ri+1 ⊆ Ri, inconsistent balls are weighted according to how many rectangles they are contained in. Balls that are inconsistent due to a single outlying strike will only have weight 1, while egregiously bad ball calls will lie inside several rectangles, increasing their contribution to IRn. Therefore, compared to the one-rectangle index, the n-rectangle index is less sensitive to a single outlying strike call.
Furthermore, when several bad strike calls lie in the same location, the n-rectangle index will reflect this inconsistency, since the successive rectangles will not shrink until these strike calls are exhausted. In Figure 3, the called balls around
2.2 Convex hull metrics
Since rectangles are used to construct the inconsistency index IRn, it measures inconsistency under the assumption (stipulated in the rules of baseball) that the true strike zone is a rectangle. However, as we will see in Section 3, strike zones in practice tend to be rounded at the corners. In this section we will introduce inconsistency measures that relax the assumption of a rectangular zone. Instead, we assume that a consistent zone will have the property that any pitch landing between two called strikes will also be a called strike. In other words, we assume that the established strike zone is convex.
Given a discrete set
Using the convex hull as our established strike zone, we can define the convex hull inconsistency index ICH analogously to the one-rectangle inconsistency index. Now an inconsistent ball is one that lies within the convex hull of strikes, and ICH is given by
For example, see Figure 4. There were five inconsistent balls versus left-handed batters, and one versus right handed batters, out of a total of 118 called balls. Therefore ICH = (5 + 1)/118 ≈ 0.051.
Like the one-rectangle index, the convex hull inconsistency index can fail to account for multiple bad strikes in the same location. It can also be unaffected by outlying strikes, depending on their location. For example, in Figure 4 versus right-handed batters, the strike at (−0.01, 1.31) has no effect on ICH; removing this point would shrink the convex hull without changing the number of called balls enclosed. However, this call seems inconsistent, given its proximity to several called balls. The problem is that a vertex of the convex hull can lie in a region populated by called balls, yet fail to enclose any. Creating smaller convex hulls inside the first (as we did to define IRn) will not address this issue.
To account for this phenomenon, we can use the locations of called balls to define a called-ball region. Instead of counting called balls within the established strike zone, we can measure the area of the overlap between the called-ball region and the convex hull of strikes.
Unlike the established strike zone, the called-ball region will typically not be convex, or even simply connected. Given a set
Let aL and aR be the areas of the intersection of the convex hull of called strikes and the α-convex hull of called balls, for left-handed and right-handed batters, respectively. We define the α-convex hull inconsistency index IACH to be a weighted average of these two areas. Let nL be the number of called pitches thrown to left-handed batters, and let nR be the number of called pitches to right-handed batters. Then
Figure 5 shows the ball and strike calls for the same game as Figure 4, along with the α-convex hull of called balls, using α = 0.7. In this case, IACH = 0.127. Notice that the called strike at (−0.01, 1.31) now has a significant effect on IACH, since it causes a large region of overlap between the convex hull and the α-convex hull, which contains the nearby called balls.
One limitation to this choice of inconsistency metric is that there is no canonical choice for the constant α. Figure 6 illustrates the issues involved. If the radius α is too small, the α-convex hull will contain isolated points and small disconnected regions. Large values of α (such as α = 0.9 in this example) will produce a single, simply-connected α-convex hull, making the called-ball region completely cover the established strike zone. Generally speaking, the larger the value of α, the tougher the metric IACH is on the umpires.
In the analysis that follows, we have chosen to use α = 0.7, based largely on qualitative inspections of various game examples, as in Figure 6. A correlation analysis can lend some empirical support to this choice. Table 1 gives the pairwise correlations for IACH computed over all 2017 regular season games using six different values of α between 0.4 and 0.9, in increments of 0.1. For these six values, the greatest correlation is between α = 0.6 and α = 0.7, and we observe that once α exceeds 0.7, the correlations between adjacent values begin to decrease. These results confirm the observation that choosing α in the range 0.6 ≤ α ≤ 0.7 tends to give similar measures of IACH.
Correlation matrix for IACH calculated with different values of α over all games in the 2017 season.
|α = 0.4||α = 0.5||α = 0.6||α = 0.7||α = 0.8||α = 0.9|
|α = 0.4||1.00||0.85||0.73||0.63||0.45||0.35|
|α = 0.5||1.00||0.91||0.79||0.60||0.46|
|α = 0.6||1.00||0.92||0.74||0.56|
|α = 0.7||1.00||0.82||0.62|
|α = 0.8||1.00||0.67|
|α = 0.9||1.00|
2.3 Statistical properties of the inconsistency metrics
All four inconsistency measures are sensitive to a single outlying called strike, and the n-rectangle and α-convex hull indices are sensitive to an egregiously bad called ball in the middle of the strike zone. This sensitivity is by design, to avoid penalizing umpires who make slightly inconsistent calls as much as those who make clearly bad calls. However, a consequence of this feature is that these metrics are also sensitive to the number of pitches called. As the number of called pitches increases, the chances that an umpire will make an egregious call increases, and once such a call is made, the inconsistency index will remain high.
Figure 7 investigates the association between number of pitches called and inconsistency index. The top row shows scatter plots of the four indices versus number of pitches called in the game, for all regular-season games with between 50 and 300 called pitches. (Only 2 of 2425 games in our sample fall outside this range.) For each index, there is a slight discernible upward trend. The correlation coefficient r is approximately 0.2 in all four cases.
The smoothed density estimates in Figure 8 show that the distributions of IR1, IR10, ICH, and IACH are all skewed right. Such skewness may be a feature of the metrics, or it may indicate that major league umpires are, on the whole, very good at calling games consistently. To assess sensitivity in the tails of these distributions, the second row of scatter plots in Figure 7 considers only “high-inconsistency games,” where IR10 + IACH > 0.3. For these games, and for this range of pitches called, there does not appear to be a strong association between the inconsistency measures and the number of called pitches (|r| < 0.1 in all cases).
In this section we have considered two simple metrics, IR1 and ICH, along with extensions IR10 and IACH, respectively, which attempt to address deficiencies in the simple metrics. Figure 8 shows that the tails of the distributions of the extended metrics are substantially thicker, suggesting that these metrics are better at differentiating between higher levels of inconsistency.
The pairwise correlation coefficients for the four metrics are given in Table 2. The two extended metrics IR10 and IACH are correlated, but not very strongly, indicating that they measure different aspects of inconsistency. Some of the difference may be due to the shape of the strike zone that an umpire tends to call. For example, among the 79 umpires who called at least 20 games behind home plate in 2017, Chad Whitson was the 17th most consistent umpire when measured using IACH, but ranked 51st when measured using IR10. The methods that we will present in Section 3.1 reveal that Whitson’s zone tends to be quite rounded at the corners, rather than rectangular, so it is not surprising that he ranks higher when judged using the convex hull metrics. By comparison, Pat Hoberg, whose zone is somewhat less rounded, has the 6th best IR10, but ranks 34th according to IACH. See Figure 9.
Correlation matrix for the four inconsistency indices.
As a compromise, in some of our analysis, we use the sum IR10 + IACH as a general measure of inconsistency. Notice that IR10 typically takes larger values than IACH, so the 10-rectangle metric is effectively weighted more than the α-convex hull metric in this sum. Over all 2423 games, the mean of the sum IR10 + IACH is 0.191 with standard deviation 0.142, and its median is 0.156.
3 Zone accuracy
While consistency of ball and strike calls is an important aspect of neutral officiating, it is also expected that home plate umpires conform to established definitions and practices. The official rules of baseball (MLB 2018) define the strike zone as “that area over home plate the upper limit of which is a horizontal line at the midpoint between the top of the shoulders and the top of the uniform pants, and the lower level is a line at the hollow beneath the kneecap.” It is often expedient to use the rectangular front of this pentagonal prism as a two-dimensional approximation of the rule book strike zone. The width of the front of home plate is 17 inches, establishing the horizontal limits of this rectangle. Our data has been normalized so that the vertical limits go from 1.5 to 3.5 feet above the ground. Since the rules also state that a pitch should be called a strike if “any part of the ball passes through any part of the strike zone,” we add one-half the width of a baseball to each of these limits to obtain the rectangle with opposite corners at (−0.8308, 1.3775) and (0.8308, 3.6225). This rectangle is pictured in Figure 1.
As Figure 1 illustrates, the rectangular rule-book strike zone differs from how the strike zone is officiated in actual games. However, spray charts of called strikes are not appropriate for estimating the borders of the called strike zone, since they show only where called strikes are likely to occur, which is biased according to where pitchers tend to throw. For example, Figure 1 indicates that called strikes occur less frequently on the inside corners than on the outside corners, but this effect could simply be a consequence of pitchers’ reluctance to throw inside.
Using a grid of one-inch squares in the plane at the front of the plate, (Roegele 2018) describes a consensus strike zone determined by the squares on this grid in which pitches are more likely to be called strikes than balls. In this section, we give a more accurate method for obtaining the borders of the consensus strike zone using kernel density estimation (Venables and Ripley 2010). We can also apply this technique to calls made by individual umpires, giving ways to assess conformity and zone size.
3.1 Kernel density estimation
Let s(x, y) be the two-dimensional probability density function describing the distribution of called strikes in the plane. That is, s(x, y) gives the density of the probability that a called pitch will cross the plate at location (x, y), given that the pitch is called a strike. In order to describe the consensus zone, we would like to compute the reverse conditional probability, that is, the probability density f(x, y) that a called pitch will be called a strike, given that it crosses that plate at location (x, y). Let
Figure 10 shows the smooth contours produced using this method, along with the discrete approximation of the method described in (Roegele 2018). In addition to improving the resolution of the zone boundary, the kernel density estimation method will work well for smaller samples of pitches. In particular, the kernel density estimation method can produce 50% contours for each MLB umpire’s calls over the course of a season, which can be used to describe season-long tendencies.
For example, let X be the region in the plane bounded by the 50% contour for a particular umpire, and let C be the consensus zone described above. The symmetric difference (X∖C) ∪ (C∖X) is the set of all points lying in one zone and not in the other, and its area DS measures the extent to which the umpire’s zone deviates from the consensus zone. To illustrate the extent to which zones can conform to the consensus zone, Figure 11 shows the contour zones for the umpires with the greatest and least values of DS for the 2017 season, along with the consensus zone.
3.2 Contour-based zone accuracy and size
Given any pair of closed curves
For each umpire, the individual contour zones yield a convenient measure of an umpire’s zone size S. It has been suggested (Roegele 2017) that accuracy measurements can function as a proxy for zone size, since inaccurate umpires would tend to have larger strike zones. Our data and measurements do not provide evidence for this assertion. Figure 12 illustrates the associations between these two measures of accuracy, AC and AR, along with zone size S. The accuracy measures AC and AR are only moderately correlated, and neither is strongly associated with zone size S. The correlation between AC and S is even weaker if the influential observation with the largest zone (Doug Eddings) is removed.
4 Alternative umpire evaluations
Ball and strike calls have always been subject to the judgment of the home plate umpire. While the MLB rule book offers standards for the extent of the strike zone, Figures 1 and 10 illustrate that variations from the rule-book zone are common, and probably widely accepted. Certainly, it would represent a significant departure from current norms if umpires (perhaps aided by technology) started conforming their zones to the rule-book rectangle. Furthermore, the zone as it is called today could very well be the result of a consensus that has emerged over the years between players and umpires. Therefore, a fair evaluation system for umpires should take history and current accepted practice into account. In this section we consider how the measures of inconsistency and accuracy developed above can complement other measures of umpire performance.
4.1 Correlation and outliers
Ideally, any new metric should evaluate phenomena that previous metrics ignore, since strongly associated measures can yield redundant information. However, it is reasonable to expect that the best umpires are the best at several aspects of the job, so there are likely to be associations among different performance measurements. Figure 13 summarizes the pairwise correlation coefficients for the above measures on all MLB umpires who called at least 20 games behind the plate in the 2017 season. The four inconsistency measures IR1, IR10, ICH and IACH have been averaged over all the games called by each umpire. Notice that these inconsistency indices are correlated positively with each other and negatively with the two accuracy measures AC and AR. The other measures considered, symmetric-difference nonconformity DS, zone size S, walk rate rW, and strikeout rate rK, generally do not show strong associations with the inconsistency and accuracy measures.
The association between accuracy and inconsistency is not surprising, but it is not strong. In particular, for these umpires, average IR10 + IACH and AC have a correlation coefficient of r = −0.58. The scatterplot in Figure 14 illustrates the negative relationship between consensus zone accuracy AC and season-average inconsistency, measured as the sum IR10 + IACH of the 10-rectangle and α-convex hull indices. Notable in this graph are the outliers. For example, Carlos Torres, Tim Timmons, and Cory Blaser stand out has having above-average consistency but only average accuracy, suggesting that they would be underrated if evaluated on the basis of accuracy alone.
4.2 Principal component analysis
Any single rating system for umpires can produce a ranking of umpires. When combining several different metrics for umpire performance, we can organize the information that accounts for most of the variation between umpires by examining principal components.
Table 3 shows the coefficients for a principal component analysis (Mardia, Kent, and Bibby 1980) of the season-average inconsistency indices IR1, IR10, ICH, IACH, along with consensus accuracy AC, rule book accuracy AR, zone size S, walk rate rW, and strikeout rate rK, each normalized to have unit variance. The first two components, PC1 and PC2, account for 68% of the variation in the data. Component PC1 is dominated by the accuracy measures (positive) and inconsistency measures (negative), so it seems appropriate to designate it as “strike zone quality.” Meanwhile, component PC2 is dominated by walk rate (negative) and strikeout rate and zone size (positive), so this component measures “pitcher friendliness.”
Principal component analysis for four measures of inconsistency, two measures of accuracy, zone size, and walk and strikeout rate.
The principal components provide a way to summarize the various umpire evaluations developed above. Using the coefficients in each column of Table 3 to form linear combinations of the normalized metrics, we obtain component scores for each umpire. Figure 15 is a scatter plot of the component scores for the first two principal components, PC1 and PC2, labeled by umpire. The most consistent and accurate umpires are those furthest right, while the most neutral arbiters between pitcher and batter are found along the horizontal axis.
5 Conclusions and discussion
The above results illustrate how geometric inconsistency metrics can capture qualities of home plate umpiring not assessed entirely by accuracy, even when measured probabilistically by AC. We have also shown how the borders of the consensus zone and individual umpire zones can be computed using kernel density estimation, providing efficient methods for comparing umpire tendencies. Such metrics can inform the current debate on the efficacy of human umpires.
While the evidence suggests that MLB umpires are generally quite accurate and consistent, current technology is capable of providing real-time information that would take ball and strike calls out of the hands of umpires and standardize the called strike zone to the rule book rectangle. Doing so, however, would be a significant departure from current practice, and would eliminate facets of the game that arguably contribute to its appeal.
For example, teams value catchers who excel at framing pitches to make them appear as strikes. It is possible that good pitch framers cause umpires to be more inconsistent within a game (Fast 2011a). In addition, accurate pitchers who consistently throw to their catcher’s target, even beyond the margins of the zone, can receive favorable strike calls. In such ways, human variation in strike calling influences the way baseball is played, so strike zone analysis can yield insight into aspects of the game beyond umpire performance.
Much of the inconsistency and inaccuracy that we are able to measure could be due to game circumstances. For example, (Walsh 2010) and (Carruth 2012) use the discrete grid method to demonstrate that the called strike zone tends to be larger on 3-0 counts than on 0-2 counts. Other factors, such as the age and experience of the pitcher (Turkenkopf 2008) can influence umpire zones. Presumably, such tendencies vary from umpire to umpire, and could be studied using the tools presented here.
Questions for future investigation include the following.
- –Are certain pitch types harder to call consistently?
- –What factors contribute to an umpire’s strikeout rate? Do more consistent umpires show less variability before and after two strikes have been called on the hitter?
- –How do umpire ratings correlate to an umpire’s public profile, as measured by press and social media mentions? Are the best umpires those you have never heard of?
- –What has been the effect of the Zone Evaluation system? Have umpires improved in some aspects but not in others? Do the age and years of major league service influence how the strike zone is called?
- –Pitch-tracking data has been shown to be noisy (Schifman 2018) (Fast 2011b). In particular, the top and bottom of the zone are estimated by the operator of the pitch-tracking system, based on each batter’s stance. Can inconsistency measures be adapted to assess these variations? Can left/right inconsistency be separated from up/down inconsistency?
- –The α-convex hull and the n-rectangle indices both generalize to higher dimensions, and the three-dimensional path of a pitch can be approximated using pitch-tracking data. Implement a three-dimensional inconsistency index.
- –Are there analogous situations (e.g. in manufacturing), where a geometric measure of inconsistency could be applied?
For reproducibility, data and R code used for the analysis and figures in this paper are available at
The author thanks the anonymous referees for many helpful and constructive comments.
Carruth, M. 2012. The Size of the Strike Zone by Count. https://www.fangraphs.com/blogs/the-size-of-the-strike-zone-by-count/.
Davis, N. and M. Lopez. 2015. Umpires Are Less Blind Than They Used To Be. August. https://fivethirtyeight.com/features/umpires-are-less-blind-than-they-used-to-be/.
Fast, M. 2011a. Spinning Yarn: The Real Strike Zone. February. https://www.baseballprospectus.com/news/article/12965/spinning-yarn-the-real-strike-zone/.
Fast, M. 2011b. Spinning Yarn: The Real Strike Zone, Part 2, June. https://www.baseballprospectus.com/news/article/14098/spinning-yarn-the-real-strike-zone-part-2/.
Mardia, K. V., J. T. Kent, and J. M. Bibby. 1980. Multivariate Analysis. Cambridge, MA: Academic Press. ISBN: 0124712525.
Mills, B. 2017. “Technological Innovations in Monitoring and Evaluation: Evidence of Performance Impacts Among Major League Baseball Umpires.” Labour Economics 46(C):189–99.
MLB. 2018. Official Baseball Rules. http://mlb.mlb.com/documents/0/8/0/268272080/2018_Official_Baseball_Rules.pdf.
MLBAM. 2018. MLB Advanced Media Gameday Data. http://gd2.mlb.com/components/-game/mlb.
Pateiro-López, B. and A. Rodrıguez-Casal. 2010. “Generalizing the Convex Hull of a Sample: The R Package Alphahull.” Journal of Statistical Software 34(5):1–28.
Roegele, J. 2017. Midseason 2017 Strike Zone Review, July. https://www.fangraphs.com/-tht/midseason-2017-strike-zone-review/.
Roegele, J. 2018. The 2017 Strike Zone, March. https://www.fangraphs.com/tht/the-2017-strike-zone/.
Schifman, G. 2018. The Lurking Error in Statcast Pitch Data, March. https://www.fangraphs.com/tht/the-lurking-error-in-statcast-pitch-data/.
Turkenkopf, D. 2008. A Strike Is a Strike, Right? https://www.beyondtheboxscore.com/2008/4/24/459913/a-strike-is-a-strike-right.
Venables, W. N. and B. D. Ripley. 2010. Modern applied statistics with S. New York, NY: Springer.
Walsh, J. 2010. The Compassionate Umpire. https://www.fangraphs.com/tht/the-compassionate-umpire/.