Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Journal of Quantitative Analysis in Sports

An official journal of the American Statistical Association

Editor-in-Chief: Steve Rigdon, PhD

4 Issues per year


CiteScore 2017: 0.67

SCImago Journal Rank (SJR) 2017: 0.290
Source Normalized Impact per Paper (SNIP) 2017: 0.853

Online
ISSN
1559-0410
See all formats and pricing
More options …
Volume 14, Issue 4

Issues

Volume 1 (2005)

New metrics for evaluating home plate umpire consistency and accuracy

David J. Hunter
Published Online: 2018-09-25 | DOI: https://doi.org/10.1515/jqas-2018-0061

Abstract

The availability of pitch-tracking data has led to increased scrutiny of Major League Baseball umpires. While many studies have attempted to rate umpires based on their conformity to the rule book strike zone, players and managers tend to accept deviations from this zone, provided that umpires establish consistent zones within a game. Using tools from computational geometry, we propose new metrics for assessing the consistency and accuracy of an umpire’s ball and strike calls over the course of a game. We apply these metrics to pitch-tracking data on all ball and strike calls made during the 2017 MLB regular season, giving some characterizations of the variation in performance of MLB umpires. This analysis demonstrates that measures of consistency can complement current accuracy-based evaluations of umpires.

Keywords: α-convex hull; convex hull; kernel density estimation; principal component analysis

1 Introduction

Since 2009, Major League Baseball has been using modern pitch-tracking data to evaluate and train its umpires (Mills 2017). While the instant public availability of this data has prompted some calls for electronic automation of ball and strike calls, there is evidence that MLB’s Zone Evaluation system has led to an improvement in umpire accuracy (Davis and Lopez 2015).

The Zone Evaluation system focuses on fidelity to the rectangular front of rule book zone, but actual strike calls in practice conform to the patterns shown in Figure 1. These plots suggest that pitches on corners of the rule book zone are likely to be called balls, forming an accepted “consensus” strike zone that is rounded and non-rectangular. In addition, it appears that pitches off the plate away from the batter are more likely to be called strikes than pitches off the plate inside, suggesting that consensus zones differ for left- and right-handed batters. To more accurately assess umpires within the context of accepted practices, measures of strike zone accuracy can be adapted to account for these consensus zones (Roegele 2017).

However, measures of accuracy alone fail to assess umpire consistency within a game. In this paper, we propose several new ways to measure an umpire’s consistency, apart from accuracy. We relax the requirement of a rectangular zone, and we allow for variations based on the handedness of the batter. Since factors such as the style of the starting pitcher may influence the shape of an umpire’s zone from game to game, we measure consistency within a game and average over all games in the 2017 season, rather than aggregating the call data and taking a single measurement. We also investigate the relationships between consistency, accuracy, and other umpire tendencies.

All ball (blue) and strike (red) calls made in the 2017 MLB season, for left- and right-handed batters, from the umpire’s perspective. The rectangle indicates the rule book strike zone. Vertical positions have been scaled based on the height and stance of the batter.
Figure 1:

All ball (blue) and strike (red) calls made in the 2017 MLB season, for left- and right-handed batters, from the umpire’s perspective. The rectangle indicates the rule book strike zone. Vertical positions have been scaled based on the height and stance of the batter.

2 Inconsistency

Over the course of a game, an umpire establishes a region of pitches that are called strikes. Ideally, this established strike zone will have a predictable shape, and no pitches that fall inside of it will ever be called balls. In this section, we propose four metrics for assessing the consistency of calls relative to the strike zone that the umpire establishes.

Each of these metrics depends on the chosen geometry of the established strike zone. First, we consider the consequences and limitations of a simple rectangular established strike zone, and propose a refinement to address these limitations. Next, we relax our geometric assumptions to consider non-rectangular established zones, requiring only that these zones be convex.

Throughout this paper, we use publicly-available data from MLB Advanced Media for the 2017 regular season (MLBAM 2018). Ball and strike data are posted as (px, pz) pairs, indicating the horizontal and vertical position (in feet) of the ball as it crosses the front of home plate, where the center of the plate at ground level corresponds to the point (0, 0). Since the vertical limits of the strike zone depend on the height and and stance of the batter, MLBAM also provides parameters sz_top and sz_bot, which estimate the top and bottom of the strike zone for each batter. We use these parameters to normalize the vertical positions pz so that the top of the zone corresponds to pz = 3.5 and the bottom of the zone corresponds to pz = 1.5:

normalized 𝚙𝚣=2(𝚙𝚣𝚜𝚣_𝚝𝚘𝚙)𝚜𝚣_𝚝𝚘𝚙𝚜𝚣_𝖻𝚘𝚝+3.5

All of the strike zone plots in this paper show px on the horizontal axis and this normalized value of pz on the vertical axis.

2.1 Rectangular metrics

A natural definition for the established strike zone is the smallest rectangular region containing all the strikes. Figure 2 shows an example of these established strike zones for left- and right-handed batters for a particular MLB game. Any called ball inside these rectangles is inconsistent. For a given game, we define the one-rectangle inconsistency index IR1 as

IR1=number of inconsistent ballstotal number of called balls.

The smallest rectangle containing all the strikes. Balls are drawn as blue circles, and strikes as red diamonds. Note that the center of the circle or diamond must lie on or inside the rectangle to be considered inside the rectangular region.
Figure 2:

The smallest rectangle containing all the strikes. Balls are drawn as blue circles, and strikes as red diamonds. Note that the center of the circle or diamond must lie on or inside the rectangle to be considered inside the rectangular region.

In the game shown in Figure 2, out of 110 called balls, 2 were inconsistent to left-handed batters and 13 were inconsistent to right-handed batters, so IR1 = (2 + 13)/110 ≈ 0.136.

While the one-rectangle inconsistency index is natural to define and easy to compute, it is highly sensitive to a single outlying strike call. For example, in the right-handed plot in Figure 2, if we remove the lowest strike at (0.26, 1.27), the lower border of the established strike zone moves up to the next-lowest strike, eliminating three inconsistent balls. Had this single low strike been called a ball, the index IR1 would have been (2 + 10)/110 ≈ 0.109 instead of (2 + 13)/110 ≈ 0.136.

Another weakness of the one-rectangle index is that it can fail to account for multiple bad strike calls in the same location. Again using the right-handed plot in Figure 2, we see that eliminating the strike at (−0.93, 2.94) has no effect on IR1, because the resulting rectangle will still enclose the same number of called balls. While this strike call seems inconsistent given the five called balls around px = −0.6, the measure IR1 fails to reflect this inconsistency.

We can mitigate these limitations in the one-rectangle inconsistency index by using more rectangles. As with the definition of IR1, the first rectangular region is the smallest one that contains all the strikes; that is, it is the rectangle determined by the smallest and largest values of both px and pz for the set of called strikes. The second rectangle is then determined by the second-smallest and second-largest values of these coordinates. Continuing in this manner, taking the ith smallest and ith largest coordinates, we can form rectangles R1,R2,,Rn for both left- and right-handed hitters, for some choice of n. (Once i becomes large enough to exhaust all of the called strikes, take Ri to be the empty set.) Let s(i) be the number of called balls inside the two Ri’s. Define the n-rectangle inconsistency index as

IRn=s(1)+s(2)++s(n)total number of called balls

The rectangles used to calculate IR10 are shown in Figure 3. Versus left-handed batters, rectangle R1 is the only rectangle containing called balls. However, versus right-handed batters, R1 contains 13 called balls (as above), R2 contains 11, R3 contains 7, R4 contains 1, and the remaining rectangles contain none. Therefore,

IR10=(2+13)+(0+10)+(0+6)+(0+1)1100.29.

Successive rectangles enclose inconsistent balls. The more inconsistent called balls lie within more rectangles.
Figure 3:

Successive rectangles enclose inconsistent balls. The more inconsistent called balls lie within more rectangles.

Notice that, in this example, IR10 = IR9 = ⋯ = IR4, illustrating that the value of this index eventually stabilizes, given enough rectangles. In practice, 10 rectangles is plenty for most MLB game data sets.

Since Ri+1Ri, inconsistent balls are weighted according to how many rectangles they are contained in. Balls that are inconsistent due to a single outlying strike will only have weight 1, while egregiously bad ball calls will lie inside several rectangles, increasing their contribution to IRn. Therefore, compared to the one-rectangle index, the n-rectangle index is less sensitive to a single outlying strike call.

Furthermore, when several bad strike calls lie in the same location, the n-rectangle index will reflect this inconsistency, since the successive rectangles will not shrink until these strike calls are exhausted. In Figure 3, the called balls around px = −0.6 are weighted more heavily because of the strike at (−0.93, 2.94), in contrast to the situation in Figure 2, where we saw that this strike had no effect on the one-rectangle index.

2.2 Convex hull metrics

Since rectangles are used to construct the inconsistency index IRn, it measures inconsistency under the assumption (stipulated in the rules of baseball) that the true strike zone is a rectangle. However, as we will see in Section 3, strike zones in practice tend to be rounded at the corners. In this section we will introduce inconsistency measures that relax the assumption of a rectangular zone. Instead, we assume that a consistent zone will have the property that any pitch landing between two called strikes will also be a called strike. In other words, we assume that the established strike zone is convex.

Given a discrete set P2 representing the locations of called strikes during a game, there is a natural geometric definition for the established strike zone, namely, the convex hull of P. We can define the convex hull S as the intersection of all closed half planes that contain P:

S={HlHlP=}Hlc,

where Hlc denotes the complement of the open half-plane bounded by the line l.

Using the convex hull as our established strike zone, we can define the convex hull inconsistency index ICH analogously to the one-rectangle inconsistency index. Now an inconsistent ball is one that lies within the convex hull of strikes, and ICH is given by

ICH=number of inconsistent ballstotal number of called balls.

For example, see Figure 4. There were five inconsistent balls versus left-handed batters, and one versus right handed batters, out of a total of 118 called balls. Therefore ICH = (5 + 1)/118 ≈ 0.051.

Like the one-rectangle index, the convex hull inconsistency index can fail to account for multiple bad strikes in the same location. It can also be unaffected by outlying strikes, depending on their location. For example, in Figure 4 versus right-handed batters, the strike at (−0.01, 1.31) has no effect on ICH; removing this point would shrink the convex hull without changing the number of called balls enclosed. However, this call seems inconsistent, given its proximity to several called balls. The problem is that a vertex of the convex hull can lie in a region populated by called balls, yet fail to enclose any. Creating smaller convex hulls inside the first (as we did to define IRn) will not address this issue.

To account for this phenomenon, we can use the locations of called balls to define a called-ball region. Instead of counting called balls within the established strike zone, we can measure the area of the overlap between the called-ball region and the convex hull of strikes.

The established strike zone is the convex hull of called strikes.
Figure 4:

The established strike zone is the convex hull of called strikes.

Unlike the established strike zone, the called-ball region will typically not be convex, or even simply connected. Given a set Q2 representing the locations of called balls during a game, and given some radius α > 0, define

X={Bx,αBx,αP=}Bx,αc,

where Bx,αc denotes the complement in the plane of the open disk of radius α centered at the point x. The region X, which will serve as our called-ball region, is called the α-convex hull of Q (Pateiro-López and Rodrıguez-Casal 2010). Note that the α-convex hull is not convex, in general.

Let aL and aR be the areas of the intersection of the convex hull of called strikes and the α-convex hull of called balls, for left-handed and right-handed batters, respectively. We define the α-convex hull inconsistency index IACH to be a weighted average of these two areas. Let nL be the number of called pitches thrown to left-handed batters, and let nR be the number of called pitches to right-handed batters. Then

IACH=nLaL+nRaRnL+nR

Figure 5 shows the ball and strike calls for the same game as Figure 4, along with the α-convex hull of called balls, using α = 0.7. In this case, IACH = 0.127. Notice that the called strike at (−0.01, 1.31) now has a significant effect on IACH, since it causes a large region of overlap between the convex hull and the α-convex hull, which contains the nearby called balls.

The established strike zone is the convex hull of called strikes (in red), and the called-ball region is the α-convex hull of balls (in blue), where α = 0.7.
Figure 5:

The established strike zone is the convex hull of called strikes (in red), and the called-ball region is the α-convex hull of balls (in blue), where α = 0.7.

One limitation to this choice of inconsistency metric is that there is no canonical choice for the constant α. Figure 6 illustrates the issues involved. If the radius α is too small, the α-convex hull will contain isolated points and small disconnected regions. Large values of α (such as α = 0.9 in this example) will produce a single, simply-connected α-convex hull, making the called-ball region completely cover the established strike zone. Generally speaking, the larger the value of α, the tougher the metric IACH is on the umpires.

Six different called-ball regions (α-convex hulls) for different choices of α.
Figure 6:

Six different called-ball regions (α-convex hulls) for different choices of α.

In the analysis that follows, we have chosen to use α = 0.7, based largely on qualitative inspections of various game examples, as in Figure 6. A correlation analysis can lend some empirical support to this choice. Table 1 gives the pairwise correlations for IACH computed over all 2017 regular season games using six different values of α between 0.4 and 0.9, in increments of 0.1. For these six values, the greatest correlation is between α = 0.6 and α = 0.7, and we observe that once α exceeds 0.7, the correlations between adjacent values begin to decrease. These results confirm the observation that choosing α in the range 0.6 ≤ α ≤ 0.7 tends to give similar measures of IACH.

Table 1:

Correlation matrix for IACH calculated with different values of α over all games in the 2017 season.

2.3 Statistical properties of the inconsistency metrics

All four inconsistency measures are sensitive to a single outlying called strike, and the n-rectangle and α-convex hull indices are sensitive to an egregiously bad called ball in the middle of the strike zone. This sensitivity is by design, to avoid penalizing umpires who make slightly inconsistent calls as much as those who make clearly bad calls. However, a consequence of this feature is that these metrics are also sensitive to the number of pitches called. As the number of called pitches increases, the chances that an umpire will make an egregious call increases, and once such a call is made, the inconsistency index will remain high.

Figure 7 investigates the association between number of pitches called and inconsistency index. The top row shows scatter plots of the four indices versus number of pitches called in the game, for all regular-season games with between 50 and 300 called pitches. (Only 2 of 2425 games in our sample fall outside this range.) For each index, there is a slight discernible upward trend. The correlation coefficient r is approximately 0.2 in all four cases.

Scatterplots of the four inconsistency indices, IR1, IR10, ICH, and IACH, versus the number of called pitches in the game. The top row shows the data for all games with between 50 and 300 called pitches, while the bottom row includes only games with high inconsistency indices. The blue curves show the smoothed conditional means.
Figure 7:

Scatterplots of the four inconsistency indices, IR1, IR10, ICH, and IACH, versus the number of called pitches in the game. The top row shows the data for all games with between 50 and 300 called pitches, while the bottom row includes only games with high inconsistency indices. The blue curves show the smoothed conditional means.

The smoothed density estimates in Figure 8 show that the distributions of IR1, IR10, ICH, and IACH are all skewed right. Such skewness may be a feature of the metrics, or it may indicate that major league umpires are, on the whole, very good at calling games consistently. To assess sensitivity in the tails of these distributions, the second row of scatter plots in Figure 7 considers only “high-inconsistency games,” where IR10 + IACH > 0.3. For these games, and for this range of pitches called, there does not appear to be a strong association between the inconsistency measures and the number of called pitches (|r| < 0.1 in all cases).

Smoothed density estimates of the four inconsistency indices computed on 2423 games.
Figure 8:

Smoothed density estimates of the four inconsistency indices computed on 2423 games.

In this section we have considered two simple metrics, IR1 and ICH, along with extensions IR10 and IACH, respectively, which attempt to address deficiencies in the simple metrics. Figure 8 shows that the tails of the distributions of the extended metrics are substantially thicker, suggesting that these metrics are better at differentiating between higher levels of inconsistency.

The pairwise correlation coefficients for the four metrics are given in Table 2. The two extended metrics IR10 and IACH are correlated, but not very strongly, indicating that they measure different aspects of inconsistency. Some of the difference may be due to the shape of the strike zone that an umpire tends to call. For example, among the 79 umpires who called at least 20 games behind home plate in 2017, Chad Whitson was the 17th most consistent umpire when measured using IACH, but ranked 51st when measured using IR10. The methods that we will present in Section 3.1 reveal that Whitson’s zone tends to be quite rounded at the corners, rather than rectangular, so it is not surprising that he ranks higher when judged using the convex hull metrics. By comparison, Pat Hoberg, whose zone is somewhat less rounded, has the 6th best IR10, but ranks 34th according to IACH. See Figure 9.

Table 2:

Correlation matrix for the four inconsistency indices.

Zone tendencies for Chad Whitson and Pat Hoberg. Whitson is more consistent according to IACH, while Hoberg is more consistent according to IR10. The method for constructing these contours is discussed in Section 3.1.
Figure 9:

Zone tendencies for Chad Whitson and Pat Hoberg. Whitson is more consistent according to IACH, while Hoberg is more consistent according to IR10. The method for constructing these contours is discussed in Section 3.1.

As a compromise, in some of our analysis, we use the sum IR10 + IACH as a general measure of inconsistency. Notice that IR10 typically takes larger values than IACH, so the 10-rectangle metric is effectively weighted more than the α-convex hull metric in this sum. Over all 2423 games, the mean of the sum IR10 + IACH is 0.191 with standard deviation 0.142, and its median is 0.156.

3 Zone accuracy

While consistency of ball and strike calls is an important aspect of neutral officiating, it is also expected that home plate umpires conform to established definitions and practices. The official rules of baseball (MLB 2018) define the strike zone as “that area over home plate the upper limit of which is a horizontal line at the midpoint between the top of the shoulders and the top of the uniform pants, and the lower level is a line at the hollow beneath the kneecap.” It is often expedient to use the rectangular front of this pentagonal prism as a two-dimensional approximation of the rule book strike zone. The width of the front of home plate is 17 inches, establishing the horizontal limits of this rectangle. Our data has been normalized so that the vertical limits go from 1.5 to 3.5 feet above the ground. Since the rules also state that a pitch should be called a strike if “any part of the ball passes through any part of the strike zone,” we add one-half the width of a baseball to each of these limits to obtain the rectangle with opposite corners at (−0.8308, 1.3775) and (0.8308, 3.6225). This rectangle is pictured in Figure 1.

As Figure 1 illustrates, the rectangular rule-book strike zone differs from how the strike zone is officiated in actual games. However, spray charts of called strikes are not appropriate for estimating the borders of the called strike zone, since they show only where called strikes are likely to occur, which is biased according to where pitchers tend to throw. For example, Figure 1 indicates that called strikes occur less frequently on the inside corners than on the outside corners, but this effect could simply be a consequence of pitchers’ reluctance to throw inside.

Using a grid of one-inch squares in the plane at the front of the plate, (Roegele 2018) describes a consensus strike zone determined by the squares on this grid in which pitches are more likely to be called strikes than balls. In this section, we give a more accurate method for obtaining the borders of the consensus strike zone using kernel density estimation (Venables and Ripley 2010). We can also apply this technique to calls made by individual umpires, giving ways to assess conformity and zone size.

3.1 Kernel density estimation

Let s(x, y) be the two-dimensional probability density function describing the distribution of called strikes in the plane. That is, s(x, y) gives the density of the probability that a called pitch will cross the plate at location (x, y), given that the pitch is called a strike. In order to describe the consensus zone, we would like to compute the reverse conditional probability, that is, the probability density f(x, y) that a called pitch will be called a strike, given that it crosses that plate at location (x, y). Let s^(x,y) be a two-dimensional kernel density estimate computed on the (px, pz) coordinates of called strikes, and let c^(x,y) be a two-dimensional kernel density estimate computed on the coordinates of all called pitches. Then by Bayes’ theorem, an estimate for f(x, y) is given by

f^(x,y)=p^s^(x,y)c^(x,y),

where p^ is the proportion of called pitches that are strikes. The 50% contour of f^(x,y) will then be the border of the consensus zone.

Figure 10 shows the smooth contours produced using this method, along with the discrete approximation of the method described in (Roegele 2018). In addition to improving the resolution of the zone boundary, the kernel density estimation method will work well for smaller samples of pitches. In particular, the kernel density estimation method can produce 50% contours for each MLB umpire’s calls over the course of a season, which can be used to describe season-long tendencies.

The smooth curves are the boundaries of the consensus zones computed using kernel density estimation. The shaded squares show the result of computing the consensus zone using the discrete method of (Roegele 2018).
Figure 10:

The smooth curves are the boundaries of the consensus zones computed using kernel density estimation. The shaded squares show the result of computing the consensus zone using the discrete method of (Roegele 2018).

For example, let X be the region in the plane bounded by the 50% contour for a particular umpire, and let C be the consensus zone described above. The symmetric difference (X∖C) ∪ (C∖X) is the set of all points lying in one zone and not in the other, and its area DS measures the extent to which the umpire’s zone deviates from the consensus zone. To illustrate the extent to which zones can conform to the consensus zone, Figure 11 shows the contour zones for the umpires with the greatest and least values of DS for the 2017 season, along with the consensus zone.

Measured by symmetric difference with the consensus zone (in gray), Stu Scheurwater had the most conforming zone, while Rob Drake’s zone was the least conforming.
Figure 11:

Measured by symmetric difference with the consensus zone (in gray), Stu Scheurwater had the most conforming zone, while Rob Drake’s zone was the least conforming.

3.2 Contour-based zone accuracy and size

Given any pair of closed curves 𝒵l,𝒵r representing the boundaries of strike zones versus left- and right-handed batters, and any set of called pitches, let A𝒵 denote the proportion of correctly-called pitches, based on the strike zones 𝒵l and 𝒵r. Let Cl and Cr be the 2017 consensus zones computed using the kernel density estimation method, and let R = Rl = Rr be the rule-book rectangle described above. For each MLB umpire who called at least 20 games behind home plate in 2017, we compute the umpire’s consensus accuracy AC and rule-book accuracy AR.

For each umpire, the individual contour zones yield a convenient measure of an umpire’s zone size S. It has been suggested (Roegele 2017) that accuracy measurements can function as a proxy for zone size, since inaccurate umpires would tend to have larger strike zones. Our data and measurements do not provide evidence for this assertion. Figure 12 illustrates the associations between these two measures of accuracy, AC and AR, along with zone size S. The accuracy measures AC and AR are only moderately correlated, and neither is strongly associated with zone size S. The correlation between AC and S is even weaker if the influential observation with the largest zone (Doug Eddings) is removed.

Pairwise correlations of AC, AR, and S, for each full-time umpire over the 2017 season. The curves down the diagonal show the distributions of each measure. The scatterplots in the bottom row indicate that measures of accuracy are not strongly associated with zone size.
Figure 12:

Pairwise correlations of AC, AR, and S, for each full-time umpire over the 2017 season. The curves down the diagonal show the distributions of each measure. The scatterplots in the bottom row indicate that measures of accuracy are not strongly associated with zone size.

4 Alternative umpire evaluations

Ball and strike calls have always been subject to the judgment of the home plate umpire. While the MLB rule book offers standards for the extent of the strike zone, Figures 1 and 10 illustrate that variations from the rule-book zone are common, and probably widely accepted. Certainly, it would represent a significant departure from current norms if umpires (perhaps aided by technology) started conforming their zones to the rule-book rectangle. Furthermore, the zone as it is called today could very well be the result of a consensus that has emerged over the years between players and umpires. Therefore, a fair evaluation system for umpires should take history and current accepted practice into account. In this section we consider how the measures of inconsistency and accuracy developed above can complement other measures of umpire performance.

4.1 Correlation and outliers

Ideally, any new metric should evaluate phenomena that previous metrics ignore, since strongly associated measures can yield redundant information. However, it is reasonable to expect that the best umpires are the best at several aspects of the job, so there are likely to be associations among different performance measurements. Figure 13 summarizes the pairwise correlation coefficients for the above measures on all MLB umpires who called at least 20 games behind the plate in the 2017 season. The four inconsistency measures IR1, IR10, ICH and IACH have been averaged over all the games called by each umpire. Notice that these inconsistency indices are correlated positively with each other and negatively with the two accuracy measures AC and AR. The other measures considered, symmetric-difference nonconformity DS, zone size S, walk rate rW, and strikeout rate rK, generally do not show strong associations with the inconsistency and accuracy measures.

Pairwise correlations for season averages of IR1, IR10, ICH, IACH, along with AC, AR, DS, S, rW, rK, for MLB umpires with at least 20 games called in 2017.
Figure 13:

Pairwise correlations for season averages of IR1, IR10, ICH, IACH, along with AC, AR, DS, S, rW, rK, for MLB umpires with at least 20 games called in 2017.

The association between accuracy and inconsistency is not surprising, but it is not strong. In particular, for these umpires, average IR10 + IACH and AC have a correlation coefficient of r = −0.58. The scatterplot in Figure 14 illustrates the negative relationship between consensus zone accuracy AC and season-average inconsistency, measured as the sum IR10 + IACH of the 10-rectangle and α-convex hull indices. Notable in this graph are the outliers. For example, Carlos Torres, Tim Timmons, and Cory Blaser stand out has having above-average consistency but only average accuracy, suggesting that they would be underrated if evaluated on the basis of accuracy alone.

Scatterplot of consensus accuracy AC versus average inconsistency IR10 + IACH.
Figure 14:

Scatterplot of consensus accuracy AC versus average inconsistency IR10 + IACH.

4.2 Principal component analysis

Any single rating system for umpires can produce a ranking of umpires. When combining several different metrics for umpire performance, we can organize the information that accounts for most of the variation between umpires by examining principal components.

Table 3 shows the coefficients for a principal component analysis (Mardia, Kent, and Bibby 1980) of the season-average inconsistency indices IR1, IR10, ICH, IACH, along with consensus accuracy AC, rule book accuracy AR, zone size S, walk rate rW, and strikeout rate rK, each normalized to have unit variance. The first two components, PC1 and PC2, account for 68% of the variation in the data. Component PC1 is dominated by the accuracy measures (positive) and inconsistency measures (negative), so it seems appropriate to designate it as “strike zone quality.” Meanwhile, component PC2 is dominated by walk rate (negative) and strikeout rate and zone size (positive), so this component measures “pitcher friendliness.”

Table 3:

Principal component analysis for four measures of inconsistency, two measures of accuracy, zone size, and walk and strikeout rate.

The principal components provide a way to summarize the various umpire evaluations developed above. Using the coefficients in each column of Table 3 to form linear combinations of the normalized metrics, we obtain component scores for each umpire. Figure 15 is a scatter plot of the component scores for the first two principal components, PC1 and PC2, labeled by umpire. The most consistent and accurate umpires are those furthest right, while the most neutral arbiters between pitcher and batter are found along the horizontal axis.

Plot of the first two principal components. The component on the horizontal axis is dominated by accuracy and consistency, designated as “strike zone quality,” where the average quality score is zero. The vertical axis component is dominated by walk rate (negative) and strikeout rate and zone size (positive), designated as “pitcher friendliness,” with neutrality between pitcher and hitter at zero.
Figure 15:

Plot of the first two principal components. The component on the horizontal axis is dominated by accuracy and consistency, designated as “strike zone quality,” where the average quality score is zero. The vertical axis component is dominated by walk rate (negative) and strikeout rate and zone size (positive), designated as “pitcher friendliness,” with neutrality between pitcher and hitter at zero.

5 Conclusions and discussion

The above results illustrate how geometric inconsistency metrics can capture qualities of home plate umpiring not assessed entirely by accuracy, even when measured probabilistically by AC. We have also shown how the borders of the consensus zone and individual umpire zones can be computed using kernel density estimation, providing efficient methods for comparing umpire tendencies. Such metrics can inform the current debate on the efficacy of human umpires.

While the evidence suggests that MLB umpires are generally quite accurate and consistent, current technology is capable of providing real-time information that would take ball and strike calls out of the hands of umpires and standardize the called strike zone to the rule book rectangle. Doing so, however, would be a significant departure from current practice, and would eliminate facets of the game that arguably contribute to its appeal.

For example, teams value catchers who excel at framing pitches to make them appear as strikes. It is possible that good pitch framers cause umpires to be more inconsistent within a game (Fast 2011a). In addition, accurate pitchers who consistently throw to their catcher’s target, even beyond the margins of the zone, can receive favorable strike calls. In such ways, human variation in strike calling influences the way baseball is played, so strike zone analysis can yield insight into aspects of the game beyond umpire performance.

Much of the inconsistency and inaccuracy that we are able to measure could be due to game circumstances. For example, (Walsh 2010) and (Carruth 2012) use the discrete grid method to demonstrate that the called strike zone tends to be larger on 3-0 counts than on 0-2 counts. Other factors, such as the age and experience of the pitcher (Turkenkopf 2008) can influence umpire zones. Presumably, such tendencies vary from umpire to umpire, and could be studied using the tools presented here.

Questions for future investigation include the following.

  • Are certain pitch types harder to call consistently?

  • What factors contribute to an umpire’s strikeout rate? Do more consistent umpires show less variability before and after two strikes have been called on the hitter?

  • How do umpire ratings correlate to an umpire’s public profile, as measured by press and social media mentions? Are the best umpires those you have never heard of?

  • What has been the effect of the Zone Evaluation system? Have umpires improved in some aspects but not in others? Do the age and years of major league service influence how the strike zone is called?

  • Pitch-tracking data has been shown to be noisy (Schifman 2018) (Fast 2011b). In particular, the top and bottom of the zone are estimated by the operator of the pitch-tracking system, based on each batter’s stance. Can inconsistency measures be adapted to assess these variations? Can left/right inconsistency be separated from up/down inconsistency?

  • The α-convex hull and the n-rectangle indices both generalize to higher dimensions, and the three-dimensional path of a pitch can be approximated using pitch-tracking data. Implement a three-dimensional inconsistency index.

  • Are there analogous situations (e.g. in manufacturing), where a geometric measure of inconsistency could be applied?

For reproducibility, data and R code used for the analysis and figures in this paper are available at https://github.com/djhunter/inconsistency.

Acknowledgement

The author thanks the anonymous referees for many helpful and constructive comments.

References

About the article

Published Online: 2018-09-25

Published in Print: 2018-11-27


Citation Information: Journal of Quantitative Analysis in Sports, Volume 14, Issue 4, Pages 159–172, ISSN (Online) 1559-0410, ISSN (Print) 2194-6388, DOI: https://doi.org/10.1515/jqas-2018-0061.

Export Citation

©2018 Walter de Gruyter GmbH, Berlin/Boston.Get Permission

Comments (0)

Please log in or register to comment.
Log in