Introducing Grid WAR: Rethinking WAR for Starting Pitchers

The baseball statistic"Wins Above Replacement"(WAR) has emerged as one of the most popular evaluation metrics. But it is not readily observed and tabulated; WAR is an estimate of a parameter in a vaguely defined model with all its attendant assumptions. Industry-standard models of WAR for starting pitchers from FanGraphs and Baseball Reference all assume that season-long averages are sufficient statistics for a pitcher's performance. This provides an invalid mathematical foundation for many reasons, especially because WAR should not be linear with respect to any counting statistic. To repair this defect, as well as many others, we devise a new measure, Grid WAR, which accurately estimates a starting pitcher's WAR on a per-game basis. The convexity of Grid WAR diminishes the impact of"blow-up"games and upweights exceptional games, raising the valuation of pitchers like Sandy Koufax, Whitey Ford, and Catfish Hunter who exhibit fundamental game-by-game variance. Grid WAR is designed to accurately measure past performance, but also has predictive value insofar as a pitcher's Grid WAR is better than WAR at predicting future performance. Finally, at https://gridwar.xyz we host a Shiny app which displays the Grid WAR results of each MLB game since 1952, including career, season, and game level results, which updates automatically every morning.


Introduction
Suppose we observe a starting pitcher's performance in a given season. For concreteness, lets assume the pitcher is the Max Scherzer. Broadly, the analyst is interested in 2 similar, but distinct, questions about Scherzer's performance: 1. Historical: how good was Scherzer this season? 2. Predictive: how good will Scherzer be next season?
The fundamental objective of a baseball game is to win, so we will define quality in terms of the numbers of wins attributable to the pitcher. Since we also wish to compare a starting pitcher's value directly to position players, we will scale the contribution by subtracting the value of a replacement-level player at his position. This widely used concept is called wins above replacement (WAR) and is broadly defined as follows: 1. Historical WAR: how many wins did Scherzer's observed performance this season produce above the average performance of a replacement-level pitcher?
2. Predictive WAR: how many wins above replacement do we expect Scherzer to produce next season?
Baseball fans care immensely about historical WAR, as they want to accurately compare pitchers across history. Baseball analysts care more deeply about predictive WAR, as they want to build the best team by acquiring, promoting, and retaining the players who are expected to contribute the most. Even pitchers themselves care about WAR, as it has recently been proposed to determine arbitration salaries (Perry, 2021). Hence it is of utmost importance to estimate both historical and predictive WAR as accurately as possible.
Wins above replacement, however, is unlike other fundamental counting statistics such as batting average, strikeouts, or home runs in that it is not observable. Rather, WAR is an estimate produced from a model. There are many plausible models, each producing substantially different pitcher valuations. In this paper, we explore the consequence of a certain universally assumed modeling assumption. In particular, industry-standard estimates of WAR for starting pitchers are assumed to be functions of averages of a pitcher's observable counting statistics tabulated over the course of a season. We shall show that this modeling choice leads to inaccurate pitcher valuations.

Measuring pitcher performance
The fundamental idea behind estimating WAR is to capture the contribution of a player's observed performance isolated from the factors outside of his control. The way in which we measure observed performance is a crucial component of estimating his contribution to winning games.
To estimate WAR, a baseball analyst first chooses a base metric of performance, and then maps this base metric to wins. Different choices of base metric yield substantially different estimates for WAR. For instance, FanGraphs and Baseball Reference build separate WAR values from fieldingindependent pitching counting stats (FIP), average runs allowed per nine innings (RA9), and expected runs allowed (xRA). FIP is a weighted average of a pitcher's isolated pitching metrics 1 (e.g., home runs, walks, and strikeouts), and expected runs allowed estimates runs allowed from observed outcomes by removing sequencing randomness.
The difference between Runs Allowed and FIP, and hence the associated WAR values, can be substantial. For example, consider an inning where a pitcher strikes out 3, while allowing a home run, 2 walks, and a single. Depending on the sequence of the events, the pitcher could be charged with 1 to 4 runs. If the pitcher can't affect the sequence, then it makes to charge the pitcher with an "expected" runs given the counting stats. In other words, player valuations that are based only on individual counting stats, like FIP WAR, make sense if you believe that sequencing variability is caused by chance alone.
We believe, however, that variability can have causes other than chance. In other words, we believe the version of Scherzer that starts a game and gets tagged for 6 runs in 2 innings is fundamentally different than the Scherzer who strikes out the first 6 batters. We claim this game-by-game variance is not just "bad luck" but a measurable characteristic. Some pitchers are more likely to be consistent across games, while others are more likely to vary. Furthermore, we believe a pitcher bares responsibility for his sequence of outcomes and that his performance in situations of varying leverage should be taken into account. For batters, there is enough evidence of this to generate a substantial debate (see for example, Bill James' comparison of Altuve and Judge 2 ), but for pitchers the argument against it is nonsense.
WAR. To isolate the pitcher's contribution, we ignore the performance of Scherzer's team's batters, which is axiomatically independent of Scherzer's pitching. Therefore, his runs allowed directly determines his context-neutral win contribution. It may also make sense to use a version of expected runs allowed to further isolate Scherzer's contribution from that of his fielders. Nonetheless, expected runs allowed is a model which is estimated from data, so we should be careful about relying on a blackbox model which introduces an extra layer of uncertainty.
In estimating a pitcher's predictive WAR, on the other hand, we should use the most predictive base metric. Because a pitcher's historical WAR at the end of a season defines how valuable he was during the season, we want our predictive WAR to predict his next season's historical WAR as best as possible. In other words, predictive WAR is simply predicted historical WAR.

Averaging pitcher performance over the course of a season is wrong
After choosing a base measure of pitcher performance, a baseball analyst must then decide how to aggregate pitcher performance over the course of a season. In particular, current implementations of WAR from FanGraphs and Baseball Reference average pitcher performance over the entire season (e.g., FIP per inning, RA per nine innings, and xRA per out). In estimating a pitcher's historical win contribution, averaging his performance (in particular, his runs allowed) over the entire season is wrong because not all runs have the same value. To understand, think of a starting pitcher's WAR in a single game as a function R → WAR(R) where R is the number of runs allowed in that game. We expect WAR to be a decreasing function of R, because allowing more runs in a game should correspond to fewer wins above replacement. Additionally, we expect WAR to be a convex function in R (i.e. its second derivative is positive.) In other words, as R increases, we expect the relative impact of allowing an extra run, given by WAR(R + 1) − WAR(R), to decrease. For instance, allowing 2 runs instead of 1 should have a much steeper drop off in WAR than allowing 8 runs instead of 7.
For concreteness, consider Max Scherzer's six game stretch from June 12, 2014 through the 2014 All Star game, shown in Table 1 (ESPN, 2014). We re-arrange the order of these games to aid our explanation. Games 1 through 5 were excellent, allowing only 5 runs in 37 innings. Over these 5 games, Scherzer accumulated about 2 wins above replacement (the Detroit Tigers did in fact win all 5). In the sixth game over this stretch, Scherzer was rocked for 10 runs in 4 innings, exiting in the 5th inning with runners on second and third and no outs. Adding this one blow-up game balloons his ERA to 3.3 and reduces his total WAR over the 6 game stretch down to about 1/2. This is a complete absurdity, as accumulated "real" WAR cannot drop from 2 to 1/2 with the addition of one game, since a game can't be lost more than once. The correct assessment would charge Scherzer with the maximum possible damage, about −0.40 wins. So Scherzer's "real" WAR over the six games should be about 1.5 which is about 1 win higher than the standard calculation. By evaluating Scherzer's performances using only the average, standard WAR significantly devalues his contributions during this six game stretch, because no game can be "lost" more than once. The correct approach would be to calculate WAR per game and sum them up. game 1 2 3 4 5 6 total earned runs 0 1 2 1 1 10 15 innings pitched 9 6 7 8 7 4 41 Table 1: Max Scherzer's performance over six games prior to the 2014 All Star break.
Here is another revealing albeit hypothetical example. Suppose a pitcher tosses 2 nearly flawless 8 inning starts, allowing 1 run in each start. These two performances are followed by a terrible 2-inning blow-up where he gives up 8 runs. His averaged performance, per nine innings, over the three games is a thoroughly mediocre 5 runs, which translates to a WAR of about 0.0 when calculated using standard metrics. In contrast, it is clear that over the three starts his team will win, with near certainty, 2 out of three, which translates to a "real" WAR of about 1.2 in total. Our hypothetical pitcher, who is great in 2 out of 3 starts and terrible in every third, would accumulate more than 12 WAR over a full season which would be better than every pitcher in the modern era. In contrast, standard WAR metrics would suggest he be designated for assignment. What drives the difference? A poor performance can greatly affect the average, allowing a single game to be "lost" more than once. Specifically, standard metrics allow the one blowup game to count for 2 losses, resulting in 0 WAR. The example is somewhat extreme, but not that rare.
Because we expect WAR to be a convex function, Jensen's inequality tells us that averaging a pitcher's performance over the course of the season undervalues his win contribution. Specifically, thinking of a pitcher's number of runs allowed in a complete game as a random variable R, Jensen's inequality says (1.1) Traditional methods for computing WAR are reminiscent of the left side of Equation (1.1) -average a pitcher's performance, and then compute his WAR from the resulting average scaled by the number of innings pitched.
Because winning a baseball game is defined by the runs allowed during a game, a historical WAR metric should look like the right side of Equation (1.1) -compute the WAR of each of a pitcher's individual games, and then aggregate. Finally, Equation (1.1) provides theoretical justification that estimating historical WAR from pitcher performance averaged over the entire season is wrong.
Hence in Section 2 we devise Grid WAR, which estimates a starting pitcher's WAR in each of his games. We show that Grid WAR estimates the completely context-neutral win probability added above replacement at the point when a pitcher exits the game. 3 This is very different from the usual win-probability-added calculation, which is completely dependent on the starting pitcher's team's offense.
In Section 3 we discuss our results. To understand the effect of averaging pitcher performance over an entire season, we compute Grid WAR on our dataset of all baseball games from 2010 to 2019 and compare it to FanGraphs WAR. Examining starting pitchers in 2019 in Section 3.1, we find that GWAR values games in which a starting pitcher allows few runs (e.g., zero, one, or 2 runs). Then, examining starting pitchers across the entirety of 2010 to 2019 in Section 3.2, and again in Section 3.4, we find that averaging pitcher performance over the course of an entire season tends to, in general, undervalue worse pitchers and overvalue better pitchers. This is because the convexity of GWAR diminishes the impact of games in which a pitcher allows many runs, and worse pitchers have more of these games.
In crafting a predictive WAR, however, it is possible that averaging pitcher performance over all a pitcher's games has value. In particular, the appeal of averaging pitcher performance, aside from its simplicity, is that it adjusts for sequencing randomness, or the idea that when a pitcher allows a run may be the result of randomness rather than a repeatable or identifiable trait. Nonetheless, in describing how Scherzer contributed to his team's winning a specific game on a given day, the number of runs he allowed that day matters immensely. To show there is predictive value in computing WAR on a game-by-game level, in Section 3.3 we quantify the value lost by only using FanGraphs WAR to estimate pitcher quality rather than using GWAR. In particular, we find that pitcher rankings built from past seasons' GWAR are better than pitcher rankings built from past seasons' FanGraphs WAR at predicting future pitcher rankings according to GWAR.
Finally, in Section 4 we conclude by discussing the best pitchers and pitcher-seasons across baseball history according to Grid WAR and discuss ideas for future work.

Defining Grid WAR for Starting Pitchers
Our task is to estimate a starting pitcher's WAR for an individual game, which we call Grid WAR (GWAR). The idea is to estimate a context-neutral version of win probability added derived only from a pitcher's performance, invariant to factors outside of his control such as his team's batting. In Section 2.1 we detail our mathematical formulation of Grid WAR. Subsequently, we discuss how we estimate the grid functions f and g, the constant w rep , and the park effects α which allow us to compute a starting pitcher's Grid WAR for a baseball game. We begin with a brief overview of our data in Section 2.2, and then in Section 2.3 we estimate f , in Section 2.4 we estimate g, in Section 2.5 we estimate w rep , and in Section 2.6 we estimate α.

Grid WAR Formulation
First, we define a starting pitcher's Grid WAR for a game in which he exits at the end of an inning.
To do so, we define the function f = f (I, R) which, assuming both teams have league-average offenses, computes the probability a team wins a game after giving up R runs through I innings (the values of f (I, R) for integer values of I and R can be displayed in a simple grid). In short, f is a context-neutral version of win probability, as it depends only on the starter's performance.
Note that f also depends on the league (AL vs. NL), season, and ballpark. For example, games in which the home team is in the National League (NL) prior to 2022 did not feature designated hitters, whereas American League (AL) games did, leading to different run environments. Additionally, baseballs have different compositions across seasons, leading to different proportions of home runs and base hits, and hence different run environments. Finally, it is easier to score runs at some ballparks that at others. For instance, Coors Field at high altitude in Denver features many more home runs that other parks. Consequently, f = f (I, R) is implicitly a function of league, season, and ballpark.
To compute a wins above replacement metric, we need to compare this context-neutral wincontribution to that of a potential replacement-level pitcher. We use a constant w rep which denotes the probability a team wins a game with a replacement-level starting pitcher, assuming both teams have league-average offenses. We expect w rep < 0.5 since replacement-level pitchers are worse than league-average pitchers. Then, we define a starter's Grid WAR during a game in which he gives up R runs through I complete innings as Finally, we define a starting pitcher's Grid WAR for an entire season as the sum of the Grid WAR of his individual games.

Our Data
In the remainder of Section 2, we discuss how we estimate the grid functions f and g, the constant w rep , and the park effects α which allow us to compute a starting pitcher's Grid WAR for a baseball game (Equation (2.2)). In our analysis. we use play-by-play data from Retrosheet. We scraped every plate appearance from 1990 to 2020 from the Retrosheet database. For each plate appearance, we record the pitcher, batter, home team, away team, league, park, inning, runs allowed, base state, and outs count. Our final dataset is publicly available online. 4 In our study, we restrict our analysis to every plate appearance from 2010 to 2019 featuring a starting pitcher. Additionally, we scraped FanGraphs RA/9 WAR (abbreviated henceforth as FWAR (RA/9)) and FanGraphs FIP WAR (abbreviated henceforth as FWAR (FIP)) using the baseballr package (Petti and Gilani, 2021). All computations in our analysis are performed in R, and our code is publicly avaiable online. 5

Estimating the grid function f
Now, we estimate the grid function f = f (I, R) which, assuming both teams have average offenses 6 , computes the probability a team wins a game after giving up R runs through I complete innings. We call f a grid because the values of f (I, R) for integer values of I and R can be displayed in a simple 2D grid. To account for different run environments accross different seasons, leagues (NL vs. AL), and ballparks, we estimate a different grid for each league-season-ballpark. Due to a lack of data within each individual league-season, we estimate f using a parametric mathematical model rather than a statistical or machine learning model fit from historical data. In particular, we use an Empirical Bayes Poisson model, from which we explicitly compute context-neutral win probability, rather than a logistic regression, XGBoost, or empirical distribution fit from observational data. We detail why our mathematical model is superior to statistical models in Appendix A.
Because the runs allowed in a half-inning is a natural number, we begin our parametric modeling process by supposing that the runs allowed in a half-inning is a Poisson random variable. In particular, denoting the runs scored by the pitcher's team's batters in inning i by X i and the runs scored by the opposing team in inning i by Y i (for innings after the pitcher exits the game), we assume The 2 teams have their own runs per inning parameters λ X and λ Y because a baseball season involves teams of varying strength playing against each other. Given these team strength parameters, the probability that a pitcher wins the game after allowing R runs through I innings, assuming win probability in overtime is 1/2, is noting that the Skellam distribution arises as a difference of 2 independent Poisson random variables. Moreover, to capture the variability in team strength across each of the 30 MLB teams, we impose a positive normal prior, We estimate the prior hyperparameters λ and σ 2 λ separately for each league-season by computing each team's mean and variance of the runs allowed in each half inning, respectively, and then averaging over all teams. The initial estimated values of σ 2 λ are too large (e.g., the prior is overdispered), so we include a tuning parameter k designed to tune the dispersion across team strengths to match observed data. In particular, we use k = 0.28, which minimizes the log-loss between the observed win/loss column and predictions from the induced grid. In particular, the induced grid is given by the posterior mean grid, which we estimate using Monte Carlo integration with B = 100 samples, Y are i.i.d. samples from the prior distribution (2.7).
Additionally, recall that f = f (I, R) is implicitly a function of ballpark. To adjust for ballpark, we first define the park effect α of a ballpark as the expected runs allowed in one half-inning at that park above that of an average park, if an average offense faces an average defense. Therefore, as λ represents the mean runs allowed in a half-inning for a given league-season, λ + α represents the mean runs allowed in a half-inning at a given ballpark during that league-season. So, to adjust for ballpark, we use λ + α in place of λ in our Poisson model (2.8) and positive Normal prior (2.7). In Section 2.6 we estimate the park effects α.
In Figure 1 we visualize the estimated grid f according to our Poisson model (2.8), with prior (2.7), using the 2019 NL λ and σ 2 λ , without a park adjustment. Note that the f grid for other leagueseasons are similar, but differ slightly according to the differing run environments λ and σ 2 λ . We see that f is monotonic decreasing in R because as a pitcher allows more runs through a fixed number of innings, his team is less likely to win the game. Also, f is monotonic increasing in I because giving up R runs through I innings is worse than giving up R runs through I + i innings for i > 0, since giving up R runs through I + i innings implies a pitcher gave up no more than R runs through I innings. Further, f is convex in R for large values of R because the marginal impact of allowing an additional run diminishes to zero as R increases because, after giving up a certain number of runs, the game is essentially already lost. Succinctly, "you can only lose a game once".
Finally, the grid f is smooth.

Estimating the grid function g
Now, we estimate the function g = g(R|S, O) which, assuming both teams have league-average offenses, computes the probability that, starting midway through an inning with O ∈ {0, 1, 2} outs and base-state S ∈ {000, 100, 010, 001, 110, 101, 011, 111}, a team scores exactly R runs through the end of the inning. We estimate g(R|S, O) using the empirical distribution. Specifically, we bin and average over the variables (R, S, O), using data from every game from 2010 to 2019. Because g isn't significantly different across innings, we use data from each of the first eight innings.
In Figure 2a we visualize g(R|S, O = 0), with O = 0 outs, for each base-state S. With no men on base (S = 000), 0 runs allowed for the rest of the inning is most likely. With bases loaded (S = 111), 1 run allowed for the rest of the inning is most likely, and there is a fat tail expressing that 2 through 5 runs through the rest of the inning are also reasonable occurrences. With men on second and third, 2 runs allowed for the rest of the inning is most likely, but the tail is skinnier than that of bases loaded.

Estimating the constant w rep
To estimate wins above replacement, we need to compare a starting pitcher's context-neutral win contribution to that of a potential replacement-level pitcher. Thus we estimate a constant w rep which represents the context-neutral probability a team wins a game with a replacement-level starting pitcher, assuming both teams have a league-average offense and league-average fielding. Fangraphs (2010) defines replacement-level as the "level of production you could get from a player that would cost you nothing but the league minimum salary to acquire." We estimate w rep so as to match FanGraphs' definition of replacement-level. In particular, we choose w rep so that the sum of GWAR across all starting pitchers from 2010 to 2019 equals the sum of FWAR (RA/9), w rep = 0.428. (2.9) 2.6 Estimating the park effects α Finally, we estimate the park effect α of each ballpark, which represents the expected runs scored in one half-inning at that park above that of an average park, if an average offense faces an average defense. To compute the park effects for 2019, we take all half-innings from 2017 to 2019 and fit a ridge regression, using cross validation to tune the ridge hyperparameter, where the outcome is runs scored during the half-inning and the covariates are fixed effects for each park, team-offensiveseason, and team-defensive-season. We compute similar three-year park effects for other seasons. We visualize the 2019 park effects in Figure 2b. We use ridge regression, as opposed to ordinary least squares or existing park effects from ESPN, FanGraphs, or Baseball Reference, because, as detailed in Appendix C, it performs the best in 2 simulation studies and has the best out-of-sample predictive performance on observed data.

Results
After estimating the grid functions f and g, the constant w rep , and the park effects α, we compute GWAR for each game from 2010 to 2019. We begin with a quick exposition of Grid  The remainder of this Section is organized as follows. First, in Section 3.1, we compare pitcher valuations according to GWAR and FanGraphs WAR (FWAR (RA/9)) in the 2019 season. We find that GWAR especially values games in which a starting pitcher allows few runs (e.g., zero, one, or 2 runs). Then, in Section 3.2, we examine pitcher valuations according to GWAR and FWAR over entire careers. We find that averaging pitcher performance over the course of an entire season tends to, in general, undervalue worse pitchers and overvalue better pitchers. This is because the convexity of GWAR diminishes the impact of games in which a pitcher allows many runs, and the weaker pitchers tend to have more of these games. Next, in Section 3.3, we quantify the value lost by using FWAR to estimate pitcher quality rather than using GWAR. In particular, we find that pitcher rankings built from past seasons' GWAR are better than pitcher rankings built from past seasons' FWAR at predicting future pitcher rankings according to GWAR. Finally, in Section 3.4 we corroborate the result of Section 3.2 using a pitcher quality estimator built from Grid WAR. We find that nearly every pitcher can have the occasional great game, but great pitchers have fewer terrible games, causing existing WAR metrics to undervalue the contributions of mediocre pitchers.

Comparing Grid WAR to FanGraphs WAR in 2019
To understand the effect of averaging pitcher performance over the entire season on valuing pitchers, we begin by comparing GWAR to FanGraphs WAR (FWAR) in the 2019 season. We use FWAR (RA/9), rather than FWAR (FIP), since the former is built from runs allowed. In order to compare the relative value of starting pitchers according to GWAR relative to FWAR (RA/9), we rescale GWAR in 2019. In particular, we enforce that the sum of GWAR across starting pitchers in 2019 equals the corresponding sum in FWAR (RA/9). In Figure 4 we visualize GWAR vs. FWAR (RA/9) for starting pitchers in 2019. We see that some pitchers lie above the line y = x, and so are undervalued according to GWAR relative to FWAR, and other pitchers who lie below the line are overvalued. To understand why a player is undervalued or overvalued according to GWAR relative to FWAR in 2019, we compare players who have similar FWAR but different GWAR values in 2019. In Figure 5 we compare Homer Bailey's 2019 season to Tanner Roark's. They have the same FWAR (RA/9), 2.7, but Bailey has a much higher GWAR (Bailey 3.26, Roark 2.36). Similarly, in Figure 6 we compare Mike Fiers' 2019 season to Aaron Nola's. They have the same FWAR (RA/9), 4.1, but Fiers has a higher GWAR (Fiers 4.99, Nola 3.99).
In each of these comparisons, we see a similar trend explaining the differences in GWAR. Specifically, the pitcher with higher GWAR allows fewer runs in more games, and allows more runs in fewer games. This is depicted graphically in the "Difference" histograms, which show the dif-  ference between the histogram on the left and the histogram on the right. The green bars denote positive differences (i.e., the pitcher on the left has more games with a given number of runs allowed than the pitcher on the right), and the red bars denote negative differences (i.e., the pitcher on the left has fewer games with a given number of runs allowed than the pitcher on the right).
In each of these examples, the green bars are shifted towards the left (pitchers with higher GWAR allow few runs in more games), and the red bars are shifted towards the right (pitchers with lower GWAR allow many runs in more games). For instance, consider Figure 5. Bailey pitches 4 more games than Roark in which he allows 2 runs or fewer, and Roark pitches 4 more games than Bailey in which he allows 3 runs or more. Similar logic applies to Figure 6: Fiers pitches 4 more games than Nola in which he allows 0 runs, and Nola pitches 5 more games than Fiers in which he allows 1 run or more.

Comparing Grid WAR to FanGraphs WAR across careers
To further understand the effect of averaging pitcher performance over the entire season on valuing pitchers, we begin by comparing GWAR to FWAR (RA/9) over the course of entire careers. Because FanGraphs rescales FWAR each season according to its definition of replacement-level, we rescale GWAR in each year to enable fair comparison of relative value between GWAR and FWAR (RA/9). In particular, we enforce that the sum of GWAR across starting pitchers within each season equals the corresponding sum in FWAR (RA/9). In Figure 7 we visualize GWAR vs. FWAR (RA/9) for each starting pitcher-season across all years from 2010 to 2019. We see that some pitchers lie above the line y = x (the black line), and so are undervalued according to GWAR relative to FWAR, and other pitchers who lie below the line are overvalued. We see that worse pitchers generally lie above the line y = x (the black line), and so are undervalued according to GWAR relative to FWAR (RA/9), whereas better pitchers lie below the line. The regression line (the blue line), which has slope less than one, summarizes this phenomenon. This is because FWAR averages pitcher performance across an entire season. For a bad pitcher who has many games in which he allows many runs, averaging dilutes the performances of his good games. This is because Grid WAR is (mostly) convex in runs allowed: the marginal difference between allowing R + 1 runs instead of R runs in a game decreases as R increases (recall Figure 1). Therefore, GWAR downweights the contribution of the R th run of a game for large R, whereas FWAR weighs all runs allowed in a season equally. Since worse pitchers have many more occurences than better pitchers of the R th run of a game where R is large, FWAR (RA/9) undervalues worse pitchers in general.
Because worse pitchers are generally undervalued according to GWAR relative to FWAR, better pitchers must generally be overvalued, as we have constrained GWAR and FWAR to have the same sum. Specifically, averaging a good pitcher's performance over the entire season doesn't tank his FWAR as much because he had fewer bad performances to begin with.
In Figure 8 we show the six most undervalued and six most overvalued pitchers according to GWAR relative to FWAR (RA/9), aggregated across all seasons from 2010 to 2019. As expected, the undervalued pitchers are generally considered worse pitchers, and the overvalued pitchers are generally considered better pitchers. In Figure 9 we visualize the runs allowed distribution of the most undervalued pitcher, Yovani Gallardo, in his three most undervalued seasons. We see that Gallardo has quite a few games in which he allows many runs, say six or more runs. GWAR diminishes the impact of these games, which increases his estimated value. Gallardo also has many good games in which he allows 0, 1, or 2 runs, which the convexity of GWAR values highly.

Grid WAR has predictive value
Recall that a a baseball analyst is interested in estimating both historical WAR and predictive WAR, which a priori are two distinct valuations. Industry-standard models of WAR for starting pitchers from FanGraphs and Baseball Reference, which serve as both historical and predictive WAR met- rics, all assume that season-long averages are sufficient statistics for a pitcher's performance. This provides an invalid mathematical foundation for many reasons, especially because WAR is not linear with respect to any counting statistic; in particular, historical WAR must be a convex function of the number of runs allowed in a game. To repair this defect (among many others), we devised a new measure, Grid WAR (GWAR), which estimates a starting pitcher's WAR on a per-game basis.
In particular, Grid WAR is the right way to estimate historical WAR for starting pitchers.
Because a pitcher's historical WAR at the end of a season defines how valuable he was during the season, we want our predictive WAR to predict his next season's historical WAR as best as possible. In other words, predictive WAR is simply predicted historical WAR. Hence our goal is to predict a starting pitcher's future Grid WAR. A priori, it is not immediately obvious whether a pitcher's past Grid WAR is predictive of his future Grid WAR. In particular, if a pitcher's game-bygame variance in runs allowed is due mostly to randomness rather than a fundamental identifiable trait, a WAR which averages pitcher performance over the season may be more predictive than Grid WAR of future Grid WAR. Thus, in this Section, we compare the predictive capabilities of Grid WAR and FanGraphs WAR. We find that, in predicting future pitcher rankings according to Grid WAR, our predicted pitcher ranking built from Grid WAR is more predictive than that built from FanGraphs WAR. This suggests that some pitchers' game-by-game variance in performance is a fundamental trait.
To value a starting pitcher using his previous seasons' WAR and number of games played, we could simply use his mean game WAR. The fewer games a pitcher has played, however, the less reliable his mean game WAR is in predicting his latent pitcher quality. Therefore, we use shrinkage estimation to construct a pitcher quality metric. In calculating pitcher p's quality estimate µ p , the fewer games he has played, the more his mean game WAR is shrunk towards the overall mean pitcher quality. Specifically, we construct three shrinkage estimators of pitcher p's quality, denoted , and µ FWAR (RA/9) p , built from the three respective WAR metrics. We use a parametric Empirical Bayes approach in the spirit of Brown (2008) to formulate these shrinkage estimators, detailed in Appendix B.
Recall that our goal is to predict each starting pitcher's next season's cumulative Grid WAR, which at the end of next season will represent his historical value added. So, using the 2019 season as a hold-out validation set, our goal is to predict each starting pitcher's 2019 Grid WAR. We use our remaining data from 2010 to 2018 to estimate pitcher quality, built separately from GWAR and FWAR, in order to predict 2019 Grid WAR. Thus we restrict our analysis to the set of starting pitchers who have a FanGraphs' WAR in at least one season from 2010 to 2018 (so, they must have at least 25 starts in that season). Our pitcher quality estimators, however, are on different scales since each WAR metric is on its own scale. Hence to ensure fair comparison of Grid WAR and FanGraphs WAR, we map each estimator to a starting pitcher ranking, ranking each pitcher from one (best) to the number of pitchers (worst). We denote the three ranks of pitcher p according , and R FWAR (RA/9) p , respectively. In Figure 24 of Appendix B we visualize these starting pitcher rankings prior to the 2019 season according to these estimators µ p (left) and their associated ranks R p (right). Additionally, we rank pitchers in 2019 by their observed cumulative 2019 Grid WAR, denoted R GWAR p . Finally, we use root mean squared error (rmse) to measure how well the predicted pitcher rankings R predict the observed rankings R, shown in Table 2. We see that pitcher rankings built from Grid WAR are more predictive than those built from FanGraphs WAR. Formally, A < B and A < C, where A, B, and C are defined in Table 2. In other words, baseball analysts lose value by not using Grid WAR to value pitchers. In Table 3 (resp., Table 4) we conduct a similar analysis, but restricting the test set to just the five most undervalued starting pitchers in 2019 according to R GWAR relative to R FWAR (RA/9) (resp., R FWAR (FIP) ). Conversely, in Table 5 (resp., Table 6) we conduct a similar analysis, but restricting the test set to just the five most overvalued starting pitchers in 2019 according to R GWAR relative to R FWAR (RA/9) (resp., R FWAR (FIP) ). We again find that baseball analysts lose value by not using Grid WAR to estimate pitcher quality. In particular, for these "extreme" pitchers who are highly undervalued or highly overvalued, analysts do worse predicting their quality when they use FWAR rather than GWAR.
observed ranking R predicted ranking R rmse(R, R) R GWAR R GWAR 7.2 R GWAR R FWAR (RA/9) 15.7 Table 3: The rmse of the observed pitcher ranking R GWAR in 2019 and pitcher ranking estimates R computed from three different WAR metrics, using just the five most undervalued starting pitchers in 2019 according to R GWAR relative to each R FWAR (RA/9) . observed ranking R predicted ranking R rmse(R, R) R GWAR R GWAR 10.7 R GWAR R FWAR (RA/9) 14.0 Table 5: The rmse of the observed pitcher ranking R GWAR in 2019 and pitcher ranking estimates R computed from three different WAR metrics, using just the five most overvalued starting pitchers in 2019 according to R GWAR relative to each R FWAR (RA/9) . Finally, for concreteness, in Figure 10 we visualize how our GWAR and FWAR based pitcher ranking predictions fare against the observed 2019 GWAR pitcher rankings. Specifically, the blue dots (our 2019 GWAR-based predictions) are closer to the black dots (the observed 2019 pitcher rankings according to GWAR) than the red dots (our 2019 FWAR (RA/9)-based predictions).

Existing WAR metrics undervalue mediocrity
So, a game-by-game WAR like Grid WAR is not only the right way to measure historical WAR for starting pitchers, but is also predictive of future Grid WAR. In particular, an estimator of latent pitcher talent should be built using Grid WAR. In this Section, we explore the relationship between a pitcher's talent according to µ GWAR p and his game-by-game pitcher performance. In particular, we find that all pitchers have great games, but great pitchers don't have terrible games. Therefore, averaging pitcher performance over the entire season dilutes the value of mediocre pitchers' bad games, causing existing WAR metrics to undervalue mediocrity. This agrees with our assessment from Section 3.2.
We begin with Figure 11, from which we get a sense of the distribution of pitcher talent µ GWAR p (left) and of the distribution of game-by-game Grid WAR (right). Then, in Figure 12a, we visualize the distribution of game-by-game Grid WAR conditional on being a bad pitcher (red), a typical pitcher (green), and a great pitcher (blue), according to µ GWAR p . We see that bad pitchers, typical pitchers, and great pitchers all have great games. Great pitchers, on the other hand, pitch many fewer bad games than bad and mediocre pitchers do.
In Figure 12b we view this phenomenon through another lense. Specifically, we visualize the distribution of pitcher quality µ GWAR p conditional on having a bad game (red), a typical game (green), and a great game (blue). We again see that bad pitchers, typical pitchers, and great pitchers all have great games. Bad games, however, feature a higher proportion of bad pitchers.
Averaging pitcher performance over the season allows a pitcher's bad performances to dilute the value of his good ones. Consequently, such WAR metrics like FanGraphs WAR devalue the contributions of mediocre and bad pitchers, who have many more bad games than great pitchers. In short, the baseball community has been undervaluing the contributions of the mediocre.

Discussion
Traditional implementations of WAR for starting pitchers estimate WAR as a function of pitcher performance averaged over the entire season. Averaging pitcher performance, however, allows a pitcher's bad games to dilute the performances of his good games. In particular, one bad "blow-up" game after averaging can reduce a pitcher's WAR by more than minimum possible WAR in a game. Therefore, a starters' seasonal WAR should be the sum of the WAR of each of his individual games. Hence we devise Grid WAR, which estimates a starting pitcher's WAR in each of his games. In particular, Grid WAR estimates the context-neutral win probability added above replacement at the point when a pitcher exits the game. We find that Grid WAR is convex in runs allowed, capturing the fundamental baseball principle that you can only lose a game once.
Comparing starting pitchers' Grid WAR to his FanGraphs WAR from 2010 to 2019, we find that standard WAR calculations undervalue mediocrity relative to Grid WAR. Because all starters pitch great games, but great starters don't pitch many terrible games, averaging pitcher performance over a season discounts the contributions of the mediocre and bad pitchers' great games. We also show that past performance according to Grid WAR is predictive of future Grid WAR, providing evidence that a pitcher's runs allowed profile is not entirely due to sequencing randomness, but is also the result of an identifiable game-by-game variance or sequencing trait.

The best starting pitchers in modern baseball history
We created an interactive Shiny app, 7 hosted at https://gridwar.xyz, which displays the Grid WAR results of every starting pitcher game, season, and career since 1952. We also visualize the f grids and park factors for each season.
The pitcher with the highest total Grid WAR of all time in a single season is Sandy Koufax and it is not even close. In 1966, he accumulated 11.54 GWAR over 41 games in his final season which is half a game more GWAR than the second best season (Bob Gibson had 11.05 Grid WAR over 34 games in 1968) and the third best (Dwight Gooden had 11.04 Grid WAR over 35 games in 1985). Koufax's 1966 season is an illuminating example of the value of Grid WAR compared to 7 The website is built using pre-2008 play-by-play data from Retrosheet (2021) and play-by-play data since 2008 from Statcast (2023). We use the baseballr package in R to scrape from each of these data sources (Petti and Gilani, 2021). We automatically scrape Statcast data each morning, so the website is up-to-date. standard formulations. While his 1966 is the standout season of all time in terms of Grid WAR, it is just the 6 th highest seasonal FWAR (RA/9) and the 20 th highest seasonal FWAR (FIP). The other methods incorrectly overweight his three outlying blow-up games (i.e. less than -0.1 GWAR). This is an excellent example of why it is a philosophical mistake to ignore variance and convexity. In 1966, there were two versions of Koufax: the "left arm of God" or worse than replacement. The "left arm of God" threw 8 complete game shutouts and 9 one-run complete games. Grid WAR properly accounts for this variation while the standard metrics do not. Among the 15 all time best seasons, no other pitcher appears more than once, while Koufax appears on the list three times (1963,1965,1966) with his 1964 season only falling short because Koufax lost a quarter of the season due to injury. Koufax's "duality" is not just chance variation, it is a systematic attribute and a significant contributor to his early retirement after the 1966 season. We visualize the top 15 starting pitcher-seasons of all time by total Grid WAR in Figure 17. These are the 15 th and 16 th ranked pitchers, and they don't come anywhere particularly close to Clemens or even Maddux, although their careers aren't over. As discussed before, the pitchers with the highest career Grid WAR come from the previous millenium because top starters back then pitched more games per season and pitched more innings per game. We visualize the top 15 starting pitchers of all time by total career Grid WAR in Figure 15.

Future work
Although Grid WAR improves substantially upon existing estimates of WAR for starting pitchers, our analysis is not without limitations. In particular, the current version of Grid WAR, as well WAR estimates from FanGraphs and Baseball Reference, doesn't adjust for opposing batter quality. Thus, for a pitcher who faces good offensive teams more often than other pitchers do, Grid WAR underestimates his WAR. Additionally, the current version of Grid WAR doesn't adjust for the pitcher's team's fielding. Thus, for a pitcher who plays with great fielders who reduce his runs allowed, Grid WAR overestimates his WAR. Within our Poisson model framework, we could adjust for offensive quality and fielding in the same way we adjusted for ballpark. Specifically, we could add to the Poisson parameter λ from Equation (2.7) a coefficient capturing the opposing team's offensive quality and a coefficient capturing the pitcher's team's fielding quality. But, we expect these fielding adjustments to have a very small impact. In particular, we expect the effect of fielding to have a smaller total total impact than ballpark, which itself has a small impact, except at extreme parks like those of the Rockies, Rangers, and Mets, for whom park effects are moderate. This can be seen in Figure 29: Grid WAR computed with our ridge-adjusted park effects is extremely similar to Grid WAR without park effects. We leave the addition of batting and fielding adjustments to future work.
Moreover, the distribution of runs scored in a half-inning is not Poisson; more likely it is a zeroinflated Poisson or similar distribution on the non-negative integers. Computationally, it is straightforward to modify the f grid formula (Equation 2.4) to accomodate different distributions. One interesting modification would adjust to allow different parameters for each inning depending on when a starting pitcher is pulled. In particular, middle relievers tend to be worse than starting pitchers, suggesting a higher value of λ for those innings, and closer are often very good pitchers, suggesting a lower value of λ . But there are several substantial benefits to sticking with a simpler Poisson model. First, it produces a closed-form formula which is quick to evaluate. Second, a simple parametric model makes it easier to adjust for ballpark (and other confounds like batting quality and fielding quality). Finally, the resulting f grid is eminently reasonable and quite accurate for our purposes. For example, while the Poisson model systematically underestimates the probability of big-deficit late inning comeback, these differences have an insignificant impact on Grid WAR. We leave any adjustment to the half-inning runs distribution as future work.
Additionally, a flaw in our Empirical Bayes shrinkage estimator of latent pitcher quality µ p from Formula (B.2) assumes that µ p remains constant over the entire decade from 2010 to 2019. Player quality is, however, non-stationary over time, and a more elaborate estimator should account for this. Therefore, in future work we suggest using a similar Empirical Bayes approach to estimate pitcher quality, except downweighting data the further back in time it is observed (e.g., using exponential decay weighting as in Medvedovsky and Patton (2022)) in the posterior mean Formulas (B.2) and (B.14).
While we extensively explore a game-by-game implementation of WAR for starting pitchers in this study, we leave for future work an analysis and implementation of game-by-game WAR for other positions in baseball and in other sports. In particular, although Grid WAR works well to estimate WAR for starting pitchers, it does not translate to valuing relief pitchers in an obvious way. Because relievers enter the game at different times, it is more difficult to value their contextneutral win contribution. Also, there is no obvious analog of w rep for relievers. Grid WAR, on the other hand, should be used to value hockey goalies: just as a pitcher allows a certain number of runs throughout the game, a hockey goalie allows a certain number of goals throughout the game. Also, as with starting pitchers, hockey goalies enter the game at the same time.

A Estimating f using a mathematical, not a statistical, model
In this Section, we detail our modeling process for estimating the grid function f = f (I, R) which, assuming both teams have randomly drawn offenses, computes the probability a team wins a game after giving up R runs through I complete innings. In particular, we compare statistical models fit from observational data to mathematical probability models, which are superior.
To account for different run environments accross different seasons and leagues (NL vs. AL), we estimate a different grid for each league-season. We begin by estimating f from our observational dataset of half-innings from 2010 to 2019. The response variable is a binary indicator denoting whether the pitcher's team won the game, and the features are the inning number I, the runs allowed through that half-inning R, the league, and the season. Note that if a home team leads after the top of the 9 th inning, then the bottom of the 9 th is not played. Therefore, to avoid selection bias, we exclude all 9 th inning instances in which a pitcher pitches at home.
With enough data, the empirical grid (e.g., binning and averaging over all combinations of I and R within a league-season) is a great estimator of f . In Figure 19a we visualize the empirical grid fit on a dataset of all half-innings from 2019 in which the home team is in the National League. The function f should be monotonic decreasing in R. In particular, as a pitcher allows more runs through a fixed number of innings, his team is less likely to win the game. It should also be monotonic increasing in I because giving up R runs through I innings is worse than giving up R runs through I + i innings for i > 0, since giving up R runs through I + i innings implies a pitcher gave up no more than R runs through I innings. The empirical grid, however, is not monotonic in either R or I because each league-season dataset is not large enough. Moreover, even when we use our entire dataset of all half-innings from 2010 to 2019, the empirical grid is still not monotonic in R or I.
To force our fitted f to be monotonic, we use XGBoost with monotonic constraints, tuned using cross validation (Chen and Guestrin, 2016). We visualize our 2019 NL XGBoost fit in Figure 19b. We indeed see that the fitted f is decreasing in R and increasing in I. Additionally, R → f (I, R) is mostly convex: if a pitcher has already allowed a high number of runs, there is a lesser relative impact of throwing an additional run on winning the game. Nonetheless, XGBoost overfits, especially towards the tails (e.g., for R large). For instance, the 2019 NL XGBoost model indicates that the probability of winning a game after allowing 10 runs through 9 innings is about 0.11, which is too large.
As there is not enough data to use machine learning to fit a separate grid for each league-season without overfitting, we turn to a parametric mathematical model. Indeed, the power of parameterization is that it distills the information of a dataset into a concise form (e.g., into a few parameters), allowing us create a strong model from limited data. Because the runs allowed in a half-inning is a natural number, we begin our parametric quest by supposing that the runs allowed in a half-inning is a Poisson(λ ) random variable. In particular, denoting the runs allowed by the pitcher's team's batters in inning i by X i and the runs allowed by the opposing team in inning i (for innings i after the pitcher exits the game), we assume Then the probability that a pitcher wins the game after allowing R runs through I innings, assuming win probability in overtime is 1/2, is If I = 9, this is equal to noting that the Skellam distribution arises as a difference of 2 independent Poisson distributed random variables. Then, we estimate λ separately for each league-season by computing each team's mean runs allowed in each half inning, and then averaging over all teams.
In Figure 20a we visualize the estimated f according to our Poisson model (A.2) using the 2019 NL λ . We see that f is decreasing in R, increasing in I, convex in the tails of R, and is smooth. Nonetheless, some of the win probability values from this model are unrealistic. For instance, it implies the probability of winning the game after shutting out the opposing team through 9 innings is about 99%, which is too high, and the probability of winning the game after allowing 10 runs through 9 innings is about 1%, which is too low. The win probability values at both tails of R are too extreme in our original Poisson model (A.6) because we assume both teams have the same mean runs per inning λ . This is an unrealistic assumption: in real life, a baseball season involves teams of varying strength playing against each other. When teams of differing batting strength play each other, win probabilities differ. For instance, when a great hitting team allows 7 runs to a terrible hitting team, the great hitting team has a larger probability of coming back to win the game than a worse hitting team would. Thus, accounting for random differences in team strength across games should flatten the f (I, R) grid.
On this view, it is more realistic to assume the pitcher's team and the opposing team have their own runs scored per inning parameters, Moreover, to capture the variability in team strength across each of the 30 MLB teams, we impose a positive normal prior, We estimate the prior hyperparameters λ and σ λ separately for each league-season by computing each team's mean and s.d. of the runs allowed in each half inning, respectively, and then averaging over all teams.
Given λ X and λ Y , we compute Formula (A.6) similarly as before using the Poisson and Skellam distributions. We use Monte Carlo integration with B = 100 samples to estimate the posterior mean grid, Y are i.i.d. samples from the prior distribution (A.7).
In Figure 20b we visualize the estimated f according to this Poisson model (A.8), with prior (A.7), using the 2019 NL λ and σ 2 λ . We see that f is mostly linear in R, rather than convex, and the values of f when R is large are highly unrealistic. For instance, this model indicates that the probability of winning the game after allowing 10 runs through 9 innings is about 38%, which is way too high. This is because our model is overdispersed, i.e. the estimated prior variance σ 2 λ is too large. For example, too large of a σ 2 λ allows λ X and λ Y to be very far apart, so if a pitcher allows 10 runs through 9 innings and λ X is much larger than λ Y , then his team will have a significant chance of coming back to win.
To resolve the overdispersion issue, we introduce a tuning parameter k designed to tune the dispersion across team strengths to match observed data, In particular, we use k = 0.28, which minimizes the log-loss between the observed win/loss column and predictions from the induced grid f (I, R|λ , σ 2 λ , k). In Figure 21 we visualize the estimated f according to our Poisson model (A.8), with tuned dispersion prior (A.9), using the 2019 NL λ and σ 2 λ . We see that f is decreasing in R, increasing in I, and convex when R is large. In particular, it looks like a smoothed version of the XGBoost grid from Figure 19b. Additionally, the values of the grid at both tails of R seem reasonable. For instance, the model indicates that allowing 0 runs through 9 innings has about a 97% win probability, which is more reasonable than before. For all of these reasons, we use this model for the grid f to compute Grid WAR for starting pitchers.

B Estimating pitcher quality using Empirical Bayes
In this Section, we describe how we estimate pitcher quality. Given enough data, a pitcher's mean game WAR would suffice to capture his quality. In the MLB, however, a pitcher starts just a finite number of games per season, so for many pitchers there is not enough data to use just his mean game WAR to represent his quality. Therefore, in this Section we use a parametric Empirical Bayes approach in the spirit of Brown (2008) to devise shrinkage estimators µ GWAR p and µ FWAR p , built from Grid WAR and FanGraphs WAR respectively, to represent pitcher p's quality. In particular, µ p shrinks his mean game WAR to the overall mean in proportion to his number of games played.

B.1 Empirical Bayes estimator of pitcher quality built from Grid WAR
To begin, index each starting pitcher by p ∈ {1, ..., P} and index pitcher p's games by g ∈ {1, ..., N p }. Let X pg denote pitcher p's observed Grid WAR in game g. After observing his N p games, we model In this model, µ p represents pitcher p's unobservable "true" pitcher quality, or his latent underlying mean game Grid WAR. Similarly, σ 2 p represents pitcher p's latent game-by-game variance in pitcher quality, or his game-by-game variance in mean game Grid WAR. The prior parameters µ and τ 2 represent the mean and variance, respectively, of pitcher quality across all pitchers. In Figure 22 we visualize the game-level Grid WAR of four starting pitchers. While Grid WAR is not actually normally distributed, it isn't too unreasonable an approximation (particularly for typical pitchers). In particular, we use a Gausian model because it produces a good and interpretable estimator of pitcher p's latent pitcher quality, not because of accuracy. We estimate pitcher p's pitcher quality µ p using the posterior mean, which as a result of our normal-normal conjugate model (B.1) is The posterior mean is a weighted sum between the observed total Grid WAR and the overall mean pitcher quality, weighted by the variances σ 2 p and τ 2 and the number of games played N p . In particular, the more games a pitcher has played, the closer his estimated quality is to his observed mean game Grid WAR. Conversely, the fewer games he has played, the closer his estimated quality is to the overall mean quality.
Estimator (B.2), however, is defined in terms of unkown parameters µ, τ 2 , and σ 2 p . Thus, to effectively use this estimator, we employ an Empirical Bayes approach in the spirit of Brown (2008). Specifically, in place of these estimators in Equation (B.2), we plug in their maximum likelihood estimates (MLE) µ, τ 2 , and σ 2 p , estimated from the data {X pg }.
We being finding the MLE by noting the marginal distribution of X pg according to model (B.1), (B.4) Therefore the log-likelihood of the full dataset {X pg : 1 ≤ g ≤ N p , 1 ≤ p ≤ P} is To find the MLE of µ, we set the derivative of the log-likelihood with respect to µ equal to 0 and solve for µ, .

(B.7)
We use a similar approach to find the MLE of τ 2 and σ 2 p . In particular, Additionally, for each pitcher p, (B.10) or equivalently
Using our dataset of all starting pitchers from 2010 to 2018, 8 we run Algorithm 1, yielding maximum likelihood estimators of µ, τ 2 , and {σ 2 p }. With ε = 10 −5 , the algorithm converges after just four iterations. Then, we plug these estimators into Formula (B.2), yielding parametric Empirical Bayes estimators of {µ p }. In Figure 23 we compare these estimates { µ GWAR p } to each pitcher's observed mean game Grid WAR. For players with fewer games played (small gray dots), µ p is shrunk towards the overall mean µ. For players with enough games played (large blue dots), µ p is essentially pitcher p's mean game GWAR, lying on the line y = x.
In Figure 24 we visualize starting pitcher rankings prior to the 2019 season according to µ p (left) and the associated ranks R p (right). Clayton Kershaw has the highest µ GWAR p and Ivan Nova has the lowest.

B.2 Empirical Bayes estimator of pitcher quality built from FanGraphs WAR
Our estimator µ FWAR p of latent pitcher quality built from pitcher p's previous seasons' observed FanGraphs WAR differs methodologically from our estimator built from Grid WAR in that FWAR is computed on the seasonal level and GWAR is computed on the game level. Accordingly, we slightly modify the procedure from the previous section. To begin, again index each starting pitcher by p ∈ {1, ..., P} and index pitcher p's games by g ∈ {1, ..., N p }. Let X pg denote pitcher p's unobserved FanGraphs WAR in game g. Note that we observe his total FWAR, (B.12) As before, we use model (B.1), which implies (B.13) Therefore the posterior mean of pitcher p's latent pitcher quality µ p is (B.14) This estimator is analagous to that from Equation (B.2), using FWAR instead of GWAR.
As before, we use a parametric Empirical Bayes approach to estimate each starting pitcher's latent quality from his FanGraphs WAR. In particular, we compute maximum likelihood estimates for µ, τ 2 , and {σ 2 p } using the FanGraphs data {X p }, which we plug in to Formula (B.14). We again begin finding the MLE by noting the marginal distribution of X p according to model (B.13), Thus the log-likelihood of the full FanGraphs dataset {X p } is proportional to .
Setting the derivative of the log-likelihood with respect to σ 2 p equal to 0 and solving for σ 2 p , however, yields a trivial equation which doesn't include σ 2 p . Intuitively this is because we can't glean information about σ 2 p since we don't observe the game-level FanGraphs WAR X pg . Therefore, in designing an iterative algorithm analagous to Algorithm 1 but for FanGraphs WAR, we eliminate Equation (B.11) (Step 3) and replace σ 2 p in Steps 1 and 2 with a constant hyperparameter σ 2 . We then choose the value of σ 2 which minimizes the rmse between resulting estimated pitcher quality µ p and observed mean game FWAR in a hold-out set. We detail the full procedure in Algorithm 2.
Using the same dataset of starting pitchers from 2010 to 2018 as before, we run Algorithm 2, yielding maximum likelihood estimators of µ, τ 2 , and {σ 2 p }. With ε = 10 −5 , the algorithm converges again after just four iterations. Then, we plug these estimators into Formula (B.14), yielding parametric Empirical Bayes estimators of {µ p }. In Figure 25 we compare these estimates { µ FWAR p } to each pitcher's mean game FanGraphs WAR from 2010 to 2018. As before, for players with fewer games played (small gray dots), µ p is shrunk towards the overall mean µ. Also, for players with enough games played (large blue dots), µ p is essentially pitcher p's mean game FWAR, lying on the line y = x.
The primary weakness of our Empirical Bayes approach is that it assumes latent pitcher quality µ p is constant over the decade from 2010 to 2019. Player quality is, however, non-stationary over time. Therefore, in future work we suggest using a similar Empirical Bayes approach to estimate pitcher quality, except downweighting data the further back in time it is observed (e.g., using exponential decay weighting as in Medvedovsky and Patton (2022)) in the posterior mean Formulas (B.2) and (B.14).
In Figure 24 we visualize starting pitcher rankings prior to the 2019 season according to µ p (left) and Ivan Nova has the lowest such values. We see that there is a nontrivial difference between pitcher quality estimates and rankings built from Grid WAR and FanGraphs WAR.

C Estimating the Park Effects α
In this Section, we detail why we use ridge regression to estimate park effects. First, in Section C.1, we discuss existing park factors from ESPN, FanGraphs, and Baseball Reference. Then, in Section C.2, we discuss problems with these existing park effects. In Section C.3, we introduce our park effects model, designed to yield park effects which represent the expected runs scored in a half-inning at a ballpark above that of an average park, if an average offense faces an average defense. Then, in Sections C.4 and C.5, we conduct 2 simulation studies which show that ridge repression works better than other methods at estimating park effects. Then, in Section C.6, we show that our ridge park effects have better out-of-sample predictive performance than existing park effects from ESPN and FanGraphs. Finally, in Sections C.7 and 2.6, we discuss our final ridge park effects, fit on data from all half-innings from 2017 to 2019.

C.1 Existing Park Effects.
where w is a regression weight determined by the number of years in the dataset (e.g., for a 3 year park factor, w = 0.8). for pitchers as base park factors. Then, they apply several adjustments on top of these base values. For instance, they adjust for the quality of the home team and the fact that the batter doesn't face its own pitchers. These adjustments, however, are a long series of convoluted calculations, so we do not repeat them here.

C.2 Problems with Existing Park Effects
There are several problems with these existing runs-based park effects.
First, ESPN and FanGraphs do not adjust for offensive and defensive quality at all, and Baseball Reference adjusts for only a fraction of team quality. It is important to adjust for team quality in order to de-bias the park factors. For example, the Colorado Rockies play in the NL West, a division with good offensive teams such as the Dodgers, Giants, and Padres. So, by ignoring offensive quality in creating park factors, the Rockies' park factor may be an overestimate, since many of the runs scored at their park may be due to the offensive power of the NL West rather than the park itself. So, by ignoring team quality, the ESPN and FanGraphs park factors are biased. Baseball Reference's park factors adjust for the fact that a team doesn't face its own pitchers, albeit through a convoluted series of ad-hoc calculations. Although adjusting for not facing one's own pitchers slightly de-biases the park factors, it does not suffice as a full adjustment of the offensive and defensive quality of a team's schedule.
Second, these existing runs-based park effects do not come from a statistical model. This makes it more difficult to quantitatively measure which park factors are the "best", for instance via some loss function. In other words, it is more difficult to quantitatively know that Baseball Reference's park factors are actually more accurate than FanGraphs', in some mathematical sense, besides that it claims to adjust for some biases in its derivation, although we discuss a way to do so in Section C.6. Another benefit of a statistical model is that it will allow us to adjust for the offensive and defensive quality of a team and its opponents simultaneously. Finally, a statistical model will give us a firm physical interpretation of the park factors.
Hence, in this paper, we create our own park factors, which are the fitted coefficients of a statistical model that adjusts for team offensive and defensive quality.

C.3 Our Park Effects Model
In this Section, we introduce our park effects model, designed to yield park effects which represent the expected runs scored in a half-inning at a ballpark above that of an average park, if an average offense faces an average defense.
We index each half-inning in our dataset by i, each park by j, and each team-season by k. We define the park matrix P so that P i j is 1 if the i th half-inning is played in park j, and 0 otherwise. Similarly, we define the offense matrix O so that O ik is 1 if the k th team-season is on offense during the i th half-inning, and 0 otherwise, and define the defense matrix D so that D ik is 1 if the k th teamseason is on defense during the i th half-inning, and 0 otherwise. We denote the runs scored during the i th half-inning by y i . Then, we model y i using a linear model, where ε i is mean-zero noise, The coefficients are fitted relative to the first park ANA (the Anaheim Angels) and relative to the first team-season ANA2017 (the Angels in 2017). By including distinct coefficients for each offensive-team-season and each defensive-team-season, we adjust for offensive and defensive quality simultaneously in fitting our park factors. Finally, in order to make our park effects represent the expected runs scored in a half-inning at a ballpark above that of an average park, we subtract the mean park effect from each park effect, (C.10)

C.4 First Simulation Study
We have a park effects model, Formula (C.5), but it is not immediately obvious which algorithm we should use to fit the model. In particular, due to multicollinearity in the observed data matrix X, ordinary least squares is sub-optimal. Hence we run a simulation study in order to test various methods of fitting model (C.5), using the method which best recovers the "true" simulated park effects as the park factor algorithm to be used in computing Grid WAR: ridge regression.
Simulation setup. In our first simulation study, we assume that the park, team offensive quality, and team defensive quality coefficients are independent. Specifically, we simulate 25 "true" (C.11) Then, we assemble our data matrix X to consist of every half-inning from 2017 to 2019. Then, we simulate 25 "true" outcome vectors {y [m] } 25 m=1 according to We use a truncated normal distribution, denoted by N + , in order to make y i positive. We round y i so that it is a positive integer, since y i represents the runs scored in the i th inning. Although we don't directly simulate ε i , our simulated y i still adheres to model (C.7), as it has mean X i * β . We choose the values in Formula (C.11) so that the simulated outcome vectors {y [m] } 25 m=1 seem reasonable in representing the runs scored in a half-inning.
Our goal is to recover the park effects β (park) , so our evaluation metric of an estimatorβ is the average simulation error of the fitted park effects, (C.13) Note that it doesn't make sense to compare the existing ESPN and FanGraphs park factors to park effects methods based on model (C.5) as part of this simulation study because they are not based on model (C.5). In fact, ESPN and FanGraphs park effects are not based on any statistical model. Rather, in Section C.6, we separately compare these existing park factors to our model-based park factors.
Method 1: OLS without adjusting for team quality. The naive method of estimating the park factors β (park) is ordinary least squares regression while ignoring team offensive quality and team defensive quality, as done in Baumer et al. (2015, Formula 11). In other words, fit the park coeffi-cients using OLS on the following model, In failing to adjust for offensive and defensive quality, we expect this algorithm to perform poorly.
Method 2: OLS. Next, we adjust for offensive quality, defensive quality, and park simultaneously using ordinary least squares (OLS) on model (C.5). This method is similar to that from Acharya et al. (2008), although they compute game-level park factors and we compute half-inning-level park factors. This yields an unbiased estimate of the park effects, and so we expect this method to perform better than the previous one.
Method 3: Three-Part OLS. Although OLS using model (C.5) is unbiased, the fitted coefficients have high variance due to the multicollinearity of the data matrix X. In particular, the park matrix P is correlated with the offensive team matrix O and the defensive team matrix D because in each half-inning, either the team on offense or defense is the home team. We may visualize the collinearity in X by denoting all of the half-innings (rows) in which the road team is batting by road, denoting the half-innings in which the home team is batting by home, and writing X (e.g., for one season of data) as X = 1 P road, * O road, * P road, * 1 P home, * P home, * D home, * . (C.15) To address this collinearity issue, we propose a three-part OLS algorithm. First, we estimate the offensive quality coefficients during half-innings in which the road team is batting. We do so via OLS on the following model, This yields a decent estimateβ (off) of β (off) , in particular because for one season of innings, P road, * = D road, * , and for multiple seasons of innings, P road, * D road, * .
Second, we estimate the defensive quality coefficients during half-innings in which the home team is batting. We do so via OLS using the following model, This yields a decent estimateβ (def) of β (def) , in particular because P home, * O home, * .
Third, we use the fitted team quality coefficientsβ (off) andβ (def) on all half-innings to obtain the park effects. Specifically, we run OLS on the following model, yielding fitted park coefficientsβ (park) .
Method 4: Ridge. Finally, we use ridge regression to fit model (C.5). In the presence of multicollinearity, ridge regression coefficient estimates may improve upon OLS estimates by introducing a small amount of bias in order to reduce the variance of the estimates (Hoerl and Kennard, 1970). We tune the ridge parameter λ using cross validation. . Then, using the observed data matrix X, which consists of all half-innings from 2017 to 2019, we run each of our five methods, yielding parameter estimates, and evaluate them using the average simulation error from Formula (C.13). We report the results in Table 7. method simulation error Ridge 0.0244 OLS 0.0326 Three-part OLS 0.0330 OLS without adjusting for team quality 0.0610 Table 7: Results of our first simulation study.
The OLS estimator without adjusting for team quality performs worst, as ignoring team quality leads to a biased estimate of the park effects. OLS and three-part OLS, which include proper adjustments for team quality, perform similarly and are second best. Three-part OLS turns out not to be an improvement over OLS because despite the multicollinearity, there is enough linear independence between the batting-road and batting-home half-innings to obtain reasonably accurate team quality estimates. Also, if X contains multiple years worth of data, three-part OLS leads to slightly biased estimates of β (off) and β (def) since steps one and 2 only adjust for park. Additionally, three-part OLS uses half as much data to estimate team quality, which is significant because the outcome variable inning runs is so noisy. Lastly, ridge regression performs the best, and is significantly better than OLS. In Figure 26, we visualize one of the 25 simulations by plotting the "true" park effects against the ridge estimates and the OLS estimates. We see that OLS is biased, whereas ridge lies more evenly around the line y = x. Figure 26: For one of 25 simulations from our first simulation study from Section C.4, plot the "true" park effects against the ridge estimates and the OLS estimates. The line y = x, shown in black, represents a perfect fit between the "true" and fitted park effects. The OLS estimates are biased, whereas the ridge estimates lie more evenly around the line y = x.

C.5 Second Simulation Study
A primary criticism of the first simulation study from Section C.4 is that in actual baseball, the offensive and defensive quality coefficients are not independent. Rather, often times offensive and defensive qualities are correlated within MLB divisions. For instance, in 2021, the Rays, Red Sox, Yankees, and Blue Jays of the AL East each had at least 91 wins, and so were all good offensive teams. Correlated offensive and defensive qualities within divisions introduces additional collinearity into the data matrix X, since teams play other teams within their division at a disproportionately high rate.
Another criticism of the first simulation study is that it treats Colorado's park effect as a draw from the same distribution as the other park effects, whereas in real life we know Colorado's park effect is an outlier as a result of the high altitude.
So, in this section, we conduct a second simulation study which incorporates intra-divisional collinearity and forces Colorado's park effect to be an outlier. Specifically, we simulate 25 "true" The remainder of the second simulation study proceeds identically to the first simulation study from the previous section.
We report the results of our second simulation study in Table 8. Again, ridge regression performs the best. In particular, ridge performs significantly better than the other methods on the outlier Colorado, and better than the other methods on the other parks. In Figure 27, we visualize one of the 25 simulations by plotting the "true" park effects against the ridge estimates and the OLS estimates. We see that ridge regression successfully fits the Colorado park effect, whereas OLS significantly overestimates Colorado. Colorado as an outlier in OLS exerts high leverage over the rest of the park effects, swaying their estimates upwards. One might suggest removing the outlier Colorado from the dataset and estimating it separately, but doing weakens the estimates of the other teams in its division as a result of removing too many games from the set schedule which determines the data matrix X. So, in both the first simulation study from Section C.4 and the second simulation study from this Section, ridge regression most successfully estimates the "true" simulated park effects.  Figure 27: For one of the 25 simulations from our second simulation study from Section C.5, plot the "true" park effects against the ridge estimates and the OLS estimates. The line y = x, shown in black, represents a perfect fit between the "true" and fitted park effects. The OLS estimates are biased, whereas the ridge estimates lie more evenly around the line y = x. In particular, ridge regression much better captures the park effect of the outlier, Denver.

C.6 Comparing Existing Park Effects and Ridge Park Effects
Now, in this Section, we compare our ridge park factors, which perform best in our simulation studies from Sections C.4 and C.5, to existing park effects from ESPN and FanGraphs.
Transforming ESPN and FanGraphs park factors to an "additive" scale. Our ridge and OLS park effects of a ballpark, based on model (C.5), are "additive" in the sense that they represent the expected runs scored in a half-inning at that park above that of an average park, if an average offense faces an average defense. On the other hand, ESPN and FanGraphs park factors, defined in Formulas (C.1) and (C.2), are "multiplicative" in the sense that they represent the ratio of runs created at home to runs created on the road. Therefore, in order to compare these existing park factors to our park factors, we need to put them on the same scale. In particular, we transform the ESPN and FanGraphs park effects into "additive" park effects. To do so, we take the mean runs scored in a half-inning, y = 0.5227, (C.22) and multiply it by a "multiplicative" park factor subtracted by 1. For example, if the ESPN Colorado park factor α in 2019 is 1.34, representing that teams score 34% more runs in Colorado than in other parks, then the transformed "additive" ESPN park factor is (α − 1) · y = (0.34) · (0.5227) = 0.178. (C.23) After this transformation, the ESPN and FanGraphs park factors also represent the expected runs scored in a half-inning at that park above that of an average park.
Visualizing these park effect methods. In Figure 28 we visualize the ridge, OLS, ESPN, and FanGraphs park effects (where the latter 2 are transformed to an "additive" scale), fit on all halfinnings from 2017 to 2019. We use the park abbreviations from Retrosheet, our data source, as discussed in Section 2.2. As expected, we see that ridge park factors are a shrunk version of OLS park factors, and FanGraphs park factors are a shrunk version of ESPN park factors. The FanGraphs and ridge park factors are remarkably similar. Also, as expected, Coors Field (DEN02) has the largest park effect for all four methods. The Texas Ranger's ballpark in Arlington (ARL02) has the second highest park effect for all four methods. These 2 parks have significantly larger park effects than all the other ones. Additionally, the Mets' ballpark (NYC20) has the lowest park effect for all four methods.
Additionally, we visualize how these various park effects impact the Grid WAR of various starting pitchers. In Figure 29, we show the 2019 seasonal Grid WAR for a set of starters without park effects, with ridge park effects, and with (transformed) ESPN park effects. For most pitchers, the impact of including a park adjustment is small. For some pitchers, the impact of an ESPN park adjustment is massive. For instance, the GWAR of Mike Minor and Laynce Lynn, who pitched for Figure 28: Ridge, OLS, ESPN, and FanGraphs 2019 three-year park effects (where the latter 2 are transformed to an "additive" scale). NYC20 refers to Citi Field, NYC21 refers to Yankee Stadium II, CHI11 refers to Wrigley field, and CHI12 refers to the White Sox's park the Rangers in 2019, each increases by a staggering one whole WAR. Ridge park factors have a much more muted impact than ESPN park factors. This makes sense, as ridge regression shrinks the park coefficients closer to 0. For a few pitchers, however, even the ridge park effects make a nontrivial impact on their GWAR. This also makes sense, as some park effects, such as those of the Mets, Rockies, and Rangers, are far enough from zero. For instance, the GWAR of Noah Syndergaard and Jacob deGrom, who pitched for the Mets, each decreases by about one-quarter of a WAR.
Comparing these park effects quantitatively. In deciding which park effects to use in our final Grid WAR calculations, we quantitatively compare the ridge, OLS, ESPN, and FanGraphs park factors. In particular, we compare the out-of-sample predictive performance of these park effects.
We begin by fitting each park factor method using data from all half-innings from 2014 to 2016. Figure 29: Grid WAR on a set of pitchers from 2019 without park effects (red triangles), with ridge park effects (blue squares), and with ESPN park effects (green circles).
Note that the OLS and ridge park factors adjust for team offensive and defensive quality, whereas the ESPN and FanGraphs park factors don't. So, in order to fairly quantitatively compare which of these park factor methods is "best", we adjust for team quality on all of these methods. Specifically, using the fitted park factorsα from 2014-2016 for a given method, we regress out team offense and team defense indicators via OLS on the following model, where P, O, and D are the data matrices consisting of all half-innings from 2017 to 2019, Then, using these adjusted models based on Formula (C.24), we predict the expected runs scored y i in each half-inning i from 2017 to 2019, which are out-of-sample predictions relative to the park effectsα which were estimated on data from 2014-2016. Finally, we compute the out-of-sample RMSE.
Each of the four methods has the same out-of-sample RMSE, 1.504. For reference, the RMSE of the overall mean is 1.508, so park factors do improve prediction, albeit slightly. But, because the runs scored in a half-inning is so noisy, and because the differences across parks are so slight, out-of-sample RMSE isn't sensitive enough to quantitatively show which park factors have the best predictive performance.
To more clearly understand the differences in methods, we calculate the ecological RMSE to quantitatively compare the various park factor methods. This is done by first fixing a ballpark p. Then for each park factor method, we take the mean of the vector of predicted runs scored in each halfinning at the given park p, yieldingŷ p . Then we find the mean of the vector of observed runs scored in each half-inning at that park p, yielding y p . Finally, we compute the RMSE of the regression of (y p ) on (ŷ p ) (each vectors of length 30), yielding the out-of-sample ecological RMSE. In Table 9 we show the out-of-sample ecological RMSE for several park factor methods. The ridge park effects perform best, outperforming the FanGraphs and ESPN park factors, mainly since Ridge adjusts for offensive and defensive quality. The ridge park effects outperform the OLS park effects for the same reasons discussed in our simulation studies from Sections C.4 and C.5.

Park Effect
Ecological RMSE Ridge 0.01516 FanGraphs 0.01578 ESPN 0.01579 OLS 0.01672 Overall mean y 0.04658 Table 9: Out-of-sample ecological RMSE of predicting the runs scored in a half-inning using various park effect methods.
Hence the ridge regression park effects lead to better out-of-sample predictions of the runs scored in a half-inning than existing park factor methods from ESPN and FanGraphs.

C.7 2019 Three-Year Park Effects
As discussed in Sections C.4 and C.5, the ridge park effects outperform other park effect methods in 2 simulation studies. Further, as discussed in Section C.6, ridge park effects outperform existing park effect methods from ESPN and FanGraphs. Therefore, we use ridge park effects in computing Grid WAR. In particular, we use ridge regression on our observed dataset {y, X} consisting of all half-innings from 2017 to 2019, tuning the ridge parameter λ using cross validation, to fit our park effects, shown in Figure 2b in Section 2.6.