Abstract
As a baseball game progresses, batters appear to perform better the more times they face a particular pitcher. The apparent dropoff in pitcher performance from one time through the order to the next, known as the Time Through the Order Penalty (TTOP), is often attributed to withingame batter learning. Although the TTOP has largely been accepted within baseball and influences many managers’ ingame decision making, we argue that existing approaches of estimating the size of the TTOP cannot disentangle continuous evolution in pitcher performance over the course of the game from discontinuities between successive times through the order. Using a Bayesian multinomial regression model, we find that, after adjusting for confounders like batter and pitcher quality, handedness, and home field advantage, there is little evidence of strong discontinuity in pitcher performance between times through the order. Our analysis suggests that the start of the third time through the order should not be viewed as a special cutoff point in deciding whether to pull a starting pitcher.
Funding source: Wisconsin Alumni Research Foundation
Acknowledgments
The authors thank Tom Tango for his comments on an early draft of this paper. The authors acknowledge the High Performance Computing Center (HPCC) at The Wharton School, University of Pennsylvania for providing computational resources that have contributed to the research results reported within this paper.

Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.

Research funding: Support for S.K.D. was provided by the University of Wisconsin–Madison, Office of the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation.

Conflict of interest statement: The authors declare no conflicts of interest regarding this article.
Appendix A: Our code and data
Our code is available on Github.^{[3]} The data_wrangling folder of the Github repository contains our dataset processing, including the Retrosheet data scraper. The data folder further processes the full dataset into a smaller dataset relevant for this paper. Finally, the model_positive_slope_prior folder contains our data analysis, including our Stan model.
The final datasets used in this paper are available for download.^{[4]} The cleaned dataset of all MLB plate appearances from 1990 to 2020 is retro_final_PA_19902020d.csv. The datasets
Appendix B: Model simulation study
We conduct a simulation study to assess the capacity of our model (Equation (2)) to estimate time through the order penalties of various sizes. Specifically, we simulate data consistent with different TTOPs and verify that our posterior estimates are close to the data generating parameters.
B.1 Simulation setup
For our first simulation, we generate data consistent with continuous pitcher fatigue and no TTOP for any of the plate appearance outcomes by setting β _{2k } = β _{3k } = 0 for each k ≠ 1. In our second simulation, for each k ≠ 1, we set the β _{2k } and β _{3k } so that the resulting xwOBA curves display TTOPs consistent with Tango, Lichtman, and Dolphin (2007)’s findings of about 10 expected wOBA points between successive times through the order. Finally, for our third simulation, we set β _{2k } and β _{3k } so that there is no 2TTOP (in terms of xwOBA) but a large 3TTOP of about 50 wOBA points. For each simulation, we set the values of the α _{0k }’s, α _{1k }’s, and η _{ k }’s in a way that is consistent with observed data. Additional details about the simulation setup, including the data generating parameter values, are available in Appendix C.
For each simulation, we generate 225 full seasons worth of data. We fit our model to 80 % of the data from each simulated season and evaluate our fitted model’s predictive performance on the remaining 20 %. We further assess how well our fitted model recovers the function xwOBA(t, x ) for a set of average confounder values.
B.2 Simulation results
In all three simulation studies, we reliably recover the data generating parameters: averaged across all parameters, the estimated frequentist coverage of the marginal 95 % posterior credible intervals exceeds 92 % in each study. Importantly, the coverage of the 95 % posterior credible intervals for the discontinuity parameters β _{2k } and β _{3k } exceeds 91 % in each study. That is, for each simulated dataset, the 95 % credible intervals for the β _{2k }’s and β _{3k }’s usually contain the true data generating parameters. Furthermore, our model demonstrates good predictive capabilities (see Appendix C for details).
B.3 Simulation visualization
In each simulation, we visualize the trajectory of posterior expected wOBA over the course of the game for an average batter on the road facing an average pitcher with the same handedness. That is, we plot the sequence
Figure 6 shows the sequence of posterior means, 50 %, and 95 % credible intervals of
Appendix C: Simulation details
C.1 Data generating parameters
The exact data generating parameter values of β _{2k } and β _{3k } for our three simulation studies are shown in Table 4.
k = BB  k = HBP  k = 1B  k = 2B  k = 3B  k = HR  

β _{2k } for sim 1  0  0  0  0  0  0 
β _{3k } for sim 1  0  0  0  0  0  0 
β _{2k } for sim 2  2/65  0  4/65  2/65  0  2/65 
β _{3k } for sim 2  1/15  0  2/15  1/15  0  1/15 
β _{2k } for sim 3  0  0  0  0  0  0 
β _{3k } for sim 3  1/10  1/10  3/10  1/10  1/10  3/20 
Furthermore, in each of our simulation studies, we assume that pitchers fatigue linearly over the course of a game. The particular true parameter values of α _{0k } and α _{1k } used in each of our simulation studies are shown in Table 5.
k = BB  k = HBP  k = 1B  k = 2B  k = 3B  k = HR  

α _{0k }  −0.601  −1.804  −0.475  −0.943  −1.510  −0.565 
α _{1k }  0.00271  0.0122  0.00354  0.00635  0.0223  0.00926 
Finally, in each of our simulation studies, we set the value of η to mimic fitted values from observed data. The particular true parameter values of η used in each of our simulation studies are shown in Table 6.
k = BB  k = HBP  k = 1B  k = 2B  k = 3B  k = HR  

η _{bat_quality}  0.865  1.408  0.371  0.856  1.399  1.525 
η _{pit_quality}  1.128  1.987  1.050  1.472  3.286  1.850 
η _{hand}  −0.201  0.166  −0.0164  −0.0420  −0.462  −0.0958 
η _{home}  0.0792  −0.0776  0.0245  −0.00103  0.107  0.0230 
C.2 Predictive performance on simulated data
Our model demonstrates good predictive capabilities. To get a general sense of our model’s performance, we use outofsample cross entropy loss, given by
For each of our three simulations, the average cross entropy loss over each of our 25 datasets is 1.05, 1.06, and 1.07, respectively. Using the empirical outcome probabilities yields an average outofsample crossentropy loss of 1.06, 1.08, and 1.08, respectively, for each of our three simulations. It is reassuring that our model (barely) outperforms the observed base rates.
Appendix D: Observed model fit details
D.1 The impact of pitcher decline on the outcome of a plate appearance
In this Section, we quantify the effect size of pitcher decline over the course of a game, again using the 2017 season as our primary example.
In particular, we examine how the probability of each outcome of a plate appearance changes over the course of a game. Specifically, we use the posterior distribution of
and the similarly defined
In Figure 7 we plot the posterior distribution of
Similarly, in Figure 8 we plot the posterior distribution of
Additionally, we examine how the expected wOBA of each outcome of a plate appearance changes over the course of a game. In particular, we compute the posterior distribution of the change in the expected wOBA of outcome k ≠ 1 from 1TTO to 2TTO, over average,
where w
_{
k
} is the wOBA weight for outcome k as discussed in Section 2.4. Similarly, we define
In Figure 9 we plot the posterior distribution of
Similarly, in Figure 10 we plot the posterior distribution of
Furthermore, we aggregate the increase in the probability of each nonout plate appearance outcome k from one TTO to the next via expected wOBA, defined in Equation (8). In particular, recall from Section 3.2 that a pitcher declines by about 13 wOBA points from one TTO to the next, over average, which is consistent with the effect sizes from Figures 7 and 8. Figure 11 illustrates this via a histogram of the posterior samples of
D.2 Predictive performance on observed data
To get a general sense of our model’s performance on observed data, we run a fivefold cross validation to predict the probability of each plate appearance outcome for each plate appearance in 2017. The outofsample cross entropy loss, given by Formula (19), is 1.035. We compare our model’s cross entropy loss to that of other prediction strategies to better understand its performance. Consider a fivefold cross validation using the base rates of each plate appearance outcome. So, for each fold, find the proportion of plate appearances in which each outcome occurs, and compute the cross entropy loss using these base rates on the remaining outofsample plate appearances. For reference, in 2017, an out occurs in 67.6 % of plate appearances, an uBB 7.8 %, an HBP 0.9 %, a 1B 14.9 %, a 2B 4.8 %, a 3B 0.45 %, and an HR in 3.5 % of plate appearances. The outofsample cross entropy loss of the base rates of each outcome is 1.042. So, our model very slightly outperforms the base rates. Finally, note that our model using raw batter and pitcher quality covariates, rather than logittransformed batter and pitcher quality covariates, has a crossvalidated outofsample cross entropy loss of 1.040. That the logittransformed player quality covariates have better outofsample predictive performance helps justify using the logit transform.
D.3 The trend is persistent across years
In Figure 12 we show boxplots of the posterior distributions of the discontinuity parameters β _{2k } and β _{3k } − β _{2k } from our model (Equation (2)) fit separately on data from each season from 2012 to 2019. For some outcomes (e.g., walks), the posterior distributions are tightly concentrated around 0, and for other outcomes (e.g., triples and hitbypitches, which are rare events), the posterior distributions are quite wide, which is compatible with a large effect in either direction. Overall, the posterior distributions of the discontinuity parameters cover both positive and negative values, and most of them are centered around 0. In particular, we don’t see what we would expect to see if there were strong evidence for a TTOP (i.e., we don’t see the posterior distributions tighly concentrated around a positive number). Ultimately, we do not find the posterior distributions in Figure 12 to be consistent with large, systematic time through the order penalties.
In Figure 13 we plot the posterior distribution of xwOBA over the course of a game according to our model fit separately on data from each year from 2012 to 2019. We see that expected wOBA increases steadily over the course of a game, without significant discontinuity (in particular, significant upward discontinuity) between times through the order. The 2018 season is the only season in which we see an upward discontinuity in the posterior means, which occurs between 2TTO and 3TTO. This discontinuity, however, lies inside of the credible intervals and so is not significant.
Appendix E: Alternative models
E.1 A more flexible model: the indicator model
In Equation (2) we model pitcher decline over the course of a game as the combination of discontinuous decline from each TTO to the next and continuous linear pitcher decline across all the batters. A more flexible model would not enforce a particular functional form on withingame pitcher decline. In particular, the most flexible model has a separate coefficient for each batter t ∈ {1, …, 27},
With this more flexible model, the qualitative results of our study don’t change. For instance, as in Figure 4, in Figure 14 we plot the posterior distribution of the trajectory of expected wOBA over the course of a game, according to the indicator model from Equation (22) fit on data from 2017. We do not see a significant discontinuity in pitcher performance from one TTO to the next. In other words, we don’t find evidence of a strong batter discontinuity between times through the order. This trend is persistent across each year from 2012 to 2019.
E.2 A more elaborate model: pitcherspecific and batterspecific effects
In our model from Equation (2), we make the simplifying assumption that the trajectory of withingame pitcher deterioration is the same across all pitchers and batters. Nonetheless, it is likely that pitcher performance declines at different rates for different players. To account for such heterogeneity, we extend our model by introducing playerspecific rates of decline. Specifically, we model
where p(i) is the index of the pitcher and b(i) is the index of the batter in atbat i. The pitcherspecific continuous decline parameters and batterspecific discontinuity parameters have Gaussian priors,
which themselves have priors,
With this more flexible model, the qualitative results of our study don’t change. For instance, as in Figure 4, in Figure 15 we plot the posterior distribution of the trajectory of expected wOBA over the course of a game, according to the playerspecific model from Equation (23) fit on data from 2017. In particular, we use the posterior distributions of the prior means α _{0k }, α _{1k }, β _{2k }, and β _{3k } to compute the xwOBA trajectory for an average pitcher facing an average batter. We do not see a significant upwards discontinuity in expected wOBA from one TTO to the next. In other words, we find little evidence for a strong batter discontinuity between times through the order. This trend is persistent across each year from 2012 to 2019.
References
Brown, L. D. 2008. “InSeason Prediction of Batting Averages: A Field Test of Empirical Bayes and Bayes Methodologies.” Annals of Applied Statistics 2 (1): 113–52. https://doi.org/10.1214/07aoas138.Search in Google Scholar
Carpenter, B., A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell. 2017. “Stan: A Probabilistic Programming Language.” Journal of Statistical Software 76 (1): 1–32. https://doi.org/10.18637/jss.v076.i01.Search in Google Scholar PubMed PubMed Central
Fangraphs. 2021. wOBA and FIP Constants. https://www.fangraphs.com/guts.aspx?type=cn.Search in Google Scholar
Gelman, A., and D. B. Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science 7: 457–72. https://doi.org/10.1214/ss/1177011136.Search in Google Scholar
Greenhouse, J. 2011. Spitballing: Fourth Time’s the Harm. https://www.baseballprospectus.com/news/article/13117/spitballingfourthtimestheharm/.Search in Google Scholar
Laurila, D. 2015. Managers on the Third Time through the Order. https://blogs.fangraphs.com/managersonthethirdtimethroughtheorder/.Search in Google Scholar
Lichtman, M. 2013. Baseball ProGUESTus: Everything You Always Wanted to Know about the Times through the Order Penalty. https://www.baseballprospectus.com/news/article/22156/.Search in Google Scholar
R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing.Search in Google Scholar
Rivera, J. 2020. Rays’ Kevin Cash Explains Decision to Pull Blake Snell in World Series: ’I Regret it Because it Didn’t Work Out’. https://www.sportingnews.com/us/mlb/news/kevincashblakesnellworld seriesexplained/lfnyfc4nqwys1pcncc2lnyjho.Search in Google Scholar
Slowinski, P. 2010. wOBA. https://library.fangraphs.com/offense/woba/.Search in Google Scholar
Stan Development Team. 2022. RStan: The R Interaface for Stan.Search in Google Scholar
Tango, T., M. Lichtman, and A. Dolphin. 2007. The Book: Playing the Percentages in Baseball. Washington, D.C.: Potomac Books.Search in Google Scholar
© 2023 Walter de Gruyter GmbH, Berlin/Boston