In order to fit such a regression model, we must begin with an estimate of the probability that the home team wins the game after leading by *L* points after *T* seconds, which we denote by *p*_{T},_{L}. Estimating win probability at specific intermediate times during a game is not a new problem; indeed, Lindsey (1963) estimated win probabilities in baseball in the 1960s and Stern (1994) introduced a probit regression model to estimate *p*_{T},_{L}. Maymin, Maymin, and Shen (2012) expanded on that probit model to study when to take starters in foul trouble out of a game, Bashuk (2012) considered empirical estimates of win probability to predict team performance in college basketball, and Pettigrew (2015) recently introduced a parametric model to estimate win probability in hockey. Intuitively, we believe that *p*_{T},_{L} is a smooth function of both *T* and *L*; for a fixed lead, the win probability should be relatively constant for a small duration of time. By construction, the probit model of Stern (1994) produces a smooth estimate of the win probability and the estimates based on all games from the 2006–2007 to 2012–2013 regular seasons are shown in Figure 1(A), where the color of the unit cell [*T*, *T*+1]× [*L*, *L*+1] corresponds to the estimated value of *p*_{T},_{L}.

Figure 1: Various estimates of *p*_{T,L}. The probit estimates in (A), while smooth, do not agree with the empirical win probabilities shown in (B). Our estimates, shown in (C), are closer in value to the empirical estimates than are those in (A) but are much smoother than the empirical estimates.

To get a sense of how well the probit estimates fit the observed data, we can compare them to the empirical estimates of *p*_{T},_{L} given by the proportion of times that the home team has won after leading by *L* points after *T* seconds. The empirical estimates of *p*_{T},_{L} are shown in Figure 1(B).

We see immediately that the empirical estimates are rather different than the probit estimates: for positive *L*, the probit estimate of *p*_{T},_{L} tends to be much smaller than the empirical estimate of *p*_{T},_{L} and for negative L, the probit estimates tend to overestimate *p*_{T},_{L}. This discrepancy arises primarily because the probit model is fit using only data from the ends of the first three quarters and does not incorporate any other intermediate times. Additionally, the probit model imposes several rather strong assumptions about the evolution of the win probability as the game progresses. As a result, we find the empirical estimates much more compelling than the probit estimates. Despite this, we observe in Figure 1(B) that the empirical estimates are much less smooth than the probit estimates. Also worrying are the extreme and incongruous estimates near the edges of the colored region in Figure 1(B). For instance, the empirical estimates suggest that the home team will always win the game if they trailed by 18 points after five minutes of play. Upon further inspection, we find that the home team trailed by 18 points after five minutes exactly once in the seven season span from 2006 to 2013 and they happened to win that game. In other words, the empirical estimates are rather sensitive to small sample size leading to extreme values which can heavily bias our response variables *y*_{i} in Equation 2.

To address these small sample issues in the empirical estimate, we propose a middle ground between the empirical and probit estimates. In particular, we let *N*_{T},_{L} be the number of games in which the home team has led by *ℓ* points after *t* seconds where *T*–*h*_{t}≤*t*≤*T*+*h*_{t} and *L*–*h*_{l}≤*ℓ*≤*L*+*h*_{l}, where *h*_{t} and *h*_{l} are positive integers. We then let *n*_{T},_{L} be the number of games which the home team won in this window and model *n*_{T},_{L} as a Binomial (*N*_{T},_{L}, *p*_{T},_{L}) random variable. This modeling approach is based on the assumption that the win probability is relatively constant over a small window in the (*T*, *L*)-plane. The choice of *h*_{t} and *h*_{l} dictate how many game states worth of information is used to estimate *p*_{T},_{L} and larger choices of both will yield, in general, smoother estimates of *p*_{T},_{L}. Since very few offensive possession last six seconds or less and since no offensive possession can result in more than four points, we argue that the win probability should be relatively constant in the window [*T*–3, *T*+3]×[*L*–2, *L*+2] and we take *h*_{t}=3, *h*_{l}=2.

We place a conjugate Beta(*α*_{T},_{L}, *β*_{T},_{L}) prior on *p*_{T},_{L} and estimate *p*_{T},_{L} with the resulting posterior mean $${\widehat{p}}_{T\mathrm{,}L}\mathrm{,}$$ given by

$${\widehat{p}}_{T\mathrm{,}L}=\frac{{n}_{T\mathrm{,}L}+{\alpha}_{T\mathrm{,}L}}{{N}_{T\mathrm{,}L}+{\alpha}_{T\mathrm{,}L}+{\beta}_{T\mathrm{,}L}}\mathrm{.}$$

The value of *y*_{i} in Equation 2 is the difference between the estimated win probability at the end of the shift and at the start of the shift.

Based on the above expression, we can interpret *α*_{T},_{L} and *β*_{T},_{L} as “pseudo-wins” and “pseudo-losses” added to the observed counts of home team wins and losses in the window [*T*–3, *T*+3]×[*L*–2, *L*+2]. The addition of these “pseudo-games” tends to shrink the original empirical estimates of *p*_{T},_{L} towards $$\frac{{\alpha}_{T\mathrm{,}L}}{{\alpha}_{T\mathrm{,}L}+{\beta}_{T\mathrm{,}L}}\mathrm{.}$$ To specify *α*_{T},_{L} and *β*_{T},_{L}, it is enough to describe how many pseudo-wins and pseudo-losses we add to each of the 35 unit cells [*t*,+1]×[*ℓ*, *ℓ*+1] in the window [*T*–3, *T*+3]×[*L*–2, *L*+2]. We add a total of 10 pseudo-games to each unit cell, but the specific number of pseudo-wins depends on the value of *ℓ* For *ℓ*<–20 we add 10 pseudo-losses and no pseudo-wins and for *ℓ*>20, we add 10 pseudo-wins and no pseudo-losses. For the remaining values of *ℓ*, we add five pseudo-wins and five pseudo-losses. Since we add 10 pseudo-games to each cell, we add a total of *α*_{T},_{L}+*β*_{T},_{L}=350 pseudo-games the window [*T*–3, *T*+3]×[*L*–2, *L*+2]. We note that this procedure does not ensure that our estimated win probabilities are monotonic in lead and time. However, the empirical win probabilities are far from monotonic themselves, and our procedure does mitigate many of these departures by smoothing over the window [*T*–3, *T*+3]×[*L*–2, *L*+2].

We find that for most combinations of *T* and *L*, *N*_{T},_{L} is much greater than 350; for instance, at *T*=423, we observe *N*_{T},_{L}=4018, 11,375, 17,724, 14,588, and 5460 for *L*=–10, –5, 0, 5, and 10, respectively. In these cases, the value of $${\widehat{p}}_{T\mathrm{,}L}$$ is driven more by the observed data than by the values of *α*_{T},_{L} and *β*_{T},_{L}. Moreover, in such cases, the uncertainty of our estimate $${\widehat{p}}_{T\mathrm{,}L},$$ which can be measured by the posterior standard deviation of *p*_{T},_{L} is exceeding small: for *T*=423 and –10≤*L*≤10, the posterior standard deviation of *p*_{T},_{L}, is between 0.003 and 0.007. When *N*_{T},_{L} is comparable to or much smaller than 350, the values of *α*_{T},_{L} and *β*_{T},_{L} exert more influence on the value of $${\widehat{p}}_{T\mathrm{,}L}\mathrm{.}$$ The increased influence of the prior on $${\widehat{p}}_{T\mathrm{,}L}$$ in such rare game states helps smooth over the extreme discontinuities that are present in the empirical win probability estimates above. In these situations, there is a larger degree of uncertainty in our estimate of $${\widehat{p}}_{T\mathrm{,}L}\mathrm{,}$$ but we find that the posterior standard deviation of *p*_{T},_{L} never exceeds 0.035. The uncertainty in our estimation of *p*_{T},_{L} leads to additional uncertainty in the *y*_{i}’s, akin to measurement error. The error term in Equation 2 is meant to capture this additional uncertainty, as well as any inherent variation in the change in win probability unexplained by the players on the court.

## Comments (0)