Field goal percentage is a common measure of shooting skill and efficiency in the National Basketball Association (NBA), and general shooting prowess is often defined for players by their overall FG%. It can be used in its raw form, or as a component of more advanced metrics like true-shooting percentage (TS%) or effective field goal percentage (eFG%). Shooting percentages play a large role in influencing both fan and coaching evaluation of players, and are often used to predict future player performance when making decisions regarding free agency or draft selection.
Predicting a player’s FG% given past shooting is a difficult task. Shooting percentages are highly variable, especially on longer shots like 3-point attempts. For example, it takes roughly 750 3-point attempts before a player’s shooting percentage stabilizes, where over half of the variation in their 3-point percentage (3P%) is explained by shooting skill, rather than noise (Blackport 2014). Additionally, 3P% has been shown to be an unreliable metric in terms of its ability to discriminate between players and its stability from one season to the next (Franks et al. 2016). As the proportion of shot attempts taken as 3-pointers increases, with total attempts having risen nearly 50% over the last 8 years (Young 2016), overall FG% becomes more variable and less stable.
Part of the large variation in shooting percentages is likely due to the many contextual factors that contribute to the probability of a shot make. Improvements to FG% prediction have been made by including some of these covariates in shot-make prediction models (Piette, Sathyanarayan, and Zhang 2010; Cen et al. 2015). However, because of the small differences that separate the true shooting skill of players in the NBA, chance variation may also contribute significantly to the variation and instability of FG%. Optical tracking data of shot trajectories can potentially reduce noise in shooting metrics by allowing us to differentiate shots that rim out, air balls, and (unintentional) banks, giving us more information about players’ shooting skill with fewer shots. This idea has been demonstrated recently during practice shooting sessions, where FG% augmented by precise shot factor information gathered during these sessions improved the prediction of future shooting (Marty and Lucey 2017; Marty 2018). Accurate estimates of shot factors using optical tracking data from live games may allow for a similar improvement in the prediction of in-game shooting metrics.
In this paper, we seek to reduce the variation in predicting player FG% using NBA optical tracking data. We begin the paper by introducing a new estimator for FG%, RB-FG%, based on aggregating shot-make probabilities. Estimation of shot-make probabilities is then split into two main parts. First, using spatio-temporal information provided by the tracking data, we model shot trajectories in order to estimate the depth, left-right distance, and entry angle of balls entering the basket. Next, we use a regression model to estimate the probability of each shot going in. We define the average of these estimated probabilties, RB-FG%, as our new estimator of FG% for each player. Finally, we compare the predictive ability of the RB-FG% estimator to its raw counterpart that does not utilize trajectory information.
2 The Rao-Blackwellized estimator
In this section we introduce our new estimator for FG% based on shot-make probabilities. When trying to predict a player’s future FG% using their past FG%, each shot Xi is treated as Bernoulli random variable with probability of success θ, where θ is a measure of the player’s true FG%. However, shot trajectories provided by optical tracking data give us more information for each shot than simply whether it is a make or a miss. Incorporating this information into a shot model may allow us to reduce the variance involved in estimating and predicting shooting skill. Therefore, we can define an alternative model where the probability of a shot-make varies depending on its trajectory, and shots are modeled as Beta-Bernoulli random variables
As shown below, inference under the model in which shots are treated as Bernoulli random variables and inference under the expected Beta-Binomial of our new model is the same (Skellam 1948). Let
Therefore, inference for θ is the same under the Bernoulli and expected Beta-Binomial distributions. Furthermore, suppose we obtain Xi (make or miss) and pi (the probability that shot i will go in). Let
Thus the RB-FG% is simply the conditional expectation of raw FG% given these shot-make probabilities pi. Because under the Beta-Binomial model pi is sufficient for θ, by the Rao-Blackwell Theorem we have:
Unfortunately, we are unable to know the true probability that a shot will go in. Therefore, as decribed below, we will use estimates of shot-make probabilities based on shot trajectory information to obtain an estimate of RB-FG%. Using an estimate of
3 Estimating shot-make probabilities
3.1 Measuring shot factors
In order to estimate shot-make probabilities, we first measure three shot factors based on how each shot entered the basket – left-right accuracy, depth, and entry angle – following the procedure of Marty and Lucey (2017). We define left-right accuracy as the deviation of the ball from the centre of the hoop as the ball crosses the plane of the basket (Figure 1A). Shot depth is defined as the distance of the ball from a tangent line through the front of the hoop as the ball crosses the plane of the basket (Figure 1A), with the front of the hoop adjusted to be from the perspective of the shooter. We specify the adjusted front of the rim as depth 0, so a shot crossing the basket plane at the center of the hoop has a depth of 9 inches. Finally, the entry angle is defined as the angle between the plane of the hoop and a tangent line through the ball as it is entering the basket (Figure 1B). See Marty and Lucey (2017) for further detail regarding these measurements.
To obtain these shot factor estimates, we use shot trajectory information provided by the SportVu optical tracking data from STATS LLC. The data provides measurements of the X and Y coordinates for all 10 players and X, Y, and Z coordinates of the ball 25 times per second. Our dataset consists of 1212 games from the 2014–2015 NBA regular season and 1206 games from the 2015–2016 regular season. We first restrict our analysis to 3-point shots as these shots have the most trajectory information and we can assume all shooters are attempting to hit the centre of the basket (no shot attempts purposely off the backboard). In total our dataset consists of trajectory information for 47,631 3-point shots from the 2014–2015 season and 49,876 3-point shots from the 2015–2016 season.
Although the optical tracking data gives X, Y, and Z coordinates of the ball at the basket, the location data is noisy, especially in measuring the height of the ball. To obtain a better estimate of the position of the ball near the basket we model a quadratic best fit line through the trajectory data given by the tracking database. If Zi is the height of shot i, and xi and yi are the X, Y coordinates of the shot in the tracking data, we use a quadratic polynomial to model the height, and estimate the coefficients by a least-squares regression:
We use the point where the model specifies the ball crosses 10 feet in height as the estimated X, Y location of the ball at the basket, and use this location to calculate the shot’s depth, left-right accuracy, and entry angle.
We compare the above model with a second model in which we try to leverage pre-existing knowledge of shot trajectories. We know each shot starts at the player’s location at the time of release (player location is less noisy than ball location in the tracking database) and ends around the basket. Therefore, we can improve estimation by biasing the start and end points of our modeled trajectories to incorporate this prior knowledge. To accomplish this we introduce a Bayesian regression model using pseudo-data to establish priors that reflect this knowledge. This is an informal empirical Bayes method where instead of using data to estimate the priors, we use prior knowledge of how the data should look. Given the quadratic model (1) for each shot, we can specify a Bayesian regression model with a conjugate Normal prior for β of the form
where un is the posterior mean of β, and Λn is the posterior precision matrix for β. We update the parameters twice, once using pseudo-data reflecting our prior knowledge of where shots start and finish, and a second time using the shot trajectory data from the optical tracking data. We specify 4 pseudo-data points, 2 at the start of the shot set at the X, Y coordinates of the player when the shot is released and at a height of 7 feet, and 2 set at the centre of the hoop and at 10 feet in height. After two Bayesian learning updates we take the posterior mean of β, u2, and use it as the estimate for the coefficients in the quadratic polynomial model (1).
We then use (1) to compute the 3 shot factors for each shot using both the ordinary linear regression (OLR) and Bayesian regression approaches. Comparing the two models, we find both predict shots to have a mean depth value of 11′′, a mean left-right value of 0′′, and a mean entry angle around 45°. As in Marty and Lucey (2017) we find shots entering the basket at 11′′ in depth, 2′′ deeper than the centre of the basket, and 0′′ in left-right accuracy are made with the highest percentage. However, we find shot depths are evenly distributed around 11′′, in contrast to the findings of Marty and Lucey (2017) who found that shooters have a mean shot depth value of 9′′, at the centre of the hoop. The variance in left-right distance and entry angle between the two models is similar, however the variance in shot depth is much larger in the OLR compared to the Bayesian regression model (Figure 2). Overall, variances in shot factors under the Bayesian model match the variances of the precise shot factor measurements of Marty and Lucey (2017) more closely than the OLR model. Furthermore, we will see later that when we model shot probabilities the Bayesian model produces a lower misclassification rate and log loss than the OLR model. Moving forward, we decide to use shot factors calculated via the Bayesian regression model.
We next compare the precision of our estimated shot factors to those measured by the Noah Shooting System – a dedicated hardware install found in practice facilities that provides shooting information not available in live games. Marty and Lucey (2017) were able to use the Noah system to define a Guaranteed Make Zone (GMZ) of over 90% based on these shot factors. Their GMZ is marked by shots with an entry angle of 45°, a left-right accuracy between -2′′ and 2′′, and a depth between 7′′ and 14′′. Using our estimated shot factors, we found shots in this GMZ are made only 85.2% of the time. This suggests that despite the Bayesian model, our shot factor estimates are still less precise than those gathered by the Noah system.
3.2 Modeling shot-make probabilities
In this section we train a shot-make probability model using 3-point shots from the 2014–2015 season. To obtain shot-make probabilities for each shot, we use the estimated shot factors described previously as covariates in a logistic regression:
where Si is an indicator function equal to 1 when a shot goes in and 0 went it misses, i indexes all 3-point shots from the 2014–2015 season (N = 47,631),
Although our Bayesian regression model biases shot trajectories toward the basket, some trajectories are still quite variable. Modeled trajectories that are too far from the raw data are removed and instead assigned a probability of 1 or 0 for a make or miss, respectively. We use factors from the remaining shots to estimate shot-make probabilities with model (2). To assess how accurate the model is we perform a 10-fold cross-validation to obtain the mean misclassification rate, as well as calculate the log loss and Brier score. We repeat this procedure with shot factors estimated from the OLR model, and the results are shown in Table 1.
Mean misclassification rate, brier score, and log loss of model (2).
|Misclassification rate||Brier score||Log loss|
Log loss and Brier scores are based on shot-make probability predictions from model (2) for 3-point shots from the 2015–2016 NBA season. The covariates are estimated via the Bayesian regression and OLR methods described in Section 3.1, while the Grand Mean is the league-wide 3P% for the 2014–2015 season. The mean misclassification rate is the result of 10-fold cross-validation.
The covariates estimated via Bayesian regression resulted in misclassification rate 0.204. Therefore, our Bayesian model is able to predict makes/misses correctly about 80% of the time. This is a higher rate than many shot prediction models that use contextual covariates, like those presented in Cen et al. (2015) which utilize variables such as distance to basket and nearest defender to predict shot-makes with 65% accuracy. Similar to probabilities based on raw FG% (Marty and Lucey 2017), predicted shot-make probabilities are highest for shots at 11 inches depth, 0 inches of left-right deviation, and similar for shots with entry angles in the mid-40s. These can be seen in relation to the basket in Figure 3.
4 Applications of the RB-FG% estimator
4.1 Predicting three-point field goal percentage
In this section we aim to create a new estimate for player FG% by aggregating estimated shot-make probabilities given by (2). Without loss of generality, we focus first on 3-point shots for clarity of presentation. We gather shot trajectories for 3-point shots taken from the first half of the 2015–2016 NBA season in the SportVu tracking database (N = 24,855), and predict the probability of each shot going in using model (2) trained by shots taken in the 2014–2015 season. The mean of these estimated shot-make probabilities is the RB-FG% estimate,
Mean absolute prediction errors of FG% estimators.
|Raw||Grand mean||RB||Shrunk raw||Shrunk RB|
As mentioned in Section 2, due to the uncertainty in our shot factor estimates resulting from the noise in the optical tracking data, we do not know each shot’s true make probability. We can analyze how sensitive our estimator
In addition to assessing prediction accuracy, we can also investigate whether the RB-FG% estimator produces more consistent player rankings than raw FG%. We calculate
Rao-Blackwellizing the estimator for FG% does reduce variance and improve the prediction accuracy, but these estimators are based on low sample sizes for most players. Players in our dataset take between 3 and 402 three-point attempts in the first half of the 2015–2016 season, far fewer than the number needed for 3P% to stabilize (see Section 1). We are able to further reduce the variance of
Table 2 shows that the shrunk-RB estimator is a better predictor than the shrunk-raw estimator, and this improvement is illustrated in Figure 5. Hence while Rao-Blackwellizing significantly improves prediction, leveraging knowledge about the distribution of 3P%’s can further improve the RB-FG% estimator (Efron and Morris 1977).
In addition to predicting future shooting, we can also use
4.2 Predicting true shooting percentage
Although we’ve focused on three-point shots, we are able to Rao-Blackwellize any shooting statistic provided we have enough trajectory information to accurately estimate shot factors. We now expand our selection of shots and try to improve predictions of TS% using our shot factor and shot probability models. We repeat the procedure described in Section 3 to estimate shot factors for all two-point shots and free throws in the 2014–2015 season, and use these to create separate Rao-Blackwellized two-point FG% and free throw percentage (FT%) estimates. As before, shots that do not have enough location data or result in trajectory predictions very far from the raw data are not included in training or prediction datasets. In total, shot-make probabilities are estimated for 21,153 out of 24,832 free throws and 21,890 out of 73,925 two-point shots, with remaining probabilities assigned as 1 or 0 for a shot make and miss, respectively. The new RB estimators are again used to predict two-point FG%, FT% and TS% in the second half of the 2015–2016 season. As with 3P%, we find the shrunk Rao-Blackwellized estimator for TS% results in the lowest mean absolute error (Table 2).
Rao-Blackwellizing 2-point shots results in only a modest decrease in mean absolute error compared to the shrunk raw estimator. This may be because we are only able to estimate shot-make probabilities for a small fraction of two-point shots using the optical tracking database. Many 2-point shots are taken close to the basket or intended as bank-shots, resulting in insufficient or inaccurate trajectory information. These 2-point shots are not included in our prediction model and thus 2-point FG% is only partially Rao-Blackwellized. Interestingly, Rao-Blackwellizing FT% also resulted in only a minor improvement in prediction. This is not due to lack of trajectory information as most free throws are included in our shot-make model, but may be because free throws more closely follow a Bernoulli distribution than either 2-point or 3-point shots. Free throws are certainly more homogeous than other shot attempts as they are not affected by contextual factors like changing shot distance or defender pressure. There has been some research showing serial correlation between free throws (Arkes 2010). Though even when shown this effect is considerably smaller than the effects contextual factors have on field-goal shot-make probabilities. The closer that a player’s free throw attempts follow a Bernoulli distribution, the less potential there is to decrease the mean-squared error of the raw estimator of FT% through Rao-Blackwellization. If a player’s free throw attempts perfectly follow a Bernoulli distribution the number of makes and misses becomes a sufficient statistic for FT% and Rao-Blackwellizing would give no improvement in prediction accuracy.
4.3 Example of an improvement in inferring player FG%
We now present an example of when evaluating a player using
5 Discussion and conclusion
In this paper we were able to construct an improved estimator for FG% based on shot-make probabilities calculated from shot trajectories. Via the Rao-Blackwell theorem, we demonstrated that if we model shots according to a Beta-Bernoulli distribution, rather than a Bernoulli, aggregating shot-make probabilities for individual players is a more accurate estimator for future shooting than raw FG%. Shot trajectory data has been shown to improve estimation of FG% in other contexts. Marty (2018) demonstrates, using precise shot data captured by Noahlytics during practice shooting sessions, that raw shooting percentage augmented with 9 spatial rim patterns is a better estimate of shooting skill than raw FG%. We are able to extend this idea to live games, and show that shot features measured using the less precise optical tracking data can still provide improvement in FG% prediction and estimation. Our method differs in that we create a new shooting statistic, one based on shot-make probabilities only, rather than use raw FG% augmented with spatial features. Comparing the estimation ability of
Another way to quantify the quality of our Rao-Blackwellized metrics is to measure how well they are able to discriminate between players. We can accomplish this by comparing the discrimination meta-metric for Rao-Blackwellized and raw shooting metrics (Franks et al. 2016). This meta-metric quantifies the fraction of variance between players that is due to differences in true shooting skill. Table 3 shows that RB-3P% and RB-TS% are both more discriminative metrics than their raw counterparts. Franks et al. (2016) also define the meta-metric stability: the fraction of total variance in a metric that is due to true changes in player skill over time, rather than chance variability. We did not calculate this meta-metric as we do not have enough seasons of trajectory data to obtain accurate estimates.
Discrimination values for raw and Rao-Blackwellized shooting metrics.
|Raw 3P%||RB-3P%||Raw TS%||RB-TS%|
There have been many other models that use game-specific context variables like defender distance and shot location to try and estimate the probability that shots will go in (Chang et al. 2014; Cen et al. 2015). These models attempt to stabilize FG% estimation by controlling for external covariates that can affect shot-make probabilities. However,
Because RB-FG% allows us to more accurately estimate true FG% with smaller sample sizes, we should be able to more accurately predict how contextual shooting variables like defender distance impact a player’s shooting. Unfortunately, it is difficult to compare coefficients for contextual variables when fitting predicted probabilities compared to a binary shot response (make/miss) because we are estimating coefficients using different loss functions. Therefore, when we try to compare these coefficient estimates to a “true” value, for example how defender-distance affects FG% for a player over the entire season, we are comparing two estimated coefficients to a “true” coefficient value which is also estimated using a binary shot response. Even if the coefficient for defender distance estimated using
Although all NBA teams almost exclusively use raw FG% and its aggregate statistics to evaluate player shooting, many teams use shot trajectory characteristics to evaluate and coach player shooting in practice. The Noah Shooting System is used by a number of teams to analyze player shooting and to improve shot trajectories during practice shooting sessions. Analysis of trajectories in games, however, is not typically done due to the noisiness of the location data in the SportVu database. This paper provides a method to utilize in-game shot trajectories provided by the optical tracking data to better evaluate and predict player shooting.
The authors would like to thank Alex D’Amour, Alex Franks, Dan Cervone, Andy Miller, Nate Sandholtz, Jacob Mortensen, Matthew Van Bommel, Paul Gustafson, Joel Therrien, and Breanne Smart for their comments and contributions. We would also like to thank the associate editor and the referees for their helpful comments during the revision process.
Arkes, J. 2010. “Revisiting the Hot Hand Theory with Free Throw Data in a Multivariate Framework.” Journal of Quantitative Analysis in Sports 6(1): Retrieved 12 Jun. 2018, from doi:10.2202/1559-0410.1198.
Blackport, D. 2014. “How Long Does it Take for Three Point Shooting to Stabilize?” https://fansided.com/-2014/08/29/long-take-three-point-shooting-stabilize/. November 11th, 2017.
Brown, L. D. 2008. “In-Season Prediction of Batting Averages: A Field Test of Empirical Bayes and Bayes Methodologies.” The Annals of Applied Statistics 2:113–152.
Casella, G. 1985. “An Introduction to Empirical Bayes Data Analysis.” The American Statistician 39:83–87.
Chang, Y. H., R. Maheswaran, J. Su, S. Kwok, T. Levy, A. Wexler, and K. Squire. 2014. “Quantifying Shot Quality in the NBA.” Proceedings of the 2014 MIT Sloan Sports Analytics Conference.
Cen, R., H. Chase, C. Pena-Lobel, and D. Silberwasser. 2015. “NBA Shot Prediction and Analysis.” https://hwchase17.github.io/sportvu/. November 11th, 2017.
Franks, A., A. D’Amour, D. Cervone, and L. Bornn. 2016. “Meta-Analytics: Tools for Understanding the Statistical Properties of Sports Metrics.” Journal of Quantitative Analysis in Sports 12:151–165.
Kubatko, J., D. Oliver, K. Pelton, and D. T. Rosenbaum. 2007. “A Starting Point for Analyzing Basketball Statistics.” Journal of Quantitative Analysis in Sports 3(3): Retrieved 12 Jun. 2018, from doi:10.2202/1559-0410.1070.
Marty, R. 2018. “High-resolution Shot Capture Reveals Systematic Biases and an Improved Method for Shooter Evalutation.” Proceedings of the 2018 MIT Sloan Sports Analytics Conference.
Marty, R. and S. Lucey. 2017. “A Data-Driven Method for Understanding and Increasing 3-Point Shooting Percentage.” Proceedings of the 2017 MIT Sloan Sports Analytics Conference.
Paine, N. 2016. “LeBron’s 3-Point Shot Has Abandoned Him.” https://fivethirtyeight.com/features/lebrons-3-point-shot-has-abandoned-him/. January 3rd, 2018.
Piette, J., A. Sathyanarayan, and K. Zhang. 2010. “Scoring and Shooting Abilities of NBA Players.” Journal of Quantitative Analysis in Sports 6(1): Retrieved 12 Jun. 2018, from doi:10.2202/1559-0410.1194.
Skellam, J. G. 1948. “A Probability Distribution Derived from the Binomial Distribution by Regarding the Probability of Success as Variable Between the Sets of Trials.” Journal of the Royal Statistical Society. Series B (Methodological) 10:257–261.
- Export Citation
Skellam, J. G. 1948. “A Probability Distribution Derived from the Binomial Distribution by Regarding the Probability of Success as Variable Between the Sets of Trials.”)| false Journal of the Royal Statistical Society. Series B (Methodological)10:257–261. 10.1111/j.2517-6161.1948.tb00014.x
Young, S. 2016. “The NBA’s 3-point Revolution.https://bballbreakdown.com/2016/12/16/the-nba-3-point-revolution/. December 14th, 2017.