Accessible Unlicensed Requires Authentication Published by De Gruyter February 23, 2015

Nearest-neighbor matchup effects: accounting for team matchups for predicting March Madness

Andrew Hoegh, Marcos Carzolio, Ian Crandell, Xinran Hu, Lucas Roberts, Yuhyun Song and Scotland C. Leman

Abstract

Recently, the surge of predictive analytics competitions has improved sports predictions by fostering data-driven inference and steering clear of human bias. This article details methods developed for Kaggle’s March Machine Learning Mania competition for the 2014 NCAA tournament. A submission to the competition consists of outcome probabilities for each potential matchup. Most predictive models are based entirely on measures of overall team strength, resulting in the unintended “transitive property.” These models are therefore unable to capture specific matchup tendencies. We introduce our novel nearest-neighbor matchup effects framework, which presents a flexible way to account for team characteristics above and beyond team strength that may influence game outcomes. In particular we develop a general framework that couples a model predicting a point spread with a clustering procedure that borrows strength from games similar to a current matchup. This results in a model capable of issuing predictions controlling for team strength and that capture specific matchup characteristics.


Corresponding author: Andrew Hoegh, Virginia Tech – Department of Statistics, Hutcheson Hall – RM 406A 250 Drillfield Drive, Blacksburg, VA 24061, USA, e-mail:

Appendix

All of the variables used for identifying neighbors of teams are listed below with descriptions. For additional details, see Pomeroy (2012).

Effective Height: The average height of the center and power forwards – adjusted for minutes played.

Adusted Tempo: An estimate of the possessions per game against a team playing at a standardized tempo.

Offensive Rebound Percentage (Offense): Percentage of offensive rebounds gained on offense.

Offensive Rebound Percentage (Defense): Percentage of offensive rebounds secured on defense.

Assist Rate: Percentage of field goals that result in an assist.

Effective Field Goal Percentage Defense: Gives more credit for three pointers, specifically (0.5 FGM3+FGM)/FGA, where FGM3 is three point field goals made, FGM is total field goals made, and FGA is field goals attempted.

Free Throw Rate (Offense): Ratio of free throws attempted to field goals attempted.

Free Throw Rate Defense: Ratio of free throws attempted to field goals attempted for opponents.

Two Point Field Goal Percentage (Offense): Shooting percentage on two point baskets for offense.

Two Point Field Goal Percentage (Defense): Shooting percentage on two point baskets allowed on defense.

Three Point Field Goal Percentage (Offense): Shooting percentage of three point baskets on offense.

Three Point Field Goal Percentage (Defense): Shooting percentage of three point baskets allowed on defense.

Free Throw Percentage: Shooting percentage on free throws.

Block Percentage: Percentage of opponents two point field goal attempts that are blocked.

Steal Rate: Percentage of defensive possessions that result in a steal.

Free Throw Contribution (Offense): Percentage of points scored on free throws.

Free Throw Contribution (Defense): Percentage of points allowed on free throws.

Two Point Field Goal Contribution (Offense): Percentage of points scored on two point field goals.

Two Point Field Goal Contribution (Defense): Percentage of points allowed on two point field goals.

Three Point Field Goal Contribution (Offense): Percentage of points scored on three point field goals.

Three Point Field Goal Contribution (Defense): Percentage of points allowed on three point field goals.

References

Boulier, B. L. and H. O. Stekler. 2003. “Predicting the Outcomes of National Football League Games.” International Journal of Forecasting 19:257–270. Search in Google Scholar

Brown, M. and J. Sokol. 2010. “An Improved LRMC Method for NCAA Basketball Prediction.” Journal of Quantitative Analysis in Sports 6:1–23. Search in Google Scholar

Carlin, B. P. 1996. “Improved NCAA Basketball Tournament Modeling via Point Spread and Team Strength Information.” The American Statistician 50:39–43. Search in Google Scholar

Caudill, S. B. 2003. “Predicting Discrete Outcomes with the Maximum Score Estimator: The Case of the NCAA Men’s Basketball Tournament.” International Journal of Forecasting 19:313–317. Search in Google Scholar

Goldbloom, A. 2014. “March Machine Learning Mania.” (http://www.kaggle.com/c/march-machine-learning-mania), accessed June 18, 2014. Search in Google Scholar

Harville, D. A. and M. H. Smith. 1994. “The Home-Court Advantage: How Large is it, and does it vary from Team to Team?” The American Statistician 48:22–28. Search in Google Scholar

House, L., S. Leman, and C. Han. 2010. “Bayesian Visual Analytics (bava).” FODAVA Technical Report. Search in Google Scholar

Hu, X., L. Bradel, D. Maiti, L. House, C. North, and S. Leman. 2013. “Semantics of Directly Manipulating Spatializations.” Visualization and Computer Graphics, IEEE Transactions on 19:2052–2059. Search in Google Scholar

James, B. 1983. “Baseball Abstract.” New York: Ballantine. Search in Google Scholar

Kvam, P. and J. S. Sokol. 2006. “A Logistic Regression/Markov Chain Model for NCAA Basketball.” Naval Research Logistics 53:788–803. Search in Google Scholar

Lewis, M. 2004. “Moneyball.” New York: W. W. Norton & Company. Search in Google Scholar

Manski, C. F. and S. R. Lerman. 1977. “The Estimation of Choice Probabilities from Choice Based Samples.” Econometrica: Journal of the Econometric Society 1977–1988. Search in Google Scholar

Massey, K. 2014. “College Basketball Rating Composite.” (masseyratings.com), accessed April 7, 2014. Search in Google Scholar

Miller, S. J. 2007. “A Derivation of the Pythagorean Won-loss Formula in Baseball.” Chance 20:40–48. Search in Google Scholar

Pomeroy, K. 2012. “Ratings Glossary.” (http://kenpom.com/blog/index.php/weblog/entry/ratings_glossary), accessed June 18, 2014. Search in Google Scholar

Rosenthal, J. 2013. “The Rosenthal Fit: A Statistical Ranking of NCAA Men’s Basketball Teams.” (http://andrewgelman.com/2014/02/25/basketball-stats-dont-model-probability-win-model-expected-score-differential/), accessed June 18, 2014. Search in Google Scholar

Sagarin, J. 2014. “Ratings Glossary.” (http://www.usatoday.com/sports/ncaab/sagarin/), accessed June 18, 2014. Search in Google Scholar

Schwertman, N. C., K. L. Schenk, and B. C. Holbrook. 1996. “More Probability Models for the NCAA Regional Basketball Tournaments.” The American Statistician 50:34–38. Search in Google Scholar

Silver, N. 2003. “Introducing Pecota.” Baseball Prospectus 2003:507–514. Search in Google Scholar

Silver, N. 2014. “Building a Bracket is Hard this Year, But We’ll Help You Play the Odds.” (http://fivethirtyeight.com/features/nate-silvers-ncaa-basketball-predictions/), accessed June 18, 2014. Search in Google Scholar

Smith, T. and N. C. Schwertman. 1999. “Can the NCAA Basketball Tournament Seeding be used to Predict Margin of Victory?” The American Statistician 53:94–98. Search in Google Scholar

West, B. T. 2006. “A Simple and Exible Rating Method for Predicting Success in the NCAA Basketball Tournament.” Journal of Quantitative Analysis in Sports 2:3. Search in Google Scholar

Wright, C. 2012. “Statistical Predictors of March Madness: An Examination of the NCAA Men’s’ Basketball Championship.” (http://economics-files.pomona.edu/GarySmith/Econ190/Wright%20March%20Madness%20Final%20Paper.pdf), accessed June 18, 2014. Search in Google Scholar

Published Online: 2015-2-23
Published in Print: 2015-3-1

©2015 by De Gruyter