Predicting the outcome of a single sporting event is difficult; predicting all of the outcomes for an entire tournament is a monumental challenge. Despite the difficulties, millions of people compete each year to forecast the outcome of the NCAA men’s basketball tournament, which spans 63 games over 3 weeks. Statistical prediction of game outcomes involves a multitude of possible covariates and information sources, large performance variations from game to game, and a scarcity of detailed historical data. In this paper, we present the results of a team of modelers working together to forecast the 2014 NCAA men’s basketball tournament. We present not only the methods and data used, but also several novel ideas for post-processing statistical forecasts and decontaminating data sources. In particular, we highlight the difficulties in using publicly available data and suggest techniques for improving their relevance.
Cesa-Bianchi, Nicolo and Gabor Lugosi. 2001. “Worst-Case Bounds for the Logarithmic Loss of Predictors.” Machine Learning 43(3):247–264.10.1023/A:1010848128995)| false
Cochocki, A. and Rolf Unbehauen. 1993. Neural Networks for Optimization and Signal Processing. 1st ed. New York, NY, USA: John Wiley & Sons, Inc., ISBN 0471930105.
Cover, Thomas M. and Joy A Thomas. 2012. Elements of Information Theory. John Wiley & Sons, Inc., Hoboken, New Jersey.
Demir-Kavuk, Ozgur, Mayumi Kamada, Tatsuya Akutsu, and Ernst-Walter Knapp. 2011. “Prediction using Step-wise L1, L2 Regularization and Feature Selection for Small Data Sets with Large Number of Features.” BMC Bioinformatics 12:412.
Hamilton, Howard H. 2011. “An Extension of the Pythagorean Expectation for Association Football.” Journal of Quantitative Analysis in Sports 7(2). DOI: 10.2202/1559-0410.1335.10.2202/1559-0410.1335)| false
Harville, David A. 2003. “The Selection or Seeding of College Basketball or Football Teams for Postseason Competition.” Journal of the American Statistical Association 98(461):17–27.
Harville, David A. 2003. “The Selection or Seeding of College Basketball or Football Teams for Postseason Competition.” Journal of the American Statistical Association 98(461):17–27.10.1198/016214503388619058)| false
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer.
Huang, Tzu-Kuo, Ruby C. Weng, and Chih-Jen Lin. 2006. “Generalized Bradley-Terry Models and Multi-Class Probability Estimates.” Journal of Machine Learning Research 7(1):85–115.
Jacobson, Sheldon H. and Douglas M. King. 2009. “Seeding in the NCAA Men’s Basketball Tournament: When is a Higher Seed Better?” Journal of Gambling Business and Economics 3(2):63.
Kaplan, Edward H. and Stanley J. Garstka. 2001. “March Madness and the Office Pool.” Management Science 47(3):369–382.
Koenker, Roger and Gilbert W. Bassett, Jr. 2010. “March Madness, Quantile Regression Bracketology, and the Hayek Hypothesis.” Journal of Business & Economic Statistics 28(1):26–35.10.1198/jbes.2009.07093)| false
Liaw, Andy and Matthew Wiener. 2002. “Classification and Regression by Randomforest.” R News 2(3):18–22.
McCrea, Sean M. and Edward R. Hirt. 2009. “March Madness: Probability Matching in Prediction of the NCAA Basketball Tournament”. Journal of Applied Social Psychology, 39(12):2809–2839.10.1111/j.1559-1816.2009.00551.x)| false
Smith, Tyler and Neil C. Schwertman. 1999. “Can the NCAA Basketball Tournament Seeding be Used to Predict Margin of Victory?” The American Statistician 53(2):94–98.10.1080/00031305.1999.10474438)| false
JQAS, an official journal of the American Statistical Association, publishes research on the quantitative aspects of professional and collegiate sports. Articles deal with subjects as measurements of player performance, tournament structure, and the frequency and occurrence of records. Additionally, the journal serves as an outlet for professionals in the sports world to raise issues and ask questions that relate to quantitative sports analysis.