Abstract
This paper analyzes a unique data set consisting of all penalty calls in the National Hockey League between the 1995–1996 and 2001–2002 seasons. The primary finding is the prevalence of “reverse calls:” if previous penalties have been on one team, then the next penalty is more likely to be on the other. This pattern is consistent with a simple behavioral rationale based on the fundamental difficulty of refereeing a National Hockey League game. Statistical modeling reveals that the identity of the next team to be penalized also depends on a variety of other factors, including the score, the time in the game, the time since last penalty, which team is at home, and whether one or two referees are calling the game. There is also evidence of differences among referees in their tendency to reverse calls.
- 1
Source: www.legendsofhockey.net.
- 2
- 3
Source: Toronto Star, February 8, 2004. A “shift” is a player’s turn on the ice and generally lasts less than a minute.
- 4
Source: Hockey Night in Canada broadcast, October 25, 2003.
- 5
For example, looking at the sample boxscore given in the Appendix A, matching penalties occurred at 17:07 of the first period and 0:58 of overtime. For this game, the four matching penalties would not appear in the analysis sample.
- 6
One possible explanation is that penalty calls near the end of a period are less costly since the end of the period temporarily breaks up the power play.
Appendix A
Sample game boxscore in raw form
+++ National Hockey League – Lightning vs. Stars – 10/22/1995 0:58AM ET +++
Tampa Bay 1 1 1 0–3
Dallas 2 1 0 0–3
FIRST PERIOD – Scoring: 1, Dallas, Klatt 2 (Matvichuk), 5:04. 2, Tampa Bay, Gavey 1 (shorthanded) (Tucker, Hamrlik), 12:13. 3, Dallas, Adams 4 (Ledyard), 16:08. Penalties: Matvichuk, Dal (charging), 8:13; Wiemer, T.B. (high sticking), 10:47; Burr, T.B. (charging), 17:07; Zmolek, Dal (high sticking), 17:07; Hamrlik, T.B. (holding), 17:38.
SECOND PERIOD – Scoring: 4, Dallas, Modano 5 (power play) (Ledyard, D Hatcher), 8:58. 5, Tampa Bay, Gratton 5 (power play) (Klima, Cullen), 17:00. Penalties: Borschevsky, Dal (Obstr hooking), 0:47; Gratton, T.B. (slashing), 1:12; Bradley, T.B. (Obstr hooking), 7:06; Charron, T.B. (high sticking), 8:21; D Hatcher, Dal (closing hand on puck), 15:05.
THIRD PERIOD – Scoring: 6, Tampa Bay, Houlder 1 (Klima, Ysebaert), 6:39. Penalties: Tampa Bay bench, served by Selivanov (too many men on the ice), 2:43; Zmolek, Dal (Obstr hooking), 10:49; Klima, T.B. (Obstr tripping), 11:45; Ciccone, T.B. (slashing), 13:49.
OVERTIME – Scoring: None. Penalties: Burr, T.B. (roughing), 0:58; Churla, Dal (roughing), 0:58; Hamrlik, T.B. (hooking), 1:53.
Shots on goal:
Tampa Bay 13 10 13 2–38
Dallas 14 9 5 2–30
Power-play Conversions: Tam – 1 of 4, Dal – 1 of 9. Goalies: Tampa Bay, Puppa (30 shots, 27 saves; record: 2-1-2). Dallas, Wakaluk (38, 35; record: 2-1-1). A:16,789. Referee: Roberts. Linesmen: D Mccourt, Mcelman.
Appendix B
Rather that fitting one type of model to the data, we consider a variety of strategies for learning the relationship between revcall and the other variables.
In order to gauge how well a model worked, we did a simple out-of-sample experiment. We randomly selected 11,000 observations to be our out-of-sample “test” data and then used the remaining data (57,883–11,000=46,883) observations as “training” data with which to estimate the models.
The “models” we tried were:
decision trees with various numbers of bottom nodes,
random forests with various numbers of trees,
linear logistic regression,
boosting with various numbers of trees and interaction depths, and
BART: Bayesian Additive Regression Trees with the default prior.
Model fitting was performed in R (R Core Team 2013) using the tree, randomForest, glm, gbm, and BayesTree packages or functions, respectively.
Figure 10 displays the out-of-sample loss for the modeling strategies, where loss is measured by the deviance (–2×log-likelihood value). Because the deviance has the opposite sign of the likelihood and a bigger likelihood is better, a smaller deviance indicates a better fit. For a textbook discussion of the use of deviance in model selection, see Chapter 6 of James et al. (2013). Again, we fit models using the training data and then evaluate the likelihood of the fitted model on the test data. The top panel displays the loss for all models, while the bottom panel only displays the results for logistic regression, boosting, and BART since these models performed the best. The best models in terms of out-of-sample loss are boosting with 250 trees and interaction depth 6 and BART.

Out-of-sample deviance loss for various predictive modeling strategies. (i) Ti: tree with i bottom nodes, (ii) Fi: random forest with i trees, (iii) L: linear logistic regression, (iv) Gi: boosting with i trees, for i=200, 300, 500, trees of depth 3 were used while for i=250 trees of depth 4, 6, and 8 were used, (v) BART: Bayesian Additive Regression Trees with default prior. All other parameters set to defaults given in the R package. In the top panel, all methods are displayed. In the bottom panel we just compare the better ones. The predictive modeling method with the smallest loss is BART, with boosting with 250 trees and tree depth equal to 6 very close.
While it is not easy to interpret the deviance measure, the results in Figure 10 suggest that logistic regression is “not too bad.” Figure 11 plots the fitted

BART fit (x-axis) vs. logit fit (y-axis). Both approaches find substantial fit, but there are also some big differences.
References
Allen, W. D. 2002. “Crime, Punishment, and Recidivism, Lessons from the National Hockey League.” Journal of Sports Economics 3:39–60.10.1177/1527002502003001004Search in Google Scholar
Beaudoin, D. and T. B. Swartz. 2010. “Strategies for Pulling the Goalie in Hockey.” The American Statistician 64:197–204.10.1198/tast.2010.09147Search in Google Scholar
Becker, G. 1968. “Crime and Punishment: an Economic Approach.” Journal of Political Economy 76:169–217.10.1086/259394Search in Google Scholar
Carvalho, C. and P. R.Hahn. 2014. “Decoupling Shrinkage and Selection in Bayesian Linear Models.” working paper.Search in Google Scholar
Chipman, H., E. George, and R. McCulloch. 2010. “BART: Bayesian Additive Regression Trees.” Annals of Applied Statistics 4(1):266–298.10.1214/09-AOAS285Search in Google Scholar
Heckelman, J. and Y. Yates. 2003. “And a Hockey Game Broke Out: Crime and Punishment in the NHL.” Economic Enquiry 41:704–712.10.1093/ei/cbg038Search in Google Scholar
James, G., D. Witten, T. Hastie, and R. Tibshirani. 2013. An Introduction to Statistical Learning. New York: Springer.10.1007/978-1-4614-7138-7Search in Google Scholar
Levitt, S. 2002. “Testing the Economic Model of Crime: The National Hockey League’s Two Referee Experiment.” Contributions to Economic Analysis and Policy 1.10.2202/1538-0645.1014Search in Google Scholar
Murphy, K. 2012. Machine Learning. Cambridge, Massachusetts: The MIT Press.Search in Google Scholar
R Core Team. 2013. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, (http://www.R-project.org/).Search in Google Scholar
Correction Statement
Correction added on June 27, 2014 after online publication [March 22, 2014]. The periods in the last sentence of paragraph 1, column 2, page 207 were changed from “320 min periods” to “3×20 min periods”. In Table 1, page 224 the frequency of AA was updated from 6.6% to 36.6%.
©2014 by Walter de Gruyter Berlin/Boston