This paper analyzes a unique data set consisting of all penalty calls in the National Hockey League between the 1995–1996 and 2001–2002 seasons. The primary finding is the prevalence of “reverse calls:” if previous penalties have been on one team, then the next penalty is more likely to be on the other. This pattern is consistent with a simple behavioral rationale based on the fundamental difficulty of refereeing a National Hockey League game. Statistical modeling reveals that the identity of the next team to be penalized also depends on a variety of other factors, including the score, the time in the game, the time since last penalty, which team is at home, and whether one or two referees are calling the game. There is also evidence of differences among referees in their tendency to reverse calls.
Source: Toronto Star, February 8, 2004. A “shift” is a player’s turn on the ice and generally lasts less than a minute.
Source: Hockey Night in Canada broadcast, October 25, 2003.
For example, looking at the sample boxscore given in the Appendix A, matching penalties occurred at 17:07 of the first period and 0:58 of overtime. For this game, the four matching penalties would not appear in the analysis sample.
One possible explanation is that penalty calls near the end of a period are less costly since the end of the period temporarily breaks up the power play.
Sample game boxscore in raw form
+++ National Hockey League – Lightning vs. Stars – 10/22/1995 0:58AM ET +++
Tampa Bay 1 1 1 0–3
Dallas 2 1 0 0–3
FIRST PERIOD – Scoring: 1, Dallas, Klatt 2 (Matvichuk), 5:04. 2, Tampa Bay, Gavey 1 (shorthanded) (Tucker, Hamrlik), 12:13. 3, Dallas, Adams 4 (Ledyard), 16:08. Penalties: Matvichuk, Dal (charging), 8:13; Wiemer, T.B. (high sticking), 10:47; Burr, T.B. (charging), 17:07; Zmolek, Dal (high sticking), 17:07; Hamrlik, T.B. (holding), 17:38.
SECOND PERIOD – Scoring: 4, Dallas, Modano 5 (power play) (Ledyard, D Hatcher), 8:58. 5, Tampa Bay, Gratton 5 (power play) (Klima, Cullen), 17:00. Penalties: Borschevsky, Dal (Obstr hooking), 0:47; Gratton, T.B. (slashing), 1:12; Bradley, T.B. (Obstr hooking), 7:06; Charron, T.B. (high sticking), 8:21; D Hatcher, Dal (closing hand on puck), 15:05.
THIRD PERIOD – Scoring: 6, Tampa Bay, Houlder 1 (Klima, Ysebaert), 6:39. Penalties: Tampa Bay bench, served by Selivanov (too many men on the ice), 2:43; Zmolek, Dal (Obstr hooking), 10:49; Klima, T.B. (Obstr tripping), 11:45; Ciccone, T.B. (slashing), 13:49.
OVERTIME – Scoring: None. Penalties: Burr, T.B. (roughing), 0:58; Churla, Dal (roughing), 0:58; Hamrlik, T.B. (hooking), 1:53.
Shots on goal:
Tampa Bay 13 10 13 2–38
Dallas 14 9 5 2–30
Power-play Conversions: Tam – 1 of 4, Dal – 1 of 9. Goalies: Tampa Bay, Puppa (30 shots, 27 saves; record: 2-1-2). Dallas, Wakaluk (38, 35; record: 2-1-1). A:16,789. Referee: Roberts. Linesmen: D Mccourt, Mcelman.
Rather that fitting one type of model to the data, we consider a variety of strategies for learning the relationship between revcall and the other variables.
In order to gauge how well a model worked, we did a simple out-of-sample experiment. We randomly selected 11,000 observations to be our out-of-sample “test” data and then used the remaining data (57,883–11,000=46,883) observations as “training” data with which to estimate the models.
The “models” we tried were:
decision trees with various numbers of bottom nodes,
random forests with various numbers of trees,
linear logistic regression,
boosting with various numbers of trees and interaction depths, and
BART: Bayesian Additive Regression Trees with the default prior.
Model fitting was performed in R (R Core Team 2013) using the tree, randomForest, glm, gbm, and BayesTree packages or functions, respectively.
Figure 10 displays the out-of-sample loss for the modeling strategies, where loss is measured by the deviance (–2×log-likelihood value). Because the deviance has the opposite sign of the likelihood and a bigger likelihood is better, a smaller deviance indicates a better fit. For a textbook discussion of the use of deviance in model selection, see Chapter 6 of James et al. (2013). Again, we fit models using the training data and then evaluate the likelihood of the fitted model on the test data. The top panel displays the loss for all models, while the bottom panel only displays the results for logistic regression, boosting, and BART since these models performed the best. The best models in terms of out-of-sample loss are boosting with 250 trees and interaction depth 6 and BART.
While it is not easy to interpret the deviance measure, the results in Figure 10 suggest that logistic regression is “not too bad.” Figure 11 plots the fitted from the BART fit against those obtained using logistic regression. The BART fit is a posterior mean. The line drawn throught the plot has slope 1 and intercept 0. Broadly, the two models agree on the probabilities, but there is also substantial discrepancy. For both models, the fitted probabilites range from about 0.25 to about 0.90, which suggests a great deal of predictability in penalty calls. In Section 4.4 we see that BART does find interesting interactions which the linear logit clearly could not uncover.
Carvalho, C. and P. R.Hahn. 2014. “Decoupling Shrinkage and Selection in Bayesian Linear Models.” working paper.Search in Google Scholar
Levitt, S. 2002. “Testing the Economic Model of Crime: The National Hockey League’s Two Referee Experiment.” Contributions to Economic Analysis and Policy 1.10.2202/1538-0645.1014Search in Google Scholar
Murphy, K. 2012. Machine Learning. Cambridge, Massachusetts: The MIT Press.Search in Google Scholar
Correction added on June 27, 2014 after online publication [March 22, 2014]. The periods in the last sentence of paragraph 1, column 2, page 207 were changed from “320 min periods” to “3×20 min periods”. In Table 1, page 224 the frequency of AA was updated from 6.6% to 36.6%.
©2014 by Walter de Gruyter Berlin/Boston