In this work, we propose a new hybrid modeling approach for the scores of international soccer matches which combines random forests with Poisson ranking methods. While the random forest is based on the competing teams’ covariate information, the latter method estimates ability parameters on historical match data that adequately reflect the current strength of the teams. We compare the new hybrid random forest model to its separate building blocks as well as to conventional Poisson regression models with regard to their predictive performance on all matches from the four FIFA World Cups 2002–2014. It turns out that by combining the random forest with the team ability parameters from the ranking methods as an additional covariate the predictive power can be improved substantially. Finally, the hybrid random forest is used (in advance of the tournament) to predict the FIFA World Cup 2018. To complete our analysis on the previous World Cup data, the corresponding 64 matches serve as an independent validation data set and we are able to confirm the compelling predictive potential of the hybrid random forest which clearly outperforms all other methods including the betting odds.
A Some notations and definitions
Kronecker’s delta, which is used in Section 4 in the formula of the multinomial likelihood and the RPS, is defined as follows:
The Skellam distribution, which is also used in Section 4, is the discrete probability distribution of the integer random variable that is defined as the difference
where Ik(⋅) is the modified Bessel function of the first kind (for more details, see Skellam 1946). Now let Y1 and Y2 denote the (conditionally independent) Poisson-distributed numbers of goals of two soccer teams competing in a match. Then, the three probabilities P(Y1 > Y2), P(Y1 = Y2) and P(Y1 < Y2) can be easily obtained by computing P(K > 0), P(K = 0) and P(K < 0) via the Skellam distribution.
B Lasso regression for soccer data
An alternative, more traditional approach which is often applied for modeling soccer results is based on regression. In the most popular case the scores of the competing teams are treated as (conditionally) independent variables following a Poisson distribution (conditioned on certain covariates), as introduced in the seminal works of Maher (1982) and Dixon and Coles (1997). Similar to the random forests, the methods described here can also be directly applied to data in the format of Table 2 from Section 2.1. Hence, each score is treated as a single observation and one obtains two observations per match. Accordingly, for n teams the respective model has the form
where Yijk denotes the score of team i against team j in tournament k with
Due to a rather large number of potential covariates in our data, we use regularization techniques when estimating the models to allow for variable selection and to avoid overfitting. In the following, we will introduce such a basic regularization approach, namely the conventional Lasso (Tibshirani 1996). For estimation, instead of the regular likelihood
is maximized, where
While the Lasso method described above was chosen as the reference method to compare the predictive power of the hybrid model, in the literature also several alternatives and extensions are discussed. In the following, we shortly sketch some possible modifications. As a first possible extension of the model (2), the linear predictor can be augmented by team-specific attack and defense effects for all competing teams. This extension was used in Groll et al. (2015) to predict the FIFA World Cup 2014. There, each couple of attack and defense parameters corresponding to a team has been treated as a group and, hence, the Group Lasso penalty proposed by Yuan and Lin (2006) has been applied on those parameter groups.
Alternatively, if the model (2) shall be extended from linear to smooth covariate effects
Altogether, in Schauberger and Groll (2018) the simple Lasso from (3) with predictor structure (2) turned out to be the best-performing regression approach, though slightly outperformed by the random forests from Section 3.1.
C Comparison of FIFA ranking, Elo rating and estimated abilities
Table 8 compares the ranking of the 32 participating teams in the FIFA World Cup 2018 according to estimated abilities (left column), Elo rating (center column) and FIFA ranking (right column). The ranking according to the estimated abilities and the Elo ratings are very similar (Spearman correlation of 0.94), while both have a smaller correlation with the FIFA ranking (Spearman correlation of 0.86 and 0.90, respectively).
All three methods rank Germany and Brazil as the two top teams. Notable differences between the rankings can be seen, for example, for Spain and Belgium. Both the estimated abilities and the Elo rating rank Spain third while it is ranked ninth by FIFA. Belgium is ranked rather inhomogenously in positions 6, 8 and 3 by the different methods. More details on the comparison of estimated team abilities and the FIFA rank can be found in Ley et al. (2019).
D Probabilities for FIFA World Cup 2018 Winner
In this section, the hybrid random forest is applied to (new) data for the World Cup 2018 in Russia (in advance of the tournament) to predict winning probabilities for all teams and to predict the tournament course.
The abilities were estimated by the bivariate Poisson model with a half period of 3 years. All matches of the 228 national teams played since 2010-06-13 up to 2018-06-06 are used for the estimation, what results in a total of more than 7000 matches. All further predictor variables are taken as the latest values shortly before the World Cup (and using the finally announced squads of 23 players for all nations).
For each match in the World Cup 2018, the hybrid random forest can be used to predict an expected number of goals for both teams. Given the expected number of goals, a real result is drawn by assuming two (conditionally) independent Poisson distributions for both scores. Based on these results, all 48 matches from the group stage can be simulated and final group standings can be calculated. Due to the fact that real results are simulated, we can precisely follow the official FIFA rules when determining the final group standings5. This enables us to determine the matches in the round-of-sixteen and we can continue by simulating the knockout stage. In the case of draws in the knockout stage, we simulate extra-time by a second simulated result. However, here we multiply the expected number of goals by the factor 0.33 to account for the shorter time to score (30 min instead of 90 min). In the case of a further draw in extra-time we simulate the penalty shootout by a (virtual) coin flip.
Following this strategy, a whole tournament run can be simulated, which we repeat 100,000 times. Based on these simulations, for each of the 32 participating teams probabilities to reach the single knockout stages and, finally, to win the tournament are obtained. These are summarized in Table 9 together with the winning probabilities based on the ODDSET odds for comparison.
We can see that, according to our hybrid random forest model, Spain was the favored team with a predicted winning probability of 13.7% followed by Germany, France, Brazil and Belgium. Overall, this result seems in line with the probabilities from the bookmakers, as we can see in the last column. While Oddset favors Germany and Brazil, the hybrid random forest model predicts a slight advantage for Spain. However, we can see no clear favorite, as several teams seem to have good chances. In retrospect, the early drop-outs of Germany and Spain seem rather surprising. While Spain at least played a successful group stage finishing in first place, Germany performed unexpectedly bad with two defeats during the group stage. The probability for such an early drop-out of Germany was predicted to be only around 22% and, therefore, could be seen as the biggest surprise of the tournament. Spain failed in the round-of-16 against host Russia in a penalty shoot-out and, hence, did not reach the quarter finals (the probability for this event had been predicted to be about 39%). Beside the probabilities of becoming world champion, Table 9 provides some further interesting insights also for the single stages within the tournament. For example, it is interesting to see that the two favored teams Spain and Germany had almost equal chances to at least reach the round-of-sixteen (80.5% and 78.0%, respectively), while the probabilities to at least reach the quarter finals differ significantly. While Spain had a probability of 61.2% to reach at least the quarter finals, Germany only achieved a probability of 49.0%. Obviously, in contrast to Spain, Germany had a rather high chance to meet a strong opponent in the round-of-sixteen. In case they would have reached the round-of-sixteen, Germany would have faced either Brazil, Switzerland, Serbia or Costa Rica, while Spain would have faced Uruguay, Russia, Saudi Arabia or Egypt. In the following rounds, Germany catches up to Spain finally ending up with almost equal winning probabilities.
Most probable tournament course
Finally, based on the 100,000 simulations, we also provide the most probable tournament course. For each of the eight groups we selected the most probable final group standing, while also considering the order of the first two places, but not the irrelevant order of the teams on places three and four. The results together with the corresponding probabilities are presented in Table 10.
Obviously, there are large differences with respect to the groups’ balances. While in Group B and Group G the model forecasts Spain followed by Portugal as well as Belgium followed by England with rather high probabilities of 27.7% and 26.7%, respectively, other groups such as Group D, Group E, Group F and Group H seem to be more volatile. Now that we know the true tournament outcome, it is worth a note that indeed in Group B and G the first two places were exactly taken by the two forecasted teams, while in Group F and H there were some surprises.
Moreover, we provide the most probable course of the knockout stage in Figure 4. The most likely round-of-sixteen directly results from those teams qualifying for the knockout stage in Table 10. For all following matches we compute the probabilities for the respective two teams (say team A and team B) to go to the next stage. This is done by applying the Skellam distribution to first get the probabilities for A wins, draw and B wins after 90 minutes. Second, the probability for draw is distributed between teams A and B again following the principles of extra-time and penalty shootouts we already applied for draws in the knockout stage in the previous section. This way the probabilities for A wins and B wins add up to 1, as is necessary for the knockout stage. In Figure 4, the probabilities accompanying the edges of the tournament tree represent the probability of the favored team to proceed to the next stage.
In the most probable tournament course Germany wins the World Cup. However, again it becomes obvious that with (in that case) Switzerland the German team would have had to face a much stronger opponent than Spain in the round-of-sixteen. Even though they still were the favorite in this match, they would have succeeded to move on to the quarter finals only with a probability of 58%. While in the most probable course of the knock-out stage, though having tough times in all single stages, Germany would have made its way into the final and defended the title, the previous section showed that generally still Spain was the most likely winner.
We wish to attract the reader’s attention to the fact that, despite being the most probable tournament course, due to the myriad of possible constellations this exact tournament course still was extremely unlikely: if we take the product of all single probabilities of Table 10 and Figure 4, its overall probability yields 7.63 ⋅ 10−9%. Hence, deviations of the true tournament course from the model’s most probable one were not only possible, but very likely.
Bischl, B., M. Lang, L. Kotthoff, J. Schiffner, J. Richter, E. Studerus, G. Casalicchio, and Z. M. Jones. 2016. “mlr: Machine Learning in R.” Journal of Machine Learning Research 17:1–5. http://jmlr.org/papers/v17/15-066.html.Search in Google Scholar
Boshnakov, G., T. Kharrat, and I. G. McHale. 2017. “A Bivariate Weibull Count Model for Forecasting Association Football Scores.” International Journal of Forecasting 33:458–466. http://www.sciencedirect.com/science/article/pii/S0169207017300018.10.1016/j.ijforecast.2016.11.006Search in Google Scholar
Breiman, L., J. H. Friedman, R. A. Olshen, and J. C. Stone. 1984. Classification and Regression Trees. Monterey, CA: Wadsworth.Search in Google Scholar
Dixon, M. J. and S. G. Coles. 1997. “Modelling Association Football Scores and Inefficiencies in the Football Betting Market.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 46:265–280.10.1111/1467-9876.00065Search in Google Scholar
Dyte, D. and S. R. Clarke. 2000. “A Ratings Based Poisson Model for World Cup Soccer Simulation.” Journal of the Operational Research Society 51(8):993–998.10.1057/palgrave.jors.2600997Search in Google Scholar
Friedman, J., T. Hastie, and R. Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33:1.10.18637/jss.v033.i01Search in Google Scholar
Gneiting, T. and A. E. Raftery. 2007. “Strictly Proper Scoring Rules, Prediction, and Estimation.” Journal of the American Statistical Association 102:359–378.10.1198/016214506000001437Search in Google Scholar
Groll, A. and J. Abedieh. 2013. “Spain Retains its Title and Sets a New Record – Generalized Linear Mixed Models on European Football Championships.” Journal of Quantitative Analysis in Sports 9:51–66.10.1515/jqas-2012-0046Search in Google Scholar
Groll, A., T. Kneib, A. Mayr, and G. Schauberger. 2018. “On the Dependency of Soccer Scores – A Sparse Bivariate Poisson Model for the UEFA European Football Championship 2016.” Journal of Quantitative Analysis in Sports 14:65–79.10.1515/jqas-2017-0067Search in Google Scholar
Groll, A., G. Schauberger, and G. Tutz. 2015. “Prediction of Major International Soccer Tournaments Based on Team-Specific Regularized Poisson Regression: An Application to the FIFA World Cup 2014.” Journal of Quantitative Analysis in Sports 11:97–115.10.1515/jqas-2014-0051Search in Google Scholar
Kelly, J. L. 1956. “A New Interpretation of Information Rate.” Bell System Technical Journal 35:917–926. http://dx.doi.org/10.1002/j.1538-7305.1956.tb03809.x.10.1142/9789814293501_0003Search in Google Scholar
Koopman, S. J. and R. Lit. 2015. “A Dynamic Bivariate Poisson Model for Analysing and Forecasting Match Results in the English Premier League.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 178:167–186.10.1111/rssa.12042Search in Google Scholar
Leitner, C., A. Zeileis, and K. Hornik. 2010. “Forecasting Sports Tournaments by Ratings of (Prob)Abilities: A Comparison for the EURO 2008.” International Journal of Forecasting 26(3):471–481.10.1016/j.ijforecast.2009.10.001Search in Google Scholar
Ley, C., T. Van de Wiele, and H. Van Eetvelde. 2019. “Ranking Soccer Teams on the Basis of their Current Strength: A Comparison of Maximum Likelihood Approaches.” Statistical Modelling 19:55–77. https://doi.org/10.1177/1471082X18817650.10.1177/1471082X18817650Search in Google Scholar
McHale, I. and P. Scarf. 2007. “Modelling Soccer Matches Using Bivariate Discrete Distributions with General Dependence Structure.” Statistica Neerlandica 61:432–445. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9574.2007.00368.x.10.1111/j.1467-9574.2007.00368.xSearch in Google Scholar
McHale, I. G. and P. A. Scarf. 2011. “Modelling the Dependence of Goals Scored by Opposing Teams in International Soccer Matches.” Statistical Modelling 41:219–236.10.1177/1471082X1001100303Search in Google Scholar
Probst, P. and A.-L. Boulesteix. 2017. “To Tune or not to Tune the Number of Trees in Random Forest?” Journal of Machine Learning Research 18:181:1–181:18.Search in Google Scholar
Schauberger, G. and A. Groll. 2018. “Predicting Matches in International Football Tournaments with Random Forests.” Statistical Modelling 18:460–482. https://doi.org/10.1177/1471082X18799934.10.1177/1471082X18799934Search in Google Scholar
Skellam, J. G. 1946. “The Frequency Distribution of the Difference between Two Poisson Variates Belonging to Different Populations.” Journal of the Royal Statistical Society. Series A (General) 109:296–296.10.2307/2981372Search in Google Scholar
Strobl, C., A.-L. Boulesteix, A. Zeileis, and T. Hothorn. 2007. “Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution.” BMC Bioinformatics 8:25.10.1186/1471-2105-8-25Search in Google Scholar PubMed PubMed Central
Strobl, C., A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis. 2008. “Conditional Variable Importance for Random Forests.” BMC Bioinformatics 9:307.10.1186/1471-2105-9-307Search in Google Scholar PubMed PubMed Central
Wright, M. N. and A. Ziegler. 2017. “Ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R.” Journal of Statistical Software 77:1–17.10.18637/jss.v077.i01Search in Google Scholar
Yuan, M. and Y. Lin. 2006. “Model Selection and Estimation in Regression with Grouped Variables.” Journal of the Royal Statistical Society B68:49–67.10.1111/j.1467-9868.2005.00532.xSearch in Google Scholar
© 2019 Walter de Gruyter GmbH, Berlin/Boston