Nowadays many approaches that analyze and predict the results of football matches are based on bookmakers’ ratings. It is commonly accepted that the models used by the bookmakers contain a lot of expertise as the bookmakers’ profits and losses depend on the performance of their models. One objective of this article is to analyze the role of bookmakers’ odds together with many additional, potentially influental covariates with respect to a national team’s success at European football championships and especially to detect covariates, which are able to explain parts of the information covered by the odds. Therefore a pairwise Poisson model for the number of goals scored by national teams competing in European football championship matches is used. Moreover, the generalized linear mixed model (GLMM) approach, which is a widely used tool for modeling cluster data, allows to incorporate team-specific random effects. Two different approaches to the fitting of GLMMs incorporating variable selection are used, subset selection as well as a Lasso-type technique, including an L1-penalty term that enforces variable selection and shrinkage simultaneously. Based on the two preceeding European football championships a sparse model is obtained that is used to predict all matches of the current tournament resulting in a possible course of the European football championship (EURO) 2012.
A Correlation structure of the EURO 2004 and 2008 data
B Alternative predictions of the EURO 2012
The German state betting agency ODDSET ranked Spain in third place among the favorites for the EURO 2008 with odds of 6.50 (usually, in statistics odds represent the ratio of the probability that an event will happen to the probability that it will not happen; however, European bookmakers specify the gross ratio which represents the ratio of paid amount to stake. So putting €1 on Spain as the EURO 2008 champion would have given back €6.50. Thus, European odds can be directly transformed into probabilities by taking the inverse and adjusting for the bookmakers’ margins) behind Germany (4.50) and Italy (5.50). Before the FIFA World Cup 2010 Spain was ranked in first place among the favorites with odds of 5.00 together with Brazil.
The German state betting agency ODDSET ranked Greece in 12th place among the favorites for the EURO 2004 with odds of 45.00.
Although this represents a quite small basis of data, we abstain from using earlier European championships, as one of our main objects is to analyze the explanatory power of bookmakers’ odds together with many additional, potentially influental covariates. Unfortunately, the possibility of betting on the overall cup winner before the start of the tournament is quite novel. The German state betting agency ODDSET e.g. offered the bet for the first time at the EURO 2004.
There are countless examples in history for such events, throughout all competitions. We want to mention only some of the most famous ones: Germany’s first World Cup success in Switzerland 1954, known as the “miracle from Bern”; Greece’s victory at the EURO 2004 (compare footnote 1); FC Porto’s triumph in the UEFA CL season 2003/2004.
The GDP per capita is the gross domestic product divided by midyear population. The GDP is the sum of gross values added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products.
We had to resort to different sources in order to collect data for all participating countries at the EURO 2004, 2008 and 2012. Amongst the most useful ones are http://www.wko.at, http://www.statista.com/ and http://epp.eurostat.ec.europa.eu. For some years the populations of Russia and Ukraine had to be searched individually.
Unfortunately, the archive of the webpage was established not until 4th October 2004, so the average market values of the national teams that we used for the EURO 2004 can only be seen as a rough approximation, as market values certainly changed after the EURO 2004.
Note that European national teams also gain UEFA team points. For each game played in the most recently completed full cycle (a full cycle is defined as all qualifying games and final tournament games, whereas a half cycle is defined as all games played in the latest qualifying round only) of both the latest FIFA World Cup and European championship, with addition of points for each game played at the latest completed half cycle. Similar to the FIFA points a time-dependent weight-adjustment is used, allocating to both the latest full and half cycle double the weight as to the older full cycle. Thus, the UEFA team points would reflect a lot of information about the current strength of a national team in a European-wide comparison, but as the UEFA changed the coefficient ranking system in 2008, we focused on the UEFA club ranking.
Note that this variable is not available by any soccer data provider and thus had to be counted “by hand.”
The two variables “Maximum number of teammates” and “second maximum number of teammates” are highly (negatively) correlated with the number of different clubs, where the players are under contract, and hence also include information about the structure of the teams’ squads. Therefore, we did not consider the number of different clubs as a separate variable.
This variable is available on several soccer data providers, see for example http://www.kicker.de/.
As we are in a matched-pair design, we do not exclude single observations from the training data, but single matches.
A closer look on the coefficient paths of this model shows that for sligthly smaller values of the tuning parameter than the selected one, the variables ODDSET odds and fairness would have been included. Besides, in most of the training data sets both ODDSET odds and fairness have been included at the optimal tuning parameter.
In comparison to Model 2, for glmmLasso based on LOOCV now several variables are not selected anymore, when the variable fairness (V2) is excluded. This may be due to the considerable correlations between the fairness and these variables, e.g. corV2,V10=−0.29, corV2,V11=−0.16 and corV2,V12=−0.16 (see Table 6 in Appendix A).
Three-way odds consider only the tendency of a match with the possible results winning of Team 1, draw or defeat of Team 1 and are usually fixed some days before the corresponding match takes place.
The transformed probabilities only serve as an approximation, based on the assumption that the bookmakers’ margins follow a discrete uniform distribution on the three possible match tendencies.
For convenience we suppress the index t for both teams here, which indicates the number of the game for a team, as well as the indices j andcorresponding to the match-specific random effects. As the match under consideration could have a different number in the individual match numbering of each team, one should correctly writeandif Team k and Team l are facing each other in a certain match j, where the superscript indicates that the estimate is depends on the opponent’s covariates.
Similar to footnote 3, in the following we suppress both the indices j andcorresponding to the match-specific random effects and the index for the match numbering as well as the superscripts for both teams, in order to keep the notation simple. Note here that for the two teams of Ireland and Ukraine that did not qualify for either EURO 2004 or 2008 no random effects estimates exist and thus their random effects are set to zero. Besides, it has to be mentioned that the match-specific random effects estimates cannot be used for the prediction of new matches.
Akaike, H. 1973. “Information Theory and the Extension of the Maximum Likelihood Principle,” Second International Symposium on Information Theory 267–281.Search in Google Scholar
Bates, D. and M. Maechler. 2010. lme4: Linear Mixed-Effects Models UsingS4 classes. R package version 0.999375–34.Search in Google Scholar
Bernard, A. B. and M. R. Busse. 2004. “Who Wins the Olympic Games: Economic Developement and Medal Totals.” The Review of Economics and Statistics 86(1):413–417.10.1162/003465304774201824Search in Google Scholar
Breslow, N. E. and D. G. Clayton. 1993. “Approximate Inference in Generalized Linear Mixed Model.” Journal of the American Statistical Association 88:9–25.Search in Google Scholar
Breslow, N. E. and X. Lin. 1995. “Bias Correction in Generalized Linear Mixed Models With a Single Component of Dispersion,” Biometrika 82:81–91.Search in Google Scholar
Broström, G. 2009. glmmML: Generalized Linear Models With Clustering. R package version 0.81–6.Search in Google Scholar
Brown, T. D., J. L. V. Raalte, B. W. Brewer, C. R. Winter, A. E. Cornelius, and M. B. Andersen. 2002. “World Cup Soccer Home Advantage.” Journal of Sport Behavior 25:134–144.Search in Google Scholar
Carlin, J. B., L. C. Gurrin, J. A. C. Sterne, R. Morley, and T. Dwyer. 2005. “Regression Models for Twin Studies: A Critical Review.” International Journal of Epidemiology 34:1089–1099.10.1093/ije/dyi153Search in Google Scholar PubMed
Dawson, P. and S. Dobson. 2010. “The Influence of Social Pressure and Nationality on Individual Decisions. Evidence From the Behaviour of Referees.” Journal of Economic Psychology 31:181–191.Search in Google Scholar
Dyte, D. and S. R. Clarke. 2000. “A Ratings Based Poisson Model for World Cup Soccer Simulation.” Journal of the Operational Research Society 51(8):993–998.10.1057/palgrave.jors.2600997Search in Google Scholar
Eugster, M. J. A., J. Gertheiss, and S. Kaiser. 2011. “Having the Second Leg at Home-Advantage in the UEFA Champions League Knockout Phase?” Journal of Quantitative Analysis in Sports 7(1).10.2202/1559-0410.1275Search in Google Scholar
Frohwein, T. 2010, June. Die falschen Pferde. In: e-politik.de (08.06.2010), available at: http://www.e-politik.de/lesen/artikel/2010/die-falschen-pferde/(12.06.2012).Search in Google Scholar
Gerhards, J., M. Mutz, and G. G. Wagner 2012. “Keiner Kommt an Spanien Vorbei-auβer dem Zufall.” DIW-Wochenbericht 24:14–20.Search in Google Scholar
Gerhards, J. and G. G. Wagner. 2008. “Market Value Versus Accident-Who Becomes European Soccer Champion?” DIW-Wochenbericht 24:236–328.Search in Google Scholar
Gerhards, J. and G. G. Wagner. 2010. “Money and a Little Bit of Chance: Spain Was Odds-On Favourite of the Football Worldcup.” DIW- Wochenbericht 29:12–15.Search in Google Scholar
Goeman, J. J. 2010. “L1 Penalized Estimation in the Cox Proportional Hazards Model.” Biometrical Journal 52:70–84.Search in Google Scholar
Groll, A. 2011a. glmmLasso: Variable Selection for Generalized Linear Mixed Models by L1-Penalized Estimation. R package version 1.1.0.Search in Google Scholar
Groll, A. 2011b. Variable Selection by Regularization Methods for Generalized Mixed Models. Ph.D. thesis, University of Munich, Göttingen. Cuvillier Verlag.Search in Google Scholar
Groll, A. and G. Tutz. 2012. “Variable Selection for Generalized Linear Mixed Models by L1-Penalized Estimation.” Statistics and Computing. DOI: 10.1007/s11222-012-9359-z.10.1007/s11222-012-9359-zSearch in Google Scholar
Leitner, C., A. Zeileis, and K. Hornik. 2008. “Who is Going to Win the EURO 2008? (A statistical investigation of bookmakers odds).” Research report series, Department of Statistics and Mathematics, University of Vienna.Search in Google Scholar
Leitner, C., A. Zeileis, and K. Hornik. 2010a. “Forecasting Sports Tournaments by Ratings of (Prob)abilities: A Comparison for the EURO 2008.” International Journal of Forecasting 26(3):471–481.10.1016/j.ijforecast.2009.10.001Search in Google Scholar
Leitner, C., A. Zeileis, and K. Hornik. 2010b. “Forecasting the Winner of the FIFA World Cup 2010. Research Report Series.” Department of Statistics and Mathematics, University of Vienna.Search in Google Scholar
Leitner, C., A. Zeileis, and K. Hornik. 2011. “Bookmaker Concensus and Agreement for the UEFA Champions League 2008/09.” IMA Journal of Management Mathematics 22(2):183–194.10.1093/imaman/dpq016Search in Google Scholar
Lin, X. and N. E. Breslow. 1996. “Bias Correction in Generalized Linear Mixed Models with Multiple Components of Dispersion.” Journal of the American Statistical Association 91:1007–1016.10.1080/01621459.1996.10476971Search in Google Scholar
Nevill, A., N. Balmer, and M. Williams. 1999. “Crowd Influence on Decisions in Association Football.” The Lancet 353 (9162), 1416.Search in Google Scholar
Pollard, R. and G. Pollard. 2005 “Home Advantage in Soccer: A Review of its Existence and Causes.” International Journal of Soccer and Science Journal 3(1):25–33.Search in Google Scholar
Schelldorfer, J. and P. Bühlmann. 2011. “GLMMLasso: An algorithm for High-Dimensional Generalized Linear Mixed Models Using L1-Penalization. Preprint, ETH Zurich. http://stat.ethz.ch/people/schell.10.1111/j.1467-9469.2011.00740.xSearch in Google Scholar
Stoy, V., R. Frankenberger, D. Buhr, L. Haug, B. Springer, and J. Schmid. 2010. “Das Ganze ist Mehr als die Summe seiner Lichtgestalten. Eine ganzheitliche Analyse der Erfolgschancen bei der Fußballweltmeisterschaft 2010.” Working Paper 46, Eberhard Karls University, Tübingen, Germany.Search in Google Scholar
Yang, H. 2007. Variable Selection Procedures for Generalized Linear Mixed Models in Longitudinal Data Analysis. Ph.D. thesis, North Carolina State University.Search in Google Scholar
Zeileis, A., C. Leitner, and K. Hornik. 2012. History repeating: Spain beats Germany in the EURO 2012 final. Working paper, Faculty of Economics and Statistics, University of Innsbruck.Search in Google Scholar
©2013 by Walter de Gruyter Berlin Boston