Jump to ContentJump to Main Navigation
Show Summary Details

Journal of Quantitative Analysis in Sports

An official journal of the American Statistical Association

Editor-in-Chief: Mark Glickman PhD

4 Issues per year

SCImago Journal Rank (SJR) 2015: 0.288
Source Normalized Impact per Paper (SNIP) 2015: 0.358
Impact per Publication (IPP) 2015: 0.250

See all formats and pricing

A generative model for predicting outcomes in college basketball

Francisco J. R. RuizORCID iD: http://orcid.org/0000-0002-2200-901X
  • Corresponding author
  • University Carlos III in Madrid – Signal Theory and Communications Department. Avda. de la Universidad, 30. Lab 4.3.A03, Leganes, Madrid 28911, Spain
  • ORCID iD: http://orcid.org/0000-0002-2200-901X
  • Email:
/ Fernando Perez-Cruz
  • University Carlos III in Madrid – Signal Theory and Communications Department. Avda. de la Universidad, 30, Leganes, Madrid 28911, Spain; and Bell Labs, Alcatel-Lucent, New Providence, NJ 07974 USA, e-mail:
Published Online: 2015-02-07 | DOI: https://doi.org/10.1515/jqas-2014-0055


We show that a classical model for soccer can also provide competitive results in predicting basketball outcomes. We modify the classical model in two ways in order to capture both the specific behavior of each National collegiate athletic association (NCAA) conference and different strategies of teams and conferences. Through simulated bets on six online betting houses, we show that this extension leads to better predictive performance in terms of profit we make. We compare our estimates with the probabilities predicted by the winner of the recent Kaggle competition on the 2014 NCAA tournament, and conclude that our model tends to provide results that differ more from the implicit probabilities of the betting houses and, therefore, has the potential to provide higher benefits.

Keywords: NCAA tournament; Poisson factorization; Probabilistic modeling; variational inference

1 Introduction

In this paper, we aim at estimating probabilities in sports. Specifically, we focus on the March Madness Tournament in college basketball,1 although the model is general enough to model nearly any team sport for regular season and play-off games (assuming that both teams are willing to win). Estimating probabilities in sport events is challenging, because it is unclear what variables affect the outcome and what information is publicly known before the games begin. In team sports, it is even more complicated, because the information about individual players becomes relevant. Although there has been some attempts to model individual players (Miller et al. 2014), there is no standard method to evaluate the importance of individual players and remove their contribution to the team when players do not play or get injured or suspended. It is also unclear if considering individual player information can improve predictions with no overfit. For college basketball, even more variables come into play, because there are 351 teams divided in 32 conferences, they only play about 30 regular games and the match-ups are not random, so the results do not directly show the level of each team.

In the literature, we can find several variants of a simple model for soccer that identifies each team by its attack and defense coefficients (Baio and Blangiardo 2010; Crowder et al. 2002; Dixon and Coles 1997; Heuer, Muller, and Rubner 2010; Maher 1982). In all these works, the score for the home team is drawn from a Poisson distribution, whose mean is the multiplicative contribution of the home team attack coefficient and the away team defense coefficient. The score of the visitor team is an independent Poisson random variable, whose mean is the visitor attack coefficient multiplied by the home team defense coefficient. These coefficients are estimated by maximum likelihood using the past results and used to predict future outcomes.

A similar model can be found in the literature of Poisson factorization (Canny 2004; Cemgil 2009; Dunson and Herring 2005), where the elements of a matrix are assumed to be independent Poisson random variables given some latent attributes. For instance, in Poisson factorization for recommendation systems (Gopalan, Hofman, and Blei 2013), where the input is a user/item matrix of ratings, each user and each item is represented by a K-dimensional latent vector of positive weights. Each rating is modeled by a Poisson distribution parameterized by the inner product of the user’s and item’s weights.

We build a model that combines these two ideas (Poisson factorization and the model for soccer) and takes into account the structure of the Men’s Division I Basketball of the National collegiate athletic association (NCAA). In order to estimate the mean of the Poisson distributions, we define an attack and defense vector for each team and for each NCAA conference. The conference-specific coefficients model the overall behavior of each conference, while the team-specific coefficients capture differences within each conference. To estimate the coefficients, we apply a variational inference algorithm. For comparisons, we adhere to the rules in the recent Kaggle competition,2 in which all the predictions have to be in place before the actual tournament starts, i.e., we do not use the results in the first rounds of the tournament to improve the predictions in the subsequent rounds.

We use two metrics to validate the model. First, we compute the average negative log-likelihood of the predicted probabilities for the winning teams. This metric is used to determine the winners of the Kaggle competition. Unfortunately, the test sample size (63 games) is too small and almost all reasonable participants are statistically indistinguishable from the winner. Compared to the winner’s probability estimates, we could not reject the null hypothesis of a Wilcoxon signed-rank test (Wilcoxon 1945) for 198 out of the remaining 247 participants. With so few test cases, it is unclear if the winners had a better model or were just lucky. This serves as an excuse for sore losers (we were ranked #39 in the competition), but more importantly as a word of advice for these competitions, in which metrics should be able to tell without doubt that some participants did significantly better (making use of statistical tests to tell them apart). Second, we compute the profit we would make after betting on six on-line betting houses using Kelly’s criterion (Kelly 1956). Kelly’s criterion assumes that our estimates are the true underlying probabilities, and the betting house odds are only an estimate. It provides the fraction of our bankroll that we should stake on each bet in order to maximize the long term growth rate of our fortune and make sure that we do not lose it all. This metric tells us how good our probability estimates are when compared to those of the betting houses. Our model outperforms the considered betting houses and the Kaggle competition winner.

2 Model description

We develop a statistical model for count data, corresponding to the outcomes of each basketball game. For each game m=1,…,M, we observe the pair (ymH,ymA), which are the points scored by the home and away teams, respectively.

The soccer model by Maher (1982) or Dixon and Coles (1997) introduces an attack and defense coefficient for each team t=1,…,T, denoted, respectively, by αt and βt. Given these coefficients, the number of scores obtained by the home and away sides at game m are independently distributed as

ymH~Poisson(γαh(m)βa(m)),ymA~Poisson(αa(m)βh(m)), (1)(1)

respectively. Here, the index h(m)∈{1,…, T} identifies the team that is playing at home in the m -th game and, similarly, a(m) identifies the team that is playing away. The parameter γ is the home coefficient and represents the advantage for the team hosting the game. This effect is assumed to be constant for all the teams and throughout the season. Note also that βt is actually a “inverse defense” coefficient, in the sense that smaller values represent better defense capabilities.

For the NCAA Tournament, we modify the model in Eq. 1 in two ways. First, we represent each team with K1 attack coefficients and K1 defense coefficients, which are grouped for each team in vectors αt and βt, respectively. Each coefficient may represent a particular tactic or strategy, so that teams can be good at defending some tactics but worse at defending others (the same applies for attacking). Second, we also take into account the conference to which each team belongs.3 For that purpose, we introduce conference-specific attack and defense coefficients, allowing us to capture the overall behavior of each conference. We denote by ηl and ρl the K2-dimensional attack and defense coefficient vectors of conference l, respectively, and we introduce index ℓ(t)∈{1 ,…, L} to represent the conference to which team t belongs. Hence, we model the outcome at game m as

ymH~Poisson(γαh(m)βa(m)+γη(h(m))ρ(a(m))),ymA~Poisson(αa(m)βh(m)+η(a(m))ρ(h(m))). (2)(2)

To complete the specification of the model, we place independent gamma priors over the elements of the attack and defense vectors, as well as a gamma prior over the home coefficient. Throughout the paper, we parametrize the gamma distribution with its shape and rate. Therefore, the generative model is as follows:

  1. Draw the home coefficient γ∼gamma(sγ, rγ).

  2. For each team t=1, …, T:

    1. Draw the attack coefficients αt,k∼gamma(sα, rα) for k=1, …, K1

    2. Draw the defense coefficients βt,k∼gamma(sβ, rβ) for k=1, …, K1.

  3. For each conference l=1, …, L:

    1. Draw the attack coefficients ηl,k∼gamma(sη, rη) for k=1, …, K2.

    2. Draw the defense coefficients ρl,k∼gamma(sρ, rρ) for k=1, …, K2.

  4. For each game m=1, …, M:

    1. Draw the score ymH~Poisson(γαh(m)βa(m)+γη(h(m))ρ(a(m))).

    2. Draw the score ymA~Poisson(αa(m)βh(m)+η(a(m))ρ(h(m))).

Thus, the shape and rate parameters of the a priori gamma distributions are hyperparameters of our model. The corresponding graphical model is shown in Figure 1, in which circles correspond to random variables and gray-shaded circles represent observations.

Figure 1:

Graphical model representation for our generative model.

3 Inference

In this section, we describe a mean-field inference algorithm to approximate the posterior distribution of the attack and defense coefficients, as well as the home coefficient, which we need to predict the outcomes of the tournament games.

Variational inference provides an alternative to Markov chain Monte Carlo (MCMC) methods as a general source of approximation methods for inference in probabilistic models (Jordan et al. 1999). Variational algorithms turn inference into a non-convex optimization problem, but they are in general computationally less demanding compared to MCMC methods and do not suffer from limitations involving mixing of the Markov chains. In a general variational inference scenario, we have a set of hidden variables Φ whose posterior distribution given the observations y is intractable. In order to approximate the posterior p(Φ|y,), where denotes the set of hyperparameters of the model, we first define a parametrized family of distributions over the hidden variables, q(Φ), and then fit their parameters to find a distribution that is close to the true posterior. Closeness is measured in terms of Kullback-Leibler (KL) divergence between both distributions DKL(q||p). The computation of the KL divergence is intractable, but fortunately, minimizing DKL(q||p) is equivalent to maximizing the so-called evidence lower bound (ELBO) , since

logp(y|H)=E[logp(y,Φ|)]+H[q]+DKL(q||p)E[logp(y,Φ|)]+H[q], (3)(3)

where the expectations above are taken with respect to the variational distribution q(Φ), and H[q] denotes the entropy of the distribution q(Φ).

Typical variational inference methods maximize the ELBO by coordinate ascent, iteratively optimizing each variational parameter. A closed-form expression for the corresponding updates can be easily found for conditionally conjugate variables, i.e., variables whose complete conditional is in the exponential family. We refer to (Ghahramani and Beal 2001; Hoffman et al. 2013) for further details. In order to obtain a conditionally conjugate model, and following (Dunson and Herring 2005; Gopalan et al. 2013, 2014; Zhou et al. 2012), we augment the representation by defining for each game the auxiliary latent variables

zm,kH1~Poisson(γαh(m),kβa(m),k),zm,kH2~Poisson(γη(h(m)),kρ(a(m)),k),zm,kA1~Poisson(αa(m),kβh(m),k),   zm,kA2~Poisson(η(a(m)),kρ(h(m)),k), (4)(4)

so that the observations for the home and away scores can be, respectively, expressed as

ymH=k=1K1zm,kH1+k=1K2zm,kH2,   and  ymA=k=1K1zm,kA1+k=1K2zm,kA2, (5)(5)

due to the additive property of Poisson random variables. Thus, the auxiliary variables preserve the marginal Poisson distribution of the observations. Furthermore, the complete conditional distribution over the auxiliary variables, given the observations and the rest of latent variables, is a Multinomial. Using the auxiliary variables, and denoting α={αt}, β={βt}, η={ηl}, ρ={ρl} and z={zmkH1,zmkH2,zmkA1,zmkA2}, the joint distribution over the hidden variables can be written as

p(α,β,η,ρ,γ,z|H)=t=1Tk=1K1p(αt,k|sα,rα)p(βt,k|sβ,rβ)×p(γ|sγ,rγ)l=1Lk=1K2p(ηl,k|sη,rη)p(ρl,k|sρ,rρ)×m=1Mk=1K1p(zm,kH1|γ,αh(m),k,βa(m),k)p(zm,kA1|αa(m),k,βh(m),k)×m=1Mk=1K2p(zm,kH2|γ,η(h(m)),k,ρ(a(m)),k)p(zm,kA2|η(a(m)),k,ρ(h(m)),k), (6)(6)

and the observations are generated according to Eq. 5. In mean-field inference, the posterior distribution is approximated with a completely factorized variational distribution, i.e., q is chosen as

q(α,β,η,ρ,γ,z)=q(γ)t=1Tk=1K1q(αt,k)q(βt,k)l=1Lk=1K2q(ηl,k)q(ρl,k)m=1Mq(zmH)q(zmA), (7)(7)

being zmH the vector containing the variables {zmkH1,zmkH2} for game m (and similarly for zmA and {zmkA1,zmkA2}). For conciseness, we have removed the dependency on the variational parameters in Eq. 7. We set the variational distribution for each variable in the same exponential family as the corresponding complete conditional, therefore yielding

q(γ)=gamma(γ|γshp,γrte),q(αt,k)=gamma(αt,k|αt,kshp,αt,krte),   q(βt,k)=gamma(βt,k|βt,kshp,βt,krte),q(ηl,k)=gamma(ηl,k|ηl,kshp,ηl,krte),      q(ρl,k)=gamma(ρl,k|ρl,kshp,ρl,krte),q(zmH)=multinomial(zmH|ymH,ϕmH),q(zmA)=multinomial(zmA|ymA,ϕmA). (8)(8)

Then, the set of variational parameters is composed of the shape and rate for each gamma distribution, as well as the probability vectors ϕmH and ϕmA for the multinomial distributions. Note that ϕmH and ϕmA are both (K1+K2)-dimensional vectors. To minimize the KL divergence and obtain an approximation of the posterior, we apply a coordinate ascent algorithm (the update equations of the variational parameters are given in Appendix A).

4 Experiments

4.1 Experimental setup

We apply our variational algorithm to last 4 years of NCAA Men’s Division I Basketball Tournament. Here, we focus on 2014 tournament, while results for previous years can be found in Appendix B. Following the recent Kaggle competition procedure, we fit the model using the regular season results of over 5000 games to predict the outcome of the 63 tournament games.4 As in Kaggle competition, we do not predict the “first four” games (they are not considered in the learning stage either). We apply the algorithm described in Section 3 independently for each season, because teams exhibit different strength even at consecutive seasons, probably due to the high turnaround of players. Note that the data include a variable which indicates whether one of the teams was hosting the game, or it was played on a neutral court. We include this variable in our formulation of the problem, and therefore we remove the home coefficient γ for games in which the site was considered neutral. We use the output of our algorithm, i.e., the parameters for the approximate posterior distribution over the hidden coefficients, to estimate the probability of teams winning in each Tournament game.5 To test the model, we simulate betting on the Tournament games using data from several betting houses6 (missing entries in the bookmaker betting odd matrices were not taken into account).

For hyperparameter selection, we carried out an exhaustive grid search, but did not find significant differences in our results as a consequence of the shape and rate values of the a priori gamma distributions. The experiments that we describe in this section were run with shape 1 and rate 0.1, except for the home coefficient, for which we use unit shape and rate.

For the training stage, we initialize our algorithm by randomly setting all the variational parameters. Every 10 iterations, we compute the ELBO as =E[logp(α,β,η,ρ,γ,z|)]+E[logq(α,β,η,ρ,γ,z)], where the expectations are taken with respect to the variational distribution q. The training process stops when the relative change in the ELBO is <10–8, or when 106 iterations are reached (whatever happens first).

After convergence, we estimate the probabilities of each team winning for the 63 games in the tournament. We estimate them for each game m by computing the expected Poisson means as E[ymH]=E[αh(m)βa(m)+η(h(m))ρ(a(m))] and E[ymA]=E[αa(m)βh(m)+η(a(m))ρ(h(m))]. Holding both means fixed, the difference ymHymA follows a Skellam distribution (Skellam 1946) with parameters E[ymH] and E[ymA]. We compute the probability of team h(m) winning the game as Prob(ymHymA>0|ymHymA0). Alternatively, we can estimate probabilities by sampling from the approximate posterior distribution, with no significant difference in the predictions. We average the predicted probabilities for 100 independent runs of the variational algorithm, under different initializations to alleviate the sensibility of the variational algorithm to its starting point.

4.2 Results for 2014 tournament

Exploratory analysis. One of the benefits of a generative model is that, instead of a black-box approach, it provides an explanation of the results. Furthermore, generative models allow integrating the information from experts in sports as prior knowledge in the Bayesian generative model. This would constrain the statistical model and may provide more accurate predictions and usable information to help understand the teams performance.

We found that the expected value of the home coefficient is E[γ]=1.03 (we obtained this value after averaging the results for 100 independent runs for a model with K1=K2=10, being the standard deviation around 5×10–4). This indicates that playing at home provides some advantage, but this advantage is not as relevant as in soccer, where the home coefficient is typically around 1.4 (Dixon and Coles 1997).

We can also use our generative model to rank conferences and provide a qualitative measure on how well it follows general appreciation. Although there are several ways for ranking, we have taken a simple approach. For a model with K1=10 and K2=10 we have ranked the conferences according to k=1K2(E[η,k]E[ρ,k]), with expectations taken with respect to the variational distribution. In Table 1 we show the obtained ranking, together with the number of teams for each conference that entered the March Madness Tournament. The top-5 conferences (Pac-12, Big Ten, ACC, Big 12 and Atlantic 10) are the stronger ones, as they contribute with six or seven teams to the Tournament. There are two conferences that contribute with four teams (Big East and American) and they are ranked 7th and 8th. There are three conferences (Mountain West, West Coast and SEC) that contribute with two or three teams and they are ranked 11th–13th. There are only three conferences that contribute with only the conference winner and that are stronger than the second tier conferences (those with two to four teams in the Tournament). There are also three conferences (Big South, Mid-American and Ohio Valley) that we divide into two sub-conferences, but they only contribute with one team to the tournament. The sub-conference that contributed with a team to the tournament is always ranked higher with our score.

Table 1:

Ranking of conferences provided by our model.

We also provide some qualitative results about the team-level parameters. For the model above with K1=K2=10, we rank teams according to the value of k=1K1E[αt,kβt,k]+k=1K2E[η(t),kρ(t),k]. We show in Table 2 the top-64 teams of the obtained ranking. Out of the 36 teams that entered the tournament as “at large” bids, 34 of them are placed on the top-60 positions of the ranking. The two other teams are Tennessee, which is ranked #61, and North Carolina State, ranked #78. Out of the 32 teams that entered the Tournament as “automatic bids” (i.e., teams winning their conference tournaments), half of them are placed on the top-100 positions, while the rest are ranked up to position #280 (for Texas Southern). In addition, for nine out of the 10 conferences that contribute with more than one team to the March Madness competition, the conference winner is also listed in Table 2 (top-64 positions), and 44 out of the 46 teams of these 10 conferences that entered the Tournament are also in that list. The two teams that do not appear in the top-64 positions are St Joseph’s (winner of the Atlantic 10 conference) and North Carolina State, which entered the competition in the pre-round. St Joseph’s was the play-off winner at the Atlantic 10 conference, but it had a poor record in that conference, which explains why its rating is not that high with out score. Regarding the teams in the March Madness competition belonging to the weaker conferences (i.e., those conferences that only contribute with one team to the Tournament), only two out of 22 teams are in the top-64 positions. Qualitatively, our results coincide with the way teams are selected for the tournament.

Table 2:

Ranking of teams provided by our model (only shown top-64 teams).

If we focus on Pac-12 conference, the six teams that entered the competition are placed in positions #1, #7, #17, #20, #28 and #30 of Table 2 (for Arizona, UCLA, Arizona State, Colorado, Stanford and Oregon, respectively), and the conference winner was UCLA, which is the second of the six teams. This is not a contradiction, because it is the number of won games in the conference tournament what determines the conference winner, while our model takes into account all the games and the score of each game as input. Under our ranking, a team that loses a few games by a small difference and win many games by a large difference will be better placed than a team that wins all the games by a small margin.

Finally, our model has the ability to provide predictions for the expected results in each game, since we directly model the number of points. We include in Table 3 a subset of the games in the March Madness competition, together with their results, our predictions, and the 90% credible intervals (the rest of the games of the Tournament are shown in Appendix C). The predictions have been obtained after averaging the expected Poisson means E[ymH] and E[ymA] for 100 independent runs, using a model with K1=K2=10. Out of the 126 scores, 21 of them are outside the 90% credible interval, which is a bit high but not unheard of. What might be more surprising is that 17 out of these 21 scores are below the credible interval and only four of them above the credible interval. There are several explanations for this effect. The most plausible is that we train the model with the regular season results but predict Tournament games instead. In regular season games, losing a game is not the end of the world, but losing a Tournament game has greater importance. Hence, players, which are young college students, might feel some additional pressure and it should be unsurprising that teams tend to score less than in the regular season. Nevertheless, we can still say that this is only a minor effect and that the loss of performance due to pressure is not significant enough to make us state that a model trained for the regular season cannot be used for predicting the March Madness Tournament.

Table 3:

List of a subset of the games in the 2014 tournament.

Quantitative analysis. To quantitavely evaluate our proposal, we report five solutions and compare them with the Kaggle competition winner and the implicit probabilities of six online betting houses.7 We use four models with a fixed value of K1 and K2, but we also report the probabilities obtained as the average of the predictions for 10 different models, with K1 ranging between 4 and 10 and K2 ranging between 10 and 15. In Kaggle competition, our 10-model average predictions led us to position =39 out of 248. We first report the negative logarithmic loss, which is computed as in Kaggle competition as

LogLoss=163m=163(νmlog(ν^m)+(1νm)log(1ν^m)), (9)(9)

where νm∈{0,1} indicates whether team h(m) beats team a(m), and ν^m[0,1] is the predicted probability of team h(m) beating team a(m). To be able to understand the variability in these results, we take 500 bootstrap samples (Efron 1979) and show the boxplot for these samples in Figure 2. We report the mean, the median, the 25/75% and the 10/90% percentiles, as well as the extreme values in the standard format. Note that K1=1, K2=0 corresponds to the classical model for soccer. We have included some markers for comparison: the best and 100th best results in the Kaggle competition, the median probability prediction for all Kaggle participants, the Kaggle seed benchmark (in which the winning probability predicted for the stronger team is 0.5+0.03*seed difference) and the 0.5-benchmark for all games and teams. In this figure, the boxplot for the winner of the Kaggle competition is lower than the boxplot for our models and the online betting houses. However, we found that the predictions of the Kaggle winner are not statistically different from our predictions, as reported by a Wilcoxon signed-rank test (Wilcoxon 1945) with a significance level of 1%. Specifically, we found that the predictions by Kaggle winner are not statistically different when compared to our 10-model average predictions. Furthermore, for 198 (out of 248) participants in the Kaggle competition, the Wilcoxon test failed to reject the null hypothesis (which corresponds to the median between the winner and the other participants being the same). This just indicates that the sample size is too small and we would need a larger test set to measure the goodness of fit of each proposal.

Figure 2:

Boxplot representation of logarithmic loss after bootstrap. From left to right, we depict results for the considered models, Kaggle winner’s estimates, and the six betting houses.

We now turn to a monetary metric that allows comparing our results with respect to the different betting houses. We assume that our probability estimates are the true ones and use Kelly’s criterion (Kelly 1956) to decide how much we should bet (and for which team). Roughly, Kelly’s criterion tells that the amount that we should bet grows with the difference between our probabilities and the implicit probabilities of the betting houses, and that we should bet for the team for which this difference is positive.8 If the probabilities are very similar or Q is very large then Kelly’s criterion might recommend not to bet. We have applied Kelly’s criterion for the 63 games in the Tournament assuming that we have $1 per game. We could have aggregated the bankroll after each day or each weekend and bet more aggressively in the latter stages of the Tournament, but we believe that results with $1 per game are easier to follow. In Figure 3, we show the boxplot representation of our profit in the six considered betting houses, as well as the profit of the Kaggle competition winner (again, we use 500 bootstrap samples). For all the methods, the mean and the median are positive and away from zero, but the differences are not significant and are not significant amongst them (according to a Wilcoxon signed-rank test). The mean of our 10-model average and the mean of the K1=K2=10 model are larger than the mean of the Kaggle competition winner for all the betting houses. Our variance is larger because our model points towards a high variance strategy, in which we tend to bet for the underdog (see next paragraph). Also, the probabilities given by our model are more dissimilar than Kaggle winner’s when compared to the betting houses and, as a consequence, we bet in more games and larger quantities, as detailed below. However, we would require a (much) larger number of test games to properly analyze the differences between both models, if they actually exist. Over the 63 tournament games we can state that the Kaggle competition winner follows a lower risk strategy, while our model points towards a higher risk strategy.

Figure 3:

Boxplot representation of profit after bootstrap, broken down by betting house.

The contradiction between this monetary metric and the negative logarithmic loss can be easily explained, because in betting it typically pays off to bet in favor of the underdog (if it is undervalued), and our model tends to provide less extreme probabilities compared to the probabilities submitted by the winner of the Kaggle competition and the implicit probabilities of the betting houses. We end up betting in favor of the team with larger odds and we lose most of the bets, but when we win we recover from the losses. To illustrate this, we include Table 4, where we show the number of games in which we have won the bets that we have placed. For instance, for Pinnacle Sports we decide to bet on 60 games out of 63 (under our 10-model average predictions) and win 21 bets (about a third), while the winner of the Kaggle competition wins 29 bets out of 44 (about two thirds). The winner of the Kaggle competition tends to bet for the favorite, winning a small amount that compensates the few losses. Additionally, we tend to bet more in each game: in average, we stake 14 cents per bet, while the average bet for the Kaggle winner is 9 cents (for Pinnacle Sports). This means that our probabilities are further than those of the winner of the Kaggle competition when compared to the betting houses probabilities. This is not a bad thing for betting, since we need a model that is not only accurate, but also provides different predictions than the implicit probabilities of the betting houses. The betting houses do not necessary need to predict the true probabilities of the event, but they need to predict what people think are the true probabilities (and are willing to bet on). A model that identifies weaker but undervalued teams has the potential to provide huge benefits.

Table 4:

Number of games in which we win, number of games in which we bet, and number of games for which we have available the bookmaker odds (#Wins/#Bets/#Total).

Finally, we show in Figure 4 the profit for each of the Kaggle participants after betting using Kelly’s criterion on the 63 games in the Tournament. The results are ordered according to Kaggle leaderboard. The winner is represented by the first red dot and we are represented by the second red dot (the 39th dot overall). From this figure, we can see that the log-loss and the betting profits are related, but they are not a one to one mapping: 46 out of the first 50 participants have positive returns, and so do 23 out of the second 50 participants, 13 out of the third 50 participants, 13 out of the fourth 50 participants and only 7 of the last group of 54. This is easy to understand if we focus on participants with a positive return and a low negative log-loss score. These participants typically post over-confident predictions (close to 100% sure that a certain team will win), these predictions when wrong only give limited betting losses (at most $1 in our comparison), but a nearly unbounded log-loss. We can see that some of these participants would have obtained big wins even though their predictions are over-confident. If this would have been the error measured in Kaggle,9 we would have been ranked #17.

Figure 4:

Profit on Pinnacle Sports for all Kaggle participants (ordered according to their final score in Kaggle competition). Red markers show the results by Kaggle winner and our 10-model average.

5 Conclusions

In this paper, we have extended a simple soccer model for college basketball. Outcomes at each game are modeled as independent Poisson random variables whose means depend on the attack and defense coefficients of teams and conferences. Our conference-specific coefficients account for the overall behavior of each conference, while the per-team coefficients provide more specific information about each team. Our vector-valued coefficients can capture different strategies of both teams and conferences. We have derived a variational inference algorithm to learn the attack and defense coefficients, and have applied this algorithm to four March Madness Tournaments. We compare our predictions for the 2014 Tournament to the recent Kaggle competition results and six online betting houses. Simulations show that our model identifies weaker but undervalued teams, which results in a positive mean profit in all the considered betting houses. We also outperform the Kaggle competition winner in terms of mean profit.


We thank Kaggle competition organizers for providing us with the individual submissions of all the participants. Francisco J. R. Ruiz is supported by an FPU fellowship from the Spanish Ministry of Education (AP2010-5333). This work is also partially supported by Ministerio de Economía of Spain (projects ‘COMONSENS’, id. CSD2008-00010, and ‘ALCIT’, id. TEC2012-38800-C03-01), by Comunidad de Madrid (project ‘CASI-CAM-CM’, id. S2013/ICE-2845), and by the European Union 7th Framework Programme through the Marie Curie Initial Training Network ‘Machine Learning for Personalized Medicine’ (MLPM2012, Grant No. 316861).

Appendix A

A Variational Update Equations

In this section, we provide further details on the variational inference algorithm detailed in Section 3. Here, we denote by ϕmH1(ϕmA1) the K1-vector composed of the first K1 elements of ϕmH(ϕmA), and by ϕmH2(ϕmA2) the K2-vector composed of the remaining K2 elements. We show below the update equations for all the variational parameters, which are needed for the coordinate ascent algorithm:

  1. For the home coefficient γ, the updates are given by

    γshp=sγ+m=1MymH, (10)(10)

    γrte=rγ+m=1ME[αh(m)βa(m)+η(h(m))ρ(a(m))], (11)(11)

    where we denote by E[] the expectation with respect to the distribution q.

  2. For the team attack and defense parameters αt, k and βt, k, we obtain

    αt,kshp=sα+m:h(m)=tϕm,kH1ymH+m:a(m)=tϕm,kA1ymA, (12)(12)

    αt,krte=rα+m:h(m)=tE[γ]E[βa(m),k]+m:a(m)=tE[βh(m),k], (13)(13)

    βt,kshp=sβ+m:a(m)=tϕm,kH1ymH+m:h(m)=tϕm,kA1ymA, (14)(14)

    βt,krte=rβ+m:a(m)=tE[γ]E[αh(m),k]+m:h(m)=tE[αa(m),k]. (15)(15)

  3. For the conference attack and defense parameters ηl, k and ρl, k, the updates are

    ηl,kshp=sη+m:(h(m))=lϕm,kH2ymH+m:(a(m))=lϕm,kA2ymA, (16)(16)

    ηl,krte=rη+m:(h(m))=lE[γ]E[ρ(a(m)),k]+m:(a(m))=lE[ρ(h(m)),k], (17)(17)

    ρl,kshp=sρ+m:(a(m))=lϕm,kH2ymH+m:(h(m))=lϕm,kA2ymA, (18)(18)

    ρl,krte=rρ+m:(a(m))=lE[γ]E[η(h(m)),k]+m:(h(m))=lE[η(a(m)),k], (19)(19)

  4. For the multinomial probabilities of the auxiliary variables, we obtain

    ϕm,kH1exp{E[logγ]+E[logαh(m),k]+E[logβa(m),k]}, (20)(20)

    ϕm,kH2exp{E[logγ]+E[logη(h(m)),k]+E[logρ(a(m)),k]}, (21)(21)

    ϕm,kA1exp{E[logαa(m),k]+E[logβh(m),k]}, (22)(22)

    ϕm,kA2exp{E[logη(a(m)),k]+E[logρ(h(m)),k]}, (23)(23)

    where the proportionality constants ensure that ϕmH and ϕmA are probability vectors.

All expectations above can be written in closed form, since for a random variable X∼gamma (s, r), we have E[X]=s/r and E[logX]=ψ(s)log(r), being ψ(·) the digamma function (Abramowitz and Stegun 1972).

B Results for 2011–2014 Tournaments

We now provide some additional results including the 2011–2014 tournaments. In Figure 5, we plot the normalized histogram corresponding to the proportion of observed events for which the predicted probabilities are comprised between the values in the x-axis, across the four considered Tournaments, for several models. The legend indicates the corresponding values of K1 and K2. In the figure, we can see that, as the predicted probability increases, so does the proportion of observed events.

Figure 5:

Proportion of observed events for which the predicted probabilities are comprised between the values in the x-axis, across the four considered seasons.

Figure 6 shows a boxplot representation (after 500 bootstrap samples) of the negative logarithmic loss for each of the considered season. Here, we can see that 2011 Tournament yielded more unexpected results than in 2014.

Figure 6:

Boxplot representation of logarithmic loss after bootstrap, broken down by season.

Figure 7 shows the average profit we make in all the bet houses after adding together the profit (or loss) for each individual season. Figures 8–10 show the boxplot representation (after 500 bootstrap samples) for each individual season (the plot for 2014 tournament is included in the main text).

Figure 7:

Profit broken down by house, across the four considered seasons.

Figure 8:

Boxplot representation of profit after bootstrap for season 2010/2011, broken down by house.

Figure 9:

Boxplot representation of profit after bootstrap for season 2011/2012, broken down by house.

Figure 10:

Boxplot representation of profit after bootstrap for season 2012/2013, broken down by house.

C List of 2014 Tournament Games

We show in Table 5 the list corresponding to the 35 games in the 2014 March Madness Tournament not shown in Table 3. For each game, we show the actual outcome of the game, as well as the predicted mean values and the 90% credible intervals.

Table 5:

List of the first 35 games in the 2014 tournament.


  • Abramowitz, M. and I. A. Stegun. 1972. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. New York: Dover Publications.

  • Baio, G. and M. A. Blangiardo. 2010. “Bayesian Hierarchical Model for the Prediction of Football Results.” Journal of Applied Statistics 37:253–264. [Web of Science] [Crossref]

  • Canny, J. 2004. “GaP: A Factor Model for Discrete Data.” In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA: ACM, pp. 122–129.

  • Cemgil, A. T. 2009. “Bayesian Inference for Nonnegative Matrix Factorisation Models.” Computational Intelligence and Neuroscience 2009: 17.

  • Crowder, M., M. Dixon, A. Ledford, and M. Robinson. 2002. “Dynamic Modelling and Prediction of English Football League Matches for Betting.” Journal of the Royal Statistical Society: Series D (The Statistician) 51:157–168.

  • Dixon, M. J. and S. G. Coles. 1997. “Modelling Association Football Scores and Inefficiencies in the Football Betting Market.” Journal of the Royal Statistical Society. Series C (Applied Statistics) 46:265–280.

  • Dunson, D. B. and A. H. Herring. 2005. “Bayesian Latent Variable Models for Mixed Discrete Outcomes.” Biostatistics 6:11–25. [PubMed] [Crossref]

  • Efron, B. 1979. “Bootstrap Methods: Another Look at the Jackknife.” The Annals of Statistics 7:1–26. [Crossref]

  • Ghahramani, Z. and M. J. Beal. 2000. “Propagation Algorithms for Variational Bayesian Learning.” In Advances in Neural Information Processing Systems 13, pp. 507–513.

  • Gopalan, P., J. M. Hofman, and D. M. Blei. 2013. “Scalable Recommendation with Poisson Factorization.” arXiv preprint arXiv:1311.1704.

  • Gopalan, P., F. J. R. Ruiz, R. Ranganath, and D. M. Blei. 2014. “Bayesian Nonparametric Poisson Factorization for Recommendation Systems,” Artificial Intelligence and Statistics (AISTATS) 33:275–283.

  • Heuer, A., C. Müller, and O. Rubner. 2010. “Soccer: is Scoring Goals a Predictable Poissonian Process?” arXiv preprint arXiv:1002.0797. [Web of Science]

  • Hoffman, M. D., D. M. Blei, C. Wang, and J. Paisley. 2013. “Stochastic Variational Inference.” Journal of Machine Learning Research 14:1303–1347.

  • Jordan, M. I., Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. 1999. “An Introduction to Variational Methods for Graphical Models.” Machine Learning 37:183–233. [Crossref]

  • Kelly, J. L. 1956. “A New Interpretation of Information Rate.” IRE Transactions on Information Theory 2:185–189. [Crossref]

  • Maher, M. J. 1982. “Modelling Association Football Scores.” Statistics Neerland 36:109–118. [Crossref]

  • Miller, A., L. Bornn, R. Adams, and K. Goldsberry. 2014. “Factorized Point Process Intensities: A Spatial Analysis of Professional Basketball.” arXiv preprint arXiv:1401.0942.

  • Skellam, J. G. 1946. “The Frequency Distribution of the Difference between Two Poisson Variates Belonging to Different Populations.” Journal of the Royal Statistical Society 109:296+3. [Crossref]

  • Wilcoxon, F. 1945. “Individual Comparisons by Ranking Methods.” Biometrics Bulletin 1:80–83. [Crossref]

  • Zhou, M., L. Hannah, D. B. Dunson, and L. Carin. 2012. “Beta-Negative Binomial Process and Poisson Factor Analysis.” Journal of Machine Learning Research – Proceedings Track 22:1462–1471.

About the article

Corresponding author: Francisco J. R. Ruiz, University Carlos III in Madrid – Signal Theory and Communications Department. Avda. de la Universidad, 30. Lab 4.3.A03, Leganes, Madrid 28911, Spain, e-mail: .

Published Online: 2015-02-07

Published in Print: 2015-03-01

Citation Information: Journal of Quantitative Analysis in Sports, ISSN (Online) 1559-0410, ISSN (Print) 2194-6388, DOI: https://doi.org/10.1515/jqas-2014-0055. Export Citation

Comments (0)

Please log in or register to comment.
Log in