This is the first study to compare the performance of proposed models for forecasting tennis match wins. Using a large set of recent professional tennis matches to test the predictive performance of published approaches, it was found that a predictive method based on Elo ratings was the closest competitor to bookmaker predictions, correctly predicting 70% of match outcomes and outperforming published alternatives that included regression-based models, point-based models, and a Bradley-Terry model. Among the regression models, the inclusion of player ranking as a predictor resulted in the best performance and additional predictors considered by previous authors did not improve performance. Among the point-based models, the approach that adjusts the serve win probability for player and opponent strength, according to how their average performance differs from the average of the field, had the best predictive performance. While the regression-based and point-based models had comparable accuracy, point-based models generally had better discriminatory ability, putting them on par with the Elo and bookmaker predictions on this dimension of performance. All models were less accurate at predicting the outcomes of matches of lower-ranked players compared to matches of the top players in the sport.

Although no model was able to beat the bookmakers, the Elo model developed by FiveThirtyEight was a close competitor and performed better than all other approaches in terms of accuracy and discrimination. The standard implementation of Elo is based on career-to-date wins and losses, while regression- and point-based models have typically been done with 1 or 2 years of prior performance data. When only 1 year of prior performance data was used, the FiveThirtyEight model performance was comparable to regression models with player ranking but remained the best-performing approach for predicting outcomes of matches of the highest-ranked players. Thus, the career-to-date information contributed to its edge for a subset of matches but not all.

These findings provide further rationale for the wide use of Elo-based predictions in the media and adds tennis to a growing list of sports for which Elo ratings have proven useful (Stefani 2011; Lasek, Szlávik, and Bhulai 2013). Still, the performance of the FiveThirtyEight might seem surprising given that the information it uses is fairly basic. The Elo ratings in this model only consider a player’s past wins and losses. However, the dynamic nature of these ratings go further than the other approaches considered in this paper in how they reassess player ability and adjust for opponent strength at the time of the match. Unlike player rankings, that are updated weekly and rely on an arbitrary point system, Elo ratings are updated at the end of each match using a probability-based formula that weighs more recent performance more heavily and credits players for wins against more difficult opponents. This suggests that accounting for recency of play and the quality of opponents are critical elements in predicting the outcome of matches at the elite level with greater accuracy.

A key finding of the validation study was the lack of performance improvement for the most predictor-rich regression models. Indeed, a logistic model with the player and opponent ranking differential as its only predictor performed as well overall as the Prize Probit and Probit Plus models that incorporated additional tournament and player demographic variables. In analyses not shown, the predictor used in the logistic model was fit with a probit form, which did not change the prediction performance, showing that the logistic performance was due to the predictive strength of differential ranking and not the choice of distributional form.

There are several reasons why the difference in player-opponent ranking drives the performance of the regression-based models. Rankings represent a rolling weighted sum of a player’s win-loss record in the previous 12 months, where the weights are a scaled point system derived by the tour that attempts to reflect the prestige of a match (Irons, Buckley, and Paulden 2014). Although non-probabilistic in nature, the ranking points are intended to have a high correlation with a player’s recent ability on the biggest stages of the calendar. There is also a potential “rich get richer effect” with rankings due to tournament seeding, as tournaments are designed to help the highest ranked players advance, strengthening the correlation between player differences in rank and match wins.

Despite the strength of player ranking in published regression models, the superior performance of the FiveThirtyEight model suggests that alternative measures of player strength might improve the performance of regression methods. In particular, model-based measures of player ability that account for career matches and adjust for opponent difficulty could be promising alternatives to official rankings.

As a class, point-based models generally underestimated the win probability of the higher-ranked player in a match, the one exception being the Opponent Adjusted model. There are two main modeling decisions that determine the performance of the point-based model. First, the choice of model for winning a point on serve; second, the choice of approach for predicting a match win from point wins. All of the point-based methods rely on IID assumptions for predicting match wins. The IID model, which assumes a constant probability of winning on serve during a match, is known to be incorrect but has been argued to be a good approximation (Klaassen and Magnus 2001). However, the IID has not been comprehensively tested on more recent professional matches, which raises the possibility that relaxing the IID model assumptions might reduce the bias found for point-based models.

While the point-based models had, as a group, more evidence of bias, they also exhibited greater discriminatory ability. This suggests the conclusion of a trade-off in bias and discrimination with point-level information. However, the observation that the Opponent Adjusted model had good calibration and discriminatory ability shows that this is not an inherent trade-off of point-based methods, and it should be possible to improve the calibration of this class of methods without sacrificing their discriminatory strengths.

While the Bradley-Terry model had prediction accuracy that was similar to the point-based models, it showed the largest bias and the poorest discriminatory ability of all methods. Further, this bias was not remedied with the inclusion of an additional year of performance data. The Bradley-Terry model is the only approach considered here that utilizes game wins as the primary driver of player ability. Owing to the hierarchical nature of tennis, it is not necessary to win every point, game, or set to win a match. The superior performance of the Elo-based model over the Bradley-Terry model suggests that the focus on game-level measures of performance in place of overall match performance is a less reliable measure of player ability.

Strengths of the present study include the use of an independent validation dataset with many matches on all surfaces and stages of the ATP tour above the Challenger level. However, by focusing on only 1 year of data it is unclear whether these findings can be generalized to past or future generations of players. Another limitation in the generalizability of the findings is that the paper did not consider performance for the Women’s Tennis Association. Further, the present paper focused on pre-match predictive performance and did not investigate the advantages of within-match updating, which is a potentially unique strength of point-based models.

This work highlights a number of directions for further research. Proposed improvements of the Elo rating system have been developed but have yet to be applied in tennis (Glickman 1999; Herbrich, Minka, and Graepel 2006). All of the evaluated prediction methods were less accurate at predicting match outcomes for lower-ranked players compared to matches of the best players in the sport. This property could hinder the practical utility of current prediction methods and the extension of these methods to the junior game. Further work is needed to identify stronger predictors of the performance of lower-ranked players. In this regard, it was notable that no published prediction method included information about player mental skills or shot-level characteristics, though these are both thought to be important determinants of match outcomes (Féry and Crognier 2001; Jones 2002). It is an open question whether either of these areas of performance could improve the predictive performance or generalizability of current methods.

The recent media scandal on match-fixing in tennis calls attention to the reality that the performance of prediction methods in the sport is not simply an academic concern. For modelers to ensure that coaches and tennis officials are using the most appropriate available tools when evaluating tennis outcomes, rigorous validation should be a routine part of the development of tennis prediction methods. At present, it can be concluded that some published prediction models are more useful than others and all models have limited utility outside of the highest levels of the sport. The variation in model performance demonstrated in this study emphasizes the importance of comparative validation and the need for continued research to improve forecasting outcomes in tennis.

## Comments (0)