Can we predict outcomes of matches with math? (Probably)

When it comes to statistically representing sports, tennis is remarkably well-suited. Just consider, for a moment, the complexities in other sports that render them statistically unwieldy: football has over twenty people on the field at a given time and is played with a prolate spheroid (which means bounces on loose balls are close to random); basketball has ten players on the court, all of whom affect the game simultaneously, even when playing off the ball; baseball can be modeled pretty well up to the point of contact, but becomes complicated once the ball is in play. Tennis, on the other hand, remains a mathematically and statistically straightforward game, whether played by Federer or a toddler.

In this first post of an ongoing series revolving around statistical modeling of ATP singles matches, I will lay out the mathematical understanding of tennis that makes it easy to model and introduce a basic version of a tennis model. For anyone interested in more granular discussion, a good place to start is O’Malley, “Probability Formulas and Statistical Analysis in Tennis” (2008).

First, a brief and simplified overview of the current statistical scholarship on modeling tennis matches: Current models describe a tennis match with a hierarchical Markov model, since the game has a hierarchical scoring system. The picture below illustrates this point.

We can create a statistical model rather easily with the assumption that points within a match are identically and independently distributed (IID). IID alludes to the notion that one point does not affect another and that the probability distribution for each point is the same. Intelligent people can certainly trouble this assumption, but we’ll leave it for now, since it is critical to the statistical model to follow. It is possible to derive a Markov chain for any match, using the probabilities of a player winning a point on serve and return. Much work has been published on inferring those probabilities from past data. The models presented in the literature have been successful, yielding between 68% and 70% of correct binary predictions on outcomes.

Using the IID assumption, we can build out our model so long as we know the probability that a given player wins a point on serve and on return. How do we get this probability (which we’ll call *p *for serve and *q* for return)? Historical data. Our estimates of *p *and *q *are the proportions of points won on serve or return in the past (which could be any period of time as the modeler sees fit).

Once we have *p *and *q**, *we’ll need formulas to flesh out the probabilities that a player wins a series of points (i.e. a game), a series of games (i.e. a set), and a series of sets (i.e. a match). Notice that all we have here is *p *extending through a chain to yield another probability for a higher level (from point to game to set to match). This means we’ll have to consider the various combinations for how a player could win at each level. For example, a player can win a game by winning four points without losing any, winning four points and losing one, winning found points and losing two, or winning from deuce. The equations that capture this are below (O’Malley, 2008).

After carrying this logic through the various twist and turns of a tennis match, you are left with two rather simple formulas. One captures the probability that a player wins a 3-set match and the other a 5-set match, given *p *and *q*. S(p,q) denotes the probability of a player winning a set.

Now that the mathematics has been established, we have to figure out how to operationalize these formulas. I’ll use historical data from tennisinsight.com and the statistical software Stata to construct the model from the formulas above. I’ve included my Stata code below for the O’Malley formulas.

gen OmalleySPQ = ( serviceptsw * returnptsw )/(1-( serviceptsw *(1- returnptsw )+ returnptsw *(1- serviceptsw )))

gen Omalley3set = OmalleySPQ^2 * (1+2*(1- OmalleySPQ))

gen Omalley5set = OmalleySPQ^3 * (1+3*(1- OmalleySPQ)+6*(1- OmalleySPQ)^2)

Having run this model using tennisinsight data current as of 9/13/16 for the last 12 months, I got the following ranking of top players. Notice the discrepancies between the O’Malley probabilities and the ATP rankings.

Player | Probability of Winning 3-Set Match | Probability of Winning 5 Set Match | ATP Rank |

Novak Djokovic | .7066 | .7515 | 1 |

Roger Federer | .6492 | .6840 | 7 |

Andy Murray | .6260 | .6561 | 2 |

Milos Raonic | .6085 | .6346 | 6 |

Rafael Nadal | .6081 | .6342 | 4 |

Gael Monfils | .5957 | .6191 | 8 |

Kei Nishikori | .5891 | .6109 | 5 |

Stan Wawrinka | .5880 | .6094 | 3 |

Marin Cilic | .5850 | .6053 | 11 |

O’Malley models do not incorporate the skill of the specific opponent a player faces in a match. These models provide a probability of victory based *only *on the individual player’s ability to win on serve and return against previous opponents included in the historical data sample. Undoubtedly, this is a major shortcoming of the model, which may help explain the discrepancies with the ATP rankings. In the next installation of this series, we’ll explore a model that implements a head-to-head comparison and runs through some example match-ups.

Edited by Joe Sparacio, Emily Greitzer, Vincent Choy.

- Serena Williams
- Eugenie Bouchard
- Ana Ivanovic
- Angelique Kerber

## What do you think?

Please log in or register to comment!