A better statistical approach to understanding who will win on tour

In the first installment of this series, I outlined a very basic understanding of tennis matches in mathematical terms and how this understanding could be used to build a predictive model. Using only data on points won on serve and return, I arrived at O’Malley probabilities that a given player wins a three-set or five-set match. The essential shortcoming of the O’Malley model is that it does not include any information about the player’s opponent in the match. In this piece, I’ll describe a model that yields a probability of victory in a match with two specified players.

Furthermore, this model will incorporate an adjustment for the type of court surface. This aspect of the match undeniably has a significant impact on the outcome. Some players simply have a style that performs best on one type of surface. Just look at Nadal’s record on clay, for example.

Recall that the intuition behind the O’Malley models was that tennis can be simplified into a comparison of players’ win percentages of points on serve and return. In fact, this simplification is quite robust (which might explain why tennis coverage seems to lack advanced statistical analysis, but that’s a topic for another article). Today’s model continues to use the same intuition. I happen to have the exact data for points won on serve and return from tennisinsight.com. However, this data can also be derived from the basic stats available on the ATP website. Barnett and Clarke, in their 2005 article from the *IMA Journal of Management Mathematics*, detail this process and more rigorously walk through the forthcoming model.

__The Model__

Here is the model expressed in words: Take the percentage of points a given player wins on serve as the overall percentage of points won on serve for that tournament. In doing so, you will have accounted for the type of court surface. Add the margin by which a player’s serving percentage exceeds the average, which incorporates the server’s particular ability. Subtract the margin by which the opponent’s receiving percentage exceeds the average, which accounts for the returner’s particular ability.

Here is the model expressed mathematically: The subscript t denotes the particular tournament or surface averages; fij = the combined percentage of points won on serve for player i against player j; gji = the combined percentage of points won on return for player j against player i:

fij = ft + (fi − fav) − (gj − gav)

gji = gt + (gj − gav) − ( fi − fav)

Importantly, fij + gji = 1. Given this fact, the outcome of the match can reliably be predicted by looking only at fij and its counterpart, fji. For greater detail on this, have a look at Klaassen and Magnus’ 2003 article from the *European Journal of Operational Research.*

Let’s break fij down very clearly.

ft = average percentage of points won on serve for t (t is usually the tournament or the surface, and this is how the model adjusts for court type)

fi = percentage of points won on serve for player i

fav = average percentage of points won on serve for all players

gj = percentage of points won on return for player j

gav = average percentage of points won on return for all players

This model’s effectiveness comes from its simplicity. Essentially, all we see in fij is how well players are serving in a given tournament or on a given surface (ft), how well player i is serving (fi - fav), and how well player j is returning (gj-gav). When we look at the counterpart, fji, we see the same things, except i and j are flipped.

Klaassen and Magnus explain that the chance of a player winning significantly depends on the difference fij - fji. The intuition for this is the same one articulated earlier. The graph below captures the relationship between fij-fji (on the x) and the probability that player i wins the match (on the y).

To make this concrete, let’s look at an example from a match played this year on tour. Having one of my colleagues select a random match from this year, I ended up with Cilic vs. Verdasco in the Round of 16 at Tokyo (played on October 5). Here’s how the model would have predicted that match along with the Stata code I used to run it (I’ve adjusted the data to fit what I would have had at that point):

### Cilic | Verdasco

gen FIJ = fthardcourtALLTIME + (.685 - OVAVGPercentWonOnServe) - (.379 - OVAVGPercentWonOnReturn) **[This line will get us the combined percentage of points won on serve for Cilic over Verdasco]**

display FIJ

.67416 **[Cilic is expected to win 67.416% of points on serve over Verdasco]**

### Veradsco | Cilic

gen FJI = fthardcourtALLTIME + (.638 - OVAVGPercentWonOnServe) - (.371 - OVAVGPercentWonOnReturn)** [This line will get us the combined percentage of points won on serve for Verdasco over Cilic]**

display FJI

.63516003 **[Verdasco is expected to win 63.516003% of points on serve over Cilic]**

gen probedge = FIJ - FJI

display probedge

.03899997 **[The expected difference in percentage of points won on serve for Cilic vs. Verdasco is about 4%, in favor of Cilic]**

It may be tempting to interpret .03899997 as roughly a four percent edge in probability for Cilic over Verdasco, but this interpretation would be incorrect. Instead, this figure is plugged into the function that yields the curve in the graph above. This function yields a probability that Cilic will defeat Verdasco (or the other way around). In this case, we see about a 63% chance that Cilic beats Verdasco.

The result of the three-set match in Tokyo? Cilic won, 4-6, 7-5, 7-5.

At this point, we have seen bare-bones predictive models (O’Malley) and more advanced ones (Klaassen and Magnus). In the next installment of this series, I will actually compare these different models in predicting the outcomes of several matches on tour.

Edited by Jazmyn Brown, David Kaptzan.

- Rafael Nadal
- Pete Sampras
- John McEnroe
- Roger Federer

## What do you think?

Please log in or register to comment!