In a typical season of a soccer league, each team participating in the league plays every other team twice, earning three points for each win, one point for each draw, and no points for a loss. At the end of the season, the accumulated point totals are used to determine the league table: a list of teams and their point totals ordered from least points to most.
By normalizing the point totals for a season, we can treat the final table as a probability distribution, the shape of which can be used to understand the final structure of the season. As an example, I’ve plotted the final distributions for the top five European soccer leagues after the 2016-2017 seasons (note that all the code and data involved in this post can be found on my github account). Take a minute to see if you notice any qualitative differences between these seasons.
You can have all sorts of fun with just these distributions (stay tuned for a followup post), but what I’d like to talk about today is the manner in which these distributions are formed over the course of a season. To be precise, I want to think of the league season overall as an iterative computation of the final distribution, in which each game played updates a running point distribution converging to this final distribution on the last day of the season. The animation below (made using matplotlib’s animation package) shows the end of this process for the 2000-2001 season of Germany’s top soccer league. The running distributions are represented by the blue dots approaching the final distribution given by the underlying bar plot (all distributions are ranked to make the visuals clearer, but in general the computations below compare distributions team-to-team not by rank).
As with any computational or convergent process, we might be interested in characterizing how quickly the solution is computed or how quickly the limit is well-approximated. To understand this, we need some way of measuring the distance between the running point distribution after a given number of games and the final point distribution at the end of the season. Enter the Jensen-Shannon divergence, a measure of distance between probability distributions based on cross-entropy.
One reason for using Jensen-Shannon in this case is that early in seasons there are often one or more teams with zero points. Other measures, such as Kullback-Leibler, return infinite values when one of the distributions assigns zero probability to an event and the other does not. In many cases this is a feature, not a bug, but since I expect more variation in point distribution early in the season, when just a few new points can have significant impacts, I want to be able to measure the change in distance meaningfully even when some teams have no points.
The data I use consists of 110 seasons stretching back to 1995/96 from the top five European soccer leagues (England, Spain, Germany, Italy, and France), obtained from football-data.co.uk. For each season, I computed the Jensen-Shannon divergence between the league table updated after each game and the final league table. The resulting curves, for all 110 seasons, are plotted below.
As you can see, the Jensen-Shannon divergence appears to decrease exponentially. Moreover, the only real variance in the curves seems to happen in the first hundred games (zoomed in on below).
To get a better sense of this exponential decrease, I also plotted the average of these curves and, assuming an exponential model , an optimal least-squares fit of (standard deviation errors for these parameters being and respectively).
Given this, one might reasonably expect that the distance from the point distribution after 100 games to the final distribution would be approximately . (By comparison, the JSD between the — sorted! — final distributions for Spain and Italy plotted above is , three orders of magnitude higher.) League seasons are typically 380 games long, so what this means is that by about a quarter of the way through a season, the running distribution and the final distribution are not that far away from each other.
I want to emphasize that this result is not about team rankings. Very small changes to point totals can have significant effects on rankings without changing the distribution much at all. Indeed, a one-point change to a team’s point total, while potentially insignificant probabilistically, could very well affect who wins the league. So the takeaway here is not that the league ranking doesn’t change after a hundred games (thank goodness!), but rather that the distribution of points — in a sense, the overall shape or structure of the league — is not significantly changing after that point.
Let me know if you have any questions or comments!
- can you predict the league from the final point distribution of a particular season?
- how quickly do running league rankings (not distributions) converge to the final ranking for a season?