Fitting Generalised Linear Models to 2020-21 EPL Football Data

Posted on Previous Blogging Site (10/05/2021)

The English Premier League is the highest-level division in the English Football League and is widely regarded throughout the world as one of the very best footballing leagues. The EPL consists of twenty teams where over the course of a single season, each team competes against every single opposition in the division twice. In total there are 380 matches played in a given season where with the increased popularity of internet and mobile sports betting, anyone over the age of 18 in the UK are provided with the opportunity to wager on the outcome of every single match amongst the wide variety of possible betting markets.

The statistical software package R Studio will be used to fit a series of generalised linear models to this season’s football data so far (16/08/2020-20/04-21) where a couple of betting markets will be evaluated. A model comparison will be undergone to determine the best model statistically and whether this corresponds to a closer estimate of odds to the betting markets. Lastly, we will identify any profitable betting opportunities provided by the model and the outcome of these bets.

The following matches will be used for prediction and the odds beside them were taken from www.oddschecker.com on the date Tuesday 13th April 2021.

Figure 1 – The seven EPL matches which the statistical models will predict for

The above odds are all in decimal format which is my personal display preference because I find it more efficient to calculate outcome possibilities. Please note the cell colours in light blue or red are indicating a movement in odds and is of no importance to this project.  If you were wondering how odds can be interpreted, I will give a quick example. My favourite team Arsenal are hosting Fulham on Sunday 18th April 2021 with decimal odds of 1.8 to win. This means for a £100 wager; the resulting net profit is £80 should Arsenal win. But what does this actually mean in probability terms? Well, 100/1.8 = 55.5% chance of winning. If you wanted to wager on Arsenal, the question you must ask yourself is do Arsenal defeat Fulham at least 6 out of ten times on this given Sunday. If the answer is a slam dunk yes, then placing a wager on this match will be profitable long term. Now comes the big question, how can you possibly know a team’s true chance of winning?  

Although bettors tend to overestimate their ability of predicting, sports betting professionals and bookmakers utilise sophisticated sports data models to derive a team winning percentage. Regardless, all models are built on a starting block which utilises a data distribution that best describes a data set.

The Data Set

The website www.football-data.co.uk is the source that will be used to acquire football data. In this data set, there are 106 columns of data including metrics such as number of goals scored, shots on goal and even the referee of the match.

Must decide before attempting to construct a model, what data can be used to provide a reasonable estimate for a team success. Taking a browse at the previous English Premier League final standings for the past six seasons. In four out of the six seasons, the team with the worst goal difference ended rock bottom of the league table. Alternatively, all six premier league champion winning sides were in the top two ranked for highest goal difference. There is a general pattern here, where teams with the better goal difference tend to finish higher up in the league table.

If we want to model based on goals scored and conceded per match, we are using count data. Plotting histograms is one possible way to recognised a reasonable data distribution and I found the Poisson distribution was an appropriate choice which has been overlayed as seen below.  

Figure 2 – Histogram of EPL Home Goals with the Poisson Distribution overlayed
Figure 3 – Histogram of EPL Away Goals with the Poisson Distribution overlayed

Both histograms have a positive skewness with a Poisson distribution shape. The Poisson distribution does not fit perfectly especially for 0 to 1 goal scored where the actual data is more than the estimation. This might well have an impact on the results of the model which we will analyse later in the project as Poisson data sets with 0’s can have an impact on the model.

The Poisson Distribution

Now that we have observed the goals data following a Poisson distribution, it is worth explaining what exactly is a Poisson distribution. The Poisson distribution is named after the French mathematician Simeon Denis Poisson and is a discrete probability distribution that provides a probability of the occurrence for a number of events within a fixed time period with a constant mean rate. So in regards to football, the Poisson distribution will provide a probability for the expected number of goals for matches with the duration of 90 minutes.  

The Probability mass function of a Poisson distribution is defined as follows

For the Poisson distribution, the mean is equal to the variance.

I will now take a look at the mean and variance of the goals data set to check whether this holds true.

This is a violation of the Poisson distribution as we can see both the mean and variance for home goals and away goals are not equal. In the model analysis later, this might be an indication the Poisson distribution is not a spot on fit for football goals data.

An example calculation for the percentage chance of the home team to score only one goal using the Poisson distribution by plugging in the average home goals value of 1.334416 is …

Assumptions of the Poisson Distribution

It is important for any statistics model that the assumptions are stated and met otherwise this will impact the validity of the results and wrongful conclusions formed from the data. The assumptions of the Poisson distribution are as follows

  1. K, is a discrete random variable where takes values such as 0,1, 2 … and so on
  2. Events occur independently where one event occurring does not have an effect on the probability of the second event.
  3. The average rate at which events occur is independent of any occurrences. So, a team is expected to have the same probability of scoring in the first half as the second.
  4. Two events cannot occur at exactly the same instance

In relation to football, assumptions one and four are met. Teams cannot score half a goal and goal data is discrete. As for assumption four, a goal scored is an instant event where only one possibility can happen.

For Assumption 2, many football fans of their respective teams will know their side will tend to defend a 1 goal lead especially very late in the match so when a team goes defensive this will have an effect on their own side to score a second etc. Also, some teams play different strategies where some might opt for a defensive strategy away from home and after going a goal down, completely switch to all-out attack. This assumption is not met for football.

Assumption 3 is like the previous example where due to different strategies and game flow, a team tend to have a higher probability to score in second half compared to the first half especially if 0-0 at the half time break. Definitely cannot say a team will have the same rate of success probability to score a goal in the 15th minute and the 60th minute for example.

Due to 2 assumptions unlikely to be met, the Poisson distribution might well not be the perfect model choice but we shall explore anyway.

Generalised Linear Models

Generalised linear models are an extension of the ordinary linear regression model for data distributions which are non-normal. In an ordinary linear regression model, the response variable follows a normal distribution and is a linear relationship to the explanatory variables.

The mathematical form of an ordinary linear model is

In our case, the data distribution for the mean response variable (football goals) is Poisson distributed so a generalised linear model for Poisson is needed. How a GLM is able to adopt the Poisson distribution is by using a link function to transform the mean response variable.  

The pdf/pmf of a GLM has the form

So the pdf/pmf of a GLM can be written in the form

For the Poisson GLM

The GLM calculates coefficients for the explanatory variables using the method Maximum Likelihood Estimation and provides the optimal value for each coefficient.

Model 1 – Poisson Regression

We are going to now fit our first generalised linear model using the Poisson distribution.

Interpreting the Poisson Regression Output

So Leeds expected goals at home to Liverpool is
Log(Goals) =  0.1293802 -0.0001433 + 0.2379885 + 0.3902404
Log(Goals) = 0.3724746
Goals = 1.451322

Alternatively, the expected goals for Liverpool away to Leeds are
Log(Goals) =  0.1293802 + 0.2379885 +0.3902404
Log(Goals) = 0.7576091
Goals = 2.13317

Observing the output, you may have noticed the team Arsenal are not listed in the coefficients for team or opponent. Arsenal are indeed involved in the Poisson regression model however the estimate value which corresponds to Arsenal is the intercept. That is because Arsenal are fixed in the model as the benchmark where all the other teams are calculated as a comparison to Arsenal. Arsenal (the intercept) coefficient of 0.1293802 which is positive indicates when Arsenal are involved in a match, they have a positive influence for goals to be scored.

For all the other teams, coefficients for teams represents attacking strength so the higher the value, the greater the probability for a goal scored. This is proved by the fact the two Manchester clubs have the highest coefficients and so happens to be the EPL leading scoring sides. In terms of opponent coefficients, this relates to a team defensive strength and Man City and Chelsea are the only negative valued teams who are the best defensive teams in the EPL. So, they have a negative impact on goals scored which is indicative by the negative value

The last crucial interpretation is the estimate for home advantage. We also saw earlier, there appears to be no home team advantage this season and this is in fact confirmed by the model. e^(-0.0001433) = 0.9998567103 ~ 1. So, the home team is just as likely to score as the away team for this season’s data.

Model 1 Diagnostics

There are two types of model assumption checks to carry out. The first check is to plot residual plots using the standardised residuals. The other check to assess the model fit is by using two base statistics which are the Deviance test statistic.

A standardised residual is a measure of strength of the difference between observed and expected values. A fitted value is the statistical model’s prediction of the mean response value. Residual plots are useful for discovering patterns, outliers or misspecifications of the model. Ideally, we really want to see the residuals exhibiting no patterns and a random scatter above and below the horizontal 0 line.  

Figure 4 – Residual Plot for the standardised residuals against the fitted values for the Poisson GLM

The variability appears to not be equal to 0 because there are more points above +2 where there are no residuals below -1. This confirms the violation of the Poisson distribution where mean = variance = 0. Variance is decreasing with an increase in x fitted values. Thinking back to a football match, is it reasonable to assume a team probability to score in a 90 minute match is equal throughout? The answer is no due to a variety of reasons, one common example are teams especially very good counter attacking teams who like to defend on one goal leads.

Figure 5 – Histogram plot for the standardised residuals for the Poisson GLM

Although it is not mandatory for residuals to follow a normal distribution with GLM’s, we still would prefer to see a symmetric normal shape. Here we have a positively skewed histogram so the residuals are not normally distributed and this is in fact confirmed in the normal QQ-plot below.

Figure 6 – Normal Quantile Quantile (QQ) plot for the Poisson GLM

For this QQ Plot, we want to observe all points meeting the straight blue straight line. Points that do not touch the line are outliers and we have a clear horizontal pattern from -3 quantiles to -0.5. Able to confirm from all three plots that the Poisson GLM is not a good fit for the football data because the mean is not equal to the variance which in affect has provided not the best looking diagnostic plots.

The dispersion parameter will help to confirm if the model is overdispersed. Overdispersion is when there is a greater variability in a data set than would be expected in the statistics model.

Dispersion Parameter = 728.17/576 = 1.264

1.264>1 so there is overdispersion. However this value is not very large which I had expected. Although there are other GLM’s which helps deal with overdispersion such as the quasipoisson glm or the negative binomial, believe the excess of actual 0 goals scored in the data compared to what the poisson distribution predicts is the cause of play here for the poor model fit. One model which I think will be an improvement is the Zero-Inflated Poisson regression model which uses a process to deal with data sets which have an excess of 0’s and may well result in a better prediction for odds.

Model 2  – Zero-Inflated Poisson GLM

The zero inflation poisson model works just like the Poisson GLM but also includes a second underlying process to determine whether a count is 0 or not. Once a count is determined to be a non zero, then the regular Poisson process takes over to determine its actual non zero value based on the PMF function.

Model 2 Diagnostics

Figure 7 – Residual plot of the Standardised Residuals against Fitted Values for the Zero Inflated Poisson GLM

I would say this residual plot is an improvement on the first model because far more points are between -2 and +2 where the mean will equal 0. However we have a negative pattern below 0 for residuals and not a complete random scatter which still indicates the mean is not equal to the variance.

Figure 8 – Histogram of the Standardised Residuals for the Zero Inflated Poisson GLM

This is the shape of a histogram we are looking for in residuals as there is symmetry and a normal shape. A perfect histogram would have the centre at 0 which is not quite the case here but is close. None the less, can conclude this histogram plot is an improvement on the first model and might be another indication we will have a better prediction of results from the Zero Inflated Poisson GLM.

Figure 9 – Normal Quantile-Quantile Plot for the Zero Inflated Poisson GLM

Lastly for this QQ plot, much more points meet the blue line where residuals are more normally distributed compared to the first model. Although not all points meet the line which again indicates outliers are present, we no longer have that horizontal black line of residuals which is a good indication the zero-inflated poisson GLM is a better fit for our data set.

I expect for low scoring matches such as 0-0, 1-0, the zero-inflated Poisson regression model will provide better predictions because of the fact the issue of the standard Poisson under predicting for 0 goals has been taken account for.

Model Predictions and Comparison

In this last stage of the project, both GLM models will now be compared to the betting market odds for two betting market types. I will also highlight potential profitable bets depending on what the models estimate. 

Using the GLM, matches are simulated by estimating two poisson distributions for the home team and the opponent team. For my own preference I output the resulting simulation matrices for each match into Microsoft Excel. Below is the first matchup between Everton and Tottenham with all the scoreline predictions. Vertically are the estimated goals for the home team whilst horizontally is the estimated goals for the away team. So a 1-0 victory for Everton according to the GLM model 1 is 6.874%.

Figure 10 – Simulation Matrix for the Everton v Tottenham match using Model 1 – Poisson GLM

To calculate percentages for various bet types, we sum the appropriate entries in the simulation matrix. For Draw percentage 24.251 – this is the sum for all diagonal entries 0-0, 1-1, 2-2 etc so 5.904+11.447+…+0.000 to give 24.251%.

Figure 11 – Example Calculations from the simulation matrix

When comparing against the betting market odds, it is important to point out this assumption. I am assuming the best possible odds are representative of the true percentage of an event happening. This is not too correct because the betting markets in general set the odds to provide their company a long term profit regardless of which outcome.

Betting Market 1 – Home/Draw/Away

The first betting market is the home win , draw or away win market. Before even observing the results, I anticipate the GLM model 2 – zero inflation Poisson regression should be more accurate with estimations for draws because 0-0 and 1-1 are low scoring matches which the general poisson is known to overestimate for.

Table 1 – GLM Model Comparison with the Oddschecker.com 1X2 market odds

Five out of the seven matches, the zero inflated model indeed estimated better for draws. In general, 0-0 is the most common result should a match end in a draw therefore this is a good indication that the zero inflated model has estimated better. In terms of the betting odds for draws, the public do not tend to bet big money on the draw as compared to the actual teams and so the odds for the draw from the bookmaker tend to be more true. For the second betting market, this should provide a more conclusive evidence for whether the zero inflated model predicts closer to the betting markets for low scoring matches.

In terms of looking for profitable bets, the fact both models estimate nearly a ten percent less likely chance Chelsea defeats Brighton, I would look into laying against Chelsea to win as 10 percent is a large enough margin to account for uncertainty. The other stand out bets are West Ham with a huge 16% better estimation to win compared to what the bookmakers are offering. The last bet I would look into placing is on Leeds United to defeat Liverpool.

Betting Market 2 – The Under Goals Market

The goals market is a very popular betting market for punters who like to see action and a fun sweat for their money especially in accumulators. However, I will be observing the under goals market because I am interested to see how well these models fit to the lower scoring matches. Under 0.5 is the same as 0-0. Under 1.5 is 0-0, 1-0, 0-1. So I am expecting the zero inflation poisson model to again accurately estimate closer to the betting markets for these odds compared to the standard poisson regression.

Table 2 – GLM Model Comparison with the Bet365 Under Goals Market

This betting market has provided a clear indication of the differences in prediction between the two models. It is worth noting the bookmakers will not offer the true odds for these bets as they will include an overround where the punter gets less value for their money compared to what the bet is truly worth. I believe the zero-inflation model has provided the more believable predictions because there are only two matches where there are a 10% discrepancy compared to the bookmaker odds where as the Poisson GLM have five matches.

Two matches which stand out to wager on is the under goals in the Man United v Burnley and the over goals in the Leeds v Liverpool. However because Leeds and Liverpool are known to be high scoring sides, the odds for the over market is far too small for my liking (1.57 or 4/7 UK odds) so not too worthwhile wagering on.

Possible Profitable Bets

I have identified the following bets where there is a significant difference between the zero-inflated poisson model predication and the betting market. I have opted for bets where the odds are at least evens 50% so they provide a good return on investment. As odds decrease, the more certainty is needed to get the bet spot on. In terms of model to betting market difference, I am looking for a good 7% and upwards threshold . For these examples, assume each bet stake is limited to £100.

Bet NumberBet TypeOdds
1West Ham win vs Newcastle2.12
2
Leeds win vs Liverpool
5.1
3Chelsea to not defeat Brighton2.1
4
Manchester United v Burnley Under 1.5 Goals
4.5
5
Manchester United v Burnley Under 2.5 Goals
2.2
6Arsenal v Fulham Under 1.5 Goals3.5
7
Wolves v Sheffield United Under 1.5 Goals
2.75
Table 3 – Profitable bets with stakes of £100 per bet

Figure 12 – The football results provided from bbcsport
Bet Bet TypeOddsResultProfit
1West Ham win vs Newcastle2.12Lose-£100
2Leeds win vs Liverpool5.1Lose-£100
3Chelsea to not defeat Brighton2.1Win+£110
4Man U v Burnley U1.5 goals4.5Lose-£100
5Man U v Burnley U2.5 goals2.2Lose-£100
6Arsenal v Fulham U1.5 goals3.5Lose-£100
7Wolves v Sheffield United U1.5 Goals2.75Win+£175
-£215
Table 4 – The profit/loss for each bet

From a total outlay of £700, if I placed the above bets for £100 each this would have resulted in -£215 profit. This just goes to show, whenever wagering money it is not advisable to go all guns blazing behind a model output. This is also a far too small sample to make a real conclusion on whether the zero-inflated model is profitable or not. A good sample, I would recommend is 100 bets and keep tracking. I just know from my own personal betting experience, there are far more factors that need to be accounted for which I will explain in the project conclusion.

Project Conclusion

The model assumptions for the Poisson GLM were not met due to the nature of the football sport which led to the model providing suspect inaccuracies to the betting market. The zero-inflated Poisson GLM was a significant improvement to the Poisson GLM and I would highly recommend making model modifications with this model.

There are limitations with betting blindly with just a zero inflated poisson glm. Possible factors to identify a team success can include

  1. A team recent form
  2. The difference in days rest between match days
  3. Player Injuries especially key players
  4. The weather on match day
  5. A managerial change during the season

One possible model improvement on the Zero-Inflated Poisson regression model is the Dixon Coles model which includes a time decaying function and might well be my second blog post! Funnily enough, both Mark J. Dixon and Stuart G. Coles both worked on this model at the University of Nottingham where I so happened to recently graduate from.

I hope you enjoyed reading this blog entry and I am always open to feedback, questions etc.

One thought on “Fitting Generalised Linear Models to 2020-21 EPL Football Data”

Leave a Reply

Your email address will not be published.