Tag Archives: Statistics

Statistics based projects. Target audience – BSc level Statistics

Predicting House Prices Using Regression Techniques

Posted on previous blogging site (31/08/2021)

For this project, the objective is to predict the sale prices of houses using a data set containing data relating to houses located in the state of Iowa from the United States of America. 80 explanatory variables are contained in the data set with 1460 observations. By having the possibility of using so many explanatory variables, there will be a high tendency for models to overfit resulting in a poor predictive performance on unseen data observations. Therefore, it will be crucial to analyse just how well these models fit and whether the statistical assumptions behind these models are met.

Before attempting to fit models, we have a messy data set that needs to be cleaned. Careful consideration will be taken when dealing with missing values which can have an impact on model accuracy. Once the data is cleaned, three different models will be fitted to the data and a model comparison will follow. With a finalised model, one last improvement attempt to the model will involve a feature selection based on correlation amongst variables. Finally, a conclusion with a model selection and explanation.

The data sets were acquired from the data community website Kaggle. In this competition, there are two data sets provided; a training data set and a test data set.  I will refer to the Kaggle test data set as the “final test data”. No sale prices are included in the final test data and this data set will only be used for Kaggle competition purposes to submit the best model from this project to see how well it ranks with other competitors worldwide.

If you wish to observe my code to read along to this report, you may find this on my GitHub page located here.

Splitting of the Training Data Set

Before cleaning the data, the very first action will involve splitting the original data into two data sets. One data set will contain 80% of the observations which will be named the ‘train’ data set whilst the remaining 20% will be the ‘test’ data used for model validation.

The main reason behind the process of data splitting is to avoid the high possibility of a model overfitting should a model use the entire data set. When a model overfits, in essence it behaves like a lookup table where it seeks to identify observations from the list of values it has already learnt. However, ask the model to predict an estimate with a new observation it has not seen before and the model will more than likely provide a guess which is nowhere near accurate. This is why, by using 80% of data for the model to train with, we can then observe how the model predicts on the unseen 20%.

The reasoning behind splitting the data before being cleaned is to avoid the possibility of a data leakage. A data leakage is the problem of information regarding the test data set being made available to the model in the training data set. If data outside the training data set is used to develop a model, this can lead to exaggerated model results.

I made sure to randomise the data before the split to ensure no unnecessary bias which is a mandatory requirement when using statistics as many of the models and concepts are based on the assumption of randomly distributed data.

Exploring the Data Set

To seek an initial understanding of the data involved in the project, all three data sets will be explored. The Sale Price will be used as the dependent variable in the model builds, so is the variable of most importance to be explored first. A histogram plot as well as a box plot are a great way to visualise a possible data distribution to describe the data sets whilst examining any possible data outliers. These plots can be seen in the figures (1a and 1b) below.

The distribution of the sale price in both data sets appear to not be normally distributed as both histograms are positively skewed and not symmetric around the mean. This can well mean that a normal linear regression model based will not be a good fit. If this is the case, one possibility for an improved performance will be to use a log transformation on the Sale Price variable or seek other distributions.

Both the boxplots confirm there are some outliers in both of the data sets however these outliers relate to the prices of very expensive houses. The outliers are of values ranging from 350,000 onwards and  this is of course very plausible  for houses to sell at these prices. A quick look on google indeed shows some houses in Iowa selling in the mid 500’s so these outliers are good to not be mistaken or errors.

What is of more interest is the check for non-believable values such as negative values, 0’s or very low prices such as £5 etc.   A summary of the Sale Price variable will state the minimum value as well as providing other summary statistics.

A minimum value of 34900 indicates there are no definite anomalies. Although 34900 is a very low value for a house price in the year 2017; this value is not low enough to be eliminated from the data set. Again, by googling Iowa house prices there are indeed houses selling for 24k so this minimum value is plausible.

Data Cleaning

Data cleaning is an extremely important step in the process of building models. An inaccurate data set can lead to erroneous analysis of data ultimately translating to a waste of time … and money! Careful consideration has to be taken when dealing with messy data including missing values, anomalies, outliers, duplicates.

Table 2 below shows the variables with the highest percentage of missing values with a value coded “N/A”.

When a variable has over 80% of values missing, it is reasonable to just simply remove the variable from the data set. However, my preferred approach is attempt to keep as much data as possible without deletion.  After reading the description text supplied with the dataset, N/A for a lot of the columns means none/no. So instead of deleting columns with a lot of missing values, the “N/A” will be changed to “none”.

Summary of the Data Cleaning Changes Made

  • For 15 categorical variables, the data description file states N/A = none so this was coded accordingly
  • “LotFrontage” – The median and mean were very similar across all three data sets but I opted for using the median because there were a few outliers causing the mean to be slightly higher than the median
  • “GarageYearBlt” – Again opted for the median with the same reason as above for “LotFrontage”
  • For Electrical, the mode is SBKR which is the most common electrical value by a landslide therefore I believe it is more than reasonable to replace n/a’s with SBKR.
  • For “MasVnrArea” – Since majority of values correspond to the value 0, more than reasonable to replace N/A to 0 because a value would have been recorded. Below figures (3a and 3b) shows this.

After applying the above data cleaning amendments, there were still 13 rows of data remaining in the Final Test data containing “n/a” and at this stage it would be more than reasonable to simply remove them since they equate to 13/1460 (0.8%) of total observations in the data set. However, the Kaggle competition submission requires all data values must be used so these “N/A” values were actually imputed.

Now all three data sets are cleaned, it is time to get started with model fitting. 

Model 1 – Multiple Linear Regression

The multiple linear regression model will be the first statistical model attempted as a fit to the training data set. The multiple linear regression model is an extension of the league regression model which predicts the outcome of a dependent variable based on all the other explanatory variables. The form of the multiple linear regression model is written as follows …

Assumptions of the Multiple Linear Regression Model

  1. A linear relationship between the dependent variable (SalePrice) and the independent variables
  2. Statistical independence of errors
  3. Constant variance of errors
  4. Error terms are normally distributed

If there is a violation to any of these assumptions, this can lead to forecasts and insights from the regression output to be inefficient and at a worst case scenario seriously biased or misleading. This is why it will be important to double check the model fit by assessing the residual plots to give a picture of how the error terms are behaving and whether the model assumptions are being met. The root mean squared error will also be used as the metric to compare model fits.

Coding Categorical Variables

There is still one more issue that remains with the data in its current state and that is the categorical variables. Regression analysis requires variables to be in numerical form. A workaround to get the regression to work with categorical variables is by using dummy variables.

In the following example, observations for the categorical variable “LandSlope” have three options which are as follows …

  1. Gentle slope
  2. Moderate slope
  3. Severe slope

Dummy variables can be coded in binary form leading to a table referred to as the contrast matrix. When using dummy variables we code n-1 variables. In our example we have three categories, therefore we code only two new columns which is illustrated below in table 3.

The original data column will be dropped as it is no longer needed and we can interpret the table as follows. If we want the variable “LandSlope” being gentle, then we place 0’s in both moderate and severe columns. If the “LandSlope” is either moderate or severe then we place a 1 in the correct corresponding column.

It is important to know that when creating models, if we wish to use other data sets to predict our model with, this additional data set must be of the same size as the original data set used by the model which it has been trained on. Before creating dummy variables, I decided to once again combine all three data sets with an additional column to label each data set. By doing this, all three data sets will have the same number of columns and once completed the data can be split into their respective data sets once again.

Model 1 – Linear Model Results

95.1% of observed variation in the dependent variable can be explained by the independent variables in the multiple regression which is indicated by the R-Squared value. This is obviously very encouraging however it is well known that as more variables are included in a model the R-Squared value will be high anyway and cause the model to overfit.  A measure to counteract the inclusion of adding more independent variables is the adjusted r-squared value. The adjusted-r squared increases in value when the new term improves the model more than would be expected by chance and penalises if the model is improved less than expected. Here an adjusted r-squared value of 93.5% is still a very good result. I will make a prediction of sale prices using the test data using the linear regression model and then comparing difference between the predicted sale prices and the actual test sale prices. 

Root Mean Squared Error

To evaluate the accuracy of predictions made from the various statistical models, I will compare the different model’s performance using the metric known as the root mean squared error. RMSE is the standard deviation of the residuals and measures how far away from the regression line that the data points are. In essence the RMSE is the measure of how spread out the residuals are. The lower the RMSE value the better. The Root Mean Square Error is calculated as follows …

Or simply find the RMSE by
1. Squaring the residuals
2. Find the average of the residuals (this results in the Mean Squared Error)
3. Take the square root of the Mean Squared Error

A RMSE value of 66075.93 is very high and is indicative that the data has in fact been overfitted. I will now observe the residual plots which will visualise this.

This residual plot shows heteroscedasticity meaning an unequal scatter of residuals. There is a systematic change in the spread of residuals as we increase in predictive price along the x axis. This fails the assumption of the linear regression model where the ordinary least squares regression assumes residuals are drawn from a population with constant variance. So to confirm, there is room for improvement.

The histogram plot of residuals is encouraging because the shape of the residuals appear to be normal although not completely symmetric. There is a slight negative skewness.

For this QQ plot, the red line is fitted through the quantiles. For a normal distribution all the quantiles will meet on the red line and be straight. Here although a lot of the points do meet the line, some of the points towards the ends  start to drift away from the line meaning although the distribution is symmetric it comes with fat tails.

Model 2 – Log Linear Regression Model

One possible way to deal with the issue of non-normality in a regression model is to take a log transform on the dependent variable. The second model I will attempt to fit is a log linear regression model where the dependent variable (Sale Price) will be log transformed. The form of the log linear regression model is as follows

Model Assumptions

  • Statistical independence of errors
  • Constant variance of errors
  • Error terms are normally distributed

To produce a log linear regression model, the dependent variable shall be transformed by taking the natural log of all the sale price values. The below figure is a histogram of the now dependant variable after the log transformation.

Compared to the previous figures in 1a and 1b, this histogram of the log transformation of the sale price variable is now of a normal distribution shape with symmetry around the centre. The skewness has now been removed therefore results for this model should be an improvement on the previous model.

Model 2 Results

The residual plot for this model is definitely an improvement as there is a more random scatter of residuals. This plot appears to show a constant variance amongst residuals with an even number of residuals above and below the horizontal 0,0 line.

Just as in the previous model, the residuals appear to be normally distributed although there is a slight positive skewness before 0.

This QQ plot confirms that although majority of the points meet the quantile red line, there is a fat tail towards the left as they are far away from the line.

Model 3 – Random Forest Model

The random forest model is a supervised machine learning algorithm which builds multiple decision trees and merges them together to make a prediction by averaging the predictive results of each component tree. The random forest generally provides a much better predictive accuracy than a single decision tree. A major advantage for using the random forest model is it can be used for both classification and regression. For this project, we will be using the random forest for regression purposes.

Assumptions of the Random Forest Model

  1. At each step of building an individual tree, the random forest finds the best split of data
  2. Whilst building a tree, the whole dataset is not used but a bootstrap sample
  3. Use bootstrap aggregation so assumes sampling is representative

In terms of statistical assumptions such as normality of errors etc that are necessary in the linear regression models, this model does not require these assumptions to be met. The reason being is that random forests are non parametric and are determined by the data therefore are able to handle skewed data as well as categorical data.

I decided to opt for a default random forest model size of 100 estimators however should the random forest model provide a competitive RMSE score, then I will explore the possibility of using different estimators.  

This plot shows the random forest model has the tendency to provide a reasonable prediction compared to the actual Sale Price value. Although for the mean of values the random forest slightly predicted over, the overall shape aligns.

A root mean squared error of 30621 is a better score than the original multiple linear regression model however no where near as close to our best performing model the log linear regression model. Even though the assumptions of normality and constant variance of errors are not mandatory for the random forest regression model, the residual plots will be explored to see how the residuals perform.

This residual plot shows a random scatter of points up to sale price of 30000 where the variance starts to widen. We certainly have a very clear outlier at the 35000 point. Although the constant variance assumption is not mandatory for the random forest model, we would still much rather observe a random scatter throughout the whole plot.  

Similar to the previous model histogram for residuals with a slight negative skewness however this shape is no longer a smooth normal shape. Expecting the corresponding QQ plot to not be a straight line through the quantiles.

Indeed the QQ plot for the random forest model is the worst amongst the three models as the quantiles are not meeting the red line straight on throughout.  

Model Comparison

It is very clear, the best model which fitted to the data was model 2 – the log linear transform model. With a 0.22 rmse value, this is far more accurate than the other two models. An important take away from this model comparison is the R2 and adjusted R2 results. There is very little difference between model 1 and 2 for these results which goes to show why we can’t confidently rely on this result. Even though adjusted rpenalises the use of adding more variables to a model, this goes to show that this result will tend to be high if there are a lot of variables.

Feature Selection

It was concluded, the best model is the log linear multiple regression model so now for the rest of this project, I will only use this model to seek for further improvement. One area to improve a model prediction, is feature selection. Feature selection is the process of reducing the number of variables in a model to improve model performance. There are various types of feature selection techniques, however the type of feautre selection I will use is based on correlation.

Correlation is the statistical term referring to how close two variables are to having a linear relationship with each other.

-1 = Highly negative correlation
0 = No correlation
+1 = Highly positive correlation

If we have two variables x and y; a highly negative correlation of -1 means that as X increases in value, Y decreases in value at the same rate. In terms of removing variables from the model based on correlation, I am only interested in variables which are highly positively correlated. This is because if two variables are both highly correlated with each other and the rest of the model variables, then we can remove the lowly performing variable out of the two to improve model performance.

Correlation Matrix Example

I will illustrate a correlation matrix example with a subset of the data using just seven variables.

The very dark blue squares on the diagonal with the value of 1 are the variables duplicated with each other so ignore these correlations as they are of no use for interpreation. However we are seeking for the squares which are darker as this relates to the higher positive correlation pairs.

YearRemodAdd and YearBuilt are the two highest correlation pair with a correlation coefficient of 0.6. If we was to remove one of these variables, we would remove the variable which is more highly positive with the rest of the other variables. YearBuilt is more positively correlated with MasVnrArea and BsmtFinSF1 therefore I would remove this variable from the model over YearRemodAdd.

There are 285 degrees of freedom in our final model so therefore 286 variables. It would be far to difficult to interpret a correlation matrix to go about eliminating variables by using one’s eye. To go about removing variables based on correlation, I created a correlation function with a threshold of 80%. Variables which are highly positive with all the other variables at a threshold of 80%, I requested Python to provide these variables and then eliminate them from the model. This resulted in 21 variables removed from the model which were as follows.

Some might argue by having so many variables in the model,  to go for a more conservative threshold of 90% whilst a more aggressive approach might set the threshold to 60%. Personal preference led to me choosing 80% threshold because I think this is a high enough threshold whilst allowing for a considerable amount of variables to be removed from the model.

Model 4 – Log Linear Regression After Feature Selection

Model Comparison

With only a 0.46% decrease in RMSE value, the feature selection based on high correlated variables has not made a significant difference at all. I will side with the more conservative option and stick with the log linear regression model without the feature selection as my proposed final model for this project.

Kaggle Final Test Data Submission

Kaggle allows multiple submission attempts therefore both model 2(log linear regression model) and model 4(log linear regression model with feature selection) were submitted. The Kaggle scores were as follows.

No real surprise to see model 4 performed only marginally better as this was the case in the previous analysis earlier.

Out of 4622 entrants, my best model (model 4) ranks 3646th which is just shy of the bottom 20% of submissions. Certainly was hoping for a better rank for a debut submission but more intrigued to see the leading scorers with accuracy scores of 0.003! There are other techniques out there which I am yet to learn so who knows maybe I will return in a year time with new found knowledge to seek an improvement in my score!

Project Conclusion

Three models were fitted with the best performing model being the log linear regression model. By transforming the dependent variable (Sale Price), the model was able to meet all the assumptions behind a linear regression model resulting in a drastically improved root mean square error accuracy. 

Although in this project, feature selection based on correlation only improved the model marginally this still shows how the choice of variables can possibly affect model performance. The feature selection method based on correlation is just one method out of many possible so maybe there is room for improvement in reducing the variables in the model using an alternative feature selection method.   

None the less, the log linear regression model was an excellent fit to the training data whilst performing very well by accurately predicting on unseen data whilst meeting all of its statistical assumptions.  

If you managed to read all of this project report,  I hope you enjoyed the read!  

Fitting Generalised Linear Models to 2020-21 EPL Football Data

Posted on Previous Blogging Site (10/05/2021)

The English Premier League is the highest-level division in the English Football League and is widely regarded throughout the world as one of the very best footballing leagues. The EPL consists of twenty teams where over the course of a single season, each team competes against every single opposition in the division twice. In total there are 380 matches played in a given season where with the increased popularity of internet and mobile sports betting, anyone over the age of 18 in the UK are provided with the opportunity to wager on the outcome of every single match amongst the wide variety of possible betting markets.

The statistical software package R Studio will be used to fit a series of generalised linear models to this season’s football data so far (16/08/2020-20/04-21) where a couple of betting markets will be evaluated. A model comparison will be undergone to determine the best model statistically and whether this corresponds to a closer estimate of odds to the betting markets. Lastly, we will identify any profitable betting opportunities provided by the model and the outcome of these bets.

The following matches will be used for prediction and the odds beside them were taken from www.oddschecker.com on the date Tuesday 13th April 2021.

Figure 1 – The seven EPL matches which the statistical models will predict for

The above odds are all in decimal format which is my personal display preference because I find it more efficient to calculate outcome possibilities. Please note the cell colours in light blue or red are indicating a movement in odds and is of no importance to this project.  If you were wondering how odds can be interpreted, I will give a quick example. My favourite team Arsenal are hosting Fulham on Sunday 18th April 2021 with decimal odds of 1.8 to win. This means for a £100 wager; the resulting net profit is £80 should Arsenal win. But what does this actually mean in probability terms? Well, 100/1.8 = 55.5% chance of winning. If you wanted to wager on Arsenal, the question you must ask yourself is do Arsenal defeat Fulham at least 6 out of ten times on this given Sunday. If the answer is a slam dunk yes, then placing a wager on this match will be profitable long term. Now comes the big question, how can you possibly know a team’s true chance of winning?  

Although bettors tend to overestimate their ability of predicting, sports betting professionals and bookmakers utilise sophisticated sports data models to derive a team winning percentage. Regardless, all models are built on a starting block which utilises a data distribution that best describes a data set.

The Data Set

The website www.football-data.co.uk is the source that will be used to acquire football data. In this data set, there are 106 columns of data including metrics such as number of goals scored, shots on goal and even the referee of the match.

Must decide before attempting to construct a model, what data can be used to provide a reasonable estimate for a team success. Taking a browse at the previous English Premier League final standings for the past six seasons. In four out of the six seasons, the team with the worst goal difference ended rock bottom of the league table. Alternatively, all six premier league champion winning sides were in the top two ranked for highest goal difference. There is a general pattern here, where teams with the better goal difference tend to finish higher up in the league table.

If we want to model based on goals scored and conceded per match, we are using count data. Plotting histograms is one possible way to recognised a reasonable data distribution and I found the Poisson distribution was an appropriate choice which has been overlayed as seen below.  

Figure 2 – Histogram of EPL Home Goals with the Poisson Distribution overlayed
Figure 3 – Histogram of EPL Away Goals with the Poisson Distribution overlayed

Both histograms have a positive skewness with a Poisson distribution shape. The Poisson distribution does not fit perfectly especially for 0 to 1 goal scored where the actual data is more than the estimation. This might well have an impact on the results of the model which we will analyse later in the project as Poisson data sets with 0’s can have an impact on the model.

The Poisson Distribution

Now that we have observed the goals data following a Poisson distribution, it is worth explaining what exactly is a Poisson distribution. The Poisson distribution is named after the French mathematician Simeon Denis Poisson and is a discrete probability distribution that provides a probability of the occurrence for a number of events within a fixed time period with a constant mean rate. So in regards to football, the Poisson distribution will provide a probability for the expected number of goals for matches with the duration of 90 minutes.  

The Probability mass function of a Poisson distribution is defined as follows

For the Poisson distribution, the mean is equal to the variance.

I will now take a look at the mean and variance of the goals data set to check whether this holds true.

This is a violation of the Poisson distribution as we can see both the mean and variance for home goals and away goals are not equal. In the model analysis later, this might be an indication the Poisson distribution is not a spot on fit for football goals data.

An example calculation for the percentage chance of the home team to score only one goal using the Poisson distribution by plugging in the average home goals value of 1.334416 is …

Assumptions of the Poisson Distribution

It is important for any statistics model that the assumptions are stated and met otherwise this will impact the validity of the results and wrongful conclusions formed from the data. The assumptions of the Poisson distribution are as follows

  1. K, is a discrete random variable where takes values such as 0,1, 2 … and so on
  2. Events occur independently where one event occurring does not have an effect on the probability of the second event.
  3. The average rate at which events occur is independent of any occurrences. So, a team is expected to have the same probability of scoring in the first half as the second.
  4. Two events cannot occur at exactly the same instance

In relation to football, assumptions one and four are met. Teams cannot score half a goal and goal data is discrete. As for assumption four, a goal scored is an instant event where only one possibility can happen.

For Assumption 2, many football fans of their respective teams will know their side will tend to defend a 1 goal lead especially very late in the match so when a team goes defensive this will have an effect on their own side to score a second etc. Also, some teams play different strategies where some might opt for a defensive strategy away from home and after going a goal down, completely switch to all-out attack. This assumption is not met for football.

Assumption 3 is like the previous example where due to different strategies and game flow, a team tend to have a higher probability to score in second half compared to the first half especially if 0-0 at the half time break. Definitely cannot say a team will have the same rate of success probability to score a goal in the 15th minute and the 60th minute for example.

Due to 2 assumptions unlikely to be met, the Poisson distribution might well not be the perfect model choice but we shall explore anyway.

Generalised Linear Models

Generalised linear models are an extension of the ordinary linear regression model for data distributions which are non-normal. In an ordinary linear regression model, the response variable follows a normal distribution and is a linear relationship to the explanatory variables.

The mathematical form of an ordinary linear model is

In our case, the data distribution for the mean response variable (football goals) is Poisson distributed so a generalised linear model for Poisson is needed. How a GLM is able to adopt the Poisson distribution is by using a link function to transform the mean response variable.  

The pdf/pmf of a GLM has the form

So the pdf/pmf of a GLM can be written in the form

For the Poisson GLM

The GLM calculates coefficients for the explanatory variables using the method Maximum Likelihood Estimation and provides the optimal value for each coefficient.

Model 1 – Poisson Regression

We are going to now fit our first generalised linear model using the Poisson distribution.

Interpreting the Poisson Regression Output

So Leeds expected goals at home to Liverpool is
Log(Goals) =  0.1293802 -0.0001433 + 0.2379885 + 0.3902404
Log(Goals) = 0.3724746
Goals = 1.451322

Alternatively, the expected goals for Liverpool away to Leeds are
Log(Goals) =  0.1293802 + 0.2379885 +0.3902404
Log(Goals) = 0.7576091
Goals = 2.13317

Observing the output, you may have noticed the team Arsenal are not listed in the coefficients for team or opponent. Arsenal are indeed involved in the Poisson regression model however the estimate value which corresponds to Arsenal is the intercept. That is because Arsenal are fixed in the model as the benchmark where all the other teams are calculated as a comparison to Arsenal. Arsenal (the intercept) coefficient of 0.1293802 which is positive indicates when Arsenal are involved in a match, they have a positive influence for goals to be scored.

For all the other teams, coefficients for teams represents attacking strength so the higher the value, the greater the probability for a goal scored. This is proved by the fact the two Manchester clubs have the highest coefficients and so happens to be the EPL leading scoring sides. In terms of opponent coefficients, this relates to a team defensive strength and Man City and Chelsea are the only negative valued teams who are the best defensive teams in the EPL. So, they have a negative impact on goals scored which is indicative by the negative value

The last crucial interpretation is the estimate for home advantage. We also saw earlier, there appears to be no home team advantage this season and this is in fact confirmed by the model. e^(-0.0001433) = 0.9998567103 ~ 1. So, the home team is just as likely to score as the away team for this season’s data.

Model 1 Diagnostics

There are two types of model assumption checks to carry out. The first check is to plot residual plots using the standardised residuals. The other check to assess the model fit is by using two base statistics which are the Deviance test statistic.

A standardised residual is a measure of strength of the difference between observed and expected values. A fitted value is the statistical model’s prediction of the mean response value. Residual plots are useful for discovering patterns, outliers or misspecifications of the model. Ideally, we really want to see the residuals exhibiting no patterns and a random scatter above and below the horizontal 0 line.  

Figure 4 – Residual Plot for the standardised residuals against the fitted values for the Poisson GLM

The variability appears to not be equal to 0 because there are more points above +2 where there are no residuals below -1. This confirms the violation of the Poisson distribution where mean = variance = 0. Variance is decreasing with an increase in x fitted values. Thinking back to a football match, is it reasonable to assume a team probability to score in a 90 minute match is equal throughout? The answer is no due to a variety of reasons, one common example are teams especially very good counter attacking teams who like to defend on one goal leads.

Figure 5 – Histogram plot for the standardised residuals for the Poisson GLM

Although it is not mandatory for residuals to follow a normal distribution with GLM’s, we still would prefer to see a symmetric normal shape. Here we have a positively skewed histogram so the residuals are not normally distributed and this is in fact confirmed in the normal QQ-plot below.

Figure 6 – Normal Quantile Quantile (QQ) plot for the Poisson GLM

For this QQ Plot, we want to observe all points meeting the straight blue straight line. Points that do not touch the line are outliers and we have a clear horizontal pattern from -3 quantiles to -0.5. Able to confirm from all three plots that the Poisson GLM is not a good fit for the football data because the mean is not equal to the variance which in affect has provided not the best looking diagnostic plots.

The dispersion parameter will help to confirm if the model is overdispersed. Overdispersion is when there is a greater variability in a data set than would be expected in the statistics model.

Dispersion Parameter = 728.17/576 = 1.264

1.264>1 so there is overdispersion. However this value is not very large which I had expected. Although there are other GLM’s which helps deal with overdispersion such as the quasipoisson glm or the negative binomial, believe the excess of actual 0 goals scored in the data compared to what the poisson distribution predicts is the cause of play here for the poor model fit. One model which I think will be an improvement is the Zero-Inflated Poisson regression model which uses a process to deal with data sets which have an excess of 0’s and may well result in a better prediction for odds.

Model 2  – Zero-Inflated Poisson GLM

The zero inflation poisson model works just like the Poisson GLM but also includes a second underlying process to determine whether a count is 0 or not. Once a count is determined to be a non zero, then the regular Poisson process takes over to determine its actual non zero value based on the PMF function.

Model 2 Diagnostics

Figure 7 – Residual plot of the Standardised Residuals against Fitted Values for the Zero Inflated Poisson GLM

I would say this residual plot is an improvement on the first model because far more points are between -2 and +2 where the mean will equal 0. However we have a negative pattern below 0 for residuals and not a complete random scatter which still indicates the mean is not equal to the variance.

Figure 8 – Histogram of the Standardised Residuals for the Zero Inflated Poisson GLM

This is the shape of a histogram we are looking for in residuals as there is symmetry and a normal shape. A perfect histogram would have the centre at 0 which is not quite the case here but is close. None the less, can conclude this histogram plot is an improvement on the first model and might be another indication we will have a better prediction of results from the Zero Inflated Poisson GLM.

Figure 9 – Normal Quantile-Quantile Plot for the Zero Inflated Poisson GLM

Lastly for this QQ plot, much more points meet the blue line where residuals are more normally distributed compared to the first model. Although not all points meet the line which again indicates outliers are present, we no longer have that horizontal black line of residuals which is a good indication the zero-inflated poisson GLM is a better fit for our data set.

I expect for low scoring matches such as 0-0, 1-0, the zero-inflated Poisson regression model will provide better predictions because of the fact the issue of the standard Poisson under predicting for 0 goals has been taken account for.

Model Predictions and Comparison

In this last stage of the project, both GLM models will now be compared to the betting market odds for two betting market types. I will also highlight potential profitable bets depending on what the models estimate. 

Using the GLM, matches are simulated by estimating two poisson distributions for the home team and the opponent team. For my own preference I output the resulting simulation matrices for each match into Microsoft Excel. Below is the first matchup between Everton and Tottenham with all the scoreline predictions. Vertically are the estimated goals for the home team whilst horizontally is the estimated goals for the away team. So a 1-0 victory for Everton according to the GLM model 1 is 6.874%.

Figure 10 – Simulation Matrix for the Everton v Tottenham match using Model 1 – Poisson GLM

To calculate percentages for various bet types, we sum the appropriate entries in the simulation matrix. For Draw percentage 24.251 – this is the sum for all diagonal entries 0-0, 1-1, 2-2 etc so 5.904+11.447+…+0.000 to give 24.251%.

Figure 11 – Example Calculations from the simulation matrix

When comparing against the betting market odds, it is important to point out this assumption. I am assuming the best possible odds are representative of the true percentage of an event happening. This is not too correct because the betting markets in general set the odds to provide their company a long term profit regardless of which outcome.

Betting Market 1 – Home/Draw/Away

The first betting market is the home win , draw or away win market. Before even observing the results, I anticipate the GLM model 2 – zero inflation Poisson regression should be more accurate with estimations for draws because 0-0 and 1-1 are low scoring matches which the general poisson is known to overestimate for.

Table 1 – GLM Model Comparison with the Oddschecker.com 1X2 market odds

Five out of the seven matches, the zero inflated model indeed estimated better for draws. In general, 0-0 is the most common result should a match end in a draw therefore this is a good indication that the zero inflated model has estimated better. In terms of the betting odds for draws, the public do not tend to bet big money on the draw as compared to the actual teams and so the odds for the draw from the bookmaker tend to be more true. For the second betting market, this should provide a more conclusive evidence for whether the zero inflated model predicts closer to the betting markets for low scoring matches.

In terms of looking for profitable bets, the fact both models estimate nearly a ten percent less likely chance Chelsea defeats Brighton, I would look into laying against Chelsea to win as 10 percent is a large enough margin to account for uncertainty. The other stand out bets are West Ham with a huge 16% better estimation to win compared to what the bookmakers are offering. The last bet I would look into placing is on Leeds United to defeat Liverpool.

Betting Market 2 – The Under Goals Market

The goals market is a very popular betting market for punters who like to see action and a fun sweat for their money especially in accumulators. However, I will be observing the under goals market because I am interested to see how well these models fit to the lower scoring matches. Under 0.5 is the same as 0-0. Under 1.5 is 0-0, 1-0, 0-1. So I am expecting the zero inflation poisson model to again accurately estimate closer to the betting markets for these odds compared to the standard poisson regression.

Table 2 – GLM Model Comparison with the Bet365 Under Goals Market

This betting market has provided a clear indication of the differences in prediction between the two models. It is worth noting the bookmakers will not offer the true odds for these bets as they will include an overround where the punter gets less value for their money compared to what the bet is truly worth. I believe the zero-inflation model has provided the more believable predictions because there are only two matches where there are a 10% discrepancy compared to the bookmaker odds where as the Poisson GLM have five matches.

Two matches which stand out to wager on is the under goals in the Man United v Burnley and the over goals in the Leeds v Liverpool. However because Leeds and Liverpool are known to be high scoring sides, the odds for the over market is far too small for my liking (1.57 or 4/7 UK odds) so not too worthwhile wagering on.

Possible Profitable Bets

I have identified the following bets where there is a significant difference between the zero-inflated poisson model predication and the betting market. I have opted for bets where the odds are at least evens 50% so they provide a good return on investment. As odds decrease, the more certainty is needed to get the bet spot on. In terms of model to betting market difference, I am looking for a good 7% and upwards threshold . For these examples, assume each bet stake is limited to £100.

Bet NumberBet TypeOdds
1West Ham win vs Newcastle2.12
2
Leeds win vs Liverpool
5.1
3Chelsea to not defeat Brighton2.1
4
Manchester United v Burnley Under 1.5 Goals
4.5
5
Manchester United v Burnley Under 2.5 Goals
2.2
6Arsenal v Fulham Under 1.5 Goals3.5
7
Wolves v Sheffield United Under 1.5 Goals
2.75
Table 3 – Profitable bets with stakes of £100 per bet

Figure 12 – The football results provided from bbcsport
Bet Bet TypeOddsResultProfit
1West Ham win vs Newcastle2.12Lose-£100
2Leeds win vs Liverpool5.1Lose-£100
3Chelsea to not defeat Brighton2.1Win+£110
4Man U v Burnley U1.5 goals4.5Lose-£100
5Man U v Burnley U2.5 goals2.2Lose-£100
6Arsenal v Fulham U1.5 goals3.5Lose-£100
7Wolves v Sheffield United U1.5 Goals2.75Win+£175
-£215
Table 4 – The profit/loss for each bet

From a total outlay of £700, if I placed the above bets for £100 each this would have resulted in -£215 profit. This just goes to show, whenever wagering money it is not advisable to go all guns blazing behind a model output. This is also a far too small sample to make a real conclusion on whether the zero-inflated model is profitable or not. A good sample, I would recommend is 100 bets and keep tracking. I just know from my own personal betting experience, there are far more factors that need to be accounted for which I will explain in the project conclusion.

Project Conclusion

The model assumptions for the Poisson GLM were not met due to the nature of the football sport which led to the model providing suspect inaccuracies to the betting market. The zero-inflated Poisson GLM was a significant improvement to the Poisson GLM and I would highly recommend making model modifications with this model.

There are limitations with betting blindly with just a zero inflated poisson glm. Possible factors to identify a team success can include

  1. A team recent form
  2. The difference in days rest between match days
  3. Player Injuries especially key players
  4. The weather on match day
  5. A managerial change during the season

One possible model improvement on the Zero-Inflated Poisson regression model is the Dixon Coles model which includes a time decaying function and might well be my second blog post! Funnily enough, both Mark J. Dixon and Stuart G. Coles both worked on this model at the University of Nottingham where I so happened to recently graduate from.

I hope you enjoyed reading this blog entry and I am always open to feedback, questions etc.