Posted on previous blogging site (31/08/2021)
For this project, the objective is to predict the sale prices of houses using a data set containing data relating to houses located in the state of Iowa from the United States of America. 80 explanatory variables are contained in the data set with 1460 observations. By having the possibility of using so many explanatory variables, there will be a high tendency for models to overfit resulting in a poor predictive performance on unseen data observations. Therefore, it will be crucial to analyse just how well these models fit and whether the statistical assumptions behind these models are met.
Before attempting to fit models, we have a messy data set that needs to be cleaned. Careful consideration will be taken when dealing with missing values which can have an impact on model accuracy. Once the data is cleaned, three different models will be fitted to the data and a model comparison will follow. With a finalised model, one last improvement attempt to the model will involve a feature selection based on correlation amongst variables. Finally, a conclusion with a model selection and explanation.
The data sets were acquired from the data community website Kaggle. In this competition, there are two data sets provided; a training data set and a test data set. I will refer to the Kaggle test data set as the “final test data”. No sale prices are included in the final test data and this data set will only be used for Kaggle competition purposes to submit the best model from this project to see how well it ranks with other competitors worldwide.
If you wish to observe my code to read along to this report, you may find this on my GitHub page located here.
Splitting of the Training Data Set
Before cleaning the data, the very first action will involve splitting the original data into two data sets. One data set will contain 80% of the observations which will be named the ‘train’ data set whilst the remaining 20% will be the ‘test’ data used for model validation.
The main reason behind the process of data splitting is to avoid the high possibility of a model overfitting should a model use the entire data set. When a model overfits, in essence it behaves like a lookup table where it seeks to identify observations from the list of values it has already learnt. However, ask the model to predict an estimate with a new observation it has not seen before and the model will more than likely provide a guess which is nowhere near accurate. This is why, by using 80% of data for the model to train with, we can then observe how the model predicts on the unseen 20%.
The reasoning behind splitting the data before being cleaned is to avoid the possibility of a data leakage. A data leakage is the problem of information regarding the test data set being made available to the model in the training data set. If data outside the training data set is used to develop a model, this can lead to exaggerated model results.
I made sure to randomise the data before the split to ensure no unnecessary bias which is a mandatory requirement when using statistics as many of the models and concepts are based on the assumption of randomly distributed data.
Exploring the Data Set
To seek an initial understanding of the data involved in the project, all three data sets will be explored. The Sale Price will be used as the dependent variable in the model builds, so is the variable of most importance to be explored first. A histogram plot as well as a box plot are a great way to visualise a possible data distribution to describe the data sets whilst examining any possible data outliers. These plots can be seen in the figures (1a and 1b) below.
The distribution of the sale price in both data sets appear to not be normally distributed as both histograms are positively skewed and not symmetric around the mean. This can well mean that a normal linear regression model based will not be a good fit. If this is the case, one possibility for an improved performance will be to use a log transformation on the Sale Price variable or seek other distributions.
Both the boxplots confirm there are some outliers in both of the data sets however these outliers relate to the prices of very expensive houses. The outliers are of values ranging from 350,000 onwards and this is of course very plausible for houses to sell at these prices. A quick look on google indeed shows some houses in Iowa selling in the mid 500’s so these outliers are good to not be mistaken or errors.
What is of more interest is the check for non-believable values such as negative values, 0’s or very low prices such as £5 etc. A summary of the Sale Price variable will state the minimum value as well as providing other summary statistics.
A minimum value of 34900 indicates there are no definite anomalies. Although 34900 is a very low value for a house price in the year 2017; this value is not low enough to be eliminated from the data set. Again, by googling Iowa house prices there are indeed houses selling for 24k so this minimum value is plausible.
Data Cleaning
Data cleaning is an extremely important step in the process of building models. An inaccurate data set can lead to erroneous analysis of data ultimately translating to a waste of time … and money! Careful consideration has to be taken when dealing with messy data including missing values, anomalies, outliers, duplicates.
Table 2 below shows the variables with the highest percentage of missing values with a value coded “N/A”.
When a variable has over 80% of values missing, it is reasonable to just simply remove the variable from the data set. However, my preferred approach is attempt to keep as much data as possible without deletion. After reading the description text supplied with the dataset, N/A for a lot of the columns means none/no. So instead of deleting columns with a lot of missing values, the “N/A” will be changed to “none”.
Summary of the Data Cleaning Changes Made
- For 15 categorical variables, the data description file states N/A = none so this was coded accordingly
- “LotFrontage” – The median and mean were very similar across all three data sets but I opted for using the median because there were a few outliers causing the mean to be slightly higher than the median
- “GarageYearBlt” – Again opted for the median with the same reason as above for “LotFrontage”
- For Electrical, the mode is SBKR which is the most common electrical value by a landslide therefore I believe it is more than reasonable to replace n/a’s with SBKR.
- For “MasVnrArea” – Since majority of values correspond to the value 0, more than reasonable to replace N/A to 0 because a value would have been recorded. Below figures (3a and 3b) shows this.
After applying the above data cleaning amendments, there were still 13 rows of data remaining in the Final Test data containing “n/a” and at this stage it would be more than reasonable to simply remove them since they equate to 13/1460 (0.8%) of total observations in the data set. However, the Kaggle competition submission requires all data values must be used so these “N/A” values were actually imputed.
Now all three data sets are cleaned, it is time to get started with model fitting.
Model 1 – Multiple Linear Regression
The multiple linear regression model will be the first statistical model attempted as a fit to the training data set. The multiple linear regression model is an extension of the league regression model which predicts the outcome of a dependent variable based on all the other explanatory variables. The form of the multiple linear regression model is written as follows …
Assumptions of the Multiple Linear Regression Model
- A linear relationship between the dependent variable (SalePrice) and the independent variables
- Statistical independence of errors
- Constant variance of errors
- Error terms are normally distributed
If there is a violation to any of these assumptions, this can lead to forecasts and insights from the regression output to be inefficient and at a worst case scenario seriously biased or misleading. This is why it will be important to double check the model fit by assessing the residual plots to give a picture of how the error terms are behaving and whether the model assumptions are being met. The root mean squared error will also be used as the metric to compare model fits.
Coding Categorical Variables
There is still one more issue that remains with the data in its current state and that is the categorical variables. Regression analysis requires variables to be in numerical form. A workaround to get the regression to work with categorical variables is by using dummy variables.
In the following example, observations for the categorical variable “LandSlope” have three options which are as follows …
- Gentle slope
- Moderate slope
- Severe slope
Dummy variables can be coded in binary form leading to a table referred to as the contrast matrix. When using dummy variables we code n-1 variables. In our example we have three categories, therefore we code only two new columns which is illustrated below in table 3.
The original data column will be dropped as it is no longer needed and we can interpret the table as follows. If we want the variable “LandSlope” being gentle, then we place 0’s in both moderate and severe columns. If the “LandSlope” is either moderate or severe then we place a 1 in the correct corresponding column.
It is important to know that when creating models, if we wish to use other data sets to predict our model with, this additional data set must be of the same size as the original data set used by the model which it has been trained on. Before creating dummy variables, I decided to once again combine all three data sets with an additional column to label each data set. By doing this, all three data sets will have the same number of columns and once completed the data can be split into their respective data sets once again.
Model 1 – Linear Model Results
95.1% of observed variation in the dependent variable can be explained by the independent variables in the multiple regression which is indicated by the R-Squared value. This is obviously very encouraging however it is well known that as more variables are included in a model the R-Squared value will be high anyway and cause the model to overfit. A measure to counteract the inclusion of adding more independent variables is the adjusted r-squared value. The adjusted-r squared increases in value when the new term improves the model more than would be expected by chance and penalises if the model is improved less than expected. Here an adjusted r-squared value of 93.5% is still a very good result. I will make a prediction of sale prices using the test data using the linear regression model and then comparing difference between the predicted sale prices and the actual test sale prices.
Root Mean Squared Error
To evaluate the accuracy of predictions made from the various statistical models, I will compare the different model’s performance using the metric known as the root mean squared error. RMSE is the standard deviation of the residuals and measures how far away from the regression line that the data points are. In essence the RMSE is the measure of how spread out the residuals are. The lower the RMSE value the better. The Root Mean Square Error is calculated as follows …
Or simply find the RMSE by
1. Squaring the residuals
2. Find the average of the residuals (this results in the Mean Squared Error)
3. Take the square root of the Mean Squared Error
A RMSE value of 66075.93 is very high and is indicative that the data has in fact been overfitted. I will now observe the residual plots which will visualise this.
This residual plot shows heteroscedasticity meaning an unequal scatter of residuals. There is a systematic change in the spread of residuals as we increase in predictive price along the x axis. This fails the assumption of the linear regression model where the ordinary least squares regression assumes residuals are drawn from a population with constant variance. So to confirm, there is room for improvement.
The histogram plot of residuals is encouraging because the shape of the residuals appear to be normal although not completely symmetric. There is a slight negative skewness.
For this QQ plot, the red line is fitted through the quantiles. For a normal distribution all the quantiles will meet on the red line and be straight. Here although a lot of the points do meet the line, some of the points towards the ends start to drift away from the line meaning although the distribution is symmetric it comes with fat tails.
Model 2 – Log Linear Regression Model
One possible way to deal with the issue of non-normality in a regression model is to take a log transform on the dependent variable. The second model I will attempt to fit is a log linear regression model where the dependent variable (Sale Price) will be log transformed. The form of the log linear regression model is as follows
Model Assumptions
- Statistical independence of errors
- Constant variance of errors
- Error terms are normally distributed
To produce a log linear regression model, the dependent variable shall be transformed by taking the natural log of all the sale price values. The below figure is a histogram of the now dependant variable after the log transformation.
Compared to the previous figures in 1a and 1b, this histogram of the log transformation of the sale price variable is now of a normal distribution shape with symmetry around the centre. The skewness has now been removed therefore results for this model should be an improvement on the previous model.
Model 2 Results
The residual plot for this model is definitely an improvement as there is a more random scatter of residuals. This plot appears to show a constant variance amongst residuals with an even number of residuals above and below the horizontal 0,0 line.
Just as in the previous model, the residuals appear to be normally distributed although there is a slight positive skewness before 0.
This QQ plot confirms that although majority of the points meet the quantile red line, there is a fat tail towards the left as they are far away from the line.
Model 3 – Random Forest Model
The random forest model is a supervised machine learning algorithm which builds multiple decision trees and merges them together to make a prediction by averaging the predictive results of each component tree. The random forest generally provides a much better predictive accuracy than a single decision tree. A major advantage for using the random forest model is it can be used for both classification and regression. For this project, we will be using the random forest for regression purposes.
Assumptions of the Random Forest Model
- At each step of building an individual tree, the random forest finds the best split of data
- Whilst building a tree, the whole dataset is not used but a bootstrap sample
- Use bootstrap aggregation so assumes sampling is representative
In terms of statistical assumptions such as normality of errors etc that are necessary in the linear regression models, this model does not require these assumptions to be met. The reason being is that random forests are non parametric and are determined by the data therefore are able to handle skewed data as well as categorical data.
I decided to opt for a default random forest model size of 100 estimators however should the random forest model provide a competitive RMSE score, then I will explore the possibility of using different estimators.
This plot shows the random forest model has the tendency to provide a reasonable prediction compared to the actual Sale Price value. Although for the mean of values the random forest slightly predicted over, the overall shape aligns.
A root mean squared error of 30621 is a better score than the original multiple linear regression model however no where near as close to our best performing model the log linear regression model. Even though the assumptions of normality and constant variance of errors are not mandatory for the random forest regression model, the residual plots will be explored to see how the residuals perform.
This residual plot shows a random scatter of points up to sale price of 30000 where the variance starts to widen. We certainly have a very clear outlier at the 35000 point. Although the constant variance assumption is not mandatory for the random forest model, we would still much rather observe a random scatter throughout the whole plot.
Similar to the previous model histogram for residuals with a slight negative skewness however this shape is no longer a smooth normal shape. Expecting the corresponding QQ plot to not be a straight line through the quantiles.
Indeed the QQ plot for the random forest model is the worst amongst the three models as the quantiles are not meeting the red line straight on throughout.
Model Comparison
It is very clear, the best model which fitted to the data was model 2 – the log linear transform model. With a 0.22 rmse value, this is far more accurate than the other two models. An important take away from this model comparison is the R2 and adjusted R2 results. There is very little difference between model 1 and 2 for these results which goes to show why we can’t confidently rely on this result. Even though adjusted r2 penalises the use of adding more variables to a model, this goes to show that this result will tend to be high if there are a lot of variables.
Feature Selection
It was concluded, the best model is the log linear multiple regression model so now for the rest of this project, I will only use this model to seek for further improvement. One area to improve a model prediction, is feature selection. Feature selection is the process of reducing the number of variables in a model to improve model performance. There are various types of feature selection techniques, however the type of feautre selection I will use is based on correlation.
Correlation is the statistical term referring to how close two variables are to having a linear relationship with each other.
-1 = Highly negative correlation
0 = No correlation
+1 = Highly positive correlation
If we have two variables x and y; a highly negative correlation of -1 means that as X increases in value, Y decreases in value at the same rate. In terms of removing variables from the model based on correlation, I am only interested in variables which are highly positively correlated. This is because if two variables are both highly correlated with each other and the rest of the model variables, then we can remove the lowly performing variable out of the two to improve model performance.
Correlation Matrix Example
I will illustrate a correlation matrix example with a subset of the data using just seven variables.
The very dark blue squares on the diagonal with the value of 1 are the variables duplicated with each other so ignore these correlations as they are of no use for interpreation. However we are seeking for the squares which are darker as this relates to the higher positive correlation pairs.
YearRemodAdd and YearBuilt are the two highest correlation pair with a correlation coefficient of 0.6. If we was to remove one of these variables, we would remove the variable which is more highly positive with the rest of the other variables. YearBuilt is more positively correlated with MasVnrArea and BsmtFinSF1 therefore I would remove this variable from the model over YearRemodAdd.
There are 285 degrees of freedom in our final model so therefore 286 variables. It would be far to difficult to interpret a correlation matrix to go about eliminating variables by using one’s eye. To go about removing variables based on correlation, I created a correlation function with a threshold of 80%. Variables which are highly positive with all the other variables at a threshold of 80%, I requested Python to provide these variables and then eliminate them from the model. This resulted in 21 variables removed from the model which were as follows.
Some might argue by having so many variables in the model, to go for a more conservative threshold of 90% whilst a more aggressive approach might set the threshold to 60%. Personal preference led to me choosing 80% threshold because I think this is a high enough threshold whilst allowing for a considerable amount of variables to be removed from the model.
Model 4 – Log Linear Regression After Feature Selection
Model Comparison
With only a 0.46% decrease in RMSE value, the feature selection based on high correlated variables has not made a significant difference at all. I will side with the more conservative option and stick with the log linear regression model without the feature selection as my proposed final model for this project.
Kaggle Final Test Data Submission
Kaggle allows multiple submission attempts therefore both model 2(log linear regression model) and model 4(log linear regression model with feature selection) were submitted. The Kaggle scores were as follows.
No real surprise to see model 4 performed only marginally better as this was the case in the previous analysis earlier.
Out of 4622 entrants, my best model (model 4) ranks 3646th which is just shy of the bottom 20% of submissions. Certainly was hoping for a better rank for a debut submission but more intrigued to see the leading scorers with accuracy scores of 0.003! There are other techniques out there which I am yet to learn so who knows maybe I will return in a year time with new found knowledge to seek an improvement in my score!
Project Conclusion
Three models were fitted with the best performing model being the log linear regression model. By transforming the dependent variable (Sale Price), the model was able to meet all the assumptions behind a linear regression model resulting in a drastically improved root mean square error accuracy.
Although in this project, feature selection based on correlation only improved the model marginally this still shows how the choice of variables can possibly affect model performance. The feature selection method based on correlation is just one method out of many possible so maybe there is room for improvement in reducing the variables in the model using an alternative feature selection method.
None the less, the log linear regression model was an excellent fit to the training data whilst performing very well by accurately predicting on unseen data whilst meeting all of its statistical assumptions.
If you managed to read all of this project report, I hope you enjoyed the read!