OLS Regression: Boston Housing Dataset
In this section I wanted to demonstrate how to fit a multiple regression model using the Scikit-learn library in Python. Scikit-learn allows us to fit many different supervised and unsupervised models, but here I focused on fitting a linear regression model. The linear regression model is housed in the linear_model module of sklearn, which is Python’s Scikit-learn library.
When fitting linear models, we must be diligent with respect to discovering and fixing issues that frequently occur in real world data. Patterns in data frequently result in violations of regression assumptions:
1. Non-linear relationships between regressors and the response;
2. Correlation of error terms;
3. Non-constant variance of error terms or heteroscedasticity;
4. Presence of outliers;
5. Presence of influential observations or high-leverage points; and
6. Multicollinearity.
In this section, I wanted show how to diagnose these issues, and solutions to them are offered in subsequent posts.Several datasets are available from sklearn. I will work through the Boston Housing dataset in this example. It is available in sklearn as a toy dataset. Upon loading the dataset, we can see that it has 506 rows and 13 columns.
A description of the dataset including a data dictionary is available. It is a rather old dataset, but it is perfect for regression analysis demonstration because it is loaded with violations of regression assumptions.
First, I created two data frames, one containing all regressors and one for the target variable.
I concatenated the two data frames and called the data frame boston_df. It now contains all regressors and target variable. We are now ready for Exploratory Data Analysis!
Exploratory Data Analysis:
It is tempting to skip Exploratory Data Analysis. However, EDA allows us to understand the data, and it allows us to discover issues that may effect the results of statistical models.
Distribution of Variables:
We can start by looking at the distribution of our data. The target (MEDV) looks fairly normal although there is a skew on the right, which could cause problems.
The distribution of features is far from ideal. With the exception of RM, all other features exhibit a non-normal distribution. For now, I will leave this alone and attempt to fix the problem later. I also noticed that one variable (CHAS) is binary categorical.
Linearity Assumption:
When fitting a linear regression model, it is assumed that there is a linear relationship between each regressor and the target. The first task is to plot each regressor against the target (MEDV). Residual analysis will also be completed after the model is fit to look at the same problem.
Multiple plots show non-linear relationships (ZN, INDUS, NOX, LSTAT, AGE, DIS, B). CRIM has multiple extreme observations, while observations at the 0 CRIM level have no relationship with home values. RAD and TAX has no relationship with Home values at the highest levels, certainly a cause of concern.
Correlations:
One should always look at a correlation matrix. It shows which feature is correlated with the target and the direction of that correlation. One would like to see significant/high correlations between the regressors and the target variable (negative or positive depending on whether a feature is expected to have a positive or negative impact on home values). For example, crime rates should be inversely correlated with home prices.
Correlation coefficients can be evaluated by a rudimentary rule of thumb: 0 means no correlation, 0.01-.3 weak, 0.31-0.7 moderate, 0.7-0.9 strong, 0.9-1 very strong. In our example, only two variables reach the vicinity of 0.7, LSTAT and RM.
Also, cross correlation between regressors is not desirable because of the collinearity principle of regression. Multicollinearity will be discussed in a few paragraphs. The output of the correlation table below shows the cross-correlation of all variables. Unfortunately, the Boston Housing dataset contains lots of moderate to strong cross-correlations between regressors. I cut the picture on the right at B because it was too large to see.
The seaborn package may also be helpful to visualize correlations. Note the legend on the right side. Darker colors mean stronger correlations, although the directionality of correlations is not available. Several regressors are correlated considerably, which is a grave concern and will have to be addressed in regression modeling.
We are now ready to fit a model!
Regression Model:
In order to use Scikit-learn, the module needs to be installed. This can be done using a terminal window and #pip install -U scikit-learn. An important note is that sklearn import does not automatically install sub packages, hence they must be installed one by one. In this example, I had to install model_selection import train_test_split, which will be used for partitioning the dataset.
For statistical learning, it is necessary to partition the dataset. When partitioning the data with sklearn, we need to create a data frame that has all regressors and another data frame that has the target variable. Note that the original data was already provided that way, I just concatenated the two data frames because I thought it was useful to have a single data frame for EDA. Some may consider that not necessary…
Here, I held on to 80% of the data for training and set aside 20% for testing. We can see that 404 observations are now considered training in both Y_train (target) and X_train data frames (features) date frames. The test data sizes are 102 in both corresponding data frames.
Let’s run our first regression model and see how we did. The intercept and coefficients can be manually printed. I also printed a plot that compares actual and predicted home values based on the test data. We can observe some significant problems. As an example, a large number of actual values were predicted to be higher indicating a poor model fit and accuracy.
I also printed some goodness of fit statistics such as Mean Absolute Error, Mean Squared Error and Root Mean Squared Error. It is hard to judge whether these scores are good or bad until we start building newer, more appropriate models. All three - MAE, MSE and RMSE - were computed based on test data.
Finally, some very important model stats can be obtained from the OLS Regression Results table. First, the F-statistic is large at 661, which is certainly larger than 1, which allows us to reject the null hypothesis, and conclude that at least one of the variables must be related to the target MEDV. The R-squared is also very large at 0.738 indicating that the model explains 74% of the variance in the target variable. We also get two fit statistics Akaike Information Criterion and Bayesian Information Criterion, which are large but by themselves they are difficult to evaluate. They will be useful when fitting alternative models.
Unfortunately, we do have some variables that were not a significant contributor to the explanatory power of the model (INDUS, CHAS, NOX, AGE). We do get a warning in this table about multicollinearity: The condition number is large, 8.97e+03. This might indicate that there are strong multicollinearity or other numerical problems.
Non-linear relationships between regressors and the response:
We already saw during EDA that some of the features are not linearly related to the target. We can look at the non-linearity issue by plotting residuals vs. the fitted values. The plot should show no pattern, and the presence of a pattern indicates issues with the linearity assumption. Further, the points should be horizontally distributed. The residual plot shows a U-shaped pattern, which is an indication of non-linearity.
Plotting observed vs. predicted values against each other may also be helpful. Linearity will cause points to be symmetrically distributed along the diagonal of the plot, while non-symmetrical distribution is an indication of violation of non-linearity assumption.
Right type of regression used:
The Versus Fits plot is also useful in determining whether the right type of regression has been used. The curved pattern shows that the chosen regression was of the wrong type.
Correlation of error terms:
As discussed earlier, an important assumption of linear regression modeling is that the error terms must NOT be correlated. Correlations among the error terms causes the standard errors underestimated and the confidence in prediction intervals will be narrower. Additionally, the p-values in the model will be underestimated, which may prompt us to consider a parameter statistically significant in error. When error terms are correlated, the fits vs. residuals plot frequently appears in the shape of a funnel.
Independence (or no correlation) is typically expressed based on error terms, e.g. the error terms must be independent. Since the error terms are not known, we can estimate them by subtracting fitted values from actual values for each observation.
The best way to check the independence assumption is to plot residuals against regressor values. A random pattern shows independence, while a plot with a shape is usually an indication of the error terms being correlated. The plots below show clear patterns for all variables indicating that the independence assumption of regression is violated.
Non-constant variance of error terms or heteroscedasticity:
When variances of the error terms increase with the value of the response (e.g. the vs. fit plot exhibits a funnel shape), we are dealing with heteroscedasticity or non-constant variances. Transformation of the variables may help the problem but at least for now, we can spot it on a plot.
The Vs. Fits plot of the Boston Housing shows an initially decreasing and then increasing variance of error terms as the value of predicted observations increase. There are hypothesis tests that are available for testing the homoscedasticity assumption, but these tests are notoriously unreliable because they themselves are based on models with a sets of assumptions.
Outliers:
An outlier is an observation that is far from its predicted value. Outliers may have an impact on the regression line although that impact may not be significant. However, outliers may have an impact on the RSE score, even if they do not have an impact on the OLS fit. RSE is used for the computation of confidence intervals and p-values, therefore outliers do have an impact on how we interpret model statistics. Outliers can also have a negative impact on R squared values by making them smaller in error.
There are alternative ways to find outliers. One effective way of finding outliers is to plot the studentized residuals (e.g. take each residual and divide by its estimated standard error ). Observations above 3 are usually outliers.
The IQR method is yet another method one could consider using. An observation may be considered an outlier if it is outside of the IQR range. Other thresholds may also be considered and the threshold is simply based on tolerance by the analyst.
Presence of influential observations or high-leverage points:
High leverage points have a significant impact on the regression line, which may be a significant problem if the fitted model is impacted by a handful of observations. In fact, the entire fit may be compromised by high leverage points. Leverage statistic is the best way to identify high leverage points. The leverage statistic is always between 1/n and 1 and it equals (p+1)/n where p is the number of features. Any observation that exceeds the leverage statistic by a considerable number can be deemed a high leverage value. Plotting predicted and actual observations may also be very helpful.
Normal Distribution
I plotted the histogram of residuals but the selection of bin width influences the picture of the distribution. A far better way to assess the normality assumption is to use a Q-Q plot. This is a probability plot designed to show whether the data came from a normal distribution. If it did, the plot will lie along the diagonal line.
Q-Q plots are also useful to visualize outliers, which are observations far deviating from the diagonal at the upper or lower edges.
The the case of our Boston housing data, the normal distribution is violated toward the upper edge of the diagonal line. These points could be observed in the original histogram of the target variable , MEDV.
Multicollinearity:
Multicollinearity refers to correlations among multiple features. As a result of cross-correlation among multiple variables, it is difficult to tease out the effect of individual regressors on the target. Collinearity impacts the accuracy of estimates negatively and the standard error to increase. Collinearity may also result in lower t scores, which may cause us to deem a feature significant in error.
Note that the correlation matrix is not good enough to detect collinearity because it is possible to have collinearity between variables even if the bivariate coefficient of correlation is not significant, which is called multicollinearity. It can be detected by computing the variance inflation factor. When VIF is 1, there is no collinearity. When values exceed 5 or 10, we detected collinearity. In smaller samples 5 is the preferred cutoff, while in larger sample sizes 10 is acceptable.
Eigen vectors vectors may also indicate multicollinearity when their values are closer to zero.