Multicollinearity
Regressors are orthogonal when there is no linear relationship between them. Unfortunately, linear dependencies frequently exist in real life data, which is referred to as multicollinearity. Multicollinearity could result in significant problems during model fitting. For example, multicollinearity between regressors may result in large variances and covariances for the OLS estimators, which could lead to unstable/poor parameter estimates. In practice, multicollinearity often pushes the parameter estimates higher in absolute value than they really should be. Further, coefficients have been observed to switch signs in multicollinear data. In sum, the multicollinearity should prompt us to question the validity and reliability of the specified model.
Multicollinearity be detected by looking at eigenvalues as well. When multicollinearity exists, at least one of the eigenvalues is close to zero (it suggests minimal variation in the data that is orthogonal with other eigen vectors). Speaking of eigenvalues, their sum equals the number of regressors. Eigenvalues signify the variance in the direction defined by the rotated axis. No variation means limited to no ability to detect trends in the data. Rotation means the defining of new axes by the variances of regressors. Rotation is calculating the weighted averages of standardized regressors. The first new variable accounts for the most variance from a single axis.
Eigen vectors explain the variation in the data orthogonal to other eigen vectors, and the eigen value shows how much variation is in that direction. When eigen values are zero, we need to look for corresponding eigen vectors that are large and the indices of the values show which regressors are collinear.
How to Detect Multicollinearity Easily
Printing and observing bivariate correlations of predictors is not good enough when evaluating the existence of multicollinearity because of potential cross correlation of three or more variables. On the flip side, in certain cases, high correlation between variables does not result in collinearity (e.g. the VIF associated with a variable is not high.)
Instead, one should use variable inflation factor or VIF, which can be computed for each regressor by fitting an OLS model that has the regressor in question as a target variable and all other regressors as features. If a strong relationship exists between the target (e.g. regressor in question) and at least one other regressor, the VIF will be high. What is high? Textbooks usually suggest 5 or 10 as a cutoff value above which the VIF score suggests the presence of multicollinearity. So which one, 5 or 10? If the dataset is very large with a lot of features, a VIF cutoff of 10 is acceptable. Smaller datasets require a more conservative approach where the VIF cutoff may needed to be dropped to 5. I have seen people using an even lower cutoff threshold, and the purpose of the analysis should dictate which threshold to use.
What to do about Multicollinearity
Let me start with a fallacy. Some suggest that standardization helps with multicollinearity. Standardization does not affect the correlation between regressors.
One approach may be the removal of regressors that are correlated. Another may be principal component analysis or PCA. There are other regression methods which may help with the problem such as partial least squares regression or penalized regression methods like ridge or lasso regression. Finally, it may be acceptable to do nothing if the precision of estimating parameters is not that important. So let’s look at multicollinearity in the context of the Boston Housing dataset:
The model fitted below is identical to some of the earlier posts in that it is very rudimentary. It contains all regressors of the Boston Housing data set without any treatment. It is important to look at because I would like to compare how the treatment of multicollinearity changes some of the statistics about the model’s fit.
Let’s assess multicollinearity using Variable Inflation Factors. Notice that a constant was added since statsmodels api does not automatically include a y intercept. This very fact caused a lot of headache as I forgot to add the constant many times in the past. Anyway, the print of the VIFs shows that there is collinearity in the data. Both RAD and TAX have VIFs of well above 5.
In the following code I dropped RAD. Normally, one could start dropping the regressor with the highest VIF. I did start with removing TAX but realized that removing that regressor made the error of the model higher and removing RAD was a much better choice. First, we can observe that removal of RAD fixes all issues with high VIFs.
Now if we fit a new model with the dataset that does not have RAD as a regressor, the model performs somewhat better based on predictions based on test data. While the new model’s AIC and BIC were both slightly higher the calculated test error (MAE) decreased: