Diagnostics for Leverage and Influence
The location of observations in a space play a role in determining regression coefficients. Some observations are far from values predicted by a model and are called outliers. This means that a predicted y value is far from the actual y value of an observation.
In many cases, outliers do not have a large effect on the OLS line. Still, an outlier may cause significant issues as it does have an impact on RSE. Recall that RSE is utilized for the computing of p-values and confidence intervals. As a consequence, outliers do have an impact on the interpretation of model fit. Inclusion of outliers may also have an impact on R squared.
Residual plots are a great way of visualizing outliers. In order to aid the decision to deem an observation an outlier, studentized residuals may be used. Studentized residuals are computed by dividing each residual by the standard error. When studentized residuals values exceed 3 in absolute, we should be concerned about the observation being an outlier.
In contrast to an outlier, a leverage value has an unusual x observation. In other words, the observed value of a predictor is very unusual compared to other values. As a result, removing a leverage value from the dataset will have an impact on the OLS line. As a result, just a few observations with high leverage may result in questionable model fit. In multiple regression models we can’t just simply look at x values within a variable and spot the leverage values because it is possible to have an observation within the normal range of each variable when the observation may be outside of the normal range when all regressors are considered simultaneously. Leverage statistic h can be used to spot high leverage values where the cutoff is (p+1)/n. Any value above the h cutoff may be considered a leverage value.
Let’s look at the Boston Housing dataset and see if we can find outliers, leverage values and influential observations. First, I ingested the dataset as usual. For this example, I removed two variables, AGE and INDUS, because they were not significant during the initial fitting procedure.
I used statsmodels api for a lot of the heavy lifting. get_influence() gets an instance of Influence with influence and outlier measures. Model1 is the OLS regression fitted earlier.
In the next block, I wanted to show how to obtain studentized residuals, Cook’s Distances, DFFITS and leverage values one by one. The influence.summay_frame() provides these values automatically. The summary also provides dfbetas for each of the regressors. I also concatenated this table with the MEDV value of each observation. I named the resulting data frame MEDVres.
Below is the print of influence.summary_frame() before being concatenated with MEDV values. The scatter plot of leverage values vs. studentized residuals are also plotted.
The next step is to identify outliers using studentized residuals. Studentized residuals could be concerning when their absolute values exceed 2. This is an aggressive stance and one could relax this criteria and consider studentized residuals exceeding 3 as an outlier. It is worth reading the following post about why NaN values would appear at times: https://stats.stackexchange.com/questions/123368/studentized-residuals-undefined. Anyway, below I printed the top 5 negative residuals below.
We can actually identify the outliers by simply running the following formula. Again, I used the cutoff of 2 as opposed to 3 in some textbooks. The left column contains the index of an observation, while the right value is the MEDV value of an outlier observation.
Now that we identified outliers, we need to see which observations can be considered to have leverage values. As discussed earlier, the leverage cutoff can be calculated as (2k+2)/n where k is the number of predictors and n is the sample size.
We can now identify all observations with high leverage by simply using the cutoff formula. It appears that there are 61 such observations.
Now that we identified some outliers and leverage values, let’s bring them together to identify observations with significant influence. Indeed, when an observation is both an outlier and has high leverage, it will surely impact the regression line as a result of influencing regression coefficients.
It is also very useful to look at overall influence, which can be measured by Cook’s Distances and DFFITS. Cook’s Distances can be 0 or higher. The higher the value, the more influential the observation is. Many people use three times the mean of Cook’s D as a cutoff for an observation deemed influential.
DFFITS is also designed to identify influential observations with a cutoff value of 2*sqrt(k/n). Unlike Cook’s Distances, DFFITS can be both positive and negative, but a value close to 0 is desired as these values would have no influence on the OLS regression line.
Now let's take a look at DFITS. The conventional cut-off point for DFITS is 2*sqrt(k/n). DFITS can be either positive or negative, with numbers close to zero corresponding to the points with small or zero influence. As we see, DFITS also indicates that DC is, by far, the most influential observation.
The resulting table is very wide and long. As a result, I am including a small excerpt to demonstrate what it looks like.
The yellowbrick package allows us to visualize Cook’s Distances.
And this brings us to DFBETAS. Cook’s Distances and DFFITS are general measures of influence, while DFBETAS are variable specific. It shows how influential each observation is on the corresponding coefficients. DFBETAS are provided as part of the influence.summary_frame() output but is is worth visualizing it.
So let’s use the findings. I fitted two different regression models. First, I removed the observations that were deemed influential and fitted an OLS model. Second, I removed all outliers, and then fitted another OLS model.
A comparison of the AIC and BIC values of this model to the initial model shows that our model fit improved as a result of removing the influential observations.
The model above was fitted after removing all outliers based on the Cook’s Distance analysis. The model fit improved even further and the R squared value increased compared to the initial fit and the fit based on variables without influential observations.
The conclusion is that model fit can be improved by identifying and removing outliers, observations with high leverage and influential observations. Having said that, deleting observations may not be desirable and other methods may be used to deal with influential values such as using an artificial cap value or replacing all influential values with the mean. Machine learning can be used to try different options for a better fit.