Diagnostics for Leverage and Influence
The location of observations in a space play a role in determining regression coefficients. Some observations are far from values predicted by a model and are called outliers. This means that a predicted y value is far from the actual y value of an observation.
In many cases, outliers do not have a large effect on the OLS line. Still, an outlier may cause significant issues as it does have an impact on RSE. Recall that RSE is utilized for the computing of p-values and confidence intervals. As a consequence, outliers do have an impact on the interpretation of model fit. Inclusion of outliers may also have an impact on R squared.
Residual plots are a great way of visualizing outliers. In order to aid the decision to deem an observation an outlier, studentized residuals may be used. Studentized residuals are computed by dividing each residual by the standard error. When studentized residuals values exceed 3 in absolute, we should be concerned about the observation being an outlier.
In contrast to an outlier, a leverage value has an unusual x observation. In other words, the observed value of a predictor is very unusual compared to other values. As a result, removing a leverage value from the dataset will have an impact on the OLS line. As a result, just a few observations with high leverage may result in questionable model fit. In multiple regression models we can’t just simply look at x values within a variable and spot the leverage values because it is possible to have an observation within the normal range of each variable when the observation may be outside of the normal range when all regressors are considered simultaneously. Leverage statistic h can be used to spot high leverage values where the cutoff is (p+1)/n. Any value above the h cutoff may be considered a leverage value.
Let’s look at the Boston Housing dataset and see if we can find outliers, leverage values and influential observations. First, I ingested the dataset as usual. For this example, I removed two variables, AGE and INDUS, because they were not significant during the initial fitting procedure.
import pandas as pd import numpy as np import itertools from itertools import chain, combinations import statsmodels.formula.api as sm import scipy.stats as scipystats import statsmodels.api as sm import statsmodels.stats.stattools as stools import statsmodels.stats as stats from statsmodels.graphics.regressionplots import * import matplotlib.pyplot as plt import seaborn as sns import copy from sklearn.model_selection import train_test_split import math import time
#IMPORT DATASET import pandas as pd from sklearn.datasets import load_boston boston= load_boston() #Create analytical data set #Create dataframe from feature_names boston_features_df = pd.DataFrame(data=boston.data,columns=boston.feature_names) #Create dataframe from target boston_target_df = pd.DataFrame(data=boston.target,columns=['MEDV']) #Partition the data #Create training and test datasets #We know from prior work that two variables were not significant contributors and need to be removed (AGE and INDUS) # Drop AGE and INDUS boston_features2_df=boston_features_df.drop(columns=['AGE', 'INDUS']) #Partition dataset import sklearn from sklearn.model_selection import train_test_split #sklearn import does not automatically install sub packages X = boston_features2_df Y = boston_target_df X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, test_size = 0.20, random_state = 5) #For demonstration from statsmodels.formula.api import ols #Model statistics model1 = sm.OLS(Y_train, sm.add_constant(X_train)).fit() print_model1 = model1.summary() print(print_model1)
I used statsmodels api for a lot of the heavy lifting. get_influence() gets an instance of Influence with influence and outlier measures. Model1 is the OLS regression fitted earlier.
influence = model1.get_influence() pd.Series(influence.hat_matrix_diag).describe() count 404.000000 mean 0.029703 std 0.028211 min 0.005690 25% 0.014799 50% 0.020744 75% 0.036086 max 0.351723 dtype: float64
In the next block, I wanted to show how to obtain studentized residuals, Cook’s Distances, DFFITS and leverage values one by one. The influence.summay_frame() provides these values automatically. The summary also provides dfbetas for each of the regressors. I also concatenated this table with the MEDV value of each observation. I named the resulting data frame MEDVres.
influence = model1.get_influence() inf_sum = influence.summary_frame() print(inf_sum.head()) student_resid = influence.resid_studentized_external (cooks, p) = influence.cooks_distance (dffits, p) = influence.dffits leverage = influence.hat_matrix_diag print ('\n') print ('Leverage vs. Studentized Residuals') sns.regplot(leverage, model1.resid_pearson, fit_reg=False) plt.title('Leverage vs. Studentized Residuals') plt.xlabel('Leverage') plt.ylabel('Studentized Residuals') #Concat MEDV and the resulting residual table #Note that hat_diag is leverage so change the ciolumn heading from hat_diag to leverage from statsmodels.formula.api import ols MEDVres = pd.concat([boston_df.MEDV, inf_sum], axis = 1) MEDVres=MEDVres.rename(columns={'hat_diag': 'leverage'}) MEDVres.head()
Below is the print of influence.summary_frame() before being concatenated with MEDV values. The scatter plot of leverage values vs. studentized residuals are also plotted.
dfb_const dfb_CRIM dfb_ZN dfb_CHAS dfb_NOX dfb_RM dfb_DIS \ 33 0.004212 0.000886 -0.001868 0.000520 -0.001506 -0.000055 0.000991 283 -0.004325 0.013510 0.165739 0.217643 -0.048912 0.064451 -0.075798 418 0.064623 0.598700 -0.036948 0.026304 0.018920 -0.041332 0.046806 502 0.013932 -0.004569 -0.017240 0.007403 -0.017676 0.004782 0.021633 402 0.063417 0.034065 -0.013930 0.024539 -0.007781 -0.062845 0.011428 dfb_RAD dfb_TAX dfb_PTRATIO dfb_B dfb_LSTAT cooks_d \ 33 0.003373 0.003382 -0.010605 0.001511 -0.005790 0.000026 283 0.016462 -0.026867 -0.005289 -0.006179 0.054326 0.009438 418 -0.135088 -0.006557 0.015099 -0.211732 -0.133261 0.035852 502 0.017684 0.008317 -0.041426 0.000965 0.018930 0.000297 402 -0.043117 -0.001509 -0.012829 -0.078164 -0.057583 0.002266 standard_resid hat_diag dffits_internal student_resid dffits 33 -0.148714 0.013939 -0.017681 -0.148528 -0.017659 283 1.179325 0.075298 0.336529 1.179915 0.336698 418 1.167851 0.239801 0.655917 1.168395 0.656222 502 -0.374113 0.024842 -0.059712 -0.373702 -0.059647 402 -1.285053 0.016199 -0.164899 -1.286125 -0.165037
The next step is to identify outliers using studentized residuals. Studentized residuals could be concerning when their absolute values exceed 2. This is an aggressive stance and one could relax this criteria and consider studentized residuals exceeding 3 as an outlier. It is worth reading the following post about why NaN values would appear at times: https://stats.stackexchange.com/questions/123368/studentized-residuals-undefined. Anyway, below I printed the top 5 negative residuals below.
#studentized residuals for identifying outliers #requested studentized residuals call them r #studentized residuals that exceed +2 or -2 are concerning #studentized residuals that exceed +3 or -3 are extremely concerning r = MEDVres.student_resid print ('-'*30 + ' studentized residual ' + '-'*30) print (r.describe()) print ('\n') r_sort = MEDVres.sort_values(by = 'student_resid') print ('-'*30 + ' top 5 most negative residuals ' + '-'*30) print (r_sort.head()) print ('\n') print ('-'*30 + ' top 5 most positive residuals ' + '-'*30) print (r_sort.tail())
------------------------------ studentized residual ------------------------------ count 404.000000 mean 0.004729 std 1.017836 min -3.506797 25% -0.562912 50% -0.138306 75% 0.420073 max 5.464612 Name: student_resid, dtype: float64 ------------------------------ top 5 most negative residuals ------------------------------ MEDV dfb_const dfb_CRIM dfb_ZN dfb_CHAS dfb_NOX dfb_RM \ 364 21.9 0.667778 0.113804 0.107776 -0.616018 -0.209043 -0.708748 505 11.9 0.070241 -0.036201 -0.094961 0.044961 -0.123162 0.059625 401 7.2 0.108062 -0.019858 -0.022017 0.039677 -0.015976 -0.096769 375 15.0 0.174213 -0.131031 0.011348 0.042939 -0.017619 -0.205546 397 8.5 0.010791 0.066082 -0.037235 0.034932 -0.007160 0.025261 dfb_DIS dfb_RAD dfb_TAX dfb_PTRATIO dfb_B dfb_LSTAT cooks_d \ 364 -0.180677 0.031865 -0.151854 -0.294940 -0.146532 -0.013238 0.105395 505 0.111267 0.103545 0.050053 -0.244647 0.012416 0.152422 0.010858 401 0.018899 -0.059734 -0.002898 -0.020852 -0.165908 -0.082444 0.007009 375 0.028951 -0.006090 -0.037613 -0.058662 -0.161656 -0.004060 0.011597 397 0.034995 -0.084152 0.011188 0.002027 -0.107611 -0.023573 0.004001 standard_resid leverage dffits_internal student_resid dffits 364 -3.457330 0.095684 -1.124606 -3.506797 -1.140696 505 -2.221733 0.025718 -0.360965 -2.233001 -0.362796 401 -2.211719 0.016903 -0.290013 -2.222808 -0.291467 375 -2.146969 0.029307 -0.373050 -2.156948 -0.374784 397 -1.713136 0.016096 -0.219116 -1.717390 -0.219660
We can actually identify the outliers by simply running the following formula. Again, I used the cutoff of 2 as opposed to 3 in some textbooks. The left column contains the index of an observation, while the right value is the MEDV value of an outlier observation.
#Print all MEDV values where the studentized residuals exceed 2 print (MEDVres.MEDV[abs(r) > 2])
64 33.0 161 50.0 162 50.0 166 50.0 214 23.7 225 50.0 228 46.7 233 48.3 253 42.8 267 50.0 364 21.9 365 27.5 367 23.1 368 50.0 369 50.0 370 50.0 371 50.0 372 50.0 374 13.8 375 15.0 401 7.2 505 11.9 Name: MEDV, dtype: float64
Now that we identified outliers, we need to see which observations can be considered to have leverage values. As discussed earlier, the leverage cutoff can be calculated as (2k+2)/n where k is the number of predictors and n is the sample size.
#Identify high leverage #point with leverage = (2k+2)/n #k = number of predictors (11) #n = number of observations (506) ((2*11)+2)/506 #=0.04743083003952569 any numbner higher than this is high leverage l = MEDVres.leverage print ('-'*30 + ' Leverage ' + '-'*30) print (l.describe()) print ('\n') l_sort = MEDVres.sort_values(by = 'leverage', ascending = False) print ('-'*30 + ' top 5 highest leverage data points ' + '-'*30) print (l_sort.head())
------------------------------ Leverage ------------------------------ count 404.000000 mean 0.029703 std 0.028211 min 0.005690 25% 0.014799 50% 0.020744 75% 0.036086 max 0.351723 Name: leverage, dtype: float64 ------------------------------ top 5 highest leverage data points ------------------------------ MEDV dfb_const dfb_CRIM dfb_ZN dfb_CHAS dfb_NOX dfb_RM \ 380 10.4 0.052078 -0.393317 0.032894 -0.008546 -0.020914 -0.048826 418 8.8 0.064623 0.598700 -0.036948 0.026304 0.018920 -0.041332 405 5.0 0.005872 -0.204162 0.006323 -0.003434 -0.009344 0.007857 365 27.5 0.462435 -0.074987 0.151347 -0.084445 0.135214 -0.776065 155 15.6 0.006989 -0.022791 0.004100 -0.137614 -0.146278 0.023760 dfb_DIS dfb_RAD dfb_TAX dfb_PTRATIO dfb_B dfb_LSTAT cooks_d \ 380 -0.031489 0.085597 -0.014302 -0.017654 -0.064719 0.050684 0.013771 418 0.046806 -0.135088 -0.006557 0.015099 -0.211732 -0.133261 0.035852 405 -0.010635 0.033011 0.000971 -0.001588 -0.038644 0.022505 0.003815 365 -0.133736 0.231775 -0.104776 -0.055747 -0.027342 -0.624312 0.064897 155 -0.068914 0.077127 -0.005154 0.015553 0.151105 0.051739 0.007041 standard_resid leverage dffits_internal student_resid dffits 380 -0.551887 0.351723 -0.406509 -0.551397 -0.406148 418 1.167851 0.239801 0.655917 1.168395 0.656222 405 -0.454273 0.181548 -0.213952 -0.453813 -0.213735 365 2.455956 0.114348 0.882475 2.471913 0.888209 155 -0.838743 0.107225 -0.290674 -0.838425 -0.290564
We can now identify all observations with high leverage by simply using the cutoff formula. It appears that there are 61 such observations.
#point with leverage = (2k+2)/n = 0.04743083003952569 #Print all MEDV values where the leverage exceeds 0.04743083003952569 print (MEDVres.MEDV[abs(l) > ((2*11)+2)/506])
8 16.5 48 14.4 54 18.9 102 18.6 142 13.4 143 15.6 144 11.8 145 13.8 146 15.6 147 14.6 148 17.8 152 15.3 154 17.0 155 15.6 159 23.3 160 27.0 162 50.0 203 48.5 204 50.0 209 20.0 210 21.7 211 19.3 253 42.8 257 50.0 269 20.7 274 32.4 282 46.0 283 50.0 290 28.5 291 37.3 354 18.2 355 20.6 356 17.8 357 21.7 363 16.8 364 21.9 365 27.5 367 23.1 368 50.0 369 50.0 370 50.0 372 50.0 374 13.8 380 10.4 398 5.0 404 8.5 405 5.0 415 7.2 416 7.5 418 8.8 423 13.4 424 11.7 426 10.2 427 10.9 438 8.4 450 13.4 457 13.5 488 15.2 491 13.6 492 20.1 Name: MEDV, Length: 61, dtype: float64
Now that we identified some outliers and leverage values, let’s bring them together to identify observations with significant influence. Indeed, when an observation is both an outlier and has high leverage, it will surely impact the regression line as a result of influencing regression coefficients.
#large residual and large leverage = INFLUENTIAL #Print values that are both outliers and influential outlier=pd.DataFrame((MEDVres.MEDV[abs(r) > 2])) lev= pd.DataFrame((MEDVres.MEDV[abs(l) > ((2*11)+2)/506])) #Influential1=pd.merge(outlier,lev, left_index=True, right_index=True, how='outer') #print(Influential1) Influential2=pd.merge(outlier,lev, left_index=True, right_index=True) print(Influential2)
MEDV_x MEDV_y 162 50.0 50.0 253 42.8 42.8 364 21.9 21.9 365 27.5 27.5 367 23.1 23.1 368 50.0 50.0 369 50.0 50.0 370 50.0 50.0 372 50.0 50.0 374 13.8 13.8
#Plot influential observations #Use residual squared to restrict the graph but preserve the relative position of observations from statsmodels.graphics.regressionplots import * plot_leverage_resid2(lm) plt.show() # plt.scatter(MEDVres.student_resid ** 2, MEDVres.leverage) # for i, state in enumerate(boston_df.MEDV): # plt.annotate(state, [(MEDVres.student_resid ** 2)[i], MEDVres.leverage[i]]) # plt.xlabel("Normalized Residuals**2") # plt.ylabel("Leverage") # plt.show() influence_plot(model1) plt.show()
It is also very useful to look at overall influence, which can be measured by Cook’s Distances and DFFITS. Cook’s Distances can be 0 or higher. The higher the value, the more influential the observation is. Many people use three times the mean of Cook’s D as a cutoff for an observation deemed influential.
DFFITS is also designed to identify influential observations with a cutoff value of 2*sqrt(k/n). Unlike Cook’s Distances, DFFITS can be both positive and negative, but a value close to 0 is desired as these values would have no influence on the OLS regression line.
Now let's take a look at DFITS. The conventional cut-off point for DFITS is 2*sqrt(k/n). DFITS can be either positive or negative, with numbers close to zero corresponding to the points with small or zero influence. As we see, DFITS also indicates that DC is, by far, the most influential observation.
#GENERAL MEASURE OF INFLUENCE #Identify influential observations with DFFITS #conventional cut-off point for DFITS is 2*sqrt(k/n) MEDVres[abs(MEDVres.dffits) > 2 * math.sqrt(11 / 506)]
The resulting table is very wide and long. As a result, I am including a small excerpt to demonstrate what it looks like.
#Cook's D of more than 3 times the mean is a possible outlier #MEDVres.loc[:,"cooks_d"].mean() cutoff=(MEDVres.loc[:,"cooks_d"].mean())*3 outlier2=pd.DataFrame((MEDVres.MEDV[abs(MEDVres.cooks_d) > cutoff])) print(outlier2)
MEDV 64 33.0 147 14.6 148 17.8 152 15.3 161 50.0 162 50.0 166 50.0 214 23.7 225 50.0 233 48.3 253 42.8 267 50.0 364 21.9 365 27.5 367 23.1 368 50.0 369 50.0 370 50.0 371 50.0 372 50.0 374 13.8 380 10.4 418 8.8
The yellowbrick package allows us to visualize Cook’s Distances.
from yellowbrick.regressor import CooksDistance y = boston_target_df['MEDV'] X = boston_features_df # Instantiate and fit the visualizer visualizer = CooksDistance() visualizer.fit(X, y) visualizer.show()
And this brings us to DFBETAS. Cook’s Distances and DFFITS are general measures of influence, while DFBETAS are variable specific. It shows how influential each observation is on the corresponding coefficients. DFBETAS are provided as part of the influence.summary_frame() output but is is worth visualizing it.
#Visulize influential observaions #dfbetas above 2/sqrt(n) is suspect plt.scatter(MEDVres.MEDV, MEDVres.dfb_CRIM, color = "red", marker = "o") plt.scatter(MEDVres.MEDV, MEDVres.dfb_ZN, color = "green", marker = "o") plt.scatter(MEDVres.MEDV, MEDVres.dfb_CHAS, color = "blue", marker = "o") plt.scatter(MEDVres.MEDV, MEDVres.dfb_NOX, color = "black", marker = "o") plt.scatter(MEDVres.MEDV, MEDVres.dfb_RM, color = "red", marker = "x") plt.scatter(MEDVres.MEDV, MEDVres.dfb_DIS, color = "green", marker = "x") plt.scatter(MEDVres.MEDV, MEDVres.dfb_RAD, color = "blue", marker = "x") plt.scatter(MEDVres.MEDV, MEDVres.dfb_TAX, color = "black", marker = "x") plt.scatter(MEDVres.MEDV, MEDVres.dfb_PTRATIO, color = "green", marker = "*") plt.scatter(MEDVres.MEDV, MEDVres.dfb_B, color = "blue", marker = "*") plt.scatter(MEDVres.MEDV, MEDVres.dfb_LSTAT, color = "black", marker = "*") plt.plot((-5, 55), (0.295, 0.295), '-.r*') plt.plot((-5, 55), (-0.295, -0.295), '-.r*') # plt.plot((-5, 55), (0.28, 0.28), 'k-') # plt.plot((-5, 55), (-0.28, -0.28), 'k-')
So let’s use the findings. I fitted two different regression models. First, I removed the observations that were deemed influential and fitted an OLS model. Second, I removed all outliers, and then fitted another OLS model.
#Remove influential observations and rerun regression boston_alt_x=boston_features_df.drop([163,253,364,365,367,368,369,370,412,414]) boston_alt_y=boston_target_df.drop([163,253,364,365,367,368,369,370,412,414]) #Remove influential observations and rerun regression boston_alt_x2=boston_features_df.drop([64, 147, 148, 161, 162, 163,166, 186, 195, 214, 225, 233, 267, 364, 365,367,368,369,370,380, 412,414]) boston_alt_y2=boston_target_df.drop([64, 147, 148, 161, 162, 163,166, 186, 195, 214, 225, 233, 267, 364, 365,367,368,369,370,380, 412,414])
#Partition dataset import sklearn from sklearn.model_selection import train_test_split #sklearn import does not automatically install sub packages X = boston_alt_x Y = boston_alt_y X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, test_size = 0.20, random_state = 5) #Partition dataset import sklearn from sklearn.model_selection import train_test_split #sklearn import does not automatically install sub packages X2 = boston_alt_x2 Y2 = boston_alt_y2 X2_train, X2_test, Y2_train, Y2_test = sklearn.model_selection.train_test_split(X2, Y2, test_size = 0.20, random_state = 5)
from statsmodels.formula.api import ols #Model statistics model_alt = sm.OLS(Y_train, sm.add_constant(X_train)).fit() print_model_alt = model_alt.summary() print(print_model_alt)
A comparison of the AIC and BIC values of this model to the initial model shows that our model fit improved as a result of removing the influential observations.
from statsmodels.formula.api import ols #Model statistics model_alt2 = sm.OLS(Y2_train, sm.add_constant(X2_train)).fit() print_model_alt2 = model_alt2.summary() print(print_model_alt2)
The model above was fitted after removing all outliers based on the Cook’s Distance analysis. The model fit improved even further and the R squared value increased compared to the initial fit and the fit based on variables without influential observations.
The conclusion is that model fit can be improved by identifying and removing outliers, observations with high leverage and influential observations. Having said that, deleting observations may not be desirable and other methods may be used to deal with influential values such as using an artificial cap value or replacing all influential values with the mean. Machine learning can be used to try different options for a better fit.