Outliers, leverage and influential observations

Oct 21

Oct 21 Diagnostics for Leverage and Influence

“There are two parts to influence: First, influence is powerful; and second, influence is subtle. You wouldn’t let someone push you off course, but you might let someone nudge you off course and not even realize it. ”

— Jim Rohn

The location of observations in a space play a role in determining regression coefficients. Some observations are far from values predicted by a model and are called outliers. This means that a predicted y value is far from the actual y value of an observation.

In many cases, outliers do not have a large effect on the OLS line. Still, an outlier may cause significant issues as it does have an impact on RSE. Recall that RSE is utilized for the computing of p-values and confidence intervals. As a consequence, outliers do have an impact on the interpretation of model fit. Inclusion of outliers may also have an impact on R squared.

Residual plots are a great way of visualizing outliers. In order to aid the decision to deem an observation an outlier, studentized residuals may be used. Studentized residuals are computed by dividing each residual by the standard error. When studentized residuals values exceed 3 in absolute, we should be concerned about the observation being an outlier.

In contrast to an outlier, a leverage value has an unusual x observation. In other words, the observed value of a predictor is very unusual compared to other values. As a result, removing a leverage value from the dataset will have an impact on the OLS line. As a result, just a few observations with high leverage may result in questionable model fit. In multiple regression models we can’t just simply look at x values within a variable and spot the leverage values because it is possible to have an observation within the normal range of each variable when the observation may be outside of the normal range when all regressors are considered simultaneously. Leverage statistic h can be used to spot high leverage values where the cutoff is (p+1)/n. Any value above the h cutoff may be considered a leverage value.

Let’s look at the Boston Housing dataset and see if we can find outliers, leverage values and influential observations. First, I ingested the dataset as usual. For this example, I removed two variables, AGE and INDUS, because they were not significant during the initial fitting procedure.

import pandas as pd
import numpy as np
import itertools
from itertools import chain, combinations
import statsmodels.formula.api as sm
import scipy.stats as scipystats
import statsmodels.api as sm
import statsmodels.stats.stattools as stools
import statsmodels.stats as stats 
from statsmodels.graphics.regressionplots import *
import matplotlib.pyplot as plt
import seaborn as sns
import copy
from sklearn.model_selection import train_test_split
import math
import time

#IMPORT DATASET

import pandas as pd
from sklearn.datasets import load_boston
boston= load_boston()

#Create analytical data set
#Create dataframe from feature_names
boston_features_df = pd.DataFrame(data=boston.data,columns=boston.feature_names)

#Create dataframe from target
boston_target_df = pd.DataFrame(data=boston.target,columns=['MEDV'])

#Partition the data
#Create training and test datasets
#We know from prior work that two variables were not significant contributors and need to be removed (AGE and INDUS)
# Drop AGE and INDUS
boston_features2_df=boston_features_df.drop(columns=['AGE', 'INDUS'])

#Partition dataset
import sklearn
from sklearn.model_selection import train_test_split #sklearn import does not automatically install sub packages

X = boston_features2_df
Y = boston_target_df

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, test_size = 0.20, random_state = 5)

#For demonstration
from statsmodels.formula.api import ols

#Model statistics
model1 = sm.OLS(Y_train, sm.add_constant(X_train)).fit()
print_model1 = model1.summary()
print(print_model1)

Screen Shot 2019-10-20 at 10.12.09 PM.png

I used statsmodels api for a lot of the heavy lifting. get_influence() gets an instance of Influence with influence and outlier measures. Model1 is the OLS regression fitted earlier.

influence = model1.get_influence()
pd.Series(influence.hat_matrix_diag).describe()

count    404.000000
mean       0.029703
std        0.028211
min        0.005690
25%        0.014799
50%        0.020744
75%        0.036086
max        0.351723
dtype: float64

In the next block, I wanted to show how to obtain studentized residuals, Cook’s Distances, DFFITS and leverage values one by one. The influence.summay_frame() provides these values automatically. The summary also provides dfbetas for each of the regressors. I also concatenated this table with the MEDV value of each observation. I named the resulting data frame MEDVres.

influence = model1.get_influence()
inf_sum = influence.summary_frame()

print(inf_sum.head())


student_resid = influence.resid_studentized_external
(cooks, p) = influence.cooks_distance
(dffits, p) = influence.dffits
leverage = influence.hat_matrix_diag


print ('\n')
print ('Leverage vs. Studentized Residuals')
sns.regplot(leverage, model1.resid_pearson,  fit_reg=False)
plt.title('Leverage vs. Studentized Residuals')
plt.xlabel('Leverage')
plt.ylabel('Studentized Residuals')

#Concat MEDV and the resulting residual table
#Note that hat_diag is leverage so change the ciolumn heading from hat_diag to leverage
from statsmodels.formula.api import ols

MEDVres = pd.concat([boston_df.MEDV, inf_sum], axis = 1)
MEDVres=MEDVres.rename(columns={'hat_diag': 'leverage'})
MEDVres.head()

Below is the print of influence.summary_frame() before being concatenated with MEDV values. The scatter plot of leverage values vs. studentized residuals are also plotted.

 dfb_const  dfb_CRIM    dfb_ZN  dfb_CHAS   dfb_NOX    dfb_RM   dfb_DIS  \
  0.004212  0.000886 -0.001868  0.000520 -0.001506 -0.000055  0.000991   
-0.004325  0.013510  0.165739  0.217643 -0.048912  0.064451 -0.075798   
 0.064623  0.598700 -0.036948  0.026304  0.018920 -0.041332  0.046806   
 0.013932 -0.004569 -0.017240  0.007403 -0.017676  0.004782  0.021633   
 0.063417  0.034065 -0.013930  0.024539 -0.007781 -0.062845  0.011428   

      dfb_RAD   dfb_TAX  dfb_PTRATIO     dfb_B  dfb_LSTAT   cooks_d  \
 0.003373  0.003382    -0.010605  0.001511  -0.005790  0.000026   
0.016462 -0.026867    -0.005289 -0.006179   0.054326  0.009438   
-0.135088 -0.006557     0.015099 -0.211732  -0.133261  0.035852   
0.017684  0.008317    -0.041426  0.000965   0.018930  0.000297   
-0.043117 -0.001509    -0.012829 -0.078164  -0.057583  0.002266   

     standard_resid  hat_diag  dffits_internal  student_resid    dffits  
      -0.148714  0.013939        -0.017681      -0.148528 -0.017659  
      1.179325  0.075298         0.336529       1.179915  0.336698  
      1.167851  0.239801         0.655917       1.168395  0.656222  
     -0.374113  0.024842        -0.059712      -0.373702 -0.059647  
     -1.285053  0.016199        -0.164899      -1.286125 -0.165037  

Screen Shot 2019-10-20 at 10.32.50 PM.png

The next step is to identify outliers using studentized residuals. Studentized residuals could be concerning when their absolute values exceed 2. This is an aggressive stance and one could relax this criteria and consider studentized residuals exceeding 3 as an outlier. It is worth reading the following post about why NaN values would appear at times: https://stats.stackexchange.com/questions/123368/studentized-residuals-undefined. Anyway, below I printed the top 5 negative residuals below.

#studentized residuals for identifying outliers
#requested studentized residuals call them r
#studentized residuals that exceed +2 or -2 are concerning
#studentized residuals that exceed +3 or -3 are extremely concerning

r = MEDVres.student_resid
print ('-'*30 + ' studentized residual ' + '-'*30)
print (r.describe())
print ('\n')

r_sort = MEDVres.sort_values(by = 'student_resid')
print ('-'*30 + ' top 5 most negative residuals ' + '-'*30)
print (r_sort.head())
print ('\n')

print ('-'*30 + ' top 5 most positive residuals ' + '-'*30)
print (r_sort.tail())

------------------------------ studentized residual ------------------------------
count    404.000000
mean       0.004729
std        1.017836
min       -3.506797
25%       -0.562912
50%       -0.138306
75%        0.420073
max        5.464612
Name: student_resid, dtype: float64


------------------------------ top 5 most negative residuals ------------------------------
     MEDV  dfb_const  dfb_CRIM    dfb_ZN  dfb_CHAS   dfb_NOX    dfb_RM  \
364  21.9   0.667778  0.113804  0.107776 -0.616018 -0.209043 -0.708748   
505  11.9   0.070241 -0.036201 -0.094961  0.044961 -0.123162  0.059625   
401   7.2   0.108062 -0.019858 -0.022017  0.039677 -0.015976 -0.096769   
375  15.0   0.174213 -0.131031  0.011348  0.042939 -0.017619 -0.205546   
397   8.5   0.010791  0.066082 -0.037235  0.034932 -0.007160  0.025261   

      dfb_DIS   dfb_RAD   dfb_TAX  dfb_PTRATIO     dfb_B  dfb_LSTAT   cooks_d  \
364 -0.180677  0.031865 -0.151854    -0.294940 -0.146532  -0.013238  0.105395   
505  0.111267  0.103545  0.050053    -0.244647  0.012416   0.152422  0.010858   
401  0.018899 -0.059734 -0.002898    -0.020852 -0.165908  -0.082444  0.007009   
375  0.028951 -0.006090 -0.037613    -0.058662 -0.161656  -0.004060  0.011597   
397  0.034995 -0.084152  0.011188     0.002027 -0.107611  -0.023573  0.004001   

     standard_resid  leverage  dffits_internal  student_resid    dffits  
364       -3.457330  0.095684        -1.124606      -3.506797 -1.140696  
505       -2.221733  0.025718        -0.360965      -2.233001 -0.362796  
401       -2.211719  0.016903        -0.290013      -2.222808 -0.291467  
375       -2.146969  0.029307        -0.373050      -2.156948 -0.374784  
397       -1.713136  0.016096        -0.219116      -1.717390 -0.219660  

We can actually identify the outliers by simply running the following formula. Again, I used the cutoff of 2 as opposed to 3 in some textbooks. The left column contains the index of an observation, while the right value is the MEDV value of an outlier observation.

#Print all MEDV values where the studentized residuals exceed 2
print (MEDVres.MEDV[abs(r) > 2])

   33.0
  50.0
  50.0
  50.0
  23.7
  50.0
  46.7
  48.3
  42.8
  50.0
  21.9
  27.5
  23.1
  50.0
  50.0
  50.0
  50.0
  50.0
  13.8
  15.0
   7.2
  11.9
Name: MEDV, dtype: float64

Now that we identified outliers, we need to see which observations can be considered to have leverage values. As discussed earlier, the leverage cutoff can be calculated as (2k+2)/n where k is the number of predictors and n is the sample size.

#Identify high leverage
#point with leverage = (2k+2)/n 
#k = number of predictors (11)
#n = number of observations (506)
((2*11)+2)/506 #=0.04743083003952569 any numbner higher than this is high leverage
l = MEDVres.leverage

print ('-'*30 + ' Leverage ' + '-'*30)
print (l.describe())
print ('\n')

l_sort = MEDVres.sort_values(by = 'leverage', ascending = False)
print ('-'*30 + ' top 5 highest leverage data points ' + '-'*30)
print (l_sort.head())

------------------------------ Leverage ------------------------------
count    404.000000
mean       0.029703
std        0.028211
min        0.005690
25%        0.014799
50%        0.020744
75%        0.036086
max        0.351723
Name: leverage, dtype: float64


------------------------------ top 5 highest leverage data points ------------------------------
     MEDV  dfb_const  dfb_CRIM    dfb_ZN  dfb_CHAS   dfb_NOX    dfb_RM  \
380  10.4   0.052078 -0.393317  0.032894 -0.008546 -0.020914 -0.048826   
418   8.8   0.064623  0.598700 -0.036948  0.026304  0.018920 -0.041332   
405   5.0   0.005872 -0.204162  0.006323 -0.003434 -0.009344  0.007857   
365  27.5   0.462435 -0.074987  0.151347 -0.084445  0.135214 -0.776065   
155  15.6   0.006989 -0.022791  0.004100 -0.137614 -0.146278  0.023760   

      dfb_DIS   dfb_RAD   dfb_TAX  dfb_PTRATIO     dfb_B  dfb_LSTAT   cooks_d  \
380 -0.031489  0.085597 -0.014302    -0.017654 -0.064719   0.050684  0.013771   
418  0.046806 -0.135088 -0.006557     0.015099 -0.211732  -0.133261  0.035852   
405 -0.010635  0.033011  0.000971    -0.001588 -0.038644   0.022505  0.003815   
365 -0.133736  0.231775 -0.104776    -0.055747 -0.027342  -0.624312  0.064897   
155 -0.068914  0.077127 -0.005154     0.015553  0.151105   0.051739  0.007041   

     standard_resid  leverage  dffits_internal  student_resid    dffits  
380       -0.551887  0.351723        -0.406509      -0.551397 -0.406148  
418        1.167851  0.239801         0.655917       1.168395  0.656222  
405       -0.454273  0.181548        -0.213952      -0.453813 -0.213735  
365        2.455956  0.114348         0.882475       2.471913  0.888209  
155       -0.838743  0.107225        -0.290674      -0.838425 -0.290564  

We can now identify all observations with high leverage by simply using the cutoff formula. It appears that there are 61 such observations.

#point with leverage = (2k+2)/n = 0.04743083003952569
#Print all MEDV values where the leverage exceeds 0.04743083003952569
print (MEDVres.MEDV[abs(l) > ((2*11)+2)/506])

    16.5
   14.4
   18.9
  18.6
  13.4
  15.6
  11.8
  13.8
  15.6
  14.6
  17.8
  15.3
  17.0
  15.6
  23.3
  27.0
  50.0
  48.5
  50.0
  20.0
  21.7
  19.3
  42.8
  50.0
  20.7
  32.4
  46.0
  50.0
  28.5
  37.3
       ... 
  18.2
  20.6
  17.8
  21.7
  16.8
  21.9
  27.5
  23.1
  50.0
  50.0
  50.0
  50.0
  13.8
  10.4
   5.0
   8.5
   5.0
   7.2
   7.5
   8.8
  13.4
  11.7
  10.2
  10.9
   8.4
  13.4
  13.5
  15.2
  13.6
  20.1
Name: MEDV, Length: 61, dtype: float64

Now that we identified some outliers and leverage values, let’s bring them together to identify observations with significant influence. Indeed, when an observation is both an outlier and has high leverage, it will surely impact the regression line as a result of influencing regression coefficients.

#large residual and large leverage = INFLUENTIAL
#Print values that are both outliers and influential
outlier=pd.DataFrame((MEDVres.MEDV[abs(r) > 2]))
lev= pd.DataFrame((MEDVres.MEDV[abs(l) > ((2*11)+2)/506]))

#Influential1=pd.merge(outlier,lev, left_index=True, right_index=True, how='outer')
#print(Influential1)

Influential2=pd.merge(outlier,lev, left_index=True, right_index=True)
print(Influential2)

 MEDV_x  MEDV_y
  50.0    50.0
  42.8    42.8
  21.9    21.9
  27.5    27.5
  23.1    23.1
  50.0    50.0
  50.0    50.0
  50.0    50.0
  50.0    50.0
  13.8    13.8

#Plot influential observations
#Use residual squared to restrict the graph but preserve the relative position of observations

from statsmodels.graphics.regressionplots import *
plot_leverage_resid2(lm)
plt.show()

# plt.scatter(MEDVres.student_resid ** 2, MEDVres.leverage)
# for i, state in enumerate(boston_df.MEDV):
#     plt.annotate(state, [(MEDVres.student_resid ** 2)[i],  MEDVres.leverage[i]])
# plt.xlabel("Normalized Residuals**2")
# plt.ylabel("Leverage")
# plt.show()

influence_plot(model1)
plt.show()

Screen Shot 2019-10-20 at 11.01.37 PM.png

Screen Shot 2019-10-20 at 11.01.45 PM.png

It is also very useful to look at overall influence, which can be measured by Cook’s Distances and DFFITS. Cook’s Distances can be 0 or higher. The higher the value, the more influential the observation is. Many people use three times the mean of Cook’s D as a cutoff for an observation deemed influential.

DFFITS is also designed to identify influential observations with a cutoff value of 2*sqrt(k/n). Unlike Cook’s Distances, DFFITS can be both positive and negative, but a value close to 0 is desired as these values would have no influence on the OLS regression line.

Now let's take a look at DFITS. The conventional cut-off point for DFITS is 2*sqrt(k/n). DFITS can be either positive or negative, with numbers close to zero corresponding to the points with small or zero influence. As we see, DFITS also indicates that DC is, by far, the most influential observation.

#GENERAL MEASURE OF INFLUENCE
#Identify influential observations with DFFITS
#conventional cut-off point for DFITS is 2*sqrt(k/n)
MEDVres[abs(MEDVres.dffits) > 2 * math.sqrt(11 / 506)]

The resulting table is very wide and long. As a result, I am including a small excerpt to demonstrate what it looks like.

Screen Shot 2019-10-20 at 11.13.11 PM.png

#Cook's D of more than 3 times the mean is a possible outlier
#MEDVres.loc[:,"cooks_d"].mean()
cutoff=(MEDVres.loc[:,"cooks_d"].mean())*3
outlier2=pd.DataFrame((MEDVres.MEDV[abs(MEDVres.cooks_d) > cutoff]))
print(outlier2)

 MEDV
 33.0
14.6
17.8
15.3
50.0
50.0
50.0
23.7
50.0
48.3
42.8
50.0
21.9
27.5
23.1
50.0
50.0
50.0
50.0
50.0
13.8
10.4
 8.8

The yellowbrick package allows us to visualize Cook’s Distances.

from yellowbrick.regressor import CooksDistance

y = boston_target_df['MEDV']
X = boston_features_df

# Instantiate and fit the visualizer
visualizer = CooksDistance()
visualizer.fit(X, y)
visualizer.show()

Screen Shot 2019-10-20 at 11.40.12 PM.png

And this brings us to DFBETAS. Cook’s Distances and DFFITS are general measures of influence, while DFBETAS are variable specific. It shows how influential each observation is on the corresponding coefficients. DFBETAS are provided as part of the influence.summary_frame() output but is is worth visualizing it.

#Visulize influential observaions
#dfbetas above 2/sqrt(n) is suspect

plt.scatter(MEDVres.MEDV, MEDVres.dfb_CRIM, color = "red", marker = "o")
plt.scatter(MEDVres.MEDV, MEDVres.dfb_ZN, color = "green", marker = "o")
plt.scatter(MEDVres.MEDV, MEDVres.dfb_CHAS, color = "blue", marker = "o")
plt.scatter(MEDVres.MEDV, MEDVres.dfb_NOX, color = "black", marker = "o")

plt.scatter(MEDVres.MEDV, MEDVres.dfb_RM, color = "red", marker = "x")
plt.scatter(MEDVres.MEDV, MEDVres.dfb_DIS, color = "green", marker = "x")
plt.scatter(MEDVres.MEDV, MEDVres.dfb_RAD, color = "blue", marker = "x")
plt.scatter(MEDVres.MEDV, MEDVres.dfb_TAX, color = "black", marker = "x")

plt.scatter(MEDVres.MEDV, MEDVres.dfb_PTRATIO, color = "green", marker = "*")
plt.scatter(MEDVres.MEDV, MEDVres.dfb_B, color = "blue", marker = "*")
plt.scatter(MEDVres.MEDV, MEDVres.dfb_LSTAT, color = "black", marker = "*")

plt.plot((-5, 55), (0.295, 0.295), '-.r*')
plt.plot((-5, 55), (-0.295, -0.295), '-.r*')

# plt.plot((-5, 55), (0.28, 0.28), 'k-')
# plt.plot((-5, 55), (-0.28, -0.28), 'k-')

Screen Shot 2019-10-20 at 11.24.45 PM.png

So let’s use the findings. I fitted two different regression models. First, I removed the observations that were deemed influential and fitted an OLS model. Second, I removed all outliers, and then fitted another OLS model.

#Remove influential observations and rerun regression
boston_alt_x=boston_features_df.drop([163,253,364,365,367,368,369,370,412,414])
boston_alt_y=boston_target_df.drop([163,253,364,365,367,368,369,370,412,414])

#Remove influential observations and rerun regression
boston_alt_x2=boston_features_df.drop([64, 147, 148, 161, 162, 163,166, 186, 195, 214, 225, 233, 267, 364, 365,367,368,369,370,380, 412,414])
boston_alt_y2=boston_target_df.drop([64, 147, 148, 161, 162, 163,166, 186, 195, 214, 225, 233, 267, 364, 365,367,368,369,370,380, 412,414])

#Partition dataset
import sklearn
from sklearn.model_selection import train_test_split #sklearn import does not automatically install sub packages

X = boston_alt_x
Y = boston_alt_y

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, test_size = 0.20, random_state = 5)

#Partition dataset
import sklearn
from sklearn.model_selection import train_test_split #sklearn import does not automatically install sub packages

X2 = boston_alt_x2
Y2 = boston_alt_y2

X2_train, X2_test, Y2_train, Y2_test = sklearn.model_selection.train_test_split(X2, Y2, test_size = 0.20, random_state = 5)

from statsmodels.formula.api import ols

#Model statistics
model_alt = sm.OLS(Y_train, sm.add_constant(X_train)).fit()
print_model_alt = model_alt.summary()
print(print_model_alt)

Screen Shot 2019-10-20 at 11.45.41 PM.png

A comparison of the AIC and BIC values of this model to the initial model shows that our model fit improved as a result of removing the influential observations.

from statsmodels.formula.api import ols

#Model statistics
model_alt2 = sm.OLS(Y2_train, sm.add_constant(X2_train)).fit()
print_model_alt2 = model_alt2.summary()
print(print_model_alt2)

Screen Shot 2019-10-20 at 11.51.35 PM.png

The model above was fitted after removing all outliers based on the Cook’s Distance analysis. The model fit improved even further and the R squared value increased compared to the initial fit and the fit based on variables without influential observations.

The conclusion is that model fit can be improved by identifying and removing outliers, observations with high leverage and influential observations. Having said that, deleting observations may not be desirable and other methods may be used to deal with influential values such as using an artificial cap value or replacing all influential values with the mean. Machine learning can be used to try different options for a better fit.