Violation of Regression Assumptions

Oct 8

Oct 8 OLS Regression: Boston Housing Dataset

In this section I wanted to demonstrate how to fit a multiple regression model using the Scikit-learn library in Python. Scikit-learn allows us to fit many different supervised and unsupervised models, but here I focused on fitting a linear regression model. The linear regression model is housed in the linear_model module of sklearn, which is Python’s Scikit-learn library.

When fitting linear models, we must be diligent with respect to discovering and fixing issues that frequently occur in real world data. Patterns in data frequently result in violations of regression assumptions:

1. Non-linear relationships between regressors and the response;

2. Correlation of error terms;

3. Non-constant variance of error terms or heteroscedasticity;

4. Presence of outliers;

5. Presence of influential observations or high-leverage points; and

6. Multicollinearity.

In this section, I wanted show how to diagnose these issues, and solutions to them are offered in subsequent posts.Several datasets are available from sklearn. I will work through the Boston Housing dataset in this example. It is available in sklearn as a toy dataset. Upon loading the dataset, we can see that it has 506 rows and 13 columns.

import pandas as pd
from sklearn.datasets import load_boston
boston= load_boston()
print(boston.data.shape)
print(boston.DESCR)

A description of the dataset including a data dictionary is available. It is a rather old dataset, but it is perfect for regression analysis demonstration because it is loaded with violations of regression assumptions.

. _boston_dataset:
Boston house prices dataset
---------------------------
**Data Set Characteristics:**  
    :Number of Instances: 506 
    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's
    :Missing Attribute Values: None
    :Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

First, I created two data frames, one containing all regressors and one for the target variable.

#Create analytical data set
#Create dataframe from feature_names
boston_features_df = pd.DataFrame(data=boston.data,columns=boston.feature_names)

#Create dataframe from target
boston_target_df = pd.DataFrame(data=boston.target,columns=['MEDV'])
print(boston_target_df.head())
print(boston_features_df.head()

 MEDV
24.0
21.6
34.7
33.4
36.2
      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  
0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   
   PTRATIO       B  LSTAT  
   15.3  396.90   4.98  
   17.8  396.90   9.14  
   17.8  392.83   4.03  
   18.7  394.63   2.94  
   18.7  396.90   5.33  

I concatenated the two data frames and called the data frame boston_df. It now contains all regressors and target variable. We are now ready for Exploratory Data Analysis!

#create a single data frame with both features and target by concatonating
boston_df=pd.concat([boston_features_df, boston_target_df], axis=1)
boston_df.head()

Screen Shot 2019-10-06 at 6.39.40 PM.png

Exploratory Data Analysis:

It is tempting to skip Exploratory Data Analysis. However, EDA allows us to understand the data, and it allows us to discover issues that may effect the results of statistical models.

Distribution of Variables:

We can start by looking at the distribution of our data. The target (MEDV) looks fairly normal although there is a skew on the right, which could cause problems.

#Exploratory Data Analysis
import matplotlib.pyplot as plt
plt.style.use('ggplot')
# Histogram of target: MEDV     
boston_df['MEDV'].plot(kind='hist',edgecolor='black',figsize=(6,3))
plt.title('Median value of owner-occupied homes in $1000s', size=15)
plt.xlabel('Value', size=15)
plt.ylabel('Frequency', size=15) 
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)

Screen+Shot+2019-10-07+at+8.17.46+PM.jpg

The distribution of features is far from ideal. With the exception of RM, all other features exhibit a non-normal distribution. For now, I will leave this alone and attempt to fix the problem later. I also noticed that one variable (CHAS) is binary categorical.

#Linear Relationship:  There should be a linear relationship between predictors and response variable
for index, feature_name in enumerate(boston.feature_names):
    plt.figure(figsize=(6, 3))
    plt.scatter(boston.data[:, index], boston.target)
    plt.ylabel('Home Value', size=15)
    plt.xlabel(feature_name, size=15)
    plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)   
#Assess distribution of features
for index, feature_name in enumerate(boston.feature_names):
    plt.figure(figsize=(4, 3))
    plt.hist(boston.data[:, index])
    plt.ylabel('Frequency', size=15)
    plt.xlabel(feature_name, size=15)
    plt.tight_layout()

Screen Shot 2019-10-07 at 8.23.36 PM.png

Screen Shot 2019-10-07 at 8.23.45 PM.png

Linearity Assumption:

When fitting a linear regression model, it is assumed that there is a linear relationship between each regressor and the target. The first task is to plot each regressor against the target (MEDV). Residual analysis will also be completed after the model is fit to look at the same problem.

Multiple plots show non-linear relationships (ZN, INDUS, NOX, LSTAT, AGE, DIS, B). CRIM has multiple extreme observations, while observations at the 0 CRIM level have no relationship with home values. RAD and TAX has no relationship with Home values at the highest levels, certainly a cause of concern.

Screen Shot 2019-10-07 at 8.21.45 PM.png

Screen Shot 2019-10-07 at 8.21.58 PM.png

Correlations:

One should always look at a correlation matrix. It shows which feature is correlated with the target and the direction of that correlation. One would like to see significant/high correlations between the regressors and the target variable (negative or positive depending on whether a feature is expected to have a positive or negative impact on home values). For example, crime rates should be inversely correlated with home prices.

Correlation coefficients can be evaluated by a rudimentary rule of thumb: 0 means no correlation, 0.01-.3 weak, 0.31-0.7 moderate, 0.7-0.9 strong, 0.9-1 very strong. In our example, only two variables reach the vicinity of 0.7, LSTAT and RM.

#Take a look at correlations
#Correlation between features and target is preferred
#Correlation between features is suspicious

boston_df[['CRIM','ZN', 'NOX', 'INDUS','CHAS','NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']].corr()

Also, cross correlation between regressors is not desirable because of the collinearity principle of regression. Multicollinearity will be discussed in a few paragraphs. The output of the correlation table below shows the cross-correlation of all variables. Unfortunately, the Boston Housing dataset contains lots of moderate to strong cross-correlations between regressors. I cut the picture on the right at B because it was too large to see.

Screen Shot 2019-10-07 at 8.30.23 PM.png

The seaborn package may also be helpful to visualize correlations. Note the legend on the right side. Darker colors mean stronger correlations, although the directionality of correlations is not available. Several regressors are correlated considerably, which is a grave concern and will have to be addressed in regression modeling.

We are now ready to fit a model!

#Visualize corelations
import seaborn as sns, numpy as np
ax=sns.heatmap(boston_df[['CRIM','ZN', 'NOX', 'INDUS','CHAS','RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']].corr(),
               cmap=sns.cubehelix_palette(20, light=0.95, dark=0.15))
ax.xaxis.tick_top

Screen Shot 2019-10-08 at 7.09.43 AM.png

Regression Model:

In order to use Scikit-learn, the module needs to be installed. This can be done using a terminal window and #pip install -U scikit-learn. An important note is that sklearn import does not automatically install sub packages, hence they must be installed one by one. In this example, I had to install model_selection import train_test_split, which will be used for partitioning the dataset.

import sklearn
from sklearn.model_selection import train_test_split #sklearn import does not automatically install sub packages
from sklearn import linear_model
import statsmodels.api as sm
import numpy as np

For statistical learning, it is necessary to partition the dataset. When partitioning the data with sklearn, we need to create a data frame that has all regressors and another data frame that has the target variable. Note that the original data was already provided that way, I just concatenated the two data frames because I thought it was useful to have a single data frame for EDA. Some may consider that not necessary…

Here, I held on to 80% of the data for training and set aside 20% for testing. We can see that 404 observations are now considered training in both Y_train (target) and X_train data frames (features) date frames. The test data sizes are 102 in both corresponding data frames.

#Partition the data
#Create training and test datasets
X = boston_df.drop('MEDV', axis = 1)
Y = boston_df['MEDV'] 

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, test_size = 0.20, random_state = 5)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(404, 13)
(102, 13)
(404,)
(102,)

Let’s run our first regression model and see how we did. The intercept and coefficients can be manually printed. I also printed a plot that compares actual and predicted home values based on the test data. We can observe some significant problems. As an example, a large number of actual values were predicted to be higher indicating a poor model fit and accuracy.

#Train regression model
from sklearn.linear_model import LinearRegression
lin_mod1 = LinearRegression()
lin_mod1.fit(X_train, Y_train)
#Create predictions using test features
Y_pred = lin_mod1.predict(X_test)
#Compare predicted and actual MEDV still based on train data
plt.scatter(Y_test, Y_pred)
plt.xlabel("Actual Home Value")
plt.ylabel("Predicted Home Value")
plt.title("Home Values: Actual vs Predicted")

Screen Shot 2019-10-07 at 8.37.01 PM.png

#Print model parameters
print('Intercept: \n', lin_mod1.intercept_)
print('Coefficidents: \n', lin_mod1.coef_)

Intercept: 
 37.91248700975083
Coefficidents: 
 [-1.30799852e-01  4.94030235e-02  1.09535045e-03  2.70536624e+00
 -1.59570504e+01  3.41397332e+00  1.11887670e-03 -1.49308124e+00
  3.64422378e-01 -1.31718155e-02 -9.52369666e-01  1.17492092e-02
 -5.94076089e-01]

I also printed some goodness of fit statistics such as Mean Absolute Error, Mean Squared Error and Root Mean Squared Error. It is hard to judge whether these scores are good or bad until we start building newer, more appropriate models. All three - MAE, MSE and RMSE - were computed based on test data.

#Goodness of fit statistics: values closer to 0 are better
import sklearn
from sklearn import metrics

print('Mean Absolute Error:', metrics.mean_absolute_error(Y_test, Y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(Y_test, Y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))

Mean Absolute Error: 3.4550349322483482
Mean Squared Error: 28.530458765974583
Root Mean Squared Error: 5.341391089030514

Finally, some very important model stats can be obtained from the OLS Regression Results table. First, the F-statistic is large at 661, which is certainly larger than 1, which allows us to reject the null hypothesis, and conclude that at least one of the variables must be related to the target MEDV. The R-squared is also very large at 0.738 indicating that the model explains 74% of the variance in the target variable. We also get two fit statistics Akaike Information Criterion and Bayesian Information Criterion, which are large but by themselves they are difficult to evaluate. They will be useful when fitting alternative models.

Unfortunately, we do have some variables that were not a significant contributor to the explanatory power of the model (INDUS, CHAS, NOX, AGE). We do get a warning in this table about multicollinearity: The condition number is large, 8.97e+03. This might indicate that there are strong multicollinearity or other numerical problems.

#Model statistics
#Ensure that constant is added: represets y-intercept
model = sm.OLS(Y_train, sm.add_constant(X_train)).fit()
print_model = model.summary()
print(print_model)

Screen Shot 2019-10-13 at 10.32.59 PM.png

Non-linear relationships between regressors and the response:

We already saw during EDA that some of the features are not linearly related to the target. We can look at the non-linearity issue by plotting residuals vs. the fitted values. The plot should show no pattern, and the presence of a pattern indicates issues with the linearity assumption. Further, the points should be horizontally distributed. The residual plot shows a U-shaped pattern, which is an indication of non-linearity.

Plotting observed vs. predicted values against each other may also be helpful. Linearity will cause points to be symmetrically distributed along the diagonal of the plot, while non-symmetrical distribution is an indication of violation of non-linearity assumption.

#RESIDUAL ANALYSIS: CHECK FIT OF TRAINED MODEL
predictions = lin_mod1.predict(X_train) #make predictions with train data/use sklearn
residual=Y_train-predictions
residual.head()

  -0.728770
  5.471472
  4.884009
 -1.777959
 -6.135923
Name: MEDV, dtype: float64

#RESIDUALS VS. FITS
plt.axhline(y=0, color='k', linestyle=':') #draw a black horizontal line when residuals equal 0

plt.scatter(predictions, residual, color='#003F72')
plt.title('Versus Fits')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.show()

Screen Shot 2019-10-15 at 6.56.01 AM.png

Right type of regression used:

The Versus Fits plot is also useful in determining whether the right type of regression has been used. The curved pattern shows that the chosen regression was of the wrong type.

Correlation of error terms:

As discussed earlier, an important assumption of linear regression modeling is that the error terms must NOT be correlated. Correlations among the error terms causes the standard errors underestimated and the confidence in prediction intervals will be narrower. Additionally, the p-values in the model will be underestimated, which may prompt us to consider a parameter statistically significant in error. When error terms are correlated, the fits vs. residuals plot frequently appears in the shape of a funnel.

Independence (or no correlation) is typically expressed based on error terms, e.g. the error terms must be independent. Since the error terms are not known, we can estimate them by subtracting fitted values from actual values for each observation.

The best way to check the independence assumption is to plot residuals against regressor values. A random pattern shows independence, while a plot with a shape is usually an indication of the error terms being correlated. The plots below show clear patterns for all variables indicating that the independence assumption of regression is violated.

#Plot feature residuals vs. each feature
fig, ax=plt.subplots(1,4)
ax[0]=sns.residplot('CRIM','residual', res_feat, ax=ax[0], scatter_kws={'alpha':0.4})
ax[0].set_ylabel('MEDV Residuals')
ax[1]=sns.residplot('ZN','residual', res_feat, ax=ax[1], scatter_kws={'alpha':0.4})
ax[1].set_ylabel('')
ax[2]=sns.residplot('NOX','residual', res_feat, ax=ax[2], scatter_kws={'alpha':0.4})
ax[2].set_ylabel('')
ax[3]=sns.residplot('INDUS','residual', res_feat, ax=ax[3], scatter_kws={'alpha':0.4})
ax[3].set_ylabel('')
#Plot feature residuals vs. each feature
fig, ax=plt.subplots(1,4)
ax[0]=sns.residplot('CHAS','residual', res_feat, ax=ax[0], scatter_kws={'alpha':0.4})
ax[0].set_ylabel('MEDV Residuals')
ax[1]=sns.residplot('RM','residual', res_feat, ax=ax[1], scatter_kws={'alpha':0.4})
ax[1].set_ylabel('')
ax[2]=sns.residplot('AGE','residual', res_feat, ax=ax[2], scatter_kws={'alpha':0.4})
ax[2].set_ylabel('')
ax[3]=sns.residplot('RAD','residual', res_feat, ax=ax[3], scatter_kws={'alpha':0.4})
ax[3].set_ylabel('')

Screen Shot 2019-10-15 at 6.56.52 AM.png

Screen Shot 2019-10-15 at 6.57.29 AM.png

Non-constant variance of error terms or heteroscedasticity:

When variances of the error terms increase with the value of the response (e.g. the vs. fit plot exhibits a funnel shape), we are dealing with heteroscedasticity or non-constant variances. Transformation of the variables may help the problem but at least for now, we can spot it on a plot.

The Vs. Fits plot of the Boston Housing shows an initially decreasing and then increasing variance of error terms as the value of predicted observations increase. There are hypothesis tests that are available for testing the homoscedasticity assumption, but these tests are notoriously unreliable because they themselves are based on models with a sets of assumptions.

Outliers:

An outlier is an observation that is far from its predicted value. Outliers may have an impact on the regression line although that impact may not be significant. However, outliers may have an impact on the RSE score, even if they do not have an impact on the OLS fit. RSE is used for the computation of confidence intervals and p-values, therefore outliers do have an impact on how we interpret model statistics. Outliers can also have a negative impact on R squared values by making them smaller in error.

There are alternative ways to find outliers. One effective way of finding outliers is to plot the studentized residuals (e.g. take each residual and divide by its estimated standard error ). Observations above 3 are usually outliers.

The IQR method is yet another method one could consider using. An observation may be considered an outlier if it is outside of the IQR range. Other thresholds may also be considered and the threshold is simply based on tolerance by the analyst.

#Detect outliers
import numpy as np
import matplotlib.pyplot as plt
#Visualize outliers using boxplots
#I did not plot CHAS given it is binary
np.random.seed(10)
array1 = boston_df['CRIM']
array2 = boston_df['ZN']
array3 = boston_df[ 'INDUS']
array4 = boston_df['NOX']

data = [array1, array2, array3, array4]
fig = plt.figure(1, figsize=(15, 6))
ax = fig.add_subplot(111)
res = ax.boxplot(data, patch_artist=True)
plt.title('Boxplots of Regressors') 
ax.set_xticklabels(['CRIM','ZN', 'INDUS','NOX'])
plt.show()

Screen Shot 2019-10-07 at 8.55.50 PM.png

#Visualize outliers using boxplots
#I did not plot CHAS given it is binary
np.random.seed(10)
array9 = boston_df['TAX']
array10 = boston_df['PTRATIO']
array11 = boston_df['B']
array12 = boston_df['LSTAT']
array13 = boston_df['MEDV']

data = [array9, array10, array11, array12, array13]
fig = plt.figure(1, figsize=(15, 6))
ax = fig.add_subplot(111)
res = ax.boxplot(data, patch_artist=True)

plt.title('Boxplots of Regressors') 
ax.set_xticklabels(['TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'])

plt.show()

Screen Shot 2019-10-07 at 8.57.21 PM.png

np.random.seed(10)
array5 = boston_df['RM']
array6 = boston_df['AGE']
array7 = boston_df['DIS']
array8 = boston_df['RAD']

data = [array5, array6, array7, array8]
fig = plt.figure(1, figsize=(15, 6))
ax = fig.add_subplot(111)
res = ax.boxplot(data, patch_artist=True)

plt.title('Boxplots of Regressors') 
ax.set_xticklabels(['RM', 'AGE', 'DIS', 'RAD'])

plt.show()

Screen Shot 2019-10-07 at 8.58.25 PM.png

#Outlier detection using z score
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(boston_df))

#Define threshold for outliers
#1st array shows row number of outlier location
#2nd array shows column number of outlier location
threshold = 3
print(np.where(z > 3))

(array([ 55,  56,  57, 102, 141, 142, 152, 154, 155, 160, 162, 163, 199,
       200, 201, 202, 203, 204, 208, 209, 210, 211, 212, 216, 218, 219,
       220, 221, 222, 225, 234, 236, 256, 257, 262, 269, 273, 274, 276,
       277, 282, 283, 283, 284, 347, 351, 352, 353, 353, 354, 355, 356,
       357, 358, 363, 364, 364, 365, 367, 369, 370, 372, 373, 374, 374,
       380, 398, 404, 405, 406, 410, 410, 411, 412, 412, 414, 414, 415,
       416, 418, 418, 419, 423, 424, 425, 426, 427, 427, 429, 431, 436,
       437, 438, 445, 450, 454, 455, 456, 457, 466]), array([ 1,  1,  1, 11, 12,  3,  3,  3,  3,  3,  3,  3,  1,  1,  1,  1,  1,
        1,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  5,  3,  3,  1,  5,
        5,  3,  3,  3,  3,  3,  3,  1,  3,  1,  1,  7,  7,  1,  7,  7,  7,
        3,  3,  3,  3,  3,  5,  5,  5,  3,  3,  3, 12,  5, 12,  0,  0,  0,
        0,  5,  0, 11, 11, 11, 12,  0, 12, 11, 11,  0, 11, 11, 11, 11, 11,
       11,  0, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11]))
print(z[142][3])  #As an example, print outlier value

3.668397859712446

#IQR method 
Q1 = boston_df.quantile(0.25)
Q3 = boston_df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

#False is not outlier
#True is outlier
print(boston_df < (Q1 - 1.5 * IQR)) |(boston_df > (Q3 + 1.5 * IQR))

 CRIM     ZN  INDUS   CHAS    NOX     RM    AGE    DIS    RAD    TAX  \
0    False  False  False  False  False  False  False  False  False  False   
1    False  False  False  False  False  False  False  False  False  False   
2    False  False  False  False  False  False  False  False  False  False   
3    False  False  False  False  False  False  False  False  False  False   
4    False  False  False  False  False  False  False  False  False  False   
5    False  False  False  False  False  False  False  False  False  False   
6    False  False  False  False  False  False  False  False  False  False   
7    False  False  False  False  False  False  False  False  False  False   
8    False  False  False  False  False  False  False  False  False  False   
9    False  False  False  False  False  False  False  False  False  False   
10   False  False  False  False  False  False  False  False  False  False   
11   False  False  False  False  False  False  False  False  False  False   
12   False  False  False  False  False  False  False  False  False  False   
13   False  False  False  False  False  False  False  False  False  False   
14   False  False  False  False  False  False  False  False  False  False   
15   False  False  False  False  False  False  False  False  False  False   
16   False  False  False  False  False  False  False  False  False  False   
17   False  False  False  False  False  False  False  False  False  False   
18   False  False  False  False  False  False  False  False  False  False   
19   False  False  False  False  False  False  False  False  False  False   

Presence of influential observations or high-leverage points:

High leverage points have a significant impact on the regression line, which may be a significant problem if the fitted model is impacted by a handful of observations. In fact, the entire fit may be compromised by high leverage points. Leverage statistic is the best way to identify high leverage points. The leverage statistic is always between 1/n and 1 and it equals (p+1)/n where p is the number of features. Any observation that exceeds the leverage statistic by a considerable number can be deemed a high leverage value. Plotting predicted and actual observations may also be very helpful.

Normal Distribution

I plotted the histogram of residuals but the selection of bin width influences the picture of the distribution. A far better way to assess the normality assumption is to use a Q-Q plot. This is a probability plot designed to show whether the data came from a normal distribution. If it did, the plot will lie along the diagonal line.

Q-Q plots are also useful to visualize outliers, which are observations far deviating from the diagonal at the upper or lower edges.

The the case of our Boston housing data, the normal distribution is violated toward the upper edge of the diagonal line. These points could be observed in the original histogram of the target variable , MEDV.

#Assess Q-Q plot
import numpy as np
import numpy.random as random
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

data=residual.values.flatten()
data.sort()
norm=random.normal(0,2,len(data))
norm.sort()
plt.figure(figsize=(12,8),facecolor='1.0')
plt.plot(norm,data,"o")

#generate a trend line as in http://widu.tumblr.com/post/43624347354/matplotlib-trendline
z = np.polyfit(norm,data, 1)
p = np.poly1d(z)
plt.plot(norm,p(norm),"k--", linewidth=2)
plt.title("Normal Q-Q plot", size=16)
plt.xlabel("Theoretical quantiles", size=12)
plt.ylabel("Expreimental quantiles", size=12)
plt.tick_params(labelsize=12)
plt.show()

Screen Shot 2019-10-15 at 7.03.39 AM.png

#HISTOGRAM OF RESIDUALS
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
x = residual

plt.title('Histogram of Residuals')
plt.hist(x, bins=10)
plt.ylabel('Frequency')
plt.xlabel('Residuals')
plt.show()

Screen Shot 2019-10-15 at 7.09.02 AM.png

Multicollinearity:

Multicollinearity refers to correlations among multiple features. As a result of cross-correlation among multiple variables, it is difficult to tease out the effect of individual regressors on the target. Collinearity impacts the accuracy of estimates negatively and the standard error to increase. Collinearity may also result in lower t scores, which may cause us to deem a feature significant in error.

Note that the correlation matrix is not good enough to detect collinearity because it is possible to have collinearity between variables even if the bivariate coefficient of correlation is not significant, which is called multicollinearity. It can be detected by computing the variance inflation factor. When VIF is 1, there is no collinearity. When values exceed 5 or 10, we detected collinearity. In smaller samples 5 is the preferred cutoff, while in larger sample sizes 10 is acceptable.

Eigen vectors vectors may also indicate multicollinearity when their values are closer to zero.

#test for multicollinearity
from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X_train)

#eigen vector of a correlation matrix
#Values close to 0 indicate multicollinearity

cor_mat1 = np.corrcoef(X_std.T)

eig_vals, eig_vecs = np.linalg.eig(cor_mat1)

print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

Eigenvectors 
[[-2.47425034e-01 -3.08439366e-01  2.09248686e-01  1.01636242e-01
  -8.63284961e-02  3.98731240e-01 -7.13059605e-01  2.30038555e-01
   3.53108937e-02  2.00168407e-01 -4.34946053e-02 -8.90053883e-02
   1.02918090e-01]
 [ 2.59465494e-01 -3.03852310e-01  3.08184478e-01  2.57878387e-01
  -2.91677002e-01  2.26824721e-01  2.95040340e-01 -3.67243847e-01
  -8.09119998e-02  4.01319376e-01 -2.83337203e-01  3.73540332e-02
  -2.66742263e-01]
 [-3.49317701e-01  9.63503477e-02  1.62017988e-02  4.23414641e-02
   4.21966831e-02 -2.93107974e-04  3.32283831e-01  1.98623118e-01
  -2.36033886e-01  6.33634727e-01  3.08848133e-01  1.46983627e-01
   3.75230336e-01]
 [-1.58413963e-02  4.89778959e-01  2.90932911e-01  7.67269472e-01
   1.27290114e-01 -1.93209476e-01 -1.64595683e-01 -4.08417991e-02
   3.63737232e-02 -4.21500956e-02  1.97249150e-02  2.10903272e-02
  -1.71814075e-02]
 [-3.48158306e-01  2.06075299e-01  1.31391367e-01 -7.20965654e-02
  -1.36293859e-01  1.01002058e-01  2.25604851e-01  6.75586623e-02
   2.69677760e-02 -1.12684470e-02 -2.07529931e-02 -8.24986376e-01
  -2.18473247e-01]
 [ 1.79205418e-01  7.95939845e-02  6.08722261e-01 -3.84736366e-01
   3.50496132e-01  1.15419511e-02 -1.17762420e-01 -3.12081174e-01
   4.08492966e-02  6.80522448e-02  4.45804588e-01 -3.75648343e-02
  -4.78469613e-02]
 [-3.08995756e-01  3.14823557e-01 -1.91909625e-02 -2.01155809e-01
  -6.71041036e-02  1.03990180e-01 -1.19265027e-01 -6.04157758e-01
  -3.40117244e-02 -2.01723975e-02 -4.26468677e-01  6.19762813e-02
   4.27490142e-01]
 [ 3.23492791e-01 -3.33523253e-01 -6.79163814e-02  2.52856024e-01
  -5.59036323e-02 -6.16901636e-02  4.99813925e-02 -1.80913849e-01
  -1.95982054e-02 -1.53677467e-01  2.30893964e-01 -4.26917951e-01
   6.43300230e-01]
 [-3.25594165e-01 -2.84857499e-01  2.53882962e-01  1.08556728e-01
   2.01001427e-01  9.73672914e-02  1.74797089e-01  7.75906659e-03
  -6.35374498e-01 -4.85838525e-01 -4.06950400e-02  1.18312710e-01
  -2.67119214e-02]
 [-3.39352638e-01 -2.53798191e-01  2.14046890e-01  1.07442984e-01
   1.35993393e-01  1.12321675e-01  3.39968259e-01  2.95879479e-02
   7.25326633e-01 -1.95286675e-01 -2.43384726e-02  1.83743437e-01
   1.30047498e-01]
 [-2.07430872e-01 -3.12911598e-01 -3.45365060e-01  1.36922402e-01
   6.16666812e-01 -2.47626986e-01 -1.13996349e-01 -3.11834420e-01
   2.33334241e-02  2.75184941e-01 -8.20263476e-02 -1.87628181e-01
  -2.25825564e-01]
 [ 1.92624946e-01  2.40570161e-01 -3.02691516e-01  1.28421804e-01
   3.44296160e-01  8.01110351e-01  1.26067534e-01 -3.76695529e-02
  -5.98936097e-03 -5.37210657e-02  1.24719666e-01 -3.19367173e-03
  -2.81182396e-02]
 [-3.07288532e-01 -5.36028227e-02 -2.49747151e-01  1.28395066e-01
  -4.27461429e-01  4.56751887e-02 -1.02393658e-01 -4.14492973e-01
   2.92358545e-02 -1.18470805e-01  6.02627805e-01  1.32461368e-01
  -2.47278848e-01]]

Eigenvalues 
[6.12706048 1.39238017 1.28490134 0.85003081 0.80094028 0.70400183
 0.53834139 0.40473886 0.05872764 0.26533046 0.22706882 0.16363976
 0.18283815]

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns
vif.round(1)