Scaling, Centering and Standardization
Centering or scaling variables may be advantageous in regression although how, when and what to standardize seems to be a matter of preference based on scientific background/field of scientists. See this article for further thought: https://statmodeling.stat.columbia.edu/2009/07/11/when_to_standar/
First, when regression is used for explaining a phenomenon, interpreting the y-intercept is important. Centering results in predictors having a mean of zero. As a result, the intercept can be assumed to have the expected value Y when the predictor values are set to the mean (zero). Without centering, the predictors would have to be set to zero to interpret the Y intercept, which may not be feasible or sensible.
Second, centering also helps with allowing easier interpretation of variable coefficients associated with different magnitudes, e.g. when one regressor is measured on a very small order, while another on a very large order. Standardization allows the units of regression coefficients to be expressed in the same units. Luckily, centering or scaling does not have an impact on p-values, therefore regression model statistics can be interpreted the same way as if centering or scaling did not take place.
Third, when creating sums or averages of variables on different scale, it may be important to scale the variables to have the same unit.
Finally, other methods, such as Principal Components Analysis may require centering or scaling. Factor loadings in PCA represent weights by which each standardized original variable should be multiplied to get the component score. Standardization has a role in creating eigenvectors, which are then used during orthogonal rotation of variables to form principal components. For now, we will focus on regression only.
Feature scaling is relatively easy with Python. Note that it is recommended to split data into test and training data sets BEFORE scaling. If scaling is done before partitioning the data, the data may be scaled around the mean of the entire sample, which may be different than the mean of the test and mean of the train data.
Standardization:
One of the most frequently used methods of scaling is standardization. During standardization, we remove the mean from each value, and then scale them by dividing them by their standard deviation. The result is data with a mean of zero and the standard deviation of one.
Standardization can improve the performance of models. For example, an unstandardized feature with a very large variance may dominate others, which may result in subpar model performance.
Sci-kit in Python offers several scalers: a.) StandardScaler, b.) MinMaxScaler, c.) MaxAbsScaler and d.) RobustScaler.
Standard Scaler
StandardScales, as its name suggests is the most standard, garden variety standardization tool. It centers the data by subtracting the mean of a variable from each observation and dividing it by the variable’s standard deviation. It is possible not to scale the values around zero but around a preselected value. In this case the with_mean parameter needs to be set to False.
#STANDARDIZATION ######################### #Standard Scaler #centers data by removing the mean value then scale by dividing by standard deviation. #Pure centering from sklearn.preprocessing import StandardScaler scaler = StandardScaler() boston_scaled_df=boston_features_df.copy() boston_scaled_df=pd.DataFrame(scaler.fit_transform(boston_scaled_df), columns=boston_scaled_df.columns) boston_scaled_df.head()
#Plot to demonastrate the effect of scaling on one of the variables: CRIM import matplotlib.pyplot as plt plt.style.use('ggplot') import matplotlib.gridspec as gridspec fig1 = plt.figure(constrained_layout=True) spec1 = gridspec.GridSpec(ncols=2, nrows=2, figure=fig1) f1_ax1 = fig1.add_subplot(spec1[0, 0]) plt.scatter(boston_features_df['CRIM'], boston_target_df['MEDV']) plt.xlabel('CRIM') plt.ylabel("MEDV") f1_ax2 = fig1.add_subplot(spec1[0, 1]) plt.scatter(boston_scaled_df['CRIM'], boston_target_df['MEDV']) plt.xlabel('CRIM') plt.ylabel('MEDV') f1_ax3 = fig1.add_subplot(spec1[1, 0]) boston_features_df['CRIM'].plot(kind='hist',edgecolor='black',figsize=(6,3)) plt.title('CRIM', size=10) f1_ax4 = fig1.add_subplot(spec1[1, 1]) boston_scaled_df['CRIM'].plot(kind='hist',edgecolor='black',figsize=(6,3)) plt.title('CRIM', size=10)
MinMax Scaler
The MinMaxScaler allows the features to be scaled to a predetermined range. This scaler subtracts the smallest value of a variable from each observation and then divides it by a specified range. Note that the feature_range parameter has a default of 0-1. The scaler is best used for non-normal distributions, but its drawback is its sensitivity to outliers.
#MinMax Scaler #scaling each feature to a given range from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler(feature_range=(0,1)) boston_scaled2_df=boston_features_df.copy() boston_scaled2_df=pd.DataFrame(scaler.fit_transform(boston_scaled2_df), columns=boston_scaled2_df.columns) boston_scaled2_df.head()
MaxAbs Scaler
The MaxAbsScaler divides each observation within a variable by the absolute value of the highest value. The default range is -1,1. The MaxAbs scaler is most useful when data was already centered around zero and is sparse.
#MaxAbs Scaler #scales the data to a [-1,1] range based on the absolute maximum from sklearn.preprocessing import MaxAbsScaler scaler = MinMaxScaler(feature_range=(-1,1)) boston_scaled3_df=boston_features_df.copy() boston_scaled3_df=pd.DataFrame(scaler.fit_transform(boston_scaled3_df), columns=boston_scaled3_df.columns) boston_scaled3_df.head()
Robust Scaler
When the data contains a large number of outliers, the standard deviation and mean will be impacted by them and scaling with the above scalers may be problematic. In this case, the RobustScaler may work better because it removes the median and scales the data according to the quantile range. The quantile range to be used for scaling can be specified.
#Robust scaler #removes the median and scales the data according to the quantile range from sklearn.preprocessing import RobustScaler robust = RobustScaler(quantile_range = (0.1,0.9)) boston_robust_df=boston_features_df.copy() boston_robust_df=pd.DataFrame(robust.fit_transform(boston_robust_df), columns=boston_robust_df.columns) boston_robust_df.head()
Here, I wanted to quickly demonstrate that the coefficients will be different based on the type of scaler used but the statistics pertaining to the model are the same. Take a look at the p-values for example.
I simply printed the OLS Regression Table for three models as a demonstration. The first table contains statistics for the unscaled model, the second table is a depiction of how values change (or do not change in case of p-values or R-squared) when the Standard Scaler is used. The third table shows the results of using the MaxAbs scaler.
#Fit models to each standardized dataset import sklearn from sklearn.model_selection import train_test_split #sklearn import does not automatically install sub packages from sklearn import linear_model import statsmodels.api as sm import numpy as np #Partition the data #Create training and test datasets X1 = boston_features_df X2 = boston_scaled_df X3 = boston_scaled2_df X4 = boston_scaled3_df X5 = boston_robust_df Y = boston_target_df X1_train, X1_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X1, Y, test_size = 0.20, random_state = 5) X2_train, X2_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X2, Y, test_size = 0.20, random_state = 5) X3_train, X3_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X3, Y, test_size = 0.20, random_state = 5) X4_train, X4_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X4, Y, test_size = 0.20, random_state = 5) X5_train, X5_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X5, Y, test_size = 0.20, random_state = 5) #Train regression model: Unscaled from sklearn.linear_model import LinearRegression lin_mod1 = LinearRegression() lin_mod1.fit(X1_train, Y_train) #Create predictions using test features Y1_pred = lin_mod1.predict(X1_test) ###### Standard Scaler lin_mod2 = LinearRegression() lin_mod2.fit(X2_train, Y_train) #Create predictions using test features: Standard Scaler Y2_pred = lin_mod2.predict(X2_test) ###### MinMax Scaler lin_mod3 = LinearRegression() lin_mod3.fit(X3_train, Y_train) #Create predictions using test features: MinMax Scaler Y3_pred = lin_mod3.predict(X3_test) ###### MaxAbs Scaler lin_mod4 = LinearRegression() lin_mod4.fit(X4_train, Y_train) #Create predictions using test features: MaxAbs Scaler Y4_pred = lin_mod4.predict(X4_test) ###### Robust Scaler lin_mod5 = LinearRegression() lin_mod5.fit(X5_train, Y_train) #Create predictions using test features: Robust Scaler Y5_pred = lin_mod5.predict(X5_test) # Compute and print fit statistics import sklearn from sklearn import metrics print('Mean Absolute Error (Y1 - Not Scaled):', metrics.mean_absolute_error(Y_test, Y1_pred)) print('Mean Absolute Error (Y2 - Standard Scaler):', metrics.mean_absolute_error(Y_test, Y2_pred)) print('Mean Absolute Error (Y3 - MinMax Scaler):', metrics.mean_absolute_error(Y_test, Y3_pred)) print('Mean Absolute Error (Y4 - MaxAbbs Scaler):', metrics.mean_absolute_error(Y_test, Y4_pred)) print('Mean Absolute Error (Y5 - Robust Scaler):', metrics.mean_absolute_error(Y_test, Y5_pred)) print('') print('Mean Squared Error (Y1 - Not Scaled):', metrics.mean_squared_error(Y_test, Y1_pred)) print('Mean Squared Error (Y2 - Standard Scaler):', metrics.mean_squared_error(Y_test, Y2_pred)) print('Mean Squared Error (Y3 - MinMax Scaler):', metrics.mean_squared_error(Y_test, Y3_pred)) print('Mean Squared Error(Y4 - MaxAbbs Scaler):', metrics.mean_squared_error(Y_test, Y4_pred)) print('Mean Squared Error (Y5 - Robust Scaler):', metrics.mean_squared_error(Y_test, Y5_pred)) print('') print('Root Mean Squared Error (Y1 - Not Scaled):', np.sqrt(metrics.mean_squared_error(Y_test, Y1_pred))) print('Root Mean Squared Error (Y2 - Standard Scaler):', np.sqrt(metrics.mean_squared_error(Y_test, Y2_pred))) print('Root Mean Squared Error (Y3 - MinMax Scaler):', np.sqrt(metrics.mean_squared_error(Y_test, Y3_pred))) print('Root Mean Squared Error (Y4 - MaxAbbs Scaler:', np.sqrt(metrics.mean_squared_error(Y_test, Y4_pred))) print('Root Mean Squared Error (Y5 - Robust Scaler):', np.sqrt(metrics.mean_squared_error(Y_test, Y5_pred)))
Mean Absolute Error (Y1 - Not Scaled): 3.21327049584237 Mean Absolute Error (Y2 - Standard Scaler): 3.2132704958423823 Mean Absolute Error (Y3 - MinMax Scaler): 3.2132704958423863 Mean Absolute Error (Y4 - MaxAbbs Scaler): 3.213270495842384 Mean Absolute Error (Y5 - Robust Scaler): 3.213270495842394 Mean Squared Error (Y1 - Not Scaled): 20.869292183770682 Mean Squared Error (Y2 - Standard Scaler): 20.86929218377084 Mean Squared Error (Y3 - MinMax Scaler): 20.86929218377086 Mean Squared Error(Y4 - MaxAbbs Scaler): 20.869292183770842 Mean Squared Error (Y5 - Robust Scaler): 20.86929218377092 Root Mean Squared Error (Y1 - Not Scaled): 4.568292042303193 Root Mean Squared Error (Y2 - Standard Scaler): 4.56829204230321 Root Mean Squared Error (Y3 - MinMax Scaler): 4.568292042303213 Root Mean Squared Error (Y4 - MaxAbbs Scaler: 4.568292042303211 Root Mean Squared Error (Y5 - Robust Scaler): 4.568292042303219
#Print Model Data #Model statistics: Unscaled model1 = sm.OLS(Y_train, sm.add_constant(X1_train)).fit() print_model1 = model1.summary() print(print_model1) #Model statistics: Standard model2 = sm.OLS(Y_train, sm.add_constant(X2_train)).fit() print_model2 = model2.summary() print(print_model2) #Model statistics: MinMax model3 = sm.OLS(Y_train, sm.add_constant(X3_train)).fit() print_model3 = model3.summary() print(print_model3) #Model statistics: MaxAbs model4 = sm.OLS(Y_train, sm.add_constant(X4_train)).fit() print_model4 = model4.summary() print(print_model4) #Model statistics: Robust model5 = sm.OLS(Y_train, sm.add_constant(X5_train)).fit() print_model5 = model5.summary() print(print_model5)