Scaling, Centering and Standardization

October 15, 2019 Gellert Toth

Centering or scaling variables may be advantageous in regression although how, when and what to standardize seems to be a matter of preference based on scientific background/field of scientists. See this article for further thought: https://statmodeling.stat.columbia.edu/2009/07/11/when_to_standar/

First, when regression is used for explaining a phenomenon, interpreting the y-intercept is important. Centering results in predictors having a mean of zero. As a result, the intercept can be assumed to have the expected value Y when the predictor values are set to the mean (zero). Without centering, the predictors would have to be set to zero to interpret the Y intercept, which may not be feasible or sensible.

Second, centering also helps with allowing easier interpretation of variable coefficients associated with different magnitudes, e.g. when one regressor is measured on a very small order, while another on a very large order. Standardization allows the units of regression coefficients to be expressed in the same units. Luckily, centering or scaling does not have an impact on p-values, therefore regression model statistics can be interpreted the same way as if centering or scaling did not take place.

Third, when creating sums or averages of variables on different scale, it may be important to scale the variables to have the same unit.

Finally, other methods, such as Principal Components Analysis may require centering or scaling. Factor loadings in PCA represent weights by which each standardized original variable should be multiplied to get the component score. Standardization has a role in creating eigenvectors, which are then used during orthogonal rotation of variables to form principal components. For now, we will focus on regression only.

Feature scaling is relatively easy with Python. Note that it is recommended to split data into test and training data sets BEFORE scaling. If scaling is done before partitioning the data, the data may be scaled around the mean of the entire sample, which may be different than the mean of the test and mean of the train data.

Standardization:

One of the most frequently used methods of scaling is standardization. During standardization, we remove the mean from each value, and then scale them by dividing them by their standard deviation. The result is data with a mean of zero and the standard deviation of one.

Standardization can improve the performance of models. For example, an unstandardized feature with a very large variance may dominate others, which may result in subpar model performance.

Sci-kit in Python offers several scalers: a.) StandardScaler, b.) MinMaxScaler, c.) MaxAbsScaler and d.) RobustScaler.

Standard Scaler

StandardScales, as its name suggests is the most standard, garden variety standardization tool. It centers the data by subtracting the mean of a variable from each observation and dividing it by the variable’s standard deviation. It is possible not to scale the values around zero but around a preselected value. In this case the with_mean parameter needs to be set to False.

See this content in the original post

MinMax Scaler

The MinMaxScaler allows the features to be scaled to a predetermined range. This scaler subtracts the smallest value of a variable from each observation and then divides it by a specified range. Note that the feature_range parameter has a default of 0-1. The scaler is best used for non-normal distributions, but its drawback is its sensitivity to outliers.

See this content in the original post

MaxAbs Scaler

The MaxAbsScaler divides each observation within a variable by the absolute value of the highest value. The default range is -1,1. The MaxAbs scaler is most useful when data was already centered around zero and is sparse.

See this content in the original post

Robust Scaler

When the data contains a large number of outliers, the standard deviation and mean will be impacted by them and scaling with the above scalers may be problematic. In this case, the RobustScaler may work better because it removes the median and scales the data according to the quantile range. The quantile range to be used for scaling can be specified.

See this content in the original post

Here, I wanted to quickly demonstrate that the coefficients will be different based on the type of scaler used but the statistics pertaining to the model are the same. Take a look at the p-values for example.

I simply printed the OLS Regression Table for three models as a demonstration. The first table contains statistics for the unscaled model, the second table is a depiction of how values change (or do not change in case of p-values or R-squared) when the Standard Scaler is used. The third table shows the results of using the MaxAbs scaler.

See this content in the original post

#Model statistics: Unscaled

#Model statistics: Standard: Note collinearity warning removed

#Model statistics: MaxAbs