Transforming Variables

October 17, 2019 Gellert Toth

Transforming variables in regression is often a necessity. Both independent and dependent variables may need to be transformed (for various reasons).

Transforming the Dependent variable:

Homoscedasticity of the residuals is an important assumption of linear regression modeling. One way of achieving this symmetry is through the transformation of the target variable. Skewed or extremely non-normal data will give us problems, therefore transforming the target is an important part of model building.

Independent variables:

While independent variables need not be normally distributed, it is extremely important that there is a linear relationship between each regressor and the target (it’s logit). Transformation is a way to fix the non-linearity problem, if it exists. Transformations can also help with high leverage values or outliers.

Note that histograms are not the best way to diagnose non-linearity issues, though they are useful at visualizing distributions. Scatterplots are a far easier way to assess linearity in the data. The bottom line is that transformation may help with linearity, which will in turn help to achieve a better model fit.

From a model building perspective, transformations can also help reducing complexity of a model. Reducing polynomials or removing the need for interactions may also help a better fit.

Lucky for us sklearn, provides several power transformation methods. Two of the most frequently used are the Box-Cox transformation and Yeo-Johnson transformation procedures. Both belong to the family of power transformations and are used to make distributions more normally distributed. Note that the Box-Cox method requires positive inputs while Yeo-Johnson works with either positive or negative values.

In this section, I also wanted to demonstrate how to use the quantile transformer of sklearn. It is also set up to achieve a normal distribution. Its advantages include the reduction of impact from outliers or influential observations. A major drawback of this transformation is its potential for non-linear distortion. Also, values that fall beyond the fitted range are added to the bounds of the distribution.

Working with the Boston Housing data, I ingested the data the same way as before:

See this content in the original post

I also ingested a pickled dataset. This data was prepared in a previous session when I standardized data. Since Box-Cox transformation requires non-negative data, this dataset was standardized with the MinMax scaler of sklearn, which rescaled the data resulting in non-negative values.

See this content in the original post

Notice that the dataset still contains 0 values, which are also prohibited in Box-Cox transformation. In order to get around this problem, I added 1 to each value.

See this content in the original post

Note that in this code I added 1 to a binary variable (CHAS), which was not standardized. So now we are ready for Box-Cox transformation by selected variables.

Again, when building a model, one would want to transform the data separately for the test and train datasets!

See this content in the original post

When executing the Yeo-Johnson transformation, it is not necessary that the data contain values greater than 0. As a result, I used he original dataset (boston_features_df).

See this content in the original post

Finally, I wanted to demonstrate the use of a transformation method that is different from the power transformation methods described above. The quantile distribution is non-parametric, and one should use it cautiously because of its impact on outliers or values in tails of distributions. Nevertheless, I find this transformation very useful at times.

See this content in the original post

So let us compare how the three transformation methods changed the distribution of our data. In order to demonstrate the impact of transformation, I picked two variables, CRIM and NOX. The histograms and scatter plots of both variables transformed with the three different transformation methods were compared to the untransformed plots.

See this content in the original post