Transforming Variables
Transforming variables in regression is often a necessity. Both independent and dependent variables may need to be transformed (for various reasons).
Transforming the Dependent variable:
Homoscedasticity of the residuals is an important assumption of linear regression modeling. One way of achieving this symmetry is through the transformation of the target variable. Skewed or extremely non-normal data will give us problems, therefore transforming the target is an important part of model building.
Independent variables:
While independent variables need not be normally distributed, it is extremely important that there is a linear relationship between each regressor and the target (it’s logit). Transformation is a way to fix the non-linearity problem, if it exists. Transformations can also help with high leverage values or outliers.
Note that histograms are not the best way to diagnose non-linearity issues, though they are useful at visualizing distributions. Scatterplots are a far easier way to assess linearity in the data. The bottom line is that transformation may help with linearity, which will in turn help to achieve a better model fit.
From a model building perspective, transformations can also help reducing complexity of a model. Reducing polynomials or removing the need for interactions may also help a better fit.
Lucky for us sklearn, provides several power transformation methods. Two of the most frequently used are the Box-Cox transformation and Yeo-Johnson transformation procedures. Both belong to the family of power transformations and are used to make distributions more normally distributed. Note that the Box-Cox method requires positive inputs while Yeo-Johnson works with either positive or negative values.
In this section, I also wanted to demonstrate how to use the quantile transformer of sklearn. It is also set up to achieve a normal distribution. Its advantages include the reduction of impact from outliers or influential observations. A major drawback of this transformation is its potential for non-linear distortion. Also, values that fall beyond the fitted range are added to the bounds of the distribution.
Working with the Boston Housing data, I ingested the data the same way as before:
I also ingested a pickled dataset. This data was prepared in a previous session when I standardized data. Since Box-Cox transformation requires non-negative data, this dataset was standardized with the MinMax scaler of sklearn, which rescaled the data resulting in non-negative values.
Notice that the dataset still contains 0 values, which are also prohibited in Box-Cox transformation. In order to get around this problem, I added 1 to each value.
Note that in this code I added 1 to a binary variable (CHAS), which was not standardized. So now we are ready for Box-Cox transformation by selected variables.
Again, when building a model, one would want to transform the data separately for the test and train datasets!
When executing the Yeo-Johnson transformation, it is not necessary that the data contain values greater than 0. As a result, I used he original dataset (boston_features_df).
Finally, I wanted to demonstrate the use of a transformation method that is different from the power transformation methods described above. The quantile distribution is non-parametric, and one should use it cautiously because of its impact on outliers or values in tails of distributions. Nevertheless, I find this transformation very useful at times.
So let us compare how the three transformation methods changed the distribution of our data. In order to demonstrate the impact of transformation, I picked two variables, CRIM and NOX. The histograms and scatter plots of both variables transformed with the three different transformation methods were compared to the untransformed plots.