Dimension Reduction: Principal Components and Partial Least Squares Regression
Objectives:
Explain how principal components regression and partial squares regression work.
Show Python code to to perform Principal Components Regression and Partial Least Squares Regression
Overview:
Principal Components Regression (PCR) and Partial Least Squares Regression (PLS) are yet two other alternatives to simple linear model fitting that often produces a model with better fit and higher accuracy. Both are dimension reduction methods but PCR offers an unsupervised approach, while PCL is a supervised alternative.
Principal Components Regression:
The approach is based on reducing the number of predictors into a smaller dimension using principal components analysis. These principal components then are used as regressors when fitting a new OLS model.
Since a relatively small number of principal components explain a large percent of the variability in data, the approach may be sufficient in explaining a relationship between the target variable and the principal components that were constructed from a larger number of regressor variables.
One drawback of PCR, is that it is based on an unsupervised approach to feature reduction: Principal Components Analysis. PCA is set out to find linear combinations that best describe original regressors. Since detection of these linear combinations was performed without using a target variable, we can’t be certain that the principal components we created are the best to use to predict the target variable. It is entirely possible, that a different set of principal components would perform better. The solution to this problem is Partial Least Squares Regression (more about that later).
Still, the PCA approach is a good way to overcome multicollinearity problems in OLS models. Further, since PCA is a dimension reduction approach, PCR may be a good way of attacking problems with high-dimensional covariates.
PCR follows three steps:
1. Find principal components from the data matrix of original regressors.
2. Regress the outcome variable on the selected principal components, which are covariates of the original regressors. The regression used should be OLS.
3. Transform the findings back to the scale of the covariates using PCA loadings to get a PCR estimator so that regression coefficients can be estimated.
PCR vs. Shrinkage:
Sometimes, PCR can outperform shrinkage models (in terms of model error), while other times shrinkage models are better. If relatively few principal components are needed to explain variance in the data, then PCR will outperform shrinkage methods such as ridge, lasso or elastic net models. If more principal components are required, then shrinkage methods will perform better.
Principal Components Regression and the Boston Housing Data:
First, we need load all required libraries, most of them as part of the sklearn package
Process the data, by transforming all variables using box-cox transformation.
Partition the data and create polynomial features and interaction terms. Also select a few variables from a larger dataset (p=29).
Sklearn does not have a PCR module, so we need to run PCA first, and then use the principal components in OLS regression.
Note that I only included 29 variables, therefore the maximum number of principal components is 29, e.g. when every variable is considered a principal component. Obviously we need to cut this number down a bit, but for now, I wanted print the first five:
Next we will run cross validation to calculate the MSE value of each model as we place more and more principal components in linear regression.
The graph below shows that the MSE decreases rapidly and significantly as the first few principal components are used but it tails off and provides very little improvement after about the 6th principal component. Further, the MSE remains close to unchanged after 10 or more principal components are included in the model.
We can also calculate and plot the aggregate variance explained as we put more and more principal components in the model. The first principal component explains 27% of the variance but adding the first seven means that 85% of the variance is captured, meaning we do not need to go beyond this number to explain the model very well.
Now we can fit an OLS model with only the selected number of principal components and calculate the models MSE on test data. The test MSE was 0.32109279891577586. The test MSE of ridge regression on the same data was 0.20453122412502994 (See the post on regularized regression), therefore PCR did not get us a model that is better than ridge regression this time. This is not surprising, because the number of principal components was relatively large, which usually favors ridge regression.
Partial Least Squares:
PLS is also a feature reduction method but it offers a supervised alternative to PCR. The new set of features are also the linear combinations of the original regressors, but in computing them, the method makes use of the target variable. As a result, the results of this technique not only explain the linear combinations of original features but the response variable as well.
The model first requires the standardization of all predictors. PLS then starts computing the first linear combination of features by setting constants used in computing the Z values (Z values represent the linear combinations of original predictors) equal to the coefficient of simple OLS regression between the target and a specific regressor. PLS puts more weight on variables that are more correlated with the target variable.
Now we are ready to compute the test MSE achieved by the model. Here, we are introducing only the first six components because the test MSE is minimized at that level. The test MSE of this model is significantly better than that of PCR at 0.20540507220365029. In fact, it is only slightly worse than that of ridge regression (0.20453122412502994).