DataSklr

View Original

Feature Selection with Python

Goals:

  • Discuss feature selection methods available in Sci-Kit (sklearn.feature_selection), including cross-validated Recursive Feature Elimination (RFECV) and Univariate Feature Selection (SelectBest);

  • Discuss methods that can inherently be used to select regressors, such as Lasso and Decision Trees - Embedded Models (SelectFromModel);

  • Demonstrate forward and backward feature selection methods using statsmodels.api; and

  • Correlation coefficients as feature selection tool

Overview:

In real world analytics, we often come across a large volume of candidate regressors, but most end up not being useful in regression modeling.  Finding the most appropriate set of regressors is a variable selection issue.  Unfortunately, variable selection has two conflicting goals: (a) on the one hand, we try to include as many regressors as possible so that we can maximize the explanatory power of our model, (b) on the other hand, we want as few predictors as possible because more regressors could lead to an increased variance in the prediction. 

When starting out with a very large feature set, deleting some of them, often results in a model with better precision. This is not surprising because when we retain variables with zero coefficients or coefficients with values less than their standard errors, the parameter estimates and the predicted response increase unreasonably. However, deleting variables could also increase bias into estimates of the coefficients and the response. Fortunately, we can find a point where the deletion of variables has a small impact, and the error (MSE) associated with parameter estimates will be smaller than the reduction in variance.

Metrics to use when evaluating what to keep or discard:

When evaluating which variable to keep or discard, we need some evaluation criteria.  Many people decide on R squared, but other metrics may be better because R squared will always increase with the addition of newer regressors.  Adjusted R squared is a metric that does not necessarily increase with the addition of variables. It only increases if the partial F statistic used to test the significance of additional regressors is greater than 1. Other metrics may also be used such as Residual Mean Square, Mallow’s Cp statistic, AIC and BIC, metrics that evaluate model error on the training dataset in machine learning.

Methods to evaluate what to keep or discard:

 Several strategies are available when selecting features for model fitting. Traditionally, most programs such as R and SAS offer easy access to forward, backward and stepwise regressor selection. With a little work, these steps are available in Python as well. I have seen people try filtering methods, where they assess each regressor’s correlation with the dependent variable or check univariate tests that evaluate the relationship between each independent regressor and the dependent variable. Filtering is usually based on an arbitrary (or normative) threshold that allows the analyst to discard features. The third group of potential feature reduction methods are actual methods, that are designed to remove features without predictive value. A shrinkage method, Lasso Regression (L1 penalty) comes to mind immediately, but Partial Least Squares (supervised approach) and Principal Components Regression (unsupervised approach) may also be useful. Lastly, tree based methods produce a variable importance output, which may also be extremely useful when deciding what to keep and what to eliminate.

Stepwise Feature Elimination:

There are three ways to deploy stepwise feature elimination: (a) forward, (b) backward, and (c) stepwise methods.

Forward:

Forward elimination starts with no features, and the insertion of features into the regression model one-by-one. First, the regressor with the highest correlation is selected for inclusion, which coincidentally the regressor that produces the largest F-statistic value when testing the significance of the model. This is called partial correlation because technically they represent the correlation coefficients between the model residuals with a specific variable and the model residuals with the other regressors. All subsequent regressors are selected the same way. The procedure continues until the F statistic exceeds a pre-selected F-value (called F-to-enter) and terminates otherwise.

Backward:

Backward elimination starts with all regressors in the model. The F statistic is calculated as we remove regressors on at a time. In this case, the feature with the smallest F statistic is removed from the model ands the procedure continues until the smallest partial F statistic is greater than the pre-selected cutoff value of F, and terminates otherwise. This method sounds particularly appealing, when we’d like to see how each variable affects the model.

Stepwise:

Stepwise elimination is a hybrid of forward and backward elimination and starts similarly to the forward elimination method, e.g. with no regressors. Features are then selected as described in forward feature selection, but after each step, regressors are checked for elimination as per backward elimination. The hope is that as we enter new variables that are better at explaining the dependent variable, variables already included may become redundant.

Interestingly, stepwise feature selection methods were not readily available in Python until 2019, and one had to create a custom program. Today, the method can be found on github (https://github.com/AakkashVijayakumar/stepwise-regression). I wanted to demonstrate how it works with the Boston housing data.

See this content in the original post

The results of forward feature selection are provided below. Note that the threshold was selected at 0.01 meaning that only variables lower than that threshold were selected. In this case 11 of 13 features. A more stringent criteria will eliminate more variables, although the 0.01 cutoff is already pretty stringent.

See this content in the original post

Now we can assess the backward elimination procedure. It appears that this method also selected the same variables and eliminated INDUS and AGE.

See this content in the original post

Feature Selection with Sci-Kit:

Several methodologies of feature selection are available in Sci-Kit in the sklearn.feature_selection module. They include Recursive Feature Elimination (RFE) and Univariate Feature Selection. Feature selection using SelectFromModel allows the analyst to make use of L1-based feature selection (e.g. Lasso) and tree-based feature selection.

Recursive Feature Elimination:

A popular feature selection method within sklearn is the Recursive Feature Elimination.  RFE selects features by considering a smaller and smaller set of regressors.  The starting point is the original set of regressors. Less important regressors are recursively pruned from the initial set.  The procedure is repeated until a desired set of features remain.  That number can either be a priori specified, or can be found using cross validation. In fact, RFE offers a variant – RFECV – designed to optimally find the best subset of regressors. 

See this content in the original post

Variables in the 4-6, 8 and 11 position ( a total of 5 variables) were selected for inclusion in a model. This is because we specified 5 variables as the preferred number of features. This could be increased or decreased as needed. At this point, the feature names are not printed, only their position.

See this content in the original post

The code prints the variables ranked highest above the threshold specified. Their rank is concatenated with the name of the feature for easier interpretation. The five feature threshold was specified, which may or may not be the right choice. In order to identify the most optimal features, we can use cross validation. Luckily, this is available in Sci-kit as an option. I deliberately changed the cv value to 300 fold to produce a different result. The default is 3, which results in all features selected in the Boston housing dataset. As we increase the folds, the task becomes computationally more and more expensive, but the number of variables selected reduces. In this example, the only feature selected is NOX.

See this content in the original post

Univariate Feature Selection:

UFS selects features based on univariate statistical tests, which evaluate the relationship between two randomly selected variables. In case of a continuous dependent variable, two options are available: f-regression and mutual_info_regression. Several options are available but two different ways of specifying the removal of features are (a) SelectKBest removes of all low scoring features, and (b) SelectPercentile allows the analyst to specify a scoring percent of features, and all features not reaching that threshold then are removed. 

See this content in the original post

We can now rank the importance of each feature based on their score. The higher the score, the more important the variable.

A very interesting discussion on StackExchange suggests that the ranks obtained by Univariate Feature Selection using f_regression can also be achieved by computing correlation coefficients of individual features with the dependent variable. One must compute the correlation at each step. See: https://stats.stackexchange.com/questions/204141/difference-between-selecting-features-based-on-f-regression-and-based-on-r2

However, when examining correlation coefficients of each independent variable and the dependent variable at the same step, the ranks are NOT the same. Still, some analysts find the below analysis useful in deciding on which feature to use. After computing the correlation of each individual regressor and the dependent variable, a threshold will help deciding on whether to keep or discard regressors.

See this content in the original post
See this content in the original post

When the threshold is set at 0.6, only two variables are selected: LSTAT and RM. Their correlation coefficients are listed as well.

See this content in the original post

Feature Selection Using Shrinkage or Decision Trees:

Lasso (L1) Based Feature Selection:

Several models are designed to reduce the number of features. One of the shrinkage methods - Lasso - for example reduces several coefficients to zero leaving only features that are truly important. For a discussion on Lasso and L1 penalty, please click:

Sci-Kit offers SelectFromModel as a tool to run embedded models for feature selection. The module makes use of a threshold parameter, which can be either user specified or heuristically set based on median or mean. Below, the code uses Lasso (L1 penalty) to find features for inclusion. I set the threshold to 0.25, which results in six features being selected. The get the names of the selected variables, a mask (integer index) of the features selected must be used by calling get_support().

See this content in the original post

Tree-based Feature Selection:

Decision trees or other tree-based models contain a variable importance output that can be used to decide, which feature to select for inclusion. Features that are closer to the root of the tree are more important than those at end splits, which are not as relevant. I am working on a discussion about decision trees, so check back for them soon.