Ensemble Learning
TABLE OF CONTENTS:
In order to arrive at the most accurate prediction, machine learning models are built, tuned and compared against each other. The reader can get can click on the links below to assess the models or sections of the exercise. Each section has a short explanation of theory, and a description of applied machine learning with Python:
Ensemble Learning (Current Blog)
OBJECTIVES:
This blog is part of a series of models showcasing applied machine learning models in a classification setting. By clicking on any of the tabs above, the reader can navigate to other methods of analysis applied to the same data. This was designed so that one could see what a data scientist would do from soup to nuts when faced with a problem like the one presented here. Note that the overall focus of this blog is Ensemble Learning.
Classification Trees:
Before getting too deep in the forest (Ha!), I wanted to spend a few minutes on classification trees. They are similar to regression trees but have a qualitative response. At the end of 2019, I wrote about regression trees and pointed out that the predicted response for an observation is actually the mean of all training observations on the same terminal node. (Click here to navigate to the blog.) . In a classification problem, the predicted class that an observation belongs to is the most frequently occurring class in a terminal node based on training data.
The similarities do not end here. Growing trees in a classification setting is also similar to growing trees for regression. Recursive binary splitting is used for growing a tree, but the criterion to use for binary splits is classification error rate (instead of RSS for regression). That classification error rate is the fraction of training observations that do not belong to the most common class. In practice, two other measures are used, the Gini Index and entropy, both of which can be tried when tuning models with Scikit-Learn.
Ensemble Learning:
Similar to regression, individual classification trees can be grouped and tuned as ensemble models. When using ensemble methods multiple trees are being cultivated at the same time. Individual trees have high variance but low bias on an individual basis. When averaging the individual trees’ predicted values, the variance of the ensemble system is usually greatly reduced. In this blog the following ensemble models for classification will be discussed:
bagging
various boosting techniques
random forests
BAGGING:
The bootstrap approach can be used to dramatically improve the performance of decision trees. Decision trees often suffer from relatively high variance, which results in poor accuracy measured on the test data. Bootstrap aggregation or bagging means that repeated samples are taken from a single training dataset with replacement. Classification trees are separately fitted on each, and each model is used for predictions. These predictions are then averaged.
BOOSTING:
Boosting is also an ensemble method that can be used on a regression and classification setting. During boosting a sequence of trees is grown and successive trees are cultivated on the residuals of preceding trees. Note that during boosting, residuals are used, not the dependent variable for sequential tree fitting. When boosting, bootstrapping is not used but each tree is fit on the original data set. When the next sequence of trees is built, the residuals are updated. Boosting uses three main tuning parameters: the number of trees, a shrinkage parameter and the number of splits per tree. Cross-validation should be used to decide on the number of trees. The shrinkage parameter is usually a small number (0.01 or 0.001), and must be evaluated in the context of the number of trees used because a very small shrinkage parameter typically requires a large number of trees. The number of splits in each tree is used to control the complexity of the ensemble model. It is often sufficient to create a tree stump by setting this parameter to one.
Several versions of boosting are available for ensemble learning: (a) Adaptive Boosting, (b) Gradient Boosting, (c) Extreme Gradient Boosting, and (d) Light GBM.
RANDOM FORESTS:
Similar to bagging, random forests also make use of building multiple classification trees based on bootstrapped training samples. However, for each split in a tree, a random set of predictors is chosen from the full set of predictors. The size of the random subset is typically the square root of the total number of features. The split is allowed to choose only from the randomly selected subset of predictors. A fresh sample of predictors is taken at each split, which means that the majority of predictors are not allowed to be considered at each split. As a result, this process makes the individual trees less correlated.
Charity Analysis:
First, load all required libraries and data. Note that I partitioned the data, then standardized it and removed all collinear items based on VIF scores. The python code for this work can be found on the one of the prior blogs. Please click Linear and Quadratic Discriminant Analysis to see it.
I saved the treated datasets as pickles because some of the steps were very time consuming. Here, we just need to read the pickles. The first two is the original test and train dataset, the second pair is the second order polynomial set, and the third pair contains the third order polynomial sets.
Random Forest:
The first model is untuned, and I just placed the code here to understand what the code structure looks like.
Now, we are ready to tune the random forest classifier using GridSearchCV. Only a handful of parameters are tuned below, but a complete list of tunable parameters can be obtained by clicking here. I also listed them in the code chunk below.
There are three fitting procedures below for the three different training data sets (e.g. sets with 1st, 2nd and 3rd degree polynomials, respectively). The random forest module of Scikit-Learn offers several attributes. I printed two of them below to see the values of the tuned parameters and the best model’s score.
Gradient Boosting:
First, we fit a generic untuned model to get a baseline accuracy score, and then we start tuning the parameters of this model using GridSearchCV as well. Again, a list of tunable parameters can be obtained by clicking the Gradient Boosting link here. The best model’s parameter values are printed along with the best model’s score.
Adaptive Boosting:
We will follow the exact same recipe: First get a generic AdaBoost model, then tune the parameters, and print the best model’s parameter values and score. The complete list of tunable parameters can be found by clicking on the AdaBoost Classifier link.
Extreme Gradient Boosting Classifier:
The XG Boost library is not part of the Scikit-Learn system but integrates very well with it. In fact, the documentation is very similar, and a list of tunable parameters as well as attributes and a host of other specifications can be found on the XG Boost API reference page.
In reality, XG Boost can be easily integrated with GridSearchCV the exact same manner as any of the other ensemble methods presented in this blog.
Bagging:
The final model fitted among the competing ensemble models is bagged classification trees. The specification and list of tunable parameters for bagged classification trees can be found by clicking on the Bagging link.
Results:
For each model type, we have four models, (a) one default model without any tuning using the original training data, (b) a tuned model based on the training data, (c) a tuned model based on second order polynomials and interaction terms, and (d) a tuned model based on third order polynomials and interaction terms. Remember, that in each dataset, collinear features were removed. The accuracy score (based on test observations) of each competing model was computed and the best model was identified: BEST MODEL - Tuned Adaptive boosting that was based on the original training data.
The accuracy of this model is 91.3%.
Descriptions of Winning Models by Ensemble Type:
Confusion Matrix:
OTHER accuracy Statistics:
Finally we got to one of the most important areas in terms of how to choose a winning model. Many people simply look at the area under the ROC curve and go with the highest number. Here, I wanted to compute accuracy, precision, recall, F1 score and ROC score for all winning models by ensemble type, and then choose a winner of all ensemble methods. Here, the accuracy scores of AdaBoost and Gradient Boosting were equal and highest at 91.3%. AdaBoost had a higher precision (91.4%), while Gradient Boosting had better recall (92%). The highest ROC score was achieved by AdaBoost (91.34). Interestingly, the highest F1 score was generated by Gradient boosting (91.4), therefore, I would probably choose gradient boosting as my champion model given that our goal is to generate maximum return on investment with the direct mail campaign.