DataSklr

View Original

Ensemble Methods: Bagging and Random Forests

Objectives:

  • Gain an understanding of bootstrapping;

  • Learn how to use Sci-kit Learn to build ensemble learning models;

  • Demonstrate the use of ensemble learning models for regression:

    • Bagging;

    • Random forests;

  • Compare bagging and random forests.

Bagging:

The bootstrap approach can be used to dramatically improve the performance of decision trees. Decision trees often suffer from relatively high variance, which results in poor accuracy measured on the test data. Bootstrap aggregation or bagging means that repeated samples are taken from a single training dataset with replacement. Regression trees are separately fitted on each, and each model is used for predictions. These predictions are then averaged.

Bootstrap:

Bootstrapping can be used to quantify the uncertainty pertaining to machine learning algorithms. It relies on a process that generates new sample sets from a population to estimate variability. In other words, sample data sets are repeatedly taken from a larger data set with replacement The standard error of the bootstrap samples is computed, and the average of these standard errors are used to estimate the standard error of the population. Both figures below are from Elements of Statistical Learning.

Histogram on left: estimates of alpha obtained by generating 1,000 simulated datasets from a true population. Center: estimates of alpha obtained from 1,000 bootstrap samples from a single data set. In each set, the pink line represents the tru value of alpha. (Elements of Statistical Learning)

A visual representation of the bootstrap with three observations. Each bootstrap sample contains the same number of observations and are sampled with replacement. (Elements of Statistical Learning)

Ensemble Methods:

Bagging belongs to a group of statistical learning techniques called ensemble methods because multiple trees are being cultivated at the same time. However, unlike regression trees, the trees grown during bagging are not pruned, which means that they have high variance but low bias on an individual basis. Fortunately, the averaging of the predicted values significantly reduces variance for the system.

The test error is easy to compute. When fitting bagged trees, each uses approximately two-thirds of the of the observations, while one-third is not used and is often called out-of-bag observations. This can be used to compute OOB MSE for regression problems. It has been shown that OOB MSE is virtually the same as leave-one-out cross-validation error.

Challenges in Interpretation:

Unfortunately, the resulting model is cumbersome to interpret. When a lot of trees are used, it is impossible to tell the structure of decision making by the model. The importance of predictors is still possible to compute by understanding how much the RSS decreases on average at all splits for each individual predictor when building regression trees.

See this content in the original post

At first, I created a parameter grid, which was then used for tuning during the bagging process. Special attention should be paid to the max_samples and max_features parameters. They were iteratively tuned. For example, I started with a range between 0.1-1.0 but narrowed the range down 0.5-1.0 and finally 0.9-1.0. At each iteration, I checked the test MSE for guidance (lower is better obviously!) . A list of all tunable parameters can be found at sci-kitlearn.org.

See this content in the original post

The lowest MSE I could get with the above four parameters tuned was 7.21. At that point, max_features were set to 0.8, max_samples to 0.95, n_estimators were determined to be 10 by default. The model did use bootstrapping.

The purpose of tuning these features is to reduce the correlation between regressors. If we were to accept all regressors at the same time, a very strong regressor would always dominate all models, therefore the trees based on each bootstrap would be very similar hence their selected features would be correlated. Restricting max_features and max_samples disallows the bagging regressor to prefer a particular feature or set of features.

Random Forests:

Similar to bagging, random forests also make use of building multiple regression trees based on bootstrapped training samples. However, for each split in a tree, a random set of predictors is chosen from the full set of predictors. The size of the random subset is typically the square root of the total number of features. The split is allowed to choose only from the randomly selected subset of predictors. A fresh sample of predictors is taken at each split, which means that the majority of predictors are not allowed to be considered at each split. As a result, this process makes the individual trees less correlated.

See this content in the original post

Note the same iterative parameter tuning with ParameterGrid. One could also write a loop iterating between two parameters in predetermined interval steps, but in case of a very large data set, this could become computationally very expensive.

The lowest test MSE achieved with the Random Forests model was 7.84, therefore it was outperformed by bagging. The best model was specified by max_features at 0.35 and n_estimators at 250.

Bagged Trees vs. Random Forests:

If there is a strong predictor, most trees will use that as a top split, and bagged trees will be correlated and look similar. When trees are similar, averaging their error will not result in lower overall variance. In contrast to bagging, random forests consider only a subset of predictors, which allows all predictors (not just the strong predictor) to be used as a top split for some (or many) of the trees. If the subset predictor sample size equals the actual predictor sample size then random forests actually operate as bagging. If predictors are correlated then a small subset of predictors should be used for random forests.