Regression Trees

December 5, 2019 Gellert Toth

Objectives:

Gain an understanding of how regression trees are cultivated and pruned;
Programmatically create a regression tree using DecisionTree Regressor of sklearn;
Understand the mathematical function behind regression tree growing;
Learn about tree pruning in sklearn:
- tune max_depth parameter with cross validation in for loop;
- tune max_depth parameter with GridSearchCV;
Visualize a regression tree; and
Understand regression tree structures.

Overview:

Tree-based methods are predictive models that involve segmenting the feature space into several sub-regions. Since the divided feature space can be visualized as a tree, the approaches are often referred to as tree-based methods.

Tree-based models are relatively easy to explain and can be used for regression (e.g. when the dependent variable is continuous) and for classification (e.g. when the dependent variable is a class variable). Decision or regression trees are usually less accurate than other machine learning approaches like bagging, random forests and boosting, which cultivate several trees, which are then combined during prediction. As a result of the increased complexity, for regression (e.g. when the dependent variable is continuous) and for classification (e.g. when the dependent variable is a class variable). In this post, simple decision trees for regression will be explored. As a result of the increased complexity, all three – bagging, boosting and random forests – are a bit harder to interpret than regression or decision trees.

Regression Trees:

As discussed above, decision trees divide all observations into several sub-spaces. Sticking with the Boston Housing dataset, I divided all observations into three sub-spaces: R1, R2 and R3.

The subspaces represent terminal nodes of the regression tree, which sometimes are referred to as leaves. These three regions can be written as R1 ={X | TAX>4oo}, R2 ={X | TAX<=400, RM>7}, and R3 ={X | TAX<=400, RM<7}. The points along the decision tree are called internal nodes. Internal nodes connect branches. In the Boston Housing dataset, the two internal nodes are TAX>400 and RM<7.

sub-dividing the feature space:

The goal was to divide the feature space into unique (non-overlapping) regions. For every training observation that falls into a specific region, the same prediction was made by computing the mean of all observations in that region. Mathematically, when dividing the feature space, we minimize RSS:

Most algorithms use an approach called recursive binary splitting that starts on the top and terminates at the bottom. Each split is indicated by two new branches. The process is referred to as a greedy algorithm because splits are made by considering the best split at that juncture without considering any future steps. The greatest reduction of RSS is computed at a given split. The process is then repeated at the two nodes from the prior step. At this new split only the predictor space pertaining to a single node is considered. In other words, the exercise minimizing the RSS needs to be computed two times, one for each branch.

Tree PruninG:

It is important to find a tree that produces relatively low variance yet it does not produce a large amount of bias. The best way to accomplish this is to create a large tree and prune it back to a sub-tree until the tree provides the least amount of test error. Cross-validation is an excellent tool for pruning based on test error.

Cost complexity pruning (Weakest link pruning) selects a small set of subtrees during cross validation. This method considers a sequence of trees which are indexed by a tuning parameter. Unfortunately, sklearn does not offer this feature. There are several options through.

A simple example without any further tuning:

The first step is to load all necessary packages for the work. Here, DecsisionTreeRegressor from sklearn is being used. Once a simple model was trained, predictions were made based on the regressors on the test dataset. The results were printed as an array.

See this content in the original post

Just to make a point here, I simply filtered the dataset to two variables (TAX and RM). The plot below shows the feature space of those two variables, which was then segmented by the regression tree model. In reality, there were 13 variables, which make it very difficult to graphically show although I could have printed all bivariate combinations of the regressors in the model.

See this content in the original post

We can also illustrate how fitting a regression tree works by plotting actual observations and predicted values on the same plot. For simplicity, I selected a single variable (TAX) to demonstrate:

See this content in the original post

Tree Pruning:

The regression tree called model1 is probably not a great model because it was not pruned hence it is severely overfitted. As discussed earlier, it is a good idea to prune using cost complexity pruning. Unfortunately, sklearn does not have a tuning parameter (often referred to as alpha in other programming languages), but we can take care of tree pruning by tuning the max_depth parameter. There are several ways of accomplishing such a task….

First, some data wrangling is necessary. First, I wanted to restrict this demonstration to two variables for easier understanding, so I filtered the Boston Housing data to two regressors (TAX and RM). Second, the data frame needed to be converted into an array and then further processed so that each observation is a row and the column of the array represents a variable.

See this content in the original post

Cross Validation by Handwritten Code:

In the first example, I wrote a for loop that takes integers 1-20 and iterates by plugging them into the max_depth parameter. For each integer, a regression tree is fitted using the two features and the cross validation score is computed. The cross_valscore is available from sklearn. In this example, I specified r-squared as an evaluation criterion but there are several other options. Each is set up so that higher numbers mean a better model. Even errors are reported in a 1-error format so that a higher parameter can be considered better. The list can be found at Sci-Kit Learn.

I also decided to calculate the test error for each of the models. Both cross validation score and the test MSE lead to the same choice, e.g. the model is best with a maximum depth of 5 nodes. Note that the cross validation score output does not show the root and there are only 19 scores. The cross validation score is the highest at .61 and the test mean squared error is the lowest at 17.049901960784307 when the tree depth is five.

See this content in the original post

Cross Validation with Grid Search:

It is possible to find max_depth by running GridSearchCV, which replaces the for loop in the prior example. In this case, the selected max_depth is four.

See this content in the original post

Visualize the Regression Tree:

First, let us remember that at some point earlier, we filtered all regressors but TAX and RM, therefore the following model will be based on those two features. First, there will be no restrictions on the number of branches just to see what a tree would actually look like. Several new modules are necessary for the visualization, therefore it is important to load StringIO from sklearn, IPython.display and its Image component as well as pydotplus. The regression tree is rather wide and hard to see.

See this content in the original post

Let’s see what a pruned regression tree looks like now that we set the max_depth to four.

See this content in the original post

Understanding Regression Trees:

It is possible to inquire about the regression tree structure with Python by examining an attribute of the tree estimator called tree_ . The binary tree tree_ contains parallel arrays. The following can be viewed according to Sci-Kit Learn:

the binary tree structure;
the depth of each node and whether or not it’s a leaf;
the nodes that were reached by a sample using the decision_path method;
the leaf that was reached by a sample using the apply method;
the rules that were used to predict a sample;
the decision path shared by a group of samples.

The i-th element of each array holds information about the node `i`. Node 0 is the tree's root. NOTE: Some of the arrays only apply to either leaves or split nodes, resp. In this . case the values of nodes of the other type are arbitrary!

See this content in the original post

A lot of the information coming from these arrays can be seen on the tree plot. For example, there are 29 nodes. One can count them on the tree plot as well. The i-th element of each # array holds information about the node `i`. I find this incredibly useful for interpretation especially of the nodes on a tree plot are very small and hard to see.

See this content in the original post