"The classifiers most likely to be the best are the random forest (RF) versions, the best of which (implemented in R and accessed via caret), achieves 94.1 percent of the maximum accuracy overcoming 90 percent in the 84.3 percent of the data sets."

– Fernández-Delgado et al (2014)

"You can't see the forest for the trees!"

– An old saying

(For more resources related to this topic, see here.)

In this article by Cory Lesmeister, the author of Mastering Machine Learning with R, the first item of discussion is the basic decision tree, which is both simple to build and understand. However, the single decision tree method does not perform as well as the other methods such as support vector machines or neural networks. Therefore, we will discuss the creation of multiple, sometimes hundreds of, different trees with their individual results combined, leading to a single overall prediction.

The first quote written above is from Fernández-Delgado et al in the Journal of Machine Learning Research and is meant to set the stage that the techniques in this article are quite powerful, particularly when used for the classification problems. Certainly, they are not always the best solution, but they do provide a good starting point.

Regression trees

For an understanding of the tree-based methods, it is probably easier to start with a quantitative outcome and then move on to how it works on a classification problem. The essence of a tree is that the features are partitioned, starting with the first split that improves the residual sum of squares the most. These binary splits continue until the termination of the tree. Each subsequent split/partition is not done on the entire dataset but only on the portion of the prior split that it falls under. This top-down process is referred as recursive partitioning. It is also a process that is greedy, a term you may stumble on in reading about the machine learning methods. Greedy means that in each split in the process, the algorithm looks for the greatest reduction in the residual sum of squares without a regard to how well it will perform in the later partitions. The result is that you may end up with a full tree of unnecessary branches, leading to a low bias but high variance. To control this effect, you need to appropriately prune the tree to an optimal size after building a full tree.

The following figure provides a visual of the technique in action. The data is hypothetical with 30 observations, a response ranging from 1 to 10, and two predictor features, both ranging in value from 0 to 10 named X1 and X2. The tree has three splits that lead to four terminal nodes. Each split is basically an if or then statement or uses an R syntax, ifelse(). In the first split, if X1 < 3.5, then the response is split into 4 observations with an average value of 2.4 and the remaining 26 observations. This left branch of 4 observations is a terminal node as any further splits would not substantially improve the residual sum of squares. The predicted value for the 4 observations in that partition of the tree becomes the average. The next split is at X2 < 4 and finally X1 < 7.5.

An advantage of this method is that it can handle the highly nonlinear relationships; but can you see a couple of potential problems? The first issue is that an observation is given the average of the terminal node that it falls under. This can hurt the overall predictive performance (high bias). Conversely, if you keep partitioning the data further and further to achieve a low bias, high variance can become an issue. As with the other methods, you can use cross-validation to select the appropriate tree size.

classification-and-regression-trees-0-img-0

Regression Tree with 3 splits and 4 terminal nodes and the corresponding node average and number of observations.

Classification trees

Classification trees operate under the same principal as regression trees except that the splits are not determined by the residual sum of squares but an error rate. The error rate used is not what you would expect, where the calculation is simply misclassified observations divided by the total observations. As it turns out, when it comes to tree splitting, a misclassification rate by itself may lead to a situation where you can gain information with a further split but not improve the misclassification rate. Let's look at an example.

Suppose we have a node—let's call it N0 where you have 7 observations labeled No and 3 observations labeled Yes, and we say that the misclassified rate is 30 percent. With this in mind, let's calculate a common alternative error measure called Gini index. The formula for a single node Gini index is as follows:

Gini = 1 – (probability of Class 1)2 – (probability of Class 2)2.

For N0, the Gini is 1 - (.7)2 - (.3)2, which is equal to 0.42, versus the misclassification rate of 30 percent.

Taking this example further, we will now create an N1 node with 3 of Class 1 and none of Class 2 along with N2, which has 4 observations from Class 1 and 3 from Class 2. Now, the overall misclassification rate for this branch of the tree is still 30 percent, but look at the following to see how the overall Gini index has improved:

Gini(N1) = 1 – (3/3)2 – (0/3)2 = 0.
Gini(N2) = 1 – (4/7)2 – (3/7)2 = 0.49.

The new Gini index = (proportion of N1 x Gini(N1)) + (proportion of N2 x Gini(N2)) which is equal to (.3 x 0) + (.7 x 0.49) or 0.343.

By doing a split on a surrogate error rate, we actually improved our model impurity by reducing it from 0.42 to 0.343, whereas the misclassification rate did not change. This is the methodology used by the rpart() package.

Random forest

To greatly improve our model's predictive ability, we can produce numerous trees and combine the results. The random forest technique does this by applying two different tricks in the model development. The first is the use of bootstrap aggregation or bagging as it is called.

In bagging, an individual tree is built on a sample of dataset, roughly two-thirds of the total observations. It is important to note that the remaining one-third is referred to as Out of Bag(OOB). This is repeated for dozens or hundreds of times and the results are averaged. Each of these trees is grown and not pruned based on any error measure and this means that the variance of each of these individual trees is high. However, by averaging the results, you can reduce the variance without increasing the bias.

The next thing that the random forest brings to the table is that concurrently with the random sample of the data, it also takes a random sample of the input features at each split. In the randomForest package, we will use the default random number of the sampled predictors, which is the square root of the total predictors for classification problems and total predictors divided by 3 for regression. The number of predictors that the algorithm randomly chooses at each split can be changed via the model tuning process.

By doing this random sampling of the features at each split and incorporating it into the methodology, you mitigate the effect of a highly correlated predictor in becoming the main driver in all of your bootstrapped trees and preventing you from reducing the variance that you hoped to achieve with bagging. The subsequent averaging of the trees that are less correlated to each other than if you only performed bagging, is more generalizable and more robust to outliers.

Gradient boosting

The boosting methods can become extremely complicated for you to learn and understand, but you should keep in mind about what is fundamentally happening behind the curtain. The main idea is to build an initial model of some kind (linear, spline, tree, and so on.) called the base-learner, examine the residuals, and fit a model based on these residuals around the so-called loss function. A loss function is merely the function that measures the discrepancy between the model and desired prediction, for example, a squared error for the regression or the logistic function for the classification. The process continues until it reaches some specified stopping criterion. This is like the student who takes a practice exam and gets 30 out of 100 questions wrong and as a result, studies only those 30 questions that were missed. The next practice exam they get 10 out of these 30 wrong and so only focus on these 10 questions and so on. If you would like to explore the theory behind this further, a great resource for you is available in Frontiers in Neurorobotics, Gradient boosting machines, a tutorial, Natekin A., Knoll A., (2013), at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3885826/.

As previously mentioned, boosting can be applied to many different base learners, but here we will only focus on the specifics of tree-based learning. Each tree iteration is small and we will determine how small it is with one of the tuning parameters referred to as interaction depth. In fact, it may be as small as one split, which is referred to as a stump.

Trees are sequentially fit to the residuals according to the loss function up to the number of trees that we specified (our stopping criterion).

There is another tuning parameter that we will need to identify and that is shrinkage. You can think of shrinkage as the rate at which your model is learning generally and specifically, as the contribution of each tree or stump to the model. This learning rate acts as a regularization parameter.

The other thing about our boosting algorithm is that it is stochastic, meaning that it adds randomness by taking a random sample of our data at each tree. Introducing some randomness to a boosted model usually improves the accuracy and speed and reduces overfitting (Friedman 2002).

As you may have guessed, tuning these parameters can be quite a challenge. These parameters can interact with each other and if you just tinker with one without considering the other, your model may actually perform worse. The caret package will help us in this endeavor.

Business case

The overall business objective in this situation is to see if we can improve the predictive ability for some of the cases. For regression, we will visit the prostate cancer data.

For classification purposes, we will utilize both the breast cancer biopsy data and Pima Indian Diabetes data.

Both random forests and boosting will be applied to all the three datasets. The simple tree method will be used only on the breast and prostate cancer sets.

Regression tree

We will jump right into the prostate data set, but first let's load the necessary R package, as follows:

> library(rpart) #classification and regression trees

> library(partykit) #treeplots

> library(MASS) #breast and pima indian data

> library(ElemStatLearn) #prostate data

> library(randomForest) #random forests

> library(gbm) #gradient boosting

> library(caret) #tune hyper-parameter

First, we will do regression on the prostate data. This involves calling the dataset, coding the gleason score as an indicator variable using the ifelse() function, and creating a test and training set. The training set will be pros.train and the test set will be pros.test, as follows:

> data(prostate)

> prostate$gleason = ifelse(prostate$gleason == 6, 0, 1)

> pros.train = subset(prostate, train==TRUE)[,1:9]

> pros.test = subset(prostate, train==FALSE)[,1:9]

To build a regression tree on the training data, we will use the following rpart() function from R's party package. The syntax is quite similar to what we used in the other modeling techniques:

> tree.pros <- rpart(lpsa~., data=pros.train)

We can call this object using the print() function and cptable and then examine the error per split to determine the optimal number of splits in the tree, as follows:

> print(tree.pros$cptable)
          CP nsplit rel error    xerror      xstd
1 0.35852251      0 1.0000000 1.0364016 0.1822698
2 0.12295687      1 0.6414775 0.8395071 0.1214181
3 0.11639953      2 0.5185206 0.7255295 0.1015424
4 0.05350873      3 0.4021211 0.7608289 0.1109777
5 0.01032838      4 0.3486124 0.6911426 0.1061507
6 0.01000000      5 0.3382840 0.7102030 0.1093327

This is a very important table to analyze. The first column labeled CP is the cost complexity parameter, which states that the second column, nsplit, is the number of splits in the tree. The rel error column stands for relative errors and is the residual sum of squares for that number of splits divided by the residual sum of squares for no splits (RSS(k)/RSS(0). Both xerror and xstd are based on a ten-fold cross-validation with xerror being the average error and xstd being the standard deviation of the cross-validation process. We can see that four splits produced slightly less errors using cross-validation while five splits produced the lowest error on the full dataset. You can examine this using the plotcp() function, as follows:

> plotcp(tree.pros)

The following is the output of the preceding command:

classification-and-regression-trees-0-img-1

The plot shows us the relative error by the tree size with the corresponding error bars. The horizontal line on the plot is the upper limit of the lowest standard error. Selecting the tree size 5, which is four splits, we can build a new tree object where xerror is minimized by pruning our tree accordingly—first creating an object for cp associated with the pruned tree from the table. Then the prune() function handles the rest as follows:

> cp = min(tree.pros$cptable[5,])

> prune.tree.pros <- prune(tree.pros, cp = cp)

With this done, you can plot and compare the full and pruned trees. The tree plots produced by the partykit package are much better than those produced by the party package. You can simply use the as.party() function as a wrapper in the plot() function:

> plot(as.party(tree.pros))

The output of the preceding command is as follows:

classification-and-regression-trees-0-img-2

> plot(as.party(prune.tree.pros))

The following is the output of the preceding command:

classification-and-regression-trees-0-img-3

Note that the splits are exactly the same in the two trees with the exception of the last split, which includes the age variable for the full tree. Interestingly, both the first and second splits in the tree are related to the log of cancer volume lcavol. These plots are quite informative as they show the splits, nodes, observations per node, and box plots of the outcome that we are trying to predict.

Let's see how well the pruned tree performs on the test data. What we will do is create an object of the predicted values using the predict() function by incorporating the test data. Then, we will calculate the errors as the predicted values minus the actual values and finally the mean of the squared errors, as follows:

> party.pros.test <- predict(prune.tree.pros, newdata=pros.test)

> rpart.resid = party.pros.test - pros.test$lpsa #calculate residuals

> mean(rpart.resid^2) #caluclate MSE
[1] 0.5267748

One can look at the tree plots that we produced and easily explain what are the primary drivers behind the response. As mentioned in the introduction, the trees are easy to interpret and explain, which, in many cases, may be more important than the accuracy.

Classification tree

For the classification problem, we will prepare the breast cancer data. After loading the data, you will delete the patient ID, rename the features, eliminate the few missing values, and then create the train/test datasets, as follows:

> data(biopsy)

> biopsy <- biopsy[,-1] #delete ID

> names(biopsy) = c("thick", "u.size", "u.shape", "adhsn", "s.size", "nucl", "chrom", "n.nuc", "mit", "class") #change the feature names

> biopsy.v2 = na.omit(biopsy) #delete the observations with missing values

> set.seed(123) #random number generator

> ind = sample(2, nrow(biopsy.v2), replace=TRUE, prob=c(0.7, 0.3))

> biop.train = biopsy.v2[ind==1,] #the training data set

> biop.test = biopsy.v2[ind==2,] #the test data set

With the data set up appropriately, we will use the same syntax style for a classification problem as we did previously for a regression problem, but before creating a classification tree, we will need to ensure that the outcome is a factor, which can be done using the str() function. as follows:

> str(biop.test[,10])
 Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 2 1 1 …

First, we will create the tree:

> set.seed(123)

> tree.biop <- rpart(class~., data=biop.train)

Then, examine the table for the optimal number of splits:

> print(tree.biop$cptable)
          CP nsplit rel error    xerror       xstd
1 0.79651163      0 1.0000000 1.0000000 0.06086254
2 0.07558140      1 0.2034884 0.2674419 0.03746996
3 0.01162791      2 0.1279070 0.1453488 0.02829278
4 0.01000000      3 0.1162791 0.1744186 0.03082013

The cross-validation error is at a minimum with only two splits (row 3). We can now prune the tree, plot the full and pruned tree, and see how it performs on the test set, as follows:

> cp = min(tree.biop$cptable[3,])

> prune.tree.biop <- prune(tree.biop, cp = cp)

> plot(as.party(tree.biop))

classification-and-regression-trees-0-img-4

> plot(as.party(prune.tree.biop))

classification-and-regression-trees-0-img-5

An examination of the tree plots shows that the uniformity of the cell size is the first split, then bare nuclei. The full tree had an additional split at the cell thickness. We can predict the test observations using type="class" in the predict() function:

> rparty.test <- predict(prune.tree.biop, newdata=biop.test, type="class")

> table(rparty.test, biop.test$class)
           
rparty.test benign malignant
  benign       136         3
  malignant      6        64

> (136+64)/209
[1] 0.9569378

The basic tree with just two splits gets us almost 96 percent accuracy. This still falls short but should encourage us to believe that we can improve on it with the upcoming methods, starting with random forests.

Summary

In this article we learned both the power and limitations of tree-based learning methods for both classification and regression problems. To improve on predictive ability, we have the tools of the random forest and gradient boosted trees at our disposal.