In this article by Ashish Gupta, author of the book Learning Apache Mahout Classification, we will learn about Random forest, which is one of the most popular techniques in classification. It starts with a machine learning technique called decision tree. In this article, we will explore the following topics:

Decision tree
Random forest
Using Mahout for Random forest

(For more resources related to this topic, see here.)

Decision tree

A decision tree is used for classification and regression problems. In simple terms, it is a predictive model that uses binary rules to calculate the target variable. In a decision tree, we use an iterative process of splitting the data into partitions, then we split it further on branches. As in other classification model creation processes, we start with the training dataset in which target variables or class labels are defined. The algorithm tries to break all the records in training datasets into two parts based on one of the explanatory variables. The partitioning is then applied to each new partition, and this process is continued until no more partitioning can be done. The core of the algorithm is to find out the rule that determines the initial split. There are algorithms to create decision trees, such as Iterative Dichotomiser 3 (ID3), Classification and Regression Tree (CART), Chi-squared Automatic Interaction Detector (CHAID), and so on. A good explanation for ID3 can be found at http://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/06prop/id3/id3.html.

Forming the explanatory variables to choose the best splitter in a node, the algorithm considers each variable in turn. Every possible split is considered and tried, and the best split is the one that produces the largest decrease in diversity of the classification label within each partition. This is repeated for all variables, and the winner is chosen as the best splitter for that node. The process is continued in the next node until we reach a node where we can make the decision.

We create a decision tree from a training dataset so it can suffer from the overfitting problem. This behavior creates a problem with real datasets. To improve this situation, a process called pruning is used. In this process, we remove the branches and leaves of the tree to improve the performance. Algorithms used to build the tree work best at the starting or root node since all the information is available there. Later on, with each split, data is less and towards the end of the tree, a particular node can show patterns that are related to the set of data which is used to split. These patterns create problems when we use them to predict the real dataset. Pruning methods let the tree grow and remove the smaller branches that fail to generalize. Now take an example to understand the decision tree.

Consider we have a iris flower dataset. This dataset is hugely popular in the machine learning field. It was introduced by Sir Ronald Fisher. It contains 50 samples from each of three species of iris flower (Iris setosa, Iris virginica, and Iris versicolor). The four explanatory variables are the length and width of the sepals and petals in centimeters, and the target variable is the class to which the flower belongs.

learning-random-forest-using-mahout-img-0

As you can see in the preceding diagram, all the groups were earlier considered as Sentosa species and then the explanatory variable and petal length were further used to divide the groups. At each step, the calculation for misclassified items was also done, which shows how many items were wrongly classified. Moreover, the petal width variable was taken into account. Usually, items at leaf nodes are correctly classified.

Random forest

The Random forest algorithm was developed by Leo Breiman and Adele Cutler. Random forests grow many classification trees. They are an ensemble learning method for classification and regression that constructs a number of decision trees at training time and also outputs the class that is the mode of the classes outputted by individual trees.

Single decision trees show the bias–variance tradeoff. So they usually have high variance or high bias. The following are the parameters in the algorithm:

Bias: This is an error caused by an erroneous assumption in the learning algorithm
Variance: This is an error that ranges from sensitivity to small fluctuations in the training set

Random forests attempt to mitigate this problem by averaging to find a natural balance between two extremes. A Random forest works on the idea of bagging, which is to average noisy and unbiased models to create a model with low variance. A Random forest algorithm works as a large collection of decorrelated decision trees. To understand the idea of a Random forest algorithm, let's work with an example.

Consider we have a training dataset that has lots of features (explanatory variables) and target variables or classes:

learning-random-forest-using-mahout-img-1

We create a sample set from the given dataset:

learning-random-forest-using-mahout-img-2

A different set of random features were taken into account to create the random sub-dataset. Now, from these sub-datasets, different decision trees will be created. So actually we have created a forest of the different decision trees. Using these different trees, we will create a ranking system for all the classifiers. To predict the class of a new unknown item, we will use all the decision trees and separately find out which class these trees are predicting. See the following diagram for a better understanding of this concept:

learning-random-forest-using-mahout-img-3 Different decision trees to predict the class of an unknown item

In this particular case, we have four different decision trees. We predict the class of an unknown dataset with each of the trees. As per the preceding figure, the first decision tree provides class 2 as the predicted class, the second decision tree predicts class 5, the third decision tree predicts class 5, and the fourth decision tree predicts class 3. Now, a Random forest will vote for each class. So we have one vote each for class 2 and class 3 and two votes for class 5. Therefore, it has decided that for the new unknown dataset, the predicted class is class 5. So the class that gets a higher vote is decided for the new dataset. A Random forest has a lot of benefits in classification and a few of them are mentioned in the following list:

Combination of learning models increases the accuracy of the classification
Runs effectively on large datasets as well
The generated forest can be saved and used for other datasets as well
Can handle a large amount of explanatory variables

Now that we have understood the Random forest theoretically, let's move on to Mahout and use the Random forest algorithm, which is available in Apache Mahout.

Using Mahout for Random forest

Mahout has implementation for the Random forest algorithm. It is very easy to understand and use. So let's get started.

Dataset

We will use the NSL-KDD dataset. Since 1999, KDD'99 has been the most widely used dataset for the evaluation of anomaly detection methods. This dataset is prepared by S. J. Stolfo and is built based on the data captured in the DARPA'98 IDS evaluation program (R. P. Lippmann, D. J. Fried, I. Graf, J. W. Haines, K. R. Kendall, D. McClung, D. Weber, S. E. Webster, D. Wyschogrod, R. K. Cunningham, and M. A. Zissman, "Evaluating intrusion detection systems: The 1998 darpa off-line intrusion detection evaluation," discex, vol. 02, p. 1012, 2000).

DARPA'98 is about 4 GB of compressed raw (binary) tcp dump data of 7 weeks of network traffic, which can be processed into about 5 million connection records, each with about 100 bytes. The two weeks of test data have around 2 million connection records. The KDD training dataset consists of approximately 4,900,000 single connection vectors, each of which contains 41 features and is labeled as either normal or an attack, with exactly one specific attack type.

NSL-KDD is a dataset suggested to solve some of the inherent problems of the KDD'99 dataset. You can download this dataset from http://nsl.cs.unb.ca/NSL-KDD/.

We will download the KDDTrain+_20Percent.ARFF and KDDTest+.ARFF datasets.

learning-random-forest-using-mahout-img-4

In KDDTrain+_20Percent.ARFF and KDDTest+.ARFF, remove the first 44 lines (that is, all lines starting with @attribute). If this is not done, we will not be able to generate a descriptor file.

learning-random-forest-using-mahout-img-5

Steps to use the Random forest algorithm in Mahout

The steps to implement the Random forest algorithm in Apache Mahout are as follows:

Transfer the test and training datasets to hdfs using the following commands:

hadoop fs -mkdir /user/hue/KDDTrain
hadoop fs -mkdir /user/hue/KDDTest
hadoop fs –put /tmp/KDDTrain+_20Percent.arff /user/hue/KDDTrain
hadoop fs –put /tmp/KDDTest+.arff /user/hue/KDDTest

Generate the descriptor file. Before you build a Random forest model based on the training data in KDDTrain+.arff, a descriptor file is required. This is because all information in the training dataset needs to be labeled. From the labeled dataset, the algorithm can understand which one is numerical and categorical. Use the following command to generate descriptor file:
```
hadoop jar $MAHOUT_HOME/core/target/mahout-core-xyz.job.jar
org.apache.mahout.classifier.df.tools.Describe
-p /user/hue/KDDTrain/KDDTrain+_20Percent.arff
-f /user/hue/KDDTrain/KDDTrain+.info
-d N 3 C 2 N C 4 N C 8 N 2 C 19 N L
```
Jar: Mahout core jar (xyz stands for version). If you have directly installed Mahout, it can be found under the /usr/lib/mahout folder. The main class Describe is used here and it takes three parameters:

Unlock access to the largest independent learning library in Tech for FREE!

Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.

Renews at €14.99/month. Cancel anytime

The p path for the data to be described.

The f location for the generated descriptor file.

d is the information for the attribute on the data. N 3 C 2 N C 4 N C 8 N 2 C 19 N L defines that the dataset is starting with a numeric (N), followed by three categorical attributes, and so on. In the last, L defines the label.

The output of the previous command is shown in the following screenshot:
Build the Random forest using the following command:
```
hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-xyz-job.
jar org.apache.mahout.classifier.df.mapreduce.BuildForest
-Dmapred.max.split.size=1874231 -d /user/hue/KDDTrain/
KDDTrain+_20Percent.arff
-ds /user/hue/KDDTrain/KDDTrain+.info
-sl 5 -p -t 100 –o /user/hue/ nsl-forest
```
Jar: Mahout example jar (xyz stands for version). If you have directly installed Mahout, it can be found under the /usr/lib/mahout folder. The main class build forest is used to build the forest with other arguments, which are shown in the following list:

Dmapred.max.split.size indicates to Hadoop the maximum size of each partition.

d stands for the data path.

ds stands for the location of the descriptor file.

sl is a variable to select randomly at each tree node. Here, each tree is built using five randomly selected attributes per node.

p uses partial data implementation.

t stands for the number of trees to grow. Here, the commands build 100 trees using partial implementation.

o stands for the output path that will contain the decision forest.

In the end, the process will show the following result:
Use this model to classify the new dataset:
```
hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-xyz-job.
jar org.apache.mahout.classifier.df.mapreduce.TestForest
-i /user/hue/KDDTest/KDDTest+.arff
-ds /user/hue/KDDTrain/KDDTrain+.info -m /user/hue/nsl-forest -a –
mr
-o /user/hue/predictions
```
Jar: Mahout example jar (xyz stands for version). If you have directly installed Mahout, it can be found under the /usr/lib/mahout folder. The class to test the forest has the following parameters:

I indicates the path for the test data

ds stands for the location of the descriptor file

m stands for the location of the generated forest from the previous command

a informs to run the analyzer to compute the confusion matrix

mr informs Hadoop to distribute the classification

o stands for the location to store the predictions in

The job provides the following confusion matrix:

So, from the confusion matrix, it is clear that 9,396 instances were correctly classified and 315 normal instances were incorrectly classified as anomalies. And the accuracy percentage is 77.7635 (correctly classified instances by the model / classified instances). The output file in the prediction folder contains the list where 0 and 1. 0 defines the normal dataset and 1 defines the anomaly.

Summary

In this article, we discussed the Random forest algorithm. We started our discussion by understanding the decision tree and continued with an understanding of the Random forest. We took up the NSL-KDD dataset, which is used to build predictive systems for cyber security. We used Mahout to build the Random forest tree, and used it with the test dataset and generated the confusion matrix and other statistics for the output.