Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

Supervised learning

Save for later
  • 3000 min read
  • 2014-12-19 00:00:00

article-image

In this article by Dan Toomey, author of the book R for Data Science, we will learn about the supervised learning, which involves the use of a target variable and a number of predictor variables that are put into a model to enable the system to predict the target. This is also known as predictive modeling.

(For more resources related to this topic, see here.)

As mentioned, in supervised learning we have a target variable and a number of possible predictor variables. The objective is to associate the predictor variables in such a way so as to accurately predict the target variable. We are using some portion of observed data to learn how our model behaves and then testing that model on the remaining observations for accuracy.

We will go over the following supervised learning techniques:

  • Decision trees
  • Regression
  • Neural networks
  • Instance based learning (k-NN)
  • Ensemble learning
  • Support vector machines
  • Bayesian learning
  • Bayesian inference
  • Random forests

Decision tree

For decision tree machine learning, we develop a logic tree that can be used to predict our target value based on a number of predictor variables. The tree has logical points, such as if the month is December, follow the tree logic to the left; otherwise, follow the tree logic to the right. The last leaf of the tree has a predicted value.

For this example, we will use the weather data in the rattle package. We will develop a decision tree to be used to determine whether it will rain tomorrow
or not based on several variables. Let's load the rattle package as follows:

> library(rattle)

We can see a summary of the weather data. This shows that we have some real data over a year from Australia:

> summary(weather)
     Date                     Location     MinTemp    
Min.   :2007-11-01   Canberra     :366   Min.   :-5.300
1st Qu.:2008-01-31   Adelaide     : 0   1st Qu.: 2.300
Median :2008-05-01   Albany       : 0   Median : 7.450
Mean   :2008-05-01   Albury       : 0   Mean   : 7.266
3rd Qu.:2008-07-31   AliceSprings : 0   3rd Qu.:12.500
Max.   :2008-10-31   BadgerysCreek: 0   Max.   :20.900
                     (Other)     : 0                  
   MaxTemp         Rainfall       Evaporation       Sunshine    
Min.   : 7.60   Min.   : 0.000   Min.  : 0.200   Min.   : 0.000
1st Qu.:15.03   1st Qu.: 0.000   1st Qu.: 2.200   1st Qu.: 5.950
Median :19.65   Median : 0.000   Median : 4.200   Median : 8.600
Mean   :20.55   Mean   : 1.428   Mean   : 4.522   Mean   : 7.909
3rd Qu.:25.50   3rd Qu.: 0.200   3rd Qu.: 6.400   3rd Qu.:10.500
Max.   :35.80   Max.   :39.800   Max.   :13.800   Max.   :13.600
                                                   NA's   :3      
WindGustDir   WindGustSpeed   WindDir9am   WindDir3pm
NW     : 73   Min.   :13.00   SE     : 47   WNW   : 61
NNW   : 44   1st Qu.:31.00   SSE   : 40   NW     : 61
E     : 37   Median :39.00   NNW   : 36   NNW   : 47
WNW   : 35   Mean   :39.84   N     : 31   N     : 30
ENE   : 30   3rd Qu.:46.00   NW     : 30   ESE   : 27
(Other):144   Max.   :98.00   (Other):151   (Other):139
NA's   : 3   NA's   :2       NA's   : 31   NA's   : 1
WindSpeed9am     WindSpeed3pm   Humidity9am     Humidity3pm  
Min.   : 0.000   Min.   : 0.00   Min.   :36.00   Min.   :13.00
1st Qu.: 6.000   1st Qu.:11.00   1st Qu.:64.00   1st Qu.:32.25
Median : 7.000   Median :17.00   Median :72.00   Median :43.00
Mean   : 9.652   Mean   :17.99   Mean   :72.04   Mean   :44.52
3rd Qu.:13.000   3rd Qu.:24.00   3rd Qu.:81.00   3rd Qu.:55.00
Max.   :41.000   Max.   :52.00   Max.   :99.00   Max.   :96.00
NA's   :7                                                      
Pressure9am     Pressure3pm       Cloud9am       Cloud3pm  
Min.   : 996.5   Min.   : 996.8   Min.   :0.000   Min.   :0.000
1st Qu.:1015.4   1st Qu.:1012.8   1st Qu.:1.000   1st Qu.:1.000
Median :1020.1   Median :1017.4   Median :3.500   Median :4.000
Mean   :1019.7   Mean   :1016.8   Mean   :3.891   Mean   :4.025
3rd Qu.:1024.5   3rd Qu.:1021.5   3rd Qu.:7.000   3rd Qu.:7.000
Max.   :1035.7   Max.   :1033.2   Max.   :8.000   Max.   :8.000
Temp9am         Temp3pm         RainToday RISK_MM
Min.   : 0.100   Min.   : 5.10   No :300   Min.   : 0.000
1st Qu.: 7.625   1st Qu.:14.15   Yes: 66   1st Qu.: 0.000
Median :12.550   Median :18.55             Median : 0.000
Mean   :12.358   Mean   :19.23             Mean   : 1.428
3rd Qu.:17.000   3rd Qu.:24.00             3rd Qu.: 0.200
Max.   :24.700   Max.   :34.50           Max.   :39.800
                                                          
RainTomorrow
No :300    
Yes: 66      

We will be using the rpart function to develop a decision tree. The rpart function looks like this:

rpart(formula, data, weights, subset, na.action = na.rpart, method,
model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ...)

The various parameters of the rpart function are described in the following table:

Parameter

Description

formula

This is the formula used for the prediction.

data

This is the data matrix.

weights

These are the optional weights to be applied.

subset

This is the optional subset of rows of data to be used.

na.action

This specifies the action to be taken when y, the target value, is missing.

method

This is the method to be used to interpret the data. It should be one of these: anova, poisson, class, or exp. If not specified, the algorithm decides based on the layout of the data.

These are the additional parameters to be used to control the behavior of the algorithm.

 Let's create a subset as follows:

> weather2 <- subset(weather,select=-c(RISK_MM))
> install.packages("rpart")
>library(rpart)
> model <- rpart(formula=RainTomorrow ~ .,data=weather2, method="class")
> summary(model)
Call:
rpart(formula = RainTomorrow ~ ., data = weather2, method = "class")
n= 366
 
CPn split       rel error     xerror   xstd
1 0.19696970     0 1.0000000 1.0000000 0.1114418
2 0.09090909      1 0.8030303 0.9696970 0.1101055
3 0.01515152     2 0.7121212 1.0151515 0.1120956
4 0.01000000     7 0.6363636 0.9090909 0.1073129
 
Variable importance
Humidity3pm WindGustSpeed     Sunshine WindSpeed3pm       Temp3pm
           24           14          12             8             6
Pressure3pm       MaxTemp       MinTemp   Pressure9am       Temp9am
           6             5             4             4             4
Evaporation         Date   Humidity9am     Cloud3pm     Cloud9am
            3             3             2             2             1
     Rainfall
           1
Node number 1: 366 observations,   complexity param=0.1969697
predicted class=No   expected loss=0.1803279 P(node) =1
   class counts:   300   66
   probabilities: 0.820 0.180
left son=2 (339 obs) right son=3 (27 obs)
Primary splits:
   Humidity3pm < 71.5   to the left, improve=18.31013, (0 missing)
   Pressure3pm < 1011.9 to the right, improve=17.35280, (0 missing)
   Cloud3pm   < 6.5     to the left, improve=16.14203, (0 missing)
   Sunshine   < 6.45   to the right, improve=15.36364, (3 missing)
   Pressure9am < 1016.35 to the right, improve=12.69048, (0 missing)
Surrogate splits:
   Sunshine < 0.45   to the right, agree=0.945, adj=0.259, (0 split)
(many more)…

As you can tell, the model is complicated. The summary shows the progression of the model development using more and more of the data to fine-tune the tree. We will be using the rpart.plot package to display the decision tree in a readable manner as follows:

> library(rpart.plot)
> fancyRpartPlot(model,main="Rain Tomorrow",sub="Chapter 12")

supervised-learning-img-0

This is the output of the fancyRpartPlot function

Now, we can follow the logic of the decision tree easily. For example, if the humidity is over 72, we are predicting it will rain.

Regression

We can use a regression to predict our target value by producing a regression model from our predictor variables.

We will be using the forest fire data from http://archive.ics.uci.edu. We will load the data and get the following summary:

> forestfires <- read.csv
("http://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv") > summary(forestfires)        X               Y           month     day         FFMC     Min.   :1.000   Min.   :2.0   aug   :184   fri:85 Min.   :18.70 1st Qu.:3.000   1st Qu.:4.0   sep   :172   mon:74   1st Qu.:90.20 Median :4.000   Median :4.0   mar   : 54   sat:84   Median :91.60 Mean   :4.669   Mean   :4.3   jul   : 32   sun:95   Mean   :90.64 3rd Qu.:7.000   3rd Qu.:5.0  feb   : 20   thu:61   3rd Qu.:92.90 Max.   :9.000   Max.   :9.0   jun   : 17   tue:64   Max.   :96.20                                (Other): 38   wed:54                      DMC             DC             ISI             temp     Min.   : 1.1   Min.   : 7.9   Min.   : 0.000   Min.   : 2.20 1st Qu.: 68.6   1st Qu.:437.7   1st Qu.: 6.500   1st Qu.:15.50 Median :108.3   Median :664.2   Median : 8.400   Median :19.30 Mean   :110.9   Mean   :547.9   Mean   : 9.022   Mean   :18.89 3rd Qu.:142.4   3rd Qu.:713.9   3rd Qu.:10.800   3rd Qu.:22.80 Max.   :291.3   Max.   :860.6   Max.   :56.100   Max.   :33.30                                                                         RH             wind           rain             area       Min.   : 15.00   Min.   :0.400   Min.   :0.00000   Min.   :   0.00 1st Qu.: 33.00   1st Qu.:2.700   1st Qu.:0.00000   1st Qu.:   0.00 Median : 42.00   Median :4.000   Median :0.00000   Median :   0.52 Mean   : 44.29   Mean   :4.018   Mean   :0.02166   Mean   : 12.85 3rd Qu.: 53.00   3rd Qu.:4.900   3rd Qu.:0.00000   3rd Qu.:   6.57 Max.   :100.00   Max.   :9.400   Max.   :6.40000   Max.   :1090.84

I will just use the month, temperature, wind, and rain data to come up with a model of the area (size) of the fires using the lm function. The lm function looks like this:

lm(formula, data, subset, weights, na.action,
   method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
   singular.ok = TRUE, contrasts = NULL, offset, ...)

The various parameters of the lm function are described in the following table:

Parameter

Description

formula

This is the formula to be used for the model

data

This is the dataset

subset

This is the subset of dataset to be used

weights

These are the weights to apply to factors

These are the additional parameters to be added to the function

Let's load the data as follows:

> model <- lm(formula = area ~ month + temp + wind + rain, data=forestfires)

Looking at the generated model, we see the following output:

> summary(model)
Call:
lm(formula = area ~ month + temp + wind + rain, data = forestfires)
Residuals:
   Min     1Q Median     3Q     Max
-33.20 -14.93   -9.10   -1.66 1063.59
Coefficients:
           Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.390     24.532 -0.709   0.4787
monthaug     -10.342     22.761 -0.454   0.6498
monthdec     11.534     30.896   0.373   0.7091
monthfeb       2.607     25.796   0.101   0.9196
monthjan       5.988     50.493   0.119   0.9056
monthjul     -8.822    25.068 -0.352   0.7251
monthjun     -15.469     26.974 -0.573   0.5666
monthmar     -6.630     23.057 -0.288   0.7738
monthmay       6.603     50.053   0.132   0.8951
monthnov     -8.244     67.451 -0.122   0.9028
monthoct     -8.268    27.237 -0.304   0.7616
monthsep     -1.070     22.488 -0.048   0.9621
temp           1.569     0.673   2.332   0.0201 *
wind           1.581     1.711   0.924   0.3557
rain         -3.179     9.595 -0.331   0.7406
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 
Residual standard error: 63.99 on 502 degrees of freedom
Multiple R-squared: 0.01692, Adjusted R-squared: -0.0105
F-statistic: 0.617 on 14 and 502 DF, p-value: 0.8518

Surprisingly, the month has a significant effect on the size of the fires. I would have guessed that whether or not the fires occurred in August or similar months would have effected any discernable difference. Also, the temperature has such a minimal effect. Further, the model is using the month data as categorical.

If we redevelop the model (without temperature), we have a better fit (notice the multiple R-squared value drops to 0.006 from 0.01), as shown here:

> model <- lm(formula = area ~ month + wind + rain, data=forestfires)
> summary(model)
 
Call:
lm(formula = area ~ month + wind + rain, data = forestfires)
 
Residuals:
   Min     1Q Median     3Q     Max
-22.17 -14.39 -10.46   -3.87 1072.43
 
Coefficients:
          Estimate Std. Error t value Pr(>|t|)
(Intercept)   4.0126   22.8496   0.176   0.861
monthaug     4.3132   21.9724   0.196   0.844
monthdec     1.3259   30.7188   0.043   0.966
monthfeb     -1.6631   25.8441 -0.064   0.949
monthjan     -6.1034   50.4475 -0.121   0.904
monthjul     6.4648   24.3021   0.266   0.790
monthjun     -2.4944   26.5099 -0.094   0.925
monthmar     -4.8431   23.1458 -0.209   0.834
monthmay     10.5754   50.2441   0.210   0.833
monthnov     -8.7169   67.7479 -0.129   0.898
monthoct     -0.9917   27.1767 -0.036   0.971
monthsep     10.2110   22.0579   0.463   0.644
wind         1.0454     1.7026   0.614   0.540
rain         -1.8504     9.6207 -0.192   0.848
 
Residual standard error: 64.27 on 503 degrees of freedom
Multiple R-squared: 0.006269, Adjusted R-squared: -0.01941
F-statistic: 0.2441 on 13 and 503 DF, p-value: 0.9971

From the results, we can see R-squared of close to 0 and p-value almost 1; this is a very good fit.

If you plot the model, you will get a series of graphs. The plot of the residuals versus fitted values is the most revealing, as shown in the following graph:

> plot(model)

You can see from the graph that the regression model is very accurate:

supervised-learning-img-1

Neural network

In a neural network, it is assumed that there is a complex relationship between the predictor variables and the target variable. The network allows the expression of each of these relationships.

For this model, we will use the liver disorder data from http://archive.ics.uci.edu. The data has a few hundred observations from patients with liver disorders. The variables are various measures of blood for each patient as shown here:

> bupa <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.data")
> colnames(bupa) <- c("mcv","alkphos","alamine","aspartate","glutamyl","drinks","selector")
> summary(bupa)
     mcv           alkphos         alamine    
Min.   : 65.00   Min.   : 23.00   Min.   : 4.00
1st Qu.: 87.00   1st Qu.: 57.00   1st Qu.: 19.00
Median : 90.00   Median : 67.00   Median : 26.00
Mean   : 90.17   Mean   : 69.81   Mean   : 30.36
3rd Qu.: 93.00   3rd Qu.: 80.00   3rd Qu.: 34.00
Max.   :103.00   Max.   :138.00   Max.   :155.00
   aspartate       glutamyl         drinks    
Min.   : 5.00   Min.   : 5.00   Min.  : 0.000
1st Qu.:19.00   1st Qu.: 15.00   1st Qu.: 0.500
Median :23.00   Median : 24.50   Median : 3.000
Mean   :24.64   Mean   : 38.31   Mean   : 3.465
3rd Qu.:27.00   3rd Qu.: 46.25   3rd Qu.: 6.000
Max.   :82.00   Max.   :297.00   Max. :20.000
   selector  
Min.   :1.000
1st Qu.:1.000
Median :2.000
Mean   :1.581
3rd Qu.:2.000
Max.   :2.000

We generate a neural network using the neuralnet function. The neuralnet function looks like this:

neuralnet(formula, data, hidden = 1, threshold = 0.01,      
         stepmax = 1e+05, rep = 1, startweights = NULL,
         learningrate.limit = NULL,
         learningrate.factor = list(minus = 0.5, plus = 1.2),
         learningrate=NULL, lifesign = "none",
         lifesign.step = 1000, algorithm = "rprop+",
         err.fct = "sse", act.fct = "logistic",
         linear.output = TRUE, exclude = NULL,
         constant.weights = NULL, likelihood = FALSE)

The various parameters of the neuralnet function are described in the following table:

Parameter

Description

formula

This is the formula to converge.

data

This is the data matrix of predictor values.

hidden

This is the number of hidden neurons in each layer.

stepmax

This is the maximum number of steps in each repetition. Default is 1+e5.

rep

This is the number of repetitions.

Let's generate the neural network as follows:

> nn <- neuralnet(selector~mcv+alkphos+alamine+aspartate+glutamyl+drinks, 
data=bupa, linear.output=FALSE, hidden=2)

We can see how the model was developed via the result.matrix variable in the following output:

> nn$result.matrix
                                     1
error                 100.005904355153
reached.threshold       0.005904330743
steps                 43.000000000000
Intercept.to.1layhid1   0.880621509705
mcv.to.1layhid1       -0.496298308044
alkphos.to.1layhid1     2.294158313786
alamine.to.1layhid1     1.593035613921
aspartate.to.1layhid1 -0.407602506759
glutamyl.to.1layhid1   -0.257862634340
drinks.to.1layhid1     -0.421390527261
Intercept.to.1layhid2   0.806928998059
mcv.to.1layhid2       -0.531926150470
alkphos.to.1layhid2     0.554627946150
alamine.to.1layhid2     1.589755874579
aspartate.to.1layhid2 -0.182482440722
glutamyl.to.1layhid2   1.806513419058
drinks.to.1layhid2     0.215346602241
Intercept.to.selector   4.485455617018
1layhid.1.to.selector   3.328527160621
1layhid.2.to.selector   2.616395644587

The process took 43 steps to come up with the neural network once the threshold was under 0.01 (0.005 in this case). You can see the relationships between the predictor values.

Looking at the network developed, we can see the hidden layers of relationship among the predictor variables. For example, sometimes mcv combines at one ratio and on other times at another ratio, depending on its value. Let's load the neural network as follows:

> plot(nn)

supervised-learning-img-2

Instance-based learning

R programming has a nearest neighbor algorithm (k-NN). The k-NN algorithm takes the predictor values and organizes them so that a new observation is applied to the organization developed and the algorithm selects the result (prediction) that is most applicable based on nearness of the predictor values in the new observation. The nearest neighbor function is knn. The knn function call looks like this:

knn(train, test, cl, k = 1, l = 0, prob = FALSE, use.all = TRUE)

The various parameters of the knn function are described in the following table:

Parameter

Description

train

This is the training data.

test

This is the test data.

cl

This is the factor of true classifications.

k

This is the Number of neighbors to consider.

l

This is the minimum vote for a decision.

prob

This is a Boolean flag to return proportion of winning votes.

use.all

This is a Boolean variable for tie handling. TRUE means use all votes of max distance

I am using the auto MPG dataset in the example of using knn.

First, we load the dataset :

> data <- read.table
("http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data", na.string="?") > colnames(data) <- c
("mpg","cylinders","displacement","horsepower","weight","acceleration","model.year","origin","car.name") > summary(data)      mpg         cylinders     displacement     horsepower Min.   : 9.00  Min.   :3.000   Min.   : 68.0   150   : 22 1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.2   90     : 20 Median :23.00   Median :4.000   Median :148.5   88     : 19 Mean   :23.51   Mean   :5.455   Mean   :193.4   110   : 18 3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0   100   : 17 Max.   :46.60   Max.   :8.000   Max.   :455.0   75     : 14                                                  (Other):288      weight     acceleration     model.year       origin     Min.   :1613   Min. : 8.00   Min.   :70.00   Min.   :1.000 1st Qu.:2224   1st Qu.:13.82   1st Qu.:73.00   1st Qu.:1.000 Median :2804   Median :15.50   Median :76.00   Median :1.000 Mean   :2970   Mean   :15.57   Mean   :76.01   Mean   :1.573 3rd Qu.:3608   3rd Qu.:17.18   3rd Qu.:79.00   3rd Qu.:2.000 Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000                                                                           car.name ford pinto   : 6 amc matador   : 5 ford maverick : 5 toyota corolla: 5 amc gremlin   : 4 amc hornet   : 4 (Other)       :369  

There are close to 400 observations in the dataset. We need to split the data into a training set and a test set. We will use 75 percent for training. We use the createDataPartition function in the caret package to select the training rows. Then, we create a test dataset and a training dataset using the partitions as follows:

> library(caret)
> training <- createDataPartition(data$mpg, p=0.75, list=FALSE)
> trainingData <- data[training,]
> testData <- data[-training,]
> model <- knn(train=trainingData, test=testData, cl=trainingData$mpg)
NAs introduced by coercion

The error message means that some numbers in the dataset have a bad format. The bad numbers were automatically converted to NA values. Then the inclusion of the NA values caused the function to fail, as NA values are not expected in this function call.

First, there are some missing items in the dataset loaded. We need to eliminate those NA values as follows:

> completedata <- data[complete.cases(data),]

After looking over the data several times, I guessed that the car name fields were being parsed as numerical data when there was a number in the name, such as Buick Skylark 320. I removed the car name column from the test and we end up with the following valid results;

> drops <- c("car.name")
> completeData2 <- completedata[,!(names(completedata) %in% drops)]
> training <- createDataPartition(completeData2$mpg, p=0.75, list=FALSE)
> trainingData <- completeData2[training,]
> testData <- completeData2[-training,]
> model <- knn(train=trainingData, test=testData, cl=trainingData$mpg)

We can see the results of the model by plotting using the following command. However, the graph doesn't give us much information to work on.

> plot(model)

supervised-learning-img-3

We can use a different kknn function to compare our model with the test data. I like this version a little better as you can plainly specify the formula for the model. Let's use the kknn function as follows:

> library(kknn)
> model <- kknn(formula = formula(mpg~.), train = trainingData, test = testData, k = 3, distance = 1)
> fit <- fitted(model)
> plot(testData$mpg, fit)
> abline(a=0, b=1, col=3)

I added a simple slope to highlight how well the model fits the training data. It looks like as we progress to higher MPG values, our model has a higher degree of variance. I think that means we are missing predictor variables, especially for the later model, high MPG series of cars. That would make sense as government mandate and consumer demand for high efficiency vehicles changed the mpg for vehicles. Here is the graph generated by the previous code:

supervised-learning-img-4

Ensemble learning

Ensemble learning is the process of using multiple learning methods to obtain
better predictions. For example, we could use a regression and k-NN, combine the results, and end up with a better prediction. We could average the results of both or provide heavier weight towards one or another of the algorithms, whichever appears to be a better predictor.

Support vector machines

We covered support vector machines (SVM), but I will run through an example here. As a reminder, SVM is concerned with binary data. We will use the spam dataset from Hewlett Packard (part of the kernlab package). First, let's load the data as follows:

> library(kernlab)
> data("spam")
> summary(spam)
     make           address           all             num3d        
Min.   :0.0000   Min.   : 0.000   Min.   :0.0000   Min.   : 0.00000
1st Qu.:0.0000   1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.: 0.00000
Median :0.0000   Median : 0.000   Median :0.0000   Median : 0.00000
Mean   :0.1046   Mean   : 0.213   Mean   :0.2807   Mean   : 0.06542
3rd Qu.:0.0000   3rd Qu.: 0.000   3rd Qu.:0.4200   3rd Qu.: 0.00000
Max.   :4.5400   Max.   :14.280   Max.   :5.1000   Max.   :42.81000
…

There are 58 variables with close to 5000 observations, as shown here:

> table(spam$type)
nonspam   spam
   2788   1813

Now, we break up the data into a training set and a test set as follows:

> index <- 1:nrow(spam)
> testindex <- sample(index, trunc(length(index)/3))
> testset <- spam[testindex,]
> trainingset <- spam[-testindex,]

Now, we can produce our SVM model using the svm function. The svm function looks like this:

svm(formula, data = NULL, ..., subset, na.action =na.omit, scale = TRUE)

The various parameters of the svm function are described in the following table:

Parameter

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $15.99/month. Cancel anytime

Description

formula

This is the formula model

data

This is the dataset

subset

This is the subset of the dataset to be used

na.action

This contains what action to take with NA values

scale

This determines whether to scale the data

Let's use the svm function to produce a SVM model as follows:

> library(e1071)
> model <- svm(type ~ ., data = trainingset, 
method = "C-classification", kernel = "radial", cost = 10, gamma = 0.1) > summary(model) Call: svm(formula = type ~ ., data = trainingset, method = "C-classification",    kernel = "radial", cost = 10, gamma = 0.1) Parameters:    SVM-Type: C-classification SVM-Kernel: radial        cost: 10      gamma: 0.1 Number of Support Vectors: 1555 ( 645 910 ) Number of Classes: 2 Levels: nonspam spam

We can test the model against our test dataset and look at the results as follows:

> pred <- predict(model, testset)
> table(pred, testset$type)
pred     nonspam spam
nonspam     891 104
spam         38 500

Note, the e1071 package is not compatible with the current version of R. Given its usefulness I would expect the package to be updated to support the user base.

So, using SVM, we have a 90 percent ((891+500) / (891+104+38+500)) accuracy rate of prediction.

Bayesian learning

With Bayesian learning, we have an initial premise in a model that is adjusted with new information. We can use the MCMCregress method in the MCMCpack package to use Bayesian regression on learning data and apply the model against test data. Let's load the MCMCpack package as follows:

> install.packages("MCMCpack")
> library(MCMCpack)

We are going to be using the transplant data on transplants available at http://lib.stat.cmu.edu/datasets/stanford. (The dataset on the site is part of the web page, so I copied into a local CSV file.)

The data shows expected transplant success factor, the actual transplant success factor, and the number of transplants over a time period. So, there is a good progression over time as to the success of the program. We can read the dataset as follows:

> transplants <- read.csv("transplant.csv")
> summary(transplants)
   expected         actual       transplants  
Min.   : 0.057   Min.   : 0.000   Min.   : 1.00
1st Qu.: 0.722   1st Qu.: 0.500   1st Qu.: 9.00
Median : 1.654   Median : 2.000   Median : 18.00
Mean   : 2.379   Mean   : 2.382   Mean   : 27.83
3rd Qu.: 3.402   3rd Qu.: 3.000   3rd Qu.: 40.00
Max.   :12.131   Max.   :18.000   Max.   :152.00

We use Bayesian regression against the data— note that we are modifying the
model as we progress with new information using the MCMCregress function.
The MCMCregress function looks like this:

MCMCregress(formula, data = NULL, burnin = 1000, mcmc = 10000,
   thin = 1, verbose = 0, seed = NA, beta.start = NA,
   b0 = 0, B0 = 0, c0 = 0.001, d0 = 0.001, sigma.mu = NA, sigma.var = NA,
   marginal.likelihood = c("none", "Laplace", "Chib95"), ...)

The various parameters of the MCMCregress function are described in the
following table:

Parameter

Description

formula

This is the formula of model

data

This is the dataset to be used for model

These are the additional parameters for the function

Let's use the Bayesian regression against the data as follows:

> model <- MCMCregress(expected ~ actual + transplants, data=transplants)
> summary(model)
Iterations = 1001:11000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 10000
1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:
               Mean     SD Naive SE Time-series SE
(Intercept) 0.00484 0.08394 0.0008394     0.0008388
actual     0.03413 0.03214 0.0003214     0.0003214
transplants 0.08238 0.00336 0.0000336     0.0000336
sigma2     0.44583 0.05698 0.0005698     0.0005857
2. Quantiles for each variable:
               2.5%     25%     50%     75%   97.5%
(Intercept) -0.15666 -0.05216 0.004786 0.06092 0.16939
actual     -0.02841 0.01257 0.034432 0.05541 0.09706
transplants 0.07574 0.08012 0.082393 0.08464 0.08890
sigma2       0.34777 0.40543 0.441132 0.48005 0.57228

The plot of the data shows the range of results, as shown in the following graph. Look at this in contrast to a simple regression with one result.

> plot(model)

supervised-learning-img-5

Random forests

Random forests is an algorithm that constructs a multitude of decision trees for the model of the data and selects the best of the lot as the final result. We can use the randomForest function in the kernlab package for this function. The randomForest function looks like this:

randomForest(formula, data=NULL, ..., subset, na.action=na.fail)

The various parameters of the randomForest function are described in the
following table:

Parameter

Description

formula

This is the formula of model

data

This is the dataset to be used

subset

This is the subset of the dataset to be used

na.action

This is the action to take with NA values

For an example of random forest, we will use the spam data, as in the section
Support vector machines.

First, let's load the package and library as follows:

> install.packages("randomForest")
> library(randomForest)

Now, we will generate the model with the following command (this may take a while):

> fit <- randomForest(type ~ ., data=spam)

Let's look at the results to see how it went:

> fit
Call:
randomForest(formula = type ~ ., data = spam)
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 7
       OOB estimate of error rate: 4.48%
Confusion matrix:
        nonspam spam class.error
nonspam   2713   75 0.02690100
spam       131 1682 0.07225593

We can look at the relative importance of the data variables in the final model, as shown here:

> head(importance(fit))
       MeanDecreaseGini
make           7.967392
address       12.654775
all           25.116662
num3d           1.729008
our           67.365754
over           17.579765

Ordering the data shows a couple of the factors to be critical to the determination. For example, the presence of the exclamation character in the e-mail is shown as a dominant indicator of spam mail:

charExclamation   256.584207
charDollar       200.3655348
remove           168.7962949
free              142.8084662
capitalAve       137.1152451
capitalLong       120.1520829
your             116.6134519

Unsupervised learning

With unsupervised learning, we do not have a target variable. We have a number of predictor variables that we look into to determine if there is a pattern.

We will go over the following unsupervised learning techniques:

  • Cluster analysis
  • Density estimation
  • Expectation-maximization algorithm
  • Hidden Markov models
  • Blind signal separation

Cluster analysis

Cluster analysis is the process of organizing data into groups (clusters) that are similar to each other.

For our example, we will use the wheat seed data available at http://www.uci.edu, as shown here:

> wheat <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt", sep="t")

Let's look at the raw data:

> head(wheat)
X15.26 X14.84 X0.871 X5.763 X3.312 X2.221 X5.22 X1
1 14.88 14.57 0.8811 5.554 3.333 1.018 4.956 1
2 14.29 14.09 0.9050 5.291 3.337 2.699 4.825 1
3 13.84 13.94 0.8955 5.324 3.379 2.259 4.805 1
4 16.14 14.99 0.9034 5.658 3.562 1.355 5.175 1
5 14.38 14.21 0.8951 5.386 3.312 2.462 4.956 1
6 14.69 14.49 0.8799 5.563 3.259 3.586 5.219 1

We need to apply column names so we can see the data better:

> colnames(wheat) <- c("area", "perimeter", "compactness", "length", "width", "asymmetry", "groove", "undefined")
> head(wheat)
   area perimeter compactness length width asymmetry groove undefined
1 14.88     14.57     0.8811 5.554 3.333     1.018 4.956         1
2 14.29     14.09     0.9050 5.291 3.337     2.699 4.825         1
3 13.84     13.94     0.8955 5.324 3.379     2.259 4.805         1
4 16.14     14.99     0.9034 5.658 3.562     1.355 5.175         1
5 14.38     14.21     0.8951 5.386 3.312     2.462 4.956         1
6 14.69     14.49     0.8799 5.563 3.259     3.586 5.219         1

The last column is not defined in the data description, so I am removing it:

> wheat <- subset(wheat, select = -c(undefined) )
> head(wheat)
   area perimeter compactness length width asymmetry groove
1 14.88     14.57     0.8811 5.554 3.333     1.018 4.956
2 14.29     14.09     0.9050 5.291 3.337     2.699 4.825
3 13.84     13.94     0.8955 5.324 3.379     2.259 4.805
4 16.14     14.99     0.9034 5.658 3.562     1.355 5.175
5 14.38     14.21     0.8951 5.386 3.312     2.462 4.956
6 14.69    14.49     0.8799 5.563 3.259     3.586 5.219

Now, we can finally produce the cluster using the kmeans function. The kmeans function looks like this:

kmeans(x, centers, iter.max = 10, nstart = 1,
       algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
                     "MacQueen"), trace=FALSE)

The various parameters of the kmeans function are described in the following table:

Parameter

Description

x

This is the dataset

centers

This is the number of centers to coerce data towards

These are the additional parameters of the function

Let's produce the cluster using the kmeans function:

> fit <- kmeans(wheat, 5)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

Unfortunately, there are some rows with missing data, so let's fix this using the following command:

> wheat <- wheat[complete.cases(wheat),]

Let's look at the data to get some idea of the factors using the following command:

> plot(wheat)

If we try looking at five clusters, we end up with a fairly good set of clusters with an 85 percent fit, as shown here:

> fit <- kmeans(wheat, 5)
> fit
K-means clustering with 5 clusters of sizes 29, 33, 56, 69, 15
Cluster means:
     area perimeter compactness   length   width asymmetry   groove
1 16.45345 15.35310   0.8768000 5.882655 3.462517 3.913207 5.707655
2 18.95455 16.38879   0.8868000 6.247485 3.744697 2.723545 6.119455
3 14.10536 14.20143   0.8777750 5.480214 3.210554 2.368075 5.070000
4 11.94870 13.27000   0.8516652 5.229304 2.870101 4.910145 5.093333
5 19.58333 16.64600   0.8877267 6.315867 3.835067 5.081533 6.144400
Clustering vector:
...
Within cluster sum of squares by cluster:
[1] 48.36785 30.16164 121.63840 160.96148 25.81297
(between_SS / total_SS = 85.4 %)

If we push to 10 clusters, the performance increases to 92 percent.

Density estimation

Density estimation is used to provide an estimate of the probability density function of a random variable. For this example, we will use sunspot data from Vincent arlbuck site. Not clear if sunspots are truly random.

Let's load our data as follows:

> sunspots <- read.csv("http://vincentarelbundock.github.io/Rdatasets/csv/datasets/sunspot.month.csv")
> summary(sunspots)
       X             time     sunspot.month  
Min.   :   1   Min.   :1749   Min.   : 0.00
1st Qu.: 795   1st Qu.:1815   1st Qu.: 15.70
Median :1589   Median :1881   Median : 42.00
Mean   :1589   Mean   :1881   Mean   : 51.96
3rd Qu.:2383   3rd Qu.:1948   3rd Qu.: 76.40
Max.   :3177   Max.   :2014   Max.   :253.80
> head(sunspots)
X     time sunspot.month
1 1 1749.000         58.0
2 2 1749.083         62.6
3 3 1749.167         70.0
4 4 1749.250         55.7
5 5 1749.333         85.0
6 6 1749.417        83.5

We will now estimate the density using the following command:

> d <- density(sunspots$sunspot.month)
> d
Call:
density.default(x = sunspots$sunspot.month)
Data: sunspots$sunspot.month (3177 obs.); Bandwidth 'bw' = 7.916
       x               y          
Min.   :-23.75   Min.   :1.810e-07
1st Qu.: 51.58   1st Qu.:1.586e-04
Median :126.90   Median :1.635e-03
Mean   :126.90   Mean   :3.316e-03
3rd Qu.:202.22   3rd Qu.:5.714e-03
Max.   :277.55   Max.   :1.248e-02

A plot is very useful for this function, so let's generate one using the following command:

> plot(d)

supervised-learning-img-6

It is interesting to see such a wide variation; maybe the data is pretty random after all.

We can use the density to estimate additional periods as follows:

> N<-1000
> sunspots.new <- rnorm(N, sample(sunspots$sunspot.month, size=N, replace=TRUE))
> lines(density(sunspots.new), col="blue")

supervised-learning-img-7

It looks like our density estimate is very accurate.

Expectation-maximization

Expectation-maximization (EM) is an unsupervised clustering approach that adjusts the data for optimal values.

When using EM, we have to have some preconception of the shape of the data/model that will be targeted. This example reiterates the example on the Wikipedia page, with comments. The example tries to model the iris species from the other data points. Let's load the data as shown here:

> iris <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
> colnames(iris) <- c("SepalLength","SepalWidth","PetalLength","PetalWidth","Species")
> modelName = "EEE"

Each observation has sepal length, width, petal length, width, and species,
as shown here:

> head(iris)
SepalLength SepalWidth PetalLength PetalWidth     Species
1         5.1       3.5         1.4       0.2 Iris-setosa
2         4.9       3.0         1.4       0.2 Iris-setosa
3         4.7       3.2         1.3       0.2 Iris-setosa
4         4.6       3.1         1.5       0.2 Iris-setosa
5         5.0       3.6         1.4       0.2 Iris-setosa
6         5.4       3.9         1.7       0.4 Iris-setosa

We are estimating the species from the other points, so let's separate the data
as follows:

> data = iris[,-5]
> z = unmap(iris[,5])

Let's set up our mstep for EM, given the data, categorical data (z) relating to each data point, and our model type name:

> msEst <- mstep(modelName, data, z)

We use the parameters defined in the mstep to produce our model, as shown here:

> em(modelName, data, msEst$parameters)
$z
               [,1]         [,2]         [,3]
[1,] 1.000000e+00 4.304299e-22 1.699870e-42
…
[150,] 8.611281e-34 9.361398e-03 9.906386e-01
$parameters$pro
[1] 0.3333333 0.3294048 0.3372619
$parameters$mean
             [,1]     [,2]     [,3]
SepalLength 5.006 5.941844 6.574697
SepalWidth 3.418 2.761270 2.980150
PetalLength 1.464 4.257977 5.538926
PetalWidth 0.244 1.319109 2.024576
$parameters$variance$d
[1] 4
$parameters$variance$G
[1] 3
$parameters$variance$sigma
, , 1
           SepalLength SepalWidth PetalLength PetalWidth
SepalLength 0.26381739 0.09030470 0.16940062 0.03937152
SepalWidth   0.09030470 0.11251902 0.05133876 0.03082280
PetalLength 0.16940062 0.05133876 0.18624355 0.04183377
PetalWidth   0.03937152 0.03082280 0.04183377 0.03990165
, , 2
, , 3
… (there was little difference in the 3 sigma values)
Covariance
$parameters$variance$Sigma
           SepalLength SepalWidth PetalLength PetalWidth
SepalLength 0.26381739 0.09030470 0.16940062 0.03937152
SepalWidth   0.09030470 0.11251902 0.05133876 0.03082280
PetalLength 0.16940062 0.05133876 0.18624355 0.04183377
PetalWidth   0.03937152 0.03082280 0.04183377 0.03990165
$parameters$variance$cholSigma
            SepalLength SepalWidth PetalLength PetalWidth
SepalLength -0.5136316 -0.1758161 -0.32980960 -0.07665323
SepalWidth   0.0000000 0.2856706 -0.02326832 0.06072001
PetalLength   0.0000000 0.0000000 -0.27735855 -0.06477412
PetalWidth   0.0000000 0.0000000 0.00000000 0.16168899
attr(,"info")
iterations       error
4.000000e+00 1.525131e-06

There is quite a lot of output from the em function. The highlights for me were the three sigma ranges were the same and the error from the function was very small.
So, I think we have a very good estimation of species using just the four data points.

Hidden Markov models

The hidden Markov models (HMM) is the idea of observing data assuming it has been produced by a Markov model. The problem is to discover what that model is.

I am using the Python example on Wikipedia for HMM. For an HMM, we need states (assumed to be hidden from observer), symbols, transition matrix between states, emission (output) states, and probabilities for all.

The Python information presented is as follows:

states = ('Rainy', 'Sunny')
observations = ('walk', 'shop', 'clean')
start_probability = {'Rainy': 0.6, 'Sunny': 0.4}
transition_probability = {
   'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3},
   'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6},
   }
emission_probability = {
   'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5},
   'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1},
   }
trans <- matrix(c('Rainy', : {'Rainy': 0.7, 'Sunny': 0.3},
   'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6},
   }

We convert these to use in R for the initHmm function by using the following command:

> hmm <- initHMM(c("Rainy","Sunny"), c('walk', 'shop', 'clean'), 
c(.6,.4), matrix(c(.7,.3,.4,.6),2), matrix(c(.1,.4,.5,.6,.3,.1),3)) > hmm $States [1] "Rainy" "Sunny" $Symbols [1] "walk" "shop" "clean" $startProbs Rainy Sunny 0.6   0.4 $transProbs        to from   Rainy Sunny Rainy   0.7   0.4 Sunny   0.3   0.6 $emissionProbs        symbols states walk shop clean Rainy 0.1 0.5   0.3 Sunny 0.4 0.6   0.1

The model is really a placeholder for all of the setup information needed for HMM. We can then use the model to predict based on observations, as follows:

> future <- forward(hmm, c("walk","shop","clean"))
> future
       index
states         1         2         3
Rainy -2.813411 -3.101093 -4.139551
Sunny -1.832581 -2.631089 -5.096193

The result is a matrix of probabilities. For example, it is more likely to be Sunny when we observe walk.

Blind signal separation

Blind signal separation is the process of identifying sources of signals from a mixed signal. Primary component analysis is one method of doing this. An example is a cocktail party where you are trying to listen to one speaker.

For this example, I am using the decathlon dataset in the FactoMineR package,
as shown here:

> library(FactoMineR)
> data(decathlon)

Let's look at the data to get some idea of what is available:

> summary(decathlon)
100m           Long.jump     Shot.put       High.jump
Min.   :10.44   Min.   :6.61   Min.   :12.68   Min.   :1.850
1st Qu.:10.85   1st Qu.:7.03   1st Qu.:13.88   1st Qu.:1.920
Median :10.98   Median :7.30   Median :14.57   Median :1.950
Mean   :11.00   Mean   :7.26   Mean   :14.48   Mean   :1.977
3rd Qu.:11.14   3rd Qu.:7.48   3rd Qu.:14.97   3rd Qu.:2.040
Max.   :11.64   Max.   :7.96   Max.   :16.36   Max.   :2.150
400m           110m.hurdle       Discus       Pole.vault  
Min.   :46.81   Min.   :13.97   Min.   :37.92   Min.   :4.200
1st Qu.:48.93   1st Qu.:14.21   1st Qu.:41.90   1st Qu.:4.500
Median :49.40   Median :14.48   Median :44.41   Median :4.800
Mean   :49.62   Mean   :14.61 Mean   :44.33   Mean   :4.762
3rd Qu.:50.30   3rd Qu.:14.98   3rd Qu.:46.07   3rd Qu.:4.920
Max.   :53.20   Max.   :15.67   Max.   :51.65   Max.   :5.400
Javeline       1500m           Rank           Points  
Min.   :50.31   Min.   :262.1   Min.   : 1.00   Min.   :7313
1st Qu.:55.27   1st Qu.:271.0   1st Qu.: 6.00   1st Qu.:7802
Median :58.36   Median :278.1   Median :11.00   Median :8021
Mean   :58.32   Mean   :279.0   Mean   :12.12   Mean   :8005
3rd Qu.:60.89   3rd Qu.:285.1   3rd Qu.:18.00   3rd Qu.:8122
Max.   :70.52   Max.   :317.0   Max.   :28.00   Max.   :8893
   Competition
Decastar:13
OlympicG:28

The output looks like performance data from a series of events at a track meet:

> head(decathlon)
       100m   Long.jump Shot.put High.jump 400m 110m.hurdle Discus
SEBRLE 11.04     7.58   14.83     2.07 49.81       14.69 43.75
CLAY   10.76     7.40   14.26     1.86 49.37       14.05 50.72
KARPOV 11.02     7.30   14.77     2.04 48.37       14.09 48.95
BERNARD 11.02     7.23   14.25     1.92 48.93       14.99 40.87
YURKOV 11.34     7.09   15.19     2.10 50.42       15.31 46.26
WARNERS 11.11     7.60   14.31     1.98 48.68       14.23 41.10
       Pole.vault Javeline 1500m Rank Points Competition
SEBRLE       5.02   63.19 291.7   1   8217   Decastar
CLAY         4.92   60.15 301.5   2   8122   Decastar
KARPOV       4.92   50.31 300.2   3   8099   Decastar
BERNARD       5.32   62.77 280.1   4   8067   Decastar
YURKOV       4.72   63.44 276.4   5   8036   Decastar
WARNERS       4.92   51.77 278.1   6   8030   Decastar

Further, this is performance of specific individuals in track meets.

We run the PCA function by passing the dataset to use, whether to scale the data or not, and the type of graphs:

> res.pca = PCA(decathlon[,1:10], scale.unit=TRUE, ncp=5, graph=T)

This produces two graphs:

  • Individual factors map
  • Variables factor map

The individual factors map lays out the performance of the individuals. For example, we see Karpov who is high in both dimensions versus Bourginon who is performing badly (on the left in the following chart):

supervised-learning-img-8

The variables factor map shows the correlation of performance between events. For example, doing well in the 400 meters run is negatively correlated with the performance in the long jump; if you did well in one, you likely did well in the other as well. Here is the variables factor map of our data:

supervised-learning-img-9

Questions

Factual

  • Which supervised learning technique(s) do you lean towards as your
    "go to" solution?
  • Why are the density plots for Bayesian results off-center?

When, how, and why?

  • How would you decide on the number of clusters to use?
  • Find a good rule of thumb to decide the number of hidden layers in a
    neural net.

Challenges

  • Investigate other blind signal separation techniques, such as ICA.
  • Use other methods, such as poisson, in the rpart function (especially if you have a natural occurring dataset).

Summary

In this article, we looked into various methods of machine learning, including both supervised and unsupervised learning. With supervised learning, we have a target variable we are trying to estimate. With unsupervised, we only have a possible set of predictor variables and are looking for patterns.

In supervised learning, we looked into using a number of methods, including decision trees, regression, neural networks, support vector machines, and Bayesian learning. In unsupervised learning, we used cluster analysis, density estimation, hidden Markov models, and blind signal separation.

Resources for Article:


Further resources on this subject: