





















































In this article by Dan Toomey, author of the book R for Data Science, we will learn about the supervised learning, which involves the use of a target variable and a number of predictor variables that are put into a model to enable the system to predict the target. This is also known as predictive modeling.
(For more resources related to this topic, see here.)
As mentioned, in supervised learning we have a target variable and a number of possible predictor variables. The objective is to associate the predictor variables in such a way so as to accurately predict the target variable. We are using some portion of observed data to learn how our model behaves and then testing that model on the remaining observations for accuracy.
We will go over the following supervised learning techniques:
For decision tree machine learning, we develop a logic tree that can be used to predict our target value based on a number of predictor variables. The tree has logical points, such as if the month is December, follow the tree logic to the left; otherwise, follow the tree logic to the right. The last leaf of the tree has a predicted value.
For this example, we will use the weather data in the rattle package. We will develop a decision tree to be used to determine whether it will rain tomorrow
or not based on several variables. Let's load the rattle package as follows:
> library(rattle)
We can see a summary of the weather data. This shows that we have some real data over a year from Australia:
> summary(weather) Date Location MinTemp Min. :2007-11-01 Canberra :366 Min. :-5.300 1st Qu.:2008-01-31 Adelaide : 0 1st Qu.: 2.300 Median :2008-05-01 Albany : 0 Median : 7.450 Mean :2008-05-01 Albury : 0 Mean : 7.266 3rd Qu.:2008-07-31 AliceSprings : 0 3rd Qu.:12.500 Max. :2008-10-31 BadgerysCreek: 0 Max. :20.900 (Other) : 0 MaxTemp Rainfall Evaporation Sunshine Min. : 7.60 Min. : 0.000 Min. : 0.200 Min. : 0.000 1st Qu.:15.03 1st Qu.: 0.000 1st Qu.: 2.200 1st Qu.: 5.950 Median :19.65 Median : 0.000 Median : 4.200 Median : 8.600 Mean :20.55 Mean : 1.428 Mean : 4.522 Mean : 7.909 3rd Qu.:25.50 3rd Qu.: 0.200 3rd Qu.: 6.400 3rd Qu.:10.500 Max. :35.80 Max. :39.800 Max. :13.800 Max. :13.600 NA's :3 WindGustDir WindGustSpeed WindDir9am WindDir3pm NW : 73 Min. :13.00 SE : 47 WNW : 61 NNW : 44 1st Qu.:31.00 SSE : 40 NW : 61 E : 37 Median :39.00 NNW : 36 NNW : 47 WNW : 35 Mean :39.84 N : 31 N : 30 ENE : 30 3rd Qu.:46.00 NW : 30 ESE : 27 (Other):144 Max. :98.00 (Other):151 (Other):139 NA's : 3 NA's :2 NA's : 31 NA's : 1 WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Min. : 0.000 Min. : 0.00 Min. :36.00 Min. :13.00 1st Qu.: 6.000 1st Qu.:11.00 1st Qu.:64.00 1st Qu.:32.25 Median : 7.000 Median :17.00 Median :72.00 Median :43.00 Mean : 9.652 Mean :17.99 Mean :72.04 Mean :44.52 3rd Qu.:13.000 3rd Qu.:24.00 3rd Qu.:81.00 3rd Qu.:55.00 Max. :41.000 Max. :52.00 Max. :99.00 Max. :96.00 NA's :7 Pressure9am Pressure3pm Cloud9am Cloud3pm Min. : 996.5 Min. : 996.8 Min. :0.000 Min. :0.000 1st Qu.:1015.4 1st Qu.:1012.8 1st Qu.:1.000 1st Qu.:1.000 Median :1020.1 Median :1017.4 Median :3.500 Median :4.000 Mean :1019.7 Mean :1016.8 Mean :3.891 Mean :4.025 3rd Qu.:1024.5 3rd Qu.:1021.5 3rd Qu.:7.000 3rd Qu.:7.000 Max. :1035.7 Max. :1033.2 Max. :8.000 Max. :8.000 Temp9am Temp3pm RainToday RISK_MM Min. : 0.100 Min. : 5.10 No :300 Min. : 0.000 1st Qu.: 7.625 1st Qu.:14.15 Yes: 66 1st Qu.: 0.000 Median :12.550 Median :18.55 Median : 0.000 Mean :12.358 Mean :19.23 Mean : 1.428 3rd Qu.:17.000 3rd Qu.:24.00 3rd Qu.: 0.200 Max. :24.700 Max. :34.50 Max. :39.800 RainTomorrow No :300 Yes: 66
We will be using the rpart function to develop a decision tree. The rpart function looks like this:
rpart(formula, data, weights, subset, na.action = na.rpart, method, model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ...)
The various parameters of the rpart function are described in the following table:
Parameter |
Description |
formula |
This is the formula used for the prediction. |
data |
This is the data matrix. |
weights |
These are the optional weights to be applied. |
subset |
This is the optional subset of rows of data to be used. |
na.action |
This specifies the action to be taken when y, the target value, is missing. |
method |
This is the method to be used to interpret the data. It should be one of these: anova, poisson, class, or exp. If not specified, the algorithm decides based on the layout of the data. |
… |
These are the additional parameters to be used to control the behavior of the algorithm. |
Let's create a subset as follows:
> weather2 <- subset(weather,select=-c(RISK_MM)) > install.packages("rpart") >library(rpart) > model <- rpart(formula=RainTomorrow ~ .,data=weather2, method="class") > summary(model) Call: rpart(formula = RainTomorrow ~ ., data = weather2, method = "class") n= 366 CPn split rel error xerror xstd 1 0.19696970 0 1.0000000 1.0000000 0.1114418 2 0.09090909 1 0.8030303 0.9696970 0.1101055 3 0.01515152 2 0.7121212 1.0151515 0.1120956 4 0.01000000 7 0.6363636 0.9090909 0.1073129 Variable importance Humidity3pm WindGustSpeed Sunshine WindSpeed3pm Temp3pm 24 14 12 8 6 Pressure3pm MaxTemp MinTemp Pressure9am Temp9am 6 5 4 4 4 Evaporation Date Humidity9am Cloud3pm Cloud9am 3 3 2 2 1 Rainfall 1 Node number 1: 366 observations, complexity param=0.1969697 predicted class=No expected loss=0.1803279 P(node) =1 class counts: 300 66 probabilities: 0.820 0.180 left son=2 (339 obs) right son=3 (27 obs) Primary splits: Humidity3pm < 71.5 to the left, improve=18.31013, (0 missing) Pressure3pm < 1011.9 to the right, improve=17.35280, (0 missing) Cloud3pm < 6.5 to the left, improve=16.14203, (0 missing) Sunshine < 6.45 to the right, improve=15.36364, (3 missing) Pressure9am < 1016.35 to the right, improve=12.69048, (0 missing) Surrogate splits: Sunshine < 0.45 to the right, agree=0.945, adj=0.259, (0 split) (many more)…
As you can tell, the model is complicated. The summary shows the progression of the model development using more and more of the data to fine-tune the tree. We will be using the rpart.plot package to display the decision tree in a readable manner as follows:
> library(rpart.plot) > fancyRpartPlot(model,main="Rain Tomorrow",sub="Chapter 12")
This is the output of the fancyRpartPlot function
Now, we can follow the logic of the decision tree easily. For example, if the humidity is over 72, we are predicting it will rain.
We can use a regression to predict our target value by producing a regression model from our predictor variables.
We will be using the forest fire data from http://archive.ics.uci.edu. We will load the data and get the following summary:
> forestfires <- read.csv
("http://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv") > summary(forestfires) X Y month day FFMC Min. :1.000 Min. :2.0 aug :184 fri:85 Min. :18.70 1st Qu.:3.000 1st Qu.:4.0 sep :172 mon:74 1st Qu.:90.20 Median :4.000 Median :4.0 mar : 54 sat:84 Median :91.60 Mean :4.669 Mean :4.3 jul : 32 sun:95 Mean :90.64 3rd Qu.:7.000 3rd Qu.:5.0 feb : 20 thu:61 3rd Qu.:92.90 Max. :9.000 Max. :9.0 jun : 17 tue:64 Max. :96.20 (Other): 38 wed:54 DMC DC ISI temp Min. : 1.1 Min. : 7.9 Min. : 0.000 Min. : 2.20 1st Qu.: 68.6 1st Qu.:437.7 1st Qu.: 6.500 1st Qu.:15.50 Median :108.3 Median :664.2 Median : 8.400 Median :19.30 Mean :110.9 Mean :547.9 Mean : 9.022 Mean :18.89 3rd Qu.:142.4 3rd Qu.:713.9 3rd Qu.:10.800 3rd Qu.:22.80 Max. :291.3 Max. :860.6 Max. :56.100 Max. :33.30 RH wind rain area Min. : 15.00 Min. :0.400 Min. :0.00000 Min. : 0.00 1st Qu.: 33.00 1st Qu.:2.700 1st Qu.:0.00000 1st Qu.: 0.00 Median : 42.00 Median :4.000 Median :0.00000 Median : 0.52 Mean : 44.29 Mean :4.018 Mean :0.02166 Mean : 12.85 3rd Qu.: 53.00 3rd Qu.:4.900 3rd Qu.:0.00000 3rd Qu.: 6.57 Max. :100.00 Max. :9.400 Max. :6.40000 Max. :1090.84
I will just use the month, temperature, wind, and rain data to come up with a model of the area (size) of the fires using the lm function. The lm function looks like this:
lm(formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, ...)
The various parameters of the lm function are described in the following table:
Parameter |
Description |
formula |
This is the formula to be used for the model |
data |
This is the dataset |
subset |
This is the subset of dataset to be used |
weights |
These are the weights to apply to factors |
… |
These are the additional parameters to be added to the function |
Let's load the data as follows:
> model <- lm(formula = area ~ month + temp + wind + rain, data=forestfires)
Looking at the generated model, we see the following output:
> summary(model) Call: lm(formula = area ~ month + temp + wind + rain, data = forestfires) Residuals: Min 1Q Median 3Q Max -33.20 -14.93 -9.10 -1.66 1063.59 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -17.390 24.532 -0.709 0.4787 monthaug -10.342 22.761 -0.454 0.6498 monthdec 11.534 30.896 0.373 0.7091 monthfeb 2.607 25.796 0.101 0.9196 monthjan 5.988 50.493 0.119 0.9056 monthjul -8.822 25.068 -0.352 0.7251 monthjun -15.469 26.974 -0.573 0.5666 monthmar -6.630 23.057 -0.288 0.7738 monthmay 6.603 50.053 0.132 0.8951 monthnov -8.244 67.451 -0.122 0.9028 monthoct -8.268 27.237 -0.304 0.7616 monthsep -1.070 22.488 -0.048 0.9621 temp 1.569 0.673 2.332 0.0201 * wind 1.581 1.711 0.924 0.3557 rain -3.179 9.595 -0.331 0.7406 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 63.99 on 502 degrees of freedom Multiple R-squared: 0.01692, Adjusted R-squared: -0.0105 F-statistic: 0.617 on 14 and 502 DF, p-value: 0.8518
Surprisingly, the month has a significant effect on the size of the fires. I would have guessed that whether or not the fires occurred in August or similar months would have effected any discernable difference. Also, the temperature has such a minimal effect. Further, the model is using the month data as categorical.
If we redevelop the model (without temperature), we have a better fit (notice the multiple R-squared value drops to 0.006 from 0.01), as shown here:
> model <- lm(formula = area ~ month + wind + rain, data=forestfires) > summary(model) Call: lm(formula = area ~ month + wind + rain, data = forestfires) Residuals: Min 1Q Median 3Q Max -22.17 -14.39 -10.46 -3.87 1072.43 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.0126 22.8496 0.176 0.861 monthaug 4.3132 21.9724 0.196 0.844 monthdec 1.3259 30.7188 0.043 0.966 monthfeb -1.6631 25.8441 -0.064 0.949 monthjan -6.1034 50.4475 -0.121 0.904 monthjul 6.4648 24.3021 0.266 0.790 monthjun -2.4944 26.5099 -0.094 0.925 monthmar -4.8431 23.1458 -0.209 0.834 monthmay 10.5754 50.2441 0.210 0.833 monthnov -8.7169 67.7479 -0.129 0.898 monthoct -0.9917 27.1767 -0.036 0.971 monthsep 10.2110 22.0579 0.463 0.644 wind 1.0454 1.7026 0.614 0.540 rain -1.8504 9.6207 -0.192 0.848 Residual standard error: 64.27 on 503 degrees of freedom Multiple R-squared: 0.006269, Adjusted R-squared: -0.01941 F-statistic: 0.2441 on 13 and 503 DF, p-value: 0.9971
From the results, we can see R-squared of close to 0 and p-value almost 1; this is a very good fit.
If you plot the model, you will get a series of graphs. The plot of the residuals versus fitted values is the most revealing, as shown in the following graph:
> plot(model)
You can see from the graph that the regression model is very accurate:
In a neural network, it is assumed that there is a complex relationship between the predictor variables and the target variable. The network allows the expression of each of these relationships.
For this model, we will use the liver disorder data from http://archive.ics.uci.edu. The data has a few hundred observations from patients with liver disorders. The variables are various measures of blood for each patient as shown here:
> bupa <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.data") > colnames(bupa) <- c("mcv","alkphos","alamine","aspartate","glutamyl","drinks","selector") > summary(bupa) mcv alkphos alamine Min. : 65.00 Min. : 23.00 Min. : 4.00 1st Qu.: 87.00 1st Qu.: 57.00 1st Qu.: 19.00 Median : 90.00 Median : 67.00 Median : 26.00 Mean : 90.17 Mean : 69.81 Mean : 30.36 3rd Qu.: 93.00 3rd Qu.: 80.00 3rd Qu.: 34.00 Max. :103.00 Max. :138.00 Max. :155.00 aspartate glutamyl drinks Min. : 5.00 Min. : 5.00 Min. : 0.000 1st Qu.:19.00 1st Qu.: 15.00 1st Qu.: 0.500 Median :23.00 Median : 24.50 Median : 3.000 Mean :24.64 Mean : 38.31 Mean : 3.465 3rd Qu.:27.00 3rd Qu.: 46.25 3rd Qu.: 6.000 Max. :82.00 Max. :297.00 Max. :20.000 selector Min. :1.000 1st Qu.:1.000 Median :2.000 Mean :1.581 3rd Qu.:2.000 Max. :2.000
We generate a neural network using the neuralnet function. The neuralnet function looks like this:
neuralnet(formula, data, hidden = 1, threshold = 0.01, stepmax = 1e+05, rep = 1, startweights = NULL, learningrate.limit = NULL, learningrate.factor = list(minus = 0.5, plus = 1.2), learningrate=NULL, lifesign = "none", lifesign.step = 1000, algorithm = "rprop+", err.fct = "sse", act.fct = "logistic", linear.output = TRUE, exclude = NULL, constant.weights = NULL, likelihood = FALSE)
The various parameters of the neuralnet function are described in the following table:
Parameter |
Description |
formula |
This is the formula to converge. |
data |
This is the data matrix of predictor values. |
hidden |
This is the number of hidden neurons in each layer. |
stepmax |
This is the maximum number of steps in each repetition. Default is 1+e5. |
rep |
This is the number of repetitions. |
Let's generate the neural network as follows:
> nn <- neuralnet(selector~mcv+alkphos+alamine+aspartate+glutamyl+drinks,
data=bupa, linear.output=FALSE, hidden=2)
We can see how the model was developed via the result.matrix variable in the following output:
> nn$result.matrix 1 error 100.005904355153 reached.threshold 0.005904330743 steps 43.000000000000 Intercept.to.1layhid1 0.880621509705 mcv.to.1layhid1 -0.496298308044 alkphos.to.1layhid1 2.294158313786 alamine.to.1layhid1 1.593035613921 aspartate.to.1layhid1 -0.407602506759 glutamyl.to.1layhid1 -0.257862634340 drinks.to.1layhid1 -0.421390527261 Intercept.to.1layhid2 0.806928998059 mcv.to.1layhid2 -0.531926150470 alkphos.to.1layhid2 0.554627946150 alamine.to.1layhid2 1.589755874579 aspartate.to.1layhid2 -0.182482440722 glutamyl.to.1layhid2 1.806513419058 drinks.to.1layhid2 0.215346602241 Intercept.to.selector 4.485455617018 1layhid.1.to.selector 3.328527160621 1layhid.2.to.selector 2.616395644587
The process took 43 steps to come up with the neural network once the threshold was under 0.01 (0.005 in this case). You can see the relationships between the predictor values.
Looking at the network developed, we can see the hidden layers of relationship among the predictor variables. For example, sometimes mcv combines at one ratio and on other times at another ratio, depending on its value. Let's load the neural network as follows:
> plot(nn)
R programming has a nearest neighbor algorithm (k-NN). The k-NN algorithm takes the predictor values and organizes them so that a new observation is applied to the organization developed and the algorithm selects the result (prediction) that is most applicable based on nearness of the predictor values in the new observation. The nearest neighbor function is knn. The knn function call looks like this:
knn(train, test, cl, k = 1, l = 0, prob = FALSE, use.all = TRUE)
The various parameters of the knn function are described in the following table:
Parameter |
Description |
train |
This is the training data. |
test |
This is the test data. |
cl |
This is the factor of true classifications. |
k |
This is the Number of neighbors to consider. |
l |
This is the minimum vote for a decision. |
prob |
This is a Boolean flag to return proportion of winning votes. |
use.all |
This is a Boolean variable for tie handling. TRUE means use all votes of max distance |
I am using the auto MPG dataset in the example of using knn.
First, we load the dataset :
> data <- read.table
("http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data", na.string="?") > colnames(data) <- c
("mpg","cylinders","displacement","horsepower","weight","acceleration","model.year","origin","car.name") > summary(data) mpg cylinders displacement horsepower Min. : 9.00 Min. :3.000 Min. : 68.0 150 : 22 1st Qu.:17.50 1st Qu.:4.000 1st Qu.:104.2 90 : 20 Median :23.00 Median :4.000 Median :148.5 88 : 19 Mean :23.51 Mean :5.455 Mean :193.4 110 : 18 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:262.0 100 : 17 Max. :46.60 Max. :8.000 Max. :455.0 75 : 14 (Other):288 weight acceleration model.year origin Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000 1st Qu.:2224 1st Qu.:13.82 1st Qu.:73.00 1st Qu.:1.000 Median :2804 Median :15.50 Median :76.00 Median :1.000 Mean :2970 Mean :15.57 Mean :76.01 Mean :1.573 3rd Qu.:3608 3rd Qu.:17.18 3rd Qu.:79.00 3rd Qu.:2.000 Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000 car.name ford pinto : 6 amc matador : 5 ford maverick : 5 toyota corolla: 5 amc gremlin : 4 amc hornet : 4 (Other) :369
There are close to 400 observations in the dataset. We need to split the data into a training set and a test set. We will use 75 percent for training. We use the createDataPartition function in the caret package to select the training rows. Then, we create a test dataset and a training dataset using the partitions as follows:
> library(caret) > training <- createDataPartition(data$mpg, p=0.75, list=FALSE) > trainingData <- data[training,] > testData <- data[-training,] > model <- knn(train=trainingData, test=testData, cl=trainingData$mpg) NAs introduced by coercion
The error message means that some numbers in the dataset have a bad format. The bad numbers were automatically converted to NA values. Then the inclusion of the NA values caused the function to fail, as NA values are not expected in this function call.
First, there are some missing items in the dataset loaded. We need to eliminate those NA values as follows:
> completedata <- data[complete.cases(data),]
After looking over the data several times, I guessed that the car name fields were being parsed as numerical data when there was a number in the name, such as Buick Skylark 320. I removed the car name column from the test and we end up with the following valid results;
> drops <- c("car.name") > completeData2 <- completedata[,!(names(completedata) %in% drops)] > training <- createDataPartition(completeData2$mpg, p=0.75, list=FALSE) > trainingData <- completeData2[training,] > testData <- completeData2[-training,] > model <- knn(train=trainingData, test=testData, cl=trainingData$mpg)
We can see the results of the model by plotting using the following command. However, the graph doesn't give us much information to work on.
> plot(model)
We can use a different kknn function to compare our model with the test data. I like this version a little better as you can plainly specify the formula for the model. Let's use the kknn function as follows:
> library(kknn) > model <- kknn(formula = formula(mpg~.), train = trainingData, test = testData, k = 3, distance = 1) > fit <- fitted(model) > plot(testData$mpg, fit) > abline(a=0, b=1, col=3)
I added a simple slope to highlight how well the model fits the training data. It looks like as we progress to higher MPG values, our model has a higher degree of variance. I think that means we are missing predictor variables, especially for the later model, high MPG series of cars. That would make sense as government mandate and consumer demand for high efficiency vehicles changed the mpg for vehicles. Here is the graph generated by the previous code:
Ensemble learning is the process of using multiple learning methods to obtain
better predictions. For example, we could use a regression and k-NN, combine the results, and end up with a better prediction. We could average the results of both or provide heavier weight towards one or another of the algorithms, whichever appears to be a better predictor.
We covered support vector machines (SVM), but I will run through an example here. As a reminder, SVM is concerned with binary data. We will use the spam dataset from Hewlett Packard (part of the kernlab package). First, let's load the data as follows:
> library(kernlab) > data("spam") > summary(spam) make address all num3d Min. :0.0000 Min. : 0.000 Min. :0.0000 Min. : 0.00000 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.: 0.00000 Median :0.0000 Median : 0.000 Median :0.0000 Median : 0.00000 Mean :0.1046 Mean : 0.213 Mean :0.2807 Mean : 0.06542 3rd Qu.:0.0000 3rd Qu.: 0.000 3rd Qu.:0.4200 3rd Qu.: 0.00000 Max. :4.5400 Max. :14.280 Max. :5.1000 Max. :42.81000 …
There are 58 variables with close to 5000 observations, as shown here:
> table(spam$type) nonspam spam 2788 1813
Now, we break up the data into a training set and a test set as follows:
> index <- 1:nrow(spam) > testindex <- sample(index, trunc(length(index)/3)) > testset <- spam[testindex,] > trainingset <- spam[-testindex,]
Now, we can produce our SVM model using the svm function. The svm function looks like this:
svm(formula, data = NULL, ..., subset, na.action =na.omit, scale = TRUE)
The various parameters of the svm function are described in the following table:
Parameter |
Description |
formula |
This is the formula model |
data |
This is the dataset |
subset |
This is the subset of the dataset to be used |
na.action |
This contains what action to take with NA values |
scale |
This determines whether to scale the data |
Let's use the svm function to produce a SVM model as follows:
> library(e1071) > model <- svm(type ~ ., data = trainingset,
method = "C-classification", kernel = "radial", cost = 10, gamma = 0.1) > summary(model) Call: svm(formula = type ~ ., data = trainingset, method = "C-classification", kernel = "radial", cost = 10, gamma = 0.1) Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 10 gamma: 0.1 Number of Support Vectors: 1555 ( 645 910 ) Number of Classes: 2 Levels: nonspam spam
We can test the model against our test dataset and look at the results as follows:
> pred <- predict(model, testset) > table(pred, testset$type) pred nonspam spam nonspam 891 104 spam 38 500
Note, the e1071 package is not compatible with the current version of R. Given its usefulness I would expect the package to be updated to support the user base.
So, using SVM, we have a 90 percent ((891+500) / (891+104+38+500)) accuracy rate of prediction.
With Bayesian learning, we have an initial premise in a model that is adjusted with new information. We can use the MCMCregress method in the MCMCpack package to use Bayesian regression on learning data and apply the model against test data. Let's load the MCMCpack package as follows:
> install.packages("MCMCpack") > library(MCMCpack)
We are going to be using the transplant data on transplants available at http://lib.stat.cmu.edu/datasets/stanford. (The dataset on the site is part of the web page, so I copied into a local CSV file.)
The data shows expected transplant success factor, the actual transplant success factor, and the number of transplants over a time period. So, there is a good progression over time as to the success of the program. We can read the dataset as follows:
> transplants <- read.csv("transplant.csv") > summary(transplants) expected actual transplants Min. : 0.057 Min. : 0.000 Min. : 1.00 1st Qu.: 0.722 1st Qu.: 0.500 1st Qu.: 9.00 Median : 1.654 Median : 2.000 Median : 18.00 Mean : 2.379 Mean : 2.382 Mean : 27.83 3rd Qu.: 3.402 3rd Qu.: 3.000 3rd Qu.: 40.00 Max. :12.131 Max. :18.000 Max. :152.00
We use Bayesian regression against the data— note that we are modifying the
model as we progress with new information using the MCMCregress function.
The MCMCregress function looks like this:
MCMCregress(formula, data = NULL, burnin = 1000, mcmc = 10000, thin = 1, verbose = 0, seed = NA, beta.start = NA, b0 = 0, B0 = 0, c0 = 0.001, d0 = 0.001, sigma.mu = NA, sigma.var = NA, marginal.likelihood = c("none", "Laplace", "Chib95"), ...)
The various parameters of the MCMCregress function are described in the
following table:
Parameter |
Description |
formula |
This is the formula of model |
data |
This is the dataset to be used for model |
… |
These are the additional parameters for the function |
Let's use the Bayesian regression against the data as follows:
> model <- MCMCregress(expected ~ actual + transplants, data=transplants) > summary(model) Iterations = 1001:11000 Thinning interval = 1 Number of chains = 1 Sample size per chain = 10000 1. Empirical mean and standard deviation for each variable, plus standard error of the mean: Mean SD Naive SE Time-series SE (Intercept) 0.00484 0.08394 0.0008394 0.0008388 actual 0.03413 0.03214 0.0003214 0.0003214 transplants 0.08238 0.00336 0.0000336 0.0000336 sigma2 0.44583 0.05698 0.0005698 0.0005857 2. Quantiles for each variable: 2.5% 25% 50% 75% 97.5% (Intercept) -0.15666 -0.05216 0.004786 0.06092 0.16939 actual -0.02841 0.01257 0.034432 0.05541 0.09706 transplants 0.07574 0.08012 0.082393 0.08464 0.08890 sigma2 0.34777 0.40543 0.441132 0.48005 0.57228
The plot of the data shows the range of results, as shown in the following graph. Look at this in contrast to a simple regression with one result.
> plot(model)
Random forests is an algorithm that constructs a multitude of decision trees for the model of the data and selects the best of the lot as the final result. We can use the randomForest function in the kernlab package for this function. The randomForest function looks like this:
randomForest(formula, data=NULL, ..., subset, na.action=na.fail)
The various parameters of the randomForest function are described in the
following table:
Parameter |
Description |
formula |
This is the formula of model |
data |
This is the dataset to be used |
subset |
This is the subset of the dataset to be used |
na.action |
This is the action to take with NA values |
For an example of random forest, we will use the spam data, as in the section
Support vector machines.
First, let's load the package and library as follows:
> install.packages("randomForest") > library(randomForest)
Now, we will generate the model with the following command (this may take a while):
> fit <- randomForest(type ~ ., data=spam)
Let's look at the results to see how it went:
> fit Call: randomForest(formula = type ~ ., data = spam) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 7 OOB estimate of error rate: 4.48% Confusion matrix: nonspam spam class.error nonspam 2713 75 0.02690100 spam 131 1682 0.07225593
We can look at the relative importance of the data variables in the final model, as shown here:
> head(importance(fit)) MeanDecreaseGini make 7.967392 address 12.654775 all 25.116662 num3d 1.729008 our 67.365754 over 17.579765
Ordering the data shows a couple of the factors to be critical to the determination. For example, the presence of the exclamation character in the e-mail is shown as a dominant indicator of spam mail:
charExclamation 256.584207 charDollar 200.3655348 remove 168.7962949 free 142.8084662 capitalAve 137.1152451 capitalLong 120.1520829 your 116.6134519
With unsupervised learning, we do not have a target variable. We have a number of predictor variables that we look into to determine if there is a pattern.
We will go over the following unsupervised learning techniques:
Cluster analysis is the process of organizing data into groups (clusters) that are similar to each other.
For our example, we will use the wheat seed data available at http://www.uci.edu, as shown here:
> wheat <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt", sep="t")
Let's look at the raw data:
> head(wheat) X15.26 X14.84 X0.871 X5.763 X3.312 X2.221 X5.22 X1 1 14.88 14.57 0.8811 5.554 3.333 1.018 4.956 1 2 14.29 14.09 0.9050 5.291 3.337 2.699 4.825 1 3 13.84 13.94 0.8955 5.324 3.379 2.259 4.805 1 4 16.14 14.99 0.9034 5.658 3.562 1.355 5.175 1 5 14.38 14.21 0.8951 5.386 3.312 2.462 4.956 1 6 14.69 14.49 0.8799 5.563 3.259 3.586 5.219 1
We need to apply column names so we can see the data better:
> colnames(wheat) <- c("area", "perimeter", "compactness", "length", "width", "asymmetry", "groove", "undefined") > head(wheat) area perimeter compactness length width asymmetry groove undefined 1 14.88 14.57 0.8811 5.554 3.333 1.018 4.956 1 2 14.29 14.09 0.9050 5.291 3.337 2.699 4.825 1 3 13.84 13.94 0.8955 5.324 3.379 2.259 4.805 1 4 16.14 14.99 0.9034 5.658 3.562 1.355 5.175 1 5 14.38 14.21 0.8951 5.386 3.312 2.462 4.956 1 6 14.69 14.49 0.8799 5.563 3.259 3.586 5.219 1
The last column is not defined in the data description, so I am removing it:
> wheat <- subset(wheat, select = -c(undefined) ) > head(wheat) area perimeter compactness length width asymmetry groove 1 14.88 14.57 0.8811 5.554 3.333 1.018 4.956 2 14.29 14.09 0.9050 5.291 3.337 2.699 4.825 3 13.84 13.94 0.8955 5.324 3.379 2.259 4.805 4 16.14 14.99 0.9034 5.658 3.562 1.355 5.175 5 14.38 14.21 0.8951 5.386 3.312 2.462 4.956 6 14.69 14.49 0.8799 5.563 3.259 3.586 5.219
Now, we can finally produce the cluster using the kmeans function. The kmeans function looks like this:
kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"), trace=FALSE)
The various parameters of the kmeans function are described in the following table:
Parameter |
Description |
x |
This is the dataset |
centers |
This is the number of centers to coerce data towards |
… |
These are the additional parameters of the function |
Let's produce the cluster using the kmeans function:
> fit <- kmeans(wheat, 5) Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
Unfortunately, there are some rows with missing data, so let's fix this using the following command:
> wheat <- wheat[complete.cases(wheat),]
Let's look at the data to get some idea of the factors using the following command:
> plot(wheat)
If we try looking at five clusters, we end up with a fairly good set of clusters with an 85 percent fit, as shown here:
> fit <- kmeans(wheat, 5) > fit K-means clustering with 5 clusters of sizes 29, 33, 56, 69, 15 Cluster means: area perimeter compactness length width asymmetry groove 1 16.45345 15.35310 0.8768000 5.882655 3.462517 3.913207 5.707655 2 18.95455 16.38879 0.8868000 6.247485 3.744697 2.723545 6.119455 3 14.10536 14.20143 0.8777750 5.480214 3.210554 2.368075 5.070000 4 11.94870 13.27000 0.8516652 5.229304 2.870101 4.910145 5.093333 5 19.58333 16.64600 0.8877267 6.315867 3.835067 5.081533 6.144400 Clustering vector: ... Within cluster sum of squares by cluster: [1] 48.36785 30.16164 121.63840 160.96148 25.81297 (between_SS / total_SS = 85.4 %)
If we push to 10 clusters, the performance increases to 92 percent.
Density estimation is used to provide an estimate of the probability density function of a random variable. For this example, we will use sunspot data from Vincent arlbuck site. Not clear if sunspots are truly random.
Let's load our data as follows:
> sunspots <- read.csv("http://vincentarelbundock.github.io/Rdatasets/csv/datasets/sunspot.month.csv") > summary(sunspots) X time sunspot.month Min. : 1 Min. :1749 Min. : 0.00 1st Qu.: 795 1st Qu.:1815 1st Qu.: 15.70 Median :1589 Median :1881 Median : 42.00 Mean :1589 Mean :1881 Mean : 51.96 3rd Qu.:2383 3rd Qu.:1948 3rd Qu.: 76.40 Max. :3177 Max. :2014 Max. :253.80 > head(sunspots) X time sunspot.month 1 1 1749.000 58.0 2 2 1749.083 62.6 3 3 1749.167 70.0 4 4 1749.250 55.7 5 5 1749.333 85.0 6 6 1749.417 83.5
We will now estimate the density using the following command:
> d <- density(sunspots$sunspot.month) > d Call: density.default(x = sunspots$sunspot.month) Data: sunspots$sunspot.month (3177 obs.); Bandwidth 'bw' = 7.916 x y Min. :-23.75 Min. :1.810e-07 1st Qu.: 51.58 1st Qu.:1.586e-04 Median :126.90 Median :1.635e-03 Mean :126.90 Mean :3.316e-03 3rd Qu.:202.22 3rd Qu.:5.714e-03 Max. :277.55 Max. :1.248e-02
A plot is very useful for this function, so let's generate one using the following command:
> plot(d)
It is interesting to see such a wide variation; maybe the data is pretty random after all.
We can use the density to estimate additional periods as follows:
> N<-1000 > sunspots.new <- rnorm(N, sample(sunspots$sunspot.month, size=N, replace=TRUE)) > lines(density(sunspots.new), col="blue")
It looks like our density estimate is very accurate.
Expectation-maximization (EM) is an unsupervised clustering approach that adjusts the data for optimal values.
When using EM, we have to have some preconception of the shape of the data/model that will be targeted. This example reiterates the example on the Wikipedia page, with comments. The example tries to model the iris species from the other data points. Let's load the data as shown here:
> iris <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data") > colnames(iris) <- c("SepalLength","SepalWidth","PetalLength","PetalWidth","Species") > modelName = "EEE"
Each observation has sepal length, width, petal length, width, and species,
as shown here:
> head(iris) SepalLength SepalWidth PetalLength PetalWidth Species 1 5.1 3.5 1.4 0.2 Iris-setosa 2 4.9 3.0 1.4 0.2 Iris-setosa 3 4.7 3.2 1.3 0.2 Iris-setosa 4 4.6 3.1 1.5 0.2 Iris-setosa 5 5.0 3.6 1.4 0.2 Iris-setosa 6 5.4 3.9 1.7 0.4 Iris-setosa
We are estimating the species from the other points, so let's separate the data
as follows:
> data = iris[,-5] > z = unmap(iris[,5])
Let's set up our mstep for EM, given the data, categorical data (z) relating to each data point, and our model type name:
> msEst <- mstep(modelName, data, z)
We use the parameters defined in the mstep to produce our model, as shown here:
> em(modelName, data, msEst$parameters) $z [,1] [,2] [,3] [1,] 1.000000e+00 4.304299e-22 1.699870e-42 … [150,] 8.611281e-34 9.361398e-03 9.906386e-01 $parameters$pro [1] 0.3333333 0.3294048 0.3372619 $parameters$mean [,1] [,2] [,3] SepalLength 5.006 5.941844 6.574697 SepalWidth 3.418 2.761270 2.980150 PetalLength 1.464 4.257977 5.538926 PetalWidth 0.244 1.319109 2.024576 $parameters$variance$d [1] 4 $parameters$variance$G [1] 3 $parameters$variance$sigma , , 1 SepalLength SepalWidth PetalLength PetalWidth SepalLength 0.26381739 0.09030470 0.16940062 0.03937152 SepalWidth 0.09030470 0.11251902 0.05133876 0.03082280 PetalLength 0.16940062 0.05133876 0.18624355 0.04183377 PetalWidth 0.03937152 0.03082280 0.04183377 0.03990165 , , 2 , , 3 … (there was little difference in the 3 sigma values) Covariance $parameters$variance$Sigma SepalLength SepalWidth PetalLength PetalWidth SepalLength 0.26381739 0.09030470 0.16940062 0.03937152 SepalWidth 0.09030470 0.11251902 0.05133876 0.03082280 PetalLength 0.16940062 0.05133876 0.18624355 0.04183377 PetalWidth 0.03937152 0.03082280 0.04183377 0.03990165 $parameters$variance$cholSigma SepalLength SepalWidth PetalLength PetalWidth SepalLength -0.5136316 -0.1758161 -0.32980960 -0.07665323 SepalWidth 0.0000000 0.2856706 -0.02326832 0.06072001 PetalLength 0.0000000 0.0000000 -0.27735855 -0.06477412 PetalWidth 0.0000000 0.0000000 0.00000000 0.16168899 attr(,"info") iterations error 4.000000e+00 1.525131e-06
There is quite a lot of output from the em function. The highlights for me were the three sigma ranges were the same and the error from the function was very small.
So, I think we have a very good estimation of species using just the four data points.
The hidden Markov models (HMM) is the idea of observing data assuming it has been produced by a Markov model. The problem is to discover what that model is.
I am using the Python example on Wikipedia for HMM. For an HMM, we need states (assumed to be hidden from observer), symbols, transition matrix between states, emission (output) states, and probabilities for all.
The Python information presented is as follows:
states = ('Rainy', 'Sunny') observations = ('walk', 'shop', 'clean') start_probability = {'Rainy': 0.6, 'Sunny': 0.4} transition_probability = { 'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3}, 'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6}, } emission_probability = { 'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5}, 'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1}, } trans <- matrix(c('Rainy', : {'Rainy': 0.7, 'Sunny': 0.3}, 'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6}, }
We convert these to use in R for the initHmm function by using the following command:
> hmm <- initHMM(c("Rainy","Sunny"), c('walk', 'shop', 'clean'),
c(.6,.4), matrix(c(.7,.3,.4,.6),2), matrix(c(.1,.4,.5,.6,.3,.1),3)) > hmm $States [1] "Rainy" "Sunny" $Symbols [1] "walk" "shop" "clean" $startProbs Rainy Sunny 0.6 0.4 $transProbs to from Rainy Sunny Rainy 0.7 0.4 Sunny 0.3 0.6 $emissionProbs symbols states walk shop clean Rainy 0.1 0.5 0.3 Sunny 0.4 0.6 0.1
The model is really a placeholder for all of the setup information needed for HMM. We can then use the model to predict based on observations, as follows:
> future <- forward(hmm, c("walk","shop","clean")) > future index states 1 2 3 Rainy -2.813411 -3.101093 -4.139551 Sunny -1.832581 -2.631089 -5.096193
The result is a matrix of probabilities. For example, it is more likely to be Sunny when we observe walk.
Blind signal separation is the process of identifying sources of signals from a mixed signal. Primary component analysis is one method of doing this. An example is a cocktail party where you are trying to listen to one speaker.
For this example, I am using the decathlon dataset in the FactoMineR package,
as shown here:
> library(FactoMineR) > data(decathlon)
Let's look at the data to get some idea of what is available:
> summary(decathlon) 100m Long.jump Shot.put High.jump Min. :10.44 Min. :6.61 Min. :12.68 Min. :1.850 1st Qu.:10.85 1st Qu.:7.03 1st Qu.:13.88 1st Qu.:1.920 Median :10.98 Median :7.30 Median :14.57 Median :1.950 Mean :11.00 Mean :7.26 Mean :14.48 Mean :1.977 3rd Qu.:11.14 3rd Qu.:7.48 3rd Qu.:14.97 3rd Qu.:2.040 Max. :11.64 Max. :7.96 Max. :16.36 Max. :2.150 400m 110m.hurdle Discus Pole.vault Min. :46.81 Min. :13.97 Min. :37.92 Min. :4.200 1st Qu.:48.93 1st Qu.:14.21 1st Qu.:41.90 1st Qu.:4.500 Median :49.40 Median :14.48 Median :44.41 Median :4.800 Mean :49.62 Mean :14.61 Mean :44.33 Mean :4.762 3rd Qu.:50.30 3rd Qu.:14.98 3rd Qu.:46.07 3rd Qu.:4.920 Max. :53.20 Max. :15.67 Max. :51.65 Max. :5.400 Javeline 1500m Rank Points Min. :50.31 Min. :262.1 Min. : 1.00 Min. :7313 1st Qu.:55.27 1st Qu.:271.0 1st Qu.: 6.00 1st Qu.:7802 Median :58.36 Median :278.1 Median :11.00 Median :8021 Mean :58.32 Mean :279.0 Mean :12.12 Mean :8005 3rd Qu.:60.89 3rd Qu.:285.1 3rd Qu.:18.00 3rd Qu.:8122 Max. :70.52 Max. :317.0 Max. :28.00 Max. :8893 Competition Decastar:13 OlympicG:28
The output looks like performance data from a series of events at a track meet:
> head(decathlon) 100m Long.jump Shot.put High.jump 400m 110m.hurdle Discus SEBRLE 11.04 7.58 14.83 2.07 49.81 14.69 43.75 CLAY 10.76 7.40 14.26 1.86 49.37 14.05 50.72 KARPOV 11.02 7.30 14.77 2.04 48.37 14.09 48.95 BERNARD 11.02 7.23 14.25 1.92 48.93 14.99 40.87 YURKOV 11.34 7.09 15.19 2.10 50.42 15.31 46.26 WARNERS 11.11 7.60 14.31 1.98 48.68 14.23 41.10 Pole.vault Javeline 1500m Rank Points Competition SEBRLE 5.02 63.19 291.7 1 8217 Decastar CLAY 4.92 60.15 301.5 2 8122 Decastar KARPOV 4.92 50.31 300.2 3 8099 Decastar BERNARD 5.32 62.77 280.1 4 8067 Decastar YURKOV 4.72 63.44 276.4 5 8036 Decastar WARNERS 4.92 51.77 278.1 6 8030 Decastar
Further, this is performance of specific individuals in track meets.
We run the PCA function by passing the dataset to use, whether to scale the data or not, and the type of graphs:
> res.pca = PCA(decathlon[,1:10], scale.unit=TRUE, ncp=5, graph=T)
This produces two graphs:
The individual factors map lays out the performance of the individuals. For example, we see Karpov who is high in both dimensions versus Bourginon who is performing badly (on the left in the following chart):
The variables factor map shows the correlation of performance between events. For example, doing well in the 400 meters run is negatively correlated with the performance in the long jump; if you did well in one, you likely did well in the other as well. Here is the variables factor map of our data:
Factual
When, how, and why?
Challenges
In this article, we looked into various methods of machine learning, including both supervised and unsupervised learning. With supervised learning, we have a target variable we are trying to estimate. With unsupervised, we only have a possible set of predictor variables and are looking for patterns.
In supervised learning, we looked into using a number of methods, including decision trees, regression, neural networks, support vector machines, and Bayesian learning. In unsupervised learning, we used cluster analysis, density estimation, hidden Markov models, and blind signal separation.
Further resources on this subject: