scikit-learn Cookbook , Second Edition

Chapter 1. High-Performance Machine Learning – NumPy

In this chapter, we will cover the following recipes:

NumPy basics
Loading the iris dataset
Viewing the iris dataset
Viewing the iris dataset with pandas
Plotting with NumPy and matplotlib
A minimal machine learning recipe – SVM classification
Introducing cross-validation
Putting it all together
Machine learning overview – classification versus regression

NumPy basics

Data science deals in part with structured tables of data. The scikit-learn library requires input tables of two-dimensional NumPy arrays. In this section, you will learn about the numpy library.

How to do it...

We will try a few operations on NumPy arrays. NumPy arrays have a single type for all of their elements and a predefined shape. Let us look first at their shape.

The shape and dimension of NumPy arrays

Start by importing NumPy:

import numpy as np

Produce a NumPy array of 10 digits, similar to Python's range(10) method:

np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

The array looks like a Python list with only one pair of brackets. This means it is of one dimension. Store the array and find out the shape:

array_1 = np.arange(10)
array_1.shape
(10L,)

The array has a data attribute, shape. The type of array_1.shape is a tuple (10L,), which has length 1, in this case. The number of dimensions is the same as the length of the tuple—a dimension of 1, in this case:

array_1.ndim      #Find number of dimensions of array_1
1

The array has 10 elements. Reshape the array by calling the reshape method:

array_1.reshape((5,2))
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

This reshapes the array into 5 x 2 data object that resembles a list of lists (a three dimensional NumPy array looks like a list of lists of lists). You did not save the changes. Save the reshaped array as follows::

array_1 = array_1.reshape((5,2))

Note that array_1 is now two-dimensional. This is expected, as its shape has two numbers and it looks like a Python list of lists:

array_1.ndim
2

NumPy broadcasting

Add 1 to every element of the array by broadcasting. Note that changes to the array are not saved:

array_1 + 1
array([[ 1,  2],
       [ 3,  4],
       [ 5,  6],
       [ 7,  8],
       [ 9, 10]])

The term broadcasting refers to the smaller array being stretched or broadcast across the larger array. In the first example, the scalar 1 was stretched to a 5 x 2 shape and then added to array_1.

Create a new array_2 array. Observe what occurs when you multiply the array by itself (this is not matrix multiplication; it is element-wise multiplication of arrays):

array_2 = np.arange(10)
array_2 * array_2
array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

Every element has been squared. Here, element-wise multiplication has occurred. Here is a more complicated example:

array_2 = array_2 ** 2  #Note that this is equivalent to array_2 * array_2
array_2 = array_2.reshape((5,2))
array_2
array([[ 0,  1],
       [ 4,  9],
       [16, 25],
       [36, 49],
       [64, 81]])

Change array_1 as well:

array_1 = array_1 + 1
array_1
array([[ 1,  2],
       [ 3,  4],
       [ 5,  6],
       [ 7,  8],
       [ 9, 10]])

Now add array_1 and array_2 element-wise by simply placing a plus sign between the arrays:

array_1 + array_2
array([[ 1,  3],
       [ 7, 13],
       [21, 31],
       [43, 57],
       [73, 91]])

The formal broadcasting rules require that whenever you are comparing the shapes of both arrays from right to left, all the numbers have to either match or be one. The shapes 5 X 2 and 5 X 2 match for both entries from right to left. However, the shape 5 X 2 X 1 does not match 5 X 2, as the second values from the right, 2 and 5 respectively, are mismatched:

Initializing NumPy arrays and dtypes

There are several ways to initialize NumPy arrays besides np.arange:

Initialize an array of zeros with np.zeros. The np.zeros((5,2)) command creates a 5 x 2 array of zeros:

np.zeros((5,2))
array([[ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.]])

Initialize an array of ones using np.ones. Introduce a dtype argument, set to np.int, to ensure that the ones are of NumPy integer type. Note that scikit-learn expects np.float arguments in arrays. The dtype refers to the type of every element in a NumPy array. It remains the same throughout the array. Every single element of the array below has a np.int integer type.

np.ones((5,2), dtype = np.int)
array([[1, 1],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1]])

Use np.empty to allocate memory for an array of a specific size and dtype, but no particular initialized values:

np.empty((5,2), dtype = np.float)
array([[  3.14724935e-316,   3.14859499e-316],
       [  3.14858945e-316,   3.14861159e-316],
       [  3.14861435e-316,   3.14861712e-316],
       [  3.14861989e-316,   3.14862265e-316],
       [  3.14862542e-316,   3.14862819e-316]])

Use np.zeros, np.ones, and np.empty to allocate memory for NumPy arrays with different initial values.

Indexing

Look up the values of the two-dimensional arrays with indexing:

array_1[0,0]   #Finds value in first row and first column.
1

View the first row:

array_1[0,:]
array([1, 2])

Then view the first column:

array_1[:,0]
array([1, 3, 5, 7, 9])

View specific values along both axes. Also view the second to the fourth rows:

array_1[2:5, :]
array([[ 5,  6],
       [ 7,  8],
       [ 9, 10]])

View the second to the fourth rows only along the first column:

array_1[2:5,0]
array([5, 7, 9])

Boolean arrays

Additionally, NumPy handles indexing with Boolean logic:

First produce a Boolean array:

array_1>5
array([[False, False],
       [False, False],
       [False,  True],
       [ True,  True],
       [ True,  True]], dtype=bool)

Place brackets around the Boolean array to filter by the Boolean array:

array_1[array_1 > 5]
array([ 6,  7,  8,  9, 10])

Arithmetic operations

Add all the elements of the array with the sum method. Go back to array_1:

array_1
array([[ 1,  2],
       [ 3,  4],
       [ 5,  6],
       [ 7,  8],
       [ 9, 10]])
array_1.sum()
55

Find all the sums by row:

array_1.sum(axis = 1)
array([ 3,  7, 11, 15, 19])

Find all the sums by column:

array_1.sum(axis = 0)
array([25, 30])

Find the mean of each column in a similar way. Note that the dtype of the array of averages is np.float:

array_1.mean(axis = 0)
array([ 5.,  6.])

NaN values

Scikit-learn will not accept np.nan values. Take array_3 as follows:

array_3 = np.array([np.nan, 0, 1, 2, np.nan])

Find the NaN values with a special Boolean array created by the np.isnan function:

np.isnan(array_3)
array([ True, False, False, False,  True], dtype=bool)

Filter the NaN values by negating the Boolean array with the symbol ~ and placing brackets around the expression:

array_3[~np.isnan(array_3)]
>array([ 0.,  1.,  2.])

Alternatively, set the NaN values to zero:

array_3[np.isnan(array_3)] = 0
array_3
array([ 0.,  0.,  1.,  2.,  0.])

How it works...

Data, in the present and minimal sense, is about 2D tables of numbers, which NumPy handles very well. Keep this in mind in case you forget the NumPy syntax specifics. Scikit-learn accepts only 2D NumPy arrays of real numbers with no missing np.nan values.

From experience, it tends to be best to change np.nan to some value instead of throwing away data. Personally, I like to keep track of Boolean masks and keep the data shape roughly the same, as this leads to fewer coding errors and more coding flexibility.

Viewing the iris dataset

Now that we've loaded the dataset, let's examine what is in it. The iris dataset pertains to a supervised classification problem.

How to do it...

To access the observation variables, type:

iris.data

This outputs a NumPy array:

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2], 
#...rest of output suppressed because of length

Let's examine the NumPy array:

iris.data.shape

This returns:

(150L, 4L)

This means that the data is 150 rows by 4 columns. Let's look at the first row:

iris.data[0]

array([ 5.1,  3.5,  1.4,  0.2])

The NumPy array for the first row has four numbers.

To determine what they mean, type:

iris.feature_names
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

The feature or column names name the data. They are strings, and in this case, they correspond to dimensions in different types of flowers. Putting it all together, we have 150 examples of flowers with four measurements per flower in centimeters. For example, the first flower has measurements of 5.1 cm for sepal length, 3.5 cm for sepal width, 1.4 cm for petal length, and 0.2 cm for petal width. Now, let's look at the output variable in a similar manner:

iris.target

This yields an array of outputs: 0, 1, and 2. There are only three outputs. Type this:

iris.target.shape

You get a shape of:

(150L,)

This refers to an array of length 150 (150 x 1). Let's look at what the numbers refer to:

iris.target_names

array(['setosa', 'versicolor', 'virginica'], 
      dtype='|S10')

The output of the iris.target_names variable gives the English names for the numbers in the iris.target variable. The number zero corresponds to the setosa flower, number one corresponds to the versicolor flower, and number two corresponds to the virginica flower. Look at the first row of iris.target:

iris.target[0]

This produces zero, and thus the first row of observations we examined before correspond to the setosa flower.

How it works...

In machine learning, we often deal with data tables and two-dimensional arrays corresponding to examples. In the iris set, we have 150 observations of flowers of three types. With new observations, we would like to predict which type of flower those observations correspond to. The observations in this case are measurements in centimeters. It is important to look at the data pertaining to real objects. Quoting my high school physics teacher,

"Do not forget the units!"

The iris dataset is intended to be for a supervised machine learning task because it has a target array, which is the variable we desire to predict from the observation variables. Additionally, it is a classification problem, as there are three numbers we can predict from the observations, one for each type of flower. In a classification problem, we are trying to distinguish between categories. The simplest case is binary classification. The iris dataset, with three flower categories, is a multi-class classification problem.

There's more...

With the same data, we can rephrase the problem in many ways, or formulate new problems. What if we want to determine relationships between the observations? We can define the petal width as the target variable. We can rephrase the problem as a regression problem and try to predict the target variable as a real number, not just three categories. Fundamentally, it comes down to what we intend to predict. Here, we desire to predict a type of flower.

Plotting with NumPy and matplotlib

A simple way to make visualizations with NumPy is by using the library matplotlib. Let's make some visualizations quickly.

Getting ready

Start by importing numpy and matplotlib. You can view visualizations within an IPython Notebook using the %matplotlib inline command:

import numpy as np
 import matplotlib.pyplot as plt
 %matplotlib inline

How to do it...

The main command in matplotlib, in pseudo code, is as follows:

plt.plot(numpy array, numpy array of same length)

Plot a straight line by placing two NumPy arrays of the same length:

plt.plot(np.arange(10), np.arange(10))

Plot an exponential:

plt.plot(np.arange(10), np.exp(np.arange(10)))

Place the two graphs side by side:

plt.figure()
plt.subplot(121)
plt.plot(np.arange(10), np.exp(np.arange(10)))
plt.subplot(122)
plt.scatter(np.arange(10), np.exp(np.arange(10)))

Or top to bottom:

plt.figure()
plt.subplot(211)
plt.plot(np.arange(10), np.exp(np.arange(10)))
plt.subplot(212)
plt.scatter(np.arange(10), np.exp(np.arange(10)))

The first two numbers in the subplot command refer to the grid size in the figure instantiated by plt.figure(). The grid size referred to in plt.subplot(221) is 2 x 2, the first two digits. The last digit refers to traversing the grid in reading order: left to right and then up to down.

Plot in a 2 x 2 grid traversing in reading order from one to four:

plt.figure()
plt.subplot(221)
plt.plot(np.arange(10), np.exp(np.arange(10)))
plt.subplot(222)
plt.scatter(np.arange(10), np.exp(np.arange(10)))
plt.subplot(223)
plt.scatter(np.arange(10), np.exp(np.arange(10)))
plt.subplot(224)
plt.scatter(np.arange(10), np.exp(np.arange(10)))

Finally, with real data:

from sklearn.datasets import load_iris

iris = load_iris()
data = iris.data
target = iris.target

# Resize the figure for better viewing
plt.figure(figsize=(12,5))

# First subplot
plt.subplot(121)

# Visualize the first two columns of data:
plt.scatter(data[:,0], data[:,1], c=target)

# Second subplot
plt.subplot(122)

# Visualize the last two columns of data:
plt.scatter(data[:,2], data[:,3], c=target)

The c parameter takes an array of colors—in this case, the colors 0, 1, and 2 in the iris target:

A minimal machine learning recipe – SVM classification

Machine learning is all about making predictions. To make predictions, we will:

State the problem to be solved
Choose a model to solve the problem
Train the model
Make predictions
Measure how well the model performed

Getting ready

Back to the iris example, we now store the first two features (columns) of the observations as X and the target as y, a convention in the machine learning community:

X = iris.data[:, :2]  
y = iris.target

How to do it...

First, we state the problem. We are trying to determine the flower-type category from a set of new observations. This is a classification task. The data available includes a target variable, which we have named y. This is a supervised classification problem.

Note

The task of supervised learning involves predicting values of an output variable with a model that trains using input variables and an output variable.

Next, we choose a model to solve the supervised classification. For now, we will use a support vector classifier. Because of its simplicity and interpretability, it is a commonly used algorithm (interpretable means easy to read into and understand).
To measure the performance of prediction, we will split the dataset into training and test sets. The training set refers to data we will learn from. The test set is the data we hold out and pretend not to know as we would like to measure the performance of our learning procedure. So, import a function that will split the dataset:

from sklearn.model_selection import train_test_split

Apply the function to both the observation and target data:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

The test size is 0.25 or 25% of the whole dataset. A random state of one fixes the random seed of the function so that you get the same results every time you call the function, which is important for now to reproduce the same results consistently.

Now load a regularly used estimator, a support vector machine:

from sklearn.svm import SVC

You have imported a support vector classifier from the svm module. Now create an instance of a linear SVC:

clf = SVC(kernel='linear',random_state=1)

The random state is fixed to reproduce the same results with the same code later.

The supervised models in scikit-learn implement a fit(X, y) method, which trains the model and returns the trained model. X is a subset of the observations, and each element of y corresponds to the target of each observation in X. Here, we fit a model on the training data:

clf.fit(X_train, y_train)

Now, the clf variable is the fitted, or trained, model.

The estimator also has a predict(X) method that returns predictions for several unlabeled observations, X_test, and returns the predicted values, y_pred. Note that the function does not return the estimator. It returns a set of predictions:

y_pred = clf.predict(X_test)

So far, you have done all but the last step. To examine the model performance, load a scorer from the metrics module:

from sklearn.metrics import accuracy_score

With the scorer, compare the predictions with the held-out test targets:

accuracy_score(y_test,y_pred)

0.76315789473684215

How it works...

Without knowing very much about the details of support vector machines, we have implemented a predictive model. To perform machine learning, we held out one-fourth of the data and examined how the SVC performed on that data. In the end, we obtained a number that measures accuracy, or how the model performed.

There's more...

To summarize, we will do all the steps with a different algorithm, logistic regression:

First, import LogisticRegression:

from sklearn.linear_model import LogisticRegression

Then write a program with the modeling steps:
1. Split the data into training and testing sets.
2. Fit the logistic regression model.
3. Predict using the test observations.
4. Measure the accuracy of the predictions with y_test versus y_pred:

import matplotlib.pyplot as plt
from sklearn import datasets

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X = iris.data[:, :2]   #load the iris data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

#train the model
clf = LogisticRegression(random_state = 1)
clf.fit(X_train, y_train)

#predict with Logistic Regression
y_pred = clf.predict(X_test)

#examine the model accuracy
accuracy_score(y_test,y_pred)

0.60526315789473684

This number is lower; yet we cannot make any conclusions comparing the two models, SVC and logistic regression classification. We cannot compare them, because we were not supposed to look at the test set for our model. If we made a choice between SVC and logistic regression, the choice would be part of our model as well, so the test set cannot be involved in the choice. Cross-validation, which we will look at next, is a way to choose between models.

Introducing cross-validation

We are thankful for the iris dataset, but as you might recall, it has only 150 observations. To make the most out of the set, we will employ cross-validation. Additionally, in the last section, we wanted to compare the performance of two different classifiers, support vector classifier and logistic regression. Cross-validation will help us with this comparison issue as well.

Getting ready

Suppose we wanted to choose between the support vector classifier and the logistic regression classifier. We cannot measure their performance on the unavailable test set.

What if, instead, we:

Forgot about the test set for now?
Split the training set into two parts, one to train on and one to test the training?

Split the training set into two parts using the train_test_split function used in previous sections:

from sklearn.model_selection import train_test_split
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_train, y_train, test_size=0.25, random_state=1)

X_train_2 consists of 75% of the X_train data, while X_test_2 is the remaining 25%. y_train_2 is 75% of the target data, and matches the observations of X_train_2. y_test_2 is 25% of the target data present in y_train.

As you might have expected, you have to use these new splits to choose between the two models: SVC and logistic regression. Do so by writing a predictive program.

How to do it...

Start with all the imports and load the iris dataset:

from sklearn import datasets

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

#load the classifying models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

iris = datasets.load_iris()
X = iris.data[:, :2]  #load the first two features of the iris data 
y = iris.target #load the target of the iris data

#split the whole set one time
#Note random state is 7 now
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=7)

#split the training set into parts
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_train, y_train, test_size=0.25, random_state=7)

Create an instance of an SVC classifier and fit it:

svc_clf = SVC(kernel = 'linear',random_state = 7)
svc_clf.fit(X_train_2, y_train_2)

Do the same for logistic regression (both lines for logistic regression are compressed into one):

lr_clf = LogisticRegression(random_state = 7).fit(X_train_2, y_train_2)

Now predict and examine the SVC and logistic regression's performance on X_test_2:

svc_pred = svc_clf.predict(X_test_2)
lr_pred = lr_clf.predict(X_test_2)

print "Accuracy of SVC:",accuracy_score(y_test_2,svc_pred)
print "Accuracy of LR:",accuracy_score(y_test_2,lr_pred)

Accuracy of SVC: 0.857142857143
Accuracy of LR: 0.714285714286

The SVC performs better, but we have not yet seen the original test data. Choose SVC over logistic regression and try it on the original test set:

print "Accuracy of SVC on original Test Set: ",accuracy_score(y_test, svc_clf.predict(X_test))

Accuracy of SVC on original Test Set:  0.684210526316

How it works...

In comparing the SVC and logistic regression classifier, you might wonder (and be a little suspicious) about a lot of scores being very different. The final test on SVC scored lower than logistic regression. To help with this situation, we can do cross-validation in scikit-learn.

Cross-validation involves splitting the training set into parts, as we did before. To match the preceding example, we split the training set into four parts, or folds. We are going to design a cross-validation iteration by taking turns with one of the four folds for testing and the other three for training. It is the same split as done before four times over with the same set, thereby rotating, in a sense, the test set:

With scikit-learn, this is relatively easy to accomplish:

We start with an import:

from sklearn.model_selection import cross_val_score

Then we produce an accuracy score on four folds:

svc_scores = cross_val_score(svc_clf, X_train, y_train, cv=4)
svc_scores

array([ 0.82758621,  0.85714286,  0.92857143,  0.77777778])

We can find the mean for average performance and standard deviation for a measure of spread of all scores relative to the mean:

print "Average SVC scores: ", svc_scores.mean()
print "Standard Deviation of SVC scores: ", svc_scores.std()

Average SVC scores:  0.847769567597
Standard Deviation of SVC scores:  0.0545962864696

Similarly, with the logistic regression instance, we compute four scores:

lr_scores = cross_val_score(lr_clf, X_train, y_train, cv=4)
print "Average SVC scores: ", lr_scores.mean()
print "Standard Deviation of SVC scores: ", lr_scores.std()

Average SVC scores:  0.748893906221
Standard Deviation of SVC scores:  0.0485633168699

Now we have many scores, which confirms our selection of SVC over logistic regression. Thanks to cross-validation, we used the training multiple times and had four small test sets within it to score our model.

Note that our model is a bigger model that consists of:

Training an SVM through cross-validation
Training a logistic regression through cross-validation
Choosing between SVM and logistic regression

Note

The choice at the end is part of the model.

There's more...

Despite our hard work and the elegance of the scikit-learn syntax, the score on the test set at the very end remains suspicious. The reason for this is that the test and train split are not necessarily balanced; the train and test sets do not necessarily have similar proportions of all the classes.

This is easily remedied by using a stratified test-train split:

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

By selecting the target set as the stratified argument, the target classes are balanced. This brings the SVC scores closer together.

svc_scores = cross_val_score(svc_clf, X_train, y_train, cv=4)
print "Average SVC scores: " , svc_scores.mean()
print "Standard Deviation of SVC scores: ", svc_scores.std()
print "Score on Final Test Set:", accuracy_score(y_test, svc_clf.predict(X_test))

Average SVC scores:  0.831547619048
Standard Deviation of SVC scores:  0.0792488953372
Score on Final Test Set: 0.789473684211

Additionally, note that in the preceding example, the cross-validation procedure produces stratified folds by default:

from sklearn.model_selection import cross_val_score
svc_scores = cross_val_score(svc_clf, X_train, y_train, cv = 4)

The preceding code is equivalent to:

from sklearn.model_selection import cross_val_score, StratifiedKFold
skf = StratifiedKFold(n_splits = 4)
svc_scores = cross_val_score(svc_clf, X_train, y_train, cv = skf)

Putting it all together

Now, we are going to perform the same procedure as before, except that we will reset, regroup, and try a new algorithm: K-Nearest Neighbors (KNN).

How to do it...

Start by importing the model from sklearn, followed by a balanced split:

from sklearn.neighbors import KNeighborsClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state = 0)

Note

The random_state parameter fixes the random_seed in the function train_test_split. In the preceding example, the random_state is set to zero and can be set to any integer.

Construct two different KNN models by varying the n_neighbors parameter. Observe that the number of folds is now 10. Tenfold cross-validation is common in the machine learning community, particularly in data science competitions:

from sklearn.model_selection import cross_val_score
knn_3_clf = KNeighborsClassifier(n_neighbors = 3)
knn_5_clf = KNeighborsClassifier(n_neighbors = 5)

knn_3_scores = cross_val_score(knn_3_clf, X_train, y_train, cv=10)
knn_5_scores = cross_val_score(knn_5_clf, X_train, y_train, cv=10)

Score and print out the scores for selection:

print "knn_3 mean scores: ", knn_3_scores.mean(), "knn_3 std: ",knn_3_scores.std()
print "knn_5 mean scores: ", knn_5_scores.mean(), " knn_5 std: ",knn_5_scores.std()

knn_3 mean scores:  0.798333333333 knn_3 std:  0.0908142181722
knn_5 mean scores:  0.806666666667 knn_5 std:  0.0559320575496

Both nearest neighbor types score similarly, yet the KNN with parameter n_neighbors = 5 is a bit more stable. This is an example of hyperparameter optimization which we will examine closely throughout the book.

There's more...

You could have just as easily run a simple loop to score the function more quickly:

all_scores = []
for n_neighbors in range(3,9,1):
    knn_clf = KNeighborsClassifier(n_neighbors = n_neighbors)
    all_scores.append((n_neighbors, cross_val_score(knn_clf, X_train, y_train, cv=10).mean()))
sorted(all_scores, key = lambda x:x[1], reverse = True)

Its output suggests that n_neighbors = 4 is a good choice:

[(4, 0.85111111111111115),
 (7, 0.82611111111111113),
 (6, 0.82333333333333347),
 (5, 0.80666666666666664),
 (3, 0.79833333333333334),
 (8, 0.79833333333333334)]

Machine learning overview – classification versus regression

In this recipe we will examine how regression can be viewed as being very similar to classification. This is done by reconsidering the categorical labels of regression as real numbers. In this section we will also look at at several aspects of machine learning from a very broad perspective including the purpose of scikit-learn. scikit-learn allows us to find models that work well incredibly quickly. We do not have to work out all the details of the model, or optimize, until we found one that works well. Consequently, your company saves precious development time and computational resources thanks to scikit-learn giving us the ability to develop models relatively quickly.

The purpose of scikit-learn

As we have seen before, scikit-learn allowed us to find a model that works fairly quickly. We tried SVC, logistic regression, and a few KNN classifiers. Through cross-validation, we selected models that performed better than others. In industry, after trying SVMs and logistic regression, we might focus on SVMs and optimize them further. Thanks to scikit-learn, we saved a lot of time and resources, including mental energy. After optimizing the SVM at work on a realistic dataset, we might re-implement it for speed in Java or C and gather more data.

Supervised versus unsupervised

Classification and regression are supervised, as we know the target variables for the observations. Clustering—creating regions in space for each category without being given any labels is unsupervised learning.

Getting ready

In classification, the target variable is one of several categories, and there must be more than one instance of every category. In regression, there can be only one instance of every target variable, as the only requirement is that the target is a real number.

In the case of logistic regression, we saw previously that the algorithm first performs a regression and estimates a real number for the target. Then the target class is estimated by using thresholds. In scikit-learn, there are predict_proba methods that yield probabilistic estimates, which relate regression-like real number estimates with classification classes in the style of logistic regression.

Any regression can be turned into classification by using thresholds. A binary classification can be viewed as a regression problem by using a regressor. The target variables produced will be real numbers, not the original class variables.

How to do it...

Quick SVC – a classifier and regressor

Load iris from the datasets module:

import numpy as np
import pandas as pd
from sklearn import datasets

iris = datasets.load_iris()

For simplicity, consider only targets 0 and 1, corresponding to Setosa and Versicolor. Use the Boolean array iris.target < 2 to filter out target 2. Place it within brackets to use it as a filter in defining the observation set X and the target set y:

X = iris.data[iris.target < 2]
y = iris.target[iris.target < 2]

Now import train_test_split and apply it:

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state= 7)

Prepare and run an SVC by importing it and scoring it with cross-validation:

from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

svc_clf = SVC(kernel = 'linear').fit(X_train, y_train)
svc_scores = cross_val_score(svc_clf, X_train, y_train, cv=4)

As done in previous sections, view the average of the scores:

svc_scores.mean()

0.94795321637426899

Perform the same with support vector regression by importing SVR from sklearn.svm, the same module that contains SVC:

from sklearn.svm import SVR

Then write the necessary syntax to fit the model. It is almost identical to the syntax for SVC, just replacing some c keywords with r:

svr_clf = SVR(kernel = 'linear').fit(X_train, y_train)

Making a scorer

To make a scorer, you need:

A scoring function that compares y_test, the ground truth, with y_pred, the predictions
To determine whether a high score is good or bad

Before passing the SVR regressor to the cross-validation, make a scorer by supplying two elements:

In practice, begin by importing the make_scorer function:

from sklearn.metrics import make_scorer

Use this sample scoring function:

#Only works for this iris example with targets 0 and 1
def for_scorer(y_test, orig_y_pred):
    y_pred = np.rint(orig_y_pred).astype(np.int)   #rounds prediction to the nearest integer
    return accuracy_score(y_test, y_pred)

The np.rint function rounds off the prediction to the nearest integer, hopefully one of the targets, 0 or 1. The astype method changes the type of the prediction to integer type, as the original target is in integer type and consistency is preferred with regard to types. After the rounding occurs, the scoring function uses the old accuracy_score function, which you are familiar with.

Now, determine whether a higher score is better. Higher accuracy is better, so for this situation, a higher score is better. In scikit code:

svr_to_class_scorer = make_scorer(for_scorer, greater_is_better=True)

Finally, run the cross-validation with a new parameter, the scoring parameter:

svr_scores = cross_val_score(svr_clf, X_train, y_train, cv=4, scoring = svr_to_class_scorer)

Find the mean:

svr_scores.mean()

0.94663742690058483

The accuracy scores are similar for the SVR regressor-based classifier and the traditional SVC classifier.

How it works...

You might ask, why did we take out class 2 out of the target set?

The reason is that, to use a regressor, our intent has to be to predict a real number. The categories had to have real number properties: that they are ordered (informally, if we have three ordered categories x, y, z and x < y and y < z then x < z). By eliminating the third category, the remaining flowers (Setosa and Versicolor) became ordered by a property we invented: Setosaness or Versicolorness.

The next time you encounter categories, you can consider whether they can be ordered. For example, if the dataset consists of shoe sizes, they can be ordered and a regressor can be applied, even though no one has a shoe size of 12.125.

There's more...

Linear versus nonlinear

Linear algorithms involve lines or hyperplanes. Hyperplanes are flat surfaces in any n-dimensional space. They tend to be easy to understand and explain, as they involve ratios (with an offset). Some functions that consistently and monotonically increase or decrease can be mapped to a linear function with a transformation. For example, exponential growth can be mapped to a line with the log transformation.

Nonlinear algorithms tend to be tougher to explain to colleagues and investors, yet ensembles of decision trees that are nonlinear tend to perform very well. KNN, which we examined earlier, is nonlinear. In some cases, functions not increasing or decreasing in a familiar manner are acceptable for the sake of accuracy.

Try a simple SVC with a polynomial kernel, as follows:

from sklearn.svm import SVC   #Usual import of SVC
svc_poly_clf = SVC(kernel = 'poly', degree= 3).fit(X_train, y_train)  #Polynomial Kernel of Degree 3

The polynomial kernel of degree 3 looks like a cubic curve in two dimensions. It leads to a slightly better fit, but note that it can be harder to explain to others than a linear kernel with consistent behavior throughout all of the Euclidean space:

svc_poly_scores = cross_val_score(svc_clf, X_train, y_train, cv=4)
svc_poly_scores.mean()

0.95906432748538006

Black box versus not

For the sake of efficiency, we did not examine the classification algorithms used very closely. When we compared SVC and logistic regression, we chose SVMs. At that point, both algorithms were black boxes, as we did not know any internal details. Once we decided to focus on SVMs, we could proceed to compute coefficients of the separating hyperplanes involved, optimize the hyperparameters of the SVM, use the SVM for big data, and do other processes. The SVMs have earned our time investment because of their superior performance.

Interpretability

Some machine learning algorithms are easier to understand than others. These are usually easier to explain to others as well. For example, linear regression is well known and easy to understand and explain to potential investors of your company. SVMs are more difficult to entirely understand.

My general advice: if SVMs are highly effective for a particular dataset, try to increase your personal interpretability of SVMs in the particular problem context. Also, consider merging algorithms somehow, using linear regression as an input to SVMs, for example. This way, you have the best of both worlds.

Note

This is really context-specific, however. Linear SVMs are relatively simple to visualize and understand. Merging linear regression with SVM could complicate things. You can start by comparing them side by side.

However, if you cannot understand every detail of the math and practice of SVMs, be kind to yourself, as machine learning is focused more on prediction performance rather than traditional statistics.

A pipeline

In programming, a pipeline is a set of procedures connected in series, one after the other, where the output of one process is the input to the next:

You can replace any procedure in the process with a different one, perhaps better in some way, without compromising the whole system. For the model in the middle step, you can use an SVC or logistic regression:

One can also keep track of the classifier itself and build a flow diagram from the classifier. Here is a pipeline keeping track of the SVC classifier:

In the upcoming chapters, we will see how scikit-learn uses the intuitive notion of a pipeline. So far, we have used a simple one: train, predict, test.

Rex Jones May 03, 2018

Excellent reference for basic modeling in Python using scikit. Julian Avila has created a great reference that I use it in my class. It's easy to incorporate reading assignments from this book.

Amazon Verified review

Jose Luis Ramirez Nov 28, 2017

I've been working with python in my AI projects and I came up upon this great book of recipes that I can use quickly and practically in every stage of my developments. It is ease to use right away and it has reference to the enough amount of theory so you don't have to go to search around for extra info. The book introduces neural networks in a simple way and it has robust OOP for more complex AI projects.

Miss S Betts Apr 22, 2020

Purchase book.Open in cloud reader and start flicking through - and the reader starts displaying the word in a single column down the centre of the screen. Can't find settings to stop it.