In this article by Alberto Boschetti and Luca Massaron, authors of the book Python Data Science Essentials, you will learn about how to deal with big data and an overview of Stochastic Gradient Descent (SGD).

(For more resources related to this topic, see here.)

Dealing with big data

Big data challenges data science projects with four points of view: volume (data quantity), velocity, variety, and veracity. Scikit-learn package offers a range of classes and functions that will help you effectively work with data so big that it cannot entirely fit in the memory of your computer, no matter what its specifications may be.

Before providing you with an overview of big data solutions, we have to create or import some datasets in order to give you a better idea of the scalability and performances of different algorithms. This will require about 1.5 gigabytes of your hard disk, which will be freed after the experiment.

Creating some big datasets as examples

As a typical example of big data analysis, we will use some textual data from the Internet and we will take advantage of the available fetch_20newsgroups, which contains data of 11,314 posts, each one averaging about 206 words, that appeared in 20 different newsgroups:

In: import numpy as np
from sklearn.datasets import fetch_20newsgroups
newsgroups_dataset = fetch_20newsgroups(shuffle=True, remove=('headers', 'footers', 'quotes'),
 random_state=6)
print 'Posts inside the data: %s' % np.shape(newsgroups_dataset.data)
print 'Average number of words for post: %0.0f' % np.mean([len(text.split(' ')) for text in
 newsgroups_dataset.data])
Out: Posts inside the data: 11314
Average number of words for post: 206

Instead, to work out a generic classification example, we will create three synthetic datasets that contain from 1,00,000 to up to 10 million cases. You can create and use any of them according to your computer resources. We will always refer to the largest one for our experiments:

In: from sklearn.datasets import make_classification
X,y = make_classification(n_samples=10**5, n_features=5, n_informative=3, random_state=101)
D = np.c_[y,X]
np.savetxt('huge_dataset_10__5.csv', D, delimiter=",") # the saved file should be around 14,6 MB
del(D, X, y)
X,y = make_classification(n_samples=10**6, n_features=5, n_informative=3, random_state=101)
D = np.c_[y,X]
np.savetxt('huge_dataset_10__6.csv', D, delimiter=",") # the saved file should be around 146 MB
del(D, X, y)
X,y = make_classification(n_samples=10**7, n_features=5, n_informative=3, random_state=101)
D = np.c_[y,X]
np.savetxt('huge_dataset_10__7.csv', D, delimiter=",") # the saved file should be around 1,46 GB
del(D, X, y)

After creating and using any of the datasets, you can remove them by the following command:

import os
os.remove('huge_dataset_10__5.csv')
os.remove('huge_dataset_10__6.csv')
os.remove('huge_dataset_10__7.csv')

Scalability with volume

The trick to manage high volumes of data without loading too many megabytes or gigabytes into memory is to incrementally update the parameters of your algorithm until all the observations have been elaborated at least once by the machine learner.

This is possible in Scikit-learn thanks to the .partial_fit() method, which has been made available to a certain number of supervised and unsupervised algorithms. Using the .partial_fit() method and providing some basic information (for example, for classification, you should know beforehand the number of classes to be predicted), you can immediately start fitting your model, even if you have a single or a few observations.

This method is called incremental learning. The chunks of data that you incrementally fed into the learning algorithm are called batches. The critical points of incremental learning are as follows:

Batch size

Data preprocessing

Number of passes with the same examples

Validation and parameters fine tuning

Batch size generally depends on your available memory. The principle is that the larger the data chunks, the better, since the data sample will get more representatives of the data distributions as its size gets larger.

Also, data preprocessing is challenging. Incremental learning algorithms work well with data in the range of [-1,+1] or [0,+1] (for instance, Multinomial Bayes won't accept negative values). However, to scale into such a precise range, you need to know beforehand the range of each variable. Alternatively, you have to either pass all the data once, and record the minimum and maximum values, or derive them from the first batch, trimming the following observations that exceed the initial maximum and minimum values.

The number of passes can become a problem. In fact, as you pass the same examples multiple times, you help the predictive coefficients converge to an optimum solution. If you pass too many of the same observations, the algorithm will tend to overfit, that is, it will adapt too much to the data repeated too many times. Some algorithms, like the SGD family, are also very sensitive to the order that you propose to the examples to be learned. Therefore, you have to either set their shuffle option (shuffle=True) or shuffle the file rows before the learning starts, keeping in mind that for efficacy, the order of the rows proposed for the learning should be casual.

Validation is not easy either, especially if you have to validate against unseen chunks, validate in a progressive way, or hold out some observations from every chunk. The latter is also the best way to reserve a sample for grid search or some other optimization.

In our example, we entrust the SGDClassifier with a log loss (basically a logistic regression) to learn how to predict a binary outcome given 10**7 observations:

In: from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np
streaming = pd.read_csv('huge_dataset_10__7.csv', header=None, chunksize=10000)
learner = SGDClassifier(loss='log')
minmax_scaler = MinMaxScaler(feature_range=(0, 1))
cumulative_accuracy = list()
for n,chunk in enumerate(streaming):
   if n == 0:
           minmax_scaler.fit(chunk.ix[:,1:].values)
   X = minmax_scaler.transform(chunk.ix[:,1:].values)
   X[X>1] = 1
   X[X<0] = 0
   y = chunk.ix[:,0]
   if n > 8 :
       cumulative_accuracy.append(learner.score(X,y))
   learner.partial_fit(X,y,classes=np.unique(y))
print 'Progressive validation mean accuracy %0.3f' % np.mean(cumulative_accuracy)
Out: Progressive validation mean accuracy 0.660

First, pandas read_csv allows us to iterate over the file by reading batches of 10,000 observations (the number can be increased or decreased according to your computing resources).

We use the MinMaxScaler in order to record the range of each variable on the first batch. For the following batches, we will use the rule that if it exceeds one of the limits of [0,+1], they are trimmed to the nearest limit.

Eventually, starting from the 10th batch, we will record the accuracy of the learning algorithm on each newly received batch before using it to update the training. In the end, the accumulated accuracy scores are averaged, offering a global performance estimation.

Keeping up with velocity

There are various algorithms that work with incremental learning.

For classification, we will recall the following:

sklearn.naive_bayes.MultinomialNB

sklearn.naive_bayes.BernoulliNB

sklearn.linear_model.Perceptron

sklearn.linear_model.SGDClassifier

sklearn.linear_model.PassiveAggressiveClassifier

For regression, we will recall the following:

sklearn.linear_model.SGDRegressor

sklearn.linear_model.PassiveAggressiveRegressor

As for velocity, they are all comparable in speed. You can try for yourself the following script:

In: from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
import pandas as pd
import numpy as np
from datetime import datetime
classifiers = {
'SGDClassifier hinge loss' : SGDClassifier(loss='hinge', random_state=101),
'SGDClassifier log loss' : SGDClassifier(loss='log', random_state=101),
'Perceptron' : Perceptron(random_state=101),
'BernoulliNB' : BernoulliNB(),
'PassiveAggressiveClassifier' : PassiveAggressiveClassifier(random_state=101)
}
huge_dataset = 'huge_dataset_10__6.csv'
for algorithm in classifiers:
   start = datetime.now()
   minmax_scaler = MinMaxScaler(feature_range=(0, 1))
   streaming = pd.read_csv(huge_dataset, header=None, chunksize=100)
   learner = classifiers[algorithm]
   cumulative_accuracy = list()
   for n,chunk in enumerate(streaming):
       y = chunk.ix[:,0]
       X = chunk.ix[:,1:]
       if n > 50 :
           cumulative_accuracy.append(learner.score(X,y))
       learner.partial_fit(X,y,classes=np.unique(y))
   elapsed_time = datetime.now() - start
   print algorithm + ' : mean accuracy %0.3f in %s secs' % (np.mean(cumulative_accuracy),
elapsed_time.total_seconds())
Out: BernoulliNB : mean accuracy 0.734 in 41.101 secs
Perceptron : mean accuracy 0.616 in 37.479 secs
SGDClassifier hinge loss : mean accuracy 0.712 in 38.43 secs
SGDClassifier log loss : mean accuracy 0.716 in 39.618 secs
PassiveAggressiveClassifier : mean accuracy 0.625 in 40.622 secs

As a general note, remember that smaller batches are slower since that implies more disk access from a database or a file, which is always a bottleneck.

Dealing with variety

Variety is a big data characteristic. This is especially true when we are dealing with textual data or very large categorical variables (for example, variables storing website names in programmatic advertising). As you will learn from batches of examples, and as you unfold categories or words, each one is an appropriate and exclusive variable. You may find it difficult to handle the challenge of variety and the unpredictability of large streams of data. Scikit-learn package provides you with a simple and fast way to implement the hashing trick and completely forget the problem of defining in advance of a rigid variable structure.

The hashing trick uses hash functions and sparse matrices in order to save your time, resources, and hassle. The hash functions are functions that map in a deterministic way any input they receive. It doesn't matter if you feed them with numbers or strings, they will always provide you with an integer number in a certain range. Sparse matrices are, instead, arrays that record only values that are not zero, since their default value is zero for any combination of their row and column. Therefore, the hashing trick bounds every possible input; it doesn't matter if it was previously unseen to a certain range or position on a corresponding input sparse matrix, which is loaded with a value that is not 0.

For instance, if your input is Python, a hashing command like abs(hash('Python')) can transform that into the 539294296 integer number and then assign the value of 1 to the cell at the 539294296 column index. The hash function is a very fast and convenient way to always express the same column index given the same input. The using of only absolute values assures that each index corresponds only to a column in our array (negative indexes just start from the last column, and hence in Python, each column of an array can be expressed by both a positive and negative number).

The example that follows uses the HashingVectorizer class, a convenient class that automatically takes documents, separates the words, and transforms them, thanks to the hashing trick, into an input matrix. The script aims at learning why posts are published in 20 distinct newsgroups on the basis of the words used on the existing posts in the newsgroups:

In: import pandas as pd
import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import HashingVectorizer
def streaming():
   for response, item in zip(newsgroups_dataset.target, newsgroups_dataset.data):
       yield response, item
hashing_trick = HashingVectorizer(stop_words='english', norm = 'l2', non_negative=True)
learner = SGDClassifier(random_state=101)
texts = list()
targets = list()
for n,(target, text) in enumerate(streaming()):
   texts.append(text)
   targets.append(target)
  if n % 1000 == 0 and n >0:
       learning_chunk = hashing_trick.transform(texts)
       if n > 1000:
           last_validation_score = learner.score(learning_chunk, targets),
       learner.partial_fit(learning_chunk, targets, classes=[k for k in range(20)])
       texts, targets = list(), list()
print 'Last validation score: %0.3f' % last_validation_score
Out: Last validation score: 0.710

At this point, no matter what text you may input, the predictive algorithm will always answer by pointing out a class. In our case, it points out a newsgroup suitable for the post to appear on. Let's try out this algorithm with a text taken from a classified ad:

In: New_text = ['A 2014 red Toyota Prius v Five with fewer than 14K miles. Powered by a reliable 1.8L four cylinder hybrid engine that averages 44mpg in the city and 40mpg on the highway.']
text_vector = hashing_trick.transform(New_text)
print np.shape(text_vector), type(text_vector)
print 'Predicted newsgroup: %s' % newsgroups_dataset.target_names[learner.predict(text_vector)]
Out:
(1, 1048576) <class 'scipy.sparse.csr.csr_matrix'>
Predicted newsgroup: rec.autos

Naturally, you may change the New_text variable and discover where your text will most likely be displayed in a newsgroup. Note that the HashingVectorizer class has transformed the text into a csr_matrix (which is quite an efficient sparse matrix), having about 1 million columns.

A quick overview of Stochastic Gradient Descent (SGD)

We will close this article on big data with a quick overview of the SGD family comprising of SGDClassifier (for classification) and SGDRegressor (for regression).

Like other classifiers, they can be fit by using the .fit() method (passing row by row the in-memory dataset to the learning algorithm) or the previously seen .partial_fit() method based on batches. In the latter case, if you are classifying, you have to declare the predicted classes with the class parameter. It can accept a list containing all the class code that it should expect to meet during the training phase.

SGDClassifier can behave as a logistic regression when the loss parameter is set to loss. It transforms into a linear SVC if the loss is set to hinge. It can also take the form of other loss functions or even the loss functions working for regression.

SGDRegressor mimics a linear regression using the squared_loss loss parameter. Instead, the huber loss transforms the squared loss into a linear loss over a certain distance epsilon (another parameter to be fixed). It can also act as a linear SVR using the epsilon_insensitive loss function or the slightly different squared_epsilon_insensitive (which penalizes outliers more).

As in other situations with machine learning, performance of the different loss functions on your data science problem cannot be estimated a priori. Anyway, please take into account that if you are doing classification and you need an estimation of class probabilities, you will be limited in your choice to log or modified_huber only.

Key parameters that require tuning for this algorithm to best work with your data are:

n_iter: The number of iterations over the data; the more the passes, the better the optimization of the algorithm. However, there is a higher risk of overfitting. Empirically, SGD tends to converge to a stable solution after having seen 10**6 examples. Given your examples, set your number of iterations accordingly.

penalty : You have to choose l1, l2, or elasticnet, which are all different regularization strategies, in order to avoid overfitting because of overparametrization (using too many unnecessary parameters leads to the memorization of observations more than the learning of patterns). Briefly, l1 tends to reduce unhelpful coefficients to zero, l2 just attenuates them, and elasticnet is a mix of l1 and l2 strategies.

alpha: This is a multiplier of the regularization term; the higher the alpha, the more the regularization. We advise you to find the best alpha value by performing a grid search ranging from 10**-7 to 10**-1.

l1_ratio: The l1 ratio is used for elastic net penalty.

learning_rate: This sets how much the coefficients are affected by every single example. Usually, it is optimal for classifiers and invscaling for regression. If you want to use invscaling for classification, you'll have to set eta0 and power_t (invscaling = eta0 / (t**power_t)). With invscaling, you can start with a lower learning rate, which is less than the optimal rate, though it will decrease slower.

epsilon: This should be used if your loss is huber, epsilon_insensitive, or squared_epsilon_insensitive.

shuffle: If this is True, the algorithm will shuffle the order of the training data in order to improve the generalization of the learning.

Summary

In this article, we introduced the essentials of machine learning. We discussed about creating some big data examples, scalability with volume, keeping up with velocity, dealing with variety and with advanced one SGD.