





















































In this article by Alberto Boschetti and Luca Massaron, authors of the book Python Data Science Essentials, you will learn about how to deal with big data and an overview of Stochastic Gradient Descent (SGD).
(For more resources related to this topic, see here.)
Big data challenges data science projects with four points of view: volume (data quantity), velocity, variety, and veracity. Scikit-learn package offers a range of classes and functions that will help you effectively work with data so big that it cannot entirely fit in the memory of your computer, no matter what its specifications may be.
Before providing you with an overview of big data solutions, we have to create or import some datasets in order to give you a better idea of the scalability and performances of different algorithms. This will require about 1.5 gigabytes of your hard disk, which will be freed after the experiment.
As a typical example of big data analysis, we will use some textual data from the Internet and we will take advantage of the available fetch_20newsgroups, which contains data of 11,314 posts, each one averaging about 206 words, that appeared in 20 different newsgroups:
In: import numpy as np from sklearn.datasets import fetch_20newsgroups newsgroups_dataset = fetch_20newsgroups(shuffle=True, remove=('headers', 'footers', 'quotes'),
random_state=6) print 'Posts inside the data: %s' % np.shape(newsgroups_dataset.data) print 'Average number of words for post: %0.0f' % np.mean([len(text.split(' ')) for text in
newsgroups_dataset.data]) Out: Posts inside the data: 11314 Average number of words for post: 206
Instead, to work out a generic classification example, we will create three synthetic datasets that contain from 1,00,000 to up to 10 million cases. You can create and use any of them according to your computer resources. We will always refer to the largest one for our experiments:
In: from sklearn.datasets import make_classification X,y = make_classification(n_samples=10**5, n_features=5, n_informative=3, random_state=101) D = np.c_[y,X] np.savetxt('huge_dataset_10__5.csv', D, delimiter=",") # the saved file should be around 14,6 MB del(D, X, y) X,y = make_classification(n_samples=10**6, n_features=5, n_informative=3, random_state=101) D = np.c_[y,X] np.savetxt('huge_dataset_10__6.csv', D, delimiter=",") # the saved file should be around 146 MB del(D, X, y) X,y = make_classification(n_samples=10**7, n_features=5, n_informative=3, random_state=101) D = np.c_[y,X] np.savetxt('huge_dataset_10__7.csv', D, delimiter=",") # the saved file should be around 1,46 GB del(D, X, y)
After creating and using any of the datasets, you can remove them by the following command:
import os os.remove('huge_dataset_10__5.csv') os.remove('huge_dataset_10__6.csv') os.remove('huge_dataset_10__7.csv')
The trick to manage high volumes of data without loading too many megabytes or gigabytes into memory is to incrementally update the parameters of your algorithm until all the observations have been elaborated at least once by the machine learner.
This is possible in Scikit-learn thanks to the .partial_fit() method, which has been made available to a certain number of supervised and unsupervised algorithms. Using the .partial_fit() method and providing some basic information (for example, for classification, you should know beforehand the number of classes to be predicted), you can immediately start fitting your model, even if you have a single or a few observations.
This method is called incremental learning. The chunks of data that you incrementally fed into the learning algorithm are called batches. The critical points of incremental learning are as follows:
Batch size generally depends on your available memory. The principle is that the larger the data chunks, the better, since the data sample will get more representatives of the data distributions as its size gets larger.
Also, data preprocessing is challenging. Incremental learning algorithms work well with data in the range of [-1,+1] or [0,+1] (for instance, Multinomial Bayes won't accept negative values). However, to scale into such a precise range, you need to know beforehand the range of each variable. Alternatively, you have to either pass all the data once, and record the minimum and maximum values, or derive them from the first batch, trimming the following observations that exceed the initial maximum and minimum values.
The number of passes can become a problem. In fact, as you pass the same examples multiple times, you help the predictive coefficients converge to an optimum solution. If you pass too many of the same observations, the algorithm will tend to overfit, that is, it will adapt too much to the data repeated too many times. Some algorithms, like the SGD family, are also very sensitive to the order that you propose to the examples to be learned. Therefore, you have to either set their shuffle option (shuffle=True) or shuffle the file rows before the learning starts, keeping in mind that for efficacy, the order of the rows proposed for the learning should be casual.
Validation is not easy either, especially if you have to validate against unseen chunks, validate in a progressive way, or hold out some observations from every chunk. The latter is also the best way to reserve a sample for grid search or some other optimization.
In our example, we entrust the SGDClassifier with a log loss (basically a logistic regression) to learn how to predict a binary outcome given 10**7 observations:
In: from sklearn.linear_model import SGDClassifier from sklearn.preprocessing import MinMaxScaler import pandas as pd import numpy as np streaming = pd.read_csv('huge_dataset_10__7.csv', header=None, chunksize=10000) learner = SGDClassifier(loss='log') minmax_scaler = MinMaxScaler(feature_range=(0, 1)) cumulative_accuracy = list() for n,chunk in enumerate(streaming): if n == 0: minmax_scaler.fit(chunk.ix[:,1:].values) X = minmax_scaler.transform(chunk.ix[:,1:].values) X[X>1] = 1 X[X<0] = 0 y = chunk.ix[:,0] if n > 8 : cumulative_accuracy.append(learner.score(X,y)) learner.partial_fit(X,y,classes=np.unique(y)) print 'Progressive validation mean accuracy %0.3f' % np.mean(cumulative_accuracy) Out: Progressive validation mean accuracy 0.660
First, pandas read_csv allows us to iterate over the file by reading batches of 10,000 observations (the number can be increased or decreased according to your computing resources).
We use the MinMaxScaler in order to record the range of each variable on the first batch. For the following batches, we will use the rule that if it exceeds one of the limits of [0,+1], they are trimmed to the nearest limit.
Eventually, starting from the 10th batch, we will record the accuracy of the learning algorithm on each newly received batch before using it to update the training. In the end, the accumulated accuracy scores are averaged, offering a global performance estimation.
There are various algorithms that work with incremental learning.
For classification, we will recall the following:
For regression, we will recall the following:
As for velocity, they are all comparable in speed. You can try for yourself the following script:
In: from sklearn.naive_bayes import MultinomialNB from sklearn.naive_bayes import BernoulliNB from sklearn.linear_model import Perceptron from sklearn.linear_model import SGDClassifier from sklearn.linear_model import PassiveAggressiveClassifier import pandas as pd import numpy as np from datetime import datetime classifiers = { 'SGDClassifier hinge loss' : SGDClassifier(loss='hinge', random_state=101), 'SGDClassifier log loss' : SGDClassifier(loss='log', random_state=101), 'Perceptron' : Perceptron(random_state=101), 'BernoulliNB' : BernoulliNB(), 'PassiveAggressiveClassifier' : PassiveAggressiveClassifier(random_state=101) } huge_dataset = 'huge_dataset_10__6.csv' for algorithm in classifiers: start = datetime.now() minmax_scaler = MinMaxScaler(feature_range=(0, 1)) streaming = pd.read_csv(huge_dataset, header=None, chunksize=100) learner = classifiers[algorithm] cumulative_accuracy = list() for n,chunk in enumerate(streaming): y = chunk.ix[:,0] X = chunk.ix[:,1:] if n > 50 : cumulative_accuracy.append(learner.score(X,y)) learner.partial_fit(X,y,classes=np.unique(y)) elapsed_time = datetime.now() - start print algorithm + ' : mean accuracy %0.3f in %s secs' % (np.mean(cumulative_accuracy),
elapsed_time.total_seconds()) Out: BernoulliNB : mean accuracy 0.734 in 41.101 secs Perceptron : mean accuracy 0.616 in 37.479 secs SGDClassifier hinge loss : mean accuracy 0.712 in 38.43 secs SGDClassifier log loss : mean accuracy 0.716 in 39.618 secs PassiveAggressiveClassifier : mean accuracy 0.625 in 40.622 secs
As a general note, remember that smaller batches are slower since that implies more disk access from a database or a file, which is always a bottleneck.
Variety is a big data characteristic. This is especially true when we are dealing with textual data or very large categorical variables (for example, variables storing website names in programmatic advertising). As you will learn from batches of examples, and as you unfold categories or words, each one is an appropriate and exclusive variable. You may find it difficult to handle the challenge of variety and the unpredictability of large streams of data. Scikit-learn package provides you with a simple and fast way to implement the hashing trick and completely forget the problem of defining in advance of a rigid variable structure.
The hashing trick uses hash functions and sparse matrices in order to save your time, resources, and hassle. The hash functions are functions that map in a deterministic way any input they receive. It doesn't matter if you feed them with numbers or strings, they will always provide you with an integer number in a certain range. Sparse matrices are, instead, arrays that record only values that are not zero, since their default value is zero for any combination of their row and column. Therefore, the hashing trick bounds every possible input; it doesn't matter if it was previously unseen to a certain range or position on a corresponding input sparse matrix, which is loaded with a value that is not 0.
For instance, if your input is Python, a hashing command like abs(hash('Python')) can transform that into the 539294296 integer number and then assign the value of 1 to the cell at the 539294296 column index. The hash function is a very fast and convenient way to always express the same column index given the same input. The using of only absolute values assures that each index corresponds only to a column in our array (negative indexes just start from the last column, and hence in Python, each column of an array can be expressed by both a positive and negative number).
The example that follows uses the HashingVectorizer class, a convenient class that automatically takes documents, separates the words, and transforms them, thanks to the hashing trick, into an input matrix. The script aims at learning why posts are published in 20 distinct newsgroups on the basis of the words used on the existing posts in the newsgroups:
In: import pandas as pd import numpy as np from sklearn.linear_model import SGDClassifier from sklearn.feature_extraction.text import HashingVectorizer def streaming(): for response, item in zip(newsgroups_dataset.target, newsgroups_dataset.data): yield response, item hashing_trick = HashingVectorizer(stop_words='english', norm = 'l2', non_negative=True) learner = SGDClassifier(random_state=101) texts = list() targets = list() for n,(target, text) in enumerate(streaming()): texts.append(text) targets.append(target) if n % 1000 == 0 and n >0: learning_chunk = hashing_trick.transform(texts) if n > 1000: last_validation_score = learner.score(learning_chunk, targets), learner.partial_fit(learning_chunk, targets, classes=[k for k in range(20)]) texts, targets = list(), list() print 'Last validation score: %0.3f' % last_validation_score Out: Last validation score: 0.710
At this point, no matter what text you may input, the predictive algorithm will always answer by pointing out a class. In our case, it points out a newsgroup suitable for the post to appear on. Let's try out this algorithm with a text taken from a classified ad:
In: New_text = ['A 2014 red Toyota Prius v Five with fewer than 14K miles. Powered by a reliable 1.8L four cylinder hybrid engine that averages 44mpg in the city and 40mpg on the highway.'] text_vector = hashing_trick.transform(New_text) print np.shape(text_vector), type(text_vector) print 'Predicted newsgroup: %s' % newsgroups_dataset.target_names[learner.predict(text_vector)] Out: (1, 1048576) <class 'scipy.sparse.csr.csr_matrix'> Predicted newsgroup: rec.autos
Naturally, you may change the New_text variable and discover where your text will most likely be displayed in a newsgroup. Note that the HashingVectorizer class has transformed the text into a csr_matrix (which is quite an efficient sparse matrix), having about 1 million columns.
We will close this article on big data with a quick overview of the SGD family comprising of SGDClassifier (for classification) and SGDRegressor (for regression).
Like other classifiers, they can be fit by using the .fit() method (passing row by row the in-memory dataset to the learning algorithm) or the previously seen .partial_fit() method based on batches. In the latter case, if you are classifying, you have to declare the predicted classes with the class parameter. It can accept a list containing all the class code that it should expect to meet during the training phase.
SGDClassifier can behave as a logistic regression when the loss parameter is set to loss. It transforms into a linear SVC if the loss is set to hinge. It can also take the form of other loss functions or even the loss functions working for regression.
SGDRegressor mimics a linear regression using the squared_loss loss parameter. Instead, the huber loss transforms the squared loss into a linear loss over a certain distance epsilon (another parameter to be fixed). It can also act as a linear SVR using the epsilon_insensitive loss function or the slightly different squared_epsilon_insensitive (which penalizes outliers more).
As in other situations with machine learning, performance of the different loss functions on your data science problem cannot be estimated a priori. Anyway, please take into account that if you are doing classification and you need an estimation of class probabilities, you will be limited in your choice to log or modified_huber only.
Key parameters that require tuning for this algorithm to best work with your data are:
In this article, we introduced the essentials of machine learning. We discussed about creating some big data examples, scalability with volume, keeping up with velocity, dealing with variety and with advanced one SGD.
Further resources on this subject: