Online learning with Spark Streaming
As we have seen, Spark Streaming makes it easy to work with data streams in a way that should be familiar to us from working with RDDs. Using Spark's stream processing primitives combined with the online learning capabilities of ML Library SGD-based methods, we can create real-time machine learning models that we can update on new data in the stream as it arrives.
Streaming regression
Spark provides a built-in streaming machine learning model in the StreamingLinearAlgorithm
class. Currently, only a linear regression implementation is available-StreamingLinearRegressionWithSGD
-but future versions will include classification.
The streaming regression model provides two methods for usage:
trainOn
: This takesDStream[LabeledPoint]
as its argument. This tells the model to train on every batch in the input DStream. It can be called multiple times to train on different streams.predictOn
: This also takesDStream[LabeledPoint]
. This tells the model to make predictions...