Creating a simple pipeline
Spark provides pipeline APIs under Spark ML. A pipeline a sequence of stages consisting of transformers and estimators. There are two basic types of pipeline stages, called transformer and estimator:
- A transformer takes a dataset as an input and produces an augmented dataset as the so that the output can be fed to the next step. For example, Tokenizer and HashingTF are two transformers. Tokenizer transforms a dataset with text into a dataset with tokenized words. A HashingTF, on the other hand, produces the term frequencies. The concept of tokenization and is commonly used in text mining and text analytics.
- On the contrary, an estimator must be the first on the input dataset to produce a model. In this case, the model itself will be used as the transformer for transforming the input dataset into the output dataset. For example, a Logistic Regression or linear regression can be used as an estimator after fitting the training dataset with corresponding labels and...