Using a tf-idf model
While we often refer to training a tf-idf model, it is actually a feature extraction process or transformation rather than a machine learning model. Tf-idf weighting is often used as a preprocessing step for other models, such as dimensionality reduction, classification, or regression.
To illustrate the potential uses of tf-idf weighting, we will explore two examples. The first is using the tf-idf vectors to compute document similarity, while the second involves training a multilabel classification model with the tf-idf vectors as input features.
Document similarity with the 20 Newsgroups dataset and tf-idf features
You might recall fromChapter 5, Building a Recommendation Engine with Spark, that the similarity between two vectors can be computed using a distance metric. The closer two vectors are (that is, the lower the distance metric), the more similar they are. One such metric that we used to compute similarity between movies is cosine similarity.
Just like we did for...