Unsupervised machine learning
In this section, to make the discussion concrete, only the reduction using PCA and the LDA for topic modeling will be discussed for text clustering. Other algorithms for unsupervised learning will be discussed in Chapter 13, My Name is Bayes, Naive Bayes with some practical examples.
Dimensionality reduction
Dimensionality reduction is the of reducing the number of variables under consideration. It can be used to extract latent features from raw and noisy features or to compress data while maintaining the structure. Spark MLlib provides support for dimensionality reduction on the RowMatrix
class. The most commonly used algorithms for reducing the dimensionality of data are PCA and SVD. However, in this section, we will discuss PCA only to make the discussion more concrete.
PCA
PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables...