Training a dimensionality reduction model
Dimensionality reduction models in MLlib require vectors as inputs. However, unlike clustering that operated on an RDD[Vector]
, PCA and SVD computations are provided as methods on a distributed RowMatrix
(this difference is largely down to syntax, as a RowMatrix
is simply a wrapper around an RDD[Vector]
).
Running PCA on the LFW dataset
Now that we have extracted our image pixel data into vectors, we can instantiate a new RowMatrix
.
Note
def computePrincipalComponents(k: Int)
: MatrixComputes the top k
principal components. Rows correspond to observations, and columns correspond to variables. The principal components are stored as a local matrix of size n-by-k
. Each column corresponds for one principal component, and the columns are in descending order of component variance. The row data do not need to be "centered" first; it is not necessary for the mean of each column to be 0
.Note that this cannot be computed on matrices with more than 65535
columns...