Using a dimensionality reduction model
It is interesting to be able to visualize the outcome of a model in this way; however, the overall purpose of using dimensionality reduction is to create a more compact representation of the data that still captures the important features and variability in the raw dataset. To do this, we need to use a trained model to transform our raw data by projecting it into the new, lower-dimensional space represented by the principal components.
Projecting data using PCA on the LFW dataset
We will illustrate this concept by projecting each LFW image into a ten-dimensional vector. This is done through a matrix multiplication of the image matrix with the matrix of principal components. As the image matrix is a distributed MLlib RowMatrix
, Spark takes care of distributing this computation for us through the multiply
function.
val projected = matrix.multiply(pc) println(projected.numRows, projected.numCols)
This preceding function will give you the following output...