K-means - training a clustering model
Training for K-means in Spark ML takes an approach similar to the other models -- we pass a DataFrame that contains our training data to the fit method of the KMeans
object.
Note
Here we use the libsvm data format.
Training a clustering model on the MovieLens dataset
We will train a model for both the movie and user factors that we generated by running our recommendation model.
We need to pass in the number of clusters K and the maximum number of iterations for the algorithm to run. Model training might run for less than the maximum number of iterations if the change in the objective function from one iteration to the next is less than the tolerance level (the default for this tolerance is 0.0001).
Spark ML's k-means provides random and K-means || initialization, with the default being K-means ||. As both of these initialization methods are based on random selection to some extent, each model training run will return a different result.
K-means does not generally...