In this article by Manpreet Singh Ghotra and Rajdeep Dua, coauthors of the book Machine Learning with Spark, Second Edition, we will analyze the case where we do not have labeled data available. Supervised learning methods are those where the training data is labeled with the true outcome that we would like to predict (for example, a rating for recommendations and class assignment for classification or a real target variable in the case of regression).

(For more resources related to this topic, see here.)

In unsupervised learning, the model is not supervised with the true target label. The unsupervised case is very common in practice, since obtaining labeled training data can be very difficult or expensive in many real-world scenarios (for example, having humans label training data with class labels for classification). However, we would still like to learn some underlying structure in the data and use these to make predictions.

This is where unsupervised learning approaches can be useful. Unsupervised learning models are also often combined with supervised models, for example, applying unsupervised techniques to create new input features for supervised models.

Clustering models are, in many ways, the unsupervised equivalent of classification models. With classification, we would try to learn a model that would predict which class a given training example belonged to. The model is essentially a mapping from a set of features to the class.

In clustering, we would like to segment the data in such a way that each training example is assigned to a segment called a cluster. The clusters act much like classes, except that the true class assignments are unknown.

Clustering models have many use cases that are the same as classification; these include the following:

Segmenting users or customers into different groups based on behavior characteristics and metadata
Grouping content on a website or products in a retail business
Finding clusters of similar genes
Segmenting communities in ecology
Creating image segments for use in image analysis applications such as object detection

Types of clustering models

There are many different forms of clustering models available, ranging from simple to extremely complex ones. The Spark MLlibrary currently provides K-means clustering, which is among the simplest approaches available. However, it is often very effective, and its simplicity makes it is relatively easy to understand and is scalable.

K-means clustering

K-means attempts to partition a set of data points into K distinct clusters (where K is an input parameter for the model).

More formally, K-means tries to find clusters so as to minimize the sum of squared errors (or distances) within each cluster. This objective function is known as the within cluster sum of squared errors (WCSS).

clustering-model-spark-img-0

It is the sum, over each cluster, of the squared errors between each point and the cluster center.

Starting with a set of K initial cluster centers (which are computed as the mean vector for all data points in the cluster), the standard method for K-means iterates between two steps:

Assign each data point to the cluster that minimizes the WCSS. The sum of squares is equivalent to the squared Euclidean distance; therefore, this equates to assigning each point to the closest cluster center as measured by the Euclidean distance metric.
Compute the new cluster centers based on the cluster assignments from the first step.

The algorithm proceeds until either a maximum number of iterations has been reached or convergence has been achieved. Convergence means that the cluster assignments no longer change during the first step; therefore, the value of the WCSS objective function does not change either.

For more details, refer to Spark's documentation on clustering at http://spark.apache.org/docs/latest/mllib-clustering.html or refer to http://en.wikipedia.org/wiki/K-means_clustering.

To illustrate the basics of K-means, we will use a simple dataset. We have five classes, which are shown in the following figure:

clustering-model-spark-img-1

Multiclass dataset

However, assume that we don't actually know the true classes. If we use K-means with five clusters, then after the first step, the model's cluster assignments might look like this:

clustering-model-spark-img-2

Cluster assignments after the first K-means iteration

We can see that K-means has already picked out the centers of each cluster fairly well. After the next iteration, the assignments might look like those shown in the following figure:

clustering-model-spark-img-3

Cluster assignments after the second K-means iteration

Things are starting to stabilize, but the overall cluster assignments are broadly the same as they were after the first iteration. Once the model has converged, the final assignments could look like this:

clustering-model-spark-img-4

Final cluster assignments for K-means

As we can see, the model has done a decent job of separating the five clusters. The leftmost three are fairly accurate (with a few incorrect points). However, the two clusters in the bottom-right corner are less accurate.

This illustrates the following:

The iterative nature of K-means
The model's dependency on the method of initially selecting clusters' centers (here, we will use a random approach)
How the final cluster assignments can be very good for well-separated data but can be poor for data that is more difficult

Initialization methods

The standard initialization method for K-means, usually simply referred to as the random method, starts by randomly assigning each data point to a cluster before proceeding with the first update step.

Spark ML provides a parallel variant for this initialization method, called K-means++, which is the default initialization method used.

Refer to http://en.wikipedia.org/wiki/K-means_clustering#Initialization_methods and http://en.wikipedia.org/wiki/K-means%2B%2B for more information.

The results of using K-means++ are shown here. Note that this time, the difficult bottom-right points have been mostly correctly clustered.

clustering-model-spark-img-5

Final cluster assignments for K-means++

Variants

There are many other variants of K-means; they focus on initialization methods or the core model. One of the more common variants is fuzzy K-means. This model does not assign each point to one cluster as K-means does (a so-called hard assignment). Instead, it is a soft version of K-means, where each point can belong to many clusters and is represented by the relative membership to each cluster. So, for K clusters, each point is represented as a K-dimensional membership vector, with each entry in this vector indicating the membership proportion in each cluster.

Mixture models

A mixture model is essentially an extension of the idea behind fuzzy K-means; however, it makes an assumption that there is an underlying probability distribution that generates the data. For example, we might assume that the data points are drawn from a set of K-independent Gaussian (normal) probability distributions. The cluster assignments are also soft, so each point is represented by K membership weights in each of the K underlying probability distributions.

Refer to http://en.wikipedia.org/wiki/Mixture_model for further details and for a mathematical treatment of mixture models.

Hierarchical clustering

Hierarchical clustering is a structured clustering approach that results in a multilevel hierarchy of clusters where each cluster might contain many subclusters (or child clusters). Each child cluster is, thus, linked to the parent cluster. This form of clustering is often also called tree clustering.

Agglomerative clustering is a bottom-up approach where we have the following:

Each data point begins in its own cluster
The similarity (or distance) between each pair of clusters is evaluated
The pair of clusters that are most similar are found; this pair is then merged to form a new cluster
The process is repeated until only one top-level cluster remains

Divisive clustering is a top-down approach that works in reverse, starting with one cluster, and at each stage, splitting a cluster into two, until all data points are allocated to their own bottom-level cluster.

You can find more information at http://en.wikipedia.org/wiki/Hierarchical_clustering.

Summary

In this article, we explored a new class of model that learns structure from unlabeled data—unsupervised learning. You learned about various clustering models like the K-means model, mixture models, and the hierarchical clustering model. We also considered a simple dataset to illustrate the basics of K-means.