In this article by Rodolfo Bonnin, the author of the book Building Machine Learning Projects with TensorFlow, we will start applying data transforming operations. We will begin finding interesting patterns in some given information, discovering groups of data, or clusters, and using clustering techniques.

(For more resources related to this topic, see here.)

In this process we'll also gain two new tools: the ability to generate synthetic sample sets from a collection of representative data structures via the scikit-learn library, and the ability to graphically plot our data and model results, this time via the matplotlib library.

The topics we will cover in this article are as follows:

Getting an idea of how clustering works, and comparing it to other alternative existent classification techniques
Using scikit-learn and matplotlib to enrichen the possibilities of dataset choices, and to get professional looking graphical representation of the data
Implementing the K-means clustering algorithm
Test some variations of the K-means methods to improve the fit and/or the convergence rate

Three types of learning from data

Based on how we approach the supervision of the samples, we can extract three types of learning:

Unsupervised learning: The fully unsupervised approach directly takes a number of undetermined elements and builds a classification of them, looking at different properties that could determine its class
Semi-supervised learning: The semi-supervised approach has a number of known classified items and then applies techniques to discover the class of the remaining items
Supervised learning: In supervised learning, we start from a population of samples, which have a known type beforehand, and then build a model from it

Normally there are three sample populations: one from which the model grows, called training set, one that is used to test the model, called training set, and then there are the samples for which we will be doing classification.

data-clustering-img-0 Types of data learning based on supervision: unsupervised, semi-supervised, and supervised

Unsupervised data clustering

One of the simplest operations that can be initially applied to an unknown dataset is to try to understand the possible grouping or common features that the dataset members have.

To do so, we could try to find representative points in them that summarize a balance of the parameters of the members of the group. This value could be, for example, the mean or the median of all the cluster members.

This also guides to the idea of defining a notion of distance between members: all the members of the groups should be obviously at short distances between them and the representative points, that from the central points of the other groups.

In the following image, we can see the results of a typical clustering algorithm and the representation of the cluster centers:

data-clustering-img-1 Sample clustering algorithm output

K-means

K-means is a very well-known clustering algorithm that can be easily implemented. It is very straightforward and can guide (depending on the data layout) to a good initial understanding of the provided information.

Mechanics of K-means

K-means tries to divide a set of samples into K disjoint groups or clusters, using as a main indicator the mean value (be it 1D, 2D, and so on) of the members. This point is normally called centroid, referring to the arithmetic entity with the same name.

One important characteristic of K-means is that K should be provided beforehand, and so some previous knowledge of the data is needed to avoid a non-representative result.

Algorithm iteration criterion

The criterion and goal of this method is to minimize the sum of squared distances from the cluster's member to the actual centroid of all cluster contained samples. This is also known as minimization of inertia.

data-clustering-img-2 Error minimization criteria for K-means

K-means algorithm breakdown

The mechanism of the K-means algorithm can be summarized in the following graphic:

data-clustering-img-3 Simplified flow chart of the K-means process

And this is a simplified summary of the algorithm:

We start with unclassified samples and take K elements as the starting centroids. There are also possible simplifications of this algorithm that take the first elements in the element list, for the sake of brevity.
We then calculate the distances between the samples and the first chosen samples, and so we get the first calculated centroids (or other representative values). You can see in the moving centroids in the illustration toward a more common sense centroid.
After the centroids change, their displacement will provoke the individual distances to change, and so the cluster membership can change.
So this is the time when we recalculate the centroids and repeat the first steps, in case the stop condition isn't met.

The stopping conditions could be of various types:

After n iterations (it could be that either we chose a too large number and we'll have unnecessary rounds of computing, or it could converge slowly and we will have a very unconvincing results) if the centroid doesn't have a very stable means. This stop condition could also be used as a last resort if we have a really long iterative process.
Referring to the previous mean result, a possibly better criterion for the convergence of the iterations is to take a look at the changes of the centroids, be it in total displacement or total cluster element switches. The last one is employed normally, so we will stop the process once there are no more element-changing clusters:

data-clustering-img-4 K-means simplified graphic

Pros and cons of K-means

The advantages of this method are:

It scales very well (most of the calculations can be run in parallel)
It has been used in a very large range of applications

But its simplicity has also a price (no silver bullet rule applies):

It requires apriori knowledge (the number of possible clusters should be known beforehand)
The outlier values can push the values of the centroids, as they have the same value as any other sample
As we assume that the figure is convex and isotropic, it doesn't work very well with non-circle-like delimited clusters

Summary

In this article, we got a simple overview of some of the most basic models we can implement, but we tried to be as detailed in the explanation as possible.

From now on, we are able to generate synthetic datasets, allowing us to rapidly test the adequacy of a model for different data configurations and so evaluate the advantages and shortcoming of them without having to load models with a greater number of unknown characteristics.

You can also refer to the following books on the similar topics: