





















































In this article by Building Machine Learning Projects with TensorFlow, we will start applying data transforming operations. We will begin finding interesting patterns in some given information, discovering groups of data, or clusters, and using clustering techniques.
, the author of the book(For more resources related to this topic, see here.)
In this process we'll also gain two new tools: the ability to generate synthetic sample sets from a collection of representative data structures via the scikit-learn library, and the ability to graphically plot our data and model results, this time via the matplotlib library.
The topics we will cover in this article are as follows:
Based on how we approach the supervision of the samples, we can extract three types of learning:
Normally there are three sample populations: one from which the model grows, called training set, one that is used to test the model, called training set, and then there are the samples for which we will be doing classification.
Types of data learning based on supervision: unsupervised, semi-supervised, and supervised
One of the simplest operations that can be initially applied to an unknown dataset is to try to understand the possible grouping or common features that the dataset members have.
To do so, we could try to find representative points in them that summarize a balance of the parameters of the members of the group. This value could be, for example, the mean or the median of all the cluster members.
This also guides to the idea of defining a notion of distance between members: all the members of the groups should be obviously at short distances between them and the representative points, that from the central points of the other groups.
In the following image, we can see the results of a typical clustering algorithm and the representation of the cluster centers:
Sample clustering algorithm output
K-means is a very well-known clustering algorithm that can be easily implemented. It is very straightforward and can guide (depending on the data layout) to a good initial understanding of the provided information.
K-means tries to divide a set of samples into K disjoint groups or clusters, using as a main indicator the mean value (be it 1D, 2D, and so on) of the members. This point is normally called centroid, referring to the arithmetic entity with the same name.
One important characteristic of K-means is that K should be provided beforehand, and so some previous knowledge of the data is needed to avoid a non-representative result.
The criterion and goal of this method is to minimize the sum of squared distances from the cluster's member to the actual centroid of all cluster contained samples. This is also known as minimization of inertia.
Error minimization criteria for K-means
The mechanism of the K-means algorithm can be summarized in the following graphic:
Simplified flow chart of the K-means process
And this is a simplified summary of the algorithm:
The stopping conditions could be of various types:
K-means simplified graphic
The advantages of this method are:
But its simplicity has also a price (no silver bullet rule applies):
In this article, we got a simple overview of some of the most basic models we can implement, but we tried to be as detailed in the explanation as possible.
From now on, we are able to generate synthetic datasets, allowing us to rapidly test the adequacy of a model for different data configurations and so evaluate the advantages and shortcoming of them without having to load models with a greater number of unknown characteristics.
You can also refer to the following books on the similar topics:
Further resources on this subject: