Optimizing the number of centroids
When doing k-means clustering, we really do not know the right number of clusters in advance, so finding this out is an important step. Once we know (or estimate) the number of centroids, the problem will start to look more like a classification one as our knowledge to work with will have increased substantially.
Getting ready
Evaluating the model performance for unsupervised techniques is a challenge. Consequently, sklearn
has several methods for evaluating clustering when a ground truth is known, and very few for when it isn't.
We'll start with a single cluster model and evaluate its similarity. This is more for the purpose of mechanics as measuring the similarity of one cluster count is clearly not useful in finding the ground truth number of clusters.
How to do it...
- To get started, we'll create several blobs that can be used to simulate clusters of data:
from sklearn.datasets import make_blobs import numpy as np blobs, classes = make_blobs(500, centers=3...