Assessing cluster correctness
We talked a little bit about assessing clusters when the ground truth is not known. However, we have not yet talked about assessing k-means when the cluster is known. In a lot of cases, this isn't knowable; however, if there is outside annotation, we will know the ground truth or at least the proxy sometimes.
Getting ready
So, let's assume a world where we have an outside agent supplying us with the ground truth.
We'll create a simple dataset, evaluate the measures of correctness against the ground truth in several ways, and then discuss them:
from sklearn import datasets from sklearn import cluster blobs, ground_truth = datasets.make_blobs(1000, centers=3,cluster_std=1.75)
How to do it...
- Before we walk through the metrics, let's take a look at the dataset:
%matplotlib inline import matplotlib.pyplot as plt f, ax = plt.subplots(figsize=(7, 5)) colors = ['r', 'g', 'b'] for i in range(3): p = blobs[ground_truth == i] ax.scatter(p[:,0], p[:,1], c=colors[i...