Using MiniBatch k-means to handle more data
K-means is a nice method to use; however, it is not ideal for a lot of data. This is due to the complexity of k-means. This said, we can get approximate solutions with much better algorithmic complexity using MiniBatch k-means.
Getting ready
MiniBatch k-means is a faster implementation of k-means. K-means is computationally very expensive; the problem is NP-hard.
However, using MiniBatch k-means, we can speed up k-means by orders of magnitude. This is achieved by taking many subsamples that are called MiniBatches. Given the convergence properties of subsampling, a close approximation to regular k-means is achieved provided there are good initial conditions.
How to do it...
- Let's do some very high-level profiling of MiniBatch clustering. First, we'll look at the overall speed difference, and then we'll look at the errors in the estimates:
import numpy as np from sklearn.datasets import make_blobs blobs, labels = make_blobs(int(1e6), 3) from sklearn.cluster...