Online Clustering
Sometimes, dataset X is too large, and the algorithms can become extremely slow, with a proportional need for memory. In these cases, it's preferable to employ a batch strategy that can learn while the data is streamed. As the number of parameters is generally very small, Online Clustering is quite fast and only a little bit less accurate than standard algorithms working with the whole dataset.
Mini-batch K-means
The first approach we are going to consider is a mini-batch version of the standard K-means algorithm. In this case, we cannot compute the centroids for all samples, and so the main problem is to define a criterion to reassign the centroids after a partial fit. The standard process is based on a streaming average, and therefore there will be centroids with a higher sample count and others with lower values. In scikit-learn, the fine-tuning of this process is achieved using the reassigment_ratio
parameter (whose default value is 0.01
). Small values (for example, 0...