Using k-means for outlier detection
In this recipe, we'll look at both the debate and mechanics of k-means for outlier detection. It can be useful to isolate some types of errors, but care should be taken when using it.
Getting ready
We'll use k-means to do outlier detection on a cluster of points. It's important to note that there are many camps when it comes to outliers and outlier detection. On one hand, we're potentially removing points that were generated by the data-generating process by removing outliers. On the other hand, outliers can be due to a measurement error or some other outside factor.
This is the most credence we'll give to the debate. The rest of this recipe is about finding outliers; we'll work under the assumption that our choice to remove outliers is justified. The act of outlier detection is a matter of finding the centroids of the clusters and then identifying points that are potential outliers by their distances from the centroid.
How to do it...
- First, we'll generate...