Centroid-based clustering (CC)
In this section, we discuss the clustering technique and its computational challenges. An example of using K-means with Spark MLlib will be shown for a better understanding of the centroid-based clustering.
Challenges in CC algorithm
As discussed previously, in a centroid-based algorithm like K-means, setting the optimal value of the number of clusters K is an optimization problem. This problem can be described as NP-hard (that is non-deterministic polynomial-time hard) featuring high algorithmic complexities, and thus the common approach is trying to achieve only an approximate solution. Consequently, solving these optimization problems imposes an extra burden and consequently nontrivial drawbacks. Furthermore, the K-means algorithm expects that each cluster has approximately similar size. In other words, data points in each cluster have to be uniform to get better clustering performance.
Another major drawback of this algorithm is that this algorithm tries...