Distribution-based clustering (DC)
In this section, we will discuss the distribution-based clustering and its computational challenges. An example of using Gaussian mixture models (GMMs) with Spark MLlib will be shown for a understanding of distribution-based clustering.
Challenges in DC algorithm
A distribution-based clustering algorithm like GMM is an expectation-maximization algorithm. To avoid the overfitting problem, GMM usually models the dataset with a fixed number of Gaussian distributions. The distributions are initialized randomly, and the related parameters are iteratively optimized too to fit the model better to the training dataset. This is the most robust feature of GMM and helps the model to be converged toward the local optimum. However, multiple runs of this algorithm may produce different results.
In other words, unlike the bisecting K-means algorithm and soft clustering, GMM is optimized for hard clustering, and in order to obtain of that type, objects are often assigned...