Probabilistic clustering with Gaussian mixture models
In k-means, we assume that the variance of the clusters is equal. This leads to a subdivision of space that determines how the clusters are assigned; but what about a situation where the variances are not equal and each cluster point has some probabilistic association with it?
Getting ready
There's a more probabilistic way of looking at k-means clustering. Hard k-means clustering is the same as applying a Gaussian mixture model with a covariance matrix, S
, which can be factored to the error times of the identity matrix. This is the same covariance structure for each cluster. It leads to spherical clusters. However, if we allow S to vary, a GMM can be estimated and used for prediction. We'll look at how this works in a univariate sense and then expand to more dimensions.
How to do it...
- First, we need to create some data. For example, let's simulate heights of both women and men. We'll use this example throughout this recipe. It's a simple...