





















































In this article by Ashish Gupta, author of the book, Rapid – Apache Mahout Clustering Designs, we will discuss a model-based clustering algorithm. Model-based clustering is used to overcome some of the deficiencies that can occur in K-means or Fuzzy K-means algorithms. We will discuss the following topics in this article:
(For more resources related to this topic, see here.)
In model-based clustering, we assume that data is generated by a model and try to get the model from the data. The right model will fit the data better than other models.
In the K-means algorithm, we provide the initial set of cluster, and K-means provides us with the data points in the clusters. Think about a case where clusters are not distributed normally, then the improvement of a cluster will not be good using K-means. In this scenario, the model-based clustering algorithm will do the job. Another idea you can think of when dividing the clusters is—hierarchical clustering—and we need to find out the overlapping information. This situation will also be covered by model-based clustering algorithms.
If all components are not well separated, a cluster can consist of multiple mixture components. In simple terms, in model-based clustering, data is a mixture of two or more components. Each component has an associated probability and is described by a density function. Model-based clustering can capture the hierarchy and the overlap of the clusters at the same time. Partitions are determined by an EM (expectation-maximization) algorithm for maximum likelihood. The generated models are compared by a Bayesian Information criterion (BIC). The model with the lowest BIC is preferred.
In the equation BIC = -2 log(L) + mlog(n), L is the likelihood function and m is the number of free parameters to be estimated. n is the number of data points.
Dirichlet clustering is a model-based clustering method. This algorithm is used to understand the data and cluster the data. Dirichlet clustering is a process of nonparametric and Bayesian modeling. It is nonparametric because it can have infinite number of parameters. Dirichlet clustering is based on Dirichlet distribution.
For this algorithm, we have a probabilistic mixture of a number of models that are used to explain data. Each data point will be coming from one of the available models. The models are taken from the sample of a prior distribution of models, and points are assigned to these models iteratively. In each iteration probability, a point generated by a particular model is calculated. After the points are assigned to a model, new parameters for each of the model are sampled. This sample is from the posterior distribution of the model parameters, and it considers all the observed data points assigned to the model. This sampling provides more information than normal clustering listed as follows:
In machine learning, topic modeling is nothing but finding out a topic from the text document using a statistical model. A document on particular topics has some particular words. For example, if you are reading an article on sports, there are high chances that you will get words such as football, baseball, Formula One and Olympics. So a topic model actually uncovers the hidden sense of the article or a document. Topic models are nothing but the algorithms that can discover the main themes from a large set of unstructured document. It uncovers the semantic structure of the text. Topic modeling enables us to organize large scale electronic archives. Mahout has the implementation of one of the topic modeling algorithms—Latent Dirichlet Allocation (LDA).
LDA is a statistical model of document collection that tries to capture the intuition of the documents. In normal clustering algorithms, if words having the same meaning don't occur together, then the algorithm will not associate them, but LDA can find out which two words are used in similar context, and LDA is better than other algorithms in finding out the association in this way.
LDA is a generative, probabilistic model. It is generative because the model is tweaked to fit the data, and using the parameters of the model, we can generate the data on which it fits. It is probabilistic because each topic is modeled as an infinite mixture over an underlying set of topic probabilities. The topic probabilities provide an explicit representation of a document. Graphically, a LDA model can be represented as follows:
The notation used in this image represents the following:
How LDA works in a map-reduce mode? So these are the steps that LDA follows in mapper and reducer steps:
This process is iterative, and in each iteration the sum of the probabilities is calculated and the process stops when it stops changing. A parameter set, which is similar to the convergence threshold in K-means, is set to check the changes. In the end, LDA estimates how well the model fits the data.
In Mahout, the Collapsed Variation Bayes (CVB) algorithm is implemented for LDA. LDA uses a term frequency vector as an input and not tf-idf vectors. We need to take care of the two parameters while running the LDA algorithm—the number of topics and the number of words in the documents. A higher number of topics will provide very low level topics while a lower number will provide a generalized topic at high level, such as sports.
In Mahout, mean field variational inference is used to estimate the model. It is similar to expectation-maximization of hierarchical Bayesian models. An expectation step reads each document and calculates the probability of each topic for each word in every document.
The maximization step takes the counts and sums all the probabilities and normalizes them.
To run LDA using Mahout, we will use the 20 Newsgroups dataset. We will convert the corpus to vectors, run LDA on these vectors, and get the resultant topics.
Let's run this example to view how topic modeling works in Mahout.
We will use the 20 Newsgroup dataset for this exercise. Download the 20news-bydate.tar.gz dataset from http://qwone.com/~jason/20Newsgroups/.
Perform the following steps to execute the CVB algorithm:
mkdir /tmp/20newsdata
cdtmp/20newsdata
tar-xzvf /tmp/20news-bydate.tar.gz
mkdir /tmp/20newsdataall
cp –R /20newsdata/*/* /tmp/20newsdataall
hadoopfs –mkdir /usr/hue/20newsdata
hadoopfs –put /tmp/20newsdataall /usr/hue/20newsdata
bin/mahoutseqdirectory -i /user/hue/20newsdata/20newsdataall -o /user/hue/20newsdataseq-out
bin/mahout seq2sparse -i /user/hue/20newsdataseq-out/part-m-00000 -o /user/hue/20newsdatavec -lnorm -nv -wtt
bin/mahoutrowid -i /user/hue/20newsdatavec/tf-vectors –o /user/hue/20newsmatrix
bin/mahout cvb -i /user/hue/20newsmatrix/matrix –o /user/hue/ldaoutput–k 10 –x 20 –dict/user/hue/20newsdatavec/dictionary.file-0 –dt /user/hue/ldatopics –mt /user/hue/ldamodel
The parameters used in the preceding command can be explained as follows:
The output of the preceding command can be seen as follows:
bin/mahout vectordump -i /user/hue/ldaoutput/ -d /user/hue/20newsdatavec/dictionary.file-0 -dtsequencefile -vs 10 -sort true -o /tmp/lda-output.txt
The parameters used in the preceding command can be explained as follows:
From the preceding screenshot you can see that after running the algorithm, you will get the term and probability of that.
In this article, we learned about model-based clustering, the Dirichlet process, and topic modeling. In model-based clustering, we tried to obtain the model from the data ,while the Dirichlet process is used to understand the data. Topic modeling helps us to identify the topics in an article or in a set of documents. We discussed how Mahout has implemented topic modeling using the latent Dirichlet process and how it is implemented in map reduce.
We discussed how to use Mahout to find out the topic distribution on a set of documents.
Further resources on this subject: