In this article by Ashish Gupta, author of the book, Rapid – Apache Mahout Clustering Designs, we will discuss a model-based clustering algorithm. Model-based clustering is used to overcome some of the deficiencies that can occur in K-means or Fuzzy K-means algorithms. We will discuss the following topics in this article:

Learning model-based clustering
Understanding Dirichlet clustering
Understanding topic modeling

(For more resources related to this topic, see here.)

Learning model-based clustering

In model-based clustering, we assume that data is generated by a model and try to get the model from the data. The right model will fit the data better than other models.

In the K-means algorithm, we provide the initial set of cluster, and K-means provides us with the data points in the clusters. Think about a case where clusters are not distributed normally, then the improvement of a cluster will not be good using K-means. In this scenario, the model-based clustering algorithm will do the job. Another idea you can think of when dividing the clusters is—hierarchical clustering—and we need to find out the overlapping information. This situation will also be covered by model-based clustering algorithms.

If all components are not well separated, a cluster can consist of multiple mixture components. In simple terms, in model-based clustering, data is a mixture of two or more components. Each component has an associated probability and is described by a density function. Model-based clustering can capture the hierarchy and the overlap of the clusters at the same time. Partitions are determined by an EM (expectation-maximization) algorithm for maximum likelihood. The generated models are compared by a Bayesian Information criterion (BIC). The model with the lowest BIC is preferred.

In the equation BIC = -2 log(L) + mlog(n), L is the likelihood function and m is the number of free parameters to be estimated. n is the number of data points.

Understanding Dirichlet clustering

Dirichlet clustering is a model-based clustering method. This algorithm is used to understand the data and cluster the data. Dirichlet clustering is a process of nonparametric and Bayesian modeling. It is nonparametric because it can have infinite number of parameters. Dirichlet clustering is based on Dirichlet distribution.

For this algorithm, we have a probabilistic mixture of a number of models that are used to explain data. Each data point will be coming from one of the available models. The models are taken from the sample of a prior distribution of models, and points are assigned to these models iteratively. In each iteration probability, a point generated by a particular model is calculated. After the points are assigned to a model, new parameters for each of the model are sampled. This sample is from the posterior distribution of the model parameters, and it considers all the observed data points assigned to the model. This sampling provides more information than normal clustering listed as follows:

As we are assigning points to different models, we can find out how many models are supported by the data.
The other information that we can get is how well the data is described by a model and how two points are explained by the same model.

Topic modeling

In machine learning, topic modeling is nothing but finding out a topic from the text document using a statistical model. A document on particular topics has some particular words. For example, if you are reading an article on sports, there are high chances that you will get words such as football, baseball, Formula One and Olympics. So a topic model actually uncovers the hidden sense of the article or a document. Topic models are nothing but the algorithms that can discover the main themes from a large set of unstructured document. It uncovers the semantic structure of the text. Topic modeling enables us to organize large scale electronic archives. Mahout has the implementation of one of the topic modeling algorithms—Latent Dirichlet Allocation (LDA).

LDA is a statistical model of document collection that tries to capture the intuition of the documents. In normal clustering algorithms, if words having the same meaning don't occur together, then the algorithm will not associate them, but LDA can find out which two words are used in similar context, and LDA is better than other algorithms in finding out the association in this way.

LDA is a generative, probabilistic model. It is generative because the model is tweaked to fit the data, and using the parameters of the model, we can generate the data on which it fits. It is probabilistic because each topic is modeled as an infinite mixture over an underlying set of topic probabilities. The topic probabilities provide an explicit representation of a document. Graphically, a LDA model can be represented as follows:

understanding-model-based-clustering-img-0

The notation used in this image represents the following:

M, N, and K represent the number of documents, the number of words in the document, and the number of topics in the document respectively.
is the prior weight of the K topic in a document.
is the prior weight of the w word in a topic.
φ is the probability of a word occurring in a topic.
Θ is the topic distribution.
z is the identity of a topic of all the words in all the documents.
w is the identity of all the words in all the documents.

How LDA works in a map-reduce mode? So these are the steps that LDA follows in mapper and reducer steps:

Mapper phase:
- The program starts with an empty topic model.
- All the documents are read by different mappers.
- The probabilities of each topic for each word in the document are calculated.
Reducer Phase:
- The reducer receives the count of probabilities.
- These counts are summed and the model is normalized.

This process is iterative, and in each iteration the sum of the probabilities is calculated and the process stops when it stops changing. A parameter set, which is similar to the convergence threshold in K-means, is set to check the changes. In the end, LDA estimates how well the model fits the data.

In Mahout, the Collapsed Variation Bayes (CVB) algorithm is implemented for LDA. LDA uses a term frequency vector as an input and not tf-idf vectors. We need to take care of the two parameters while running the LDA algorithm—the number of topics and the number of words in the documents. A higher number of topics will provide very low level topics while a lower number will provide a generalized topic at high level, such as sports.

In Mahout, mean field variational inference is used to estimate the model. It is similar to expectation-maximization of hierarchical Bayesian models. An expectation step reads each document and calculates the probability of each topic for each word in every document.

The maximization step takes the counts and sums all the probabilities and normalizes them.

Running LDA using Mahout

To run LDA using Mahout, we will use the 20 Newsgroups dataset. We will convert the corpus to vectors, run LDA on these vectors, and get the resultant topics.

Let's run this example to view how topic modeling works in Mahout.

Dataset selection

We will use the 20 Newsgroup dataset for this exercise. Download the 20news-bydate.tar.gz dataset from http://qwone.com/~jason/20Newsgroups/.

Steps to execute CVB (LDA)

Perform the following steps to execute the CVB algorithm:

Create a 20newsdata directory and unzip the data here:

mkdir /tmp/20newsdata
cdtmp/20newsdata
tar-xzvf /tmp/20news-bydate.tar.gz

There are two folders under 20newsdata: 20news-bydate-test and 20news-bydate-train. Now, create another 20newsdataall directory and merge both the training and test data of the group.

Now move to the home directory and execute the following command:

mkdir /tmp/20newsdataall
cp –R /20newsdata/*/* /tmp/20newsdataall

Create a directory in Hadoop and save this data in HDFS:

hadoopfs –mkdir /usr/hue/20newsdata
hadoopfs –put /tmp/20newsdataall /usr/hue/20newsdata

Mahout CVB will accept the data in the vector format. For this, first we will generate a sequence file from the directory as follows:
```
bin/mahoutseqdirectory -i /user/hue/20newsdata/20newsdataall -o /user/hue/20newsdataseq-out
```

Convert the sequence file to a sparse vector but, as discussed earlier, using the term frequency weight.

bin/mahout seq2sparse -i /user/hue/20newsdataseq-out/part-m-00000 -o /user/hue/20newsdatavec -lnorm -nv -wtt

understanding-model-based-clustering-img-1

Convert the sparse vector to the input form required by the CVB algorithm.

bin/mahoutrowid -i /user/hue/20newsdatavec/tf-vectors –o /user/hue/20newsmatrix

understanding-model-based-clustering-img-2

Convert the sparse vector to the input form required by CVB algorithm.
```
bin/mahout cvb -i /user/hue/20newsmatrix/matrix –o /user/hue/ldaoutput–k 10 –x 20 –dict/user/hue/20newsdatavec/dictionary.file-0 –dt /user/hue/ldatopics –mt /user/hue/ldamodel
```
The parameters used in the preceding command can be explained as follows:
- -i: This is the input path of the document vector
- -o: This is the output path of the topic term distribution
- -k: This is the number of latent topics
- -x: This is the maximum number of iterations
- -dict: This is the term dictionary files
- -dt: This is the output path of document—topic distribution
- -mt: This is the model state path after each iteration
The output of the preceding command can be seen as follows:
Once the command finishes, you will get the information on the screen as follows:
To view the output, run the following command :
bin/mahout vectordump -i /user/hue/ldaoutput/ -d /user/hue/20newsdatavec/dictionary.file-0 -dtsequencefile -vs 10 -sort true -o /tmp/lda-output.txt
The parameters used in the preceding command can be explained as follows:
- -i: This is the input location of the CVB output
- -d: This is the dictionary file location created during vector creation
- -dt: This is the dictionary file type (sequence or text)
- -vs: This is the vector size
- -sort: This is the flag to put true or false
- -o: This is the output location of local filesystem
Now your output will be saved in the local filesystem. Open the file and you will see an output similar to the following:

From the preceding screenshot you can see that after running the algorithm, you will get the term and probability of that.

Summary

In this article, we learned about model-based clustering, the Dirichlet process, and topic modeling. In model-based clustering, we tried to obtain the model from the data ,while the Dirichlet process is used to understand the data. Topic modeling helps us to identify the topics in an article or in a set of documents. We discussed how Mahout has implemented topic modeling using the latent Dirichlet process and how it is implemented in map reduce.

We discussed how to use Mahout to find out the topic distribution on a set of documents.