An example of a document clustering application
This application will read a set of documents and will organize them using the k-means clustering algorithms. To achieve this, we will use four components:
The Reader system: This system will read all the documents and convert every document into a list of
String
objects.The Indexer system: This system will process the documents and convert them into a list of words. At the same time, it will generate the global vocabulary of the set of documents with all the words that appear on them.
The Mapper system: This system will convert each list of words into a mathematical representation using the vector space model. The value of each item will be the Tf-Idf (short for term frequency–inverse document frequency) metric.
The Clustering system: This system will use the k-means clustering algorithm to cluster the documents.
All these systems are concurrent and use their own tasks to implement their functionality. Let's see how you can implement this...