Topic modeling
When we have a collection of documents for which we do not clearly know the categories, topic models help us to roughly find the categorization. The model treats each document as a mixture of topics, probably with one dominating topic.
For example, let's suppose we have the following sentences:
- Eating fruits as snacks is a healthy habit
- Exercising regularly is an important part of a healthy lifestyle
- Grapefruit and oranges are citrus fruits
A topic model of these sentences may output the following:
- Topic A: 40% healthy, 20% fruits, 10% snacks
- Topic B: 20% Grapefruit, 20% oranges, 10% citrus
- Sentence 1 and 2: 80% Topic A, 20% Topic B
- Sentence 3: 100% Topic B
From the output of the model, we can guess that Topic A is about health and Topic B is about fruits. Though these topics are not known apriori, the model outputs corresponding probabilities for words associated with health, exercising, and fruits in the documents.
It is clear from these examples that topic modeling is an unsupervised...