Traditional NLP
Extracting useful information for text-based information is no easy task. For a basic application, such as document classification, the common way of feature extraction is called bag of words (BoW), in which the frequency of the occurrence of each word is used as a feature for training the classifier. We will briefly talk about BoW in the following section, as well as the tf-idf approach, which is intended to reflect how important a word is to a document in a collection or corpus.
Bag of words
BoW is mainly for categorizing documents. It is also used in computer vision. The idea is to represent the document as a bag or a set of words, disregarding the grammar and the order of the word sequences.
After the preprocessing of the text, often called the corpus, a set of vocabulary is generated and BoW representation for each document is built on top of it.
Take the following two text samples as an example:
“The quick brown fox jumps over the lazy dog” “never jump over the lazy dog...