Adding flexibility with N-grams
The bag-of-words model takes into account isolated terms called unigrams. This looses the order of the words, which can be important in some cases. A generalization of the technique is called n-grams, where we use single words as well as word pairs or word triplets, in the case of bigrams and trigrams, respectively. The n-gram refers to the general case where you keep up to n
words together in the data. Naturally this representation exhibits unfavorable combinatorial complexity characteristics and makes the data grow exponentially. When dealing with a large corpus this can take significant computing power.
With the sentence
object we created before to exemplify how the tokenization process works (it contains the sentence: If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck
.) and the build_dfm()
function we created with the n_grams
argument, you can compare the resulting DFM with n_grams = 2
to the one with n_grams...