Identifying and removing rare words
We can remove words with low occurences by leveraging the ability to find words with low frequency counts, that fall outside of a certain deviation of the norm, or just from a list of words considered to be rare within the given domain. But the technique we will use works the same for either.
How to do it
Rare words can be removed by building a list of those rare words and then removing them from the set of tokens being processed. The list of rare words can be determined by using the frequency distribution provided by NTLK. Then you decide what threshold should be used as a rare word threshold:
- The script in the
07/07_rare_words.py
file extends that of the frequency distribution recipe to identify words with two or fewer occurrences and then removes those words from the tokens:
with open('wotw.txt', 'r') as file: data = file.read() tokens = [word.lower() for word in regexp_tokenize(data, '\w+')] stoplist = stopwords.words('english') without_stops ...