Calculating the frequency distributions of words
A frequency distribution counts the number of occurrences of distinct data values. These are of value as we can use them to determine which words or phrases within a document are most common, and from that infer those that have greater or lesser value.
Frequency distributions can be calculated using several different techniques. We will examine them using the facilities built into NLTK.
How to do it
NLTK provides a class, ntlk.probabilities.FreqDist
, that allow us to very easily calculate the frequency distribution of values in a list. Let's examine using this class (code is in 07/freq_dist.py
):
- To create a frequency distribution using NLTK, start by importing the feature from NTLK (and also tokenizers and stop words):
from nltk.probabilities import FreqDist from nltk.tokenize import regexp_tokenize from nltk.corpus import stopwords
- Then we can use the
FreqDist
function to create a frequency distribution given a list of words. We will examine...