A sample text classifier based on the Reuters corpus
We are going to build a sample text classifier based on the NLTK Reuters corpus. This one is made up of thousands of news lines divided into 90 categories:
from nltk.corpus import reuters print(reuters.categories()) [u'acq', u'alum', u'barley', u'bop', u'carcass', u'castor-oil', u'cocoa', u'coconut', u'coconut-oil', u'coffee', u'copper', u'copra-cake', u'corn', ...
To simplify the process, we'll take only two categories, which have a similar number of documents:
import numpy as np Xr = np.array(reuters.sents(categories=['rubber'])) Xc = np.array(reuters.sents(categories=['cotton'])) Xw = np.concatenate((Xr, Xc))
As each document is already split into tokens and we want to apply our custom tokenizer (with stopword removal and stemming), we need to rebuild the full sentences:
X = [] for document in Xw: X.append(' '.join(document).strip().lower())
Now we need to prepare the label vector, by assigning 0
to rubber
and 1
to cotton
:
Yr = np.zeros...