Preprocessing – similarity measured as a similar number of common words
As we have seen earlier, the bag of words approach is both fast and robust. It is, though, not without challenges. Let's dive directly into them.
Converting raw text into a bag of words
We do not have to write custom code for counting words and representing those counts as a vector. Scikit's CountVectorizer
method, does the job efficiently but also has a very convenient interface:
>>> from sklearn.feature_extraction.text import CountVectorizer>>> vectorizer = CountVectorizer(min_df=1)
The min_df
parameter determines how CountVectorizer
treats seldom words (minimum document frequency). If it is set to an integer, all words occurring in fewer documents will be dropped. If it is a fraction, all words that occur in less than that fraction of the overall dataset will be dropped. The max_df
parameter works in a similar manner. If we print the instance, we can see what other parameters scikit provides together...