Creating our first classifier
Let's start with the simple and beautiful nearest-neighbor method from Chapter 2, Classifying with Real-world Examples. Although it is not as advanced as other methods, it is very powerful: as it is not model-based, it can learn nearly any data. But this beauty comes with a clear disadvantage, which we will find out very soon (because of which, we had to capitalize learn in the previous sentence).
Engineering the features
As mentioned earlier, we will use the Text and Score features to train our classifier. The problem with Text is that the classifier does not work well with strings. We will have to convert it into one or more numbers. So, what statistics could be useful to extract from a post? Let's start with the number of HTML links, assuming that good posts have a higher chance of having links in them.
We can do this with regular expressions. The following captures all HTML link tags that start with http://
(ignoring the other protocols for now):
import re link_match...