Text-based datasets contain a lot of information, whether they are books, historical documents, social media, e-mail, or any of the other ways we communicate via writing. Extracting features from text-based datasets and using them for classification is a difficult problem. There are, however, some common patterns for text mining.

(For more resources related to this topic, see here.)

We look at disambiguating terms in social media using the Naive Bayes algorithm, which is a powerful and surprisingly simple algorithm. Naive Bayes takes a few shortcuts to properly compute the probabilities for classification, hence the term naive in the name. It can also be extended to other types of datasets quite easily and doesn't rely on numerical features. The model in this article is a baseline for text mining studies, as the process can work reasonably well for a variety of datasets.

We will cover the following topics in this article:

Downloading data from social network APIs
Transformers for text
Naive Bayes classifier
Using JSON for saving and loading datasets
The NLTK library for extracting features from text
The F-measure for evaluation

Disambiguation

Text is often called an unstructured format. There is a lot of information there, but it is just there; no headings, no required format, loose syntax and other problems prohibit the easy extraction of information from text. The data is also highly connected, with lots of mentions and cross-references—just not in a format that allows us to easily extract it!

We can compare the information stored in a book with that stored in a large database to see the difference. In the book, there are characters, themes, places, and lots of information. However, the book needs to be read and, more importantly, interpreted to gain this information. The database sits on your server with column names and data types. All the information is there and the level of interpretation needed is quite low. Information about the data, such as its type or meaning is called metadata, and text lacks it. A book also contains some metadata in the form of a table of contents and index but the degree is significantly lower than that of a database.

One of the problems is the term disambiguation. When a person uses the word bank, is this a financial message or an environmental message (such as river bank)? This type of disambiguation is quite easy in many circumstances for humans (although there are still troubles), but much harder for computers to do.

In this article, we will look at disambiguating the use of the term Python on Twitter's stream. A message on Twitter is called a tweet and is limited to 140 characters. This means there is little room for context. There isn't much metadata available although hashtags are often used to denote the topic of the tweet.

When people talk about Python, they could be talking about the following things:

The programming language Python
Monty Python, the classic comedy group
The snake Python
A make of shoe called Python

There can be many other things called Python. The aim of our experiment is to take a tweet mentioning Python and determine whether it is talking about the programming language, based only on the content of the tweet.

Downloading data from a social network

We are going to download a corpus of data from Twitter and use it to sort out spam from useful content. Twitter provides a robust API for collecting information from its servers and this API is free for small-scale usage. It is, however, subject to some conditions that you'll need to be aware of if you start using Twitter's data in a commercial setting.

First, you'll need to sign up for a Twitter account (which is free). Go to http://twitter.com and register an account if you do not already have one.

Next, you'll need to ensure that you only make a certain number of requests per minute. This limit is currently 180 requests per hour. It can be tricky ensuring that you don't breach this limit, so it is highly recommended that you use a library to talk to Twitter's API.

You will need a key to access Twitter's data. Go to http://twitter.com and sign in to your account.

When you are logged in, go to https://apps.twitter.com/ and click on Create New App.

Create a name and description for your app, along with a website address. If you don't have a website to use, insert a placeholder. Leave the Callback URL field blank for this app—we won't need it. Agree to the terms of use (if you do) and click on Create your Twitter application.

Keep the resulting website open—you'll need the access keys that are on this page. Next, we need a library to talk to Twitter. There are many options; the one I like is simply called twitter, and is the official Twitter Python library.

You can install twitter using pip3 install twitter if you are using pip to install your packages. If you are using another system, check the documentation at https://github.com/sixohsix/twitter.

Create a new IPython Notebook to download the data. We will create several notebooks in this article for various different purposes, so it might be a good idea to also create a folder to keep track of them. This first notebook, ch6_get_twitter, is specifically for downloading new Twitter data.

First, we import the twitter library and set our authorization tokens. The consumer key, consumer secret will be available on the Keys and Access Tokens tab on your Twitter app's page. To get the access tokens, you'll need to click on the Create my access token button, which is on the same page. Enter the keys into the appropriate places in the following code:

import twitter
consumer_key = "<Your Consumer Key Here>"
consumer_secret = "<Your Consumer Secret Here>"
access_token = "<Your Access Token Here>"
access_token_secret = "<Your Access Token Secret Here>"
authorization = twitter.OAuth(access_token, access_token_secret, consumer_key, consumer_secret)

We are going to get our tweets from Twitter's search function. We will create a reader that connects to twitter using our authorization, and then use that reader to perform searches. In the Notebook, we set the filename where the tweets will be stored:

import os
output_filename = os.path.join(os.path.expanduser("~"),   "Data", "twitter", "python_tweets.json")

We also need the json library for saving our tweets:

import json

Next, create an object that can read from Twitter. We create this object with our authorization object that we set up earlier:

t = twitter.Twitter(auth=authorization)

We then open our output file for writing. We open it for appending—this allows us to rerun the script to obtain more tweets. We then use our Twitter connection to perform a search for the word Python. We only want the statuses that are returned for our dataset. This code takes the tweet, uses the json library to create a string representation using the dumps function, and then writes it to the file. It then creates a blank line under the tweet so that we can easily distinguish where one tweet starts and ends in our file:

with open(output_filename, 'a') as output_file:
   search_results = t.search.tweets(q="python", count=100)['statuses']
   for tweet in search_results:
       if 'text' in tweet:
           output_file.write(json.dumps(tweet))
           output_file.write("nn")

In the preceding loop, we also perform a check to see whether there is text in the tweet or not. Not all of the objects returned by twitter will be actual tweets (some will be actions to delete tweets and others). The key difference is the inclusion of text as a key, which we test for.

Running this for a few minutes will result in 100 tweets being added to the output file.

You can keep rerunning this script to add more tweets to your dataset, keeping in mind that you may get some duplicates in the output file if you rerun it too fast (that is, before Twitter gets new tweets to return!).

Loading and classifying the dataset

After we have collected a set of tweets (our dataset), we need labels to perform classification. We are going to label the dataset by setting up a form in an IPython Notebook to allow us to enter the labels.

The dataset we have stored is nearly in a JSON format. JSON is a format for data that doesn't impose much structure and is directly readable in JavaScript (hence the name, JavaScript Object Notation). JSON defines basic objects such as numbers, strings, lists and dictionaries, making it a good format for storing datasets if they contain data that isn't numerical. If your dataset is fully numerical, you would save space and time using a matrix-based format like in NumPy.

A key difference between our dataset and real JSON is that we included newlines between tweets. The reason for this was to allow us to easily append new tweets (the actual JSON format doesn't allow this easily). Our format is a JSON representation of a tweet, followed by a newline, followed by the next tweet, and so on.

To parse it, we can use the json library but we will have to first split the file by newlines to get the actual tweet objects themselves.

Set up a new IPython Notebook (I called mine ch6_label_twitter) and enter the dataset's filename. This is the same filename in which we saved the data in the previous section. We also define the filename that we will use to save the labels to. The code is as follows:

import os
input_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json")
labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_classes.json")

As stated, we will use the json library, so import that too:

import json

We create a list that will store the tweets we received from the file:

tweets = []

We then iterate over each line in the file. We aren't interested in lines with no information (they separate the tweets for us), so check if the length of the line (minus any whitespace characters) is zero. If it is, ignore it and move to the next line. Otherwise, load the tweet using json.loads (which loads a JSON object from a string) and add it to our list of tweets. The code is as follows:

with open(input_filename) as inf:
   for line in inf:
       if len(line.strip()) == 0:
          continue
       tweets.append(json.loads(line))

We are now interested in classifying whether an item is relevant to us or not (in this case, relevant means refers to the programming language Python). We will use the IPython Notebook's ability to embed HTML and talk between JavaScript and Python to create a viewer of tweets to allow us to easily and quickly classify the tweets as spam or not.

The code will present a new tweet to the user (you) and ask for a label: is it relevant or not? It will then store the input and present the next tweet to be labeled.

First, we create a list for storing the labels. These labels will be stored whether or not the given tweet refers to the programming language Python, and it will allow our classifier to learn how to differentiate between meanings.

We also check if we have any labels already and load them. This helps if you need to close the notebook down midway through labeling. This code will load the labels from where you left off. It is generally a good idea to consider how to save at midpoints for tasks like this. Nothing hurts quite like losing an hour of work because your computer crashed before you saved the labels! The code is as follows:

labels = []
if os.path.exists(labels_filename):
   with open(labels_filename) as inf:
       labels = json.load(inf)

Next, we create a simple function that will return the next tweet that needs to be labeled. We can work out which is the next tweet by finding the first one that hasn't yet been labeled. The code is as follows:

def get_next_tweet():
   return tweet_sample[len(labels)]['text']

The next step in our experiment is to collect information from the user (you!) on which tweets are referring to Python (the programming language) and which are not. As of yet, there is not a good, straightforward way to get interactive feedback with pure Python in IPython Notebooks. For this reason, we will use some JavaScript and HTML to get this input from the user.

Next we create some JavaScript in the IPython Notebook to run our input. Notebooks allow us to use magic functions to embed HTML and JavaScript (among other things) directly into the Notebook itself. Start a new cell with the following line at the top:

%%javascript

The code in here will be in JavaScript, hence the curly braces that are coming up. Don't worry, we will get back to Python soon. Keep in mind here that the following code must be in the same cell as the %%javascript magic function.

The first function we will define in JavaScript shows how easy it is to talk to your Python code from JavaScript in IPython Notebooks. This function, if called, will add a label to the labels array (which is in python code). To do this, we load the IPython kernel as a JavaScript object and give it a Python command to execute. The code is as follows:

function set_label(label){
   var kernel = IPython.notebook.kernel;
   kernel.execute("labels.append(" + label + ")");
   load_next_tweet();
}

At the end of that function, we call the load_next_tweet function. This function loads the next tweet to be labeled. It runs on the same principle; we load the IPython kernel and give it a command to execute (calling the get_next_tweet function we defined earlier).

However, in this case we want to get the result. This is a little more difficult. We need to define a callback, which is a function that is called when the data is returned. The format for defining callback is outside the scope of this book. If you are interested in more advanced JavaScript/Python integration, consult the IPython documentation.

The code is as follows:

function load_next_tweet(){
   var code_input = "get_next_tweet()";
   var kernel = IPython.notebook.kernel;
   var callbacks = { 'iopub' : {'output' : handle_output}};
   kernel.execute(code_input, callbacks, {silent:false});
}

The callback function is called handle_output, which we will define now. This function gets called when the Python function that kernel.execute calls returns a value. As before, the full format of this is outside the scope of this book. However, for our purposes the result is returned as data of the type text/plain, which we extract and show in the #tweet_text div of the form we are going to create in the next cell. The code is as follows:

function handle_output(out){
   var res = out.content.data["text/plain"];
$("div#tweet_text").html(res);
}

Our form will have a div that shows the next tweet to be labeled, which we will give the ID #tweet_text. We also create a textbox to enable us to capture key presses (otherwise, the Notebook will capture them and JavaScript won't do anything). This allows us to use the keyboard to set labels of 1 or 0, which is faster than using the mouse to click buttons—given that we will need to label at least 100 tweets.

Run the previous cell to embed some JavaScript into the page, although nothing will be shown to you in the results section.

We are going to use a different magic function now, %%html. Unsurprisingly, this magic function allows us to directly embed HTML into our Notebook. In a new cell, start with this line:

%%html

For this cell, we will be coding in HTML and a little JavaScript. First, define a div element to store our current tweet to be labeled. I've also added some instructions for using this form. Then, create the #tweet_text div that will store the text of the next tweet to be labeled. As stated before, we need to create a textbox to be able to capture key presses. The code is as follows:

<div name="tweetbox">
   Instructions: Click in textbox. Enter a 1 if the tweet is relevant, enter 0 otherwise.<br>
Tweet: <div id="tweet_text" value="text"></div><br>
<input type=text id="capture"></input><br>
</div>

Don't run the cell just yet!

We create the JavaScript for capturing the key presses. This has to be defined after creating the form, as the #tweet_text div doesn't exist until the above code runs. We use the JQuery library (which IPython is already using, so we don't need to include the JavaScript file) to add a function that is called when key presses are made on the #capture textbox we defined. However, keep in mind that this is a %%html cell and not a JavaScript cell, so we need to enclose this JavaScript in the <script> tags.

We are only interested in key presses if the user presses the 0 or the 1, in which case the relevant label is added. We can determine which key was pressed by the ASCII value stored in e.which. If the user presses 0 or 1, we append the label and clear out the textbox. The code is as follows:

<script>
$("input#capture").keypress(function(e) {
if(e.which == 48) {
   set_label(0);
   $("input#capture").val("");
}else if (e.which == 49){
   set_label(1);
   $("input#capture").val("");
}
});

All other key presses are ignored.

As a last bit of JavaScript for this article (I promise), we call the load_next_tweet() function. This will set the first tweet to be labeled and then close off the JavaScript. The code is as follows:

load_next_tweet();
</script>

After you run this cell, you will get an HTML textbox, alongside the first tweet's text. Click in the textbox and enter 1 if it is relevant to our goal (in this case, it means is the tweet related to the programming language Python) and a 0 if it is not. After you do this, the next tweet will load. Enter the label and the next one will load. This continues until the tweets run out.

When you finish all of this, simply save the labels to the output filename we defined earlier for the class values:

with open(labels_filename, 'w') as outf:
   json.dump(labels, outf)

You can call the preceding code even if you haven't finished. Any labeling you have done to that point will be saved. Running this Notebook again will pick up where you left off and you can keep labeling your tweets.

This might take a while to do this! If you have a lot of tweets in your dataset, you'll need to classify all of them. If you are pushed for time, you can download the same dataset I used, which contains classifications.

Creating a replicable dataset from Twitter

In data mining, there are lots of variables. These aren't just in the data mining algorithms—they also appear in the data collection, environment, and many other factors. Being able to replicate your results is important as it enables you to verify or improve upon your results.

Getting 80 percent accuracy on one dataset with algorithm X, and 90 percent accuracy on another dataset with algorithm Y doesn't mean that Y is better. We need to be able to test on the same dataset in the same conditions to be able to properly compare.

On running the preceding code, you will get a different dataset to the one I created and used. The main reasons are that Twitter will return different search results for you than me based on the time you performed the search. Even after that, your labeling of tweets might be different from what I do. While there are obvious examples where a given tweet relates to the python programming language, there will always be gray areas where the labeling isn't obvious. One tough gray area I ran into was tweets in non-English languages that I couldn't read. In this specific instance, there are options in Twitter's API for setting the language, but even these aren't going to be perfect.

Due to these factors, it is difficult to replicate experiments on databases that are extracted from social media, and Twitter is no exception. Twitter explicitly disallows sharing datasets directly.

One solution to this is to share tweet IDs only, which you can share freely. In this section, we will first create a tweet ID dataset that we can freely share. Then, we will see how to download the original tweets from this file to recreate the original dataset.

First, we save the replicable dataset of tweet IDs. Creating another new IPython Notebook, first set up the filenames. This is done in the same way we did labeling but there is a new filename where we can store the replicable dataset. The code is as follows:

import os
input_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json")
labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_classes.json")
replicable_dataset = os.path.join(os.path.expanduser("~"),   "Data", "twitter", "replicable_dataset.json")

We load the tweets and labels as we did in the previous notebook:

import json
tweets = []
with open(input_filename) as inf:
   for line in inf:
       if len(line.strip()) == 0:
           continue
       tweets.append(json.loads(line))
if os.path.exists(labels_filename):
   with open(classes_filename) as inf:
       labels = json.load(inf)

Now we create a dataset by looping over both the tweets and labels at the same time and saving those in a list:

dataset = [(tweet['id'], label) for tweet, label in zip(tweets,   labels)]

Finally, we save the results in our file:

with open(replicable_dataset, 'w') as outf:
   json.dump(dataset, outf)

Now that we have the tweet IDs and labels saved, we can recreate the original dataset. If you are looking to recreate the dataset I used for this article, it can be found in the code bundle that comes with this book.

Loading the preceding dataset is not difficult but it can take some time. Start a new IPython Notebook and set the dataset, label, and tweet ID filenames as before. I've adjusted the filenames here to ensure that you don't overwrite your previously collected dataset, but feel free to change these if you want. The code is as follows:

import os
tweet_filename = os.path.join(os.path.expanduser("~"), "Data",   "twitter", "replicable_python_tweets.json")
labels_filename = os.path.join(os.path.expanduser("~"), "Data",   "twitter", "replicable_python_classes.json")
replicable_dataset = os.path.join(os.path.expanduser("~"),   "Data", "twitter", "replicable_dataset.json")

Then load the tweet IDs from the file using JSON:

import json
with open(replicable_dataset) as inf:
   tweet_ids = json.load(inf)

Saving the labels is very easy. We just iterate through this dataset and extract the IDs. We could do this quite easily with just two lines of code (open file and save tweets). However, we can't guarantee that we will get all the tweets we are after (for example, some may have been changed to private since collecting the dataset) and therefore the labels will be incorrectly indexed against the data.

As an example, I tried to recreate the dataset just one day after collecting them and already two of the tweets were missing (they might be deleted or made private by the user). For this reason, it is important to only print out the labels that we need. To do this, we first create an empty actual labels list to store the labels for tweets that we actually recover from twitter, and then create a dictionary mapping the tweet IDs to the labels.

The code is as follows:

actual_labels = []
label_mapping = dict(tweet_ids)

Next, we are going to create a twitter server to collect all of these tweets. This is going to take a little longer. Import the twitter library that we used before, creating an authorization token and using that to create the twitter object:

import twitter
consumer_key = "<Your Consumer Key Here>"
consumer_secret = "<Your Consumer Secret Here>"
access_token = "<Your Access Token Here>"
access_token_secret = "<Your Access Token Secret Here>"
authorization = twitter.OAuth(access_token, access_token_secret,   consumer_key, consumer_secret)
t = twitter.Twitter(auth=authorization)

Iterate over each of the twitter IDs by extracting the IDs into a list using the following command:

all_ids = [tweet_id for tweet_id, label in tweet_ids]

Then, we open our output file to save the tweets:

with open(tweets_filename, 'a') as output_file:

The Twitter API allows us get 100 tweets at a time. Therefore, we iterate over each batch of 100 tweets:

   for start_index in range(0, len(tweet_ids), 100):

To search by ID, we first create a string that joins all of the IDs (in this batch) together:

       id_string = ",".join(str(i) for i in           all_ids[start_index:start_index+100])

Next, we perform a statuses/lookup API call, which is defined by Twitter. We pass our list of IDs (which we turned into a string) into the API call in order to have those tweets returned to us:

        search_results = t.statuses.lookup(_id=id_string)

Then for each tweet in the search results, we save it to our file in the same way we did when we were collecting the dataset originally:

       for tweet in search_results:
           if 'text' in tweet:
               output_file.write(json.dumps(tweet))
               output_file.write("nn")

As a final step here (and still under the preceding if block), we want to store the labeling of this tweet. We can do this using the label_mapping dictionary we created before, looking up the tweet ID. The code is as follows:

               actual_labels.append(label_mapping[tweet['id']])

Run the previous cell and the code will collect all of the tweets for you. If you created a really big dataset, this may take a while—Twitter does rate-limit requests. As a final step here, save the actual_labels to our classes file:

with open(labels_filename, 'w') as outf:
   json.dump(actual_labels, outf)

Text transformers

Now that we have our dataset, how are we going to perform data mining on it?

Text-based datasets include books, essays, websites, manuscripts, programming code, and other forms of written expression. All of the algorithms we have seen so far deal with numerical or categorical features, so how do we convert our text into a format that the algorithm can deal with?

There are a number of measurements that could be taken. For instance, average word and average sentence length are used to predict the readability of a document. However, there are lots of feature types such as word occurrence which we will now investigate.

Bag-of-words

One of the simplest but highly effective models is to simply count each word in the dataset. We create a matrix, where each row represents a document in our dataset and each column represents a word. The value of the cell is the frequency of that word in the document.

Here's an excerpt from The Lord of the Rings, J.R.R. Tolkien:

Three Rings for the Elven-kings under the sky,

Seven for the Dwarf-lords in halls of stone,

Nine for Mortal Men, doomed to die,

One for the Dark Lord on his dark throne

In the Land of Mordor where the Shadows lie.

One Ring to rule them all, One Ring to find them,

One Ring to bring them all and in the darkness bind them.

In the Land of Mordor where the Shadows lie.

- J.R.R. Tolkien's epigraph to The Lord of The Rings

The word the appears nine times in this quote, while the words in, for, to, and one each appear four times. The word ring appears three times, as does the word of.

We can create a dataset from this, choosing a subset of words and counting the frequency:

Word	the	one	ring	to
Frequency	9 Unlock access to the largest independent learning library in Tech for FREE! Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of. Renews at €14.99/month. Cancel anytime	4	3	4

We can use the counter class to do a simple count for a given string. When counting words, it is normal to convert all letters to lowercase, which we do when creating the string. The code is as follows:

s = """Three Rings for the Elven-kings under the sky,
Seven for the Dwarf-lords in halls of stone,
Nine for Mortal Men, doomed to die,
One for the Dark Lord on his dark throne
In the Land of Mordor where the Shadows lie.
One Ring to rule them all, One Ring to find them,
One Ring to bring them all and in the darkness bind them.
In the Land of Mordor where the Shadows lie. """.lower()
words = s.split()
from collections import Counter
c = Counter(words)

Printing c.most_common(5) gives the list of the top five most frequently occurring words. Ties are not handled well as only five are given and a very large number of words all share a tie for fifth place.

The bag-of-words model has three major types. The first is to use the raw frequencies, as shown in the preceding example. This does have a drawback when documents vary in size from fewer words to many words, as the overall values will be very different. The second model is to use the normalized frequency, where each document's sum equals 1. This is a much better solution as the length of the document doesn't matter as much. The third type is to simply use binary features—a value is 1 if the word occurs at all and 0 if it doesn't. We will use binary representation in this article.

Another popular (arguably more popular) method for performing normalization is called term frequency - inverse document frequency, or tf-idf. In this weighting scheme, term counts are first normalized to frequencies and then divided by the number of documents in which it appears in the corpus

There are a number of libraries for working with text data in Python. We will use a major one, called Natural Language ToolKit (NLTK). The scikit-learn library also has the CountVectorizer class that performs a similar action, and it is recommended you take a look at it. However the NLTK version has more options for word tokenization. If you are doing natural language processing in python, NLTK is a great library to use.

N-grams

A step up from single bag-of-words features is that of n-grams. An n-gram is a subsequence of n consecutive tokens. In this context, a word n-gram is a set of n words that appear in a row.

They are counted the same way, with the n-grams forming a word that is put in the bag. The value of a cell in this dataset is the frequency that a particular n-gram appears in the given document.

The value of n is a parameter. For English, setting it to between 2 to 5 is a good start, although some applications call for higher values.

As an example, for n=3, we extract the first few n-grams in the following quote:

Always look on the bright side of life.

The first n-gram (of size 3) is Always look on, the second is look on the, the third is on the bright. As you can see, the n-grams overlap and cover three words.

Word n-grams have advantages over using single words. This simple concept introduces some context to word use by considering its local environment, without a large overhead of understanding the language computationally. A disadvantage of using n-grams is that the matrix becomes even sparser—word n-grams are unlikely to appear twice (especially in tweets and other short documents!).

Specially for social media and other short documents, word n-grams are unlikely to appear in too many different tweets, unless it is a retweet. However, in larger documents, word n-grams are quite effective for many applications.

Another form of n-gram for text documents is that of a character n-gram. Rather than using sets of words, we simply use sets of characters (although character n-grams have lots of options for how they are computed!). This type of dataset can pick up words that are misspelled, as well as providing other benefits. We will test character n-grams in this article.

Other features

There are other features that can be extracted too. These include syntactic features, such as the usage of particular words in sentences. Part-of-speech tags are also popular for data mining applications that need to understand meaning in text. Such feature types won't be covered in this book. If you are interested in learning more, I recommend Python 3 Text Processing with NLTK 3 Cookbook, Jacob Perkins, Packt publication.

Naive Bayes

Naive Bayes is a probabilistic model that is unsurprisingly built upon a naive interpretation of Bayesian statistics. Despite the naive aspect, the method performs very well in a large number of contexts. It can be used for classification of many different feature types and formats, but we will focus on one in this article: binary features in the bag-of-words model.

Bayes' theorem

For most of us, when we were taught statistics, we started from a frequentist approach. In this approach, we assume the data comes from some distribution and we aim to determine what the parameters are for that distribution. However, those parameters are (perhaps incorrectly) assumed to be fixed. We use our model to describe the data, even testing to ensure the data fits our model.

Bayesian statistics instead model how people (non-statisticians) actually reason. We have some data and we use that data to update our model about how likely something is to occur. In Bayesian statistics, we use the data to describe the model rather than using a model and confirming it with data (as per the frequentist approach).

Bayes' theorem computes the value of P(A|B), that is, knowing that B has occurred, what is the probability of A. In most cases, B is an observed event such as it rained yesterday, and A is a prediction it will rain today. For data mining, B is usually we observed this sample and A is it belongs to this class. We will see how to use Bayes' theorem for data mining in the next section.

The equation for Bayes' theorem is given as follows:

As an example, we want to determine the probability that an e-mail containing the word drugs is spam (as we believe that such a tweet may be a pharmaceutical spam).

A, in this context, is the probability that this tweet is spam. We can compute P(A), called the prior belief directly from a training dataset by computing the percentage of tweets in our dataset that are spam. If our dataset contains 30 spam messages for every 100 e-mails, P(A) is 30/100 or 0.3.

B, in this context, is this tweet contains the word 'drugs'. Likewise, we can compute P(B) by computing the percentage of tweets in our dataset containing the word drugs. If 10 e-mails in every 100 of our training dataset contain the word drugs, P(B) is 10/100 or 0.1. Note that we don't care if the e-mail is spam or not when computing this value.

P(B|A) is the probability that an e-mail contains the word drugs if it is spam. It is also easy to compute from our training dataset. We look through our training set for spam e-mails and compute the percentage of them that contain the word drugs. Of our 30 spam e-mails, if 6 contain the word drugs, then P(B|A) is calculated as 6/30 or 0.2.

From here, we use Bayes' theorem to compute P(A|B), which is the probability that a tweet containing the word drugs is spam. Using the previous equation, we see the result is 0.6. This indicates that if an e-mail has the word drugs in it, there is a 60 percent chance that it is spam.

Note the empirical nature of the preceding example—we use evidence directly from our training dataset, not from some preconceived distribution. In contrast, a frequentist view of this would rely on us creating a distribution of the probability of words in tweets to compute similar equations.

Naive Bayes algorithm

Looking back at our Bayes' theorem equation, we can use it to compute the probability that a given sample belongs to a given class. This allows the equation to be used as a classification algorithm.

With C as a given class and D as a sample in our dataset, we create the elements necessary for Bayes' theorem, and subsequently Naive Bayes. Naive Bayes is a classification algorithm that utilizes Bayes' theorem to compute the probability that a new data sample belongs to a particular class.

P(C) is the probability of a class, which is computed from the training dataset itself (as we did with the spam example). We simply compute the percentage of samples in our training dataset that belong to the given class.

P(D) is the probability of a given data sample. It can be difficult to compute this, as the sample is a complex interaction between different features, but luckily it is a constant across all classes. Therefore, we don't need to compute it at all. We will see later how to get around this issue.

P(D|C) is the probability of the data point belonging to the class. This could also be difficult to compute due to the different features. However, this is where we introduce the naive part of the Naive Bayes algorithm. We naively assume that each feature is independent of each other. Rather than computing the full probability of P(D|C), we compute the probability of each feature D1, D2, D3, … and so on. Then, we multiply them together:

P(D|C) = P(D1|C) x P(D2|C).... x P(Dn|C)

Each of these values is relatively easy to compute with binary features; we simply compute the percentage of times it is equal in our sample dataset.

In contrast, if we were to perform a non-naive Bayes version of this part, we would need to compute the correlations between different features for each class. Such computation is infeasible at best, and nearly impossible without vast amounts of data or adequate language analysis models.

From here, the algorithm is straightforward. We compute P(C|D) for each possible class, ignoring the P(D) term. Then we choose the class with the highest probability. As the P(D) term is consistent across each of the classes, ignoring it has no impact on the final prediction.

How it works

As an example, suppose we have the following (binary) feature values from a sample in our dataset: [0, 0, 0, 1].

Our training dataset contains two classes with 75 percent of samples belonging to the class 0, and 25 percent belonging to the class 1. The likelihood of the feature values for each class are as follows:

For class 0: [0.3, 0.4, 0.4, 0.7]

For class 1: [0.7, 0.3, 0.4, 0.9]

These values are to be interpreted as: for feature 1, it is a 1 in 30 percent of cases for class 0.

We can now compute the probability that this sample should belong to the class 0. P(C=0) = 0.75 which is the probability that the class is 0.

P(D) isn't needed for the Naive Bayes algorithm. Let's take a look at the calculation:

P(D|C=0) = P(D1|C=0) x P(D2|C=0) x P(D3|C=0) x P(D4|C=0)
= 0.3 x 0.6 x 0.6 x 0.7
= 0.0756

The second and third values are 0.6, because the value of that feature in the sample was 0. The listed probabilities are for values of 1 for each feature. Therefore, the probability of a 0 is its inverse: P(0) = 1 – P(1).

Now, we can compute the probability of the data point belonging to this class. An important point to note is that we haven't computed P(D), so this isn't a real probability. However, it is good enough to compare against the same value for the probability of the class 1. Let's take a look at the calculation:

P(C=0|D) = P(C=0) P(D|C=0)
= 0.75 * 0.0756
= 0.0567

Now, we compute the same values for the class 1:

P(C=1) = 0.25

P(D) isn't needed for naive Bayes. Let's take a look at the calculation:

P(D|C=1) = P(D1|C=1) x P(D2|C=1) x P(D3|C=1) x P(D4|C=1)
= 0.7 x 0.7 x 0.6 x 0.9
= 0.2646
P(C=1|D) = P(C=1)P(D|C=1)
= 0.25 * 0.2646
= 0.06615

Normally, P(C=0|D) + P(C=1|D) should equal to 1. After all, those are the only two possible options! However, the probabilities are not 1 due to the fact we haven't included the computation of P(D) in our equations here.

The data point should be classified as belonging to the class 1. You may have guessed this while going through the equations anyway; however, you may have been a bit surprised that the final decision was so close. After all, the probabilities in computing P(D|C) were much, much higher for the class 1. This is because we introduced a prior belief that most samples generally belong to the class 0.

If the classes had been equal sizes, the resulting probabilities would be much different. Try it yourself by changing both P(C=0) and P(C=1) to 0.5 for equal class sizes and computing the result again.

Application

We will now create a pipeline that takes a tweet and determines whether it is relevant or not, based only on the content of that tweet.

To perform the word extraction, we will be using the NLTK, a library that contains a large number of tools for performing analysis on natural language. We will use NLTK in future articles as well.

To get NLTK on your computer, use pip to install the package: pip3 install nltk

If that doesn't work, see the NLTK installation instructions at www.nltk.org/install.html.

We are going to create a pipeline to extract the word features and classify the tweets using Naive Bayes. Our pipeline has the following steps:

Transform the original text documents into a dictionary of counts using NLTK's word_tokenize function.
Transform those dictionaries into a vector matrix using the DictVectorizer transformer in scikit-learn. This is necessary to enable the Naive Bayes classifier to read the feature values extracted in the first step.
Train the Naive Bayes classifier, as we have seen in previous articles.
We will need to create another Notebook (last one for the article!) called ch6_classify_twitter for performing the classification.

Extracting word counts

We are going to use NLTK to extract our word counts. We still want to use it in a pipeline, but NLTK doesn't conform to our transformer interface. We will therefore need to create a basic transformer to do this to obtain both fit and transform methods, enabling us to use this in a pipeline.

First, set up the transformer class. We don't need to fit anything in this class, as this transformer simply extracts the words in the document. Therefore, our fit is an empty function, except that it returns self which is necessary for transformer objects.

Our transform is a little more complicated. We want to extract each word from each document and record True if it was discovered. We are only using the binary features here—True if in the document, False otherwise. If we wanted to use the frequency we would set up counting dictionaries.

Let's take a look at the code:

from sklearn.base import TransformerMixin
class NLTKBOW(TransformerMixin):
   def fit(self, X, y=None):
       return self
   def transform(self, X):
     return [{word: True for word in word_tokenize(document)}
                 for document in X]

The result is a list of dictionaries, where the first dictionary is the list of words in the first tweet, and so on. Each dictionary has a word as key and the value true to indicate this word was discovered. Any word not in the dictionary will be assumed to have not occurred in the tweet. Explicitly stating that a word's occurrence is False will also work, but will take up needless space to store.

Converting dictionaries to a matrix

This step converts the dictionaries built as per the previous step into a matrix that can be used with a classifier. This step is made quite simple through the DictVectorizer transformer.

The DictVectorizer class simply takes a list of dictionaries and converts them into a matrix. The features in this matrix are the keys in each of the dictionaries, and the values correspond to the occurrence of those features in each sample. Dictionaries are easy to create in code, but many data algorithm implementations prefer matrices. This makes DictVectorizer a very useful class.

In our dataset, each dictionary has words as keys and only occurs if the word actually occurs in the tweet. Therefore, our matrix will have each word as a feature and a value of True in the cell if the word occurred in the tweet.

To use DictVectorizer, simply import it using the following command:

from sklearn.feature_extraction import DictVectorizer

Training the Naive Bayes classifier

Finally, we need to set up a classifier and we are using Naive Bayes for this article. As our dataset contains only binary features, we use the BernoulliNB classifier that is designed for binary features. As a classifier, it is very easy to use. As with DictVectorizer, we simply import it and add it to our pipeline:

from sklearn.naive_bayes import BernoulliNB

Putting it all together

Now comes the moment to put all of these pieces together. In our IPython Notebook, set the filenames and load the dataset and classes as we have done before. Set the filenames for both the tweets themselves (not the IDs!) and the labels that we assigned to them. The code is as follows:

import os
input_filename = os.path.join(os.path.expanduser("~"), "Data",   "twitter", "python_tweets.json")
labels_filename = os.path.join(os.path.expanduser("~"), "Data",   "twitter", "python_classes.json")

Load the tweets themselves. We are only interested in the content of the tweets, so we extract the text value and store only that. The code is as follows:

tweets = []
with open(input_filename) as inf:
   for line in inf:
       if len(line.strip()) == 0:
           continue
       tweets.append(json.loads(line)['text'])

Load the labels for each of the tweets:

with open(classes_filename) as inf:
   labels = json.load(inf)

Now, create a pipeline putting together the components from before. Our pipeline has three parts:

The NLTKBOW transformer we created
A DictVectorizer transformer
A BernoulliNB classifier

The code is as follows:

from sklearn.pipeline import Pipeline
pipeline = Pipeline([('bag-of-words', NLTKBOW()),
                     ('vectorizer', DictVectorizer()),
                     ('naive-bayes', BernoulliNB())
                     ])

We can nearly run our pipeline now, which we will do with cross_val_score as we have done many times before. Before that though, we will introduce a better evaluation metric than the accuracy metric we used before. As we will see, the use of accuracy is not adequate for datasets when the number of samples in each class is different.

Evaluation using the F1-score

When choosing an evaluation metric, it is always important to consider cases where that evaluation metric is not useful. Accuracy is a good evaluation metric in many cases, as it is easy to understand and simple to compute. However, it can be easily faked. In other words, in many cases you can create algorithms that have a high accuracy by poor utility.

While our dataset of tweets (typically, your results may vary) contains about 50 percent programming-related and 50 percent nonprogramming, many datasets aren't as balanced as this.

As an example, an e-mail spam filter may expect to see more than 80 percent of incoming e-mails be spam. A spam filter that simply labels everything as spam is quite useless; however, it will obtain an accuracy of 80 percent!

To get around this problem, we can use other evaluation metrics. One of the most commonly employed is called an f1-score (also called f-score, f-measure, or one of many other variations on this term).

The f1-score is defined on a per-class basis and is based on two concepts: the precision and recall. The precision is the percentage of all the samples that were predicted as belonging to a specific class that were actually from that class. The recall is the percentage of samples in the dataset that are in a class and actually labeled as belonging to that class.

In the case of our application, we could compute the value for both classes (relevant and not relevant). However, we are really interested in the spam. Therefore, our precision computation becomes the question: of all the tweets that were predicted as being relevant, what percentage were actually relevant? Likewise, the recall becomes the question: of all the relevant tweets in the dataset, how many were predicted as being relevant?

After you compute both the precision and recall, the f1-score is the harmonic mean of the precision and recall:

To use the f1-score in scikit-learn methods, simply set the scoring parameter to f1. By default, this will return the f1-score of the class with label 1. Running the code on our dataset, we simply use the following line of code:

scores = cross_val_score(pipeline, tweets, labels, scoring='f1')

We then print out the average of the scores:

import numpy as np
print("Score: {:.3f}".format(np.mean(scores)))

The result is 0.798, which means we can accurately determine if a tweet using Python relates to the programing language nearly 80 percent of the time. This is using a dataset with only 200 tweets in it. Go back and collect more data and you will find that the results increase!

More data usually means a better accuracy, but it is not guaranteed!

Getting useful features from models

One question you may ask is what are the best features for determining if a tweet is relevant or not? We can extract this information from of our Naive Bayes model and find out which features are the best individually, according to Naive Bayes.

First we fit a new model. While the cross_val_score gives us a score across different folds of cross-validated testing data, it doesn't easily give us the trained models themselves. To do this, we simply fit our pipeline with the tweets, creating a new model. The code is as follows:

model = pipeline.fit(tweets, labels)

Note that we aren't really evaluating the model here, so we don't need to be as careful with the training/testing split. However, before you put these features into practice, you should evaluate on a separate test split. We skip over that here for the sake of clarity.

A pipeline gives you access to the individual steps through the named_steps attribute and the name of the step (we defined these names ourselves when we created the pipeline object itself). For instance, we can get the Naive Bayes model:

nb = model.named_steps['naive-bayes']

From this model, we can extract the probabilities for each word. These are stored as log probabilities, which is simply log(P(A|f)), where f is a given feature.

The reason these are stored as log probabilities is because the actual values are very low. For instance, the first value is -3.486, which correlates to a probability under 0.03 percent. Logarithm probabilities are used in computation involving small probabilities like this as they stop underflow errors where very small values are just rounded to zeros. Given that all of the probabilities are multiplied together, a single value of 0 will result in the whole answer always being 0! Regardless, the relationship between values is still the same; the higher the value, the more useful that feature is.

We can get the most useful features by sorting the array of logarithm probabilities. We want descending order, so we simply negate the values first. The code is as follows:

top_features = np.argsort(-feature_probabilities[1])[:50]

The preceding code will just give us the indices and not the actual feature values. This isn't very useful, so we will map the feature's indices to the actual values. The key is the DictVectorizer step of the pipeline, which created the matrices for us. Luckily this also records the mapping, allowing us to find the feature names that correlate to different columns. We can extract the features from that part of the pipeline:

dv = model.named_steps['vectorizer']

From here, we can print out the names of the top features by looking them up in the feature_names_ attribute of DictVectorizer. Enter the following lines into a new cell and run it to print out a list of the top features:

for i, feature_index in enumerate(top_features):
   print(i, dv.feature_names_[feature_index],       np.exp(feature_probabilities[1][feature_index]))

The first few features include :, http, # and @. These are likely to be noise (although the use of a colon is not very common outside programming), based on the data we collected. Collecting more data is critical to smoothing out these issues. Looking through the list though, we get a number of more obvious programming features:

7 for 0.188679245283
11 with 0.141509433962
28 installing 0.0660377358491
29 Top 0.0660377358491
34 Developer 0.0566037735849
35 library 0.0566037735849
36 ] 0.0566037735849
37 [ 0.0566037735849
41 version 0.0471698113208
43 error 0.0471698113208

There are some others too that refer to Python in a work context, and therefore might be referring to the programming language (although freelance snake handlers may also use similar terms, they are less common on Twitter):

22 jobs 0.0660377358491
30 looking 0.0566037735849
31 Job 0.0566037735849
34 Developer 0.0566037735849
38 Freelancer 0.0471698113208
40 projects 0.0471698113208
47 We're 0.0471698113208

That last one is usually in the format: We're looking for a candidate for this job.

Looking through these features gives us quite a few benefits. We could train people to recognize these tweets, look for commonalities (which give insight into a topic), or even get rid of features that make no sense. For example, the word RT appears quite high in this list; however, this is a common Twitter phrase for retweet (that is, forwarding on someone else's tweet). An expert could decide to remove this word from the list, making the classifier less prone to the noise we introduced by having a small dataset.

Summary

In this article, we looked at text mining—how to extract features from text, how to use those features, and ways of extending those features. In doing this, we looked at putting a tweet in context—was this tweet mentioning python referring to the programming language? We downloaded data from a web-based API, getting tweets from the popular microblogging website Twitter. This gave us a dataset that we labeled using a form we built directly in the IPython Notebook.

We also looked at reproducibility of experiments. While Twitter doesn't allow you to send copies of your data to others, it allows you to send the tweet's IDs. Using this, we created code that saved the IDs and recreated most of the original dataset. Not all tweets were returned; some had been deleted in the time since the ID list was created and the dataset was reproduced.

We used a Naive Bayes classifier to perform our text classification. This is built upon the Bayes' theorem that uses data to update the model, unlike the frequentist method that often starts with the model first. This allows the model to incorporate and update new data, and incorporate a prior belief. In addition, the naive part allows to easily compute the frequencies without dealing with complex correlations between features.

The features we extracted were word occurrences—did this word occur in this tweet? This model is called bag-of-words. While this discards information about where a word was used, it still achieves a high accuracy on many datasets.

This entire pipeline of using the bag-of-words model with Naive Bayes is quite robust. You will find that it can achieve quite good scores on most text-based tasks. It is a great baseline for you, before trying more advanced models. As another advantage, the Naive Bayes classifier doesn't have any parameters that need to be set (although there are some if you wish to do some tinkering).

In the next article, we will look at extracting features from another type of data, graphs, in order to make recommendations on who to follow on social media.