Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-social-media-insight-using-naive-bayes
Packt
22 Feb 2016
48 min read
Save for later

Social Media Insight Using Naive Bayes

Packt
22 Feb 2016
48 min read
Text-based datasets contain a lot of information, whether they are books, historical documents, social media, e-mail, or any of the other ways we communicate via writing. Extracting features from text-based datasets and using them for classification is a difficult problem. There are, however, some common patterns for text mining. (For more resources related to this topic, see here.) We look at disambiguating terms in social media using the Naive Bayes algorithm, which is a powerful and surprisingly simple algorithm. Naive Bayes takes a few shortcuts to properly compute the probabilities for classification, hence the term naive in the name. It can also be extended to other types of datasets quite easily and doesn't rely on numerical features. The model in this article is a baseline for text mining studies, as the process can work reasonably well for a variety of datasets. We will cover the following topics in this article: Downloading data from social network APIs Transformers for text Naive Bayes classifier Using JSON for saving and loading datasets The NLTK library for extracting features from text The F-measure for evaluation Disambiguation Text is often called an unstructured format. There is a lot of information there, but it is just there; no headings, no required format, loose syntax and other problems prohibit the easy extraction of information from text. The data is also highly connected, with lots of mentions and cross-references—just not in a format that allows us to easily extract it! We can compare the information stored in a book with that stored in a large database to see the difference. In the book, there are characters, themes, places, and lots of information. However, the book needs to be read and, more importantly, interpreted to gain this information. The database sits on your server with column names and data types. All the information is there and the level of interpretation needed is quite low. Information about the data, such as its type or meaning is called metadata, and text lacks it. A book also contains some metadata in the form of a table of contents and index but the degree is significantly lower than that of a database. One of the problems is the term disambiguation. When a person uses the word bank, is this a financial message or an environmental message (such as river bank)? This type of disambiguation is quite easy in many circumstances for humans (although there are still troubles), but much harder for computers to do. In this article, we will look at disambiguating the use of the term Python on Twitter's stream. A message on Twitter is called a tweet and is limited to 140 characters. This means there is little room for context. There isn't much metadata available although hashtags are often used to denote the topic of the tweet. When people talk about Python, they could be talking about the following things: The programming language Python Monty Python, the classic comedy group The snake Python A make of shoe called Python There can be many other things called Python. The aim of our experiment is to take a tweet mentioning Python and determine whether it is talking about the programming language, based only on the content of the tweet. Downloading data from a social network We are going to download a corpus of data from Twitter and use it to sort out spam from useful content. Twitter provides a robust API for collecting information from its servers and this API is free for small-scale usage. It is, however, subject to some conditions that you'll need to be aware of if you start using Twitter's data in a commercial setting. First, you'll need to sign up for a Twitter account (which is free). Go to http://twitter.com and register an account if you do not already have one. Next, you'll need to ensure that you only make a certain number of requests per minute. This limit is currently 180 requests per hour. It can be tricky ensuring that you don't breach this limit, so it is highly recommended that you use a library to talk to Twitter's API. You will need a key to access Twitter's data. Go to http://twitter.com and sign in to your account. When you are logged in, go to https://apps.twitter.com/ and click on Create New App. Create a name and description for your app, along with a website address. If you don't have a website to use, insert a placeholder. Leave the Callback URL field blank for this app—we won't need it. Agree to the terms of use (if you do) and click on Create your Twitter application. Keep the resulting website open—you'll need the access keys that are on this page. Next, we need a library to talk to Twitter. There are many options; the one I like is simply called twitter, and is the official Twitter Python library. You can install twitter using pip3 install twitter if you are using pip to install your packages. If you are using another system, check the documentation at https://github.com/sixohsix/twitter. Create a new IPython Notebook to download the data. We will create several notebooks in this article for various different purposes, so it might be a good idea to also create a folder to keep track of them. This first notebook, ch6_get_twitter, is specifically for downloading new Twitter data. First, we import the twitter library and set our authorization tokens. The consumer key, consumer secret will be available on the Keys and Access Tokens tab on your Twitter app's page. To get the access tokens, you'll need to click on the Create my access token button, which is on the same page. Enter the keys into the appropriate places in the following code: import twitter consumer_key = "<Your Consumer Key Here>" consumer_secret = "<Your Consumer Secret Here>" access_token = "<Your Access Token Here>" access_token_secret = "<Your Access Token Secret Here>" authorization = twitter.OAuth(access_token, access_token_secret, consumer_key, consumer_secret) We are going to get our tweets from Twitter's search function. We will create a reader that connects to twitter using our authorization, and then use that reader to perform searches. In the Notebook, we set the filename where the tweets will be stored: import os output_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json") We also need the json library for saving our tweets: import json Next, create an object that can read from Twitter. We create this object with our authorization object that we set up earlier: t = twitter.Twitter(auth=authorization) We then open our output file for writing. We open it for appending—this allows us to rerun the script to obtain more tweets. We then use our Twitter connection to perform a search for the word Python. We only want the statuses that are returned for our dataset. This code takes the tweet, uses the json library to create a string representation using the dumps function, and then writes it to the file. It then creates a blank line under the tweet so that we can easily distinguish where one tweet starts and ends in our file: with open(output_filename, 'a') as output_file: search_results = t.search.tweets(q="python", count=100)['statuses'] for tweet in search_results: if 'text' in tweet: output_file.write(json.dumps(tweet)) output_file.write("nn") In the preceding loop, we also perform a check to see whether there is text in the tweet or not. Not all of the objects returned by twitter will be actual tweets (some will be actions to delete tweets and others). The key difference is the inclusion of text as a key, which we test for. Running this for a few minutes will result in 100 tweets being added to the output file. You can keep rerunning this script to add more tweets to your dataset, keeping in mind that you may get some duplicates in the output file if you rerun it too fast (that is, before Twitter gets new tweets to return!). Loading and classifying the dataset After we have collected a set of tweets (our dataset), we need labels to perform classification. We are going to label the dataset by setting up a form in an IPython Notebook to allow us to enter the labels. The dataset we have stored is nearly in a JSON format. JSON is a format for data that doesn't impose much structure and is directly readable in JavaScript (hence the name, JavaScript Object Notation). JSON defines basic objects such as numbers, strings, lists and dictionaries, making it a good format for storing datasets if they contain data that isn't numerical. If your dataset is fully numerical, you would save space and time using a matrix-based format like in NumPy. A key difference between our dataset and real JSON is that we included newlines between tweets. The reason for this was to allow us to easily append new tweets (the actual JSON format doesn't allow this easily). Our format is a JSON representation of a tweet, followed by a newline, followed by the next tweet, and so on. To parse it, we can use the json library but we will have to first split the file by newlines to get the actual tweet objects themselves. Set up a new IPython Notebook (I called mine ch6_label_twitter) and enter the dataset's filename. This is the same filename in which we saved the data in the previous section. We also define the filename that we will use to save the labels to. The code is as follows: import os input_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json") labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_classes.json") As stated, we will use the json library, so import that too: import json We create a list that will store the tweets we received from the file: tweets = [] We then iterate over each line in the file. We aren't interested in lines with no information (they separate the tweets for us), so check if the length of the line (minus any whitespace characters) is zero. If it is, ignore it and move to the next line. Otherwise, load the tweet using json.loads (which loads a JSON object from a string) and add it to our list of tweets. The code is as follows: with open(input_filename) as inf: for line in inf: if len(line.strip()) == 0: continue tweets.append(json.loads(line)) We are now interested in classifying whether an item is relevant to us or not (in this case, relevant means refers to the programming language Python). We will use the IPython Notebook's ability to embed HTML and talk between JavaScript and Python to create a viewer of tweets to allow us to easily and quickly classify the tweets as spam or not. The code will present a new tweet to the user (you) and ask for a label: is it relevant or not? It will then store the input and present the next tweet to be labeled. First, we create a list for storing the labels. These labels will be stored whether or not the given tweet refers to the programming language Python, and it will allow our classifier to learn how to differentiate between meanings. We also check if we have any labels already and load them. This helps if you need to close the notebook down midway through labeling. This code will load the labels from where you left off. It is generally a good idea to consider how to save at midpoints for tasks like this. Nothing hurts quite like losing an hour of work because your computer crashed before you saved the labels! The code is as follows: labels = [] if os.path.exists(labels_filename): with open(labels_filename) as inf: labels = json.load(inf) Next, we create a simple function that will return the next tweet that needs to be labeled. We can work out which is the next tweet by finding the first one that hasn't yet been labeled. The code is as follows: def get_next_tweet(): return tweet_sample[len(labels)]['text'] The next step in our experiment is to collect information from the user (you!) on which tweets are referring to Python (the programming language) and which are not. As of yet, there is not a good, straightforward way to get interactive feedback with pure Python in IPython Notebooks. For this reason, we will use some JavaScript and HTML to get this input from the user. Next we create some JavaScript in the IPython Notebook to run our input. Notebooks allow us to use magic functions to embed HTML and JavaScript (among other things) directly into the Notebook itself. Start a new cell with the following line at the top: %%javascript The code in here will be in JavaScript, hence the curly braces that are coming up. Don't worry, we will get back to Python soon. Keep in mind here that the following code must be in the same cell as the %%javascript magic function. The first function we will define in JavaScript shows how easy it is to talk to your Python code from JavaScript in IPython Notebooks. This function, if called, will add a label to the labels array (which is in python code). To do this, we load the IPython kernel as a JavaScript object and give it a Python command to execute. The code is as follows: function set_label(label){ var kernel = IPython.notebook.kernel; kernel.execute("labels.append(" + label + ")"); load_next_tweet(); } At the end of that function, we call the load_next_tweet function. This function loads the next tweet to be labeled. It runs on the same principle; we load the IPython kernel and give it a command to execute (calling the get_next_tweet function we defined earlier). However, in this case we want to get the result. This is a little more difficult. We need to define a callback, which is a function that is called when the data is returned. The format for defining callback is outside the scope of this book. If you are interested in more advanced JavaScript/Python integration, consult the IPython documentation. The code is as follows: function load_next_tweet(){ var code_input = "get_next_tweet()"; var kernel = IPython.notebook.kernel; var callbacks = { 'iopub' : {'output' : handle_output}}; kernel.execute(code_input, callbacks, {silent:false}); } The callback function is called handle_output, which we will define now. This function gets called when the Python function that kernel.execute calls returns a value. As before, the full format of this is outside the scope of this book. However, for our purposes the result is returned as data of the type text/plain, which we extract and show in the #tweet_text div of the form we are going to create in the next cell. The code is as follows: function handle_output(out){ var res = out.content.data["text/plain"]; $("div#tweet_text").html(res); } Our form will have a div that shows the next tweet to be labeled, which we will give the ID #tweet_text. We also create a textbox to enable us to capture key presses (otherwise, the Notebook will capture them and JavaScript won't do anything). This allows us to use the keyboard to set labels of 1 or 0, which is faster than using the mouse to click buttons—given that we will need to label at least 100 tweets. Run the previous cell to embed some JavaScript into the page, although nothing will be shown to you in the results section. We are going to use a different magic function now, %%html. Unsurprisingly, this magic function allows us to directly embed HTML into our Notebook. In a new cell, start with this line: %%html For this cell, we will be coding in HTML and a little JavaScript. First, define a div element to store our current tweet to be labeled. I've also added some instructions for using this form. Then, create the #tweet_text div that will store the text of the next tweet to be labeled. As stated before, we need to create a textbox to be able to capture key presses. The code is as follows: <div name="tweetbox"> Instructions: Click in textbox. Enter a 1 if the tweet is relevant, enter 0 otherwise.<br> Tweet: <div id="tweet_text" value="text"></div><br> <input type=text id="capture"></input><br> </div> Don't run the cell just yet! We create the JavaScript for capturing the key presses. This has to be defined after creating the form, as the #tweet_text div doesn't exist until the above code runs. We use the JQuery library (which IPython is already using, so we don't need to include the JavaScript file) to add a function that is called when key presses are made on the #capture textbox we defined. However, keep in mind that this is a %%html cell and not a JavaScript cell, so we need to enclose this JavaScript in the <script> tags. We are only interested in key presses if the user presses the 0 or the 1, in which case the relevant label is added. We can determine which key was pressed by the ASCII value stored in e.which. If the user presses 0 or 1, we append the label and clear out the textbox. The code is as follows: <script> $("input#capture").keypress(function(e) { if(e.which == 48) { set_label(0); $("input#capture").val(""); }else if (e.which == 49){ set_label(1); $("input#capture").val(""); } }); All other key presses are ignored. As a last bit of JavaScript for this article (I promise), we call the load_next_tweet() function. This will set the first tweet to be labeled and then close off the JavaScript. The code is as follows: load_next_tweet(); </script> After you run this cell, you will get an HTML textbox, alongside the first tweet's text. Click in the textbox and enter 1 if it is relevant to our goal (in this case, it means is the tweet related to the programming language Python) and a 0 if it is not. After you do this, the next tweet will load. Enter the label and the next one will load. This continues until the tweets run out. When you finish all of this, simply save the labels to the output filename we defined earlier for the class values: with open(labels_filename, 'w') as outf: json.dump(labels, outf) You can call the preceding code even if you haven't finished. Any labeling you have done to that point will be saved. Running this Notebook again will pick up where you left off and you can keep labeling your tweets. This might take a while to do this! If you have a lot of tweets in your dataset, you'll need to classify all of them. If you are pushed for time, you can download the same dataset I used, which contains classifications. Creating a replicable dataset from Twitter In data mining, there are lots of variables. These aren't just in the data mining algorithms—they also appear in the data collection, environment, and many other factors. Being able to replicate your results is important as it enables you to verify or improve upon your results. Getting 80 percent accuracy on one dataset with algorithm X, and 90 percent accuracy on another dataset with algorithm Y doesn't mean that Y is better. We need to be able to test on the same dataset in the same conditions to be able to properly compare. On running the preceding code, you will get a different dataset to the one I created and used. The main reasons are that Twitter will return different search results for you than me based on the time you performed the search. Even after that, your labeling of tweets might be different from what I do. While there are obvious examples where a given tweet relates to the python programming language, there will always be gray areas where the labeling isn't obvious. One tough gray area I ran into was tweets in non-English languages that I couldn't read. In this specific instance, there are options in Twitter's API for setting the language, but even these aren't going to be perfect. Due to these factors, it is difficult to replicate experiments on databases that are extracted from social media, and Twitter is no exception. Twitter explicitly disallows sharing datasets directly. One solution to this is to share tweet IDs only, which you can share freely. In this section, we will first create a tweet ID dataset that we can freely share. Then, we will see how to download the original tweets from this file to recreate the original dataset. First, we save the replicable dataset of tweet IDs. Creating another new IPython Notebook, first set up the filenames. This is done in the same way we did labeling but there is a new filename where we can store the replicable dataset. The code is as follows: import os input_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json") labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_classes.json") replicable_dataset = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_dataset.json") We load the tweets and labels as we did in the previous notebook: import json tweets = [] with open(input_filename) as inf: for line in inf: if len(line.strip()) == 0: continue tweets.append(json.loads(line)) if os.path.exists(labels_filename): with open(classes_filename) as inf: labels = json.load(inf) Now we create a dataset by looping over both the tweets and labels at the same time and saving those in a list: dataset = [(tweet['id'], label) for tweet, label in zip(tweets, labels)] Finally, we save the results in our file: with open(replicable_dataset, 'w') as outf: json.dump(dataset, outf) Now that we have the tweet IDs and labels saved, we can recreate the original dataset. If you are looking to recreate the dataset I used for this article, it can be found in the code bundle that comes with this book. Loading the preceding dataset is not difficult but it can take some time. Start a new IPython Notebook and set the dataset, label, and tweet ID filenames as before. I've adjusted the filenames here to ensure that you don't overwrite your previously collected dataset, but feel free to change these if you want. The code is as follows: import os tweet_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_python_tweets.json") labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_python_classes.json") replicable_dataset = os.path.join(os.path.expanduser("~"), "Data", "twitter", "replicable_dataset.json") Then load the tweet IDs from the file using JSON: import json with open(replicable_dataset) as inf: tweet_ids = json.load(inf) Saving the labels is very easy. We just iterate through this dataset and extract the IDs. We could do this quite easily with just two lines of code (open file and save tweets). However, we can't guarantee that we will get all the tweets we are after (for example, some may have been changed to private since collecting the dataset) and therefore the labels will be incorrectly indexed against the data. As an example, I tried to recreate the dataset just one day after collecting them and already two of the tweets were missing (they might be deleted or made private by the user). For this reason, it is important to only print out the labels that we need. To do this, we first create an empty actual labels list to store the labels for tweets that we actually recover from twitter, and then create a dictionary mapping the tweet IDs to the labels. The code is as follows: actual_labels = [] label_mapping = dict(tweet_ids) Next, we are going to create a twitter server to collect all of these tweets. This is going to take a little longer. Import the twitter library that we used before, creating an authorization token and using that to create the twitter object: import twitter consumer_key = "<Your Consumer Key Here>" consumer_secret = "<Your Consumer Secret Here>" access_token = "<Your Access Token Here>" access_token_secret = "<Your Access Token Secret Here>" authorization = twitter.OAuth(access_token, access_token_secret, consumer_key, consumer_secret) t = twitter.Twitter(auth=authorization) Iterate over each of the twitter IDs by extracting the IDs into a list using the following command: all_ids = [tweet_id for tweet_id, label in tweet_ids] Then, we open our output file to save the tweets: with open(tweets_filename, 'a') as output_file: The Twitter API allows us get 100 tweets at a time. Therefore, we iterate over each batch of 100 tweets: for start_index in range(0, len(tweet_ids), 100): To search by ID, we first create a string that joins all of the IDs (in this batch) together: id_string = ",".join(str(i) for i in all_ids[start_index:start_index+100]) Next, we perform a statuses/lookup API call, which is defined by Twitter. We pass our list of IDs (which we turned into a string) into the API call in order to have those tweets returned to us: search_results = t.statuses.lookup(_id=id_string) Then for each tweet in the search results, we save it to our file in the same way we did when we were collecting the dataset originally: for tweet in search_results: if 'text' in tweet: output_file.write(json.dumps(tweet)) output_file.write("nn") As a final step here (and still under the preceding if block), we want to store the labeling of this tweet. We can do this using the label_mapping dictionary we created before, looking up the tweet ID. The code is as follows: actual_labels.append(label_mapping[tweet['id']]) Run the previous cell and the code will collect all of the tweets for you. If you created a really big dataset, this may take a while—Twitter does rate-limit requests. As a final step here, save the actual_labels to our classes file: with open(labels_filename, 'w') as outf: json.dump(actual_labels, outf) Text transformers Now that we have our dataset, how are we going to perform data mining on it? Text-based datasets include books, essays, websites, manuscripts, programming code, and other forms of written expression. All of the algorithms we have seen so far deal with numerical or categorical features, so how do we convert our text into a format that the algorithm can deal with? There are a number of measurements that could be taken. For instance, average word and average sentence length are used to predict the readability of a document. However, there are lots of feature types such as word occurrence which we will now investigate. Bag-of-words One of the simplest but highly effective models is to simply count each word in the dataset. We create a matrix, where each row represents a document in our dataset and each column represents a word. The value of the cell is the frequency of that word in the document. Here's an excerpt from The Lord of the Rings, J.R.R. Tolkien: Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in halls of stone, Nine for Mortal Men, doomed to die, One for the Dark Lord on his dark throne In the Land of Mordor where the Shadows lie. One Ring to rule them all, One Ring to find them, One Ring to bring them all and in the darkness bind them. In the Land of Mordor where the Shadows lie.                                            - J.R.R. Tolkien's epigraph to The Lord of The Rings The word the appears nine times in this quote, while the words in, for, to, and one each appear four times. The word ring appears three times, as does the word of. We can create a dataset from this, choosing a subset of words and counting the frequency: Word the one ring to Frequency 9 4 3 4 We can use the counter class to do a simple count for a given string. When counting words, it is normal to convert all letters to lowercase, which we do when creating the string. The code is as follows: s = """Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in halls of stone, Nine for Mortal Men, doomed to die, One for the Dark Lord on his dark throne In the Land of Mordor where the Shadows lie. One Ring to rule them all, One Ring to find them, One Ring to bring them all and in the darkness bind them. In the Land of Mordor where the Shadows lie. """.lower() words = s.split() from collections import Counter c = Counter(words) Printing c.most_common(5) gives the list of the top five most frequently occurring words. Ties are not handled well as only five are given and a very large number of words all share a tie for fifth place. The bag-of-words model has three major types. The first is to use the raw frequencies, as shown in the preceding example. This does have a drawback when documents vary in size from fewer words to many words, as the overall values will be very different. The second model is to use the normalized frequency, where each document's sum equals 1. This is a much better solution as the length of the document doesn't matter as much. The third type is to simply use binary features—a value is 1 if the word occurs at all and 0 if it doesn't. We will use binary representation in this article. Another popular (arguably more popular) method for performing normalization is called term frequency - inverse document frequency, or tf-idf. In this weighting scheme, term counts are first normalized to frequencies and then divided by the number of documents in which it appears in the corpus There are a number of libraries for working with text data in Python. We will use a major one, called Natural Language ToolKit (NLTK). The scikit-learn library also has the CountVectorizer class that performs a similar action, and it is recommended you take a look at it. However the NLTK version has more options for word tokenization. If you are doing natural language processing in python, NLTK is a great library to use. N-grams A step up from single bag-of-words features is that of n-grams. An n-gram is a subsequence of n consecutive tokens. In this context, a word n-gram is a set of n words that appear in a row. They are counted the same way, with the n-grams forming a word that is put in the bag. The value of a cell in this dataset is the frequency that a particular n-gram appears in the given document. The value of n is a parameter. For English, setting it to between 2 to 5 is a good start, although some applications call for higher values. As an example, for n=3, we extract the first few n-grams in the following quote: Always look on the bright side of life. The first n-gram (of size 3) is Always look on, the second is look on the, the third is on the bright. As you can see, the n-grams overlap and cover three words. Word n-grams have advantages over using single words. This simple concept introduces some context to word use by considering its local environment, without a large overhead of understanding the language computationally. A disadvantage of using n-grams is that the matrix becomes even sparser—word n-grams are unlikely to appear twice (especially in tweets and other short documents!). Specially for social media and other short documents, word n-grams are unlikely to appear in too many different tweets, unless it is a retweet. However, in larger documents, word n-grams are quite effective for many applications. Another form of n-gram for text documents is that of a character n-gram. Rather than using sets of words, we simply use sets of characters (although character n-grams have lots of options for how they are computed!). This type of dataset can pick up words that are misspelled, as well as providing other benefits. We will test character n-grams in this article. Other features There are other features that can be extracted too. These include syntactic features, such as the usage of particular words in sentences. Part-of-speech tags are also popular for data mining applications that need to understand meaning in text. Such feature types won't be covered in this book. If you are interested in learning more, I recommend Python 3 Text Processing with NLTK 3 Cookbook, Jacob Perkins, Packt publication. Naive Bayes Naive Bayes is a probabilistic model that is unsurprisingly built upon a naive interpretation of Bayesian statistics. Despite the naive aspect, the method performs very well in a large number of contexts. It can be used for classification of many different feature types and formats, but we will focus on one in this article: binary features in the bag-of-words model. Bayes' theorem For most of us, when we were taught statistics, we started from a frequentist approach. In this approach, we assume the data comes from some distribution and we aim to determine what the parameters are for that distribution. However, those parameters are (perhaps incorrectly) assumed to be fixed. We use our model to describe the data, even testing to ensure the data fits our model. Bayesian statistics instead model how people (non-statisticians) actually reason. We have some data and we use that data to update our model about how likely something is to occur. In Bayesian statistics, we use the data to describe the model rather than using a model and confirming it with data (as per the frequentist approach). Bayes' theorem computes the value of P(A|B), that is, knowing that B has occurred, what is the probability of A. In most cases, B is an observed event such as it rained yesterday, and A is a prediction it will rain today. For data mining, B is usually we observed this sample and A is it belongs to this class. We will see how to use Bayes' theorem for data mining in the next section. The equation for Bayes' theorem is given as follows: As an example, we want to determine the probability that an e-mail containing the word drugs is spam (as we believe that such a tweet may be a pharmaceutical spam). A, in this context, is the probability that this tweet is spam. We can compute P(A), called the prior belief directly from a training dataset by computing the percentage of tweets in our dataset that are spam. If our dataset contains 30 spam messages for every 100 e-mails, P(A) is 30/100 or 0.3. B, in this context, is this tweet contains the word 'drugs'. Likewise, we can compute P(B) by computing the percentage of tweets in our dataset containing the word drugs. If 10 e-mails in every 100 of our training dataset contain the word drugs, P(B) is 10/100 or 0.1. Note that we don't care if the e-mail is spam or not when computing this value. P(B|A) is the probability that an e-mail contains the word drugs if it is spam. It is also easy to compute from our training dataset. We look through our training set for spam e-mails and compute the percentage of them that contain the word drugs. Of our 30 spam e-mails, if 6 contain the word drugs, then P(B|A) is calculated as 6/30 or 0.2. From here, we use Bayes' theorem to compute P(A|B), which is the probability that a tweet containing the word drugs is spam. Using the previous equation, we see the result is 0.6. This indicates that if an e-mail has the word drugs in it, there is a 60 percent chance that it is spam. Note the empirical nature of the preceding example—we use evidence directly from our training dataset, not from some preconceived distribution. In contrast, a frequentist view of this would rely on us creating a distribution of the probability of words in tweets to compute similar equations. Naive Bayes algorithm Looking back at our Bayes' theorem equation, we can use it to compute the probability that a given sample belongs to a given class. This allows the equation to be used as a classification algorithm. With C as a given class and D as a sample in our dataset, we create the elements necessary for Bayes' theorem, and subsequently Naive Bayes. Naive Bayes is a classification algorithm that utilizes Bayes' theorem to compute the probability that a new data sample belongs to a particular class. P(C) is the probability of a class, which is computed from the training dataset itself (as we did with the spam example). We simply compute the percentage of samples in our training dataset that belong to the given class. P(D) is the probability of a given data sample. It can be difficult to compute this, as the sample is a complex interaction between different features, but luckily it is a constant across all classes. Therefore, we don't need to compute it at all. We will see later how to get around this issue. P(D|C) is the probability of the data point belonging to the class. This could also be difficult to compute due to the different features. However, this is where we introduce the naive part of the Naive Bayes algorithm. We naively assume that each feature is independent of each other. Rather than computing the full probability of P(D|C), we compute the probability of each feature D1, D2, D3, … and so on. Then, we multiply them together: P(D|C) = P(D1|C) x P(D2|C).... x P(Dn|C) Each of these values is relatively easy to compute with binary features; we simply compute the percentage of times it is equal in our sample dataset. In contrast, if we were to perform a non-naive Bayes version of this part, we would need to compute the correlations between different features for each class. Such computation is infeasible at best, and nearly impossible without vast amounts of data or adequate language analysis models. From here, the algorithm is straightforward. We compute P(C|D) for each possible class, ignoring the P(D) term. Then we choose the class with the highest probability. As the P(D) term is consistent across each of the classes, ignoring it has no impact on the final prediction. How it works As an example, suppose we have the following (binary) feature values from a sample in our dataset: [0, 0, 0, 1]. Our training dataset contains two classes with 75 percent of samples belonging to the class 0, and 25 percent belonging to the class 1. The likelihood of the feature values for each class are as follows: For class 0: [0.3, 0.4, 0.4, 0.7] For class 1: [0.7, 0.3, 0.4, 0.9] These values are to be interpreted as: for feature 1, it is a 1 in 30 percent of cases for class 0. We can now compute the probability that this sample should belong to the class 0. P(C=0) = 0.75 which is the probability that the class is 0. P(D) isn't needed for the Naive Bayes algorithm. Let's take a look at the calculation: P(D|C=0) = P(D1|C=0) x P(D2|C=0) x P(D3|C=0) x P(D4|C=0) = 0.3 x 0.6 x 0.6 x 0.7 = 0.0756 The second and third values are 0.6, because the value of that feature in the sample was 0. The listed probabilities are for values of 1 for each feature. Therefore, the probability of a 0 is its inverse: P(0) = 1 – P(1). Now, we can compute the probability of the data point belonging to this class. An important point to note is that we haven't computed P(D), so this isn't a real probability. However, it is good enough to compare against the same value for the probability of the class 1. Let's take a look at the calculation: P(C=0|D) = P(C=0) P(D|C=0) = 0.75 * 0.0756 = 0.0567 Now, we compute the same values for the class 1: P(C=1) = 0.25 P(D) isn't needed for naive Bayes. Let's take a look at the calculation: P(D|C=1) = P(D1|C=1) x P(D2|C=1) x P(D3|C=1) x P(D4|C=1) = 0.7 x 0.7 x 0.6 x 0.9 = 0.2646 P(C=1|D) = P(C=1)P(D|C=1) = 0.25 * 0.2646 = 0.06615 Normally, P(C=0|D) + P(C=1|D) should equal to 1. After all, those are the only two possible options! However, the probabilities are not 1 due to the fact we haven't included the computation of P(D) in our equations here. The data point should be classified as belonging to the class 1. You may have guessed this while going through the equations anyway; however, you may have been a bit surprised that the final decision was so close. After all, the probabilities in computing P(D|C) were much, much higher for the class 1. This is because we introduced a prior belief that most samples generally belong to the class 0. If the classes had been equal sizes, the resulting probabilities would be much different. Try it yourself by changing both P(C=0) and P(C=1) to 0.5 for equal class sizes and computing the result again. Application We will now create a pipeline that takes a tweet and determines whether it is relevant or not, based only on the content of that tweet. To perform the word extraction, we will be using the NLTK, a library that contains a large number of tools for performing analysis on natural language. We will use NLTK in future articles as well. To get NLTK on your computer, use pip to install the package: pip3 install nltk If that doesn't work, see the NLTK installation instructions at www.nltk.org/install.html. We are going to create a pipeline to extract the word features and classify the tweets using Naive Bayes. Our pipeline has the following steps: Transform the original text documents into a dictionary of counts using NLTK's word_tokenize function. Transform those dictionaries into a vector matrix using the DictVectorizer transformer in scikit-learn. This is necessary to enable the Naive Bayes classifier to read the feature values extracted in the first step. Train the Naive Bayes classifier, as we have seen in previous articles. We will need to create another Notebook (last one for the article!) called ch6_classify_twitter for performing the classification. Extracting word counts We are going to use NLTK to extract our word counts. We still want to use it in a pipeline, but NLTK doesn't conform to our transformer interface. We will therefore need to create a basic transformer to do this to obtain both fit and transform methods, enabling us to use this in a pipeline. First, set up the transformer class. We don't need to fit anything in this class, as this transformer simply extracts the words in the document. Therefore, our fit is an empty function, except that it returns self which is necessary for transformer objects. Our transform is a little more complicated. We want to extract each word from each document and record True if it was discovered. We are only using the binary features here—True if in the document, False otherwise. If we wanted to use the frequency we would set up counting dictionaries. Let's take a look at the code: from sklearn.base import TransformerMixin class NLTKBOW(TransformerMixin): def fit(self, X, y=None): return self def transform(self, X): return [{word: True for word in word_tokenize(document)} for document in X] The result is a list of dictionaries, where the first dictionary is the list of words in the first tweet, and so on. Each dictionary has a word as key and the value true to indicate this word was discovered. Any word not in the dictionary will be assumed to have not occurred in the tweet. Explicitly stating that a word's occurrence is False will also work, but will take up needless space to store. Converting dictionaries to a matrix This step converts the dictionaries built as per the previous step into a matrix that can be used with a classifier. This step is made quite simple through the DictVectorizer transformer. The DictVectorizer class simply takes a list of dictionaries and converts them into a matrix. The features in this matrix are the keys in each of the dictionaries, and the values correspond to the occurrence of those features in each sample. Dictionaries are easy to create in code, but many data algorithm implementations prefer matrices. This makes DictVectorizer a very useful class. In our dataset, each dictionary has words as keys and only occurs if the word actually occurs in the tweet. Therefore, our matrix will have each word as a feature and a value of True in the cell if the word occurred in the tweet. To use DictVectorizer, simply import it using the following command: from sklearn.feature_extraction import DictVectorizer Training the Naive Bayes classifier Finally, we need to set up a classifier and we are using Naive Bayes for this article. As our dataset contains only binary features, we use the BernoulliNB classifier that is designed for binary features. As a classifier, it is very easy to use. As with DictVectorizer, we simply import it and add it to our pipeline: from sklearn.naive_bayes import BernoulliNB Putting it all together Now comes the moment to put all of these pieces together. In our IPython Notebook, set the filenames and load the dataset and classes as we have done before. Set the filenames for both the tweets themselves (not the IDs!) and the labels that we assigned to them. The code is as follows: import os input_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_tweets.json") labels_filename = os.path.join(os.path.expanduser("~"), "Data", "twitter", "python_classes.json") Load the tweets themselves. We are only interested in the content of the tweets, so we extract the text value and store only that. The code is as follows: tweets = [] with open(input_filename) as inf: for line in inf: if len(line.strip()) == 0: continue tweets.append(json.loads(line)['text']) Load the labels for each of the tweets: with open(classes_filename) as inf: labels = json.load(inf) Now, create a pipeline putting together the components from before. Our pipeline has three parts: The NLTKBOW transformer we created A DictVectorizer transformer A BernoulliNB classifier The code is as follows: from sklearn.pipeline import Pipeline pipeline = Pipeline([('bag-of-words', NLTKBOW()), ('vectorizer', DictVectorizer()), ('naive-bayes', BernoulliNB()) ]) We can nearly run our pipeline now, which we will do with cross_val_score as we have done many times before. Before that though, we will introduce a better evaluation metric than the accuracy metric we used before. As we will see, the use of accuracy is not adequate for datasets when the number of samples in each class is different. Evaluation using the F1-score When choosing an evaluation metric, it is always important to consider cases where that evaluation metric is not useful. Accuracy is a good evaluation metric in many cases, as it is easy to understand and simple to compute. However, it can be easily faked. In other words, in many cases you can create algorithms that have a high accuracy by poor utility. While our dataset of tweets (typically, your results may vary) contains about 50 percent programming-related and 50 percent nonprogramming, many datasets aren't as balanced as this. As an example, an e-mail spam filter may expect to see more than 80 percent of incoming e-mails be spam. A spam filter that simply labels everything as spam is quite useless; however, it will obtain an accuracy of 80 percent! To get around this problem, we can use other evaluation metrics. One of the most commonly employed is called an f1-score (also called f-score, f-measure, or one of many other variations on this term). The f1-score is defined on a per-class basis and is based on two concepts: the precision and recall. The precision is the percentage of all the samples that were predicted as belonging to a specific class that were actually from that class. The recall is the percentage of samples in the dataset that are in a class and actually labeled as belonging to that class. In the case of our application, we could compute the value for both classes (relevant and not relevant). However, we are really interested in the spam. Therefore, our precision computation becomes the question: of all the tweets that were predicted as being relevant, what percentage were actually relevant? Likewise, the recall becomes the question: of all the relevant tweets in the dataset, how many were predicted as being relevant? After you compute both the precision and recall, the f1-score is the harmonic mean of the precision and recall: To use the f1-score in scikit-learn methods, simply set the scoring parameter to f1. By default, this will return the f1-score of the class with label 1. Running the code on our dataset, we simply use the following line of code: scores = cross_val_score(pipeline, tweets, labels, scoring='f1') We then print out the average of the scores: import numpy as np print("Score: {:.3f}".format(np.mean(scores))) The result is 0.798, which means we can accurately determine if a tweet using Python relates to the programing language nearly 80 percent of the time. This is using a dataset with only 200 tweets in it. Go back and collect more data and you will find that the results increase! More data usually means a better accuracy, but it is not guaranteed! Getting useful features from models One question you may ask is what are the best features for determining if a tweet is relevant or not? We can extract this information from of our Naive Bayes model and find out which features are the best individually, according to Naive Bayes. First we fit a new model. While the cross_val_score gives us a score across different folds of cross-validated testing data, it doesn't easily give us the trained models themselves. To do this, we simply fit our pipeline with the tweets, creating a new model. The code is as follows: model = pipeline.fit(tweets, labels) Note that we aren't really evaluating the model here, so we don't need to be as careful with the training/testing split. However, before you put these features into practice, you should evaluate on a separate test split. We skip over that here for the sake of clarity. A pipeline gives you access to the individual steps through the named_steps attribute and the name of the step (we defined these names ourselves when we created the pipeline object itself). For instance, we can get the Naive Bayes model: nb = model.named_steps['naive-bayes'] From this model, we can extract the probabilities for each word. These are stored as log probabilities, which is simply log(P(A|f)), where f is a given feature. The reason these are stored as log probabilities is because the actual values are very low. For instance, the first value is -3.486, which correlates to a probability under 0.03 percent. Logarithm probabilities are used in computation involving small probabilities like this as they stop underflow errors where very small values are just rounded to zeros. Given that all of the probabilities are multiplied together, a single value of 0 will result in the whole answer always being 0! Regardless, the relationship between values is still the same; the higher the value, the more useful that feature is. We can get the most useful features by sorting the array of logarithm probabilities. We want descending order, so we simply negate the values first. The code is as follows: top_features = np.argsort(-feature_probabilities[1])[:50] The preceding code will just give us the indices and not the actual feature values. This isn't very useful, so we will map the feature's indices to the actual values. The key is the DictVectorizer step of the pipeline, which created the matrices for us. Luckily this also records the mapping, allowing us to find the feature names that correlate to different columns. We can extract the features from that part of the pipeline: dv = model.named_steps['vectorizer'] From here, we can print out the names of the top features by looking them up in the feature_names_ attribute of DictVectorizer. Enter the following lines into a new cell and run it to print out a list of the top features: for i, feature_index in enumerate(top_features): print(i, dv.feature_names_[feature_index], np.exp(feature_probabilities[1][feature_index])) The first few features include :, http, # and @. These are likely to be noise (although the use of a colon is not very common outside programming), based on the data we collected. Collecting more data is critical to smoothing out these issues. Looking through the list though, we get a number of more obvious programming features: 7 for 0.188679245283 11 with 0.141509433962 28 installing 0.0660377358491 29 Top 0.0660377358491 34 Developer 0.0566037735849 35 library 0.0566037735849 36 ] 0.0566037735849 37 [ 0.0566037735849 41 version 0.0471698113208 43 error 0.0471698113208 There are some others too that refer to Python in a work context, and therefore might be referring to the programming language (although freelance snake handlers may also use similar terms, they are less common on Twitter): 22 jobs 0.0660377358491 30 looking 0.0566037735849 31 Job 0.0566037735849 34 Developer 0.0566037735849 38 Freelancer 0.0471698113208 40 projects 0.0471698113208 47 We're 0.0471698113208 That last one is usually in the format: We're looking for a candidate for this job. Looking through these features gives us quite a few benefits. We could train people to recognize these tweets, look for commonalities (which give insight into a topic), or even get rid of features that make no sense. For example, the word RT appears quite high in this list; however, this is a common Twitter phrase for retweet (that is, forwarding on someone else's tweet). An expert could decide to remove this word from the list, making the classifier less prone to the noise we introduced by having a small dataset. Summary In this article, we looked at text mining—how to extract features from text, how to use those features, and ways of extending those features. In doing this, we looked at putting a tweet in context—was this tweet mentioning python referring to the programming language? We downloaded data from a web-based API, getting tweets from the popular microblogging website Twitter. This gave us a dataset that we labeled using a form we built directly in the IPython Notebook. We also looked at reproducibility of experiments. While Twitter doesn't allow you to send copies of your data to others, it allows you to send the tweet's IDs. Using this, we created code that saved the IDs and recreated most of the original dataset. Not all tweets were returned; some had been deleted in the time since the ID list was created and the dataset was reproduced. We used a Naive Bayes classifier to perform our text classification. This is built upon the Bayes' theorem that uses data to update the model, unlike the frequentist method that often starts with the model first. This allows the model to incorporate and update new data, and incorporate a prior belief. In addition, the naive part allows to easily compute the frequencies without dealing with complex correlations between features. The features we extracted were word occurrences—did this word occur in this tweet? This model is called bag-of-words. While this discards information about where a word was used, it still achieves a high accuracy on many datasets. This entire pipeline of using the bag-of-words model with Naive Bayes is quite robust. You will find that it can achieve quite good scores on most text-based tasks. It is a great baseline for you, before trying more advanced models. As another advantage, the Naive Bayes classifier doesn't have any parameters that need to be set (although there are some if you wish to do some tinkering). In the next article, we will look at extracting features from another type of data, graphs, in order to make recommendations on who to follow on social media. Resources for Article: Further resources on this subject: Putting the Fun in Functional Python[article] Python Data Analysis Utilities[article] Leveraging Python in the World of Big Data[article]
Read more
  • 0
  • 0
  • 2565

article-image-training-neural-networks-efficiently-using-keras
Packt
22 Feb 2016
9 min read
Save for later

Training neural networks efficiently using Keras

Packt
22 Feb 2016
9 min read
In this article, we will take a look at Keras, one of the most recently developed libraries to facilitate neural network training. The development on Keras started in the early months of 2015; as of today, it has evolved into one of the most popular and widely used libraries that are built on top of Theano, and allows us to utilize our GPU to accelerate neural network training. One of its prominent features is that it's a very intuitive API, which allows us to implement neural networks in only a few lines of code. Once you have Theano installed, you can install Keras from PyPI by executing the following command from your terminal command line: (For more resources related to this topic, see here.) pip install Keras For more information about Keras, please visit the official website at http://keras.io. To see what neural network training via Keras looks like, let's implement a multilayer perceptron to classify the handwritten digits from the MNIST dataset. The MNIST dataset can be downloaded from http://yann.lecun.com/exdb/mnist/ in four parts as listed here: train-images-idx3-ubyte.gz: These are training set images (9912422 bytes) train-labels-idx1-ubyte.gz: These are training set labels (28881 bytes) t10k-images-idx3-ubyte.gz: These are test set images (1648877 bytes) t10k-labels-idx1-ubyte.gz: These are test set labels (4542 bytes) After downloading and unzipped the archives, we place the files into a directory mnist in our current working directory, so that we can load the training as well as the test dataset using the following function: import os import struct import numpy as np def load_mnist(path, kind='train'): """Load MNIST data from `path`""" labels_path = os.path.join(path, '%s-labels-idx1-ubyte' % kind) images_path = os.path.join(path, '%s-images-idx3-ubyte' % kind) with open(labels_path, 'rb') as lbpath: magic, n = struct.unpack('>II', lbpath.read(8)) labels = np.fromfile(lbpath, dtype=np.uint8) with open(images_path, 'rb') as imgpath: magic, num, rows, cols = struct.unpack(">IIII", imgpath.read(16)) images = np.fromfile(imgpath, dtype=np.uint8).reshape(len(labels), 784) return images, labels X_train, y_train = load_mnist('mnist', kind='train') print('Rows: %d, columns: %d' % (X_train.shape[0], X_train.shape[1])) Rows: 60000, columns: 784 X_test, y_test = load_mnist('mnist', kind='t10k') print('Rows: %d, columns: %d' % (X_test.shape[0], X_test.shape[1])) Rows: 10000, columns: 784 On the following pages, we will walk through the code examples for using Keras step by step, which you can directly execute from your Python interpreter. However, if you are interested in training the neural network on your GPU, you can either put it into a Python script, or download the respective code from the Packt Publishing website. In order to run the Python script on your GPU, execute the following command from the directory where the mnist_keras_mlp.py file is located: THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python mnist_keras_mlp.py To continue with the preparation of the training data, let's cast the MNIST image array into 32-bit format: >>> import theano >>> theano.config.floatX = 'float32' >>> X_train = X_train.astype(theano.config.floatX) >>> X_test = X_test.astype(theano.config.floatX) Next, we need to convert the class labels (integers 0-9) into the one-hot format. Fortunately, Keras provides a convenient tool for this: >>> from keras.utils import np_utils >>> print('First 3 labels: ', y_train[:3]) First 3 labels: [5 0 4] >>> y_train_ohe = np_utils.to_categorical(y_train) >>> print('nFirst 3 labels (one-hot):n', y_train_ohe[:3]) First 3 labels (one-hot): [[ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] [ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]] Now, we can get to the interesting part and implement a neural network. However, we will replace the logistic units in the hidden layer with hyperbolic tangent activation functions, replace the logistic function in the output layer with softmax, and add an additional hidden layer. Keras makes these tasks very simple, as you can see in the following code implementation: >>> from keras.models import Sequential >>> from keras.layers.core import Dense >>> from keras.optimizers import SGD >>> np.random.seed(1) >>> model = Sequential() >>> model.add(Dense(input_dim=X_train.shape[1], ... output_dim=50, ... init='uniform', ... activation='tanh')) >>> model.add(Dense(input_dim=50, ... output_dim=50, ... init='uniform', ... activation='tanh')) >>> model.add(Dense(input_dim=50, ... output_dim=y_train_ohe.shape[1], ... init='uniform', ... activation='softmax')) >>> sgd = SGD(lr=0.001, decay=1e-7, momentum=.9) >>> model.compile(loss='categorical_crossentropy', optimizer=sgd) First, we initialize a new model using the Sequential class to implement a feedforward neural network. Then, we can add as many layers to it as we like. However, since the first layer that we add is the input layer, we have to make sure that the input_dim attribute matches the number of features (columns) in the training set (here, 768). Also, we have to make sure that the number of output units (output_dim) and input units (input_dim) of two consecutive layers match. In the preceding example, we added two hidden layers with 50 hidden units plus 1 bias unit each. Note that bias units are initialized to 0 in fully connected networks in Keras. This is in contrast to the MLP implementation, where we initialized the bias units to 1, which is a more common (not necessarily better) convention. Finally, the number of units in the output layer should be equal to the number of unique class labels—the number of columns in the one-hot encoded class label array. Before we can compile our model, we also have to define an optimizer. In the preceding example, we chose a stochastic gradient descent optimization. Furthermore, we can set values for the weight decay constant and momentum learning to adjust the learning rate at each epoch. Lastly, we set the cost (or loss) function to categorical_crossentropy. The (binary) cross-entropy is just the technical term for the cost function in logistic regression, and the categorical cross-entropy is its generalization for multi-class predictions via softmax. After compiling the model, we can now train it by calling the fit method. Here, we are using mini-batch stochastic gradient with a batch size of 300 training samples per batch. We train the MLP over 50 epochs, and we can follow the optimization of the cost function during training by setting verbose=1. The validation_split parameter is especially handy, since it will reserve 10 percent of the training data (here, 6,000 samples) for validation after each epoch, so that we can check if the model is overfitting during training. >>> model.fit(X_train, ... y_train_ohe, ... nb_epoch=50, ... batch_size=300, ... verbose=1, ... validation_split=0.1, ... show_accuracy=True) Train on 54000 samples, validate on 6000 samples Epoch 0 54000/54000 [==============================] - 1s - loss: 2.2290 - acc: 0.3592 - val_loss: 2.1094 - val_acc: 0.5342 Epoch 1 54000/54000 [==============================] - 1s - loss: 1.8850 - acc: 0.5279 - val_loss: 1.6098 - val_acc: 0.5617 Epoch 2 54000/54000 [==============================] - 1s - loss: 1.3903 - acc: 0.5884 - val_loss: 1.1666 - val_acc: 0.6707 Epoch 3 54000/54000 [==============================] - 1s - loss: 1.0592 - acc: 0.6936 - val_loss: 0.8961 - val_acc: 0.7615 […] Epoch 49 54000/54000 [==============================] - 1s - loss: 0.1907 - acc: 0.9432 - val_loss: 0.1749 - val_acc: 0.9482 Printing the value of the cost function is extremely useful during training, since we can quickly spot whether the cost is decreasing during training and stop the algorithm earlier if otherwise to tune the hyperparameters values. To predict the class labels, we can then use the predict_classes method to return the class labels directly as integers: >>> y_train_pred = model.predict_classes(X_train, verbose=0) >>> print('First 3 predictions: ', y_train_pred[:3]) >>> First 3 predictions: [5 0 4] Finally, let's print the model accuracy on training and test sets: >>> train_acc = np.sum( ... y_train == y_train_pred, axis=0) / X_train.shape[0] >>> print('Training accuracy: %.2f%%' % (train_acc * 100)) Training accuracy: 94.51% >>> y_test_pred = model.predict_classes(X_test, verbose=0) >>> test_acc = np.sum(y_test == y_test_pred, ... axis=0) / X_test.shape[0] print('Test accuracy: %.2f%%' % (test_acc * 100)) Test accuracy: 94.39% Note that this is just a very simple neural network without optimized tuning parameters. If you are interested in playing more with Keras, please feel free to further tweak the learning rate, momentum, weight decay, and number of hidden units. Although Keras is great library for implementing and experimenting with neural networks, there are many other Theano wrapper libraries that are worth mentioning. A prominent example is Pylearn2 (http://deeplearning.net/software/pylearn2/), which has been developed in the LISA lab in Montreal. Also, Lasagne (https://github.com/Lasagne/Lasagne) may be of interest to you if you prefer a more minimalistic but extensible library, that offers more control over the underlying Theano code. Summary We caught a glimpse of the most beautiful and most exciting algorithms in the whole machine learning field: artificial neural networks. I can recommend you to follow the works of the leading experts in this field, such as Geoff Hinton (http://www.cs.toronto.edu/~hinton/), Andrew Ng (http://www.andrewng.org), Yann LeCun (http://yann.lecun.com), Juergen Schmidhuber (http://people.idsia.ch/~juergen/), and Yoshua Bengio (http://www.iro.umontreal.ca/~bengioy), just to name a few. To learn more about material design, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Building Machine Learning Systems with Python (https://www.packtpub.com/big-data-and-business-intelligence/building-machine-learning-systems-python) Neural Network Programming with Java (https://www.packtpub.com/networking-and-servers/neural-network-programming-java) Resources for Article: Further resources on this subject: Python Data Analysis Utilities [article] Machine learning and Python – the Dream Team [article] Adding a Spark to R [article]
Read more
  • 0
  • 0
  • 3080

article-image-adding-spark-r
Packt
22 Feb 2016
3 min read
Save for later

Adding a Spark to R

Packt
22 Feb 2016
3 min read
Spark is written in a language called Scala. It has interfaces to use from Java and Python and from the recent version 1.4.0; it also supports R. This is called SparkR, which we will describe in the next section. The four classes of libraries available in Spark are SQL and DataFrames, Spark Streaming, MLib (machine learning), and GraphX (graph algorithms). Currently, SparkR supports only SQL and DataFrames; others are definitely in the roadmap. Spark can be downloaded from the Apache project page at http://spark.apache.org/downloads.html. Starting from 1.4.0 version, SparkR is included in Spark and no separate download is required. (For more resources related to this topic, see here.) SparkR Similar to RHadoop, SparkR is an R package that allows R users to use Spark APIs through the RDD class. For example, using SparkR, users can run jobs on Spark from RStudio. SparkR can be evoked from RStudio. To enable this, include the following lines in your .Rprofile file that R uses at startup to initialize the environments: Sys.setenv(SPARK_HOME/.../spark-1.5.0-bin-hadoop2.6") #provide the correct path where spark downloaded folder is kept for SPARK_HOME .libPaths(c(file.path(Sys.getenv("SPARK_HOME"),""R",""lib"),".libPaths())) Once this is done, start RStudio and enter the following commands to start using SparkR: >library(SparkR) >sc ← sparkR.init(master="local") As mentioned, as of the latest version 1.5 when this chapter is in writing, SparkR supports limited functionalities of R. This mainly includes data slicing and dicing and summary stat functions. The current version does not support the use of contributed R packages; however, it is planned for a future release. On machine learning, currently SparkR supports the glm( ) function. We will do an example in the next section. Linear regression using SparkR In the following example, we will illustrate how to use SparkR for machine learning. >library(SparkR) >sc ← sparkR.init(master="local") >sqlContext ← sparkRSQL.init(sc) #Importing data >df ← read.csv("/Users/harikoduvely/Projects/Book/Data /ENB2012_data.csv",header = T) >#Excluding variable Y2,X6,X8 and removing records from 768 containing mainly null values >df ← df[1:768,c(1,2,3,4,5,7,9)] >#Converting to a Spark R Dataframe >dfsr ← createDataFrame(sqlContext,df) >model ← glm(Y1 ~ X1 + X2 + X3 + X4 + X5 + X7,data = dfsr,family = "gaussian") > summary(model) Summary In this article we have seen examples of SparkR and linear regression using SparkR. For more information on Spark you can refer to: https://www.packtpub.com/big-data-and-business-intelligence/spark-python-developers https://www.packtpub.com/big-data-and-business-intelligence/spark-beginners Resources for Article: Further resources on this subject: Data Analysis Using R[article] Introducing Bayesian Inference[article] Bayesian Network Fundamentals[article]
Read more
  • 0
  • 0
  • 1947
Visually different images

article-image-customizing-heat-maps-intermediate
Packt
22 Feb 2016
11 min read
Save for later

Customizing heat maps (Intermediate)

Packt
22 Feb 2016
11 min read
This article will help you explore more advanced functions to customize the layout of the heat maps. The main focus lies on the usage of different color palettes, but we will also cover other useful features, such as cell notes that will be used in this recipe. (For more resources related to this topic, see here.) To ensure that our heat maps look good in any situation, we will make use of different color palettes in this recipe, and we will even learn how to create our own. Further, we will add some more extras to our heat maps including visual aids such as cell note labels, which will make them even more useful and accessible as a tool for visual data analysis. The following image shows a heat map with cell notes and an alternative color palette created from the arabidopsis_genes.csv data set: Getting ready Download the 5644OS_03_01.r script and the Arabidopsis_genes.csv data set from your account at http://www.packtpub.com and save it to your hard drive. I recommend that you save the script and data file to the same folder on your hard drive. If you execute the script from a different location to the data file, you will have to change the current R working directory accordingly. The script will check automatically if any additional packages need to be installed in R. How to do it... Execute the following code in R via the 5644OS_03_01.r script and take a look at the PDF file custom_heatmaps.pdf that will be created in the current working directory: ### loading packages if (!require("gplots")) { install.packages("gplots", dependencies = TRUE) library(RColorBrewer) } if (!require("RColorBrewer")) { install.packages("RColorBrewer", dependencies = TRUE) library(RColorBrewer) } ### reading in data gene_data <- read.csv("arabidopsis_genes.csv") row_names <- gene_data[,1] gene_data <- data.matrix(gene_data[,2:ncol(gene_data)]) rownames(gene_data) <- row_names ### setting heatmap.2() default parameters heat2 <- function(...) heatmap.2(gene_data, tracecol = "black", dendrogram = "column", Rowv = NA, trace = "none", margins = c(8,10), density.info = "density", ...) pdf("custom_heatmaps.pdf") ### 1) customizing colors # 1.1) in-built color palettes heat2(col = terrain.colors(n = 1000), main = "1.1) Terrain Colors") # 1.2) RColorBrewer palettes heat2(col = brewer.pal(n = 9, "YlOrRd"), main = "1.2) Brewer Palette") # 1.3) creating own color palettes my_colors <- c(y1 = "#F7F7D0", y2 = "#FCFC3A", y3 = "#D4D40D", b1 = "#40EDEA", b2 = "#18B3F0", b3 = "#186BF0", r1 = "#FA8E8E", r2 = "#F26666", r1 = "#C70404") heat2(col = my_colors, main = "1.3) Own Color Palette") my_palette <- colorRampPalette(c("blue", "yellow", "red"))(n = 1000) heat2(col = my_palette, main = "1.3) ColorRampPalette") # 1.4) gray scale heat2(col = gray(level = (0:100)/100), main ="1.4) Gray Scale") ### 2) adding cell notes fold_change <- 2^gene_data rounded_fold_changes <- round(rounded_fold_changes, 2) heat2(cellnote = rounded, notecex = 0.5, notecol = "black", col = my_palette, main = "2) Cell Notes") ### 3) adding column side colors heat2(ColSideColors = c("red", "gray", "red", rep("green",13)), main = "3) ColSideColors") dev.off() How it works... Primarily, we will be using read.csv() and heatmap.2() to read in data into R and construct our heat maps. In this recipe, however, we will focus on advanced features to enhance our heat maps, such as customizing color and other visual elements: Inspecting the arabidopsis_genes.csv data set: The arabidopsis_genes.csv file contains a compilation of gene expression data from the model plant Arabidopsis thaliana. I obtained the freely available data of 16 different genes as log 2 ratios of target and reference gene from the Arabidopsis eFP Browser (http://bar.utoronto.ca/efp_arabidopsis/). For each gene, expression data of 47 different areas of the plant is available in this data file. Reading the data and converting it into a numeric matrix: We have to convert the data table into a numeric matrix first before we can construct our heat maps: gene_data <- read.csv("arabidopsis_genes.csv") row_names <- gene_data[,1] gene_data <- data.matrix(gene_data[,2:ncol(gene_data)]) rownames(gene_data) <- row_names Creating a customized heatmap.2() function: To reduce typing efforts, we are defining our own version of the heatmap.2() function now, where we will include some arguments that we are planning to keep using throughout this recipe: heat2 <- function(...) heatmap.2(gene_data, tracecol = "black", dendrogram = "column", Rowv = NA, trace = "none", margins = c(8,10), density.info = "density", ...) So, each time we call our newly defined heat2() function, it will behave similar to the heatmap.2() function, except for the additional arguments that we will pass along. We also include a new argument, black, for the tracecol parameter, to better distinguish the density plot in the color key from the background. The built-in color palettes: There are four more color palettes available in the base R that we could use instead of the heat.colors palette: rainbow, terrain.colors, topo.colors, and cm.colors. So let us make use of the terrain.colors color palette now, which will give us a nice color transition from green over yellow to rose: heat2(col = terrain.colors(n = 1000), main = "1.1) Terrain Colors") Every number for the parameter n that is larger than the default value 12 will add additional colors, which will make the transition smoother. A value of 1000 for the n parameter should be more than sufficient to make the transition between the individual colors indistinguishable to the human eye. The following image shows a side-by-side comparison of the heat.colors and terrain.colors color palettes using a different number of color shades: Further, it is also possible to reverse the direction of the color transition. For example, if we want to have a heat.color transition from yellow to red instead of red to yellow in our heat map, we could simply define a reverse function: rev_heat.colors <- function(x) rev(heat.colors(x)) heat2(col = rev_heat.colors(500)) RColorBrewer palettes: A lot of color palettes are available from the RColorBrewer package. To see how they look like, you can type display.brewer.all() into the R command-line after loading the RColorBrewer package. However, in contrast to the dynamic range color palettes that we have seen previously, the RColorBrewer palettes have a distinct number of different colors. So to select all nine colors from the YlOrRd palette, a gradient from yellow to red, we use the following command: heat2(col = brewer.pal(n = 9, "YlOrRd"), main = "1.2) Brewer Palette") The following image gives you a good overview of all the different color palettes that are available from the RColorBrewer package: Creating our own color palettes: Next, we will see how we can create our own color palettes. A whole bunch of different colors are already defined in R. An overview of those colors can be seen by typing colors() into the command line of R. The most convenient way to assign new colors to a color palette is using hex colors (hexadecimal colors). Many different online tools are freely available that allow us to obtain the necessary hex codes. A great example is color picker (http://www.colorpicker.com), which allows us to choose from a rich color table and provides us with the corresponding hex codes. Once we gather all the hexadecimal codes for the colors that we want to use for our color palette, we can assign them to a variable as we have done before with the explicit color names: my_colors <- c(y1 = "#F7F7D0", y2 = "#FCFC3A", y3 = "#D4D40D", b1 = "#40EDEA", b2 = "#18B3F0", b3 = "#186BF0", r1 = "#FA8E8E", r2 = "#F26666", r1 = "#C70404") heat2(col = my_colors, main = "1.3) Own Color Palette") This is a very handy approach for creating a color key with very distinct colors. However, the downside of this method is that we have to provide a lot of different colors if we want to create a smooth color gradient; we have used 1000 different colors for the terrain.color() palette to get a smooth transition in the color key! Using colorRampPalette for smoother color gradients: A convenient approach to create a smoother color gradient is to use the colorRampPalette() function, so we don't have to insert all the different colors manually. The function takes a vector of different colors as an argument. Here, we provide three colors: blue for the lower end of the color key, yellow for the middle range, and red for the higher end. As we did it for the in-built color palettes, such as heat.color, we assign the value 1000 to the n parameter: my_palette <- colorRampPalette(c("blue", "yellow", "red"))(n = 1000) heat2(col = my_palette, main = "1.3) ColorRampPalette") In this case, it is more convenient to use discrete color names over hex colors, since we are using the colorRampPalette() function to create a gradient and do not need all the different shades of a particular color. Grayscales: It might happen that the medium or device that we use to display our heat maps does not support colors. Under these circumstances, we can use the gray palette to create a heat map that is optimized for those conditions. The level parameter of the gray() function takes a vector with values between 0 and 1 as an argument, where 0 represents black and 1 represents white, respectively. For a smooth gradient, we use a vector with 100 equally spaced shades of gray ranging from 0 to 1. heat2(col = gray(level = (0:200)/200), main ="1.4) Gray Scale") We can make use of the same color palettes for the levelplot() function too. It works in a similar way as it did for the heatmap.2() function that we are using in this recipe. However, inside the levelplot() function call, we must use col.regions instead of the simple col, so that we can include a color palette argument. Adding cell notes to our heat map: Sometimes, we want to show a data set along with our heat map. A neat way is to use so-called cell notes to display data values inside the individual heat map cells. The underlying data matrix for the cell notes does not necessarily have to be the same numeric matrix we used to construct our heat map, as long as it has the same number of rows and columns. As we recall, the data we read from arabidopsis_genes.csv resembles log 2 ratios of sample and reference gene expression levels. Let us calculate the fold changes of the gene expression levels now and display them—rounded to two digits after the decimal point—as cell notes on our heat map: fold_change <- 2^gene_data rounded_fold_changes <- round(fold_change, 2) heat2(cellnote = rounded_fold_changes, notecex = 0.5, notecol = "black", col = rev_heat.colors, main = "Cell Notes") The notecex parameter controls the size of the cell notes. Its default size is 1, and every argument between 0 and 1 will make the font smaller, whereas values larger than 1 will make the font larger. Here, we decreased the font size of the cell notes by 50 percent to fit it into the cell boundaries. Also, we want to display the cell notes in black to have a nice contrast to the colored background; this is controlled by the notecol parameter. Row and column side colors: Another approach to pronounce certain regions, that is, rows or columns on the heat map is to make use of row and column side colors. The ColSideColors argument will place a colored box between the dendrogram and heat map that can be used to annotate certain columns. We pass our vector with colors to ColSideColors, where its length must be equal to the number of columns of the heat map. Here, we want to color the first and third column red, the second one gray, and all the remaining 13 columns green: heat2(ColSideColors = c("red", "gray", "red", rep("green", 13)), main = "ColSideColors") You can see in the following image how the column side colors look like when we include the ColSideColors argument as shown previously: Attentive readers may have noticed that the order of colors in the column color box slightly differs from the order of colors we passed as a vector to ColSideColors. We see red two times next to each other, followed by a green and a gray box. This is due to the fact that the columns of our heat map have been reordered by the hierarchical clustering algorithm. Summary To learn more about the similar technology, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Instant R Starter (https://www.packtpub.com/big-data-and-business-intelligence/instant-r-starter-instant) Machine Learning with R - Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition) Mastering RStudio – Develop, Communicate, and Collaborate with R (https://www.packtpub.com/application-development/mastering-rstudio-%E2%80%93-develop-communicate-and-collaborate-r) Resources for Article: Further resources on this subject: Data Analysis Using R[article] Big Data Analysis[article] Big Data Analysis (R and Hadoop)[article]
Read more
  • 0
  • 0
  • 3370

article-image-building-recommendation-system-azure
Packt
19 Feb 2016
7 min read
Save for later

Building A Recommendation System with Azure

Packt
19 Feb 2016
7 min read
Recommender systems are common these days. You may not have noticed, but you might already be a user or receiver of such a system somewhere. Most of the well-performing e-commerce platforms use recommendation systems to recommend items to their users. When you see on the Amazon website that a book is recommended to you based on your earlier preferences, purchases, and browse history, Amazon is actually using such a recommendation system. Similarly, Netflix uses its recommendation system to suggest movies for you. (For more resources related to this topic, see here.) A recommender or recommendation system is used to recommend a product or information often based on user characteristics, preferences, history, and so on. So, a recommendation is always personalized. Until recently, it was not so easy or straightforward to build a recommender, but Azure ML makes it really easy to build one as long as you have your data ready. This article introduces you to the concept of recommendation systems and also the model available in ML Studio for you to build your own recommender system. It then walks you through the process of building a recommendation system with a simple example. The Matchbox recommender Microsoft has developed a large-scale recommender system based on a probabilistic model (Bayesian) called Matchbox. This model can learn about a user's preferences through observations made on how they rate items, such as movies, content, or other products. Based on those observations, it recommends new items to the users when requested. Matchbox uses the available data for each user in the most efficient way possible. The learning algorithm it uses is designed specifically for big data. However, its main feature is that Matchbox takes advantage of metadata available for both users and items. This means that the things it learns about one user or item can be transferred across to other users or items. You can find more information about the Matchbox model at the Microsoft Research project link. Kinds of recommendations The Matchbox recommender supports the building of four kinds of recommenders, which will include most of the scenarios. Let's take a look at the following list: Rating Prediction: This predicts ratings for a given user and item, for example, if a new movie is released, the system will predict what will be your rating for that movie out of 1-5. Item Recommendation: This recommends items to a given user, for example, Amazon suggests you books or YouTube suggests you videos to watch on its home page (especially when you are logged in). Related Users: This finds users that are related to a given user, for example, LinkedIn suggests people that you can get connected to or Facebook suggests friends to you. Related Items: This finds the items related to a given item, for example, a blog site suggests you related posts when you are reading a blog post. Understanding the recommender modules The Matchbox recommender comes with three components; as you might have guessed, a module each to train, score, and evaluate the data. The modules are described as follows. The train Matchbox recommender This module contains the algorithm and generates the trained algorithm, as shown in the following screenshot: This module takes the values for the following two parameters. The number of traits This value decides how many implicit features (traits) the algorithm will learn about that are related to every user and item. The higher this value, the precise it would be as it would lead to better prediction. Typically, it takes a value in the range of 2 to 20. The number of recommendation algorithm iterations It is the number of times the algorithm iterates over the data. The higher this value, the better would the predictions be. Typically, it takes a value in the range of 1 to 10. The score matchbox recommender This module lets you specify the kind of recommendation and corresponding parameters you want: Rating Prediction Item Prediction Related Users Related Items Let's take a look at the following screenshot: The ML Studio help page for the module provides details of all the corresponding parameters. The evaluate recommender This module takes a test and a scored dataset and generates evaluation metrics, as shown in the following screenshot: It also lets you specify the kind of recommendation, such as the score module and corresponding parameters. Building a recommendation system Now, it would be worthwhile that you learn to build one by yourself. We will build a simple recommender system to recommend restaurants to a given user. ML Studio includes three sample datasets, described as follows: Restaurant customer data: This is a set of metadata about customers, including demographics and preferences, for example, latitude, longitude, interest, and personality. Restaurant feature data: This is a set of metadata about restaurants and their features, such as food type, dining style, and location, for example, placeID, latitude, longitude, price. Restaurant ratings: This contains the ratings given by users to restaurants on a scale of 0 to 2. It contains the columns: userID, placeID, and rating. Now, we will build a recommender that will recommend a given number of restaurants to a user (userID). To build a recommender perform the following steps: Create a new experiment. In the Search box in the modules palette, type Restaurant. The preceding three datasets get listed. Drag them all to the canvas one after another. Drag a Split module and connect it to the output port of the Restaurant ratings module. On the properties section to the right, choose Splitting mode as Recommender Split. Leave the other parameters at their default values. Drag a Project Columns module to the canvas and select the columns: userID, latitude, longitude, interest, and personality. Similarly, drag another Project Columns module and connect it to the Restaurant feature data module and select the columns: placeID, latitude, longitude, price, the_geom_meter, and address, zip. Drag a Train Matchbox Recommender module to the canvas and make connections to the three input ports, as shown in the following screenshot: Drag a Score Matchbox Recommender module to the canvas and make connections to the three input ports and set the property's values, as shown in the following screenshot: Run the experiment and when it gets completed, right-click on the output of the Score Matchbox Recommender module and click on Visualize to explore the scored data. You can note the different restaurants (IDs) recommended as items for a user from the test dataset. The next step is to evaluate the scored prediction. Drag the Evaluate Recommender module to the canvas and connect the second output of the Split module to its first input port and connect the output of the Score Matchbox Recommender module to its second input. Leave the module at its default properties. Run the experiment again and when finished, right-click on the output port of the Evaluate Recommender module and click on Visualize to find the evaluation metric. The evaluation metric Normalized Discounted Cumulative Gain (NDCG) is estimated from the ground truth ratings given in the test set. Its value ranges from 0.0 to 1.0, where 1.0 represents the most ideal ranking of the entities. Summary You started with gaining the basic knowledge about a recommender system. You then understood the Matchbox recommender that comes with ML Studio along with its components. You also explored different kinds of recommendations that you can make with it. Finally, you ended up building a simple recommendation system to recommend restaurants to a given user. For more information on Azure, take a look at the following books also by Packt Publishing: Learning Microsoft Azure (https://www.packtpub.com/networking-and-servers/learning-microsoft-azure) Microsoft Windows Azure Development Cookbook (https://www.packtpub.com/application-development/microsoft-windows-azure-development-cookbook) Resources for Article: Further resources on this subject: Introduction to Microsoft Azure Cloud Services[article] Microsoft Azure – Developing Web API for Mobile Apps[article] Security in Microsoft Azure[article]
Read more
  • 0
  • 0
  • 4362

article-image-what-logistic-regression
Packt
19 Feb 2016
9 min read
Save for later

What is logistic regression?

Packt
19 Feb 2016
9 min read
In logistic regression, input features are linearly scaled just as with linear regression; however, the result is then fed as an input to the logistic function. This function provides a nonlinear transformation on its input and ensures that the range of the output, which is interpreted as the probability of the input belonging to class 1, lies in the interval [0,1]. (For more resources related to this topic, see here.) The form of the logistic function is as follows: The plot of the logistic function is as follows: When x = 0, the logistic function takes the value 0.5. As x tends to +∞, the exponential in the denominator vanishes and the function approaches the value 1. As x tends to -∞, the exponential, and hence the denominator, tends to move toward infinity and the function approaches the value 0. Thus, our output is guaranteed to be in the interval [0,1], which is necessary for it to be a probability. Generalized linear models Logistic regression belongs to a class of models known as generalized linear models (GLMs). Generalized linear models have three unifying characteristics. The first of these is that they all involve a linear combination of the input features, thus explaining part of their name. The second characteristic is that the output is considered to have an underlying probability distribution belonging to the family of exponential distributions. These include the normal distribution, the Poisson and the binomial distribution. Finally, the mean of the output distribution is related to the linear combination of input features by way of a function, known as the link function. Let's see how this all ties in with logistic regression, which is just one of many examples of a GLM. We know that we begin with a linear combination of input features, so for example, in the case of one input feature, we can build up an x term as follows: Note that in the case of logistic regression, we are modeling a probability that the output belongs to class 1, rather the output directly as we were in linear regression. As a result, we do not need to model the error term because our output, which is a probability, incorporates nondeterministic aspects of our model, such as measurement uncertainties, directly. Next, we apply the logistic function to this term in order to produce our model's output: Here, the left term tells us directly that we are computing the probability that our output belongs to class 1 based on our evidence of seeing the values of the input feature X1. For logistic regression, the underlying probability distribution of the output is the Bernoulli distribution. This is the same as the binomial distribution with a single trial and is the distribution we would obtain in an experiment with only two possible outcomes having constant probability, such as a coin flip. The mean of the Bernoulli distribution, μy, is the probability of the (arbitrarily chosen) outcome for success, in this case, class 1. Consequently, the left-hand side in the previous equation is also the mean of our underlying output distribution. For this reason, the function that transforms our linear combination of input features is sometimes known as the mean function, and we just saw that this function is the logistic function for logistic regression. Now, to determine the link function for logistic regression, we can perform some simple algebraic manipulations in order to isolate our linear combination of input features. The term on the left-hand side is known as the log-odds or logit function and is the link function for logistic regression. The denominator of the fraction inside the logarithm is the probability of the output being class 0 given the data. Consequently, this fraction represents the ratio of probability between class 1 and class 0, which is also known as the odds ratio. A good reference for logistic regression along with examples of other GLMs such as Poisson regression is Extending the Linear Model with R, Julian J. Faraway, CRC Press. Interpreting coefficients in logistic regression Looking at the right-hand side of the last equation, we can see that we have almost exactly the same form as we had for simple linear regression, barring the error term. The fact that we have the logit function on the left-hand side, however, means we cannot interpret our regression coefficients in the same way that we did with linear regression. In logistic regression, a unit increase in feature Xi results in multiplying the odds ratio by an amount, { QUOTE  }. When a coefficient βi is positive, then we multiply the odds ratio by a number greater than 1, so we know that increasing the feature Xi will effectively increase the probability of the output being labeled as class 1. Similarly, increasing a feature with a negative coefficient shifts the balance toward predicting class 0. Finally, note that when we change the value of an input feature, the effect is a multiplication on the odds ratio and not on the model output itself, which we saw is the probability of predicting class 1. In absolute terms, the change in the output of our model as a result of a change in the input is not constant throughout but depends on the current value of our input features. This is, again, different from linear regression, where no matter what the values of the input features, the regression coefficients always represent a fixed increase in the output per unit increase of an input feature. Assumptions of logistic regression Logistic regression makes fewer assumptions about the input than linear regression. In particular, the nonlinear transformation of the logistic function means that we can model more complex input-output relationships. We still have a linearity assumption, but in this case, it is between the features and the log-odds. We no longer require a normality assumption for residuals and nor do we need the homoscedastic assumption. On the other hand, our error terms still need to be independent. Strictly speaking, the features themselves no longer need to be independent but in practice, our model will still face issues if the features exhibit a high degree of multicollinearity. Finally, we'll note that just like with unregularized linear regression, feature scaling does not affect the logistic regression model. This means that centering and scaling a particular input feature will simply result in an adjusted coefficient in the output model without any repercussions on the model performance. It turns out that for logistic regression, this is the result of a property known as the invariance property of maximum likelihood, which is the method used to select the coefficients and will be the focus of the next section. It should be noted, however, that centering and scaling features might still be a good idea if they are on very different scales. This is done to assist the optimization procedure during training. In short, we should turn to feature scaling only if we run into model convergence issues. Maximum likelihood estimation When we studied linear regression, we found our coefficients by minimizing the sum of squared error terms. For logistic regression, we do this by maximizing the likelihood of the data. The likelihood of an observation is the probability of seeing that observation under a particular model. In our case, the likelihood of seeing an observation X for class 1 is simply given by the probability P(Y=1|X), the form of which was given earlier in this article. As we only have two classes, the likelihood of seeing an observation for class 0 is given by 1 - P(Y=1|X). The overall likelihood of seeing our entire data set of observations is the product of all the individual likelihoods for each data point as we consider our observations to be independently obtained. As the likelihood of each observation is parameterized by the regression coefficients βi, the likelihood function for our entire data set is also, therefore, parameterized by these coefficients. We can express our likelihood function as an equation, as shown in the following equation: Now, this equation simply computes the probability that a logistic regression model with a particular set of regression coefficients could have generated our training data. The idea is to choose our regression coefficients so that this likelihood function is maximized. We can see that the form of the likelihood function is a product of two large products from the two big π symbols. The first product contains the likelihood of all our observations for class 1, and the second product contains the likelihood of all our observations for class 0. We often refer to the log likelihood of the data, which is computed by taking the logarithm of the likelihood function and using the fact that the logarithm of a product of terms is the sum of the logarithm of each term: We can simplify this even further using a classic trick to form just a single sum: To see why this is true, note that for the observations where the actual value of the output variable y is 1, the right term inside the summation is zero, so we are effectively left with the first sum from the previous equation. Similarly, when the actual value of y is 0, then we are left with the second summation from the previous equation. Note that maximizing the likelihood is equivalent to maximizing the log likelihood. Maximum likelihood estimation is a fundamental technique of parameter fitting and we will encounter it in other models in this book. Despite its popularity, it should be noted that maximum likelihood is not a panacea. Alternative training criteria on which to build a model exist, and there are some well-known scenarios under which this approach does not lead to a good model. Finally, note that the details of the actual optimization procedure that finds the values of the regression coefficients for maximum likelihood are beyond the scope of this book and in general, we can rely on R to implement this for us. Summary In this article, we demonstrated why logistic regression is a better way to approach classification problems compared to linear regression with a threshold by showing that the least squares criterion is not the most appropriate criterion to use when trying to separate two classes. It turns out that logistic regression is not a great choice for multiclass settings in general. To learn more about Predictive Analysis, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Learning Predictive Analytics with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-predictive-analytics-r) Resources for Article: Further resources on this subject: Machine learning in practice [article] Introduction to Machine Learning with R [article] Training and Visualizing a neural network with R [article]
Read more
  • 0
  • 0
  • 1619
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $15.99/month. Cancel anytime
article-image-twitter-sentiment-analysis
Packt
19 Feb 2016
30 min read
Save for later

Twitter Sentiment Analysis

Packt
19 Feb 2016
30 min read
In this article, we will cover: Twitter and it's importance Getting hands on with Twitter's data and using various Twitter APIs Use of data to solve business problems—comparison of various businesses based on tweets (For more resources related to this topic, see here.) Twitter and its importance Twitter can be considered as extension of the short messages service or SMS but on an Internet-based platform. In the words of Jack Dorsey, co-founder and co-creator of Twitter: "...We came across the word 'twitter', and it was just perfect. The definition was 'a short burst of inconsequential information,' and 'chirps from birds'. And that's exactly what the product was" Twitter acts as a utility where one can send their SMSs to the whole world. It enables people to instantaneously get heard and get a response. Since the audience of this SMS is so large, many a times responses are very quick. So, Twitter facilitates the basic social instincts of humans. By sharing on Twitter, a user can easily express his/her opinion for just about everything and at anytime. Friends who are connected or, in case of Twitter, followers, immediately get the information about what's going on in someone's life. This in turn severs another humanemotion—the innate need to know about what is going on in someone's life. Apart from being real time, Twitter's UI is really easy to work with. It's naturally and instinctively understood, that is, the UI is very intuitive in nature. Each tweet on Twitter is a short message with maximum of 140 characters. Twitter is an excellent example of a microblogging service. As of July 2014, the Twitter user base reached above 500 million, with more than 271 million active users. Around 23 percent are adult Internet users, which is also about 19 percent of the entire adult population. If we can properly mine what users are tweeting about, Twitter can act as a great tool for advertisement and marketing. But this not the only information Twitter provides. Because of its non-symmetric nature in terms of followers and followings, Twitter assists better in terms of understanding user interests rather than its impact on the social network. An interest graph can be thought of as a method to learn the links between individuals and their diverse interests. Computing the degree of association or correlations between individual's interests and the potential advertisements are one of the most important applications of the interest graphs. Based on these correlations, a user can be targeted so as to attain a maximum response to an advertisement campaign along with followers' recommendations. One interesting fact about Twitter (and Facebook) is that the user does not need to be a real person. A user on Twitter (or on Facebook) can be anything and anyone, for example, an organization, a campaign itself, a famous but imaginary personality (a fictional character recognizable in the media) apart from a real/actual person. If a real person follows these users on Twitter, a lot can be inferred about their personality and hence they can be recommended ads or other followers based on such information. For example, @fakingnews is an Indian blog that publishes news satires ranging from Indian politics to typical Indian mindsets. People who follow @fakingnews are the ones who, in general, like to read sarcasm news. Hence, these people can be thought of as to belonging to the same cluster or a community. If we have another sarcastic blog, we can always recommend it to this community and improve on advertisement return on investment. The chances of getting more hits via people belonging to this community will be higher than a community who don't follows @fakingnews, or any such news, in general. Once you have comprehended that Twitter allows you to create, link, and investigate a community of interest for a random topic, the influence of Twitter and the knowledge one can find from mining it becomes clearer. Understanding Twitter's API Twitter APIs provide a means to access the Twitter data, that is, tweets sent by its millions of users. Let's get to know these APIs a bit better. Twitter vocabulary As described earlier, Twitter is a microblogging service with social aspect associated. It allows its users to express their views/sentiments with the means of Internet SMS, called tweets in the context of Twitter. These tweets are entities formed of maximum of 140 characters. The content of these tweets can be anything ranging from a person's mood to person's location to a person's curiosity. The platform where these tweets are posted is called Timeline. To use Twitter's APIs, one must understand the basic terminology. Tweets are the crux of Twitter. Theoretically, a tweet is just 140 characters of text content tweeted by a user, but there is more to it than just that. There is more metadata associated with the same tweet, which are classified by Twitter as entities and places. The entities constitute of hash tags, URLs, and other media data that users have included in their tweet. The places are nothing but locations from where the tweet originated. It possible the place is a real world location from where the tweet was sent, or it is a location mentioned in the text of the tweet. Take the following tweet as an example: Learn how to consume millions of tweets with @twitterapi at #TDC2014 in São Paulo #bigdata tomorrow at 2:10pm http://t.co/pTBlWzTvVd The preceding tweet was tweeted by @TwitterDev and it's about 132 characters long. The following are the entities mentioned in this tweet: Handle: @twitterapi Hashtags: #TDC2014, #bigdata URL: http://t.co/pTBlWzTvVd São Paulo is the place mentioned in this tweet. This is a one such example of a tweet with a fairly good amount of metadata. Although the actual tweet's length is well within the 140-character limit, it contains more information than one can think of. This actually enables us to figure out that this tweet belongs to a specific community based on the cross referencing the topics presents in the hash tags, the URL to the website, the different users mentioned in it, and so on. The interface (web or mobile) on to which the tweets are displayed is called timeline. The tweets are, in general, arranged in chronological order of posting time. On a specific user's account, only certain number of tweets are displayed by Twitter. This is generally based on users the given user is following and is being followed by. This is the interface a user will see when he/she login his/her Twitter account. A Twitter stream is different from Twitter timeline in the sense that they are not for a specific user. The Tweets on a user's Twitter timeline will be displayed from only certain number of users will be displayed/updated less frequently while the Twitter stream is chronological collection of the all the tweets posted by all the users. The number of active users on Twitter is in orders of hundreds of millions. All the users tweeting during some public events of widespread interest such as presidential debates can achieve speeds of several hundreds of thousands of tweets per minute. The behavior is very similar to a stream; hence the name of such collection is Twitter stream. You can try the following by creating a Twitter account (it would be more insightful if you have less number of followers already with you). Before creating the account, it is advised that you read all the terms and conditions of the same. You can also start reading its API's documentation. Creating a Twitter API connection We need to have an app created at https://dev.twitter.com/apps before making any API requests to Twitter. It's a standard method for developers to gain API access and more important it helps Twitter to observe and restricts developer from making high load API requests. The ROAuth package is the one we are going to use in our experiments. Tokens allow users to authorize third-party apps to access the data from any user account without the need to have their passwords (or other sensitive information). ROAuth basically facilitates the same. Creating new app The first step to getting any kind of token access from twitter is to create an app on it. The user has to go to https://dev.twitter.com/ and log in with their Twitter credentials. With you logged in using your credentials, the step for creating app are as follows: Go to https://apps.twitter.com/app/new. Put the name of your application in the Name field. This name can be anything you like. Similarly, enter the description in the Description field. The Website field needs to be filled with a valid URL, but again that can be any random URL. You can leave the Callback URL field blank. After the creation of this app, we need to find the API Key and API Secret values from the Key and Access Token tab. Consider the example shown in the following figure:   Under the Key and Access Tokens tab, you will find a button to generate access tokens. Click on it and you will be provided with an Access Token and Access Token Secret value. Before using the preceding keys, we need to install twitteRto access the data in R using the app we just created, using following code: Install.packages(c("devtools", "rjson", "bit64", "httr")) library(devtools) install_github("geoffjentry/twitteR"). library(twitteR) Here's sample code that helps us access the tweets posted since any give date and which contain a specific keyword. In this example, we are searching for tweeting containing the word Earthquake in the tweets posted since September 29, 2014. In order to get this information, we provide four special types of information to get the authorization token: key secret access token access token secret We'll show you how to use the preceding information to get an app authorized by the user and access its resources on Twitter. The ROAuh function in twitteR will make our next steps very smooth and clear: api_key<- "your_api_key" api_secret<- "your_api_secret" access_token<- "your_access_token" access_token_secret<- "your_access_token_secret" setup_twitter_oauth (api_key,api_secret,access_token,access_token_secret) EarthQuakeTweets = searchTwitter("EarthQuake", since='2014-09-29') The results of this example should simply display Using direct authentication with 25 tweets loaded in the EarthQuakeTweets variable as shown here. head(EarthQuakeTweets,2) [[1]] [1] "TamamiJapan: RT @HistoricalPics: Japan. Top: One Month After Hiroshima, 1945. Bottom: One Month After The Earthquake and Tsunami, 2011. Incredible. http…" [[2]] [1] "OldhamDs: RT @HistoricalPics: Japan. Top: One Month After Hiroshima, 1945. Bottom: One Month After The Earthquake and Tsunami, 2011. Incredible. http…" We have shown in the first two of the 25 tweets containing the word Earthquake since September 29, 2014. If you closely observe the results, you'll find all the metadata using str(EarthQuakeTweets[1]). Finding trending topics Now that we understand how to create API connections to Twitter and fetch data using it, we will see how to get answer to what is trending on Twitter to list what topic (worldwide or local) is being talked about the most right now. Using the same API, we can easily access the trending information: #return data frame with name, country & woeid. Locs <- availableTrendLocations() # Where woeid is a numerical identification code describing a location ID # Filter the data frame for Delhi (India) and extract the woeid of the same LocsIndia = subset(Locs, country == "India") woeidDelhi = subset(LocsIndia, name == "Delhi")$woeid # getTrends takes a specified woeid and returns the trending topics associated with that woeid trends = getTrends(woeid=woeidDelhi) The function availableTrendLocations() returns R data frame containing the name, country, and woeid parameters. We than filter this data frame for a location of our choosing; in this example, its Delhi, India. The function getTrends() fetches the top 10 trends in the location determined by the woeid. Here are the top four trending hash tags in the region defined by woeid = 20070458, that is, Delhi, India. head(trends) name url query woeid 1 #AntiHinduNGOsExposed http://twitter.com/search?q=%23AntiHinduNGOsExposed %23AntiHinduNGOsExposed 20070458 2 #KhaasAadmi http://twitter.com/search?q=%23KhaasAadmi %23KhaasAadmi 20070458 3 #WinGOSF14 http://twitter.com/search?q=%23WinGOSF14 %23WinGOSF14 20070458 4 #ItsForRealONeBay http://twitter.com/search?q=%23ItsForRealONeBay %23ItsForRealONeBay 20070458 Searching tweets Now, similar to the trends there is one more important function that comes with the TwitteR package: searchTwitter(). This function will return tweets containing the searched string along with the other constraints. Some of the constraints that can be imposed are as follows: lang: This constraints the tweets of given language. since/until: This constraints the tweets to be since the given date or until the given date. geocode: This constraints tweets to be from only those users who are located within certain distance from the given latitude/longitude. For example, extracting tweets about the cricketer Sachin Tendulkar in the month of November 2014: head(searchTwitter('Sachin Tendulkar', since='2014-11-01', until= '2014-11-30')) [[1]] [1] "TendulkarFC: RT @Moulinparikh: Sachin Tendulkar had a long session with the Mumbai Ranji Trophy team after today's loss." [[2]] [1] "tyagi_niharika: @WahidRuba @Anuj_dvn @Neel_D_ @alishatariq3 @VWellwishers @Meenal_Rathore oh... Yaadaaya....hmaraesachuuu sirxedxa0xbdxedxb8x8d..i mean sachin Tendulkar" [[3]] [1] "Meenal_Rathore: @WahidRuba @Anuj_dvn @tyagi_niharika @Neel_D_ @alishatariq3 @AliaaFcc @VWellwishers .. Sachin Tendulkar xedxa0xbdxedxb8x8a☺️" [[4]] [1] "MishraVidyanand: Vidyanand Mishra is following the Interest "The Living Legend SachinTendu..." on http://t.co/tveHXMB4BM - http://t.co/CocNMcxFge" [[5]] [1] "CSKalwaysWin: I have never tried to compare myself to anyone else.n - Sachin Tendulkar" Twitter sentiment analysis Depending on the objective and based on the functionality to search any type of tweets from the public timeline, one can always collect the required corpus. For example, you may want to learn about customer satisfaction levels with various cab services, which are coming in Indian market. These start-ups are offering various discounts and coupons to attract customers but at the end of the day, the service quality determines the business of any organization. These startups are constantly promoting themselves on various social media websites. Customers are showing various levels of sentiments on the same platform. Let's target the following: Meru Cabs: A radio cabs service based in Mumbai, India. Launched in 2007. Ola Cabs: A taxi aggregator company based in Bangalore, India. Launched in 2011. TaxiForSure: A taxi aggregator company based in Bangalore, India. Launched in 2011. Uber India: A taxi aggregator company headquartered in San Francisco, California. Launched in India in 2014. Let's set our goal to get the general sentiments about each of the preceding services providers based on the customer sentiments present in the tweets on Twitter. Collecting tweets as a corpus We'll start with the searchTwitter()function (discussed previously) on the TwitteR package to gather the tweets for each of the preceding organizations. Now, in order to avoid writing same code again and again, we pushed the following authorization code in the file called authenticate.R. library(twitteR) api_key<- "xx" api_secret<- "xx" access_token<- "xx" access_token_secret<- "xx" setup_twitter_oauth(api_key,api_secret,access_token, access_token_secret) We run the following scripts to get the required tweets: # Load the necessary packages source('authenticate.R') Meru_tweets = searchTwitter("MeruCabs", n=2000, lang="en") Ola_tweets = searchTwitter("OlaCabs", n=2000, lang="en") TaxiForSure_tweets = searchTwitter("TaxiForSure", n=2000, lang="en") Uber_tweets = searchTwitter("Uber_Delhi", n=2000, lang="en") Now, as mentioned in Twitter's Rest API documentation, we get the message "Due to capacity constraints, the index currently only covers about a week's worth of tweets". We do not always get the desired number of tweets (for example, here it's 2000). Instead, the following are the size of each of the above Tweet lists we get the following: >length(Meru_tweets) [1] 393 >length(Ola_tweets) [1] 984 > length(TaxiForSure_tweets) [1] 720 > length(Uber_tweets) [1] 2000 As you can see from the preceding code, the length of these tweets is not equal to the number of tweets we had asked for in our query scripts. There are many takeaways from this information. Since these tweets are only from last one week's tweets on Twitter, they suggest there is more discussion about these taxi services in the following order: Uber India Ola Cabs TaxiForSure Meru Cabs A ban was imposed on Uber India after an alleged rape incident by one Uber India driver. The decision to put a ban on the entire organization because one of its drivers committed a crime became a matter of public outcry. Hence, the number of tweets about Uber increased on social media. Now, Meru Cabs have been in India for almost 7 years now. Hence, they are quite a stable organization. They amount of promotion Ola Cabs and TaxiForSure are doing is way higher than that of Meru Cabs. This can be one reason for Meru Cabs having theleast number (393) of tweets in last week. The number of tweets in last week is comparable for Ola Cabs (984) and TaxiForSure (720). There can be several numbers of reasons for the same. They were both started their business in same year and more importantly they follow the same business model. While Meru Cabs is a radio taxi service and they own and manage a fleet of cars while Ola Cabs, TaxiForSure, or Uber are a marketplace for users to compare the offerings of various operators and book easily. Let's dive deep into the data and get more insights. Cleaning the corpus Before applying any intelligent algorithms to gather more insights out of the tweets collected so far, let's first clean it. In order to clean up, we should understand how the list of tweets looks like: head(Meru_tweets) [[1]] [1] "MeruCares: @KapilTwitts 2&gt;...and other details at [email protected] We'll check back and reach out soon." [[2]] [1] "vikasraidhan: @MeruCabs really disappointed with @GenieCabs. Cab is never assigned on time. Driver calls after 30 minutes. Why would I ride with Meru?" [[3]] [1] "shiprachowdhary: fallback of #ubershame , #MERUCABS taking customers for a ride" [[4]] [1] "shiprachowdhary: They book Genie, but JIT inform of cancellation &amp; send full fare #MERUCABS . Very disappointed.Always used these guys 4 and recommend them." [[5]] [1] "shiprachowdhary: No choice bt to take the #merucabs premium service. Driver told me that this happens a lot with #merucabs." [[6]] [1] "shiprachowdhary: booked #Merucabsyestrdy. Asked for Meru Genie. 10 mins 4 pick up time, they call to say Genie not available, so sending the full fare cab" The first tweet here is a grievance solution, while the second, fourth and fifth are actually customer sentiments about the services provided by Meru Cabs. We see: Lots of meta information such as @people, URLs and #hashtags Punctuation marks, numbers, and unnecessary spaces Some of these tweets are retweets from other users; for the given application, we would not like to consider retweets (RTs) in sentiment analysis We clean all these data using the following code block: MeruTweets <- sapply(Meru_tweets, function(x) x$getText()) OlaTweets = sapply(Ola_tweets, function(x) x$getText()) TaxiForSureTweets = sapply(TaxiForSure_tweets, function(x) x$getText()) UberTweets = sapply(Uber_tweets, function(x) x$getText()) catch.error = function(x) { # let us create a missing value for test purpose y = NA # Try to catch that error (NA) we just created catch_error = tryCatch(tolower(x), error=function(e) e) # if not an error if (!inherits(catch_error, "error")) y = tolower(x) # check result if error exists, otherwise the function works fine. return(y) } cleanTweets<- function(tweet){ # Clean the tweet for sentiment analysis # remove html links, which are not required for sentiment analysis tweet = gsub("(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", " ", tweet) # First we will remove retweet entities from the stored tweets (text) tweet = gsub("(RT|via)((?:\b\W*@\w+)+)", " ", tweet) # Then remove all "#Hashtag" tweet = gsub("#\w+", " ", tweet) # Then remove all "@people" tweet = gsub("@\w+", " ", tweet) # Then remove all the punctuation tweet = gsub("[[:punct:]]", " ", tweet) # Then remove numbers, we need only text for analytics tweet = gsub("[[:digit:]]", " ", tweet) # finally, we remove unnecessary spaces (white spaces, tabs etc) tweet = gsub("[ t]{2,}", " ", tweet) tweet = gsub("^\s+|\s+$", "", tweet) # if anything else, you feel, should be removed, you can. For example "slang words" etc using the above function and methods. # Next we'll convert all the word in lower case. This makes uniform pattern. tweet = catch.error(tweet) tweet } cleanTweetsAndRemoveNAs<- function(Tweets) { TweetsCleaned = sapply(Tweets, cleanTweets) # Remove the "NA" tweets from this tweet list TweetsCleaned = TweetsCleaned[!is.na(TweetsCleaned)] names(TweetsCleaned) = NULL # Remove the repetitive tweets from this tweet list TweetsCleaned = unique(TweetsCleaned) TweetsCleaned } MeruTweetsCleaned = cleanTweetsAndRemoveNAs(MeruTweets) OlaTweetsCleaned = cleanTweetsAndRemoveNAs(OlaTweets) TaxiForSureTweetsCleaned <- cleanTweetsAndRemoveNAs(TaxiForSureTweets) UberTweetsCleaned = cleanTweetsAndRemoveNAs(UberTweets) Here's the size of each of the cleaned tweet lists: > length(MeruTweetsCleaned) [1] 309 > length(OlaTweetsCleaned) [1] 811 > length(TaxiForSureTweetsCleaned) [1] 574 > length(UberTweetsCleaned) [1] 1355 Estimating sentiment (A) There are many sophisticated resources available to estimate sentiments. Many research papers and software packages are available open source,and they implement very complex algorithms for sentiments analysis. After getting the cleaned Twitter data, we are going to use few of such R packages available to assess the sentiments in the tweets. It's worth mentioning here that not all the tweets represent a sentiment. Few tweets can be just information/facts, while others can be customer care responses. Ideally, they should not be used to assess the customer sentiment about a particular organization. As a first step, we'll use a Naïve algorithm, which gives a score based on the number of times a positive or a negative word occurred in the given sentence (and in our case, in a tweet). Please download the positive and negative opinion/sentiment (nearly 68, 000) words from English language. These opinion lexicon will be used as a first example in our sentiment analysis experiment. The good thing about this approach is that we are relying on a highly researched upon and at the same time customizable input parameters. Here are a few examples of existing positive and negative sentiments words: Positive: Love, best, cool, great, good, and amazing Negative: Hate, worst, sucks, awful, and nightmare >opinion.lexicon.pos = scan('opinion-lexicon-English/positive-words.txt', what='character', comment.char=';') >opinion.lexicon.neg = scan('opinion-lexicon-English/negative-words.txt', what='character', comment.char=';') > head(opinion.lexicon.neg) [1] "2-faced" "2-faces" "abnormal" "abolish" "abominable" "abominably" > head(opinion.lexicon.pos) [1] "a+" "abound" "abounds" "abundance" "abundant" "accessable" We'll add a few industry-specific and/or especially emphatic terms based on our requirements: pos.words = c(opinion.lexicon.pos,'upgrade') neg.words = c(opinion.lexicon.neg,'wait', 'waiting', 'wtf', 'cancellation') Now, we create a function score.sentiment(), which computes the raw sentiment based on the simple matching algorithm: getSentimentScore = function(sentences, words.positive, words.negative, .progress='none') { require(plyr) require(stringr) scores = laply(sentences, function(sentence, words.positive, words.negative) { # Let first remove the Digit, Punctuation character and Control characters: sentence = gsub('[[:cntrl:]]', '', gsub('[[:punct:]]', '', gsub('\d+', '', sentence))) # Then lets convert all to lower sentence case: sentence = tolower(sentence) # Now lets split each sentence by the space delimiter words = unlist(str_split(sentence, '\s+')) # Get the boolean match of each words with the positive & negative opinion-lexicon pos.matches = !is.na(match(words, words.positive)) neg.matches = !is.na(match(words, words.negative)) # Now get the score as total positive sentiment minus the total negatives score = sum(pos.matches) - sum(neg.matches) return(score) }, words.positive, words.negative, .progress=.progress ) # Return a data frame with respective sentence and the score return(data.frame(text=sentences, score=scores)) } Now, we apply the preceding function on the corpus of tweets collected and cleaned so far: MeruResult = getSentimentScore(MeruTweetsCleaned, words.positive , words.negative) OlaResult = getSentimentScore(OlaTweetsCleaned, words.positive , words.negative) TaxiForSureResult = getSentimentScore(TaxiForSureTweetsCleaned, words.positive , words.negative) UberResult = getSentimentScore(UberTweetsCleaned, words.positive , words.negative) Here are some sample results: Tweet for Meru Cabs Score gt and other details at feedback com we ll check back and reach out soon 0 really disappointed with cab is never assigned on time driver calls after minutes why would i ride with meru -1 so after years of bashing today i m pleasantly surprised clean car courteous driver prompt pickup mins efficient route 4 a min drive cost hrs used to cost less ur unreliable and expensive trying to lose ur customers -3 Tweet For Ola Cabs Score the service is going from bad to worse the drivers deny to come after a confirmed booking -3 love the olacabs app give it a swirl sign up with my referral code dxf n and earn rs download the app from 1 crn kept me waiting for mins amp at last moment driver refused pickup so unreliable amp irresponsible -4 this is not the first time has delighted me punctuality and free upgrade awesome that 4 Tweet For TaxiForSure Score great service now i have become a regular customer of tfs thank you for the upgrade as well happy taxi ing saving 5 really disappointed with cab is never assigned on time driver calls after minutes why would i ride with meru -1 horrible taxi service had to wait for one hour with a new born in the chilly weather of new delhi waiting for them -4 what do i get now if you resolve the issue after i lost a crucial business because of the taxi delay -3 Tweet For Uber India Score that s good uber s fares will prob be competitive til they gain local monopoly then will go sky high as in new york amp delhi saving 3 from a shabby backend app stack to daily pr fuck ups its increasingly obvious that is run by child minded blow hards -3 you say that uber is illegally running were you stupid to not ban earlier and only ban it now after the rape -3 perhaps uber biz model does need some looking into it s not just in delhi that this happens but in boston too 0 From the preceding observations, it's clear that this basic sentiment analysis method works fine in normal circumstances, but in case of Uber India the results deviated too much from a subjective score. It's safe to say that basic word matching gives a good indicator of overall customer sentiments, except in the case when the data itself is not reliable. In our case, the tweets from Uber India are not really related to the services that Uber provides, rather the one incident of crime by its driver and whole score went haywire. Let's not compute a point statistic of the scores we have computed so far. Since the numbers of tweets are not equal for each of the four organizations, we compute a mean and standard deviation for each. Organization Mean Sentiment Score Standard Deviation Meru Cabs -0.2218543 1.301846 Ola Cabs 0.197724 1.170334 TaxiForSure -0.09841828 1.154056 Uber India -0.6132666 1.071094 Estimating sentiment (B) Let's now move one step further. Now instead of using simple matching of opinion lexicon, we'll use something called Naive Bayes to decide on the emotion present in any tweet. We would require packages called Rstem and sentiment to assist in this. It's important to mention here that both these packages are no longer available in CRAN and hence we have to provide either the repository location as a parameter install.package() function. Here's the R script to install the required packages: install.packages("Rstem", repos = "http://www.omegahat.org/R", type="source") require(devtools) install_url("http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz") require(sentiment) ls("package:sentiment") Now that we have the sentiment and Rstem packages installed in our R workspace, we can build the bayes classifier for sentiment analysis: library(sentiment) # classify_emotion function returns an object of class data frame # with seven columns (anger, disgust, fear, joy, sadness, surprise, # # best_fit) and one row for each document: MeruTweetsClassEmo = classify_emotion(MeruTweetsCleaned, algorithm="bayes", prior=1.0) OlaTweetsClassEmo = classify_emotion(OlaTweetsCleaned, algorithm="bayes", prior=1.0) TaxiForSureTweetsClassEmo = classify_emotion(TaxiForSureTweetsCleaned, algorithm="bayes", prior=1.0) UberTweetsClassEmo = classify_emotion(UberTweetsCleaned, algorithm="bayes", prior=1.0) The following figure shows few results from Bayesian analysis using thesentiment package for Meru Cabs tweets. Similarly, we generated results for other cab-services from our problem setup. The sentiment package was built to use a trained dataset of emotion words (nearly 1500 words). The function classify_emotion() generates results belonging to one of the following six emotions: anger, disgust, fear, joy, sadness, and surprise. Hence, when the system is not able to classify the overall emotion to any of the six,NA is returned: Let's substitute these NA values with the word unknown to make the further analysis easier: # we will fetch emotion category best_fit for our analysis purposes. MeruEmotion = MeruTweetsClassEmo[,7] OlaEmotion = OlaTweetsClassEmo[,7] TaxiForSureEmotion = TaxiForSureTweetsClassEmo[,7] UberEmotion = UberTweetsClassEmo[,7] MeruEmotion[is.na(MeruEmotion)] = "unknown" OlaEmotion[is.na(OlaEmotion)] = "unknown" TaxiForSureEmotion[is.na(TaxiForSureEmotion)] = "unknown" UberEmotion[is.na(UberEmotion)] = "unknown" The best-fit emotions present in these tweets are as follows: Further, we'll use another function classify_polarity() provided by the sentiment package to classify the tweets into two classes, pos (positive sentiment) or neg (negative sentiment). The idea is to compute the log likelihood of a tweet assuming it to belong to either of two classes. Once these likelihoods are calculated, a ratio of the pos-likelihood to neg-likelihood is calculated and based on this ratio the tweets are classified to belong to a particular class. It's important to note that if this ratio turns out to be 1, then the overall sentiment of the tweet is assumed to be "neutral". The code is as follows: MeruTweetsClassPol = classify_polarity(MeruTweetsCleaned, algorithm="bayes") OlaTweetsClassPol = classify_polarity(OlaTweetsCleaned, algorithm="bayes") TaxiForSureTweetsClassPol = classify_polarity(TaxiForSureTweetsCleaned, algorithm="bayes") UberTweetsClassPol = classify_polarity(UberTweetsCleaned, algorithm="bayes") We get the following output: The preceding figure shows few results from obtained using the classify_polarity() function of sentiment package for Meru Cabs tweets. We'll now generate consolidated results from the two functions in a data frame for each cab service for plotting purposes: # we will fetch polarity category best_fit for our analysis purposes, MeruPol = MeruTweetsClassPol[,4] OlaPol = OlaTweetsClassPol[,4] TaxiForSurePol = TaxiForSureTweetsClassPol[,4] UberPol = UberTweetsClassPol[,4] # Let us now create a data frame with the above results MeruSentimentDataFrame = data.frame(text=MeruTweetsCleaned, emotion=MeruEmotion, polarity=MeruPol, stringsAsFactors=FALSE) OlaSentimentDataFrame = data.frame(text=OlaTweetsCleaned, emotion=OlaEmotion, polarity=OlaPol, stringsAsFactors=FALSE) TaxiForSureSentimentDataFrame = data.frame(text=TaxiForSureTweetsCleaned, emotion=TaxiForSureEmotion, polarity=TaxiForSurePol, stringsAsFactors=FALSE) UberSentimentDataFrame = data.frame(text=UberTweetsCleaned, emotion=UberEmotion, polarity=UberPol, stringsAsFactors=FALSE) # rearrange data inside the frame by sorting it MeruSentimentDataFrame = within(MeruSentimentDataFrame, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE)))) OlaSentimentDataFrame = within(OlaSentimentDataFrame, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE)))) TaxiForSureSentimentDataFrame = within(TaxiForSureSentimentDataFrame, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE)))) UberSentimentDataFrame = within(UberSentimentDataFrame, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE)))) plotSentiments1<- function (sentiment_dataframe,title) { library(ggplot2) ggplot(sentiment_dataframe, aes(x=emotion)) + geom_bar(aes(y=..count.., fill=emotion)) + scale_fill_brewer(palette="Dark2") + ggtitle(title) + theme(legend.position='right') + ylab('Number of Tweets') + xlab('Emotion Categories') } plotSentiments1(MeruSentimentDataFrame, 'Sentiment Analysis of Tweets on Twitter about MeruCabs') plotSentiments1(OlaSentimentDataFrame, 'Sentiment Analysis of Tweets on Twitter about OlaCabs') plotSentiments1(TaxiForSureSentimentDataFrame, 'Sentiment Analysis of Tweets on Twitter about TaxiForSure') plotSentiments1(UberSentimentDataFrame, 'Sentiment Analysis of Tweets on Twitter about UberIndia') The output is as follows: In the preceding figure, we showed sample results using generated results on Meru Cabs tweets using both the functions. Let's now plot them one by one. First, let's create a single function to be used by each business's tweets. We call it plotSentiments1() and then we plot it for each business: The following dashboard shows the analysis for Ola Cabs: The following dashboard shows the analysis for TaxiForSure: The following dashboard shows the analysis for Uber India: These sentiments basically reflect the more or less the same observations as we did with the basic word-matching algorithm. The number of tweets with joy constitute the largest part of tweets for all these organizations, indicating that these organizations are trying their best to provide good business in the country. The sadness tweets are less numerous than the joy tweets. However, if compared with each other, they indicate the overall market share versus level of customer satisfaction of each service provider in question. Similarly, these graphs can be used to assess the level of dissatisfaction in terms of anger and disgust in the tweets. Let's now consider only the positive and negative sentiments present in the tweets: # Similarly we will plot distribution of polarity in the tweets plotSentiments2 <- function (sentiment_dataframe,title) { library(ggplot2) ggplot(sentiment_dataframe, aes(x=polarity)) + geom_bar(aes(y=..count.., fill=polarity)) + scale_fill_brewer(palette="RdGy") + ggtitle(title) + theme(legend.position='right') + ylab('Number of Tweets') + xlab('Polarity Categories') } plotSentiments2(MeruSentimentDataFrame, 'Polarity Analysis of Tweets on Twitter about MeruCabs') plotSentiments2(OlaSentimentDataFrame, 'Polarity Analysis of Tweets on Twitter about OlaCabs') plotSentiments2(TaxiForSureSentimentDataFrame, 'Polarity Analysis of Tweets on Twitter about TaxiForSure') plotSentiments2(UberSentimentDataFrame, 'Polarity Analysis of Tweets on Twitter about UberIndia') The output is as follows: The following dashboard shows the polarity analysis for Ola Cabs: The following dashboard shows the analysis for TaxiForSure: The following dashboard shows the analysis for Uber India: It's a basic human trait to inform about other's what's wrong rather than informing if there was something right. That is say that we tend to tweets/report if something bad had happened rather reporting/tweeting if the experience was rather good. Hence, the negative tweets are supposed to be larger than the positive tweets in general. Still over a period of time (a week in our case) the ratio of the two easily reflect the overall market share versus the level of customer satisfaction of each service provider in question. Next, we try to get the sense of the overall content of the tweets using the word clouds. removeCustomeWords <- function (TweetsCleaned) { for(i in 1:length(TweetsCleaned)){ TweetsCleaned[i] <- tryCatch({ TweetsCleaned[i] = removeWords(TweetsCleaned[i], c(stopwords("english"), "care", "guys", "can", "dis", "didn", "guy" ,"booked", "plz")) TweetsCleaned[i] }, error=function(cond) { TweetsCleaned[i] }, warning=function(cond) { TweetsCleaned[i] }) } return(TweetsCleaned) } getWordCloud <- function (sentiment_dataframe, TweetsCleaned, Emotion) { emos = levels(factor(sentiment_dataframe$emotion)) n_emos = length(emos) emo.docs = rep("", n_emos) TweetsCleaned = removeCustomeWords(TweetsCleaned) for (i in 1:n_emos){ emo.docs[i] = paste(TweetsCleaned[Emotion == emos[i]], collapse=" ") } corpus = Corpus(VectorSource(emo.docs)) tdm = TermDocumentMatrix(corpus) tdm = as.matrix(tdm) colnames(tdm) = emos require(wordcloud) suppressWarnings(comparison.cloud(tdm, colors = brewer.pal(n_emos, "Dark2"), scale = c(3,.5), random.order = FALSE, title.size = 1.5)) } getWordCloud(MeruSentimentDataFrame, MeruTweetsCleaned, MeruEmotion) getWordCloud(OlaSentimentDataFrame, OlaTweetsCleaned, OlaEmotion) getWordCloud(TaxiForSureSentimentDataFrame, TaxiForSureTweetsCleaned, TaxiForSureEmotion) getWordCloud(UberSentimentDataFrame, UberTweetsCleaned, UberEmotion) The preceding figure shows word cloud from tweets about Meru Cabs. The preceding figure shows word cloud from tweets about Ola Cabs. The preceding figure shows word cloud from tweets about TaxiForSure. The preceding figure shows word cloud from tweets about Uber India. Summary In this article, we gained knowledge of the various Twitter APIs, we discussed how to create a connection with Twitter, and we saw how to retrieve the tweets with various attributes. We saw the power of Twitter in helping us determine the customer attitude toward today's various businesses. The activity can be done on the weekly basis and one can easily get the monthly or quarterly or yearly changes in customer sentiments. This can not only help the customer decide the trending businesses, but the business itself can get a well-defined metric of its own performance. It can use such scores/graphs to improve. We also discussed various methods of sentiment analysis varying from basic word matching to the advanced Bayesian algorithms. Resources for Article: Further resources on this subject: Find Friends on Facebook [article] Supervised learning[article] Warming Up [article]
Read more
  • 0
  • 0
  • 5597

article-image-introduction-machine-learning-r
Packt
18 Feb 2016
7 min read
Save for later

Introduction to Machine Learning with R

Packt
18 Feb 2016
7 min read
If science fiction stories are to be believed, the invention of artificial intelligence inevitably leads to apocalyptic wars between machines and their makers. In the early stages, computers are taught to play simple games of tic-tac-toe and chess. Later, machines are given control of traffic lights and communications, followed by military drones and missiles. The machine's evolution takes an ominous turn once the computers become sentient and learn how to teach themselves. Having no more need for human programmers, humankind is then deleted. (For more resources related to this topic, see here.) Thankfully, at the time of writing this, machines still require user input. Though your impressions of machine learning may be colored by these mass-media depictions, today's algorithms are too application-specific to pose any danger of becoming self-aware. The goal of today's machine learning is not to create an artificial brain, but rather to assist us in making sense of the world's massive data stores. Putting popular misconceptions aside, in this article we will learn the following topics: Installing R packages Loading and unloading R packages Machine learning with R Many of the algorithms needed for machine learning with R are not included as part of the base installation. Instead, the algorithms needed for machine learning are available via a large community of experts who have shared their work freely. These must be installed on top of base R manually. Thanks to R's status as free open source software, there is no additional charge for this functionality. A collection of R functions that can be shared among users is called a package. Free packages exist for each of the machine learning algorithms covered in this book. In fact, this book only covers a small portion of all of R's machine learning packages. If you are interested in the breadth of R packages, you can view a list at Comprehensive R Archive Network (CRAN), a collection of web and FTP sites located around the world to provide the most up-to-date versions of R software and packages. If you obtained the R software via download, it was most likely from CRAN at http://cran.r-project.org/index.html. If you do not already have R, the CRAN website also provides installation instructions and information on where to find help if you have trouble. The Packages link on the left side of the page will take you to a page where you can browse packages in an alphabetical order or sorted by the publication date. At the time of writing this, a total 6,779 packages were available—a jump of over 60% in the time since the first edition was written, and this trend shows no sign of slowing! The Task Views link on the left side of the CRAN page provides a curated list of packages as per the subject area. The task view for machine learning, which lists the packages covered in this book (and many more), is available at http://cran.r-project.org/web/views/MachineLearning.html. Installing R packages Despite the vast set of available R add-ons, the package format makes installation and use a virtually effortless process. To demonstrate the use of packages, we will install and load the RWeka package, which was developed by Kurt Hornik, Christian Buchta, and Achim Zeileis (see Open-Source Machine Learning: R Meets Weka in Computational Statistics 24: 225-232 for more information). The RWeka package provides a collection of functions that give R access to the machine learning algorithms in the Java-based Weka software package by Ian H. Witten and Eibe Frank. More information on Weka is available at http://www.cs.waikato.ac.nz/~ml/weka/ To use the RWeka package, you will need to have Java installed (many computers come with Java preinstalled). Java is a set of programming tools available for free, which allow for the use of cross-platform applications such as Weka. For more information, and to download Java on your system, you can visit http://java.com. The most direct way to install a package is via the install.packages() function. To install the RWeka package, at the R command prompt, simply type: > install.packages("RWeka") R will then connect to CRAN and download the package in the correct format for your OS. Some packages such as RWeka require additional packages to be installed before they can be used (these are called dependencies). By default, the installer will automatically download and install any dependencies. The first time you install a package, R may ask you to choose a CRAN mirror. If this happens, choose the mirror residing at a location close to you. This will generally provide the fastest download speed. The default installation options are appropriate for most systems. However, in some cases, you may want to install a package to another location. For example, if you do not have root or administrator privileges on your system, you may need to specify an alternative installation path. This can be accomplished using the lib option, as follows: > install.packages("RWeka", lib="/path/to/library") The installation function also provides additional options for installation from a local file, installation from source, or using experimental versions. You can read about these options in the help file, by using the following command: > ?install.packages More generally, the question mark operator can be used to obtain help on any R function. Simply type ? before the name of the function. Loading and unloading R packages In order to conserve memory, R does not load every installed package by default. Instead, packages are loaded by users as they are needed, using the library() function. The name of this function leads some people to incorrectly use the terms library and package interchangeably. However, to be precise, a library refers to the location where packages are installed and never to a package itself. To load the RWeka package we installed previously, you can type the following: > library(RWeka) Aside from RWeka, there are several other R packages. To unload an R package, use the detach() function. For example, to unload the RWeka package shown previously use the following command: > detach("package:RWeka", unload = TRUE) This will free up any resources used by the package. Summary Machine learning originated at the intersection of statistics, database science, and computer science. It is a powerful tool, capable of finding actionable insight in large quantities of data. Still, caution must be used in order to avoid common abuses of machine learning in the real world. Conceptually, learning involves the abstraction of data into a structured representation and the generalization of this structure into action that can be evaluated for utility. In practical terms, a machine learner uses data containing examples and features of the concept to be learned and summarizes this data in the form of a model, which is then used for predictive or descriptive purposes. These purposes can be grouped into tasks, including classification, numeric prediction, pattern detection, and clustering. Among the many options, machine learning algorithms are chosen on the basis of the input data and the learning task. R provides support for machine learning in the form of community-authored packages. These powerful tools are free to download; however, they need to be installed before they can be used. To learn more about R, you can refer the following books published by Packt Publishing (https://www.packtpub.com/): Machine Learning with R - Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition) R for Data Science (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science) R Data Science Essentials (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science-essentials) R Graphs Cookbook Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/r-graph-cookbook-%E2%80%93-second-edition) Resources for Article: Further resources on this subject: Machine Learning[article] Introducing Test-driven Machine Learning[article] Machine Learning with R[article]
Read more
  • 0
  • 0
  • 7419

article-image-machine-learning-practice
Packt
18 Feb 2016
13 min read
Save for later

Machine learning in practice

Packt
18 Feb 2016
13 min read
In this article, we will learn how we can implement machine learning in practice. To apply the learning process to real-world tasks, we'll use a five-step process. Regardless of the task at hand, any machine learning algorithm can be deployed by following this plan: Data collection: The data collection step involves gathering the learning material an algorithm will use to generate actionable knowledge. In most cases, the data will need to be combined into a single source like a text file, spreadsheet, or database. Data exploration and and preparation: The quality of any machine learning project is based largely on the quality of its input data. Thus, it is important to learn more about the data and its nuances during a practice called data exploration. Additional work is required to prepare the data for the learning process. This involves fixing or cleaning so-called "messy" data, eliminating unnecessary data, and recoding the data to conform to the learner's expected inputs. Model training: By the time the data has been prepared for analysis, you are likely to have a sense of what you are capable of learning from the data. The specific machine learning task chosen will inform the selection of an appropriate algorithm and the algorithm will represent the data in the form of a model. Model evaluation: Because each machine learning model results in a biased solution to the learning problem, it is important to evaluate how well the algorithm learns from its experience. Depending on the type of model used, you might be able to evaluate the accuracy of the model using a test dataset or you may need to develop measures of performance specific to the intended application. Model improvement: If better performance is needed, it becomes necessary to utilize more advanced strategies to augment the performance of the model. Sometimes, it may be necessary to switch to a different type of model altogether. You may need to supplement your data with additional data or perform additional preparatory work as in step two of this process. (For more resources related to this topic, see here.) After these steps are completed, if the model appears to be performing well, it can be deployed for its intended task. As the case may be, you might utilize your model to provide score data for predictions (possibly in real time), for projections of financial data, to generate useful insight for marketing or research, or to automate tasks such as mail delivery or flying aircraft. The successes and failures of the deployed model might even provide additional data to train your next generation learner. Types of input data The practice of machine learning involves matching the characteristics of input data to the biases of the available approaches. Thus, before applying machine learning to real-world problems, it is important to understand the terminology that distinguishes among input datasets. The phrase unit of observation is used to describe the smallest entity with measured properties of interest for a study. Commonly, the unit of observation is in the form of persons, objects or things, transactions, time points, geographic regions, or measurements. Sometimes, units of observation are combined to form units such as person-years, which denote cases where the same person is tracked over multiple years; each person-year comprises of a person's data for one year. The unit of observation is related but not identical to the unit of analysis, which is the smallest unit from which the inference is made. Although it is often the case, the observed and analyzed units are not always the same. For example, data observed from people might be used to analyze trends across countries. Datasets that store the units of observation and their properties can be imagined as collections of data consisting of: Examples: Instances of the unit of observation for which properties have been recorded Features: Recorded properties or attributes of examples that may be useful for learning It is the easiest to understand features and examples through real-world cases. To build a learning algorithm to identify spam e-mail, the unit of observation could be e-mail messages, examples would be specific messages, and its features might consist of the words used in the messages. For a cancer detection algorithm, the unit of observation could be patients, the examples might include a random sample of cancer patients, and its features may be the genomic markers from biopsied cells as well as the characteristics of patient such as weight, height, or blood pressure. While examples and features do not have to be collected in any specific form, they are commonly gathered in the matrix format, which means that each example has exactly the same features. The following spreadsheet shows a dataset in the matrix format. In the matrix data, each row in the spreadsheet is an example and each column is a feature. Here, the rows indicate examples of automobiles, while the columns record various each automobile's feature, such as price, mileage, color, and transmission type. Matrix format data is by far the most common form used in machine learning, though other forms are used occasionally in specialized cases: Features also come in various forms. If a feature represents a characteristic measured in numbers, it is unsurprisingly called numeric. Alternatively, if a feature is an attribute that consists of a set of categories, the feature is called categorical or nominal. A special case of categorical variables is called ordinal, which designates a nominal variable to the categories falling in an ordered list. Some examples of ordinal variables include clothing sizes such as small, medium, and large, or a measurement of customer satisfaction on a scale from "not at all happy" to "very happy." It is important to consider what the features represent, as the type and number of features in your dataset will assist in determining an appropriate machine learning algorithm for your task. Types of machine learning algorithms Machine learning algorithms are divided into categories according to their purpose. Understanding the categories of learning algorithms is an essential first step towards using data to drive the desired action. A predictive model is used for tasks that involve, as the name implies, the prediction of one value using other values in the dataset. The learning algorithm attempts to discover and model the relationship between the target feature (the feature being predicted) and the other features. Despite the common use of the word "prediction" to imply forecasting, predictive models need not necessarily foresee events in the future. For instance, a predictive model could be used to predict past events such as the date of a baby's conception using the mother's present-day hormone levels. Predictive models can also be used in real time to control traffic lights during rush hours. Because predictive models are given clear instruction on what they need to learn and how they are intended to learn it, the process of training a predictive model is known as supervised learning. The supervision does not refer to human involvement, but rather to the fact that the target values provide a way for the learner to know how well it has learned the desired task. Stated more formally, given a set of data, a supervised learning algorithm attempts to optimize a function (the model) to find the combination of feature values that result in the target output. The often used supervised machine learning task of predicting which category an example belongs to is known as classification. It is easy to think of potential uses for a classifier. For instance, you could predict whether: An e-mail message is spam A person has cancer A football team will win or lose An applicant will default on a loan In classification, the target feature to be predicted is a categorical feature known as the class and is divided into categories called levels. A class can have two or more levels, and the levels may or may not be ordinal. Because classification is so widely used in machine learning, there are many types of classification algorithms, with strengths and weaknesses suited for different types of input data. Supervised learners can also be used to predict numeric data such as income, laboratory values, test scores, or counts of items. To predict such numeric values, a common form of numeric prediction fits linear regression models to the input data. Although regression models are not the only type of numeric models, they are, by far, the most widely used. Regression methods are widely used for forecasting, as they quantify in exact terms the association between inputs and the target, including both, the magnitude and uncertainty of the relationship. Since it is easy to convert numbers into categories (for example, ages 13 to 19 are teenagers) and categories into numbers (for example, assign 1 to all males, 0 to all females), the boundary between classification models and numeric prediction models is not necessarily firm. A descriptive model is used for tasks that would benefit from the insight gained from summarizing data in new and interesting ways. As opposed to predictive models that predict a target of interest, in a descriptive model, no single feature is more important than the other. In fact, because there is no target to learn, the process of training a descriptive model is called unsupervised learning. Although it can be more difficult to think of applications for descriptive models—after all, what good is a learner that isn't learning anything in particular—they are used quite regularly for data mining. For example, the descriptive modeling task called pattern discovery is used to identify useful associations within data. Pattern discovery is often used for market basket analysis on retailers' transactional purchase data. Here, the goal is to identify items that are frequently purchased together, such that the learned information can be used to refine marketing tactics. For instance, if a retailer learns that swimming trunks are commonly purchased at the same time as sunscreens, the retailer might reposition the items more closely in the store or run a promotion to "up-sell" customers on associated items. Originally used only in retail contexts, pattern discovery is now starting to be used in quite innovative ways. For instance, it can be used to detect patterns of fraudulent behavior, screen for genetic defects, or identify hot spots for criminal activity. The descriptive modeling task of dividing a dataset into homogeneous groups is called clustering. This is, sometimes, used for segmentation analysis that identifies groups of individuals with similar behavior or demographic information so that advertising campaigns could be tailored for particular audiences. Although the machine is capable of identifying the clusters, human intervention is required to interpret them. For example, given five different clusters of shoppers at a grocery store, the marketing team will need to understand the differences among the groups in order to create a promotion that best suits each group. This is almost certainly easier than trying to create a unique appeal for each customer. Lastly, a class of machine learning algorithms known as meta-learners is not tied to a specific learning task, but is rather focused on learning how to learn more effectively. A meta-learning algorithm uses the result of some learnings to inform additional learning. This can be beneficial for very challenging problems or when a predictive algorithm's performance needs to be as accurate as possible. Matching input data to algorithms The following table lists the general types of machine learning algorithms, each of which may be implemented in several ways. These are the basis on which nearly all the other more advanced methods are based. Although this covers only a fraction of the entire set of machine learning algorithms, learning these methods will provide a sufficient foundation to make sense of any other method you may encounter in the future. Model Learning task Supervised Learning Algorithms Nearest Neighbor Classification Naive Bayes Classification Decision Trees Classification Classification Rule Learners Classification Linear Regression Numeric prediction Regression Trees Numeric prediction Model Trees Numeric prediction Neural Networks Dual use Support Vector Machines Dual use Unsupervised Learning Algorithms Association Rules Pattern detection k-means clustering Clustering Meta-Learning Algorithms Bagging Dual use Boosting Dual use Random Forests Dual use  To begin applying machine learning to a real-world project, you will need to determine which of the four learning tasks your project represents: classification, numeric prediction, pattern detection, or clustering. The task will drive the choice of algorithm. For instance, if you are undertaking pattern detection, you are likely to employ association rules. Similarly, a clustering problem will likely utilize the k-means algorithm and numeric prediction will utilize regression analysis or regression trees. For classification, more thought is needed to match a learning problem to an appropriate classifier. In these cases, it is helpful to consider various distinctions among algorithms—distinctions that will only be apparent by studying each of the classifiers in depth. For instance, within classification problems, decision trees result in models that are readily understood, while the models of neural networks are notoriously difficult to interpret. If you were designing a credit-scoring model, this could be an important distinction because law often requires that the applicant must be notified about the reasons he or she was rejected for the loan. Even if the neural network is better at predicting loan defaults, if its predictions cannot be explained, then it is useless for this application. Although you will sometimes find that these characteristics exclude certain models from consideration. In many cases, the choice of algorithm is arbitrary. When this is true, feel free to use whichever algorithm you are most comfortable with. Other times, when predictive accuracy is primary, you may need to test several algorithms and choose the one that fits the best or use a meta-learning algorithm that combines several different learners to utilize the strengths of each. Summary Machine learning originated at the intersection of statistics, database science, and computer science. It is a powerful tool, capable of finding actionable insight in large quantities of data. Still, caution must be used in order to avoid common abuses of machine learning in the real world. Conceptually, learning involves the abstraction of data into a structured representation and the generalization of this structure into action that can be evaluated for utility. In practical terms, a machine learner uses data containing examples and features of the concept to be learned and summarizes this data in the form of a model, which is then used for predictive or descriptive purposes. These purposes can be grouped into tasks, including classification, numeric prediction, pattern detection, and clustering. Among the many options, machine learning algorithms are chosen on the basis of the input data and the learning task. R provides support for machine learning in the form of community-authored packages. These powerful tools are free to download, but need to be installed before they can be used. Resources for Article:   Further resources on this subject: Introduction to Machine Learning with R [article] Machine Learning with R [article] Spark – Architecture and First Program [article]
Read more
  • 0
  • 0
  • 2558

article-image-putting-fun-functional-python
Packt
17 Feb 2016
21 min read
Save for later

Putting the Fun in Functional Python

Packt
17 Feb 2016
21 min read
Functional programming defines a computation using expressions and evaluation—often encapsulated in function definitions. It de-emphasizes or avoids the complexity of state change and mutable objects. This tends to create programs that are more succinct and expressive. In this article, we'll introduce some of the techniques that characterize functional programming. We'll identify some of the ways to map these features to Python. Finally, we'll also address some ways in which the benefits of functional programming accrue when we use these design patterns to build Python applications. Python has numerous functional programming features. It is not a purely functional programming language. It offers enough of the right kinds of features that it confers to the benefits of functional programming. It also retains all optimization power available from an imperative programming language. We'll also look at a problem domain that we'll use for many of the examples in this book. We'll try to stick closely to Exploratory Data Analysis (EDA) because its algorithms are often good examples of functional programming. Furthermore, the benefits of functional programming accrue rapidly in this problem domain. Our goal is to establish some essential principles of functional programming. We'll focus on Python 3 features in this book. However, some of the examples might also work in Python 2. (For more resources related to this topic, see here.) Identifying a paradigm It's difficult to be definitive on what fills the universe of programming paradigms. For our purposes, we will distinguish between just two of the many programming paradigms: Functional programming and Imperative programming. One important distinguishing feature between these two is the concept of state. In an imperative language, like Python, the state of the computation is reflected by the values of the variables in the various namespaces. The values of the variables establish the state of a computation; each kind of statement makes a well-defined change to the state by adding or changing (or even removing) a variable. A language is imperative because each statement is a command, which changes the state in some way. Our general focus is on the assignment statement and how it changes state. Python has other statements, such as global or nonlocal, which modify the rules for variables in a particular namespace. Statements like def, class, and import change the processing context. Other statements like try, except, if, elif, and else act as guards to modify how a collection of statements will change the computation's state. Statements like for and while, similarly, wrap a block of statements so that the statements can make repeated changes to the state of the computation. The focus of all these various statement types, however, is on changing the state of the variables. Ideally, each statement advances the state of the computation from an initial condition toward the desired final outcome. This "advances the computation" assertion can be challenging to prove. One approach is to define the final state, identify a statement that will establish this final state, and then deduce the precondition required for this final statement to work. This design process can be iterated until an acceptable initial state is derived. In a functional language, we replace state—the changing values of variables—with a simpler notion of evaluating functions. Each function evaluation creates a new object or objects from existing objects. Since a functional program is a composition of a function, we can design lower-level functions that are easy to understand, and we will design higher-level compositions that can also be easier to visualize than a complex sequence of statements. Function evaluation more closely parallels mathematical formalisms. Because of this, we can often use simple algebra to design an algorithm, which clearly handles the edge cases and boundary conditions. This makes us more confident that the functions work. It also makes it easy to locate test cases for formal unit testing. It's important to note that functional programs tend to be relatively succinct, expressive, and efficient when compared to imperative (object-oriented or procedural) programs. The benefit isn't automatic; it requires a careful design. This design effort is often easier than functionally similar procedural programming. Subdividing the procedural paradigm We can subdivide imperative languages into a number of discrete categories. In this section, we'll glance quickly at the procedural versus object-oriented distinction. What's important here is to see how object-oriented programming is a subset of imperative programming. The distinction between procedural and object-orientation doesn't reflect the kind of fundamental difference that functional programming represents. We'll use code examples to illustrate the concepts. For some, this will feel like reinventing a wheel. For others, it provides a concrete expression of abstract concepts. For some kinds of computations, we can ignore Python's object-oriented features and write simple numeric algorithms. For example, we might write something like the following to get the range of numbers: s = 0 for n in range(1, 10): if n % 3 == 0 or n % 5 == 0: s += n print(s) We've made this program strictly procedural, avoiding any explicit use of Python's object features. The program's state is defined by the values of the variables s and n. The variable, n, takes on values such that 1 ≤ n < 10. As the loop involves an ordered exploration of values of n, we can prove that it will terminate when n == 10. Similar code would work in C or Java using their primitive (non-object) data types. We can exploit Python's Object-Oriented Programming (OOP) features and create a similar program: m = list() for n in range(1, 10): if n % 3 == 0 or n % 5 == 0: m.append(n) print(sum(m)) This program produces the same result but it accumulates a stateful collection object, m, as it proceeds. The state of the computation is defined by the values of the variables m and n. The syntax of m.append(n) and sum(m) can be confusing. It causes some programmers to insist (wrongly) that Python is somehow not purely Object-oriented because it has a mixture of the function()and object.method() syntax. Rest assured, Python is purely Object-oriented. Some languages, like C++, allow the use of primitive data type such as int, float, and long, which are not objects. Python doesn't have these primitive types. The presence of prefix syntax doesn't change the nature of the language. To be pedantic, we could fully embrace the object model, the subclass, the list class, and add a sum method: class SummableList(list): def sum( self ): s= 0 for v in self.__iter__(): s += v return s If we initialize the variable, m, with the SummableList() class instead of the list() method, we can use the m.sum() method instead of the sum(m) method. This kind of change can help to clarify the idea that Python is truly and completely object-oriented. The use of prefix function notation is purely syntactic sugar. All three of these examples rely on variables to explicitly show the state of the program. They rely on the assignment statements to change the values of the variables and advance the computation toward completion. We can insert the assert statements throughout these examples to demonstrate that the expected state changes are implemented properly. The point is not that imperative programming is broken in some way. The point is that functional programming leads to a change in viewpoint, which can, in many cases, be very helpful. We'll show a function view of the same algorithm. Functional programming doesn't make this example dramatically shorter or faster. Using the functional paradigm In a functional sense, the sum of the multiples of 3 and 5 can be defined in two parts: The sum of a sequence of numbers A sequence of values that pass a simple test condition, for example, being multiples of three and five The sum of a sequence has a simple, recursive definition: def sum(seq): if len(seq) == 0: return 0 return seq[0] + sum(seq[1:]) We've defined the sum of a sequence in two cases: the base case states that the sum of a zero length sequence is 0, while the recursive case states that the sum of a sequence is the first value plus the sum of the rest of the sequence. Since the recursive definition depends on a shorter sequence, we can be sure that it will (eventually) devolve to the base case. The + operator on the last line of the preceeding example and the initial value of 0 in the base case characterize the equation as a sum. If we change the operator to * and the initial value to 1, it would just as easily compute a product. Similarly, a sequence of values can have a simple, recursive definition, as follows: def until(n, filter_func, v): if v == n: return [] if filter_func(v): return [v] + until( n, filter_func, v+1 ) else: return until(n, filter_func, v+1) In this function, we've compared a given value, v, against the upper bound, n. If v reaches the upper bound, the resulting list must be empty. This is the base case for the given recursion. There are two more cases defined by the given filter_func() function. If the value of v is passed by the filter_func() function, we'll create a very small list, containing one element, and append the remaining values of the until() function to this list. If the value of v is rejected by the filter_func() function, this value is ignored and the result is simply defined by the remaining values of the until() function. We can see that the value of v will increase from an initial value until it reaches n, assuring us that we'll reach the base case soon. Here's how we can use the until() function to generate the multiples of 3 or 5. First, we'll define a handy lambda object to filter values: mult_3_5= lambda x: x%3==0 or x%5==0 (We will use lambdas to emphasize succinct definitions of simple functions. Anything more complex than a one-line expression requires the def statement.) We can see how this lambda works from the command prompt in the following example: >>> mult_3_5(3) True >>> mult_3_5(4) False >>> mult_3_5(5) True This function can be used with the until() function to generate a sequence of values, which are multiples of 3 or 5. The until() function for generating a sequence of values works as follows: >>> until(10, lambda x: x%3==0 or x%5==0, 0) [0, 3, 5, 6, 9] We can use our recursive sum() function to compute the sum of this sequence of values. The various functions, such as sum(), until(), and mult_3_5() are defined as simple recursive functions. The values are computed without restoring to use intermediate variables to store state. We'll return to the ideas behind this purely functional recursive function definition in several places. It's important to note here that many functional programming language compilers can optimize these kinds of simple recursive functions. Python can't do the same optimizations. Using a functional hybrid We'll continue this example with a mostly functional version of the previous example to compute the sum of the multiples of 3 and 5. Our hybrid functional version might look like the following: print( sum(n for n in range(1, 10) if n%3==0 or n%5==0) ) We've used nested generator expressions to iterate through a collection of values and compute the sum of these values. The range(1, 10) method is an iterable and, consequently, a kind of generator expression; it generates a sequence of values . The more complex expression, n for n in range(1, 10) if n%3==0 or n%5==0, is also an iterable expression. It produces a set of values . A variable, n, is bound to each value, more as a way of expressing the contents of the set than as an indicator of the state of the computation. The sum() function consumes the iterable expression, creating a final object, 23. The bound variable doesn't change once a value is bound to it. The variable, n, in the loop is essentially a shorthand for the values available from the range() function. The if clause of the expression can be extracted into a separate function, allowing us to easily repurpose this with other rules. We could also use a higher-order function named filter() instead of the if clause of the generator expression. As we work with generator expressions, we'll see that the bound variable is at the blurry edge of defining the state of the computation. The variable, n, in this example isn't directly comparable to the variable, n, in the first two imperative examples. The for statement creates a proper variable in the local namespace. The generator expression does not create a variable in the same way as a for statement does: >>> sum( n for n in range(1, 10) if n%3==0 or n%5==0 ) 23 >>> n Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'n' is not defined Because of the way Python uses namespaces, it might be possible to write a function that can observe the n variable in a generator expression. However, we won't. Our objective is to exploit the functional features of Python, not to detect how those features have an object-oriented implementation under the hood. Looking at object creation In some cases, it might help to look at intermediate objects as a history of the computation. What's important is that the history of a computation is not fixed. When functions are commutative or associative, then changes to the order of evaluation might lead to different objects being created. This might have performance improvements with no changes to the correctness of the results. Consider this expression: >>> 1+2+3+4 10 We are looking at a variety of potential computation histories with the same result. Because the + operator is commutative and associative, there are a large number of candidate histories that lead to the same result. Of the candidate sequences, there are two important alternatives, which are as follows: >>> ((1+2)+3)+4 10 >>> 1+(2+(3+4)) 10 In the first case, we fold in values working from left to right. This is the way Python works implicitly. Intermediate objects 3 and 6 are created as part of this evaluation. In the second case, we fold from right-to-left. In this case, intermediate objects 7 and 9 are created. In the case of simple integer arithmetic, the two results have identical performance; there's no optimization benefit. When we work with something like the list append, we might see some optimization improvements when we change the association rules. Here's a simple example: >>> import timeit >>> timeit.timeit("((([]+[1])+[2])+[3])+[4]") 0.8846941249794327 >>> timeit.timeit("[]+([1]+([2]+([3]+[4])))") 1.0207440659869462 In this case, there's some benefit in working from left to right. What's important for functional design is the idea that the + operator (or add() function) can be used in any order to produce the same results. The + operator has no hidden side effects that restrict the way this operator can be used. The stack of turtles When we use Python for functional programming, we embark down a path that will involve a hybrid that's not strictly functional. Python is not Haskell, OCaml, or Erlang. For that matter, our underlying processor hardware is not functional; it's not even strictly object-oriented—CPUs are generally procedural. All programming languages rest on abstractions, libraries, frameworks and virtual machines. These abstractions, in turn, may rely on other abstractions, libraries, frameworks and virtual machines. The most apt metaphor is this: the world is carried on the back of a giant turtle. The turtle stands on the back of another giant turtle. And that turtle, in turn, is standing on the back of yet another turtle. It's turtles all the way down.                                                                                                             – Anonymous
There's no practical end to the layers of abstractions. More importantly, the presence of abstractions and virtual machines doesn't materially change our approach to designing software to exploit the functional programming features of Python. Even within the functional programming community, there are more pure and less pure functional programming languages. Some languages make extensive use of monads to handle stateful things like filesystem input and output. Other languages rely on a hybridized environment that's similar to the way we use Python. We write software that's generally functional with carefully chosen procedural exceptions. Our functional Python programs will rely on the following three stacks of abstractions: Our applications will be functions—all the way down—until we hit the objects The underlying Python runtime environment that supports our functional programming is objects—all the way down—until we hit the turtles The libraries that support Python are a turtle on which Python stands The operating system and hardware form their own stack of turtles. These details aren't relevant to the problems we're going to solve. A classic example of functional programming As part of our introduction, we'll look at a classic example of functional programming. This is based on the classic paper Why Functional Programming Matters by John Hughes. The article appeared in a paper called Research Topics in Functional Programming, edited by D. Turner, published by Addison-Wesley in 1990. Here's a link to the paper Research Topics in Functional Programming: http://www.cs.kent.ac.uk/people/staff/dat/miranda/whyfp90.pdf This discussion of functional programming in general is profound. There are several examples given in the paper. We'll look at just one: the Newton-Raphson algorithm for locating the roots of a function. In this case, the function is the square root. It's important because many versions of this algorithm rely on the explicit state managed via loops. Indeed, the Hughes paper provides a snippet of the Fortran code that emphasizes stateful, imperative processing. The backbone of this approximation is the calculation of the next approximation from the current approximation. The next_() function takes x, an approximation to the sqrt(n) method and calculates a next value that brackets the proper root. Take a look at the following example: def next_(n, x): return (x+n/x)/2 This function computes a series of values . The distance between the values is halved each time, so they'll quickly get to converge on the value such that, which means . We don't want to call the method next() because this name would collide with a built-in function. We call it the next_() method so that we can follow the original presentation as closely as possible. Here's how the function looks when used in the command prompt: >>> n= 2 >>> f= lambda x: next_(n, x) >>> a0= 1.0 >>> [ round(x,4) for x in (a0, f(a0), f(f(a0)), f(f(f(a0))),) ] [1.0, 1.5, 1.4167, 1.4142] We've defined the f() method as a lambda that will converge on . We started with 1.0 as the initial value for . Then we evaluated a sequence of recursive evaluations: , and so on. We evaluated these functions using a generator expression so that we could round off each value. This makes the output easier to read and easier to use with doctest. The sequence appears to converge rapidly on . We can write a function, which will (in principle) generate an infinite sequence of values converging on the proper square root: def repeat(f, a): yield a for v in repeat(f, f(a)): yield v This function will generate approximations using a function, f(), and an initial value, a. If we provide the next_() function defined earlier, we'll get a sequence of approximations to the square root of the n argument. The repeat() function expects the f() function to have a single argument, however, our next_() function has two arguments. We can use a lambda object, lambda x: next_(n, x), to create a partial version of the next_() function with one of two variables bound. The Python generator functions can't be trivially recursive, they must explicitly iterate over the recursive results, yielding them individually. Attempting to use a simple return repeat(f, f(a)) will end the iteration, returning a generator expression instead of yielding the sequence of values. We have two ways to return all the values instead of returning a generator expression, which are as follows: We can write an explicit for loop as follows: for x in some_iter: yield x. We can use the yield from statement as follows: yield from some_iter. Both techniques of yielding the values of a recursive generator function are equivalent. We'll try to emphasize yield from. In some cases, however, the yield with a complex expression will be more clear than the equivalent mapping or generator expression. Of course, we don't want the entire infinite sequence. We will stop generating values when two values are so close to each other that we can call either one the square root we're looking for. The common symbol for the value, which is close enough, is the Greek letter Epsilon, ε, which can be thought of as the largest error we will tolerate. In Python, we'll have to be a little clever about taking items from an infinite sequence one at a time. It works out well to use a simple interface function that wraps a slightly more complex recursion. Take a look at the following code snippet: def within(ε, iterable): def head_tail(ε, a, iterable): b= next(iterable) if abs(a-b) <= ε: return b return head_tail(ε, b, iterable) return head_tail(ε, next(iterable), iterable) We've defined an internal function, head_tail(), which accepts the tolerance, ε, an item from the iterable sequence, a, and the rest of the iterable sequence, iterable. The next item from the iterable bound to a name b. If , then the two values that are close enough together that we've found the square root. Otherwise, we use the b value in a recursive invocation of the head_tail() function to examine the next pair of values. Our within() function merely seeks to properly initialize the internal head_tail() function with the first value from the iterable parameter. Some functional programming languages offer a technique that will put a value back into an iterable sequence. In Python, this might be a kind of unget() or previous() method that pushes a value back into the iterator. Python iterables don't offer this kind of rich functionality. We can use the three functions next_(), repeat(), and within() to create a square root function, as follows: def sqrt(a0, ε, n): return within(ε, repeat(lambda x: next_(n,x), a0)) We've used the repeat() function to generate a (potentially) infinite sequence of values based on the next_(n,x) function. Our within() function will stop generating values in the sequence when it locates two values with a difference less than ε. When we use this version of the sqrt() method, we need to provide an initial seed value, a0, and an ε value. An expression like sqrt(1.0, .0001, 3) will start with an approximation of 1.0 and compute the value of to within 0.0001. For most applications, the initial a0 value can be 1.0. However, the closer it is to the actual square root, the more rapidly this method converges. The original example of this approximation algorithm was shown in the Miranda language. It's easy to see that there are few profound differences between Miranda and Python. The biggest difference is Miranda's ability to construct cons, a value back into an iterable, doing a kind of unget. This parallelism between Miranda and Python gives us confidence that many kinds of functional programming can be easily done in Python. Summary We've looked at programming paradigms with an eye toward distinguishing the functional paradigm from two common imperative paradigms in details. For more information kindly take a look at the following books, also by Packt Publishing: Learning Python (https://www.packtpub.com/application-development/learning-python) Mastering Python (https://www.packtpub.com/application-development/mastering-python) Mastering Object-oriented Python (https://www.packtpub.com/application-development/mastering-object-oriented-python) Resources for Article: Further resources on this subject: Saying Hello to Unity and Android [article] Using Specular in Unity [article] Unity 3.x Scripting-Character Controller versus Rigidbody [article]
Read more
  • 0
  • 1
  • 2164
article-image-probabilistic-graphical-models
Packt
17 Feb 2016
6 min read
Save for later

Probabilistic Graphical Models

Packt
17 Feb 2016
6 min read
Probabilistic graphical models, or simply graphical models as we will refer to them in this article, are models that use the representation of a graph to describe the conditional independence relationships between a series of random variables. This topic has received an increasing amount of attention in recent years and probabilistic graphical models have been successfully applied to tasks ranging from medical diagnosis to image segmentation. In this article, we'll present some of the necessary background that will pave the way to understanding the most basic graphical model, the Naïve Bayes classifier. We will then look at a slightly more complicated graphical model, known as the Hidden Markov Model, or HMM for short. To get started in this field, we must first learn about graphs. (For more resources related to this topic, see here.) A Little Graph Theory Graph theory is a branch of mathematics that deals with mathematical objects known as graphs. Here, a graph does not have the everyday meaning that we are more used to talking about, in the sense of a diagram or plot with an x and y axis. In graph theory, a graph consists of two sets. The first is a set of vertices, which are also referred to as nodes. We typically use integers to label and enumerate the vertices. The second set consists of edges between these vertices. Thus, a graph is nothing more than a description of some points and the connections between them. The connections can have a direction so that an edge goes from the source or tail vertex to the target or head vertex. In this case, we have a directed graph. Alternatively, the edges can have no direction, so that the graph is undirected. A common way to describe a graph is via the adjacency matrix. If we have V vertices in the graph, an adjacency matrix is a V×V matrix whose entries are 0 if the vertex represented by the row number is not connected to the vertex represented by the column number. If there is a connection, the entry is 1. With undirected graphs, both nodes at each edge are connected to each other so the adjacency matrix is symmetric. For directed graphs, a vertex vi is connected to a vertex vj via an edge (vi,vj); that is, an edge where vi is the tail and vj is the head. Here is an example adjacency matrix for a graph with seven nodes: > adjacency_m 1 2 3 4 5 6 7 1 0 0 0 0 0 1 0 2 1 0 0 0 0 0 0 3 0 0 0 0 0 0 1 4 0 0 1 0 1 0 1 5 0 0 0 0 0 0 0 6 0 0 0 1 1 0 1 7 0 0 0 0 1 0 0 This matrix is not symmetric, so we know that we are dealing with a directed graph. The first 1 value in the first row of the matrix denotes the fact that there is an edge starting from vertex 1 and ending on vertex 6. When the number of nodes is small, it is easy to visualize a graph. We simply draw circles to represent the vertices and lines between them to represent the edges. For directed graphs, we use arrows on the lines to denote the directions of the edges. It is important to note that we can draw the same graph in an infinite number of different ways on the page. This is because the graph tells us nothing about the positioning of the nodes in space; we only care about how they are connected to each other. Here are two different but equally valid ways to draw the graph described by the adjacency matrix we just saw: Two vertices are said to be connected with each other if there is an edge between them (taking note of the order when talking about directed graphs). If we can move from vertex vi to vertex vj by starting at the first vertex and finishing at the second vertex, by moving on the graph along the edges and passing through an arbitrary number of graph vertices, then these intermediate edges form a path between these two vertices. Note that this definition requires that all the vertices and edges along the path are distinct from each other (with the possible exception of the first and last vertex). For example, in our graph, vertex 6 can be reached from vertex 2 by a path leading through vertex 1. Sometimes, there can be many such possible paths through the graph, and we are often interested in the shortest path, which moves through the fewest number of intermediary vertices. We can define the distance between two nodes in the graph as the length of the shortest path between them. A path that begins and ends at the same vertex is known as a cycle. A graph that does not have any cycles in it is known as an acyclic graph. If an acyclic graph has directed edges, it is known as a directed acyclic graph, which is often abbreviated as a DAG. There are many excellent references on graph theory available. One such reference which is available online, is Graph Theory, Reinhard Diestel, Springer. This landmark reference is now in its 4th edition and can be found at http://diestel-graph-theory.com/. It might not seem obvious at first, but it turns out that a large number of real world situations can be conveniently described using graphs. For example, the network of friendships on social media sites, such as Facebook, or followers on Twitter, can be represented as graphs. On Facebook, the friendship relation is reciprocal, and so the graph is undirected. On Twitter, the follower relation is not, and so the graph is directed. Another graph is the network of websites on the Web, where links from one web page to the next form directed edges. Transport networks, communication networks, and electricity grids can be represented as graphs. For the predictive modeler, it turns out that a special class of models known as probabilistic graphical models, or graphical models for short, are models that involve a graph structure. In a graphical model, the nodes represent random variables and the edges in between represent the dependencies between them. Before we can go into further detail, we'll need to take a short detour in order to visit Bayes' Theorem, a classic theorem in statistics that despite its simplicity has implications both profound and practical when it comes to statistical inference and prediction. Summary In this article, we learned that graphs are consist of nodes and edges. We also learned the way of describing a graph is via the adjacency matrix. For more information on graphical models, you can refer to the books published by Packt (https://www.packtpub.com/): Mastering Predictive Analytics with Python (https://www.packtpub.com/big-data-and-business-intelligence/mastering-predictive-analytics-python) R Graphs Cookbook Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/r-graph-cookbook-%E2%80%93-second-edition) Resources for Article: Further resources on this subject: Data Analytics[article] Big Data Analytics[article] Learning Data Analytics with R and Hadoop[article]
Read more
  • 0
  • 0
  • 866

Packt
17 Feb 2016
29 min read
Save for later

Spark – Architecture and First Program

Packt
17 Feb 2016
29 min read
In this article by Sumit Gupta and Shilpi Saxena, the authors of Real-Time Big Data Analytics, we will discuss the architecture of Spark and its various components in detail. We will also briefly talk about the various extensions/libraries of Spark, which are developed over the core Spark framework. (For more resources related to this topic, see here.) Spark is a general-purpose computing engine that initially focused to provide solutions to the iterative and interactive computations and workloads. For example, machine learning algorithms, which reuse intermediate or working datasets across multiple parallel operations. The real challenge with iterative computations was the dependency of the intermediate data/steps on the overall job. This intermediate data needs to be cached in the memory itself for faster computations because flushing and reading from a disk was an overhead, which, in turn, makes the overall process unacceptably slow. The creators of Apache Spark not only provided scalability, fault tolerance, performance, distributed data processing but also provided in-memory processing of distributed data over the cluster of nodes. To achieve this, a new layer abstraction of distributed datasets that is partitioned over the set of machines (cluster) was introduced, which can be cached in the memory to reduce the latency. This new layer of abstraction was known as resilient distributed datasets (RDD). RDD, by definition, is an immutable (read-only) collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. It is important to note that Spark is capable of performing in-memory operations, but at the same time, it can also work on the data stored on the disk. High-level architecture Spark provides a well-defined and layered architecture where all its layers and components are loosely coupled and integration with external components/libraries/extensions is performed using well-defined contracts. Here is the high-level architecture of Spark 1.5.1 and its various components/layers: The preceding diagram shows the high-level architecture of Spark. Let's discuss the roles and usage of each of the architecture components: Physical machines: This layer represents the physical or virtual machines/nodes on which Spark jobs are executed. These nodes collectively represent the total capacity of the cluster with respect to the CPU, memory, and data storage. Data storage layer: This layer provides the APIs to store and retrieve the data from the persistent storage area to Spark jobs/applications. This layer is used by Spark workers to dump data on the persistent storage whenever the cluster memory is not sufficient to hold the data. Spark is extensible and capable of using any kind of filesystem. RDD, which hold the data, are agnostic to the underlying storage layer and can persist the data in various persistent storage areas, such as local filesystems, HDFS, or any other NoSQL database such as HBase, Cassandra, MongoDB, S3, and Elasticsearch. Resource manager: The architecture of Spark abstracts out the deployment of the Spark framework and its associated applications. Spark applications can leverage cluster managers such as YARN (http://tinyurl.com/pcymnnf) and Mesos (http://mesos.apache.org/) for the allocation and deallocation of various physical resources, such as the CPU and memory for the client jobs. The resource manager layer provides the APIs that are used to request for the allocation and deallocation of available resource across the cluster. Spark core libraries: The Spark core library represents the Spark Core engine, which is responsible for the execution of the Spark jobs. It contains APIs for in-memory distributed data processing and a generalized execution model that supports a wide variety of applications and languages. Spark extensions/libraries: This layer represents the additional frameworks/APIs/libraries developed by extending the Spark core APIs to support different use cases. For example, Spark SQL is one such extension, which is developed to perform ad hoc queries and interactive analysis over large datasets. The preceding architecture should be sufficient enough to understand the various layers of abstraction provided by Spark. All the layers are loosely coupled, and if required, can be replaced or extended as per the requirements. Spark extensions is one such layer that is widely used by architects and developers to develop custom libraries. Let's move forward and talk more about Spark extensions, which are available for developing custom applications/jobs. Spark extensions/libraries In this section, we will briefly discuss the usage of various Spark extensions/libraries that are available for different use cases. The following are the extensions/libraries available with Spark 1.5.1: Spark Streaming: Spark Streaming, as an extension, is developed over the core Spark API. It enables scalable, high-throughput, and fault-tolerant stream processing of live data streams. Spark Streaming enables the ingestion of data from various sources, such as Kafka, Flume, Kinesis, or TCP sockets. Once the data is ingested, it can be further processed using complex algorithms that are expressed with high-level functions, such as map, reduce, join, and window. Finally, the processed data can be pushed out to filesystems, databases, and live dashboards. In fact, Spark Streaming also facilitates the usage Spark's machine learning and graph processing algorithms on data streams. For more information, refer to http://spark.apache.org/docs/latest/streaming-programming-guide.html. Spark MLlib: Spark MLlib is another extension that provides the distributed implementation of various machine learning algorithms. Its goal is to make practical machine learning library scalable and easy to use. It provides implementation of various common machine learning algorithms used for classification, regression, clustering, and many more. For more information, refer to http://spark.apache.org/docs/latest/mllib-guide.html. Spark GraphX: GraphX provides the API to create a directed multigraph with properties attached to each vertex and edges. It also provides the various common operators used for the aggregation and distributed implementation of various graph algorithms, such as PageRank and triangle counting. For more information, refer to http://spark.apache.org/docs/latest/graphx-programming-guide.html. Spark SQL: Spark SQL provides the distributed processing of structured data and facilitates the execution of relational queries, which are expressed in a structured query language. (http://en.wikipedia.org/wiki/SQL). It provides the high level of abstraction known as DataFrames, which is a distributed collection of data organized into named columns. For more information, refer to http://spark.apache.org/docs/latest/sql-programming-guide.html. SparkR: R (https://en.wikipedia.org/wiki/R_(programming_language) is a popular programming language used for statistical computing and performing machine learning tasks. However, the execution of the R language is single threaded, which makes it difficult to leverage in order to process large data (TBs or PBs). R can only process the data that fits into the memory of a single machine. In order to overcome the limitations of R, Spark introduced a new extension: SparkR. SparkR provides an interface to invoke and leverage Spark distributed execution engine from R, which allows us to run large-scale data analysis from the R shell. For more information, refer to http://spark.apache.org/docs/latest/sparkr.html. All the previously listed Spark extension/libraries are part of the standard Spark distribution. Once we install and configure Spark, we can start using APIs that are exposed by the extensions. Apart from the earlier extensions, Spark also provides various other external packages that are developed and provided by the open source community. These packages are not distributed with the standard Spark distribution, but they can be searched and downloaded from http://spark-packages.org/. Spark packages provide libraries/packages for integration with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content. Let's move on to the next section where we will dive deep into the Spark packaging structure and execution model, and we will also talk about various other Spark components. Spark packaging structure and core APIs In this section, we will briefly talk about the packaging structure of the Spark code base. We will also discuss core packages and APIs, which will be frequently used by the architects and developers to develop custom applications with Spark. Spark is written in Scala (http://www.scala-lang.org/), but for interoperability, it also provides the equivalent APIs in Java and Python as well. For brevity, we will only talk about the Scala and Java APIs, and for Python APIs, users can refer to https://spark.apache.org/docs/1.5.1/api/python/index.html. A high-level Spark code base is divided into the following two packages: Spark extensions: All APIs for a particular extension is packaged in its own package structure. For example, all APIs for Spark Streaming are packaged in the org.apache.spark.streaming.* package, and the same packaging structure goes for other extensions: Spark MLlib—org.apache.spark.mllib.*, Spark SQL—org.apcahe.spark.sql.*, Spark GraphX—org.apache.spark.graphx.*. For more information, refer to http://tinyurl.com/q2wgar8 for Scala APIs and http://tinyurl.com/nc4qu5l for Java APIs. Spark Core: Spark Core is the heart of Spark and provides two basic components: SparkContext and SparkConfig. Both of these components are used by each and every standard or customized Spark job or Spark library and extension. The terms/concepts Context and Config are not new and more or less they have now become a standard architectural pattern. By definition, a Context is an entry point of the application that provides access to various resources/features exposed by the framework, whereas a Config contains the application configurations, which helps define the environment of the application. Let's move on to the nitty-gritty of the Scala APIs exposed by Spark Core: org.apache.spark: This is the base package for all Spark APIs that contains a functionality to create/distribute/submit Spark jobs on the cluster. org.apache.spark.SparkContext: This is the first statement in any Spark job/application. It defines the SparkContext and then further defines the custom business logic that is is provided in the job/application. The entry point for accessing any of the Spark features that we may want to use or leverage is SparkContext, for example, connecting to the Spark cluster, submitting jobs, and so on. Even the references to all Spark extensions are provided by SparkContext. There can be only one SparkContext per JVM, which needs to be stopped if we want to create a new one. The SparkContext is immutable, which means that it cannot be changed or modified once it is started. org.apache.spark.rdd.RDD.scala: This is another important component of Spark that represents the distributed collection of datasets. It exposes various operations that can be executed in parallel over the cluster. The SparkContext exposes various methods to load the data from HDFS or the local filesystem or Scala collections, and finally, create an RDD on which various operations such as map, filter, join, and persist can be invoked. RDD also defines some useful child classes within the org.apache.spark.rdd.* package such as PairRDDFunctions to work with key/value pairs, SequenceFileRDDFunctions to work with Hadoop sequence files, and DoubleRDDFunctions to work with RDDs of doubles. We will read more about RDD in the subsequent sections. org.apache.spark.annotation: This package contains the annotations, which are used within the Spark API. This is the internal Spark package, and it is recommended that you do not to use the annotations defined in this package while developing our custom Spark jobs. The three main annotations defined within this package are as follows: DeveloperAPI: All those APIs/methods, which are marked with DeveloperAPI, are for advance usage where users are free to extend and modify the default functionality. These methods may be changed or removed in the next minor or major releases of Spark. Experimental: All functions/APIs marked as Experimental are officially not adopted by Spark but are introduced temporarily in a specific release. These methods may be changed or removed in the next minor or major releases. AlphaComponent: The functions/APIs, which are still being tested by the Spark community, are marked as AlphaComponent. These are not recommended for production use and may be changed or removed in the next minor or major releases. org.apache.spark.broadcast: This is one of the most important packages, which are frequently used by developers in their custom Spark jobs. It provides the API for sharing the read-only variables across the Spark jobs. Once the variables are defined and broadcast, they cannot be changed. Broadcasting the variables and data across the cluster is a complex task, and we need to ensure that an efficient mechanism is used so that it improves the overall performance of the Spark job and does not become an overhead. Spark provides two different types of implementations of broadcasts—HttpBroadcast and TorrentBroadcast. The HttpBroadcast broadcast leverages the HTTP server to fetch/retrieve the data from Spark driver. In this mechanism, the broadcast data is fetched through an HTTP Server running at the driver itself and further stored in the executor block manager for faster accesses. The TorrentBroadcast broadcast, which is also the default implementation of the broadcast, maintains its own block manager. The first request to access the data makes the call to its own block manager, and if not found, the data is fetched in chunks from the executor or driver. It works on the principle of BitTorrent and ensures that the driver is not the bottleneck in fetching the shared variables and data. Spark also provides accumulators, which work like broadcast, but provide updatable variables shared across the Spark jobs but with some limitations. You can refer to https://spark.apache.org/docs/1.5.1/api/scala/index.html#org.apache.spark.Accumulator. org.apache.spark.io: This provides implementation of various compression libraries, which can be used at block storage level. This whole package is marked as Developer API, so developers can extend and provide their own custom implementations. By default, it provides three implementations: LZ4, LZF, and Snappy. org.apache.spark.scheduler: This provides various scheduler libraries, which help in job scheduling, tracking, and monitoring. It defines the directed acyclic graph (DAG) scheduler (http://en.wikipedia.org/wiki/Directed_acyclic_graph). The Spark DAG scheduler defines the stage-oriented scheduling where it keeps track of the completion of each RDD and the output of each stage and then computes DAG, which is further submitted to the underlying org.apache.spark.scheduler.TaskScheduler API that executes them on the cluster. org.apache.spark.storage: This provides APIs for structuring, managing, and finally, persisting the data stored in RDD within blocks. It also keeps tracks of data and ensures that it is either stored in memory, or if the memory is full, it is flushed to the underlying persistent storage area. org.apache.spark.util: These are the utility classes used to perform common functions across the Spark APIs. For example, it defines MutablePair, which can be used as an alternative to Scala's Tuple2 with the difference that MutablePair is updatable while Scala's Tuple2 is not. It helps in optimizing memory and minimizing object allocations. Spark execution model – master worker view Let's move on to the next section where we will dive deep into the Spark execution model, and we will also talk about various other Spark components. Spark essentially enables the distributed in-memory execution of a given piece of code. We discussed the Spark architecture and its various layers in the previous section. Let's also discuss its major components, which are used to configure the Spark cluster, and at the same time, they will be used to submit and execute our Spark jobs. The following are the high-level components involved in setting up the Spark cluster or submitting a Spark job: Spark driver: This is the client program, which defines SparkContext. The entry point for any job that defines the environment/configuration and the dependencies of the submitted job is SparkContext. It connects to the cluster manager and requests resources for further execution of the jobs. Cluster manager/resource manager/Spark master: The cluster manager manages and allocates the required system resources to the Spark jobs. Furthermore, it coordinates and keeps track of the live/dead nodes in a cluster. It enables the execution of jobs submitted by the driver on the worker nodes (also called Spark workers) and finally tracks and shows the status of various jobs running by the worker nodes. Spark worker/executors: A worker actually executes the business logic submitted by the Spark driver. Spark workers are abstracted and are allocated dynamically by the cluster manager to the Spark driver for the execution of submitted jobs. The following diagram shows the high-level components and the master worker view of Spark: The preceding diagram depicts the various components involved in setting up the Spark cluster, and the same components are also responsible for the execution of the Spark job. Although all the components are important, but let's briefly discuss the cluster/resource manager, as it defines the deployment model and allocation of resources to our submitted jobs. Spark enables and provides flexibility to choose our resource manager. As of Spark 1.5.1, the following are the resource managers or deployment models that are supported by Spark: Apache Mesos: Apache Mesos (http://mesos.apache.org/) is a cluster manager that provides efficient resource isolation and sharing across distributed applications or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other frameworks on a dynamically shared pool of nodes. Apache Mesos and Spark are closely related to each other (but they are not the same). The story started way back in 2009 when Mesos was ready and there were talks going on about the ideas/frameworks that can be developed on top of Mesos, and that's exactly how Spark was born. Refer to http://spark.apache.org/docs/latest/running-on-mesos.html for more information on running Spark jobs on Apache Mesos. Hadoop YARN: Hadoop 2.0 (http://tinyurl.com/qypb4xm), also known as YARN, was a complete change in the architecture. It was introduced as a generic cluster computing framework that was entrusted with the responsibility of allocating and managing the resources required to execute the varied jobs or applications. It introduced new daemon services, such as the resource manager (RM), node manager (NM), and application master (AM), which are responsible for managing cluster resources, individual nodes, and respective applications. YARN also introduced specific interfaces/guidelines for application developers where they can implement/follow and submit or execute their custom applications on the YARN cluster. The Spark framework implements the interfaces exposed by YARN and provides the flexibility of executing the Spark applications on YARN. Spark applications can be executed in the following two different modes in YARN: YARN client mode: In this mode, the Spark driver executes the client machine (the machine used for submitting the job), and the YARN application master is just used for requesting the resources from YARN. All our logs and sysouts (println) are printed on the same console, which is used to submit the job. YARN cluster mode: In this mode, the Spark driver runs inside the YARN application master process, which is further managed by YARN on the cluster, and the client can go away just after submitting the application. Now as our Spark driver is executed on the YARN cluster, our application logs/sysouts (println) are also written in the log files maintained by YARN and not on the machine that is used to submit our Spark job. For more information on executing Spark applications on YARN, refer to http://spark.apache.org/docs/latest/running-on-yarn.html. Standalone mode: The Core Spark distribution contains the required APIs to create an independent, distributed, and fault tolerant cluster without any external or third-party libraries or dependencies. Local mode: Local mode should not be confused with standalone mode. In local mode, Spark jobs can be executed on a local machine without any special cluster setup by just passing local[N] as the master URL, where N is the number of parallel threads. Writing and executing our first Spark program In this section, we will install/configure and write our first Spark program in Java and Scala. Hardware requirements Spark supports a variety of hardware and software platforms. It can be deployed on commodity hardware and also supports deployments on high-end servers. Spark clusters can be provisioned either on cloud or on-premises. Though there is no single configuration or standards, which can guide us through the requirements of Spark, but to create and execute Spark examples provided in this article, it would be good to have a laptop/desktop/server with the following configuration: RAM: 8 GB. CPU: Dual core or Quad core. DISK: SATA drives with a capacity of 300 GB to 500 GB with 15 k RPM. Operating system: Spark supports a variety of platforms that include various flavors of Linux (Ubuntu, HP-UX, RHEL, and many more) and Windows. For our examples, we will recommend that you use Ubuntu for the deployment and execution of examples. Spark core is coded in Scala, but it offers several development APIs in different languages, such as Scala, Java, and Python, so that developers can choose their preferred weapon for coding. The dependent software may vary based on the programming languages but still there are common sets of software for configuring the Spark cluster and then language-specific software for developing Spark jobs. In the next section, we will discuss the software installation steps required to write/execute Spark jobs in Scala and Java on Ubuntu as the operating system. Installation of the basic softwares In this section, we will discuss the various steps required to install the basic software, which will help us in the development and execution of our Spark jobs. Spark Perform the following steps to install Spark: Download the Spark compressed tarball from http://d3kbcqa49mib13.cloudfront.net/spark-1.5.1-bin-hadoop2.4.tgz. Create a new directory spark-1.5.1 on your local filesystem and extract the Spark tarball into this directory. Execute the following command on your Linux shell in order to set SPARK_HOME as an environment variable: export SPARK_HOME=<Path of Spark install Dir> Now, browse your SPARK_HOME directory and it should look similar to the following screenshot: Java Perform the following steps to install Java: Download and install Oracle Java 7 from http://www.oracle.com/technetwork/java/javase/install-linux-self-extracting-138783.html. Execute the following command on your Linux shell to set JAVA_HOME as an environment variable: export JAVA_HOME=<Path of Java install Dir> Scala Perform the following steps to install Scala: Download the Scala 2.10.5 compressed tarball from http://downloads.typesafe.com/scala/2.10.5/scala-2.10.5.tgz?_ga=1.7758962.1104547853.1428884173. Create a new directory, Scala 2.10.5, on your local filesystem and extract the Scala tarball into this directory. Execute the following commands on your Linux shell to set SCALA_HOME as an environment variable, and add the Scala compiler to the $PATH system: export SCALA_HOME=<Path of Scala install Dir> Next, execute the command in the following screenshot to ensure that the Scala runtime and Scala compiler is available and the version is 2.10.x: Spark 1.5.1 supports the 2.10.5 version of Scala, so it is advisable to use the same version to avoid any runtime exceptions due to mismatch of libraries. Eclipse Perform the following steps to install Eclipse: Based on your hardware configuration, download Eclipse Luna (4.4) from http://www.eclipse.org/downloads/packages/eclipse-ide-java-eedevelopers/lunasr2: Next, install the IDE for Scala in Eclipse itself so that we can write and compile our Scala code inside Eclipse (http://scala-ide.org/download/current.html). We are now done with the installation of all the required software. Let's move on and configure our Spark cluster. Configuring the Spark cluster The first step to configure the Spark cluster is to identify the appropriate resource manager. We discussed the various resource managers in the Spark execution model – master worker view section (Yarn, Mesos, and standalone). Standalone is the most preferred resource manager for development because it is simple/quick and does not require installation of any other component or software. We will also configure the standalone resource manager for all our Spark examples, and for more information on Yarn and Mesos, refer to the Spark execution model – master worker view section. Perform the following steps to bring up an independent cluster using Spark binaries: The first step to set up the Spark cluster is to bring up the master node, which will track and allocate the systems' resource. Open your Linux shell and execute the following command: $SPARK_HOME/sbin/start-master.sh The preceding command will bring up your master node, and it will also enable a UI, the Spark UI to monitor the nodes/jobs in the Spark cluster, http://<host>:8080/. The <host> is the domain name of the machine on which the master is running. Next, let's bring up our worker node, which will execute our Spark jobs. Execute the following command on the same Linux shell: $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker <Spark-Master> & In the preceding command, replace the <Spark-Master> with the Spark URL, which is shown at the top of the Spark UI, just beside Spark master at. The preceding command will start the Spark worker process in the background and the same will also be reported in the Spark UI. The Spark UI shown in the preceding screenshot shows the three different sections, providing the following information: Workers: This reports the health of a worker node, which is alive or dead and also provides drill-down to query the status and details logs of the various jobs executed by that specific worker node Running applications: This shows the applications that are currently being executed in the cluster and also provides drill-down and enables viewing of application logs Completed application: This is the same functionality as running applications; the only difference being that it shows the jobs, which are finished We are done!!! Our Spark cluster is up and running and ready to execute our Spark jobs with one worker node. Let's move on and write our first Spark application in Scala and Java and further execute it on our newly created cluster. Coding Spark job in Scala In this section, we will code our first Spark job in Scala, and we will also execute the same job on our newly created Spark cluster and will further analyze the results. This is our first Spark job, so we will keep it simple. We will use the Chicago crimes dataset for August 2015and will count the number of crimes reported in August 2015. Perform the following steps to code the Spark job in Scala for aggregating the number of crimes in August 2015: Open Eclipse and create a Scala project called Spark-Examples. Expand your newly created project and modify the version of the Scala library container to 2.10. This is done to ensure that the version of Scala libraries used by Spark and the custom jobs developed/deployed are the same. Next, open the properties of your project Spark-Examples and add the dependencies for the all libraries packaged with the Spark distribution, which can be found at $SPARK_HOME/lib. Next, create a chapter.six Scala package, and in this package, define a new Scala object by the name of ScalaFirstSparkJob. Define a main method in the Scala object and also import SparkConfand SparkContext. Now, add the following code to the main method of ScalaFirstSparkJob: object ScalaFirstSparkJob { def main(args: Array[String]) { println("Creating Spark Configuration") //Create an Object of Spark Configuration val conf = new SparkConf() //Set the logical and user defined Name of this Application conf.setAppName("My First Spark Scala Application") println("Creating Spark Context") //Create a Spark Context and provide previously created //Object of SparkConf as an reference. val ctx = new SparkContext(conf) //Define the location of the file containing the Crime Data val file = "file:///home/ec2-user/softwares/crime-data/ Crimes_-Aug-2015.csv"; println("Loading the Dataset and will further process it") //Loading the Text file from the local file system or HDFS //and converting it into RDD. //SparkContext.textFile(..) - It uses the Hadoop's //TextInputFormat and file is broken by New line Character. //Refer to http://hadoop.apache.org/docs/r2.6.0/api/org/ apache/hadoop/mapred/TextInputFormat.html //The Second Argument is the Partitions which specify the parallelism. //It should be equal or more then number of Cores in the cluster. val logData = ctx.textFile(file, 2) //Invoking Filter operation on the RDD, and counting the number of lines in the Data loaded in RDD. //Simply returning true as "TextInputFormat" have already divided the data by "\n" //So each RDD will have only 1 line. val numLines = logData.filter(line => true).count() //Finally Printing the Number of lines. println("Number of Crimes reported in Aug-2015 = " + numLines) } } We are now done with the coding! Our first Spark job in Scala is ready for execution. Now, from Eclipse itself, export your project as a .jar fie, name it spark-examples.jar, and save this .jar file in the root of $SPARK_HOME. Next, open your Linux console, go to $SPARK_HOME, and execute the following command: $SPARK_HOME/bin/spark-submit --class chapter.six.ScalaFirstSparkJob --master spark://ip-10-166-191-242:7077 spark-examples.jar In the preceding command, ensure that the value given to --masterparameter is the same as it is shown on your Spark UI. The Spark-submit is a utility script, which is used to submit the Spark jobs to the cluster. As soon as you click on Enter and execute the preceding command, you will see lot of activity (log messages) on the console, and finally, you will see the output of your job at the end: Isn't that simple! As we move forward and discuss Spark more, you will appreciate the ease of coding and simplicity provided by Spark for creating, deploying, and running jobs in a distributed framework. Your completed job will also be available for viewing at the Spark UI: The preceding image shows the status of our first Scala job on the UI. Now let's move forward and develop the same Job using Spark Java APIs. Coding Spark job in Java Perform the following steps to code the Spark job in Java for aggregating the number of crimes in August 2015: Open your Spark-Examples Eclipse project (created in the previous section). Add a new chapter.six.JavaFirstSparkJobJava file, and add the following code snippet: import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; public class JavaFirstSparkJob { public static void main(String[] args) { System.out.println("Creating Spark Configuration"); // Create an Object of Spark Configuration SparkConf javaConf = new SparkConf(); // Set the logical and user defined Name of this Application javaConf.setAppName("My First Spark Java Application"); System.out.println("Creating Spark Context"); // Create a Spark Context and provide previously created //Objectx of SparkConf as an reference. JavaSparkContext javaCtx = new JavaSparkContext(javaConf); System.out.println("Loading the Crime Dataset and will further process it"); String file = "file:///home/ec2-user/softwares/crime-data/Crimes_-Aug-2015.csv"; JavaRDD<String> logData = javaCtx.textFile(file); //Invoking Filter operation on the RDD. //And counting the number of lines in the Data loaded //in RDD. //Simply returning true as "TextInputFormat" have already divided the data by "\n" //So each RDD will have only 1 line. long numLines = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return true; } }).count(); //Finally Printing the Number of lines System.out.println("Number of Crimes reported in Aug-2015 = "+numLines); javaCtx.close(); } } Next, compile the preceding JavaFirstSparkJob from Eclipse itself and perform steps 7, 8, and 9 of the previous section in which we executed the Spark Scala job. We are done! Analyze the output on the console; it should be the same as the output of the Scala job, which we executed in the previous section. Troubleshooting – tips and tricks In this section, we will talk about troubleshooting tips and tricks, which are helpful in solving the most common errors encountered while working with Spark. Port numbers used by Spark Spark binds various network ports for communication within the cluster/nodes and also exposes the monitoring information of jobs to developers and administrators. There may be instances where the default ports used by Spark may not be available or may be blocked by the network firewall which in turn will result in modifying the default Spark ports for master/worker or driver. Here is a list of all the ports utilized by Spark and their associated parameters, which need to be configured for any changes (http://spark.apache.org/docs/latest/security.html#configuring-ports-for-network-security). Classpath issues – class not found exception Classpath is the most common issue and it occurs frequently in distributed applications. Spark and its associated jobs run in a distributed mode on a cluster. So, if your Spark job is dependent upon external libraries, then we need to ensure that we package them into a single JAR fie and place it in a common location or the default classpath of all worker nodes or define the path of the JAR file within SparkConf itself: val sparkConf = new SparkConf().setAppName("myapp").setJars(<path of Jar file>)) Other common exceptions In this section, we will talk about few of the common errors/issues/exceptions encountered by architects/developers when they set up Spark or execute Spark jobs: Too many open files: This increases the ulimit on your Linux OS by executingsudo ulimit –n 20000. Version of Scala: Spark 1.5.1 supports Scala 2.10, so if you have multiple versions of Scala deployed on your box, then ensure that all versions are the same, that is, Scala 2.10. Out of memory on workers in standalone mode: This configures SPARK_WORKER_MEMORY in $SPARK_HOME/conf/spark-env.sh. By default, it provides a total memory of 1 G to workers, but at the same time, you should analyze and ensure that you are not loading or caching too much data on worker nodes. Out of memory in applications executed on worker nodes: This configures spark.executor.memory in your SparkConf, as follows: val sparkConf = new SparkConf().setAppName("myapp") .set("spark.executor.memory", "1g") The preceding tips will help you solve basic issues in setting up Spark clusters, but as you move ahead, there could be more complex issues, which are beyond the basic setup, and for all those issues, post your queries at http://stackoverflow.com/questions/tagged/apache-spark or mail at [email protected]. Summary In this article, we discussed the architecture of Spark and its various components. We also configured our Spark cluster and executed our first Spark job in Scala and Java. Resources for Article:   Further resources on this subject: Data mining [article] Python Data Science Up and Running [article] The Design Patterns Out There and Setting Up Your Environment [article]
Read more
  • 0
  • 0
  • 2399

article-image-python-data-analysis-utilities
Packt
17 Feb 2016
13 min read
Save for later

Python Data Analysis Utilities

Packt
17 Feb 2016
13 min read
After the success of the book Python Data Analysis, Packt's acquisition editor Prachi Bisht gauged the interest of the author, Ivan Idris, in publishing Python Data Analysis Cookbook. According to Ivan, Python Data Analysis is one of his best books. Python Data Analysis Cookbook is meant for a bit more experienced Pythonistas and is written in the cookbook format. In the year after the release of Python Data Analysis, Ivan has received a lot of feedback—mostly positive, as far as he is concerned. Although Python Data Analysis covers a wide range of topics, Ivan still managed to leave out a lot of subjects. He realized that he needed a library as a toolbox. Named dautil for data analysis utilities, the API was distributed by him via PyPi so that it is installable via pip/easy_install. As you know, Python 2 will no longer be supported after 2020, so dautil is based on Python 3. For the sake of reproducibility, Ivan also published a Docker repository named pydacbk (for Python Data Analysis Cookbook). The repository represents a virtual image with preinstalled software. For practical reasons, the image doesn't contain all the software, but it still contains a fair percentage. This article has the following sections: Data analysis, data science, big data – what is the big deal? A brief history of data analysis with Python A high-level overview of dautil IPython notebook utilities Downloading data Plotting utilities Demystifying Docker Future directions (For more resources related to this topic, see here.) Data analysis, data science, big data – what is the big deal? You've probably seen Venn diagrams depicting data science as the intersection of mathematics/statistics, computer science, and domain expertise. Data analysis is timeless and was there before data science and computer science. You could perform data analysis with a pen and paper and, in more modern times, with a pocket calculator. Data analysis has many aspects with goals such as making decisions or coming up with new hypotheses and questions. The hype, status, and financial rewards surrounding data science and big data remind me of the time when data warehousing and business intelligence were the buzzwords. The ultimate goal of business intelligence and data warehousing was to build dashboards for management. This involved a lot of politics and organizational aspects, but on the technical side, it was mostly about databases. Data science, on the other hand, is not database-centric, and leans heavily on machine learning. Machine learning techniques have become necessary because of the bigger volumes of data. Data growth is caused by the growth of the world's population and the rise of new technologies such as social media and mobile devices. Data growth is in fact probably the only trend that we can be sure will continue. The difference between constructing dashboards and applying machine learning is analogous to the way search engines evolved. Search engines (if you can call them that) were initially nothing more than well-organized collections of links created manually. Eventually, the automated approach won. Since more data will be created in time (and not destroyed), we can expect an increase in automated data analysis. A brief history of data analysis with Python The history of the various Python software libraries is quite interesting. I am not a historian, so the following notes are written from my own perspective: 1989: Guido van Rossum implements the very first version of Python at the CWI in the Netherlands as a Christmas hobby project. 1995: Jim Hugunin creates Numeric, the predecessor to NumPy. 1999: Pearu Peterson writes f2py as a bridge between Fortran and Python. 2000: Python 2.0 is released. 2001: The SciPy library is released. Also, Numarray, a competing library of Numeric, is created. Fernando Perez releases IPython, which starts out as an afternoon hack. NLTK is released as a research project. 2002: John Hunter creates the matplotlib library. 2005: NumPy is released by Travis Oliphant. Initially, NumPy is Numeric extended with features inspired by Numarray. 2006: NumPy 1.0 is released. The first version of SQLAlchemy is released. 2007: The scikit-learn project is initiated as a Google Summer of Code project by David Cournapeau. Cython is forked from Pyrex. Cython is later intensively used in pandas and scikit-learn to improve performance. 2008: Wes McKinney starts working on pandas. Python 3.0 is released. 2011: The IPython 0.12 release introduces the IPython notebook. Packt releases NumPy 1.5 Beginner's Guide. 2012: Packt releases NumPy Cookbook. 2013: Packt releases NumPy Beginner's Guide - Second Edition. 2014: Fernando Perez announces Project Jupyter, which aims to make a language-agnostic notebook. Packt releases Learning NumPy Array and Python Data Analysis. 2015: Packt releases NumPy Beginner's Guide - Third Edition and NumPy Cookbook - Second Edition. A high-level overview of dautil The dautil API that Ivan made for this book is a humble toolbox, which he found useful. It is released under the MIT license. This license is very permissive, so you could in theory use the library in a production system. He doesn't recommend doing this currently (as of January, 2016), but he believes that the unit tests and documentation are of acceptable quality. The library has 3000+ lines of code and 180+ unit tests with a reasonable coverage. He has fixed as many issues reported by pep8 and flake8 as possible. Some of the functions in dautil are on the short side and are of very low complexity. This is on purpose. If there is a second edition (knock on wood), dautil will probably be completely transformed. The API evolved as Ivan wrote the book under high time pressure, so some of the decisions he made may not be optimal in retrospect. However, he hopes that people find dautil useful and, ideally, contribute to it. The dautil modules are summarized in the following table: Module Description LOC dautil.collect Contains utilities related to collections 331 dautil.conf Contains configuration utilities 48 dautil.data Contains utilities to download and load data 468 dautil.db Contains database-related utilities 98 dautil.log_api Contains logging utilities 204 dautil.nb Contains IPython/Jupyter notebook widgets and utilities 609 dautil.options Configures dynamic options of several libraries related to data analysis 71 dautil.perf Contains performance-related utilities 162 dautil.plotting Contains plotting utilities 382 dautil.report Contains reporting utilities 232 dautil.stats Contains statistical functions and utilities 366 dautil.ts Contains Utilities for time series and dates 217 dautil.web Contains utilities for web mining and HTML processing 47 IPython notebook utilities The IPython notebook has become a standard tool for data analysis. The dautil.nb has several interactive IPython widgets to help with Latex rendering, the setting of matplotlib properties, and plotting. Ivan has defined a Context class, which represents the configuration settings of the widgets. The settings are stored in a pretty-printed JSON file in the current working directory, which is named dautil.json. This could be extended, maybe even with a database backend. The following is an edited excerpt (so that it doesn't take up a lot of space) of an example dautil.json: { ... "calculating_moments": { "figure.figsize": [ 10.4, 7.7 ], "font.size": 11.2 }, "calculating_moments.latex": [ 1, 2, 3, 4, 5, 6, 7 ], "launching_futures": { "figure.figsize": [ 11.5, 8.5 ] }, "launching_futures.labels": [ [ {}, { "legend": "loc=best", "title": "Distribution of Means" } ], [ { "legend": "loc=best", "title": "Distribution of Standard Deviation" }, { "legend": "loc=best", "title": "Distribution of Skewness" } ] ], ... }  The Context object can be constructed with a string—Ivan recommends using the name of the notebook, but any unique identifier will do. The dautil.nb.LatexRenderer also uses the Context class. It is a utility class, which helps you number and render Latex equations in an IPython/Jupyter notebook, for instance, as follows: import dautil as dl lr = dl.nb.LatexRenderer(chapter=12, context=context) lr.render(r'delta! = x - m') lr.render(r'm' = m + frac{delta}{n}') lr.render(r'M_2' = M_2 + delta^2 frac{ n-1}{n}') lr.render(r'M_3' = M_3 + delta^3 frac{ (n - 1) (n - 2)}{n^2}/ - frac{3delta M_2}{n}') lr.render(r'M_4' = M_4 + frac{delta^4 (n - 1) / (n^2 - 3n + 3)}{n^3} + frac{6delta^2 M_2}/ {n^2} - frac{4delta M_3}{n}') lr.render(r'g_1 = frac{sqrt{n} M_3}{M_2^{3/2}}') lr.render(r'g_2 = frac{n M_4}{M_2^2}-3.') The following is the result:   Another widget you may find useful is RcWidget, which sets matplotlib settings, as shown in the following screenshot: Downloading data Sometimes, we require sample data to test an algorithm or prototype a visualization. In the dautil.data module, you will find many utilities for data retrieval. Throughout this book, Ivan has used weather data from the KNMI for the weather station in De Bilt. A couple of the utilities in the module add a caching layer on top of existing pandas functions, such as the ones that download data from the World Bank and Yahoo! Finance (the caching depends on the joblib library and is currently not very configurable). You can also get audio, demographics, Facebook, and marketing data. The data is stored under a special data directory, which depends on the operating system. On the machine used in the book, it is stored under ~/Library/Application Support/dautil. The following example code loads data from the SPAN Facebook dataset and computes the clique number: import networkx as nx import dautil as dl fb_file = dl.data.SPANFB().load() G = nx.read_edgelist(fb_file, create_using=nx.Graph(), nodetype=int) print('Graph Clique Number', nx.graph_clique_number(G.subgraph(list(range(2048)))))  To understand what is going on in detail, you will need to read the book. In a nutshell, we load the data and use the NetworkX API to calculate a network metric. Plotting utilities Ivan visualizes data very often in the book. Plotting helps us get an idea about how the data is structured and helps you form hypotheses or research questions. Often, we want to chart multiple variables, but we want to easily see what is what. The standard solution in matplotlib is to cycle colors. However, Ivan prefers to cycle line widths and line styles as well. The following unit test demonstrates his solution to this issue: def test_cycle_plotter_plot(self): m_ax = Mock() cp = plotting.CyclePlotter(m_ax) cp.plot([0], [0]) m_ax.plot.assert_called_with([0], [0], '-', lw=1) cp.plot([0], [1]) m_ax.plot.assert_called_with([0], [1], '--', lw=2) cp.plot([1], [0]) m_ax.plot.assert_called_with([1], [0], '-.', lw=1) The dautil.plotting module currently also has a helper tool for subplots, histograms, regression plots, and dealing with color maps. The following example code (the code for the labels has been omitted) demonstrates a bar chart utility function and a utility function from dautil.data, which downloads stock price data: import dautil as dl import numpy as np import matplotlib.pyplot as plt ratios = [] STOCKS = ['AAPL', 'INTC', 'MSFT', 'KO', 'DIS', 'MCD', 'NKE', 'IBM'] for symbol in STOCKS: ohlc = dl.data.OHLC() P = ohlc.get(symbol)['Adj Close'].values N = len(P) mu = (np.log(P[-1]) - np.log(P[0]))/N var_a = 0 var_b = 0 for k in range(1, N): var_a = (np.log(P[k]) - np.log(P[k - 1]) - mu) ** 2 var_a = var_a / N for k in range(1, N//2): var_b = (np.log(P[2 * k]) - np.log(P[2 * k - 2]) - 2 * mu) ** 2 var_b = var_b / N ratios.append(var_b/var_a - 1) _, ax = plt.subplots() dl.plotting.bar(ax, STOCKS, ratios) plt.show() Refer to the following screenshot for the end result: The code performs a random walk test and calculates the corresponding ratio for a list of stock prices. The data is retrieved whenever you run the code, so you may get different results. Some of you have a finance aversion, but rest assured that this book has very little finance-related content. The following script demonstrates a linear regression utility and caching downloader for World Bank data (the code for the watermark and plot labels has been omitted): import dautil as dl import matplotlib.pyplot as plt import numpy as np wb = dl.data.Worldbank() countries = wb.get_countries()[['name', 'iso2c']] inf_mort = wb.get_name('inf_mort') gdp_pcap = wb.get_name('gdp_pcap') df = wb.download(country=countries['iso2c'], indicator=[inf_mort, gdp_pcap], start=2010, end=2010).dropna() loglog = df.applymap(np.log10) x = loglog[gdp_pcap] y = loglog[inf_mort] dl.options.mimic_seaborn() fig, [ax, ax2] = plt.subplots(2, 1) ax.set_ylim([0, 200]) ax.scatter(df[gdp_pcap], df[inf_mort]) ax2.scatter(x, y) dl.plotting.plot_polyfit(ax2, x, y) plt.show()  The following image should be displayed by the code: The program downloads World Bank data for 2010 and plots the infant mortality rate against the GDP per capita. Also shown is a linear fit of the log-transformed data. Demystifying Docker Docker uses Linux kernel features to provide an extra virtualization layer. It was created in 2013 by Solomon Hykes. Boot2Docker allows us to install Docker on Windows and Mac OS X as well. Boot2Docker uses a VirtualBox VM that contains a Linux environment with Docker. Ivan's Docker image, which is mentioned in the introduction, is based on the continuumio/miniconda3 Docker image. The Docker installation docs are at https://docs.docker.com/index.html. Once you install Boot2Docker, you need to initialize it. This is only necessary once, and Linux users don't need this step: $ boot2docker init The next step for Mac OS X and Windows users is to start the VM: $ boot2docker start Check the Docker environment by starting a sample container: $ docker run hello-world Docker images are organized in a repository, which resembles GitHub. A producer pushes images and a consumer pulls images. You can pull Ivan's repository with the following command. The size is currently 387 MB. $ docker pull ivanidris/pydacbk Future directions The dautil API consists of items Ivan thinks will be useful outside of the context of this book. Certain functions and classes that he felt were only suitable for a particular chapter are placed in separate per-chapter modules, such as ch12util.py. In retrospect, parts of those modules may need to be included in dautil as well. In no particular order, Ivan has the following ideas for future dautil development: He is playing with the idea of creating a parallel library with "Cythonized" code, but this depends on how dautil is received Adding more data loaders as required There is a whole range of streaming (or online) algorithms that he thinks should be included in dautil as well The GUI of the notebook widgets should be improved and extended The API should have more configuration options and be easier to configure Summary In this article, Ivan roughly sketched what data analysis, data science, and big data are about. This was followed by a brief of history of data analysis with Python. Then, he started explaining dautil—the API he made to help him with this book. He gave a high-level overview and some examples of the IPython notebook utilities, features to download data, and plotting utilities. He used Docker for testing and giving readers a reproducible data analysis environment, so he spent some time on that topic too. Finally, he mentioned the possible future directions that could be taken for the library in order to guide anyone who wants to contribute. Resources for Article:   Further resources on this subject: Recommending Movies at Scale (Python) [article] Python Data Science Up and Running [article] Making Your Data Everything It Can Be [article]
Read more
  • 0
  • 0
  • 3788
article-image-hive-security
Packt
17 Feb 2016
13 min read
Save for later

Hive Security

Packt
17 Feb 2016
13 min read
In this article by Hanish Bansal, Saurabh Chauhan, and Shrey Mehrotra, the authors of the book, Apache Hive Cookbook, we will cover the following recipes: Securing Hadoop Authorizing Hive Security is a major concern in all the big data frameworks. It is little complex to implement security in distributed systems because the components of different machines need to communicate with each other. It is very important to enable security on the data. (For more resources related to this topic, see here.) Securing Hadoop In today's era of big data, most organizations are concentrating to use Hadoop as a centralized data store. Data size is growing day by day, and organizations want to derive some insights and make decisions using the important information. While everyone is focusing on collecting the data, but having all the data at a centralized place increases the risk of data security. Securing the data access of Hadoop Distributed File System (HDFS) is very important. Hadoop security means restricting the access of data to only authorized users and groups. Further, when we talk about security, there are two major things—Authentication and Authorization. The HDFS supports a permission model for files and directories that is much equivalent to standard POSIX model. Similar to UNIX permissions, each file and directory in HDFS are associated with an owner, a group, and other users. There are three types of permissions in HDFS: read, write, and execute. In contrast to the UNIX permission model, there is no concept of executable files. So in case of files, read (r) permission is required to read a file, and write (w) permission is required to write or append to a file. In case of directories, read (r) permission is required to list the contents of directory, write (w) permission is required to create or delete the files or subdirectories, and execute (x) permission is required to access the child objects (files/subdirectories) of that directory. The following screenshot shows the level of access to each individual entity, namely OWNER, GROUP, and OTHER: The Default HDFS Permission Model As illustrated in the previous screenshot, by default, the permission set for the owner of files or directories is rwx (7), which means the owner of the file or directory is having full permission to read, write, and execute. For the members of group, the permission set is r-x, which means group members can only read and execute the files or directories and they cannot write or update anything in the files or directories. For other members, a permission set is same as a group, that is, other members can only read and execute the files or directories and they cannot write or update anything in files or directories. Although this basic permission model is sufficient to handle a large number of security requirements at the block level, but using this model, you cannot define finer level security to specifically named users or groups. HDFS also has a feature to configure Access Control List (ACL), which can be used to define fine-grained permissions at file level as well as directory level for specifically named users or groups. For example, you want to give read access to users John, Mike, and Kate; then, HDFS ACLs can be used to define such kind of permissions. HDFS ACLs are designed on the base concept of POSIX ACLs of UNIX systems. How to do it… First of all, you will need to enable ACLs in Hadoop. To enable ACL permissions, configure the following property in Hadoop-configure file named hdfs-site.xml located at <HADOOP_HOME>/etc/hadoop/hdfs-site.xml: <property> <name>dfs.namenode.acls.enabled</name> <value>true</value> </property> There are two main commands that are used to configure ACLs: setfacl and getfacl. The command setfacl is used to set Finer Access Control Lists (FACL) for files or directories, and getfacl is used to retrieve Finer Access Control Lists (FACL) for files or directories. Let's see how to use these commands: hdfs dfs -setfacl [-R] [-b |-k -m |-x <acl_specification> <path>] |[--set <acl_specification> <path>] The same command can be run using hadoop fs also, as follows: hadoop fs -setfacl [-R] [-b |-k -m |-x <acl_specification> <path>] |[--set <acl_specification> <path>] This command contains the following elements: -R is used to apply operation recursively for all files and subdirectories under a directory. -b is used to remove all ACLs except the base ACLs. -k is used to remove the default ACLs. -m is used to modify ACLs. Using this option, new entries are added to the existing set of ACLs. -x is used to remove specific ACLs. acl_specification is a comma-separated list of ACLs. path is the path of a file or directory for which ACL has to be applied. --set is used to set new ACLs. It removes all existing ACLs and set the new ACLs only. Now, let's see another command that is used to retrieve the ACLs: hdfs dfs -getfacl [-R] <path> This command can also be run using hadoop fs as follows: hadoop fs -getfacl [-R] <path> This command contains the following elements: -R is used to retrieve ACLs recursively for all files and subdirectories under a directory path is a path of a file or directory of which ACL is to be retrieve The command getfacl will list all default ACLs as well as new ACLs defined for specified files or directories. How it works… If ACLs are defined for a file or directory, then while accessing that file/directory, access is validated as given in the following algorithm: If the username is the same as the owner name of the file, then owner permissions are enforced If username matches with one of named user ACL entry, then those permissions are enforced If a user's group name matches with one of the named group ACL entry, then those permissions are enforced In case multiple ACLs entries found for a user, then the union of all those permissions is enforced If no ACL entry found for a user, then other permissions are enforced Let's assume that we have a file named stock-data containing stock market data. To retrieve all ACLs of this file, run the following command after which the output is shown in the screenshot given later: $ hadoop fs -getfacl /stock-data Because we have not defined any custom ACL for this file, as shown in the previous screenshot, command will return default ACL for this file. You can check the permissions of a file or directory using the ls command also. As shown in the previous screenshot, the permission set for stock-data file is -rw-r--r--, which means read and write access for owner as well as read access for group members and others. In the following command, we give read and write access to user named mike, and the result is shown in the following screenshot: $ hadoop fs -setfacl -m user:mike:rw- /stock-data As shown in the previous screenshot, first, we defined the ACLs for the user mike using setfacl command; then, we retrieved the ACLs using the getfacl command. The output of the getfacl command will list out all default permissions as well as all ACLs. We defined ACLs for the user mike, so in the output, there is an extra row user:mike:rw-. There is an extra row in the output mask::rw-, which defines special mask ACLs entry. Mask is a special type of ACLs that filters out the access for all named users, named groups, and unnamed groups. If you have not defined mask ACL, then its value is calculated using the union of all permissions. In addition to this, the output of the ls command is also changed after defining ACLs. There is an extra plus (+) sign in permissions list that indicates that there are additional ACLs defined for this file or directory. Revoking access of user mike. To remove a specific ACL -x option is used with the setfacl command: $ hadoop fs -setfacl -x user:mike /stock-data In the previous screenshot, after revoking access of user mike, ACLs are updated, and there is no entry for the user mike now. See also You can read more about the permission model in Hadoop at https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html. Authorizing Hive Hive authorization is to verifying that a user is authorized to perform particular action. Authentication is about verifying the identity of a user, which is different from the authorization concept. Hive can be used in the following different ways: Using HCatalog API: Hive's Hcatalog API is used to access Hive by many other frameworks such as Apache Pig, MapReduce, Facebook Presto, Spark SQL, and Cloudera Impala. Using the HCatalog API, users have direct access to HDFS data and hive metadata. Hive metadata is directly accessible using metastore server API. Using Hive CLI: Using Hive CLI also, users have direct access to HDFS data and Hive metadata. Hive CLI directly interacts with a Hive metastore server. Currently Hive CLI don't support rich authorization. In next versions of hive, Hive CLI's implementation will be changed to provide better security, and also Hive CLI will interact with HiveServer2 rather than directly interacting with the metastore server. Using ODBC/JDBC and other HiveServer2 clients such as Beeline: These clients don't have direct access to HDFS data and metadata but through HiveServer2. For security purpose, this is the best way to access Hive. How to do it… The following are the various ways of authorization in Hive: Default authorization: the legacy mode: The legacy authorization mode was available in earlier versions of Hive. This authorization scheme prevents the users from doing some unwanted actions. This scheme doesn't prevent malicious users from doing activities. It manages the access control using grant and revoke statements. This mode supports Hive Command Line Interface (Hive CLI). In case of Hive CLI, users have direct access to HDFS files and directories, so they can easily break the security checks. Also in this model for granting privileges, the permissions needed for a user are not defined, which means that any user can grant the access to themselves, so it is not secure to use this model. Storage-based Authorization: As storage perspective, both HDFS data as well as Hive metadata must be accessed only to authorized users. If users use the HCatalog API or Hive CLI, then they have direct access to data. To protect the data, HDFS ACLs are being enabled. In this mode, HDFS permissions work as a single source of truth for protecting data.Generally in Hive metastore, database credentials are configured in Hive configuration file hive-site.xml. Malicious users can easily read the metastore credentials and then cause serious damage to data as well as metadata, so Hive metastore server can also be secured. In this authorization, you can also enable security at metastore level. After enabling metastore security, it will restrict the access on metadata objects by verifying that users have respective system permissions corresponding to different files and directories of metadata objects. To configure the storage-based authorization, set the following properties in the hive-site.xml file: Property Value hive.metastore.pre.event.listeners org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener hive.security.metastore.authorization.manager org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider hive.security.metastore.authenticator.manager org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator hive.security.metastore.authorization.auth.reads true After setting all these configurations, Hive configuration file hive-site.xml will look as follows: <configuration> <property> <name>hive.metastore.pre.event.listeners</name> <value>org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener</value> </property> <property> <name>hive.security.metastore.authorization.manager</name> <value>org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider</value> </property> <property> <name>hive.security.metastore.authenticator.manager</name> <value>org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator</value> </property> <property> <name>hive.security.metastore.authorization.auth.reads</name> <value>true</value> </property> </configuration> hive.metastore.pre.event.listeners: This property is used to define pre-event listener class, which is loaded on the metastore side. APIs of this class are executed before occurring of any event such as creating a database, table, or partition; altering a database, table, or partition; or dropping a database, table, or partition. Configuring this property turns on security at a metastore level. Set value of this property to org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener. hive.security.metastore.authorization.manager: This property is used to define the authorization provider class for metastore security. The default value of this property is DefaultHiveMetastoreAuthorizationProvider, which provides default legacy authorization described in the previous bullet. To enable storage-based authorization based on Hadoop ACLs, set value of this property to org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider. You can also write your own custom class to manage authorization and configure this property to enable custom authorization manager. The custom authorization manager class must implement an interface org.apache.hadoop.hive.ql.security.authorization.HiveMetastoreAuthorizationProvider. hive.security.metastore.authenticator.manager: This property is used to define an authentication manager class. Set value of this property to org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator. You can also write your custom class to manage authentication and configure to this property. The custom authentication manager class must implement an interface org.apache.hadoop.hive.ql.security.HiveAuthenticationProvider. hive.security.metastore.authorization.auth.reads: This property is used to define whether metastore authorization should check for read access or not. The default value of this property is true. SQL Standard-based Authorization: SQL standard-based authorization is the third way of authorizing Hive. Although the previous methodology storage-based authorization also provides access control at level of partitions, tables, and databases, that methodology does not provide access control at more granular level such as columns and rows. This is because storage-based authorization depends on access control provided by HDFS using ACL that controls the access on the level of files and directories. SQL Standard-based authorization can be used to enforce fine-grained security. It is recommended to use as it is fully SQL compliant in its authorization model. There's more Many things can be done with SQL standard-based authorization. Use SQL standard-based authorization for more details. Summary In this article, we learned two different recipes Securing Hadoop and Authorizing Hive. You also learned different terminology of access permissions and their types. You went through various steps to secure Hadoop and learned different ways to perform authorization in Hive. Resources for Article: Further resources on this subject: Hive in Hadoop[article] Processing Tweets with Apache Hive[article] Using Hive non-interactively (Simple)[article]
Read more
  • 0
  • 0
  • 2864

article-image-predicting-handwritten-digits
Packt
16 Feb 2016
10 min read
Save for later

Predicting handwritten digits

Packt
16 Feb 2016
10 min read
Our final application for neural networks will be the handwritten digit prediction task. In this task, the goal is to build a model that will be presented with an image of a numerical digit (0–9) and the model must predict which digit is being shown. We will use the MNIST database of handwritten digits from http://yann.lecun.com/exdb/mnist/. (For more resources related to this topic, see here.) From this page, we have downloaded and unzipped the two training files train-images-idx3-ubyte.gz and train-images-idx3-ubyte.gz. The former contains the data from the images and the latter contains the corresponding digit labels. The advantage of using this website is that the data has already been preprocessed by centering each digit in the image and scaling the digits to a uniform size. To load the data, we've used information from the website about the IDX format to write two functions: read_idx_image_data <- function(image_file_path) { con <- file(image_file_path, "rb") magic_number <- readBin(con, what = "integer", n = 1, size = 4, endian = "big") n_images <- readBin(con, what = "integer", n = 1, size = 4, endian="big") n_rows <- readBin(con, what = "integer", n = 1, size = 4, endian = "big") n_cols <- readBin(con, what = "integer", n = 1, size = 4, endian = "big") n_pixels <- n_images * n_rows * n_cols pixels <- readBin(con, what = "integer", n = n_pixels, size = 1, signed = F) image_data <- matrix(pixels, nrow = n_images, ncol = n_rows * n_cols, byrow = T) close(con) return(image_data) } read_idx_label_data <- function(label_file_path) { con <- file(label_file_path, "rb") magic_number <- readBin(con, what = "integer", n = 1, size = 4, endian = "big") n_labels <- readBin(con, what = "integer", n = 1, size = 4, endian = "big") label_data <- readBin(con, what = "integer", n = n_labels, size = 1, signed = F) close(con) return(label_data) } We can then load our two data files by issuing the following two commands: > mnist_train <- read_idx_image_data("train-images-idx3-ubyte") > mnist_train_labels <- read_idx_label_data("train-labels-idx1- ubyte") > str(mnist_train) int [1:60000, 1:784] 0 0 0 0 0 0 0 0 0 0 ... > str(mnist_train_labels) int [1:60000] 5 0 4 1 9 2 1 3 1 4 ... Each image is represented by a 28-pixel by 28-pixel matrix of grayscale values in the range 0 to 255, where 0 is white and 255 is black. Thus, our observations each have 282 = 784 feature values. Each image is stored as a vector by rasterizing the matrix from right to left and top to bottom. There are 60,000 images in the training data, and our mnist_train object stores these as a matrix of 60,000 rows by 78 columns so that each row corresponds to a single image. To get an idea of what our data looks like, we can visualize the first seven images: To analyze this data set, we will introduce our third and final R package for training neural network models, RSNNS. This package is actually an R wrapper around the Stuttgart Neural Network Simulator (SNNS), a popular software package containing standard implementations of neural networks in C created at the University of Stuttgart. The package authors have added a convenient interface for the many functions in the original software. One of the benefits of using this package is that it provides several of its own functions for data processing, such as splitting the data into a training and test set. Another is that it implements many different types of neural networks, not just MLPs. We will begin by normalizing our data to the unit interval by dividing by 255 and then indicating that our output is a factor with each level corresponding to a digit: > mnist_input <- mnist_train / 255 > mnist_output <- as.factor(mnist_train_labels) Although the MNIST website already contains separate files with test data, we have chosen to split the training data file as the models already take quite a while to run. The reader is encouraged to repeat the analysis that follows with the supplied test files as well. To prepare the data for splitting, we will randomly shuffle our images in the training data: > set.seed(252) > mnist_index <- sample(1:nrow(mnist_input), nrow(mnist_input)) > mnist_data <- mnist_input[mnist_index, 1:ncol(mnist_input)] > mnist_out_shuffled <- mnist_output[mnist_index] Next, we must dummy-encode our output factor as this is not done automatically for us. The decodeClassLabels() function from the RSNNS package is a convenient way to do this. Additionally, we will split our shuffled data into an 80–20 training and test set split using splitForTrainingAndTest(). This will store the features and labels for the training and test sets separately, which will be useful for us shortly. Finally, we can also normalize our data using the normTrainingAndTestSet() function. To specify unit interval normalization, we must set the type parameter to 0_1: > library("RSNNS") > mnist_out <- decodeClassLabels(mnist_out_shuffled) > mnist_split <- splitForTrainingAndTest(mnist_data, mnist_out, ratio = 0.2) > mnist_norm <- normTrainingAndTestSet(mnist_split, type = "0_1") For comparison, we will train two MLP networks using the mlp() function. By default, this is configured for classification and uses the logistic function as the activation function for hidden layer neurons. The first model will have a single hidden layer with 100 neurons; the second model will use 300. The first argument to the mlp() function is the matrix of input features and the second is the vector of labels. The size parameter plays the same role as the hidden parameter in the neuralnet package. That is to say, we can specify a single integer for a single hidden layer, or a vector of integers specifying the number of hidden neurons per layer when we want more than one hidden layer. Next, we can use the inputsTest and targetsTest parameters to specify the features and labels of our test set beforehand, so that we can be ready to observe the performance on our test set in one call. The models we will train will take several hours to run. If we want to know how long each model took to run, we can save the current time using proc.time() before training a model and comparing it against the time when the model completes. Putting all this together, here is how we trained our two MLP models: > start_time <- proc.time() > mnist_mlp <- mlp(mnist_norm$inputsTrain, mnist_norm$targetsTrain, size = 100, inputsTest = mnist_norm$inputsTest, targetsTest = mnist_norm$targetsTest) > proc.time() - start_time user system elapsed 2923.936 5.470 2927.415 > start_time <- proc.time() > mnist_mlp2 <- mlp(mnist_norm$inputsTrain, mnist_norm$targetsTrain, size = 300, inputsTest = mnist_norm$inputsTest, targetsTest = mnist_norm$targetsTest) > proc.time() - start_time user system elapsed 7141.687 7.488 7144.433 As we can see, the models take quite a long time to run (the values are in seconds). For reference, these were trained on a 2.5 GHz Intel Core i7 Apple MacBook Pro with 16 GB of memory. The model predictions on our test set are saved in the fittedTestValues attribute (and for our training set, they are stored in the fitted.values attribute). We will focus on test set accuracy. First, we must decode the dummy-encoded network outputs by selecting the binary column with the maximum value. We must also do this for the target outputs. Note that the first column corresponds to the digit 0. > mnist_class_test <- (0:9)[apply(mnist_norm$targetsTest, 1, which.max)] > mlp_class_test <- (0:9)[apply(mnist_mlp$fittedTestValues, 1, which.max)] > mlp2_class_test <- (0:9)[apply(mnist_mlp2$fittedTestValues, 1, which.max)] Now we can check the accuracy of our two models, as follows: > mean(mnist_class_test == mlp_class_test) [1] 0.974 > mean(mnist_class_test == mlp2_class_test) [1] 0.981 The accuracy is very high for both models, with the second model slightly outperforming the first. We can use the confusionMatrix() function to see the errors made in detail: > confusionMatrix(mnist_class_test, mlp2_class_test) predictions targets 0 1 2 3 4 5 6 7 8 9 0 1226 0 0 1 1 0 1 1 3 1 1 0 1330 5 3 0 0 0 3 0 1 2 3 0 1135 3 2 1 1 5 3 0 3 0 0 6 1173 0 11 1 5 6 1 4 0 5 0 0 1143 1 5 5 0 10 5 2 2 1 12 2 1077 7 3 5 4 6 3 0 2 1 1 3 1187 0 1 0 7 0 0 7 1 3 1 0 1227 1 4 8 5 4 3 5 1 4 4 0 1110 5 9 1 0 0 6 8 5 0 11 6 1164 As expected, we see quite a bit of symmetry in this matrix because certain pairs of digits are often harder to distinguish than others. For example, the most common pair of digits that the model confuses is the pair (3,5). The test data available on the website contains some examples of digits that are harder to distinguish from others. By default, the mlp() function allows for a maximum of 100 iterations, via its maxint parameter. Often, we don't know the number of iterations we should run for a particular model; a good way to determine this is to plot the training and testing error rates versus iteration number. With the RSNNS package, we can do this with the plotIterativeError() function. The following graphs show that for our two models, both errors plateau after 30 iterations: Receiver operating characteristic curves In this article, we will present a commonly used graph to show binary classification performance, the receiver operating characteristic (ROC) curve. This curve is a plot of the true positive rate on the y axis and the false positive rate on the x axis. The true positive rate, as we know, is just the recall or, equivalently, the sensitivity of a binary classifier. The false positive rate is just 1 minus the specificity. A random binary classifier will have a true positive rate equal to the false positive rate and thus, on the ROC curve, the line y = x is the line showing the performance of a random classifier. Any curve lying above this line will perform better than a random classifier. A perfect classifier will exhibit a curve from the origin to the point (0,1), which corresponds to a 100 percent true positive rate and a 0 percent false positive rate. We often talk about the ROC Area Under the Curve (ROC AUC) as a performance metric. The area under the random classifier is just 0.5 as we are computing the area under the line y = x on a unit square. By convention, the area under a perfect classifier is 1 as the curve passes through the point (0,1). In practice, we obtain values between these two. For our MNIST digit classifier, we have a multiclass problem, but we can use the plotROC() function of the RSNNS package to study the performance of our classifier on individual digits. The following plot shows the ROC curve for digit 1, which is almost perfect: Summary Predictive analytics, and data science more generally, currently enjoy a huge surge in interest, as predictive technologies such as spam filtering, word completion and recommendation engines have pervaded everyday life. We are now not only increasingly familiar with these technologies, but these technologies have also earned our confidence. You can learn more about predictive analysis by referring to: https://www.packtpub.com/big-data-and-business-intelligence/predictive-analytics-using-rattle-and-qlik-sense https://www.packtpub.com/big-data-and-business-intelligence/haskell-financial-data-modeling-and-predictive-analytics Resources for Article:   Further resources on this subject: Learning Data Analytics with R and Hadoop[article] Big Data Analytics[article] Data Analytics[article]
Read more
  • 0
  • 0
  • 1416