Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-store-access-social-media-data-mongodb
Amey Varangaonkar
26 Dec 2017
6 min read
Save for later

How to store and access social media data in MongoDB

Amey Varangaonkar
26 Dec 2017
6 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book Python Social Media Analytics, co-authored by Siddhartha Chatterjee and Michal Krystyanczuk.[/box] Our article explains how to effectively perform different operations using MongoDB and Python to effectively access and modify the data. According to the official MongoDB page: MongoDB is free and open-source, distributed database which allows ad hoc queries, indexing, and real time aggregation to access and analyze your data. It is published under the GNU Affero General Public License and stores data in flexible, JSON-like documents, meaning fields can vary from document to document and the data structure can be changed over time. Along with ease of use, MongoDB is recognized for the following advantages: Schema-less design: Unlike traditional relational databases, which require the data to fit its schema, MongoDB provides a flexible schema-less data model. The data model is based on documents and collections. A document is essentially a JSON structure and a collection is a group of documents. One links data within collections using specific identifiers. The document model is quite useful in this subject as most social media APIs provide their data in JSON format. High performance: Indexing and GRIDFS features of MongoDB provide fast access and storage. High availability: Duplication feature that allows us to make various copies of databases in different nodes confirms high availability in the case of node failures. Automatic scaling: The Sharding feature of MongoDB scales large data sets Automatically. You can access information on the implementation of Sharding in the official documentation of MongoDB: https://docs.mongodb.com/v3.0/sharding/ Installing MongoDB MongoDB can be downloaded and installed from the following link: http://www.mongodb.org/downloads?_ga=1.253005644.410512988.1432811016. Setting up the environment MongoDB requires a data directory to store all the data. The directory can be created in your working directory: md datadb Starting MongoDB We need to go to the folder where mongod.exe is stored and and run the following command: cmd binmongod.exe Once the MongoDB server is running in the background, we can switch to our Python environment to connect and start working. MongoDB using Python MongoDB can be used directly from the shell command or through programming languages. For the sake of our book we'll explain how it works using Python. MongoDB is accessed using Python through a driver module named PyMongo. We will not go into the detailed usage of MongoDB, which is beyond the scope of this book. We will see the most common functionalities required for analysis projects. We highly recommend reading the official MongoDB documentation. PyMongo can be installed using the following command: pip install pymongo Then the following command imports it in the Python script  from pymongo import MongoClient client = MongoClient('localhost:27017') The database structure of MongoDB is similar to SQL languages, where you have databases, and inside databases you have tables. In MongoDB you have databases, and inside them you have collections. Collections are where you store the data, and databases store multiple collections. As MongoDB is a NoSQL database, your tables do not need to have a predefined structure, you can add documents of any composition as long as they are a JSON object. But by convention is it best practice to have a common general structure for documents in the same collections. To access a database named scrapper we simply have to do the following: db_scrapper = db.scrapper To access a collection named articles in the database scrapper we do this: db_scrapper = db.scrapper collection_articles = db_scrapper.articles Once you have the client object initiated you can access all the databases and the collections very easily. Now, we will see how to perform different operations: Insert: To insert a document into a collection we build a list of new documents to insert into the database: docs = [] for _ in range(0, 10): # each document must be of the python type dict docs.append({ "author": "...", "content": "...", "comment": ["...", ... ] }) Inserting all the docs at once: db.collection.insert_many(docs) Or you can insert them one by one: for doc in docs: db.collection.insert_one(doc) You can find more detailed documentation at: https://docs.mongodb.com/v3.2/tutorial/insert-documents/. Find: To fetch all documents within a collection: # as the find function returns a cursor we will iterate over the cursor to actually fetch # the data from the database docs = [d for d in db.collection.find()] To fetch all documents in batches of 100 documents: batch_size = 100 Iteration = 0 count = db.collection.count() # getting the total number of documents in the collection while iteration * batch_size < count: docs = [d for d in db.collection.find().skip(batch_size * iteration).limit(batch_size)] Iteration += 1 To fetch documents using search queries, where the author is Jean Francois: query = {'author': 'Jean Francois'} docs = [d for d in db.collection.find(query) Where the author field exists and is not null: query = {'author': {'$exists': True, '$ne': None}} docs = [d for d in db.collection.find(query)] There are many other different filtering methods that provide a wide variety of flexibility and precision; we highly recommend taking your time going through the different search operators. You can find more detailed documentation at: https://docs.mongodb.com/v3.2/reference/method/db.collection.find/ Update: To update a document where the author is Jean Francois and set the attribute published as True: query_search = {'author': 'Jean Francois'} query_update = {'$set': {'published': True}} db.collection.update_many(query_search, query_update) Or you can update just the first matching document: db.collection.update_one(query_search, query_update) Find more detailed documentation at: https://docs.mongodb.com/v3.2/reference/method/db.collection.update/ Remove: Remove all documents where the author is Jean Francois: query_search = {'author': 'Jean Francois'} db.collection.delete_many(query_search, query_update) Or remove the first matching document: db.collection.delete_one(query_search, query_update) Find more detailed documentation at: https://docs.mongodb.com/v3.2/tutorial/remove-documents/ Drop: You can drop collections by the following: db.collection.drop() Or you can drop the whole database: db.dropDatabase() We saw how to store and access data from MongoDB. MongoDB has gained a lot of popularity and is the preferred database choice for many, especially when it comes to working with social media data. If you found our post to be useful, do make sure to check out Python Social Media Analytics, which contains useful tips and tricks on leveraging the power of Python for effective data analysis from various social media sites such as YouTube, GitHub, Twitter etc.  
Read more
  • 0
  • 0
  • 4863

article-image-mine-popular-trends-on-github-using-python-part-1
Amey Varangaonkar
26 Dec 2017
11 min read
Save for later

Mine Popular Trends on GitHub using Python - Part 1

Amey Varangaonkar
26 Dec 2017
11 min read
[box type="note" align="" class="" width=""]This interesting article is an excerpt from the book Python Social Media Analytics, written by Siddhartha Chatterjee and Michal Krystyanczuk. The book contains useful techniques to gain valuable insights from different social media channels using popular Python packages.[/box] In this article, we explore how to leverage the power of Python in order to gather and process data from GitHub and make it analysis-ready. Those who love to code, love GitHub. GitHub has taken the widely used version controlling approach to coding to the highest possible level by implementing social network features to the world of programming. No wonder GitHub is also thought of as Social Coding. We thought a book on Social Network analysis would not be complete without a use case on data from GitHub. GitHub allows you to create code repositories and provides multiple collaborative features, bug tracking, feature requests, task managements, and wikis. It has about 20 million users and 57 million code repositories (source: Wikipedia). These kind of statistics easily demonstrate that this is the most representative platform of programmers. It's also a platform for several open source projects that have contributed greatly to the world of software development. Programming technology is evolving at such a fast pace, especially due to the open source movement, and we have to be able to keep a track of emerging technologies. Assuming that the latest programming tools and technologies are being used with GitHub, analyzing GitHub could help us detect the most popular technologies. The popularity of repositories on GitHub is assessed through the number of commits it receives from its community. We will use the GitHub API in this chapter to gather data around repositories with the most number of commits and then discover the most popular technology within them. For all we know, the results that we get may reveal the next great innovations. Scope and process GitHub API allows us to get information about public code repositories submitted by users. It covers lots of open-source, educational and personal projects. Our focus is to find the trending technologies and programming languages of last few months, and compare with repositories from past years. We will collect all the meta information about the repositories such as: Name: The name of the repository Description: A description of the repository Watchers: People following the repository and getting notified about its activity Forks: Users cloning the repository to their own accounts Open Issues: Issues submitted about the repository We will use this data, a combination of qualitative and quantitative information, to identify the most recent trends and weak signals. The process can be represented by the steps shown in the following figure: Getting the data Before using the API, we need to set the authorization. The API gives you access to all publicly available data, but some endpoints need user permission. You can create a new token with some specific scope access using the application settings. The scope depends on your application's needs, such as accessing user email, updating user profile, and so on. Password authorization is only needed in some cases, like access by user authorized applications. In that case, you need to provide your username or email, and your password. All API access is over HTTPS, and accessed from the https://api.github.com/ domain. All data is sent and received as JSON. Rate Limits The GitHub Search API is designed to help to find specific items (repository, users, and so on). The rate limit policy allows up to 1,000 results for each search. For requests using basic authentication, OAuth, or client ID and secret, you can make up to 30 requests per minute. For unauthenticated requests, the rate limit allows you to make up to 10 requests per minute. Connection to GitHub GitHub offers a search endpoint which returns all the repositories matching a query. As we go along, in different steps of the analysis we will change the value of the variable q (query). In the first part, we will retrieve all the repositories created since January 1, 2017 and then we will compare the results with previous years. Firstly, we initialize an empty list results which stores all data about repositories. Secondly, we build get requests with parameters required by the API. We can only get 100 results per request, so we have to use a pagination technique to build a complete dataset. results = [] q = "created:>2017-01-01" def search_repo_paging(q): url = 'https://api.github.com/search/repositories' params = {'q' : q, 'sort' : 'forks', 'order': 'desc', 'per_page' : 100} while True: res = requests.get(url,params = params) result = res.json() results.extend(result['items']) params = {} try: url = res.links['next']['url'] except: break In the first request we have to pass all the parameters to the GET method in our request. Then, we make a new request for every next page, which can be found in res.links['next']['url']. res. links contains a full link to the resources including all the other parameters. That is why we empty the params dictionary. The operation is repeated until there is no next page key in res.links dictionary. For other datasets we modify the search query in such a way that we retrieve repositories from previous years. For example to get the data from 2015 we define the following query: q = "created:2015-01-01..2015-12-31" In order to find proper repositories, the API provides a wide range of query parameters. It is possible to search for repositories with high precision using the system of qualifiers. Starting with main search parameters q, we have following options: sort: Set to forks as we are interested in finding the repositories having the largest number of forks (you can also sort by number of stars or update time) order: Set to descending order per_page: Set to the maximum amount of returned repositories Naturally, the search parameter q can contain multiple combinations of qualifiers. Data pull The amount of data we collect through GitHub API is such that it fits in memory. We can deal with it directly in a pandas dataframe. If more data is required, we would recommend storing it in a database, such as MongoDB. We use JSON tools to convert the results into a clean JSON and to create a dataframe. from pandas.io.json import json_normalize import json import pandas as pd import bson.json_util as json_util sanitized = json.loads(json_util.dumps(results)) normalized = json_normalize(sanitized) df = pd.DataFrame(normalized) The dataframe df contains columns related to all the results returned by GitHub API. We can list them by typing the following: Df.columns Index(['archive_url', 'assignees_url', 'blobs_url', 'branches_url', 'clone_url', 'collaborators_url', 'comments_url', 'commits_url', 'compare_url', 'contents_url', 'contributors_url', 'default_branch', 'deployments_url', 'description', 'downloads_url', 'events_url', 'Fork', 'forks', 'forks_count', 'forks_url', 'full_name', 'git_commits_url', 'git_refs_url', 'git_tags_url', 'git_url', 'has_downloads', 'has_issues', 'has_pages', 'has_projects', 'has_wiki', 'homepage', 'hooks_url', 'html_url', 'id', 'issue_comment_url', 'Issue_events_url', 'issues_url', 'keys_url', 'labels_url', 'language', 'languages_url', 'merges_url', 'milestones_url', 'mirror_url', 'name', 'notifications_url', 'open_issues', 'open_issues_count', 'owner.avatar_url', 'owner.events_url', 'owner.followers_url', 'owner.following_url', 'owner.gists_url', 'owner.gravatar_id', 'owner.html_url', 'owner.id', 'owner.login', 'Owner.organizations_url', 'owner.received_events_url', 'owner.repos_url', 'owner.site_admin', 'owner.starred_url', 'owner.subscriptions_url', 'owner.type', 'owner.url', 'private', 'pulls_url', 'pushed_at', 'releases_url', 'score', 'size', 'ssh_url', 'stargazers_count', 'stargazers_url', 'statuses_url', 'subscribers_url', 'subscription_url', 'svn_url', 'tags_url', 'teams_url', 'trees_url', 'updated_at', 'url', 'Watchers', 'watchers_count', 'year'], dtype='object') Then, we select a subset of variables which will be used for further analysis. Our choice is based on the meaning of each of them. We skip all the technical variables related to URLs, owner information, or ID. The remaining columns contain information which is very likely to help us identify new technology trends: description: A user description of a repository watchers_count: The number of watchers size: The size of repository in kilobytes forks_count: The number of forks open_issues_count: The number of open issues language: The programming language the repository is written in We have selected watchers_count as the criterion to measure the popularity of repositories. This number indicates how many people are interested in the project. However, we may also use forks_count which gives us slightly different information about the popularity. The latter represents the number of people who actually worked with the code, so it is related to a different group. Data processing In the previous step we structured the raw data which is now ready for further analysis. Our objective is to analyze two types of data: Textual data in description Numerical data in other variables Each of them requires a different pre-processing technique. Let's take a look at each type in Detail. Textual data For the first kind, we have to create a new variable which contains a cleaned string. We will do it in three steps which have already been presented in previous chapters: Selecting English descriptions Tokenization Stopwords removal As we work only on English data, we should remove all the descriptions which are written in other languages. The main reason to do so is that each language requires a different processing and analysis flow. If we left descriptions in Russian or Chinese, we would have very noisy data which we would not be able to interpret. As a consequence, we can say that we are analyzing trends in the English-speaking world. Firstly, we remove all the empty strings in the description column. df = df.dropna(subset=['description']) In order to remove non-English descriptions we have to first detect what language is used in each text. For this purpose we use a library called langdetect which is based on the Google language detection project (https://github.com/shuyo/language-detection). from langdetect import detect df['lang'] = df.apply(lambda x: detect(x['description']),axis=1) We create a new column which contains all the predictions. We see different languages, such as en (English), zh-cn (Chinese), vi (Vietnamese), or ca (Catalan). df['lang'] 0 en 1 en 2 en 3 en 4 en 5 zh-cn In our dataset en represents 78.7% of all the repositories. We will now select only those repositories with a description in English: df = df[df['lang'] == 'en'] In the next step, we will create a new clean column with pre-processed textual data. We execute the following code to perform tokenization and remove stopwords: import nltk from nltk import word_tokenize from nltk.corpus import stopwords def clean(text = '', stopwords = []): #tokenize tokens = word_tokenize(text.strip()) #lowercase clean = [i.lower() for i in tokens] #remove stopwords clean = [i for i in clean if i not in stopwords] #remove punctuation punctuations = list(string.punctuation) clean = [i.strip(''.join(punctuations)) for i in clean if i not in punctuations] return " ".join(clean) df['clean'] = df['description'].apply(str) #make sure description is a string df['clean'] = df['clean'].apply(lambda x: clean(text = x, stopwords = stopwords.words('english'))) Finally, we obtain a clean column which contains cleaned English descriptions, ready for analysis: df['clean'].head(5) 0 roadmap becoming web developer 2017 1 base repository imad v2 course application ple… 2 decrypted content eqgrp-auction-file.tar.xz 3 shadow brokers lost translation leak 4 learn design large-scale systems prep system d... Numerical data For numerical data, we will check statistically both what the distribution of values is and whether there are any missing values: df[['watchers_count','size','forks_count','open_issues']].describe() We see that there are no missing values in all four variables: watchers_count, size, forks_count, and open_issues. The watchers_count varies from 0 to 20,792 while the minimum number of forks is 33 and goes up to 2,589. The first quartile of repositories has no open issues while top 25% have more than 12. It is worth noticing that, in our dataset, there is a repository which has 458 open issues. Once we are done with the pre-processing of the data, our next step would be to analyze it, in order to get actionable insights from it. If you found this article to be useful, stay tuned for Part 2, where we perform analysis on the processed GitHub data and determine the top trending technologies. Alternatively, you can check out the book Python Social Media Analytics, to learn how to get valuable insights from various social media sites such as Facebook, Twitter and more.    
Read more
  • 0
  • 0
  • 3530

article-image-clean-social-media-data-analysis-python
Amey Varangaonkar
26 Dec 2017
10 min read
Save for later

How to effectively clean social media data for analysis

Amey Varangaonkar
26 Dec 2017
10 min read
[box type="note" align="" class="" width=""]This article is a book extract from Python Social Media Analytics, written by Siddhartha Chatterjee and Michal Krystyanczuk.[/box] Data cleaning and preprocessing is an essential - and often crucial - part of any analytical process. In this excerpt, we explain the different techniques and mechanisms for effective analysis of your social media data. Social media contains different types of data: information about user profiles, statistics (number of likes or number of followers), verbatims, and other media content. Quantitative data is very convenient for an analysis using statistical and numerical methods, but unstructured data such as user comments is much more challenging. To get meaningful information, one has to perform the whole process of information retrieval. It starts with the definition of the data type and data structure. On social media, unstructured data is related to text, images, videos, and sound and we will mostly deal with textual data. Then, the data has to be cleaned and normalized. Only after all these steps can we delve into the analysis. Social media Data type and encoding Comments and conversation are textual data that we retrieve as strings. In brief, a string is a sequence of characters represented by code points. Every string in Python is seen as a Unicode covering the numbers from 0 through 0x10FFFF (1,114,111 decimal). Then, the sequence has to be represented as a set of bytes (values from 0 to 255) in memory. The rules for translating a Unicode string into a sequence of bytes are called encoding. Encoding plays a very important role in natural language processing because people use more and more characters such as emojis or emoticons, which replace whole words and express emotions. Moreover, in many languages, there are accents that go beyond the regular English alphabet. In order to deal with all the processing problems that might be caused by these, we have to use the right encoding, because comparing two strings with different encodings is actually like comparing apples and oranges. The most common one is UTF-8, used by default in Python 3, which can handle any type of character. As a rule of thumb always normalize your data to Unicode UTF-8. Structure of social media data Another question we'll encounter is, What is the right structure for our data? The most natural choice is a list that can store a sequence of data points (verbatims, numbers, and so on). However, the use of lists will not be efficient on large datasets and we'll be constrained to use sequential processing of the data. That is why a much better solution is to store the data in a tabular format in pandas dataframe, which has multiple advantages for further processing. First of all, rows are indexed, so search operations become much faster. There are also many optimized methods for different kinds of processing and above all it allows you to optimize your own processing by using functional programming. Moreover, a row can contain multiple fields with metadata about verbatims, which are very often used in our analysis. It is worth remembering that the dataset in pandas must fit into RAM memory. For bigger datasets, we suggest the use of SFrames. Pre-processing and text normalization Preprocessing is one of the most important parts of the analysis process. It reformats the unstructured data into uniform, standardized form. The characters, words, and sentences identified at this stage are the fundamental units passed to all further processing stages. The quality of the preprocessing has a big impact of the final result on the whole process. There are several stages of the process: from simple text cleaning by removing white spaces, punctuation, HTML tags and special characters up to more sophisticated normalization techniques such as tokenization, stemming or lemmatization. In general, the main aim is to keep all the characters and words that are important for the analysis and, at the same time, get rid of all others, and the text corpus should be maintained in one uniform format. We import all necessary libraries. import re, itertools import nltk from nltk.corpus import stopwords When dealing with raw text, we usually have a set of words including many details we are not interested in, such as whitespace, line breaks, and blank lines. Moreover, many words contain capital letters so programming languages misinterpret for example, "go" and "Go" as two different words. In order to handle such distinctions, we can convert all words to lowercase format with the following steps: Perform basic text mining cleaning. Remove all whitespaces: verbatim = verbatim.strip() Many text processing tasks can be done via pattern matching. We can find words containing a character and replace it with another one or just remove it. Regular expressions give us a powerful and flexible method for describing the character patterns we are interested in. They are commonly used in cleaning punctuation, HTML tags, and URLs paths. 3. Remove punctuation: verbatim = re.sub(r'[^ws]','',verbatim) 4. Remove HTML tags: verbatim = re.sub('<[^<]+?>', '', verbatim) 5. Remove URLs: verbatim = re.sub(r'^https?://.*[rn]*', '', verbatim, flags=re.MULTILINE) Depending on the quality of the text corpus, sometimes there is a need to implement some corrections. This refers to the text sources such as Twitter or forums, where emotions can play a role and the comments contain multiple letters words for example, "happpppy" instead of "happy" 6. Standardize words (remove multiple letters): verbatim = ''.join(''.join(s)[:2] for _, s in itertools.groupby(verbatim)) After removal of punctuation or white spaces, words can be attached. This happens especially when deleting the periods at the end of the sentences. The corpus might look like: "the brown dog is lostEverybody is looking for him". So there is a need to split "lostEverybody" into two separate words. 7. Split attached words: verbatim = " ".join(re.findall('[A-Z][^A-Z]*', verbatim)) Stop words are basically a set of commonly used words in any language: mainly determiners, prepositions, and coordinating conjunctions. By removing the words that are very commonly used in a given language, we can focus only on the important words instead, and improve the accuracy of the text processing. 8. Convert text to lowercase, lower(): verbatim = verbatim.lower() 9. Stop word removal: verbatim = ' '.join([word for word in verbatim.split() if word not in (stopwords.words('english'))]) 10. Stemming and lemmatization: The main aim of stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Stemming reduces word forms to so-called stems, whereas lemmatization reduces word forms to linguistically valid lemmas. Some examples of stemming are cars -> car, men -> man, and went -> Go Such text processing can give added value in some domains, and may improve the accuracy of practical information extraction tasks Tokenization: Tokenization is the process of breaking a text corpus up into words (most commonly), phrases, or other meaningful elements, which are then called tokens. The tokens become the basic units for further text processing. tokens = nltk.word_tokenize(verbatim) Other techniques are spelling correction, domain knowledge, and grammar checking. Duplicate removal Depending on data source we might notice multiple duplicates in our dataset. The decision to remove duplicates should be based on the understanding of the domain. In most cases, duplicates come from errors in data collection process and it is recommended to remove them in order to reduce bias in our analysis, with the help of the following: df = df.drop_duplicates(subset=['column_name']) Knowing basic text cleaning techniques, we can now learn how to store the data in an efficient way. For this purpose, we will explain how to use one of the most convenient NoSQL databases—MongoDB. Capture: Once you have made a connection to your API you need to make a special request and receive the data at your end. This step requires you go through the data to be able to understand it. Often the data is received in a special format called JavaScript Object Notation (JSON). JSON was created to enable a lightweight data interchange between programs. The JSON resembles the old XML format and consists of a key-value pair. Normalization: The data received from platforms are not in an ideal format to perform analysis. With textual data there are many different approaches to normalization. One can be stripping whitespaces surrounding verbatims, or converting all verbatims to lowercase, or changing the encoding to UTF-8. The point is that if we do not maintain a standard protocol for normalization, we will introduce many unintended errors. The goal of normalization is to transform all your data in a consistent manner that ensures a uniform standardization of your data. It is recommended that you create wrapper functions for your normalization techniques, and then apply these wrappers on all your data input points so as to ensure that all the data in your analysis go through exactly the same normalization process. In general, one should always perform the following cleaning steps: Normalize the textual content: Normalization generally contains at least the following steps: Stripping surrounding whitespaces. Lowercasing the verbatim. Universal encoding (UTF-8). 2. Remove special characters (example: punctuation). 3. Remove stop words: Irrespective of the language stop words add no additional informative value to the analysis, except in the case of deep parsing where stop words can be bridge connectors between targeted words. 4. Splitting attached words. 5. Removal of URLs and hyperlinks: URLs and hyperlinks can be studied separately, but due to the lack of grammatical structure they are by convention removed from verbatims. 6. Slang lookups: This is a relatively difficult task, because here we would require a predefined vocabulary of slang words and their proper reference words, for example: luv maps to love. Such dictionaries are available on the open web, but there is always a risk of them being outdated. In the case of studying words and not phrases (or n-grams), it is very important to do the following: Tokenize verbatim Stemming and lemmatization (Optional): This is where different written forms of the same word do not hold additional meaning to your study Some advanced cleaning procedures are: Grammar checking: Grammar checking is mostly learning-based, a huge amount of proper text data is learned, and models are created for the purpose of grammar correction. There are many online tools that are available for grammar correction purposes. This is a very tricky cleaning technique because language style and structure can change from source to source (for example language on Twitter will not correspond with the language from published books). Wrongly correcting grammar can have negative effects on the analysis. Spelling correction: In natural language, misspelled errors are encountered. Companies, such as Google and Microsoft have achieved a decent accuracy level in automated spell correction. One can use algorithms such as the Levenshtein Distances, Dictionary Lookup, and so on, or other modules and packages to fix these errors. Again take spell correction with a grain of salt, because false positives can affect the results. Storing: Once the data is received, normalized, and/or cleaned, we need to store the data in an efficient storage database. In this book we have chosen MongoDB as the database as it's a modern and scalable database. It's also relatively easy to use and get started. However, other databases such as Cassandra or HBase could also be used depending on expertise and objectives. Data cleaning and preprocessing, although tedious, can simplify your data analysis work. With the effective Python packages like Numpy, SciPy, Pandas etc these tasks become so much easy and save a lot of your time. If you found this piece of information useful, make sure to check out our book Python Social Media Analytics, which will help you draw actionable insights from mining social media portals such as GitHub, Twitter, YouTube, and more!
Read more
  • 0
  • 1
  • 8727
Visually different images

article-image-hitting-the-right-notes-in-2017-ai-song-for-data-scientists
Aarthi Kumaraswamy
26 Dec 2017
3 min read
Save for later

Hitting the right notes in 2017: AI in a song for Data Scientists

Aarthi Kumaraswamy
26 Dec 2017
3 min read
A lot, I mean lots and lots of great articles have been written already about AI’s epic journey in 2017. They all generally agree that 2017 sets the stage for AI in very real terms.  We saw immense progress in academia, research and industry in terms of an explosion of new ideas (like capsNets), questioning of established ideas (like backprop, AI black boxes), new methods (Alpha Zero’s self-learning), tools (PyTorch, Gluon, AWS SageMaker), and hardware (quantum computers, AI chips). New and existing players gearing up to tap into this phenomena even as they struggled to tap into the limited talent pool at various conferences and other community hangouts. While we have accelerated the pace of testing and deploying some of those ideas in the real world with self-driving cars, in media & entertainment, among others, progress in building a supportive and sustainable ecosystem has been slow. We also saw conversations on AI ethics, transparency, interpretability, fairness, go mainstream alongside broader contexts such as national policies, corporate cultural reformation setting the tone of those conversations. While anxiety over losing jobs to robots keeps reaching new heights proportional to the cryptocurrency hype, we saw humanoids gain citizenship, residency and even talk about contesting in an election! It has been nothing short of the stuff, legendary tales are made of: struggle, confusion, magic, awe, love, fear, disgust, inspiring heroes, powerful villains, misunderstood monsters, inner demons and guardian angels. And stories worth telling must have songs written about them! Here’s our ode to AI Highlights in 2017 while paying homage to an all-time favorite: ‘A few of my favorite things’ from Sound of Music. Next year, our AI friends will probably join us behind the scenes in the making of another homage to the extraordinary advances in data science, machine learning, and AI. [box type="shadow" align="" class="" width=""] Stripes on horses and horsetails on zebras Bright funny faces in bowls full of rameN Brown furry bears rolled into pandAs These are a few of my favorite thinGs   TensorFlow projects and crisp algo models Libratus’ poker faces, AlphaGo Zero’s gaming caboodles Cars that drive and drones that fly with the moon on their wings These are a few of my favorite things   Interpreting AI black boxes, using Python hashes Kaggle frenemies and the ones from ML MOOC classes R white spaces that melt into strings These are a few of my favorite things   When models don’t converge, and networks just forget When I am sad I simply remember my favorite things And then I don’t feel so bad[/box]   PS: We had to leave out many other significant developments in the above cover as we are limited in our creative repertoire. We invite you to join in and help us write an extended version together! The idea is to make learning about data science easy, accessible, fun and memorable!    
Read more
  • 0
  • 0
  • 1461

article-image-customizing-deep-learning-models-keras
Amey Varangaonkar
22 Dec 2017
8 min read
Save for later

2 ways to customize your deep learning models with Keras

Amey Varangaonkar
22 Dec 2017
8 min read
[box type="note" align="" class="" width=""]The following extract is taken from the book Deep Learning with Keras, co-authored by Antonio Gulli and Sujit Pal. [/box] Keras has a lot of built-in functionality for you to build all your deep learning models without much need for customization. In this article, the authors explain how your Keras models can be customized for better and more efficient deep learning. As you will recall, Keras is a high level API that delegates to either a TensorFlow or Theano backend for the computational heavy lifting. Any code you build for your customization will call out to one of these backends. In order to keep your code portable across the two backends, your custom code should use the Keras backend API (https://keras.io/backend/), which provides a set of functions that act like a facade over your chosen backend. Depending on the backend selected, the call to the backend facade will translate to the appropriate TensorFlow or Theano call. The full list of functions available and their detailed descriptions can be found on the Keras backend page. In addition to portability, using the backend API also results in more maintainable code, since Keras code is generally more high-level and compact compared to equivalent TensorFlow or Theano code. In the unlikely case that you do need to switch to using the backend directly, your Keras components can be used directly inside TensorFlow (not Theano though) code as described in this Keras blog (https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html) Customizing Keras typically means writing your own custom layer or custom distance function. In this section, we will demonstrate how to build some simple Keras layers. You will see more examples of using the backend functions to build other custom Keras components, such as objectives (loss functions), in subsequent sections. Keras example — using the lambda layer Keras provides a lambda layer; it can wrap a function of your choosing. For example, if you wanted to build a layer that squares its input tensor element-wise, you can say simply: model.add(lambda(lambda x: x ** 2)) You can also wrap functions within a lambda layer. For example, if you want to build a custom layer that computes the element-wise euclidean distance between two input tensors, you would define the function to compute the value itself, as well as one that returns the output shape from this function, like so: def euclidean_distance(vecs): x, y = vecs return K.sqrt(K.sum(K.square(x - y), axis=1, keepdims=True)) def euclidean_distance_output_shape(shapes): shape1, shape2 = shapes return (shape1[0], 1) You can then call these functions using the lambda layer shown as follows: lhs_input = Input(shape=(VECTOR_SIZE,)) lhs = dense(1024, kernel_initializer="glorot_uniform", activation="relu")(lhs_input) rhs_input = Input(shape=(VECTOR_SIZE,)) rhs = dense(1024, kernel_initializer="glorot_uniform", activation="relu")(rhs_input) sim = lambda(euclidean_distance, output_shape=euclidean_distance_output_shape)([lhs, rhs]) Keras example - building a custom normalization layer While the lambda layer can be very useful, sometimes you need more control. As an example, we will look at the code for a normalization layer that implements a technique called local response normalization. This technique normalizes the input over local input regions, but has since fallen out of favor because it turned out not to be as effective as other regularization methods such as dropout and batch normalization, as well as better initialization methods. Building custom layers typically involves working with the backend functions, so it involves thinking about the code in terms of tensors. As you will recall, working with tensors is a two step process. First, you define the tensors and arrange them in a computation graph, and then you run the graph with actual data. So working at this level is harder than working in the rest of Keras. The Keras documentation has some guidelines for building custom layers (https://keras.io/layers/writing-your-own-keras-layers/), which you should definitely read. One of the ways to make it easier to develop code in the backend API is to have a small test harness that you can run to verify that your code is doing what you want it to do. Here is a small harness I adapted from the Keras source to run your layer against some input and return a result: from keras.models import Sequential from keras.layers.core import Dropout, Reshape def test_layer(layer, x): layer_config = layer.get_config() layer_config["input_shape"] = x.shape layer = layer.__class__.from_config(layer_config) model = Sequential() model.add(layer) model.compile("rmsprop", "mse") x_ = np.expand_dims(x, axis=0) return model.predict(x_)[0] And here are some tests with layer objects provided by Keras to make sure that the harness runs okay: from keras.layers.core import Dropout, Reshape from keras.layers.convolutional import ZeroPadding2D import numpy as np x = np.random.randn(10, 10) layer = Dropout(0.5) y = test_layer(layer, x) assert(x.shape == y.shape) x = np.random.randn(10, 10, 3) layer = ZeroPadding2D(padding=(1,1)) y = test_layer(layer, x) assert(x.shape[0] + 2 == y.shape[0]) assert(x.shape[1] + 2 == y.shape[1]) x = np.random.randn(10, 10) layer = Reshape((5, 20)) y = test_layer(layer, x) assert(y.shape == (5, 20)) Before we begin building our local response normalization layer, we need to take a moment to understand what it really does. This technique was originally used with Caffe, and the Caffe documentation (http://caffe.berkeleyvision.org/tutorial/layers/lrn.html), describes it as a kind of lateral inhibition that works by normalizing over local input regions. In ACROSS_CHANNEL mode, the local regions extend across nearby channels but have no spatial extent. In WITHIN_CHANNEL mode, the local regions extend spatially, but are in separate channels. We will implement the WITHIN_CHANNEL model as follows. The formula for local response normalization in the WITHIN_CHANNEL model is given by: The code for the custom layer follows the standard structure. The __init__ method is used to set the application specific parameters, that is, the hyperparameters associated with the layer. Since our layer only does a forward computation and doesn't have any learnable weights, all we do in the build method is to set the input shape and delegate to the superclass's build method, which takes care of any necessary book-keeping. In layers where learnable weights are involved, this method is where you would set the initial values. The call method does the actual computation. Notice that we need to account for dimension ordering. Another thing to note is that the batch size is usually unknown at design times, so you need to write your operations so that the batch size is not explicitly invoked. The computation itself is fairly straightforward and follows the formula closely. The sum in the denominator can also be thought of as average pooling over the row and column dimension with a padding size of (n, n) and a stride of (1, 1). Because the pooled data is averaged already, we no longer need to divide the sum by n. The last part of the class is the get_output_shape_for method. Since the layer normalizes each element of the input tensor, the output size is identical to the input size: from keras import backend as K from keras.engine.topology import Layer, InputSpec class LocalResponseNormalization(Layer): def __init__(self, n=5, alpha=0.0005, beta=0.75, k=2, **kwargs): self.n = n self.alpha = alpha self.beta = beta self.k = k super(LocalResponseNormalization, self).__init__(**kwargs) def build(self, input_shape): self.shape = input_shape super(LocalResponseNormalization, self).build(input_shape) def call(self, x, mask=None): if K.image_dim_ordering == "th": _, f, r, c = self.shape Else: _, r, c, f = self.shape squared = K.square(x) pooled = K.pool2d(squared, (n, n), strides=(1, 1), padding="same", pool_mode="avg") if K.image_dim_ordering == "th": summed = K.sum(pooled, axis=1, keepdims=True) averaged = self.alpha * K.repeat_elements(summed, f, axis=1) Else: summed = K.sum(pooled, axis=3, keepdims=True) averaged = self.alpha * K.repeat_elements(summed, f, axis=3) denom = K.pow(self.k + averaged, self.beta) return x / denom def get_output_shape_for(self, input_shape): return input_shape You can test this layer during development using the test harness we described here. It is easier to run this instead of trying to build a whole network to put this into, or worse, waiting till you have fully specified the layer before running it: x = np.random.randn(225, 225, 3) layer = LocalResponseNormalization() y = test_layer(layer, x) assert(x.shape == y.shape) Now that you have a good idea of how to build a custom Keras layer, you might find it instructive to look at Keunwoo Choi's melspectogram (https://keunwoochoi.wordpress.com/2016/11/18/for-beginners-writing-a-custom-keras-layer/) Though building custom Keras layers seems to be fairly commonplace for experienced Keras developers, but they may not be widely useful in a general context. Custom layers are usually built to serve a specific narrow purpose, depending on the use-case in question, and Keras gives you enough flexibility to do so with ease. If you found our post useful, make sure to check out our best selling title Deep Learning with Keras, for other intriguing deep learning concepts and their implementation using Keras.    
Read more
  • 0
  • 1
  • 18359

article-image-how-to-stream-and-store-tweets-in-apache-kafka
Fatema Patrawala
22 Dec 2017
8 min read
Save for later

How to stream and store tweets in Apache Kafka

Fatema Patrawala
22 Dec 2017
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book authored by Ankit Jain titled Mastering Apache Storm. This book explores various real-time processing functionalities offered by Apache Storm such as parallelism, data partitioning, and more.[/box] Today, we are going to cover how to stream tweets from Twitter using the twitter streaming API. We are also going to explore how we can store fetched tweets in Kafka for later processing through Storm. Setting up a single node Kafka cluster Following are the steps to set up a single node Kafka cluster:   Download the Kafka 0.9.x binary distribution named kafka_2.10-0.9.0.1.tar.gz from http://apache.claz.org/kafka/0.9.0. or 1/kafka_2.10-0.9.0.1.tgz. Extract the archive to wherever you want to install Kafka with the following command: tar -xvzf kafka_2.10-0.9.0.1.tgz cd kafka_2.10-0.9.0.1   Change the following properties in the $KAFKA_HOME/config/server.properties file: log.dirs=/var/kafka- logszookeeper.connect=zoo1:2181,zoo2:2181,zoo3:2181 Here, zoo1, zoo2, and zoo3 represent the hostnames of the ZooKeeper nodes. The following are the definitions of the important properties in the server.properties file: broker.id: This is a unique integer ID for each of the brokers in a Kafka cluster. port: This is the port number for a Kafka broker. Its default value is 9092. If you want to run multiple brokers on a single machine, give a unique port to each broker. host.name: The hostname to which the broker should bind and advertise itself. log.dirs: The name of this property is a bit unfortunate as it represents not the log directory for Kafka, but the directory where Kafka stores the actual data sent to it. This can take a single directory or a comma-separated list of directories to store data. Kafka throughput can be increased by attaching multiple physical disks to the broker node and specifying multiple data directories, each lying on a different disk. It is not much use specifying multiple directories on the same physical disk, as all the I/O will still be happening on the same disk. num.partitions: This represents the default number of partitions for newly created topics. This property can be overridden when creating new topics. A greater number of partitions results in greater parallelism at the cost of a larger number of files. log.retention.hours: Kafka does not delete messages immediately after consumers consume them. It retains them for the number of hours defined by this property so that in the event of any issues the consumers can replay the messages from Kafka. The default value is 168 hours, which is 1 week. zookeeper.connect: This is the comma-separated list of ZooKeeper nodes in hostname:port form.    Start the Kafka server by running the following command: > ./bin/kafka-server-start.sh config/server.properties [2017-04-23 17:44:36,667] INFO New leader is 0 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) [2017-04-23 17:44:36,668] INFO Kafka version : 0.9.0.1 (org.apache.kafka.common.utils.AppInfoParser) [2017-04-23 17:44:36,668] INFO Kafka commitId : a7a17cdec9eaa6c5 (org.apache.kafka.common.utils.AppInfoParser) [2017-04-23 17:44:36,670] INFO [Kafka Server 0], started (kafka.server.KafkaServer) If you get something similar to the preceding three lines on your console, then your Kafka broker is up-and-running and we can proceed to test it. Now we will verify that the Kafka broker is set up correctly by sending and receiving some test messages. First, let's create a verification topic for testing by executing the following command: > bin/kafka-topics.sh --zookeeper zoo1:2181 --replication-factor 1 --partition 1 --topic verification-topic --create Created topic "verification-topic".    Now let's verify if the topic creation was successful by listing all the topics: > bin/kafka-topics.sh --zookeeper zoo1:2181 --list verification-topic    The topic is created; let's produce some sample messages for the Kafka cluster. Kafka comes with a command-line producer that we can use to produce messages: > bin/kafka-console-producer.sh --broker-list localhost:9092 -- topic verification-topic    Write the following messages on your console: Message 1 Test Message 2 Message 3 Let's consume these messages by starting a new console consumer on a new console window: > bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic verification-topic --from-beginning Message 1 Test Message 2 Message 3 Now, if we enter any message on the producer console, it will automatically be consumed by this consumer and displayed on the command line. Collecting Tweets We are assuming you already have a twitter account, and that the consumer key and access token are generated for your application. You can refer to: https://bdthemes.com/support/knowledge-base/generate-api-key-consumer-token-acc ess-key-twitter-oauth/ to generate a consumer key and access token. Take the following steps: Create a new maven project with groupId, com.storm advance and artifactId, kafka_producer_twitter. Add the following dependencies to the pom.xml file. We are adding the Kafka and Twitter streaming Maven dependencies to pom.xml to support the Kafka Producer and the streaming tweets from Twitter: <dependencies> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka_2.10</artifactId> <version>0.9.0.1</version> <exclusions> <exclusion> <groupId>com.sun.jdmk</groupId> <artifactId>jmxtools</artifactId> </exclusion> <exclusion> <groupId>com.sun.jmx</groupId> <artifactId>jmxri</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-slf4j-impl</artifactId> <version>2.0-beta9</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-1.2-api</artifactId> <version>2.0-beta9</version> </dependency> <!-- https://mvnrepository.com/artifact/org.twitter4j/twitter4j-stream --> <dependency> <groupId>org.twitter4j</groupId> <artifactId>twitter4j-stream</artifactId> <version>4.0.6</version> </dependency> </dependencies> 3. Now, we need to create a class, TwitterData, that contains the code to consume/stream data from Twitter and publish it to the Kafka cluster. We are assuming you already have a running Kafka cluster and topic, twitterData, created in the Kafka cluster. for information on the installation of the Kafka cluster and the creation of a Kafka please refer to . The class contains an instance of the twitter4j.conf.ConfigurationBuilder class; we need to set the access token and consumer keys in configuration, as mentioned in the source code.4. The twitter4j.StatusListener class returns the continuous stream of tweets inside the onStatus() method. We are using the Kafka Producer code inside the onStatus() method to publish the tweets in Kafka. The following is the source code for the TwitterData class: public class TwitterData { /** The actual Twitter stream. It's set up to collect raw JSON data */ private TwitterStream twitterStream; static String consumerKeyStr = "r1wFskT3q"; static String consumerSecretStr = "fBbmp71HKbqalpizIwwwkBpKC"; static String accessTokenStr = "298FPfE16frABXMcRIn7aUSSnNneMEPrUuZ"; static String accessTokenSecretStr = "1LMNZZIfrAimpD004QilV1pH3PYTvM"; public void start() { ConfigurationBuilder cb = new ConfigurationBuilder(); cb.setOAuthConsumerKey(consumerKeyStr); cb.setOAuthConsumerSecret(consumerSecretStr); cb.setOAuthAccessToken(accessTokenStr); cb.setOAuthAccessTokenSecret(accessTokenSecretStr); cb.setJSONStoreEnabled(true); cb.setIncludeEntitiesEnabled(true); // instance of TwitterStreamFactory twitterStream = new TwitterStreamFactory(cb.build()).getInstance(); final Producer<String, String> producer = new KafkaProducer<String, String>(getProducerConfig());// topicDetails CreateTopic("127.0.0.1:2181").createTopic("twitterData", 2, 1); /** Twitter listener **/ StatusListener listener = new StatusListener() { public void onStatus(Status status) { ProducerRecord<String, String> data = new ProducerRecord<String, String>("twitterData", DataObjectFactory.getRawJSON(status)); // send the data to kafka producer.send(data); } public void onException(Exception arg0) { System.out.println(arg0); } arg0) {  }; public void onDeletionNotice(StatusDeletionNotice } public void onScrubGeo(long arg0, long arg1) { } public void onStallWarning(StallWarning arg0) { } public void onTrackLimitationNotice(int arg0) { } /** Bind the listener **/ twitterStream.addListener(listener); /** GOGOGO **/ twitterStream.sample(); } private Properties getProducerConfig() { Properties props = new Properties(); // List of kafka borkers. Complete list of brokers is not required as // the producer will auto discover the rest of the brokers. props.put("bootstrap.servers", "localhost:9092"); props.put("batch.size", 1); // new sending // Serializer used for sending data to kafka. Since we are // string, // we are using StringSerializer. props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("producer.type", "sync"); return props; } public static void main(String[] args) throws InterruptedException { new TwitterData().start(); } Use valid Kafka properties before executing the TwitterData. After executing the preceding class, the user will have a real-time stream of Twitter tweets in Kafka. In the next section, we are going to cover how we can use Storm to calculate the sentiments of the collected tweets. To summarize we covered how to install a single node Apache Kafka cluster and how to collect tweet from Twitter to store in a Kafka cluster If you enjoyed this post, check out the book Mastering Apache Storm to know more about different types of real time processing techniques used to create distributed applications.
Read more
  • 0
  • 0
  • 6101
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $15.99/month. Cancel anytime
article-image-data-science-saved-christmas
Aaron Lazar
22 Dec 2017
9 min read
Save for later

How Data Science saved Christmas

Aaron Lazar
22 Dec 2017
9 min read
It’s the middle of December and it’s shivery cold in the North Pole at -20°C. A fat old man sits on a big brown chair, beside the fireplace, stroking his long white beard. His face has a frown on it, quite his unusual self. Mr. Claus quips, “Ruddy mailman should have been here by now! He’s never this late to bring in the li'l ones’ letters.” [caption id="attachment_3284" align="alignleft" width="300"] Nervous Santa Claus on Christmas Eve, he is sitting on the armchair and resting head on his hands[/caption] Santa gets up from his chair, his trouser buttons crying for help, thanks to his massive belly. He waddles over to the window and looks out. He’s sad that he might not be able to get the children their gifts in time, this year. Amidst the snow, he can see a glowing red light. “Oh Rudolph!” he chuckles. All across the living room are pictures of little children beaming with joy, holding their presents in their hands. A small smile starts building and then suddenly, Santa gets a new-found determination to get the presents over to the children, come what may! An idea strikes him as he waddles over to his computer room. Now Mr. Claus may be old on the outside, but on the inside, he’s nowhere close! He recently set up a new rig, all by himself. Six Nvidia GTX Titans, coupled with sixteen gigs of RAM, a 40-inch curved monitor that he uses to keep an eye on who’s being naughty or nice, and a 1000 watt home theater system, with surround sound, heavy on the bass. On the inside, he’s got a whole load of software on the likes of the Python language (not the Garden of Eden variety), OpenCV - his all-seeing eye that’s on the kids and well, Tensorflow et al. Now, you might wonder what an old man is doing with such heavy software and hardware. A few months ago, Santa caught wind that there’s a new and upcoming trend that involves working with tonnes of data, cleaning, processing and making sense of it. The idea of crunching data somehow tickled the old man and since then, the jolly good master tinkerer and his army of merry elves have been experimenting away with data. Santa’s pretty much self-taught at whatever he does, be it driving a sleigh or learning something new. A couple of interesting books he picked up from Packt were, Python Data Science Essentials - Second Edition, Hands-On Data Science and Python Machine Learning, and Python Machine Learning - Second Edition. After spending some time on the internet, he put together a list of things he needed to set up his rig and got them from Amazon. [caption id="attachment_3281" align="alignright" width="300"] Santa Claus is using a laptop on the top of a house[/caption] He quickly boots up the computer and starts up Tensorflow. He needs to come up with a list of probable things that each child would have wanted for Christmas this year. Now, there are over 2 billion children in the world and finding each one’s wish is going to be more than a task! But nothing is too difficult for Santa! He gets to work, his big head buried in his keyboard, his long locks falling over his shoulder. So, this was his plan: Considering that the kids might have shared their secret wish with someone, Santa plans to tackle the problem from different angles, to reach a higher probability of getting the right gifts: He plans to gather email and Social Media data from all the kids’ computers - all from the past month It’s a good thing kids have started owning phones at such an early age now - he plans to analyze all incoming and outgoing phone calls that have happened over the course of the past month He taps into every country's local police department’s records to stream all security footage all over the world [caption id="attachment_3288" align="alignleft" width="300"] A young boy wearing a red Christmas hat and red sweater is writing a letter to Santa Claus. The child is sitting at a wooden table in front of a Christmas tree.[/caption] If you’ve reached till here, you’re probably wondering whether this article is about Mr.Claus or Mr.Bond. Yes, the equipment and strategy would have fit an MI6 or a CIA agent’s role. You never know, Santa might just be a retired agent. Do they ever retire? Hmm! Anyway, it takes a while before he can get all the data he needs. He trusts Spark to sort this data in order, which is stored in a massive data center in his basement (he’s a bit cautious after all the news about data breaches). And he’s off to work! He sifts through the emails and messages, snorting from time to time at some of the hilarious ones. Tensorflow rips through the data, picking out keywords for Santa. It takes him a few hours to get done with the emails and social media data alone! By the time he has a list, it’s evening and time for supper. Santa calls it a day and prepares to continue the next day. The next day, Santa gets up early and boots up his equipment as he brushes and flosses. He plonks himself in the huge swivel chair in front of the monitor, munching on freshly baked gingerbread. He starts tapping into all the phone company databases across the world, fetching all the data into his data center. Now, Santa can’t afford to spend the whole time analyzing voices himself, so he lets Tensorflow analyze voices and segregate the keywords it picks up from the voice signals. Every kid’s name to a possible gift. Now there were a lot of unmentionable things that got linked to several kids names. Santa almost fell off his chair when he saw the list. “These kids grow up way too fast, these days!” It’s almost 7 PM in the evening when Santa realizes that there’s way too much data to process in a day. A few days later, Santa returns to his tech abode, to check up on the progress of the call data processing. There’s a huge list waiting in front of him. He thinks to himself, “This will need a lot of cleaning up!” He shakes his head thinking, I should have started with this! He now has to munge through that camera footage! Santa had never worked on so much data before so he started to get a bit worried that he might be unable to analyze it in time. He started pacing around the room trying to think up a workaround. Time was flying by and he still did not know how to speed up the video analyses. Just when he’s about to give up, the door opens and Beatrice walks in. Santa almost trips as he runs to hug his wife! Beatrice is startled for a bit but then breaks into a smile. “What is it dear? Did you miss me so much?” Santa replies, “You can’t imagine how much! I’ve been doing everything on my own and I really need your help!” Beatrice smiles and says, “Well, what are we waiting for? Let’s get down to it!” Santa explains the problem to Beatrice in detail and tells her how far he’s reached in the analysis. Beatrice thinks for a bit and asks Santa, “Did you try using Keras on top of TensorFlow?” Santa, blank for a minute, nods his head. Beatrice continues, “Well from my experience, Keras gives TensorFlow a boost of about 10%, which should help quicken the analysis. Santa just looks like he’s made the best decision marrying Beatrice and hugs her again! “Bea, you’re a genius!” he cries out. “Yeah, and don’t forget to use Matplotlib!” she yells back as Santa hurries back to his abode. He’s off to work again, this time saddling up Keras to work on top of TensorFlow. Hundreds and thousands of terabytes of video data flowing into the machines. He channels the output through OpenCV and ties it with TensorFlow to add a hint of Deep Learning. He quickly types out some Python scripts to integrate both the tools to create the optimal outcome. And then the wait begins. Santa keeps looking at his watch every half hour, hoping that the processing happens fast. The hardware has begun heating up quite a bit and he quickly races over to bring a cooler that’s across the room. While he waits for the videos to finish up, he starts working on sifting out the data from the text and audio. He remembers what Beatrice said and uses Matplotlib to visualize it. Soon he has a beautiful map of the world with all the children’s names and their possible gifts beside. Three days later, the video processing gets done Keras truly worked wonders for TensorFlow! Santa now has another set of data to help him narrow down the gift list. A few hours later he’s got his whole list visualized on Matplotlib. [caption id="attachment_3289" align="alignleft" width="300"] Santa Claus riding on sleigh with gift box against snow falling on fir tree forest[/caption] There’s one last thing left to do! He suits up in red and races out the door to Rudolph and the other reindeer, unties them from the fence and leads them over to the sleigh. Once they’re fastened, he loads up an empty bag onto the sleigh and it magically gets filled up. He quickly checks it to see if all is well and they’re off! It’s Christmas morning and all the kids are racing out of bed to rip their presents open! There are smiles all around and everyone’s got a gift, just as the saying goes! Even the ones who’ve been naughty have gotten gifts. Back in the North Pole, the old man is back in his abode, relaxing in an easy chair with his legs up on the table. The screen in front of him runs real-time video feed of kids all over the world opening up their presents. A big smile on his face, Santa turns to look out the window at the glowing red light amongst the snow, he takes a swig of brandy from a hip flask. Thanks to Data Science, this Christmas is the merriest yet!
Read more
  • 0
  • 0
  • 3744

article-image-exploratory-data-analysis-eda-spark-sql
Amarabha Banerjee
21 Dec 2017
7 min read
Save for later

How to perform Exploratory Data Analysis (EDA) with Spark SQL

Amarabha Banerjee
21 Dec 2017
7 min read
[box type="note" align="" class="" width=""]Below given post is a book excerpt taken from Learning Spark SQL written by Aurobindo Sarkar. This book will help you design, implement, and deliver successful streaming applications, machine learning pipelines and graph applications using Spark SQL API.[/box] Our article aims to give you an understanding of how exploratory data analysis is performed with Spark SQL. What is Exploratory Data Analysis (EDA) Exploratory Data Analysis (EDA), or Initial Data Analysis (IDA), is an approach to data analysis that attempts to maximize insight into data. This includes assessing the quality and structure of the data, calculating summary or descriptive statistics, and plotting appropriate graphs. It can uncover underlying structures and suggest how the data should be modeled. Furthermore, EDA helps us detect outliers, errors, and anomalies in our data, and deciding what to do about such data is often more important than other, more sophisticated analysis. EDA enables us to test our underlying assumptions, discover clusters and other patterns in our data, and identify the possible relationships between various variables. A careful EDA process is vital to understanding the data and is sometimes sufficient to reveal such poor data quality that using a more sophisticated model-based analysis is not justified. Typically, the graphical techniques used in EDA are simple, consisting of plotting the raw data and simple statistics. The focus is on the structures and models revealed by the data or best fit the data. EDA techniques include scatter plots, box plots, histograms, probability plots, and so on. In most EDA techniques, we use all of the data, without making any underlying assumptions. The analyst builds intuition, or gets a "feel", for the Dataset as a result of such exploration. More specifically, the graphical techniques allow us to efficiently select and validate appropriate models, test our assumptions, identify relationships, select estimators, detect outliers, and so on. EDA involves a lot of trial and error, and several iterations. The best way is to start simple and then build in complexity as you go along. There is a major trade-off in modeling between the simple and the more accurate ones. Simple models may be much easier to interpret and understand. These models can get you to 90% accuracy very quickly, versus a more complex model that might take weeks or months to get you an additional 2% improvement. For example, you should plot simple histograms and scatter plots to quickly start developing an intuition for your data. Using Spark SQL for basic data analysis Interactively, processing and visualizing large data is challenging as the queries can take a long time to execute and the visual interface cannot accommodate as many pixels as data points. Spark supports in-memory computations and a high degree of parallelism to achieve interactivity with large distributed data. In addition, Spark is capable of handling petabytes of data and provides a set of versatile programming interfaces and libraries. These include SQL, Scala, Python, Java and R APIs, and libraries for distributed statistics and machine learning. For data that fits into a single computer, there are many good tools available, such as R, MATLAB, and others. However, if the data does not fit into a single machine, or if it is very complicated to get the data to that machine, or if a single computer cannot easily process the data, then this section will offer some good tools and techniques for data exploration. In this section, we will go through some basic data exploration exercises to understand a sample Dataset. We will use a Dataset that contains data related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The marketing campaigns were based on phone calls to customers. We'll use the bank-additional-full.csv file that contains 41,188 records and 20 input fields, ordered by date (from May 2008 to November 2010). The Dataset has been contributed by S. Moro, P. Cortez, and P. Rita, and can be downloaded from https://archive.ics.uci.edu/ml/datasets/Bank+Marketing. As a first step, let's define a schema and read in the CSV file to create a DataFrame. You can use :paste command to paste initial set of statements in your Spark shell session (use Ctrl+D to exit the paste mode), as shown: 2. After the DataFrame has been created, we first verify the number of records: We can also define a case class called Call for our input records, and then create a strongly-typed Dataset, as follows: Identifying missing data Missing data can occur in Datasets due to reasons ranging from negligence to a refusal on the part of respondents to provide a specific data point. However, in all cases, missing data is a common occurrence in real-world Datasets. Missing data can create problems in data analysis and sometimes lead to wrong decisions or conclusions. Hence, it is very important to identify missing data and devise effective strategies to deal with it. In this section, we analyze the numbers of records with missing data fields in our sample Dataset. In order to simulate missing data, we will edit our sample Dataset by replacing fields containing "unknown" values with empty strings. First, we created a DataFrame/Dataset from our edited file, as shown: In the next section, we will compute some basic statistics for our sample Dataset to improve our understanding of the data. Computing basic statistics Computing basic statistics is essential for a good preliminary understanding of our data. First, for convenience, we create a case class and a Dataset containing a subset of fields from our original DataFrame. In the following example, we choose some of the numeric fields and the outcome field, that is, the "term deposit subscribed" field: Next, we use describe() compute the count, mean, stdev, min, and max values for the numeric columns in our Dataset. The describe() command gives a way to do a quick sense-check on your data. For example, the counts of rows of each of the columns selected matches the total number records in the DataFrame (no null or invalid rows),whether the average and range of values for the age column matching your expectations, and so on. Based on the values of the means and standard deviations, you can get select certain data elements for deeper analysis. For example, assuming normal distribution, the mean and standard deviation values for age suggest most values of age are in the range 30 to 50 years, for other columns the standard deviation values may be indicative of a skew in the data (as the standard deviation is greater than the mean). Identifying data outliers An outlier or an anomaly is an observation of the data that deviates significantly from other observations in the Dataset. These erroneous outliers can be due to errors in the data collection or variability in measurement. They can impact the results significantly so it is imperative to identify them during the EDA process. However, these techniques define outliers as points, which do not lie in clusters. The user has to model the data points using statistical distributions, and the outliers are identified depending on how they appear in relation to the underlying model. The main problem with these approaches is that during EDA, the user typically does not have enough knowledge about the underlying data distribution. EDA, using a modeling and visualizing approach, is a good way of achieving a deeper intuition of our data. Spark MLlib supports a large (and growing) set of distributed machine learning algorithms to make this task simpler. For example, we can apply clustering algorithms and visualize the results to detect outliers in a combination columns. In the following example, we use the last contact duration, in seconds (duration), number of contacts performed during this campaign, for this client (campaign), number of days that have passed by after the client was last contacted from a previous campaign (pdays) and the previous: number of contacts performed before this campaign and for this client (prev) values to compute two clusters in our data by applying the k-means clustering algorithm: If you liked this article, please be sure to check out Learning Spark SQL which will help you learn more useful techniques on data extraction and data analysis.    
Read more
  • 0
  • 0
  • 8457

article-image-implementing-row-level-security-in-postgresql
Amey Varangaonkar
21 Dec 2017
7 min read
Save for later

Implementing Row-level Security in PostgreSQL

Amey Varangaonkar
21 Dec 2017
7 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book Mastering PostgreSQL 9.6, authored by Hans-Jürgen Schönig. The book gives a comprehensive primer on different features and capabilities of PostgreSQL 9.6, and how you can leverage them efficiently to administer and manage your PostgreSQL database.[/box] In this article, we discuss the concept of row-level security and how effectively it can be implemented in PostgreSQL using a interesting example. Having the row-level security feature enables allows you to store data for multiple users in a single database and table. At the same time it sets restrictions on the row-level access, based on a particular user’s role or identity. What is Row-level Security? In usual cases, a table is always shown as a whole. When the table contains 1 million rows, it is possible to retrieve 1 million rows from it. If somebody had the rights to read a table, it was all about the entire table. In many cases, this is not enough. Often it is desirable that a user is not allowed to see all the rows. Consider the following real-world example: an accountant is doing accounting work for many people. The table containing tax rates should really be visible to everybody as everybody has to pay the same rates. However, when it comes to the actual transactions, you might want to ensure that everybody is only allowed to see his or her own transactions. Person A should not be allowed to see person B's data. In addition to that, it might also make sense that the boss of a division is allowed to see all the data in his part of the company. Row-level security has been designed to do exactly this and enables you to build multi-tenant systems in a fast and simple way. The way to configure those permissions is to come up with policies. The CREATE POLICY command is here to provide you with a means to write those rules: test=# h CREATE POLICY Command: CREATE POLICY Description: define a new row level security policy for a table Syntax: CREATE POLICY name ON table_name [ FOR { ALL | SELECT | INSERT | UPDATE | DELETE } ] [ TO { role_name | PUBLIC | CURRENT_USER | SESSION_USER } [, ...] ] [ USING ( using_expression ) ] [ WITH CHECK ( check_expression ) ] To show you how a policy can be written, I will first log in as superuser and create a table containing a couple of entries: test=# CREATE TABLE t_person (gender text, name text); CREATE TABLE test=# INSERT INTO t_person VALUES ('male', 'joe'), ('male', 'paul'), ('female', 'sarah'), (NULL, 'R2- D2'); INSERT 0 4 Then access is granted to the joe role: test=# GRANT ALL ON t_person TO joe; GRANT So far, everything is pretty normal and the joe role will be able to actually read the entire table as there is no RLS in place. But what happens if row-level security is enabled for the table? test=# ALTER TABLE t_person ENABLE ROW LEVEL SECURITY; ALTER TABLE There is a deny all default policy in place, so the joe role will actually get an empty table: test=> SELECT * FROM t_person; gender | name --------+------ (0 rows) Actually, the default policy makes a lot of sense as users are forced to explicitly set permissions. Now that the table is under row-level security control, policies can be written (as superuser): test=# CREATE POLICY joe_pol_1 ON t_person FOR SELECT TO joe USING (gender = 'male'); CREATE POLICY Logging in as the joe role and selecting all the data, will return just two rows: test=> SELECT * FROM t_person; gender | name --------+------ male | joe male | paul (2 rows) Let us inspect the policy I have just created in a more detailed way. The first thing you see is that a policy actually has a name. It is also connected to a table and allows for certain operations (in this case, the SELECT clause). Then comes the USING clause. It basically defines what the joe role will be allowed to see. The USING clause is therefore a mandatory filter attached to every query to only select the rows our user is supposed to see. Now suppose that, for some reason, it has been decided that the joe role is also allowed to see robots. There are two choices to achieve our goal. The first option is to simply use the ALTER POLICY clause to change the existing policy: test=> h ALTER POLICY Command: ALTER POLICY Description: change the definition of a row level security policy Syntax: ALTER POLICY name ON table_name RENAME TO new_name ALTER POLICY name ON table_name [ TO { role_name | PUBLIC | CURRENT_USER | SESSION_USER } [, ...] ] [ USING ( using_expression ) ] [ WITH CHECK ( check_expression ) ] The second option is to create a second policy as shown in the next example: test=# CREATE POLICY joe_pol_2 ON t_person FOR SELECT TO joe USING (gender IS NULL); CREATE POLICY The beauty is that those policies are simply connected using an OR condition.Therefore, PostgreSQL will now return three rows instead of two: test=> SELECT * FROM t_person; gender | name --------+------- male | joe male | paul | R2-D2 (3 rows) The R2-D2 role is now also included in the result as it matches the second policy. To show you how PostgreSQL runs the query, I have decided to include an execution plan of the query: test=> explain SELECT * FROM t_person; QUERY PLAN ---------------------------------------------------------- Seq Scan on t_person (cost=0.00..21.00 rows=9 width=64) Filter: ((gender IS NULL) OR (gender = 'male'::text)) (2 rows) As you can see, both the USING clauses have been added as mandatory filters to the query. You might have noticed in the syntax definition that there are two types of clauses: USING: This clause filters rows that already exist. This is relevant to SELECT and UPDATE clauses, and so on. CHECK: This clause filters new rows that are about to be created; so they are relevant to INSERT  and UPDATE clauses, and so on. Here is what happens if we try to insert a row: test=> INSERT INTO t_person VALUES ('male', 'kaarel'); ERROR: new row violates row-level security policy for table "t_person" As there is no policy for the INSERT clause, the statement will naturally error out. Here is the policy to allow insertions: test=# CREATE POLICY joe_pol_3 ON t_person FOR INSERT TO joe WITH CHECK (gender IN ('male', 'female')); CREATE POLICY The joe role is allowed to add males and females to the table, which is shown in the next listing: test=> INSERT INTO t_person VALUES ('female', 'maria'); INSERT 0 1 However, there is also a catch; consider the following example: test=> INSERT INTO t_person VALUES ('female', 'maria') RETURNING *; ERROR: new row violates row-level security policy for table "t_person" Remember, there is only a policy to select males. The trouble here is that the statement will return a woman, which is not allowed because joe role is under a male only policy. Only for men, will the RETURNING * clause actually work: test=> INSERT INTO t_person VALUES ('male', 'max') RETURNING *; gender | name --------+------ male | max (1 row) INSERT 0 1 If you don't want this behavior, you have to write a policy that actually contains a proper USING clause. If you liked our post, make sure to check out our book Mastering PostgreSQL 9.6 - a comprehensive PostgreSQL guide covering all database administration and maintenance aspects.
Read more
  • 0
  • 0
  • 8225

article-image-how-to-install-keras-on-docker-and-cloud-ml
Amey Varangaonkar
20 Dec 2017
3 min read
Save for later

How to Install Keras on Docker and Cloud ML

Amey Varangaonkar
20 Dec 2017
3 min read
[box type="note" align="" class="" width=""]The following extract is taken from the book Deep Learning with Keras, written by Antonio Gulli and Sujit Pal. It contains useful techniques to train effective deep learning models using the highly popular Keras library.[/box] Keras is a deep learning library which can be used on the enterprise platform, by deploying it on a container. In this article, we see how to install Keras on Docker and Google’s Cloud ML. Installing Keras on Docker One of the easiest ways to get started with TensorFlow and Keras is running in a Docker container. A convenient solution is to use a predefined Docker image for deep learning created by the community that contains all the popular DL frameworks (TensorFlow, Theano, Torch, Caffe, and so on). Refer to the GitHub repository at https://github.com/saiprashanths/dl-docker for the code files. Assuming that you already have Docker up and running (for more information, refer to https://www.docker.com/products/overview), installing it is pretty simple and is shown as follows: The following screenshot, says something like, after getting the image from Git, we build the Docker image: In this following screenshot, we see how to run it: From within the container, it is possible to activate support for Jupyter Notebooks (for more information, refer to http://jupyter.org/): Access it directly from the host machine on port: It is also possible to access TensorBoard (for more information, refer to https://www.tensorflow.org/how_tos/summaries_and_tensorboard/) with the help of the command in the screenshot that follows, which is discussed in the next section: After running the preceding command, you will be redirected to the following page: Installing Keras on Google Cloud ML Installing Keras on Google Cloud is very simple. First, we can install Google Cloud (for the downloadable file, refer to https://cloud.google.com/sdk/), a command-line interface for Google Cloud Platform; then we can use CloudML, a managed service that enables us to easily build machine, learning models with TensorFlow. Before using Keras, let's use Google Cloud with TensorFlow to train an MNIST example available on GitHub. The code is local and training happens in the cloud: In the following screenshot, you can see how to run a training session: We can use TensorBoard to show how cross-entropy decreases across iterations: In the next screenshot, we see the graph of cross-entropy: Now, if we want to use Keras on the top of TensorFlow, we simply download the Keras source from PyPI (for the downloadable file, refer to https://pypi.python.org/pypi/Keras/1.2.0 or later versions) and then directly use Keras as a CloudML package solution, as in the following example: Here, trainer.task2.py is an example script: from keras.applications.vgg16 import VGG16 from keras.models import Model from keras.preprocessing import image from keras.applications.vgg16 import preprocess_input import numpy as np # pre-built and pre-trained deep learning VGG16 model base_model = VGG16(weights='imagenet', include_top=True) for i, layer in enumerate(base_model.layers): print (i, layer.name, layer.output_shape) Thus we saw, how fairly easy it is to set up and run Keras on a Docker container and Cloud ML. If this article interested you, make sure to check out our book Deep Learning with Keras, where you can learn to install Keras on other popular platforms such as Amazon Web Services and Microsoft Azure.  
Read more
  • 0
  • 0
  • 3808
article-image-9-useful-r-packages-for-nlp-text-mining
Amey Varangaonkar
18 Dec 2017
6 min read
Save for later

9 Useful R Packages for NLP & Text Mining

Amey Varangaonkar
18 Dec 2017
6 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book Mastering Text Mining with R, co-authored by Ashish Kumar and Avinash Paul. This book lists various techniques to extract useful and high-quality information from your textual data.[/box] There is a wide range of packages available in R for natural language processing and text mining. In the article below, we present some of the popular and widely used R packages for NLP: OpenNLP OpenNLP is an R package which provides an interface, Apache OpenNLP, which is a  machine-learning-based toolkit written in Java for natural language processing activities. Apache OpenNLP is widely used for most common tasks in NLP, such as tokenization, POS tagging, named entity recognition (NER), chunking, parsing, and so on. It provides wrappers for Maxent entropy models using the Maxent Java package. It provides functions for sentence annotation, word annotation, POS tag annotation, and annotation parsing using the Apache OpenNLP chunking parser. The Maxent Chunk annotator function computes the chunk annotation using the Maxent chunker provided by OpenNLP. The Maxent entity annotator function in R package utilizes the Apache OpenNLP Maxent name finder for entity annotation. Model files can be downloaded from http://opennlp.sourceforge.net/models-1.5/. These language models can be effectively used in R packages by installing the OpenNLPmodels.language package from the repository at http://datacube.wu.ac.at. Get the OpenNLP package here. Rweka The RWeka package in R provides an interface to Weka. Weka is an open source software developed by a machine learning group at the University of Wakaito, which provides a wide range of machine learning algorithms which can either be directly applied to a dataset or it can be called from a Java code. Different data-mining activities, such as data processing, supervised and unsupervised learning, association mining, and so on, can be performed using the RWeka package. For natural language processing, RWeka provides tokenization and stemming functions. RWeka packages provide an interface to Alphabetic, NGramTokenizers, and wordTokenizer functions, which can efficiently perform tokenization for contiguous alphabetic sequence, string-split to n-grams, or simple word tokenization, respectively. Get started with Rweka here. RcmdrPlugin.temis The RcmdrPlugin.temis package in R provides a graphical integrated text-mining solution. This package can be leveraged for many text-mining tasks, such as importing and cleaning a corpus, terms and documents count, term co-occurrences, correspondence analysis, and so on. Corpora can be imported from different sources and analysed using the importCorpusDlg function. The package provides flexible data source options to import corpora from different sources, such as text files, spreadsheet files, XML, HTML files, Alceste format and Twitter search. The Import function in this package processes the corpus and generates a term-document matrix. The package provides different functions to summarize and visualize the corpus statistics. Correspondence analysis and hierarchical clustering can be performed on the corpus. The corpusDissimilarity function helps analyse and create a crossdissimilarity table between term-documents present in the corpus. This package provides many functions to help the users explore the corpus. For example, frequentTerms to list the most frequent terms of a corpus, specificTerms to list terms most associated with each document, subsetCorpusByTermsDlg to create a subset of the corpus. Term frequency, term co-occurrence, term dictionary, temporal evolution of occurrences or term time series, term metadata variables, and corpus temporal evolution are among the other very useful functions available in this package for text mining. Download the package from CRAN page. tm The tm package is a text-mining framework which provides some powerful functions which will aid in text-processing steps. It has methods for importing data, handling corpus, metadata management, creation of term document matrices, and preprocessing methods. For managing documents using the tm package, we create a corpus which is a collection of text documents. There are two types of implementation, volatile corpus (VCorpus) and permanent corpus (PCropus). VCorpus is completely held in memory and when the R object is destroyed the corpus is gone. PCropus is stored in the filesystem and is present even after the R object is destroyed; this corpus can be created by using the VCorpus and PCorpus functions respectively. This package provides a few predefined sources which can be used to import text, such as DirSource, VectorSource, or DataframeSource. The getSources method lists available sources, and users can create their own sources. The tm package ships with several reader options: readPlain, readPDF, and readDOC. We can execute the getReaders method for an up-to-date list of available readers. To write a corpus to the filesystem, we can use writeCorpus. For inspecting a corpus, there are methods such as inspect and print. For transformation of text, such as stop-word removal, stemming, whitespace removal, and so on, we can use the tm_map, content_transformer, tolower, stopwords("english") functions. For metadata management, meta comes in handy. The tm package provides various quantitative function for text analysis, such as DocumentTermMatrix , findFreqTerms, findAssocs, and removeSparseTerms. Download the tm package here. languageR languageR provides data sets and functions for statistical analysis on text data. This package contains functions for vocabulary richness, vocabulary growth, frequency spectrum, also mixed-effects models and so on. There are simulation functions available: simple regression, quasi-F factor, and Latin-square designs. Apart from that, this package can also be used for correlation, collinearity diagnostic, diagnostic visualization of logistic models, and so on. koRpus The koRpus package is a versatile tool for text mining which implements many functions for text readability and lexical variation. Apart from that, it can also be used for basic level functions such as tokenization and POS tagging. You can find more information about its current version and dependencies here. RKEA The RKEA package provides an interface to KEA, which is a tool for keyword extraction from texts. RKEA requires a keyword extraction model, which can be created by manually indexing a small set of texts, using which it extracts keywords from the document. maxent The maxent package in R provides tools for low-memory implementation of multinomial logistic regression, which is also called the maximum entropy model. This package is quite helpful for classification processes involving sparse term-document matrices, and low memory consumption on huge datasets. Download and get started with maxent. lsa Truncated singular vector decomposition can help overcome the variability in a term-document matrix by deriving the latent features statistically. The lsa package in R provides an implementation of latent semantic analysis. The ease of use and efficiency of R packages can be very handy when carrying out even the trickiest of text mining task. As a result, they have grown to become very popular in the community. If you found this post useful, you should definitely refer to our book Mastering Text Mining with R. It will give you ample techniques for effective text mining and analytics using the above mentioned packages.
Read more
  • 0
  • 1
  • 32566

article-image-nips-2017-learning-state-representations-yael-niv
Amarabha Banerjee
18 Dec 2017
6 min read
Save for later

NIPS 2017 Special: Decoding the Human Brain for Artificial Intelligence to make smarter decisions

Amarabha Banerjee
18 Dec 2017
6 min read
Yael Niv is an Associate Professor of Psychology at the Princeton Neuroscience Institute since 2007. Her preferred areas of research include human and animal reinforcement learning and decision making. At her Niv lab, she studies day-to-day processes that animals and humans use to learn by trial and error, without explicit instructions given. In order to predict future events and to act upon the current environment so as to maximize reward and minimize the damage. Our article aims to deliver key points from Yael Niv’s keynote presentation at NIPS 2017. She talks about the ability of Artificial Intelligence systems to perform simple human-like tasks effectively using State representations in the human brain. The talk also deconstructs the complex human decision-making process. Further, we explore how a human brain breaks down complex procedures into simple states and how these states determine our decision-making capabilities.This, in turn, gives valuable insights into the design and architecture of smart AI systems with decision-making capabilities. Staying Simple is Complex What do you think happens when a human being crosses a road, especially when it’s a busy street and you constantly need to keep an eye on multiple checkpoints in order to be safe and sound? The answer is quite ironical. The human brain breaks down the complex process into multiple simple blocks. The blocks can be termed as states - and these states then determine decisions such as when to cross the road or at what speed to cross the road. In other words, the states can be anything - from determining the incoming traffic density to maintaining the calculation of your walking speed. These states help the brain to ignore other spurious or latent tasks in order to complete the priority task at hand. Hence, the computational power of the brain is optimized. The human brain possesses the capability to focus on the most important task at hand and then breaks it down into multiple simple tasks. The process of making smarter AI systems with complex decision-making capabilities can take inspiration from this process. The Practical Human Experiment To observe how the human brain behaves when urged to draw complex decisions, a few experiments were performed. The primary objective of these experiments was to verify the hypothesis that the decision making information in the human brain is stored in a part of the frontal brain called as Orbitofrontal cortex. The two experiments performed are described in brief below: Experiment 1 The participants were given sets of circles at random and they were asked to guess the number of circles in the cluster within 2 minutes. After they guessed the first time, the experimenter disclosed the correct number of circles. Then the subjects were further given a cluster of circles in two different colors (red and yellow) to repeat the guessing activity for each cluster. However, the experimenter never disclosed the fact that they will be given different colored clusters next. Observation: The most important observation derived from the experiment was that after the subject knew the correct count, their guesses revolved around that number irrespective of whether that count mattered for the next set of circle clusters given. That is, the count had actually changed for the two color specimens given to them. The important factor here is that the participants were not told that color would be a parameter to determine the number of circles in each set and still it played a huge part in guessing the number of circles in each set. This way it acted as a latent factor, which was present in the subconscious of the participants and was not a direct parameter. And, this being a latent factor was not in the list of parameters which played an important in determining the number of circles. But still, it played an important part in changing the overall count which was significantly higher for the red color than for the yellow color cluster. Hence, the experiment proved the hypothesis that latent factors are an integral part of intelligent decision-making capabilities in human beings. Experiment 2 The second experiment was performed to ascertain the hypothesis that the Orbitofrontal cortex contains all the data to help the human brain make complex decisions. For this, human brains were monitored using MRI to track the brain activity during the decision making process. In this experiment, the subjects were given a straight line and a dot. They were then asked to predict the next line from the dot - both in terms of line direction and its length. After completing this process for a given number of times, the participants were asked to remember the length and direction of the first line. There was a minor change among the sets of lines and dots. One group had a gradual change in line length and direction and another group had a drastic change in the middle. Observation: The results showed that the group with a gradual change of line length and direction were more helpful in preserving the first data and the one with drastic change was less accurate. The MRI reports showed signs that the classification information was primarily stored in the Orbitofrontal cortex. Hence it is considered as one of the most important parts of the human decision-making process. Shallow Learning with Deep Representations The decision-making capabilities and the effect of latent factors involved in it form the basis of dormant memory in humans. An experiment on rats was performed to explain this phenomenon. In the experiment, 4 rats were given electric shock accompanied by a particular type of sound for a day or two. On the third day, they reacted to the sound even without being given electric shocks. Ivan Pavlov has coined this term as Classical Conditioning theory wherein a relatively permanent change in behavior can be seen as a result of experience or continuous practice. Such instances of conditioning can be deeply damaging, for example in case of PTSD (Post Traumatic Stress Disorder) patients and other trauma victims. In order to understand the process of State representations being stored in memory, the reversal mechanism, i.e how to reverse the process also needs to be understood. For that, three techniques were tested on these rats: The rats were not given any shock but were subjected to the sound The rats were given shocks accompanied by sound at regular intervals and sounds without shock The shocks were slowly reduced in numbers but the sound continued The best results in reversing the memory were observed in case of the third technique, which is known as gradual extinction. In this way, a simple reinforcement learning mechanism is shown to be very effective because it helps in creating simple states which are manageable efficiently and trainable easily. Along with this, if we could extract information from brain imaging data derived from the Orbitofrontal cortex, these simple representational states can shed a lot of light into making complex computational processes simpler and enable us to make smarter AI systems for a better future.
Read more
  • 0
  • 0
  • 2070

article-image-neural-network-model-multi-layer-perceptrons-classifying-iris-flower-species
Sunith Shetty
16 Dec 2017
9 min read
Save for later

Deep Learning Algorithms: How to classify Irises using multi-layer perceptrons

Sunith Shetty
16 Dec 2017
9 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Big Data Analytics with Java by Rajat Mehta. Java is the de facto language for major big data environments like Hadoop, MapReduce etc. This book will teach you how to perform analytics on big data with production-friendly Java.[/box] From our below given post, we help you learn how to classify flower species from Iris dataset using multi-layer perceptrons. Code files are available for download towards the end of the post. Flower species classification using multi-layer perceptrons This is a simple hello world-style program for performing classification using multi-layer perceptrons. For this, we will be using the famous Iris dataset, which can be downloaded from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Iris. This dataset has four types of datapoints, shown as follows: Attribute name Attribute description Petal Length Petal length in cm Petal Width Petal width in cm Sepal Length Sepal length in cm Sepal Width Sepal width in cm Class The type of iris flower that is Iris Setosa, Iris Versicolour, Iris Virginica This is a simple dataset with three types of Iris classes, as mentioned in the table. From the perspective of our neural network of perceptrons, we will be using the multi-perceptron algorithm bundled inside the spark ml library and will demonstrate how you can club it with the Spark-provided pipeline API for the easy manipulation of the machine learning workflow. We will also split our dataset into training and testing bundles so as to separately train our model on the training set and finally test its accuracy on the test set. Let's now jump into the code of this simple example. First, create the Spark configuration object. In our case, we also mention that the master is local as we are running it on our local machine: SparkConf sc = new SparkConf().setMaster("local[*]"); Next, build the SparkSession with this configuration and provide the name of the application; in our case, it is JavaMultilayerPerceptronClassifierExample: SparkSession spark = SparkSession  .builder()  .config(sc)  .appName("JavaMultilayerPerceptronClassifierExample")  .getOrCreate(); Next, provide the location of the iris dataset file: String path = "data/iris.csv"; Now load this dataset file into a Spark dataset object. As the file is in an csv format, we also specify the format of the file while reading it using the SparkSession object: Now load this dataset file into a Spark dataset object. As the file is in an csv format, we also specify the format of the file while reading it using the SparkSession object: Dataset<Row> dataFrame1 = spark.read().format("csv").load(path); After loading the data from the file into the dataset object, let's now extract this data from the dataset and put it into a Java class, IrisVO. This IrisVO class is a plain POJOand has the attributes to store the data point types, as shown: public class IrisVO { private Double sepalLength; private Double petalLength; private Double petalWidth; private Double sepalWidth; private String labelString; On the dataset object dataFrame1, we invoke the to JavaRDD method to convert it into an RDD object and then invoke the map function on it. The map function is linked to a lambda function, as shown. In the lambda function, we go over each row of the dataset and pull the data items from it and fill it in the IrisVO POJO object before finally returning this object from the lambda function. This way, we get a dataMap rdd object filled with IrisVO objects: JavaRDD<IrisVO> dataMap = dataFrame1.toJavaRDD().map( r -> {  IrisVO irisVO = new IrisVO();  irisVO.setLabelString(r.getString(5));  irisVO.setPetalLength(Double.parseDouble(r.getString(3)));  irisVO.setSepalLength(Double.parseDouble(r.getString(1)));  irisVO.setPetalWidth(Double.parseDouble(r.getString(4)));  irisVO.setSepalWidth(Double.parseDouble(r.getString(2)));  return irisVO; }); As we are using the latest Spark ML library for applying our machine learning algorithms from Spark, we need to convert this RDD back to a dataset. In this case, however, this dataset would have the schema for the individual data points as we had mapped them to the IrisVO object attribute types earlier: Dataset<Row> dataFrame = spark.createDataFrame(dataMap.rdd(), IrisVO. class); We will now split the dataset into two portions: one for training our multi-layer perceptron model and one for testing its accuracy later. For this, we are using the prebuilt randomSplit method available on the dataset object and will provide the parameters. We keep 70 percent for training and 30 percent for testing. The last entry is the 'seed' value supplied to the randomSplit method. Dataset<Row>[] splits = dataFrame.randomSplit(new double[]{0.7, 0.3}, 1234L); Next, we extract the splits into individual datasets for training and testing: Dataset<Row> train = splits[0]; Dataset<Row> test = splits[1]; Until now we had seen the code that was pretty much generic across most of the Spark machine learning implementations. Now we will get into the code that is specific to our multi-layer perceptron model. We will create an int array that will contain the count for the various attributes needed by our model: int[] layers = new int[] {4, 5, 4, 3}; Let's now look at the attribute types of this int array, as shown in the following table: Attribute value at array index Description 0 This is the number of neurons or perceptrons at the input layer of the network. This is the count of the number of features that are passed to the model. 1 This is a hidden layer containing five perceptrons (sigmoid neurons only, ignore the terminology). 2 This is another hidden layer containing four sigmoid neurons. 3 This is the number of neurons representing the output label classes. In our case, we have three types of Iris flowers, hence three classes. After creating the layers for the neural network and specifying the number of neurons in each layer, next build a StringIndexer class. Since our models are mathematical and look for mathematical inputs for their computations, we have to convert our string labels for classification (that is, Iris Setosa, Iris Versicolour, and Iris Virginica) into mathematical numbers. To do this, we use the StringIndexer class that is provided by Apache Spark. In the instance of this class, we also provide the place from where we can read the data for the label and the column where it will output the numerical representation for that label: StringIndexer labelIndexer = new StringIndexer(). setInputCol("labelString").setOutputCol("label"); Now we build the features array. These would be the features that we use when training our model: String[] featuresArr = {"sepalLength","sepalWidth","petalLength","pet alWidth"}; Next, we build a features vector as this needs to be fed to our model. To put the feature in vector form, we use the VectorAssembler class from the Spark ML library. We also provide a features array as input and provide the output column where the vector array will be printed: VectorAssembler va = new VectorAssembler().setInputCols(featuresArr). setOutputCol("features"); Now we build the multi-layer perceptron model that is bundled within the Spark ML library. To this model we supply the array of layers we created earlier. This layer array has the number of neurons (sigmoid neurons) that are needed in each layer of the multi-perceptron network: MultilayerPerceptronClassifier trainer = new MultilayerPerceptronClassifier()  .setLayers(layers)  .setBlockSize(128)  .setSeed(1234L)  .setMaxIter(25); The other parameters that are being passed to this multi-layer perceptron model are: Block Size Block size for putting input data in matrices for faster computation. The default value is 128. Seed Seed for weight initialization if weights are not set. Maximum iterations Maximum number of iterations to be performed on the dataset while learning. The default value is 100. Finally, we hook all the workflow pieces together using the pipeline API. To this pipeline API, we pass the different pieces of the workflow, that is, the labelindexer and vector assembler, and finally provide the model: Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] {labelIndexer, va, trainer}); Once our pipeline object is ready, we fit the model on the training dataset to train our model on the underlying training data: PipelineModel model = pipeline.fit(train); Once the model is trained, it is not yet ready to be run on the test data to figure out its predictions. For this, we invoke the transform method on our model and store the result in a Dataset object: Dataset<Row> result = model.transform(test); Let's see the first few lines of this result by invoking a show method on it: result.show(); This would print the result of the first few lines of the result dataset as shown: As seen in the previous image, the last column depicts the predictions made by our model. After making the predictions, let's now check the accuracy of our model. For this, we will first select two columns in our model which represent the predicted label, as well as the actual label (recall that the actual label is the output of our StringIndexer): Dataset<Row> predictionAndLabels = result.select("prediction", "label"); Finally, we will use a standard class called MulticlassClassificationEvaluator, which is provided by Spark for checking the accuracy of the models. We will create an instance of this class. Next, we will set the metric name of the metric, that is, accuracy, for which we want to get the value from our predicted results: MulticlassClassificationEvaluator evaluator = new MulticlassClassificationEvaluator() .setMetricName("accuracy"); Next, using the instance of this evaluator, invoke the evaluate method and pass the parameter of the dataset that contains the column for the actual result and predicted result (in our case, it is the predictionAndLabels column): System.out.println("Test set accuracy = " + evaluator.evaluate(predictionAndLabels)); This would print the output as: If we get this value in a percentage, this means that our model is 95% accurate. This is the beauty of neural networks - they can give us very high accuracy when tweaked properly. With this, we come to an end for our small hello world-type program on multi-perceptrons. Unfortunately, Spark support on neural networks and deep learning is not extensive; at least not until now. To summarize, we covered a sample case study for the classification of Iris flower species based on the features that were used to train our neural network. If you are keen to know more about real-time analytics using deep learning methodologies such as neural networks and multi-layer perceptrons, you can refer to the book Big Data Analytics with Java. [box type="download" align="" class="" width=""]Download Code files[/box]      
Read more
  • 0
  • 0
  • 5916
article-image-nips-2017-deep-bayesian-bayesian-deep-learning-yee-whye-teh
Savia Lobo
15 Dec 2017
8 min read
Save for later

NIPS 2017 Special: A deep dive into Deep Bayesian and Bayesian Deep Learning with Yee Whye Teh

Savia Lobo
15 Dec 2017
8 min read
Yee Whye Teh is a professor at the department of Statistics of the University of Oxford and also a research scientist at DeepMind. He works on statistical machine learning, focussing on Bayesian nonparametrics, probabilistic learning, and deep learning. The motive of this article aims to bring our readers to Yee’s keynote speech at the NIPS 2017. Yee’s keynote ponders deeply on the interface between two perspectives on machine learning: Bayesian learning and Deep learning by exploring questions like: How can probabilistic thinking help us understand deep learning methods or lead us to interesting new methods? Conversely, how can deep learning technologies help us develop advanced probabilistic methods? For a more comprehensive and in-depth understanding of this novel approach, be sure to watch the complete keynote address by Yee Whye Teh on  NIPS facebook page. All images in this article come from Yee’s presentation slides and do not belong to us. The history of machine learning has shown a growth in both model complexity and in model flexibility. The theory led models have started to lose their shine. This is because machine learning is at the forefront of a revolution that could be called as data led models or the data revolution. As opposed to theory led models, data-led models try not to impose too many assumptions on the processes that have to be modeled and are rather superflexible non-parametric models that can capture the complexities but they require large amount of data to operate.   On the model flexibility side, we have various approaches that have been explored over the years. We have kernel methods, Gaussian processes, Bayesian nonparametrics and now we have deep learning as well. The community has also developed evermore complex frameworks both graphical and programmatic to compose large complex models from simpler building blocks. In the 90’s we had graphical models, later we had probabilistic programming systems, followed by deep learning systems like TensorFlow, Theano, and Torch. A recent addition is probabilistic Torch, which brings together ideas from both the probabilistic Bayesian learning and deep learning. On one hand we have Bayesian learning, which deals with learning as inference in some probabilistic models. On the other hand we have deep learning models, which view learning as optimization functions parametrized by neural networks. In recent years there has been an explosion of exciting research at this interface of these two popular approaches resulting in increasingly complex and exciting models. What is Bayesian theory of learning Bayesian learning describes an ideal learner as one who interacts with the world in order to know its state, which is given by θ. He/she makes some observations about the world by deducing a model in Bayesian context. This model is a joint distribution of both the unknown state of the world θ and the observation about the world x. The model consists of prior distribution and marginal distribution, combining which gives a reverse conditional distribution also known as posterior, which describes the totality of the agent's knowledge about the world after he/she sees x. This posterior can also be used for predicting future observations and act accordingly. Issues associated with Bayesian learning Rigidity Learning can be wrong if model is wrong Not all prior knowledge can be encoded as joint distribution Simple analytic forms are limiting for conditional distributions 2. Scalability: Intractable to compute this posterior and approximations have to be made, which then introduces trade offs between efficiency and accuracy. As a result, it is often assumed that Bayesian techniques are not scalable. To address these issues, the speaker highlights some of his recent projects which showcase scenarios where deep learning ideas are applied to Bayesian models (Deep Bayesian learning) or in the reverse applying Bayesian ideas to Neural Networks ( i.e. Bayesian Deep learning) Deep Bayesian learning: Deep learning assists Bayesian learning Deep learning can improve Bayesian learning in the following ways: Improve the modeling flexibility by using neural networks in the construction of Bayesian models Improve the inference and scalability of these methods by parameterizing the posterior way of using neural networks Empathizing inference over multiple runs These can be seen in the following projects showcased by Yee: Concrete VAEs(Variational Autoencoders) FIVO: Filtered Variational Objectives Concrete VAEs What are VAEs? All the qualities mentioned above, i.e. improving modeling flexibility, improving inference and scalability, and empathizing inference over multiple runs by using neural networks can be seen in a class of deep generative models known as VAE (Variational Autoencoders). Fig: Variational Autoencoders VAEs include latent variables that describe the contents of a scene i.e objects, pose. The relationship between these latent variables and the pixels have to be highly complex and nonlinear. So, in short, VAEs are used to parameterize generative and variable posterior distribution that allows for greater scope flexible modeling. The key that makes VAEs work is the reparameterization trick Fig: Adding reparameterization to VAEs The reparameterization trick is crucial to the continuous latent variables in the VAEs. But many models naturally include discrete latent variables. Yee suggests application of the reparameterization on the discrete latent variables as a work around. This brings us to the concept of Concrete VAEs.. CONtinuous relaxation of disCRETE distributions.Also, the density can be further calculated: This concrete distribution is the reparameterization trick for discrete variables which helps in calculating the KL divergence that is needed for variational inference. FIVO: Filtered Variational Objectives FIVO extends VAEs towards models for sequential and time series data. It is built upon another extension of VAEs known as Importance Weighted Autoencoder, a generative model with a similar as that of the VAE, but which uses a strictly tighter log-likelihood lower bound. Variational lower bound: Rederivation from importance sampling: Better to use multiple samples: Using Importance Weighted Autoencoders we can use multiple sampling, with which we can get a tighter lower bound and optimizing this lower bound should lead to better learning. Let’s have a look at the FIVO objectives: We can use any unbiased estimator p(X) of marginal probabilityTightness of bound related to variance of estimatorFor sequential models, we can use particle filters which produce unbiased estimator of marginal probability. They can also have much lower variance than importance samplers. Bayesian Deep learning: Bayesian approach for deep learning gives us counterintuitive and surprising ways to make deep learning scalable. In order to explore the potential of Bayesian learning with deep neural networks, Yee introduced a project named, The posterior server. The Posterior server The posterior server is a distributed server for deep learning. It makes use of the Bayesian approach in order to make neural networks highly scalable. This project focuses on Distributed learning, where both the data and the computations can be spread across the network. The figure above shows that there are a bunch of workers and each communicates with the parameter server, which effectively maintains the authoritative copy of the parameters of the network. At each iteration, each worker obtains the latest copy of the parameter from the server, computes the gradient update based on its data and sends it back to the server which then updates it to the authoritative copy. So, communications on the network tend to be slower than the computations that can be done on the network. Hence, one might consider multiple gradient steps on each iteration before it sends the accumulated update back to the parameter server. The problem is that the parameter and the worker quickly get out of sync with the authoritative copy on the parameter server. As a result, this leads to stale updates which allow noise into the system and we often need frequent synchronizations across the network for the algorithm to learn in a stable fashion. The main idea here in Bayesian context is that we don't just want a single parameter, we want a whole distribution over them. This will then relax the need for frequent synchronizations across the network and hopefully lead to algorithms that are robust to last frequent communication. Each worker is simply going to construct its own tractable approximation to his own likelihood function and send this information to the posterior server which then combines these approximations together to form the full posterior or an approximation of it. Further, the approximations that are constructed would be based on the statistics of some sampling algorithms that happens locally on that worker. The actual algorithm includes a combination of the variational algorithms, Stochastic Gradient EP and the Markov chain Monte Carlo on the workers themselves. So the variational part in the algorithm handles the communication part in the network whereas the MCMC part handles the sampling part that is posterior to construct the statistics that the variational part needs. For scalability, a stochastic gradient Langevin algorithm which is a simple generalization of the SGT, which includes additional injected noise, to sample from posterior noise. To experiment with this server, it was trained densely connected neural networks with 500 reLU units on MNIST dataset. You can have a detailed understanding of these examples in the keynote video. This interface between Bayesian learning and deep learning is a very exciting frontier. Researchers have brought management of uncertainties within deep learning. Also, flexibility and scalability in Bayesian modeling. Yee concludes with two questions for the audience to think about. Does being Bayesian in the space of functions makes more sense than being Bayesian in the sense of parameters? How to deal with uncertainties under model misspecification?    
Read more
  • 0
  • 0
  • 3611

article-image-how-google-mapreduce-works-big-data-projects
Sugandha Lahoti
15 Dec 2017
7 min read
Save for later

How Google's MapReduce works and why it matters for Big Data projects

Sugandha Lahoti
15 Dec 2017
7 min read
[box type="note" align="" class="" width=""]The article given below is a book extract from Java Data Analysis written by John R. Hubbard. The book will give you the most out of popular Java libraries and tools to perform efficient data analysis.[/box] In this article, we will explore Google’s MapReduce framework to analyze big data. How do you quickly sort a list of billion elements? Or multiply two matrices, each with a million rows and a million columns? In implementing their PageRank algorithm, Google quickly discovered the need for a systematic framework for processing massive datasets. That could be done only by distributing the data and the processing over many storage units and processors. Implementing a single algorithm, such as PageRank in that environment is difficult, and maintaining the implementation as the dataset grows is even more challenging. The solution: MapReduce framework The answer lay in separating the software into two levels: a framework that manages the big data access and parallel processing at a lower level, and a couple of user-written methods at an upper-level. The independent user who writes the two methods need not be concerned with the details of the big data management at the lower level. How does it function Specifically, the data flows through a sequence of stages: The input stage divides the input into chunks, usually 64MB or 128MB. The mapping stage applies a user-defined map() function that generates from one key-value pair a larger collection of key-value pairs of a different type. The partition/grouping stage applies hash sharding to those keys to group them. The reduction stage applies a user-defined reduce() function to apply some specific algorithm to the data in the value of each key-value pair. The output stage writes the output from the reduce() method. The user's choice of map() and reduce() methods determines the outcome of the entire process; hence the name MapReduce. This idea is a variation on the old algorithmic paradigm called divide and conquer. Think of the proto-typical mergesort, where an array is sorted by repeatedly dividing it into two halves until the pieces have only one element, and then they are systematically pairwise merged back together. MapReduce is actually a meta-algorithm—a framework, within which specific algorithms can be implemented through its map() and reduce() methods. Extremely powerful, it has been used to sort a petabyte of data in only a few hours. Recall that a petabyte is 10005 = 1015 bytes, which is a thousand terabytes or a million gigabytes. Some examples of MapReduce applications Here are a few examples of big data problems that can be solved with the MapReduce framework: Given a repository of text files, find the frequency of each word. This is called the WordCount problem. Given a repository of text files, find the number of words of each word length. Given two matrices in a sparse matrix format, compute their product. Factor a matrix given in sparse matrix format. Given a symmetric graph whose nodes represent people and edges represent friendship, compile a list of common friends. Given a symmetric graph whose nodes represent people and edges represent friendship, compute the average number of friends by age. Given a repository of weather records, find the annual global minima and maxima by year. Sort a large list. Note that in most implementations of the MapReduce framework, this problem is trivial, because the framework automatically sorts the output from the map() function. Reverse a graph. Find a minimal spanning tree (MST) of a given weighted graph. Join two large relational database tables. The WordCount example In this section, we present the MapReduce solution to the WordCount problem, sometimes called the Hello World example for MapReduce. The diagram in the figure below shows the data flow for the WordCount program. On the left are two of the 80 files that are read into the program: During the mapping stage, each word, followed by the number 1, is copied into a temporary file, one pair per line. Notice that many words are duplicated many times. For example, image appears five times among the 80 files (including both files shown), so the string image 1 will appear four times in the temporary file. Each of the input files has about 110 words, so over 8,000 word-number pairs will be written to the temporary file. Note that this figure shows only a very small part of the data involved. The output from the mapping stage includes every word that is input, as many times that it appears. And the output from the grouping stage includes every one of those words, but without duplication. The grouping process reads all the words from the temporary file into a key-value hash table, where the key is the word, and the value is a string of 1s, one for each occurrence of that word in the temporary file. Notice that these 1s written to the temporary file are not used. They are included simply because the MapReduce framework in general expects the map() function to generate key-value pairs.The reducing stage transcribed the contents of the hash table to an output file, replacing each string of 1s with the number of them. For example, the key-value pair ("book", "1 1 1 1")  is written as book 4 in the output file. Keep in mind that this is a toy example of the MapReduce process. The input consists of 80 text files containing about 9073 words. So, the temporary file has 9073 lines, with one word per line. Only 2149 of those words are distinct, so the hash table has 2149 entries and the output file has 2149 lines, with one word per line. The main idea So, this is the main idea of the MapReduce meta-algorithm: provide a framework for processing massive datasets, a framework that allows the independent programmer to plug in specialized map() and reduce() methods that actually implement the required particular algorithm. If that particular algorithm is to count words, then write the map() method to extract each individual word from a specified file and write the key-value pair (word, 1) to wherever the specified writer will put them, and write the reduce() method to take a key-value pair such as (word, 1 1 1 1) and return the corresponding key-value pair as (word, 4) to wherever its specified writer will put it. These two methods are completely localized—they simply operate on key-value pairs. And, they are completely independent of the size of the dataset. The diagram below illustrates the general flow of data through an application of the MapReduce framework: The original dataset could be in various forms and locations: a few files in a local directory, a large collection of files distributed over several nodes on the same cluster, a database on a database system (relational or NoSQL), or data sources available on the World Wide Web. The MapReduce controller then carries out these five tasks: Split the data into smaller datasets, each of which can be easily accessed on a single machine. Simultaneously (that is, in parallel), run a copy of the user-supplied map() method, one on each dataset, producing a set of key-value pairs in a temporary file on that local machine. Redistribute the datasets among the machines, so that all instances of each key are in the same dataset. This is typically done by hashing the keys. Simultaneously (in parallel), run a copy of the user-supplied reduce() method, one on each of the temporary files, producing one output file on each machine. Combine the output files into a single result. If the reduce() method also sorts its output, then this last step could also include merging those outputs. The genius of the MapReduce framework is that it separates the data management (moving, partitioning, grouping, sorting, and so on) from the data crunching (counting, averaging, maximizing, and so on). The former is done with no attention required by the user. The latter is done in parallel, separately in each node, by invoking the two user-supplied methods map() and reduce(). Essentially, the only obligation of the user is to devise the correct implementations of these two methods that will solve the given problem. As we mentioned earlier, these examples are presented mainly to elucidate how the MapReduce algorithm works. Real-world implementations would, however, use MongoDB or Hadoop frameworks. If you enjoyed this excerpt, check out the book Java Data Analysis to get an understanding of the various data analysis techniques, and how to implement them using Java.  
Read more
  • 0
  • 0
  • 5466