How-To Tutorials

article-image-format-publish-code-using-r-markdown

29 Dec 2017

6 min read

How to format and publish code using R Markdown

29 Dec 2017

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Ahmed Sherif titled Practical Business Intelligence. This book is a complete guide for implementing Business intelligence with the help of powerful tools like D3.js, R, Tableau, Qlikview and Python that are available on the market. It starts off by preparing you for data analytics and then moves on to teach you a range of techniques to fetch important information from various databases.[/box] Today you will explore how to use R Markdown, which is a format that allows reproducible reports with embedded R code that can be published into slideshows, Word documents, PDF files, and HTML web pages. Getting started with R Markdown R Markdown documents have the .RMD extension and are created by selecting R Markdown from the menu bar of RStudio, as seen here: If this is the first time you are creating an R Markdown report, you may be prompted to install some additional packages for R Markdown to work, as seen in the following screenshot: Once the packages are installed, we can define a title, author, and default output format, as follows: For our purposes we will use HTML output. The default output of the R Markdown document will appear like this: R Markdown features and components We can go ahead and delete everything below line 7 in the previous screenshot as we will create our own template with our embedded code and formatting. Header levels can be generated using # in front of a title. The largest font size will have a single # and each subsequent # added will decrease the header level font. Whenever we wish to embed actual R code into the report, we can include it inside of a code chunk by clicking on the icon shown here: Once that icon is selected, a shaded region is created between two ``` characters where R code can be generated identical to that used in RStudio. The first header generated will be for the results, and then the subsequent header will indicate the libraries used to generate the report. This can be generated using the following script: # Results ###### Libraries used are RODBC, plotly, and forecast Executing R code inside of R Markdown The next step is to run the actual R code inside of the chunk snippet that calls the required libraries needed to generate the report. This can be generated using the following script: ```{r} # We will not see the actual libraries loaded # as it is not necessary for the end user library('RODBC') library('plotly') library('forecast') ``` We can then click on the Knit HTML icon on the menu bar to generate a preview of our code results in R Markdown. Unfortunately, this output of library information is not useful to the end user. Exporting tips for R Markdown The report output includes all the messages and potential warnings that are the result of calling a package. This is not information that is useful to the report consumer. Fortunately for R developers, these types of messages can be concealed by tweaking the R chunk snippets to include the following logic in their script: ```{r echo = FALSE, results = 'hide', message = FALSE} ``` We can continue embedding R code into our report to run queries against the SQL Server database and produce summary data of the dataframe as well as the three main plots for the time series plot, observed versus fitted smoothing, and Holt-Winters forecasting: ###### Connectivity to Data Source is through ODBC ```{r echo = FALSE, results = 'hide', message = FALSE} connection_SQLBI<-odbcConnect("SQLBI") #Get Connection Details connection_SQLBI ##query fetching begin## SQL_Query_1<-sqlQuery(connection_SQLBI, 'SELECT [WeekInYear] ,[DiscountCode] FROM [AdventureWorks2014].[dbo].[DiscountCodebyWeek]' ) ##query fetching end## #begin table manipulation colnames(SQL_Query_1)<- c("Week", "Discount") SQL_Query_1$Weeks <- as.numeric(SQL_Query_1$Week) SQL_Query_1<-SQL_Query_1[,-1] #removes first column SQL_Query_1<-SQL_Query_1[c(2,1)] #reverses columns 1 and 2 #end table manipulation ``` ### Preview of First 6 rows of data ```{r echo = FALSE, message= FALSE} head(SQL_Query_1) ``` ### Summary of Table Observations ```{r echo = FALSE, message= FALSE} str(SQL_Query_1) ``` ### Time Series and Forecast Plots ```{r echo = FALSE, message= FALSE} Query1_TS<-ts(SQL_Query_1$Discount) par(mfrow=c(3,1)) plot.ts(Query1_TS, xlab = 'Week (1-52)', ylab = 'Discount', main = 'Time Series of Discount Code by Week') discountforecasts <- HoltWinters(Query1_TS, beta=FALSE, gamma=FALSE) plot(discountforecasts) discountforecasts_8periods <- forecast.HoltWinters(discountforecasts, h=8) plot.forecast(discountforecasts_8periods, ylab='Discount', xlab = 'Weeks (1-60)', main = 'Forecasting 8 periods') ``` The final output Before publishing the output with the results, R Markdown offers the developer opportunities to prettify the end product. One effect I like to add to a report is a logo of some kind. This can be done by applying the following code to any line in R Markdown: ![](http://website.com/logo.jpg) # image is on a website ![](images/logo.jpg) # image is locally on your machine The first option adds an image from a website, and the second option adds an image locally. For my purposes, I will add a PacktPub logo right above the Results section in the R Markdown, as seen in the following screenshot: To learn more about customizing an R Markdown document, visit the following website: http://rmarkdown.rstudio.com/authoring_basics.html. Once we are ready to preview the results of the R Markdown output, we can once again select the Knit to HTML button on the menu. The new report can be seen in this screenshot: As can be seen in the final output, even if the R code is embedded within the R Markdown document, we can suppress the unnecessary technical output and reveal the relevant tables, fields, and charts that will provide the most benefit to end users and report consumers. If you have enjoyed reading this article and want to develop the ability to think along the right lines and use more than one tool to perform analysis depending on the needs of your business, do check out Practical Business Intelligence.

0
0
3893

article-image-25-startups-machine-learning-differently-2018

Fatema Patrawala

29 Dec 2017

14 min read

25 Startups using machine learning differently in 2018: From farming to brewing beer to elder care

Fatema Patrawala

29 Dec 2017

14 min read

What really excites me about data science and by extension machine learning is the sheer number of possibilities! You can think of so many applications off the top of your head: robo-advisors, computerized lawyers, digital medicine, even automating VC decisions when they invest in startups. You can even venture into automation of art and music, algorithms writing papers which are indistinguishable from human-written papers. It's like solving a puzzle, but a puzzle that's meaningful and that has real world implications. The things that we can do today weren’t possible 5 years ago, and this is largely thanks to growth in computational power, data availability, and the adoption of the cloud that made accessing these resources economical for everyone, all key enabling factors for the advancement of Machine learning and AI. Having witnessed the growth of data science as discipline, industries like finance, health-care, education, media & entertainment, insurance, retail as well as energy has left no stone unturned to harness this opportunity. Data science has the capability to offer even more; and we will see the wide range of applications in the future in places haven’t even been explored. In the years to come, we will increasingly see data powered/AI enabled products and services take on roles traditionally handled by humans as they required innately human qualities to successfully perform. In this article we have covered some use cases of Data Science being used differently and start-ups who have practically implemented it: The Nurturer: For elder care The world is aging rather rapidly. According to the World Health Organization, nearly two billion people across the world are expected to be over 60 years old by 2050, a figure that’s more than triple what it was in 2000. In order to adapt to their increasingly aging population, many countries have raised the retirement age, reducing pension benefits, and have started spending more on elderly care. Research institutions in countries like Japan, home to a large elderly population, are focusing their R&D efforts on robots that can perform tasks like lifting and moving chronically ill patients, many startups are working on automating hospital logistics and bringing in virtual assistance. They also offer AI-based virtual assistants to serve as middlemen between nurses and patients, reducing the need for frequent in-hospital visits. Dr Ben Maruthappu, a practising doctor, has brought a change to the world of geriatric care with an AI based app Cera. It is an on-demand platform to aid the elderly in need. The Cera app firmly puts itself in the category of Uber & Amazon, whereby it connects elderly people in need of care with a caregiver in a matter of few hours. The team behind this innovation also plans to use AI to track patients’ health conditions and reduce the number of emergency patients admitted in hospitals. A social companion technology - Elliq created by Intuition Robotics helps older adults stay active and engaged with a proactive social robot that overcomes the digital divide. AliveCor, a leading FDA-cleared mobile heart solution helps save lives, money, and has brought modern healthcare alive into the 21st century. The Teacher: Personalized education platform for lifelong learning With children increasingly using smartphones and tablets and coding becoming a part of national curricula around the world, technology has become an integral part of classrooms. We have already witnessed the rise and impact of education technology especially through a multitude of adaptive learning platforms that allow learners to strengthen their skills and knowledge - CBTs, LMSes, MOOCs and more. And now virtual reality (VR) and artificial intelligence (AI) are gaining traction to provide us with lifelong learning companion that can accompany and support individuals throughout their studies - in and beyond school . An AI based educational platform learns the amount of potential held by each particular student. Based on this data, tailored guidance is provided to fix mistakes and improvise on the weaker areas. A detailed report can be generated by the teachers to help them customise lesson plans to best suit the needs of the student. Take Gruff Davies’ Kwiziq for example. Gruff with his team leverage AI to provide a personalised learning experience for students based on their individual needs. Students registered on the platform get an advantage of an AI based language coach which asks them to solve various micro quizzes. Quiz solutions provided by students are then turned into detailed “brain maps”. These brain maps are further used to provide tailored instructions and feedback for improvement. Other startup firms like Blippar specialize in Augmented reality for visual and experiential learning. Unelma Platforms, a software platform development company provides state-of-the-art software for higher-education, healthcare and business markets. The Provider: Farming to be more productive, sustainable and advanced Though farming is considered the backbone of many national economies especially in the developing world, there is often an outdated view of it involving a small, family-owned lands where crops are hand harvested. The reality of modern-day farms have had to overhaul operations to meet demand and remain competitively priced while adapting to the ever-changing ways technology is infiltrating all parts of life. Climate change is a serious environmental threat farmers must deal with every season: Strong storms and severe droughts have made farming even more challenging. Additionally lack of agricultural input, water scarcity, over-chemicalization in fertilizers, water & soil pollution or shortage of storage systems has made survival for farmers all the more difficult. To overcome these challenges, smart farming techniques are the need of an hour for farmers in order to manage resources and sustain in the market. For instance, in a paper published by arXiv, the team explains how they used a technique known as transfer learning to teach the AI how to recognize crop diseases and pest damage.They utilized TensorFlow, to build and train a neural network of their own, which involved showing the AI 2,756 images of cassava leaves from plants in Tanzania. Their efforts were a success, as the AI was able to correctly identify brown leaf spot disease with 98 percent accuracy. WeFarm, SaaS based agritech firm, headquartered in London, aims to bridge the connectivity gap amongst the farmer community. It allows them to send queries related to farming via text message which is then shared online into several languages. The farmer then receives a crowdsourced response from other farmers around the world. In this way, a particular farmer in Kenya can get a solution from someone sitting in Uganda, without having to leave his farm, spend additional money or without accessing the internet. Benson Hill Bio-systems, by Matthew B. Crisp, former President of Agricultural Biotechnology Division, has differentiated itself by bringing the power of Cloud Biology™ to agriculture. It combines cloud computing, big data analytics, and plant biology to inspire innovation in agriculture. At the heart of Benson Hill is CropOS™, a cognitive engine that integrates crop data and analytics with the biological expertise and experience of the Benson Hill scientists. CropOS™ continuously advances and improves with every new dataset, resulting in the strengthening of the system’s predictive power. Firms like Plenty Inc and Bowery Farming Inc are nowhere behind in offering smart farming solutions. Plenty Inc is an agriculture technology company that develops plant sciences for crops to flourish in a pesticide and GMO-free environment. While Bowery Farming uses high-tech approaches such as robotics, LED lighting and data analytics to grow leafy greens indoors. The Saviour: For sustainability and waste management The global energy landscape continues to evolve, sometimes by the nanosecond, sometimes by the day. The sector finds itself pulled to economize and pushed to innovate due to a surge in demand for new power and utilities offerings. Innovations in power-sector technology, such as new storage battery options and smartphone-based thermostat apps, AI enabled sensors etc; are advancing at a pace that has surprised developers and adopters alike. Consumer’s demands for such products have increased. To meet this, industry leaders are integrating those innovations into their operations and infrastructure as rapidly as they can. On the other hand, companies pursuing energy efficiency have two long-standing goals — gaining a competitive advantage and boosting the bottom line — and a relatively new one: environmental sustainability. Realising the importance of such impending situations in the industry, we have startups like SmartTrace offering an innovative cloud-based platform to quickly manage waste at multiple levels. This includes bridging rough data from waste contractors, extrapolating to volume, EWC, finance and Co2 statistics. Data extracted acts as a guide to improve methodology, educate, strengthen oversight and direct improvements to the bottom line, as well as environmental outcomes. One Concern provides damage estimates using artificial intelligence on natural phenomena sciences. Autogrid organizes energy data and employs big data analytics to generate real-time predictions to create actionable data. The Dreamer: For lifestyle and creative product development and design Consumers in our modern world continually make multiple decisions with regard to product choice due to many competing products in the market.Often those choices boil down to whether it provides better value than others either in terms of product quality, price or by aligning with their personal beliefs and values.Lifestyle products and brands operate off ideologies, hoping to attract a relatively high number of people and ultimately becoming a recognized social phenomenon. While ecommerce has leveraged data science to master the price dimension, here are some examples of startups trying to deconstruct the other two dimensions: product development and branding. I wonder if you have ever imagined your beer to be brewed by AI? Well, now you can with IntelligentX. The Intelligent X team claim to have invented the world's first beer brewed by Artificial intelligence. They also plan to craft a premium beer using complex machine learning algorithms which can improve itself from the feedback given by its customers. Customers are given to try one of their four bottled conditioned beers, after the trial they are asked by their AI what they think of the beer, via an online feedback messenger. The data then collected is used by an algorithm to brew the next batch. Because their AI is constantly reacting to user feedback, they can brew beer that matches what customers want, more quickly than anyone else can. What this actually means that the company gets more data and customers get a customized fresh beer! In the lifestyle domain, we have Stitch Fix which has brought a personal touch to the online shopping journey. They are no regular other apparel e-commerce company. They have created a perfect formula for blending human expertise with the right amount of Data Science to serve their customers. According to Katrina Lake, Founder, and CEO, "You can look at every product on the planet, but trying to figure out which one is best for you is really the challenge” and that’s where Stitch Fix has come into the picture. The company is disrupting traditional retail by bridging the gap of personalized shopping, that the former could not achieve. To know how StitchFix uses Full Stack Data Science read our detailed article. The Writer: From content creation to curation to promotion In the publishing industry, we have seen a digital revolution coming in too. Echobox are one of the pioneers in building AI for the publishing industry. Antoine Amann, founder of Echobox, wrote in a blog post that they have "developed an AI platform that takes large quantity of variables into account and analyses them in real time to determine optimum post performance". Echobox pride itself to currently work with Facebook and Twitter for optimizing social media content, perform advanced analytics with A/B testing and also curate content for desired CTRs. With global client base like The Le Monde, The Telegraph, The Guardian etc. they have conveniently ripped social media editors. New York-based startup Agolo uses AI to create real-time summaries of information. It initially use to curate Twitter feeds in order to focus on conversations, tweets and hashtags that were most relevant to its user's preferences. Using natural language processing, Agolo scans content, identifies relationships among the information sources, and picks out the most relevant information, all leading to a comprehensive summary of the original piece of information. Other websites like Grammarly, offers AI-powered solutions to help people write, edit and formulate mistake-free content. Textio came up with augmented writing which means every time you wrote something and you would come to know ahead of time exactly who is going to respond. It basically means writing which is supported by outcomes in real time. Automated Insights, Creator of Wordsmith, the natural language generation platform enables you to produce human-sounding narratives from data. The Matchmaker: Connecting people, skills and other entities AI will make networking at B2B events more fun and highly productive for business professionals. Grip, a London based startup, formerly known as Network, rebranded itself in the month of April, 2016. Grip is using AI as a platform to make networking at events more constructive and fruitful. It acts as a B2B matchmaking engine that accumulates data from social accounts (LinkedIn, Twitter) and smartly matches the event registration data. Synonymous to Tinder for networking, Grip uses advanced algorithms to recommend the right people and presents them with an easy to use swiping interface feature. It also delivers a detailed report to the event organizer on the success of the event for every user or a social Segment. We are well aware of the data scientist being the sexiest job of the 21st century. JamieAi harnessing this fact connects technical talent with data-oriented jobs organizations of all types and sizes. The start-up firm has combined AI insights and human oversight to reduce hiring costs and eliminate bias. Also, third party recruitment agencies are removed from the process to boost transparency and efficiency in the path to employment. Another example is Woo.io, a marketplace for matching tech professionals and companies. The Manager: Virtual assistants of a different kind Artificial Intelligence can also predict how much your household appliance will cost on your electricity bill. Verv, a producer of clever home energy assistance provides intelligent information on your household appliances. It helps its customers with a significant reduction on their electricity bills and carbon footprints. The technology uses machine learning algorithms to provide real-time information by learning how much power and money each device is using. Not only this, it can also suggest eco-friendly alternatives, alert homeowners of appliances in use for a longer duration and warn them of any dangerous activity when they aren’t present at home. Other examples include firms like Maana which manages machines and improves operational efficiencies in order to make fast data driven decisions. Gong.io, acts as a sales representative’s assistant to understand sales conversations resulting into actionable insights. ObEN, creates complete virtual identities for consumers and celebrities in the emerging digital world. The Motivator: For personal and business productivity and growth A super cross-functional company Perkbox, came up with an employee engagement platform. Saurav Chopra founder of Perkbox believes teams perform their best when they are happy and engaged! Hence, Perkbox helps companies boost employee motivation and create a more inspirational atmosphere to work. The platform offers gym services, dental discounts and rewards for top achievers in the team to firms in UK. Perkbox offers a wide range of perks, discounts and tools to help organizations retain and motivate their employees. Technologies like AWS and Kubernetes allow to closely knit themselves with their development team. In order to build, scale and support Perkbox application for the growing number of user base. So, these are some use cases where we found startups using data science and machine learning differently. Do you know of others? Please share them in the comments below.

0
0
4552

article-image-getting-to-know-and-manipulate-tensors-in-tensorflow

Sunith Shetty

29 Dec 2017

5 min read

Getting to know and manipulate Tensors in TensorFlow

Sunith Shetty

29 Dec 2017

5 min read

[box type="note" align="" class="" width=""]This article is a book excerpt written by Rodolfo Bonnin, titled Building Machine Learning Projects with TensorFlow. In this book, you will learn to build powerful machine learning projects to tackle complex data for gaining valuable insights. [/box] Today, you will learn everything about Tensors, their properties and how they are used to represent data. What are tensors? TensorFlow bases its data management on tensors. Tensors are concepts from the field of mathematics, and are developed as a generalization of the linear algebra terms of vectors and matrices. Talking specifically about TensorFlow, a tensor is just a typed, multidimensional array, with additional operations, modeled in the tensor object. Tensor properties - ranks, shapes, and types TensorFlow uses tensor data structure to represent all data. Any tensor has a static type and dynamic dimensions, so you can change a tensor's internal organization in real-time. Another property of tensors, is that only objects of the tensor type can be passed between nodes in the computation graph. Let's now see what the properties of tensors are (from now on, every time we use the word tensor, we'll be referring to TensorFlow's tensor objects). Tensor rank Tensor ranks represent the dimensional aspect of a tensor, but is not the same as a matrix rank. It represents the quantity of dimensions in which the tensor lives, and is not a precise measure of the extension of the tensor in rows/columns or spatial equivalents. A rank one tensor is the equivalent of a vector, and a rank one tensor is a matrix. For a rank two tensor you can access any element with the syntax t[i, j]. For a rank three tensor you would need to address an element with t[i, j, k], and so on. In the following example, we will create a tensor, and access one of its components: import tensorflow as tf sess = tf.Session() tens1 = tf.constant([[[1,2],[2,3]],[[3,4],[5,6]]]) print sess.run(tens1)[1,1,0] Output: 5 This is a tensor of rank three, because in each element of the containing matrix, there is a vector element: Rank Math entity Code definition example 0 Scalar scalar = 1000 1 Vector Vector vector = [2, 8, 3] 2 Matrix matrix = [[4, 2, 1], [5, 3, 2], [5, 5, 6]] 3 3-tensor tensor = [[[4], [3], [2]], [[6], [100], [4]], [[5], [1], [4]]] n n-tensor … Tensor shape The TensorFlow documentation uses three notational conventions to describe tensor dimensionality: rank, shape, and dimension number. The following table shows how these relate to one another: Rank Shape Dimension number Example 0 [] 0 4 1 [D0] 1 [2] 2 [D0, D1] 2 [6, 2] 3 [D0, D1, D2] 3 [7, 3, 2] n [D0, D1, … Dn-1] n-D A tensor with shape [D0, D1, … Dn-1] In the following example, we create a sample rank three tensor, and print the shape of it: Tensor data types In addition to dimensionality, tensors have a fixed data type. You can assign any one of the following data types to a tensor: Data type Python type Description DT_FLOAT tf.float32 32 bits floating point. DT_DOUBLE tf.float64 64 bits floating point. DT_INT8 tf.int8 8 bits signed integer. DT_INT16 tf.int16 16 bits signed integer. DT_INT32 tf.int32 32 bits signed integer. DT_INT64 tf.int64 64 bits signed integer. DT_UINT8 tf.uint8 8 bits unsigned integer. DT_STRING tf.string Variable length byte arrays. Each element of a tensor is a byte array. DT_BOOL tf.bool Boolean. Creating new tensors We can either create our own tensors, or derivate them from the well-known numpy library. In the following example, we create some numpy arrays, and do some basic math with them: import tensorflow as tf import numpy as np x = tf.constant(np.random.rand(32).astype(np.float32)) y= tf.constant ([1,2,3]) x Y Output: <tf.Tensor 'Const_2:0' shape=(3,) dtype=int32> From numpy to tensors and vice versa TensorFlow is interoperable with numpy, and normally the eval() function calls will return a numpy object, ready to be worked with the standard numerical tools. We must note that the tensor object is a symbolic handle for the result of an operation, so it doesn't hold the resulting values of the structures it contains. For this reason, we must run the eval() method to get the actual values, which is the equivalent to Session.run(tensor_to_eval). In this example, we build two numpy arrays, and convert them to tensors: import tensorflow as tf #we import tensorflow import numpy as np #we import numpy sess = tf.Session() #start a new Session Object x_data = np.array([[1.,2.,3.],[3.,2.,6.]]) # 2x3 matrix x = tf.convert_to_tensor(x_data, dtype=tf.float32) print (x) Output: Tensor("Const_3:0", shape=(2, 3), dtype=float32) Useful method: tf.convert_to_tensor: This function converts Python objects of various types to tensor objects. It accepts tensorobjects, numpy arrays, Python lists, and Python scalars. Getting things done - interacting with TensorFlow As with the majority of Python's modules, TensorFlow allows the use of Python's interactive console: In the previous figure, we call the Python interpreter (by simply calling Python) and create a tensor of constant type. Then we invoke it again, and the Python interpreter shows the shape and type of the tensor. We can also use the IPython interpreter, which will allow us to employ a format more compatible with notebook-style tools, such as Jupyter: When talking about running TensorFlow Sessions in an interactive manner, it's better to employ the InteractiveSession object. Unlike the normal tf.Session class, the tf.InteractiveSession class installs itself as the default session on construction. So when you try to eval a tensor, or run an operation, it will not be necessary to pass a Session object to indicate which session it refers to. To summarize, we have learned about tensors, the key data structure in TensorFlow and simple operations we can apply to the data. To know more about different machine learning techniques and algorithms that can be used to build efficient and powerful projects, you can refer to the book Building Machine Learning Projects with TensorFlow.

0
0
10553

article-image-18-striking-ai-trends-to-watch-in-2018-part-2

Sugandha Lahoti

28 Dec 2017

12 min read

18 striking AI Trends to watch in 2018 - Part 2

Sugandha Lahoti

28 Dec 2017

12 min read

0
0
2831

article-image-decrypting-bitcoin-2017-journey

Ashwin Nair

28 Dec 2017

7 min read

There and back again: Decrypting Bitcoin`s 2017 journey from $1000 to $20000

Ashwin Nair

28 Dec 2017

7 min read

Lately, Bitcoin has emerged as the most popular topic of discussion amongst colleagues, friends and family. The conversations more or less have a similar theme - filled with inane theories around what Bitcoin is and and the growing fear of missing out on Bitcoin mania. Well to be fair, who would want to miss out on an opportunity to grow their money by more than 1000%. That`s the return posted by Bitcoin since the start of the year. Bitcoin at the time of writing this article is at $15,000 with a marketcap over $250 billion. To put the hype in context, Bitcoin is now valued higher than 90% of companies in S&P 500 list. Supposedly, invented by an anonymous group or an individual under the alias Satoshi Nakamoto in 2009, Bitcoin has been seen as a digital currency that the internet might not need but one that it deserves. Satoshi`s vision was to create a robust electronic payment system that functions smoothly without a need for a trusted third party. This was achievable with the help of Blockchain, a digital ledger which records transactions in an indelible fashion and is distributed across multiple nodes. This ensured that no transaction is altered or deleted, thus completely eliminating the need for a trusted third party. For all the excitement around Bitcoin and the increase in interest level towards owning one, the thought of dissecting the roller coaster journey from sub $1000 level to $15,000 this year seemed pretty exciting. Started year with a bang / Global uncertainties leading to Bitcoin rally - Oct 2016 to Jan 2017 Global uncertainties played a big role in driving Bitcoin`s price and boy 2016 was full of it! From Brexit to Trump winning the US elections and major economic commotion in terms of Devaluation of China`s Yuan and India`s Demonetization drive all leading to investors seeking shelter in Bitcoin. The first blip - Jan 2017 to March 2017 China as a country has had a major impact in determining Bitcoin`s fate. Early 2017, China contributed to over 80% of Bitcoin transactions indicating the amount of power Chinese traders and investors had in controlling Bitcoin`s price. However, People's Bank of China`s closer inspection of the exchanges revealed several irregularities in the transactions and business practices. Which eventually led to officials halting withdrawals using Bitcoin transactions and taking stringent steps towards cryptocurrency exchanges. Source - Bloomberg On the path tobecoming mainstream and gaining support from Ethereum - Mar 2017 to June 2017 During this phase, we saw the rise of another cryptocurrency, Ethereum, a close digital cousin of Bitcoin. Similar to Bitcoin, Ethereum is also built on top of Blockchain technology allowing it to build a decentralized public network. However, Ethereum’s capabilities extend beyond being a cryptocurrency and help developers to build and deploy any kind of decentralized applications. Ethereum`s valuation in this period rose from $20 to $375 which was quite beneficial for Bitcoin. As every reference of Ethereum had Bitcoin mentioned as well, whether it was to explain what Ethereum is or how it can take the number 1 cryptocurrency spot in the future. This, coupled with the rise in Blockchain`s popularity, increased Bitcoin`s visibility within USA. The media started observing politicians, celebrities and other prominent personalities speaking on Bitcoin as well. Bitcoin also received a major boost from Chinese exchanges, wherein withdrawals of the cryptocurrency resumed after nearly a four-month freeze. All these factors led to Bitcoin crossing an all time high of $2500, up by more than 150% since the start of the year. The Curious case of Fork - June 2017 to September 2017 The month of July saw the Cryptocurrencies market cap witnessing a sharp decline, with questions being raised on the price volatility and whether Bitcoin`s rally for the year was over? We can of-course now confidently debunk that question. Though, there hasn’t been any proven rationale behind the decline, one of the reasons seems to be profit booking following months of steep rally in valuations witnessed by Bitcoin. Another major factor, which might have driven the price collapse may be an unresolved dispute among leading members of the bitcoin community over how to overcome the growing problem of Bitcoin being slow and expensive. With growing usage of Bitcoin, its performance in terms of transaction time has slowed down. Bitcoin`s network due to its limitation in terms of block size could only execute around 7 transactions per second compared to the VISA Network which could do over 1600 transactions. This also led to transaction fees being increased substantially to $5.00 per transaction and settlement time often taking hours and even days. This eventually put Bitcoin`s flaw in the spotlight when compared with services offered by competitors such as Paypal in terms of cost and transaction time. Source - Coinmarketcap The poor performance of Bitcoin led to investors opting for other cryptocurrencies. The above graph, shows how Bitcoin`s dominance fell substantially compared to other cryptocurrencies such as Ethereum and Litecoin during this time. With the core-community still unable to come to a consensus on how to improve the performance and update the software, the prospect of a “fork” was raised. Fork highlights change in underlying software protocol of Bitcoin to make previous rules valid or invalid. There are two types of blockchain forks - soft fork and hard fork. Around August, the community announced to go ahead with a hard fork in the form of Bitcoin Cash. This news was surprisingly taken in a positive manner leading to Bitcoin rebounding strongly and reaching new heights of around $4000 in price. Once bitten, Twice Shy (China) September 2017 to October 2017 Source - Zerohedge The month of September saw another setback for Bitcoin due to measures taken from People`s Bank of China. This time, PBoC banned initial coin offerings (ICO), thus prohibiting the practice of building and selling cryptocurrency to any investors or to finance any startup projects within China. Based on a report by National Committee of Experts on Internet Financial Security Technology, Chinese Investors were involved in 2.6 Billion Yuan worth of ICOs in January-June, 2017 reflecting China`s exposure towards Bitcoin. My Precious (October 2017 to December 2017) Source - Manmanroe During the last quarter Bitcoin`s surge has shocked even hardcore Bitcoin fanatics. Everything seems to be going right for Bitcoin at the moment. While at the start of the year China was the major contributor towards the hike in Bitcoin`s valuation, now the momentum seemed to have shifted to a much sensible and responsible market in terms of Japan who have embraced Bitcoin in quite a positive manner. As you can see from the below graph, Japan now holds more than 50% of transaction compared to USA which is much lesser in size. Besides Japan, we are also seeing positive developments in country such as Russia and India, who are looking to legalize cryptocurrency usage. Moreover, the level of interests towards Bitcoin from institutional investors is at its peak. All these factors have resulted in Bitcoin to cross the 5 digit mark for the first time in Nov, 2017 and touching an all time high figure of close to $20,000 in December, 2017. Post the record high, Bitcoin has been witnessing a crash and rebound phenomenon in the last two weeks of December. From a record high of $20,000 to $11,000 and now at $15,000, Bitcoin is still a volatile digital currency if one is looking for a quick price appreciation. Despite the valuation dilemma and the price volatility, one thing is sure: the world is warming up to the idea of cryptocurrencies and even owning one. There are already several predictions being made on how Bitcoin`s astronomical growth is going to continue in 2018. However, Bitcoin needs to overcome several challenges before it can replace the traditional currency and and be widely accepting in banking practices. Besides, the rise of other cryptocurriencies such as Ethereum, LiteCoin or bitcoin cash who are looking to dethrone Bitcoin from the #1 spot, there are broader issues at hand which the Bitcoin community should prioritize such as how to curb the effect of Bitcoin`s mining activities on the environment and on having smoother reforms as well as building regulatory roadmap from countries before people actually start using instead of just looking it as a tool for making a quick buck.

0
20
7232

article-image-implement-neural-network-single-layer-perceptron

Pravin Dhandre

28 Dec 2017

10 min read

How to Implement a Neural Network with Single-Layer Perceptron

Pravin Dhandre

28 Dec 2017

10 min read

0
0
8638

Sunith Shetty

28 Dec 2017

10 min read

Getting started with Spark 2.0

Sunith Shetty

28 Dec 2017

10 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book by Muhammad Asif Abbasi titled Learning Apache Spark 2. In this book, author explains how to perform big data analytics using Spark streaming, machine learning techniques and more.[/box] Today, we will learn about the basics of Spark architecture and its various components. We will also explore how to install Spark for running it in the local mode. Apache Spark architecture overview Apache Spark is an open source distributed data processing engine for clusters, which provides a unified programming model engine across different types data processing workloads and platforms. At the core of the project is a set of APIs for Streaming, SQL, Machine Learning (ML), and Graph. Spark community supports the Spark project by providing connectors to various open source and proprietary data storage engines. Spark also has the ability to run on a variety of cluster managers like YARN and Mesos, in addition to the Standalone cluster manager which comes bundled with Spark for standalone installation. This is thus a marked difference from Hadoop ecosystem where Hadoop provides a complete platform in terms of storage formats, compute engine, cluster manager, and so on. Spark has been designed with the single goal of being an optimized compute engine. This therefore allows you to run Spark on a variety of cluster managers including being run standalone, or being plugged into YARN and Mesos. Similarly, Spark does not have its own storage, but it can connect to a wide number of storage engines. Currently Spark APIs are available in some of the most common languages including Scala, Java, Python, and R. Let's start by going through various API's available in Spark. Spark-core At the heart of the Spark architecture is the core engine of Spark, commonly referred to as spark-core, which forms the foundation of this powerful architecture. Spark-core provides services such as managing the memory pool, scheduling of tasks on the cluster (Spark works as a Massively Parallel Processing (MPP) system when deployed in cluster mode), recovering failed jobs, and providing support to work with a wide variety of storage systems such as HDFS, S3, and so on. Note: Spark-Core provides a full scheduling component for Standalone Scheduling: Code is available at: https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/scheduler Spark-Core abstracts the users of the APIs from lower-level technicalities of working on a cluster. Spark-Core also provides the RDD APIs which are the basis of other higher-level APIs, and are the core programming elements on Spark. Note: MPP systems generally use a large number of processors (on separate hardware or virtualized) to perform a set of operations in parallel. The objective of the MPP systems is to divide work into smaller task pieces and running them in parallel to increase in throughput time. Spark SQL Spark SQL is one of the most popular modules of Spark designed for structured and semistructured data processing. Spark SQL allows users to query structured data inside Spark programs using SQL or the DataFrame and the Dataset API, which is usable in Java, Scala, Python, and R. Because of the fact that the DataFrame API provides a uniform way to access a variety of data sources, including Hive datasets, Avro, Parquet, ORC, JSON, and JDBC, users should be able to connect to any data source the same way, and join across these multiple sources together. The usage of Hive meta store by Spark SQL gives the user full compatibility with existing Hive data, queries, and UDFs. Users can seamlessly run their current Hive workload without modification on Spark. Spark SQL can also be accessed through spark-sql shell, and existing business tools can connect via standard JDBC and ODBC interfaces. Spark streaming More than 50% of users consider Spark Streaming to be the most important component of Apache Spark. Spark Streaming is a module of Spark that enables processing of data arriving in passive or live streams of data. Passive streams can be from static files that you choose to stream to your Spark cluster. This can include all sorts of data ranging from web server logs, social-media activity (following a particular Twitter hashtag), sensor data from your car/phone/home, and so on. Spark-streaming provides a bunch of APIs that help you to create streaming applications in a way similar to how you would create a batch job, with minor tweaks. As of Spark 2.0, the philosophy behind Spark Streaming is not to reason about streaming and building data application as in the case of a traditional data source. This means the data from sources is continuously appended to the existing tables, and all the operations are run on the new window. A single API lets the users create batch or streaming applications, with the only difference being that a table in batch applications is finite, while the table for a streaming job is considered to be infinite. MLlib MLlib is Machine Learning Library for Spark, if you remember from the preface, iterative algorithms are one of the key drivers behind the creation of Spark, and most machine learning algorithms perform iterative processing in one way or another. Note: Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data. Spark MLlib allows developers to use Spark API and build machine learning algorithms by tapping into a number of data sources including HDFS, HBase, Cassandra, and so on. Spark is super fast with iterative computing and it performs 100 times better than MapReduce. Spark MLlib contains a number of algorithms and utilities including, but not limited to, logistic regression, Support Vector Machine (SVM), classification and regression trees, random forest and gradient-boosted trees, recommendation via ALS, clustering via K-Means, Principal Component Analysis (PCA), and many others. GraphX GraphX is an API designed to manipulate graphs. The graphs can range from a graph of web pages linked to each other via hyperlinks to a social network graph on Twitter connected by followers or retweets, or a Facebook friends list. Graph theory is a study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph is made up of vertices (nodes/points), which are connected by edges (arcs/lines). – Wikipedia.org Spark provides a built-in library for graph manipulation, which therefore allows the developers to seamlessly work with both graphs and collections by combining ETL, discovery analysis, and iterative graph manipulation in a single workflow. The ability to combine transformations, machine learning, and graph computation in a single system at high speed makes Spark one of the most flexible and powerful frameworks out there. The ability of Spark to retain the speed of computation with the standard features of fault-tolerance makes it especially handy for big data problems. Spark GraphX has a number of built-in graph algorithms including PageRank, Connected components, Label propagation, SVD++, and Triangle counter. Spark deployment Apache Spark runs on both Windows and Unix-like systems (for example, Linux and Mac OS). If you are starting with Spark you can run it locally on a single machine. Spark requires Java 7+, Python 2.6+, and R 3.1+. If you would like to use Scala API (the language in which Spark was written), you need at least Scala version 2.10.x. Spark can also run in a clustered mode, using which Spark can run both by itself, and on several existing cluster managers. You can deploy Spark on any of the following cluster managers, and the list is growing everyday due to active community support: Hadoop YARN. Apache Mesos. Standalone scheduler. Yet Another Resource Negotiator (YARN) is one of the key features including a redesigned resource manager thus splitting out the scheduling and resource management capabilities from original MapReduce in Hadoop. Apache Mesos is an open source cluster manager that was developed at the University of California, Berkeley. It provides efficient resource isolation and sharing across distributed applications, or frameworks. Installing Apache Spark As mentioned in the earlier pages, while Spark can be deployed on a cluster, you can also run it in local mode on a single machine. In this chapter, we are going to download and install Apache Spark on a Linux machine and run it in local mode. Before we do anything we need to download Apache Spark from Apache's web page for the Spark project: Use your recommended browser to navigate to http://spark.apache.org/downloads.html. Choose a Spark release. You'll find all previous Spark releases listed here. We'll go with release 2.0.0 (at the time of writing, only the preview edition was available). You can download Spark source code, which can be built for several versions of Hadoop, or download it for a specific Hadoop version. In this case, we are going to download one that has been pre-built for Hadoop 2.7 or later. You can also choose to download directly or from among a number of different Mirrors. For the purpose of our exercise we'll use direct download and download it to our preferred location. Note: If you are using Windows, please remember to use a pathname without any spaces. 5. The file that you have downloaded is a compressed TAR archive. You need to extract the archive. Note: The TAR utility is generally used to unpack TAR files. If you don't have TAR, you might want to download that from the repository or use 7-ZIP, which is also one of my favorite utilities. 6. Once unpacked, you will see a number of directories/files. Here's what you would typically see when you list the contents of the unpacked directory: The bin folder contains a number of executable shell scripts such as pypark, sparkR, spark-shell, spark-sql, and spark-submit. All of these executables are used to interact with Spark, and we will be using most if not all of these. 7. If you see my particular download of spark you will find a folder called yarn. The example below is a Spark that was built for Hadoop version 2.7 which comes with YARN as a cluster manager. We'll start by running Spark shell, which is a very simple way to get started with Spark and learn the API. Spark shell is a Scala Read-Evaluate-Print-Loop (REPL), and one of the few REPLs available with Spark which also include Python and R. You should change to the Spark download directory and run the Spark shell as follows: /bin/spark-shell. We now have Spark running in standalone mode. We'll discuss the details of the deployment architecture a bit later in this chapter, but now let's kick start some basic Spark programming to appreciate the power and simplicity of the Spark framework. We have gone through the Spark architecture overview and the steps to install Spark on your own personal machine. To know more about Spark SQL, Spark Streaming, and Machine Learning with Spark, you can refer our book Learning Apache Spark 2.

0
0
2534

article-image-18-striking-ai-trends-2018-part-1

Sugandha Lahoti

27 Dec 2017

14 min read

18 striking AI Trends to watch in 2018 - Part 1

Sugandha Lahoti

27 Dec 2017

14 min read

Artificial Intelligence is the talk of the town. It has evolved past merely being a buzzword in 2016, to be used in a more practical manner in 2017. As 2018 rolls out, we will gradually notice AI transitioning into a necessity. We have prepared a detailed report, on what we can expect from AI in the upcoming year. So sit back, relax, and enjoy the ride through the future. (Don’t forget to wear your VR headgear! ) Here are 18 things that will happen in 2018 that are either AI driven or driving AI: Artificial General Intelligence may gain major traction in research. We will turn to AI enabled solution to solve mission-critical problems. Machine Learning adoption in business will see rapid growth. Safety, ethics, and transparency will become an integral part of AI application design conversations. Mainstream adoption of AI on mobile devices Major research on data efficient learning methods AI personal assistants will continue to get smarter Race to conquer the AI optimized hardware market will heat up further We will see closer AI integration into our everyday lives. The cryptocurrency hype will normalize and pave way for AI-powered Blockchain applications. Advancements in AI and Quantum Computing will share a symbiotic relationship Deep learning will continue to play a significant role in AI development progress. AI will be on both sides of the cybersecurity challenge. Augmented reality content will be brought to smartphones. Reinforcement learning will be applied to a large number of real-world situations. Robotics development will be powered by Deep Reinforcement learning and Meta-learning Rise in immersive media experiences enabled by AI. A large number of organizations will use Digital Twin. 1. General AI: AGI may start gaining traction in research. AlphaZero is only the beginning. 2017 saw Google’s AlphaGo Zero (and later AlphaZero) beat human players at Go, Chess, and other games. In addition to this, computers are now able to recognize images, understand speech, drive cars, and diagnose diseases better with time. AGI is an advancement of AI which deals with bringing machine intelligence as close to humans as possible. So, machines can possibly do any intellectual task that a human can! The success of AlphaGo covered one of the crucial aspects of AGI systems—the ability to learn continually, avoiding catastrophic forgetting. However, there is a lot more to achieving human-level general intelligence than the ability to learn continually. For instance, AI systems of today can draw on skills it learned on one game to play another. But they lack the ability to generalize the learned skill. Unlike humans, these systems do not seek solutions from previous experiences. An AI system cannot ponder and reflect on a new task, analyze its capabilities, and work out how best to apply them. In 2018, we expect to see advanced research in the areas of deep reinforcement learning, meta-learning, transfer learning, evolutionary algorithms and other areas that aid in developing AGI systems. Detailed aspects of these ideas are highlighted in later points. We can indeed say, Artificial General Intelligence is inching closer than ever before and 2018 is expected to cover major ground in that direction. 2. Enterprise AI: Machine Learning adoption in enterprises will see rapid growth. 2017 saw a rise in cloud offerings by major tech players, such as the Amazon Sagemaker, Microsoft Azure Cloud, Google Cloud Platform, allowing business professionals and innovators to transfer labor-intensive research and analysis to the cloud. Cloud is a $130 billion industry as of now, and it is projected to grow. Statista carried out a survey to present the level of AI adoption among businesses worldwide, as of 2017. Almost 80% of the participants had incorporated some or other form of AI into their organizations or planned to do so in the future. Source: https://www.statista.com/statistics/747790/worldwide-level-of-ai-adoption-business/ According to a report from Deloitte, medium and large enterprises are set to double their usage of machine learning by the end of 2018. Apart from these, 2018 will see better data visualization techniques, powered by machine learning, which is a critical aspect of every business. Artificial intelligence is going to automate the cycle of report generation and KPI analysis, and also, bring in deeper analysis of consumer behavior. Also with abundant Big data sources coming into the picture, BI tools powered by AI will emerge, which can harness the raw computing power of voluminous big data for data models to become streamlined and efficient. 3. Transformative AI: We will turn to AI enabled solutions to solve mission-critical problems. 2018 will see the involvement of AI in more and more mission-critical problems that can have world-changing consequences: read enabling genetic engineering, solving the energy crisis, space exploration, slowing climate change, smart cities, reducing starvation through precision farming, elder care etc. Recently NASA revealed the discovery of a new exoplanet, using data crunched from Machine learning and AI. With this recent reveal, more AI techniques would be used for space exploration and to find other exoplanets. We will also see the real-world deployment of AI applications. So it will not be only about academic research, but also about industry readiness. 2018 could very well be the year when AI becomes real for medicine. According to Mark Michalski, executive director, Massachusetts General Hospital and Brigham and Women’s Center for Clinical Data Science, “By the end of next year, a large number of leading healthcare systems are predicted to have adopted some form of AI within their diagnostic groups.” We would also see the rise of robot assistants, such as virtual nurses, diagnostic apps in smartphones, and real clinical robots that can monitor patients, take care of the elderly, alert doctors, and send notifications in case of emergency. More research will be done on how AI enabled technology can help in difficult to diagnose areas in health care like mental health, the onset of hereditary diseases among others. Facebook's attempt at detection of potential suicidal messages using AI is a sign of things to come in this direction. As we explore AI enabled solutions to solve problems that have a serious impact on individuals and societies at large, considering the ethical and moral implications of such solutions will become central to developing them, let alone hard to ignore. 4. Safe AI: Safety, Ethics, and Transparency in AI applications will become integral to conversations on AI adoption and app design. The rise of machine learning capabilities has also given rise to forms of bias, stereotyping and unfair determination in such systems. 2017 saw some high profile news stories about gender bias, object recognition datasets like MS COCO, to racial disparities in education AI systems. At NIPS 2017, Kate Crawford talked about bias in machine learning systems which resonated greatly with the community and became pivotal to starting conversations and thinking by other influencers on how to address the problems raised. DeepMind also launched a new unit, the DeepMind Ethics & Society, to help technologists put ethics into practice, and to help society anticipate and direct the impact of AI for the benefit of all. Independent bodies like IEEE also pushed for standards in it’s ethically aligned design paper. As news about the bro culture in Silicon Valley and the lack of diversity in the tech sector continued to stay in the news all of 2017, it hit closer home as the year came to an end, when Kristian Lum, Lead Statistician at HRDAG, described her experiences with harassment as a graduate student at prominent stat conferences. This has had a butterfly effect of sorts with many more women coming forward to raise the issue in the ML/AI community. They talked about the formation of a stronger code of conduct by boards of key conferences such as NIPS among others. Eric Horvitz, a Microsoft research director, called Lum’s post a "powerful and important report." Jeff Dean, head of Google’s Brain AI unit applauded Lum for having the courage to speak about this behavior. Other key influencers from the ML and statisticians community also spoke in support of Lum and added their views on how to tackle the problem. While the road to recovery is long and machines with moral intelligence may be decades away, 2018 is expected to start that journey in the right direction by including safety, ethics, and transparency in AI/ML systems. Instead of just thinking about ML contributing to decision making in say hiring or criminal justice, data scientists would begin to think of the potential role of ML in the harmful representation of human identity. These policies will not only be included in the development of larger AI ecosystems but also in national and international debates in politics, businesses, and education. 5. Ubiquitous AI: AI will start redefining life as we know it, and we may not even know it happened. Artificial Intelligence will gradually integrate into our everyday lives. We will see it in our everyday decisions like what kind of food we eat, the entertainment we consume, the clothes we wear, etc. Artificially intelligent systems will get better at complex tasks that humans still take for granted, like walking around a room and over objects. We’re going to see more and more products that contain some form of AI enter our lives. AI enabled stuff will become more common and available. We will also start seeing it in the background for life-altering decisions we make such as what to learn, where to work, whom to love, who our friends are, whom should we vote for, where should we invest, and where should we live among other things. 6. Embedded AI: Mobile AI means a radically different way of interacting with the world. There is no denying that AI is the power source behind the next generation of smartphones. A large number of organizations are enabling the use of AI in smartphones, whether in the form of deep learning chips, or inbuilt software with AI capabilities. The mobile AI will be a combination of on-device AI and cloud AI. Intelligent phones will have end-to-end capabilities that support coordinated development of chips, devices, and the cloud. The release of iPhone X’s FaceID—which uses a neural network chip to construct a mathematical model of the user’s face— and self-driving cars are only the beginning. As 2018 rolls out we will see vast applications on smartphones and other mobile devices which will run deep neural networks to enable AI. AI going mobile is not just limited to the embedding of neural chips in smartphones. The next generation of mobile networks 5G will soon greet the world. 2018 is going to be a year of closer collaborations and increasing partnerships between telecom service providers, handset makers, chip markers and AI tech enablers/researchers. The Baidu-Huawei partnership—to build an open AI mobile ecosystem, consisting of devices, technology, internet services, and content—is an example of many steps in this direction. We will also see edge computing rapidly becoming a key part of the Industrial Internet of Things (IIoT) to accelerate digital transformation. In combination with cloud computing, other forms of network architectures such as fog and mist would also gain major traction. All of the above will lead to a large-scale implementation of cognitive IoT, which combines traditional IoT implementations with cognitive computing. It will make sensors capable of diagnosing and adapting to their environment without the need for human intervention. Also bringing in the ability to combine multiple data streams that can identify patterns. This means we will be a lot closer to seeing smart cities in action. 7. Data-sparse AI: Research into data efficient learning methods will intensify 2017 saw highly scalable solutions for problems in object detection and recognition, machine translation, text-to-speech, recommender systems, and information retrieval. The second conference on Machine Translation happened in September 2017. The 11th ACM Conference on Recommender Systems in August 2017 witnessed a series of papers presentations, featured keynotes, invited talks, tutorials, and workshops in the field of recommendation system. Google launched the Tacotron 2 for generating human-like speech from text. However, most of these researches and systems attain state-of-the-art performance only when trained with large amounts of data. With GDPR and other data regulatory frameworks coming into play, 2018 is expected to witness machine learning systems which can learn efficiently maintaining performance, but in less time and with less data. A data-efficient learning system allows learning in complex domains without requiring large quantities of data. For this, there would be developments in the field of semi-supervised learning techniques, where we can use generative models to better guide the training of discriminative models. More research would happen in the area of transfer learning (reuse generalize knowledge across domains), active learning, one-shot learning, Bayesian optimization as well as other non-parametric methods. In addition, researchers and organizations will exploit bootstrapping and data augmentation techniques for efficient reuse of available data. Other key trends propelling data efficient learning research are growing in-device/edge computing, advancements in robotics, AGI research, and energy optimization of data centers, among others. 8. Conversational AI: AI personal assistants will continue to get smarter AI-powered virtual assistants are expected to skyrocket in 2018. 2017 was filled to the brim with new releases. Amazon brought out the Echo Look and the Echo Show. Google made its personal assistant more personal by allowing linking of six accounts to the Google Assistant built into the Home via the Home app. Bank of America unveiled Erica, it’s AI-enabled digital assistant. As 2018 rolls out, AI personal assistants will find its way into an increasing number of homes and consumer gadgets. These include increased availability of AI assistants in our smartphones and smart speakers with built-in support for platforms such as Amazon’s Alexa and Google Assistant. With the beginning of the new year, we can see personal assistants integrating into our daily routines. Developers will build voice support into a host of appliances and gadgets by using various voice assistant platforms. More importantly, developers in 2018 will try their hands on conversational technology which will include emotional sensitivity (affective computing) as well as machine translational technology (the ability to communicate seamlessly between languages). Personal assistants would be able to recognize speech patterns, for instance, of those indicative of wanting help. AI bots may also be utilized for psychiatric counseling or providing support for isolated people. And it’s all set to begin with the AI assistant summit in San Francisco scheduled on 25 - 26 January 2018. It will witness talks by world's leading innovators in advances in AI Assistants and artificial intelligence. 9. AI Hardware: Race to conquer the AI optimized hardware market will heat up further Top tech companies (read Google, IBM, Intel, Nvidia) are investing heavily in the development of AI/ML optimized hardware. Research and Markets have predicted the global AI chip market will have a growth rate of about 54% between 2017 and 2021. 2018 will see further hardware designs intended to greatly accelerate the next generation of applications and run AI computational jobs. With the beginning of 2018 chip makers will battle it out to determine who creates the hardware that artificial intelligence lives on. Not only that, there would be a rise in the development of new AI products, both for hardware and software platforms that run deep learning programs and algorithms. Also, chips which move away from the traditional one-size-fits-all approach to application-based AI hardware will grow in popularity. 2018 would see hardware which does not only store data, but also transform it into usable information. The trend for AI will head in the direction of task-optimized hardware. 2018 may also see hardware organizations move to software domains and vice-versa. Nvidia, most famous for their Volta GPUs have come up with NVIDIA DGX-1, a software for AI research, designed to streamline the deep learning workflow. More such transitions are expected at the highly anticipated CES 2018. [dropcap]P[/dropcap]hew, that was a lot of writing! But I hope you found it just as interesting to read as I found writing it. However, we are not done yet. And here is part 2 of our 18 AI trends in ‘18.

0
0
5949

article-image-mine-popular-trends-github-python-part-2

Amey Varangaonkar

27 Dec 2017

1 min read

How to Mine Popular Trends on GitHub using Python - Part 2

Amey Varangaonkar

27 Dec 2017

1 min read

[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Python Social Media Analytics, written by Siddhartha Chatterjee and Michal Krystyanczuk. In this book, you will find widely used social media mining techniques for extracting useful insights to drive your business.[/box] In Part 1 of this series, we gathered the GitHub data for analysis. Here, we will analyze that data as per our requirements, to get interesting insights on the highest trending and popular tools and languages on GitHub. We have seen so far that the GitHub API provides interesting sets of information about the code repositories and metadata around the activity of its users around these repositories. In the following sections, we will analyze this data to find out which are the most popular repositories through the analysis of its descriptions and then drilling down to the watchers, forks, and issues submitted on the emerging technologies. Since, technology is evolving so rapidly, this approach could help us to stay on top of the latest trending technologies. In order to find out what are the trending technologies, we will perform the analysis in a few steps: Identifying top technologies First of all, we will use text analytics techniques to identify what are the most popular phrases related to technologies in repositories from 2017. Our analysis will be focused on the most frequent bigrams. We import a nltk.collocation module which implements n-gram search tools: import nltk from nltk.collocations import * Then, we convert the clean description column into a list of tokens: list_documents = df['clean'].apply(lambda x: x.split()).tolist() As we perform an analysis on documents, we will use the method from_documents instead of a default one from_words. The difference between these two methods lies in then input data format. The one used in our case takes as argument a list of tokens and searches for n-grams document-wise instead of corpus-wise. It protects against detecting bi-grams composed of the last word of one document and the first one of another one: bigram_measures = nltk.collocations.BigramAssocMeasures() bigram_finder = BigramCollocationFinder.from_documents(list_documents) We take into account only bi-grams which appear at least three times in our document set: bigram_finder.apply_freq_filter(3) We can use different association measures to find the best bi-grams, such as raw frequency, pmi, student t, or chi sq. We will mostly be interested in the raw frequency measure, which is the simplest and most convenient indicator in our case. We get top 20 bigrams according to raw_freq measure: bigrams = bigram_finder.nbest(bigram_measures.raw_freq,20) We can also obtain their scores by applying the score_ngrams method: scores = bigram_finder.score_ngrams(bigram_measures.raw_freq) All the other measures are implemented as methods of BigramCollocationFinder. To try them, you can replace raw_freq by, respectively, pmi, student_t, and chi_sq. However, to create a visualization we will need the actual number of occurrences instead of scores. We create a list by using the ngram_fd.items() method and we sort it in descending order. ngram = list(bigram_finder.ngram_fd.items()) ngram.sort(key=lambda item: item[-1], reverse=True) It returns a dictionary of tuples which contain an embedded tuple and its frequency. We transform it into a simple list of tuples where we join bigram tokens: frequency = [(" ".join(k), v) for k,v in ngram] For simplicity reasons we put the frequency list into a dataframe: df=pd.DataFrame(frequency) And then, we plot the top 20 technologies in a bar chart: import matplotlib.pyplot as plt plt.style.use('ggplot') df.set_index([0], inplace = True) df.sort_values(by = [1], ascending = False).head(20).plot(kind = 'barh') plt.title('Trending Technologies') plt.ylabel('Technology') plt.xlabel('Popularity') plt.legend().set_visible(False) plt.axvline(x=14, color='b', label='Average', linestyle='--', linewidth=3) for custom in [0, 10, 14]: plt.text(14.2, custom, "Neural Networks", fontsize = 12, va = 'center', bbox = dict(boxstyle='square', fc='white', ec='none')) plt.show() We've added an additional line which helps us to aggregate all technologies related to neural networks. It is done manually by selecting elements by indices, (0,10,14) in this case. This operation might be useful for interpretation. The preceding analysis provides us with an interesting set of the most popular technologies on GitHub. It includes topics for software engineering, programming languages, and artificial intelligence. An important thing to be noted is that technology around neural networks emerges more than once, notably, deep learning, TensorFlow, and other specific projects. This is not surprising, since neural networks, which are an important component in the field of artificial intelligence, have been spoken about and practiced heavily in the last few years. So, if you're an aspiring programmer interested in AI and machine learning, this is a field to dive into! Programming languages The next step in our analysis is the comparison of popularity between different programming languages. It will be based on samples of the top 1,000 most popular repositories by year. Firstly, we get the data for last three years: queries = ["created:>2017-01-01", "created:2015-01-01..2015-12-31", "created:2016-01-01..2016-12-31"] We reuse the search_repo_paging function to collect the data from the GitHub API and we concatenate the results to a new dataframe. df = pd.DataFrame() for query in queries: data = search_repo_paging(query) data = pd.io.json.json_normalize(data) df = pd.concat([df, data]) We convert the dataframe to a time series based on the create_at column df['created_at'] = df['created_at'].apply(pd.to_datetime) df = df.set_index(['created_at']) Then, we use aggregation method groupby which restructures the data by language and year, and we count the number of occurrences by language: dx = pd.DataFrame(df.groupby(['language', df.index.year])['language'].count()) We represent the results on a bar chart: fig, ax = plt.subplots() dx.unstack().plot(kind='bar', title = 'Programming Languages per Year', ax= ax) ax.legend(['2015', '2016', '2017'], title = 'Year') plt.show() The preceding graph shows a multitude of programming languages from assembly, C, C#, Java, web, and mobile languages, to modern ones like Python, Ruby, and Scala. Comparing over the three years, we see some interesting trends. We notice HTML, which is the bedrock of all web development, has remained very stable over the last three years. This is not something that will not be replaced in a hurry. Once very popular, Ruby now has a decrease in popularity. The popularity of Python, also our language of choice for this book, is going up. Finally, the cross-device programming language, Swift, initially created by Apple but now open source, is getting extremely popular over time. It could be interesting to see in the next few years, if these trends change or hold true for long. Programming languages used in top technologies Now we know what are the top programming languages and technologies quoted in repositories description. In this section we will try to combine this information and find out what are the main programming languages for each technology. We select four technologies from previous section and print corresponding programming languages. We look up the column containing cleaned repository description and create a set of the languages related to the technology. Using a set will assure that we have unique Values. technologies_list = ['software engineering', 'deep learning', 'open source', 'exercise practice'] for tech in technologies_list: print(tech) print(set(df[df['clean'].str.contains(tech)]['language'])) software engineering {'HTML', 'Java'} deep learning {'Jupyter Notebook', None, 'Python'} open source {None, 'PHP', 'Java', 'TypeScript', 'Go', 'JavaScript', 'Ruby', 'C++'} exercise practice {'CSS', 'JavaScript', 'HTML'} Following the text analysis of the descriptions of the top technologies and then extracting the programming languages for them we notice the following: You can do a lot more analysis with this GitHub data such as: Want to know how? You can check out our book Python Social Media Analytics to get a detailed walkthrough of these topics.

0
0
3106

article-image-storing-apache-storm-data-in-elasticsearch

Richa Tripathi

27 Dec 2017

6 min read

Storing Apache Storm data in Elasticsearch

Richa Tripathi

27 Dec 2017

6 min read

0
0
2906

article-image-mine-popular-trends-on-github-using-python-part-1

Amey Varangaonkar

26 Dec 2017

11 min read

Mine Popular Trends on GitHub using Python - Part 1

Amey Varangaonkar

26 Dec 2017

11 min read

[box type="note" align="" class="" width=""]This interesting article is an excerpt from the book Python Social Media Analytics, written by Siddhartha Chatterjee and Michal Krystyanczuk. The book contains useful techniques to gain valuable insights from different social media channels using popular Python packages.[/box] In this article, we explore how to leverage the power of Python in order to gather and process data from GitHub and make it analysis-ready. Those who love to code, love GitHub. GitHub has taken the widely used version controlling approach to coding to the highest possible level by implementing social network features to the world of programming. No wonder GitHub is also thought of as Social Coding. We thought a book on Social Network analysis would not be complete without a use case on data from GitHub. GitHub allows you to create code repositories and provides multiple collaborative features, bug tracking, feature requests, task managements, and wikis. It has about 20 million users and 57 million code repositories (source: Wikipedia). These kind of statistics easily demonstrate that this is the most representative platform of programmers. It's also a platform for several open source projects that have contributed greatly to the world of software development. Programming technology is evolving at such a fast pace, especially due to the open source movement, and we have to be able to keep a track of emerging technologies. Assuming that the latest programming tools and technologies are being used with GitHub, analyzing GitHub could help us detect the most popular technologies. The popularity of repositories on GitHub is assessed through the number of commits it receives from its community. We will use the GitHub API in this chapter to gather data around repositories with the most number of commits and then discover the most popular technology within them. For all we know, the results that we get may reveal the next great innovations. Scope and process GitHub API allows us to get information about public code repositories submitted by users. It covers lots of open-source, educational and personal projects. Our focus is to find the trending technologies and programming languages of last few months, and compare with repositories from past years. We will collect all the meta information about the repositories such as: Name: The name of the repository Description: A description of the repository Watchers: People following the repository and getting notified about its activity Forks: Users cloning the repository to their own accounts Open Issues: Issues submitted about the repository We will use this data, a combination of qualitative and quantitative information, to identify the most recent trends and weak signals. The process can be represented by the steps shown in the following figure: Getting the data Before using the API, we need to set the authorization. The API gives you access to all publicly available data, but some endpoints need user permission. You can create a new token with some specific scope access using the application settings. The scope depends on your application's needs, such as accessing user email, updating user profile, and so on. Password authorization is only needed in some cases, like access by user authorized applications. In that case, you need to provide your username or email, and your password. All API access is over HTTPS, and accessed from the https://api.github.com/ domain. All data is sent and received as JSON. Rate Limits The GitHub Search API is designed to help to find specific items (repository, users, and so on). The rate limit policy allows up to 1,000 results for each search. For requests using basic authentication, OAuth, or client ID and secret, you can make up to 30 requests per minute. For unauthenticated requests, the rate limit allows you to make up to 10 requests per minute. Connection to GitHub GitHub offers a search endpoint which returns all the repositories matching a query. As we go along, in different steps of the analysis we will change the value of the variable q (query). In the first part, we will retrieve all the repositories created since January 1, 2017 and then we will compare the results with previous years. Firstly, we initialize an empty list results which stores all data about repositories. Secondly, we build get requests with parameters required by the API. We can only get 100 results per request, so we have to use a pagination technique to build a complete dataset. results = [] q = "created:>2017-01-01" def search_repo_paging(q): url = 'https://api.github.com/search/repositories' params = {'q' : q, 'sort' : 'forks', 'order': 'desc', 'per_page' : 100} while True: res = requests.get(url,params = params) result = res.json() results.extend(result['items']) params = {} try: url = res.links['next']['url'] except: break In the first request we have to pass all the parameters to the GET method in our request. Then, we make a new request for every next page, which can be found in res.links['next']['url']. res. links contains a full link to the resources including all the other parameters. That is why we empty the params dictionary. The operation is repeated until there is no next page key in res.links dictionary. For other datasets we modify the search query in such a way that we retrieve repositories from previous years. For example to get the data from 2015 we define the following query: q = "created:2015-01-01..2015-12-31" In order to find proper repositories, the API provides a wide range of query parameters. It is possible to search for repositories with high precision using the system of qualifiers. Starting with main search parameters q, we have following options: sort: Set to forks as we are interested in finding the repositories having the largest number of forks (you can also sort by number of stars or update time) order: Set to descending order per_page: Set to the maximum amount of returned repositories Naturally, the search parameter q can contain multiple combinations of qualifiers. Data pull The amount of data we collect through GitHub API is such that it fits in memory. We can deal with it directly in a pandas dataframe. If more data is required, we would recommend storing it in a database, such as MongoDB. We use JSON tools to convert the results into a clean JSON and to create a dataframe. from pandas.io.json import json_normalize import json import pandas as pd import bson.json_util as json_util sanitized = json.loads(json_util.dumps(results)) normalized = json_normalize(sanitized) df = pd.DataFrame(normalized) The dataframe df contains columns related to all the results returned by GitHub API. We can list them by typing the following: Df.columns Index(['archive_url', 'assignees_url', 'blobs_url', 'branches_url', 'clone_url', 'collaborators_url', 'comments_url', 'commits_url', 'compare_url', 'contents_url', 'contributors_url', 'default_branch', 'deployments_url', 'description', 'downloads_url', 'events_url', 'Fork', 'forks', 'forks_count', 'forks_url', 'full_name', 'git_commits_url', 'git_refs_url', 'git_tags_url', 'git_url', 'has_downloads', 'has_issues', 'has_pages', 'has_projects', 'has_wiki', 'homepage', 'hooks_url', 'html_url', 'id', 'issue_comment_url', 'Issue_events_url', 'issues_url', 'keys_url', 'labels_url', 'language', 'languages_url', 'merges_url', 'milestones_url', 'mirror_url', 'name', 'notifications_url', 'open_issues', 'open_issues_count', 'owner.avatar_url', 'owner.events_url', 'owner.followers_url', 'owner.following_url', 'owner.gists_url', 'owner.gravatar_id', 'owner.html_url', 'owner.id', 'owner.login', 'Owner.organizations_url', 'owner.received_events_url', 'owner.repos_url', 'owner.site_admin', 'owner.starred_url', 'owner.subscriptions_url', 'owner.type', 'owner.url', 'private', 'pulls_url', 'pushed_at', 'releases_url', 'score', 'size', 'ssh_url', 'stargazers_count', 'stargazers_url', 'statuses_url', 'subscribers_url', 'subscription_url', 'svn_url', 'tags_url', 'teams_url', 'trees_url', 'updated_at', 'url', 'Watchers', 'watchers_count', 'year'], dtype='object') Then, we select a subset of variables which will be used for further analysis. Our choice is based on the meaning of each of them. We skip all the technical variables related to URLs, owner information, or ID. The remaining columns contain information which is very likely to help us identify new technology trends: description: A user description of a repository watchers_count: The number of watchers size: The size of repository in kilobytes forks_count: The number of forks open_issues_count: The number of open issues language: The programming language the repository is written in We have selected watchers_count as the criterion to measure the popularity of repositories. This number indicates how many people are interested in the project. However, we may also use forks_count which gives us slightly different information about the popularity. The latter represents the number of people who actually worked with the code, so it is related to a different group. Data processing In the previous step we structured the raw data which is now ready for further analysis. Our objective is to analyze two types of data: Textual data in description Numerical data in other variables Each of them requires a different pre-processing technique. Let's take a look at each type in Detail. Textual data For the first kind, we have to create a new variable which contains a cleaned string. We will do it in three steps which have already been presented in previous chapters: Selecting English descriptions Tokenization Stopwords removal As we work only on English data, we should remove all the descriptions which are written in other languages. The main reason to do so is that each language requires a different processing and analysis flow. If we left descriptions in Russian or Chinese, we would have very noisy data which we would not be able to interpret. As a consequence, we can say that we are analyzing trends in the English-speaking world. Firstly, we remove all the empty strings in the description column. df = df.dropna(subset=['description']) In order to remove non-English descriptions we have to first detect what language is used in each text. For this purpose we use a library called langdetect which is based on the Google language detection project (https://github.com/shuyo/language-detection). from langdetect import detect df['lang'] = df.apply(lambda x: detect(x['description']),axis=1) We create a new column which contains all the predictions. We see different languages, such as en (English), zh-cn (Chinese), vi (Vietnamese), or ca (Catalan). df['lang'] 0 en 1 en 2 en 3 en 4 en 5 zh-cn In our dataset en represents 78.7% of all the repositories. We will now select only those repositories with a description in English: df = df[df['lang'] == 'en'] In the next step, we will create a new clean column with pre-processed textual data. We execute the following code to perform tokenization and remove stopwords: import nltk from nltk import word_tokenize from nltk.corpus import stopwords def clean(text = '', stopwords = []): #tokenize tokens = word_tokenize(text.strip()) #lowercase clean = [i.lower() for i in tokens] #remove stopwords clean = [i for i in clean if i not in stopwords] #remove punctuation punctuations = list(string.punctuation) clean = [i.strip(''.join(punctuations)) for i in clean if i not in punctuations] return " ".join(clean) df['clean'] = df['description'].apply(str) #make sure description is a string df['clean'] = df['clean'].apply(lambda x: clean(text = x, stopwords = stopwords.words('english'))) Finally, we obtain a clean column which contains cleaned English descriptions, ready for analysis: df['clean'].head(5) 0 roadmap becoming web developer 2017 1 base repository imad v2 course application ple… 2 decrypted content eqgrp-auction-file.tar.xz 3 shadow brokers lost translation leak 4 learn design large-scale systems prep system d... Numerical data For numerical data, we will check statistically both what the distribution of values is and whether there are any missing values: df[['watchers_count','size','forks_count','open_issues']].describe() We see that there are no missing values in all four variables: watchers_count, size, forks_count, and open_issues. The watchers_count varies from 0 to 20,792 while the minimum number of forks is 33 and goes up to 2,589. The first quartile of repositories has no open issues while top 25% have more than 12. It is worth noticing that, in our dataset, there is a repository which has 458 open issues. Once we are done with the pre-processing of the data, our next step would be to analyze it, in order to get actionable insights from it. If you found this article to be useful, stay tuned for Part 2, where we perform analysis on the processed GitHub data and determine the top trending technologies. Alternatively, you can check out the book Python Social Media Analytics, to learn how to get valuable insights from various social media sites such as Facebook, Twitter and more.

0
0
3530

article-image-store-access-social-media-data-mongodb

Amey Varangaonkar

26 Dec 2017

6 min read

How to store and access social media data in MongoDB

Amey Varangaonkar

26 Dec 2017

6 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book Python Social Media Analytics, co-authored by Siddhartha Chatterjee and Michal Krystyanczuk.[/box] Our article explains how to effectively perform different operations using MongoDB and Python to effectively access and modify the data. According to the official MongoDB page: MongoDB is free and open-source, distributed database which allows ad hoc queries, indexing, and real time aggregation to access and analyze your data. It is published under the GNU Affero General Public License and stores data in flexible, JSON-like documents, meaning fields can vary from document to document and the data structure can be changed over time. Along with ease of use, MongoDB is recognized for the following advantages: Schema-less design: Unlike traditional relational databases, which require the data to fit its schema, MongoDB provides a flexible schema-less data model. The data model is based on documents and collections. A document is essentially a JSON structure and a collection is a group of documents. One links data within collections using specific identifiers. The document model is quite useful in this subject as most social media APIs provide their data in JSON format. High performance: Indexing and GRIDFS features of MongoDB provide fast access and storage. High availability: Duplication feature that allows us to make various copies of databases in different nodes confirms high availability in the case of node failures. Automatic scaling: The Sharding feature of MongoDB scales large data sets Automatically. You can access information on the implementation of Sharding in the official documentation of MongoDB: https://docs.mongodb.com/v3.0/sharding/ Installing MongoDB MongoDB can be downloaded and installed from the following link: http://www.mongodb.org/downloads?_ga=1.253005644.410512988.1432811016. Setting up the environment MongoDB requires a data directory to store all the data. The directory can be created in your working directory: md datadb Starting MongoDB We need to go to the folder where mongod.exe is stored and and run the following command: cmd binmongod.exe Once the MongoDB server is running in the background, we can switch to our Python environment to connect and start working. MongoDB using Python MongoDB can be used directly from the shell command or through programming languages. For the sake of our book we'll explain how it works using Python. MongoDB is accessed using Python through a driver module named PyMongo. We will not go into the detailed usage of MongoDB, which is beyond the scope of this book. We will see the most common functionalities required for analysis projects. We highly recommend reading the official MongoDB documentation. PyMongo can be installed using the following command: pip install pymongo Then the following command imports it in the Python script from pymongo import MongoClient client = MongoClient('localhost:27017') The database structure of MongoDB is similar to SQL languages, where you have databases, and inside databases you have tables. In MongoDB you have databases, and inside them you have collections. Collections are where you store the data, and databases store multiple collections. As MongoDB is a NoSQL database, your tables do not need to have a predefined structure, you can add documents of any composition as long as they are a JSON object. But by convention is it best practice to have a common general structure for documents in the same collections. To access a database named scrapper we simply have to do the following: db_scrapper = db.scrapper To access a collection named articles in the database scrapper we do this: db_scrapper = db.scrapper collection_articles = db_scrapper.articles Once you have the client object initiated you can access all the databases and the collections very easily. Now, we will see how to perform different operations: Insert: To insert a document into a collection we build a list of new documents to insert into the database: docs = [] for _ in range(0, 10): # each document must be of the python type dict docs.append({ "author": "...", "content": "...", "comment": ["...", ... ] }) Inserting all the docs at once: db.collection.insert_many(docs) Or you can insert them one by one: for doc in docs: db.collection.insert_one(doc) You can find more detailed documentation at: https://docs.mongodb.com/v3.2/tutorial/insert-documents/. Find: To fetch all documents within a collection: # as the find function returns a cursor we will iterate over the cursor to actually fetch # the data from the database docs = [d for d in db.collection.find()] To fetch all documents in batches of 100 documents: batch_size = 100 Iteration = 0 count = db.collection.count() # getting the total number of documents in the collection while iteration * batch_size < count: docs = [d for d in db.collection.find().skip(batch_size * iteration).limit(batch_size)] Iteration += 1 To fetch documents using search queries, where the author is Jean Francois: query = {'author': 'Jean Francois'} docs = [d for d in db.collection.find(query) Where the author field exists and is not null: query = {'author': {'$exists': True, '$ne': None}} docs = [d for d in db.collection.find(query)] There are many other different filtering methods that provide a wide variety of flexibility and precision; we highly recommend taking your time going through the different search operators. You can find more detailed documentation at: https://docs.mongodb.com/v3.2/reference/method/db.collection.find/ Update: To update a document where the author is Jean Francois and set the attribute published as True: query_search = {'author': 'Jean Francois'} query_update = {'$set': {'published': True}} db.collection.update_many(query_search, query_update) Or you can update just the first matching document: db.collection.update_one(query_search, query_update) Find more detailed documentation at: https://docs.mongodb.com/v3.2/reference/method/db.collection.update/ Remove: Remove all documents where the author is Jean Francois: query_search = {'author': 'Jean Francois'} db.collection.delete_many(query_search, query_update) Or remove the first matching document: db.collection.delete_one(query_search, query_update) Find more detailed documentation at: https://docs.mongodb.com/v3.2/tutorial/remove-documents/ Drop: You can drop collections by the following: db.collection.drop() Or you can drop the whole database: db.dropDatabase() We saw how to store and access data from MongoDB. MongoDB has gained a lot of popularity and is the preferred database choice for many, especially when it comes to working with social media data. If you found our post to be useful, do make sure to check out Python Social Media Analytics, which contains useful tips and tricks on leveraging the power of Python for effective data analysis from various social media sites such as YouTube, GitHub, Twitter etc.

0
0
4863

article-image-clean-social-media-data-analysis-python

Amey Varangaonkar

26 Dec 2017

10 min read

How to effectively clean social media data for analysis

Amey Varangaonkar

26 Dec 2017

10 min read

[box type="note" align="" class="" width=""]This article is a book extract from Python Social Media Analytics, written by Siddhartha Chatterjee and Michal Krystyanczuk.[/box] Data cleaning and preprocessing is an essential - and often crucial - part of any analytical process. In this excerpt, we explain the different techniques and mechanisms for effective analysis of your social media data. Social media contains different types of data: information about user profiles, statistics (number of likes or number of followers), verbatims, and other media content. Quantitative data is very convenient for an analysis using statistical and numerical methods, but unstructured data such as user comments is much more challenging. To get meaningful information, one has to perform the whole process of information retrieval. It starts with the definition of the data type and data structure. On social media, unstructured data is related to text, images, videos, and sound and we will mostly deal with textual data. Then, the data has to be cleaned and normalized. Only after all these steps can we delve into the analysis. Social media Data type and encoding Comments and conversation are textual data that we retrieve as strings. In brief, a string is a sequence of characters represented by code points. Every string in Python is seen as a Unicode covering the numbers from 0 through 0x10FFFF (1,114,111 decimal). Then, the sequence has to be represented as a set of bytes (values from 0 to 255) in memory. The rules for translating a Unicode string into a sequence of bytes are called encoding. Encoding plays a very important role in natural language processing because people use more and more characters such as emojis or emoticons, which replace whole words and express emotions. Moreover, in many languages, there are accents that go beyond the regular English alphabet. In order to deal with all the processing problems that might be caused by these, we have to use the right encoding, because comparing two strings with different encodings is actually like comparing apples and oranges. The most common one is UTF-8, used by default in Python 3, which can handle any type of character. As a rule of thumb always normalize your data to Unicode UTF-8. Structure of social media data Another question we'll encounter is, What is the right structure for our data? The most natural choice is a list that can store a sequence of data points (verbatims, numbers, and so on). However, the use of lists will not be efficient on large datasets and we'll be constrained to use sequential processing of the data. That is why a much better solution is to store the data in a tabular format in pandas dataframe, which has multiple advantages for further processing. First of all, rows are indexed, so search operations become much faster. There are also many optimized methods for different kinds of processing and above all it allows you to optimize your own processing by using functional programming. Moreover, a row can contain multiple fields with metadata about verbatims, which are very often used in our analysis. It is worth remembering that the dataset in pandas must fit into RAM memory. For bigger datasets, we suggest the use of SFrames. Pre-processing and text normalization Preprocessing is one of the most important parts of the analysis process. It reformats the unstructured data into uniform, standardized form. The characters, words, and sentences identified at this stage are the fundamental units passed to all further processing stages. The quality of the preprocessing has a big impact of the final result on the whole process. There are several stages of the process: from simple text cleaning by removing white spaces, punctuation, HTML tags and special characters up to more sophisticated normalization techniques such as tokenization, stemming or lemmatization. In general, the main aim is to keep all the characters and words that are important for the analysis and, at the same time, get rid of all others, and the text corpus should be maintained in one uniform format. We import all necessary libraries. import re, itertools import nltk from nltk.corpus import stopwords When dealing with raw text, we usually have a set of words including many details we are not interested in, such as whitespace, line breaks, and blank lines. Moreover, many words contain capital letters so programming languages misinterpret for example, "go" and "Go" as two different words. In order to handle such distinctions, we can convert all words to lowercase format with the following steps: Perform basic text mining cleaning. Remove all whitespaces: verbatim = verbatim.strip() Many text processing tasks can be done via pattern matching. We can find words containing a character and replace it with another one or just remove it. Regular expressions give us a powerful and flexible method for describing the character patterns we are interested in. They are commonly used in cleaning punctuation, HTML tags, and URLs paths. 3. Remove punctuation: verbatim = re.sub(r'[^ws]','',verbatim) 4. Remove HTML tags: verbatim = re.sub('<[^<]+?>', '', verbatim) 5. Remove URLs: verbatim = re.sub(r'^https?://.*[rn]*', '', verbatim, flags=re.MULTILINE) Depending on the quality of the text corpus, sometimes there is a need to implement some corrections. This refers to the text sources such as Twitter or forums, where emotions can play a role and the comments contain multiple letters words for example, "happpppy" instead of "happy" 6. Standardize words (remove multiple letters): verbatim = ''.join(''.join(s)[:2] for _, s in itertools.groupby(verbatim)) After removal of punctuation or white spaces, words can be attached. This happens especially when deleting the periods at the end of the sentences. The corpus might look like: "the brown dog is lostEverybody is looking for him". So there is a need to split "lostEverybody" into two separate words. 7. Split attached words: verbatim = " ".join(re.findall('[A-Z][^A-Z]*', verbatim)) Stop words are basically a set of commonly used words in any language: mainly determiners, prepositions, and coordinating conjunctions. By removing the words that are very commonly used in a given language, we can focus only on the important words instead, and improve the accuracy of the text processing. 8. Convert text to lowercase, lower(): verbatim = verbatim.lower() 9. Stop word removal: verbatim = ' '.join([word for word in verbatim.split() if word not in (stopwords.words('english'))]) 10. Stemming and lemmatization: The main aim of stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Stemming reduces word forms to so-called stems, whereas lemmatization reduces word forms to linguistically valid lemmas. Some examples of stemming are cars -> car, men -> man, and went -> Go Such text processing can give added value in some domains, and may improve the accuracy of practical information extraction tasks Tokenization: Tokenization is the process of breaking a text corpus up into words (most commonly), phrases, or other meaningful elements, which are then called tokens. The tokens become the basic units for further text processing. tokens = nltk.word_tokenize(verbatim) Other techniques are spelling correction, domain knowledge, and grammar checking. Duplicate removal Depending on data source we might notice multiple duplicates in our dataset. The decision to remove duplicates should be based on the understanding of the domain. In most cases, duplicates come from errors in data collection process and it is recommended to remove them in order to reduce bias in our analysis, with the help of the following: df = df.drop_duplicates(subset=['column_name']) Knowing basic text cleaning techniques, we can now learn how to store the data in an efficient way. For this purpose, we will explain how to use one of the most convenient NoSQL databases—MongoDB. Capture: Once you have made a connection to your API you need to make a special request and receive the data at your end. This step requires you go through the data to be able to understand it. Often the data is received in a special format called JavaScript Object Notation (JSON). JSON was created to enable a lightweight data interchange between programs. The JSON resembles the old XML format and consists of a key-value pair. Normalization: The data received from platforms are not in an ideal format to perform analysis. With textual data there are many different approaches to normalization. One can be stripping whitespaces surrounding verbatims, or converting all verbatims to lowercase, or changing the encoding to UTF-8. The point is that if we do not maintain a standard protocol for normalization, we will introduce many unintended errors. The goal of normalization is to transform all your data in a consistent manner that ensures a uniform standardization of your data. It is recommended that you create wrapper functions for your normalization techniques, and then apply these wrappers on all your data input points so as to ensure that all the data in your analysis go through exactly the same normalization process. In general, one should always perform the following cleaning steps: Normalize the textual content: Normalization generally contains at least the following steps: Stripping surrounding whitespaces. Lowercasing the verbatim. Universal encoding (UTF-8). 2. Remove special characters (example: punctuation). 3. Remove stop words: Irrespective of the language stop words add no additional informative value to the analysis, except in the case of deep parsing where stop words can be bridge connectors between targeted words. 4. Splitting attached words. 5. Removal of URLs and hyperlinks: URLs and hyperlinks can be studied separately, but due to the lack of grammatical structure they are by convention removed from verbatims. 6. Slang lookups: This is a relatively difficult task, because here we would require a predefined vocabulary of slang words and their proper reference words, for example: luv maps to love. Such dictionaries are available on the open web, but there is always a risk of them being outdated. In the case of studying words and not phrases (or n-grams), it is very important to do the following: Tokenize verbatim Stemming and lemmatization (Optional): This is where different written forms of the same word do not hold additional meaning to your study Some advanced cleaning procedures are: Grammar checking: Grammar checking is mostly learning-based, a huge amount of proper text data is learned, and models are created for the purpose of grammar correction. There are many online tools that are available for grammar correction purposes. This is a very tricky cleaning technique because language style and structure can change from source to source (for example language on Twitter will not correspond with the language from published books). Wrongly correcting grammar can have negative effects on the analysis. Spelling correction: In natural language, misspelled errors are encountered. Companies, such as Google and Microsoft have achieved a decent accuracy level in automated spell correction. One can use algorithms such as the Levenshtein Distances, Dictionary Lookup, and so on, or other modules and packages to fix these errors. Again take spell correction with a grain of salt, because false positives can affect the results. Storing: Once the data is received, normalized, and/or cleaned, we need to store the data in an efficient storage database. In this book we have chosen MongoDB as the database as it's a modern and scalable database. It's also relatively easy to use and get started. However, other databases such as Cassandra or HBase could also be used depending on expertise and objectives. Data cleaning and preprocessing, although tedious, can simplify your data analysis work. With the effective Python packages like Numpy, SciPy, Pandas etc these tasks become so much easy and save a lot of your time. If you found this piece of information useful, make sure to check out our book Python Social Media Analytics, which will help you draw actionable insights from mining social media portals such as GitHub, Twitter, YouTube, and more!

0
1
8727

article-image-10-machine-learning-tools-to-look-out-for-in-2018

Amey Varangaonkar

26 Dec 2017

7 min read

10 Machine Learning Tools to watch in 2018

Amey Varangaonkar

26 Dec 2017

7 min read

0
0
3350

article-image-hitting-the-right-notes-in-2017-ai-song-for-data-scientists

Aarthi Kumaraswamy

26 Dec 2017

3 min read

Hitting the right notes in 2017: AI in a song for Data Scientists

Aarthi Kumaraswamy

26 Dec 2017

3 min read

A lot, I mean lots and lots of great articles have been written already about AI’s epic journey in 2017. They all generally agree that 2017 sets the stage for AI in very real terms. We saw immense progress in academia, research and industry in terms of an explosion of new ideas (like capsNets), questioning of established ideas (like backprop, AI black boxes), new methods (Alpha Zero’s self-learning), tools (PyTorch, Gluon, AWS SageMaker), and hardware (quantum computers, AI chips). New and existing players gearing up to tap into this phenomena even as they struggled to tap into the limited talent pool at various conferences and other community hangouts. While we have accelerated the pace of testing and deploying some of those ideas in the real world with self-driving cars, in media & entertainment, among others, progress in building a supportive and sustainable ecosystem has been slow. We also saw conversations on AI ethics, transparency, interpretability, fairness, go mainstream alongside broader contexts such as national policies, corporate cultural reformation setting the tone of those conversations. While anxiety over losing jobs to robots keeps reaching new heights proportional to the cryptocurrency hype, we saw humanoids gain citizenship, residency and even talk about contesting in an election! It has been nothing short of the stuff, legendary tales are made of: struggle, confusion, magic, awe, love, fear, disgust, inspiring heroes, powerful villains, misunderstood monsters, inner demons and guardian angels. And stories worth telling must have songs written about them! Here’s our ode to AI Highlights in 2017 while paying homage to an all-time favorite: ‘A few of my favorite things’ from Sound of Music. Next year, our AI friends will probably join us behind the scenes in the making of another homage to the extraordinary advances in data science, machine learning, and AI. [box type="shadow" align="" class="" width=""] Stripes on horses and horsetails on zebras Bright funny faces in bowls full of rameN Brown furry bears rolled into pandAs These are a few of my favorite thinGs TensorFlow projects and crisp algo models Libratus’ poker faces, AlphaGo Zero’s gaming caboodles Cars that drive and drones that fly with the moon on their wings These are a few of my favorite things Interpreting AI black boxes, using Python hashes Kaggle frenemies and the ones from ML MOOC classes R white spaces that melt into strings These are a few of my favorite things When models don’t converge, and networks just forget When I am sad I simply remember my favorite things And then I don’t feel so bad[/box] PS: We had to leave out many other significant developments in the above cover as we are limited in our creative repertoire. We invite you to join in and help us write an extended version together! The idea is to make learning about data science easy, accessible, fun and memorable!

0
0
1461

How to format and publish code using R Markdown

25 Startups using machine learning differently in 2018: From farming to brewing beer to elder care

Getting to know and manipulate Tensors in TensorFlow

18 striking AI Trends to watch in 2018 - Part 2

There and back again: Decrypting Bitcoin`s 2017 journey from $1000 to $20000

How to Implement a Neural Network with Single-Layer Perceptron

Getting started with Spark 2.0

18 striking AI Trends to watch in 2018 - Part 1

How to Mine Popular Trends on GitHub using Python - Part 2

Storing Apache Storm data in Elasticsearch

Trending Topics

Mine Popular Trends on GitHub using Python - Part 1

How to store and access social media data in MongoDB

How to effectively clean social media data for analysis

10 Machine Learning Tools to watch in 2018

Hitting the right notes in 2017: AI in a song for Data Scientists