How-To Tutorials

article-image-implementing-an-ai-in-unreal-engine-4-with-ai-perception-components-tutorial

11 Sep 2018

8 min read

Implementing an AI in Unreal Engine 4 with AI Perception components [Tutorial]

11 Sep 2018

0
0
12533

How-To Tutorials

article-image-build-a-custom-news-feed-with-python-tutorial

Prasad Ramesh

10 Sep 2018

13 min read

Build a custom news feed with Python [Tutorial]

Prasad Ramesh

10 Sep 2018

13 min read

0
0
9471

How-To Tutorials

article-image-implementing-dependency-injection-google-guice

Natasha Mathur

09 Sep 2018

10 min read

Implementing Dependency Injection in Google Guice [Tutorial]

Natasha Mathur

09 Sep 2018

10 min read

0
10
20645

How-To Tutorials

article-image-building-recommendation-system-with-scala-and-apache-spark-tutorial

Savia Lobo

08 Sep 2018

12 min read

Building Recommendation System with Scala and Apache Spark [Tutorial]

Savia Lobo

08 Sep 2018

12 min read

Recommendation systems can be defined as software applications that draw out and learn from data such as preferences, their actions (clicks, for example), browsing history, and generated recommendations, which are products that the system determines are appealing to the user in the immediate future. In this tutorial, we will learn to build a recommendation system with Scala and Apache Spark. This article is an excerpt taken from Modern Scala Projects written Ilango Gurusamy. What does a recommendation system look like The following diagram is representative of a typical recommendation system: Recommendation system In the preceding diagram, can be thought of as a recommendation ecosystem, where the recommendation system is at the heart of it. This system needs three entities: Users Products Transactions between users and products where transactions contain feedback from users about products Implementation and deployment Implementation is documented in the following subsections. All code is developed in an Intellij code editor. The very first step is to create an empty Scala project called Chapter7. Step 1 – creating the Scala project Let's create a Scala project called Chapter7 with the following artifacts: RecommendationSystem.scala RecommendationWrapper.scala Let's break down the project's structure: .idea: Generated IntelliJ configuration files. project: Contains build.properties and plugins.sbt. project/assembly.sbt: This file specifies the sbt-assembly plugin needed to build a fat JAR for deployment. src/main/scala: This is a folder that houses Scala source files in the com.packt.modern.chapter7 package. target: This is where artifacts of the compile process are stored. The generated assembly JAR file goes here. build.sbt: This is the main SBT configuration file. Spark and its dependencies are specified here. At this point, we will start developing code in the IntelliJ code editor. We will start with the AirlineWrapper Scala file and end with the deployment of the final application JAR into Spark with spark-submit. Step 2 – creating the AirlineWrapper definition Let's create the trait definition. The trait will hold the SparkSession variable, schema definitions for the datasets, and methods to build a dataframe: trait RecWrapper { } Next, let's create a schema for past weapon sales orders. Step 3 – creating a weapon sales orders schema Let's create a schema for the past sales order dataset: val salesOrderSchema: StructType = StructType(Array( StructField("sCustomerId", IntegerType,false), StructField("sCustomerName", StringType,false), StructField("sItemId", IntegerType,true), StructField("sItemName", StringType,true), StructField("sItemUnitPrice",DoubleType,true), StructField("sOrderSize", DoubleType,true), StructField("sAmountPaid", DoubleType,true) )) Next, let's create a schema for weapon sales leads. Step 4 – creating a weapon sales leads schema Here is a schema definition for the weapon sales lead dataset: val salesLeadSchema: StructType = StructType(Array( StructField("sCustomerId", IntegerType,false), StructField("sCustomerName", StringType,false), StructField("sItemId", IntegerType,true), StructField("sItemName", StringType,true) )) Next, let's build a weapon sales order dataframe. Step 5 – building a weapon sales order dataframe Let's invoke the read method on our SparkSession instance and cache it. We will call this method later from the RecSystem object: def buildSalesOrders(dataSet: String): DataFrame = { session.read .format("com.databricks.spark.csv") .option("header", true).schema(salesOrderSchema).option("nullValue", "") .option("treatEmptyValuesAsNulls", "true") .load(dataSet).cache() } Next up, let's build a sales leads dataframe: def buildSalesLeads(dataSet: String): DataFrame = { session.read .format("com.databricks.spark.csv") .option("header", true).schema(salesLeadSchema).option("nullValue", "") .option("treatEmptyValuesAsNulls", "true") .load(dataSet).cache() } This completes the trait. Overall, it looks like this: trait RecWrapper { 1) Create a lazy SparkSession instance and call it session. 2) Create a schema for the past sales orders dataset 3) Create a schema for sales lead dataset 4) Write a method to create a dataframe that holds past sales order data. This method takes in sales order dataset and returns a dataframe 5) Write a method to create a dataframe that holds lead sales data } Bring in the following imports: import org.apache.spark.mllib.recommendation.{ALS, Rating} import org.apache.spark.rdd.RDD import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} Create a Scala object called RecSystem: object RecSystem extends App with RecWrapper { } Before going any further, bring in the following imports: import org.apache.spark.rdd.RDD import org.apache.spark.sql.DataFrame Inside this object, start by loading the past sales order data. This will be our training data. Load the sales order dataset, as follows: val salesOrdersDf = buildSalesOrders("sales\\PastWeaponSalesOrders.csv") Verify the schema. This is what the schema looks like: salesOrdersDf.printSchema() root |-- sCustomerId: integer (nullable = true) |-- sCustomerName: string (nullable = true) |-- sItemId: integer (nullable = true) |-- sItemName: string (nullable = true) |-- sItemUnitPrice: double (nullable = true) |-- sOrderSize: double (nullable = true) |-- sAmountPaid: double (nullable = true) Here is a partial view of a dataframe displaying past weapon sales order data: Partial view of dataframe displaying past weapon sales order data Now, we have what we need to create a dataframe of ratings: val ratingsDf: DataFrame = salesOrdersDf.map( salesOrder => Rating( salesOrder.getInt(0), salesOrder.getInt(2), salesOrder.getDouble(6) ) ).toDF("user", "item", "rating") Save all and compile the project at the command line: C:\Path\To\Your\Project\Chapter7>sbt compile You are likely to run into the following error: [error] C:\Path\To\Your\Project\Chapter7\src\main\scala\com\packt\modern\chapter7\RecSystem.scala:50:50: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. [error] val ratingsDf: DataFrame = salesOrdersDf.map( salesOrder => [error] ^ [error] two errors found [error] (compile:compileIncremental) Compilation failed To fix this, place the following statement at the top of the declarations of the rating dataframe. It should look like this: import session.implicits._ val ratingsDf: DataFrame = salesOrdersDf.map( salesOrder => UserRating( salesOrder.getInt(0), salesOrder.getInt(2), salesOrder.getDouble(6) ) ).toDF("user", "item", "rating") Save and recompile the project. This time, it compiles just fine. Next, import the Rating class from the org.apache.spark.mllib.recommendation package. This transforms the rating dataframe that we obtained previously to its RDD equivalent: val ratings: RDD[Rating] = ratingsDf.rdd.map( row => Rating( row.getInt(0), row.getInt(1), row.getDouble(2) ) ) println("Ratings RDD is: " + ratings.take(10).mkString(" ") ) The following few lines of code are very important. We will be using the ALS algorithm from Spark MLlib to create and train a MatrixFactorizationModel, which takes an RDD[Rating] object as input. The ALS train method may require a combination of the following training hyperparameters: numBlocks: Preset to -1 in an auto-configuration setting. This parameter is meant to parallelize computation. custRank: The number of features, otherwise known as latent factors. iterations: This parameter represents the number of iterations for ALS to execute. For a reasonable solution to converge on, this algorithm needs roughly 20 iterations or less. regParam: The regularization parameter. implicitPrefs: This hyperparameter is a specifier. It lets us use either of the following: Explicit feedback Implicit feedback alpha: This is a hyperparameter connected to an implicit feedback variant of the ALS algorithm. Its role is to govern the baseline confidence in preference observations. We just explained the role played by each parameter needed by the ALS algorithm's train method. Let's get started by bringing in the following imports: import org.apache.spark.mllib.recommendation.MatrixFactorizationModel Now, let's get down to training the matrix factorization model using the ALS algorithm. Let's train a matrix factorization model given an RDD of ratings by customers (users) for certain items (products). Our train method on the ALS algorithm will take the following four parameters: Ratings. A rank. A number of iterations. A Lambda value or regularization parameter: val ratingsModel: MatrixFactorizationModel = ALS.train(ratings, 6, /* THE RANK */ 10, /* Number of iterations */ 15.0 /* Lambda, or regularization parameter */ ) Next, we load the sales lead file and convert it into a tuple format: val weaponSalesLeadDf = buildSalesLeads("sales\\ItemSalesLeads.csv") In the next section, we will display the new weapon sales lead dataframe. Step 6 – displaying the weapons sales dataframe First, we must invoke the show method: println("Weapons Sales Lead dataframe is: ") weaponSalesLeadDf.show Here is a view of the weapon sales lead dataframe: View of weapon sales lead dataframe Next, create a version of the sales lead dataframe structured as (customer, item) tuples: val customerWeaponsSystemPairDf: DataFrame = weaponSalesLeadDf.map(salesLead => ( salesLead.getInt(0), salesLead.getInt(2) )).toDF("user","item") In the next section, let's display the dataframe that we just created. Step 7 – displaying the customer-weapons-system dataframe Let's the show method, as follows: println("The Customer-Weapons System dataframe as tuple pairs looks like: ") customerWeaponsSystemPairDf.show Here is a screenshot of the new customer-weapons-system dataframe as tuple pairs: New customer-weapons-system dataframe as tuple pairs Next, we will convert the preceding dataframe into an RDD: val customerWeaponsSystemPairRDD: RDD[(Int, Int)] = customerWeaponsSystemDf.rdd.map(row => (row.getInt(0), row.getInt(1)) ) /* Notes: As far as the algorithm is concerned, customer corresponds to "user" and "product" or item corresponds to a "weapons system" */ We previously created a MatrixFactorization model that we trained with the weapons system sales orders dataset. We are in a position to predict how each customer country may rate a weapon system in the future. In the next section, we will generate predictions. Step 8 – generating predictions Here is how we will generate predictions. The predict method of our model is designed to do just that. It will generate a predictions RDD that we call weaponRecs. It represents the ratings of weapons systems that were not rated by customer nations (listed in the past sales order data) previously: val weaponRecs: RDD[Rating] = ratingsModel.predict(customerWeaponsSystemPairRDD).distinct() Next up, we will display the final predictions. Step 9 – displaying predictions Here is how to display the predictions, lined up in tabular format: println("Future ratings are: " + weaponRecs.foreach(rating => { println( "Customer: " + rating.user + " Product: " + rating.product + " Rating: " + rating.rating ) } ) ) The following table displays how each nation is expected to rate a certain system in the future, that is, a weapon system that they did not rate earlier: System rating by each nation Our recommendation system proved itself capable of generating future predictions. Up until now, we did not say how all of the preceding code is compiled and deployed. We will look at this in the next section. Compilation and deployment Compiling the project Invoke the sbt compile project at the root folder of your Chapter7 project. You should get the following output: Output on compiling the project Besides loading build.sbt, the compile task is also loading settings from assembly.sbt which we will create below. What is an assembly.sbt file? We have not yet talked about the assembly.sbt file. Our scala-based Spark application is a Spark job that will be submitted to a (local) Spark cluster as a JAR file. This file, apart from Spark libraries, also needs other dependencies in it for our recommendation system job to successfully complete. The name fat JAR is from all dependencies bundled in one JAR. To build such a fat JAR, we need an sbt-assembly plugin. This explains the need for creating a new assembly.sbt and the assembly plugin. Creating assembly.sbt Create a new assembly.sbt in your IntelliJ project view and save it under your project folder, as follows: Creating assembly.sbt Contents of assembly.sbt Paste the following contents into the newly created assembly.sbt (under the project folder). The output should look like this: Output on placing contents of assembly.sbt The sbt-assembly plugin, version 0.14.7, gives us the ability to run an sbt-assembly task. With that, we are one step closer to building a fat or Uber JAR. This action is documented in the next step. Running the sbt assembly task Issue the sbt assembly command, as follows: Running the sbt assembly command This time, the assembly task loads the assembly-plugin in assembly.sbt. However, further assembly halts because of a common duplicate error. This error arises due to several duplicates, multiple copies of dependency files that need removal before the assembly task can successfully complete. To address this situation, build.sbt needs an upgrade. Upgrading the build.sbt file The following lines of code need to be added in, as follows: Code lines for upgrading the build.sbt file To test the effect of your changes, save this and go to the command line to reissue the sbt assembly task. Rerunning the assembly command Run the assembly task, as follows: Rerunning the assembly task This time, the settings in the assembly.sbt file are loaded. The task completes successfully. To verify, drill down to the target folder. If everything went well, you should see a fat JAR, as follows: Output as a JAR file Our JAR file under the target folder is the recommendation system application's JAR file that needs to be deployed into Spark. This is documented in the next step. Deploying the recommendation application The spark-submit command is how we will deploy the application into Spark. Here are two formats for the spark-submit command. The first one is a long one which sets more parameters than the second one: spark-submit --class "com.packt.modern.chapter7.RecSystem" --master local[2] --deploy-mode client --driver-memory 16g -num-executors 2 --executor-memory 2g --executor-cores 2 <path-to-jar> Leaning on the preceding format, let's submit our Spark job, supplying various parameters to it: Parameters for Spark The different parameters are explained as follows: Tabular explanation of parameters for Spark Job We used Spark's support for recommendations to build a prediction model that generated recommendations and leveraged Spark's alternating least squares algorithm to implement our collaborative filtering recommendation system. If you've enjoyed reading this post, do check out the book Modern Scala Projects to gain insights into data that will help organizations have a strategic and competitive advantage. How to Build a music recommendation system with PageRank Algorithm Recommendation Systems Building A Recommendation System with Azure

0
0
5652

How-To Tutorials

article-image-building-a-twitter-news-bot-using-twitter-api-tutorial

Bhagyashree R

07 Sep 2018

11 min read

Building a Twitter news bot using Twitter API [Tutorial]

Bhagyashree R

07 Sep 2018

11 min read

This article is an excerpt from a book written by Srini Janarthanam titled Hands-On Chatbots and Conversational UI Development. In this article, we will explore the Twitter API and build core modules for tweeting, searching, and retweeting. We will further explore a data source for news around the globe and build a simple bot that tweets top news on its timeline. Getting started with the Twitter app To get started, let us explore the Twitter developer platform. Let us begin by building a Twitter app and later explore how we can tweet news articles to followers based on their interests: Log on to Twitter. If you don't have an account on Twitter, create one. Go to Twitter Apps, which is Twitter's application management dashboard. Click the Create New App button: Create an application by filling in the form providing name, description, and a website (fully-qualified URL). Read and agree to the Developer Agreement and hit Create your Twitter application: You will now see your application dashboard. Explore the tabs: Click Keys and Access Tokens: Copy consumer key and consumer secret and hang on to them. Scroll down to Your Access Token: Click Create my access token to create a new token for your app: Copy the Access Token and Access Token Secret and hang on to them. Now, we have all the keys and tokens we need to create a Twitter app. Building your first Twitter bot Let's build a simple Twitter bot. This bot will listen to tweets and pick out those that have a particular hashtag. All the tweets with a given hashtag will be printed on the console. This is a very simple bot to help us get started. In the following sections, we will explore more complex bots. To follow along you can download the code from the book's GitHub repository. Go to the root directory and create a new Node.js program using npm init: Execute the npm install twitter --save command to install the Twitter Node.js library: Run npm install request --save to install the Request library as well. We will use this in the future to make HTTP GET requests to a news data source. Explore your package.json file in the root directory: { "name": "twitterbot", "version": "1.0.0", "description": "my news bot", "main": "index.js", "scripts": { "test": "echo \"Error: no test specified\" && exit 1" }, "author": "", "license": "ISC", "dependencies": { "request": "^2.81.0", "twitter": "^1.7.1" } } Create an index.js file with the following code: //index.js var TwitterPackage = require('twitter'); var request = require('request'); console.log("Hello World! I am a twitter bot!"); var secret = { consumer_key: 'YOUR_CONSUMER_KEY', consumer_secret: 'YOUR_CONSUMER_SECRET', access_token_key: 'YOUR_ACCESS_TOKEN_KEY', access_token_secret: 'YOUR_ACCESS_TOKEN_SECRET' } var Twitter = new TwitterPackage(secret); In the preceding code, put the keys and tokens you saved in their appropriate variables. We don't need the request package just yet, but we will later. Now let's create a hashtag listener to listen to the tweets on a specific hashtag: //Twitter stream var hashtag = '#brexit'; //put any hashtag to listen e.g. #brexit console.log('Listening to:' + hashtag); Twitter.stream('statuses/filter', {track: hashtag}, function(stream) { stream.on('data', function(tweet) { console.log('Tweet:@' + tweet.user.screen_name + '\t' + tweet.text); console.log('------') }); stream.on('error', function(error) { console.log(error); }); }); Replace #brexit with the hashtag you want to listen to. Use a popular one so that you can see the code in action. Run the index.js file with the node index.js command. You will see a stream of tweets from Twitter users all over the globe who used the hashtag: Congratulations! You have built your first Twitter bot. Exploring the Twitter SDK In the previous section, we explored how to listen to tweets based on hashtags. Let's now explore the Twitter SDK to understand the capabilities that we can bestow upon our Twitter bot. Updating your status You can also update your status on your Twitter timeline by using the following status update module code: tweet ('I am a Twitter Bot!', null, null); function tweet(statusMsg, screen_name, status_id){ console.log('Sending tweet to: ' + screen_name); console.log('In response to:' + status_id); var msg = statusMsg; if (screen_name != null){ msg = '@' + screen_name + ' ' + statusMsg; } console.log('Tweet:' + msg); Twitter.post('statuses/update', { status: msg }, function(err, response) { // if there was an error while tweeting if (err) { console.log('Something went wrong while TWEETING...'); console.log(err); } else if (response) { console.log('Tweeted!!!'); console.log(response) } }); } Comment out the hashtag listener code and instead add the preceding status update code and run it. When run, your bot will post a tweet on your timeline: In addition to tweeting on your timeline, you can also tweet in response to another tweet (or status update). The screen_name argument is used to create a response. tweet. screen_name is the name of the user who posted the tweet. We will explore this a bit later. Retweet to your followers You can retweet a tweet to your followers using the following retweet status code: var retweetId = '899681279343570944'; retweet(retweetId); function retweet(retweetId){ Twitter.post('statuses/retweet/', { id: retweetId }, function(err, response) { if (err) { console.log('Something went wrong while RETWEETING...'); console.log(err); } else if (response) { console.log('Retweeted!!!'); console.log(response) } }); } Searching for tweets You can also search for recent or popular tweets with hashtags using the following search hashtags code: search('#brexit', 'popular') function search(hashtag, resultType){ var params = { q: hashtag, // REQUIRED result_type: resultType, lang: 'en' } Twitter.get('search/tweets', params, function(err, data) { if (!err) { console.log('Found tweets: ' + data.statuses.length); console.log('First one: ' + data.statuses[1].text); } else { console.log('Something went wrong while SEARCHING...'); } }); } Exploring a news data service Let's now build a bot that will tweet news articles to its followers at regular intervals. We will then extend it to be personalized by users through a conversation that happens over direct messaging with the bot. In order to build a news bot, we need a source where we can get news articles. We are going to explore a news service called NewsAPI.org in this section. News API is a service that aggregates news articles from roughly 70 newspapers around the globe. Setting up News API Let us set up an account with the News API data service and get the API key: Go to NewsAPI.org: Click Get API key. Register using your email. Get your API key. Explore the sources: https://newsapi.org/v1/sources?apiKey=YOUR_API_KEY. There are about 70 sources from across the globe including popular ones such as BBC News, Associated Press, Bloomberg, and CNN. You might notice that each source has a category tag attached. The possible options are: business, entertainment, gaming, general, music, politics, science-and-nature, sport, and technology. You might also notice that each source also has language (en, de, fr) and country (au, de, gb, in, it, us) tags. The following is the information on the BBC-News source: { "id": "bbc-news", "name": "BBC News", "description": "Use BBC News for up-to-the-minute news, breaking news, video, audio and feature stories. BBC News provides trusted World and UK news as well as local and regional perspectives. Also entertainment, business, science, technology and health news.", "url": "http://www.bbc.co.uk/news", "category": "general", "language": "en", "country": "gb", "urlsToLogos": { "small": "", "medium": "", "large": "" }, "sortBysAvailable": [ "top" ] } Get sources for a specific category, language, or country using: https://newsapi.org/v1/sources?category=business&apiKey=YOUR_API_KEY The following is the part of the response to the preceding query asking for all sources under the business category: "sources": [ { "id": "bloomberg", "name": "Bloomberg", "description": "Bloomberg delivers business and markets news, data, analysis, and video to the world, featuring stories from Businessweek and Bloomberg News.", "url": "http://www.bloomberg.com", "category": "business", "language": "en", "country": "us", "urlsToLogos": { "small": "", "medium": "", "large": "" }, "sortBysAvailable": [ "top" ] }, { "id": "business-insider", "name": "Business Insider", "description": "Business Insider is a fast-growing business site with deep financial, media, tech, and other industry verticals. Launched in 2007, the site is now the largest business news site on the web.", "url": "http://www.businessinsider.com", "category": "business", "language": "en", "country": "us", "urlsToLogos": { "small": "", "medium": "", "large": "" }, "sortBysAvailable": [ "top", "latest" ] }, ... ] Explore the articles: https://newsapi.org/v1/articles?source=bbc-news&apiKey=YOUR_API_KEY The following is the sample response: "articles": [ { "author": "BBC News", "title": "US Navy collision: Remains found in hunt for missing sailors", "description": "Ten US sailors have been missing since Monday's collision with a tanker near Singapore.", "url": "http://www.bbc.co.uk/news/world-us-canada-41013686", "urlToImage": "https://ichef1.bbci.co.uk/news/1024/cpsprodpb/80D9/ production/_97458923_mediaitem97458918.jpg", "publishedAt": "2017-08-22T12:23:56Z" }, { "author": "BBC News", "title": "Afghanistan hails Trump support in 'joint struggle'", "description": "President Ghani thanks Donald Trump for supporting Afghanistan's battle against the Taliban.", "url": "http://www.bbc.co.uk/news/world-asia-41012617", "urlToImage": "https://ichef.bbci.co.uk/images/ic/1024x576/p05d08pf.jpg", "publishedAt": "2017-08-22T11:45:49Z" }, ... ] For each article, the author, title, description, url, urlToImage,, and publishedAt fields are provided. Now that we have explored a source of news data that provides up-to-date news stories under various categories, let us go on to build a news bot. Building a Twitter news bot Now that we have explored News API, a data source for the latest news updates, and a little bit of what the Twitter API can do, let us combine them both to build a bot tweeting interesting news stories, first on its own timeline and then specifically to each of its followers: Let's build a news tweeter module that tweets the top news article given the source. The following code uses the tweet() function we built earlier: topNewsTweeter('cnn', null); function topNewsTweeter(newsSource, screen_name, status_id){ request({ url: 'https://newsapi.org/v1/articles?source=' + newsSource + '&apiKey=YOUR_API_KEY', method: 'GET' }, function (error, response, body) { //response is from the bot if (!error && response.statusCode == 200) { var botResponse = JSON.parse(body); console.log(botResponse); tweetTopArticle(botResponse.articles, screen_name); } else { console.log('Sorry. No new'); } }); } function tweetTopArticle(articles, screen_name, status_id){ var article = articles[0]; tweet(article.title + " " + article.url, screen_name); } Run the preceding program to fetch news from CNN and post the topmost article on Twitter: Here is the post on Twitter: Now, let us build a module that tweets news stories from a randomly-chosen source in a list of sources: function tweetFromRandomSource(sources, screen_name, status_id){ var max = sources.length; var randomSource = sources[Math.floor(Math.random() * (max + 1))]; //topNewsTweeter(randomSource, screen_name, status_id); } Let's call the tweeting module after we acquire the list of sources: function getAllSourcesAndTweet(){ var sources = []; console.log('getting sources...') request({ url: 'https://newsapi.org/v1/sources? apiKey=YOUR_API_KEY', method: 'GET' }, function (error, response, body) { //response is from the bot if (!error && response.statusCode == 200) { // Print out the response body var botResponse = JSON.parse(body); for (var i = 0; i < botResponse.sources.length; i++){ console.log('adding.. ' + botResponse.sources[i].id) sources.push(botResponse.sources[i].id) } tweetFromRandomSource(sources, null, null); } else { console.log('Sorry. No news sources!'); } }); } Let's create a new JS file called tweeter.js. In the tweeter.js file, call getSourcesAndTweet() to get the process started: //tweeter.js var TwitterPackage = require('twitter'); var request = require('request'); console.log("Hello World! I am a twitter bot!"); var secret = { consumer_key: 'YOUR_CONSUMER_KEY', consumer_secret: 'YOUR_CONSUMER_SECRET', access_token_key: 'YOUR_ACCESS_TOKEN_KEY', access_token_secret: 'YOUR_ACCESS_TOKEN_SECRET' } var Twitter = new TwitterPackage(secret); getAllSourcesAndTweet(); Run the tweeter.js file on the console. This bot will tweet a news story every time it is called. It will choose top news stories from around 70 news sources randomly. Hurray! You have built your very own Twitter news bot. In this tutorial, we have covered a lot. We started off with the Twitter API and got a taste of how we can automatically tweet, retweet, and search for tweets using hashtags. We then explored a News source API that provides news articles from about 70 different newspapers. We integrated it with our Twitter bot to create a new tweeting bot. If you found this post useful, do check out the book, Hands-On Chatbots and Conversational UI Development, which will help you explore the world of conversational user interfaces. Build and train an RNN chatbot using TensorFlow [Tutorial] Building a two-way interactive chatbot with Twilio: A step-by-step guide How to create a conversational assistant or chatbot using Python

0
1
6462

How-To Tutorials

article-image-classifying-flowers-in-iris-dataset-using-scala-tutorial

Savia Lobo

06 Sep 2018

15 min read

Classifying flowers in Iris Dataset using Scala [Tutorial]

Savia Lobo

06 Sep 2018

15 min read

0
0
5552

How-To Tutorials

article-image-intelligent-mobile-projects-with-tensorflow-build-your-first-reinforcement-learning-model-on-raspberry-pi-tutorial

Bhagyashree R

05 Sep 2018

13 min read

Intelligent mobile projects with TensorFlow: Build your first Reinforcement Learning model on Raspberry Pi [Tutorial]

Bhagyashree R

05 Sep 2018

13 min read

0
0
5864

How-To Tutorials

article-image-how-to-create-advanced-environment-interactions-with-ai-tutorial

Sugandha Lahoti

04 Sep 2018

10 min read

How to use artificial intelligence to create games with rich and interactive environments [Tutorial]

Sugandha Lahoti

04 Sep 2018

10 min read

Many of the most popular games on the planet have one thing in common: they all have rich, vivid worlds for the player to inhabit and interact with. This doesn't just mean a huge terrain or an extensive map (although it might do), it could simply be how things appear within the world. Similarly, it's not just about the environment - it's also about characters who are able to react in different ways according to the game. The only way to achieve an impressive level of 'realism' is through powerful artificial intelligence. This isn't easy, but it can be done. And learning how to do it will be well worth it, as it will create a much more engaging end product for players. This tutorial is taken from the book Practical Game AI Programming by Micael DaGraca. This book teaches you to create Game AI and implement cutting-edge AI algorithms from scratch. Let's take a look at how we can use AI to create rich environments. Breaking down the game environment by area When we create a map, often we have two or more different areas that could be used to change the gameplay, areas that could contain water, quicksand, flying zones, caves, and much more. If we wish to create an AI character that can be used in any level of our game, and anywhere, we need to take this into consideration and make the AI aware of the different zones of the map. Usually, that means that we need to input more information into the character's behavior, including how to react according to the position in which he is currently placed, or a situation where he can choose where to go. Should he avoid some areas? Should he prefer others? This type of information is relevant because it makes the character aware of the surroundings, choosing or adapting and taking into consideration his position. Not planning this correctly can lead to some unnatural decisions. For example, in Elder Scrolls V: Skyrim developed by Bethesda Softworks studio, we can watch some AI characters of the game simply turning back when they do not have information about how they should behave in some parts of the map, especially on mountains or rivers. Depending on the zones that our character finds, he might react differently or update his behavior tree to adapt to his environment. The environment that surrounds our characters can redefine their priorities or completely change their behaviors. This is a little similar to what Jean-Jacques Rousseau said about humanity: "We are good by nature, but corrupted by society." As humans, we are a representation of the environment that surrounds us, and for that reason, artificial intelligence should follow the same principle. Let's pick a soldier and update his code to work on a different scenario. We want to change his behavior according to three different zones, beach, river, and forest. So, we'll create three public static Boolean functions with the names Beach, Forest and River; then we define the zones on the map that will turn them on or off. public static bool Beach; public static bool River; public static bool Forest; Because in this example, just one of them can be true at a time, we'll add a simple line of code that disables the other options once one of them gets activated. if(Beach == true) { Forest = false; River = false; } if(Forest == true){ Beach = false; River = false; } if(River == true){ Forest = false; Beach = false; } Once we have that done, we can start defining the different behaviors for each zone. For example, in the beach zone, the characters don't have a place to get cover, so that option needs to be taken away and updated with a new one. The river zone can be used to get across to the other side, so the character can hide from the player and attack from that position. To conclude, we can define the character to be more careful and use the trees to get cover. Depending on the zones, we can change the values to better adapt to the environment, or create new functions that would allow us to use some specific characteristics of that zone. if (Forest == true) {// The AI will remain passive until an interaction with the player occurs if (Health == 100 && triggerL == false && triggerR == false && triggerM == false) { statePassive = true; stateAggressive = false; stateDefensive = false; } // The AI will shift to the defensive mode if player comes from the right side or if the AI is below 20 HP if (Health <= 100 && triggerR == true || Health <= 20) { statePassive = false; stateAggressive = false; stateDefensive = true; } // The AI will shift to the aggressive mode if player comes from the left side or it's on the middle and AI is above 20HP if (Health > 20 && triggerL == true || Health > 20 && triggerM == true) { statePassive = false; stateAggressive = true; stateDefensive = false; } walk = speed * Time.deltaTime; walk = speedBack * Time.deltaTime; } Advanced environment interactions with AI As the video game industry and the technology associated with it kept evolving, new gameplay ideas appeared, and rapidly, the interaction between the characters of the game and the environment became even more interesting, especially when using physics. This means that the outcome of the environment could be completely random, where it was required for the AI characters to constantly adapt to different situations. One honorable mention on this subject is the video game Worms developed by Team17, where the map can be fully destroyed and the AI characters of the game are able to adapt and maintain smart decisions. The objective of this game is to destroy the opponent team by killing all their worms, the last man standing wins. From the start, the characters can find some extra health points or ammunition on the map and from time to time, it drops more points from the sky. So, there are two main objectives for the character, namely survive and kill. To survive, he needs to keep a decent amount of HP and away from the enemy, the other part is to choose the best character to shoot and take as much health as possible from him. Meanwhile, the map gets destroyed by the bombs and all of the fire power used by the characters, making it a challenge for artificial intelligence. Adapting to unstable terrain Let's decompose this example and create a character that could be used in this game. We'll start by looking at the map. At the bottom, there's water that automatically kills the worms. Then, we have the terrain where the worms can walk, or destroy if needed. Finally, there's the absence of terrain, specifically, the empty space that cannot be walked on. Then we have the characters (worms) they are placed in random positions at the beginning of the game and they can walk, jump, and shoot. The characters of the game should be able to constantly adapt to the instability of the terrain, so we need to use that and make it part of the behavior tree. As demonstrated in the diagram above, the character will need to understand the position where he is currently placed, as well as the opponent's position, health, and items. Because the terrain can be blocking them, the AI character has a chance of being in a situation where he cannot attack or obtain an item. So, we give him options on what to do in those situations and many others that he might find, but the most important is to define what happens if he cannot successfully accomplish any of them. Because the terrain can be shaped into different forms, during gameplay there will be times that it is near impossible to do anything, and that is why we need to provide options on what to do in those situations. For example, in this situation where the worm doesn't have enough free space to move, a close item to pick up, or an enemy that can be properly attacked, what should he do? It's necessary to make information about the surroundings available to our character so he can make a good judgment for that situation. In this scenario, we have defined our character to shoot anyway, against the closest enemy, or to stay close to a wall. Because he is too close to the explosion that would occur from attacking the closest enemy, he should decide to stay in a corner and wait there until the next turn. Using raycast to evaluate decisions Ideally, at the start of the turn, the character has two raycasts, one for his left side and another for the right side. This will check if there's a wall obstructing one of those directions. This can be used to determine what side the character should be moving toward if he wants to protect himself from being attacked. Then, we would use another raycast in the aim direction, to see if there's something blocking the way when the character is preparing to shoot. If there's something in the middle, the character should be calculating the distance between the two to determine if it's still safe to shoot. So, each character should have a shared list of all of the worms that are currently in the game; that way they can compare the distance between them all and choose which of them are closest and shoot them. Additionally, we add the two raycasts to check if there's something blocking the sides, and we have the basic information to make the character adapt to the constant modifications of the terrain. public int HP; public int Ammunition; public static List<GameObject> wormList = new List<GameObject>(); //creates a list with all the worms public static int wormCount; //Amount of worms in the game public int ID; //It's used to differentiate the worms private float proximityValueX; private float proximityValueY; private float nearValue; public float distanceValue; //how far the enemy should be private bool canAttack; void Awake () { wormList.Add(gameObject); //add this worm to the list wormCount++; //adds plus 1 to the amount of worms in the game } void Start () { HP = 100; distanceValue = 30f; } void Update () { proximityValueX = wormList[1].transform.position.x - this.transform.position.x; proximityValueY = wormList[1].transform.position.y - this.transform.position.y; nearValue = proximityValueX + proximityValueY; if(nearValue <= distanceValue) { canAttack = true; } else { canAttack = false; } Vector3 raycastRight = transform.TransformDirection(Vector3.forward); if (Physics.Raycast(transform.position, raycastRight, 10)) print("There is something blocking the Right side!"); Vector3 raycastLEft = transform.TransformDirection(Vector3.forward); if (Physics.Raycast(transform.position, raycastRight, -10)) print("There is something blocking the Left side!"); } In this post, we explored different ways to interact with the environment. First, we learned how to break down the game environment by area. Then we learned about the advanced environment interactions with AI. To learn about manipulating animation behavior with AI read our book Practical Game AI Programming. Read Next Developing Games Using AI Techniques and Practices of Game AI Unite Berlin 2018 Keynote: Unity partners with Google, launches Ml-Agents ToolKit 0.4, Project MARS and more

0
0
3203

How-To Tutorials

article-image-implementing-cost-effective-iot-analytics-for-predictive-maintenance-tutorial

Prasad Ramesh

04 Sep 2018

10 min read

Implementing cost-effective IoT analytics for predictive maintenance [Tutorial]

Prasad Ramesh

04 Sep 2018

10 min read

Predictive maintenance is a common value proposition cited for IoT analytics. In this tutorial will look at a value formula for net savings. Then we walk through an example as a way to highlight how to think financially about when it makes sense to implement a decision and when it does not. The economics of predictive maintenance may not be entirely obvious. Believe it or not, it does not always make sense, even if you can predict early failures accurately. In many cases, you will actually lose money by doing it. Even when it can save you money, there is an optimal point for when it should be used. The optimal point depends on the costs and the accuracy of the predictive model. This article is an excerpt from a book written by Andrew Minteer titled Analytics for the Internet of Things (IoT). The value formula A formula to guide decision making compares the cost of allowing a failure to occur versus the cost to proactively repair the component while considering the probability of predicting the failure: Net Savings = (Cost of Failure * (Expected Number of Failures - Expected True Positive Predictions)) - (Proactive Repair Cost * (Expected True Positives + Expected False Positives)) If the cost of failure is the same as the proactive repair cost, even with a perfect prediction model, then there will be no savings. Make sure to include intangible costs into the cost of failure. Some examples of intangible costs include legal expenses, loss of brand equity, and even the customer's expenses. Predictive repair does make sense when there is a large spread between the cost of failure and the cost of proactive replacement, combined with a well-performing prediction model. For example, if the cost of a failure is a locomotive engine replacement at $1 million USD and the cost of a proactive repair is $200 USD, then the accuracy of the model does not even have to be all that great before a proactive replacement program makes financial sense. On the other hand, if the failure is a $400 USD automotive turbocharger replacement, and the proactive repair cost is $350 USD for a turbocharger actuator subcomponent replacement, the predictive model would need to be highly accurate for that to make financial sense. An example of making a value decision To illustrate the example, we will walk through a business situation and then some R code that simulates a cost-benefit curve for that decision. The code will use a fitted predictive model to calculate the net savings (or lack thereof) to generate a cost curve. The cost curve can then be used in a business decision on what proportion of units with predicted failures should have a proactive replacement. Imagine you work for a company that builds diesel-powered generators. There is a coolant control valve that normally lasts for 4,000 hours of operation until there is a planned replacement. From the analysis, your company has realized that the generators built two years prior are experiencing an earlier than the expected failure of the valve. When the valve fails, the engine overheats and several other components are damaged. The cost of failure, including labor rates for repair personnel and the cost to the customer for downtime, is an average of $1,000 USD. The cost of a proactive replacement of the valve is $253 USD. Should you replace all coolant valves in the population? It depends on how high a failure rate is expected. In this case, about 10% of the current non-failed units are expected to fail before the scheduled replacement. Also, importantly, it matters how well you can predict the failures. The following R code simulates this situation and uses a simple predictive model (logistic regression) to estimate a cost curve. The model has an AUC of close to 0.75. This will vary as you run the code since the dataset is randomly simulated: #make sure all needed packages are installed if(!require(caret)){ install.packages("caret") } if(!require(pROC)){ install.packages("pROC") } if(!require(dplyr)){ install.packages("dplyr") } if(!require(data.table)){ install.packages("data.table") } #Load required libraries library(caret) library(pROC) library(dplyr) library(data.table) #Generate sample data simdata = function(N=1000) { #simulate 4 features X = data.frame(replicate(4,rnorm(N))) #create a hidden data structure to learn hidden = X[,1]^2+sin(X[,2]) + rnorm(N)*1 #10% TRUE, 90% FALSE rare.class.probability = 0.1 #simulate the true classification values y.class = factor(hidden<quantile(hidden,c(rare.class.probability))) return(data.frame(X,Class=y.class)) } #make some data structure model_data = simdata(N=50000) #train a logistic regression model on the simulated data training <- createDataPartition(model_data$Class, p = 0.6, list=FALSE) trainData <- model_data[training,] testData <- model_data[-training,] glmModel <- glm(Class~ . , data=trainData, family=binomial) testData$predicted <- predict(glmModel, newdata=testData, type="response") #calculate AUC roc.glmModel <- pROC::roc(testData$Class, testData$predicted) auc.glmModel <- pROC::auc(roc.glmModel) print(auc.glmModel) #Pull together test data and predictions simModel <- data.frame(trueClass = testData$Class, predictedClass = testData$predicted) # Reorder rows and columns simModel <- simModel[order(simModel$predictedClass, decreasing = TRUE), ] simModel <- select(simModel, trueClass, predictedClass) simModel$rank <- 1:nrow(simModel) #Assign costs for failures and proactive repairs proactive_repair_cost <- 253 # Cost of proactively repairing a part failure_repair_cost <- 1000 # Cost of a failure of the part (include all costs such as lost production, etc not just the repair cost) # Define each predicted/actual combination fp.cost <- proactive_repair_cost # The part was predicted to fail but did not (False Positive) fn.cost <- failure_repair_cost # The part was not predicted to fail and it did (False Negative) tp.cost <- (proactive_repair_cost - failure_repair_cost) # The part was predicted to fail and it did (True Positive). This will be negative for a savings. tn.cost <- 0.0 # The part was not predicted to fail and it did not (True Negative) #incorporate probability of future failure simModel$future_failure_prob <- prob_failure #Function to assign costs for each instance assignCost <- function(pred, outcome, tn.cost, fn.cost, fp.cost, tp.cost, prob){ cost <- ifelse(pred == 0 & outcome == FALSE, tn.cost, # No cost since no action was taken and no failure ifelse(pred == 0 & outcome == TRUE, fn.cost, # The cost of no action and a repair resulted ifelse(pred == 1 & outcome == FALSE, fp.cost, # The cost of proactive repair which was not needed ifelse(pred == 1 & outcome == TRUE, tp.cost, 999999999)))) # The cost of proactive repair which avoided a failure return(cost) } # Initialize list to hold final output master <- vector(mode = "list", length = 100) #use the simulated model. In practice, this code can be adapted to compare multiple models test_model <- simModel # Create a loop to increment through dynamic threshold (starting at 1.0 [no proactive repairs] to 0.0 [all proactive repairs]) threshold <- 1.00 for (i in 1:101) { #Add predicted class with percentile ranking test_model$prob_ntile <- ntile(test_model$predictedClass, 100) / 100 # Dynamically determine if proactive repair would apply based on incrementing threshold test_model$glm_failure <- ifelse(test_model$prob_ntile >= threshold, 1, 0) test_model$threshold <- threshold # Compare to actual outcome to assign costs test_model$glm_impact <- assignCost(test_model$glm_failure, test_model$trueClass, tn.cost, fn.cost, fp.cost, tp.cost, test_model$future_failure_prob) # Compute cost for not doing any proactive repairs test_model$nochange_impact <- ifelse(test_model$trueClass == TRUE, fn.cost, tn.cost) # *test_model$future_failure_prob) # Running sum to produce the overall impact test_model$glm_cumul_impact <- cumsum(test_model$glm_impact) / nrow(test_model) test_model$nochange_cumul_impact <- cumsum(test_model$nochange_impact) / nrow(test_model) # Count the # of classified failures test_model$glm_failure_ct <- cumsum(test_model$glm_failure) # Create new object to house the one row per iteration output for the final plot master[[i]] <- test_model[nrow(test_model),] # Reduce the threshold by 1% and repeat to calculate new value threshold <- threshold - 0.01 } finalOutput <- rbindlist(master) finalOutput <- subset(finalOutput, select = c(threshold, glm_cumul_impact, glm_failure_ct, nochange_cumul_impact) ) # Set baseline to costs of not doing any proactive repairs baseline <- finalOutput$nochange_cumul_impact # Plot the cost curve par(mfrow = c(2,1)) plot(row(finalOutput)[,1], finalOutput$glm_cumul_impact, type = "l", lwd = 3, main = paste("Net Costs: Proactive Repair Cost of $", proactive_repair_cost, ", Failure cost $", failure_repair_cost, sep = ""), ylim = c(min(finalOutput$glm_cumul_impact) - 100, max(finalOutput$glm_cumul_impact) + 100), xlab = "Percent of Population", ylab = "Net Cost ($) / Unit") # Plot the cost difference of proactive repair program and a 'do nothing' approach plot(row(finalOutput)[,1], baseline - finalOutput$glm_cumul_impact, type = "l", lwd = 3, col = "black", main = paste("Savings: Proactive Repair Cost of $", proactive_repair_cost, ", Failure cost $", failure_repair_cost,sep = ""), ylim = c(min(baseline - finalOutput$glm_cumul_impact) - 100, max(baseline - finalOutput$glm_cumul_impact) + 100), xlab = "% of Population", ylab = "Savings ($) / Unit") abline(h=0,col="gray") As seen in the resulting net cost and savings curves, based on the model's predictions, the optimal savings would be from a proactive repair program of the top 30 percentile units. The savings decreases after this, although you would still save money when replacing up to 75% of the population. After this point, you should expect to spend more than you save. The following set of charts is the output from the preceding code: Cost and savings curves for the proactive repair $253 and failure cost at $1,000 scenario Note the changes in the following graph when the failure cost drops to $300 USD. At no point do you save money, as the proactive repair cost will always outweigh the reduced failure cost. This does not mean you should not do a proactive repair; you may still want to do so in order to satisfy your customers. Even in such a case, this cost curve method can help in decisions on how much you are willing to spend to address the problem. You can rerun the code with proactive_repair_cost set to 253 and failure_repair_cost set to 300 to generate the following charts: Cost and savings curves for the proactive repair $253 and failure cost at $300 scenario And finally, notice how the savings curve changes when the failure cost moves to $5,000. You will notice that the spread between the proactive repair cost and the failure cost determines much of when doing a proactive repair makes business sense. You can rerun the code with proactive_repair_cost set to 253 and failure_repair_cost set to 5000 to generate the following charts: Cost and savings curves for the proactive repair $253 and failure cost at $5,000 scenario Ultimately, the decision is a business case based on the expected costs and benefits. ML modeling can help optimize savings under the right conditions. Utilizing cost curves helps to determine the expected costs and savings of proactive replacements. In this tutorial, we looked at implementing economically cost effective IoT analytics for predictive maintenance with example. To further explore IoT Analytics and cloud check out the book Analytics for the Internet of Things (IoT). AWS IoT Analytics: The easiest way to run analytics on IoT data, Amazon says Build an IoT application with Azure IoT [Tutorial] Intelligent Edge Analytics: 7 ways machine learning is driving edge computing adoption in 2

0
0
4257

How-To Tutorials

article-image-build-intelligent-interfaces-with-coreml-using-a-cnn-tutorial

Savia Lobo

03 Sep 2018

19 min read

Build intelligent interfaces with CoreML using a CNN [Tutorial]

Savia Lobo

03 Sep 2018

19 min read

0
0
1897

How-To Tutorials

article-image-getting-started-with-amazon-machine-learning-workflow-tutorial

Melisha Dsouza

02 Sep 2018

14 min read

Getting started with Amazon Machine Learning workflow [Tutorial]

Melisha Dsouza

02 Sep 2018

14 min read

Amazon Machine Learning is useful for building ML models and generating predictions. It also enables the development of robust and scalable smart applications. The process of building ML models with Amazon Machine Learning consists of three operations: data analysis model training evaluation. The code files for this article are available on Github. This tutorial is an excerpt from a book written by Alexis Perrier titled Effective Amazon Machine Learning. The Amazon Machine Learning service is available at https://console.aws.amazon.com/machinelearning/. The Amazon ML workflow closely follows a standard Data Science workflow with steps: Extract the data and clean it up. Make it available to the algorithm. Split the data into a training and validation set, typically a 70/30 split with equal distribution of the predictors in each part. Select the best model by training several models on the training dataset and comparing their performances on the validation dataset. Use the best model for predictions on new data. As shown in the following Amazon ML menu, the service is built around four objects: Datasource ML model Evaluation Prediction The Datasource and Model can also be configured and set up in the same flow by creating a new Datasource and ML model. Let us take a closer look at each one of these steps. Understanding the dataset used We will use the simple Predicting Weight by Height and Age dataset (from Lewis Taylor (1967)) with 237 samples of children's age, weight, height, and gender, which is available at https://v8doc.sas.com/sashtml/stat/chap55/sect51.htm. This dataset is composed of 237 rows. Each row has the following predictors: sex (F, M), age (in months), height (in inches), and we are trying to predict the weight (in lbs) of these children. There are no missing values and no outliers. The variables are close enough in range and normalization is not required. We do not need to carry out any preprocessing or cleaning on the original dataset. Age, height, and weight are numerical variables (real-valued), and sex is a categorical variable. We will randomly select 20% of the rows as the held-out subset to use for prediction on previously unseen data and keep the other 80% as training and evaluation data. This data split can be done in Excel or any other spreadsheet editor: By creating a new column with randomly generated numbers Sorting the spreadsheet by that column Selecting 190 rows for training and 47 rows for prediction (roughly a 80/20 split) Let us name the training set LT67_training.csv and the held-out set that we will use for prediction LT67_heldout.csv, where LT67 stands for Lewis and Taylor, the creator of this dataset in 1967. As with all datasets, scripts, and resources mentioned in this book, the training and holdout files are available in the GitHub repository at https://github.com/alexperrier/packt-aml. It is important for the distribution in age, sex, height, and weight to be similar in both subsets. We want the data on which we will make predictions to show patterns that are similar to the data on which we will train and optimize our model. Loading the data on S3 Follow these steps to load the training and held-out datasets on S3: Go to your s3 console at https://console.aws.amazon.com/s3. Create a bucket if you haven't done so already. Buckets are basically folders that are uniquely named across all S3. We created a bucket named aml.packt. Since that name has now been taken, you will have to choose another bucket name if you are following along with this demonstration. Click on the bucket name you created and upload both the LT67_training.csv and LT67_heldout.csv files by selecting Upload from the Actions drop-down menu: Both files are small, only a few KB, and hosting costs should remain negligible for that exercise. Note that for each file, by selecting the Properties tab on the right, you can specify how your files are accessed, what user, role, group or AWS service may download, read, write, and delete the files, and whether or not they should be accessible from the Open Web. When creating the datasource in Amazon ML, you will be prompted to grant Amazon ML access to your input data. You can specify the access rules to these files now in S3 or simply grant access later on. Our data is now in the cloud in an S3 bucket. We need to tell Amazon ML where to find that input data by creating a datasource. We will first create the datasource for the training file ST67_training.csv. Declaring a datasource Go to the Amazon ML dashboard, and click on Create new... | Datasource and ML model. We will use the faster flow available by default: As shown in the following screenshot, you are asked to specify the path to the LT67_training.csv file {S3://bucket}{path}{file}. Note that the S3 location field automatically populates with the bucket names and file names that are available to your user: Specifying a Datasource name is useful to organize your Amazon ML assets. By clicking on Verify, Amazon ML will make sure that it has the proper rights to access the file. In case it needs to be granted access to the file, you will be prompted to do so as shown in the following screenshot: Just click on Yes to grant access. At this point, Amazon ML will validate the datasource and analyze its contents. Creating the datasource An Amazon ML datasource is composed of the following: The location of the data file: The data file is not duplicated or cloned in Amazon ML but accessed from S3 The schema that contains information on the type of the variables contained in the CSV file: Categorical Text Numeric (real-valued) Binary It is possible to supply Amazon ML with your own schema or modify the one created by Amazon ML. At this point, Amazon ML has a pretty good idea of the type of data in your training dataset. It has identified the different types of variables and knows how many rows it has: Move on to the next step by clicking on Continue, and see what schema Amazon ML has inferred from the dataset as shown in the next screenshot: Amazon ML needs to know at that point which is the variable you are trying to predict. Be sure to tell Amazon ML the following: The first line in the CSV file contains te column name The target is the weight We see here that Amazon ML has correctly inferred the following: sex is categorical age, height, and weight are numeric (continuous real values) Since we chose a numeric variable as the target Amazon ML, will use Linear Regression as the predictive model. For binary or categorical values, we would have used Logistic Regression. This means that Amazon ML will try to find the best a, b, and c coefficients so that the weight predicted by the following equation is as close as possible to the observed real weight present in the data: predicted weight = a * age + b * height + c * sex Amazon ML will then ask you if your data contains a row identifier. In our present case, it does not. Row identifiers are useful when you want to understand the prediction obtained for each row or add an extra column to your dataset later on in your project. Row identifiers are for reference purposes only and are not used by the service to build the model. You will be asked to review the datasource. You can go back to each one of the previous steps and edit the parameters for the schema, the target and the input data. Now that the data is known to Amazon ML, the next step is to set up the parameters of the algorithm that will train the model. Understanding the model We select the default parameters for the training and evaluation settings. Amazon ML will do the following: Create a recipe for data transformation based on the statistical properties it has inferred from the dataset Split the dataset (ST67_training.csv) into a training part and a validation part, with a 70/30 split. The split strategy assumes the data has already been shuffled and can be split sequentially. The recipe will be used to transform the data in a similar way for the training and the validation datasets. The only transformation suggested by Amazon ML is to transform the categorical variable sex into a binary variable, where m = 0 and f = 1 for instance. No other transformation is needed. The default advanced settings for the model are shown in the following screenshot: We see that Amazon ML will pass over the data 10 times, shuffle splitting the data each time. It will use an L2 regularization strategy based on the sum of the square of the coefficients of the regression to prevent overfitting. We will evaluate the predictive power of the model using our LT67_heldout.csv dataset later on. Regularization comes in 3 levels with a mild (10^-6), medium (10^-4), or aggressive (10^-02) setting, each value stronger than the previous one. The default setting is mild, the lowest, with a regularization constant of 0.00001 (10^-6) implying that Amazon ML does not anticipate much overfitting on this dataset. This makes sense when the number of predictors, three in our case, is much smaller than the number of samples (190 for the training set). Clicking on the Create ML model button will launch the model creation. This takes a few minutes to resolve, depending on the size and complexity of your dataset. You can check its status by refreshing the model page. In the meantime, the model status remains pending. At that point, Amazon ML will split our training dataset into two subsets: a training and a validation set. It will use the training portion of the data to train several settings of the algorithm and select the best one based on its performance on the training data. It will then apply the associated model to the validation set and return an evaluation score for that model. By default, Amazon ML will sequentially take the first 70% of the samples for training and the remaining 30% for validation. It's worth noting that Amazon ML will not create two extra files and store them on S3, but instead create two new datasources out of the initial datasource we have previously defined. Each new datasource is obtained from the original one via a Data rearrangement JSON recipe such as the following: { "splitting": { "percentBegin": 0, "percentEnd": 70 } } You can see these two new datasources in the Datasource dashboard. Three datasources are now available where there was initially only one, as shown by the following screenshot: While the model is being trained, Amazon ML runs the Stochastic Gradient algorithm several times on the training data with different parameters: Varying the learning rate in increments of powers of 10: 0.01, 0.1, 1, 10, and 100. Making several passes over the training data while shuffling the samples before each path. At each pass, calculating the prediction error, the Root Mean Squared Error (RMSE), to estimate how much of an improvement over the last pass was obtained. If the decrease in RMSE is not really significant, the algorithm is considered to have converged, and no further pass shall be made. At the end of the passes, the setting that ends up with the lowest RMSE wins, and the associated model (the weights of the regression) is selected as the best version. Once the model has finished training, Amazon ML evaluates its performance on the validation datasource. Once the evaluation itself is also ready, you have access to the model's evaluation. Evaluating the model Amazon ML uses the standard metric RMSE for linear regression. RMSE is defined as the sum of the squares of the difference between the real values and the predicted values: Here, ŷ is the predicted values, and y the real values we want to predict (the weight of the children in our case). The closer the predictions are to the real values, the lower the RMSE is. A lower RMSE means a better, more accurate prediction. Making batch predictions We now have a model that has been properly trained and selected among other models. We can use it to make predictions on new data. A batch prediction consists in applying a model to a datasource in order to make predictions on that datasource. We need to tell Amazon ML which model we want to apply on which data. Batch predictions are different from streaming predictions. With batch predictions, all the data is already made available as a datasource, while for streaming predictions, the data will be fed to the model as it becomes available. The dataset is not available beforehand in its entirety. In the Main Menu select Batch Predictions to access the dashboard predictions and click on Create a New Prediction: The first step is to select one of the models available in your model dashboard. You should choose the one that has the lowest RMSE: The next step is to associate a datasource to the model you just selected. We had uploaded the held-out dataset to S3 at the beginning of this chapter (under the Loading the data on S3 section) but had not used it to create a datasource. We will do so now.When asked for a datasource in the next screen, make sure to check My data is in S3, and I need to create a datasource, and then select the held-out dataset that should already be present in your S3 bucket: Don't forget to tell Amazon ML that the first line of the file contains columns. In our current project, our held-out dataset also contains the true values for the weight of the students. This would not be the case for "real" data in a real-world project where the real values are truly unknown. However, in our case, this will allow us to calculate the RMSE score of our predictions and assess the quality of these predictions. The final step is to click on the Verify button and wait for a few minutes: Amazon ML will run the model on the new datasource and will generate predictions in the form of a CSV file. Contrary to the evaluation and model-building phase, we now have real predictions. We are also no longer given a score associated with these predictions. After a few minutes, you will notice a new batch-prediction folder in your S3 bucket. This folder contains a manifest file and a results folder. The manifest file is a JSON file with the path to the initial datasource and the path to the results file. The results folder contains a gzipped CSV file: Uncompressed, the CSV file contains two columns, trueLabel, the initial target from the held-out set, and score, which corresponds to the predicted values. We can easily calculate the RMSE for those results directly in the spreadsheet through the following steps: Creating a new column that holds the square of the difference of the two columns. Summing all the rows. Taking the square root of the result. The following illustration shows how we create a third column C, as the squared difference between the trueLabel column A and the score (or predicted value) column B: As shown in the following screenshot, averaging column C and taking the square root gives an RMSE of 11.96, which is even significantly better than the RMSE we obtained during the evaluation phase (RMSE 14.4): The fact that the RMSE on the held-out set is better than the RMSE on the validation set means that our model did not overfit the training data, since it performed even better on new data than expected. Our model is robust. The left side of the following graph shows the True (Triangle) and Predicted (Circle) Weight values for all the samples in the held-out set. The right side shows the histogram of the residuals. Similar to the histogram of residuals we had observed on the validation set, we observe that the residuals are not centered on 0. Our model has a tendency to overestimate the weight of the students: In this tutorial, we have successfully performed the loading of the data on S3 and let Amazon ML infer the schema and transform the data. We also created a model and evaluated its performance. Finally, we made a prediction on the held -out dataset. To understand how to leverage Amazon's powerful platform for your predictive analytics needs, check out this book Effective Amazon Machine Learning. Four interesting Amazon patents in 2018 that use machine learning, AR, and robotics Amazon Sagemaker makes machine learning on the cloud easy Amazon ML Solutions Lab to help customers “work backwards” and leverage machine learning

0
0
2148

article-image-should-you-use-javascript-for-machine-learning-and-how-do-you-get-started

Sugandha Lahoti

01 Sep 2018

4 min read

Why use JavaScript for machine learning?

Sugandha Lahoti

01 Sep 2018

4 min read

Python has always been and remains the language of choice for machine learning, in part due to the maturity of the language, in part due to the maturity of the ecosystem, and in part due to the positive feedback loop of early ML efforts in Python. Recent developments in the JavaScript world, however, are making JavaScript more attractive to ML projects. I think we will see a major ML renaissance in JavaScript within a few years, especially as laptops and mobile devices become ever more powerful and JavaScript itself surges in popularity. This post is extracted from the book Hands-on Machine Learning with JavaScript by Burak Kanber. The book is a definitive guide to creating intelligent web applications with the best of machine learning and JavaScript. Advantages and challenges of JavaScript JavaScript, like any other tool, has its advantages and disadvantages. Much of the historical criticism of JavaScript has focused on a few common themes: strange behavior in type coercion, the prototypical object-oriented model, difficulty organizing large codebases, and managing deeply nested asynchronous function calls with what many developers call callback hell. Fortunately, most of these historic gripes have been resolved by the introduction of ES6, that is, ECMAScript 2015, a recent update to the JavaScript syntax. Con: Immature ecosystem for machine learning development Despite the recent language improvements, most developers would still advise against using JavaScript for ML for one reason: the ecosystem. The Python ecosystem for ML is so mature and rich that it's difficult to justify choosing any other ecosystem. But this logic is self-fulfilling and self-defeating; we need brave individuals to take the leap and work on real ML problems if we want JavaScript's ecosystem to mature. Fortunately, JavaScript has been the most popular programming language on GitHub for a few years running, and is growing in popularity by almost every metric. Pro #1: JavaScript is the most popular web development language with a mature npm ecosystem There are some advantages to using JavaScript for ML. Its popularity is one; while ML in JavaScript is not very popular at the moment, the language itself is. As demand for ML applications rises, and as hardware becomes faster and cheaper, it's only natural for ML to become more prevalent in the JavaScript world. There are tons of resources available for learning JavaScript in general, maintaining Node.js servers, and deploying JavaScript applications. The Node Package Manager (npm) ecosystem is also large and still growing, and while there aren't many very mature ML packages available, there are a number of well built, useful tools out there that will come to maturity soon. Pro #2: JavaScript is now a general purpose, cross-platform programming language Another advantage to using JavaScript is the universality of the language. The modern web browser is essentially a portable application platform which allows you to run your code, basically without modification, on nearly any device. Tools like electron (while considered by many to be bloated) allow developers to quickly develop and deploy downloadable desktop applications to any operating system. Node.js lets you run your code in a server environment. React Native brings your JavaScript code to the native mobile application environment, and may eventually allow you to develop desktop applications as well. JavaScript is no longer confined to just dynamic web interactions, it's now a general-purpose, cross-platform programming language. Pro #3: JavaScript makes Machine Learning accessible to web and front-end developers Finally, using JavaScript makes ML accessible to web and frontend developers, a group that historically has been left out of the ML discussion. Server-side applications are typically preferred for ML tools, since the servers are where the computing power is. That fact has historically made it difficult for web developers to get into the ML game, but as hardware improves, even complex ML models can be run on the client, whether it's the desktop or the mobile browser. If web developers, frontend developers, and JavaScript developers all start learning about ML today, that same community will be in a position to improve the ML tools available to us all tomorrow. If we take these technologies and democratize them, expose as many people as possible to the concepts behind ML, we will ultimately elevate the community and seed the next generation of ML researchers. Summary In this article, we've discussed the important moments of JavaScript's history as applied to ML. We’ve discussed some advantages to using JavaScript for machine learning, and also some of the challenges we’re facing, particularly in terms of the machine learning ecosystem. To begin exploring and processing the data itself, read our book Hands-on Machine Learning with JavaScript. 5 JavaScript machine learning libraries you need to know V8 JavaScript Engine releases version 6.9! HTML5 and the rise of modern JavaScript browser APIs [Tutorial]

0
0
8094

How-To Tutorials

article-image-how-to-build-a-real-time-data-pipeline-for-web-developers-part-2-tutorial

Sugandha Lahoti

30 Aug 2018

15 min read

How to build a real-time data pipeline for web developers - Part 2 [Tutorial]

Sugandha Lahoti

30 Aug 2018

15 min read

Our previous post talked about two components to build a real-time data pipeline. To recap: Most data pipelines contain these components: Data querying and event subscription Data joining or aggregation Transformation and normalization Storage and delivery In this post, we will talk about the last two components and introduce some tools and techniques that can achieve them. This post is extracted from the book Hands-on Machine Learning with JavaScript by Burak Kanber. The book is a definitive guide to creating an intelligent web application with the best of machine learning and JavaScript. Transformation and normalization of a data pipeline As your data makes its way through a pipeline, it may need to be converted into a structure compatible with your algorithm's input layer. There are many possible transformations that can be performed on the data in the pipeline. For example, in order to protect sensitive user data before it reaches a token-based classifier, you might apply a cryptographic hashing function to the tokens so that they are no longer human readable. Types of Data transformations More typically, the types of transformations will be related to sanitization, normalization, or transposition. A sanitization operation might involve removing unnecessary whitespace or HTML tags, removing email addresses from a token stream, and removing unnecessary fields from the data structure. If your pipeline has subscribed to an event stream as the source of the data and the event stream attaches source server IP addresses to event data, it would be a good idea to remove these values from the data structure, both in order to save space and to minimize the surface area for potential data leaks. Similarly, if email addresses are not necessary for your classification algorithm, the pipeline should remove that data so that it interacts with the fewest possible servers and systems. If you've designed a spam filter, you may want to look into using only the domain portion of the email address instead of the fully qualified address. Alternately, the email addresses or domains may be hashed by the pipeline so that the classifier can still recognize them but a human cannot. Make sure to audit your data for other potential security and privacy issues as well. If your application collects the end user's IP address as part of its event stream, but the classifier does not need that data, remove it from the pipeline as early as possible. These considerations are becoming ever more important with the implementation of new European privacy laws, and every developer should be aware of privacy and compliance concerns. Data Normalization A common category of data transformation is normalization. When working with a range of numerical values for a given field or feature, it's often desirable to normalize the range such that it has a known minimum and maximum bound. One approach is to normalize all values of the same field to the range [0,1], using the maximum encountered value as the divisor (for example, the sequence 1, 2, 4 can be normalized to 0.25, 0.5, 1). Whether data needs to be normalized in this manner will depend entirely on the algorithm that consumes the data. Another approach to normalization is to convert values into percentiles. In this scheme, very large outlying values will not skew the algorithm too drastically. If most values lie between 0 and 100 but a few points include values such as 50,000, an algorithm may give outsized precedence to the large values. If the data is normalized as a percentile, however, you are guaranteed to not have any values exceeding 100 and the outliers are brought into the same range as the rest of the data. Whether or not this is a good thing depends on the algorithm. Instagram Example The data pipeline is also a good place to calculate derived or second-order features. Imagine a random forest classifier that uses Instagram profile data to determine if the profile belongs to a human or a bot. The Instagram profile data will include fields such as the user's followers count, friends count, posts count, website, bio, and username. A random forest classifier will have difficulty using those fields in their original representations, however, by applying some simple data transformations, you can achieve accuracies of 90%. In the Instagram case, one type of helpful data transformation is calculating ratios. Followers count and friends count, as separate features or signals, may not be useful to the classifier since they are treated somewhat independently. But the friends-to-followers ratio can turn out to be a very strong signal that may expose bot users. An Instagram user with 1,000 friends doesn't raise any flags, nor would an Instagram user with 50 followers; treated independently, these features are not strong signals. However, an Instagram user with a friends-to-followers ratio of 20 (or 1,000/50) is almost certainly a bot designed to follow other users. Similarly, a ratio such as posts-versus-followers or posts-versus-friends may end up being a stronger signal than any of those features independently. Text content such as the Instagram user's profile bio, website, or username is made useful by deriving second-order features from them as well. A classifier may not be able to do anything with a website's URL, but perhaps a Boolean has_profile_website feature can be used as a signal instead. If, in your research, you notice that usernames of bots tend to have a lot of numbers in them, you can derive features from the username itself. One feature can calculate the ratio of letters to numbers in the username, another Boolean feature can represent whether the username has a number at the end or beginning, and a more advanced feature could determine if dictionary words were used in the username or not (therefore distinguishing between @themachinelearningwriter and something gibberish like @panatoe234). Derived features can be of any level of sophistication or simplicity. Another simple feature could be whether the Instagram profile contains a URL in the profile bio field (as opposed to the dedicated website field); this can be detected with a regex and the Boolean value used as the feature. A more advanced feature could automatically detect whether the language used in the user's content is the same as the language specified by the user's locale setting. If the user claims they're in France but always writes captions in Russian it may indeed be a Russian living in France, but when combined with other signals like a friends-to-followers ratio far from 1, this information may be indicative of a bot user. There are lower level transformations that may need to be applied to the data in the pipeline as well. If the source data is in an XML format but the classifier requires JSON formatting, the pipeline should take responsibility for the parsing and conversion of formats. Other mathematical transformations may also be applied. If the native format of the data is row-oriented but the classifier needs column-oriented data, the pipeline can perform a vector transposition operation as part of the processing. Similarly, the pipeline can use mathematical interpolation to fill in missing values. If your pipeline subscribes to events emitted by a suite of sensors in a laboratory setting and a single sensor goes offline for a couple of measurements, it may be reasonable to interpolate between the two known values in order to fill in the missing data. In other cases, missing values can be replaced with the population's mean or median value. Replacing missing values with a mean or median will often result in the classifier deprioritizing that feature for that data point, as opposed to breaking the classifier by giving it a null value. What to consider when transforming and normalizing In general, there are two things to consider in terms of transformation and normalization within a data pipeline. The first is the mechanical details of the source data and the target format: XML data must be transformed to JSON, rows must be converted to columns, images must be converted from JPEG to BMP formats, and so on. The mechanical details are not too tricky to work out, as you will already be aware of the source and target formats required by the system. The other consideration is the semantic or mathematical transformation of your data. This is an exercise in feature selection and feature engineering, and is not as straightforward as the mechanical transformation. Determining which second-order features to derive is both art and science. The art is coming up with new ideas for derived features, and the science is to rigorously test and experiment with your work. In my experience with Instagram bot detection, for instance, I found that the letters-to-numbers ratio in Instagram usernames was a very weak signal. I abandoned that idea after some experimentation in order to avoid adding unnecessary dimensionality to the problem. At this point, we have a hypothetical data pipeline that collects data, joins and aggregates it, processes it, and normalizes it. We're almost done, but the data still needs to be delivered to the algorithm itself. Once the algorithm is trained, we might also want to serialize the model and store it for later use. In the next section, we'll discuss a few considerations to make when transporting and storing training data or serialized models. Storing and delivering data in a data pipeline Once your data pipeline has applied all the necessary processing and transformations, it has one task left to do: deliver the data to your algorithm. Ideally, the algorithm will not need to know about the implementation details of the data pipeline. The algorithm should have a single location that it can interact with in order to get the fully processed data. This location could be a file on disk, a message queue, a service such as Amazon S3, a database, or an API endpoint. The approach you choose will depend on the resources available to you, the topology or architecture of your server system, and the format and size of the data. Models that are trained only periodically are typically the simplest case to handle. If you're developing an image recognition RNN that learns labels for a number of images and only needs to be retrained every few months, a good approach would be to store all the images as well as a manifest file (relating image names to labels) in a service such as Amazon S3 or a dedicated path on disk. The algorithm would first load and parse the manifest file and then load the images from the storage service as needed. Similarly, an Instagram bot detection algorithm may only need to be retrained every week or every month. The algorithm can read training data directly from a database table or a JSON or CSV file stored on S3 or a local disk. It is rare to have to do this, but in some exotic data pipeline implementations you could also provide the algorithm with a dedicated API endpoint built as a microservice; the algorithm would simply query the API endpoint first for a list of training point references, and then request each in turn from the API. Models which require online updates or near-real-time updates, on the other hand, are best served by a message queue. If a Bayesian classifier requires live updates, the algorithm can subscribe to a message queue and apply updates as they come in. Even when using a sophisticated multistage pipeline, it is possible to process new data and update a model in fractions of a second if you've designed all the components well. Storing and Delivering Data in a Spam Filter Returning to the spam filter example, we can design a highly performant data pipeline like so: first, an API endpoint receives feedback from a user. In order to keep the user interface responsive, this API endpoint is responsible only for placing the user's feedback into a message queue and can finish its task in under a millisecond. The data pipeline in turn subscribes to the message queue, and in another few milliseconds is made aware of a new message. The pipeline then applies a few simple transformations to the message, like tokenizing, stemming, and potentially even hashing the tokens. The next stage of the pipeline transforms the token stream into a hashmap of tokens and their counts (for example, from hey hey there to {hey: 2, there: 1}); this avoids the need for the classifier to update the same token's count more than once. This stage of processing will only require another couple of milliseconds at worst. Finally, the fully processed data is placed in a separate message queue which the classifier subscribes to. Once the classifier is made aware of the data it can immediately apply the updates to the model. If the classifier is backed by Redis, for instance, this final stage will also require only a few milliseconds. The entire process we have described, from the time the user's feedback reaches the API server to the time the model is updated, may only require 20 ms. Considering that communication over the internet (or any other means) is limited by the speed of light, the best-case scenario for a TCP packet making a round-trip between New York and San Francisco is 40 ms; in practice, the average cross-country latency for a good internet connection is about 80 ms. Our data pipeline and model is therefore capable of updating itself based on user feedback a full 20 ms before the user will even receive their HTTP response. Not every application requires real-time processing. Managing separate servers for an API, a data pipeline, message queues, a Redis store, and hosting the classifier might be overkill both in terms of effort and budget. You'll have to determine what's best for your use case. Storage and Delivery of the model The last thing to consider is not related to the data pipeline but rather the storage and delivery of the model itself, in the case of a hybrid approach where a model is trained on the server but evaluated on the client. The first question to ask yourself is whether the model is considered public or private. Private models should not be stored on a public Amazon S3 bucket, for instance; instead, the S3 bucket should have access control rules in place and your application will need to procure a signed download link with an expiration time (the S3 API assists with this). The next consideration is how large the model is and how often it will be downloaded by clients. If a public model is downloaded frequently but updated infrequently, it might be best to use a CDN in order to take advantage of edge caching. If your model is stored on Amazon S3, for example, then the Amazon CloudFront CDN would be a good choice. Of course, you can always build your own storage and delivery solution. In this post, I have assumed a cloud architecture, however if you have a single dedicated or collocated server then you may simply want to store the serialized model on disk and serve it either through your web server software or through your application's API. When dealing with large models, make sure to consider what will happen if many users attempt to download the model simultaneously. You may inadvertently saturate your server's network connection if too many people request the file at once, you might overrun any bandwidth limits set by your server's ISP, or you might end up with your server's CPU stuck in I/O wait while it moves data around. No one-size-fits-all solution for data pipelining As mentioned previously, there's no one-size-fits-all solution for data pipelining. If you're a hobbyist developing applications for fun or just a few users, you have lots of options for data storage and delivery. If you're working in a professional capacity on a large enterprise project, however, you will have to consider all aspects of the data pipeline and how they will impact your application's performance. I will offer one final piece of advice to the hobbyists reading this section. While it's true that you don't need a sophisticated, real-time data pipeline for hobby projects, you should build one anyway. Being able to design and build real-time data pipelines is a highly marketable and valuable skill that not many people possess, and if you're willing to put in the practice to learn ML algorithms then you should also practice building performant data pipelines. I'm not saying that you should build a big, fancy data pipeline for every single hobby project—just that you should do it a few times, using several different approaches, until you're comfortable not just with the concepts but also the implementation. Practice makes perfect, and practice means getting your hands dirty. In this two part series, we learned about data pipelines and various mechanisms that manage the collection, combination, transformation, and delivery of data from one system to the next. Next, to exactly choose the right ML algorithm for a given problem, read our book Hands-on Machine Learning with JavaScript. Create machine learning pipelines using unsupervised AutoML [Tutorial] Top AutoML libraries for building your ML pipelines

0
0
2691

How-To Tutorials

article-image-how-to-build-a-real-time-data-pipeline-for-web-developers-part-1-tutorial

Sugandha Lahoti

29 Aug 2018

12 min read

How to build a real-time data pipeline for web developers - Part 1 [Tutorial]

Sugandha Lahoti

29 Aug 2018

12 min read

There are many differences between the idealized usage of ML algorithms and real-world usage. This post gives advice related to using ML in the real world, in real applications, and in production environments. Specifically, we will talk about how to build a real-time data pipeline. The article aims to answer the following questions How do you collect, store, and process gigabytes or terabytes of training data? How and where do you store and distribute serialized models to clients? How do you collect new training examples from millions of users? This post is extracted from the book Hands-on Machine Learning with JavaScript by Burak Kanber. The book is a definitive guide to creating an intelligent web application with the best of machine learning and JavaScript. What are Data pipelines? When developing a production ML system, it's not likely that you will have the training data handed to you in a ready-to-process format. Production ML systems are typically part of larger application systems, and the data that you use will probably originate from several different sources. The training set for an ML algorithm may be a subset of your larger database, combined with images hosted on a Content Delivery Network (CDN) and event data from an Elasticsearch server. The process of ushering data through various stages of a life cycle is called data pipelining. Data pipelining may include data selectors that run SQL or Elasticsearch queries for objects, event subscriptions which allow data to flow in from event-or log-based data, aggregations, joins, combining data with data from third-party APIs, sanitization, normalization, and storage. In an ideal implementation, the data pipeline acts as an abstraction layer between the larger application environment and the ML process. The ML algorithm should be able to read the output of the data pipeline without any knowledge of the original source of the data, similar to our examples. As there are many possible data sources and infinite ways to architect an application, there is no one-size-fits-all data pipeline. However, most data pipelines will contain these components, which we will discuss in the following sections: Data querying and event subscription Data joining or aggregation Transformation and normalization Storage and delivery This article is a two-part post. In the first part, we will talk about Data Querying and event subscription and Data joining. Data querying Imagine an application such as Disqus, which is an embeddable comment form that website owners can use to add comment functionality to blog posts or other pages. The primary functionality of Disqus is to allow users to like or leave comments on posts, however, as an additional feature and revenue stream, Disqus can make content recommendations and display them alongside sponsored content. The content recommendation system is an example of an ML system that is only one feature of a larger application. A content recommendation system in an application such as Disqus does not necessarily need to interact with the comment data, but might use the user's likes history to generate recommendations similar to the current page. Such a system would also need to analyze the text content of the liked pages and compare that to the text content of all pages in the network in order to make recommendations. Disqus does not need the post's content in order to provide comment functionality, but does need to store metadata about the page (like its URL and title) in its database. The post content may therefore not reside in the application's main database, though the likes and page metadata would likely be stored there. A data pipeline built around Disqus's recommendation system needs first to query the main database for pages the user has liked—or pages that were liked by users who liked the current page—and return their metadata. In order to find similar content, however, the system will need to use the text content of each liked post. This data might be stored in a separate system, perhaps a secondary database such as MongoDB or Elasticsearch, or in Amazon S3 or some other data warehouse. The pipeline will need to retrieve the text content based on the metadata returned by the main database, and associate the content with the metadata. This is an example of multiple data selectors or data sources in the early stages of a data pipeline. One data source is the primary application data, which stores post and likes metadata. The other data source is a secondary server which stores the post's text content. The next step in this pipeline might involve finding a number of candidate posts similar to the ones the user has liked, perhaps through a request to Elasticsearch or some other service that can find similar content. Similar content is not necessarily the correct content to serve, however, so these candidate articles will ultimately be ranked by an (hypothetical) ANN in order to determine the best content to display. In this example, the input to the data pipeline is the current page and the output from the data pipeline is a list of, say, 200 similar pages that the ANN will then rank. If all the necessary data resides in the primary database, the entire pipeline can be achieved with an SQL statement and some JOINs. Even in this case, care should be taken to develop a degree of abstraction between the ML algorithm and the data pipeline, as you may decide to update the application's architecture in the future. In other cases, however, the data will reside in separate locations and a more considered pipeline should be developed. There are many ways to build this data pipeline. You could develop a JavaScript module that performs all the pipeline tasks, and in some cases, you could even write a bash script using standard Unix tools to accomplish the task. On the other end of the complexity spectrum, there are purpose-built tools for data pipelining such as Apache Kafka and AWS Pipeline. These systems are designed modularly and allow you to define a specific data source, query, transformation, and aggregation modules as well as the workflows that connect them. In AWS Pipeline, for instance, you define data nodes that understand how to interact with the various data sources in your application. The earliest stage of a pipeline is typically some sort of data query operation. Training examples must be extracted from a larger database, keeping in mind that not every record in a database is necessarily a training example. In the case of a spam filter, for instance, you should only select messages that have been marked as spam or not spam by a user. Messages that were automatically marked as spam by a spam filter should probably not be used for training, as that might cause a positive feedback loop that ultimately causes an unacceptable false positive rate. Similarly, you may want to prevent users that have been blocked or banned by your system from influencing your model training. A bad actor could intentionally mislead an ML model by taking inappropriate actions on their own data, so you should disqualify these data points as training examples. Alternatively, if your application is such that recent data points should take precedence over older training points, your data query operation might set a time-based limit on the data to use for training, or select a fixed limit ordered reverse chronologically. No matter the situation, make sure you carefully consider your data queries as they are an essential first step in your data pipeline. Not all data needs to come from database queries, however. Many applications use a pub/sub or event subscription architecture to capture streaming data. This data could be activity logs aggregated from a number of servers, or live transaction data from a number of sources. In these cases, an event subscriber will be an early part of your data pipeline. Note that event subscription and data querying are not mutually exclusive operations. Events that come in through a pub/sub system can still be filtered based on various criteria; this is still a form of data querying. One potential issue with an event subscription model arises when it's combined with a batch-training scheme. If you require 5,000 data points but receive only 100 per second, your pipeline will need to maintain a buffer of data points until the target size is reached. There are various message-queuing systems that can assist with this, such as RabbitMQ or Redis. A pipeline requiring this type of functionality might hold messages in a queue until the target of 5,000 messages is achieved, and only then release the messages for batch processing through the rest of the pipeline. In the case that data is collected from multiple sources, it most likely will need to be joined or aggregated in some manner. Let's now take a look at a situation where data needs to be joined to data from an external API. Data joining and aggregation Let's return to our example of the Disqus content recommendation system. Imagine that the data pipeline is able to query likes and post metadata directly from the primary database, but that no system in the applications stores the post's text content. Instead, a microservice was developed in the form of an API that accepts a post ID or URL and returns the page's sanitized text content. In this case, the data pipeline will need to interact with the microservice API in order to get the text content for each post. This approach is perfectly valid, though if the frequency of post content requests is high, some caching or storage should probably be implemented. The data pipeline will need to employ an approach similar to the buffering of messages in the event subscription model. The pipeline can use a message queue to queue posts that still require content, and make requests to the content microservice for each post in the queue until the queue is depleted. As each post's content is retrieved it is added to the post metadata and stored in a separate queue for completed requests. Only when the source queue is depleted and the sink queue is full should the pipeline move on to the next step. Data joining does not necessarily need to involve a microservice API. If the pipeline collects data from two separate sources that need to be combined, a similar approach can be employed. The pipeline is the only component that needs to understand the relationship between the two data sources and formats, leaving both the data sources and the ML algorithm to operate independently of those details. The queue approach also works well when a data aggregation is required. An example of this situation is a pipeline in which the input is streaming input data and the output is token counts or value aggregations. Using a message queue is desirable in these situations as most message queues ensure that a message can be consumed only once, therefore preventing any duplication by the aggregator. This is especially valuable when the event stream is very high frequency, such that tokenizing each event as it comes in would lead to backups or server overload. Because message queues ensure that each message is consumed only once, high-frequency event data can stream directly into a queue where messages are consumed by multiple workers in parallel. Each worker might be responsible for tokenizing the event data and then pushing the token stream to a different message queue. The message queue software ensures that no two workers process the same event, and each worker can operate as an independent unit that is only concerned with tokenization. As the tokenizers push their results onto a new message queue, another worker can consume those messages and aggregate token counts, delivering its own results to the next step in the pipeline every second or minute or 1,000 events, whatever is appropriate for the application. The output of this style of pipeline might be fed into a continually updating Bayesian model, for example. One benefit of a data pipeline designed in this manner is performance. If you were to attempt to subscribe to high-frequency event data, tokenize each message, aggregate token counts, and update a model all in one system, you might be forced to use a very powerful (and expensive) single server. The server would simultaneously need a high-performance CPU, lots of RAM, and a high-throughput network connection. By breaking up the pipeline into stages, however, you can optimize each stage of the pipeline for its specific task and load condition. The message queue that receives the source event stream needs only to receive the event stream but does not need to process it. The tokenizer workers do not necessarily need to be high-performance servers, as they can be run in parallel. The aggregating queue and worker will process a large volume of data but will not need to retain data for longer than a few seconds and therefore may not need much RAM. The final model, which is a compressed version of the source data, can be stored on a more modest machine. Many components of the data pipeline can be built of commodity hardware simply because a data pipeline encourages modular design. In many cases, you will need to transform your data from format to format throughout the pipeline. That could mean converting from native data structures to JSON, transposing or interpolating values, or hashing values. We talked about two data pipelines components. In the next post, we will discuss several types of data transformations that may occur in the data pipeline. We will also discuss a few considerations to make when transporting and storing training data or serialized models. Create machine learning pipelines using unsupervised AutoML [Tutorial] Top AutoML libraries for building your ML pipelines

0
0
5086

How-To Tutorials

article-image-intelligent-mobile-projects-with-tensorflow-build-a-basic-raspberry-pi-robot-that-listens-moves-sees-and-speaks-tutorial

Bhagyashree R

27 Aug 2018

14 min read

Intelligent mobile projects with TensorFlow: Build a basic Raspberry Pi robot that listens, moves, sees, and speaks [Tutorial]

Bhagyashree R

27 Aug 2018

14 min read

0
0
9512

How-To Tutorials

Implementing an AI in Unreal Engine 4 with AI Perception components [Tutorial]

Build a custom news feed with Python [Tutorial]

Implementing Dependency Injection in Google Guice [Tutorial]

Building Recommendation System with Scala and Apache Spark [Tutorial]

Building a Twitter news bot using Twitter API [Tutorial]

Classifying flowers in Iris Dataset using Scala [Tutorial]

Intelligent mobile projects with TensorFlow: Build your first Reinforcement Learning model on Raspberry Pi [Tutorial]

How to use artificial intelligence to create games with rich and interactive environments [Tutorial]

Implementing cost-effective IoT analytics for predictive maintenance [Tutorial]

Build intelligent interfaces with CoreML using a CNN [Tutorial]

Trending Topics

Getting started with Amazon Machine Learning workflow [Tutorial]

Why use JavaScript for machine learning?

How to build a real-time data pipeline for web developers - Part 2 [Tutorial]

How to build a real-time data pipeline for web developers - Part 1 [Tutorial]

Intelligent mobile projects with TensorFlow: Build a basic Raspberry Pi robot that listens, moves, sees, and speaks [Tutorial]