Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-convolutional-neural-networks-reinforcement-learning

06 Apr 2017

9 min read

Convolutional Neural Networks with Reinforcement Learning

06 Apr 2017

In this article by Antonio Gulli, Sujit Pal, the authors of the book Deep Learning with Keras, we will learn about reinforcement learning, or more specifically deep reinforcement learning, that is, the application of deep neural networks to reinforcement learning. We will also see how convolutional neural networks leverage spatial information and they are therefore very well suited for classifying images. (For more resources related to this topic, see here.) Deep convolutional neural network A Deep Convolutional Neural Network (DCCN) consists of many neural network layers. Two different types of layers, convolutional and pooling, are typically alternated. The depth of each filter increases from left to right in the network. The last stage is typically made of one or more fully connected layers as shown here: There are three key intuitions beyond ConvNets: Local receptive fields Shared weights Pooling Let's review them together. Local receptive fields If we want to preserve the spatial information, then it is convenient to represent each image with a matrix of pixels. Then, a simple way to encode the local structure is to connect a submatrix of adjacent input neurons into one single hidden neuron belonging to the next layer. That single hidden neuron represents one local receptive field. Note that this operation is named convolution and it gives the name to this type of networks. Of course we can encode more information by having overlapping submatrices. For instance let's suppose that the size of each single submatrix is 5 x 5 and that those submatrices are used with MNIST images of 28 x 28 pixels, then we will be able to generate 23 x 23 local receptive field neurons in the next hidden layer. In fact it is possible to slide the submatrices by only 23 positions before touching the borders of the images. In Keras, the size of each single submatrix is called stride-length and this is an hyper-parameter which can be fine-tuned during the construction of our nets. Let's define the feature map from one layer to another layer. Of course we can have multiple feature maps which learn independently from each hidden layer. For instance we can start with 28 x 28 input neurons for processing MINST images, and then recall k feature maps of size 23 x 23 neurons each (again with stride of 5 x 5) in the next hidden layer. Shared weights and bias Let's suppose that we want to move away from the pixel representation in a row by gaining the ability of detecting the same feature independently from the location where it is placed in the input image. A simple intuition is to use the same set of weights and bias for all the neurons in the hidden layers. In this way each layer will learn a set of position-independent latent features derived from the image. Assuming that the input image has shape (256, 256) on 3 channels with tf (Tensorflow) ordering, this is represented as (256, 256, 3). Note that with th (Theano) mode the channels dimension (the depth) is at index 1, in tf mode is it at index 3. In Keras if we want to add a convolutional layer with dimensionality of the output 32 and extension of each filter 3 x 3 we will write: model = Sequential() model.add(Convolution2D(32, 3, 3, input_shape=(256, 256, 3)) This means that we are applying a 3 x 3 convolution on 256 x 256 image with 3 input channels (or input filters) resulting in 32 output channels (or output filters). An example of convolution is provided in the following diagram: Pooling layers Let's suppose that we want to summarize the output of a feature map. Again, we can use the spatial contiguity of the output produced from a single feature map and aggregate the values of a submatrix into one single output value synthetically describing the meaning associated with that physical region. Max pooling One easy and common choice is the so-called max pooling operator which simply outputs the maximum activation as observed in the region. In Keras, if we want to define a max pooling layer of size 2 x 2 we will write: model.add(MaxPooling2D(pool_size = (2, 2))) An example of max pooling operation is given in the following diagram: Average pooling Another choice is the average pooling which simply aggregates a region into the average values of the activations observed in that region. Keras implements a large number of pooling layers and a complete list is available online. In short, all the pooling operations are nothing more than a summary operation on a given region. Reinforcement learning Our objective is to build a neural network to play the game of catch. Each game starts with a ball being dropped from a random position from the top of the screen. The objective is to move a paddle at the bottom of the screen using the left and right arrow keys to catch the ball by the time it reaches the bottom. As games go, this is quite simple. At any point in time, the state of this game is given by the (x, y) coordinates of the ball and paddle. Most arcade games tend to have many more moving parts, so a general solution is to provide the entire current game screen image as the state. The following diagram shows four consecutive screenshots of our catch game: Astute readers might note that our problem could be modeled as a classification problem, where the input to the network are the game screen images and the output is one of three actions - move left, stay, or move right. However, this would require us to provide the network with training examples, possibly from recordings of games played by experts. An alternative and simpler approach might be to build a network and have it play the game repeatedly, giving it feedback based on whether it succeeds in catching the ball or not. This approach is also more intuitive and is closer to the way humans and animals learn. The most common way to represent such a problem is through a Markov Decision Process (MDP). Our game is the environment within which the agent is trying to learn. The state of the environment at time step t is given by st (and contains the location of the ball and paddle). The agent can perform certain actions at (such as moving the paddle left or right). These actions can sometimes result in a reward rt, which can be positive or negative (such as an increase or decrease in the score). Actions change the environment and can lead to a new state st+1, where the agent can perform another action at+1, and so on. The set of states, actions and rewards, together with the rules for transitioning from one state to the other, make up a Markov decision process. A single game is one episode of this process, and is represented by a finite sequence of states, actions and rewards: Since this is a Markov decision process, the probability of state st+1 depends only on current state st and action at. Maximizing future rewards As an agent, our objective is to maximize the total reward from each game. The total reward can be represented as follows: In order to maximize the total reward, the agent should try to maximize the total reward from any time point t in the game. The total reward at time step t is given by Rt and is represented as: However, it is harder to predict the value of the rewards the further we go into the future. In order to take this into consideration, our agent should try to maximize the total discounted future reward at time t instead. This is done by discounting the reward at each future time step by a factor γ over the previous time step. If γ is 0, then our network does not consider future rewards at all, and if γ is 1, then our network is completely deterministic. A good value for γ is around 0.9. Factoring the equation allows us to express the total discounted future reward at a given time step recursively as the sum of the current reward and the total discounted future reward at the next time step: Q-learning Deep reinforcement learning utilizes a model-free reinforcement learning technique called Q-learning. Q-learning can be used to find an optimal action for any given state in a finite Markov decision process. Q-learning tries to maximize the value of the Q-function which represents the maximum discounted future reward when we perform action a in state s: Once we know the Q-function, the optimal action a at a state s is the one with the highest Q-value. We can then define a policy π(s) that gives us the optimal action at any state: We can define the Q-function for a transition point (st, at, rt, st+1) in terms of the Q-function at the next point (st+1, at+1, rt+1, st+2) similar to how we did with the total discounted future reward. This equation is known as the Bellmann equation. The Q-function can be approximated using the Bellman equation. You can think of the Q-function as a lookup table (called a Q-table) where the states (denoted by s) are rows and actions (denoted by a) are columns, and the elements (denoted by Q(s, a)) are the rewards that you get if you are in the state given by the row and take the action given by the column. The best action to take at any state is the one with the highest reward. We start by randomly initializing the Q-table, then carry out random actions and observe the rewards to update the Q-table iteratively according to the following algorithm: initialize Q-table Q observe initial state s repeat select and carry out action a observe reward r and move to new state s' Q(s, a) = Q(s, a) + α(r + γ maxa' Q(s', a') - Q(s, a)) s = s' until game over You will realize that the algorithm is basically doing stochastic gradient descent on the Bellman equation, backpropagating the reward through the state space (or episode) and averaging over many trials (or epochs). Here α is the learning rate that determines how much of the difference between the previous Q-value and the discounted new maximum Q-value should be incorporated. Summary We have seen the application of deep neural networks, reinforcement learning. We have also seen convolutional neural networks and how they are well suited for classifying images. Resources for Article: Further resources on this subject: Deep learning and regression analysis [article] Training neural networks efficiently using Keras [article] Implementing Artificial Neural Networks with TensorFlow [article]

0
0
13843

Packt

15 Mar 2017

24 min read

WebLogic Server

Packt

15 Mar 2017

24 min read

0
0
2750

Packt

09 Mar 2017

6 min read

Learn from Data

Packt

09 Mar 2017

6 min read

In this article by Rushdi Shams, the author of the book Java Data Science Cookbook, we will cover recipes that use machine learning techniques to learn patterns from data. These patterns are at the centre of attention for at least three key machine-learning tasks: classification, regression, and clustering. Classification is the task of predicting a value from a nominal class. In contrast to classification, regression models attempt to predict a value from a numeric class. (For more resources related to this topic, see here.) Generating linear regression models Most of the linear regression modelling follows a general pattern—there will be many independent variables that will be collectively produce a result, which is a dependent variable. For instance, we can generate a regression model to predict the price of a house based on different attributes/features of a house (mostly numeric, real values) like its size in square feet, number of bedrooms, number of washrooms, importance of its location, and so on. In this recipe, we will use Weka’s Linear Regression classifier to generate a regression model. Getting ready In order to perform the recipes in this section, we will require the following: To download Weka, go to http://www.cs.waikato.ac.nz/ml/weka/downloading.html and you will find download options for Windows, Mac, and other operating systems such as Linux. Read through the options carefully and download the appropriate version. During the writing of this article, 3.9.0 was the latest version for the developers and as the author already had version 1.8 JVM installed in his 64-bit Windows machine, he has chosen to download a self-extracting executable for 64-bit Windows without a Java Virtual Machine (JVM) After the download is complete, double-click on the executable file and follow on screen instructions. You need to install the full version of Weka. Once the installation is done, do not run the software. Instead, go to the directory where you have installed it and find the Java Archive File for Weka (weka.jar). Add this file in your Eclipse project as external library. If you need to download older versions of Weka for some reasons, all of them can be found at https://sourceforge.net/projects/weka/files/. Please note that there is a possibility that many of the methods from old versions are deprecated and therefore not supported any more. How to do it… In this recipe, the linear regression model we will be creating is based on the cpu.arff dataset that can be found in the data directory of the Weka installation directory. Our code will have two instance variables: the first variable will contain the data instances of cpu.arff file and the second variable will be our linear regression classifier. Instances cpu = null; LinearRegression lReg ; Next, we will be creating a method to load the ARFF file and assign the last attribute of the ARFF file as its class attribute. public void loadArff(String arffInput){ DataSource source = null; try { source = new DataSource(arffInput); cpu = source.getDataSet(); cpu.setClassIndex(cpu.numAttributes() - 1); } catch (Exception e1) { } } We will be creating a method to build the linear regression model. To do so, we simply need to call the buildClassifier() method of our linear regression variable. The model can directly be sent as parameter to System.out.println(). public void buildRegression(){ lReg = new LinearRegression(); try { lReg.buildClassifier(cpu); } catch (Exception e) { } System.out.println(lReg); } The complete code for the recipe is as follows: import weka.classifiers.functions.LinearRegression; import weka.core.Instances; import weka.core.converters.ConverterUtils.DataSource; public class WekaLinearRegressionTest { Instances cpu = null; LinearRegression lReg ; public void loadArff(String arffInput){ DataSource source = null; try { source = new DataSource(arffInput); cpu = source.getDataSet(); cpu.setClassIndex(cpu.numAttributes() - 1); } catch (Exception e1) { } } public void buildRegression(){ lReg = new LinearRegression(); try { lReg.buildClassifier(cpu); } catch (Exception e) { } System.out.println(lReg); } public static void main(String[] args) throws Exception{ WekaLinearRegressionTest test = new WekaLinearRegressionTest(); test.loadArff("path to the cpu.arff file"); test.buildRegression(); } } The output of the code is as follows: Linear Regression Model class = 0.0491 * MYCT + 0.0152 * MMIN + 0.0056 * MMAX + 0.6298 * CACH + 1.4599 * CHMAX + -56.075 Generating logistic regression models Weka has a class named Logistic that can be used for building and using a multinomial logistic regression model with a ridge estimator. Although original Logistic Regression does not deal with instance weights, the algorithm in Weka has been modified to handle the instance weights. In this recipe, we will use Weka to generate logistic regression model on iris dataset. How to do it… We will be generating a logistic regression model from the iris dataset that can be found in the data directory in the installed folder of Weka. Our code will have two instance variables: one will be containing the data instances of iris dataset and the other will be the logistic regression classifier. Instances iris = null; Logistic logReg ; We will be using a method to load and read the dataset as well as assign its class attribute (the last attribute of iris.arff file): public void loadArff(String arffInput){ DataSource source = null; try { source = new DataSource(arffInput); iris = source.getDataSet(); iris.setClassIndex(iris.numAttributes() - 1); } catch (Exception e1) { } } Next, we will be creating the most important method of our recipe that builds a logistic regression classifier from the iris dataset: public void buildRegression(){ logReg = new Logistic(); try { logReg.buildClassifier(iris); } catch (Exception e) { } System.out.println(logReg); } The complete executable code for the recipe is as follows: import weka.classifiers.functions.Logistic; import weka.core.Instances; import weka.core.converters.ConverterUtils.DataSource; public class WekaLogisticRegressionTest { Instances iris = null; Logistic logReg ; public void loadArff(String arffInput){ DataSource source = null; try { source = new DataSource(arffInput); iris = source.getDataSet(); iris.setClassIndex(iris.numAttributes() - 1); } catch (Exception e1) { } } public void buildRegression(){ logReg = new Logistic(); try { logReg.buildClassifier(iris); } catch (Exception e) { } System.out.println(logReg); } public static void main(String[] args) throws Exception{ WekaLogisticRegressionTest test = new WekaLogisticRegressionTest(); test.loadArff("path to the iris.arff file "); test.buildRegression(); } } The output of the code is as follows: Logistic Regression with ridge parameter of 1.0E-8 Coefficients... Class Variable Iris-setosa Iris-versicolor =============================================== sepallength 21.8065 2.4652 sepalwidth 4.5648 6.6809 petallength -26.3083 -9.4293 petalwidth -43.887 -18.2859 Intercept 8.1743 42.637 Odds Ratios... Class Variable Iris-setosa Iris-versicolor =============================================== sepallength 2954196659.8892 11.7653 sepalwidth 96.0426 797.0304 petallength 0 0.0001 petalwidth 0 0 The interpretation of the results from the recipe is beyond the scope of this article. Interested readers are encouraged to see a Stack Overflow discussion here: http://stackoverflow.com/questions/19136213/how-to-interpret-weka-logistic-regression-output. Summary In this article, we have covered the recipes that use machine learning techniques to learn patterns from data. These patterns are at the centre of attention for at least three key machine-learning tasks: classification, regression, and clustering. Classification is the task of predicting a value from a nominal class. Resources for Article: Further resources on this subject: The Data Science Venn Diagram [article] Data Science with R [article] Data visualization [article]

0
0
1165

Packt

09 Mar 2017

34 min read

Reading the Fine Manual

Packt

09 Mar 2017

34 min read

0
0
1375

Packt

08 Mar 2017

13 min read

What is D3.js?

Packt

08 Mar 2017

13 min read

In this article by Ændrew H. Rininsland, the author of the book Learning D3.JS 4.x Data Visualization, we'll see what is new in D3 v4 and get started with Node and Git on the command line. (For more resources related to this topic, see here.) D3 (Data-Driven Documents), developed by Mike Bostock and the D3 community since 2011, is the successor to Bostock's earlier Protovis library. It allows pixel-perfect rendering of data by abstracting the calculation of things such as scales and axes into an easy-to-use domain-specific language (DSL), and uses idioms that should be immediately familiar to anyone with experience of using the popular jQuery JavaScript library. Much like jQuery, in D3, you operate on elements by selecting them and then manipulating via a chain of modifier functions. Especially within the context of data visualization, this declarative approach makes using it easier and more enjoyable than a lot of other tools out there. The official website, https://d3js.org/, features many great examples that show off the power of D3, but understanding them is tricky to start with. After finishing this book, you should be able to understand D3 well enough to figure out the examples, tweaking them to fit your needs. If you want to follow the development of D3 more closely, check out the source code hosted on GitHub at https://github.com/d3. The fine-grained control and its elegance make D3 one of the most powerful open source visualization libraries out there. This also means that it's not very suitable for simple jobs such as drawing a line chart or two-in that case you might want to use a library designed for charting. Many use D3 internally anyway. For a massive list, visit https://github.com/sorrycc/awesome-javascript#data-visualization. D3 is ultimately based around functional programming principles, which is currently experience a renaissance in the JavaScript community. This book really isn't about functional programming, but a lot of what we'll be doing will seem really familiar if you've ever used functional programming principles before. What happened to all the classes?! The second edition of this book contained quite a number of examples using the new class feature that is new in ES2015. The revised examples in this edition all use factory functions instead, and the class keyword never appears. Why is this, exactly? ES2015 classes are essentially just syntactic sugaring for factory functions. By this I mean that they ultimately compile down to that anyway. While classes can provide a certain level of organization to a complex piece of code, they ultimately hide what is going on underneath it all. Not only that, using OO paradigms like classes are effectively avoiding one of the most powerful and elegant aspects of JavaScript as a language, which is its focus on first-class functions and objects. Your code will be simpler and more elegant using functional paradigms than OO, and you'll find it less difficult to read examples in the D3 community, which almost never use classes. There are many, much more comprehensive arguments against using classes than I'm able to make here. For one of the best, please read Eric Elliott's excellent "The Two Pillars of JavaScript" pieces, at medium.com/javascript-scene/the-two-pillars-of-javascript-ee6f3281e7f3. What's new in D3 v4? One of the key changes to D3 since the last edition of this book is the release of version 4. Among its many changes, the most significant is a complete overhaul of the D3 namespace. This means that none of the examples in this book will work with D3 3.x, and the examples from the last book will not work with D3 4.x. This is quite possibly the cruelest thing Mr. Bostock could ever do to educational authors such as myself (I am totally joking here). Kidding aside, it also means many of the "block" examples in the D3 community are out-of-date and may appear rather odd if this book is your first encounter with the library. For this reason, it is very important to note the version of D3 an example uses - if it uses 3.x, it might be worth searching for a 4.x example just to prevent this cognitive dissonance. Related to this is how D3 has been broken up from a single library into many smaller libraries. There are two approaches you can take: you can use D3 as a single library in much the same way as version 3, or you can selectively use individual components of D3 in your project. This book takes the latter route, even if it does take a bit more effort - the benefit is primarily in that you'll have a better idea of how D3 is organized as a library and it reduces the size of the final bundle people who view your graphics will have to download. What's ES2017? One of the main changes to this book since the first edition is the emphasis on modern JavaScript; in this case, ES2017. Formerly known as ES6 (Harmony), it pushes the JavaScript language's features forward significantly, allowing for new usage patterns that simplify code readability and increase expressiveness. If you've written JavaScript before and the examples in this article look pretty confusing, it means you're probably familiar with the older, more common ES5 syntax. But don't sweat! It really doesn't take too long to get the hang of the new syntax, and I will try to explain the new language features as we encounter them. Although it might seem a somewhat steep learning curve at the start, by the end, you'll have improved your ability to write code quite substantially and will be on the cutting edge of contemporary JavaScript development. For a really good rundown of all the new toys you have with ES2016, check out this nice guide by the folks at Babel.js, which we will use extensively throughout this book: https://babeljs.io/docs/learn-es2015/. Before I go any further, let me clear some confusion about what ES2017 actually is. Initially, the ECMAScript (or ES for short) standards were incremented by cardinal numbers, for instance, ES4, ES5, ES6, and ES7. However, with ES6, they changed this so that a new standard is released every year in order to keep pace with modern development trends, and thus we refer to the year (2017) now. The big release was ES2015, which more or less maps to ES6. ES2016 was ratified in June 2016, and builds on the previous year's standard, while adding a few fixes and two new features. ES2017 is currently in the draft stage, which means proposals for new features are being considered and developed until it is ratified sometime in 2017. As a result of this book being written while these features are in draft, they may not actually make it into ES2017 and thus need to wait until a later standard to be officially added to the language. You don't really need to worry about any of this, however, because we use Babel.js to transpile everything down to ES5 anyway, so it runs the same in Node.js and in the browser. I try to refer to the relevant spec where a feature is added when I introduce it for the sake of accuracy (for instance, modules are an ES2015 feature), but when I refer to JavaScript, I mean all modern JavaScript, regardless of which ECMAScript spec it originated in. Getting started with Node and Git on the command line I will try not to be too opinionated in this book about which editor or operating system you should use to work through it (though I am using Atom on Mac OS X), but you are going to need a few prerequisites to start. The first is Node.js. Node is widely used for web development nowadays, and it's actually just JavaScript that can be run on the command line. Later on in this book, I'll show you how to write a server application in Node, but for now, let's just concentrate on getting it and npm (the brilliant and amazing package manager that Node uses) installed. If you're on Windows or Mac OS X without Homebrew, use the installer at https://nodejs.org/en/. If you're on Mac OS X and are using Homebrew, I would recommend installing "n" instead, which allows you to easily switch between versions of Node: $ brew install n $ n latest Regardless of how you do it, once you finish, verify by running the following lines: $ node --version $ npm --version If it displays the versions of node and npm it means you're good to go. I'm using 6.5.0 and 3.10.3, respectively, though yours might be slightly different-- the key is making sure node is at least version 6.0.0. If it says something similar to Command not found, double-check whether you've installed everything correctly, and verify that Node.js is in your $PATH environment variable. In the last edition of this book, we did a bunch of annoying stuff with Webpack and Babel and it was a bit too configuration-heavy to adequately explain. This time around we're using the lovely jspm for everything, which handles all the finicky annoying stuff for us. Install it now, using npm: npm install -g jspm@beta jspm-server This installs the most up-to-date beta version of jspm and the jspm development server. We don't need Webpack this time around because Rollup (which is used to bundle D3 itself) is used to bundle our projects, and jspm handles our Babel config for us. How helpful! Next, you'll want to clone the book's repository from GitHub. Change to your project directory and type this: $ git clone https://github.com/aendrew/learning-d3-v4 $ cd $ learning-d3-v4 This will clone the development environment and all the samples in the learning-d3-v4/ directory, as well as switch you into it. Another option is to fork the repository on GitHub and then clone your fork instead of mine as was just shown. This will allow you to easily publish your work on the cloud, enabling you to more easily seek support, display finished projects on GitHub Pages, and even submit suggestions and amendments to the parent project. This will help us improve this book for future editions. To do this, fork aendrew/learning-d3-v4 by clicking the "fork" button on GitHub, and replace aendrew in the preceding code snippet with your GitHub username. To switch between them, type the following command: $ git checkout <folder name> Replace <folder name> with the appropriate name of your folder. Stay at master for now though. To get back to it, type this line: $ git stash save && git checkout master The master branch is where you'll do a lot of your coding as you work through this book. It includes a prebuilt config.js file (used by jspm to manage dependencies), which we'll use to aid our development over the course of this book. We still need to install our dependencies, so let's do that now: $ npm install All of the source code that you'll be working on is in the lib/ folder. You'll notice it contains a just a main.js file; almost always, we'll be working in main.js, as index.html is just a minimal container to display our work in. This is it in its entirety, and it's the last time we'll look at any HTML in this book: <!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title>Learning D3</title> </head> <body> <script src="jspm_packages/system.js"></script> <script src="config.js"></script> <script> System.import('lib/main.js'); </script> </body> </html> There's also an empty stylesheet in styles/index.css, which we'll add to in a bit. To get things rolling, start the development server by typing the following line: $ npm start This starts up the jspm development server, which will transform our new-fangled ES2017 JavaScript into backwards-compatible ES5, which can easily be loaded by most browsers. Instead of loading in a compiled bundle, we use SystemJS directly and load in main.js. When we're ready for production, we'll use jspm bundle to create an optimized JS payload. Now point Chrome (or whatever, I'm not fussy - so long as it's not Internet Explorer!) to localhost:8080 and fire up the developer console ( Ctrl + Shift + J for Linux and Windows and option + command + J for Mac). You should see a blank website and a blank JavaScript console with a Command Prompt waiting for some code: A quick Chrome Developer Tools primer Chrome Developer Tools are indispensable to web development. Most modern browsers have something similar, but to keep this book shorter, we'll stick to just Chrome here for the sake of simplicity. Feel free to use a different browser. Firefox's Developer Edition is particularly nice, and - yeah yeah, I hear you guys at the back; Opera is good too! We are mostly going to use the Elements and Console tabs, Elements to inspect the DOM and Console to play with JavaScript code and look for any problems. The other six tabs come in handy for large projects: The Network tab will let you know how long files are taking to load and help you inspect the Ajax requests. The Profiles tab will help you profile JavaScript for performance. The Resources tab is good for inspecting client-side data. Timeline and Audits are useful when you have a global variable that is leaking memory and you're trying to work out exactly why your library is suddenly causing Chrome to use 500 MB of RAM. While I've used these in D3 development, they're probably more useful when building large web applications with frameworks such as React and Angular. The main one you want to focus on, however, is Sources, which shows all the source code files that have been pulled in by the webpage. Not only is this useful in determining whether your code is actually loading, it contains a fully functional JavaScript debugger, which few mortals dare to use. While explaining how to debug code is kind of boring and not at the level of this article, learning to use breakpoints instead of perpetually using console.log to figure out what your code is doing is a skill that will take you far in the years to come. For a good overview, visit https://developers.google.com/web/tools/chrome-devtools/debug/breakpoints/step-code?hl=en Most of what you'll do with Developer Tools, however, is look at the CSS inspector at the right-hand side of the Elements tab. It can tell you what CSS rules are impacting the styling of an element, which is very good for hunting rogue rules that are messing things up. You can also edit the CSS and immediately see the results, as follows: Summary In this article, you learned what D3 is and took a glance at the core philosophy behind how it works. You also set up your computer for prototyping of ideas and to play with visualizations. Resources for Article: Further resources on this subject: Learning D3.js Mapping [article] Integrating a D3.js visualization into a simple AngularJS application [article] Simple graphs with d3.js [article]

0
0
1425

Packt

03 Mar 2017

17 min read

Data Pipelines

Packt

03 Mar 2017

17 min read

In this article by Andrew Morgan, Antoine Amend, Matthew Hallett, David George, the author of the book Mastering Spark for Data Science, readers will learn how to construct a content registerand use it to track all input loaded to the system, and to deliver metrics on ingestion pipelines, so that these flows can be reliably run as an automated, lights-out process. Readers will learn how to construct a content registerand use it to track all input loaded to the system, and to deliver metrics on ingestion pipelines, so that these flows can be reliably run as an automated, lights-out process. In this article we will cover the following topics: Welcome the GDELT Dataset Data Pipelines Universal Ingestion Framework Real-time monitoring for new data Receiving Streaming Data via Kafka Registering new content and vaulting for tracking purposes Visualization of content metrics in Kibana - to monitor ingestion processes & data health (For more resources related to this topic, see here.) Data Pipelines Even with the most basic of analytics, we always require some data. In fact, finding the right data is probably among the hardest problems to solve in data science (but that’s a whole topic for another book!). We have already seen that the way in which we obtain our data can be as simple or complicated as is needed. In practice, we can break this decision into two distinct areas: Ad-hoc and scheduled. Ad-hoc data acquisition is the most common method during prototyping and small scale analytics as it usually doesn’t require any additional software to implement - the user requires some data and simply downloads it from source as and when required. This method is often a matter of clicking on a web link and storing the data somewhere convenient, although the data may still need to be versioned and secure. Scheduled data acquisition is used in more controlled environments for large scale and production analytics, there is also an excellent case for ingesting a dataset into a data lake for possible future use. With Internet of Things (IoT) on the increase, huge volumes of data are being produced, in many cases if the data is not ingested now it is lost forever. Much of this data may not have an immediate or apparent use today, but could do in the future; so the mind-set is to gather all of the data in case it is needed and delete it later when sure it is not. It’s clear we need a flexible approach to data acquisition that supports a variety of procurement options. Universal Ingestion Framework There are many ways to approach data acquisition ranging from home grown bash scripts through to high-end commercial tools. The aim of this section is to introduce a highly flexible framework that we can use for small scale data ingest, and then grow as our requirements change - all the way through to a full corporately managed workflow if needed - that framework will be build using Apache NiFi. NiFi enables us to build large-scale integrated data pipelines that move data around the planet. In addition, it’s also incredibly flexible and easy to build simple pipelines - usually quicker even than using Bash or any other traditional scripting method. If an ad-hoc approach is taken to source the same dataset on a number of occasions, then some serious thought should be given as to whether it falls into the scheduled category, or at least whether a more robust storage and versioning setup should be introduced. We have chosen to use Apache NiFi as it offers a solution that provides the ability to create many, varied complexity pipelines that can be scaled to truly Big Data and IoT levels, and it also provides a great drag & drop interface (using what’s known as flow-based programming[1]). With patterns, templates and modules for workflow production, it automatically takes care of many of the complex features that traditionally plague developers such as multi-threading, connection management and scalable processing. For our purposes it will enable us to quickly build simple pipelines for prototyping, and scale these to full production where required. It’s pretty well documented and easy to get running https://nifi.apache.org/download.html, it runs in a browser and looks like this: https://en.wikipedia.org/wiki/Flow-based_programming We leave the installation of NiFi as an exercise for the reader - which we would encourage you to do - as we will be using it in the following section. Introducing the GDELT News Stream Hopefully, we have NiFi up and running now and can start to ingest some data. So let’s start with some global news media data from GDELT. Here’s our brief, taken from the GDELT website http://blog.gdeltproject.org/gdelt-2-0-our-global-world-in-realtime/: “Within 15 minutes of GDELT monitoring a news report breaking anywhere the world, it has translated it, processed it to identify all events, counts, quotes, people, organizations, locations, themes, emotions, relevant imagery, video, and embedded social media posts, placed it into global context, and made all of this available via a live open metadata firehose enabling open research on the planet itself. [As] the single largest deployment in the world of sentiment analysis, we hope that by bringing together so many emotional and thematic dimensions crossing so many languages and disciplines, and applying all of it in realtime to breaking news from across the planet, that this will spur an entirely new era in how we think about emotion and the ways in which it can help us better understand how we contextualize, interpret, respond to, and understand global events.” In order to start consuming this open data, we’ll need to hook into that metadata firehose and ingest the news streams onto our platform. How do we do this? Let’s start by finding out what data is available. Discover GDELT Real-time GDELT publish a list of the latest files on their website - this list is updated every 15 minutes. In NiFi, we can setup a dataflow that will poll the GDELT website, source a file from this list and save it to HDFS so we can use it later. Inside the NiFi dataflow designer, create a HTTP connector by dragging a processor onto the canvas and selecting GetHTTP. To configure this processor, you’ll need to enter the URL of the file list as: http://data.gdeltproject.org/gdeltv2/lastupdate.txt And also provide a temporary filename for the file list you will download. In the example below, we’ve used the NiFi’s expression language to generate a universally unique key so that files are not overwritten (UUID()). It’s worth noting that with this type of processor (GetHTTP), NiFi supports a number of scheduling and timing options for the polling and retrieval. For now, we’re just going to use the default options and let NiFi manage the polling intervals for us. An example of latest file list from GDELT is shown below. Next, we will parse the URL of the GKG news stream so that we can fetch it in a moment. Create a Regular Expression parser by dragging a processor onto the canvas and selecting ExtractText. Now position the new processor underneath the existing one and drag a line from the top processor to the bottom one. Finish by selecting the success relationship in the connection dialog that pops up. This is shown in the example below. Next, let’s configure the ExtractText processor to use a regular expression that matches only the relevant text of the file list, for example: ([^ ]*gkg.csv.*) From this regular expression, NiFi will create a new property (in this case, called url) associated with the flow design, which will take on a new value as each particular instance goes through the flow. It can even be configured to support multiple threads. Again, this is example is shown below. It’s worth noting here that while this is a fairly specific example, the technique is deliberately general purpose and can be used in many situations. Our First GDELT Feed Now that we have the URL of the GKG feed, we fetch it by configuring an InvokeHTTP processor to use the url property we previously created as it’s remote endpoint, and dragging the line as before. All that remains is to decompress the zipped content with a UnpackContent processor (using the basic zip format) and save to HDFS using a PutHDFS processor, like so: Improving with Publish and Subscribe So far, this flow looks very “point-to-point”, meaning that if we were to introduce a new consumer of data, for example, a Spark-streaming job, the flow must be changed. For example, the flow design might have to change to look like this: If we add yet another, the flow must change again. In fact, each time we add a new consumer, the flow gets a little more complicated, particularly when all the error handling is added. This is clearly not always desirable, as introducing or removing consumers (or producers) of data, might be something we want to do often, even frequently. Plus, it’s also a good idea to try to keep your flows as simple and reusable as possible. Therefore, for a more flexible pattern, instead of writing directly to HDFS, we can publish to Apache Kafka. This gives us the ability to add and remove consumers at any time without changing the data ingestion pipeline. We can also still write to HDFS from Kafka if needed, possibly even by designing a separate NiFi flow, or connect directly to Kafka using Spark-streaming. To do this, we create a Kafka writer by dragging a processor onto the canvas and selecting PutKafka. We now have a simple flow that continuously polls for an available file list, routinely retrieving the latest copy of a new stream over the web as it becomes available, decompressing the content and streaming it record-by-record into Kafka, a durable, fault-tolerant, distributed message queue, for processing by spark-streaming or storage in HDFS. And what’s more, without writing a single line of bash! Content Registry We have seen in this article that data ingestion is an area that is often overlooked, and that its importance cannot be underestimated. At this point we have a pipeline that enables us to ingest data from a source, schedule that ingest and direct the data to our repository of choice. But the story does not end there. Now we have the data, we need to fulfil our data management responsibilities. Enter the content registry. We’re going to build an index of metadata related to that data we have ingested. The data itself will still be directed to storage (HDFS, in our example) but, in addition, we will store metadata about the data, so that we can track what we’ve received and understand basic information about it, such as, when we received it, where it came from, how big it is, what type it is, etc. Choices and More Choices The choice of which technology we use to store this metadata is, as we have seen, one based upon knowledge and experience. For metadata indexing, we will require at least the following attributes: Easily searchable Scalable Parallel write ability Redundancy There are many ways to meet these requirements, for example we could write the metadata to Parquet, store in HDFS and search using Spark SQL. However, here we will use Elasticsearch as it meets the requirements a little better, most notably because it facilitates low latency queries of our metadata over a REST API - very useful for creating dashboards. In fact, Elasticsearch has the advantage of integrating directly with Kibana, meaning it can quickly produce rich visualizations of our content registry. For this reason, we will proceed with Elasticsearch in mind. Going with the Flow Using our current NiFi pipeline flow, let’s fork the output from “Fetch GKG files from URL” to add an additional set of steps to allow us to capture and store this metadata in Elasticsearch. These are: Replace the flow content with our metadata model Capture the metadata Store directly in Elasticsearch Here’s what this looks like in NiFi: Metadata Model So, the first step here is to define our metadata model. And there are many areas we could consider, but let’s select a set that helps tackle a few key points from earlier discussions. This will provide a good basis upon which further data can be added in the future, if required. So, let’s keep it simple and use the following three attributes: File size Date ingested File name These will provide basic registration of received files. Next, inside the NiFi flow, we’ll need to replace the actual data content with this new metadata model. An easy way to do this, is to create a JSON template file from our model. We’ll save it to local disk and use it inside a FetchFile processor to replace the flow’s content with this skeleton object. This template will look something like: { "FileSize": SIZE, "FileName": "FILENAME", "IngestedDate": "DATE" } Note the use of placeholder names (SIZE, FILENAME, DATE) in place of the attribute values. These will be substituted, one-by-one, by a sequence of ReplaceText processors, that swap the placeholder names for an appropriate flow attribute using regular expressions provided by the NiFi Expression Language, for example DATE becomes ${now()}. The last step is to output the new metadata payload to Elasticsearch. Once again, NiFi comes ready with a processor for this; the PutElasticsearch processor. An example metadata entry in Elasticsearch: { "_index": "gkg", "_type": "files", "_id": "AVZHCvGIV6x-JwdgvCzW", "_score": 1, "source": { "FileSize": 11279827, "FileName": "20150218233000.gkg.csv.zip", "IngestedDate": "2016-08-01T17:43:00+01:00" } } Now that we have added the ability to collect and interrogate metadata, we now have access to more statistics that can be used for analysis. This includes: Time based analysis e.g. file sizes over time Loss of data, for example are there data “holes” in the timeline? If there is a particular analytic that is required, the NIFI metadata component can be adjusted to provide the relevant data points. Indeed, an analytic could be built to look at historical data and update the index accordingly if the metadata does not exist in current data. Kibana Dashboard We have mentioned Kibana a number of times in this article, now that we have an index of metadata in Elasticsearch, we can use the tool to visualize some analytics. The purpose of this brief section is to demonstrate that we can immediately start to model and visualize our data. In this simple example we have completed the following steps: Added the Elasticsearch index for our GDELT metadata to the “Settings” tab Selected “file size” under the “Discover” tab Selected Visualize for “file size” Changed the Aggregation field to “Range” Entered values for the ranges The resultant graph displays the file size distribution: From here we are free to create new visualizations or even a fully featured dashboard that can be used to monitor the status of our file ingest. By increasing the variety of metadata written to Elasticsearch from NiFi, we can make more fields available in Kibana and even start our data science journey right here with some ingest based actionable insights. Now that we have a fully-functioning data pipeline delivering us real-time feeds of data, how do we ensure data quality of the payload we are receiving? Let’s take a look at the options. Quality Assurance With an initial data ingestion capability implemented, and data streaming onto your platform, you will need to decide how much quality assurance is required at the front door. It’s perfectly viable to start with no initial quality controls and build them up over time (retrospectively scanning historical data as time and resources allow). However, it may be prudent to install a basic level of verification to begin with. For example, basic checks such as file integrity, parity checking, completeness, checksums, type checking, field counting, overdue files, security field pre-population, denormalization, etc. You should take care that your up-front checks do not take too long. Depending on the intensity of your examinations and the size of your data, it’s not uncommon to encounter a situation where there is not enough time to perform all processing before the next dataset arrives. You will always need to monitor your cluster resources and calculate the most efficient use of time. Here are some examples of the type of rough capacity planning calculation you can perform: Example 1: Basic Quality Checking, No Contending Users Data is ingested every 15 minutes and takes 1 minute to pull from the source Quality checking (integrity, field count, field pre-population) takes 4 minutes There are no other users on the compute cluster There are 10 minutes of resources available for other tasks. As there are no other users on the cluster, this is satisfactory - no action needs to be taken. Example 2: Advanced Quality Checking, No Contending Users Data is ingested every 15 minutes and takes 1 minute to pull from the source Quality checking (integrity, field count, field pre-population, denormalization, sub dataset building) takes 13 minutes There are no other users on the compute cluster There is only 1 minute of resource available for other tasks. We probably need to consider, either: Configuring a resource scheduling policy Reducing the amount of data ingested Reducing the amount of processing we undertake Adding additional compute resources to the cluster Example 3: Basic Quality Checking, 50% Utility Due to Contending Users Data is ingested every 15 minutes and takes 1 minute to pull from the source Quality checking (integrity, field count, field pre-population) takes 4 minutes (100% utility) There are other users on the compute cluster There are 6 minutes of resources available for other tasks (15 - 1 - (4 * (100 / 50))). Since there are other users there is a danger that, at least some of the time, we will not be able to complete our processing and a backlog of jobs will occur. When you run into timing issues, you have a number of options available to you in order to circumvent any backlog: Negotiating sole use of the resources at certain times Configuring a resource scheduling policy, including: YARN Fair Scheduler: allows you to define queues with differing priorities and target your Spark jobs by setting the spark.yarn.queue property on start-up so your job always takes precedence Dynamicandr Resource Allocation: allows concurrently running jobs to automatically scale to match their utilization Spark Scheduler Pool: allows you to define queues when sharing a SparkContext using multithreading model, and target your Spark job by setting the spark.scheduler.pool property per execution thread so your thread takes precedence Running processing jobs overnight when the cluster is quiet In any case, you will eventually get a good idea of how the various parts to your jobs perform and will then be in a position to calculate what changes could be made to improve efficiency. There’s always the option of throwing more resources at the problem, especially when using a cloud provider, but we would certainly encourage the intelligent use of existing resources - this is far more scalable, cheaper and builds data expertise. Summary In this article we walked through the full setup of an Apache NiFi GDELT ingest pipeline, complete with metadata forks and a brief introduction to visualizing the resultant data. This section is particularly important as GDELT is used extensively throughout the book and the NiFi method is a highly effective way to source data in a scalable and modular way. Resources for Article: Further resources on this subject: Integration with Continuous Delivery [article] Amazon Web Services [article] AWS Fundamentals [article]

0
1
5349

Packt

03 Mar 2017

18 min read

The NumPy array object

Packt

03 Mar 2017

18 min read

In this article by Armando Fandango author of the book Python Data Analysis - Second Edition, discuss how the NumPy provides a multidimensional array object called ndarray. NumPy arrays are typed arrays of fixed size. Python lists are heterogeneous and thus elements of a list may contain any object type, while NumPy arrays are homogenous and can contain object of only one type. An ndarray consists of two parts, which are as follows: The actual data that is stored in a contiguous block of memory The metadata describing the actual data Since the actual data is stored in a contiguous block of memory hence loading of the large data set as ndarray is affected by availability of large enough contiguous block of memory. Most of the array methods and functions in NumPy leave the actual data unaffected and only modify the metadata. Actually, we made a one-dimensional array that held a set of numbers. The ndarray can have more than a single dimension. (For more resources related to this topic, see here.) Advantages of NumPy arrays The NumPy array is, in general, homogeneous (there is a particular record array type that is heterogeneous)—the items in the array have to be of the same type. The advantage is that if we know that the items in an array are of the same type, it is easy to ascertain the storage size needed for the array. NumPy arrays can execute vectorized operations, processing a complete array, in contrast to Python lists, where you usually have to loop through the list and execute the operation on each element. NumPy arrays are indexed from 0, just like lists in Python. NumPy utilizes an optimized C API to make the array operations particularly quick. We will make an array with the arange() subroutine again. You will see snippets from Jupyter Notebook sessions where NumPy is already imported with instruction import numpy as np. Here's how to get the data type of an array: In: a = np.arange(5) In: a.dtype Out: dtype('int64') The data type of the array a is int64 (at least on my computer), but you may get int32 as the output if you are using 32-bit Python. In both the cases, we are dealing with integers (64 bit or 32 bit). Besides the data type of an array, it is crucial to know its shape. A vector is commonly used in mathematics but most of the time we need higher-dimensional objects. Let's find out the shape of the vector we produced a few minutes ago: In: a Out: array([0, 1, 2, 3, 4]) In: a.shape Out: (5,) As you can see, the vector has five components with values ranging from 0 to 4. The shape property of the array is a tuple; in this instance, a tuple of 1 element, which holds the length in each dimension. Creating a multidimensional array Now that we know how to create a vector, we are set to create a multidimensional NumPy array. After we produce the matrix, we will again need to show its, as demonstrated in the following code snippets: Create a multidimensional array as follows: In: m = np.array([np.arange(2), np.arange(2)]) In: m Out: array([[0, 1], [0, 1]]) We can show the array shape as follows: In: m.shape Out: (2, 2) We made a 2 x 2 array with the arange() subroutine. The array() function creates an array from an object that you pass to it. The object has to be an array, for example, a Python list. In the previous example, we passed a list of arrays. The object is the only required parameter of the array() function. NumPy functions tend to have a heap of optional arguments with predefined default options. Selecting NumPy array elements From time to time, we will wish to select a specific constituent of an array. We will take a look at how to do this, but to kick off, let's make a 2 x 2 matrix again: In: a = np.array([[1,2],[3,4]]) In: a Out: array([[1, 2], [3, 4]]) The matrix was made this time by giving the array() function a list of lists. We will now choose each item of the matrix one at a time, as shown in the following code snippet. Recall that the index numbers begin from 0: In: a[0,0] Out: 1 In: a[0,1] Out: 2 In: a[1,0] Out: 3 In: a[1,1] Out: 4 As you can see, choosing elements of an array is fairly simple. For the array a, we just employ the notation a[m,n], where m and n are the indices of the item in the array. Have a look at the following figure for your reference: NumPy numerical types Python has an integer type, a float type, and complex type; nonetheless, this is not sufficient for scientific calculations. In practice, we still demand more data types with varying precisions and, consequently, different storage sizes of the type. For this reason, NumPy has many more data types. The bulk of the NumPy mathematical types ends with a number. This number designates the count of bits related to the type. The following table (adapted from the NumPy user guide) presents an overview of NumPy numerical types: Type Description bool Boolean (True or False) stored as a bit inti Platform integer (normally either int32 or int64) int8 Byte (-128 to 127) int16 Integer (-32768 to 32767) int32 Integer (-2 ** 31 to 2 ** 31 -1) int64 Integer (-2 ** 63 to 2 ** 63 -1) uint8 Unsigned integer (0 to 255) uint16 Unsigned integer (0 to 65535) uint32 Unsigned integer (0 to 2 ** 32 - 1) uint64 Unsigned integer (0 to 2 ** 64 - 1) float16 Half precision float: sign bit, 5 bits exponent, and 10 bits mantissa float32 Single precision float: sign bit, 8 bits exponent, and 23 bits mantissa float64 or float Double precision float: sign bit, 11 bits exponent, and 52 bits mantissa complex64 Complex number, represented by two 32-bit floats (real and imaginary components) complex128 or complex Complex number, represented by two 64-bit floats (real and imaginary components) For each data type, there exists a matching conversion function: In: np.float64(42) Out: 42.0 In: np.int8(42.0) Out: 42 In: np.bool(42) Out: True In: np.bool(0) Out: False In: np.bool(42.0) Out: True In: np.float(True) Out: 1.0 In: np.float(False) Out: 0.0 Many functions have a data type argument, which is frequently optional: In: np.arange(7, dtype= np.uint16) Out: array([0, 1, 2, 3, 4, 5, 6], dtype=uint16) It is important to be aware that you are not allowed to change a complex number into an integer. Attempting to do that sparks off a TypeError: In: np.int(42.0 + 1.j) Traceback (most recent call last): <ipython-input-24-5c1cd108488d> in <module>() ----> 1 np.int(42.0 + 1.j) TypeError: can't convert complex to int The same goes for conversion of a complex number into a floating-point number. By the way, the j component is the imaginary coefficient of a complex number. Even so, you can convert a floating-point number to a complex number, for example, complex(1.0). The real and imaginary pieces of a complex number can be pulled out with the real() and imag() functions, respectively. Data type objects Data type objects are instances of the numpy.dtype class. Once again, arrays have a data type. To be exact, each element in a NumPy array has the same data type. The data type object can tell you the size of the data in bytes. The size in bytes is given by the itemsize property of the dtype class : In: a.dtype.itemsize Out: 8 Character codes Character codes are included for backward compatibility with Numeric. Numeric is the predecessor of NumPy. Its use is not recommended, but the code is supplied here because it pops up in various locations. You should use the dtype object instead. The following table lists several different data types and character codes related to them: Type Character code integer i Unsigned integer u Single precision float f Double precision float d bool b complex D string S unicode U Void V Take a look at the following code to produce an array of single precision floats: In: arange(7, dtype='f') Out: array([ 0., 1., 2., 3., 4., 5., 6.], dtype=float32) Likewise, the following code creates an array of complex numbers: In: arange(7, dtype='D') In: arange(7, dtype='D') Out: array([ 0.+0.j, 1.+0.j, 2.+0.j, 3.+0.j, 4.+0.j, 5.+0.j, 6.+0.j]) The dtype constructors We have a variety of means to create data types. Take the case of floating-point data (have a look at dtypeconstructors.py in this book's code bundle): We can use the general Python float, as shown in the following lines of code: In: np.dtype(float) Out: dtype('float64') We can specify a single precision float with a character code: In: np.dtype('f') Out: dtype('float32') We can use a double precision float with a character code: In: np.dtype('d') Out: dtype('float64') We can pass the dtype constructor a two-character code. The first character stands for the type; the second character is a number specifying the number of bytes in the type (the numbers 2, 4, and 8 correspond to floats of 16, 32, and 64 bits, respectively): In: np.dtype('f8') Out: dtype('float64') A (truncated) list of all the full data type codes can be found by applying sctypeDict.keys(): In: np.sctypeDict.keys() In: np.sctypeDict.keys() Out: dict_keys(['?', 0, 'byte', 'b', 1, 'ubyte', 'B', 2, 'short', 'h', 3, 'ushort', 'H', 4, 'i', 5, 'uint', 'I', 6, 'intp', 'p', 7, 'uintp', 'P', 8, 'long', 'l', 'L', 'longlong', 'q', 9, 'ulonglong', 'Q', 10, 'half', 'e', 23, 'f', 11, 'double', 'd', 12, 'longdouble', 'g', 13, 'cfloat', 'F', 14, 'cdouble', 'D', 15, 'clongdouble', 'G', 16, 'O', 17, 'S', 18, 'unicode', 'U', 19, 'void', 'V', 20, 'M', 21, 'm', 22, 'bool8', 'Bool', 'b1', 'float16', 'Float16', 'f2', 'float32', 'Float32', 'f4', 'float64', ' Float64', 'f8', 'float128', 'Float128', 'f16', 'complex64', 'Complex32', 'c8', 'complex128', 'Complex64', 'c16', 'complex256', 'Complex128', 'c32', 'object0', 'Object0', 'bytes0', 'Bytes0', 'str0', 'Str0', 'void0', 'Void0', 'datetime64', 'Datetime64', 'M8', 'timedelta64', 'Timedelta64', 'm8', 'int64', 'uint64', 'Int64', 'UInt64', 'i8', 'u8', 'int32', 'uint32', 'Int32', 'UInt32', 'i4', 'u4', 'int16', 'uint16', 'Int16', 'UInt16', 'i2', 'u2', 'int8', 'uint8', 'Int8', 'UInt8', 'i1', 'u1', 'complex_', 'int0', 'uint0', 'single', 'csingle', 'singlecomplex', 'float_', 'intc', 'uintc', 'int_', 'longfloat', 'clongfloat', 'longcomplex', 'bool_', 'unicode_', 'object_', 'bytes_', 'str_', 'string_', 'int', 'float', 'complex', 'bool', 'object', 'str', 'bytes', 'a']) The dtype attributes The dtype class has a number of useful properties. For instance, we can get information about the character code of a data type through the properties of dtype: In: t = np.dtype('Float64') In: t.char Out: 'd' The type attribute corresponds to the type of object of the array elements: In: t.type Out: numpy.float64 The str attribute of dtype gives a string representation of a data type. It begins with a character representing endianness, if appropriate, then a character code, succeeded by a number corresponding to the number of bytes that each array item needs. Endianness, here, entails the way bytes are ordered inside a 32- or 64-bit word. In the big-endian order, the most significant byte is stored first, indicated by >. In the little-endian order, the least significant byte is stored first, indicated by <, as exemplified in the following lines of code: In: t.str Out: '<f8' One-dimensional slicing and indexing Slicing of one-dimensional NumPy arrays works just like the slicing of standard Python lists. Let's define an array containing the numbers 0, 1, 2, and so on up to and including 8. We can select a part of the array from indexes 3 to 7, which extracts the elements of the arrays 3 through 6: In: a = np.arange(9) In: a[3:7] Out: array([3, 4, 5, 6]) We can choose elements from indexes the 0 to 7 with an increment of 2: In: a[:7:2] Out: array([0, 2, 4, 6]) Just as in Python, we can use negative indices and reverse the array: In: a[::-1] Out: array([8, 7, 6, 5, 4, 3, 2, 1, 0]) Manipulating array shapes We have already learned about the reshape() function. Another repeating chore is the flattening of arrays. Flattening in this setting entails transforming a multidimensional array into a one-dimensional array. Let us create an array b that we shall use for practicing the further examples: In: b = np.arange(24).reshape(2,3,4) In: print(b) Out: [[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]], [[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]]) We can manipulate array shapes using the following functions: Ravel: We can accomplish this with the ravel() function as follows: In: b Out: array([[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]], [[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]]) In: b.ravel() Out: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]) Flatten: The appropriately named function, flatten(), does the same as ravel(). However, flatten() always allocates new memory, whereas ravel gives back a view of the array. This means that we can directly manipulate the array as follows: In: b.flatten() Out: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]) Setting the shape with a tuple: Besides the reshape() function, we can also define the shape straightaway with a tuple, which is exhibited as follows: In: b.shape = (6,4) In: b Out: array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]) As you can understand, the preceding code alters the array immediately. Now, we have a 6 x 4 array. Transpose: In linear algebra, it is common to transpose matrices. Transposing is a way to transform data. For a two-dimensional table, transposing means that rows become columns and columns become rows. We can do this too by using the following code: In: b.transpose() Out: array([[ 0, 4, 8, 12, 16, 20], [ 1, 5, 9, 13, 17, 21], [ 2, 6, 10, 14, 18, 22], [ 3, 7, 11, 15, 19, 23]]) Resize: The resize() method works just like the reshape() method, In: b.resize((2,12)) In: b Out: array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]]) Stacking arrays Arrays can be stacked horizontally, depth wise, or vertically. We can use, for this goal, the vstack(), dstack(), hstack(), column_stack(), row_stack(), and concatenate() functions. To start with, let's set up some arrays: In: a = np.arange(9).reshape(3,3) In: a Out: array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) In: b = 2 * a In: b Out: array([[ 0, 2, 4], [ 6, 8, 10], [12, 14, 16]]) As mentioned previously, we can stack arrays using the following techniques: Horizontal stacking: Beginning with horizontal stacking, we will shape a tuple of ndarrays and hand it to the hstack() function to stack the arrays. This is shown as follows: In: np.hstack((a, b)) Out: array([[ 0, 1, 2, 0, 2, 4], [ 3, 4, 5, 6, 8, 10], [ 6, 7, 8, 12, 14, 16]]) We can attain the same thing with the concatenate() function, which is shown as follows: In: np.concatenate((a, b), axis=1) Out: array([[ 0, 1, 2, 0, 2, 4], [ 3, 4, 5, 6, 8, 10], [ 6, 7, 8, 12, 14, 16]]) The following diagram depicts horizontal stacking: Vertical stacking: With vertical stacking, a tuple is formed again. This time it is given to the vstack() function to stack the arrays. This can be seen as follows: In: np.vstack((a, b)) Out: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 0, 2, 4], [ 6, 8, 10], [12, 14, 16]]) The concatenate() function gives the same outcome with the axis parameter fixed to 0. This is the default value for the axis parameter, as portrayed in the following code: In: np.concatenate((a, b), axis=0) Out: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 0, 2, 4], [ 6, 8, 10], [12, 14, 16]]) Refer to the following figure for vertical stacking: Depth stacking: To boot, there is the depth-wise stacking employing dstack() and a tuple, of course. This entails stacking a list of arrays along the third axis (depth). For example, we could stack 2D arrays of image data on top of each other as follows: In: np.dstack((a, b)) Out: array([[[ 0, 0], [ 1, 2], [ 2, 4]], [[ 3, 6], [ 4, 8], [ 5, 10]], [[ 6, 12], [ 7, 14], [ 8, 16]]]) Column stacking: The column_stack() function stacks 1D arrays column-wise. This is shown as follows: In: oned = np.arange(2) In: oned Out: array([0, 1]) In: twice_oned = 2 * oned In: twice_oned Out: array([0, 2]) In: np.column_stack((oned, twice_oned)) Out: array([[0, 0], [1, 2]]) 2D arrays are stacked the way the hstack() function stacks them, as demonstrated in the following lines of code: In: np.column_stack((a, b)) Out: array([[ 0, 1, 2, 0, 2, 4], [ 3, 4, 5, 6, 8, 10], [ 6, 7, 8, 12, 14, 16]]) In: np.column_stack((a, b)) == np.hstack((a, b)) Out: array([[ True, True, True, True, True, True], [ True, True, True, True, True, True], [ True, True, True, True, True, True]], dtype=bool) Yes, you guessed it right! We compared two arrays with the == operator. Row stacking: NumPy, naturally, also has a function that does row-wise stacking. It is named row_stack() and for 1D arrays, it just stacks the arrays in rows into a 2D array: In: np.row_stack((oned, twice_oned)) Out: array([[0, 1], [0, 2]]) The row_stack() function results for 2D arrays are equal to the vstack() function results: In: np.row_stack((a, b)) Out: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 0, 2, 4], [ 6, 8, 10], [12, 14, 16]]) In: np.row_stack((a,b)) == np.vstack((a, b)) Out: array([[ True, True, True], [ True, True, True], [ True, True, True], [ True, True, True], [ True, True, True], [ True, True, True]], dtype=bool) Splitting NumPy arrays Arrays can be split vertically, horizontally, or depth wise. The functions involved are hsplit(), vsplit(), dsplit(), and split(). We can split arrays either into arrays of the same shape or indicate the location after which the split should happen. Let's look at each of the functions in detail: Horizontal splitting: The following code splits a 3 x 3 array on its horizontal axis into three parts of the same size and shape (see splitting.py in this book's code bundle): In: a Out: array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) In: np.hsplit(a, 3) Out: [array([[0], [3], [6]]), array([[1], [4], [7]]), array([[2], [5], [8]])] Liken it with a call of the split() function, with an additional argument, axis=1: In: np.split(a, 3, axis=1) Out: [array([[0], [3], [6]]), array([[1], [4], [7]]), array([[2], [5], [8]])] Vertical splitting: vsplit() splits along the vertical axis: In: np.vsplit(a, 3) Out: [array([[0, 1, 2]]), array([[3, 4, 5]]), array([[6, 7, 8]])] The split() function, with axis=0, also splits along the vertical axis: In: np.split(a, 3, axis=0) Out: [array([[0, 1, 2]]), array([[3, 4, 5]]), array([[6, 7, 8]])] Depth-wise splitting: The dsplit() function, unsurprisingly, splits depth-wise. We will require an array of rank 3 to begin with: In: c = np.arange(27).reshape(3, 3, 3) In: c Out: array([[[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8]], [[ 9, 10, 11], [12, 13, 14], [15, 16, 17]], [[18, 19, 20], [21, 22, 23], [24, 25, 26]]]) In: np.dsplit(c, 3) Out: [array([[[ 0], [ 3], [ 6]], [[ 9], [12], [15]], [[18], [21], [24]]]), array([[[ 1], [ 4], [ 7]], [[10], [13], [16]], [[19], [22], [25]]]), array([[[ 2], [ 5], [ 8]], [[11], [14], [17]], [[20], [23], [26]]])] NumPy array attributes Let's learn more about the NumPy array attributes with the help of an example. Let us create an array b that we shall use for practicing the further examples: In: b = np.arange(24).reshape(2, 12) In: b Out: array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]]) Besides the shape and dtype attributes, ndarray has a number of other properties, as shown in the following list: ndim gives the number of dimensions, as shown in the following code snippet: In: b.ndim Out: 2 size holds the count of elements. This is shown as follows: In: b.size Out: 24 itemsize returns the count of bytes for each element in the array, as shown in the following code snippet: In: b.itemsize Out: 8 If you require the full count of bytes the array needs, you can have a look at nbytes. This is just a product of the itemsize and size properties: In: b.nbytes Out: 192 In: b.size * b.itemsize Out: 192 The T property has the same result as the transpose() function, which is shown as follows: In: b.resize(6,4) In: b Out: array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]) In: b.T Out: array([[ 0, 4, 8, 12, 16, 20], [ 1, 5, 9, 13, 17, 21], [ 2, 6, 10, 14, 18, 22], [ 3, 7, 11, 15, 19, 23]]) If the array has a rank of less than 2, we will just get a view of the array: In: b.ndim Out: 1 In: b.T Out: array([0, 1, 2, 3, 4]) Complex numbers in NumPy are represented by j. For instance, we can produce an array with complex numbers as follows: In: b = np.array([1.j + 1, 2.j + 3]) In: b Out: array([ 1.+1.j, 3.+2.j]) The real property returns to us the real part of the array, or the array itself if it only holds real numbers: In: b.real Out: array([ 1., 3.]) The imag property holds the imaginary part of the array: In: b.imag Out: array([ 1., 2.]) If the array holds complex numbers, then the data type will automatically be complex as well: In: b.dtype Out: dtype('complex128') In: b.dtype.str Out: '<c16' The flat property gives back a numpy.flatiter object. This is the only means to get a flatiter object; we do not have access to a flatiter constructor. The flat iterator enables us to loop through an array as if it were a flat array, as shown in the following code snippet: In: b = np.arange(4).reshape(2,2) In: b Out: array([[0, 1], [2, 3]]) In: f = b.flat In: f Out: <numpy.flatiter object at 0x103013e00> In: for item in f: print(item) Out: 0 1 2 3 It is possible to straightaway obtain an element with the flatiter object: In: b.flat[2] Out: 2 Also, you can obtain multiple elements as follows: In: b.flat[[1,3]] Out: array([1, 3]) The flat property can be set. Setting the value of the flat property leads to overwriting the values of the entire array: In: b.flat = 7 In: b Out: array([[7, 7], [7, 7]]) We can also obtain selected elements as follows: In: b.flat[[1,3]] = 1 In: b Out: array([[7, 1], [7, 1]]) The next diagram illustrates various properties of ndarray: Converting arrays We can convert a NumPy array to a Python list with the tolist() function . The following is a brief explanation: Convert to a list: In: b Out: array([ 1.+1.j, 3.+2.j]) In: b.tolist() Out: [(1+1j), (3+2j)] The astype() function transforms the array to an array of the specified data type: In: b Out: array([ 1.+1.j, 3.+2.j]) In: b.astype(int) /usr/local/lib/python3.5/site-packages/ipykernel/__main__.py:1: ComplexWarning: Casting complex values to real discards the imaginary part … Out: array([1, 3]) In: b.astype('complex') Out: array([ 1.+1.j, 3.+2.j]) We are dropping off the imaginary part when casting from the complex type to int. The astype() function takes the name of a data type as a string too. The preceding code won't display a warning this time because we used the right data type. Summary In this article, we found out a heap about the NumPy basics: data types and arrays. Arrays have various properties that describe them. You learned that one of these properties is the data type, which, in NumPy, is represented by a full-fledged object. NumPy arrays can be sliced and indexed in an effective way, compared to standard Python lists. NumPy arrays have the extra ability to work with multiple dimensions. The shape of an array can be modified in multiple ways, such as stacking, resizing, reshaping, and splitting. Resources for Article: Further resources on this subject: Big Data Analytics [article] Python Data Science Up and Running [article] R and its Diverse Possibilities [article]

0
0
14038

Packt

01 Mar 2017

17 min read

Understanding Spark RDD

Packt

01 Mar 2017

17 min read

In this article by Asif Abbasi author of the book Learning Apache Spark 2.0, we will understand Spark RDD along with that we will learn, how to construct RDDs, Operations on RDDs, Passing functions to Spark in Scala, Java, and Python and Transformations such as map, filter, flatMap, and sample. (For more resources related to this topic, see here.) What is an RDD? What’s in a name might be true for a rose, but perhaps not for an Resilient Distributed Datasets (RDD), and in essence describes what an RDD is. They are basically datasets, which are distributed across a cluster (remember Spark framework is inherently based on an MPP architecture), and provide resilience (automatic failover) by nature. Before we go into any further detail, let’s try to understand this a little bit, and again we are trying to be as abstract as possible. Let us assume that you have a sensor data from aircraft sensors and you want to analyze the data irrespective of its size and locality. For example, an Airbus A350 has roughly 6000 sensors across the entire plane and generates 2.5 TB data per day, while the newer model expected to launch in 2020 will generate roughly 7.5 TB per day. From a data engineering point of view, it might be important to understand the data pipeline, but from an analyst and a data scientist point of view, your major concern is to analyze the data irrespective of the size and number of nodes across which it has been stored. This is where the neatness of the RDD concept comes into play, where the sensor data can be encapsulated as an RDD concept, and any transformation/action that you perform on the RDD applies across the entire dataset. Six month's worth of dataset for an A350 would be approximately 450 TBs of data, and would need to sit across multiple machines. For the sake of discussion, we assume that you are working on a cluster of four worker machines. Your data would be partitioned across the workers as follows: Figure 2-1: RDD split across a cluster The figure basically explains that an RDD is a distributed collection of the data, and the framework distributes the data across the cluster. Data distribution across a set of machines brings its own set of nuisances including recovering from node failures. RDD’s are resilient as they can be recomputed from the RDD lineage graph, which is basically a graph of the entire parent RDDs of the RDD. In addition to the resilience, distribution, and representing a data set, an RDD has various other distinguishing qualities: In Memory: An RDD is a memory resident collection of objects. We’ll look at options where an RDD can be stored in memory, on disk, or both. However, the execution speed of Spark stems from the fact that the data is in memory, and is not fetched from disk for each operation. Partitioned: A partition is a division of a logical dataset or constituent elements into independent parts. Partitioning is a defacto performance optimization technique in distributed systems to achieve minimal network traffic, a killer for high performance workloads. The objective of partitioning in key-value oriented data is to collocate similar range of keys and in effect minimize shuffling. Data inside RDD is split into partitions and across various nodes of the cluster. Typed: Data in an RDD is strongly typed. When you create an RDD, all the elements are typed depending on the data type. Lazy evaluation: The transformations in Spark are lazy, which means data inside RDD is not available until you perform an action. You can, however, make the data available at any time using a count() action on the RDD. We’ll discuss this later and the benefits associated with it. Immutable: An RDD once created cannot be changed. It can, however, be transformed into a new RDD by performing a set of transformations on it. Parallel: An RDD is operated on in parallel. Since the data is spread across a cluster in various partitions, each partition is operated on in parallel. Cacheable: Since RDD’s are lazily evaluated, any action on an RDD will cause the RDD to revaluate all transformations that led to the creation of RDD. This is generally not a desirable behavior on large datasets, and hence Spark allows the option to persist the data on memory or disk. A typical Spark program flow with an RDD includes: Creation of an RDD from a data source. A set of transformations, for example, filter, map, join, and so on. Persisting the RDD to avoid re-execution. Calling actions on the RDD to start performing parallel operations across the cluster. This is depicted in the following figure: Figure 2-2: Typical Spark RDD flow Operations on RDD Two major operation types can be performed on an RDD. They are called: Transformations Actions Transformations Transformations are operations that create a new dataset, as RDDs are immutable. They are used to transform data from one to another, which could result in amplification of the data, reduction of the data, or a totally different shape altogether. These operations do not return any value back to the driver program, and hence lazily evaluated, which is one of the main benefits of Spark. An example of a transformation would be a map function that will pass through each element of the RDD and return a totally new RDD representing the results of application of the function on the original dataset. Actions Actions are operations that return a value to the driver program. As previously discussed, all transformations in Spark are lazy, which essentially means that Spark remembers all the transformations carried out on an RDD, and applies them in the most optimal fashion when an action is called. For example, you might have a 1 TB dataset, which you pass through a set of map functions by applying various transformations. Finally, you apply the reduce action on the dataset. Apache Spark will return only a final dataset, which might be few MBs rather than the entire 1 TB dataset of mapped intermediate result. You should, however, remember to persist intermediate results otherwise Spark will recompute the entire RDD graph each time an Action is called. The persist() method on an RDD should help you avoid recomputation and saving intermediate results. We’ll look at this in more detail later. Let’s illustrate the work of transformations and actions by a simple example. In this specific example, we’ll be using flatmap() transformations and a count action. We’ll use the README.md file from the local filesystem as an example. We’ll give a line-by-line explanation of the Scala example, and then provide code for Python and Java. As always, you must try this example with your own piece of text and investigate the results: //Loading the README.md file val dataFile = sc.textFile(“README.md”) Now that the data has been loaded, we’ll need to run a transformation. Since we know that each line of the text is loaded as a separate element, we’ll need to run a flatMap transformation and separate out individual words as separate elements, for which we’ll use the split function and use space as a delimiter: //Separate out a list of words from individual RDD elements val words = dataFile.flatMap(line => line.split(“ “)) Remember that until this point, while you seem to have applied a transformation function, nothing has been executed and all the transformations have been added to the logical plan. Also note that the transformation function returns a new RDD. We can then call the count() action on the words RDD, to perform the computation, which then results in fetching of data from the file to create an RDD, before applying the transformation function specified. You might note that we have actually passed a function to Spark: //Separate out a list of words from individual RDD elements Words.count() Upon calling the count() action the RDD is evaluated, and the results are sent back to the driver program. This is very neat and especially useful during big data applications. If you are Python savvy, you may want to run the following code in PySpark. You should note that lambda functions are passed to the Spark framework: //Loading data file, applying transformations and action dataFile = sc.textFile("README.md") words = dataFile.flatMap(lambda line: line.split(" ")) words.count() Programming the same functionality in Java is also quite straight forward and will look pretty similar to the program in Scala: JavaRDD<String> lines = sc.textFile("README.md"); JavaRDD<String> words = lines.map(line -> line.split(“ “)); int wordCount = words.count(); This might look like a simple program, but behind the scenes it is taking the line.split(“ ”) function and applying it to all the partitions in the cluster in parallel. The framework provides this simplicity and does all the background work of coordination to schedule it across the cluster, and get the results back. Passing functions to Spark (Scala) As you have seen in the previous example, passing functions is a critical functionality provided by Spark. From a user’s point of view you would pass the function in your driver program, and Spark would figure out the location of the data partitions across the cluster memory, running it in parallel. The exact syntax of passing functions differs by the programming language. Since Spark has been written in Scala, we’ll discuss Scala first. In Scala, the recommended ways to pass functions to the Spark framework are as follows: Anonymous functions Static singleton methods Anonymous functions Anonymous functions are used for short pieces of code. They are also referred to as lambda expressions, and are a cool and elegant feature of the programming language. The reason they are called anonymous functions is because you can give any name to the input argument and the result would be the same. For example, the following code examples would produce the same output: val words = dataFile.map(line => line.split(“ “)) val words = dataFile.map(anyline => anyline.split(“ “)) val words = dataFile.map(_.split(“ “)) Figure 2-11: Passing anonymous functions to Spark in Scala Static singleton functions While anonymous functions are really helpful for short snippets of code, they are not very helpful when you want to request the framework for a complex data manipulation. Static singleton functions come to the rescue with their own nuances, which we will discuss in this section. In software engineering, the Singleton pattern is a design pattern that restricts instantiation of a class to one object. This is useful when exactly one object is needed to coordinate actions across the system. Static methods belong to the class and not an instance of it. They usually take input from the parameters, perform actions on it, and return the result. Figure 2-12: Passing static singleton functions to Spark in Scala Static singleton is the preferred way to pass functions, as technically you can create a class and call a method in the class instance. For example: class UtilFunctions{ def split(inputParam: String): Array[String] = {inputParam.split(“ “)} def operate(rdd: RDD[String]): RDD[String] ={ rdd.map(split)} } You can send a method in a class, but that has performance implications as the entire object would be sent along the method. Passing functions to Spark (Java) In Java, to create a function you will have to implement the interfaces available in the org.apache.spark.api.java function package. There are two popular ways to create such functions: Implement the interface in your own class, and pass the instance to Spark. Starting Java 8, you can use lambda expressions to pass off the functions to the Spark framework. Let’s reimplement the preceding word count examples in Java: Figure 2-13: Code example of Java implementation of word count (inline functions) If you belong to a group of programmers who feel that writing inline functions makes the code complex and unreadable (a lot of people do agree to that assertion), you may want to create separate functions and call them as follows: Figure 2-14: Code example of Java implementation of word count Passing functions to Spark (Python) Python provides a simple way to pass functions to Spark. The Spark programming guide available at spark.apache.org suggests there are three recommended ways to do this: Lambda expressions: The ideal way for short functions that can be written inside a single expression Local defs inside the function calling into Spark for longer code Top-level functions in a module While we have already looked at the lambda functions in some of the previous examples, let’s look at local definitions of the functions. Our example stays the same, which is we are trying to count the total number of words in a text file in Spark: def splitter(lineOfText): words = lineOfText.split(" ") return len(words) def aggregate(numWordsLine1, numWordsLineNext): totalWords = numWordsLine1 + numWordsLineNext return totalWords Let’s see the working code example: Figure 2-15: Code example of Python word count (local definition of functions) Here’s another way to implement this by defining the functions as a part of a UtilFunctions class, and referencing them within your map and reduce functions: Figure 2-16: Code example of Python word count (Utility class) You may want to be a bit cheeky here and try to add a countWords() to the UtilFunctions, so that it takes an RDD as input, and returns the total number of words. This method has potential performance implications as the whole object will need to be sent to the cluster. Let’s see how this can be implemented and the results in the following screenshot: Figure 2-17: Code example of Python word count (Utility class - 2) This can be avoided by making a copy of the referenced data field in a local object, rather than accessing it externally. Now that we have had a look at how to pass functions to Spark, and have already looked at some of the transformations and actions in the previous examples, including map, flatMap, and reduce, let’s look at the most common transformations and actions used in Spark. The list is not exhaustive, and you can find more examples in the Apache Spark documentation in the programming guide. If you would like to get a comprehensive list of all the available functions, you might want to check the following API docs: RDD PairRDD Scala http://bit.ly/2bfyoTo http://bit.ly/2bfzgah Python http://bit.ly/2bfyURl N/A Java http://bit.ly/2bfyRov http://bit.ly/2bfyOsH R http://bit.ly/2bfyrOZ N/A Table 2.1 – RDD and PairRDD API references Transformations The following table shows the most common transformations: map(func) coalesce(numPartitions) filter(func) repartition(numPartitions) flatMap(func) repartitionAndSortWithinPartitions(partitioner) mapPartitions(func) join(otherDataset, [numTasks]) mapPartitionsWithIndex(func) cogroup(otherDataset, [numTasks]) sample(withReplacement, fraction, seed) cartesian(otherDataset) Map(func) The map transformation is the most commonly used and the simplest of transformations on an RDD. The map transformation applies the function passed in the arguments to each of the elements of the source RDD. In the previous examples, we have seen the usage of map() transformation where we have passed the split() function to the input RDD. Figure 2-18: Operation of a map() function We’ll not give examples of map() functions as we have already seen plenty of examples of map functions previously. Filter (func) Filter, as the name implies, filters the input RDD, and creates a new dataset that satisfies the predicate passed as arguments. Example 2-1: Scala filtering example: val dataFile = sc.textFile(“README.md”) val linesWithApache = dataFile.filter(line => line.contains(“Apache”)) Example 2-2: Python filtering example: dataFile = sc.textFile(“README.md”) linesWithApache = dataFile.filter(lambda line: “Apache” in line) Example 2-3: Java filtering example: JavaRDD<String> dataFile = sc.textFile(“README.md”) JavaRDD<String> linesWithApache = dataFile.filter(line -> line.contains(“Apache”)); flatMap(func) The flatMap transformation is similar to map, but it offers a bit more flexibility. From the perspective of similarity to a map function, it operates on all the elements of the RDD, but the flexibility stems from its ability to handle functions that return a sequence rather than a single item. As you saw in the preceding examples, we had used flatMap to flatten the result of the split(“”) function, which returns a flattened structure rather than an RDD of string arrays. Figure 2-19: Operational details of the flatMap() transformation Let’s look at the flatMap example in Scala. Example 2-4: The flatmap() example in Scala: val favMovies = sc.parallelize(List("Pulp Fiction","Requiem for a dream","A clockwork Orange")); movies.flatMap(movieTitle=>movieTitle.split(" ")).collect() A flatMap in Python API would produce similar results. Example 2-5: The flatmap() example in Python: movies = sc.parallelize(["Pulp Fiction","Requiem for a dream","A clockwork Orange"]) movies.flatMap(lambda movieTitle: movieTitle.split(" ")).collect() The flatMap example in Java is a bit long-winded, but it essentially produces the same results. Example 2-6: The flatmap() example in Java: JavaRDD<String> movies = sc.parallelize (Arrays.asList("Pulp Fiction","Requiem for a dream" ,"A clockwork Orange") ); JavaRDD<String> movieName = movies.flatMap( new FlatMapFunction<String,String>(){ public Iterator<String> call(String movie){ return Arrays.asList(movie.split(" ")) .iterator(); } } ); Sample(withReplacement, fraction, seed) Sampling is an important component of any data analysis and it can have a significant impact on the quality of your results/findings. Spark provides an easy way to sample RDD’s for your calculations, if you would prefer to quickly test your hypothesis on a subset of data before running it on a full dataset. But here is a quick overview of the parameters that are passed onto the method: withReplacement: Is a Boolean (True/False), and it indicates if elements can be sampled multiple times (replaced when sampled out). Sampling with replacement means that the two sample values are independent. In practical terms this means that if we draw two samples with replacement, what we get on the first one doesn’t affect what we get on the second draw, and hence the covariance between the two samples is zero. If we are sampling without replacement, the two samples aren’t independent. Practically this means what we got on the first draw affects what we get on the second one and hence the covariance between the two isn’t zero. fraction: Fraction indicates the expected size of the sample as a fraction of the RDD’s size. The fraction must be between 0 and 1. For example, if you want to draw a 5% sample, you can choose 0.05 as a fraction. seed: The seed used for the random number generator. Let’s look at the sampling example in Scala. Example 2-7: The sample() example in Scala: val data = sc.parallelize( List(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)); data.sample(true,0.1,12345).collect() The sampling example in Python looks similar to the one in Scala. Example 2-8: The sample() example in Python: data = sc.parallelize( [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]) data.sample(1,0.1,12345).collect() In Java, our sampling example returns an RDD of integers. Example 2-9: The sample() example in Java: JavaRDD<Integer> nums = sc.parallelize(Arrays.asList( 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)); nums.sample(true,0.1,12345).collect(); References https://spark.apache.org/docs/latest/programming-guide.html http://www.purplemath.com/modules/numbprop.htm Summary We have gone through the concept of creating an RDD, to manipulating data within the RDD. We’ve looked at the transformations and actions available to an RDD, and walked you through various code examples to explain the differences between transformations and actions Resources for Article: Further resources on this subject: Getting Started with Apache Spark [article] Getting Started with Apache Spark DataFrames [article] Sabermetrics with Apache Spark [article]

0
0
7344

article-image-review-sql-server-features-developers

Packt

13 Feb 2017

43 min read

Review of SQL Server Features for Developers

Packt

13 Feb 2017

43 min read

0
0
1202

Packt

13 Feb 2017

14 min read

CNN architecture

Packt

13 Feb 2017

14 min read

0
0
5351

Packt

10 Feb 2017

13 min read

Bitcoin

Packt

10 Feb 2017

13 min read

In this article by Imran Bashir, the author of the book Mastering Blockchain, will see about bitcoin and it's importance in electronic cash system. (For more resources related to this topic, see here.) Bitcoin is the first application of blockchain technology. In this article readers will be introduced to the bitcoin technology in detail. Bitcoin has started a revolution with the introduction of very first fully decentralized digital currency that has been proven to be extremely secure and stable. This has also sparked great interest in academic and industrial research and introduced many new research areas. Since its introduction in 2008 bitcoin has gained much popularity and currently is the most successful digital currency in the world with billions of dollars invested in it. It is built on decades of research in the field of cryptography, digital cash and distributed computing. In the following section brief history is presented in order to provide background required to understand the foundations behind the invention of bitcoin. Digital currencies have always been an active area of research for many decades. Early proposals to create digital cash goes as far back as the early 1980s. In 1982 David Chaum proposed a scheme that used blind signatures to build untraceable digital currency. In this scheme a bank would issue digital money by signing a blinded and random serial number presented to it by the user. The user can then use the digital token signed by the bank as currency. The limitation in this scheme is that the bank has to keep track of all used serial numbers. This is a central system by design and requires to be trusted by the users. Later on in 1990 David Chaum proposed a refined version named ecash that not only used blinded signature but also some private identification data to craft a message that was then sent to the bank. This scheme allowed detection of double spending but did not prevent it. If the same token is used at two different location then the identity of the double spender would be revealed. ecash could only represent fixed amount of money. Adam Back's hashcash introduced in 1997 was originally proposed to thwart the email spam. The idea behind hashcash is to solve a computational puzzle that is easy to verify but is comparatively difficult to compute. The idea is that for a single user and single email extra computational effort it not noticeable but someone sending large number of spam emails would be discouraged as the time and resources required to run the spam campaign will increase substantially. B-money was proposed by Wei Dai in 1998 which introduced the idea of using proof of work to create money. Major weakness in the system was that some adversary with higher computational power could generate unsolicited money without giving the chance to the network to adjust to an appropriate difficulty level. The system was lacking details on the consensus mechanism between nodes and some security issues like Sybil attacks were also not addressed. At the same time Nick Szabo introduced the concept of bit gold which was also based on proof of work mechanism but had same problems as b-money had with one exception that network difficulty level was adjustable. Tomas Sander and Ammon TaShama introduced an ecash scheme in 1999 that for the first time used merkle trees to represent coins and zero knowledge proofs to prove possession of coins. In the scheme a central bank was required who kept record of all used serial numbers. This scheme allowed users to be fully anonymous albeit at some computational cost. RPOW (Reusable Proof of Work) was introduced by Hal Finney in 2004 that used hash cash scheme by Adam Back as a proof of computational resources spent to create the money. This was also a central system that kept a central database to keep track of all used PoW tokens. This was an online system that used remote attestation made possible by trusted computing platform (TPM hardware). All the above mentioned schemes are intelligently designed but were weak from one aspect or another. Especially all these schemes rely on a central server which is required to be trusted by the users. Bitcoin In 2008 bitcoin paper Bitcoin: A Peer-to-Peer Electronic Cash System was written by Satoshi Nakamoto. First key idea introduced in the paper is that it is a purely peer to peer electronic cash that does need an intermediary bank to transfer payments between peers. Bitcoin is built on decades of Cryptographic research like merkle trees, hash functions, public key cryptography and digital signatures. Moreover ideas like bit gold, b-money, hashcash and cryptographic time stamping have provided the foundations for bitcoin invention. All these technologies are cleverly combined in bitcoin to create world's first decentralized currency. Key issue that has been addressed in bitcoin is an elegant solution to Byzantine Generals problem along with a practical solution of double spend problem. Value of bitcoin has increased significantly since 2011 as shown in the graph below: Bitcoin price and volume since 2012 (on logarithmic scale) Regulation of bitcoin is a controversial subject and as much as it is a libertarian's dream law enforcement agencies and governments are proposing various regulations to control it such as bitlicense issued by NewYorks state department of financial services. This is a license issued to businesses which perform activities related to virtual currencies. Growth of bitcoin is also due to so called Network Effect. Also called demand-side economies of scale, it is a concept which basically means that more users who use the network the more valuable it becomes. Over time exponential increase has been seen in bitcoin network growth. Even though the price of bitcoin is quite volatile it has increased significantly over a period of last few years. Currently (at the time of writing) bitcoin price is 815 GBP. Bitcoin definition Bitcoin can be defined in various ways, it's a protocol, a digital currency and a platform. It is a combination of peer to peer network, protocols and software that facilitate the creation and usage of digital currency named bitcoin. Note that Bitcoin with capital B is used to refer to Bitcoin protocol whereas bitcoin with lower case b is used to refer to bitcoin, the currency. Nodes in this peer to peer to network talk to each other using the Bitcoin protocol. Decentralization of currency was made possible for the first time with the invention of Bitcoin. Moreover double spending problem was solved in an elegant and ingenious way in bitcoin. Double spending problem arises when for example a user sends coins to two different users at the same time and they will be verified independently as valid transactions. Keys and addresses Elliptic curve cryptography is used to generate public and private key pairs in the Bitcoin network. The Bitcoin address is created by taking the corresponding public key of a private key and hashing it twice, first with SHA256 algorithm and then with RIPEMD160. The resultant 160-bit hash is then prefixed with a version number and finally encoded with Base58Check encoding scheme. The bitcoin addresses are 26 to 35 characters long and begin with digit 1 or 3. A typical bitcoin address looks like a string shown as follows: 1ANAguGG8bikEv2fYsTBnRUmx7QUcK58wt This is also commonly encoded in a QR code for easy sharing. The QR code of the preceding shown address is as follows: QR code of a bitcoin address 1ANAguGG8bikEv2fYsTBnRUmx7QUcK58wt There are currently two types of addresses, commonly used P2PKH and another P2SH type starting with 1 and 3 respectively. In early days bitcoin used direct Pay-to-Pubkey which is now superseded by P2PKH. However direct Pay-to-Pubkey is still used in bitcoin for coinbase addresses. Addresses should not be used for more than once otherwise privacy and security issues can arise. Avoiding address reuse circumvents anonymity issues to some extent, bitcoin has some other security issues also, such as transaction malleability which requires different approach to resolve. from bitaddress.org private key and bitcoin address in a paper wallet Public keys in bitcoin In Public key cryptography, public keys are generated from private keys. Bitcoin uses ECC based on SECP256K1 standard. A private key is randomly selected and is 256-bit in length. Public keys can be presented in uncompressed or compressed format. Public keys are basically x and y coordinates on an elliptic curve and in uncompressed format are presented with a prefix of 04 in hexadecimal format. X and Y co-ordinates are both 32-bit in length. In total the compressed public key is 33 bytes long as compared to 65 bytes in uncompressed format. Compressed version of public keys basically include only X part, since Y part can be derived from it. The reason why compressed version of public keys works is that bitcoin client initially used uncompressed keys, but starting from bitcoin core client 0.6 compressed keys are used as standard. Keys are identified by various prefixes described as follows: Uncompressed public keys used 0x04 as prefix. Compressed public key starts with 0x03 if the y 32-bit part of the public key is odd. Compressed public key starts with 0x02 if the y 32-bit part of the public key is even. More mathematical description and reason why it works is described later. If the ECC graph is visualized it reveals that the y co-ordinate can be either below the x-axis or above the x-axis and as the curve is symmetric only the location in the prime field is required to be stored. Private keys in bitcoin Private keys are basically 256-bit numbers chosen in the range specified by SECP256K1 ECDSA recommendation. Any randomly chosen 256-bit number from 0x1 to 0xFFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFE BAAE DCE6 AF48 A03B BFD2 5E8C D036 4140 is a valid private key. Private keys are usually encoded using Wallet Import Format (WIF) in order to make them easier to copy and use. WIF can be converted to private key and vice versa. Steps are described later. Also Mini Private Key Format is used sometimes to encode the key in under 30 characters to allow storage where physical space is limited, for example, etching on physical coins or damage resistant QR codes. Bitcoin core client also allows encryption of the wallet which contains the private keys. Bitcoin currency units Bitcoin currency units are described as follows. Smallest bitcoin denomination is Satoshi. Base58Check encoding This encoding is used to limit the confusion between various characters such as 0OIl as they can look same in different fonts. The encoding basically takes the binary byte arrays and converts them into human readable string. This string is composed by utilizing a set of 58 alphanumeric symbols. More explanation and logic can be found in base58.h source file in bitcoin source code. Explanation from bitcoin source code Bitcoin addresses are encoded using Base58check encoding. Vanity addresses As the bitcoin addresses are based on base 58 encoding, it is possible to generate addresses that contain human readable messages. An example is shown as follows: Public address encoded in QR Vanity addresses are generated by using a purely brute force method. An example is shown as follows: Vanity address generated from https://bitcoinvanitygen.com/ Transactions Transactions are at the core of bitcoin ecosystem. Transactions can be as simple as just sending some bitcoins to a bitcoin address or it can be quite complex depending on the requirements. Each transaction is composed of at least one input and output. Inputs can be thought of as coins being spent that have been created in a previous transaction and outputs as coins being created. If a transaction is minting new coins then there is no input and therefore no signature is needed. If a transaction is to send coins to some other user (a bitcoin address), then it needs to be signed by the sender with their private key and also a reference is required to the previous transaction to show the origin of the coins. Coins are in fact unspent transactions outputs represented in Satoshis. Transactions are not encrypted and are publicly visible in the blockchain. Blocks are made up of transactions and these can be viewed by using any online blockchain explorer. Transaction life cycle A user/sender sends a transaction using wallet software or some other interface. Wallet software signs the transaction using the sender's private key. Transaction is broadcasted to the Bitcoin network using a flooding algorithm. Mining nodes include this transaction in the next block to be mined. Mining starts and once a miner who solves the Proof of Work problem broadcasts the newly mined block to the network. Proof of Work is explained in detail later. The nodes verify the block and propagate the block further and confirmation start to generate. Finally the confirmations start to appear in the receiver's wallet and after approximately 6 confirmations the transaction is considered finalized and confirmed. However 6 is just a recommended number , the transaction can be considered final even after first confirmation. The key idea behind waiting for six confirmations is that the probability of double spending virtually eliminates after 6 confirmations. Transaction structure A transaction at a high level contains metadata, inputs and outputs. Transactions are combined together to create a block. The transaction structure is shown in the following table: Field Size Description Version Number 4 bytes Used to specify rules to be used by the miners and nodes for transaction processing. Input counter 1 to 9 bytes Number of inputs included in the transaction. list of inputs variable Each input is composed of several fields including Previous Transaction hash, Previous Txout-index, Txin-script length, Txin-script and optional sequence number. The first transaction in a block is also called coinbase transaction. Specifies on or more transaction inputs. Out-counter 1 to 9 bytes positive integer representing the number of outputs. list of outputs variable Outputs included in the transaction. lock_time 4 bytes It defines the earliest time when a transaction becomes valid. It is either a Unix timestamp or block number. MetaData: This part of the transaction contains some values like size of transaction, number of inputs and outputs, hash of the transaction and a lock_time field. Every transaction has a prefix specifying the version number. Inputs: Generally each input spends a previous output. Each output is considered a UTXO, Unspent transaction output until an input consumes it. Outputs: Outputs have only two fields and it contains instructions for sending bitcoins. First field contains the amount of Satoshis where as second field is a locking script which contains the conditions that needs to be met in order for the output to be spent. More information about transaction spending by using locking and unlocking scripts and producing outputs is discussed later. Verification: Verification is performed by using Bitcoin's scripting language Summary In this article, we learned the importance of bitcoin in digital currency and how bitcoins are encoded using various private keys and encoding techniques. Resources for Article: Further resources on this subject: Bitcoins – Pools and Mining [article] Protecting Your Bitcoins [article] FPGA Mining [article]

0
0
4179

Packt

07 Feb 2017

32 min read

Context – Understanding your Data using R

Packt

07 Feb 2017

32 min read

0
2
1488

Packt

07 Feb 2017

6 min read

Hierarchical Clustering

Packt

07 Feb 2017

6 min read

In this article by Atul Tripathi, author of the book Machine Learning Cookbook, we will cover hierarchical clustering with a World Bank sample dataset. (For more resources related to this topic, see here.) Introduction Hierarchical clustering is one of the most important methods in unsupervised learning is hierarchical clustering. In hierarchical clustering for a given set of data points, the output is produced in the form of a binary tree (dendrogram). In the binary tree, the leaves represent the data points while internal nodes represent nested clusters of various sizes. Each object is assigned to a separate cluster. Evaluation of all the clusters shall take place based on a pairwise distance matrix. The distance matrix shall be constructed using distance values. The pair of clusters with the shortest distance must be considered. The pair identified should then be removed from the matrix and merged together. The merged clusters must be distance must be evaluated with the other clusters and the distance matrix must be updated. The process must be repeated until the distance matrix is reduced to a single element. Hierarchical clustering - World Bank sample dataset One of the main goals for establishing the World Bank has been to fight and eliminate poverty. Continuous evolution and fine tuning its policies in the ever-evolving world has been helping the institution to achieve the goal of poverty elimination. The barometer of success in elimination of poverty is measured in terms of improvement of each of the parameters in health, education, sanitation, infrastructure, and other services needed to improve the lives of poor. The development gains which will ensure the goals must be pursued in an environmentally, socially, and economically sustainable manner. Getting ready In order to perform hierarchical clustering, we shall be using a dataset collected from the World Bank dataset. Step 1 - Collecting and describing data The dataset titled WBClust2013 shall be used. This is available in the CSV format titled WBClust2013.csv. The dataset is in standard format. There are 80 rows of data. There are 14 variables. The numeric variables are: new.forest Rural log.CO2 log.GNI log.Energy.2011 LifeExp Fertility InfMort log.Exports log.Imports CellPhone RuralWater Pop The non-numeric variables are: Country How to do it Step 2 - exploring data Version info: Code for this page was tested in R version 3.2.3 (2015-12-10) Let's explore the data and understand the relationships among the variables. We'll begin by importing the CSV file named WBClust2013.csv. We will be saving the data to the wbclust data frame: > wbclust=read.csv("d:/WBClust2013.csv",header=T) Next, we shall print the wbclust data frame. The head() function returns the wbclust data frame. The wbclust data frame is passed as an input parameter: > head(wbclust) The results are as follows: Step 3 - transforming data Centering variables and creating z-scores are two common data analysis activities to standardize data. The numeric variables mentioned above need to create z-scores. The scale()function is a generic function whose default method centers and/or scales the columns of a numeric matrix. The data frame, wbclust is passed to the scale function. All the numeric fields are only considered. The result is then stored in another data frame, wbnorm. > wbnorm<-scale(wbclust[,2:13]) > wbnorm The results are as follows: All data frames have a row names attribute. In order to retrieve or set the row or column names of a matrix-like object, the rownames()function is used. The data frame, wbclust with the first column is passed to the rownames()function. > rownames(wbnorm)=wbclust[,1] > rownames(wbnorm) The call to the function rownames(wbnorm)results in display of the values from the first column. The results are as follows: Step 4 - training and evaluating the model performance The next step is about training the model. The first step is to calculate the distance matrix. The dist()function is used. Using the specified distance measure, distances between the rows of a data matrix are computed. The distance measure used can be Euclidean, maximum, Manhattan, Canberra, binary, or Minkowski. The distance measure used is Euclidean. The Euclidean distance calculates the distance between two vectors as sqrt(sum((x_i - y_i)^2)).The result is then stored in a new data frame, dist1. > dist1<-dist(wbnorm, method="euclidean") The next step is to perform clustering using Ward's method. The hclust() function is used. In order to perform cluster analysis on a set of dissimilarities of the n objects, the hclust()function is used. At the first stage, each of the objects is assigned to its own cluster. After which, at each stage the algorithm iterates and joins two of the most similar clusters. This process will continue till there is just a single cluster left. The hclust() function requires that we provide the data in the form of a distance matrix. The dist1 data frame is passed. By default, the complete linkage method is used. There are multiple agglomeration methods which can be used. Some of the agglomeration methods could be ward.D, ward.D2, single, complete, average. > clust1<-hclust(dist1,method="ward.D") > clust1 The call to the function, clust1results in display of the agglomeration methods used, the manner in which the distance is calculated, and the number of objects. The results are as follows: Step 5 - plotting the model The plot()function is a generic function for plotting of R objects. Here, the plot() function is used to draw the dendrogram: > plot(clust1,labels= wbclust$Country, cex=0.7, xlab="",ylab="Distance",main="Clustering for 80 Most Populous Countries") The result is as follows: The rect.hclust() function highlights the clusters and draws the rectangles around the branches of the dendrogram. The dendrogram is first cut at a certain level followed by drawing a rectangle around the selected branches. The object, clust1 is passed as an object to the function along with the number of clusters to be formed: > rect.hclust(clust1,k=5) The result is as follows: The cuts()function shall cut the tree into multiple groups on the basis of the desired number of groups or the cut height. Here, clust1 is passed as an object to the function along with the number of the desired group: > cuts=cutree(clust1,k=5) > cuts The result is as follows: Getting the list of countries in each group. The result is as follows: Summary In this article we covered hierarchical clustering by collecting, exploring its contents, transforming the data. We trained and evaluated it by using distance matrix and finally plotted the data as a dendrogram. Resources for Article: Further resources on this subject: Supervised Machine Learning [article] Specialized Machine Learning Topics [article] Machine Learning Using Spark MLlib [article]

0
0
1314

article-image-classification-using-convolutional-neural-networks

Mohammad Pezeshki

07 Feb 2017

5 min read

Classification using Convolutional Neural Networks

Mohammad Pezeshki

07 Feb 2017

5 min read

In this blog post, we begin with a simple classification task that the reader can readily relate to. The task is a binary classification of 25000 images of cats and dogs, divided into 20000 training, 2500 validation, and 2500 testing images. It seems reasonable to use the most promising model for object recognition, which is convolutional neural network (CNN). As a result, we use CNN as the baseline for the experiments, and along with this post, we will try to improve its performance using different techniques. So, in the next sections, we will first introduce CNN and its architecture and then we will explore three techniques to boost the performance and speed. These three techniques are using Parametric ReLU and a method of Batch Normalization. In this post, we will show the experimental results as we go through each technique. The complete code for CNN is available online in the author’s GitHub repository. Convolutional Neural Networks Convolutional neural networks can be seen as feedforward neural networks that multiple copies of the same neuron are applied to in different places. It means applying the same function to different patches of an image. Doing this means that we are explicitly imposing our knowledge about data (images) into the model structure. That's because we already know that natural image data is translation invariant, meaning that probability distribution of pixels are the same across all images. This structure, which is followed by a non-linearity and a pooling and subsampling layer, makes CNN’s powerful models, especially, when dealing with images. Here's a graphical illustration of CNN from Prof. Hugo Larochelle's course of Neural Networks, which is originally from Prof. YannLecun's paper on ConvNets. Implementation of a CNN in a GPU-based language of Theano is so straightforward as well. So, we can create a layer like this: And then we can stack them on top of each other like this: CNN Experiments Armed with CNN, we attacked the task using two baseline models. A relatively big, and a relatively small model. In the figures below, you can see the number for layer, filter size, pooling size, stride, and a number of fully connected layers. We trained both networks with a learning rate of 0.01, and a momentum of 0.9 on a GTX580 GPU. We also used early stopping. The small model can be trained in two hours and results in 81 percent accuracy on validation sets. The big model can be trained in 24 hours and results in 92 percent accuracy on validation sets. Parametric ReLU Parametric ReLU (aka Leaky ReLU) is an extension to Rectified Linear Unitthat allows the neuron to learn the slope of activation function in the negative region. Unlike the actual paper of Parametric ReLU by Microsoft Research, I used a different parameterizationthat forces the slope to be between 0 and 1. As shown in the figure below, when alpha is 0, the activation function is just linear. On the other hand, if alpha is 1, then the activation function is exactly the ReLU. Interestingly, although the number of trainable parameters is increased using Parametric ReLU, it improves the model both in terms of accuracy and in terms of convergence speed. Using Parametric ReLU makes the training time 3/4 and increases the accuracy around 1 percent. In Parametric ReLU,to make sure that alpha remains between 0 and 1, we will set alpha = Sigmoid(beta) and optimize beta instead. In our experiments, we will set the initial value of alpha to 0.5. After training, all alphas were between 0.5 and 0.8. That means that the model enjoys having a small gradient in the negative region. “Basically, even a small slope in negative region of activation function can help training a lot. Besides, it's important to let the model decide how much nonlinearity it needs.” Batch Normalization Batch Normalization simply means normalizing preactivations for each batch to have zero mean and unit variance. Based on a recent paper by Google, this normalization reduces a problem called Internal Covariance Shift and consequently makes the learning much faster. The equations are as follows: Personally, during this post, I found this as one of the most interesting and simplest techniques I've ever used. A very important point to keep in mind is to feed the whole validation set as a single batch at testing time to have a more accurate (less biased) estimation of mean and variance. “Batch Normalization, which means normalizing pre-activations for each batch to have zero mean and unit variance, can boost the results both in terms of accuracy and in terms of convergence speed.” Conclusion All in all, we will conclude this post with two finalized models. One of them can be trained in 10 epochs or, equivalently, 15 minutes, and can achieve 80 percent accuracy. The other model is a relatively large model. In this model, we did not use LDNN, but the two other techniques are used, and we achieved 94.5 percent accuracy. About the Author Mohammad Pezeshki is a PhD student in the MILA lab at University of Montreal. He obtained his bachelor's in computer engineering from Amirkabir University of Technology (Tehran Polytechnic) in July 2014. He then obtained his Master’s in June 2016. His research interests lie in the fields of Artificial Intelligence, Machine Learning, Probabilistic Models and, specifically,Deep Learning.

0
0
3780

article-image-building-search-geo-locator-elasticsearch-and-spark

Packt

31 Jan 2017

12 min read

Building A Search Geo Locator with Elasticsearch and Spark

Packt

31 Jan 2017

12 min read

0
0
5078

How-To Tutorials - Data

Convolutional Neural Networks with Reinforcement Learning

WebLogic Server

Learn from Data

Reading the Fine Manual

What is D3.js?

Data Pipelines

The NumPy array object

Understanding Spark RDD

Review of SQL Server Features for Developers

CNN architecture

Trending Topics

Bitcoin

Context – Understanding your Data using R

Hierarchical Clustering

Classification using Convolutional Neural Networks

Building A Search Geo Locator with Elasticsearch and Spark