Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

6719 Articles
article-image-understanding-streaming-applications-in-spark-sql
Amarabha Banerjee
04 Dec 2017
7 min read
Save for later

Understanding Streaming Applications in Spark SQL

Amarabha Banerjee
04 Dec 2017
7 min read
[box type="note" align="" class="" width=""]This article is a book excerpt from Learning Spark SQL written by Aurobindo Sarkar. This book gives an insight into the engineering practices used to design and build real-world Spark based applications. The hands on examples illustrated in the book will give you required confidence to work on future projects you encounter in Spark SQL. [/box] In the article, we shall talk about Spark SQL and its use in streaming applications. What are streaming applications? A streaming application is a program that has its necessary components downloaded as needed instead of being installed ahead of time on a computer. Application streaming is a method of delivering virtualized applications. Streaming applications are getting increasingly complex, because such computations don't run in isolation. They need to interact with batch data, support interactive analysis, support sophisticated machine learning applications, and so on. Typically, such applications store incoming event stream(s) on long-term storage, continuously monitor events, and run machine learning models on the stored data, while simultaneously enabling continuous learning on the incoming stream. They also have the capability to interactively query the stored data while providing exactly-once write guarantees, handling late arriving data, performing aggregations, and so on. These types of applications are a lot more than mere streaming applications and have, therefore, been termed as continuous applications. SparkSQL and Structured Streaming Before Spark 2.0, streaming applications were built on the concept of DStreams. There were several pain points associated with using DStreams. In DStreams, the timestamp was when the event actually came into the Spark system; the time embedded in the event was not taken into consideration. In addition, though the same engine can process both the batch and streaming computations, the APIs involved, though similar between RDDs (batch) and DStream (streaming), required the developer to make code changes. The DStream streaming model placed the burden on the developer to address various failure conditions, and it was hard to reason about data consistency issues. In Spark 2.0, Structured Streaming was introduced to deal with all of these pain points.  Structured Streaming is a fast, fault-tolerant, exactly-once stateful stream processing approach. It enables streaming analytics without having to reason about the underlying mechanics of streaming. In the new model, the input can be thought of as data from an append-only table (that grows continuously). A trigger specifies the time interval for checking the input for the arrival of new data. As shown in the following figure, the query represents the queries or the operations, such as map, filter, and reduce on the input, and result represents the final table that is updated in each trigger interval, as per the specified operation. The output defines the part of the result to be written to the data sink in each time interval.  The output modes can be complete, delta, or append, where the complete output mode means writing the full result table every time, the delta output mode writes the changed rows from the previous batch, and the append output mode writes the new rows only, Respectively: In Spark 2.0, in addition to the static bounded DataFrames, we have the concept of a continuous unbounded DataFrame. Both static and continuous DataFrames use the same API, thereby unifying streaming, interactive, and batch queries. For example, you can aggregate data in a stream and then serve it using JDBC. The high-level streaming API is built on the Spark SQL engine and is tightly integrated with SQL queries and the DataFrame/Dataset APIs. The primary benefit is that you use the same high-level Spark DataFrame and Dataset APIs, and the Spark engine figures out the incremental and continuous execution required for operations. Additionally, there are query management APIs that you can use to manage multiple, concurrently running, and streaming queries. For instance, you can list running queries, stop and restart queries, retrieve exceptions in case of failures, and so on. In the example code below, we use two bid files from the iPinYou Dataset as the source for our streaming data. First, we define our input records schema and create a streaming input DataFrame: scala> import org.apache.spark.sql.types._ scala> import org.apache.spark.sql.functions._ scala> import scala.concurrent.duration._ scala> import org.apache.spark.sql.streaming.ProcessingTime scala> import org.apache.spark.sql.streaming.OutputMode.Complete scala> val bidSchema = new StructType().add("bidid", StringType).add("timestamp", StringType).add("ipinyouid", StringType).add("useragent", StringType).add("IP", StringType).add("region", IntegerType).add("city", IntegerType).add("adexchange", StringType).add("domain", StringType).add("url:String", StringType).add("urlid: String", StringType).add("slotid: String", StringType).add("slotwidth", StringType).add("slotheight", StringType).add("slotvisibility", StringType).add("slotformat", StringType).add("slotprice", StringType).add("creative", StringType).add("bidprice", StringType) scala> val streamingInputDF = spark.readStream.format("csv").schema(bidSchema).option("header", false).option("inferSchema", true).option("sep", "t").option("maxFilesPerTrigger", 1).load("file:///Users/aurobindosarkar/Downloads/make-ipinyou-datamaster/ original-data/ipinyou.contest.dataset/bidfiles") Next, we define our query with a time interval of 20 seconds and the output mode as Complete: scala> val streamingCountsDF = streamingInputDF.groupBy($"city").count() scala> val query = streamingCountsDF.writeStream.format("console").trigger(ProcessingTime(20.s econds)).queryName("counts").outputMode(Complete).start() In the output, it is observed that the count of bids from each region gets updated in each time interval as new data arrives. The new bid files need to be dropped (or start with multiple bid files, as they will get picked up for processing one at a time based on the value of maxFilesPerTrigger) from the original Dataset into the bidfiles directory to see the updated results. Structured Streaming Internals Sources and incrementally executes the computation on it before writing it to the sink. In addition, any running aggregates required by your application are maintained as in-memory states backed by a Write-Ahead Log (WAL). The in-memory state data is generated and used across incremental executions. The fault tolerance requirements for such applications include the ability to recover and replay all data and metadata in the system. The planner writes offsets to a fault-tolerant WAL on persistent storage, such as HDFS, before execution as illustrated in the figure: In case the planner fails on the current incremental execution, the restarted planner reads from the WAL and re-executes the exact range of offsets required. Typically, sources such as Kafka are also fault-tolerant and generate the original transactions data, given the appropriate offsets recovered by the planner. The state data is usually maintained in a versioned, key-value map in Spark workers and is backed by a WAL on HDFS. The planner ensures that the correct version of the state is used to re-execute the transactions subsequent to a failure. Additionally, the sinks are idempotent by design, and can handle the re-executions without double commits of the output. Hence, an overall combination of offset tracking in WAL, state management, and fault-tolerant sources and sinks provide the end-to- end exactly-once guarantees. Summary SparkSQL provides one of the best platforms for implementing streaming applications. The internal architecture and the fault tolerant behavior implies that modern day developers who want to create data intensive applications with data streaming capabilities, will have to use the power of SparkSQL. If you liked our post, please be sure to check out Learning Spark SQL which consists of more useful techniques on data extraction and data analysis using Spark SQL.
Read more
  • 0
  • 0
  • 2709

article-image-basics-of-spark-sql-and-its-components
Amarabha Banerjee
04 Dec 2017
8 min read
Save for later

Basics of Spark SQL and its components

Amarabha Banerjee
04 Dec 2017
8 min read
[box type="note" align="" class="" width=""]Below given is an excerpt from the book Learning Spark SQL by Aurobindo Sarkar. Spark SQL APIs provide an optimized interface that helps developers build distributed applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. This book provides you with an understanding of design and implementation best practices used to design and build real-world, Spark-based applications. [/box] In the article, we shall give you a perspective of Spark SQL and its components. Introduction Spark SQL is one of the most advanced components of Apache Spark. It has been a part of the core distribution since Spark 1.0 and supports Python, Scala, Java, and R programming APIs. As illustrated in the figure below, Spark SQL components provide the foundation for Spark machine learning applications, streaming applications, graph applications, and many other types of application architectures. Such applications, typically, use Spark ML pipelines, Structured Streaming, and GraphFrames, which are all based on Spark SQL interfaces (DataFrame/Dataset API). These applications, along with constructs such as SQL, DataFrames, and Datasets API, receive the benefits of the Catalyst optimizer, automatically. This optimizer is also responsible for generating executable query plans based on the lower-level RDD interfaces. SparkSession SparkSession represents a unified entry point for manipulating data in Spark. It minimizes the number of different contexts a developer has to use while working with Spark. SparkSession replaces multiple context objects, such as the SparkContext, SQLContext, and HiveContext. These contexts are now encapsulated within the SparkSession object. In Spark programs, we use the builder design pattern to instantiate a SparkSession object. However, in the REPL environment (that is, in a Spark shell session), the SparkSession is automatically created and made available to you via an instance object called Spark.At this time, start the Spark shell on your computer to interactively execute the code snippets in this section. As the shell starts up, you will notice a bunch of messages appearing on your screen, as shown in the following figure. Understanding Resilient Distributed datasets (RDD) RDDs are Spark's primary distributed Dataset abstraction. It is a collection of data that is immutable, distributed, lazily evaluated, type inferred, and cacheable. Prior to execution, the developer code (using higher-level constructs such as SQL, DataFrames, and Dataset APIs) is converted to a DAG of RDDs (ready for execution). RDDs can be created by parallelizing an existing collection of data or accessing a Dataset residing in an external storage system, such as the file system or various Hadoop-based data sources. The parallelized collections form a distributed Dataset that enable parallel operations on them. An RDD can be created from the input file with number of partitions specified, as shown: scala> val cancerRDD = sc.textFile("file:///Users/aurobindosarkar/Downloads/breast-cancerwisconsin. data", 4) scala> cancerRDD.partitions.size res37: Int = 4 RDD files can be internaly converted to a DataFrame by importing the spark.implicits package and using the toDF() method: scala> import spark.implicits._scala> val cancerDF = cancerRDD.toDF() To create a DataFrame with a specific schema, we define a Row object for the rows contained in the DataFrame. Additionally, we split the comma-separated data, convert it to a list of fields, and then map it to the Row object. Finally, we use the create DataFrame() to create the DataFrame with a specified schema: def row(line: List[String]): Row = { Row(line(0).toLong, line(1).toInt, line(2).toInt, line(3).toInt, line(4).toInt, line(5).toInt, line(6).toInt, line(7).toInt, line(8).toInt, line(9).toInt, line(10).toInt) } val data = cancerRDD.map(_.split(",").to[List]).map(row) val cancerDF = spark.createDataFrame(data, recordSchema) Further, we can easily convert the preceding DataFrame to a Dataset using the case class defined earlier: scala> val cancerDS = cancerDF.as[CancerClass] RDD data is logically divided into a set of partitions; additionally, all input, intermediate, and output data is also represented as partitions. The number of RDD partitions defines the level of data fragmentation. These partitions are also the basic units of parallelism. Spark execution jobs are split into multiple stages, and as each stage operates on one partition at a time, it is very important to tune the number of partitions. Fewer partitions than active stages means your cluster could be under-utilized, while an excessive number of partitions could impact the performance due to higher disk and network I/O. Understanding DataFrames and Datasets A DataFrame is similar to a table in a relational database, a pandas dataframe, or a dataframe in R. It is a distributed collection of rows that is organized into columns. It uses the immutable, in-memory, resilient, distributed, and parallel capabilities of RDD, and applies a schema to the data. DataFrames are also evaluated lazily. Additionally, they provide a domain-specific language (DSL) for distributed data manipulation. Conceptually, the DataFrame is an alias for a collection of generic objects Dataset[Row], where a row is a generic untyped object. This means that syntax errors for DataFrames are caught during the compile stage; however, analysis errors are detected only during runtime. DataFrames can be constructed from a wide array of sources, such as structured data files, Hive tables, databases, or RDDs. The source data can be read from local filesystems, HDFS, Amazon S3, and RDBMSs. In addition, other popular data formats, such as CSV, JSON, Avro, Parquet, and so on, are also supported. Additionally, you can also create and use custom data sources. The DataFrame API supports Scala, Java, Python, and R programming APIs. The DataFrames API is declarative, and combined with procedural Spark code, it provides a much tighter integration between the relational and procedural processing in your applications. DataFrames can be manipulated using Spark's procedural API, or using relational APIs (with richer optimizations). Understanding the Catalyst optimizer The Catalyst optimizer is at the core of Spark SQL and is implemented in Scala. It enables several key features, such as schema inference (from JSON data), that are very useful in data analysis work. The following figure shows the high-level transformation process from a developer's program containing DataFrames/Datasets to the final execution plan: The internal representation of the program is a query plan. The query plan describes data operations such as aggregate, join, and filter, which match what is defined in your query. These operations generate a new Dataset from the input Dataset. After we have an initial version of the query plan ready, the Catalyst optimizer will apply a series of transformations to convert it to an optimized query plan. Finally, the Spark SQL code generation mechanism translates the optimized query plan into a DAG of RDDs that is ready for execution. The query plans and the optimized query plans are internally represented as trees. So, at its core, the Catalyst optimizer contains a general library for representing trees and applying rules to manipulate them. On top of this library, are several other libraries that are more specific to relational query processing. Catalyst has two types of query plans: Logical and Physical Plans. The Logical Plan describes the computations on the Datasets without defining how to carry out the specific computations. Typically, the Logical Plan generates a list of attributes or columns as output under a set of constraints on the generated rows. The Physical Plan describes the computations on Datasets with specific definitions on how to execute them (it is executable). Let's explore the transformation steps in more detail. The initial query plan is essentially an unresolved Logical Plan, that is, we don't know the source of the Datasets or the columns (contained in the Dataset) at this stage and we also don't know the types of columns. The first step in this pipeline is the analysis step. During analysis, the catalog information is used to convert the unresolved Logical Plan to a resolved Logical Plan. In the next step, a set of logical optimization rules is applied to the resolved Logical Plan, resulting in an optimized Logical Plan. In the next step the optimizer may generate multiple Physical Plans and compare their costs to pick the best one. The first version of the Costbased Optimizer (CBO), built on top of Spark SQL has been released in Spark 2.2. More details on cost-based optimization are presented in Chapter 11, Tuning Spark SQL Components for Performance.  All three--DataFrame, Dataset and SQL--share the same optimization pipeline as illustrated in the following figure: The primary goal of this article was to give an overview of Spark SQL to enable you being comfortable with the Spark environment through hands-on sessions (using public Datasets). If you liked our article, please be sure to check out Learning Spark SQL which consists of more useful techniques on data extraction and data analysis using Spark SQL.
Read more
  • 0
  • 0
  • 3464

article-image-4-popular-algorithms-distance-based-outlier-detection
Sugandha Lahoti
01 Dec 2017
7 min read
Save for later

4 popular algorithms for Distance-based outlier detection

Sugandha Lahoti
01 Dec 2017
7 min read
[box type="note" align="" class="" width=""]The article is an excerpt from our book titled Mastering Java Machine Learning by Dr. Uday Kamath and  Krishna Choppella.[/box] This book introduces you to an array of expert machine learning techniques, including classification, clustering, anomaly detection, stream learning, active learning, semi-supervised learning, probabilistic graph modelling and a lot more. The article given below is extracted from Chapter 5 of the book - Real-time Stream Machine Learning, explaining 4 popular algorithms for Distance-based outlier detection. Distance-based outlier detection is the most studied, researched, and implemented method in the area of stream learning. There are many variants of the distance-based methods, based on sliding windows, the number of nearest neighbors, radius and thresholds, and other measures for considering outliers in the data. We will try to give a sampling of the most important algorithms in this article. Inputs and outputs Most algorithms take the following parameters as inputs: Window size w, corresponding to the fixed size on which the algorithm looks for outlier patterns. Sliding size s, corresponds to the number of new instances that will be added to the window, and old ones removed. The count threshold k of instances when using nearest neighbor computation. The distance threshold R used to define the outlier threshold in distances. Outliers as labels or scores (based on neighbors and distance) are outputs. How does it work? We present different variants of distance-based stream outlier algorithms, giving insights into what they do differently or uniquely. The unique elements in each algorithm define what happens when the slide expires, how a new slide is processed, and how outliers are reported. Exact Storm Exact Storm stores the data in the current window w in a well-known index structure, so that the range query search or query to find neighbors within the distance R for a given point is done efficiently. It also stores k preceding and succeeding neighbors of all data points: Expired Slide: Instances in expired slides are removed from the index structure that affects range queries but are preserved in the preceding list of neighbors. New Slide: For each data point in the new slide, range query R is executed, results are used to update the preceding and succeeding list for the instance, and the instance is stored in the index structure. Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, any instance with at least k elements from the succeeding list and non-expired preceding list is reported as an outlier. Abstract-C Abstract-C keeps the index structure similar to Exact Storm but instead of preceding and succeeding lists for every object it just maintains a list of counts of neighbors for the windows the instance is participating in: Expired Slide: Instances in expired slides are removed from the index structure that affects range queries and the first element from the list of counts is removed corresponding to the last window. New Slide: For each data point in the new slide, range query R is executed and results are used to update the list count. For existing instances, the count gets updated with new neighbors and instances are added to the index structure. Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, all instances with a neighbors count less than k in the current window are considered outliers. Direct Update of Events (DUE) DUE keeps the index structure for efficient range queries exactly like the other algorithms but has a different assumption, that when an expired slide occurs, not every instance is affected in the same way. It maintains two priority queues: the unsafe inlier queue and the outlier list. The unsafe inlier queue has sorted instances based on the increasing order of smallest expiration time of their preceding neighbors. The outlier list has all the outliers in the current window: Expired Slide: Instances in expired slides are removed from the index structure that affects range queries and the unsafe inlier queue is updated for expired neighbors. Those unsafe inliers which become outliers are removed from the priority queue and moved to the outlier list. New Slide: For each data point in the new slide, range query R is executed, results are used to update the succeeding neighbors of the point, and only the most recent preceding points are updated for the instance. Based on the updates, the point is added to the unsafe inlier priority queue or removed from the queue and added to the outlier list. Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, all instances in the outlier list are reported as outliers. Micro Clustering based Algorithm (MCOD) Micro-clustering based outlier detection overcomes the computational issues of performing range queries for every data point. The micro-cluster data structure is used instead of range queries in these algorithms. A micro-cluster is centered around an instance and has a radius of R. All the points belonging to the micro-clusters become inliers. The points that are outside can be outliers or inliers and stored in a separate list. It also has a data structure similar to DUE to keep a priority queue of unsafe inliers: Expired Slide: Instances in expired slides are removed from both microclusters and the data structure with outliers and inliers. The unsafe inlier queue is updated for expired neighbors as in the DUE algorithm. Microclusters are also updated for non-expired data points. New Slide: For each data point in the new slide, the instance either becomes a center of a micro-cluster, or part of a micro-cluster or added to the event queue and the data structure of the outliers. If the point is within the distance R, it gets assigned to an existing micro-cluster; otherwise, if there are k points within R, it becomes the center of the new micro cluster; if not, it goes into the two structures of the event queue and possible outliers.   Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, any instance in the outlier structure with less than k neighboring instances is reported as an outlier. Advantages and limitations The advantages and limitations are as follows:   Exact Storm is demanding in storage and CPU for storing lists and retrieving neighbors. Also, it introduces delays; even though they are implemented in efficient data structures, range queries can be slow.   Abstract-C has a small advantage over Exact Storm, as no time is spent on finding active neighbors for each instance in the window. The storage and time spent is still very much dependent on the window and slide chosen.   DUE has some advantage over Exact Storm and Abstract-C as it can efficiently re-evaluate the "inlierness" of points (that is, whether unsafe inliers remain inliers or become outliers) but sorting the structure impacts both CPU and memory. MCOD has distinct advantages in memory and CPU owing to the use of the micro-cluster structure and removing the pairwise distance computation. Storing the neighborhood information in micro-clusters helps memory too. Validation and evaluation of stream-based outliers is still an open research area. By varying parameters such as window-size, neighbors within radius, and so on, we determine the sensitivity to the performance metrics (time to evaluate in terms of CPU times per object, Number of outliers detected in the streams,TP/Precision/Recall/ Area under PRC curve) and determine the robustness. If you liked the above article, checkout our book Mastering Java Machine Learning to explore more on advanced machine learning techniques using the best Java-based tools available.
Read more
  • 0
  • 0
  • 5784

article-image-getting-started-with-h2o-for-machine-learning
Sugandha Lahoti
01 Dec 2017
7 min read
Save for later

Getting started with Machine Learning in H2O

Sugandha Lahoti
01 Dec 2017
7 min read
[box type="note" align="" class="" width=""]We present to you an excerpt from our book by Dr. Uday Kamath and Krishna Choppella titled Mastering Java Machine Learning. This book aims to give you an array of advanced techniques on Machine Learning. [/box] Our article given below talks about using H2O as a Machine Learning Platform for Big Data applications. H2O is a leading open source platform for Machine Learning at Big Data scale, with a focus on bringing AI to the enterprise. The company counts several leading lights in statistical learning theory and optimization among its scientific advisors. It supports programming environments in multiple languages. H2O architecture The following figure gives a high-level architecture of H2O with important components. H2O can access data from various data stores such as HDFS, SQL, NoSQL, and Amazon S3, to name a few. The most popular deployment of H2O is to use one of the deployment stacks with Spark or to run it in a H2O cluster itself. The core of H2O is an optimized way of handling Big Data in memory, so that iterative algorithms that go through the same data can be handled efficiently and achieve good performance. Important Machine Learning algorithms in supervised and unsupervised learning are implemented specially to handle horizontal scalability across multiple nodes and JVMs. H2O provides not only its own user interface, known as flow, to manage and run modeling tasks, but also has different language bindings and connector APIs to Java, R, Python, and Scala. Most Machine Learning algorithms, optimization algorithms, and utilities use the concept of fork-join or MapReduce. As shown in the figure below, the entire dataset is considered as a Data Frame in H2O, and comprises vectors, which are features or columns in the dataset. The rows or instances are made up of one element from each Vector arranged side-by-side. The rows are grouped together to form a processing unit known as a Chunk. Several chunks are combined in one JVM. Any algorithmic or optimization work begins by sending the information from the topmost JVM to fork on to the next JVM, then on to the next, and so on, similar to the map operation in MapReduce. Each JVM works on the rows in the chunks to establish the task and finally the results flow back in the reduce operation: Machine learning in H2O The following figure shows all the Machine Learning algorithms supported in H2O v3 for supervised and unsupervised learning: Tools and usage H2O Flow is an interactive web application that helps data scientists to perform various tasks from importing data to running complex models using point and click and wizard-based concepts. H2O is run in local mode as: java –Xmx6g –jar h2o.jar The default way to start Flow is to point your browser and go to the following URL: http://192.168.1.7:54321/. The right-side of Flow captures every user action performed under the tab OUTLINE. The actions taken can be edited and saved as named flows for reuse and collaboration, as shown in the figure below: The figure below shows the interface for importing files from the local filesystem or HDFS and displays detailed summary statistics as well as next actions that can be performed on the dataset. Once the data is imported, it gets a data frame reference in the H2O framework with the extension of .hex. The summary statistics are useful in understanding the characteristics of data such as missing, mean, max, min, and so on. It also has an easy way to transform the features from one type to another, for example, numeric features with a few unique values to categorical/nominal types known as enum in H2O. The actions that can be performed on the datasets are: Visualize the data. Split the data into different sets such as training, validation, and testing. Build supervised and unsupervised models. Use the models to predict. Download and export the files in various formats. Building supervised or unsupervised models in H2O is done through an interactive screen. Every modeling algorithm has its parameters classified into three sections: basic, advanced, and expert. Any parameter that supports hyper-parameter searches for tuning the model has a checkbox grid next to it, and more than one parameter value can be used. Some basic parameters such as training_frame, validation_frame, and response_ column, are common to every supervised algorithm; others are specific to model types, such as the choice of solver for GLM, the activation function for deep learning, and so on. All such common parameters are available in the basic section. Advanced parameters are settings that afford greater flexibility and control to the modeler if the default behavior must be overridden. Several of these parameters are also common across some algorithms—two examples are the choice of method for assigning the fold index (if cross-validation was selected in the basic section), and selecting the column containing weights (if each example is weighted separately), and so on. Expert parameters define more complex elements such as how to handle the missing values, model-specific parameters that need more than a basic understanding of the algorithms, and other esoteric variables. In the figure below, GLM, a supervised learning algorithm, is being configured with 10-fold cross-validation, binomial (two-class) classification, efficient LBFGS optimization algorithm, and stratified sampling for cross-validation split: The model results screen contains a detailed analysis of the results using important evaluation charts, depending on the validation method that was used. At the top of the screen are possible actions that can be taken, such as to run the model on unseen data for prediction, download the model as POJO format, export the results, and so on. Some of the charts are algorithm-specific, like the scoring history that shows how the training loss or the objective function changes over the iterations in GLM—this gives the user insight into the speed of convergence as well as into the tuning of the iterations parameter. We see the ROC curves and the Area Under Curve metric on the validation data in addition to the gains and lift charts, which give the cumulative capture rate and cumulative lift over the validation sample respectively. The figure below shows SCORING HISTORY, ROC CURVE, and GAINS/LIFT charts for GLM on 10-fold cross-validation on the CoverType dataset: The output of validation gives detailed evaluation measures such as accuracy, AUC, err, errors, f1 measure, MCC (Mathews Correlation Coefficient), precision, and recall for each validation fold in the case of cross-validation as well as the mean and standard deviation computed across all. The prediction action runs the model using unseen held-out data to estimate the out-of-sample performance. Important measures such as errors, accuracy, area under curve, ROC plots, and so on, are given as the output of predictions that can be saved or exported. H2O is a rich visualization and analysis framework that is accessible from multiple programming environments( HDFS, SQL, NoSQL, S3, and others). It can also support a number of Machine Learning algorithms that can be run in a cluster. All these factors make it one of the major Machine Learning framework on Big Data. If you think this post is useful, do not miss to check our book Mastering Java Machine Learning  to know more on predictive models for batch- and stream-based big data learning using the latest tools and methodologies.  
Read more
  • 0
  • 0
  • 2645

article-image-10-machine-learning-algorithms
Aaron Lazar
30 Nov 2017
7 min read
Save for later

10 machine learning algorithms every engineer needs to know

Aaron Lazar
30 Nov 2017
7 min read
When it comes to machine learning, it's all about the algorithms. But although machine learning algorithms are the bread and butter of a data scientists job role, it's not always as straightforward as simply picking up an algorithm and running with it. Algorithm selection is incredibly important and often very challenging. There's always a number of things you have to take into consideration, such as: Accuracy: While accuracy is important, it’s not always necessary. In many cases, an approximation is sufficient, in which case, one shouldn’t look for accuracy while giving up on the processing time. Training time: This goes hand in hand with accuracy and is not the same for all algorithms. The training time might go up if there are more parameters as well. When time is a big constraint, you should choose an algorithm wisely. Linearity: Algorithms that follow linearity assume that the data trends follow a linear path. While this is good for some problems, for others it can result in lowered accuracy. Once you've taken those 3 considerations on board you can start to dig a little deeper. Kaggle did a survey in 2017 asking their readers which algorithms - or 'data science methods' more broadly - respondents were most likely to use at work. Below is a screenshot of the results. Kaggle's research offers a useful insight into the algorithms actually being used by data scientists and data analysts today. But we've brought together the types of machine learning algorithms that are most important. Every algorithm is useful in different circumstances - the skill is knowing which one to use and when. 10 machine learning algorithms Linear regression This is clearly one of the most interpretable ML algorithms. It requires minimal tuning and is easy to explain, being the key reason for its popularity. It shows the relationship between two or more variables and how a change in one of the dependent variables impacts the independent variable. It is used for forecasting sales based on trends, as well as for risk assessment. Although with a relatively low level of accuracy, a few parameters needed and lesser training times makes it’s quite popular among beginners. Logistic regression Logistic regression is typically viewed as a special form of Linear Regression, where the output variable is categorical. It’s generally used to predict a binary outcome i.e.True or False, 1 or 0, Yes or No, for a set of independent variables. As you would have already guessed, this algorithm is generally used when the dependent variable is binary. Like to Linear regression, logistic regression has a low level of accuracy, fewer parameters and lesser training times. It goes without saying that it’s quite popular among beginners too. Decision trees These algorithms are mainly decision support tools that use tree-like graphs or models of decisions and possible consequences, including outcomes based on chance-event, utilities, etc. To put it in simple words, you can say decision trees are the least number of yes/no questions to be asked, in order to identify the probability of making the right decision, as often as possible. It lets you tackle the problem at hand in a structured, systematic way to logically deduce the outcome. Decision Trees are excellent when it comes to accuracy but their training times are a bit longer as compared to other algorithms. They also require a moderate number of parameters, making them not so complicated to arrive at a good combination. Naive Bayes  This is a type of classification ML algorithm that’s based on the popular probability theorem by Bayes. It is one of the most popular learning algorithms. It groups similarities together and is usually used for document classification, facial recognition software or for predicting diseases. It generally works well when you have a medium to large data set to train your models. These have moderate training times and make use of linearity. While this is good, linearity might also bring down accuracy for certain problems. They also do not bank on too many parameters, making it easy to arrive at a good combination, although at the cost of accuracy. Random forest Without a doubt, this one is a popular go-to machine learning algorithm that creates a group of decision trees with random subsets of the data. It uses the ML method of classification and regression. It is simple to use, as just a few lines of code are enough to implement the algorithm. It is used by banks in order to predict high-risk loan applicants or even by hospitals to predict whether a particular patient is likely to develop a chronic disease or not. With a high accuracy level and moderate training time, it is quite efficient to implement. Moreover, it has average parameters. K-Means K-Means is a popular unsupervised algorithm that is used for cluster analysis and is an iterative and non-deterministic method. It operates on a given dataset through a predefined number of clusters. The output of a K-Means algorithm will be k clusters, with input data partitioned among these clusters. Biggies like Google use K-means to cluster pages by similarities and discover the relevance of search results. This algorithm has a moderate training time and has good accuracy. It doesn’t consist of many parameters, meaning that it’s easy to arrive at the best possible combination. K nearest neighbors K nearest neighbors is a very popular machine learning algorithm which can be used for both regression as well as classification, although it’s majorly used for the latter. Although it is simple, it is extremely effective. It takes little to no time to train, although its accuracy can be heavily degraded by high dimension data since there is not much of a difference between the nearest neighbor and the farthest one. Support vector machines SVMs are one of the several examples of supervised ML algorithms dealing with classification. They can be used for either regression or classification, in situations where the training dataset teaches the algorithm about specific classes, so that it can then classify the newly included data. What sets them apart from other machine learning algorithms is that they are able to separate classes quicker and with lesser overfitting than several other classification algorithms. A few of the biggest pain points that have been resolved using SVMs are display advertising, image-based gender detection and image classification with large feature sets. These are moderate in their accuracy, as well as their training times, mostly because it assumes linear approximation. On the other hand, they require an average number of parameters to get the work done. Ensemble methods Ensemble methods are techniques that build a set of classifiers and combine the predictions to classify new data points. Bayesian averaging is originally an ensemble method, but newer algorithms include error-correcting output coding, etc. Although ensemble methods allow you to devise sophisticated algorithms and produce results with a high level of accuracy, they are not preferred so much in industries where interpretability of the algorithm is more important. However, with their high level of accuracy, it makes sense to use them in fields like healthcare, where even the minutest improvement can add a lot of value. Artificial neural networks Artificial neural networks are so named because they mimic the functioning and structure of biological neural networks. In these algorithms, information flows through the network and depending on the input and output, the neural network changes in response. One of the most common use cases for ANNs is speech recognition, like in voice-based services. As the information fed to them grows, these algorithms improve. However, artificial neural networks are imperfect. With great power comes longer training times. They also have several more parameters as compared to other algorithms. That being said, they are very flexible and customizable. If you want to skill-up in implementing Machine Learning Algorithms, you can check out the following books from Packt: Data Science Algorithms in a Week by Dávid Natingga Machine Learning Algorithms by Giuseppe Bonaccorso
Read more
  • 0
  • 0
  • 6830

article-image-getting-to-know-tensorflow
Kartikey Pandey
29 Nov 2017
7 min read
Save for later

Getting to know TensorFlow

Kartikey Pandey
29 Nov 2017
7 min read
[box type="note" align="" class="" width=""] The following book excerpt is from the title Machine Learning Algorithms by Guiseppe Bonaccorso. The book describes important Machine Learning algorithms commonly used in the field of data science. These algorithms can be used for supervised as well as unsupervised learning, reinforcement learning, and semi-supervised learning. Few famous ones covered in the book are Linear regression, Logistic Regression, SVM, Naive Bayes, K-Means, Random Forest, TensorFlow, and Feature engineering. [/box] Here, in the article, we look at understanding most important Deep learning library-Tensorflow with contextual examples. Brief Introduction to TensorFlow TensorFlow is a computational framework created by Google and has become one of the most diffused deep-learning toolkits. It can work with both CPUs and GPUs and already implements most of the operations and structures required to build and train a complex model. TensorFlow can be installed as a Python package on Linux, Mac, and Windows (with or without GPU support); however, we suggest you follow the instructions provided on the website to avoid common mistakes. The main concept behind TensorFlow is the computational graph or a set of subsequent operations that transform an input batch into the desired output. In the following figure, there's a schematic representation of a graph: Starting from the bottom, we have two input nodes (a and b), a transpose operation (that works on b), a matrix multiplication and a mean reduction. The init block is a separate operation, which is formally part of the graph, but it's not directly connected to any other node; therefore it's autonomous (indeed, it's a global initializer). As this one is only a brief introduction, it's useful to list all of the most important strategic elements needed to work with TensorFlow so as to be able to build a few simple examples that can show the enormous potential of this framework: Graph: This represents the computational structure that connects a generic input batch with the output tensors through a directed network made of operations. It's defined as a tf.Graph() instance and normally used with a Python context Manager. Placeholder: This is a reference to an external variable, which must be explicitly supplied when it's requested for the output of an operation that uses it directly or indirectly. For example, a placeholder can represent a variable x, which is first transformed into its squared value and then summed to a constant value. The output is thenx2+c, which is materialized by passing a concrete value for x. It's defined as a tf.placeholder() instance. Variable: An internal variable used to store values which are updated by the algorithm. For example, a variable can be a vector containing the weights of a logistic regression. It's normally initialized before a training process and automatically modified by the built-in optimizers. It's defined as a tf.Variable() instance. A variable can also be used to store elements which must not be considered during training processes; in this case, it must be declared with the parameter trainable=False Constant: A constant value defined as a tf.constant() instance. Operation: A mathematical operation that can work with placeholders, variables, and constants. For example, the multiplication of two matrices is an operation defined a tf.constant(). Among all operations, gradient calculation is one of the most important. TensorFlow allows determining the gradients starting from a determined point in the computational graph, until the origin or another point that must be logically before it. We're going to see an example of this Operation. Session: This is a sort of wrapper-interface between TensorFlow and our working environment (for example, Python or C++). When the evaluation of a graph is needed, this macro-operation will be managed by a session, which must be fed with all placeholder values and will produce the required outputs using the requested devices. For our purposes, it's not necessary to go deeper into this concept; however, I invite the reader to retrieve further information from the website or from one of the resources listed at the end of this chapter. It's declared as an instance of tf.Session() or, as we're going to do, an instance of tf.InteractiveSession(). This type of session is particularly useful when working with notebooks or shell commands, because it places itself automatically as the default one. Device: A physical computational device, such as a CPU or a GPU. It's declared explicitly through an instance of the class tf.device()and used with a context manager. When the architecture contains more computational devices, it's possible to split the jobs so as to parallelize many operations. If no device is specified, TensorFlow will use the default one (which is the main CPU or a suitable GPU if all the necessary components are installed). Let’s now analyze this with a simple example here about computing gradients: Computing gradients The option to compute the gradients of all output tensors with respect to any connected input or node is one of the most interesting features of TensorFlow because it allows us to create learning algorithms without worrying about the complexity of all transformations. In this example, we first define a linear dataset representing the function f(x) = x in the range (-100, 100): import numpy as np >>> nb_points = 100 >>> X = np.linspace(-nb_points, nb_points, 200, dtype=np.float32) The corresponding plot is shown in the following figure: Now we want to use TensorFlow to compute: The first step is defining a graph: import tensorflow as tf >>> graph = tf.Graph() Within the context of this graph, we can define our input placeholder and other operations: >>> with graph.as_default(): >>> Xt = tf.placeholder(tf.float32, shape=(None, 1), name='x') >>> Y = tf.pow(Xt, 3.0, name='x_3') >>> Yd = tf.gradients(Y, Xt, name='dx') >>> Yd2 = tf.gradients(Yd, Xt, name='d2x') A placeholder is generally defined with a type (first parameter), a shape, and an optional name. We've decided to use a tf.float32 type because this is the only type also supported by GPUs. Selecting shape=(None, 1) means that it's possible to use any bidimensional vectors with the second dimension equal to 1. The first operation computes the third power if Xt is working on all elements. The second operation computes all the gradients of Y with respect to the input placeholder Xt. The last operation will repeat the gradient computation, but in this case, it uses Yd, which is the output of the first gradient operation. We can now pass some concrete data to see the results. The first thing to do is create a session connected to this graph: >>> session = tf.InteractiveSession(graph=graph) By using this session, we ask any computation using the method run(). All the input parameters must be supplied through a feed-dictionary, where the key is the placeholder, while the value is the actual array: >>> X2, dX, d2X = session.run([Y, Yd, Yd2], feed_dict={Xt: X.reshape((nb_points*2, 1))}) We needed to reshape our array to be compliant with the placeholder. The first argument of run() is a list of tensors that we want to be computed. In this case, we need all operation outputs. The plot of each of them is shown in the following figure: As expected, they represent respectively: x3, 3x2, and 6x. Further in the book, we look at a slightly more complex example of Logistic Regression to implement a logistic regression algorithm. Refer to Chapter 14, Brief Introduction to Deep Learning and Tensorflow of Machine Learning Algorithms to read the complete chapter.    
Read more
  • 0
  • 0
  • 3990
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-building-classification-system-logistic-regression-opencv
Savia Lobo
28 Nov 2017
7 min read
Save for later

Building a classification system with logistic regression in OpenCV

Savia Lobo
28 Nov 2017
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Michael Beyeler titled Machine Learning for OpenCV. The code and related files are available on Github here.[/box] A famous dataset in the world of machine learning is called the Iris dataset. The Iris dataset contains measurements of 150 iris flowers from three different species: setosa, versicolor, and viriginica. These measurements include the length and width of the petals, and the length and width of the sepals, all measured in centimeters: Understanding logistic regression Despite its name, logistic regression can actually be used as a model for classification. It uses a logistic function (or sigmoid) to convert any real-valued input x into a predicted output value ŷ that take values between 0 and 1, as shown in the following figure:         The logistic function Rounding ŷ to the nearest integer effectively classifies the input as belonging either to class 0 or 1. Of course, most often, our problems have more than one input or feature value, x. For example, the Iris dataset provides a total of four features. For the sake of simplicity, let's focus here on the first two features, sepal length—which we will call feature f1—and sepal width—which we will call f2. Using the tricks we learned when talking about linear regression, we know we can express the input x as a linear combination of the two features, f1 and f2: However, in contrast to linear regression, we are not done yet. From the previous section, we know that the sum of products would result in a real-valued, output—but we are interested in a categorical value, zero or one. This is where the logistic function comes in: it acts as a squashing function, σ, that compresses the range of possible output values to the range [0, 1]: [box type="shadow" align="" class="" width=""]Because the output is always between 0 and 1, it can be interpreted as a probability. If we only have a single input variable x, the output value ŷ can be interpreted as the probability of x belonging to class 1.[/box] Now let's apply this knowledge to the Iris dataset! Loading the training data The Iris dataset is included with scikit-learn. We first load all the necessary modules, as we did in our earlier examples: In [1]: import numpy as np ... import cv2 ... from sklearn import datasets ... from sklearn import model_selection ... from sklearn import metrics ... import matplotlib.pyplot as plt ... %matplotlib inline In [2]: plt.style.use('ggplot') Then, loading the dataset is a one-liner: In [3]: iris = datasets.load_iris() This function returns a dictionary we call iris, which contains a bunch of different fields: In [4]: dir(iris) Out[4]: ['DESCR', 'data', 'feature_names', 'target', 'target_names'] Here, all the data points are contained in 'data'. There are 150 data points, each of which has four feature values: In [5]: iris.data.shape Out[5]: (150, 4) These four features correspond to the sepal and petal dimensions mentioned earlier: In [6]: iris.feature_names Out[6]: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] For every data point, we have a class label stored in target: In [7]: iris.target.shape Out[7]: (150,) We can also inspect the class labels, and find that there is a total of three classes: In [8]: np.unique(iris.target) Out[8]: array([0, 1, 2]) Making it a binary classification problem For the sake of simplicity, we want to focus on a binary classification problem for now, where we only have two classes. The easiest way to do this is to discard all data points belonging to a certain class, such as class label 2, by selecting all the rows that do not belong to class 2: In [9]: idx = iris.target != 2 ... data = iris.data[idx].astype(np.float32) ... target = iris.target[idx].astype(np.float32) Inspecting the data Before you get started with setting up a model, it is always a good idea to have a look at the data. We did this earlier for the town map example, so let's continue our streak. Using Matplotlib, we create a scatter plot where the color of each data point corresponds to the class label: In [10]: plt.scatter(data[:, 0], data[:, 1], c=target, cmap=plt.cm.Paired, s=100) ... plt.xlabel(iris.feature_names[0]) ... plt.ylabel(iris.feature_names[1]) Out[10]: <matplotlib.text.Text at 0x23bb5e03eb8> To make plotting easier, we limit ourselves to the first two features (iris.feature_names[0] being the sepal length and iris.feature_names[1] being the sepal width). We can see a nice separation of classes in the following figure: Plotting the first two features of the Iris dataset Splitting the data into training and test sets We learned in the previous chapter that it is essential to keep training and test data separate. We can easily split the data using one of scikit-learn's many helper functions: In [11]: X_train, X_test, y_train, y_test = model_selection.train_test_split( ... data, target, test_size=0.1, random_state=42 ... ) Here we want to split the data into 90 percent training data and 10 percent test data, which we specify with test_size=0.1. By inspecting the return arguments, we note that we ended up with exactly 90 training data points and 10 test data points: In [12]: X_train.shape, y_train.shape Out[12]: ((90, 4), (90,)) In [13]: X_test.shape, y_test.shape Out[13]: ((10, 4), (10,)) Training the classifier Creating a logistic regression classifier involves pretty much the same steps as setting up k- NN: In [14]: lr = cv2.ml.LogisticRegression_create() We then have to specify the desired training method. Here, we can choose cv2.ml.LogisticRegression_BATCH or cv2.ml.LogisticRegression_MINI_BATCH. For now, all we need to know is that we want to update the model after every data point, which can be achieved with the following code: In [15]: lr.setTrainMethod(cv2.ml.LogisticRegression_MINI_BATCH) ... lr.setMiniBatchSize(1) We also want to specify the number of iterations the algorithm should run before it terminates: In [16]: lr.setIterations(100) We can then call the training method of the object (in the exact same way as we did earlier), which will return True upon success: In [17]: lr.train(X_train, cv2.ml.ROW_SAMPLE, y_train) Out[17]: True As we just saw, the goal of the training phase is to find a set of weights that best transform the feature values into an output label. A single data point is given by its four feature values (f0, f1, f2, f3). Since we have four features, we should also get four weights, so that x = w0 f0 + w1 f1 + w2 f2 + w3 f3, and ŷ=σ(x). However, as discussed previously, the algorithm adds an extra weight that acts as an offset or bias, so that x = w0 f0 + w1 f1 + w2 f2 + w3 f3 + w4. We can retrieve these weights as follows: In [18]: lr.get_learnt_thetas() Out[18]: array([[-0.04109113, -0.01968078, -0.16216497, 0.28704911, 0.11945518]], dtype=float32) This means that the input to the logistic function is x = -0.0411 f0 - 0.0197 f1 - 0.162 f2 + 0.287 f3 + 0.119. Then, when we feed in a new data point (f0, f1, f2, f3) that belongs to class 1, the output ŷ=σ(x) should be close to 1. But how well does that actually work? Testing the classifier Let's see for ourselves by calculating the accuracy score on the training set: In [19]: ret, y_pred = lr.predict(X_train) In [20]: metrics.accuracy_score(y_train, y_pred) Out[20]: 1.0 Perfect score! However, this only means that the model was able to perfectly memorize the training dataset. This does not mean that the model would be able to classify a new, unseen data point. For this, we need to check the test dataset: In [21]: ret, y_pred = lr.predict(X_test) ... metrics.accuracy_score(y_test, y_pred) Out[21]: 1.0 Luckily, we get another perfect score! Now we can be sure that the model we built is truly awesome. If you enjoyed building a classifier using logistic regression and would like to learn more machine learning tasks using OpenCV, be sure to check out the book, Machine Learning for OpenCV, where this section originally appears.    
Read more
  • 0
  • 0
  • 4615

article-image-learn-to-build-a-scatterplot-in-ibm-spss
Kartikey Pandey
27 Nov 2017
4 min read
Save for later

How to build a Scatterplot in IBM SPSS

Kartikey Pandey
27 Nov 2017
4 min read
[box type="note" align="" class="" width=""] ----The following excerpt is from the title Data Analysis with IBM SPSS Statistics, Chapter 5, written by Kenneth Stehlik-Barry and Anthony J. Babinec. Analytical tools such as SPSS can readily provide even a novice user with an overwhelming amount of information and a broad range of options for analyzing patterns in the data. [/box] In this article we help you learn the techniques of SPSS to build Scatterplot using the Chart Builder feature. One of the most valuable methods for examining the relationship between two variables containing scale-level data is a scatterplot. In the previous chapter, scatterplots were used to detect points that deviated from the typical pattern--multivariate outliers. To produce a similar scatterplot using two fields from the 2016 General Social Survey data, navigate to Graphs | Chart Builder. An information box is displayed indicating that each field's measurement properties will be used to identify the types of graphs available so adjusting these properties is advisable. In this example, the properties will be modified as a part of the graph specification process but you may want to alter the properties of some variables permanently so that they don't need to be changed for each use. For now, just select OK to move ahead. In the main Chart Builder window, select Scatter/Dot from the menu at the lower left, double-click on the first graph to the right (Simple Scatter) to place it in the preview pane at the upper right, and then right-click on the first field labeled HIGHEST YEAR OF SCHOOL. Change this variable from Nominal to Scale, as shown in the following Screenshot: After changing the respondent's education to Scale, drag this field to the X-Axis location in the preview pane and drag spouse's education to the Y-Axis location. Once both elements are in place, the OK choice will become available. Select it to produce the scatterplot in the following screenshot: The scatterplot produced by default provides some sense of the trend in that the denser circles are concentrated in a band from the lower left to the upper right. This pattern, however, is rather subtle visually. With some editing, the relationship can be made more Evident. Double-click on the graph to open the Chart Editor and select the X icon at the top and change the major increment to 4 so that there are numbers corresponding to completing high school and college. Do the same for the y-axis values. Select a point on the graph to highlight all the "dots" and right-click to display the following dialog. Click on the Marker tab and change the symbol to the star shape, increase the size to 6, increase the border to 2, and change the border color to a dark blue. Use Apply to make the changes visible on the scatterplot: Use the Add Fit line at Total icon above the graph to show the regression line for this data. Drag the R2 box from the upper right to the bottom, below the graph and drag the box on the graphs with the equation displayed to the lower left away from the points: The modifications to the original scatterplot make it easier to see the pattern since the “stars” near the line are darker and denser than those farther from the line indicating fewer cases are associated with those points. The SPSS capabilities with respect to scatterplot in this article will give you a foundation to create a visual representation of data for both deeper pattern discovery and to communicate results to a broader audience. Several other graph types such as pie charts and multiple line charts  can be built and edited using the approach shown in Chapter 5, Visually Exploring the Data from our title Data Analysis with IBM SPSS Statistics. Go on, explore these alternative graph styles to see when they may be better suited to your needs.  
Read more
  • 0
  • 0
  • 5194

article-image-highest-paying-data-science-jobs-2017
Amey Varangaonkar
27 Nov 2017
10 min read
Save for later

Highest Paying Data Science Jobs in 2017

Amey Varangaonkar
27 Nov 2017
10 min read
It is no secret that this is the age of data. More data has been created in the last 2 years than ever before. Within the dumps of data created every second, businesses are looking for useful, action worthy insights which they can use to enhance their processes and thereby increase their revenue and profitability. As a result, the demand for data professionals, who sift through terabytes of data for accurate analysis and extract valuable business insights from it, is now higher than ever before. Think of Data Science as a large tree from which all things related to data branch out - from plain old data management and analysis to Big Data, and more.  Even the recently booming trends in Artificial Intelligence such as machine learning and deep learning are applied in many ways within data science. Data science continues to be a lucrative and growing job market in the recent years, as evidenced by the graph below: Source: Indeed.com In this article, we look at some of the high paying, high trending job roles in the data science domain that you should definitely look out for if you’re considering data science as a serious career opportunity. Let’s get started with the obvious and the most popular role. Data Scientist Dubbed as the sexiest job of the 21st century, data scientists utilize their knowledge of statistics and programming to turn raw data into actionable insights. From identifying the right dataset to cleaning and readying the data for analysis, to gleaning insights from said analysis, data scientists communicate the results of their findings to the decision makers. They also act as advisors to executives and managers by explaining how the data affects a particular product or process within the business so that appropriate actions can be taken by them. Per Salary.com, the median annual salary for the role of a data scientist today is $122,258, with a range between $106,529 to $137,037. The salary is also accompanied by a whole host of benefits and perks which vary from one organization to the other, making this job one of the best and the most in-demand, in the job market today. This is a clear testament to the fact that an increasing number of businesses are now taking the value of data seriously, and want the best talent to help them extract that value. There are over 20,000 jobs listed for the role of data scientist, and the demand is only growing. Source: Indeed.com To become a data scientist, you require a bachelor’s or a master’s degree in mathematics or statistics and work experience of more than 5 years in a related field. You will need to possess a unique combination of technical and analytical skills to understand the problem statement and propose the best solution, good programming skills to develop effective data models, and visualization skills to communicate your findings with the decision makers. Interesting in becoming a data scientist? Here are some resources to help you get started: Principles of Data Science Data Science Algorithms in a Week Getting Started with R for Data Science [Video] For a more comprehensive learning experience, check out our skill plan for becoming a data scientist on Mapt, our premium skills development platform. Data Analyst Probably a term you are quite familiar with, Data Analysts are responsible for crunching large amounts of data and analyze it to come to appropriate logical conclusions. Whether it’s related to pure research or working with domain-specific data, a data analyst’s job is to help the decision-makers’ job easier by giving them useful insights. Effective data management, analyzing data, and reporting results are some of the common tasks associated with this role. How is this role different than a data scientist, you might ask. While data scientists specialize in maths, statistics and predictive analytics for better decision making, data analysts specialize in the tools and components of data architecture for better analysis. Per Salary.com, the median annual salary for an entry-level data analyst is $55,804, and the range usually falls between $50,063 to $63,364 excluding bonuses and benefits. For more experienced data analysts, this figure rises to around a mean annual salary of $88,532. With over 83,000 jobs listed on Indeed.com, this is one of the most popular job roles in the data science community today. This profile requires a pretty low starting point, and is justified by the low starting salary packages. As you gain more experience, you can move up the ladder and look at becoming a data scientist or a data engineer. Source: Indeed.com You may also come across terms such as business data analyst or simply business analyst which are sometimes interchangeably used with the role of a data analyst. While their primary responsibilities are centered around data crunching, business analysts model company infrastructure, while data analysts model business data structures. You can find more information related to the differences in this interesting article. If becoming a data analyst is something that interests you, here are some very good starting points: Data Analysis with R Python Data Analysis, Second Edition Learning Python Data Analysis [Video] Data Architect Data architects are responsible for creating a solid data management blueprint for an organization. They are primarily responsible for designing the data architecture and defining how data is stored, consumed and managed by different applications and departments within the organization. Because of these critical responsibilities, a data architect’s job is a very well-paid one. Per Salary.com, the median annual salary for an entry-level data architect is $74,809, with a range between $57,964 to $91,685. For senior-level data architects, the median annual salary rises up to $136,856, with a range usually between $121,969 to $159,212. These high figures are justified by the critical nature of the role of a data architect - planning and designing the right data infrastructure after understanding the business considerations to get the most value out of the data. At present, there are over 23,000 jobs for the role listed on Indeed.com, with a stable trend in job seeker interest, as shown: Source: Indeed.com To become a data architect, you need a bachelor’s degree in computer science, mathematics, statistics or a related field, and loads of real-world skills to qualify for even the entry-level positions. Technical skills such as statistical modeling, knowledge of languages such as Python and R, database architectures, Hadoop-based skills, knowledge of NoSQL databases, and some machine learning and data mining are required to become a data architect. You also need strong collaborative skills, problem-solving, creativity and the ability to think on your feet, to solve the trickiest of problems on the go. Suffice to say it’s not an easy job, but it is definitely a lucrative one! Get ahead of the curve, and start your journey to becoming a data architect now: Big Data Analytics Hadoop Blueprints PostgreSQL for Data Architects Data Engineer Data engineers or Big Data engineers are a crucial part of the organizational workforce and work in tandem with data architects and data scientists to ensure appropriate data management systems are deployed and the right kind of data is being used for analysis. They deal with messy, unstructured Big Data and strive to provide clean, usable data to the other teams within the organization. They build high-performance analytics pipelines and develop set of processes for efficient data mining. In many companies, the role of a data engineer is closely associated with that of a data architect. While an architect is responsible for the planning and designing stages of the data infrastructure project, a data engineer looks after the construction, testing and maintenance of the infrastructure. As such data engineers tend to have a more in-depth understanding of different data tools and languages than data architects. There are over 90,000 jobs listed on Indeed.com, suggesting there is a very high demand in the organizations for this kind of a role. An entry level data engineer has a median annual salary of $90,083 per Payscale.com, with a range of $60,857 to $131,851. For Senior Data Engineers, the average salary shoots up to $123,749 as per Glassdoor estimates. Source: Indeed.com With the unimaginable rise in the sheer volume of data, the onus is on the data engineers to build the right systems that empower the data analysts and data scientists to sift through the messy data and derive actionable insights from it. If becoming a data engineer is something that interests you, here are some of our products you might want to look at: Real-Time Big Data Analytics Apache Spark 2.x Cookbook Big Data Visualization You can also check out our detailed skill plan on becoming a Big Data Engineer on Mapt. Chief Data Officer There is a countless number of organizations that build their businesses on data, but don’t manage it that well. This is where a senior executive popularly known as the Chief Data Officer (CDO) comes into play - bearing the responsibility for implementing the organization’s data and information governance and assisting with data-driven business strategies. They are primarily responsible for ensuring that their organization gets the most value out of their data and put appropriate plans in place for effective data quality and its life-cycle management. The role of a CDO is one of the most lucrative and highest paying jobs in the data science frontier. An average median annual pay for a CDO per Payscale.com is around $192,812. Indeed.com lists just over 8000 job postings too - this is not a very large number, but understandable considering the recent emergence of the role and because it’s a high profile, C-suite job. Source: Indeed.com According to a Gartner research, almost 50% companies in a variety of regulated industries will have a CDO in place, by 2017. Considering the demand for the role and the fact that it is only going to rise in the future, the role of a CDO is one worth vying for.   To become a CDO, you will obviously need a solid understanding of statistical, mathematical and analytical concepts. Not just that, extensive and high-value experience in managing technical teams and information management solutions is also a prerequisite. Along with a thorough understanding of the various Big Data tools and technologies, you will need strong communication skills and deep understanding of the business. If you’re planning to know more about how you can become a Chief Data Officer, you can browse through our piece on the role of CDO. Why demand for data science professionals will rise It’s hard to imagine an organization which doesn’t have to deal with data, but it’s harder to imagine the state of an organization with petabytes of data and not knowing what to do with it. With the vast amounts of data, organizations deal with these days, the need for experts who know how to handle the data and derive relevant and timely insights from it is higher than ever. In fact, IBM predicts there’s going to be a severe shortage of data science professionals, and thereby, a tremendous growth in terms of job offers and advertised openings, by 2020. Not everyone is equipped with the technical skills and know-how associated with tasks such as data mining, machine learning and more. This is slowly creating a massive void in terms of talent that organizations are looking to fill quickly, by offering lucrative salaries and added benefits. Without the professional expertise to turn data into actionable insights, Big Data becomes all but useless.      
Read more
  • 0
  • 0
  • 4089

article-image-mid-autumn-shoppers-dream-amazon-fulfilled-thanksgiving-look-like
Aaron Lazar
24 Nov 2017
10 min read
Save for later

A mid-autumn Shopper’s dream - What an Amazon fulfilled Thanksgiving would look like

Aaron Lazar
24 Nov 2017
10 min read
I’d been preparing for Thanksgiving a good 3 weeks in advance. One reason is that I’d recently rented out a new apartment and the troops were heading over to my place this year. I obviously had to make sure everything went well and for that, trust me, there was no resting even for a minute! Thanksgiving is really about being thankful for the people and things in your life and spending quality time with family. This Thanksgiving I’m especially grateful to Amazon for making it the best experience ever! Read on to find out how Amazon made things awesome! Good times started two weeks ago when I was at the AmazonGo store with my friend, Sue. [embed]https://www.youtube.com/watch?v=NrmMk1Myrxc[/embed] In fact, this was the first time I had set foot in one of the stores. I wanted to see what was so cool about them and why everyone had been talking about them for so long! The store was pretty big and lived up to the A to Z concept, as far as I could see. The only odd thing was that I didn’t notice any queues or a billing counter. Sue glided around the floor with ease, as if she did this every day. I was more interested in seeing what was so special about this place. After she got her stuff, she headed straight for the door. I smiled to myself thinking how absent minded she was. So I called her back and reminded her “You haven’t gotten your products billed.” She smiled back at me and shrugged, “I don’t need to.” Before I could open my mouth to tell her off for stealing, she explained to me about the store. It’s something totally futuristic! Have you ever imagined not having to stand in a line to buy groceries? At the store, you just had to log in to your AmazonGo app on your phone, enter the store, grab your stuff and then leave. The sensors installed everywhere in the store automatically detected what you’d picked up and would bill you accordingly. They also used Computer Vision and Deep Learning to track people and their shopping carts. Now that’s something! And you even got a receipt! Well, it was my birthday last week and knowing what an avid reader I was, my colleagues from office gifted me a brand new Kindle. I loved every bit of it, but the best part was the X-ray feature. With X-ray, you could simply get information about a character, person or terms in a book. You could also scroll through the long lists of excerpts and click on one to go directly to that particular portion of the book! That’s really amazing, especially if you want to read a particular part of the book quickly. It came in use at the right time - I downloaded a load of recipe books for the turkey. Another feather in the cap for Amazon! Talking about feathers in one’s cap, you won’t believe it, but Amazon actually got me rekognised at work a few days ago. Nah, that wasn’t a typo. I worked as a software developer/ML engineer in a startup and I’d been doing this for as long as I can remember. I recently built this cool mobile application that recognized faces and unlocked your phone even when you didn’t have something like Face ID on your phone and the app had gotten us a million downloads in a month! It could also recognize and give you information about the ethnicity of a person if you captured their photograph with the phone’s camera. The trick was that I’d used the AmazonRekognition APIs for enhanced face detection in the application. Rekognition allows you to detect objects, scenes, text, and faces, using highly scalable, deep learning models. I also enhanced the application using the Polly API. Polly converts text to whichever language you want the speech in and gives you the synthesized speech in the form of audio files.The app I built now converted input text into 18 different languages, helping one converse with the person in front of them in that particular language, should they have a problem doing it in English. I got that long awaited promotion right after! Ever wondered how I got the new apartment? ;) Since the folks were coming over to my place in a few days, I thought I’d get a new dinner set. You’d probably think I would need to sit down at my PC or probably pick up my phone to search for a set online, but I had better things to do. Thanks to Alexa, I simply needed to ask her to find one for me and she did it brilliantly. Now Alexa isn’t my girlfriend, although I would have loved that to be. Alexa is actually Amazon’s cloud-based voice service that provides customers with an engaging way of interacting with technology. Alexa is blessed with finely tuned ASR or Automatic Speech Recognition and NLU or Natural Language Understanding engines, that instantly recognize and respond to voice requests. I selected a pretty looking set and instantly bought it through my Prime account. With technology like this at my fingertips, the developer in me had left no time in exploring possibilities with Alexa. That’s when I found out about Lex, built on the same deep learning platform that Alexa works on, which allows developers to build conversational interfaces into their apps. With the dinner set out of the way, I sat back with my feet up on the table. I was awesome, baby! Oh crap! I forgot to buy the turkey, the potatoes, the wine and a whole load of other stuff. It was 3 AM and I started panicking. I remembered that mum always put the turkey in the fridge at least 3 days in advance. I had only 2! I didn’t even have the time to make it to the AmazonGo store. I was panicking again and called up Suzy to ask her if she could pick up the stuff for me. She sounded so calm over the phone when I narrated my horror to her. She simply told me to get the stuff from AmazonFresh. So I hastily disconnected the call and almost screamed to Alexa, “Alexa, find me a big bird!”, and before I realized what I had said, I was presented with this. [caption id="attachment_2215" align="aligncenter" width="184"] Big Bird is one of the main protagonist in Sesame Street.[/caption] So I tried again, this time specifying what I actually needed! With AmazonDash integrating with AmazonFresh, I was able to get the turkey and other groceries delivered home in no time! What a blessing, indeed! A day before Thanksgiving, I was stuck in the office, working late on a new project. We usually tinkered around with a lot of ML and AI stuff. There was this project which needed the team to come up with a really innovative algorithm to perform a deep learning task. As the project lead, I was responsible for choosing the tech stack and I’m glad a little birdie had recently told me about AWS taking in MXNet officially as a Deep Learning Framework. MXNet made it a breeze to build ML applications that train quickly and could run anywhere. Moreover, with the recent collaboration between Amazon and Microsoft, a new ML library called Gluon was born. Available in MXNet, Gluon made building ML models, even more, easier and quicker, without compromising on performance. Need I say the project was successful? I got home that evening and sat down to pick a good flick or two to download from Amazon PrimeVideo. There’s always someone in the family who’d suggest we all watch a movie and I had to be prepared. With that done I quickly showered and got to bed. It was going to be a long day the next day! 4 AM my alarm rang and I was up! It was Thanksgiving, and what a wonderful day it was! I quickly got ready and prepared to start cooking. I got the bird out of the freezer and started to thaw it in cold water. It was a big bird so it was going to take some time. In the meantime, I cleaned up the house and then started working on the dressing. Apples, sausages, and cranberry. Yum! As I sliced up the sausages I realized that I had misjudged the quantity. I needed to get a couple more packets immediately! I had to run to the grocery store right away or there would be a disaster! But it took me a few minutes to remember it was Thanksgiving, one of the craziest days to get out on the road. I could call the store delivery guy or probably Amazon Dash, but then that would be illogical cos he’d have to take the same congested roads to get home.  I turned to Alexa for help, “Alexa, how do I get sausages delivered home in the next 30 minutes?”. And there I got my answer - Try Amazon PrimeAir. Now I don’t know about you, but having a drone deliver a couple packs of sausages to my house, is nothing less than ecstatic! I sat it out near the window for the next 20 minutes, praying that the package wouldn’t be intercepted by some hungry birds! I couldn’t miss the sight of the pork flying towards my apartment. With the dressing and turkey baked and ready, things were shaping up much better than I had expected. The folks started rolling in by lunchtime. Mum and dad were both quite impressed with the way I had organized things. I was beaming and in my mind hi-fived Amazon for helping me make everything possible with its amazing products and services designed to delight customers. It truly lives up to its slogan: Work hard. Have fun. Make history. If you are one of those folks who do this every day, behind the scenes, by building amazing products powered by machine learning and big data to make other's lives better, I want to thank you today for all your hard work. This Thanksgiving weekend, Packt's offering an unbelievable deal - Buy any book or video for just $10 or any three for $25! I know what I have my eyes on! Python Machine Learning - Second Edition by Sebastian Raschka and Vahid Mirjalili Effective Amazon Machine Learning by Alexis Perrier OpenCV 3 - Advanced Image Detection and Reconstruction [Video] by Prof. Robert Laganiere In the end, there’s nothing better than spending quality time with your family, enjoying a sumptuous meal, watching smiles all around and just being thankful for all you have. All I could say was, this Thanksgiving was truly Amazon fulfilled! :) Happy Thanksgiving folks!    
Read more
  • 0
  • 0
  • 1708
article-image-3d-graphics-animation-r
Savia Lobo
23 Nov 2017
4 min read
Save for later

How to create 3D Graphics and Animation in R

Savia Lobo
23 Nov 2017
4 min read
[box type="note" align="alignright" class="" width=""] The article is originally extracted from our book R Data Analysis Cookbook - Second Edition by Kuntal Ganguly. Data analytics with R has emerged to be a very important focus for organizations of all kind. The book will show how you can put your data analysis skills in R to practical use. It contains recipes catering to basic as well as advanced data analysis tasks. [/box] In this article, we have captured one aspect of using R for the creation of 3D graphics and animation. When, a two-dimensional view is not sufficient to understand and analyze the data; besides the x, y variable, an additional data dimension can be represented by a color variable. Our article gives a step by step explanation on how to plot 3D package in R is used to visualize three-dimensional graphs. Codes and data files are readily available for download towards the end of the post. Get set ready Make sure you have downloaded the code and the mtcars.csv file is located in the working directory of R. Install and load the latest version of the plot3D package: > install.packages("plot3D") > library(plot3D) How to do it... Load the mtcars dataset and preprocess it to add row names and remove the model name column: > mtcars=read.csv("mtcars.csv") > rownames(mtcars) <- mtcars$X > mtcars$X=NULL > head(mtcars) 2. Next, create a three-dimensional scatter plot: > scatter3D(x=mtcars$wt, y=mtcars$disp, z=mtcars$mpg, clab = c("Miles/(US) gallon")) 3. Now, add title and axis labels to the scatter plot: > scatter3D(x=mtcars$wt, y=mtcars$disp, z=mtcars$mpg, pch = 18, theta = 20, phi = 20, main = "Motor Trend Car Road Tests", xlab = "Weight lbs",ylab ="Displacement (cu.in.)", zlab = "Miles gallon") Rotating a 3D plot can provide a complete view of the data. Now view the plot in different directions by altering the values of two attributes--theta and phi: > scatter3D(x=mtcars$wt, y=mtcars$disp, z=mtcars$mpg,clab = c("Cars Mileage"),theta = 15, phi = 0, bty ="g") How it works... The scatter3D() function from the plot3D package has the following parameters: x, y, z: Vectors of point coordinates colvar: A variable used for the coloring col: A color palette to color the colvar variable labels: Refers to the text to be written add: Logical; if TRUE, then the points will be added to the current plot, and if FALSE, a new plot is started pch: Shape of the points cex: Size of the points theta: The azimuthal direction phi: Co-latitude; both theta and phi can be used to define the angles for the viewing direction Bty: Refers to the type of enclosing box and can take various values such as f-full box, b-default value (back panels only), g- grey background with white grid lines, and bl- black background There's more… The following concepts are very important for this recipe. Adding text to an existing 3D plot We can use the text3D() function to add text based on the car model name, alongside the data points: > scatter3D(x=mtcars$wt, y=mtcars$disp, z=mtcars$mpg, phi = 0, bty = "g", pch = 20, cex = 0.5) > text3D(x=mtcars$wt, y=mtcars$disp, z=mtcars$mpg, labels = rownames(mtcars), add = TRUE, colkey = FALSE, cex = 0.5) Using a 3D histogram The three-dimensional histogram function, hist3D(), has the following attributes: z: Values contained within a matrix. x, y: Vectors, where the length of x should be equal to nrow(z) and length of y should be equal to ncol(z). colvar: Variable used for the coloring and has the same dimension as z. col: Color palette used for the colvar variable. By default, a red-yellow-blue color scheme. add: Logical variable. If TRUE, adds surfaces to the current plot. If FALSE, starts a new plot. Let's plot the Death rate of Virgina using a 3D histogram: data(VADeaths) hist3D(z = VADeaths, scale = FALSE, expand = 0.01, bty = "g", phi = 20, col = "#0085C2", border = "black", shade = 0.2, ltheta = 80, space = 0.3, ticktype = "detailed", d = 2) Using a line graph To visualize the plot with a line graph, add the type parameter to the scatter3D() function. The type parameter can take the values l(only line), b(both line and point), and h(horizontal line and points both). The 3D plot with a horizontal line and plot: > scatter3D(x=mtcars$wt, y=mtcars$disp, z=mtcars$mpg,type="h", clab = c("Miles/(US) gallon")) [box type="download" align="" class="" width=""]Download code files here. [/box] If you think the recipe is delectable, you should definitely take a look at R data Analysis Cookbook- Second edition which will enlighten you more on data visualizations with R.  
Read more
  • 0
  • 1
  • 3980

article-image-training-rnns-time-series-forecasting
Aaron Lazar
22 Nov 2017
7 min read
Save for later

Training RNNs for Time Series Forecasting

Aaron Lazar
22 Nov 2017
7 min read
[box type="note" align="" class="" width=""]This tutorial has been taken from the book Practical Time Series Analysis by Dr. PKS Prakash and Dr. Avishek Pal.[/box] RNNs are notoriously difficult to be trained. Vanilla RNNs suffer from vanishing and exploding gradients that give erratic results during training. As a result, RNNs have difficulty in learning long-range dependencies. For time series forecasting, going too many timesteps back in the past would be problematic. To address this problem, Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU), which are special types of RNNs, have been introduced. We will use LSTM and GRU to develop the time series forecasting models. Before that, let's review how RNNs are trained using Backpropagation Through Time (BPTT), a variant of the backpropagation algorithm. We will find out how vanishing and exploding gradients arise during BPTT. Let's consider the computational graph of the RNN that we have been using for the time series forecasting. The gradient computation is shown in the following figure: Figure: Back-Propagation Through Time for deep recurrent neural network with p timesteps For weights U, there is one path to compute the partial derivative, which is given as follows: However, due to the sequential structure of the RNN, there are multiple paths connecting the weights and the loss and the loss. Hence, the partial derivative is the sum of partial derivatives along the individual paths that start at the loss node and ends at every timestep node in the computational graph: The technique of computing the gradients for the weights by summing over paths connecting the loss node and every timestep node is backpropagation through time, which is a special case of the original backpropagation algorithm. The problem of vanishing gradients in long-range RNNs is due to the multiplicative terms in the BPTT gradient computations. Now let's examine a multiplicative term from one of the preceding equations. The gradients along the computation path connecting the loss node and ith timestep is This chain of gradient multiplication is ominously long to model long-range dependencies and this is where the problem of vanishing gradient arises. The activation function for the internal state [0,1) is either a tanh or sigmoid. The first derivative of tanh is which is bound in (0,¼]. In the sigmoid function, the first-order derivative is which is bound in si . Hence, the gradients W  are positive fractions. For long-range timesteps, multiplying these fractional gradients diminishes the final product to zero and there is no gradient flow from a long-range timestep. Due to the negligibly low values of the gradients, the weights do not update and hence the neurons are said to be saturated. It is noteworthy that U, ∂si/∂s(i-1), and ∂si/∂W are matrices and therefore the partial derivatives t and st are computed on matrices. The final output is computed through matrix multiplications and additions. The first derivative on matrices is called Jacobian. If any one element of the Jacobian matrix is a fraction, then for a long-range RNN, we would see a vanishing gradient. On the other hand, if an element of the Jacobian is greater than one, the training process suffers from exploding gradients. Solving the long-range dependency problem We have seen in the previous section that it is difficult for vanilla RNNs to effectively learn long-range dependencies due to vanishing and exploding gradients. To address this issue, Long Short Term Memory network was developed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber. Gated Recurrent Unit was introduced in 2014 and gives a simpler version of LSTM. Let's review how LSTM and GRU solve the problem of learning long-range dependencies. Long Short Term Memory (LSTM) LSTM introduces additional computations in each timestep. However, it can still be treated as a black box unit that, for timestep (ht), returns the state internal (ft) and this is forwarded to the next timestep. However, internally, these vectors are computed differently. LSTM introduces three new gates: the input (ot) forget (gt), and output (ct) gates. Every timestep also has an internal hidden state (ht) and internal memory These new units are computed as follows: Now, let's understand the computations. The gates are generated through sigmoid activations ht that limits their values within ft. Hence, these act as gates by letting out only a fraction of the value when multiplied by another variable. The input gate ot controls the fraction of the newly computed input to keep. The forget gate ct determines the effect of the previous time step and the output gate ft controls how much of the internal state to let out. The internal hidden state is calculated from the input to the current timestep and output of the previous timestep. Note that this is the same as computing the internal state in a vanilla RNN. ht is the internal memory unit of the current timestep and considers the memory from the previous step but downsized by a fraction st and the effect of the internal hidden state but mixed with the input gate ot. Finally,  zt would be passed to the next timestep and computed from the current internal memory and the output gate rt. The input, forget, and output gates are used to selectively include the previous memory and the current hidden state that is computed in the same manner as in vanilla RNNs. The gating mechanism of LSTM allows memory transfer from over long-range timesteps. Gated Recurrent Units (GRU) GRU is simpler than LSTM and has only two internal gates, namely, the update gate (zt) and the reset gate (rt). The computations of the update and reset gates are as follows: zt = σ(Wzxt + Uzst-1) rt = σ(Wrxt + Urst-1) The state st of the timestep t is computed using the input xt, state st-1 from the previous timestep, the update, and the reset gates: The update being computed by a sigmoid function determines how much of the previous step's memory is to be retained in the current timestep. The reset gate controls how to combine the previous memory with the current step's input. Compared to LSTM, which has three gates, GRU has two. It does not have the output gate and the internal memory, both of which are present in LSTM. The update gate in GRU determines how to combine the previous memory with the current memory and combines the functionality achieved by the input and forget gates of LSTM. The reset gate, which combines the effect of the previous memory and the current input, is directly applied to the previous memory. Despite a few differences in how memory is transmitted along the sequence, the gating mechanisms in both LSTM and GRU are meant to learn long-range dependencies in data. So, which one do we use - LSTM or GRU? Both LSTM and GRU are capable of handling memory over long RNNs. However, a common question is which one to use? LSTM has been long preferred as the first choice for language models, which is evident from their extensive use in language translation, text generation, and sentiment classification. GRU has the distinct advantage of fewer trainable weights as compared to LSTM. It has been applied to tasks where LSTM has previously dominated. However, empirical studies show that neither approach outperforms the other in all tasks. Tuning model hyperparameters such as the dimensionality of the hidden units improves the predictions of both. A common rule of thumb is to use GRU in cases having less training data as it requires less number of trainable weights. LSTM proves to be effective in case of large datasets such as the ones used to develop language translation models. This article is an excerpt from the book Practical Time Series Analysis. For more techniques, grab the book now!
Read more
  • 0
  • 0
  • 3478

article-image-implementing-k-nearest-neighbors-algorithm-python
Aaron Lazar
17 Nov 2017
7 min read
Save for later

Implementing the k-nearest neighbors algorithm in Python

Aaron Lazar
17 Nov 2017
7 min read
[box type="note" align="" class="" width=""]The following is an excerpt from Dávid Natingga's Data Science Algorithms in a Week. [/box] The nearest neighbor algorithm classifies a data instance based on its neighbors. The class of a data instance determined by the k-nearest neighbor algorithm is the class with the highest representation among the k-closest neighbors. In this short tutorial, we will cover the basics of the k-NN algorithm - understanding it and its implementation with a simple example: Mary and her temperature preferences. So let’s get right to it, shall we? Mary and her temperature preferences As an example, if we know that our friend Mary feels cold when it is 10 degrees Celsius, but warm when it is 25 degrees Celsius, then in a room where it is 22 degrees Celsius, the nearest neighbor algorithm would guess that our friend would feel warm, because 22 is closer to 25 than to 10. Suppose we would like to know when Mary feels warm and when she feels cold, as in the previous example, but in addition, wind speed data is also available when Mary was asked if she felt warm or cold: We could represent the data in a graph, as follows: Now, suppose we would like to find out how Mary feels at the temperature 16 degrees Celsius with a wind speed of 3km/h using the 1-NN algorithm: For simplicity, we will use a Manhattan metric to measure the distance between the neighbors on the grid. The Manhattan distance dMan of a neighbor N1=(x1,y1) from the neighbor N2=(x2,y2) is defined to be dMan=|x1- x2|+|y1- y2|. Let us label the grid with distances around the neighbors to see which neighbor with a known class is closest to the point we would like to classify: We can see that the closest neighbor with a known class is the one with the temperature 15 (blue) degrees Celsius and the wind speed 5km/h. Its distance from the questioned point is three units. Its class is blue (cold). The closest red (warm) neighbor is away four units from the questioned point. Since we are using the 1-nearest neighbor algorithm, we just look at the closest neighbor and, therefore, the class of the questioned point should be blue (cold). By applying this procedure to every data point, we can complete the graph as follows: Note that sometimes a data point can be distanced from two known classes with the same distance: for example, 20 degrees Celsius and 6km/h. In such situations, we would prefer one class over the other or ignore these boundary cases. The actual result depends on the specific implementation of an algorithm. Implementation of k-nearest neighbors algorithm in Python We’ll implement the k-NN algorithm in Python to find Mary's temperature preference: # source_code/1/mary_and_temperature_preferences/knn_to_data.py # Applies the knn algorithm to the input data. # The input text file is assumed to be of the format with one line per # every data entry consisting of the temperature in degrees Celsius, # wind speed and then the classification cold/warm. import sys sys.path.append('..') sys.path.append('../../common') import knn # noqa import common # noqa # Program start # E.g. "mary_and_temperature_preferences.data" input_file = sys.argv[1] # E.g. "mary_and_temperature_preferences_completed.data" output_file = sys.argv[2] k = int(sys.argv[3]) x_from = int(sys.argv[4]) x_to = int(sys.argv[5]) y_from = int(sys.argv[6]) y_to = int(sys.argv[7]) data = common.load_3row_data_to_dic(input_file) new_data = knn.knn_to_2d_data(data, x_from, x_to, y_from, y_to, k) common.save_3row_data_from_dic(output_file, new_data) # source_code/common/common.py # ***Library with common routines and functions*** def dic_inc(dic, key): if key is None: Pass if dic.get(key, None) is None: dic[key] = 1 Else: dic[key] = dic[key] + 1 # source_code/1/knn.py # ***Library implementing knn algorihtm*** def info_reset(info): info['nbhd_count'] = 0 info['class_count'] = {} # Find the class of a neighbor with the coordinates x,y. # If the class is known count that neighbor. def info_add(info, data, x, y): group = data.get((x, y), None) common.dic_inc(info['class_count'], group) info['nbhd_count'] += int(group is not None) # Apply knn algorithm to the 2d data using the k-nearest neighbors with # the Manhattan distance. # The dictionary data comes in the form with keys being 2d coordinates # and the values being the class. # x,y are integer coordinates for the 2d data with the range # [x_from,x_to] x [y_from,y_to]. def knn_to_2d_data(data, x_from, x_to, y_from, y_to, k): new_data = {} info = {} # Go through every point in an integer coordinate system. for y in range(y_from, y_to + 1): for x in range(x_from, x_to + 1): info_reset(info) # Count the number of neighbors for each class group for # every distance dist starting at 0 until at least k # neighbors with known classes are found. for dist in range(0, x_to - x_from + y_to - y_from): # Count all neighbors that are distanced dist from # the point [x,y]. if dist == 0: info_add(info, data, x, y) Else: for i in range(0, dist + 1): info_add(info, data, x - i, y + dist - i) info_add(info, data, x + dist - i, y - i) for i in range(1, dist): info_add(info, data, x + i, y + dist - i) info_add(info, data, x - dist + i, y - i) # There could be more than k-closest neighbors if the # distance of more of them is the same from the point # [x,y]. But immediately when we have at least k of # them, we break from the loop. if info['nbhd_count'] >= k: Break class_max_count = None # Choose the class with the highest count of the neighbors # from among the k-closest neighbors. for group, count in info['class_count'].items(): if group is not None and (class_max_count is None or count > info['class_count'][class_max_count]): class_max_count = group new_data[x, y] = class_max_count return new_data Input: The program above will use the file below as the source of the input data. The file contains the table with the known data about Mary's temperature preferences: # source_code/1/mary_and_temperature_preferences/ marry_and_temperature_preferences.data 10 0 cold 25 0 warm 15 5 cold 20 3 warm 18 7 cold 20 10 cold 22 5 warm 24 6 warm Output: We run the implementation above on the input file mary_and_temperature_preferences.data using the k-NN algorithm for k=1 neighbors. The algorithm classifies all the points with the integer coordinates in the rectangle with a size of (30-5=25) by (10-0=10), so with the a of (25+1) * (10+1) = 286 integer points (adding one to count points on boundaries). Using the wc command, we find out that the output file contains exactly 286 lines - one data item per point. Using the head command, we display the first 10 lines from the output file. We visualize all the data from the output file in the next section: $ python knn_to_data.py mary_and_temperature_preferences.data mary_and_temperature_preferences_completed.data 1 5 30 0 10 $ wc -l mary_and_temperature_preferences_completed.data 286 mary_and_temperature_preferences_completed.data $ head -10 mary_and_temperature_preferences_completed.data 7 3 cold 6 9 cold 12 1 cold 16 6 cold 16 9 cold 14 4 cold 13 4 cold 19 4 warm 18 4 cold 15 1 cold So, there you have it! The K Nearest Neighbors algorithm explained and implemented in Python. I hope you enjoyed this tutorial and found it interesting. If you want more, go ahead and purchase Dávid Natingga's Data Science Algorithms in a Week, from which the tutorial has been extracted.
Read more
  • 0
  • 0
  • 4843
article-image-visualizing-univariate-distribution-seaborn
Sugandha Lahoti
16 Nov 2017
7 min read
Save for later

Visualizing univariate distribution in Seaborn

Sugandha Lahoti
16 Nov 2017
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Allen Chi Shing Yu, Claire Yik Lok Chung, and Aldrin Kay Yuen Yim titled Matplotlib 2.x By Example. [/box] Seaborn by Michael Waskom is a statistical visualization library that is built on top of Matplotlib. It comes with handy functions for visualizing categorical variables, univariate distributions, and bivariate distributions. In this article, we will visualize univariate distribution in Seaborn. Visualizing univariate distribution Seaborn makes the task of visualizing the distribution of a dataset much easier. In this example, we are going to use the annual population summary published by the Department of Economic and Social Affairs, United Nations, in 2015. Projected population figures towards 2100 were also included in the dataset. Let's see how it distributes among different countries in 2017 by plotting a bar plot: import seaborn as sns import matplotlib.pyplot as plt # Extract USA population data in 2017 current_population = population_df[(population_df.Location == 'United States of America') & (population_df.Time == 2017) & (population_df.Sex != 'Both')] # Population Bar chart sns.barplot(x="AgeGrp",y="Value", hue="Sex", data = current_population) # Use Matplotlib functions to label axes rotate tick labels ax = plt.gca() ax.set(xlabel="Age Group", ylabel="Population (thousands)") ax.set_xticklabels(ax.xaxis.get_majorticklabels(), rotation=45) plt.title("Population Barchart (USA)") # Show the figure plt.show() Bar chart in Seaborn The seaborn.barplot() function shows a series of data points as rectangular bars. If multiple points per group are available, confidence intervals will be shown on top of the bars to indicate the uncertainty of the point estimates. Like most other Seaborn functions, various input data formats are supported, such as Python lists, Numpy arrays, pandas Series, and pandas DataFrame. A more traditional way to show the population structure is through the use of a population pyramid. So what is a population pyramid? As its name suggests, it is a pyramid-shaped plot that shows the age distribution of a population. It can be roughly classified into three classes, namely constrictive, stationary, and expansive for populations that are undergoing negative, stable, and rapid growth respectively. For instance, constrictive populations have a lower proportion of young people, so the pyramid base appears to be constricted. Stable populations have a more or less similar number of young and middle-aged groups. Expansive populations, on the other hand, have a large proportion of youngsters, thus resulting in pyramids with enlarged bases. We can build a population pyramid by plotting two bar charts on two subplots with a shared y axis: import seaborn as sns import matplotlib.pyplot as plt # Extract USA population data in 2017 current_population = population_df[(population_df.Location == 'United States of America') & (population_df.Time == 2017) & (population_df.Sex != 'Both')] # Change the age group to descending order current_population = current_population.iloc[::-1] # Create two subplots with shared y-axis fig, axes = plt.subplots(ncols=2, sharey=True) # Bar chart for male sns.barplot(x="Value",y="AgeGrp", color="darkblue", ax=axes[0], data = current_population[(current_population.Sex == 'Male')]) # Bar chart for female sns.barplot(x="Value",y="AgeGrp", color="darkred", ax=axes[1], data = current_population[(current_population.Sex == 'Female')]) # Use Matplotlib function to invert the first chart axes[0].invert_xaxis() # Use Matplotlib function to show tick labels in the middle axes[0].yaxis.tick_right() # Use Matplotlib functions to label the axes and titles axes[0].set_title("Male") axes[1].set_title("Female") axes[0].set(xlabel="Population (thousands)", ylabel="Age Group") axes[1].set(xlabel="Population (thousands)", ylabel="") fig.suptitle("Population Pyramid (USA)") # Show the figure plt.show() Since Seaborn is built on top of the solid foundations of Matplotlib, we can customize the plot easily using built-in functions of Matplotlib. In the preceding example, we used matplotlib.axes.Axes.invert_xaxis() to flip the male population plot horizontally, followed by changing the location of the tick labels to the right-hand side using matplotlib.axis.YAxis.tick_right(). We further customized the titles and axis labels for the plot using a combination of matplotlib.axes.Axes.set_title(), matplotlib.axes.Axes.set(), and matplotlib.figure.Figure.suptitle(). Let's try to plot the population pyramids for Cambodia and Japan as well by changing the line population_df.Location == 'United States of America' to population_df.Location == 'Cambodia' or  population_df.Location == 'Japan'. Can you classify the pyramids into one of the three population pyramid classes? To see how Seaborn simplifies the code for relatively complex plots, let's see how a similar plot can be achieved using vanilla Matplotlib. First, like the previous Seaborn-based example, we create two subplots with shared y axis: fig, axes = plt.subplots(ncols=2, sharey=True) Next, we plot horizontal bar charts using matplotlib.pyplot.barh() and set the location and labels of ticks, followed by adjusting the subplot spacing: # Get a list of tick positions according to the data bins y_pos = range(len(current_population.AgeGrp.unique())) # Horizontal barchart for male axes[0].barh(y_pos, current_population[(current_population.Sex == 'Male')].Value, color="darkblue") # Horizontal barchart for female axes[1].barh(y_pos, current_population[(current_population.Sex == 'Female')].Value, color="darkred") # Show tick for each data point, and label with the age group axes[0].set_yticks(y_pos) axes[0].set_yticklabels(current_population.AgeGrp.unique()) # Increase spacing between subplots to avoid clipping of ytick labels plt.subplots_adjust(wspace=0.3) Finally, we use the same code to further customize the look and feel of the figure: # Invert the first chart axes[0].invert_xaxis() # Show tick labels in the middle axes[0].yaxis.tick_right() # Label the axes and titles axes[0].set_title("Male") axes[1].set_title("Female") axes[0].set(xlabel="Population (thousands)", ylabel="Age Group") axes[1].set(xlabel="Population (thousands)", ylabel="") fig.suptitle("Population Pyramid (USA)") # Show the figure plt.show() When compared to the Seaborn-based code, the pure Matplotlib implementation requires extra lines to define the tick positions, tick labels, and subplot spacing. For some other Seaborn plot types that include extra statistical calculations such as linear regression, and Pearson correlation, the code reduction is even more dramatic. Therefore, Seaborn is a "batteries-included" statistical visualization package that allows users to write less verbose code. Histogram and distribution fitting in Seaborn In the population example, the raw data was already binned into different age groups. What if the data is not binned (for example, the BigMac Index data)? Turns out, seaborn.distplot can help us to process the data into bins and show us a histogram as a result. Let's look at this example: import seaborn as sns import matplotlib.pyplot as plt # Get the BigMac index in 2017 current_bigmac = bigmac_df[(bigmac_df.Date == "2017-01-31")] # Plot the histogram ax = sns.distplot(current_bigmac.dollar_price) plt.show() The seaborn.distplot function expects either pandas Series, single-dimensional numpy.array, or a Python list as input. Then, it determines the size of the bins according to the Freedman-Diaconis rule, and finally it fits a kernel density estimate (KDE) over the histogram. KDE is a non-parametric method used to estimate the distribution of a variable. We can also supply a parametric distribution, such as beta, gamma, or normal distribution, to the fit argument. In this example, we are going to fit the normal distribution from the scipy.stats package over the Big Mac Index dataset: from scipy import stats ax = sns.distplot(current_bigmac.dollar_price, kde=False, fit=stats.norm) plt.show() [INSERT IMAGE] You have now equipped yourself with the knowledge to visualize univariate data in Seaborn as Bar Charts, Histogram, and distribution fitting. To have more fun visualizing data with Seaborn and Matplotlib, check out the book,  this snippet appears from.
Read more
  • 0
  • 0
  • 7120

article-image-visualizing-3d-plots-matplotlib-2-0
Sugandha Lahoti
16 Nov 2017
7 min read
Save for later

Visualizing 3D plots in Matplotlib 2.0

Sugandha Lahoti
16 Nov 2017
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Allen Chi Shing Yu, Claire Yik Lok Chung, and Aldrin Kay Yuen Yim titled Matplotlib 2.x By Example.[/box] By transitioning to the three-dimensional space, you may enjoy greater creative freedom when creating visualizations. The extra dimension can also accommodate more information in a single plot. However, some may argue that 3D is nothing more than a visual gimmick when projected to a 2D surface (such as paper) as it would obfuscate the interpretation of data points. In Matplotlib version 2, despite significant developments in the 3D API, annoying bugs or glitches still exist. We will discuss some workarounds toward the end of this article. More powerful Python 3D visualization packages do exist (such as MayaVi2, Plotly, and VisPy), but it's good to use Matplotlib's 3D plotting functions if you want to use the same package for both 2D and 3D plots, or you would like to maintain the aesthetics of its 2D plots. For the most part, 3D plots in Matplotlib have similar structures to 2D plots. As such, we will not go through every 3D plot type in this section. We will put our focus on 3D scatter plots and bar charts. 3D scatter plot Let's try to create a 3D scatter plot. Before doing that, we need some data points in three dimensions (x, y, z): import pandas as pd source = "https://raw.githubusercontent.com/PointCloudLibrary/data/master/tutorials/ ism_train_cat.pcd" cat_df = pd.read_csv(source, skiprows=11, delimiter=" ", names=["x","y","z"], encoding='latin_1') cat_df.head() To declare a 3D plot, we first need to import the Axes3D object from the mplot3d extension in mpl_toolkits, which is responsible for rendering 3D plots in a 2D plane. After that, we need to specify projection='3d' when we create subplots: from mpl_toolkits.mplot3d import Axes3D import matplotlib.pyplot as plt fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(cat_df.x, cat_df.y, cat_df.z) plt.show() Behold, the mighty sCATter plot in 3D. Cats are currently taking over the internet. According to the New York Times, cats are "the essential building block of the Internet" (https://www.nytimes.com/2014/07/23/upshot/what-the-internet-can-see-from-your-cat-pictures.html). Undoubtedly, they deserve a place in this chapter as well. Contrary to the 2D version of scatter(), we need to provide X, Y, and Z coordinates when we are creating a 3D scatter plot. Yet the parameters that are supported in 2D scatter() can be applied to 3D scatter() as well: fig = plt.figure() ax = fig.add_subplot(111, projection='3d') # Change the size, shape and color of markers ax.scatter(cat_df.x, cat_df.y, cat_df.z, s=4, c="g", marker="o") plt.show() To change the viewing angle and elevation of the 3D plot, we can make use of view_init(). The azim parameter specifies the azimuth angle in the X-Y plane, while elev specifies the elevation angle. When the azimuth angle is 0, the X-Y plane would appear to the north from you. Meanwhile, an azimuth angle of 180 would show you the south side of the X-Y plane: fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(cat_df.x, cat_df.y, cat_df.z,s=4, c="g", marker="o") # elev stores the elevation angle in the z plane azim stores the # azimuth angle in the x,y plane ax.view_init(azim=180, elev=10) plt.show() 3D bar chart We introduced candlestick plots for showing Open-High-Low-Close (OHLC) financial data. In addition, a 3D bar chart can be employed to show OHLC across time. The next figure shows a typical example of plotting a 5-day OHLC bar chart: import matplotlib.pyplot as plt import numpy as np from mpl_toolkits.mplot3d import Axes3D # Get 1 and every fifth row for the 5-day AAPL OHLC data ohlc_5d = stock_df[stock_df["Company"]=="AAPL"].iloc[1::5, :] fig = plt.figure() ax = fig.add_subplot(111, projection='3d') # Create one color-coded bar chart for Open, High, Low and Close prices. for color, col, z in zip(['r', 'g', 'b', 'y'], ["Open", "High", "Low", "Close"], [30, 20, 10, 0]): xs = np.arange(ohlc_5d.shape[0]) ys = ohlc_5d[col] # Assign color to the bars colors = [color] * len(xs) ax.bar(xs, ys, zs=z, zdir='y', color=colors, alpha=0.8, width=5) plt.show() The method for setting ticks and labels is similar to other Matplotlib plotting functions: fig = plt.figure(figsize=(9,7)) ax = fig.add_subplot(111, projection='3d') # Create one color-coded bar chart for Open, High, Low and Close prices. for color, col, z in zip(['r', 'g', 'b', 'y'], ["Open", "High", "Low", "Close"], [30, 20, 10, 0]): xs = np.arange(ohlc_5d.shape[0]) ys = ohlc_5d[col] # Assign color to the bars colors = [color] * len(xs) ax.bar(xs, ys, zs=z, zdir='y', color=colors, alpha=0.8) # Manually assign the ticks and tick labels ax.set_xticks(np.arange(ohlc_5d.shape[0])) ax.set_xticklabels(ohlc_5d["Date"], rotation=20, verticalalignment='baseline', horizontalalignment='right', fontsize='8') ax.set_yticks([30, 20, 10, 0]) ax.set_yticklabels(["Open", "High", "Low", "Close"]) # Set the z-axis label ax.set_zlabel('Price (US $)') # Rotate the viewport ax.view_init(azim=-42, elev=31) plt.tight_layout() plt.show() Caveats to consider while visualizing 3D plots in Matplotlib Due to the lack of a true 3D graphical rendering backend (such as OpenGL) and proper algorithm for detecting 3D objects' intersections, the 3D plotting capabilities of Matplotlib are not great but just adequate for typical applications. In the official Matplotlib FAQ (https://matplotlib.org/mpl_toolkits/mplot3d/faq.html), the author noted that 3D plots may not look right at certain angles. Besides, we also reported that mplot3d would failed to clip bar charts if zlim is set (https://github.com/matplotlib/matplotlib/ issues/8902; see also https://github.com/matplotlib/matplotlib/issues/209). Without improvements in the 3D rendering backend, these issues are hard to fix. To better illustrate the latter issue, let's try to add ax.set_zlim3d(bottom=110, top=150) right above plt.tight_layout() in the previous 3D bar chart: Clearly, something is going wrong, as the bars overshoot the lower boundary of the axes. We will try to address the latter issue through the following workaround: # FuncFormatter to add 110 to the tick labels def major_formatter(x, pos): return "{}".format(x+110) fig = plt.figure(figsize=(9,7)) ax = fig.add_subplot(111, projection='3d') # Create one color-coded bar chart for Open, High, Low and Close prices. for color, col, z in zip(['r', 'g', 'b', 'y'], ["Open", "High", "Low", "Close"], [30, 20, 10, 0]): xs = np.arange(ohlc_5d.shape[0]) ys = ohlc_5d[col] # Assign color to the bars colors = [color] * len(xs) # Truncate the y-values by 110 ax.bar(xs, ys-110, zs=z, zdir='y', color=colors, alpha=0.8) # Manually assign the ticks and tick labels ax.set_xticks(np.arange(ohlc_5d.shape[0])) ax.set_xticklabels(ohlc_5d["Date"], rotation=20, verticalalignment='baseline', horizontalalignment='right', fontsize='8') # Set the z-axis label ax.set_yticks([30, 20, 10, 0]) ax.set_yticklabels(["Open", "High", "Low", "Close"]) ax.zaxis.set_major_formatter(FuncFormatter(major_formatter)) ax.set_zlabel('Price (US $)') # Rotate the viewport ax.view_init(azim=-42, elev=31) plt.tight_layout() plt.show() Basically, we truncated the y values by 110, and then we used a tick formatter (major_formatter) to shift the tick value back to the original. For 3D scatter plots, we can simply remove the data points that exceed the boundary of set_zlim3d() in order to generate a proper figure. However, these workarounds may not work for every 3D plot type. Conclusion We didn't go into too much detail of the 3D plotting capability of Matplotlib, as it is yet to be polished. For simple 3D plots, Matplotlib already suffices. The learning curve can be reduced if we use the same package for both 2D and 3D plots. You are advised to take a look at MayaVi2, Plotly, and VisPy if you require more powerful 3D plotting functions. If you enjoyed this excerpt, be sure to check out the book it is from.
Read more
  • 0
  • 0
  • 12003