Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-top-research-papers-showcased-nips-2017-part-1
Sugandha Lahoti
07 Dec 2017
6 min read
Save for later

Top Research papers showcased at NIPS 2017 - Part 1

Sugandha Lahoti
07 Dec 2017
6 min read
The ongoing 31st annual Conference on Neural Information Processing Systems (NIPS 2017) in Long Beach, California is scheduled from December 4-9, 2017. The 6-day conference is hosting a number of invited talks, demonstrations, tutorials, and paper presentations pertaining to the latest in machine learning, deep learning and AI research. This year the conference has grown larger than life with a record-high 3,240 papers, 678 selected ones, and a completely sold-out event. Top tech members from Google, Microsoft, IBM, DeepMind, Facebook, Amazon, are among other prominent players who enthusiastically participated this year. Here is a quick roundup of some of the top research papers till date. Generative Adversarial Networks Generative Adversarial Networks are a hot topic of research at the ongoing NIPS conference. GANs cast out an easy way to train the DL algorithms by slashing out the amount of data required in training with unlabelled data. Here are a few research papers on GANs. Regularization can stabilize training of GANs Microsoft researchers have proposed a new regularization approach to yield a stable GAN training procedure at low computational costs. Their new model overcomes the fundamental limitation of GANs occurring due to a dimensional mismatch between the model distribution and the true distribution. This results in their density ratio and the associated f-divergence to be undefined. Their paper “Stabilizing Training of Generative Adversarial Networks through Regularization” turns GAN models into reliable building blocks for deep learning. They have also used this for several datasets including image generation tasks. AdaGAN: Boosting GAN Performance Training GANs can at times be a hard task. They can also suffer from the problem of missing modes where the model is not able to produce examples in certain regions of the space. Google researchers have developed an iterative procedure called AdaGAN in their paper “AdaGAN: Boosting Generative Models”, an approach inspired by boosting algorithms, where many potentially weak individual predictors are greedily aggregated to form a strong composite predictor. It adds a new component into a mixture model at every step by running a GAN algorithm on a re-weighted sample. The paper also addresses the problem of missing modes. Houdini: Generating Adversarial Examples The generation of adversarial examples is considered as a critical milestone for evaluating and upgrading the robustness of learning in machines. Also, current methods are confined to classification tasks and are unable to alter the performance measure of the problem at hand. In order to tackle such an issue, Facebook researchers have come up with a research paper titled “Houdini: Fooling Deep Structured Prediction Models”, a novel and a flexible approach for generating adversarial examples distinctly tailormade for the final performance measure of the task taken into account (combinational and non-decomposable tasks). Stochastic hard-attention for Memory Addressing in GANs DeepMind researchers showcased a new method which uses stochastic hard-attention to retrieve memory content in generative models. Their paper titled “Variational memory addressing in generative models” was presented at the 2nd day of the conference and is an advancement over the popular differentiable soft-attention mechanism. Their new technique allows developers to apply variational inference to memory addressing. This leads to more precise memory lookups using target information, especially in models with large memory buffers and with many confounding entries in the memory. Image and Video Processing A lot of hype was also around developing sophisticated models and techniques for image and video processing. Here is a quick glance at the top presentations. Fader Networks: Image manipulation through disentanglement Facebook researchers have introduced Fader Networks, in their paper titled “Fader Networks: Manipulating Images by Sliding Attributes”. These fader networks use an encoder-decoder architecture to reconstruct images by disentangling their salient information and the values of particular attributes directly in a latent space. Disentanglement helps in manipulating these attributes to generate variations of pictures of faces while preserving their naturalness. This innovative approach results in much simpler training schemes and scales for manipulating multiple attributes jointly. Visual interaction networks for Video simulation Another paper titled “Visual interaction networks: Learning a physics simulator from video Tuesday” proposes a new neural-network model to learn physical objects without prior knowledge. Deepmind’s Visual Interaction Network is used for video analysis and is able to infer the states of multiple physical objects from just a few frames of video. It then uses these to predict object positions many steps into the future. It can also deduce the locations of invisible objects. Transfer, Reinforcement, and Continual Learning A lot of research is going on in the field of Transfer, Reinforcement, and Continual learning to make stable and powerful deep learning models. Here are a few research papers presented in this domain. Two new techniques for Transfer Learning Currently, a large set of input/output (I/O) examples are required for learning any underlying input-output mapping. By leveraging information based on the related tasks, the researchers at Microsoft have addressed the problem of data and computation efficiency of program induction. Their paper “Neural Program Meta-Induction” uses two approaches for cross-task knowledge transfer. First is Portfolio adaption, where a set of induction models is pretrained on a set of related tasks, and the best model is adapted towards the new task using transfer learning. The second one is Meta program induction, a k-shot learning approach which makes a model generalize itself to new tasks without requiring any additional training. Hybrid Reward Architecture to solve the problem of generalization in Reinforcement Learning A new paper from Microsoft “Hybrid Reward Architecture for Reinforcement Learning” highlights a new method to address the generalization problem faced by a typical deep RL method. Hybrid Reward Architecture (HRA) takes a decomposed reward function as the input and learns a separate value function for each component reward function. This is especially useful in domains where the optimal value function cannot easily be reduced to a low-dimensional representation. In the new approach, the overall value function is much smoother and can be easier approximated by a low-dimensional representation, enabling more effective learning. Gradient Episodic Memory to counter catastrophic forgetting in continual learning models Continual learning is all about improving the ability of models to solve sequential tasks without forgetting previously acquired knowledge. In the paper “Gradient Episodic Memory for Continual Learning”, Facebook researchers have proposed a set of metrics to evaluate models over a continuous series of data. These metrics characterize models by their test accuracy and the ability to transfer knowledge across tasks. They have also proposed a model for continual learning, called Gradient Episodic Memory (GEM) that reduces the problem of catastrophic forgetting. They have also experimented with variants of the MNIST and CIFAR-100 datasets to demonstrate the performance of GEM when compared to other methods. In our next post, we will cover a selection of papers presented so far at NIPS 2017 in the areas of Predictive Modelling, Machine Translation, and more. For live content coverage, you can visit NIPS’ Facebook page.
Read more
  • 0
  • 0
  • 3178

article-image-what-are-slowly-changing-dimensions-scd-and-why-you-need-them-in-your-data-warehouse
Savia Lobo
07 Dec 2017
8 min read
Save for later

What are Slowly changing Dimensions (SCD) and why you need them in your Data Warehouse?

Savia Lobo
07 Dec 2017
8 min read
[box type="note" align="" class="" width=""]Below given post is an excerpt from a book by Rahul Malewar titled Learning Informatica PowerCenter 10.x. The book is a quick guide to explore Informatica PowerCenter and its features such as working on sources, targets, transformations, performance optimization, and managing your data at speed. [/box] Our article explores what Slowly Changing Dimensions (SCD) are and how to implement them in Informatica PowerCenter. As the name suggests, SCD allows maintaining changes in the Dimension table in the data warehouse. These are dimensions that gradually change with time, rather than changing on a regular basis. When you implement SCDs, you actually decide how you wish to maintain historical data with the current data. Dimensions present within data warehousing and in data management include static data about certain entities such as customers, geographical locations, products, and so on. Here we talk about general SCDs: SCD1, SCD2, and SCD3. Apart from these, there are also Hybrid SCDs that you might come across. A Hybrid SCD is nothing but a combination of multiple SCDs to serve your complex business requirements. Types of SCD The various types of SCD are described as follows: Type 1 dimension mapping (SCD1): This keeps only current data and does not maintain historical data. Note : Use SCD1 mapping when you do not want history of previous data. Type 2 dimension/version number mapping (SCD2): This keeps current as well as historical data in the table. It allows you to insert new records and changed records using a new column (PM_VERSION_NUMBER) by maintaining the version number in the table to track the changes. We use a new column PM_PRIMARYKEY to maintain the history. Note : Use SCD2 mapping when you want to keep a full history of dimension data, and track the progression of changes using a version number. Consider there is a column LOCATION in the EMPLOYEE table and you wish to track the changes in the location on employees. Consider a record for Employee ID 1001 present in your EMPLOYEE dimension table. Steve was initially working in India and then shifted to USA. We are willing to maintain history on the LOCATION field. Type 2 dimension/flag mapping: This keeps current as well as historical data in the table. It allows you to insert new records and changed records using a new column (PM_CURRENT_FLAG) by maintaining the flag in the table to track the changes. We use a new column PRIMARY_KEY to maintain the history. Note : Use SCD2 mapping when you want to keep a full history of dimension data, and track the progression of changes using a flag. Let's take an example to understand different SCDs. Type 2 dimension/effective date range mapping: This keeps current as well as historical data in the table. SCD2 allows you to insert new records and changed records using two new columns (PM_BEGIN_DATE and PM_END_DATE) by maintaining the date range in the table to track the changes. We use a new column PRIMARY_KEY to maintain the history. Note : Use SCD2 mapping when you want to keep a full history of dimension data, and track the progression of changes using start date and end date. Type 3 Dimension mapping: This keeps current as well as historical data in the table. We maintain only partial history by adding a new column PM_PREV_COLUMN_NAME, that is, we do not maintain full history. Note: Use SCD3 mapping when you wish to maintain only partial history. EMPLOYEE_ID NAME LOCATION 1001 STEVE INDIA Your data warehouse table should reflect the current status of Steve. To implement this, we have different types of SCDs. SCD1 As you can see in the following table, INDIA will be replaced with USA, so we end up having only current data, and we lose historical data: PM_PRIMARY_KEY EMPLOYEE_ID NAME LOCATION 100 1001 STEVE USA Now if Steve is again shifted to JAPAN, the LOCATION data will be replaced from USA to JAPAN: PM_PRIMARY_KEY EMPLOYEE_ID NAME LOCATION 100 1001 STEVE JAPAN The advantage of SCD1 is that we do not consume a lot of space in maintaining the data. The disadvantage is that we don't have historical data. SCD2 - Version number As you can see in the following table, we are maintaining the full history by adding a new record to maintain the history of the previous records: PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_VERSION_NUMBER 100 1001 STEVE INDIA 0 101 1001 STEVE USA 1 102 1001 STEVE JAPAN 2 200 1002 MIKE UK 0 We add two new columns in the table: PM_PRIMARYKEY to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID (supposed to be the primary key) column, and PM_VERSION_NUMBER to understand current and history records. SCD2 - FLAG As you can see in the following table, we are maintaining the full history by adding new records to maintain the history of the previous records:   PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_CURRENT_FLAG 100 1001 STEVE INDIA 0 101 1001 STEVE USA 1 We add two new columns in the table: PM_PRIMARYKEY to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID column, and PM_CURRENT_FLAG to understand current and history records. Again, if Steve is shifted, the data looks like this: PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_CURRENT_FLAG 100 1001 STEVE INDIA 0 101 1001 STEVE USA 0 102 1001 STEVE JAPAN 1 SCD2 - Date range As you can see in the following table, we are maintaining the full history by adding new records to maintain the history of the previous records: PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_BEGIN_DATE PM_END_DATE 100 1001 STEVE INDIA 01-01-14 31-05-14 101 1001 STEVE USA 01-06-14 99-99-9999 We add three new columns in the table: PM_PRIMARYKEY to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID column, and PM_BEGIN_DATE and PM_END_DATE to understand the versions in the data. The advantage of SCD2 is that you have complete history of the data, which is a must for data warehouse. The disadvantage of SCD2 is that it consumes a lot of space. SCD3 As you can see in the following table, we are maintaining the history by adding new columns: PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_PREV_LOCATION 100 1001 STEVE USA INDIA An optional column PM_PRIMARYKEY can be added to maintain the primary key constraints. We add a new column PM_PREV_LOCATION in the table to store the changes in the data. As you can see, we added a new column to store data as against SCD2,where we added rows to maintain history. If Steve is now shifted to JAPAN, the data changes to this: PM_PRIMARYKEY EMPLOYEE_ID NAME LOCATION PM_PREV_LOCATION 100 1001 STEVE JAPAN USA As you can notice, we lost INDIA from the data warehouse, that is why we say we are maintaining partial history. Note : To implement SCD3, decide how many versions of a particular column you wish to maintain. Based on this, the columns will be added in the table. SCD3 is best when you are not interested in maintaining the complete but only partial history. The drawback of SCD3 is that it doesn't store the full history. At this point, you should be very clear about the different types of SCDs. We need to implement these concepts practically in Informatica PowerCenter. Informatica PowerCenter provides a utility called wizard to implement SCD. Using the wizard, you can easily implement any SCD. In the next topics, you will learn how to use the wizard to implement SCD1, SCD2, and SCD3. Before you proceed to the next section, please make sure you have a proper understanding of the transformations in Informatica PowerCenter. You should be clear about the source qualifier, expression, filter, router, lookup, update strategy, and sequence generator transformations. Wizard creates a mapping using all these transformations to implement the SCD functionality. When we implement SCD, there will be some new records that need to be loaded into the target table, and there will be some existing records for which we need to maintain the history. Note : The record that comes for the first time in the table will be referred to as the NEW record, and the record for which we need to maintain history will be referred to as the CHANGED record. Based on the comparison of the source data with the target data, we will decide which one is the NEW record and which is the CHANGED record. To start with, we will use a sample file as our source and the Oracle table as the target to implement SCDs. Before we implement SCDs, let's talk about the logic that will serve our purpose, and then we will fine-tune the logic for each type of SCD. Extract all records from the source. Look up on the target table, and cache all the data. Compare the source data with the target data to flag the NEW and CHANGED records. Filter the data based on the NEW and CHANGED flags. Generate the primary key for every new row inserted into the table. Load the NEW record into the table, and update the existing record if needed. In this article we concentrated on a very important table feature called slowly changing dimensions. We also discussed different types of SCDs, i.e., SCD1, SCD2, and SCD3. If you are looking to explore more in Informatica Powercentre, go ahead and check out the book Learning Informatica Powercentre 10.x.  
Read more
  • 0
  • 1
  • 31520

article-image-implementing-linear-regression-analysis-r
Amarabha Banerjee
06 Dec 2017
7 min read
Save for later

Implementing Linear Regression Analysis with R

Amarabha Banerjee
06 Dec 2017
7 min read
[box type="note" align="" class="" width=""]This article is from the book Advanced Analytics with R and Tableau, written by Jen Stirrup & Ruben Oliva Ramos. The book offers a wide range of machine learning algorithms to help you learn descriptive, prescriptive, predictive, and visually appealing analytical solutions designed with R and Tableau. [/box] One of the most popular analytical methods for statistical analysis is regression analysis. In this article we explore the basics of regression analysis and how R can be used to effectively perform it. Getting started with regression Regression means the unbiased prediction of the conditional expected value, using independent variables, and the dependent variable. A dependent variable is the variable that we want to predict. Examples of a dependent variable could be a number such as price, sales, or weight. An independent variable is a characteristic, or feature, that helps to determine the dependent variable. So, for example, the independent variable of weight could help to determine the dependent variable of weight. Regression analysis can be used in forecasting, time series modeling, and cause and effect relationships. Simple linear regression R can help us to build prediction stories with Tableau. Linear regression is a great starting place when you want to predict a number, such as profit, cost, or sales. In simple linear regression, there is only one independent variable x, which predicts a dependent value, y. Simple linear regression is usually expressed with a line that identifies the slope that helps us to make predictions. So, if sales = x and profit = y, what is the slope that allows us to make the prediction? We will do this in R to create the calculation, and then we will repeat it in R. We can also color-code it so that we can see what is above and what is below the slope. Using lm() function What is linear regression? Linear regression has the objective of finding a model that fits a regression line through the data well, whilst reducing the discrepancy, or error, between the data and the regression line. We are trying here to predict the line of best fit between one or many variables from a scatter plot of points of data. To find the line of best fit, we need to calculate a couple of things about the line. We can use the lm() function to obtain the line, which we can call m: We need to calculate the slope of the line m We also need to calculate the intercept with the y axis c So we begin with the equation of the line: y = mx + c To get the line, we use the concept of Ordinary Least Squares (OLS). This means that we sum the square of the y-distances between the points and the line. Furthermore, we can rearrange the formula to give us beta (or m) in terms of the number of points n, x, and y. This would assume that we can minimize the mean error with the line and the points. It will be the best predictor for all of the points in the training set and future feature vectors. Example in R Let's start with a simple example in R, where we predict women's weight from their height. If we were articulating this question per Microsoft's Team Data Science Process, we would be stating this as a business question during the business understanding phase. How can we come up with a model that helps us to predict what the women's weight is going to be, dependent on their height? Using this business question as a basis for further investigation, how do we come up with a model from the data, which we could then use for further analysis? Simple linear regression is about two variables, an independent and a dependent variable, which is also known as the predictor variable. With only one variable, and no other information, the best prediction is the mean of the sample itself. In other words, when all we have is one variable, the mean is the best predictor of any one amount. The first step is to collect a random sample of data. In R, we are lucky to have sample data that we can use. To explore linear regression, we will use the women dataset, which is installed by default with R. The variability of the weight amount can only be explained by the weights themselves, because that is all we have. To conduct the regression, we will use the lm function, which appears as follows: model <- lm(y ~ x, data=mydata) To see the women dataset, open up RStudio. When we type in the variable name, we will get the contents of the variable. In this example, the variable name women will give us the data itself. The women's height and weight are printed out to the console, and here is an example: > women When we type in this variable name, we get the actual data itself, which we can see next: We can visualize the data quite simply in R, using the plot(women) command. The plot command provides a quick and easy way of visualizing the data. Our objective here is simply to see the relationship of the data. The results appear as follows: Now that we can see the relationship of the data, we can use the summary command to explore the data further: summary(women) This will give us the results, which are given here as follows: Let's look at the results in closer detail: Next, we can create a model that will use the lm function to create a linear regression model of the data. We will assign the results to a model called linearregressionmodel, as follows: linearregressionmodel <- lm(weight ~ height, data=women) What does the model produce? We can use the summary command again, and this will provide some descriptive statistics about the lm model that we have generated. One of the nice, understated features of R is its ability to use variables. Here we have our variable, linearregressionmodel – note that one word is storing a whole model! summary(linearregressionmodel) How does this appear in the R interface? Here is an example: What do these numbers mean? Let's take a closer look at some of the key numbers. Residual standard error In the output, residual standard error is cost, which is 1.525. Comparing actual values with predicted results Now, we will look at real values of weight of 15 women first and then will look at predicted values. Actual values of weight of 15 women are as follows, using the following command: women$weight When we execute the women$weight command, this is the result that we obtain: When we look at the predicted values, these are also read out in R: How can we put these pieces of data together? women$pred linearregressionmodel$fitted.values. This is a very simple merge. When we look inside the women variable again, this is the result: If you liked this article, please be sure to check out Advanced Analytics with R and Tableau which consists of more useful analytics techniques with R and Tableau. It will enable you to make quick, cogent, and data-driven decisions for your business using advanced analytical techniques such as forecasting, predictions, association rules, clustering, classification, and other advanced Tableau/R calculated field functions.    
Read more
  • 0
  • 0
  • 4962
Visually different images

article-image-implementing-deep-learning-keras
Amey Varangaonkar
05 Dec 2017
4 min read
Save for later

Implementing Deep Learning with Keras

Amey Varangaonkar
05 Dec 2017
4 min read
[box type="note" align="" class="" width=""]The following excerpt is from the title Deep Learning with Theano, Chapter 5 written by Christopher Bourez. The book offers a complete overview of Deep Learning with Theano, a Python-based library that makes optimizing numerical expressions and deep learning models easy on CPU or GPU. [/box] In this article, we introduce you to the highly popular deep learning library - Keras, which sits on top of the both, Theano and Tensorflow. It is a flexible platform for training deep learning models with ease. Keras is a high-level neural network API, written in Python and capable of running on top of either TensorFlow or Theano. It was developed to make implementing deep learning models as fast and easy as possible for research and development. You can install Keras easily using conda, as follows: conda install keras When writing your Python code, importing Keras will tell you which backend is used: >>> import keras Using Theano backend. Using cuDNN version 5110 on context None Preallocating 10867/11439 Mb (0.950000) on cuda0 Mapped name None to device cuda0: Tesla K80 (0000:83:00.0) Mapped name dev0 to device cuda0: Tesla K80 (0000:83:00.0) Using cuDNN version 5110 on context dev1 Preallocating 10867/11439 Mb (0.950000) on cuda1 Mapped name dev1 to device cuda1: Tesla K80 (0000:84:00.0) If you have installed Tensorflow, it might not use Theano. To specify which backend to use, write a Keras configuration file, ~/.keras/keras.json: { "epsilon": 1e-07, "floatx": "float32", "image_data_format": "channels_last", "backend": "theano" } It is also possible to specify the Theano backend directly with the environment Variable: KERAS_BACKEND=theano python Note that the device used is the device we specified for Theano in the ~/.theanorc file. It is also possible to modify these variables with Theano environment variables: KERAS_BACKEND=theano THEANO_FLAGS=device=cuda,floatX=float32,mode=FAST_ RUN python Programming with Keras Keras provides a set of methods for data pre-processing and for building models. Layers and models are callable functions on tensors and return tensors. In Keras, there is no difference between a layer/module and a model: a model can be part of a bigger model and composed of multiple layers. Such a sub-model behaves as a module, with inputs/outputs. Let's create a network with two linear layers, a ReLU non-linearity in between, and a softmax output: from keras.layers import Input, Dense from keras.models import Model inputs = Input(shape=(784,)) x = Dense(64, activation='relu')(inputs) predictions = Dense(10, activation='softmax')(x) model = Model(inputs=inputs, outputs=predictions) The model module contains methods to get input and output shape for either one or multiple inputs/outputs, and list the submodules of our module: >>> model.input_shape (None, 784) >>> model.get_input_shape_at(0) (None, 784) >>> model.output_shape (None, 10) >>> model.get_output_shape_at(0) (None, 10) >>> model.name 'Sequential_1' >>> model.input /dense_3_input >>> model.output Softmax.0 >>> model.get_output_at(0) Softmax.0 >>> model.layers [<keras.layers.core.Dense object at 0x7f0abf7d6a90>, <keras.layers.core.Dense object at 0x7f0abf74af90>] In order to avoid specify inputs to every layer, Keras proposes a functional way of writing models with the Sequential module, to build a new module or model composed. The following definition of the model builds exactly the same model as shown previously, with input_dim to specify the input dimension of the block that would be unknown otherwise and generate an error: from keras.models import Sequential from keras.layers import Dense, Activation model = Sequential() model.add(Dense(units=64, input_dim=784, activation='relu')) model.add(Dense(units=10, activation='softmax')) The model is considered a module or layer that can be part of a bigger model: model2 = Sequential() model2.add(model) model2.add(Dense(units=10, activation='softmax')) Each module/model/layer can be compiled then and trained with data : model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy']) model.fit(data, labels) Thus, we see it is fairly easy to train a model in Keras. The simplicity and ease of use that Keras offers makes it a very popular choice of tool for deep learning. If you think the article is useful, check out the book Deep Learning with Theano for interesting deep learning concepts and their implementation using Theano. For more information on the Keras library and how to train efficient deep learning models, make sure to check our highly popular title Deep Learning with Keras.  
Read more
  • 0
  • 0
  • 2365

article-image-basics-of-spark-sql-and-its-components
Amarabha Banerjee
04 Dec 2017
8 min read
Save for later

Basics of Spark SQL and its components

Amarabha Banerjee
04 Dec 2017
8 min read
[box type="note" align="" class="" width=""]Below given is an excerpt from the book Learning Spark SQL by Aurobindo Sarkar. Spark SQL APIs provide an optimized interface that helps developers build distributed applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. This book provides you with an understanding of design and implementation best practices used to design and build real-world, Spark-based applications. [/box] In the article, we shall give you a perspective of Spark SQL and its components. Introduction Spark SQL is one of the most advanced components of Apache Spark. It has been a part of the core distribution since Spark 1.0 and supports Python, Scala, Java, and R programming APIs. As illustrated in the figure below, Spark SQL components provide the foundation for Spark machine learning applications, streaming applications, graph applications, and many other types of application architectures. Such applications, typically, use Spark ML pipelines, Structured Streaming, and GraphFrames, which are all based on Spark SQL interfaces (DataFrame/Dataset API). These applications, along with constructs such as SQL, DataFrames, and Datasets API, receive the benefits of the Catalyst optimizer, automatically. This optimizer is also responsible for generating executable query plans based on the lower-level RDD interfaces. SparkSession SparkSession represents a unified entry point for manipulating data in Spark. It minimizes the number of different contexts a developer has to use while working with Spark. SparkSession replaces multiple context objects, such as the SparkContext, SQLContext, and HiveContext. These contexts are now encapsulated within the SparkSession object. In Spark programs, we use the builder design pattern to instantiate a SparkSession object. However, in the REPL environment (that is, in a Spark shell session), the SparkSession is automatically created and made available to you via an instance object called Spark.At this time, start the Spark shell on your computer to interactively execute the code snippets in this section. As the shell starts up, you will notice a bunch of messages appearing on your screen, as shown in the following figure. Understanding Resilient Distributed datasets (RDD) RDDs are Spark's primary distributed Dataset abstraction. It is a collection of data that is immutable, distributed, lazily evaluated, type inferred, and cacheable. Prior to execution, the developer code (using higher-level constructs such as SQL, DataFrames, and Dataset APIs) is converted to a DAG of RDDs (ready for execution). RDDs can be created by parallelizing an existing collection of data or accessing a Dataset residing in an external storage system, such as the file system or various Hadoop-based data sources. The parallelized collections form a distributed Dataset that enable parallel operations on them. An RDD can be created from the input file with number of partitions specified, as shown: scala> val cancerRDD = sc.textFile("file:///Users/aurobindosarkar/Downloads/breast-cancerwisconsin. data", 4) scala> cancerRDD.partitions.size res37: Int = 4 RDD files can be internaly converted to a DataFrame by importing the spark.implicits package and using the toDF() method: scala> import spark.implicits._scala> val cancerDF = cancerRDD.toDF() To create a DataFrame with a specific schema, we define a Row object for the rows contained in the DataFrame. Additionally, we split the comma-separated data, convert it to a list of fields, and then map it to the Row object. Finally, we use the create DataFrame() to create the DataFrame with a specified schema: def row(line: List[String]): Row = { Row(line(0).toLong, line(1).toInt, line(2).toInt, line(3).toInt, line(4).toInt, line(5).toInt, line(6).toInt, line(7).toInt, line(8).toInt, line(9).toInt, line(10).toInt) } val data = cancerRDD.map(_.split(",").to[List]).map(row) val cancerDF = spark.createDataFrame(data, recordSchema) Further, we can easily convert the preceding DataFrame to a Dataset using the case class defined earlier: scala> val cancerDS = cancerDF.as[CancerClass] RDD data is logically divided into a set of partitions; additionally, all input, intermediate, and output data is also represented as partitions. The number of RDD partitions defines the level of data fragmentation. These partitions are also the basic units of parallelism. Spark execution jobs are split into multiple stages, and as each stage operates on one partition at a time, it is very important to tune the number of partitions. Fewer partitions than active stages means your cluster could be under-utilized, while an excessive number of partitions could impact the performance due to higher disk and network I/O. Understanding DataFrames and Datasets A DataFrame is similar to a table in a relational database, a pandas dataframe, or a dataframe in R. It is a distributed collection of rows that is organized into columns. It uses the immutable, in-memory, resilient, distributed, and parallel capabilities of RDD, and applies a schema to the data. DataFrames are also evaluated lazily. Additionally, they provide a domain-specific language (DSL) for distributed data manipulation. Conceptually, the DataFrame is an alias for a collection of generic objects Dataset[Row], where a row is a generic untyped object. This means that syntax errors for DataFrames are caught during the compile stage; however, analysis errors are detected only during runtime. DataFrames can be constructed from a wide array of sources, such as structured data files, Hive tables, databases, or RDDs. The source data can be read from local filesystems, HDFS, Amazon S3, and RDBMSs. In addition, other popular data formats, such as CSV, JSON, Avro, Parquet, and so on, are also supported. Additionally, you can also create and use custom data sources. The DataFrame API supports Scala, Java, Python, and R programming APIs. The DataFrames API is declarative, and combined with procedural Spark code, it provides a much tighter integration between the relational and procedural processing in your applications. DataFrames can be manipulated using Spark's procedural API, or using relational APIs (with richer optimizations). Understanding the Catalyst optimizer The Catalyst optimizer is at the core of Spark SQL and is implemented in Scala. It enables several key features, such as schema inference (from JSON data), that are very useful in data analysis work. The following figure shows the high-level transformation process from a developer's program containing DataFrames/Datasets to the final execution plan: The internal representation of the program is a query plan. The query plan describes data operations such as aggregate, join, and filter, which match what is defined in your query. These operations generate a new Dataset from the input Dataset. After we have an initial version of the query plan ready, the Catalyst optimizer will apply a series of transformations to convert it to an optimized query plan. Finally, the Spark SQL code generation mechanism translates the optimized query plan into a DAG of RDDs that is ready for execution. The query plans and the optimized query plans are internally represented as trees. So, at its core, the Catalyst optimizer contains a general library for representing trees and applying rules to manipulate them. On top of this library, are several other libraries that are more specific to relational query processing. Catalyst has two types of query plans: Logical and Physical Plans. The Logical Plan describes the computations on the Datasets without defining how to carry out the specific computations. Typically, the Logical Plan generates a list of attributes or columns as output under a set of constraints on the generated rows. The Physical Plan describes the computations on Datasets with specific definitions on how to execute them (it is executable). Let's explore the transformation steps in more detail. The initial query plan is essentially an unresolved Logical Plan, that is, we don't know the source of the Datasets or the columns (contained in the Dataset) at this stage and we also don't know the types of columns. The first step in this pipeline is the analysis step. During analysis, the catalog information is used to convert the unresolved Logical Plan to a resolved Logical Plan. In the next step, a set of logical optimization rules is applied to the resolved Logical Plan, resulting in an optimized Logical Plan. In the next step the optimizer may generate multiple Physical Plans and compare their costs to pick the best one. The first version of the Costbased Optimizer (CBO), built on top of Spark SQL has been released in Spark 2.2. More details on cost-based optimization are presented in Chapter 11, Tuning Spark SQL Components for Performance.  All three--DataFrame, Dataset and SQL--share the same optimization pipeline as illustrated in the following figure: The primary goal of this article was to give an overview of Spark SQL to enable you being comfortable with the Spark environment through hands-on sessions (using public Datasets). If you liked our article, please be sure to check out Learning Spark SQL which consists of more useful techniques on data extraction and data analysis using Spark SQL.
Read more
  • 0
  • 0
  • 3464

article-image-understanding-streaming-applications-in-spark-sql
Amarabha Banerjee
04 Dec 2017
7 min read
Save for later

Understanding Streaming Applications in Spark SQL

Amarabha Banerjee
04 Dec 2017
7 min read
[box type="note" align="" class="" width=""]This article is a book excerpt from Learning Spark SQL written by Aurobindo Sarkar. This book gives an insight into the engineering practices used to design and build real-world Spark based applications. The hands on examples illustrated in the book will give you required confidence to work on future projects you encounter in Spark SQL. [/box] In the article, we shall talk about Spark SQL and its use in streaming applications. What are streaming applications? A streaming application is a program that has its necessary components downloaded as needed instead of being installed ahead of time on a computer. Application streaming is a method of delivering virtualized applications. Streaming applications are getting increasingly complex, because such computations don't run in isolation. They need to interact with batch data, support interactive analysis, support sophisticated machine learning applications, and so on. Typically, such applications store incoming event stream(s) on long-term storage, continuously monitor events, and run machine learning models on the stored data, while simultaneously enabling continuous learning on the incoming stream. They also have the capability to interactively query the stored data while providing exactly-once write guarantees, handling late arriving data, performing aggregations, and so on. These types of applications are a lot more than mere streaming applications and have, therefore, been termed as continuous applications. SparkSQL and Structured Streaming Before Spark 2.0, streaming applications were built on the concept of DStreams. There were several pain points associated with using DStreams. In DStreams, the timestamp was when the event actually came into the Spark system; the time embedded in the event was not taken into consideration. In addition, though the same engine can process both the batch and streaming computations, the APIs involved, though similar between RDDs (batch) and DStream (streaming), required the developer to make code changes. The DStream streaming model placed the burden on the developer to address various failure conditions, and it was hard to reason about data consistency issues. In Spark 2.0, Structured Streaming was introduced to deal with all of these pain points.  Structured Streaming is a fast, fault-tolerant, exactly-once stateful stream processing approach. It enables streaming analytics without having to reason about the underlying mechanics of streaming. In the new model, the input can be thought of as data from an append-only table (that grows continuously). A trigger specifies the time interval for checking the input for the arrival of new data. As shown in the following figure, the query represents the queries or the operations, such as map, filter, and reduce on the input, and result represents the final table that is updated in each trigger interval, as per the specified operation. The output defines the part of the result to be written to the data sink in each time interval.  The output modes can be complete, delta, or append, where the complete output mode means writing the full result table every time, the delta output mode writes the changed rows from the previous batch, and the append output mode writes the new rows only, Respectively: In Spark 2.0, in addition to the static bounded DataFrames, we have the concept of a continuous unbounded DataFrame. Both static and continuous DataFrames use the same API, thereby unifying streaming, interactive, and batch queries. For example, you can aggregate data in a stream and then serve it using JDBC. The high-level streaming API is built on the Spark SQL engine and is tightly integrated with SQL queries and the DataFrame/Dataset APIs. The primary benefit is that you use the same high-level Spark DataFrame and Dataset APIs, and the Spark engine figures out the incremental and continuous execution required for operations. Additionally, there are query management APIs that you can use to manage multiple, concurrently running, and streaming queries. For instance, you can list running queries, stop and restart queries, retrieve exceptions in case of failures, and so on. In the example code below, we use two bid files from the iPinYou Dataset as the source for our streaming data. First, we define our input records schema and create a streaming input DataFrame: scala> import org.apache.spark.sql.types._ scala> import org.apache.spark.sql.functions._ scala> import scala.concurrent.duration._ scala> import org.apache.spark.sql.streaming.ProcessingTime scala> import org.apache.spark.sql.streaming.OutputMode.Complete scala> val bidSchema = new StructType().add("bidid", StringType).add("timestamp", StringType).add("ipinyouid", StringType).add("useragent", StringType).add("IP", StringType).add("region", IntegerType).add("city", IntegerType).add("adexchange", StringType).add("domain", StringType).add("url:String", StringType).add("urlid: String", StringType).add("slotid: String", StringType).add("slotwidth", StringType).add("slotheight", StringType).add("slotvisibility", StringType).add("slotformat", StringType).add("slotprice", StringType).add("creative", StringType).add("bidprice", StringType) scala> val streamingInputDF = spark.readStream.format("csv").schema(bidSchema).option("header", false).option("inferSchema", true).option("sep", "t").option("maxFilesPerTrigger", 1).load("file:///Users/aurobindosarkar/Downloads/make-ipinyou-datamaster/ original-data/ipinyou.contest.dataset/bidfiles") Next, we define our query with a time interval of 20 seconds and the output mode as Complete: scala> val streamingCountsDF = streamingInputDF.groupBy($"city").count() scala> val query = streamingCountsDF.writeStream.format("console").trigger(ProcessingTime(20.s econds)).queryName("counts").outputMode(Complete).start() In the output, it is observed that the count of bids from each region gets updated in each time interval as new data arrives. The new bid files need to be dropped (or start with multiple bid files, as they will get picked up for processing one at a time based on the value of maxFilesPerTrigger) from the original Dataset into the bidfiles directory to see the updated results. Structured Streaming Internals Sources and incrementally executes the computation on it before writing it to the sink. In addition, any running aggregates required by your application are maintained as in-memory states backed by a Write-Ahead Log (WAL). The in-memory state data is generated and used across incremental executions. The fault tolerance requirements for such applications include the ability to recover and replay all data and metadata in the system. The planner writes offsets to a fault-tolerant WAL on persistent storage, such as HDFS, before execution as illustrated in the figure: In case the planner fails on the current incremental execution, the restarted planner reads from the WAL and re-executes the exact range of offsets required. Typically, sources such as Kafka are also fault-tolerant and generate the original transactions data, given the appropriate offsets recovered by the planner. The state data is usually maintained in a versioned, key-value map in Spark workers and is backed by a WAL on HDFS. The planner ensures that the correct version of the state is used to re-execute the transactions subsequent to a failure. Additionally, the sinks are idempotent by design, and can handle the re-executions without double commits of the output. Hence, an overall combination of offset tracking in WAL, state management, and fault-tolerant sources and sinks provide the end-to- end exactly-once guarantees. Summary SparkSQL provides one of the best platforms for implementing streaming applications. The internal architecture and the fault tolerant behavior implies that modern day developers who want to create data intensive applications with data streaming capabilities, will have to use the power of SparkSQL. If you liked our post, please be sure to check out Learning Spark SQL which consists of more useful techniques on data extraction and data analysis using Spark SQL.
Read more
  • 0
  • 0
  • 2709
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-4-popular-algorithms-distance-based-outlier-detection
Sugandha Lahoti
01 Dec 2017
7 min read
Save for later

4 popular algorithms for Distance-based outlier detection

Sugandha Lahoti
01 Dec 2017
7 min read
[box type="note" align="" class="" width=""]The article is an excerpt from our book titled Mastering Java Machine Learning by Dr. Uday Kamath and  Krishna Choppella.[/box] This book introduces you to an array of expert machine learning techniques, including classification, clustering, anomaly detection, stream learning, active learning, semi-supervised learning, probabilistic graph modelling and a lot more. The article given below is extracted from Chapter 5 of the book - Real-time Stream Machine Learning, explaining 4 popular algorithms for Distance-based outlier detection. Distance-based outlier detection is the most studied, researched, and implemented method in the area of stream learning. There are many variants of the distance-based methods, based on sliding windows, the number of nearest neighbors, radius and thresholds, and other measures for considering outliers in the data. We will try to give a sampling of the most important algorithms in this article. Inputs and outputs Most algorithms take the following parameters as inputs: Window size w, corresponding to the fixed size on which the algorithm looks for outlier patterns. Sliding size s, corresponds to the number of new instances that will be added to the window, and old ones removed. The count threshold k of instances when using nearest neighbor computation. The distance threshold R used to define the outlier threshold in distances. Outliers as labels or scores (based on neighbors and distance) are outputs. How does it work? We present different variants of distance-based stream outlier algorithms, giving insights into what they do differently or uniquely. The unique elements in each algorithm define what happens when the slide expires, how a new slide is processed, and how outliers are reported. Exact Storm Exact Storm stores the data in the current window w in a well-known index structure, so that the range query search or query to find neighbors within the distance R for a given point is done efficiently. It also stores k preceding and succeeding neighbors of all data points: Expired Slide: Instances in expired slides are removed from the index structure that affects range queries but are preserved in the preceding list of neighbors. New Slide: For each data point in the new slide, range query R is executed, results are used to update the preceding and succeeding list for the instance, and the instance is stored in the index structure. Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, any instance with at least k elements from the succeeding list and non-expired preceding list is reported as an outlier. Abstract-C Abstract-C keeps the index structure similar to Exact Storm but instead of preceding and succeeding lists for every object it just maintains a list of counts of neighbors for the windows the instance is participating in: Expired Slide: Instances in expired slides are removed from the index structure that affects range queries and the first element from the list of counts is removed corresponding to the last window. New Slide: For each data point in the new slide, range query R is executed and results are used to update the list count. For existing instances, the count gets updated with new neighbors and instances are added to the index structure. Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, all instances with a neighbors count less than k in the current window are considered outliers. Direct Update of Events (DUE) DUE keeps the index structure for efficient range queries exactly like the other algorithms but has a different assumption, that when an expired slide occurs, not every instance is affected in the same way. It maintains two priority queues: the unsafe inlier queue and the outlier list. The unsafe inlier queue has sorted instances based on the increasing order of smallest expiration time of their preceding neighbors. The outlier list has all the outliers in the current window: Expired Slide: Instances in expired slides are removed from the index structure that affects range queries and the unsafe inlier queue is updated for expired neighbors. Those unsafe inliers which become outliers are removed from the priority queue and moved to the outlier list. New Slide: For each data point in the new slide, range query R is executed, results are used to update the succeeding neighbors of the point, and only the most recent preceding points are updated for the instance. Based on the updates, the point is added to the unsafe inlier priority queue or removed from the queue and added to the outlier list. Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, all instances in the outlier list are reported as outliers. Micro Clustering based Algorithm (MCOD) Micro-clustering based outlier detection overcomes the computational issues of performing range queries for every data point. The micro-cluster data structure is used instead of range queries in these algorithms. A micro-cluster is centered around an instance and has a radius of R. All the points belonging to the micro-clusters become inliers. The points that are outside can be outliers or inliers and stored in a separate list. It also has a data structure similar to DUE to keep a priority queue of unsafe inliers: Expired Slide: Instances in expired slides are removed from both microclusters and the data structure with outliers and inliers. The unsafe inlier queue is updated for expired neighbors as in the DUE algorithm. Microclusters are also updated for non-expired data points. New Slide: For each data point in the new slide, the instance either becomes a center of a micro-cluster, or part of a micro-cluster or added to the event queue and the data structure of the outliers. If the point is within the distance R, it gets assigned to an existing micro-cluster; otherwise, if there are k points within R, it becomes the center of the new micro cluster; if not, it goes into the two structures of the event queue and possible outliers.   Outlier Reporting: In any window, after the processing of expired and new slide elements is complete, any instance in the outlier structure with less than k neighboring instances is reported as an outlier. Advantages and limitations The advantages and limitations are as follows:   Exact Storm is demanding in storage and CPU for storing lists and retrieving neighbors. Also, it introduces delays; even though they are implemented in efficient data structures, range queries can be slow.   Abstract-C has a small advantage over Exact Storm, as no time is spent on finding active neighbors for each instance in the window. The storage and time spent is still very much dependent on the window and slide chosen.   DUE has some advantage over Exact Storm and Abstract-C as it can efficiently re-evaluate the "inlierness" of points (that is, whether unsafe inliers remain inliers or become outliers) but sorting the structure impacts both CPU and memory. MCOD has distinct advantages in memory and CPU owing to the use of the micro-cluster structure and removing the pairwise distance computation. Storing the neighborhood information in micro-clusters helps memory too. Validation and evaluation of stream-based outliers is still an open research area. By varying parameters such as window-size, neighbors within radius, and so on, we determine the sensitivity to the performance metrics (time to evaluate in terms of CPU times per object, Number of outliers detected in the streams,TP/Precision/Recall/ Area under PRC curve) and determine the robustness. If you liked the above article, checkout our book Mastering Java Machine Learning to explore more on advanced machine learning techniques using the best Java-based tools available.
Read more
  • 0
  • 0
  • 5784

article-image-getting-started-with-h2o-for-machine-learning
Sugandha Lahoti
01 Dec 2017
7 min read
Save for later

Getting started with Machine Learning in H2O

Sugandha Lahoti
01 Dec 2017
7 min read
[box type="note" align="" class="" width=""]We present to you an excerpt from our book by Dr. Uday Kamath and Krishna Choppella titled Mastering Java Machine Learning. This book aims to give you an array of advanced techniques on Machine Learning. [/box] Our article given below talks about using H2O as a Machine Learning Platform for Big Data applications. H2O is a leading open source platform for Machine Learning at Big Data scale, with a focus on bringing AI to the enterprise. The company counts several leading lights in statistical learning theory and optimization among its scientific advisors. It supports programming environments in multiple languages. H2O architecture The following figure gives a high-level architecture of H2O with important components. H2O can access data from various data stores such as HDFS, SQL, NoSQL, and Amazon S3, to name a few. The most popular deployment of H2O is to use one of the deployment stacks with Spark or to run it in a H2O cluster itself. The core of H2O is an optimized way of handling Big Data in memory, so that iterative algorithms that go through the same data can be handled efficiently and achieve good performance. Important Machine Learning algorithms in supervised and unsupervised learning are implemented specially to handle horizontal scalability across multiple nodes and JVMs. H2O provides not only its own user interface, known as flow, to manage and run modeling tasks, but also has different language bindings and connector APIs to Java, R, Python, and Scala. Most Machine Learning algorithms, optimization algorithms, and utilities use the concept of fork-join or MapReduce. As shown in the figure below, the entire dataset is considered as a Data Frame in H2O, and comprises vectors, which are features or columns in the dataset. The rows or instances are made up of one element from each Vector arranged side-by-side. The rows are grouped together to form a processing unit known as a Chunk. Several chunks are combined in one JVM. Any algorithmic or optimization work begins by sending the information from the topmost JVM to fork on to the next JVM, then on to the next, and so on, similar to the map operation in MapReduce. Each JVM works on the rows in the chunks to establish the task and finally the results flow back in the reduce operation: Machine learning in H2O The following figure shows all the Machine Learning algorithms supported in H2O v3 for supervised and unsupervised learning: Tools and usage H2O Flow is an interactive web application that helps data scientists to perform various tasks from importing data to running complex models using point and click and wizard-based concepts. H2O is run in local mode as: java –Xmx6g –jar h2o.jar The default way to start Flow is to point your browser and go to the following URL: http://192.168.1.7:54321/. The right-side of Flow captures every user action performed under the tab OUTLINE. The actions taken can be edited and saved as named flows for reuse and collaboration, as shown in the figure below: The figure below shows the interface for importing files from the local filesystem or HDFS and displays detailed summary statistics as well as next actions that can be performed on the dataset. Once the data is imported, it gets a data frame reference in the H2O framework with the extension of .hex. The summary statistics are useful in understanding the characteristics of data such as missing, mean, max, min, and so on. It also has an easy way to transform the features from one type to another, for example, numeric features with a few unique values to categorical/nominal types known as enum in H2O. The actions that can be performed on the datasets are: Visualize the data. Split the data into different sets such as training, validation, and testing. Build supervised and unsupervised models. Use the models to predict. Download and export the files in various formats. Building supervised or unsupervised models in H2O is done through an interactive screen. Every modeling algorithm has its parameters classified into three sections: basic, advanced, and expert. Any parameter that supports hyper-parameter searches for tuning the model has a checkbox grid next to it, and more than one parameter value can be used. Some basic parameters such as training_frame, validation_frame, and response_ column, are common to every supervised algorithm; others are specific to model types, such as the choice of solver for GLM, the activation function for deep learning, and so on. All such common parameters are available in the basic section. Advanced parameters are settings that afford greater flexibility and control to the modeler if the default behavior must be overridden. Several of these parameters are also common across some algorithms—two examples are the choice of method for assigning the fold index (if cross-validation was selected in the basic section), and selecting the column containing weights (if each example is weighted separately), and so on. Expert parameters define more complex elements such as how to handle the missing values, model-specific parameters that need more than a basic understanding of the algorithms, and other esoteric variables. In the figure below, GLM, a supervised learning algorithm, is being configured with 10-fold cross-validation, binomial (two-class) classification, efficient LBFGS optimization algorithm, and stratified sampling for cross-validation split: The model results screen contains a detailed analysis of the results using important evaluation charts, depending on the validation method that was used. At the top of the screen are possible actions that can be taken, such as to run the model on unseen data for prediction, download the model as POJO format, export the results, and so on. Some of the charts are algorithm-specific, like the scoring history that shows how the training loss or the objective function changes over the iterations in GLM—this gives the user insight into the speed of convergence as well as into the tuning of the iterations parameter. We see the ROC curves and the Area Under Curve metric on the validation data in addition to the gains and lift charts, which give the cumulative capture rate and cumulative lift over the validation sample respectively. The figure below shows SCORING HISTORY, ROC CURVE, and GAINS/LIFT charts for GLM on 10-fold cross-validation on the CoverType dataset: The output of validation gives detailed evaluation measures such as accuracy, AUC, err, errors, f1 measure, MCC (Mathews Correlation Coefficient), precision, and recall for each validation fold in the case of cross-validation as well as the mean and standard deviation computed across all. The prediction action runs the model using unseen held-out data to estimate the out-of-sample performance. Important measures such as errors, accuracy, area under curve, ROC plots, and so on, are given as the output of predictions that can be saved or exported. H2O is a rich visualization and analysis framework that is accessible from multiple programming environments( HDFS, SQL, NoSQL, S3, and others). It can also support a number of Machine Learning algorithms that can be run in a cluster. All these factors make it one of the major Machine Learning framework on Big Data. If you think this post is useful, do not miss to check our book Mastering Java Machine Learning  to know more on predictive models for batch- and stream-based big data learning using the latest tools and methodologies.  
Read more
  • 0
  • 0
  • 2645

article-image-10-machine-learning-algorithms
Aaron Lazar
30 Nov 2017
7 min read
Save for later

10 machine learning algorithms every engineer needs to know

Aaron Lazar
30 Nov 2017
7 min read
When it comes to machine learning, it's all about the algorithms. But although machine learning algorithms are the bread and butter of a data scientists job role, it's not always as straightforward as simply picking up an algorithm and running with it. Algorithm selection is incredibly important and often very challenging. There's always a number of things you have to take into consideration, such as: Accuracy: While accuracy is important, it’s not always necessary. In many cases, an approximation is sufficient, in which case, one shouldn’t look for accuracy while giving up on the processing time. Training time: This goes hand in hand with accuracy and is not the same for all algorithms. The training time might go up if there are more parameters as well. When time is a big constraint, you should choose an algorithm wisely. Linearity: Algorithms that follow linearity assume that the data trends follow a linear path. While this is good for some problems, for others it can result in lowered accuracy. Once you've taken those 3 considerations on board you can start to dig a little deeper. Kaggle did a survey in 2017 asking their readers which algorithms - or 'data science methods' more broadly - respondents were most likely to use at work. Below is a screenshot of the results. Kaggle's research offers a useful insight into the algorithms actually being used by data scientists and data analysts today. But we've brought together the types of machine learning algorithms that are most important. Every algorithm is useful in different circumstances - the skill is knowing which one to use and when. 10 machine learning algorithms Linear regression This is clearly one of the most interpretable ML algorithms. It requires minimal tuning and is easy to explain, being the key reason for its popularity. It shows the relationship between two or more variables and how a change in one of the dependent variables impacts the independent variable. It is used for forecasting sales based on trends, as well as for risk assessment. Although with a relatively low level of accuracy, a few parameters needed and lesser training times makes it’s quite popular among beginners. Logistic regression Logistic regression is typically viewed as a special form of Linear Regression, where the output variable is categorical. It’s generally used to predict a binary outcome i.e.True or False, 1 or 0, Yes or No, for a set of independent variables. As you would have already guessed, this algorithm is generally used when the dependent variable is binary. Like to Linear regression, logistic regression has a low level of accuracy, fewer parameters and lesser training times. It goes without saying that it’s quite popular among beginners too. Decision trees These algorithms are mainly decision support tools that use tree-like graphs or models of decisions and possible consequences, including outcomes based on chance-event, utilities, etc. To put it in simple words, you can say decision trees are the least number of yes/no questions to be asked, in order to identify the probability of making the right decision, as often as possible. It lets you tackle the problem at hand in a structured, systematic way to logically deduce the outcome. Decision Trees are excellent when it comes to accuracy but their training times are a bit longer as compared to other algorithms. They also require a moderate number of parameters, making them not so complicated to arrive at a good combination. Naive Bayes  This is a type of classification ML algorithm that’s based on the popular probability theorem by Bayes. It is one of the most popular learning algorithms. It groups similarities together and is usually used for document classification, facial recognition software or for predicting diseases. It generally works well when you have a medium to large data set to train your models. These have moderate training times and make use of linearity. While this is good, linearity might also bring down accuracy for certain problems. They also do not bank on too many parameters, making it easy to arrive at a good combination, although at the cost of accuracy. Random forest Without a doubt, this one is a popular go-to machine learning algorithm that creates a group of decision trees with random subsets of the data. It uses the ML method of classification and regression. It is simple to use, as just a few lines of code are enough to implement the algorithm. It is used by banks in order to predict high-risk loan applicants or even by hospitals to predict whether a particular patient is likely to develop a chronic disease or not. With a high accuracy level and moderate training time, it is quite efficient to implement. Moreover, it has average parameters. K-Means K-Means is a popular unsupervised algorithm that is used for cluster analysis and is an iterative and non-deterministic method. It operates on a given dataset through a predefined number of clusters. The output of a K-Means algorithm will be k clusters, with input data partitioned among these clusters. Biggies like Google use K-means to cluster pages by similarities and discover the relevance of search results. This algorithm has a moderate training time and has good accuracy. It doesn’t consist of many parameters, meaning that it’s easy to arrive at the best possible combination. K nearest neighbors K nearest neighbors is a very popular machine learning algorithm which can be used for both regression as well as classification, although it’s majorly used for the latter. Although it is simple, it is extremely effective. It takes little to no time to train, although its accuracy can be heavily degraded by high dimension data since there is not much of a difference between the nearest neighbor and the farthest one. Support vector machines SVMs are one of the several examples of supervised ML algorithms dealing with classification. They can be used for either regression or classification, in situations where the training dataset teaches the algorithm about specific classes, so that it can then classify the newly included data. What sets them apart from other machine learning algorithms is that they are able to separate classes quicker and with lesser overfitting than several other classification algorithms. A few of the biggest pain points that have been resolved using SVMs are display advertising, image-based gender detection and image classification with large feature sets. These are moderate in their accuracy, as well as their training times, mostly because it assumes linear approximation. On the other hand, they require an average number of parameters to get the work done. Ensemble methods Ensemble methods are techniques that build a set of classifiers and combine the predictions to classify new data points. Bayesian averaging is originally an ensemble method, but newer algorithms include error-correcting output coding, etc. Although ensemble methods allow you to devise sophisticated algorithms and produce results with a high level of accuracy, they are not preferred so much in industries where interpretability of the algorithm is more important. However, with their high level of accuracy, it makes sense to use them in fields like healthcare, where even the minutest improvement can add a lot of value. Artificial neural networks Artificial neural networks are so named because they mimic the functioning and structure of biological neural networks. In these algorithms, information flows through the network and depending on the input and output, the neural network changes in response. One of the most common use cases for ANNs is speech recognition, like in voice-based services. As the information fed to them grows, these algorithms improve. However, artificial neural networks are imperfect. With great power comes longer training times. They also have several more parameters as compared to other algorithms. That being said, they are very flexible and customizable. If you want to skill-up in implementing Machine Learning Algorithms, you can check out the following books from Packt: Data Science Algorithms in a Week by Dávid Natingga Machine Learning Algorithms by Giuseppe Bonaccorso
Read more
  • 0
  • 0
  • 6830

article-image-getting-to-know-tensorflow
Kartikey Pandey
29 Nov 2017
7 min read
Save for later

Getting to know TensorFlow

Kartikey Pandey
29 Nov 2017
7 min read
[box type="note" align="" class="" width=""] The following book excerpt is from the title Machine Learning Algorithms by Guiseppe Bonaccorso. The book describes important Machine Learning algorithms commonly used in the field of data science. These algorithms can be used for supervised as well as unsupervised learning, reinforcement learning, and semi-supervised learning. Few famous ones covered in the book are Linear regression, Logistic Regression, SVM, Naive Bayes, K-Means, Random Forest, TensorFlow, and Feature engineering. [/box] Here, in the article, we look at understanding most important Deep learning library-Tensorflow with contextual examples. Brief Introduction to TensorFlow TensorFlow is a computational framework created by Google and has become one of the most diffused deep-learning toolkits. It can work with both CPUs and GPUs and already implements most of the operations and structures required to build and train a complex model. TensorFlow can be installed as a Python package on Linux, Mac, and Windows (with or without GPU support); however, we suggest you follow the instructions provided on the website to avoid common mistakes. The main concept behind TensorFlow is the computational graph or a set of subsequent operations that transform an input batch into the desired output. In the following figure, there's a schematic representation of a graph: Starting from the bottom, we have two input nodes (a and b), a transpose operation (that works on b), a matrix multiplication and a mean reduction. The init block is a separate operation, which is formally part of the graph, but it's not directly connected to any other node; therefore it's autonomous (indeed, it's a global initializer). As this one is only a brief introduction, it's useful to list all of the most important strategic elements needed to work with TensorFlow so as to be able to build a few simple examples that can show the enormous potential of this framework: Graph: This represents the computational structure that connects a generic input batch with the output tensors through a directed network made of operations. It's defined as a tf.Graph() instance and normally used with a Python context Manager. Placeholder: This is a reference to an external variable, which must be explicitly supplied when it's requested for the output of an operation that uses it directly or indirectly. For example, a placeholder can represent a variable x, which is first transformed into its squared value and then summed to a constant value. The output is thenx2+c, which is materialized by passing a concrete value for x. It's defined as a tf.placeholder() instance. Variable: An internal variable used to store values which are updated by the algorithm. For example, a variable can be a vector containing the weights of a logistic regression. It's normally initialized before a training process and automatically modified by the built-in optimizers. It's defined as a tf.Variable() instance. A variable can also be used to store elements which must not be considered during training processes; in this case, it must be declared with the parameter trainable=False Constant: A constant value defined as a tf.constant() instance. Operation: A mathematical operation that can work with placeholders, variables, and constants. For example, the multiplication of two matrices is an operation defined a tf.constant(). Among all operations, gradient calculation is one of the most important. TensorFlow allows determining the gradients starting from a determined point in the computational graph, until the origin or another point that must be logically before it. We're going to see an example of this Operation. Session: This is a sort of wrapper-interface between TensorFlow and our working environment (for example, Python or C++). When the evaluation of a graph is needed, this macro-operation will be managed by a session, which must be fed with all placeholder values and will produce the required outputs using the requested devices. For our purposes, it's not necessary to go deeper into this concept; however, I invite the reader to retrieve further information from the website or from one of the resources listed at the end of this chapter. It's declared as an instance of tf.Session() or, as we're going to do, an instance of tf.InteractiveSession(). This type of session is particularly useful when working with notebooks or shell commands, because it places itself automatically as the default one. Device: A physical computational device, such as a CPU or a GPU. It's declared explicitly through an instance of the class tf.device()and used with a context manager. When the architecture contains more computational devices, it's possible to split the jobs so as to parallelize many operations. If no device is specified, TensorFlow will use the default one (which is the main CPU or a suitable GPU if all the necessary components are installed). Let’s now analyze this with a simple example here about computing gradients: Computing gradients The option to compute the gradients of all output tensors with respect to any connected input or node is one of the most interesting features of TensorFlow because it allows us to create learning algorithms without worrying about the complexity of all transformations. In this example, we first define a linear dataset representing the function f(x) = x in the range (-100, 100): import numpy as np >>> nb_points = 100 >>> X = np.linspace(-nb_points, nb_points, 200, dtype=np.float32) The corresponding plot is shown in the following figure: Now we want to use TensorFlow to compute: The first step is defining a graph: import tensorflow as tf >>> graph = tf.Graph() Within the context of this graph, we can define our input placeholder and other operations: >>> with graph.as_default(): >>> Xt = tf.placeholder(tf.float32, shape=(None, 1), name='x') >>> Y = tf.pow(Xt, 3.0, name='x_3') >>> Yd = tf.gradients(Y, Xt, name='dx') >>> Yd2 = tf.gradients(Yd, Xt, name='d2x') A placeholder is generally defined with a type (first parameter), a shape, and an optional name. We've decided to use a tf.float32 type because this is the only type also supported by GPUs. Selecting shape=(None, 1) means that it's possible to use any bidimensional vectors with the second dimension equal to 1. The first operation computes the third power if Xt is working on all elements. The second operation computes all the gradients of Y with respect to the input placeholder Xt. The last operation will repeat the gradient computation, but in this case, it uses Yd, which is the output of the first gradient operation. We can now pass some concrete data to see the results. The first thing to do is create a session connected to this graph: >>> session = tf.InteractiveSession(graph=graph) By using this session, we ask any computation using the method run(). All the input parameters must be supplied through a feed-dictionary, where the key is the placeholder, while the value is the actual array: >>> X2, dX, d2X = session.run([Y, Yd, Yd2], feed_dict={Xt: X.reshape((nb_points*2, 1))}) We needed to reshape our array to be compliant with the placeholder. The first argument of run() is a list of tensors that we want to be computed. In this case, we need all operation outputs. The plot of each of them is shown in the following figure: As expected, they represent respectively: x3, 3x2, and 6x. Further in the book, we look at a slightly more complex example of Logistic Regression to implement a logistic regression algorithm. Refer to Chapter 14, Brief Introduction to Deep Learning and Tensorflow of Machine Learning Algorithms to read the complete chapter.    
Read more
  • 0
  • 0
  • 3990
article-image-building-classification-system-logistic-regression-opencv
Savia Lobo
28 Nov 2017
7 min read
Save for later

Building a classification system with logistic regression in OpenCV

Savia Lobo
28 Nov 2017
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Michael Beyeler titled Machine Learning for OpenCV. The code and related files are available on Github here.[/box] A famous dataset in the world of machine learning is called the Iris dataset. The Iris dataset contains measurements of 150 iris flowers from three different species: setosa, versicolor, and viriginica. These measurements include the length and width of the petals, and the length and width of the sepals, all measured in centimeters: Understanding logistic regression Despite its name, logistic regression can actually be used as a model for classification. It uses a logistic function (or sigmoid) to convert any real-valued input x into a predicted output value ŷ that take values between 0 and 1, as shown in the following figure:         The logistic function Rounding ŷ to the nearest integer effectively classifies the input as belonging either to class 0 or 1. Of course, most often, our problems have more than one input or feature value, x. For example, the Iris dataset provides a total of four features. For the sake of simplicity, let's focus here on the first two features, sepal length—which we will call feature f1—and sepal width—which we will call f2. Using the tricks we learned when talking about linear regression, we know we can express the input x as a linear combination of the two features, f1 and f2: However, in contrast to linear regression, we are not done yet. From the previous section, we know that the sum of products would result in a real-valued, output—but we are interested in a categorical value, zero or one. This is where the logistic function comes in: it acts as a squashing function, σ, that compresses the range of possible output values to the range [0, 1]: [box type="shadow" align="" class="" width=""]Because the output is always between 0 and 1, it can be interpreted as a probability. If we only have a single input variable x, the output value ŷ can be interpreted as the probability of x belonging to class 1.[/box] Now let's apply this knowledge to the Iris dataset! Loading the training data The Iris dataset is included with scikit-learn. We first load all the necessary modules, as we did in our earlier examples: In [1]: import numpy as np ... import cv2 ... from sklearn import datasets ... from sklearn import model_selection ... from sklearn import metrics ... import matplotlib.pyplot as plt ... %matplotlib inline In [2]: plt.style.use('ggplot') Then, loading the dataset is a one-liner: In [3]: iris = datasets.load_iris() This function returns a dictionary we call iris, which contains a bunch of different fields: In [4]: dir(iris) Out[4]: ['DESCR', 'data', 'feature_names', 'target', 'target_names'] Here, all the data points are contained in 'data'. There are 150 data points, each of which has four feature values: In [5]: iris.data.shape Out[5]: (150, 4) These four features correspond to the sepal and petal dimensions mentioned earlier: In [6]: iris.feature_names Out[6]: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] For every data point, we have a class label stored in target: In [7]: iris.target.shape Out[7]: (150,) We can also inspect the class labels, and find that there is a total of three classes: In [8]: np.unique(iris.target) Out[8]: array([0, 1, 2]) Making it a binary classification problem For the sake of simplicity, we want to focus on a binary classification problem for now, where we only have two classes. The easiest way to do this is to discard all data points belonging to a certain class, such as class label 2, by selecting all the rows that do not belong to class 2: In [9]: idx = iris.target != 2 ... data = iris.data[idx].astype(np.float32) ... target = iris.target[idx].astype(np.float32) Inspecting the data Before you get started with setting up a model, it is always a good idea to have a look at the data. We did this earlier for the town map example, so let's continue our streak. Using Matplotlib, we create a scatter plot where the color of each data point corresponds to the class label: In [10]: plt.scatter(data[:, 0], data[:, 1], c=target, cmap=plt.cm.Paired, s=100) ... plt.xlabel(iris.feature_names[0]) ... plt.ylabel(iris.feature_names[1]) Out[10]: <matplotlib.text.Text at 0x23bb5e03eb8> To make plotting easier, we limit ourselves to the first two features (iris.feature_names[0] being the sepal length and iris.feature_names[1] being the sepal width). We can see a nice separation of classes in the following figure: Plotting the first two features of the Iris dataset Splitting the data into training and test sets We learned in the previous chapter that it is essential to keep training and test data separate. We can easily split the data using one of scikit-learn's many helper functions: In [11]: X_train, X_test, y_train, y_test = model_selection.train_test_split( ... data, target, test_size=0.1, random_state=42 ... ) Here we want to split the data into 90 percent training data and 10 percent test data, which we specify with test_size=0.1. By inspecting the return arguments, we note that we ended up with exactly 90 training data points and 10 test data points: In [12]: X_train.shape, y_train.shape Out[12]: ((90, 4), (90,)) In [13]: X_test.shape, y_test.shape Out[13]: ((10, 4), (10,)) Training the classifier Creating a logistic regression classifier involves pretty much the same steps as setting up k- NN: In [14]: lr = cv2.ml.LogisticRegression_create() We then have to specify the desired training method. Here, we can choose cv2.ml.LogisticRegression_BATCH or cv2.ml.LogisticRegression_MINI_BATCH. For now, all we need to know is that we want to update the model after every data point, which can be achieved with the following code: In [15]: lr.setTrainMethod(cv2.ml.LogisticRegression_MINI_BATCH) ... lr.setMiniBatchSize(1) We also want to specify the number of iterations the algorithm should run before it terminates: In [16]: lr.setIterations(100) We can then call the training method of the object (in the exact same way as we did earlier), which will return True upon success: In [17]: lr.train(X_train, cv2.ml.ROW_SAMPLE, y_train) Out[17]: True As we just saw, the goal of the training phase is to find a set of weights that best transform the feature values into an output label. A single data point is given by its four feature values (f0, f1, f2, f3). Since we have four features, we should also get four weights, so that x = w0 f0 + w1 f1 + w2 f2 + w3 f3, and ŷ=σ(x). However, as discussed previously, the algorithm adds an extra weight that acts as an offset or bias, so that x = w0 f0 + w1 f1 + w2 f2 + w3 f3 + w4. We can retrieve these weights as follows: In [18]: lr.get_learnt_thetas() Out[18]: array([[-0.04109113, -0.01968078, -0.16216497, 0.28704911, 0.11945518]], dtype=float32) This means that the input to the logistic function is x = -0.0411 f0 - 0.0197 f1 - 0.162 f2 + 0.287 f3 + 0.119. Then, when we feed in a new data point (f0, f1, f2, f3) that belongs to class 1, the output ŷ=σ(x) should be close to 1. But how well does that actually work? Testing the classifier Let's see for ourselves by calculating the accuracy score on the training set: In [19]: ret, y_pred = lr.predict(X_train) In [20]: metrics.accuracy_score(y_train, y_pred) Out[20]: 1.0 Perfect score! However, this only means that the model was able to perfectly memorize the training dataset. This does not mean that the model would be able to classify a new, unseen data point. For this, we need to check the test dataset: In [21]: ret, y_pred = lr.predict(X_test) ... metrics.accuracy_score(y_test, y_pred) Out[21]: 1.0 Luckily, we get another perfect score! Now we can be sure that the model we built is truly awesome. If you enjoyed building a classifier using logistic regression and would like to learn more machine learning tasks using OpenCV, be sure to check out the book, Machine Learning for OpenCV, where this section originally appears.    
Read more
  • 0
  • 0
  • 4615

article-image-learn-to-build-a-scatterplot-in-ibm-spss
Kartikey Pandey
27 Nov 2017
4 min read
Save for later

How to build a Scatterplot in IBM SPSS

Kartikey Pandey
27 Nov 2017
4 min read
[box type="note" align="" class="" width=""] ----The following excerpt is from the title Data Analysis with IBM SPSS Statistics, Chapter 5, written by Kenneth Stehlik-Barry and Anthony J. Babinec. Analytical tools such as SPSS can readily provide even a novice user with an overwhelming amount of information and a broad range of options for analyzing patterns in the data. [/box] In this article we help you learn the techniques of SPSS to build Scatterplot using the Chart Builder feature. One of the most valuable methods for examining the relationship between two variables containing scale-level data is a scatterplot. In the previous chapter, scatterplots were used to detect points that deviated from the typical pattern--multivariate outliers. To produce a similar scatterplot using two fields from the 2016 General Social Survey data, navigate to Graphs | Chart Builder. An information box is displayed indicating that each field's measurement properties will be used to identify the types of graphs available so adjusting these properties is advisable. In this example, the properties will be modified as a part of the graph specification process but you may want to alter the properties of some variables permanently so that they don't need to be changed for each use. For now, just select OK to move ahead. In the main Chart Builder window, select Scatter/Dot from the menu at the lower left, double-click on the first graph to the right (Simple Scatter) to place it in the preview pane at the upper right, and then right-click on the first field labeled HIGHEST YEAR OF SCHOOL. Change this variable from Nominal to Scale, as shown in the following Screenshot: After changing the respondent's education to Scale, drag this field to the X-Axis location in the preview pane and drag spouse's education to the Y-Axis location. Once both elements are in place, the OK choice will become available. Select it to produce the scatterplot in the following screenshot: The scatterplot produced by default provides some sense of the trend in that the denser circles are concentrated in a band from the lower left to the upper right. This pattern, however, is rather subtle visually. With some editing, the relationship can be made more Evident. Double-click on the graph to open the Chart Editor and select the X icon at the top and change the major increment to 4 so that there are numbers corresponding to completing high school and college. Do the same for the y-axis values. Select a point on the graph to highlight all the "dots" and right-click to display the following dialog. Click on the Marker tab and change the symbol to the star shape, increase the size to 6, increase the border to 2, and change the border color to a dark blue. Use Apply to make the changes visible on the scatterplot: Use the Add Fit line at Total icon above the graph to show the regression line for this data. Drag the R2 box from the upper right to the bottom, below the graph and drag the box on the graphs with the equation displayed to the lower left away from the points: The modifications to the original scatterplot make it easier to see the pattern since the “stars” near the line are darker and denser than those farther from the line indicating fewer cases are associated with those points. The SPSS capabilities with respect to scatterplot in this article will give you a foundation to create a visual representation of data for both deeper pattern discovery and to communicate results to a broader audience. Several other graph types such as pie charts and multiple line charts  can be built and edited using the approach shown in Chapter 5, Visually Exploring the Data from our title Data Analysis with IBM SPSS Statistics. Go on, explore these alternative graph styles to see when they may be better suited to your needs.  
Read more
  • 0
  • 0
  • 5194

article-image-highest-paying-data-science-jobs-2017
Amey Varangaonkar
27 Nov 2017
10 min read
Save for later

Highest Paying Data Science Jobs in 2017

Amey Varangaonkar
27 Nov 2017
10 min read
It is no secret that this is the age of data. More data has been created in the last 2 years than ever before. Within the dumps of data created every second, businesses are looking for useful, action worthy insights which they can use to enhance their processes and thereby increase their revenue and profitability. As a result, the demand for data professionals, who sift through terabytes of data for accurate analysis and extract valuable business insights from it, is now higher than ever before. Think of Data Science as a large tree from which all things related to data branch out - from plain old data management and analysis to Big Data, and more.  Even the recently booming trends in Artificial Intelligence such as machine learning and deep learning are applied in many ways within data science. Data science continues to be a lucrative and growing job market in the recent years, as evidenced by the graph below: Source: Indeed.com In this article, we look at some of the high paying, high trending job roles in the data science domain that you should definitely look out for if you’re considering data science as a serious career opportunity. Let’s get started with the obvious and the most popular role. Data Scientist Dubbed as the sexiest job of the 21st century, data scientists utilize their knowledge of statistics and programming to turn raw data into actionable insights. From identifying the right dataset to cleaning and readying the data for analysis, to gleaning insights from said analysis, data scientists communicate the results of their findings to the decision makers. They also act as advisors to executives and managers by explaining how the data affects a particular product or process within the business so that appropriate actions can be taken by them. Per Salary.com, the median annual salary for the role of a data scientist today is $122,258, with a range between $106,529 to $137,037. The salary is also accompanied by a whole host of benefits and perks which vary from one organization to the other, making this job one of the best and the most in-demand, in the job market today. This is a clear testament to the fact that an increasing number of businesses are now taking the value of data seriously, and want the best talent to help them extract that value. There are over 20,000 jobs listed for the role of data scientist, and the demand is only growing. Source: Indeed.com To become a data scientist, you require a bachelor’s or a master’s degree in mathematics or statistics and work experience of more than 5 years in a related field. You will need to possess a unique combination of technical and analytical skills to understand the problem statement and propose the best solution, good programming skills to develop effective data models, and visualization skills to communicate your findings with the decision makers. Interesting in becoming a data scientist? Here are some resources to help you get started: Principles of Data Science Data Science Algorithms in a Week Getting Started with R for Data Science [Video] For a more comprehensive learning experience, check out our skill plan for becoming a data scientist on Mapt, our premium skills development platform. Data Analyst Probably a term you are quite familiar with, Data Analysts are responsible for crunching large amounts of data and analyze it to come to appropriate logical conclusions. Whether it’s related to pure research or working with domain-specific data, a data analyst’s job is to help the decision-makers’ job easier by giving them useful insights. Effective data management, analyzing data, and reporting results are some of the common tasks associated with this role. How is this role different than a data scientist, you might ask. While data scientists specialize in maths, statistics and predictive analytics for better decision making, data analysts specialize in the tools and components of data architecture for better analysis. Per Salary.com, the median annual salary for an entry-level data analyst is $55,804, and the range usually falls between $50,063 to $63,364 excluding bonuses and benefits. For more experienced data analysts, this figure rises to around a mean annual salary of $88,532. With over 83,000 jobs listed on Indeed.com, this is one of the most popular job roles in the data science community today. This profile requires a pretty low starting point, and is justified by the low starting salary packages. As you gain more experience, you can move up the ladder and look at becoming a data scientist or a data engineer. Source: Indeed.com You may also come across terms such as business data analyst or simply business analyst which are sometimes interchangeably used with the role of a data analyst. While their primary responsibilities are centered around data crunching, business analysts model company infrastructure, while data analysts model business data structures. You can find more information related to the differences in this interesting article. If becoming a data analyst is something that interests you, here are some very good starting points: Data Analysis with R Python Data Analysis, Second Edition Learning Python Data Analysis [Video] Data Architect Data architects are responsible for creating a solid data management blueprint for an organization. They are primarily responsible for designing the data architecture and defining how data is stored, consumed and managed by different applications and departments within the organization. Because of these critical responsibilities, a data architect’s job is a very well-paid one. Per Salary.com, the median annual salary for an entry-level data architect is $74,809, with a range between $57,964 to $91,685. For senior-level data architects, the median annual salary rises up to $136,856, with a range usually between $121,969 to $159,212. These high figures are justified by the critical nature of the role of a data architect - planning and designing the right data infrastructure after understanding the business considerations to get the most value out of the data. At present, there are over 23,000 jobs for the role listed on Indeed.com, with a stable trend in job seeker interest, as shown: Source: Indeed.com To become a data architect, you need a bachelor’s degree in computer science, mathematics, statistics or a related field, and loads of real-world skills to qualify for even the entry-level positions. Technical skills such as statistical modeling, knowledge of languages such as Python and R, database architectures, Hadoop-based skills, knowledge of NoSQL databases, and some machine learning and data mining are required to become a data architect. You also need strong collaborative skills, problem-solving, creativity and the ability to think on your feet, to solve the trickiest of problems on the go. Suffice to say it’s not an easy job, but it is definitely a lucrative one! Get ahead of the curve, and start your journey to becoming a data architect now: Big Data Analytics Hadoop Blueprints PostgreSQL for Data Architects Data Engineer Data engineers or Big Data engineers are a crucial part of the organizational workforce and work in tandem with data architects and data scientists to ensure appropriate data management systems are deployed and the right kind of data is being used for analysis. They deal with messy, unstructured Big Data and strive to provide clean, usable data to the other teams within the organization. They build high-performance analytics pipelines and develop set of processes for efficient data mining. In many companies, the role of a data engineer is closely associated with that of a data architect. While an architect is responsible for the planning and designing stages of the data infrastructure project, a data engineer looks after the construction, testing and maintenance of the infrastructure. As such data engineers tend to have a more in-depth understanding of different data tools and languages than data architects. There are over 90,000 jobs listed on Indeed.com, suggesting there is a very high demand in the organizations for this kind of a role. An entry level data engineer has a median annual salary of $90,083 per Payscale.com, with a range of $60,857 to $131,851. For Senior Data Engineers, the average salary shoots up to $123,749 as per Glassdoor estimates. Source: Indeed.com With the unimaginable rise in the sheer volume of data, the onus is on the data engineers to build the right systems that empower the data analysts and data scientists to sift through the messy data and derive actionable insights from it. If becoming a data engineer is something that interests you, here are some of our products you might want to look at: Real-Time Big Data Analytics Apache Spark 2.x Cookbook Big Data Visualization You can also check out our detailed skill plan on becoming a Big Data Engineer on Mapt. Chief Data Officer There is a countless number of organizations that build their businesses on data, but don’t manage it that well. This is where a senior executive popularly known as the Chief Data Officer (CDO) comes into play - bearing the responsibility for implementing the organization’s data and information governance and assisting with data-driven business strategies. They are primarily responsible for ensuring that their organization gets the most value out of their data and put appropriate plans in place for effective data quality and its life-cycle management. The role of a CDO is one of the most lucrative and highest paying jobs in the data science frontier. An average median annual pay for a CDO per Payscale.com is around $192,812. Indeed.com lists just over 8000 job postings too - this is not a very large number, but understandable considering the recent emergence of the role and because it’s a high profile, C-suite job. Source: Indeed.com According to a Gartner research, almost 50% companies in a variety of regulated industries will have a CDO in place, by 2017. Considering the demand for the role and the fact that it is only going to rise in the future, the role of a CDO is one worth vying for.   To become a CDO, you will obviously need a solid understanding of statistical, mathematical and analytical concepts. Not just that, extensive and high-value experience in managing technical teams and information management solutions is also a prerequisite. Along with a thorough understanding of the various Big Data tools and technologies, you will need strong communication skills and deep understanding of the business. If you’re planning to know more about how you can become a Chief Data Officer, you can browse through our piece on the role of CDO. Why demand for data science professionals will rise It’s hard to imagine an organization which doesn’t have to deal with data, but it’s harder to imagine the state of an organization with petabytes of data and not knowing what to do with it. With the vast amounts of data, organizations deal with these days, the need for experts who know how to handle the data and derive relevant and timely insights from it is higher than ever. In fact, IBM predicts there’s going to be a severe shortage of data science professionals, and thereby, a tremendous growth in terms of job offers and advertised openings, by 2020. Not everyone is equipped with the technical skills and know-how associated with tasks such as data mining, machine learning and more. This is slowly creating a massive void in terms of talent that organizations are looking to fill quickly, by offering lucrative salaries and added benefits. Without the professional expertise to turn data into actionable insights, Big Data becomes all but useless.      
Read more
  • 0
  • 0
  • 4089
article-image-mid-autumn-shoppers-dream-amazon-fulfilled-thanksgiving-look-like
Aaron Lazar
24 Nov 2017
10 min read
Save for later

A mid-autumn Shopper’s dream - What an Amazon fulfilled Thanksgiving would look like

Aaron Lazar
24 Nov 2017
10 min read
I’d been preparing for Thanksgiving a good 3 weeks in advance. One reason is that I’d recently rented out a new apartment and the troops were heading over to my place this year. I obviously had to make sure everything went well and for that, trust me, there was no resting even for a minute! Thanksgiving is really about being thankful for the people and things in your life and spending quality time with family. This Thanksgiving I’m especially grateful to Amazon for making it the best experience ever! Read on to find out how Amazon made things awesome! Good times started two weeks ago when I was at the AmazonGo store with my friend, Sue. [embed]https://www.youtube.com/watch?v=NrmMk1Myrxc[/embed] In fact, this was the first time I had set foot in one of the stores. I wanted to see what was so cool about them and why everyone had been talking about them for so long! The store was pretty big and lived up to the A to Z concept, as far as I could see. The only odd thing was that I didn’t notice any queues or a billing counter. Sue glided around the floor with ease, as if she did this every day. I was more interested in seeing what was so special about this place. After she got her stuff, she headed straight for the door. I smiled to myself thinking how absent minded she was. So I called her back and reminded her “You haven’t gotten your products billed.” She smiled back at me and shrugged, “I don’t need to.” Before I could open my mouth to tell her off for stealing, she explained to me about the store. It’s something totally futuristic! Have you ever imagined not having to stand in a line to buy groceries? At the store, you just had to log in to your AmazonGo app on your phone, enter the store, grab your stuff and then leave. The sensors installed everywhere in the store automatically detected what you’d picked up and would bill you accordingly. They also used Computer Vision and Deep Learning to track people and their shopping carts. Now that’s something! And you even got a receipt! Well, it was my birthday last week and knowing what an avid reader I was, my colleagues from office gifted me a brand new Kindle. I loved every bit of it, but the best part was the X-ray feature. With X-ray, you could simply get information about a character, person or terms in a book. You could also scroll through the long lists of excerpts and click on one to go directly to that particular portion of the book! That’s really amazing, especially if you want to read a particular part of the book quickly. It came in use at the right time - I downloaded a load of recipe books for the turkey. Another feather in the cap for Amazon! Talking about feathers in one’s cap, you won’t believe it, but Amazon actually got me rekognised at work a few days ago. Nah, that wasn’t a typo. I worked as a software developer/ML engineer in a startup and I’d been doing this for as long as I can remember. I recently built this cool mobile application that recognized faces and unlocked your phone even when you didn’t have something like Face ID on your phone and the app had gotten us a million downloads in a month! It could also recognize and give you information about the ethnicity of a person if you captured their photograph with the phone’s camera. The trick was that I’d used the AmazonRekognition APIs for enhanced face detection in the application. Rekognition allows you to detect objects, scenes, text, and faces, using highly scalable, deep learning models. I also enhanced the application using the Polly API. Polly converts text to whichever language you want the speech in and gives you the synthesized speech in the form of audio files.The app I built now converted input text into 18 different languages, helping one converse with the person in front of them in that particular language, should they have a problem doing it in English. I got that long awaited promotion right after! Ever wondered how I got the new apartment? ;) Since the folks were coming over to my place in a few days, I thought I’d get a new dinner set. You’d probably think I would need to sit down at my PC or probably pick up my phone to search for a set online, but I had better things to do. Thanks to Alexa, I simply needed to ask her to find one for me and she did it brilliantly. Now Alexa isn’t my girlfriend, although I would have loved that to be. Alexa is actually Amazon’s cloud-based voice service that provides customers with an engaging way of interacting with technology. Alexa is blessed with finely tuned ASR or Automatic Speech Recognition and NLU or Natural Language Understanding engines, that instantly recognize and respond to voice requests. I selected a pretty looking set and instantly bought it through my Prime account. With technology like this at my fingertips, the developer in me had left no time in exploring possibilities with Alexa. That’s when I found out about Lex, built on the same deep learning platform that Alexa works on, which allows developers to build conversational interfaces into their apps. With the dinner set out of the way, I sat back with my feet up on the table. I was awesome, baby! Oh crap! I forgot to buy the turkey, the potatoes, the wine and a whole load of other stuff. It was 3 AM and I started panicking. I remembered that mum always put the turkey in the fridge at least 3 days in advance. I had only 2! I didn’t even have the time to make it to the AmazonGo store. I was panicking again and called up Suzy to ask her if she could pick up the stuff for me. She sounded so calm over the phone when I narrated my horror to her. She simply told me to get the stuff from AmazonFresh. So I hastily disconnected the call and almost screamed to Alexa, “Alexa, find me a big bird!”, and before I realized what I had said, I was presented with this. [caption id="attachment_2215" align="aligncenter" width="184"] Big Bird is one of the main protagonist in Sesame Street.[/caption] So I tried again, this time specifying what I actually needed! With AmazonDash integrating with AmazonFresh, I was able to get the turkey and other groceries delivered home in no time! What a blessing, indeed! A day before Thanksgiving, I was stuck in the office, working late on a new project. We usually tinkered around with a lot of ML and AI stuff. There was this project which needed the team to come up with a really innovative algorithm to perform a deep learning task. As the project lead, I was responsible for choosing the tech stack and I’m glad a little birdie had recently told me about AWS taking in MXNet officially as a Deep Learning Framework. MXNet made it a breeze to build ML applications that train quickly and could run anywhere. Moreover, with the recent collaboration between Amazon and Microsoft, a new ML library called Gluon was born. Available in MXNet, Gluon made building ML models, even more, easier and quicker, without compromising on performance. Need I say the project was successful? I got home that evening and sat down to pick a good flick or two to download from Amazon PrimeVideo. There’s always someone in the family who’d suggest we all watch a movie and I had to be prepared. With that done I quickly showered and got to bed. It was going to be a long day the next day! 4 AM my alarm rang and I was up! It was Thanksgiving, and what a wonderful day it was! I quickly got ready and prepared to start cooking. I got the bird out of the freezer and started to thaw it in cold water. It was a big bird so it was going to take some time. In the meantime, I cleaned up the house and then started working on the dressing. Apples, sausages, and cranberry. Yum! As I sliced up the sausages I realized that I had misjudged the quantity. I needed to get a couple more packets immediately! I had to run to the grocery store right away or there would be a disaster! But it took me a few minutes to remember it was Thanksgiving, one of the craziest days to get out on the road. I could call the store delivery guy or probably Amazon Dash, but then that would be illogical cos he’d have to take the same congested roads to get home.  I turned to Alexa for help, “Alexa, how do I get sausages delivered home in the next 30 minutes?”. And there I got my answer - Try Amazon PrimeAir. Now I don’t know about you, but having a drone deliver a couple packs of sausages to my house, is nothing less than ecstatic! I sat it out near the window for the next 20 minutes, praying that the package wouldn’t be intercepted by some hungry birds! I couldn’t miss the sight of the pork flying towards my apartment. With the dressing and turkey baked and ready, things were shaping up much better than I had expected. The folks started rolling in by lunchtime. Mum and dad were both quite impressed with the way I had organized things. I was beaming and in my mind hi-fived Amazon for helping me make everything possible with its amazing products and services designed to delight customers. It truly lives up to its slogan: Work hard. Have fun. Make history. If you are one of those folks who do this every day, behind the scenes, by building amazing products powered by machine learning and big data to make other's lives better, I want to thank you today for all your hard work. This Thanksgiving weekend, Packt's offering an unbelievable deal - Buy any book or video for just $10 or any three for $25! I know what I have my eyes on! Python Machine Learning - Second Edition by Sebastian Raschka and Vahid Mirjalili Effective Amazon Machine Learning by Alexis Perrier OpenCV 3 - Advanced Image Detection and Reconstruction [Video] by Prof. Robert Laganiere In the end, there’s nothing better than spending quality time with your family, enjoying a sumptuous meal, watching smiles all around and just being thankful for all you have. All I could say was, this Thanksgiving was truly Amazon fulfilled! :) Happy Thanksgiving folks!    
Read more
  • 0
  • 0
  • 1708

article-image-3d-graphics-animation-r
Savia Lobo
23 Nov 2017
4 min read
Save for later

How to create 3D Graphics and Animation in R

Savia Lobo
23 Nov 2017
4 min read
[box type="note" align="alignright" class="" width=""] The article is originally extracted from our book R Data Analysis Cookbook - Second Edition by Kuntal Ganguly. Data analytics with R has emerged to be a very important focus for organizations of all kind. The book will show how you can put your data analysis skills in R to practical use. It contains recipes catering to basic as well as advanced data analysis tasks. [/box] In this article, we have captured one aspect of using R for the creation of 3D graphics and animation. When, a two-dimensional view is not sufficient to understand and analyze the data; besides the x, y variable, an additional data dimension can be represented by a color variable. Our article gives a step by step explanation on how to plot 3D package in R is used to visualize three-dimensional graphs. Codes and data files are readily available for download towards the end of the post. Get set ready Make sure you have downloaded the code and the mtcars.csv file is located in the working directory of R. Install and load the latest version of the plot3D package: > install.packages("plot3D") > library(plot3D) How to do it... Load the mtcars dataset and preprocess it to add row names and remove the model name column: > mtcars=read.csv("mtcars.csv") > rownames(mtcars) <- mtcars$X > mtcars$X=NULL > head(mtcars) 2. Next, create a three-dimensional scatter plot: > scatter3D(x=mtcars$wt, y=mtcars$disp, z=mtcars$mpg, clab = c("Miles/(US) gallon")) 3. Now, add title and axis labels to the scatter plot: > scatter3D(x=mtcars$wt, y=mtcars$disp, z=mtcars$mpg, pch = 18, theta = 20, phi = 20, main = "Motor Trend Car Road Tests", xlab = "Weight lbs",ylab ="Displacement (cu.in.)", zlab = "Miles gallon") Rotating a 3D plot can provide a complete view of the data. Now view the plot in different directions by altering the values of two attributes--theta and phi: > scatter3D(x=mtcars$wt, y=mtcars$disp, z=mtcars$mpg,clab = c("Cars Mileage"),theta = 15, phi = 0, bty ="g") How it works... The scatter3D() function from the plot3D package has the following parameters: x, y, z: Vectors of point coordinates colvar: A variable used for the coloring col: A color palette to color the colvar variable labels: Refers to the text to be written add: Logical; if TRUE, then the points will be added to the current plot, and if FALSE, a new plot is started pch: Shape of the points cex: Size of the points theta: The azimuthal direction phi: Co-latitude; both theta and phi can be used to define the angles for the viewing direction Bty: Refers to the type of enclosing box and can take various values such as f-full box, b-default value (back panels only), g- grey background with white grid lines, and bl- black background There's more… The following concepts are very important for this recipe. Adding text to an existing 3D plot We can use the text3D() function to add text based on the car model name, alongside the data points: > scatter3D(x=mtcars$wt, y=mtcars$disp, z=mtcars$mpg, phi = 0, bty = "g", pch = 20, cex = 0.5) > text3D(x=mtcars$wt, y=mtcars$disp, z=mtcars$mpg, labels = rownames(mtcars), add = TRUE, colkey = FALSE, cex = 0.5) Using a 3D histogram The three-dimensional histogram function, hist3D(), has the following attributes: z: Values contained within a matrix. x, y: Vectors, where the length of x should be equal to nrow(z) and length of y should be equal to ncol(z). colvar: Variable used for the coloring and has the same dimension as z. col: Color palette used for the colvar variable. By default, a red-yellow-blue color scheme. add: Logical variable. If TRUE, adds surfaces to the current plot. If FALSE, starts a new plot. Let's plot the Death rate of Virgina using a 3D histogram: data(VADeaths) hist3D(z = VADeaths, scale = FALSE, expand = 0.01, bty = "g", phi = 20, col = "#0085C2", border = "black", shade = 0.2, ltheta = 80, space = 0.3, ticktype = "detailed", d = 2) Using a line graph To visualize the plot with a line graph, add the type parameter to the scatter3D() function. The type parameter can take the values l(only line), b(both line and point), and h(horizontal line and points both). The 3D plot with a horizontal line and plot: > scatter3D(x=mtcars$wt, y=mtcars$disp, z=mtcars$mpg,type="h", clab = c("Miles/(US) gallon")) [box type="download" align="" class="" width=""]Download code files here. [/box] If you think the recipe is delectable, you should definitely take a look at R data Analysis Cookbook- Second edition which will enlighten you more on data visualizations with R.  
Read more
  • 0
  • 1
  • 3980