Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-running-parallel-data-operations-using-java-streams
Pravin Dhandre
15 Jan 2018
8 min read
Save for later

Running Parallel Data Operations using Java Streams

Pravin Dhandre
15 Jan 2018
8 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from a book co-authored by Richard M. Reese and Jennifer L. Reese, titled Java for Data Science. This book provides in-depth understanding of important tools and techniques used across data science projects in a Java environment.[/box] This article will give you an advantage of using Java 8 for solving complex and math-intensive problems on larger datasets using Java streams and lambda expressions. You will explore short demonstrations for performing matrix multiplication and map-reduce using Java 8. The release of Java 8 came with a number of important enhancements to the language. The two enhancements of interest to us include lambda expressions and streams. A lambda expression is essentially an anonymous function that adds a functional programming dimension to Java. The concept of streams, as introduced in Java 8, does not refer to IO streams. Instead, you can think of it as a sequence of objects that can be generated and manipulated using a fluent style of programming. This style will be demonstrated shortly. As with most APIs, programmers must be careful to consider the actual execution performance of their code using realistic test cases and environments. If not used properly, streams may not actually provide performance improvements. In particular, parallel streams, if not crafted carefully, can produce incorrect results. We will start with a quick introduction to lambda expressions and streams. If you are familiar with these concepts you may want to skip over the next section. Understanding Java 8 lambda expressions and streams A lambda expression can be expressed in several different forms. The following illustrates a simple lambda expression where the symbol, ->, is the lambda operator. This will take some value, e, and return the value multiplied by two. There is nothing special about the name e. Any valid Java variable name can be used: e -> 2 * e It can also be expressed in other forms, such as the following: (int e) -> 2 * e (double e) -> 2 * e (int e) -> {return 2 * e; The form used depends on the intended value of e. Lambda expressions are frequently used as arguments to a method, as we will see shortly. A stream can be created using a number of techniques. In the following example, a stream is created from an array. The IntStream interface is a type of stream that uses integers. The Arrays class' stream method converts an array into a stream: IntStream stream = Arrays.stream(numbers); We can then apply various stream methods to perform an operation. In the following statement, the forEach method will simply display each integer in the stream: stream.forEach(e -> out.printf("%d ", e)); There are a variety of stream methods that can be applied to a stream. In the following example, the mapToDouble method will take an integer, multiply it by 2, and then return it as a double. The forEach method will then display these values: stream .mapToDouble(e-> 2 * e) .forEach(e -> out.printf("%.4f ", e)); The cascading of method invocations is referred to as fluent programing. Using Java 8 to perform matrix multiplication Here, we will illustrate how streams can be used to perform matrix multiplication. The definitions of the A, B, and C matrices are the same as declared in the Implementing basic matrix operations section. They are duplicated here for your convenience: double A[][] = { {0.1950, 0.0311}, {0.3588, 0.2203}, {0.1716, 0.5931}, {0.2105, 0.3242}}; double B[][] = { {0.0502, 0.9823, 0.9472}, {0.5732, 0.2694, 0.916}}; double C[][] = new double[n][p]; The following sequence is a stream implementation of matrix multiplication. A detailed explanation of the code follows: C = Arrays.stream(A) .parallel() .map(AMatrixRow -> IntStream.range(0, B[0].length) .mapToDouble(i -> IntStream.range(0, B.length) .mapToDouble(j -> AMatrixRow[j] * B[j][i]) .sum() ).toArray()).toArray(double[][]::new); The first map method, shown as follows, creates a stream of double vectors representing the 4 rows of the A matrix. The range method will return a list of stream elements ranging from its first argument to the second argument. .map(AMatrixRow -> IntStream.range(0, B[0].length) The variable i corresponds to the numbers generated by the second range method, which corresponds to the number of rows in the B matrix (2). The variable j corresponds to the numbers generated by the third range method, representing the number of columns of the B matrix (3). At the heart of the statement is the matrix multiplication, where the sum method calculates the sum: .mapToDouble(j -> AMatrixRow[j] * B[j][i]) .sum() The last part of the expression creates the two-dimensional array for the C matrix. The operator, ::new, is called a method reference and is a shorter way of invoking the new operator to create a new object: ).toArray()).toArray(double[][]::new); The displayResult method is as follows: public void displayResult() { out.println("Result"); for (int i = 0; i < n; i++) { for (int j = 0; j < p; j++) { out.printf("%.4f ", C[i][j]); } out.println(); } } The output of this sequence follows: Result 0.0276 0.1999 0.2132 0.1443 0.4118 0.5417 0.3486 0.3283 0.7058 0.1964 0.2941 0.4964 Using Java 8 to perform map-reduce In this section, we will use Java 8 streams to perform a map-reduce operation. In this example, we will use a Stream of Book objects. We will then demonstrate how to use the Java 8 reduce and average methods to get our total page count and average page count. Rather than begin with a text file, as we did in the Hadoop example, we have created a Book class with title, author, and page-count fields. In the main method of the driver class, we have created new instances of Book and added them to an ArrayList called books. We have also created a double value average to hold our average, and initialized our variable totalPg to zero: ArrayList<Book> books = new ArrayList<>(); double average; int totalPg = 0; books.add(new Book("Moby Dick", "Herman Melville", 822)); books.add(new Book("Charlotte's Web", "E.B. White", 189)); books.add(new Book("The Grapes of Wrath", "John Steinbeck", 212)); books.add(new Book("Jane Eyre", "Charlotte Bronte", 299)); books.add(new Book("A Tale of Two Cities", "Charles Dickens", 673)); books.add(new Book("War and Peace", "Leo Tolstoy", 1032)); books.add(new Book("The Great Gatsby", "F. Scott Fitzgerald", 275)); Next, we perform a map and reduce operation to calculate the total number of pages in our set of books. To accomplish this in a parallel manner, we use the stream and parallel methods. We then use the map method with a lambda expression to accumulate all of the page counts from each Book object. Finally, we use the reduce method to merge our page counts into one final value, which is to be assigned to totalPg: totalPg = books .stream() .parallel() .map((b) -> b.pgCnt) .reduce(totalPg, (accumulator, _item) -> { out.println(accumulator + " " +_item); return accumulator + _item; }); Notice in the preceding reduce method we have chosen to print out information about the reduction operation's cumulative value and individual items. The accumulator represents the aggregation of our page counts. The _item represents the individual task within the map-reduce process undergoing reduction at any given moment. In the output that follows, we will first see the accumulator value stay at zero as each individual book item is processed. Gradually, the accumulator value increases. The final operation is the reduction of the values 1223 and 2279. The sum of these two numbers is 3502, or the total page count for all of our books: 0 822 0 189 0 299 0 673 0 212 299 673 0 1032 0 275 1032 275 972 1307 189 212 822 401 1223 2279 Next, we will add code to calculate the average page count of our set of books. We multiply our totalPg value, determined using map-reduce, by 1.0 to prevent truncation when we divide by the integer returned by the size method. We then print out average. average = 1.0 * totalPg / books.size(); out.printf("Average Page Count: %.4fn", average); Our output is as follows: Average Page Count: 500.2857 We could have used Java 8 streams to calculate the average directly using the map method. Add the following code to the main method. We use parallelStream with our map method to simultaneously get the page count for each of our books. We then use mapToDouble to ensure our data is of the correct type to calculate our average. Finally, we use the average and getAsDouble methods to calculate our average page count: average = books .parallelStream() .map(b -> b.pgCnt) .mapToDouble(s -> s) .average() .getAsDouble(); out.printf("Average Page Count: %.4fn", average); Then we print out our average. Our output, identical to our previous example, is as follows: Average Page Count: 500.2857 The above techniques leveraged Java 8 capabilities on the map-reduce framework to solve numeric problems. This type of process can also be applied to other types of data, including text-based data. The true benefit is seen when these processes handle extremely large datasets within a significant reduction in time frame. To know various other mathematical and parallel techniques in Java for building a complete data analysis application, you may read through the book Java for Data Science to get a better integrated approach.
Read more
  • 0
  • 0
  • 5835

article-image-create-standard-java-http-client-elasticsearch
Sugandha Lahoti
12 Jan 2018
6 min read
Save for later

How to create a standard Java HTTP Client in ElasticSearch

Sugandha Lahoti
12 Jan 2018
6 min read
[box type="note" align="" class="" width=""]This is an excerpt from a book written by Alberto Paro, titled Elasticsearch 5.x Cookbook. This book is your one-stop guide to mastering the complete ElasticSearch ecosystem with comprehensive recipes on what’s new in Elasticsearch 5.x.[/box] In this article we see how to create a standard Java HTTP Client in ElasticSearch. All the codes used in this article are available on GitHub. There are scripts to initialize all the required data. An HTTP client is one of the easiest clients to create. It's very handy because it allows for the calling, not only of the internal methods as the native protocol does, but also of third- party calls implemented in plugins that can be only called via HTTP. Getting Ready You need an up-and-running Elasticsearch installation. You will also need a Maven tool, or an IDE that natively supports it for Java programming such as Eclipse or IntelliJ IDEA, must be installed. The code for this recipe is in the chapter_14/http_java_client directory. How to do it For creating a HTTP client, we will perform the following steps: For these examples, we have chosen the Apache HttpComponents that is one of the most widely used libraries for executing HTTP calls. This library is available in the main Maven repository search.maven.org. To enable the compilation in your Maven pom.xml project just add the following code: <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.2</version> </dependency> If we want to instantiate a client and fetch a document with a get method the code will look like the following: import org.apache.http.*; Import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.util.EntityUtils; import java.io.*; public class App { private static String wsUrl = "http://127.0.0.1:9200"; public static void main(String[] args) { CloseableHttpClient client = HttpClients.custom() .setRetryHandler(new MyRequestRetryHandler()).build(); HttpGet method = new HttpGet(wsUrl+"/test-index/test- type/1"); // Execute the method. try { CloseableHttpResponse response = client.execute(method); if (response.getStatusLine().getStatusCode() != HttpStatus.SC_OK) { System.err.println("Method failed: " + response.getStatusLine()); }else{ HttpEntity entity = response.getEntity(); String responseBody = EntityUtils.toString(entity); System.out.println(responseBody); } } catch (IOException e) { System.err.println("Fatal transport error: " + e.getMessage()); e.printStackTrace(); } finally { // Release the connection. method.releaseConnection(); } } }    The result, if the document will be: {"_index":"test-index","_type":"test- type","_id":"1","_version":1,"exists":true, "_source" : {...}} How it works We perform the previous steps to create and use an HTTP client: The first step is to initialize the HTTP client object. In the previous code this is done via the following code: CloseableHttpClient client = HttpClients.custom().setRetryHandler(new MyRequestRetryHandler()).build(); Before using the client, it is a good practice to customize it; in general the client can be modified to provide extra functionalities such as retry support. Retry support is very important for designing robust applications; the IP network protocol is never 100% reliable, so it automatically retries an action if something goes bad (HTTP connection closed, server overhead, and so on). In the previous code, we defined an HttpRequestRetryHandler, which monitors the execution and repeats it three times before raising an error. After having set up the client we can define the method call. In the previous example we want to execute the GET REST call. The used   method will be for HttpGet and the URL will be item index/type/id. To initialize the method, the code is: HttpGet method = new HttpGet(wsUrl+"/test-index/test-type/1 To improve the quality of our REST call it's a good practice to add extra  controls to the method, such as authentication and custom headers. The Elasticsearch server by default doesn't require authentication, so we need to provide some security layer at the top of our architecture. A typical scenario is using your HTTP client with the search guard plugin or the shield plugin, which is part of X-Pack which allows the Elasticsearch REST to be extended with authentication and SSL. After one of these plugins is installed and configured on the server, the following code adds a host entry that allows the credentials to be provided only if context calls are targeting that host. The authentication is simply basicAuth, but works very well for non-complex deployments: HttpHost targetHost = new HttpHost("localhost", 9200, "http"); CredentialsProvider credsProvider = new BasicCredentialsProvider(); credsProvider.setCredentials( new AuthScope(targetHost.getHostName(), targetHost.getPort()), new UsernamePasswordCredentials("username", "password")); // Create AuthCache instance AuthCache authCache = new BasicAuthCache(); // Generate BASIC scheme object and add it to local auth cache BasicScheme basicAuth = new BasicScheme(); authCache.put(targetHost, basicAuth); // Add AuthCache to the execution context HttpClientContext context = HttpClientContext.create(); context.setCredentialsProvider(credsProvider); The create context must be used in executing the call: response = client.execute(method, context); Custom headers allow for passing extra information to the server for executing a call. Some examples could be API keys, or hints about supported formats. A typical example is using gzip data compression over HTTP to reduce bandwidth usage. To do that, we can add a custom header to the call informing the server that our client accepts encoding: Accept-Encoding, gzip: request.addHeader("Accept-Encoding", "gzip"); After configuring the call with all the parameters, we can fire up the request: response = client.execute(method, context); Every response object must be validated on its return status: if the call is OK, the return status should be 200. In the previous code the check is done in the if statement: if (response.getStatusLine().getStatusCode() != HttpStatus.SC_OK) If the call was OK and the status code of the response is 200, we can read the answer: HttpEntity entity = response.getEntity(); String responseBody = EntityUtils.toString(entity); The response is wrapped in HttpEntity, which is a stream. The HTTP client library provides a helper method EntityUtils.toString that reads all the content of HttpEntity as a string. Otherwise we'd need to create some code to read from the string and build the string. Obviously, all the read parts of the call are wrapped in a try-catch block to collect all possible errors due to networking errors. See Also The Apache HttpComponents at http://hc.apache.org/ for a complete reference and more examples about this library The search guard plugin to provide authenticated Elasticsearch access at https://github.com/floragunncom/search-guard or the Elasticsearch official shield plugin at https://www.elastic.co/products/x-pack. We saw a simple recipe to create a standard Java HTTP client in Elasticsearch. If you enjoyed this excerpt, check out the book Elasticsearch 5.x Cookbook to learn how to create an HTTP Elasticsearch client, a native client and perform other operations in ElasticSearch.          
Read more
  • 0
  • 0
  • 5232

article-image-r-perfect-statistical-analysis
Amarabha Banerjee
11 Jan 2018
7 min read
Save for later

Why R is perfect for Statistical Analysis

Amarabha Banerjee
11 Jan 2018
7 min read
[box type="note" align="" class="" width=""]This article is taken from Machine Learning with R  written by Brett Lantz. This book will help you learn specialized machine learning techniques for text mining, social network data, and big data.[/box] In this post we will explore different statistical analysis techniques and how they can be implemented using R language easily and efficiently. Introduction The R language, as the descendent of the statistics language, S, has become the preferred computing language in the field of statistics. Moreover, due to its status as an active contributor in the field, if a new statistical method is discovered, it is very likely that this method will first be implemented in the R language. As such, a large quantity of statistical methods can be fulfilled by applying the R language. To apply statistical methods in R, the user can categorize the method of implementation into descriptive statistics and inferential statistics: Descriptive statistics: These are used to summarize the characteristics of the data. The user can use mean and standard deviation to describe numerical data, and use frequency and percentages to describe categorical data Inferential statistics: Based on the pattern within a sample data, the user can infer the characteristics of the population. The methods related to inferential statistics are for hypothesis testing, data estimation, data correlation, and relationship modeling. Inference can be further extended to forecasting, prediction, and estimation of unobserved values either in or associated with the population being studied. In the following recipes, we will discuss examples of data sampling, probability distribution, univariate descriptive statistics, correlations and multivariate analysis, linear regression and multivariate analysis, Exact Binomial Test, student's t-test, Kolmogorov-Smirnov test, Wilcoxon Rank Sum and Signed Rank test, Pearson's Chi-squared Test, One-way ANOVA, and Two-way ANOVA. Data sampling with R Sampling is a method to select a subset of data from a statistical population, which can use the characteristics of the population to estimate the whole population. The following recipe will demonstrate how to generate samples in R. Perform the following steps to understand data sampling in R: To generate random samples of a given population, the user can simply use the sample function: > sample(1:10) R and Statistics [ 111 ] To specify the number of items returned, the user can set the assigned value to the size argument: > sample(1:10, size = 5) Moreover, the sample can also generate Bernoulli trials by specifying replace = TRUE (default is FALSE): > sample(c(0,1), 10, replace = TRUE) If we want to do a coin flipping trail, where the outcome is Head or Tail, we can use: > outcome <- c("Head","Tail") > sample(outcome, size=1) To generate result for 100 times, we can use: > sample(outcome, size=100, replace=TRUE) The sample can be useful when we want to select random data from datasets, selecting 10 observations from AirPassengers: > sample(AirPassengers, size=10) How it works As we saw in the preceding demonstration, the sample function can generate random samples from a specified population. The returned number from records can be designated by the user simply by specifying the argument of size. By assigning the replace argument as TRUE, you can generate Bernoulli trials (a population with 0 and 1 only). Operating a probability distribution in R Probability distribution and statistics analysis are closely related to each other. For statistics analysis, analysts make predictions based on a certain population, which is mostly under a probability distribution. Therefore, if you find that the data selected for a prediction does not follow the exact assumed probability distribution in the experiment design, the upcoming results can be refuted. In other words, probability provides the justification for statistics. The following examples will demonstrate how to generate probability distribution in R. Perform the following steps: For a normal distribution, the user can use dnorm, which will return the height of a normal curve at 0: > dnorm(0) Output: [1] 0.3989423 Then, the user can change the mean and the standard deviation in the argument: > dnorm(0,mean=3,sd=5) Output: [1] 0.06664492 Next, plot the graph of a normal distribution with the curve function: > curve(dnorm,-3,3) In contrast to dnorm, which returns the height of a normal curve, the pnorm function can return the area under a given value: > pnorm(1.5) Output: [1] 0.9331928 Alternatively, to get the area over a certain value, you can specify the option, lower.tail, as FALSE: > pnorm(1.5, lower.tail=FALSE) Output: [1] 0.0668072 To plot the graph of pnorm, the user can employ a curve function: > curve(pnorm(x), -3,3) To calculate the quantiles for a specific distribution, you can use qnorm. The function, qnorm, can be treated as the inverse of pnorm, which returns the Zscore of a given probability: > qnorm(0.5) Output: [1] 0 > qnorm(pnorm(0)) Output: [1] 0 To generate random numbers from a normal distribution, one can use the rnorm function and specify the number of generated numbers. Also, one can define optional arguments, such as the mean and standard deviation: > set.seed(50) > x = rnorm(100,mean=3,sd=5) > hist(x) To calculate the uniform distribution, the runif function generates random numbers from a uniform distribution. The user can specify the range of the generated numbers by specifying variables, such as the minimum and maximum. For the following example, the user generates 100 random variables from 0 to 5: > set.seed(50) > y = runif(100,0,5) > hist(y) Lastly, if you would like to test the normality of the data, the most widely used test for this is the Shapiro-Wilks test. Here, we demonstrate how to perform a test of normality on samples from both the normal and uniform distributions, respectively: > shapiro.test(x) Output: Shapiro-Wilk normality test data: x W = 0.9938, p-value = 0.9319 > shapiro.test(y) Shapiro-Wilk normality test data: y W = 0.9563, p-value = 0.002221 How it works In this recipe, we first introduce dnorm, a probability density function, which returns the height of a normal curve. With a single input specified, the input value is called a standard score or a z-score. Without any other arguments specified, it is assumed that the normal distribution is in use with a mean of zero and a standard deviation of 1. We then introduce three ways to draw standard and normal distributions. After this, we introduce pnorm, a cumulative density function. The function, pnorm, can generate the area under a given value. In addition to this, pnorm can be also used to calculate the p-value from a normal distribution. One can get the p-value by subtracting 1 from the number, or assigning True to the option, lower.tail. Similarly, one can use the plot function to plot the cumulative density. In contrast to pnorm, qnorm returns the z-score of a given probability. Therefore, the example shows that the application of a qnorm function to a pnorm function will produce the exact input value. Next, we show you how to use the rnrom function to generate random samples from a normal distribution, and the runif function to generate random samples from the uniform distribution. In the function, rnorm, one has to specify the number of generated numbers and we may also add optional augments, such as the mean and standard deviation. Then, by using the hist function, one should be able to find a bell-curve in figure 3. On the other hand, for the runif function, with the minimum and maximum specifications, one can get a list of sample numbers between the two. However, we can still use the hist function to plot the samples. The output figure (shown in the preceding figure) is not in a bell shape, which indicates that the sample does not come from the normal distribution. Finally, we demonstrate how to test data normality with the Shapiro-Wilks test. Here, we conduct the normality test on both the normal and uniform distribution samples, respectively. In both outputs, one can find the p-value in each test result. The p-value shows the changes, which show that the sample comes from a normal distribution. If the p-value is higher than 0.05, we can conclude that the sample comes from a normal distribution. On the other hand, if the value is lower than 0.05, we conclude that the sample does not come from a normal distribution. We have shown you how you can use R language to perform Statistical Analysis easily and efficiently and what are the simplest forms of it. If you liked this article, please be sure to check out Machine Learning with R which consists of useful machine learning techniques with R.  
Read more
  • 0
  • 0
  • 1421
Visually different images

article-image-working-with-sparks-graph-processing-library-graphframes
Pravin Dhandre
11 Jan 2018
12 min read
Save for later

Working with Spark’s graph processing library, GraphFrames

Pravin Dhandre
11 Jan 2018
12 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Rajanarayanan Thottuvaikkatumana titled, Apache Spark 2 for Beginners. The author presents a learners guide for python and scala developers to develop large-scale and distributed data processing applications in the business environment.[/box] In this post we will see how a Spark user can work with Spark’s most popular graph processing package, GraphFrames. Additionally explore how you can benefit from running queries and finding insightful patterns through graphs. The Spark GraphX library is the graph processing library that has the least programming language support. Scala is the only programming language supported by the Spark GraphX library. GraphFrames is a new graph processing library available as an external Spark package developed by Databricks, University of California, Berkeley, and Massachusetts Institute of Technology, built on top of Spark DataFrames. Since it is built on top of DataFrames, all the operations that can be done on DataFrames are potentially possible on GraphFrames, with support for programming languages such as Scala, Java, Python, and R with a uniform API. Since GraphFrames is built on top of DataFrames, the persistence of data, support for numerous data sources, and powerful graph queries in Spark SQL are additional benefits users get for free. Just like the Spark GraphX library, in GraphFrames the data is stored in vertices and edges. The vertices and edges use DataFrames as the data structure. The first use case covered in the beginning of this chapter is used again to elucidate GraphFrames-based graph processing. Please make a note that GraphFrames is an external Spark package. It has some incompatibility with Spark 2.0. Because of that, the following code snippets will not work with  park 2.0. They work with Spark 1.6. Refer to their website to check Spark 2.0 support. At the Scala REPL prompt of Spark 1.6, try the following statements. Since GraphFrames is an external Spark package, while bringing up the appropriate REPL, the library has to be imported and the following command is used in the terminal prompt to fire up the REPL and make sure that the library is loaded without any error messages: $ cd $SPARK_1.6__HOME $ ./bin/spark-shell --packages graphframes:graphframes:0.1.0-spark1.6 Ivy Default Cache set to: /Users/RajT/.ivy2/cache The jars for the packages stored in: /Users/RajT/.ivy2/jars :: loading settings :: url = jar:file:/Users/RajT/source-code/sparksource/spark-1.6.1/assembly/target/scala-2.10/spark-assembly-1.6.2- SNAPSHOT-hadoop2.2.0.jar!/org/apache/ivy/core/settings/ivysettings.xml graphframes#graphframes added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found graphframes#graphframes;0.1.0-spark1.6 in list :: resolution report :: resolve 153ms :: artifacts dl 2ms :: modules in use: graphframes#graphframes;0.1.0-spark1.6 from list in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 1 | 0 | 0 | 0 || 1 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 1 already retrieved (0kB/5ms) 16/07/31 09:22:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.6.1 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> import org.graphframes._ import org.graphframes._ scala> import org.apache.spark.rdd.RDD import org.apache.spark.rdd.RDD scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> import org.apache.spark.graphx._ import org.apache.spark.graphx._ scala> //Create a DataFrame of users containing tuple values with a mandatory Long and another String type as the property of the vertex scala> val users = sqlContext.createDataFrame(List((1L, "Thomas"),(2L, "Krish"),(3L, "Mathew"))).toDF("id", "name") users: org.apache.spark.sql.DataFrame = [id: bigint, name: string] scala> //Created a DataFrame for Edge with String type as the property of the edge scala> val userRelationships = sqlContext.createDataFrame(List((1L, 2L, "Follows"),(1L, 2L, "Son"),(2L, 3L, "Follows"))).toDF("src", "dst", "relationship") userRelationships: org.apache.spark.sql.DataFrame = [src: bigint, dst: bigint, relationship: string] scala> val userGraph = GraphFrame(users, userRelationships) userGraph: org.graphframes.GraphFrame = GraphFrame(v:[id: bigint, name: string], e:[src: bigint, dst: bigint, relationship: string]) scala> // Vertices in the graph scala> userGraph.vertices.show() +---+------+ | id| name| +---+------+ | 1|Thomas| | 2| Krish| | 3|Mathew| +---+------+ scala> // Edges in the graph scala> userGraph.edges.show() +---+---+------------+ |src|dst|relationship| +---+---+------------+ | 1| 2| Follows| | 1| 2| Son| | 2| 3| Follows| +---+---+------------+ scala> //Number of edges in the graph scala> val edgeCount = userGraph.edges.count() edgeCount: Long = 3 scala> //Number of vertices in the graph scala> val vertexCount = userGraph.vertices.count() vertexCount: Long = 3 scala> //Number of edges coming to each of the vertex. scala> userGraph.inDegrees.show() +---+--------+ | id|inDegree| +---+--------+ | 2| 2| | 3| 1| +---+--------+ scala> //Number of edges going out of each of the vertex. scala> userGraph.outDegrees.show() +---+---------+ | id|outDegree| +---+---------+ | 1| 2| | 2| 1| +---+---------+ scala> //Total number of edges coming in and going out of each vertex. scala> userGraph.degrees.show() +---+------+ | id|degree| +---+------+ | 1| 2| | 2| 3| | 3| 1| +---+------+ scala> //Get the triplets of the graph scala> userGraph.triplets.show() +-------------+----------+----------+ | edge| src| dst| +-------------+----------+----------+ |[1,2,Follows]|[1,Thomas]| [2,Krish]| | [1,2,Son]|[1,Thomas]| [2,Krish]| |[2,3,Follows]| [2,Krish]|[3,Mathew]| +-------------+----------+----------+ scala> //Using the DataFrame API, apply filter and select only the needed edges scala> val numFollows = userGraph.edges.filter("relationship = 'Follows'").count() numFollows: Long = 2 scala> //Create an RDD of users containing tuple values with a mandatory Long and another String type as the property of the vertex scala> val usersRDD: RDD[(Long, String)] = sc.parallelize(Array((1L, "Thomas"), (2L, "Krish"),(3L, "Mathew"))) usersRDD: org.apache.spark.rdd.RDD[(Long, String)] = ParallelCollectionRDD[54] at parallelize at <console>:35 scala> //Created an RDD of Edge type with String type as the property of the edge scala> val userRelationshipsRDD: RDD[Edge[String]] = sc.parallelize(Array(Edge(1L, 2L, "Follows"), Edge(1L, 2L, "Son"),Edge(2L, 3L, "Follows"))) userRelationshipsRDD: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[String]] = ParallelCollectionRDD[55] at parallelize at <console>:35 scala> //Create a graph containing the vertex and edge RDDs as created before scala> val userGraphXFromRDD = Graph(usersRDD, userRelationshipsRDD) userGraphXFromRDD: org.apache.spark.graphx.Graph[String,String] = org.apache.spark.graphx.impl.GraphImpl@77a3c614 scala> //Create the GraphFrame based graph from Spark GraphX based graph scala> val userGraphFrameFromGraphX: GraphFrame = GraphFrame.fromGraphX(userGraphXFromRDD) userGraphFrameFromGraphX: org.graphframes.GraphFrame = GraphFrame(v:[id: bigint, attr: string], e:[src: bigint, dst: bigint, attr: string]) scala> userGraphFrameFromGraphX.triplets.show() +-------------+----------+----------+ | edge| src| dst| +-------------+----------+----------+ |[1,2,Follows]|[1,Thomas]| [2,Krish]| | [1,2,Son]|[1,Thomas]| [2,Krish]| |[2,3,Follows]| [2,Krish]|[3,Mathew]| +-------------+----------+----------+ scala> // Convert the GraphFrame based graph to a Spark GraphX based graph scala> val userGraphXFromGraphFrame: Graph[Row, Row] = userGraphFrameFromGraphX.toGraphX userGraphXFromGraphFrame: org.apache.spark.graphx.Graph[org.apache.spark.sql.Row,org.apache.spark.sql .Row] = org.apache.spark.graphx.impl.GraphImpl@238d6aa2 When creating DataFrames for the GraphFrame, the only thing to keep in mind is that there are some mandatory columns for the vertices and the edges. In the DataFrame for vertices, the id column is mandatory. In the DataFrame for edges, the src and dst columns are mandatory. Apart from that, any number of arbitrary columns can be stored with both the vertices and the edges of a GraphFrame. In the Spark GraphX library, the vertex identifier must be a long integer, but the GraphFrame doesn't have any such limitations and any type is supported as the vertex identifier. Readers should already be familiar with DataFrames; any operation that can be done on a DataFrame can be done on the vertices and edges of a GraphFrame. All the graph processing algorithms supported by Spark GraphX are supported by GraphFrames as well. The Python version of GraphFrames has fewer features. Since Python is not a supported programming language for the Spark GraphX library, GraphFrame to GraphX and GraphX to GraphFrame conversions are not supported in Python. Since readers are familiar with the creation of DataFrames in Spark using Python, the Python example is omitted here. Moreover, there are some pending defects in the GraphFrames API for Python and not all the features demonstrated previously using Scala function properly in Python at the time of writing   Understanding GraphFrames queries The Spark GraphX library is the RDD-based graph processing library, but GraphFrames is a Spark DataFrame-based graph processing library that is available as an external package. Spark GraphX supports many graph processing algorithms, but GraphFrames supports not only graph processing algorithms, but also graph queries. The major difference between graph processing algorithms and graph queries is that graph processing algorithms are used to process the data hidden in a graph data structure, while graph queries are used to search for patterns in the data hidden in a graph data structure. In GraphFrame parlance, graph queries are also known as motif finding. This has tremendous applications in genetics and other biological sciences that deal with sequence motifs. From a use case perspective, take the use case of users following each other in a social media application. Users have relationships between them. In the previous sections, these relationships were modeled as graphs. In real-world use cases, such graphs can become really huge, and if there is a need to find users with relationships between them in both directions, it can be expressed as a pattern in graph query, and such relationships can be found using easy programmatic constructs. The following demonstration models the relationship between the users in a GraphFrame, and a pattern search is done using that. At the Scala REPL prompt of Spark 1.6, try the following statements: $ cd $SPARK_1.6_HOME $ ./bin/spark-shell --packages graphframes:graphframes:0.1.0-spark1.6 Ivy Default Cache set to: /Users/RajT/.ivy2/cache The jars for the packages stored in: /Users/RajT/.ivy2/jars :: loading settings :: url = jar:file:/Users/RajT/source-code/sparksource/spark-1.6.1/assembly/target/scala-2.10/spark-assembly-1.6.2- SNAPSHOT-hadoop2.2.0.jar!/org/apache/ivy/core/settings/ivysettings.xml graphframes#graphframes added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found graphframes#graphframes;0.1.0-spark1.6 in list :: resolution report :: resolve 145ms :: artifacts dl 2ms :: modules in use: graphframes#graphframes;0.1.0-spark1.6 from list in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 1 | 0 | 0 | 0 || 1 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 1 already retrieved (0kB/5ms) 16/07/29 07:09:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.6.1 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> import org.graphframes._ import org.graphframes._ scala> import org.apache.spark.rdd.RDD import org.apache.spark.rdd.RDD scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> import org.apache.spark.graphx._ import org.apache.spark.graphx._ scala> //Create a DataFrame of users containing tuple values with a mandatory String field as id and another String type as the property of the vertex. Here it can be seen that the vertex identifier is no longer a long integer. scala> val users = sqlContext.createDataFrame(List(("1", "Thomas"),("2", "Krish"),("3", "Mathew"))).toDF("id", "name") users: org.apache.spark.sql.DataFrame = [id: string, name: string] scala> //Create a DataFrame for Edge with String type as the property of the edge scala> val userRelationships = sqlContext.createDataFrame(List(("1", "2", "Follows"),("2", "1", "Follows"),("2", "3", "Follows"))).toDF("src", "dst", "relationship") userRelationships: org.apache.spark.sql.DataFrame = [src: string, dst: string, relationship: string] scala> //Create the GraphFrame scala> val userGraph = GraphFrame(users, userRelationships) userGraph: org.graphframes.GraphFrame = GraphFrame(v:[id: string, name: string], e:[src: string, dst: string, relationship: string]) scala> // Search for pairs of users who are following each other scala> // In other words the query can be read like this. Find the list of users having a pattern such that user u1 is related to user u2 using the edge e1 and user u2 is related to the user u1 using the edge e2. When a query is formed like this, the result will list with columns u1, u2, e1 and e2. When modelling real-world use cases, more meaningful variables can be used suitable for the use case. scala> val graphQuery = userGraph.find("(u1)-[e1]->(u2); (u2)-[e2]->(u1)") graphQuery: org.apache.spark.sql.DataFrame = [e1: struct<src:string,dst:string,relationship:string>, u1: struct<id:string,name:string>, u2: struct<id:string,name:string>, e2: struct<src:string,dst:string,relationship:string>] scala> graphQuery.show() +-------------+----------+----------+-------------+ | e1| u1| u2| e2| +-------------+----------+----------+-------------+ |[1,2,Follows]|[1,Thomas]| [2,Krish]|[2,1,Follows]| |[2,1,Follows]| [2,Krish]|[1,Thomas]|[1,2,Follows]| +-------------+----------+----------+-------------+ Note that the columns in the graph query result are formed with the elements given in the search pattern. There is no limit to the way the patterns can be formed. Note the data type of the graph query result. It is a DataFrame object. That brings a great flexibility in processing the query results using the familiar Spark SQL library. The biggest limitation of the Spark GraphX library is that its API is not supported with popular programming languages such as Python and R. Since GraphFrames is a DataFrame based library, once it matures, it will enable graph processing in all the programming languages supported by DataFrames. Spark external package is definitely a potential candidate to be included as part of the Spark. To know more on the design and development of a data processing application using Spark and the family of libraries built on top of it, do check out this book Apache Spark 2 for Beginners.  
Read more
  • 0
  • 0
  • 5471

article-image-getting-to-know-different-big-data-characteristics
Gebin George
05 Jan 2018
4 min read
Save for later

Getting to know different Big data Characteristics

Gebin George
05 Jan 2018
4 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Osvaldo Martin titled Mastering Predictive Analytics with R, Second Edition. This book will help you leverage the flexibility and modularity of R to experiment with a range of different techniques and data types.[/box] Our article will quickly walk you through all the fundamental characteristics of Big Data. For you to determine if your data source qualifies as big data or as needing special handling, you can start by examining your data source in the following areas: The volume (amount) of data. The variety of data. The number of different sources and spans of the data. Let's examine each of these areas. Volume If you are talking about the number of rows or records, then most likely your data source is not a big data source since big data is typically measured in gigabytes, terabytes, and petabytes. However, space doesn't always mean big, as these size measurements can vary greatly in terms of both volume and functionality. Additionally, data sources of several million records may qualify as big data, given their structure (or lack of structure). Varieties Data used in predictive models may be structured or unstructured (or both) and include transactions from databases, survey results, website logs, application messages, and so on (by using a data source consisting of a higher variety of data, you are usually able to cover a broader context for the analytics you derive from it). Variety, much like volume, is considered a normal qualifier for big data. Sources and spans If the data source for your predictive analytics project is the result of integrating several sources, you most likely hit on both criteria of volume and variety and your data qualifies as big data. If your project uses data that is affected by governmental mandates, consumer requests is a historical analysis, you are almost certainty using big data. Government regulations usually require that certain types of data need to be stored for several years. Products can be consumer driven over the lifetime of the product and with today's trends, historical analysis data is usually available for more than five years. Again, all examples of big data sources. Structure You will often find that data sources typically fall into one of the following three categories: 1. Sources with little or no structure in the data (such as simple text files). 2. Sources containing both structured and unstructured data (like data that is sourced from document management systems or various websites, and so on). 3. Sources containing highly structured data (like transactional data stored in a relational database example). How your data source is categorized will determine how you prepare and work with your data in each phase of your predictive analytics project. Although data sources with structure can obviously still fall into the category of big data, it's data containing both structured and unstructured data (and of course totally unstructured data) that fit as big data and will require special handling and or pre-processing. Statistical noise Finally, we should take a note here that other factors (other than those discussed already in the chapter) can qualify your project data source as being unwieldy, overly complex, or a big data source. These include (but are not limited to): Statistical noise (a term for recognized amounts of unexplained variations within the data) Data suffering from mismatched understandings (the differences in interpretations of the data by communities, cultures, practices, and so on) Once you have determined that the data source that you will be using in your predictive analytics project seems to qualify as big (again as we are using the term here) then you can proceed with the process of deciding how to manage and manipulate that data source, based upon the known challenges this type of data demands, so as to be most effective. In the next section, we will review some of these common problems, before we go on to offer useable solutions. We have learned fundamental characteristics which define Big Data, to further use them for Analytics. If you enjoyed our post, check out the book Mastering Predictive Analytics with R, Second Edition to learn complex machine learning models using R.    
Read more
  • 0
  • 0
  • 2108

article-image-2018-new-year-resolutions-to-thrive-in-the-algorithmic-world-part-3-of-3
Sugandha Lahoti
05 Jan 2018
5 min read
Save for later

2018 new year resolutions to thrive in the Algorithmic World - Part 3 of 3

Sugandha Lahoti
05 Jan 2018
5 min read
We have already talked about a simple learning roadmap for you to develop your data science skills in the first resolution. We also talked about the importance of staying relevant in an increasingly automated job market, in our second resolution. Now it’s time to think about the kind of person you want to be and the legacy you will leave behind. 3rd Resolution: Choose projects wisely and be mindful of their impact. Your work has real consequences. And your projects will often be larger than what you know or can do. As such, the first step toward creating impact with intention is to define the project scope, purpose, outcomes and assets clearly. The next most important factor is choosing the project team. 1. Seek out, learn from and work with a diverse group of people To become a successful data scientist you must learn how to collaborate. Not only does it make projects fun and efficient, but it also brings in diverse points of view and expertise from other disciplines. This is a great advantage for machine learning projects that attempt to solve complex real-world problems. You could benefit from working with other technical professionals like web developers, software programmers, data analysts, data administrators, game developers etc. Collaborating with such people will enhance your own domain knowledge and skills and also let you see your work from a broader technical perspective. Apart from the people involved in the core data and software domain, there are others who also have a primary stake in your project’s success. These include UX designers, people with humanities background if you are building a product intended to participate in society (which most products often are), business development folks, who actually sell your product and bring revenue, marketing people, who are responsible for bringing your product to a much wider audience to name a few. Working with people of diverse skill sets will help market your product right and make it useful and interpretable to the target audience. In addition to working with a melange of people with diverse skill sets and educational background it is also important to work with people who think differently from you, and who have experiences that are different from yours to get a more holistic idea of the problems your project is trying to tackle and to arrive at a richer and unique set of solutions to solve those problems. 2. Educate yourself on ethics for data science As an aspiring data scientist, you should always keep in mind the ethical aspects surrounding privacy, data sharing, and algorithmic decision-making.  Here are some ways to develop a mind inclined to designing ethically-sound data science projects and models. Listen to seminars and talks by experts and researchers in fairness, accountability, and transparency in machine learning systems. Our favorites include Kate Crawford’s talk on The trouble with bias, Tricia Wang on The human insights missing from big data and Ethics & Data Science by Jeff Hammerbacher. Follow top influencers on social media and catch up with their blogs and about their work regularly. Some of these researchers include Kate Crawford, Margaret Mitchell, Rich Caruana, Jake Metcalf, Michael Veale, and Kristian Lum among others. Take up courses which will guide you on how to eliminate unintended bias while designing data-driven algorithms. We recommend Data Science Ethics by the University of Michigan, available on edX. You can also take up a course on basic Philosophy from your choice of University.   Start at the beginning. Read books on ethics and philosophy when you get long weekends this year. You can begin with Aristotle's Nicomachean Ethics to understand the real meaning of ethics, a term Aristotle helped develop. We recommend browsing through The Stanford Encyclopedia of Philosophy, which is an online archive of peer-reviewed publication of original papers in philosophy, freely accessible to Internet users. You can also try Practical Ethics, a book by Peter Singer and The Elements of Moral Philosophy by James Rachels. Attend or follow upcoming conferences in the field of bringing transparency in socio-technical systems. For starters, FAT* (Conference on Fairness, Accountability, and Transparency) is scheduled on February 23 and 24th, 2018 at New York University, NYC. We also have the 5th annual conference of FAT/ML, later in the year.  3. Question/Reassess your hypotheses before, during and after actual implementation Finally, for any data science project, always reassess your hypotheses before, during, and after the actual implementation. Always ask yourself these questions after each of the above steps and compare them with the previous answers. What question are you asking? What is your project about? Whose needs is it addressing? Who could it adversely impact? What data are you using? Is the data-type suitable for your type of model? Is the data relevant and fresh? What are its inherent biases and limitations? How robust are your workarounds for them? What techniques are you going to try? What algorithms are you going to implement? What would be its complexity? Is it interpretable and transparent? How will you evaluate your methods and results? What do you expect the results to be? Are the results biased? Are they reproducible? These pointers will help you evaluate your project goals from a customer and business point of view. Additionally, it will also help you in building efficient models which can benefit the society and your organization at large. With this, we come to the end of our new year resolutions for an aspiring data scientist. However, the beauty of the ideas behind these resolutions is that they are easily transferable to anyone in any job. All you gotta do is get your foundations right, stay relevant, and be mindful of your impact. We hope this gives a great kick start to your career in 2018. “Motivation is what gets you started. Habit is what keeps you going.” ― Jim Ryun Happy New Year! May the odds and the God(s) be in your favor this year to help you build your resolutions into your daily routines and habits!
Read more
  • 0
  • 0
  • 2017
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at AU $19.99/month. Cancel anytime
article-image-2018-data-science-part-2-of-3
Savia Lobo
04 Jan 2018
7 min read
Save for later

2018 new year resolutions to thrive in the Algorithmic World - Part 2 of 3

Savia Lobo
04 Jan 2018
7 min read
In our first resolution, we talked about learning the building blocks of data science i.e developing your technical skills. In this second resolution, we walk you through steps to stay relevant in your field and how to dodge jobs that have a high possibility of getting automated in the near future. 2nd Resolution: Stay relevant in your field even as job automation is on the rise (Time investment: half an hour every day, 2 hours on weekends) Once you have got your fundamentals right, it is important to stay relevant through continuous learning and reskilling. In addition to honing your technical skills, you must also deepen your domain expertise and keep adding to your portfolio of soft skills to stay ahead of not the just human competition but also to thrive in an automated job market. We list below some simple ways to do all these in a systematic manner. All it requires is a commitment of half an hour to one hour of your time daily for your professional development. 1. Commit to and execute a daily learning-practice-participation ritual Here are some ways to stay relevant. Follow data science blogs and podcasts relevant to your area of interest. Here are some of our favorites: Data Science 101, the journey of a data scientist The Data Skeptic for a healthy dose of scientific skepticism Data Stories for data visualization This Week in Machine Learning & AI for informative discussions with prominent people in the data science/machine learning community Linear Digressions, a podcast co-hosted by a data scientist and a software engineer attempting to make data science accessible You could also follow individual bloggers/vloggers in this space like Siraj Raval, Sebastian Raschka, Denny Britz, Rodney Brookes, Corinna Cortes, Erin LeDell Newsletters are a great way to stay up-to-date and to get a macro-level perspective. You don’t have to spend an awful lot of time doing the research yourself on many different subtopics. So, subscribe to useful newsletters on data science. You can subscribe to our newsletter here. It is a good idea to subscribe to multiple newsletters on your topic of interest to get a balanced and comprehensive view of the topic. Try to choose newsletters that have distinct perspectives, are regular and are published by people passionate about the topic. Twitter gives a whole new meaning to ‘breaking news’. Also, it is a great place to follow contemporary discussions on topics of interest where participation is open to all. When done right, it can be a gold mine for insights and learning. But often it is too overwhelming as it is viewed as a broadcasting marketing tool. Follow your role models in data science on Twitter. Or you could follow us on Twitter @PacktDataHub for curated content from key data science influencers and our own updates about the world of data science. You could also click here to keep a track of 737 twitter accounts most followed by the members of the NIPS2017 community. Quora, Reddit, Medium, and StackOverflow are great places to learn about topics in depth when you have a specific question in mind or a narrow focus area. They help you get multiple informed opinions on topics. In other words, when you choose a topic worth learning, these are great places to start. Follow them up by reading books on the topic and also by reading the seminal papers to gain a robust technical appreciation. Create a Github account and participate in Kaggle competitions. Nothing sticks as well as learning by doing. You can also browse into Data Helpers, a site voluntarily set up by Angela Bass where interested data science people can offer to help newcomers with their queries on entering the required field and anything else. 2. Identify your strengths and interests to realign your career trajectory OK, now that you have got your daily learning routine in place, it is time to think a little more strategically about your career trajectory, goals and eventually the kind of work you want to be doing. This means: Getting out of jobs that can be automated Developing skills that augment or complement AI driven tasks Finding your niche and developing deep domain expertise that AI will find hard to automate in the near future Here are some ideas to start thinking about some of the above ideas. The first step is to assess your current job role and understand how likely it is to get automated. If you are in a job that has well-defined routines and rules to follow, it is quite likely to go the AI job apocalypse route. Eg: data entry, customer support that follows scripts, invoice processing, template-based software testing or development etc. Even “creative” job such as content summarization, news aggregation, template-based photo-editing/video editing etc fall in this category. In the world of data professionals, jobs like data cleaning, database optimization, feature generation, even model building (gasp!) among others could head the same way given the right incentives. Choose today to transition out of jobs that may not exist in the next 10 years. Then instead of hitting the panic button, invest in redefining your skills in a way that would be helpful in the long run. If you are a data professional, skills such as data interpretation, data-driven storytelling,  data pipeline architecture and engineering, feature engineering, and others that require a high level of human judgment skills are least likely to be replicated by machines anytime soon. By mastering skills that complement AI driven tasks and jobs, you should be able to present yourself as a lucrative option to potential employers in a highly competitive job market space.    In addition to reskilling, try to find your niche and dive deep. By niche, we mean, if you are a data scientist, choose a specific technical aspect in your field, something that interests you. It could be anything from computer vision to NLP to even a class of algorithms like neural nets or a type of problem that machine learning solves such as recommender systems or classification systems. It could even be a specific phase of a data science project such as data visualization or data pipeline engineering. Master your niche while keeping up with what’s happening in other related areas. Next, understand where your strengths lie. In other words, what your expertise is, what industry or domain do you understand well or have amassed experience in. For instance, NLP, a subset of machine learning abilities, can be applied to customer reviews to mine useful insights, perform sentiment analysis, build recommendation systems in conjunction with predictive modeling among other things. In order to build an NLP model to mine some kind of insights from customer feedback, we must have some idea of what we are looking for. Your domain expertise can be of great value here. If you are in the publishing business, you would know what keywords matter most in reviews and more importantly why they matter and how to convert the findings into actionable insights - aspects that your model or even a machine learning engineer outside your industry may not understand or appreciate. Take the case of Brendan Frey and the team of researchers at Deep Genomics as a real-world example. They applied AI and machine learning (their niche expertise) to build a neural network to identify pathological mutations in genes (their domain expertise). Their knowledge of how genes get created and how they work, what a mutation looks like etc helped them feed the features and hyperparameters into their model. Similarly, you can pick up any of your niche skills and apply them in whichever field you find interesting and worthwhile. Based on your domain knowledge and area of expertise, it could range from sorting a person into a Hogwarts house because you are a Harry Potter fan to sorting them into potential patients with a high likelihood to develop diabetes because you have a background in biotechnology.   This brings us to the next resolution where we cover aspects related to how your work will come to define you and why it matters that you choose your projects well.   
Read more
  • 0
  • 0
  • 3168

article-image-creating-reports-using-sql-server-2016-reporting-services
Kunal Chaudhari
04 Jan 2018
6 min read
Save for later

Creating reports using SQL Server 2016 Reporting Services

Kunal Chaudhari
04 Jan 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book authored by Dinesh Priyankara and Robert C. Cain, titled SQL Server 2016 Reporting Services Cookbook.This book will help you create cross-browser and cross-platform reports using SQL Server 2016 Reporting Services.[/box] In today’s tutorial, we explore steps to create reports on multiple axis charts with SQL Server 2016. Often you will want to have multiple items plotted on a chart. In this article, we will plot two values over time, in this case, the Total Sales Amount (Excluding Tax) and the Total Tax Amount. As you might expect though, the tax amounts are going to be a small percentage of the sales amounts. By default, this would create a chart with a huge gap in the middle and a Y Axis that is quite large and difficult to pinpoint values on. To prevent this, Reporting Services allows us to place a second Y Axis on our charts. With this article, we'll explore both adding a second line to our chart as well as having it plotted on a second Y-Axis. Getting ready First, we'll create a new Reporting Services project to contain it. Name this new project Chapter03. Within the new project, create a Shared Data Source that will connect to the WideWorldImportersDW database. Name the new data source after the database, WideWorldImportersDW. Next, we'll need data. Our data will come from the sales table, and we will want to sum our totals by year, so we can plot our years across the X-Axis. For the Y-Axis, we'll use the totals of two fields: TotalExcludingTax and TaxAmount. Here is the query by which we will accomplish this: SELECT YEAR([Invoice Date Key]) AS InvoiceYear ,SUM([Total Excluding Tax]) AS TotalExcludingTax ,SUM([Tax Amount]) AS TotalTaxAmount FROM [Fact].[Sale] GROUP BY YEAR([Invoice Date Key]) How to do it… Right-click on the Reports branch in the Solution Explorer. Go to Add | New Item… from the pop-up menu. On the Add New Item dialog, select Report from the choice of templates in the middle (do not select Report wizard). At the bottom, name the report Report 03-01 Multi Axis Charts.rdl and click on Add. Go to the Report Data tool pane. Right-click on Data Sources and then click Add Data Source… from the menu. In the Name: area, enter WideWorldImportersDW. Change the data source option to the Use shared dataset source reference. In the dropdown, select WideWorldImportersDW. Click on OK to close the DataSet Properties window. Right-click on the Datasets branch and select Add Dataset…. Name the dataset SalesTotalsOverTime. Select the Use a dataset embedded in my report option. Select WideWorldImportersDW in the Data source dropdown. Paste in the query from the Getting ready area of this article: When your window resembles that of the preceding figure, click on OK. Next, go to the Toolbox pane. Drag and drop a Chart tool onto the report. Select the leftmost Line chart from the Select Chart Type window, and click on OK. Resize the chart to a larger size. (For this demo, the exact size is not important. For your production reports, you can resize as needed using the Properties window, as seen previously.) Click inside the main chart area to make the Chart Data dialog appear to the right of the chart. Click on the + (plus button) to the right of the Values. Select TotalExcludingTax. Click on the plus button again, and now pick TotalTaxAmount. Click on the + (plus button) beside Category Groups, and pick InvoiceYear. Click on Preview. You will note the large gap between the two graphed lines. In addition, the values for the Total Tax Amount are almost impossible to guess, as shown in the following figure: Return to the designer, and again click in the chart area to make the Chart Data dialog appear. In the Chart Data dialog, click on the dropdown beside TotalTaxAmount: Select Series Properties…. Click on the Axes and Chart Area page, and for Vertical axis, select Secondary: Click on OK to close the Series Properties window. Right-click on the numbers now appearing on the right in the vertical axis area, and select Secondary Vertical Axis Properties in the menu. In the Axis Options, uncheck Always include zero. Click on the Number page. Under Category, select Currency. Change the Decimal places to 0, and place a check in Use 1000 separator. Click on OK to close this window. Now move to the vertical axis on the left-hand side of the chart, right-click, and pick Vertical Axis Properties. Uncheck Always include zero. On the Number page, pick Currency, set Decimal places to 0, and check Use 1000 separator. Click on OK to close. Click on the Preview tab to see the results: You can now see a chart with a second axis. The monetary amounts are much easier to read. Further, the plotted lines have a similar rise and fall, indicating the taxes collected matched the sales totals in terms of trending. SSRS is capable of plotting multiple lines on a chart. Here we've just placed two fields, but you can add as many as you need. But do realize that the more lines included, the harder the chart can become to read. All that is needed is to put the additional fields into the Values area of the Chart Data window. When these values are of similar scale, for example, sales broken up by state, this works fine. There are times though when the scale between plotted values is so great that it distorts the entire chart, leaving one value in a slender line at the top and another at the bottom, with a huge gap in the middle. To fix this, SSRS allows a second Y-Axis to be included. This will create a scale for the field (or fields) assigned to that axis in the Series Properties window. To summarize, we learned how creating reports with multiple axis is much more simpler with SQL Server 2016 Reporting Services. If you liked our post, check out the book SQL Server 2016 Reporting Services Cookbook to know more about different types of reportings and Power BI integrations.  
Read more
  • 0
  • 0
  • 2851

article-image-how-to-recognize-patterns-with-neural-networks-in-java
Kunal Chaudhari
04 Jan 2018
8 min read
Save for later

How to recognize Patterns with Neural Networks in Java

Kunal Chaudhari
04 Jan 2018
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Fabio M. Soares and Alan M. F. Souza, titled Neural Network Programming with Java Second Edition. This book covers the current state-of-art in the field of neural network that helps you understand and design basic to advanced neural networks with Java.[/box] Our article explores the power of neural networks in pattern recognition by showcasing how to recognize digits from 0 to 9 in an image. For pattern recognition, the neural network architectures that can be applied are MLPs (supervised) and the Kohonen Network (unsupervised). In the first case, the problem should be set up as a classification problem, that is, the data should be transformed into the X-Y dataset, where for every data record in X there should be a corresponding class in Y. The output of the neural network for classification problems should have all of the possible classes, and this may require preprocessing of the output records. For the other case, unsupervised learning, there is no need to apply labels to the output, but the input data should be properly structured. To remind you, the schema of both neural networks are shown in the next figure: Data pre-processing We have to deal with all possible types of data, i.e., numerical (continuous and discrete) and categorical (ordinal or unscaled).  However, here we have the possibility of performing pattern recognition on multimedia content, such as images and videos. So, can multimedia could be handled? The answer to this question lies in the way these contents are stored in files. Images, for example, are written with a representation of small colored points called pixels. Each color can be coded in an RGB notation where the intensity of red, green, and blue define every color the human eye is able to see. Therefore an image of dimension 100x100 would have 10,000 pixels, each one having three values for red, green and blue, yielding a total of 30,000 points. That is the challenge for image processing in neural networks. Some methods, may reduce this huge number of dimensions. Afterwards an image can be treated as big matrix of numerical continuous values. For simplicity, we are applying only gray-scale images with small dimensions in this article. Text recognition (optical character recognition) Many documents are now being scanned and stored as images, making it necessary to convert these documents back into text, for a computer to apply edition and text processing. However, this feature involves a number of challenges: Variety of text font Text size Image noise Manuscripts In spite of that, humans can easily interpret and read even texts produced in a bad quality image. This can be explained by the fact that humans are already familiar with text characters and the words in their language. Somehow the algorithm must become acquainted with these elements (characters, digits, signalization, and so on), in order to successfully recognize texts in images. Digit recognition Although there are a variety of tools available on the market for OCR, it still remains a big challenge for an algorithm to properly recognize texts in images. So, we will be restricting our application to in a smaller domain, so that we'll face simpler problems. Therefore, in this article, we are going to implement a neural network to recognize digits from 0 to 9 represented on images. Also, the images will have standardized and small dimensions, for the sake of simplicity. Digit representation We applied the standard dimension of 10x10 (100 pixels) in gray scaled images, resulting in 100 values of gray scale for each image: In the preceding image we have a sketch representing the digit 3 at the left and a corresponding matrix with gray values for the same digit, in gray scale. We apply this pre-processing in order to represent all ten digits in this application. Implementation in Java To recognize optical characters, data to train and to test neural network was produced by us. In this example, digits from 0 (super black) to 255 (super white) were considered. According to pixel disposal, two versions of each digit data were created: one to train and another to test. Classification techniques will be used here. Generating data Numbers from zero to nine were drawn in the Microsoft Paint ®. The images have been converted into matrices, from which some examples are shown in the following image. All pixel values between zero and nine are grayscale: For each digit we generated five variations, where one is the perfect digit, and the others contain noise, either by the drawing, or by the image quality. Each matrix row was merged into vectors (Dtrain and Dtest) to form a pattern that will be used to train and test the neural network. Therefore, the input layer of the neural network will be composed of 101 neurons. The output dataset was represented by ten patterns. Each one has a more expressive value (one) and the rest of the values are zero. Therefore, the output layer of the neural network will have ten neurons. Neural architecture So, in this application our neural network will have 100 inputs (for images that have a 10x10 pixel size) and ten outputs, the number of hidden neurons remaining unrestricted. We created a class called DigitExample to handle this application. The neural network architecture was chosen with these parameters: Neural network type: MLP Training algorithm: Backpropagation Number of hidden layers: 1 Number of neurons in the hidden layer: 18 Number of epochs: 1000 Minimum overall error: 0.001 Experiments Now, as has been done in other cases previously presented, let's find the best neural network topology training several nets. The strategy to do that is summarized in the following table:   Experiment Learning rate Activation Functions #1 0.3 Hidden Layer: SIGLOG Output Layer: LINEAR #2 0.5 Hidden Layer: SIGLOG Output Layer: LINEAR #3 0.8 Hidden Layer: SIGLOG Output Layer: LINEAR #4 0.3 Hidden Layer: HYPERTAN Output Layer: LINEAR #5 0.5 Hidden Layer: SIGLOG Output Layer: LINEAR #6 0.8 Hidden Layer: SIGLOG Output Layer: LINEAR #7 0.3 Hidden Layer: HYPERTAN Output Layer: SIGLOG #8 0.5 Hidden Layer: HYPERTAN Output Layer: SIGLOG #9 0.8 Hidden Layer: HYPERTAN Output Layer: SIGLOG The following DigitExample class code defines how to create a neural network to read from digit data: // enter neural net parameter via keyboard (omitted) // load dataset from external file (omitted) // data normalization (omitted) // create ANN and define parameters to TRAIN: Backpropagation backprop = new Backpropagation(nn, neuralDataSetToTrain, LearningAlgorithm.LearningMode.BATCH); backprop.setLearningRate( typedLearningRate ); backprop.setMaxEpochs( typedEpochs ); backprop.setGeneralErrorMeasurement(Backpropagation.ErrorMeasurement.SimpleError); backprop.setOverallErrorMeasurement(Backpropagation.ErrorMeasurement.MSE); backprop.setMinOverallError(0.001); backprop.setMomentumRate(0.7); backprop.setTestingDataSet(neuralDataSetToTest); backprop.printTraining = true; backprop.showPlotError = true; // train ANN: try {    backprop.forward();    //neuralDataSetToTrain.printNeuralOutput();    backprop.train();    System.out.println("End of training");    if (backprop.getMinOverallError() >= backprop.getOverallGeneralError()) {        System.out.println("Training successful!"); } else {        System.out.println("Training was unsuccessful"); }    System.out.println("Overall Error:" + String.valueOf(backprop.getOverallGeneralError()));    System.out.println("Min Overall Error:" + String.valueOf(backprop.getMinOverallError()));    System.out.println("Epochs of training:" + String.valueOf(backprop.getEpoch())); } catch (NeuralException ne) {    ne.printStackTrace(); } // test ANN (omitted) Results After running each experiment using the DigitExample class, excluding training and testing overall errors and the quantity of right number classifications using the test data (table above), it is possible observe that experiments #2 and #4 have the lowest MSE values. The differences between these two experiments are learning rate and activation function used in the output layer. Experiment Training overall error Testing overall error # Right number classifications #1 9.99918E-4 0.01221 2 by 10 #2 9.99384E-4 0.00140 5 by 10 #3 9.85974E-4 0.00621 4 by 10 #4 9.83387E-4 0.02491 3 by 10 #5 9.99349E-4 0.00382 3 by 10 #6 273.70 319.74 2 by 10 #7 1.32070 6.35136 5 by 10 #8 1.24012 4.87290 7 by 10 #9 1.51045 4.35602 3 by 10 The figure above shows the MSE evolution (train and test) by each epoch graphically by experiment #2. It is interesting to notice the curve stabilizes near the 30th epoch: The same graphic analysis was performed for experiment #8. It is possible to check the MSE curve stabilizes near the 200th epoch. As already explained, only MSE values might not be considered to attest neural net quality. Accordingly, the test dataset has verified the neural network generalization capacity. The next table shows the comparison between real output with noise and the neural net estimated output of experiment #2 and #8. It is possible to conclude that the neural network weights by experiment #8 can recognize seven digits patterns better than #2's: Output comparison Real output (test dataset) Digit 0.0 0.0        0.0     0.0     0.0     0.0     0.0     0.0     0.0     1.0 0.0 0.0        0.0     0.0     0.0     0.0     0.0     0.0     1.0     0.0 0.0 0.0        0.0     0.0     0.0        0.0     0.0     1.0     0.0     0.0 0.0 0.0        0.0     0.0     0.0     0.0     1.0     0.0     0.0     0.0 0.0 0.0        0.0     0.0     0.0     1.0     0.0     0.0     0.0     0.0 0.0    0.0        0.0     0.0     1.0     0.0     0.0     0.0     0.0     0.0 0.0 0.0        0.0     1.0     0.0     0.0     0.0     0.0     0.0     0.0 0.0 0.0        1.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0 0.0 1.0        0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0 1.0 0.0        0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0 0 1 2 3 4 5 6 7 8 9 Estimated output (test dataset) – Experiment #2 Digit 0.20   0.26  0.09  -0.09  0.39   0.24  0.35   0.30  0.24   1.02 0.42  -0.23  0.39   0.06  0.11    0.16 0.43   0.25  0.17  -0.26 0.51   0.84  -0.17  0.02  0.16    0.27 -0.15  0.14  -0.34 -0.12 -0.20  -0.05  -0.58  0.20  -0.16     0.27 0.83 -0.56  0.42   0.35 0.24   0.05  0.72  -0.05  -0.25    -0.38 -0.33  0.66  0.05  -0.63 0.08   0.41  -0.21  0.41  0.59     -0.12 -0.54  0.27  0.38  0.00 -0.76  -0.35  -0.09  1.25  -0.78     0.55 -0.22  0.61  0.51  0.27 -0.15   0.11  0.54  -0.53  0.55     0.17 0.09  -0.72  0.03  0.12 0.03   0.41  0.49  -0.44  -0.01    0.05 -0.05 -0.03  -0.32 -0.30 0.63  -0.47  -0.15  0.17  0.38    -0.24 0.58   0.07  -0.16 0.54 0 (OK) 1 (ERR) 2 (ERR) 3 (OK) 4 (ERR) 5 (OK) 6 (OK) 7 (ERR) 8 (ERR) 9 (OK) Estimated output (test dataset) – Experiment #8 Digit 0.10 0.10    0.12 0.10 0.12    0.13 0.13 0.26    0.17 0.39 0.13 0.10    0.11 0.10 0.11    0.10 0.29    0.23 0.32 0.10 0.26 0.38    0.10 0.10 0.12    0.10 0.10 0.17    0.10 0.10 0.10 0.10    0.10 0.10 0.10    0.17 0.39 0.10    0.38 0.10 0.15 0.10    0.24 0.10 0.10    0.10 0.10 0.39    0.37 0.10 0.20 0.12    0.10 0.10 0.37    0.10 0.10 0.10    0.17 0.12 0.10 0.10    0.10 0.39 0.10    0.16 0.11 0.30    0.14 0.10 0.10 0.11    0.39 0.10 0.10    0.15 0.10 0.10    0.17 0.10 0.10 0.25    0.34 0.10 0.10    0.10 0.10 0.10    0.10 0.10 0.39 0.10    0.10 0.10 0.28    0.10 0.27 0.11    0.10 0.21 0 (OK) 1 (OK) 2 (OK) 3 (ERR) 4 (OK) 5 (ERR) 6 (OK) 7 (OK) 8 (ERR) 9 (OK) The experiments showed in this article have taken in consideration 10x10 pixel information images. We recommend that you try to use 20x20 pixel datasets to build a neural net able to classify digit images of this size. You should also change the training parameters of the neural net to achieve better classifications. To summarize, we applied neural network techniques to perform pattern recognition on a series of numbers from 0 to 9 in an image. The application here can be extended to any type of characters instead of digits, under the condition that the neural network should all be presented with the predefined characters. If you enjoyed this excerpt, check out the book Neural Network Programming with Java Second Edition to know more about leveraging the multi-platform feature of Java to build and run your personal neural networks everywhere.    
Read more
  • 0
  • 0
  • 6834

article-image-write-first-blockchain-learning-solidity-programming-15-minutes
Aaron Lazar
03 Jan 2018
15 min read
Save for later

Write your first Blockchain: Learning Solidity Programming in 15 minutes

Aaron Lazar
03 Jan 2018
15 min read
[box type="note" align="" class="" width=""]This post is a book extract from the title Mastering Blockchain, authored by Imran Bashir. The book begins with the technical foundations of blockchain, teaching you the fundamentals of cryptography and how it keeps data secure.[/box] Our article aims to quickly get you up to speed with Blockchain development using the Solidity Programming language. Introducing solidity Solidity is a domain-specific language of choice for programming contracts in Ethereum. There are, however, other languages, such as serpent, Mutan, and LLL but solidity is the most popular at the time of writing this. Its syntax is closer to JavaScript and C. Solidity has evolved into a mature language over the last few years and is quite easy to use, but it still has a long way to go before it can become advanced and feature-rich like other well established languages. Nevertheless, this is the most widely used language available for programming contracts currently. It is a statically typed language, which means that variable type checking in solidity is carried out at compile time. Each variable, either state or local, must be specified with a type at compile time. This is beneficial in the sense that any validation and checking is completed at compile time and certain types of bugs, such as interpretation of data types, can be caught earlier in the development cycle instead of at run time, which could be costly, especially in the case of the blockchain/smart contracts paradigm. Other features of the language include inheritance, libraries, and the ability to define composite data types. Solidity is also a called contract-oriented language. In solidity, contracts are equivalent to the concept of classes in other object-oriented programming languages. Types Solidity has two categories of data types: value types and reference types. Value types These are explained in detail here. Boolean This data type has two possible values, true or false, for example: bool v = true; This statement assigns the value true to v. Integers This data type represents integers. A table is shown here, which shows various keywords used to declare integer data types. For example, in this code, note that uint is an alias for uint256: uint256 x; uint y; int256 z; These types can also be declared with the constant keyword, which means that no storage slot will be reserved by the compiler for these variables. In this case, each occurrence will be replaced with the actual value: uint constant z=10+10; State variables are declared outside the body of a function, and they remain available throughout the contract depending on the accessibility assigned to them and as long as the contract persists. Address This data type holds a 160-bit long (20 byte) value. This type has several members that can be used to interact with and query the contracts. These members are described here: Balance The balance member returns the balance of the address in Wei. Send This member is used to send an amount of ether to an address (Ethereum's 160-bit address) and returns true or false depending on the result of the transaction, for example, the following: address to = 0x6414cc08d148dce9ebf5a2d0b7c220ed2d3203da; address from = this; if (to.balance < 10 && from.balance > 50) to.send(20); Call functions The call, callcode, and delegatecall are provided in order to interact with functions that do not have Application Binary Interface (ABI). These functions should be used with caution as they are not safe to use due to the impact on type safety and security of the contracts. Array value types (fixed size and dynamically sized byte arrays) Solidity has fixed size and dynamically sized byte arrays. Fixed size keywords range from bytes1 to bytes32, whereas dynamically sized keywords include bytes and strings. bytes are used for raw byte data and string is used for strings encoded in UTF-8. As these arrays are returned by the value, calling them will incur gas cost. length is a member of array value types and returns the length of the byte array. An example of a static (fixed size) array is as follows: bytes32[10] bankAccounts; An example of a dynamically sized array is as follows: bytes32[] trades; Get length of trades: trades.length; Literals These are used to represent a fixed value. Integer literals Integer literals are a sequence of decimal numbers in the range of 0-9. An example is shown as follows: uint8 x = 2; String literals String literals specify a set of characters written with double or single quotes. An example is shown as follows: 'packt' "packt” Hexadecimal literals Hexadecimal literals are prefixed with the keyword hex and specified within double or single quotation marks. An example is shown as follows: (hex'AABBCC'); Enums This allows the creation of user-defined types. An example is shown as follows: enum Order{Filled, Placed, Expired }; Order private ord; ord=Order.Filled; Explicit conversion to and from all integer types is allowed with enums. Function types There are two function types: internal and external functions. Internal functions These can be used only within the context of the current contract. External functions External functions can be called via external function calls. A function in solidity can be marked as a constant. Constant functions cannot change anything in the contract; they only return values when they are invoked and do not cost any gas. This is the practical implementation of the concept of call as discussed in the previous chapter. The syntax to declare a function is shown as follows: function <nameofthefunction> (<parameter types> <name of the variable>) {internal|external} [constant] [payable] [returns (<return types> <name of the variable>)] Reference types As the name suggests, these types are passed by reference and are discussed in the following section. Arrays Arrays represent a contiguous set of elements of the same size and type laid out at a memory location. The concept is the same as any other programming language. Arrays have two members named length and push: uint[] OrderIds; Structs These constructs can be used to group a set of dissimilar data types under a logical group. These can be used to define new types, as shown in the following example: Struct Trade { uint tradeid; uint quantity; uint price; string trader; } Data location Data location specifies where a particular complex data type will be stored. Depending on the default or annotation specified, the location can be storage or memory. This is applicable to arrays and structs and can be specified using the storage or memory keywords. As copying between memory and storage can be quite expensive, specifying a location can be helpful to control the gas expenditure at times. Calldata is another memory location that is used to store function arguments. Parameters of external functions use calldata memory. By default, parameters of functions are stored in memory, whereas all other local variables make use of storage. State variables, on the other hand, are required to use storage. Mappings Mappings are used for a key to value mapping. This is a way to associate a value with a key. All values in this map are already initialized with all zeroes, for example, the following: mapping (address => uint) offers; This example shows that offers is declared as a mapping. Another example makes this clearer: mapping (string => uint) bids; bids["packt"] = 10; This is basically a dictionary or a hash table where string values are mapped to integer values. The mapping named bids has a packt string value mapped to value 10. Global variables Solidity provides a number of global variables that are always available in the global namespace. These variables provide information about blocks and transactions. Additionally, cryptographic functions and address-related variables are available as well. A subset of available functions and variables is shown as follows: keccak256(...) returns (bytes32) This function is used to compute the keccak256 hash of the argument provided to the Function: ecrecover(bytes32 hash, uint8 v, bytes32 r, bytes32 s) returns (address) This function returns the associated address of the public key from the elliptic curve signature: block.number This returns the current block number. Control structures Control structures available in solidity are if - else, do, while, for, break, continue, return. They work in a manner similar to how they work in C-language or JavaScript. Events Events in solidity can be used to log certain events in EVM logs. These are quite useful when external interfaces are required to be notified of any change or event in the contract. These logs are stored on the blockchain in transaction logs. Logs cannot be accessed from the contracts but are used as a mechanism to notify change of state or the occurrence of an event (meeting a condition) in the contract. In a simple example here, the valueEvent event will return true if the x parameter passed to function Matcher is equal to or greater than 10: contract valueChecker { uint8 price=10; event valueEvent(bool returnValue); function Matcher(uint8 x) returns (bool) { if (x>=price) { valueEvent(true); return true; } } } Inheritance Inheritance is supported in solidity. The is keyword is used to derive a contract from another contract. In the following example, valueChecker2 is derived from the valueChecker contract. The derived contract has access to all nonprivate members of the parent contract: contract valueChecker { uint8 price=10; event valueEvent(bool returnValue); function Matcher(uint8 x) returns (bool) {  if (x>=price)  {   valueEvent(true);   return true;   }  } } contract valueChecker2 is valueChecker { function Matcher2() returns (uint) { return price + 10; }     } In the preceding example, if uint8 price = 10 is changed to uint8 private price = 10, then it will not be accessible by the valuechecker2 contract. This is because now the member is declared as private, it is not allowed to be accessed by any other contract. Libraries Libraries are deployed only once at a specific address and their code is called via CALLCODE/DELEGATECALL Opcode of the EVM. The key idea behind libraries is code reusability. They are similar to contracts and act as base contracts to the calling contracts. A library can be declared as shown in the following example: library Addition { function Add(uint x,uint y) returns (uint z)  {    return x + y;  } } This library can then be called in the contract, as shown here. First, it needs to be imported and it can be used anywhere in the code. A simple example is shown as follows: Import "Addition.sol" function Addtwovalues() returns(uint) { return Addition.Add(100,100); } There are a few limitations with libraries; for example, they cannot have state variables and cannot inherit or be inherited. Moreover, they cannot receive Ether either; this is in contrast to contracts that can receive Ether. Functions Functions in solidity are modules of code that are associated with a contract. Functions are declared with a name, optional parameters, access modifier, optional constant keyword, and optional return type. This is shown in the following example: function orderMatcher(uint x) private constant returns(bool returnvalue) In the preceding example, function is the keyword used to declare the function. orderMatcher is the function name, uint x is an optional parameter, private is the access modifier/specifier that controls access to the function from external contracts, constant is an optional keyword used to specify that this function does not change anything in the contract but is used only to retrieve values from the contract instead, and returns (bool returnvalue) is the optional return type of the function. How to define a function: The syntax of defining a function is shown as follows: function <name of the function>(<parameters>) <visibility specifier> returns (<return data type> <name of the variable>) {  <function body> } Function signature: Functions in solidity are identified by its signature, which is the first four bytes of the keccak-256 hash of its full signature string. This is also visible in browser solidity, as shown in the following screenshot. D99c89cb is the first four bytes of 32 byte keccak-256 hash of the function named Matcher. In this example function, Matcher has the signature hash of d99c89cb. This information is useful in order to build interfaces. Input parameters of a function: Input parameters of a function are declared in the form of <data type> <parameter name>. This example clarifies the concept where uint x and uint y are input parameters of the checkValues function: contract myContract { function checkValues(uint x, uint y) { } } Output parameters of a function: Output parameters of a function are declared in the form of <data type> <parameter name>. This example shows a simple function returning a uint value: contract myContract { Function getValue() returns (uint z) {  z=x+y; } } A function can return multiple values. In the preceding example function, getValue only returns one value, but a function can return up to 14 values of different data types. The names of the unused return parameters can be omitted optionally. Internal function calls: Functions within the context of the current contract can be called internally in a direct manner. These calls are made to call the functions that exist within the same contract. These calls result in simple JUMP calls at the EVM byte code level. External function calls: External function calls are made via message calls from a contract to another contract. In this case, all function parameters are copied to the memory. If a call to an internal function is made using the this keyword, it is also considered an external call. The this variable is a pointer that refers to the current contract. It is explicitly convertible to an address and all members for a contract are inherited from the address. Fall back functions: This is an unnamed function in a contract with no arguments and return data. This function executes every time ether is received. It is required to be implemented within a contract if the contract is intended to receive ether; otherwise, an exception will be thrown and ether will be returned. This function also executes if no other function signatures match in the contract. If the contract is expected to receive ether, then the fall back function should be declared with the payable modifier. The payable is required; otherwise, this function will not be able to receive any ether. This function can be called using the address.call() method as, for example, in the following: function () { throw; } In this case, if the fallback function is called according to the conditions described earlier; it will call throw, which will roll back the state to what it was before making the call. It can also be some other construct than throw; for example, it can log an event that can be used as an alert to feed back the outcome of the call to the calling application. Modifier functions: These functions are used to change the behavior of a function and can be called before other functions. Usually, they are used to check some conditions or verification before executing the function. _(underscore) is used in the modifier functions that will be replaced with the actual body of the function when the modifier is called. Basically, it symbolizes the function that needs to be guarded. This concept is similar to guard functions in other languages. Constructor function: This is an optional function that has the same name as the contract and is executed once a contract is created. Constructor functions cannot be called later on by users, and there is only one constructor allowed in a contract. This implies that no overloading functionality is available. Function visibility specifiers (access modifiers): Functions can be defined with four access specifiers as follows: External: These functions are accessible from other contracts and transactions. They cannot be called internally unless the this keyword is used. Public: By default, functions are public. They can be called either internally or using messages. Internal: Internal functions are visible to other derived contracts from the parent contract. Private: Private functions are only visible to the same contract they are declared in. Other important keywords/functions throw: throw is used to stop execution. As a result, all state changes are reverted. In this case, no gas is returned to the transaction originator because all the remaining gas is consumed. Layout of a solidity source code file Version pragma In order to address compatibility issues that may arise from future versions of the solidity compiler version, pragma can be used to specify the version of the compatible compiler as, for example, in the following: pragma solidity ^0.5.0 This will ensure that the source file does not compile with versions smaller than 0.5.0 and versions starting from 0.6.0. Import Import in solidity allows the importing of symbols from the existing solidity files into the current global scope. This is similar to import statements available in JavaScript, as for example, in the following: Import "module-name"; Comments Comments can be added in the solidity source code file in a manner similar to C-language. Multiple line comments are enclosed in /* and */, whereas single line comments start with //. An example solidity program is as follows, showing the use of pragma, import, and comments: To summarize, we went through a brief introduction to the solidity language. Detailed documentation and coding guidelines are available online. If you found this article useful, and would like to learn more about building blockchains, go ahead and grab the book Mastering Blockchain, authored by Imran Bashir.  
Read more
  • 0
  • 0
  • 4083
article-image-2018-new-year-resolutions-algorithmic-world-part-1-of-3
Sugandha Lahoti
03 Jan 2018
6 min read
Save for later

2018 new year resolutions to thrive in an Algorithmic World - Part 1 of 3

Sugandha Lahoti
03 Jan 2018
6 min read
We often think of Data science and machine learning as skills essential to a niche group of researchers, data scientists, and developers. But the world as we know today revolves around data and algorithms, just as it used to revolve around programming a decade back. As data science and algorithms get integrated into all aspects of businesses across industries, data science like Microsoft Excel will become ubiquitous and will serve as a handy tool which makes you better at your job no matter what your job is. Knowing data science is key to having a bright career in this algoconomy (algorithm driven economy). If you are big on new year resolutions, make yourself a promise to carve your place in the algorithm-powered world by becoming data science savvy. Follow these three resolutions to set yourself up for a bright data-driven career. Get the foundations right: Start with the building blocks of data science, i.e. developing your technical skills. Stay relevant: Keep yourself updated on the latest developments in your field and periodically invest in reskilling and upskilling. Be mindful of your impact: Finally, always remember that your work has real-world implications. Choose your projects wisely and your project goals, hypotheses, and contributors with even more care. In this three-part series, we expand on how data professionals could go about achieving these three resolutions. But the principles behind the ideas are easily transferable to anyone in any job. Think of them as algorithms that can help you achieve your desired professional outcome! You simply need to engineer the features and fine-tune the hyperparameters specific to your industry and job role. 1st Resolution: Learn the building blocks of data science If you are interested in starting a career in data science or in one that involves data, here is a simple learning roadmap for you to develop your technical skills. Start off with learning a data-friendly programming language, one that you find easy and interesting. Next, brush up your statistics skills. Nothing fancy, just your high school math and stats would do nicely. Next, learn about algorithms - what they do, what questions they answer, how many types are there and how to write one. Finally, you can put all that learning to practice by building models on top of your choice of Machine Learning framework. Now let’s see, how you can accomplish each of these tasks 1. Learn Python or any another popular data friendly programming language you find interesting (Learning period: 1 week - 2 months) If you see yourself as a data scientist in the near future, knowing a programming language is one of the first things to check off your list. We suggest you learn a data-friendly programming language like Python or R. Python is a popular choice because of its strong, fast, and easy computational capabilities for the Data Science workflow. Moreover, because of a large and active community, the likelihood of finding someone in your team or your organization who knows Python is quite high, which is an added advantage. “Python has become the most popular programming language for data science because it allows us to forget about the tedious parts of programming and offers us an environment where we can quickly jot down our ideas and put concepts directly into action.” - Sebastian Raschka We suggest learning the basics from the book Learn Python in 7 days by Mohit, Bhaskar N. Das. Then you can move on to learning Python specifically for data science with Python Data Science Essentials by Alberto Boschetti. Additionally, you can learn R, which is a highly useful language when it comes to statistics and data. For learning R, we recommend R Data science Essentials by Raja B. Koushik. You can learn more about how Python and R stand against each other in the data science domain here. Although R and Python are the most popular choices for new developers and aspiring data scientists, you can also use Java for data science, if that is your cup of tea. Scala is another alternative. 2. Brush up on Statistics (Learning period: 1 week - 3 weeks) While you are training your programming muscle, we recommend that you brush through basic mathematics (probability and statistics). Remember, you already know everything to get started with data science from your high school days. You just need to refresh your memory with a little practice. A good place to start is to understand concepts like standard deviation, probability, mean, mode, variance, kurtosis among others. Now, your normal high-school books should be enough to get started, however, an in-depth understanding is required to leverage the power of data science. We recommend the book Statistics for Data Science by James D. Miller for this. 3. Learn what machine learning algorithms do and which ones to learn (Learning period: 1 month - 3 months) Machine Learning is a powerful tool to make predictions based on huge amounts of data. According to a recent study, in the next ten years, ML algorithms are expected to replace a quarter of the jobs across the world, in fields like transport, manufacturing, architecture, healthcare and many others. So the next step in your data science journey is learning about machine learning algorithms. There are new algorithms popping up almost every day. We’ve collated a list of top ten algorithms that you should learn to effectively design reliable and robust ML systems. But fear not, you don’t need to know all of them to get started. Start with some basic algorithms that are majorly used in the real world applications like linear regression, naive bayes, and decision trees. 4. Learn TensorFlow, Keras, or any other popular machine learning framework (Learning period: 1 month - 3 months) After you have familiarized yourself with some of the machine learning algorithms, it is time you put that learning to practice by building models based on those algorithms. While there are many cloud-based machine learning options that have click-based model building features available, the best way to learn a skill is to get your hands dirty. There is a growing range of frameworks that make it easy to build complex models while allowing for high degrees of customization. Here is a list of top 10 deep learning frameworks at your disposal to choose from. Our favorite pick is TensorFlow. It’s Python-based, backed by Google, has a very good documentation, and there are tons of tutorials and videos available on the internet to guide you. You can find a comprehensive list of books for learning Tensorflow here. We also recommend learning Keras, which is a good option if you have some knowledge of Python programming and want to get started with deep learning. Try the book Deep Learning with Keras, by Antonio Gulli and Sujit Pal, to get you started. If you find learning from multiple sources daunting, just learn from Sebastian Raschka’s Python machine learning book.   Once you have got your fundamentals right, it is important to stay relevant through continuous learning and reskilling. Check out part 2 where we explore how you could about doing this in a systematic and time efficient manner. In part 3, we look at ways you can own your work and become aware of its outcome.
Read more
  • 0
  • 0
  • 2177

article-image-popular-data-sources-and-models-in-sap-analytics-cloud
Kunal Chaudhari
03 Jan 2018
12 min read
Save for later

Popular Data sources and models in SAP Analytics Cloud

Kunal Chaudhari
03 Jan 2018
12 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Riaz Ahmed titled Learning SAP Analytics Cloud.This book deals with the basics of SAP Analytics Cloud (formerly known as SAP BusinessObjects Cloud) and unveil significant features for a beginner.[/box] Our article provides a brief overview of the different data sources and models, available in SAP Analytics Cloud. A model is the foundation of every analysis you create to evaluate the performance of your organization. It is a high-level design that exposes the analytic requirements of end users. Planning and analytics are the two types of models you can create in SAP Analytics Cloud. Analytics models are simpler and more flexible, while planning models are full-featured models in which you work with planning features. Preconfigured with dimensions for time and categories, planning models support multi-currency and security features at both model and dimension levels.   To determine what content to include in your model, you must first identify the columns from the source data on which users need to query. The columns you need in your model reside in some sort of data source. SAP Analytics Cloud supports three types of data sources: files (such as CSV or Excel files) that usually reside on your computer, live data connections from a connected remote system, and cloud apps. In addition to the files on your computer, you can use on-premise data sources, such as SAP Business Warehouse, SAP ERP, SAP Universe, SQL database, and more, to acquire data for your models. In the cloud, you can get data from apps such as Concur, Google Drive, SAP Business ByDesign, SAP Hybris Cloud, OData Services, and Success Factors. The following figure depicts these data sources. The cloud app data sources you can use with SAP Analytics Cloud are displayed above the firewall mark, while those in your local network are shown under the firewall. As you can see in the following figure, there are over twenty data sources currently supported by SAP Analytics Cloud. The methods of connecting to these data sources also vary from each other. However, some instances provided in this article would give you an idea on how connections are established to acquire data. The connection methods provided here relate to on-premise and cloud app data sources. Create a direct live connection to SAP HANA Execute the following steps to connect to the on-premise SAP HANA system to use live data in SAP Analytics Cloud. Live data means that you can get up-to-the-minute data when you open a story in SAP Analytics Cloud. In this case, any changes made to the data in the source system are reflected immediately. Usually, there are two ways to establish a connection to a data source--use the Connection option from the main menu, or specify the data source during the process of creating a model. However, live data connections must be established via the Connection menu option prior to creating the corresponding model. Here are the steps: From the main menu, select Connection. On the Connections page, click on the Add Connection icon (+), and select Live Data Connection | SAP HANA. In the New Live Connection dialog, enter a name for the connection (for example, HANA). From the Connection Type drop-down list, select Direct. The Direct option is used when you connect to a data source that resides inside your corporate network. The Path option requires a reverse proxy to the HANA XS server. The SAP Cloud Platform and Cloud options in this list are used when you are connecting to SAP cloud environments. When you select the Direct option, the System Type is set to HANA and the protocol is set to HTTPS. Enter the hostname and port number in respective text boxes. The Authentication Method list contains two options: User Name and Password and SAML Single Sign On. The SAML Single Sign On option requires that the SAP HANA system is already configured to use SAML authentication. If not, choose the User Name and Password option and enter these credentials in relevant boxes. Click on OK to finish the process. A new connection will appear on the Connection page, which can now be used as a data source for models. However, in order to complete this exercise, we will go through a short demo of this process here. From the main menu, go to Create | Model. On the New Model page, select Use a datasource. From the list that appears on your right side, select Live Data connection. In the dialog that is displayed, select the HANA connection you created in the previous steps from the System list. From the Data Source list, select the HANA view you want to work with. The list of views may be very long, and a search feature is available to help you locate the source you are looking for. Finally, enter the name and the optional description for the new model, and click on OK. The model will be created, and its definitions will appear on another page. Connecting remote systems to import data In addition to creating live connections, you can also create connections that allow you to import data into SAP Analytics Cloud. In these types of connections that you make to access remote systems, data is imported (copied) to SAP Analytics Cloud. Any changes users make in the source data do not affect the imported data. To establish connections with these remote systems, you need to install some additional components. For example, you must install SAP HANA Cloud connector to access SAP Business Planning and Consolidation (BPC) for Netweaver . Similarly, SAP Analytics Cloud agent should be installed for SAP Business Warehouse (BW), SQL Server, SAP ERP, and others. Take a look at the connection figure illustrated on a previous page. The following set of steps provide instructions to connect to SAP ERP. You can either connect to this system from the Connection menu or establish the connection while creating a model. In these steps, we will adopt the latter approach. From the main menu, go to Create | Model. 2. Click on the Use a datasource option on the choose how you'd like to start your model page. 3. From the list of available datasources to your right, select SAP ERP. 4. From the Connection Name list, select Create New Connection. 5. Enter a name for the connection (for example, ERP) in the Connection Name box. You can also provide a       description to further elaborate the new connection. 6. For Server Type, select Application Server and enter values for System,   System Number, Client ID, System ID, Language, User Name, and Password. Click the Create button after providing this information. 7. Next, you need to create a query based on the SAP ERP system data. Enter  a name for the query, for example, sales. 8. In the same dialog, expand the ERP object where the data exists. Locate and select the object, and then choose the data columns you want to include in your model. You are provided with a preview of the data before importing. On the preview window, click on Done to start the import process. The imported data will appear on the Data Integration page, which is the initial screen in the model creation segment. Connect Google Drive to import data You went through two scenarios in which you saw how data can be fetched. In the first scenario, you created a live connection to create a model on live data, while in the second one, you learned how to import data from remote systems. In this article, you will be guided to create a model using a cloud app called Google Drive. Google Drive is a file storage and synchronization service developed by Google. It allows users to store files in the cloud, synchronize files across devices, and share files. Here are the steps to use the data stored on Google Drive: From the main menu, go to Create | Model. On the choose how you'd like to start your model page, select Get data from an app. From the available apps to your right, select Google Drive.  In the Import Model From Google Drive dialog, click on the Select Data button.  If you are not already logged into Google Drive, you will be prompted to log in.  Another dialog appears displaying a list of compatible files. Choose a file, and click on the Select button. You are brought back to the Import Model From Google Drive dialog, where you have to enter a model name and an optional description. After providing this information, click on the Import button. The import process will start, and after a while, you will see the Data Integration screen populated with the data from the selected Google Drive file. Refreshing imported data SAP Analytics Cloud allows you to refresh your imported data. With this option, you can re-import the data on demand to get the latest values. You can perform this refresh operation either manually or create an import schedule to refresh the data at a specific date and time or on a recurring basis. The following data sources support scheduling: SAP Business Planning and Consolidation (BPC) SAP Business Warehouse (BW) Concur OData services An SAP Analytics BI platform universe (UNX) query SAP ERP Central Component (SAP ECC) SuccessFactors [DC3] HCM suite Excel and comma-separated values (CSV) files imported from a file server (not imported from your local machine) SQL databases You can adopt the following method to access the schedule settings for a model: Select Connection from the main menu. The Connection page appears. The Schedule Status tab on this page lists all updates and import jobs associated with any data source. Alternatively, go to main menu | Browse | Models. The Models page appears. The updatable model on the list will have a number of data sources shown in the Datasources column. In the Datasources column, click on the View More link. The update and import jobs associated with this data source will appear. The Update Model and Import Data job are the two types of jobs that are run either immediately or on a schedule. To run an Import Data job immediately, choose Import Data in the Action column. If you want to run an Update Model job, select a job to open it. The following refreshing methods specify how you want existing data to be handled. The Import Data jobs are listed here: Update: Selecting this option updates the existing data and adds new entries to the target model. Clean and Replace: Any existing data is wiped out and new entries are added to the target model. Append: Nothing is done with the existing data. Only new entries are added to the target model. The Update Model jobs are listed here: Clean and Replace: This deletes the existing data and adds new entries to the target model. Append: This keeps the existing data as is and adds new entries to the target model. The Schedule Settings option allows you to select one of the following schedule options: None: The import is performed immediately Once: The import is performed only once at a scheduled time Repeating: The import is executed according to a repeating pattern; you can select a start and end date and time as well as a recurrence pattern After setting your preferences, click on the Save icon to save your scheduling settings. If you chose the None option for scheduling, select Update Model or Import Data to run the update or import job now. Once a scheduled job completes, its result appears on the Schedule Status tab displaying any errors or warnings. If you see such daunting messages, select the job to see the details. Expand an entry in the Refresh Manager panel to get more information about the scary stuff. If the import process rejected any rows in the dataset, you are provided with an option to download the rejected rows as a CSV file for offline examination. Fix the data in the source system, or fix the error in the downloaded CSV file and upload data from it. After creating your models, you access them via the main menu | Browse | Models path. The Models page, as illustrated in the following figure, is the main interface where you manage your models. All existing models are listed under the Models tab. You can open a model by clicking on its name. Public dimensions are saved separately from models and appear on the Public Dimensions tab. When you create a new model or modify an existing model, you can add these public dimensions. If you are using multiple currencies in your data, the exchange rates are maintained in separate tables. These are saved independently of any model and are listed on the Currency Conversion tab. Data for geographic locations, which are displayed and used in your data analysis, is maintained on the Points of Interest tab. The toolbar provided under the four tabs carries icons to perform common operations for managing models. Click on the New Model icon to create a new model. Select a model by placing a check mark (A) in front of it. Then click on the Copy Selected Model icon to make an exact copy of the selected model. Use the delete icon to remove the selected models. The Clear Selected Model option removes all the data from the selected model. The list of data import options that are supported is available from a menu beneath the Import Data icon on the toolbar. You can export a model to a .csv file once or on a recurring schedule using Export Model As File. SAP Analytics Cloud can help transform how you discover, plan, predict, collaborate, visualize, and extend all in one solution. In addition to on-premise data sources, you can fetch data from a variety of other cloud apps and even from Excel and text files to build your data models and then create stories based on these models. If you enjoyed this excerpt, check out the book Learning SAP Analytics Cloud to know more about professional data analysis using different types of charts, tables, geo maps, and more with SAP Analytics Cloud.    
Read more
  • 0
  • 0
  • 6050

article-image-getting-started-with-linear-and-logistic-regression
Richa Tripathi
03 Jan 2018
7 min read
Save for later

Getting started with Linear and logistic regression

Richa Tripathi
03 Jan 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Alberto Boschetti and Luca Massaron, titled Python Data Science Essentials - Second Edition. This book provides the fundamentals of data science with Python by leveraging the latest tools and libraries such as Jupyter notebooks, NumPy, pandas and scikit-learn.[/box] In this article, we will learn about two easy and effective classifiers known as linear and logistic regressors. Linear and logistic regressions are the two methods that can be used to linearly predict a target value or a target class, respectively. Let's start with an example of linear regression predicting a target value. In this article, we will again use the Boston dataset, which contains 506 samples, 13 features (all real numbers), and a (real) numerical target (which renders it ideal for regression problems). We will divide our dataset into two sections by using a train/test split cross- validation to test our methodology (in the example, 80 percent of our dataset goes in training and 20 percent in test): In: from sklearn.datasets import load_boston boston = load_boston() from sklearn.cross_validation import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, test_size=0.2, random_state=0) The dataset is now loaded and the train/test pairs have been created. In the next few steps, we're going to train and fit the regressor in the training set and predict the target variable in the test dataset. We are then going to measure the accuracy of the regression task by using the MAE score. As for the scoring function, we decided on the mean absolute error in order to penalize errors just proportionally to the size of the error itself (using the more common mean squared error would have emphasized larger errors more, since errors are squared): In: from sklearn.linear_model import LinearRegression regr = LinearRegression() regr.fit(X_train, Y_train) Y_pred = regr.predict(X_test) from sklearn.metrics import mean_absolute_error print ("MAE", mean_absolute_error(Y_test, Y_pred)) Out: MAE 3.84281058945 Great! We achieved our goal in the simplest possible way. Now, let's take a look at the time needed to train the system: In: %timeit regr.fit(X_train, y_train) Out: 1000 loops, best of 3: 381 µs per loop That was really quick! The results, of course, are not all that great. However, linear regression offers a very good trade-off between performance and speed of training and simplicity. Now, let's take a look under the hood of the algorithm. Why is it so fast but not that accurate? The answer is somewhat expected-this is so because it's a very simple linear method. Let's briefly dig into a mathematical explanation of this technique. Let's name X(i) the ith sample (it is actually a row vector of numerical features) and Y(i) its target. The goal of linear regression is to find a good weight (column) vector W, which is best suited for approximating the target value when multiplied by the observation vector, that is, X(i) * W ≈ Y(i) (note that this is a dot product). W should be the same, and the best for every observation. Thus, solving the following equation becomes easy: W can be found easily with the help of a matrix inversion (or, more likely, a pseudo- inversion, which is a computationally efficient way) and a dot product. Here's the reason linear regression is so fast. Note that this is a simplistic explanation—the real method adds another virtual feature to compensate for the bias of the process. However, this does not change the complexity of the regression algorithm much. We progress now to logistic regression. In spite of what the name suggests, it is a classifier and not a regressor. It must be used in classification problems where you are dealing with only two classes (binary classification). Typically, target labels are Boolean; that is, they have values as either True/False or 0/1 (indicating the presence or absence of the expected outcome). In our example, we keep on using the same dataset. The target is to guess whether a house value is over or under the average of a threshold value we are interested in. In essence, we moved from a regression problem to a binary classification one because now our target is to guess how likely an example is to be a part of a group. We start preparing the dataset by using the following commands: In: import numpy as np avg_price_house = np.average(boston.target) high_priced_idx = (Y_train >= avg_price_house) Y_train[high_priced_idx] = 1 Y_train[np.logical_not(high_priced_idx)] = 0 Y_train = Y_train.astype(np.int8) high_priced_idx = (Y_test >= avg_price_house) Y_test[high_priced_idx] = 1 Y_test[np.logical_not(high_priced_idx)] = 0 Y_test = Y_test.astype(np.int8) Now, we will train and apply the classifier. To measure its performance, we will simply print the classification report: In: from sklearn.linear_model import LogisticRegression clf = LogisticRegression() clf.fit(X_train, Y_train) Y_pred = clf.predict(X_test) from sklearn.metrics import classification_report print (classification_report(Y_test, Y_pred)) Out:   precision recall f1-score support 0   0.81    0.90   0.85     61       1    0.82    0.68   0.75     41  avg / total   0.83    0.81    0.81    102 The output of this command can change on your machine depending on the optimization process of the LogisticRegression classifier (no seed has been set for replicability of the results). The precision and recall values are over 80 percent. This is already a good result for a very simple method. The training speed is impressive, too. Thanks to Jupyter Notebook, we can have a comparison of the algorithm with a more advanced classifier in terms of performance and speed: In: %timeit clf.fit(X_train, y_train) 100 loops, best of 3: 2.54 ms per loop What's under the hood of a logistic regression? The simplest classifier a person could imagine (apart from a mean) is a linear regressor followed by a hard threshold: Here, sign(a) = +1 if a is greater or equal than zero, and 0 otherwise. To smooth down the hardness of the threshold and predict the probability of belonging to a class, logistic regression resorts to the logit function. Its output is a (0 to 1] real number (0.0 and 1.0 are attainable only via rounding, otherwise the logit function just tends toward them), which indicates the probability that the observation belongs to class 1. Using a formula, that becomes: Here  Why the logistic function instead of some other function? Well, because it just works pretty well in most real cases. In the remaining cases, if you're not completely satisfied with its results, you may want to try some other nonlinear functions instead (there is limited variety of suitable ones, though). To summarize, we learned about two classic algorithms used in machine learning namely linear and logistic regression. With the help of an example, we put the theory into practice by predicting a target value which helped us understand the trade-offs and benefits. If you enjoyed this excerpt, check out the book Python Data Science Essentials - Second Edition to know more about other popular machine learning algorithms such as Naive Bayes, k-Nearest Neighbors (kNN), Support Vector Machines (SVM) etc.    
Read more
  • 0
  • 0
  • 2673
article-image-6-popular-regression-techniques-need-know
Amey Varangaonkar
02 Jan 2018
8 min read
Save for later

6 Popular Regression Techniques You Need To Know

Amey Varangaonkar
02 Jan 2018
8 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book Statistics for Data Science, authored by IBM certified expert James D. Miller. This book gives you a statistical view of building smart data models that help you get unique insights from your data.[/box] In this article, the author introduces you to the concept of regression analysis, one of the most popular machine learning algorithms - what it is, the different types of regression, and how to choose the right regression technique to build your data model. What is Regression Analysis? For starters, regression analysis or statistical regression is a process for estimating the relationships among variables. This process encompasses numerous techniques for modeling and analyzing variables, focusing on the relationship between a dependent variable and one (or more) independent variables (or predictors). Regression analysis is the work done to identify and understand how the (best representative) value of a dependent variable (a variable that depends on other factors) changes when any one of the independent variables (a variable that stands alone and isn't changed by the other variables) is changed while the other independent variables stay the same. A simple example might be how the total dollars spent on marketing (an independent variable example) impacts the total sales dollars (a dependent variable example) over a period of time (is it really as simple as more marketing equates to higher sales?), or perhaps there is a correlation between the total marketing dollars spent (independent variable), discounting a products price (another independent variable), and the amount of sales (a dependent variable)? [box type="info" align="" class="" width=""]Keep in mind this key point that regression analysis is used to understand which among the independent variables are related to the dependent variable(s), not just the relationship of these variables. Also, the inference of causal relationships (between the independent and dependent variables) is an important objective. However, this can lead to illusions or false relationships, so caution is recommended![/box] Overall, regression analysis can be thought of as estimating the conditional expectations of the value of the dependent variable, given the independent variables being observed, that is, endeavoring to predict the average value of the dependent variable when the independent variables are set to certain values. I call this the lever affect—meaning when one increases or decreases a value of one component, it directly affects the value at least one other (variable). An alternate objective of the process of regression analysis is the establishment of location parameters or the quantile of a distribution. In other words, this idea is to determine values that may be a cutoff, dividing a range of a probability distribution values. You'll find that regression analysis can be a great tool for prediction and forecasting (not just complex machine learning applications). We'll explore some real-world examples later, but for now, let's us look at some techniques for the process. Popular techniques and approaches for regression You'll find that various techniques for carrying out regression analysis have been developed and accepted.These are: Linear Logistic Polynomial Stepwise Ridge Lasso Linear regression Linear regression is the most basic type of regression and is commonly used for predictive analysis projects. In fact, when you are working with a single predictor (variable), we call it simple linear regression, and if there are multiple predictor variables, we call it multiple linear regression. Simply put, linear regression uses linear predictor functions whose values are estimated from the data in the model. Logistic regression Logistic regression is a regression model where the dependent variable is a categorical variable. This means that the variable only has two possible values, for example, pass/fail, win/lose, alive/dead, or healthy/sick. If the dependent variable has more than two possible values, one can use various modified logistic regression techniques, such as multinomial logistic regression, ordinal logistic regression, and so on. Polynomial regression When we speak of polynomial regression, the focus of this technique is on modeling the relationship between the independent variable and the dependent variable as an nth degree polynomial. Polynomial regression is considered to be a special case of multiple linear regressions. The predictors resulting from the polynomial expansion of the baseline predictors are known as interactive features. Stepwise regression Stepwise regression is a technique that uses some kind of automated procedure to continually execute a step of logic, that is, during each step, a variable is considered for addition to or subtraction from the set of independent variables based on some prespecified criterion. Ridge regression Often predictor variables are identified as being interrelated. When this occurs, the regression coefficient of any one variable depends on which other predictor variables are included in the model and which ones are left out. Ridge regression is a technique where a small bias factor is added to the selected variables in order to improve this situation. Therefore, ridge regression is actually considered a remedial measure to alleviate multicollinearity amongst predictor variables. Lasso regression Lasso (Least Absolute Shrinkage Selector Operator) regression is a technique where both predictor variable selection and regularization are performed in order to improve the prediction accuracy and interpretability of the result it produces. Which technique should I choose? In addition to the aforementioned regression techniques, there are numerous others to consider with, most likely, more to come. With so many options, it's important to choose the technique that is right for your data and your project. Rather than selecting the right regression approach, it is more about selecting the most effective regression approach. Typically, you use the data to identify the regression approach you'll use. You start by establishing statistics or a profile for your data. With this effort, you need to identify and understand the importance of the different variables, their relationships, coefficient signs, and their effect. Overall, here's some generally good advice for choosing the right regression approach from your project: Copy what others have done and had success with. Do the research. Incorporate the results of other projects into yours. Don't reinvent the wheel. Also, even if an observed approach doesn't quite fit as it was used, perhaps some simple adjustments would make it a good choice. Keep your approach as simple as possible. Many studies show that simpler models generally produce better predictions. Start simple, and only make the model more complex as needed. The more complex you make your model, the more likely it is that you are tailoring the model to your dataset specifically, and generalizability suffers. Check your work. As you evaluate methods, check the residual plots (more on this in the next section of this chapter) because they can help you avoid inadequate models and adjust your model for better results. Use your subject matter expertise. No statistical method can understand the underlying process or subject area the way you do. Your knowledge is a crucial part and, most likely, the most reliable way of determining the best regression approach for your project. Does it fit? After selecting a model that you feel is appropriate for use with your data (also known as determining that the approach is the best fit), you need to validate your selection, that is, determine its fit. A well-fitting regression model results in predicted values close to the observed data values. The mean model (which uses the mean for every predicted value) would generally be used if there were no informative predictor variables. The fit of a proposed regression model should, therefore, be better than the fit of the mean model. As a data scientist, you will need to scrutinize the coefficients of determination, measure the standard error of estimate, analyze the significance of regression parameters and confidence intervals. [box type="info" align="" class="" width=""]Remember that the better the fit of a regression model, most likely the better the precision in, or just better, the results.[/box] Finally, it has been proven that simple models produce more accurate results! Keep this always in mind when selecting an approach or technique, and even when the problem might be complex, it is not always obligatory to adopt a complex regression approach. Choosing the right technique, though, goes a long way in developing an accurate model. If you found this excerpt to be useful, make sure you check out our book Statistics for Data Science for more such tips on building effective data models by leveraging the power of the statistical tools and techniques.
Read more
  • 0
  • 0
  • 2603

article-image-create-treemap-packed-bubble-chart-tableau
Sugandha Lahoti
02 Jan 2018
4 min read
Save for later

How to create a Treemap and Packed Bubble Chart in Tableau

Sugandha Lahoti
02 Jan 2018
4 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Shweta Sankhe-Savale titled Tableau Cookbook – Recipes for Data Visualization. This cookbook has simple recipes for creating visualizations in Tableau. It covers the fundamentals of data visualization such as getting familiarized with Tableau Desktop and also goes to more complex problems like creating dynamic analytics with parameters, and advanced calculations.[/box] In today’s tutorial, we will learn how to create a Treemap and a packed Bubble chart in Tableau. Treemaps Treemaps are useful for representing hierarchical (tree-structured) data as a part-to-whole relationship. It shows data as a set of nested rectangles, and each branch of the tree is given a rectangle, which represents the amount of data it comprises. These can then be further divided into smaller rectangles that represent sub branches, based on its proportion to the whole. We can show information via the color and size of the rectangles and find out patterns that would be difficult to spot in other ways. They make efficient use of the space and hence can display a lot of items in a single visualization simultaneously. Getting Ready We will create a Treemap to show the sales and profit across various product subcategories. Let's see how to create a Treemap. How to Do it We will first create a new sheet and rename it as Treemap. Next, we will drag Sales from the Measures pane and drop it into the Size shelf. We will then drag Profit from Measures pane and drop it into the Color shelf. Our Mark type will automatically change to show squares. Refer to the following image: 5. Next, we will drop Sub-Category into the Label shelf in the Marks card, and we will get the output as shown in the following image: How it Works In the preceding image, since we have placed Sales in the Size shelf, we are inferring this: the greater the size, the higher the sales value; the smaller the size, the smaller the sales value. Since the Treemap is sorted in descending order of Size, we will see the biggest block in the top left-hand side corner and the smaller block in the bottom right-hand side corner. Further, we placed Profit in the Color shelf. There are some subcategories where the profit is negative and hence Tableau selects the orange/blue diverging color. Thus, when the color blue is the darkest, it indicates the Most profit. However, the orange color indicates that a particular subcategory is in a loss scenario. So, in the preceding chart, Phones has the maximum number of sales. Further, Copiers has the highest profit. Tables, on the other hand, is non-profitable. Packed Bubble Charts A Packed bubble chart is a cluster of circles where we use dimensions to define individual bubbles, and the size and/or color of the individual circles represent measures. Bubble charts have many benefits and one of them is to let us spot categories easily and compare them to the rest of the data by looking at the size of the bubble. This simple data visualization technique can provide insight in a visually attractive format. The Packed Bubble chart in Tableau uses the Circle mark type. Getting Ready To create a packed bubble chart, we will continue with the same example that we saw in the Treemap recipe. In the following section, we will see how we can convert the Treemap we created earlier into a Packed Bubble chart. How to Do it Let us duplicate the Tree Map sheet name and rename it to Packed Bubble chart. Next, change the marks from Square to Circle from the Marks dropdown in the Marks card. The output will be as shown in the following image: How it works In the Packed Bubble chart, there is no specific sort of order for Bubbles. The size and/or color are what defines the chart; the bigger or darker the circle, the greater the value. So, in the preceding example, we have Sales in the Size shelf, Profit in the Color shelf, and Sub-Category in the Label shelf. Thus, when we look at it, we understand that Phones has the most sales. Further, Copiers has the highest profit. Tables, on the other hand, is non-profitable even though the size indicates that the sales are fairly good. We saw two ways to visualize data by using Treemap and Packed Bubble chart types in Tableau. If you found this post is useful, do check out the book Tableau Cookbook – Recipes for Data Visualization to create more such charts, interactive dashboards and other beautiful data visualizations with Tableau.      
Read more
  • 0
  • 0
  • 4937