Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-classifier-construction
Packt
01 Jun 2016
8 min read
Save for later

Classifier Construction

Packt
01 Jun 2016
8 min read
In this article by Pratik Joshi, author of the book Python Machine Learning Cookbook, we will build a simple classifier using supervised learning, and then go onto build a logistic-regression classifier. Building a simple classifier In the field of machine learning, classification refers to the process of using the characteristics of data to separate it into a certain number of classes. A supervised learning classifier builds a model using labeled training data, and then uses this model to classify unknown data. Let's take a look at how to build a simple classifier. (For more resources related to this topic, see here.) How to do it… Before we begin, make sure thatyou have imported thenumpy and matplotlib.pyplot packages. After this, let's create some sample data: X = np.array([[3,1], [2,5], [1,8], [6,4], [5,2], [3,5], [4,7], [4,-1]]) Let's assign some labels to these points: y = [0, 1, 1, 0, 0, 1, 1, 0] As we have only two classes, the list y contains 0s and 1s. In general, if you have N classes, then the values in y will range from 0 to N-1. Let's separate the data into classes that are based on the labels: class_0 = np.array([X[i] for i in range(len(X)) if y[i]==0]) class_1 = np.array([X[i] for i in range(len(X)) if y[i]==1]) To get an idea about our data, let's plot this, as follows: plt.figure() plt.scatter(class_0[:,0], class_0[:,1], color='black', marker='s') plt.scatter(class_1[:,0], class_1[:,1], color='black', marker='x') This is a scatterplot where we use squares and crosses to plot the points. In this context,the marker parameter specifies the shape that you want to use. We usesquares to denote points in class_0 and crosses to denote points in class_1. If you run this code, you will see the following figure: In the preceding two lines, we just use the mapping between X and y to create two lists. If you were asked to inspect the datapoints visually and draw a separating line, what would you do? You would simply draw a line in between them. Let's go ahead and do this: line_x = range(10) line_y = line_x We just created a line with the mathematical equation,y = x. Let's plot this, as follows: plt.figure() plt.scatter(class_0[:,0], class_0[:,1], color='black', marker='s') plt.scatter(class_1[:,0], class_1[:,1], color='black', marker='x') plt.plot(line_x, line_y, color='black', linewidth=3) plt.show() If you run this code, you should see the following figure: There's more… We built a really simple classifier using the following rule: the input point (a, b) belongs to class_0 if a is greater than or equal tob;otherwise, it belongs to class_1. If you inspect the points one by one, you will see that this is true. This is it! You just built a linear classifier that can classify unknown data. It's a linear classifier because the separating line is a straight line. If it's a curve, then it becomes a nonlinear classifier. This formation worked fine because there were a limited number of points, and we could visually inspect them. What if there are thousands of points? How do we generalize this process? Let's discuss this in the next section. Building a logistic regression classifier Despite the word regression being present in the name, logistic regression is actually used for classification purposes. Given a set of datapoints, our goal is to build a model that can draw linear boundaries between our classes. It extracts these boundaries by solving a set of equations derived from the training data. Let's see how to do that in Python: We will use the logistic_regression.pyfile that is already provided to you as a reference. Assuming that you have imported the necessary packages, let's create some sample data along with training labels: X = np.array([[4, 7], [3.5, 8], [3.1, 6.2], [0.5, 1], [1, 2], [1.2, 1.9], [6, 2], [5.7, 1.5], [5.4, 2.2]]) y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2]) Here, we assume that we have three classes. Let's initialize the logistic regression classifier: classifier = linear_model.LogisticRegression(solver='liblinear', C=100) There are a number of input parameters that can be specified for the preceding function, but a couple of important ones are solver and C. The solverparameter specifies the type of solver that the algorithm will use to solve the system of equations. The C parameter controls the regularization strength. A lower value indicates higher regularization strength. Let's train the classifier: classifier.fit(X, y) Let's draw datapoints and boundaries: plot_classifier(classifier, X, y) We need to define this function: def plot_classifier(classifier, X, y):     # define ranges to plot the figure     x_min, x_max = min(X[:, 0]) - 1.0, max(X[:, 0]) + 1.0     y_min, y_max = min(X[:, 1]) - 1.0, max(X[:, 1]) + 1.0 The preceding values indicate the range of values that we want to use in our figure. These values usually range from the minimum value to the maximum value present in our data. We add some buffers, such as 1.0 in the preceding lines, for clarity. In order to plot the boundaries, we need to evaluate the function across a grid of points and plot it. Let's go ahead and define the grid: # denotes the step size that will be used in the mesh grid     step_size = 0.01       # define the mesh grid     x_values, y_values = np.meshgrid(np.arange(x_min, x_max, step_size), np.arange(y_min, y_max, step_size)) The x_values and y_valuesvariables contain the grid of points where the function will be evaluated. Let's compute the output of the classifier for all these points: # compute the classifier output     mesh_output = classifier.predict(np.c_[x_values.ravel(), y_values.ravel()])       # reshape the array     mesh_output = mesh_output.reshape(x_values.shape) Let's plot the boundaries using colored regions: # Plot the output using a colored plot     plt.figure()       # choose a color scheme     plt.pcolormesh(x_values, y_values, mesh_output, cmap=plt.cm.Set1) This is basically a 3D plotter that takes the 2D points and the associated values to draw different regions using a color scheme. You can find all the color scheme options athttp://matplotlib.org/examples/color/colormaps_reference.html. Let's overlay the training points on the plot: plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='black', linewidth=2, cmap=plt.cm.Paired)       # specify the boundaries of the figure     plt.xlim(x_values.min(), x_values.max())     plt.ylim(y_values.min(), y_values.max())       # specify the ticks on the X and Y axes     plt.xticks((np.arange(int(min(X[:, 0])-1), int(max(X[:, 0])+1), 1.0)))     plt.yticks((np.arange(int(min(X[:, 1])-1), int(max(X[:, 1])+1), 1.0)))       plt.show() Here, plt.scatter plots the points on the 2D graph. TheX[:, 0] specifies that we should take all the values along axis 0 (X-axis in our case), and X[:, 1] specifies axis 1 (Y-axis). The c=y parameter indicates the color sequence. We use the target labels to map to colors using cmap. We basically want different colors based on the target labels; hence, we use y as the mapping. The limits of the display figure are set using plt.xlim and plt.ylim. In order to mark the axes with values, we need to use plt.xticks and plt.yticks. These functions mark the axes with values so that it's easier for us to see where the points are located. In the preceding code, we want the ticks to lie between the minimum and maximum values with a buffer of 1 unit. We also want these ticks to be integers. So, we use theint() function to round off the values. If you run this code, you should see the following output: Let's see how the Cparameter affects our model. The C parameter indicates the penalty for misclassification. If we set this to 1.0, we will get the following figure: If we set C to 10000, we get the following figure: As we increase C, there is a higher penalty for misclassification. Hence, the boundaries get more optimal. Summary We successfully employed supervised learning to build a simple classifier. We subsequently went on to construct a logistic-regression classifier and saw different results of tweaking C—the regularization strength parameter. Resources for Article:   Further resources on this subject: Python Scripting Essentials [article] Web scraping with Python (Part 2) [article] Web Server Development [article]
Read more
  • 0
  • 0
  • 1505

article-image-linking-data-shapes
Packt
01 Jun 2016
7 min read
Save for later

Linking Data to Shapes

Packt
01 Jun 2016
7 min read
In this article by David J Parker, the author of the book Mastering Data Visualization with Microsoft Visio Professional 2016, discusses about that Microsoft introduced the current data-linking feature in the Professional edition of Visio Professional 2007. This feature is better than the database add-on that has been around since Visio 4 because it has greater importing capabilities and is part of the core product with its own API. This provides the Visio user with a simple method of surfacing data from a variety of data sources, and it gives the power user (or developer) the ability to create productivity enhancements in code. (For more resources related to this topic, see here.) Once data is imported in Visio, the rows of data can be linked to shapes and then displayed visually, or be used to automatically create hyperlinks. Moreover, if the data is edited outside of Visio, then the data in the Visio shapes can be refreshed, so the shapes reflect the updated data. This can be done in the Visio client, but some data sources can also refresh the data in Visio documents that are displayed in SharePoint web pages. In this way, Visio documents truly become operational intelligence dashboards. Some VBA knowledge will be useful, and the sample data sources are introduced in each section. In this chapter, we shall cover the following topics: The new Quick Import feature Importing data from a variety of sources How to link shapes to rows of data Using code for more linking possibilities A very quick introduction to importing and linking data Visio Professional 2016 added more buttons to the Data ribbon tab, and some new Data Graphics, but the functionality has basically been the same since Visio 2007 Professional. The new additions, as seen in the following screenshot, can make this particular ribbon tab quite wide on the screen. Thank goodness that wide screens have become the norm: The process to create data-refreshable shapes in Visio is simply as follows: Import data as recordsets. Link rows of data to shapes. Make the shapes display the data. Use any hyperlinks that have been created automatically. The Quick Import tool introduced in Visio 2016 Professional attempts to merge the first three steps into one, but it rarely gets it perfectly, and it is only for simple Excel data sources. Therefore, it is necessary to learn how to use the Custom Import feature properly. Knowing when to use the Quick Import tool The Data | External Data | Quick Import button is new in Visio 2016 Professional. It is part of the Visio API, so it cannot be called in code. This is not a great problem because it is only a wrapper for some of the actions that can be done in code anyway. This feature can only use an Excel workbook, but fortunately Visio installs a sample OrgData.xls file in the Visio Content<LCID> folder. The LCID (Location Code Identifier) for US English is 1033, as shown in the following screenshot: The screenshot shows a Visio Professional 2016 32-bit installation is on a Windows 10 64-bit laptop. Therefore, the Office16 applications are installed in the Program Files (x86)root folder. It would just be Program Filesroot if the 64-bit version of Office was installed. It is not possible to install a different bit version of Visio than the rest of the Office applications. There is no root folder in previous versions of Office, but the rest of the path is the same. The full path on this laptop is C:Program Files (x86)Microsoft OfficerootOffice16Visio Content1033ORGDATA.XLS, but it is best to copy this file to a folder where it can be edited. It is surprising that the Excel workbook is in the old binary format, but it is a simple process to open and save it in the new Open Packaging Convention file format with an xlsx extension. Importing to shapes without existing Shape Data rows The following example contains three Person shapes from the Work Flow Objects stencil, and each one contains the names of a person’s name, spelt exactly the same as in the key column on the Excel worksheet. It is not case sensitive, and it does not matter whether there are leading or trailing spaces in the text. When the Quick Import button is pressed, a dialog opens up to show the progress of the stages that the wizard feature is going through, as shown in the following screenshot: If the workbook contains more than one table of data, the user is prompted to select the range of cells within the workbook. When the process is complete, each of the Person shapes contains all of the data from the row in the External Data recordset, where the text matches the Name column, as shown in the following screenshot: The linked rows in the External Data window also display a chain icon, and the right-click menu has many actions, such as selecting the Linked Shapes for a row. Conversely, each shape now contains a right-mouse menu action to select the linked row in an External Data recordset. The Quick Import feature also adds some default data graphics to each shape, which will be ignored in this chapter because it is explored in detail in chapter 4, Using the Built-in Data Graphics. Note that the recordset in the External Data window is named Sheet1$A1:H52. This is not perfect, but the user can rename it through the right mouse menu actions of the tab. The Properties dialog, as seen in the following screenshot: The user can also choose what to do if a data link is added to a shape that already has one. A shape can be linked to a single row in multiple recordsets, and a single row can be linked to multiple shapes in a document, or even on the same page. However, a shape cannot be linked to more than one row in the same recordset. Importing to shapes with existing Shape Data rows The Person shape from the Resources stencil has been used in the following example, and as earlier, each shape has the name text. However, in this case, there are some existing Shape Data rows: When the Quick Import feature is run, the data is linked to each shape where the text matches the Name column value. This feature has unfortunately created a problem this time because the Phone Number, E-mail Alias, and Manager Shape Data rows have remained empty, but the superfluous Telephone, E-mail, and Reports_To have been added. The solution is to edit the column headers in the worksheet to match the existing Shape Data row labels, as shown in the following screenshot: Then, when Quick Import is used again, the column headers will match the Shape Data row names, and the data will be automatically cached into the correct places, as shown in the following screenshot: Using the Custom Import feature The user has more control using the Custom Import button on the Data | External Data ribbon tab. This button was called Link Data to Shapes in the previous versions of Visio. In either case, the action opens the Data Selector dialog, as shown in the following screenshot: Each of these data sources will be explained in this chapter, along with the two data sources that are not available in the UI (namely XML files and SQL Server Stored Procedures). Summary This article has gone through the many different sources for importing data in Visio and has shown how each can be done. Resources for Article: Further resources on this subject: Overview of Process Management in Microsoft Visio 2013[article] Data Visualization[article] Data visualization[article]
Read more
  • 0
  • 0
  • 5587

article-image-holistic-view-spark
Packt
31 May 2016
19 min read
Save for later

Holistic View on Spark

Packt
31 May 2016
19 min read
In this article by Alex Liu, author of the book Apache Spark Machine Learning Blueprints, the author talks about a new stage of utilizing Apache Spark-based systems to turn data into insights. (For more resources related to this topic, see here.) According to research done by Gartner and others, many organizations lost a huge amount of value only due to the lack of a holistic view of their business. In this article, we will review the machine learning methods and the processes of obtaining a holistic view of business. Then, we will discuss how Apache Spark fits in to make the related computing easy and fast, and at the same time, with one real-life example, illustrate this process of developing holistic views from data using Apache Spark computing, step by step as follows: Spark for a holistic view Methods for a holistic view Feature preparation Model estimation Model evaluation Results Explanation Deployment Spark for holistic view Spark is very suitable for machine-learning projects such as ours to obtain a holistic view of business as it enables us to process huge amounts of data fast, and it enables us to code complicated computation easily. In this section, we will first describe a real business case and then describe how to prepare the Spark computing for our project. The use case The company IFS sells and distributes thousands of IT products and has a lot of data on marketing, training, team management, promotion, and products. The company wants to understand how various kinds of actions, such as that in marketing and training, affect sales teams’ success. In other words, IFS is interested in finding out how much impact marketing, training, or promotions have generated separately. In the past, IFS has done a lot of analytical work, but all of it was completed by individual departments on soloed data sets. That is, they have analytical results about how marketing affects sales from using marketing data alone, and how training affects sales from analyzing training data alone. When the decision makers collected all the results together and prepared to make use of them, they found that some of the results were contradicting with each other. For example, when they added all the effects together, the total impacts were beyond their intuitively imaged. This is a typical problem that every organization is facing. A soloed approach with soloed data will produce an incomplete view, and also an often biased view, or even conflicting views. To solve this problem, analytical teams need to take a holistic view of all the company data, and gather all the data into one place, and then utilize new machine learning approaches to gain a holistic view of business. To do so, companies also need to care for the following: The completeness of causes Advanced analytics to account for the complexity of relationships Computing the complexity related to subgroups and a big number of products or services For this example, we have eight datasets that include one dataset for marketing with 48 features, one dataset for training with 56 features, and one dataset for team administration with 73 features, with the following table as a complete summary: Category Number of Features Team 73 Marketing 48 Training 56 Staffing 103 Product 77 Promotion 43 Total 400 In this company, researchers understood that pooling all the data sets together and building a complete model was the solution, but they were not able to achieve it for several reasons. Besides organizational issues inside the corporation, tech capability to store all the data, to process all the data quickly with the right methods, and to present all the results in the right ways with reasonable speed were other challenges. At the same time, the company has more than 100 products to offer for which data was pooled together to study impacts of company interventions. That is, calculated impacts are average impacts, but variations among products are too large to ignore. If we need to assess impacts for each product, parallel computing is preferred and needs to be implemented at good speed. Without utilizing a good computing platform such as Apache Spark meeting the requirements that were just described is a big challenge for this company. In the sections that follow, we will use modern machine learning on top of Apache Spark to attack this business use case and help the company to gain a holistic view of their business. In order to help readers learn machine learning on Spark effectively, discussions in the following sections are all based on work about this real business use case that was just described. But, we left some details out to protect the company’s privacy and also to keep everything brief. Distributed computing For our project, parallel computing is needed for which we should set up clusters and worker notes. Then, we can use the driver program and cluster manager to manage the computing that has to be done in each worker node. As an example, let's assume that we choose to work within Databricks’ environment: The users can go to the main menu, as shown in the preceding screenshot, click Clusters. A Window will open for users to name the cluster, select a version of Spark, and then specify number of workers. Once the clusters are created, we can go to the main menu, click the down arrow on the right-hand side of Tables. We then choose Create Tables to import our datasets that were cleaned and prepared. For the data source, the options include S3, DBFS, JDBC, and File (for local fields). Our data has been separated into two subsets, one to train and one to test each product, as we need to train a few models per product. In Apache Spark, we need to direct workers to complete computation on each note. We will use scheduler to get Notebook computation completed on Databricks, and collect the results back, which will be discussed in the Model Estimation section. Fast and easy computing One of the most important advantages of utilizing Apache Spark is to make coding easy for which several approaches are available. Here for this project, we will focus our effort on the notebook approach, and specifically, we will use the R notebooks to develop and organize codes. At the same time, with an effort to illustrate the Spark technology more thoroughly, we will also use MLlib directly to code some of our needed algorithms as MLlib has been seamlessly integrated with Spark. In the Databricks’ environment, setting up notebooks will take the following steps: As shown in the preceding screenshot, users can go to the Databricks main menu, click the down arrow on the right-hand side of Workspace, and choose Create -> Notebook to create a new notebook. A table will pop up for users to name the notebook and also select a language (R, Python, Scala, or SQL). In order to make our work repeatable and also easy to understand, we will adopt a workflow approach that is consistent with the RM4Es framework. We will also adopt Spark’s ML Pipeline tools to represent our workflows whenever possible. Specifically, for the training data set, we need to estimate models, evaluate models, then maybe to re-estimate the models again before we can finalize our models. So, we need to use Spark’s Transformer, Estimator, and Evaluator to organize an ML pipeline for this project. In practice, we can also organize these workflows within the R notebook environment. For more information about pipeline programming, please go to http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline.  & http://spark.apache.org/docs/latest/ml-guide.html Once our computing platform is set up and our framework is cleared, everything becomes clear too. In the following sections, we will move forward step by step. We will use our RM4Es framework and related processes to identity equations or methods and then prepare features first. The second step is to complete model estimations, the third is to evaluate models, and the fourth is to explain our results. Finally, we will deploy the models. Methods for a holistic view In this section, we need to select our analytical methods or models (equations), which is to complete a task of mapping our business use case to machine learning methods. For our use case of assessing impacts of various factors on sales team success, there are many suitable models for us to use. As an exercise, we will select: regression models, structural equation models, and decision trees, mainly for their easiness to interpret as well as their implementation on Spark. Once we finalize our decision for analytical methods or models, we will need to prepare the dependent variable and also prepare to code, which we will discuss one by one. Regression modeling To get ready for regression modeling on Spark, there are three issues that you have to take care of, as follows: Linear regression or logistic regression: Regression is the most mature and also most widely-used model to represent the impacts of various factors on one dependent variable. Whether to use linear regression or logistic regression depends on whether the relationship is linear or not. We are not sure about this, so we will use adopt both and then compare their results to decide on which one to deploy. Preparing the dependent variable: In order to use logistic regression, we need to recode the target variable or dependent variable (the sales team success variable now with a rating from 0 to 100) to be 0 versus 1 by separating it with the medium value. Preparing coding: In MLlib, we can use the following codes for regression modeling as we will use Spark MLlib’s Linear Regression with Stochastic Gradient Descent (LinearRegressionWithSGD): val numIterations = 90 val model = LinearRegressionWithSGD.train(TrainingData, numIterations) For logistic regression, we use the following codes: val model = new LogisticRegressionWithSGD() .setNumClasses(2) .run(training) For more about using MLlib for regression modeling, please go to: http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression In R, we can use the lm function for linear regression, and the glm function for logistic regression with family=binomial(). SEM aproach To get ready for Structural Equation Modeling (SEM) on Spark, there are also three issues that we need to take care of as follows: SEM introduction specification: SEM may be considered as an extension of regression modeling, as it consists of several linear equations that are similar to regression equations. But, this method estimates all the equations at the same time regarding their internal relations, so it is less biased than regression modeling. SEM consists of both structural modeling and latent variable modeling, but for us, we will only use structural modeling. Preparing dependent variable: We can just use the sales team success scale (rating of 0 to 100) as our target variable here. Preparing coding: We will adopt the R notebook within the Databricks environment, for which we should use the R package SEM. There are also other SEM packages, such as lavaan, that are available to use, but for this project, we will use the SEM package for its easiness to learn. Loading SEM package into the R notebook, we will use install.packages("sem", repos="http://R-Forge.R-project.org"). Then, we need to perform the R code of library(sem). After that, we need to use the specify.model() function to write some codes to specify models into our R notebook, for which the following codes are needed: mod.no1 <- specifyModel() s1 <- x1, gam31 s1 <- x2, gam32 Decision trees To get ready for the decision tree modeling on Spark, there are also three issues that we need to take care of as follows: Decision tree selection: Decision tree aims to model classifying cases, which is about classifying elements into success or not success for our use case. It is also one of the most mature and widely-used methods. For this exercise, we will only use the simple linear decision tree, and we will not venture into any more complicated trees, such as random forest. Prepare the dependent variable: To use the decision tree model here, we will separate the sales team rating into two categories of SUCCESS or NOT as we did for logistic regression. Prepare coding: For MLlib, we can use the following codes: val numClasses = 2 val categoricalFeaturesInfo = Map[Int, Int]() val impurity = "gini" val maxDepth = 6 val maxBins = 32 val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)   For more information about using MLlib for decision tree, please go to: http://spark.apache.org/docs/latest/mllib-decision-tree.html As for the R notebook on Spark, we need to use an R package of rpart, and then use the rpart functions for all the calculation. For rpart, we need to specify the classifier and also all the features that have to be used. Model estimation Once feature sets get finalized, what follows is to estimate the parameters of the selected models. We can use either MLlib or R here to do this, and we need to arrange distributed computing. To simplify this, we can utilize the Databricks’ Job feature. Specifically, within the Databricks environment, we can go to Jobs, then create jobs, as shown in the following screenshot: Then, users can select what notebooks to run, specify clusters, and then schedule jobs. Once scheduled, users can also monitor the running notebooks, and then collect results back. In Section II, we prepared some codes for each of the three models that were selected. Now, we need to modify them with the final set of features selected in the last section, so to create our final notebooks. In other words, we have one dependent variable prepared, and 17 features selected out from our PCA and feature selection work. So, we need to insert all of them into the codes that were developed in Section II to finalize our notebook. Then, we will use Spark Job feature to get these notebooks implemented in a distributed way. MLlib implementation First, we need to prepare our data with the s1 dependent variable for linear regression, and the s2 dependent variable for logistic regression or decision tree. Then, we need to add the selected 17 features into them to form datasets that are ready for our use. For linear regression, we will use the following code: val numIterations = 90 val model = LinearRegressionWithSGD.train(TrainingData, numIterations) For logistic regression, we will use the following code: val model = new LogisticRegressionWithSGD() .setNumClasses(2) For Decision tree, we will use the following code: val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins) R notebooks implementation For better comparison, it is a good idea to write linear regression and SEM into the same R notebook and also write logistic regression and Decision tree into the same R notebook. Then, the main task left here is to schedule the estimation for each worker, and then collect the results, using the previously mentioned Job feature in Databricks environment as follows: The code for linear regression and SEM is as follows: lm.est1 <- lm(s1 ~ T1+T2+M1+ M2+ M3+ Tr1+ Tr2+ Tr3+ S1+ S2+ P1+ P2+ P3+ P4+ Pr1+ Pr2+ Pr3) mod.no1 <- specifyModel() s1 <- x1, gam31 s1 <- x2, gam32 The code for logistic regression and Decision tree is as follows: logit.est1 <- glm(s2~ T1+T2+M1+ M2+ M3+ Tr1+ Tr2+ Tr3+ S1+ S2+ P1+ P2+ P3+ P4+ Pr1+ Pr2+ Pr3,family=binomial()) dt.est1 <- rpart(s2~ T1+T2+M1+ M2+ M3+ Tr1+ Tr2+ Tr3+ S1+ S2+ P1+ P2+ P3+ P4+ Pr1+ Pr2+ Pr3, method="class") After we get all models estimated as per each product, for simplicity, we will focus on one product to complete our discussion on model evaluation and model deployment. Model evaluation In the previous section, we completed our model estimation task. Now, it is time for us to evaluate the estimated models to see whether they meet our model quality criteria so that we can either move to our next stage for results explanation or go back to some previous stages to refine our models. To perform our model evaluation, in this section, we will focus our effort on utilizing RMSE (root-mean-square error) and ROC Curves (receiver operating characteristic) to assess whether our models are a good fit. To calculate RMSEs and ROC Curves, we need to use our test data rather than the training data that was used to estimate our models. Quick evaluations Many packages have already included some algorithms for users to assess models quickly. For example, both MLlib and R have algorithms to return a confusion matrix for logistic regression models, and they even get false positive numbers calculated. Specifically, MLlib has functions of confusionMatrixand numFalseNegatives() for us to use, and even some algorithms to calculate MSE quickly as follows: MSE = valuesAndPreds.(lambda (v, p): (v - p)**2).mean() print("Mean Squared Error = " + str(MSE)) Also, R has a function of confusion.matrix for us to use. In R, there are even many tools to produce some quick graphical plots that can be used to gain a quick evaluation of models. For example, we can perform plots of predicted versus actual values, and also residuals on predicted values. Intuitively, the methods of comparing predicted versus actual values are the easiest to understand and give us a quick model evaluation. The following table is a calculated confusion matrix for one of the company products, which shows a reasonable fit of our model.   Predicted as Success Predicted as NOT Actual Success 83% 17% Actual Not 9% 91% RMSE In MLlib, we can use the following codes to calculate RMSE: val valuesAndPreds = test.map { point => val prediction = new_model.predict(point.features) val r = (point.label, prediction) r } val residuals = valuesAndPreds.map {case (v, p) => math.pow((v - p), 2)} val MSE = residuals.mean(); val RMSE = math.pow(MSE, 0.5) Besides the above, MLlib also has some functions in the RegressionMetrics and RankingMetrics classes for us to use for RMSE calculation. In R, we can compute RMSE as follows: RMSE <- sqrt(mean((y-y_pred)^2)) Before this, we need to obtain the predicted values with the following commands: > # build a model > RMSElinreg <- lm(s1 ~ . ,data= data1) > #score the model > score <- predict(RMSElinreg, data2) After we have obtained RMSE values for all the estimated models, we will compare them to evaluate the linear regression model versus the logistic regression model versus the Decision tree model. For our case, linear regression models turned out to be almost the best. Then, we also compare RMSE values across products, and send back some product models back for refinement. For another example of obtaining RMSE, please go to http://www.cakesolutions.net/teamblogs/spark-mllib-linear-regression-example-and-vocabulary. ROC curves As an example, we calculate ROC curves to assess our logistic models. In MLlib, we can use the MLlib function of metrics.areaUnderROC() to calculate ROC once we apply our estimated model to our test data and get labels for testing cases. For more on using MLlib to obtain ROC, please go to: http://web.cs.ucla.edu/~mtgarip/linear.html In R, using package pROC, we can perform the following to calculate and plot ROC curves: mylogit <- glm(s2 ~ ., family = "binomial") summary(mylogit) prob=predict(mylogit,type=c("response")) testdata1$prob=prob library(pROC) g <- roc(s2 ~ prob, data = testdata1) plot(g) As discussed, once ROC curves get calculated, we can use them to compare our logistic models against Decision tree models, or compare models cross products. In our case, logistic models perform better than Decision tree models. Results explanation Once we pass our model evaluation and decide to select the estimated model as our final model, we need to interpret results to the company executives and also their technicians. Next, we discuss some commonly-used ways of interpreting our results, one using tables and another using graphs, with our focus on impacts assessments. Some users may prefer to interpret our results in terms of ROIs, for which cost and benefits data is needed. Once we have the cost and benefit data, our results here can be easily expanded to cover the ROI issues. Also, some optimization may need to be applied for real decision making. Impacts assessments As discussed in Section 1, the main purpose of this project is to gain a holistic view of the sales team success. For example, the company wishes to understand the impact of marketing on sales success in comparison to training and other factors. As we have our linear regression model estimated, one easy way of comparing impacts is to summarize the variance explained by each feature group, as shown by the following table. Tables for Impact Assessment: Feature Group % Team 8.5 Marketing 7.6 Training 5.7 Staffing 12.9 Product 8.9 Promotion 14.6 Total 58.2 The following figure is another example of using graphs to display the results that were discussed. Summary In this article, we went through a step-by-step process from data to a holistic view of businesses. From this, we processed a large amount of data on Spark and then built a model to produce a holistic view of the sales team success for the company IFS. Specifically, we first selected models per business needs after we prepared Spark computing and loaded in preprocessed data. Secondly, we estimated model coefficients. Third, we evaluated the estimated models. Then, we finally interpreted analytical results. This process is similar to the process of working with small data. But in dealing with big data, we will need parallel computing, for which Apache Spark is utilized. During this process, Apache Spark makes things easy and fast. After this article, readers will have gained a full understanding about how Apache Spark can be utilized to make our work easier and faster in obtaining a holistic view of businesses. At the same time, readers should become familiar with the RM4Es modeling processes of processing large amount of data and developing predictive models, and they should especially become capable of producing their own holistic view of businesses. Resources for Article: Further resources on this subject: Getting Started with Apache Hadoop and Apache Spark [article] Getting Started with Apache Spark DataFrames [article] Sabermetrics with Apache Spark [article]
Read more
  • 0
  • 0
  • 2081
Visually different images

article-image-splunks-input-methods-and-data-feeds
Packt
30 May 2016
13 min read
Save for later

Splunk's Input Methods and Data Feeds

Packt
30 May 2016
13 min read
This article being crafted by Ashish Kumar Yadav has been picked from Advanced Splunk book. This book helps you to get in touch with a great data science tool named Splunk. The big data world is an ever expanding forte and it is easy to get lost in the enormousness of machine data available at your bay. The Advanced Splunk book will definitely provide you with the necessary resources and the trail to get you at the other end of the machine data. While the book emphasizes on Splunk, it also discusses its close association with Python language and tools like R and Tableau that are needed for better analytics and visualization purpose. (For more resources related to this topic, see here.) Splunk supports numerous ways to ingest data on its server. Any data generated from a human-readable machine from various sources can be uploaded using data input methods such as files, directories, TCP/UDP scripts can be indexed on the Splunk Enterprise server and analytics and insights can be derived from them. Data sources Uploading data on Splunk is one of the most important parts of analytics and visualizations of data. If data is not properly parsed, timestamped, or broken into events, then it can be difficult to analyze and get proper insight on the data. Splunk can be used to analyze and visualize data ranging from various domains, such as IT security, networking, mobile devices, telecom infrastructure, media and entertainment devices, storage devices, and many more. The machine generated data from different sources can be of different formats and types, and hence, it is very important to parse data in the best format to get the required insight from it. Splunk supports machine-generated data of various types and structures, and the following screenshot shows the common types of data that comes with an inbuilt support in Splunk Enterprise. The most important point of these sources is that if the data source is from the following list, then the preconfigured settings and configurations already stored in Splunk Enterprise are applied. This helps in getting the data parsed in the best and most suitable formats of events and timestamps to enable faster searching, analytics, and better visualization. The following screenshot enlists common data sources supported by Splunk Enterprise: Structured data Machine-generated data is generally structured, and in some cases, it can be semistructured. Some of the types of structured data are EXtensible Markup Language (XML), JavaScript Object Notation (JSON), comma-separated values (CSV), tab-separated values (TSV), and pipe-separated values (PSV). Any format of structured data can be uploaded on Splunk. However, if the data is from any of the preceding formats, then predefined settings and configuration can be applied directly by choosing the respective source type while uploading the data or by configuring it in the inputs.conf file. The preconfigured settings for any of the preceding structured data is very generic. Many times, it happens that the machine logs are customized structured logs; in that case, additional settings will be required to parse the data. For example, there are various types of XML. We have listed two types here. In the first type, there is the <note> tag at the start and </note> at the end, and in between, there are parameters are their values. In the second type, there are two levels of hierarchies. XML has the <library> tag along with the <book> tag. Between the <book> and </book> tags, we have parameters and their values. The first type is as follows: <note> <to>Jack</to> <from>Micheal</from> <heading>Test XML Format</heading> <body>This is one of the format of XML!</body> </note> The second type is shown in the following code snippet: <Library> <book category="Technical"> <title lang="en">Splunk Basic</title> <author>Jack Thomas</author> <year>2007</year> <price>520.00</price> </book> <book category="Story"> <title lang="en">Jungle Book</title> <author>Rudyard Kiplin</author> <year>1984</year> <price>50.50</price> </book> </Library > Similarly, there can be many types of customized XML scripts generated by machines. To parse different types of structured data, Splunk Enterprise comes with inbuilt settings and configuration defined for the source it comes from. Let's say, for example, that the data received from a web server's logs are also structured logs and it can be in either a JSON, CSV, or simple text format. So, depending on the specific sources, Splunk tries to make the job of the user easier by providing the best settings and configuration for many common sources of data. Some of the most common sources of data are data from web servers, databases, operation systems, network security, and various other applications and services. Web and cloud services The most commonly used web servers are Apache and Microsoft IIS. All Linux-based web services are hosted on Apache servers, and all Windows-based web services on IIS. The logs generated from Linux web servers are simple plain text files, whereas the log files of Microsoft IIS can be in a W3C-extended log file format or it can be stored in a database in the ODBC log file format as well. Cloud services such as Amazon AWS, S3, and Microsoft Azure can be directly connected and configured according to the forwarded data on Splunk Enterprise. The Splunk app store has many technology add-ons that can be used to create data inputs to send data from cloud services to Splunk Enterprise. So, when uploading log files from web services, such as Apache, Splunk provides a preconfigured source type that parses data in the best format for it to be available for visualization. Suppose that the user wants to upload apache error logs on the Splunk server, and then the user chooses apache_error from the Web category of Source type, as shown in the following screenshot: On choosing this option, the following set of configuration is applied on the data to be uploaded: The event break is configured to be on the regular expression pattern ^[ The events in the log files will be broken into a single event on occurrence of [ at every start of a line (^) The timestamp is to be identified in the [%A %B %d %T %Y] format, where: %A is the day of week; for example, Monday %B is the month; for example, January %d is the day of the month; for example, 1 %T is the time that has to be in the %H : %M : %S format %Y is the year; for example, 2016 Various other settings such as maxDist that allows the amount of variance of logs can vary from the one specified in the source type and other settings such as category, descriptions, and others. Any new settings required as per our needs can be added using the New Settings option available in the section below Settings. After making the changes, either the settings can be saved as a new source type or the existing source type can be updated with the new settings. IT operations and network security Splunk Enterprise has many applications on the Splunk app store that specifically target IT operations and network security. Splunk is a widely accepted tool for intrusion detection, network and information security, fraud and theft detection, and user behaviour analytics and compliance. A Splunk Enterprise application provides inbuilt support for the Cisco Adaptive Security Appliance (ASA) firewall, Cisco SYSLOG, Call Detail Records (CDR) logs, and one of the most popular intrusion detection application, Snort. The Splunk app store has many technology add-ons to get data from various security devices such as firewall, routers, DMZ, and others. The app store also has the Splunk application that shows graphical insights and analytics over the data uploaded from various IT and security devices. Databases The Splunk Enterprise application has inbuilt support for databases such as MySQL, Oracle Syslog, and IBM DB2. Apart from this, there are technology add-ons on the Splunk app store to fetch data from the Oracle database and the MySQL database. These technology add-ons can be used to fetch, parse, and upload data from the respective database to the Splunk Enterprise server. There can be various types of data available from one source; let's take MySQL as an example. There can be error log data, query logging data, MySQL server health and status log data, or MySQL data stored in the form of databases and tables. This concludes that there can be a huge variety of data generated from the same source. Hence, Splunk provides support for all types of data generated from a source. We have inbuilt configuration for MySQL error logs, MySQL slow queries, and MySQL database logs that have been already defined for easier input configuration of data generated from respective sources. Application and operating system data The Splunk input source type has inbuilt configuration available for Linux dmesg, syslog, security logs, and various other logs available from the Linux operating system. Apart from the Linux OS, Splunk also provides configuration settings for data input of logs from Windows and iOS systems. It also provides default settings for Log4j-based logging for Java, PHP, and .NET enterprise applications. Splunk also supports lots of other applications' data such as Ruby on Rails, Catalina, WebSphere, and others. Splunk Enterprise provides predefined configuration for various applications, databases, OSes, and cloud and virtual environments to enrich the respective data with better parsing and breaking into events, thus deriving at better insight from the available data. The applications' source whose settings are not available in Splunk Enterprise can alternatively have apps or add-ons on the app store. Data input methods Splunk Enterprise supports data input through numerous methods. Data can be sent on Splunk via files and directories, TCP, UDP, scripts or using universal forwarders. Files and directories Splunk Enterprise provides an easy interface to the uploaded data via files and directories. Files can be directly uploaded from the Splunk web interface manually or it can be configured to monitor the file for changes in content, and the new data will be uploaded on Splunk whenever it is written in the file. Splunk can also be configured to upload multiple files by either uploading all the files in one shot or the directory can be monitored for any new files, and the data will get indexed on Splunk whenever it arrives in the directory. Any data format from any sources that are in a human-readable format, that is, no propriety tools are needed to read the data, can be uploaded on Splunk. Splunk Enterprise even supports uploading in a compressed file format such as (.zip and .tar.gz), which has multiple log files in a compressed format. Network sources Splunk supports both TCP and UDP to get data on Splunk from network sources. It can monitor any network port for incoming data and then can index it on Splunk. Generally, in case of data from network sources, it is recommended that you use a Universal forwarder to send data on Splunk, as Universal forwarder buffers the data in case of any issues on the Splunk server to avoid data loss. Windows data Splunk Enterprise provides direct configuration to access data from a Windows system. It supports both local as well as remote collections of various types and sources from a Windows system. Splunk has predefined input methods and settings to parse event log, performance monitoring report, registry information, hosts, networks and print monitoring of a local as well as remote Windows system. So, data from different sources of different formats can be sent to Splunk using various input methods as per the requirement and suitability of the data and source. New data inputs can also be created using Splunk apps or technology add-ons available on the Splunk app store. Adding data to Splunk—new interfaces Splunk Enterprises introduced new interfaces to accept data that is compatible with constrained resources and lightweight devices for Internet of Things. Splunk Enterprise version 6.3 supports HTTP Event Collector and REST and JSON APIs for data collection on Splunk. HTTP Event Collector is a very useful interface that can be used to send data without using any forwarder from your existing application to the Splunk Enterprise server. HTTP APIs are available in .NET, Java, Python, and almost all the programming languages. So, forwarding data from your existing application that is based on a specific programming language becomes a cake walk. Let's take an example, say, you are a developer of an Android application, and you want to know what all features the user uses that are the pain areas or problem-causing screens. You also want to know the usage pattern of your application. So, in the code of your Android application, you can use REST APIs to forward the logging data on the Splunk Enterprise server. The only important point to note here is that the data needs to be sent in a JSON payload envelope. The advantage of using HTTP Event Collector is that without using any third-party tools or any configuration, the data can be sent on Splunk and we can easily derive insights, analytics, and visualizations from it. HTTP Event Collector and configuration HTTP Event Collector can be used when you configure it from the Splunk Web console, and the event data from HTTP can be indexed in Splunk using the REST API. HTTP Event Collector HTTP Event Collector (EC) provides an API with an endpoint that can be used to send log data from applications into Splunk Enterprise. Splunk HTTP Event Collector supports both HTTP and HTTPS for secure connections. The following are the features of HTTP Event Collector, which make's adding data on Splunk Enterprise easier: It is very lightweight is terms of memory and resource usage, and thus can be used in resources constrained to lightweight devices as well. Events can be sent directly from anywhere such as web servers, mobile devices, and IoT without any need of configuration or installation of forwarders. It is a token-based JSON API that doesn't require you to save user credentials in the code or in the application settings. The authentication is handled by tokens used in the API. It is easy to configure EC from the Splunk Web console, enable HTTP EC, and define the token. After this, you are ready to accept data on Splunk Enterprise. It supports both HTTP and HTTPS, and hence it is very secure. It supports GZIP compression and batch processing. HTTP EC is highly scalable as it can be used in a distributed environment as well as with a load balancer to crunch and index millions of events per second. Summary In this article, we walked through various data input methods along with various data sources supported by Splunk. We also looked at HTTP Event Collector, which is a new feature added in Splunk 6.3 for data collection via REST to encourage the usage of Splunk for IoT. The data sources and input methods for Splunk are unlike any generic tool and the HTTP Event Collector is the added advantage compare to other data analytics tools. Resources for Article: Further resources on this subject: The Splunk Interface [article] The Splunk Web Framework [article] Introducing Splunk [article]
Read more
  • 0
  • 0
  • 7211

article-image-security-considerations-multitenant-environment
Packt
24 May 2016
8 min read
Save for later

Security Considerations in Multitenant Environment

Packt
24 May 2016
8 min read
In this article by Zoran Pavlović and Maja Veselica, authors of the book, Oracle Database 12c Security Cookbook, we will be introduced to common privileges and learn how to grant privileges and roles commonly. We'll also study the effects of plugging and unplugging operations on users, roles, and privileges. (For more resources related to this topic, see here.) Granting privileges and roles commonly Common privilege is a privilege that can be exercised across all containers in a container database. Depending only on the way it is granted, a privilege becomes common or local. When you grant privilege commonly (across all containers) it becomes common privilege. Only common users or roles can have common privileges. Only common role can be granted commonly. Getting ready For this recipe, you will need to connect to the root container as an existing common user who is able to grant a specific privilege or existing role (in our case – create session, select any table, c##role1, c##role2) to another existing common user (c##john). If you want to try out examples in the How it works section given ahead, you should open pdb1 and pdb2. You will use: Common users c##maja and c##zoran with dba role granted commonly Common user c##john Common roles c##role1 and c##role2 How to do it... You should connect to the root container as a common user who can grant these privileges and roles (for example, c##maja or system user). SQL> connect c##maja@cdb1 Grant a privilege (for example, create session) to a common user (for example, c##john) commonly c##maja@CDB1> grant create session to c##john container=all; Grant a privilege (for example, select any table) to a common role (for example, c##role1) commonly c##maja@CDB1> grant select any table to c##role1 container=all; Grant a common role (for example, c##role1) to a common role (for example, c##role2) commonly c##maja@CDB1> grant c##role1 to c##role2 container=all; Grant a common role (for example, c##role2) to a common user (for example, c##john) commonly c##maja@CDB1> grant c##role2 to c##john container=all; How it works... Figure 16 You can grant privileges or common roles commonly only to a common user. You need to connect to the root container as a common user who is able to grant a specific privilege or role. In step 2, system privilege, create session, is granted to common user c##john commonly, by adding a container=all clause to the grant statement. This means that user c##john can connect (create session) to root or any pluggable database in this container database (including all pluggable databases that will be plugged-in in the future). N.B. container = all clause is NOT optional, even though you are connected to the root. Unlike during creation of common users and roles (if you omit container=all, user or role will be created in all containers – commonly), If you omit this clause during privilege or role grant, privilege or role will be granted locally and it can be exercised only in root container. SQL> connect c##john/oracle@cdb1 c##john@CDB1> connect c##john/oracle@pdb1 c##john@PDB1> connect c##john/oracle@pdb2 c##john@PDB2> In the step 3, system privilege, select any table, is granted to common role c##role1 commonly. This means that role c##role1 contains select any table privilege in all containers (root and pluggable databases). c##zoran@CDB1> select * from role_sys_privs where role='C##ROLE1'; ROLE PRIVILEGE ADM COM ------------- ----------------- --- --- C##ROLE1 SELECT ANY TABLE NO YES c##zoran@CDB1> connect c##zoran/oracle@pdb1 c##zoran@PDB1> select * from role_sys_privs where role='C##ROLE1'; ROLE PRIVILEGE ADM COM -------------- ------------------ --- --- C##ROLE1 SELECT ANY TABLE NO YES c##zoran@PDB1> connect c##zoran/oracle@pdb2 c##zoran@PDB2> select * from role_sys_privs where role='C##ROLE1'; ROLE PRIVILEGE ADM COM -------------- ---------------- --- --- C##ROLE1 SELECT ANY TABLE NO YES In step 4, common role c##role1, is granted to another common role c##role2 commonly. This means that role c##role2 has granted role c##role1 in all containers. c##zoran@CDB1> select * from role_role_privs where role='C##ROLE2'; ROLE GRANTED ROLE ADM COM --------------- --------------- --- --- C##ROLE2 C##ROLE1 NO YES c##zoran@CDB1> connect c##zoran/oracle@pdb1 c##zoran@PDB1> select * from role_role_privs where role='C##ROLE2'; ROLE GRANTED_ROLE ADM COM ------------- ----------------- --- --- C##ROLE2 C##ROLE1 NO YES c##zoran@PDB1> connect c##zoran/oracle@pdb2 c##zoran@PDB2> select * from role_role_privs where role='C##ROLE2'; ROLE GRANTED_ROLE ADM COM ------------- ------------- --- --- C##ROLE2 C##ROLE1 NO YES In step 5, common role c##role2, is granted to common user c##john commonly. This means that user c##john has c##role2 in all containers. Consequently, user c##john can use select any table privilege in all containers in this container database. c##john@CDB1> select count(*) from c##zoran.t1; COUNT(*) ---------- 4 c##john@CDB1> connect c##john/oracle@pdb1 c##john@PDB1> select count(*) from hr.employees; COUNT(*) ---------- 107 c##john@PDB1> connect c##john/oracle@pdb2 c##john@PDB2> select count(*) from sh.sales; COUNT(*) ---------- 918843 Effects of plugging/unplugging operations on users, roles, and privileges Purpose of this recipe is to show what is going to happen to users, roles, and privileges when you unplug a pluggable database from one container database (cdb1) and plug it into some other container database (cdb2). Getting ready To complete this recipe, you will need: Two container databases (cdb1 and cdb2) One pluggable database (pdb1) in container database cdb1 Local user mike in pluggable database pdb1 with local create session privilege Common user c##john with create session common privilege and create synonym local privilege on pluggable database pdb1 How to do it... Connect to the root container of cdb1 as user sys: SQL> connect sys@cdb1 as sysdba Unplug pdb1 by creating XML metadata file: SQL> alter pluggable database pdb1 unplug into '/u02/oradata/pdb1.xml'; Drop pdb1 and keep datafiles: SQL> drop pluggable database pdb1 keep datafiles; Connect to the root container of cdb2 as user sys: SQL> connect sys@cdb2 as sysdba Create (plug) pdb1 to cdb2 by using previously created metadata file: SQL> create pluggable database pdb1 using '/u02/oradata/pdb1.xml' nocopy; How it works... By completing previous steps, you unplugged pdb1 from cdb1 and plugged it into cdb2. After this operation, all local users and roles (in pdb1) are migrated with pdb1 database. If you try to connect to pdb1 as a local user: SQL> connect mike@pdb1 It will succeed. All local privileges are migrated, even if they are granted to common users/roles. However, if you try to connect to pdb1 as a previously created common user c##john, you'll get an error SQL> connect c##john@pdb1 ERROR: ORA-28000: the account is locked Warning: You are no longer connected to ORACLE. This happened because after migration, common users are migrated in a pluggable database as locked accounts. You can continue to use objects in these users' schemas, or you can create these users in root container of a new CDB. To do this, we first need to close pdb1: sys@CDB2> alter pluggable database pdb1 close; Pluggable database altered. sys@CDB2> create user c##john identified by oracle container=all; User created. sys@CDB2> alter pluggable database pdb1 open; Pluggable database altered. If we try to connect to pdb1 as user c##john, we will get an error: SQL> conn c##john/oracle@pdb1 ERROR: ORA-01045: user C##JOHN lacks CREATE SESSION privilege; logon denied Warning: You are no longer connected to ORACLE. Even though c##john had create session common privilege in cdb1, he cannot connect to the migrated PDB. This is because common privileges are not migrated! So we need to give create session privilege (either common or local) to user c##john. sys@CDB2> grant create session to c##john container=all; Grant succeeded. Let's try granting a create synonym local privilege to the migrated pdb2: c##john@PDB1> create synonym emp for hr.employees; Synonym created. This proves that local privileges are always migrated. Summary In this article, we learned about common privileges and the methods to grant common privileges and roles to users. We also studied what happens to users, roles, and privileges when you unplug a pluggable database from one container database and plug it into some other container database. Resources for Article: Further resources on this subject: Oracle 12c SQL and PL/SQL New Features[article] Oracle GoldenGate 12c — An Overview[article] Backup and Recovery for Oracle SOA Suite 12C[article]
Read more
  • 0
  • 0
  • 1225

article-image-visualizations-using-ccc
Packt
20 May 2016
28 min read
Save for later

Visualizations Using CCC

Packt
20 May 2016
28 min read
In this article by Miguel Gaspar the author of the book Learning Pentaho CTools you will learn about the Charts Component Library in detail. The Charts Components Library is not really a Pentaho plugin, but instead is a Chart library that Webdetails created some years ago and that Pentaho started to use on the Analyzer visualizations. It allows a great level of customization by changing the properties that are applied to the charts and perfectly integrates with CDF, CDE, and CDA. (For more resources related to this topic, see here.) The dashboards that Webdetails creates make use of the CCC charts, usually with a great level of customization. Customizing them is a way to make them fancy and really good-looking, and even more importantly, it is a way to create a visualization that best fits the customer/end user's needs. We really should be focused on having the best visualizations for the end user, and CCC is one of the best ways to achieve this, but do this this you need to have a very deep knowledge of the library, and know how to get amazing results. I think I could write an entire book just about CCC, and in this article I will only be able to cover a small part of what I like, but I will try to focus on the basics and give you some tips and tricks that could make a difference.I'll be happy if I can give you some directions that you follow, and then you can keep searching and learning about CCC. An important part of CCC is understanding some properties such as the series in rows or the crosstab mode, because that is where people usually struggle at the start. When you can't find a property to change some styling/functionality/behavior of the charts, you might find a way to extend the options by using something called extension points, so we will also cover them. I also find the interaction within the dashboard to be an important feature.So we will look at how to use it, and you will see that it's very simple. In this article,you will learn how to: Understand the properties needed to adapt the chart to your data source results Use the properties of a CCC chart Create a CCC chat by using the JavaScript library Make use of internationalization of CCC charts See how to handle clicks on charts Scale the base axis Customize the tooltips Some background on CCC CCC is built on top of Protovis, a JavaScript library that allows you to produce visualizations just based on simple marks such as bars, dots, and lines, among others, which are created through dynamic properties based on the data to be represented. You can get more information on this at: http://mbostock.github.io/protovis/. If you want to extend the charts with some elements that are not available you can, but it would be useful to have an idea about how Protovis works.CCC has a great website, which is available at http://www.webdetails.pt/ctools/ccc/, where you can see some samples including the source code. On the page, you can edit the code, change some properties, and click the apply button. If the code is valid, you will see your chart update.As well as that, it provides documentation for almost all of the properties and options that CCC makes available. Making use of the CCC library in a CDF dashboard As CCC is a chart library, you can use it as you would use it on any other webpage, by using it like the samples on CCC webpages. But CDF also provides components that you can implement to use a CCC chart on a dashboard and fully integrate with the life cycle of the dashboard. To use a CCC chart on CDF dashboard, the HTML that is invoked from the XCDF file would look like the following(as we already covered how to build a CDF dashboard, I will not focus on that, and will mainly focus on the JavaScript code): <div class="row"> <div class="col-xs-12"> <div id="chart"/> </div> </div> <script language="javascript" type="text/javascript">   require(['cdf/Dashboard.Bootstrap', 'cdf/components/CccBarChartComponent'], function(Dashboard, CccBarChartComponent) {     var dashboard = new Dashboard();     var chart = new CccBarChartComponent({         type: "cccBarChart",         name: "cccChart",         executeAtStart: true,         htmlObject: "chart",         chartDefinition: {             height: 200,             path: "/public/…/queries.cda",             dataAccessId: "totalSalesQuery",             crosstabMode: true,             seriesInRows: false, timeSeries: false             plotFrameVisible: false,             compatVersion: 2         }     });     dashboard.addComponent(chart);     dashboard.init();   }); </script> The most important thing here is the use of the CCC chart component that we have covered as an example in which we have covered it's a bar chart. We can see by the object that we are instantiating CccBarChartComponent as also by the type that is cccBarChart. The previous dashboard will execute the query specified as dataAccessId of the CDA file set on the property path, and render the chart on the dashboard. We are also saying that its data comes from the query in the crosstab mode, but the base axis should not be atimeSeries. There are series in the columns, but don't worry about this as we'll be covering it later. The existing CCC components that you are able to use out of the box inside CDF dashboards are as follows. Don't forget that CCC has plenty of charts, so the sample images that you will see in the following table are just one example of the type of charts you can achieve. CCC Component Chart Type Sample Chart CccAreaChartComponent cccAreaChart   CccBarChartComponent cccBarChart http://www.webdetails.pt/ctools/ccc/#type=bar CccBoxplotChartComponent cccBoxplotChart http://www.webdetails.pt/ctools/ccc/#type=boxplot CccBulletChartComponent cccBulletChart http://www.webdetails.pt/ctools/ccc/#type=bullet CccDotChartComponent cccDotChart http://www.webdetails.pt/ctools/ccc/#type=dot CccHeatGridChartComponent cccHeatGridChart http://www.webdetails.pt/ctools/ccc/#type=heatgrid CccLineChartComponent cccLineChart http://www.webdetails.pt/ctools/ccc/#type=line CccMetricDotChartComponent cccMetricDotChart http://www.webdetails.pt/ctools/ccc/#type=metricdot CccMetricLineChartComponent cccMetricLineChart   CccNormalizedBarChartComponent cccNormalizedBarChart   CccParCoordChartComponent cccParCoordChart   CccPieChartComponent cccPieChart http://www.webdetails.pt/ctools/ccc/#type=pie CccStackedAreaChartComponent cccStackedAreaChart http://www.webdetails.pt/ctools/ccc/#type=stackedarea CccStackedDotChartComponent cccStackedDotChart   CccStackedLineChartComponent cccStackedLineChart http://www.webdetails.pt/ctools/ccc/#type=stackedline CccSunburstChartComponent cccSunburstChart http://www.webdetails.pt/ctools/ccc/#type=sunburst CccTreemapAreaChartComponent cccTreemapAreaChart http://www.webdetails.pt/ctools/ccc/#type=treemap CccWaterfallAreaChartComponent cccWaterfallAreaChart http://www.webdetails.pt/ctools/ccc/#type=waterfall In the sample code, you will find a property calledcompatMode that hasa value of 2 set. This will make CCC work as a revamped version that delivers more options, a lot of improvements, and makes it easier to use. Mandatory and desirable properties Among otherssuch as name, datasource, and htmlObject, there are other properties of the charts that are mandatory. The height is really important, because if you don't set the height of the chart, you will not fit the chart in the dashboard. The height should also be specified in pixels. If you don't set the width of the component, or to be more precise, then the chart will grab the width of the element where it's being rendered it will grab the width of the HTML element with the name specified in the htmlObject property. The seriesInRows, crosstabMode, and timeseriesproperties are optional, but depending on the kind of chart you are generating, you might want to specify them. The use of these properties becomes clear if we can also see the output of the queries we are executing. We need to get deeper into the properties that are related to the data mapping to visual elements. Mapping data We need to be aware of the way that data mapping is done in the chart.You can understand how it works if you can imagine data input as a table. CCC can receive the data as two different structures: relational and crosstab. If CCC receives data as crosstab,it will translate it to a relational structure. You can see this in the following examples. Crosstab The following table is an example of the crosstab data structure: Column Data 1 Column Data 2 Row Data 1 Measure Data 1.1 Measure Data 1.2 Row Data 2 Measure Data 2.1 Measure Data 2.2 Creating crosstab queries To create a crosstab query, usually you can do this with the group when using SQL, or just use MDX, which allows us to easily specify a set for the columns and for the rows. Just by looking at the previous and following examples, you should be able to understand that in the crosstab structure (the previous), columns and rows are part of the result set, while in the relational format (the following), column headers or headers are not part of the result set, but are part of the metadata that is returned from the query. The relationalformat is as follows: Column Row Value Column Data 1 Row Data 1 Measure Data 1.1 Column Data 2 Row Data 1 Measure Data 2.1 Column Data 1 Row Data 2 Measure Data 1.2 Column Data 2 Row Data 2 Measure Data 2.1   The preceding two data structures represent the options when setting the properties crosstabMode and seriesInRows. The crosstabMode property To better understand these concepts, we will make use of a real example. This property, crosstabMode, is easy to understand when comparing the two that represents the results of two queries. Non-crosstab (Relational): Markets Sales APAC 1281705 EMEA 50028224 Japan 503957 NA 3852061 Crosstab: Markets 2003 2004 2005 APAC 3529 5938 3411 EMEA 16711 23630 9237 Japan 2851 1692 380 NA 13348 18157 6447   In the previous tables, you can see that on the left-handside you can find the values of sales from each of the territories. The only relevant information relative to the values presented in only one variable, territories. We can say that we are able to get all the information just by looking at the rows, where we can see a direct connection between markets and the sales value. In the table presented on the right, you will find a value for each territory/year, meaning that the values presented, and in the sample provided in the matrix, are dependent on two variables, which are the territory in the rows and the years in the columns. Here we need both the rows andthe columns to know what each one of the values represents. Relevant information can be found in the rows and the columns, so this a crosstab. The crosstabs display the joint distribution of two or more variables, and are usually represented in the form of a contingency table in a matrix. When the result of a query is dependent only on one variable, then you should set the crosstabModeproperty to false. When it is dependent on 2 or more variables, you should set the crosstabMode property to false, otherwise CCC will just use the first two columns like in the non-crosstab example. The seriesInRows property Now let's use the same examplewhere we have a crosstab: The previous image shows two charts: the one on the left is a crosstab with the series in the rows, and the one on the right is also crosstab but the series are not in the rows (the series are in the columns).When the crosstab is set to true, it means that the measure column title can be translated as a series or a category,and that's determined by the property seriesInRows. If this property is set to true, then it will read the series from the rows, otherwise it will read the series from the columns. If the crosstab is set to false, the community chart component is expecting a row to correspond exactly to one data point, and two or three columns can be returned. When three columns are returned, they can be a category, series and dataor series, category and data and that's determined by the seriesInRows property. When set to true, CCC will expect the structure to have three columns such as category, series, and data. When it is set to false, it will expect them to be series, category, and data. A simple table should give you a quicker reference, so here goes: crosstabMode seriesInRows Description true true The column titles will act as category values while the series values are represented as data points of the first column. true false The column titles will act as series value while the category/category values are represented as data points of the first column. false true The column titles will act as category values while the series values are represented as data points of the first column. false false The column titles will act as category values while the series values are represented as data points of the first column. The timeSeries and timeSeriesFormat properties The timeSeries property defines whether the data to be represented by the chart is discrete or continuous. If we want to present some values over time, then the timeSeries property should be set to true. When we set the chart to be timeSeries, we also need to set another property to tell CCC how it should interpret the dates that come from the query.Check out the following image for timeSeries and timeSeriesFormat: The result of one of the queries has the year and the abbreviated month name separate by -, like 2015-Nov. For the chart to understand it as a date, we need to specify the format by setting the property timeSeriesFomart, which in our example would be %Y-%b, where %Y is the year is represented by four digits, and %b is the abbreviated month name. The format should be specified using the Protovis format that follows the same format as strftime in the C programming language, aside from some unsupported options. To find out what options are available, you should take a look at the documentation, which you will find at: https://mbostock.github.io/protovis/jsdoc/symbols/pv.Format.date.html. Making use of CCC inCDE There are a lot of properties that will use a default value, and you can find out aboutthem by looking at the documentation or inspecting the code that is generated by CDE when you use the charts components. By looking at the console log of your browser, you should also able to understand and get some information about the properties being used by default and/or see whether you are using a property that does not fit your needs. The use of CCC charts in CDE is simpler, just because you may not need to code. I am only saying may because to achieve quicker results, you may apply some code and make it easier to share properties among different charts or type of charts. To use a CCC chart, you just need to select the property that you need to change and set its value by using the dropdown or by just setting the value: The previous image shows a group of properties with the respective values on the right side. One of the best ways to start to get used to the CCC properties is to use the CCC page available as part of the Webdetails page: http://www.webdetails.pt/ctools/ccc. There you will find samples and the properties that are being used for each of the charts. You can use the dropdown to select different kinds of charts from all those that are available inside CCC. You also have the ability to change the properties and update the chart to check the result immediately. What I usually do, as it's easier and faster, is to change the properties here and check the results and then apply the necessary values for each of the properties in the CCC charts inside the dashboards. In the following samples, you will also find documentation about the properties, see where the properties are separated by sections of the chart, and after that you will find the extension points. On the site, when you click on a property/option, you will be redirected to another page where you will find the documentation and how to use it. Changing properties in the preExecution or postFetch We are able to change the properties for the charts, as with any other component. Inside the preExecution, this, refers to the component itself, so we will have access to the chart's main object, which we can also manipulate and add, remove, and change options. For instance, you can apply the following code: function() {    var cdProps = {         dotsVisible: true,         plotFrame_strokeStyle: '#bbbbbb',         colors: ['#005CA7', '#FFC20F', '#333333', '#68AC2D']     };     $.extend(true, this.chartDefinition, cdProps); } What we are doing is creating an object with all the properties that we want to add or change for the chart, and then extending the chartDefinitions (where the properties or options are). This is what we are doing with the JQuery function, extending. Use the CCC website and make your life easier This way to apply options makes it easier to set the properties. Just change or add the properties that you need, test it, and when you're happy with the result, you just need to copy them into the object that will extend/overwrite the chart options. Just keep in mind that the properties you change directly in the editor will be overwritten by the ones defined in the preExecution, if they match each other of course. Why is this important? It's because not all the properties that you can apply to CCC are exposed in CDE, so you can use the preExecution to use or set those properties. Handling the click event One important thing about the charts is that they allow interaction. CCC provides a way to handle some events in the chart and click is one of those events. To have it working, we need to change two properties: clickable, which needs to be set to true, and clickAction where we need to write a function with the code to be executed when a click happens. The function receives one argument that usually is referred to as a scene. The scene is an object that has a lot of information about the context where the event happened. From the object you will have access to vars, another object where we can find the series and the categories where the clicked happened. We can use the function to get the series/categories being clicked and perform a fireChange that can trigger updates on other components: function(scene) {     var series =  "Series:"+scene.atoms.series.label;     var category =  "Category:"+scene.vars.category.label;     var value = "Value:"+scene.vars.value.label;     Logger.log(category+"&"+value);     Logger.log(series); } In the previous code example, you can find the function to handle the click action for a CCC chart. When the click happens, the code is executed, and a variable with the click series is taken from scene.atoms.series.label. As well as this, the categories clickedscene.vars.category.label and the value that crosses the same series/category in scene.vars.value.value. This is valid for a crosstab, but you will not find the series when it's non-crosstab. You can think of a scene as describing one instance of visual representation. It is generally local to each panel or section of the chart and it's represented by a group of variables that are organized hierarchically. Depending on the scene, it may contain one or many datums. And you must be asking what a hell is a datum? A datum represents a row, so it contains values for multiple columns. We also can see from the example that we are referring to atoms, which hold at least a value, a label, and a key of a column. To get a better understanding of what I am talking about, you should perform a breakpoint anywhere in the code of the previous function and explore the object scene. In the previous example, you would be able to access to the category, series labels, and value, as you can see in the following table:   Corosstab Non-crosstab Value scene.vars.value.label or scene.getValue(); scene.vars.value.label or scene.getValue(); Category scene.vars.category.label or scene.getCategoryLabel(); scene.vars.category.label or scene.getCategoryLabel(); Series scene.atoms.series.label or scene.getSeriesLabel()   For instance, if you add the previous function code to a chart that is a crosstab where the categories are the years and the series are the territories, if you click on the chart, the output would be something like: [info] WD: Category:2004 & Value:23630 [info] WD: Series:EMEA This means that you clicked on the year 2004 for the EMEA. EMEA sales for the year 2004 were 23,630. If you replace the Logger functions withfireChangeas follows, you will be able to make use of the label/value of the clicked category to render other components and some details about them: this.dashboard.fireChange("parameter", scene.vars.category.label); Internationalization of CCCCharts We already saw that all the values coming from the database should not need to be translated. There are some ways in Pentaho to do this, but we may still need to set the title of a chart, where the title should be also internationalized. Another case is when you have dates where the month is represented by numbers in the base axis, but you want to display the month's abbreviated name. This name could be also translated to different languages, which is not hard. For the title, sub-title, and legend, the way to do it is using the instructions on how to set properties on preExecution.First, you will need to define the properties files for the internationalization and set the properties/translations: var cd = this.chartDefinition; cd.title =  this.dashboard.i18nSupport.prop('BOTTOMCHART.TITLE'); To change the title of the chart based on the language defined, we will need to define a function, but we can't use the property on the chart because that will only allow you to define a string, so you will not be able to use a JavaScript instruction to get the text. If you set the previous example code on the preExecution of the chart then, you will be able to. It may also make sense to change not only the titles, but for instance also internationalize the month names. If you are getting data like 2004-02, this may correspond to a time series format as %Y-%m. If that's the case and you want to display the abbreviated month name, then you may use the baseAxisTickFormatter and the dateFormat function from the dashboard utilities, also known as Utils. The code to write inside the preExecution would be like: var cd = this.chartDefinition; cd.baseAxisTickFormatter = function(label) {   return Utils.dateFormat(moment(label, 'YYYY-mmm'), 'MMM/YYYY'); }; The preceding code uses the baseAxisTickFormatter, which allows you to write a function that receives an argument, identified on the code as a label, because it will store the label for each one of the base axis ticks. We are using the dateFormatmethod and moment to format and return the year followed by the abbreviated month name. You can get information about the language defined and being used by running the following instruction moment.locale(); If you need to, you can change the language. Format a basis axis label based on the scale When you are working with a time series chart, you may want to set a different format for the base axis labels. Let's suppose you want to have a chart that is listening to a time selector. If you select one year old data to be displayed on the chart, certainly you are not interested in seeing the minutes on the date label. However, if you want to display the last hour, the ticks of the base axis need to be presented in minutes. There is an extension point we can use to get a conditional format based on the scale of the base axis. The extension point is baseAxisScale_tickFormatter, and it can be used like in the code as follows: baseAxisScale_tickFormatter: function(value, dateTickPrecision) { switch(dateTickPrecision) { casepvc.time.intervals.y: return format_date_year_tick(value);              break; casepvc.time.intervals.m: return format_date_month_tick(value);              break;            casepvc.time.intervals.H: return format_date_hour_tick(value);              break;          default:              return format_date_default_tick(value);   } } It accepts a function with two arguments: the value to be formatted and the tick precision, and should return the formatted label to be presented on each label of the base axis. The previous code shows howthe function is used. You can see a switch that based on the base axis scale will do a different format, calling a function. The functions in the code are not pre-defined—we need to write the functions or code to create the formatting. One example of a function to format the date is that we could use the utils dateFormat function to return the formatted value to the chart. The following table shows the intervals that can be used when verifying which time intervals are being displayed on the chart: Interval Description Number representing the interval y Year 31536e6 m Month 2592e6 d30 30 days 2592e6 d7 7 days 6048e5 d Day 864e5 H Hour 36e5 m Minute 6e4 s Second 1e3 ms Milliseconds 1 Customizing tooltips CCC provides the ability to change the tooltip format that comes by default, and can be changed using the tooltipFormat property. We can change it, making it look likethe following image, on the right side. You can also compare it to the one on the left, which is the default one: The tooltip default format might change depending on the chart type, but also on some options that you apply to the chart, mainly crosstabMode and seriesInRows. The property accepts a function that receives one argument, the scene, which will be a similar structure as already covered for the click event. You should return the HTML to be showed on the dashboard when we hover the chart. In the previous image,you will see on the chart on the left side the defiant tooltip, and on the right a different tooltip. That's because the following code was applied: tooltipFormat: function(scene){   var year = scene.atoms.series.label;   var territory = scene.atoms.category.value;   var sales = Utils.numberFormat(scene.vars.value.value, "#.00A");   var html = '<html>' + <div>Sales for '+year+' at '+territory+':'+sales+'</div>' + '</html>';   return html; } The code is pretty self-explanatory. First we are setting some variables such as year, territory, and the sales values, which we need to present inside the tooltip. Like in the click event, we are getting the labels/value from the scene, which might depend on the properties we set for the chart. For the sales, we are also abbreviating it, using two decimal places. And last, we build the HTML to be displayed when we hover over the chart. You can also change the base axis tooltip Like we are doing to the tooltip when hovering over the values represented in the chart, we can also baseAxisTooltip, just don't forget that the baseAxisTooltipVisible must be set to true (the value by default). Getting the values to show will pretty similar. It can get more complex, though not much more, when we also want for instance, to display the total value of sales for one year or for the territory. Based on that, we could also present the percentage relative to the total. We should use the property as explained earlier. The previous image is one example of how we can customize a tooltip. In this case, we are showing the value but also the percentage that represents the hovered over territory (as the percentage/all the years) and also for the hovered over year (where we show the percentage/all the territories): tooltipFormat: function(scene){   var year = scene.getSeriesLabel();   var territory = scene.getCategoryLabel();   var value = scene.getValue();   var sales = Utils.numberFormat(value, "#.00A");   var totals = {};   _.each(scene.chart().data._datums, function(element) {     var value = element.atoms.value.value;     totals[element.atoms.category.label] =            (totals[element.atoms.category.label]||0)+value;     totals[element.atoms.series.label] =       (totals[element.atoms.series.label]||0)+value;   });   var categoryPerc = Utils.numberFormat(value/totals[territory], "0.0%");   var seriesPerc = Utils.numberFormat(value/totals[year], "0.0%");   var html =  '<html>' + '<div class="value">'+sales+'</div>' + '<div class="dValue">Sales for '+territory+' in '+year+'</div>' + '<div class="bar">'+ '<div class="pPerc">'+categoryPerc+' of '+territory+'</div>'+ '<div class="partialBar" style="width:'+cPerc+'"></div>'+ '</div>' + '<div class="bar">'+ '<div class="pPerc">'+seriesPerc+' of '+year+'</div>'+ '<div class="partialBar" style="width:'+seriesPerc+'"></div>'+ '</div>' + '</html>';   return html; } The first lines of the code are pretty similar except that we are using scene.getSeriesLabel() in place of scene.atoms.series.label. They do the same, so it's only different ways to get the values/labels. Then the total calculations that are calculated by iterating in all the elements of scene.chart().data._datums, which return the logical/relational table, a combination of the territory, years, and value. The last part is just to build the HTML with all the values and labels that we already got from the scene. There are multiple ways to get the values you need, for instance to customize the tooltip, you just need to explore the hierarchical structure of the scene and get used to it. The image that you are seeing also presents a different style, and that should be done using CSS. You can add CSS for your dashboard and change the style of the tooltip, not just the format. Styling tooltips When we want to style a tooltip, we may want to use the developer's tools to check the classes or names and CSS properties already applied, but it's hard because the popup does not stay still. We can change the tooltipDelayOut property and increase its default value from 80 to 1000 or more, depending on the time you need. When you want to apply some styles to the tooltips for a particular chart you can do by setting a CSS class on the tooltip. For that you should use the propertytooltipClassName and set the class name to be added and latter user on the CSS. Summary In this article,we provided a quick overview of how to use CCC in CDF and CDE dashboards and showed you what kinds of charts are available. We covered some of the base options as well as some advanced option that you might use to get a more customized visualization. Resources for Article: Further resources on this subject: Diving into OOP Principles [article] Python Scripting Essentials [article] Building a Puppet Module Skeleton [article]
Read more
  • 0
  • 0
  • 4664
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-advanced-shell-topics
Packt
26 Apr 2016
10 min read
Save for later

Advanced Shell Topics

Packt
26 Apr 2016
10 min read
In this article by Thomas Bitterman, the author of the book Mastering IPython 4.0, we will look at the tools the IPython interactive shell provides. With the split of the Jupyter and IPython projects, the command line provided by IPython will gain importance. This article covers the following topics: What is IPython? Installing IPython Starting out with the terminal IPython beyond Python Magic commands (For more resources related to this topic, see here.) What is IPython? IPython is an open source platform for interactive and parallel computing. It started with the realization that the standard Python interpreter was too limited for sustained interactive use, especially in the areas of scientific and parallel computing. Overcoming these limitations resulted in a three-part architecture: An enhanced, interactive shell Separation of the shell from the computational kernel A new architecture for parallel computing This article will provide a brief overview of the architecture before introducing some basic shell commands. Before proceeding further, however, IPython needs to be installed. Those readers with experience in parallel and high-performance computing but new to IPython will find the following sections useful in quickly getting up to speed. Those experienced with IPython may skim the next few sections, noting where things have changed now that the notebook is no longer an integral part of development. Installing IPython The first step in installing IPython is to install Python. Instructions for the various platforms differ, but the instructions for each can be found on the Python home page at http://www.python.org. IPython requires Python 2.7 or ≥ 3.3. This article will use 3.5. Both Python and IPython are open source software, so downloading and installation are free. A standard Python installation includes the pip package manager. pip is a handy command-line tool that can be used to download and install various Python libraries. Once Python is installed, IPython can be installed with this command: pip install ipython IPython comes with a test suite called iptest. To run it, simply issue the following command: iptest A series of tests will be run. It is possible (and likely on Windows) that some libraries will be missing, causing the associated tests to fail. Simply use pip to install those libraries and rerun the test until everything passes. It is also possible that all tests pass without an important library being installed. This is the readline library (also known as PyReadline). IPython will work without it but will be missing some features that are useful for the IPython terminal, such as command completion and history navigation. To install readline, use pip: pip install readline pip install gnureadline At this point, issuing the ipython command will start up an IPython interpreter: ipython IPython beyond Python No one would use IPython if it were not more powerful than the standard terminal. Much of IPython's power comes from two features: Shell integration Magic commands Shell integration Any command starting with ! is passed directly to the operating system to be executed, and the result is returned. By default, the output is then printed out to the terminal. If desired, the result of the system command can be assigned to a variable. The result is treated as a multiline string, and the variable is a list containing one string element per line of output. For example: In [22]: myDir = !dir In [23]: myDir Out[23]: [' Volume in drive C has no label.', ' Volume Serial Number is 1E95-5694', '', ' Directory of C:\Program Files\Python 3.5', '', '10/04/2015 08:43 AM <DIR> .', '10/04/2015 08:43 AM <DIR> ..',] While this functionality is not entirely absent in straight Python (the OS and subprocess libraries provide similar abilities), the IPython syntax is much cleaner. Additional functionalities such as input and output caching, directory history, and automatic parentheses are also included. History The previous examples have had lines that were prefixed by elements such as In[23] and Out[15]. In and Out are arrays of strings, where each element is either an input command or the resulting output. They can be referred to using the arrays notation, or "magic" commands can accept the subscript alone. Magic commands IPython also accepts commands that control IPython itself. These are called "magic" commands, and they start with % or %%. A complete list of magic commands can be found by typing %lsmagic in the terminal. Magics that start with a single % sign are called "line" magics. They accept the rest of the current line for arguments. Magics that start with %% are called "cell" magics. They accept not only the rest of the current line but also the following lines. There are too many magic commands to go over in detail, but there are some related families to be aware of: OS equivalents: %cd, %env, %pwd Working with code: %run, %edit, %save, %load, %load_ext, %%capture Logging: %logstart, %logstop, %logon, %logoff, %logstate Debugging: %debug, %pdb, %run, %tb Documentation: %pdef, %pdoc, %pfile, %pprint, %psource, %pycat, %%writefile Profiling: %prun, %time, %run, %time, %timeit Working with other languages: %%script, %%html, %%javascript, %%latex, %%perl, %%ruby With magic commands, IPython becomes a more full-featured development environment. A development session might include the following steps: Set up the OS-level environment with the %cd, %env, and ! commands. Set up the Python environment with %load and %load_ext. Create a program using %edit. Run the program using %run. Log the input/output with %logstart, %logstop, %logon, and %logoff. Debug with %pdb. Create documentation with %pdoc and %pdef. This is not a tenable workflow for a large project, but for exploratory coding of smaller modules, magic commands provide a lightweight support structure. Creating custom magic commands IPython supports the creation of custom magic commands through function decorators. Luckily, one does not have to know how decorators work in order to use them. An example will explain. First, grab the required decorator from the appropriate library: In [1]: from IPython.core.magic import(register_line_magic) Then, prepend the decorator to a standard IPython function definition: In [2]: @register_line_magic ...: def getBootDevice(line): ...: sysinfo = !systeminfo ...: for ln in sysinfo: ...: if ln.startswith("Boot Device"): ...: return(ln.split()[2]) ...: Your new magic is ready to go: In [3]: %getBootDevice Out[3]: '\Device\HarddiskVolume1' Some observations are in order: Note that the function is, for the most part, standard Python. Also note the use of the !systeminfo shell command. One can freely mix both standard Python and IPython in IPython. The name of the function will be the name of the line magic. The parameter, "line," contains the rest of the line (in case any parameters are passed). A parameter is required, although it need not be used. The Out associated with calling this line magic is the return value of the magic. Any print statements executed as part of the magic are displayed on the terminal but are not part of Out (or _). Cython We are not limited to writing custom magic commands in Python. Several languages are supported, including R and Octave. We will look at one in particular, Cython. Cython is a language that can be used to write C extensions for Python. The goal for Cython is to be a superset of Python, with support for optional static type declarations. The driving force behind Cython is efficiency. As a compiled language, there are performance gains to be had from running C code. The downside is that Python is much more productive in terms of programmer hours. Cython can translate Python code into compiled C code, achieving more efficient execution at runtime while retaining the programmer-friendliness of Python. The idea of turning Python into C is not new to Cython. The default and most widely used interpreter (CPython) for Python is written in C. In some sense then, running Python code means running C code, just through an interpreter. There are other Python interpreter implementations as well, including those in Java (Jython) and C# (IronPython). CPython has a foreign function interface to C. That is, it is possible to write C language functions that interface with CPython in such a way that data can be exchanged and functions invoked from one to the other. The primary use is to call C code from Python. There are, however, two primary drawbacks: writing code that works with the CPython foreign function interface is difficult in its own right; and doing so requires knowledge of Python, C, and CPython. Cython aims to remedy this problem by doing all the work of turning Python into C and interfacing with CPython internally to Cython. The programmer writes Cython code and leaves the rest to the Cython compiler. Cython is very close to Python. The primary difference is the ability to specify C types for variables using the cdef keyword. Cython then handles type checking and conversion between Python values and C values, scoping issues, marshalling and unmarshalling of Python objects into C structures, and other cross-language issues. Cython is enabled in IPython by loading an extension. In order to use the Cython extension, do this: In [1]: %load_ext Cython At this point, the cython cell magic can be invoked: In [2]: %%cython ...: def sum(int a, int b): ...: cdef int s = a+b ...: return s And the Cython function can now be called just as if it were a standard Python function: In [3]: sum(1, 1) Out[3]: 2 While this may seem like a lot of work for something that could have been written more easily in Python in the first place, that is the price to be paid for efficiency. If, instead of simply summing two numbers, a function is expensive to execute and is called multiple times (perhaps in a tight loop), it can be worth it to use Cython for a reduction in runtime. There are other languages that have merited the same treatment, GNU Octave and R among them. Summary In this article, we covered many of the basics of using IPython for development. We started out by just getting an instance of IPython running. The intrepid developer can perform all the steps by hand, but there are also various all-in-one distributions available that will include popular modules upon installation. By default, IPython will use the pip package managers. Again, the all-in-one distributions provide added value, this time in the form of advanced package management capability. At that point, all that is obviously available is a terminal, much like the standard Python terminal. IPython offers two additional sources of functionality, however: configuration and magic commands. Magic commands fall into several categories: OS equivalents, working with code, logging, debugging, documentation, profiling, and working with other languages among others. Add to this the ability to create custom magic commands (in IPython or another language) and the IPython terminal becomes a much more powerful alternative to the standard Python terminal. Also included in IPython is the debugger—ipdb. It is very similar to the Python pdb debugger, so it should be familiar to Python developers. All this is supported by the IPython architecture. The basic idea is that of a Read-Eval-Print loop in which the Eval section has been separated out into its own process. This decoupling allows different user interface components and kernels to communicate with each other, making for a flexible system. This flexibility extends to the development environment. There are IDEs devoted to IPython (for example, Spyder and Canopy) and others that originally targeted Python but also work with IPython (for example, Eclipse). There are too many Python IDEs to list, and many should work with an IPython kernel "dropped in" as a superior replacement to a Python interpreter. Resources for Article: Further resources on this subject: Python Data Science Up and Running [article] Scientific Computing APIs for Python [article] Overview of Process Management in Microsoft Visio 2013 [article]
Read more
  • 0
  • 0
  • 1366

article-image-getting-started-apache-hadoop-and-apache-spark
Packt
22 Apr 2016
12 min read
Save for later

Getting Started with Apache Hadoop and Apache Spark

Packt
22 Apr 2016
12 min read
In this article by Venkat Ankam, author of the book, Big Data Analytics with Spark and Hadoop, we will understand the features of Hadoop and Spark and how we can combine them. (For more resources related to this topic, see here.) This article is divided into the following subtopics: Introducing Apache Spark Why Hadoop + Spark? Introducing Apache Spark Hadoop and MapReduce have been around for 10 years and have proven to be the best solution to process massive data with high performance. However, MapReduce lacked performance in iterative computing where the output between multiple MapReduce jobs had to be written to Hadoop Distributed File System (HDFS). In a single MapReduce job as well, it lacked performance because of the drawbacks of the MapReduce framework. Let's take a look at the history of computing trends to understand how computing paradigms have changed over the last two decades. The trend was to reference the URI when the network was cheaper (in 1990), Replicate when storage became cheaper (in 2000), and Recompute when memory became cheaper (in 2010), as shown in Figure 1: Figure 1: Trends of computing So, what really changed over a period of time? Over a period of time, tape is dead, disk has become tape, and SSD has almost become disk. Now, caching data in RAM is the current trend. Let's understand why memory-based computing is important and how it provides significant performance benefits. Figure 2 indicates that data transfer rates from various mediums to the CPU. Disk to CPU is 100 MB/s, SSD to CPU is 600 MB/s, and over a network to CPU is 1 MB to 1 GB/s. However, the RAM to CPU transfer speed is astonishingly fast, which is 10 GB/s. So, the idea is to cache all or partial data in memory so that higher performance can be achieved. Figure 2: Why memory? Spark history Spark started in 2009 as a research project in the UC Berkeley RAD Lab, that later became AMPLab. The researchers in the lab had previously been working on Hadoop MapReduce and observed that MapReduce was inefficient for iterative and interactive computing jobs. Thus, from the beginning, Spark was designed to be fast for interactive queries and iterative algorithms, bringing in ideas such as support for in-memory storage and efficient fault recovery. In 2011, AMPLab started to develop high-level components in Spark, such as Shark and Spark Streaming. These components are sometimes referred to as Berkeley Data Analytics Stack (BDAS). Spark was first open sourced in March 2010 and transferred to the Apache Software Foundation in June 2013, where it is now a top-level project. In February 2014, it became a top-level project at the Apache Software Foundation. Spark has since become one of the largest open source communities in big data. Now, over 250+ contributors in 50+ organizations are contributing to Spark development. User base has increased tremendously from small companies to Fortune 500 companies.Figure 3 shows the history of Apache Spark: Figure 3: The history of Apache Spark What is Apache Spark? Let's understand what Apache Spark is and what makes it a force to reckon with in big data analytics: Apache Spark is a fast enterprise-grade large-scale data processing, which is interoperable with Apache Hadoop. It is written in Scala, which is both an object-oriented and functional programming language that runs in a JVM. Spark enables applications to distribute data in-memory reliably during processing. This is the key to Spark's performance as it allows applications to avoid expensive disk access and performs computations at memory speeds. It is suitable for iterative algorithms by having every iteration access data through memory. Spark programs perform 100 times faster than MapReduce in-memory or 10 times faster on the disk (http://spark.apache.org/). It provides native support for Java, Scala, Python, and R languages with interactive shells for Scala, Python, and R. Applications can be developed easily and often 2 to 10 times less code is needed. Spark powers a stack of libraries including Spark SQL and DataFrames for interactive analytics, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for real-time analytics. You can combine these features seamlessly in the same application. Spark runs on Hadoop, Mesos, standalone resource managers, on-premise hardware, or in the cloud. What Apache Spark is not Hadoop provides us with HDFS for storage and MapReduce for compute. However, Spark does not provide any specific storage medium. Spark is mainly a compute engine, but you can store data in-memory or on Tachyon to process it. Spark has the ability to create distributed datasets from any file stored in the HDFS or other storage systems supported by Hadoop APIs (including your local filesystem, Amazon S3, Cassandra, Hive, HBase, Elasticsearch, and so on). It's important to note that Spark is not Hadoop and does not require Hadoop to run. It simply has support for storage systems implementing Hadoop APIs. Spark supports text files, SequenceFiles, Avro, Parquet, and any other Hadoop InputFormat. Can Spark replace Hadoop? Spark is designed to interoperate with Hadoop. It's not a replacement for Hadoop but for the MapReduce framework on Hadoop. All Hadoop processing frameworks (Sqoop, Hive, Pig, Mahout, Cascading, Crunch, and so on) using MapReduce as the engine now use Spark as an additional processing engine. MapReduce issues MapReduce developers faced challenges with respect to performance and converting every business problem to a MapReduce problem. Let's understand the issues related to MapReduce and how they are addressed in Apache Spark: MapReduce (MR) creates separate JVMs for every Mapper and Reducer. Launching JVMs takes time. MR code requires a significant amount of boilerplate coding. The programmer needs to think and design every business problem in terms of Map and Reduce, which makes it a very difficult program. One MR job can rarely do a full computation. You need multiple MR jobs to finish the complete task and the programmer needs to design and keep track of optimizations at all levels. An MR job writes the data to the disk between each job and hence is not suitable for iterative processing. A higher level of abstraction, such as Cascading and Scalding, provides better programming of MR jobs, but it does not provide any additional performance benefits. MR does not provide great APIs either. MapReduce is slow because every job in a MapReduce job flow stores data on the disk. Multiple queries on the same dataset will read the data separately and create a high disk I/O, as shown in Figure 4: Figure 4: MapReduce versus Apache Spark Spark takes the concept of MapReduce to the next level to store the intermediate data in-memory and reuse it, as needed, multiple times. This provides high performance at memory speeds, as shown in Figure 4. If I have only one MapReduce job, does it perform the same as Spark? No, the performance of the Spark job is superior to the MapReduce job because of in-memory computations and shuffle improvements. The performance of Spark is superior to MapReduce even when the memory cache is disabled. A new shuffle implementation (sort-based shuffle instead of hash-based shuffle), a new network module (based on netty instead of using block manager to send shuffle data), and a new external shuffle service make Spark perform the fastest petabyte sort (on 190 nodes with 46TB RAM) and terabyte sort. Spark sorted 100 TB of data using 206 EC2 i2.8x large machines in 23 minutes. The previous world record was 72 minutes, set by a Hadoop MapReduce cluster of 2,100 nodes. This means that Spark sorted the same data 3x faster using 10x less machines. All the sorting took place on the disk (HDFS) without using Spark's in-memory cache (https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html). To summarize, here are the differences between MapReduce and Spark: MapReduce Spark Ease of use Not easy to code and use Spark provides a good API and is easy to code and use Performance Performance is relatively poor when compared with Spark In-memory performance Iterative processing Every MR job writes the data to the disk and the next iteration reads from the disk Spark caches data in-memory Fault Tolerance Its achieved by replicating the data in HDFS Spark achieves fault tolerance by resilient distributed dataset (RDD) lineage Runtime Architecture Every Mapper and Reducer runs in a separate JVM Tasks are run in a preallocated executor JVM Shuffle Stores data on the disk Stores data in-memory and on the disk Operations Map and Reduce Map, Reduce, Join, Cogroup, and many more Execution Model Batch Batch, Interactive, and Streaming Natively supported Programming Languages Java Java, Scala, Python, and R Spark's stack Spark's stack components are Spark Core, Spark SQL and DataFrames, Spark Streaming, MLlib, and Graphx, as shown in Figure 5: Figure 5: The Apache Spark ecosystem Here is a comparison of Spark components versus Hadoop components: Spark Hadoop Spark Core MapReduce Apache Tez Spark SQL and DataFrames Apache Hive Impala Apache Tez Apache Drill Spark Streaming Apache Storm Spark MLlib Apache Mahout Spark GraphX Apache Giraph To understand the framework at a higher level, let's take a look at these core components of Spark and their integrations: Feature Details Programming languages Java, Scala, Python, and R. Scala, Python, and R shell for quick development. Core execution engine Spark Core: Spark Core is the underlying general execution engine for the Spark platform and all the other functionality is built on top of it. It provides Java, Scala, Python, and R APIs for the ease of development. Tungsten: This provides memory management and binary processing, cache-aware computation and code generation. Frameworks Spark SQL and DataFrames: Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. Spark Streaming: Spark Streaming enables us to build scalable and fault-tolerant streaming applications. It integrates with a wide variety of data sources, including filesystems, HDFS, Flume, Kafka, and Twitter. MLlib: MLlib is a machine learning library to create data products or extract deep meaning from the data. MLlib provides a high performance because of in-memory caching of data. Graphx: GraphX is a graph computation engine with graph algorithms to build graph applications. Off-heap storage Tachyon: This provides reliable data sharing at memory speed within and across cluster frameworks/jobs. Spark's default OFF_HEAP (experimental) storage is Tachyon. Cluster resource managers Standalone: By default, applications are submitted to the standalone mode cluster and each application will try to use all the available nodes and resources. YARN: YARN controls the resource allocation and provides dynamic resource allocation capabilities. Mesos: Mesos has two modes, Coarse-grained and Fine-grained. The coarse-grained approach has a static number of resources just like the standalone resource manager. The fine-grained approach has dynamic resource allocation just like YARN. Storage HDFS, S3, and other filesystems with the support of Hadoop InputFormat. Database integrations HBase, Cassandra, Mongo DB, Neo4J, and RDBMS databases. Integrations with streaming sources Flume, Kafka and Kinesis, Twitter, Zero MQ, and File Streams. Packages http://spark-packages.org/ provides a list of third-party data source APIs and packages. Distributions Distributions from Cloudera, Hortonworks, MapR, and DataStax. The Spark ecosystem is a unified stack that provides you with the power of combining SQL, streaming, and machine learning in one program. The advantages of unification are as follows: No need of copying or ETL of data between systems Combines processing types in one program Code reuse One system to learn One system to maintain An example of unification is shown in Figure 6: Figure 6: Unification of the Apache Spark ecosystem Why Hadoop + Spark? Apache Spark shines better when it is combined with Hadoop. To understand this, let's take a look at Hadoop and Spark features. Hadoop features The Hadoop features are described as follows: Feature Details Unlimited scalability Stores unlimited data by scaling out HDFS Effectively manages the cluster resources with YARN Runs multiple applications along with Spark Thousands of simultaneous users Enterprise grade Provides security with Kerberos authentication and ACLs authorization Data encryption High reliability and integrity Multitenancy Wide range of applications Files: Strucutured, semi-structured, or unstructured Streaming sources: Flume and Kafka Databases: Any RDBMS and NoSQL database Spark features The Spark features are described as follows: Feature Details Easy development No boilerplate coding Multiple native APIs: Java, Scala, Python, and R REPL for Scala, Python, and R In-memory performance RDDs Direct Acyclic Graph (DAG) to unify processing Unification Batch, SQL, machine learning, streaming, and graph processing When both frameworks are combined, we get the power of enterprise-grade applications with in-memory performance, as shown in Figure 7: Figure 7: Spark applications on the Hadoop platform Frequently asked questions about Spark The following are the frequent questions that practitioners raise about Spark: My dataset does not fit in-memory. How can I use Spark? Spark's operators spill data to the disk if it does not fit in-memory, allowing it to run well on data of any size. Likewise, cached datasets that do not fit in-memory are either spilled to the disk or recomputed on the fly when needed, as determined by the RDD's storage level. By default, Spark will recompute the partitions that don't fit in-memory. The storage level can be changed to MEMORY_AND_DISK to spill partitions to the disk. Figure 8 shows the performance difference in fully cached versus on the disk:Figure 8: Spark performance: Fully cached versus on the disk How does fault recovery work in Spark? Spark's in-built fault tolerance based on RDD lineage will automatically recover from failures. Figure 9 shows the performance over failure in the 6th iteration in a k-means algorithm: Figure 9: Fault recovery performance Summary In this article, we saw an introduction to Apache Spark and the features of Hadoop and Spark and discussed how we can combine them together. Resources for Article: Further resources on this subject: Adding a Spark to R[article] Big Data Analytics[article] Big Data Analysis (R and Hadoop)[article]
Read more
  • 0
  • 0
  • 3284

Packt
15 Apr 2016
17 min read
Save for later

Finding Patterns in the Noise – Clustering and Unsupervised Learning

Packt
15 Apr 2016
17 min read
In this article by, Joseph J, author of Mastering Predictive Analytics with Python, we will cover one of the natural questions to ask about a dataset is if it contains groups. For example, if we examine financial markets as a time series of prices over time, are there groups of stocks that behave similarly over time? Likewise, in a set of customer financial transactions from an e-commerce business, are there user accounts distinguished by patterns of similar purchasing activity? By identifying groups using the methods described in this article, we can understand the data as a set of larger patterns rather than just individual points. These patterns can help in making high-level summaries at the outset of a predictive modeling project, or as an ongoing way to report on the shape of the data we are modeling. Likewise, the groupings produced can serve as insights themselves, or they can provide starting points for the models. For example, the group to which a datapoint is assigned can become a feature of this observation, adding additional information beyond its individual values. Additionally, we can potentially calculate statistics (such as mean and standard deviation) for other features within these groups, which may be more robust as model features than individual entries. (For more resources related to this topic, see here.) In contrast to the methods, grouping or clustering algorithms are known as unsupervised learning, meaning we have no response, such as a sale price or click-through rate, which is used to determine the optimal parameters of the algorithm. Rather, we identify similar datapoints, and as a secondary analysis might ask whether the clusters we identify share a common pattern in their responses (and thus suggest the cluster is useful in finding groups associated with the outcome we are interested in). The task of finding these groups, or clusters, has a few common ingredients that vary between algorithms. One is a notion of distance or similarity between items in the dataset, which will allow us to compare them. A second is the number of groups we wish to identify; this can be specified initially using domain knowledge, or determined by running an algorithm with different choices of initial groups to identify the best number of groups that describes a dataset, as judged by numerical variance within the groups. Finally, we need a way to measure the quality of the groups we've identified; this can be done either visually or through the statistics that we will cover. In this article we will dive into: How to normalize data for use in a clustering algorithm and to compute similarity measurements for both categorical and numerical data How to use k-means to identify an optimal number of clusters by examining the loss function How to use agglomerative clustering to identify clusters at different scales Using affinity propagation to automatically identify the number of clusters in a dataset How to use spectral methods to cluster data with nonlinear boundaries Similarity and distance The first step in clustering any new dataset is to decide how to compare the similarity (or dissimilarity) between items. Sometimes the choice is dictated by what kinds of similarity we are trying to measure, in others it is restricted by the properties of the dataset. In the following we illustrate several kinds of distance for numerical, categorical, time series, and set-based data—while this list is not exhaustive, it should cover many of the common use cases you will encounter in business analysis. We will also cover normalizations that may be needed for different data types prior to running clustering algorithms. Numerical distances Let's begin by looking at an example contained in the wine.data file. It contains a set of chemical measurements that describe the properties of different kinds of wines, and the class of quality (I-III) to which the wine is assigned (Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation, Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.). Open the file in an iPython notebook and look at the first few rows: Notice that in this dataset we have no column descriptions. We need to parse these from the dataset description file wine.data. With the following code, we generate a regular expression that will match a header name (we match a pattern where a number followed by a parenthesis has a column name after it, as you can see in the list of column names listed in the file), and add these to an array of column names along with the first column, which is the class label of the wine (whether it belongs to category I-III). We then assign this list to the dataframe column names: Now that we have appended the column names, we can look at a summary of the dataset: How can we calculate a similarity between wines based on this data? One option would be to consider each of the wines as a point in a thirteen-dimensional space specified by its dimensions (for example, each of the properties other than the class). Since the resulting space has thirteen dimensions, we can't directly visualize the datapoints using a scatterplot to see if they are nearby, but we can calculate distances just the same as with a more familiar 2- or 3-dimensional space using the Euclidean distance formula, which is simply the length of the straight line between two points. This formula for this length can be used whether the points are in a 2-dimensional plot or a more complex space such as this example, and is given by: Here aand bare rows of the dataset and nis the number of columns. One feature of the Euclidean distance is that columns whose scale is much different from others can distort it. In our example, the values describing the magnesium content of each wine are ~100 times greater than the magnitude of features describing the alcohol content or ash percentage. If we were to calculate the distance between these datapoints, it would largely be determined by the magnesium concentration (as even small differences on this scale overwhelmingly determine the value of the distance calculation), rather than any of its other properties. While this might sometimes be desirable, in most applications we do not favour one feature over another and want to give equal weight to all columns. To get a fair distance comparison between these points, we need to normalize the columns so that they fall into the same numerical range (have similar maxima and minima values). We can do so using the scale()function in scikit-learn:   This function will subtract the mean value of a column from each element and then divide each point by the standard deviation of the column. This normalization centers each column at 0 with variance 1, and in the case of normally distributed data this would make a standard normal distribution. Also note that the scale() function returns a numpy dataframe, which is why we must call dataframe on the output to use the pandas function describe(). Now that we've scaled the data, we can calculate Euclidean distances between the points: We've now converted our dataset of 178 rows and 13 columns into a square matrix, giving the distance between each of these rows. In other words, row I, column j in this matrix represents the Euclidean distance between rows I and j in our dataset. This 'distance matrix' is the input we will use for clustering inputs in the following section. If we just want to get a visual sense of how the points compare to each other, we could use multidimensional scaling (MDS)—Modern Multidimensional Scaling - Theory and Applications Borg, I., Groenen P., Springer Series in Statistics (1997), Nonmetric multidimensional scaling: a numerical method, Kruskal, J. Psychometrika, 29 (1964), and Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis, Kruskal, J. Psychometrika, 29, (1964)—to create a visualization. Multidimensional scaling attempts to find the set of lower dimensional coordinates (here, two dimensions) that best represents the distances in the higher dimensions of a dataset (here, the pairwise Euclidean distances we calculated from the 13 dimensions). It does this by minimizing the coordinates (x, y) according to the strain function: Strain(x1…..xn) = (1 – Sum(ijdij*<xi,xj>)2/Sum(ij(dij**2)Sumij<xi,x,j>**2))1/2 Where d are the distances we've calculated between points. In other words, we find coordinates that best capture the variation in the distance through the variation in dot product the coordinates. We can then plot the resulting coordinates, using the wine class to label points in the diagram. Note that the coordinates themselves have no interpretation (in fact, they could change each time we run the algorithm). Rather, it is the relative position of points that we are interested in: Given that there are many ways we could have calculated the distance between datapoints, is the Euclidean distance a good choice here? Visually, based on the multidimensional scaling plot, we can see there is separation between the classes based on the features we've used to calculate distance, so conceptually it appears that this is a reasonable choice in this case. However, the decision also depends on what we are trying to compare; if we are interested in detecting wines with similar attributes in absolute values, then it is a good metric. However, what if we're not interested so much in the absolute composition of the wine, but whether its variables follow similar trends among wines with different alcohol contents? In this case, we wouldn't be interested in the absolute difference in values, but rather the correlationbetween the columns. This sort of comparison is common for time series, which we turn to next. Correlations and time series For time series data, we are often concerned with whether the patterns between series exhibit the same variation over time, rather than their absolute differences in value. For example, if we were to compare stocks, we might want to identify groups of stocks whose prices move up and down in similar patterns over time. The absolute price is of less interest than this pattern of increase and decrease. Let's look at an example of the Dow Jones industrial average over time (Brown, M. S., Pelosi, M., and Dirska, H. (2013). Dynamic-radius Species-conserving Genetic Algorithm for the Financial Forecasting of Dow Jones Index Stocks and Machine Learning and Data Mining in Pattern Recognition, 7988, 27-41.): This data contains the daily stock price (for 6 months) for a set of 30 stocks. Because all of the numerical values (the prices) are on the same scale, we won't normalize this data as with the wine dimensions. We notice two things about this data. First, the closing price per week (the variable we will use to calculate correlation) is presented as a string. Second, the date is not in the current format for plotting. We will process both columns to fix this, converting the columns to a float and datetime object, respectively: With this transformation, we can now make a pivot table to place the closing prices for week as columns and individual stocks as rows: As we can see, we only need columns 2 and onwards to calculate correlations between rows. Let's calculate the correlation between these time series of stock prices by selecting the second column to end columns of the data frame, calculating the pairwise correlations distance metric, and visualizing it using MDS, as before: It is important to note that the Pearson coefficient, which we've calculated here, is a measure of linearcorrelation between these time series. In other words, it captures the linear increase (or decrease) of the trend in one price relative to another, but won't necessarily capture nonlinear trends. We can see this by looking at the formula for the Pearson correlation, which is given by: P(a,b) = cov(a,b)/sd(a)/sd(b) = Sum(a-mean(b))*Sum(b-mean(b))/Sqrt(Sum(a-mean(a))2* Sqrt(Sum(b-mean(b)) This value varies from 1 (highly correlated) to -1 (inversely correlated), with 0 representing no correlation (such as a cloud of points). You might recognize the numerator of this equation as the covariance, which is a measure of how much two datasets, a and b, vary with one another. You can understand this by considering that the numerator is maximized when corresponding points in both datasets are above or below their mean value. However, whether this accurately captures the similarity in the data depends upon the scale. In data that is distributed in regular intervals between a maximum and minimum, with roughly the same difference between consecutive values (which is essentially how a trend line appears), it captures this pattern well. However, consider a case in which the data is exponentially distributed, with orders of magnitude differences between the minimum and maximum, and the difference between consecutive datapoints also varyies widely. Here, the Pearson correlation would be numerically dominated by only the largest terms, which might or might not represent the overall similarity in the data. This numerical sensitivity also occurs in the numerator, which represents the product of the standard deviations of both datasets. Thus, the value of the correlation is maximized when the variation in the two datasets is roughly explained by the product of their individual variations; there is no 'left over' variation between the datasets that is not explained by their respective standard deviations. Looking at the first two stocks in this dataset, this assumption of linearity appears to be a valid one for comparing datapoints: In addition to verifying that these stocks have a roughly linear correlation, this command introduces some new functions in pandas you may find useful. The first is iloc, which allows you to select indexed rows from a dataframe. The second is transpose, which inverts the rows and columns. Here, we select the first two rows, transpose, and then select all rows (prices) after the first (since the first is the Ticker symbol) Despite the trend we see in this example, we could imagine a nonlinear trend between prices. In these cases, it might be better to measure, not the linear correlation of the prices themselves, but whether the high prices for one stock coincide with another. In other words, the rank of market days by price should be the same, even if the prices are nonlinearly related. We can also calculate this rank correlation, also known as the Spearman's Rho, using scipy, with the following formula: Rho(a,b) = 6 * sum(d^2) / n (n2-1) Where n is the number of datapoints in each of two sets a and b, and d is the difference in ranks between each pair of datapoints ai and bi. Because we only compare the ranks of the data, not their actual values, this measure can capture variations up and down between two datasets, even if they vary over wide numerical ranges. Let's see if plotting the results using the Spearman correlation metric generates any differences in the pairwise distance of the stocks: The Spearman correlation distances, based on the x and y axes, appear closer to each other, suggesting from the perspective of rank correlation that the time series appear more similar. Though they differ in their assumptions about how the two compared datasets are distributed numerically, Pearson and Spearman correlations share the requirement that the two sets are of the same length. This is usually a reasonable assumption, and will be true of most of the examples we consider in this book. However, for cases where we wish to compare time series of unequal lengths, we can use Dynamic Time Warping (DTW). Conceptually, the idea of DTW is to warp one time series to align with a second, by allowing us to open gaps in either dataset so that it becomes the same size as the second. What the algorithm needs to resolve is where the most similar areas of the two series are, so that gaps can be places in the appropriate locations. In the simplest implementation, DTW consists of the following steps: For a dataset a of length n and a dataset n of length m, construct a matrix m by n. Set the top row and the leftmost column of this matrix to both be infinity. For each point i in set a, and point j in set b, compare their similarity using a cost function. To this cost function, add the minimum of the element (i-1, j-1), (i-1, j), and (j-1, i)—moving up and left, left, or up). These conceptually represent the costs of opening a gap in one of the series, versus aligning the same element in both. At the end of step 3, we will have traced the minimum cost path to align the two series, and the DTW distance will be represented by the bottommost corner of the matrix, (n.m). A negative aspect of this algorithm is that step 3 involves computing a value for every element of series a and b. For large time series or large datasets, this can be computationally prohibitive. While a full discussion of algorithmic improvements is beyond the scope of our present examples, we refer interested readers to FastDTW (which we will use in our example) and SparseDTW as examples of improvements that can be evaluated using many fewer calculations (Al-Naymat, G., Chawla, S., & Taheri, J. (2012), SparseDTW: A Novel Approach to Speed up Dynamic Time Warping and Stan Salvador and Philip Chan, FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space. KDD Workshop on Mining Temporal and Sequential Data, pages 70-80, 20043). We can use the FastDTW algorithm to compare the stocks data as well, and to plot the resulting coordinates. First we will compare pairwise each pair of stocks and record their DTW distance in a matrix: For computational efficiency (because the distance between i and j equals the distance between stocks j and i), we calculate only the upper triangle of this matrix. We then add the transpose (for example, the lower triangle) to this result to get the full distance matrix. Finally, we can use MDS again to plot the results: Compared to the distribution of coordinates along the x and y axis for Pearson correlation and rank correlation, the DTW distances appear to span a wider range, picking up more nuanced differences between the time series of stock prices. Now that we've looked at numerical and time series data, as a last example let's examine calculating similarity in categorical datasets. Summary In this section, we learned how to identify groups of similar items in a dataset, an exploratory analysis that we might frequently use as a first step in deciphering new datasets. We explored different ways of calculating the similarity between datapoints and described what kinds of data these metrics might best apply to. We examined both divisive clustering algorithms, which split the data into smaller components starting from a single group, and agglomerative methods, where every datapoint starts as its own cluster. Using a number of datasets, we showed examples where these algorithms will perform better or worse, and some ways to optimize them. We also saw our first (small) data pipeline, a clustering application in PySpark using streaming data. Resources for Article: Further resources on this subject: Python Data Structures[article] Big Data Analytics[article] Data Analytics[article]
Read more
  • 0
  • 0
  • 1787

article-image-probabilistic-graphical-models-r
Packt
14 Apr 2016
18 min read
Save for later

Probabilistic Graphical Models in R

Packt
14 Apr 2016
18 min read
In this article by David Bellot, author of the book, Learning Probabilistic Graphical Models in R, explains that among all the predictions that were made about the 21st century, we may not have expected that we would collect such a formidable amount of data about everything, everyday, and everywhere in the world. The past years have seen an incredible explosion of data collection about our world and lives, and technology is the main driver of what we can certainly call a revolution. We live in the age of information. However, collecting data is nothing if we don't exploit it and if we don't try to extract knowledge out of it. At the beginning of the 20th century, with the birth of statistics, the world was all about collecting data and making statistics. Back then, the only reliable tools were pencils and papers and, of course, the eyes and ears of the observers. Scientific observation was still in its infancy despite the prodigious development of the 19th century. (For more resources related to this topic, see here.) More than a hundred years later, we have computers, electronic sensors, massive data storage, and we are able to store huge amounts of data continuously, not only about our physical world but also about our lives, mainly through the use of social networks, Internet, and mobile phones. Moreover, the density of our storage technology increased so much that we can, nowadays, store months if not years of data into a very small volume that can fit in the palm of our hand. Among all the tools and theories that have been developed to analyze, understand, and manipulate probability and statistics became one of the most used. In this field, we are interested in a special, versatile, and powerful class of models called the probabilistic graphical models (PGM, for short). Probabilistic graphical model is a tool to represent beliefs and uncertain knowledge about facts and events using probabilities. It is also one of the most advanced machine learning techniques nowadays and has many industrial success stories. They can deal with our imperfect knowledge about the world because our knowledge is always limited. We can't observe everything, and we can't represent the entire universe in a computer. We are intrinsically limited as human beings and so are our computers. With Probabilistic Graphical Models, we can build simple learning algorithms or complex expert systems. With new data, we can improve these models and refine them as much as we can, and we can also infer new information or make predictions about unseen situations and events. Probabilistic Graphical Models, seen from the point of view of mathematics, are a way to represent a probability distribution over several variables, which is called a joint probability distribution. In a PGM, such knowledge between variables can be represented with a graph, that is, nodes connected by edges with a specific meaning associated to it. Let's consider an example from the medical world: how to diagnose a cold. This is an example and by no means a medical advice. It is oversimplified for the sake of simplicity. We define several random variables such as the following: Se: This means season of the year N: This means that the nose is blocked H: This means the patient has a headache S: This means that the patient regularly sneezes C: This means that the patient coughs Cold: This means the patient has a cold. Because each of the symptoms can exist at different degrees, it is natural to represent the variable as random variables. For example, if the patient's nose is a bit blocked, we will assign a probability of, say, 60% to this variable, that is P(N=blocked)=0.6 and P(N=not blocked)=0.4. In this example, the probability distribution P(Se,N,H,S,C,Cold) will require 4 * 25 = 128 values in total (4 values for season and 2 values for each of the other random variables). It's quite a lot, and honestly, it's quite difficult to determine things such as the probability that the nose is not blocked, the patient has a headache, the patient sneeze, and so on. However, we can say that a headache is not directly related to cough of a blocked nose, expect when the patient has a cold. Indeed, the patient can have a headache for many other reasons. Moreover, we can say that the Season has quite a direct effect on Sneezing, blocked nose, or Cough but less or no direct effect on Headache. In a Probabilistic Graphical Model, we will represent these dependency relationships with a graph, as follow, where each random variable is a node in the graph, and each relationship is an arrow between 2 nodes: In the graph that follows, there is a direct relation between each node and each variable of the Probabilistic Graphical Model and also a direct relation between arrows and the way we can simplify the joint probability distribution in order to make it tractable. Using a graph as a model to simplify a complex (and sometimes complicated) distribution presents numerous benefits: As we observed in the previous example, and in general when we model a problem, the random variables interacts directly with only a small subsets of other random variables. Therefore, this promotes more compact and tractable models The knowledge and dependencies represented in a graph are easy to understand and communicate The graph induces a compact representation of the joint probability distribution and it is easy to make computations with Algorithms to draw inferences and learn can use the graph theory and the associated algorithms to improve and facilitate all the inference and learning algorithms. Compared to the raw joint probability distribution, using a PGM will speed up computations by several order of magnitude. The junction tree algorithm The Junction Tree Algorithm is one of the main algorithms to do inference on PGM. Its name arises from the fact that before doing the numerical computations, we will transform the graph of the PGM into a tree with a set of properties that allow for efficient computations of posterior probabilities. One of the main aspects is that this algorithm will not only compute the posterior distribution of the variables in the query, but also the posterior distribution of all other variables that are not observed. Therefore, for the same computational price, one can have any posterior distribution. Implementing a junction tree algorithm is a complex task, but fortunately, several R packages contain a full implementation, for example, gRain. Let's say we have several variables A, B, C, D, E,and F. We will consider for the sake of simplicity that each variable is binary so that we won't have too many values to deal with. We will assume the following factorization: This is represented by the following graph: We first start by loading the gRain package into R: library(gRain) Then, we create our set of random variables from A to F: val=c(“true”,”false”) F = cptable(~F, values=c(10,90),levels=val) C = cptable(~C|F, values=c(10,90,20,80),levels=val) E = cptable(~E|F, values=c(50,50,30,70),levels=val) A = cptable(~A|C, values=c(50,50,70,30),levels=val) D = cptable(~D|E, values=c(60,40,70,30),levels=val) B = cptable(~B|A:D, values=c(60,40,70,30,20,80,10,90),levels=val) The cptable function creates a conditional probability table, which is a factor for discrete variables. The probabilities associated to each variable are purely subjective and only serve the purpose of the example. The next step is to compute the junction tree. In most packages, computing the junction tree is done by calling one function because the algorithm just does everything at once: plist = compileCPT(list(F,E,C,A,D,B)) plist Also, we check whether the list of variable is correctly compiled into a probabilistic graphical model and we obtain from the previous code: CPTspec with probabilities:  P( F )  P( E | F )  P( C | F )  P( A | C )  P( D | E )  P( B | A D ) This is indeed the factorization of our distribution, as stated earlier. If we want to check further, we can look at the conditional probability table of a few variables: print(plist$F) print(plist$B) F  true false 0.1   0.9 , , D = true        A B       true false   true   0.6   0.7   false  0.4   0.3 , , D = false          A B       true false   true   0.2   0.1   false  0.8   0.9 The second output is a bit more complex, but if you look carefully, you will see that you have two distributions, P(B|A,D=true) and P(B|A,D=false) which is more readable presentation of P(B|A,D). We finally create the graph and run the junction tree algorithm by calling this: jtree = grain(plist) Again, when we check the result, we obtain: jtree Independence network: Compiled: FALSE Propagated: FALSE   Nodes: chr [1:6] "F" "E" "C" "A" "D" "B" We only need to compute the junction tree once. Then, all queries can be computed with the same junction tree. Of course, if you change the graph, then you need to recompute the junction tree. Let's perform a few queries: querygrain(jtree, nodes=c("F"), type="marginal") $F F  true false 0.1   0.9 Of course, if you ask for the marginal distribution of F, you will obtain the initial conditional probability table because F has no parents.  querygrain(jtree, nodes=c("C"), type="marginal") $C C  true false 0.19  0.81 This is more interesting because it computes the marginal of C while we only stated the conditional distribution of C given F. We didn't need to have such a complex algorithm as the junction tree algorithm to compute such a small marginal. We saw the variable elimination algorithm earlier and that would be enough too. But if you ask for the marginal of B, then the variable elimination will not work because of the loop in the graph. However, the junction tree will give the following: querygrain(jtree, nodes=c("B"), type="marginal") $B B     true    false 0.478564 0.521436   And, can ask more complex distribution, such as the joint distribution of B and A: querygrain(jtree, nodes=c("A","B"), type="joint")        B A           true    false   true  0.309272 0.352728   false 0.169292 0.168708 In fact, any combination can be given like A,B,C: querygrain(jtree, nodes=c("A","B","C"), type="joint") , , B = true          A C           true    false   true  0.044420 0.047630   false 0.264852 0.121662   , , B = false          A C           true    false   true  0.050580 0.047370   false 0.302148 0.121338 Now, we want to observe a variable and compute the posterior distribution. Let's say F=true and we want to propagate this information down to the rest of the network: jtree2 = setEvidence(jtree, evidence=list(F="true")) We can ask for any joint or marginal now: querygrain(jtree, nodes=c("A"), type="marginal") $A A  true false 0.662 0.338 querygrain(jtree2, nodes=c("A"), type="marginal") $A A  true false  0.68  0.32 Here, we see that knowing that F=true changed the marginal distribution on A from its previous marginal (the second query is again with jtree2, the tree with an evidence). And, we can query any other variable: querygrain(jtree, nodes=c("B"), type="marginal") $B B     true    false 0.478564 0.521436   querygrain(jtree2, nodes=c("B"), type="marginal") $B B   true  false 0.4696 0.5304 Learning Building a Probabilistic Graphical Model, generally, requires three steps: defining the random variables, which are the nodes of the graph as well; defining the structure of the graph; and finally defining the numerical parameters of each local distribution. So far, the last step has been done manually and we gave numerical values to each local probability distribution by hand. In many cases, we have access to a wealth of data and we can find the numerical values of those parameters with a method called parameters learning. In other fields, it is also called parameters fitting or model calibration. Learning parameters can be done with several approaches and there is no ultimate solution to the problem because it depends on the goal where the model's user wants to reach. Nevertheless, it is common to use the notion of Maximum Likelihood of a model and also Maximum A Posteriori. As you are now used to the notion of prior and posterior of a distribution, you can already guess what a maximum a posteriori can do. Many algorithms are used, among which we can cite the Expectation Maximization algorithm (EM), which computes the maximum likelihood of a model even when data is missing or variables are not observed at all. It is a very important algorithm, especially for mixture models. A graphical model of a linear model PGM can be used to represent standard statistical models and then extend them. One famous example is the linear regression mode. We can visualize the structure of a linear mode and better understand the relationships between the variable. The linear model captures the relationships between observable variables xand a target variable y. This relation is modeled by a set of parameters, θ. But remember the distribution of y for each data point indexed by i: Here, Xiis a row vector for which the first element is always one to capture the intercept of the linear model. The parameter θ in the following graph is itself composed of the intercept, the coefficient β for each component of X, and the variance σ2 of in the distribution of yi. The PGM for an observation of a linear model can be represented as follows: So, this decomposition leads us to a second version of the graphical model in which we explicitly separate the components of θ: In a PGM, when a rectangle is drawn around a set of nodes with a number or variables in a corner (N for example), it means that the same graph is repeated many times. The likelihood function of a linear model is    , and it can be represented as a PGM. And, the vector β can also be decomposed it into its univariate components too: In this last iterations of the graphical model, we see that the parameters β could have a prior probability on it instead of being fixed. In fact, the parameter  can also be considered as a random variable. For the time being, we will keep it fixed. Latent Dirichlet Allocation The last model we want to show in this article is called the Latent Dirichlet Allocation. It is a generative model that can be represented as a graphical model. It's based on the same idea as the mixture model with one notable exception. In this model, we assume that the data points might be generated by a combination of clusters and not just one cluster at a time, as it was the case before. The LDA model is primarily used in text analysis and classification. Let's consider that a text document is composed of words making sentences and paragraphs. To simplify the problem we can say that each sentence or paragraph is about one specific topic, such as science, animals, sports, and s on. Topics can also be more specific, such as cat topic or European soccer topic. Therefore, there are words that are more likely to come from specific topics. For example, the work cat is likely to come from the topic cat topic. The word stadium is likely to come from the topic European soccer. However, the word ball should come with a higher probability from the topic European soccer, but it is not unlikely to come from the topic cat, because cats like to play with balls too. So, it seems the word ball might belong to two topics at the same time with a different degree of certainty. Other words such as table will certainly belong equally to both topics and presumably to others. They are very generic; expect, of course, if we introduce another topics such as furniture. A document is a collection of words, so a document can have complex relationships with a set of topics. But in the end, it is more likely to see words coming from the same topic or the same topics within a paragraph and to some extent to the document. In general, we model a document with a bag of words model, that is, we consider a document to be a randomly generated set of words, using a specific distribution over the words. If this distribution is uniform over all the words, then the document will be purely random without a specific meaning. However, if this distribution has a specific form, with more probability mass to related words, then the collection of words generated by this model will have a meaning. Of course, generating documents is not really the application we have in mind for such a model. What we are interested in is the analysis of documents, their classification, and automatic understanding. Let's say is  a categorical variable (in other words, a histogram), representing the probability of appearance of all words from a dictionary. Usually, in this kind of model, we restrict ourselves to long words only and remove the small words, like and, to, but, the, a, and so onThese words are usually called stop words. Let w_jbe the jth words in a document. The following three graphs show the progression from representing a document (left-most graph) to representing a collection of documents (the third graph): Let  be a distribution over topics, then in the second graph from the left, we extend this model by choosing the kind of topic that will be selected at any time and then generate a word out of it. Therefore, the variable zi now becomes the variable zij, that is, the topic iis selected for the word j. We can go even further and decide that we want to model a collection of documents, which seems natural if we consider that we have a big data set. Assuming that documents are i.i.d, the next step (the third graph) is a PGM that represents the generative model for M documents. And, because the distribution on  is categorical, we want to be Bayesian about it, mainly because it will help to model not to overfit and because we consider the selection of topics for a document to be a random process. Moreover, we want to apply the same treatment to the word variable by having a Dirichlet prior. This prior is used to avoid non-observed words that have zero probability. It smooths the distribution of words per topic. A uniform Dirichlet prior will induce a uniform prior distribution on all the words. And therefore, the final graph on the right represents the complete model. This is quite a complex graphical model but techniques have been developed to fit the parameters and use this model. If we follow this graphical model carefully, we have a process that generates documents based on a certain set of topics: α chooses the set of topics for a documents From θ, we generate a topic zij From this topic, we generate a word wj In this model, only the words are observable. All the other variables will have to be determined without observation, exactly like in the other mixture models. So, documents are represented as random mixtures over latent topics, in which each topic is represented as a distribution over words. The distribution of a topic mixture based on this graphical mode can be written as follows: You can see in this formula that for each word, we select a topic, hence the product from 1 to N. Integrating over θ and summing over z, the marginal distribution of a document is as follows: The final distribution can be obtained by taking the product of marginal distributions of single documents, so as to get the distribution over a collection of documents (assuming that documents are independently and identically distributed). Here, D is the collection of documents: The main problem to be solved now is how to compute the posterior distribution over θ and z, given a document. By applying the Bayes formula, we know the following: Unfortunately, this is intractable because of the normalization factor at the denominator. The original paper on LDA, therefore, refers to a technique called Variational inference, which aims at transforming a complex Bayesian inference problem into a simpler approximation which can be solved as an (convex) optimization problem. This technique is the third approach to Bayesian inference and has been used on many other problems. Summary The probabilistic graphical model framework offers a powerful and versatile framework to develop and extend many probabilistic models using an elegant graph-based formalism. It has many applications such as in biology, genomics, medicine, finance, robotics, computer vision, automation, engineering, law, and games, for example. Many packages in R exist to deal with all sort of models and data among which gRain or Rstan are very popular. Resources for Article:   Further resources on this subject: Extending ElasticSearch with Scripting [article] Exception Handling in MySQL for Python [article] Breaking the Bank [article]
Read more
  • 0
  • 0
  • 4373
article-image-detecting-fraud-e-commerce-orders-benfords-law
Packt
14 Apr 2016
7 min read
Save for later

Detecting fraud on e-commerce orders with Benford's law

Packt
14 Apr 2016
7 min read
In this article by Andrea Cirillo, author of the book RStudio for R Statistical Computing Cookbook, has explained how to detect fraud on e-commerce orders. Benford's law is a popular empirical law that states that the first digits of a population of data will follow a specific logarithmic distribution. This law was observed by Frank Benford around 1938 and since then has gained increasing popularity as a way to detect anomalous alteration of population of data. Basically, testing a population against Benford's law means verifying that the given population respects this law. If deviations are discovered, the law performs further analysis for items related to those deviations. In this recipe, we will test a population of e-commerce orders against the law, focusing on items deviating from the expected distribution. (For more resources related to this topic, see here.) Getting ready This recipe will use functions from the well-documented benford.analysis package by Carlos Cinelli. We therefore need to install and load this package: install.packages("benford.analysis") library(benford.analysis) In our example, we will use a data frame that stores e-commerce orders, provided within the book as an .Rdata file. In order to make it available within your environment, we need to load this file by running the following command (assuming the file is within your current working directory): load("ecommerce_orders_list.Rdata") How to do it... Perform Benford test on the order amounts: benford_test <- benford(ecommerce_orders_list$order_amount,1) Plot test analysis: plot(benford_test) This will result in the following plot: Highlights supectes digits: suspectsTable(benford_test) This will produce a table showing for each digit absolute differences between expected and observed frequencies. The first digits will therefore be more anomalous ones: > suspectsTable(benford_test)    digits absolute.diff 1:      5     4860.8974 2:      9     3764.0664 3:      1     2876.4653 4:      2     2870.4985 5:      3     2856.0362 6:      4     2706.3959 7:      7     1567.3235 8:      6     1300.7127 9:      8      200.4623 Define a function to extrapolate the first digit from each amount: left = function (string,char){   substr(string,1,char)} Extrapolate the first digit from each amount: ecommerce_orders_list$first_digit <- left(ecommerce_orders_list$order_amount,1) Filter amounts starting with the suspected digit: suspects_orders <- subset(ecommerce_orders_list,first_digit == 5) How it works Step 1 performs the Benford test on the order amounts. In this step, we applied the benford() function to the amounts. Applying this function means evaluating the distribution of the first digits of amounts against the expected Benford distribution. The function will result in the production of the following objects: Object Description Info This object covers the following general information: data.name: This shows the name of the data used n: This shows the number of observations used n.second.order: This shows the number of observations used for second-order analysis number.of.digits: This shows the number of first digits analyzed Data This is a data frame with the following subobjects: lines.used: This shows  the original lines of the dataset data.used: This shows the data used data.mantissa: This shows the log data's mantissa data.digits: This shows the first digits of the data s.o.data This is a data frame with the following subobjects: data.second.order: This shows the differences of the ordered data  data.second.order.digits: This shows the first digits of the second-order analysis Bfd This is a data frame with the following subobjects: digits: This highlights the groups of digits analyzed data.dist: This highlights the distribution of the first digits of the data data.second.order.dist: This highlights the distribution of the first digits of the second-order analysis benford.dist: This shows the theoretical Benford distribution data.second.order.dist.freq: This shows the frequency distribution of the first digits of the second-order analysis data.dist.freq: This shows the frequency distribution of the first digits of the data benford.dist.freq: This shows the theoretical Benford frequency distribution benford.so.dist.freq: This shows the theoretical Benford frequency distribution of the second order analysis. data.summation: This shows the summation of the data values grouped by first digits abs.excess.summation: This shows the absolute excess summation of the data values grouped by first digits difference: This highlights the difference between the data and Benford frequencies squared.diff: This shows the chi-squared difference between the data and Benford frequencies absolute.diff: This highlights the absolute difference between the data and Benford frequencies Mantissa This is a data frame with the following subobjects: mean.mantissa: This shows the mean of the mantissa var.mantissa: This shows the variance of the mantissa ek.mantissa: This shows the excess kurtosis of the mantissa sk.mantissa: This highlights the skewness of the mantissa MAD This object depicts the mean absolute deviation. distortion.factor This object talks about the distortion factor. Stats This object lists of htest class statistics as follows: chisq: This lists the Pearson's Chi-squared test. mantissa.arc.test: This lists the Mantissa Arc test Step 2 plots test results. Running plot on the object resulting from the benford() function will result in a plot showing the following (from upper-left corner to bottom-right corner): First digit distribution Results of second-order test Summation distribution for each digit Results of chi-squared test Summation differences If you look carefully at these plots, you will understand which digits show up a distribution significantly different from the one expected from the Benford law. Nevertheless, in order to have a sounder base for our consideration, we need to look at the suspects table, showing absolute differences between expected and observed frequencies. This is what we will do in the next step. Step 3 highlights suspects digits. Using suspectsTable() we can easily discover which digits presents the greater deviation from the expected distribution. Looking at the so-called suspects table, we can see that number 5 shows up as the first digit within our table. In the next step, we will focus our attention on the orders with amounts having this digit as the first digit. Step 4 defines a function to extrapolate the first digit from each amount. This function leverages the substr() function from the stringr() package and extracts the first digit from the number passed to it as an argument. Step 5 adds a new column to the investigated dataset where the first digit is extrapolated. Step 6 filters amounts starting with the suspected digit. After applying the left function to our sequence of amounts, we can now filter the dataset, retaining only rows whose amounts have 5 as the first digit. We will now be able to perform analytical, testing procedures on those items. Summary In this article, you learned how to apply the R language to an e-commerce fraud detection system. Resources for Article: Further resources on this subject: Recommending Movies at Scale (Python) [article] Visualization of Big Data [article] Big Data Analysis (R and Hadoop) [article]
Read more
  • 0
  • 0
  • 2504

article-image-cluster-computing-using-scala
Packt
13 Apr 2016
18 min read
Save for later

Cluster Computing Using Scala

Packt
13 Apr 2016
18 min read
In this article by Vytautas Jančauskas the author of the book Scientific Computing with Scala, explains the way of writing software to be run on distributed computing clusters. We will learn the MPJ Express library here. (For more resources related to this topic, see here.) Very often when dealing with intense data processing tasks and simulations of physical phenomena, there comes a time when no matter how many CPU cores and memory your workstation has, it is not enough. At times like these, you will want to turn to supercomputing clusters for help. These distributed computing environments consist of many nodes (each node being a separate computer) connected into a computer network using specialized high bandwidth and low latency connections (or if you are on a budget standard Ethernet hardware is often enough). These computers usually utilize a network filesystem allowing each node to see the same files. They communicate using messaging libraries, such as MPI. Your program will run on separate computers and utilize the message passing framework to exchange data via the computer network. Using MPJ Express for distributed computing MPJ Express is a message passing library for distributed computing. It works in programming languages using Java Virtual Machine (JVM). So, we can use it from Scala. It is similar in functionality and programming interface to MPI. If you know MPI, you will be able to use MPJ Express pretty much the same way. The differences specific to Scala are explained in this section. We will start with how to install it. For further reference, visit the MPJ Express website given here: http://mpj-express.org/ Setting up and running MPJ Express The steps to set up and run MPJ Express are as follows: First, download MPJ Express from the following link. The version at the time of this writing is 0.44.http://mpj-express.org/download.php Unpack the archive and refer to the included README file for installation instructions. Currently, you have to set MPJ_HOME to the folder you unpacked the archive to and add the bin folder in that archive to your path. For example, if you are a Linux user using bash as your shell, you can add the following two lines to your .bashrc file (the file is in your home directory at /home/yourusername/.bashrc): export MPJ_HOME=/home/yourusername/mpj export PATH=$MPJ_HOME/bin:$PATH Here, mpj is the folder you extracted the archive you downloaded from the MPJ Express website to. If you are using a different system, you will have to do the equivalent of the above for your system to use MPJ Express. We will want to use MPJ Express with Scala Build Tool (SBT), which we used previously to build and run all of our programs. Create the following directory structure: scalacluster/ lib/ project/ plugins.sbt build.sbt I have chosen to name the project folder asscalacluster here, but you can call it whatever you want. The .jar files in the lib folder will be accessible to your program now. Copy the contents of the lib folder from the mpj directory to this folder. Finally, create an empty build.sbt and plugins.sbt files. Let’s now write and run a simple "Hello, World!" program to test our setup: import mpi._ object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank val size: Int = MPI.COMM_WORLD.Size println("Hello, World, I'm <" + me + ">") MPI.Finalize() } } This should be familiar to everyone who has ever used MPI. First, we import everything from the mpj package. Then, we initialize MPJ Express by calling MPI.Initialize, the arguments to MPJ Express will be passed from the command-line arguments you will enter when running the program. The MPI.COMM_WORLD.Rank() function returns the MPJ processes rank. A rank is a unique identifier used to distinguish processes from one another. They are used when you want different processes to do different things. A common pattern is to use the process with rank 0 as the master process and the processes with other ranks as workers. Then, you can use the processes rank to decide what action to take in the program. We also determine how many MPJ processes were launched by checking MPI.COMM_WORLD.Size. Our program will simply print a processes rank for now. We will want to run it. If you don't have a distributed computing cluster readily available, don't worry. You can test your programs locally on your desktop or laptop. The same program will work without changes on clusters as well. To run programs written using MPJ Express, you have to use the mpjrun.sh script. This script will be available to you if you have added the bin folder of the MPJ Express archive to your PATH as described in the section on installing MPJ Express. The mpjrun.sh script will setup the environment for your MPJ Express processes and start said processes. The mpjrun.sh script takes a .jar file, so we need to create one. Unfortunately for us, this cannot easily be done using the sbt package command in the directory containing our program. This worked previously, because we used Scala runtime to execute our programs. MPJ Express uses Java. The problem is that the .jar package created with sbt package does not include Scala's standard library. We need what is called a fat .jar—one that contains all the dependencies within itself. One way of generating it is to use a plugin for SBT called sbt-assembly. The website for this plugin is given here: https://github.com/sbt/sbt-assembly There is a simple way of adding the plugin for use in our project. Remember that project/plugins.sbt file we created? All you need to do is add the following line to it (the line may be different for different versions of the plugin. Consult the website): addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.1") Now, add the following to the build.sbt file you created: lazy val root = (project in file(".")). settings( name := "mpjtest", version := "1.0", scalaVersion := "2.11.7" ) Then, execute the sbt assembly command from the shell to build the .jar file. The file will be put under the following directory if you are using the preceding build.sbt file. That is, if the folder you put the program and build.sbt in is /home/you/cluster: /home/you/cluster/target/scala-2.11/mpjtest-assembly- 1.0.jar Now, you can run the mpjtest-assembly-1.0.jar file as follows: $ mpjrun.sh -np 4 -jar target/scala-2.11/mpjtest-assembly-1.0.jar MPJ Express (0.44) is started in the multicore configuration Hello, World, I'm <0> Hello, World, I'm <2> Hello, World, I'm <3> Hello, World, I'm <1> Argument -np specifies how many processes to run. Since we specified -np 4, four processes will be started by the script. The order of the "Hello, World" messages can differ on your system since the precise order of execution of different processes is undetermined. If you got the output similar to the one shown here, then congratulations, you have done the majority of the work needed to write and deploy applications using MPJ Express. Using Send and Recv MPJ Express processes can communicate using Send and Recv. These methods constitute arguably the simplest and easiest to understand mode of operation that is also probably the most error prone. We will look at these two first. The following are the signatures for the Send and Recv methods: public void Send(java.lang.Object buf, int offset, int count, Datatype datatype, int dest, int tag) throws MPIException public Status Recv(java.lang.Object buf, int offset, int count, Datatype datatype, int source, int tag) throws MPIException Both of these calls are blocking. This means that after calling Send, your process will block (will not execute the instructions following it) until a corresponding Recv is called by another process. Also Recv will block the process, until a corresponding Send happens. By corresponding, we mean that the dest and source arguments of the calls have the values corresponding to receivers and senders ranks, respectively. The two calls will be enough to implement many complicated communication patterns. However, they are prone to various problems such as deadlocks. Also, they are quite difficult to debug, since you have to make sure that each Send has the correct corresponding Recv and vice versa. The parameters for Send and Recv are basically the same. The meanings of those parameters are summarized in the following table: Argument Type Description Buf java.lang.Object It has to be a one-dimensional Java array. When using from Scala, use the Scala array, which is a one-to-one mapping to a Java array. offset int The start of the data you want to pass from the start of the array. Count int This shows the number items of the array you want to pass. datatype Datatype The type of data in the array. Can be one of the following: MPI.BYTE, MPI.CHAR, MPI.SHORT, MPI.BOOLEAN, MPI.INT, MPI.LONG, MPI.FLOAT, MPI.DOUBLE, MPI.OBJECT, MPI.LB, MPI.UB, and MPI.PACKED. dest/source int Either the destination to send the message to or the source to get the message from. You use the rank of the process to identify sources and destinations. tag int Used to tag the message. Can be used to introduce different message types. Can be ignored for most common applications. Let’s look at a simple program using these calls for communication. We will implement a simple master/worker communication pattern: import mpi._ import scala.util.Random object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank() val size: Int = MPI.COMM_WORLD.Size() if (me == 0) { Here, we use an if statement to identify who we are based on our rank. Since each process gets a unique rank, this allows us to determine what action should be taken. In our case, we assigned the role of the master to the process with rank 0 and the role of a worker to processes with other ranks: for (i <- 1 until size) { val buf = Array(Random.nextInt(100)) MPI.COMM_WORLD.Send(buf, 0, 1, MPI.INT, i, 0) println("MASTER: Dear <" + i + "> please do work on " + buf(0)) } We iterate over workers, who have the ranks from 1 to whatever is the argument for number of processes you passed to the mpjrun.sh script. Let’s say that number is four. This gives us one master process and three worker processes. So, each process with a rank from 1 to 3 will get a randomly generated number. We have to put that number in an array even though it is a single number. This is because both Send and Recv methods expect an array as their first argument. We then use the Send method to send the data. We specified the array as argument buf, offset of 0, size of 1, type MPI.INT, destination as the for loop index, and tag as 0. This means that each of our three worker processes will receive a (most probably) different number: for (i <- 1 until size) { val buf = Array(0) MPI.COMM_WORLD.Recv(buf, 0, 1, MPI.INT, i, 0) println("MASTER: Dear <" + i + "> thanks for the reply, which was " + buf(0)) } Finally, we collect the results from the workers. For this, we iterate over the worker ranks and use the Recv method on each one of them. We print the result we got from the worker, and this concludes the master's part. We now move on to the workers: } else { val buf = Array(0) MPI.COMM_WORLD.Recv(buf, 0, 1, MPI.INT, 0, 0) println("<" + me + ">: " + "Understood, doing work on " + buf(0)) buf(0) = buf(0) * buf(0) MPI.COMM_WORLD.Send(buf, 0, 1, MPI.INT, 0, 0) println("<" + me + ">: " + "Reporting back") } The workers code is identical for all of them. They receive a message from the master, calculate the square of it, and send it back: MPI.Finalize() } } After you run the program, the results should be akin to the following, which I got when running this program on my system: MASTER: Dear <1> please do work on 71 MASTER: Dear <2> please do work on 12 MASTER: Dear <3> please do work on 55 <1>: Understood, doing work on 71 <1>: Reported back MASTER: Dear <1> thanks for the reply, which was 5041 <3>: Understood, doing work on 55 <2>: Understood, doing work on 12 <2>: Reported back MASTER: Dear <2> thanks for the reply, which was 144 MASTER: Dear <3> thanks for the reply, which was 3025 <3>: Reported back Sending Scala objects in MPJ Express messages Sometimes, the types provided by MPJ Express for use in the Send and Recv methods are not enough. You may want to send your MPJ Express processes a Scala object. A very realistic example of this would be to send an instance of a Scala case class. These can be used to construct more complicated data types consisting of several different basic types. A simple example is a two-dimensional vector consisting of x and y coordinates. This can be sent as a simple array, but more complicated classes can't. For example, you may want to use a case class as the one shown here. It has two attributes of type String and one attribute of type Int. So what do we do with a data type like this? The simplest answer to that problem is to serialize it. Serializing converts an object to a stream of characters or a string that can be sent over the network (or stored to a file or done other things with) and later on deserialized to get the original object back: scala> case class Person(name: String, surname: String, age: Int) defined class Person scala> val a = Person("Name", "Surname", 25) a: Person = Person(Name,Surname,25) A simple way of serializing is to use a format such as XML or JSON. This can be done automatically using a pickling library. Pickling is a term that comes from the Python programming language. It is the automatic conversion of an arbitrary object into a string representation that can later be de-converted to get the original object back. The reconstructed object will behave the same way as it did before conversion. This allows one to store arbitrary objects to files for example. There is a pickling library available for Scala as well. You can of course do serialization in several different ways (for example, using the powerful support for XML available in Scala). We will use the pickling library that is available from the following website for this example: https://github.com/scala/pickling You can install it by adding the following line to your build.sbt file: libraryDependencies += "org.scala-lang.modules" %% "scala- pickling" % "0.10.1" After doing that, use the following import statements to enable easy pickling in your projects: scala> import scala.pickling.Defaults._ import scala.pickling.Defaults._ scala> import scala.pickling.json._ import scala.pickling.json._ Here, you can see how you can then easily use this library to pickle and unpickle arbitrary objects without the use of annoying boiler plate code: scala> val pklA = a.pickle pklA: pickling.json.pickleFormat.PickleType = JSONPickle({ "$type": "Person", "name": "Name", "surname": "Surname", "age": 25 }) scala> val unpklA = pklA.unpickle[Person] unpklA: Person = Person(Name,Surname,25) Let’s see how this would work in an application using MPJ Express for message passing. A program using pickling to send a case class instance in a message is given here: import mpi._ import scala.pickling.Defaults._ import scala.pickling.json._ case class ArbitraryObject(a: Array[Double], b: Array[Int], c: String) Here, we have chosen to define a fairly complex case class, consisting of two arrays of different types and a string: object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank() val size: Int = MPI.COMM_WORLD.Size() if (me == 0) { val obj = ArbitraryObject(Array(1.0, 2.0, 3.0), Array(1, 2, 3), "Hello") val pkl = obj.pickle.value.toCharArray MPI.COMM_WORLD.Send(pkl, 0, pkl.size, MPI.CHAR, 1, 0) In the preceding bit of code, we create an instance of our case class. We then pickle it to JSON and get the string representation of said JSON with the value method. However, to send it in an MPJ message, we need to convert it to a one-dimensional array of one of the supported types. Since it is a string, we convert it to a char array. This is done using the toCharArray method: } else if (me == 1) { val buf = new Array[Char](1000) MPI.COMM_WORLD.Recv(buf, 0, 1000, MPI.CHAR, 0, 0) val msg = buf.mkString val obj = msg.unpickle[ArbitraryObject] On the receiving end, we get the raw char array, convert it back to string using mkString method, and then unpickle it using unpickle[T]. This will return an instance of the case class that we can use as any other instance of a case class. It is in its functionality the same object that was sent to us: println(msg) println(obj.c) } MPI.Finalize() } } The following is the result of running the preceding program. It prints out the JSON representation of our object, and also show that we can access the attributes of said object by printing the c attribute. MPJ Express (0.44) is started in the multicore configuration: { "$type": "ArbitraryObject", "a": [ 1.0, 2.0, 3.0 ], "b": [ 1, 2, 3 ], "c": "Hello" } Hello You can use this method to send arbitrary objects in an MPJ Express message. However, this is just one of many ways of doing this. As mentioned previously, an example of another way is to use the XML representation. XML support is strong in Scala, and you can use it to serialize objects as well. This will usually require you to add some boiler plate code to your program to serialize to XML. The method discussed earlier has the advantage of requiring no boiler plate code. Non-blocking communication So far, we examined only blocking (or synchronous) communication between two processes. This means that the process is blocked (halted their execution) until the Send or Recv methods have been completed successfully. This is simple to understand and enough for most cases. The problem with synchronous communication is that you have to be very careful otherwise deadlocks may occur. Deadlocks are situations when processes wait for each other to release a resource first. Mexican standoff including the dining philosophers problem is one of the famous example of Deadlock in Operating System. The point is that if you are unlucky, you may end up with a program that is seemingly stuck and you don't know why. Using nonlocking communication allows you to avoid these problems most of the time. If you think you may be at risk of deadlocks, you will probably want to use it. The signatures for the primary methods used in asynchronous communication are given here: Request Isend(java.lang.Object buf, int offset, int count, Datatype datatype, int dest, int tag) Isend works similar to its Send counterpart. The main differences are that it does not block (the program continues execution after the call rather than waiting for a corresponding send), and then it returns a Request object. This object is used to check the status of your Send request, block until it is complete if required, and so on: Request Irecv(java.lang.Object buf, int offset, int count, Datatype datatype, int src, int tag) Irecv is again the same as Recv only non-blocking and returns a Request object used to handle your receive request. The operation of these methods can be seen in action in the following example: import mpi._ object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank() val size: Int = MPI.COMM_WORLD.Size() if (me == 0) { val requests = for (i <- 0 until 10) yield { val buf = Array(i * i) MPI.COMM_WORLD.Isend(buf, 0, 1, MPI.INT, 1, 0) } } else if (me == 1) { for (i <- 0 until 10) { Thread.sleep(1000) val buf = Array[Int](0) val request = MPI.COMM_WORLD.Irecv (buf, 0, 1, MPI.INT, 0, 0) request.Wait() println("RECEIVED: " + buf(0)) } } MPI.Finalize() } } This is a very simplistic example used simply to demonstrate the basics of using the asynchronous message passing methods. First, the process with rank 0 will send 10 messages to process with rank 1 using Isend. Since Isend does not block, the loop will finish quickly and the messages it sent will be buffered until they are retrieved using Irecv. The second process (the one with rank 1) will wait for one second before retrieving each message. This is to demonstrate the asynchronous nature of these methods. The messages are in the buffer waiting to be retrieved. Therefore, Irecv can be used at your leisure when convenient. The Wait() method of the Request object, it returns, has to be used to retrieve results. The Wait() method blocks until the message is successfully received from the buffer. Summary Extremely computationally intensive programs are usually parallelized and run on supercomputing clusters. These clusters consist of multiple networked computers. Communication between these computers is usually done using messaging libraries such as MPI. These allow you to pass data between processes running on different machines in an efficient manner. In this article, you have learned how to use MPJ Express—an MPI like library for JVM. We saw how to carry out process to process communication as well as collective communication. Most important MPJ Express primitives were covered and example programs using them were given. Resources for Article: Further resources on this subject: Differences in style between Java and Scala code[article] Getting Started with JavaFX[article] Integrating Scala, Groovy, and Flex Development with Apache Maven[article]
Read more
  • 0
  • 0
  • 4025

article-image-market-basket-analysis
Packt
12 Apr 2016
17 min read
Save for later

Market Basket Analysis

Packt
12 Apr 2016
17 min read
In this article by Boštjan Kaluža, author of the book Machine Learning in Java, we will discuss affinity analysis which is the heart of Market Basket Analysis (MBA). It can discover co-occurrence relationships among activities performed by specific users or groups. In retail, affinity analysis can help you understand the purchasing behavior of customers. These insights can drive revenue through smart cross-selling and upselling strategies and can assist you in developing loyalty programs, sales promotions, and discount plans. In this article, we will look into the following topics: Market basket analysis Association rule learning Other applications in various domains First, we will revise the core association rule learning concepts and algorithms, such as support, lift, Apriori algorithm, and FP-growth algorithm. Next, we will use Weka to perform our first affinity analysis on supermarket dataset and study how to interpret the resulting rules. We will conclude the article by analyzing how association rule learning can be applied in other domains, such as IT Operations Analytics, medicine, and others. (For more resources related to this topic, see here.) Market basket analysis Since the introduction of electronic point of sale, retailers have been collecting an incredible amount of data. To leverage this data in order to produce business value, they first developed a way to consolidate and aggregate the data to understand the basics of the business. What are they selling? How many units are moving? What is the sales amount? Recently, the focus shifted to the lowest level of granularity—the market basket transaction. At this level of detail, the retailers have direct visibility into the market basket of each customer who shopped at their store, understanding not only the quantity of the purchased items in that particular basket, but also how these items were bought in conjunction with each other. This can be used to drive decisions about how to differentiate store assortment and merchandise, as well as effectively combine offers of multiple products, within and across categories, to drive higher sales and profits. These decisions can be implemented across an entire retail chain, by channel, at the local store level, and even for the specific customer with the so-called personalized marketing, where a unique product offering is made for each customer. MBA covers a wide variety of analysis: Item affinity: This defines the likelihood of two (or more) items being purchased together Identification of driver items: This enables the identification of the items that drive people to the store and always need to be in stock Trip classification: This analyzes the content of the basket and classifies the shopping trip into a category: weekly grocery trip, special occasion, and so on Store-to-store comparison: Understanding the number of baskets allows any metric to be divided by the total number of baskets, effectively creating a convenient and easy way to compare the stores with different characteristics (units sold per customer, revenue per transaction, number of items per basket, and so on) Revenue optimization: This helps in determining the magic price points for this store, increasing the size and value of the market basket Marketing: This helps in identifying more profitable advertising and promotions, targeting offers more precisely in order to improve ROI, generating better loyalty card promotions with longitudinal analysis, and attracting more traffic to the store Operations optimization: This helps in matching the inventory to the requirement by customizing the store and assortment to trade area demographics, optimizing store layout Predictive models help retailers to direct the right offer to the right customer segments/profiles, as well as gain understanding of what is valid for which customer, predict the probability score of customers responding to this offer, and understand the customer value gain from the offer acceptance. Affinity analysis Affinity analysis is used to determine the likelihood that a set of items will be bought together. In retail, there are natural product affinities, for example, it is very typical for people who buy hamburger patties to buy hamburger rolls, along with ketchup, mustard, tomatoes, and other items that make up the burger experience. While there are some product affinities that might seem trivial, there are some affinities that are not very obvious. A classic example is toothpaste and tuna. It seems that people who eat tuna are more prone to brush their teeth right after finishing their meal. So, why it is important for retailers to get a good grasp of the product affinities? This information is critical to appropriately plan promotions as reducing the price for some items may cause a spike on related high-affinity items without the need to further promote these related items. In the following section, we'll look into the algorithms for association rule learning: Apriori and FP-growth. Association rule learning Association rule learning has been a popular approach for discovering interesting relations between items in large databases. It is most commonly applied in retail for discovering regularities between products. Association rule learning approaches find patterns as interesting strong rules in the database using different measures of interestingness. For example, the following rule would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat:{onions, potatoes} à {burger} Another classic story probably told in every machine learning class is the beer and diaper story. An analysis of supermarket shoppers' behavior showed that customers, presumably young men, who buy diapers tend also to buy beer. It immediately became a popular example of how an unexpected association rule might be found from everyday data; however, there are varying opinions as to how much of the story is true. Daniel Powers says (DSS News, 2002): In 1992, Thomas Blischok, manager of a retail consulting group at Teradata, and his staff prepared an analysis of 1.2 million market baskets from about 25 Osco Drug stores. Database queries were developed to identify affinities. The analysis "did discover that between 5:00 and 7:00 p.m. that consumers bought beer and diapers". Osco managers did NOT exploit the beer and diapers relationship by moving the products closer together on the shelves. In addition to the preceding example from MBA, association rules are today employed in many application areas, including web usage mining, intrusion detection, continuous production, and bioinformatics. We'll take a closer look these areas later in this article. Basic concepts Before we dive into algorithms, let's first review the basic concepts. Database of transactions First, there is no class value, as this is not required for learning association rules. Next, the dataset is presented as a transactional table, where each supermarket item corresponds to a binary attribute. Hence, the feature vector could be extremely large. Consider the following example. Suppose we have five receipts as shown in the following image. Each receipt corresponds a purchasing transaction: To write these receipts in the form of transactional database, we first identify all the possible items that appear in the receipts. These items are onions, potatoes, burger, beer, and dippers. Each purchase, that is, transaction, is presented in a row, and there is 1 if an item was purchased within the transaction and 0 otherwise, as shown in the following table: Transaction ID Onions Potatoes Burger Beer Dippers 1 0 1 1 0 0 2 1 1 1 1 0 3 0 0 0 1 1 4 1 0 1 1 0 This example is really small. In practical applications, the dataset often contains thousands or millions of transactions, which allow learning algorithm discovery of statistically significant patterns. Itemset and rule Itemset is simply a set of items, for example, {onions, potatoes, burger}. A rule consists of two itemsets, X and Y, in the following format X -> Y. This indicates a pattern that when the X itemset is observed, Y is also observed. To select interesting rules, various measures of significance can be used. Support Support, for an itemset, is defined as the proportion of transactions that contain the itemset. The {potatoes, burger} itemset in the previous table has the following support as it occurs in 50% of transactions (2 out of 4 transactions) supp({potatoes, burger }) = 2/4 = 0.5. Intuitively, it indicates the share of transactions that support the pattern. Confidence Confidence of a rule indicates its accuracy. It is defined as Conf(X -> Y) = supp(X U Y) / supp(X). For example, the {onions, burger} -> {beer} rule has the confidence 0.5/0.5 = 1.0 in the previous table, which means that 100% of the times when onions and burger are bought together, beer is bought as well. Apriori algorithm Apriori algorithm is a classic algorithm used for frequent pattern mining and association rule learning over transactional. By identifying the frequent individual items in a database and extending them to larger itemsets, Apriori can determine the association rules, which highlight general trends about a database. Apriori algorithm constructs a set of itemsets, for example, itemset1= {Item A, Item B}, and calculates support, which counts the number of occurrences in the database. Apriori then uses a bottom up approach, where frequent itemsets are extended, one item at a time, and it works by eliminating the largest sets as candidates by first looking at the smaller sets and recognizing that a large set cannot be frequent unless all its subsets are. The algorithm terminates when no further successful extensions are found. Although, Apriori algorithm is an important milestone in machine learning, it suffers from a number of inefficiencies and tradeoffs. In the following section, we'll look into a more recent FP-growth technique. FP-growth algorithm FP-growth, where frequent pattern (FP), represents the transaction database as a prefix tree. First, the algorithm counts the occurrence of items in the dataset. In the second pass, it builds a prefix tree, an ordered tree data structure commonly used to store a string. An example of prefix tree based on the previous example is shown in the following diagram: If many transactions share most frequent items, prefix tree provides high compression close to the tree root. Large itemsets are grown directly, instead of generating candidate items and testing them against the entire database. Growth starts at the bottom of the tree, by finding all the itemsets matching minimal support and confidence. Once the recursive process has completed, all large itemsets with minimum coverage have been found and association rule creation begins. FP-growth algorithms have several advantages. First, it constructs an FP-tree, which encodes the original dataset in a substantially compact presentation. Second, it efficiently builds frequent itemsets, leveraging the FP-tree structure and divide-and-conquer strategy. The supermarket dataset The supermarket dataset, located in datasets/chap5/supermarket.arff, describes the shopping habits of supermarket customers. Most of the attributes stand for a particular item group, for example, diary foods, beef, potatoes; or department, for example, department 79, department 81, and so on. The value is t if the customer had bought an item and missing otherwise. There is one instance per customer. The dataset contains no class attribute, as this is not required to learn association rules. A sample of data is shown in the following table: Discover patterns To discover shopping patterns, we will use the two algorithms that we have looked into before, Apriori and FP-growth. Apriori We will use the Apriori algorithm as implemented in Weka. It iteratively reduces the minimum support until it finds the required number of rules with the given minimum confidence: import java.io.BufferedReader; import java.io.FileReader; import weka.core.Instances; import weka.associations.Apriori; First, we will load the supermarket dataset: Instances data = new Instances( new BufferedReader( new FileReader("datasets/chap5/supermarket.arff"))); Next, we will initialize an Apriori instance and call the buildAssociations(Instances) function to start frequent pattern mining, as follows: Apriori model = new Apriori(); model.buildAssociations(data); Finally, we can output the discovered itemsets and rules, as shown in the following code: System.out.println(model); The output is as follows: Apriori ======= Minimum support: 0.15 (694 instances) Minimum metric <confidence>: 0.9 Number of cycles performed: 17 Generated sets of large itemsets: Size of set of large itemsets L(1): 44 Size of set of large itemsets L(2): 380 Size of set of large itemsets L(3): 910 Size of set of large itemsets L(4): 633 Size of set of large itemsets L(5): 105 Size of set of large itemsets L(6): 1 Best rules found: 1. biscuits=t frozen foods=t fruit=t total=high 788 ==> bread and cake=t 723 <conf:(0.92)> lift:(1.27) lev:(0.03) [155] conv:(3.35) 2. baking needs=t biscuits=t fruit=t total=high 760 ==> bread and cake=t 696 <conf:(0.92)> lift:(1.27) lev:(0.03) [149] conv:(3.28) 3. baking needs=t frozen foods=t fruit=t total=high 770 ==> bread and cake=t 705 <conf:(0.92)> lift:(1.27) lev:(0.03) [150] conv:(3.27) ... The algorithm outputs ten best rules according to confidence. Let's look the first rule and interpret the output, as follows: biscuits=t frozen foods=t fruit=t total=high 788 ==> bread and cake=t 723 <conf:(0.92)> lift:(1.27) lev:(0.03) [155] conv:(3.35) It says that when biscuits, frozen foods, and fruits are bought together and the total purchase price is high, it is also very likely that bread and cake are purchased as well. The {biscuits, frozen foods, fruit, total high} itemset appears in 778 transactions, while the {bread, cake} itemset appears in 723 transactions. The confidence of this rule is 0.92, meaning that the rule holds true in 92% of transactions where the {biscuits, frozen foods, fruit, total high} itemset is present. The output also reports additional measures such as lift, leverage, and conviction, which estimate the accuracy against our initial assumptions, for example, the 3.35 conviction value indicates that the rule would be incorrect 3.35 times as often if the association was purely a random chance. Lift measures the number of times X and Y occur together than expected if they where statistically independent (lift=1). The 2.16 lift in the X -> Y rule means that the probability of X is 2.16 times greater than the probability of Y. FP-growth Now, let's try to get the same results with more efficient FP-growth algorithm. FP-growth is also implemented in the weka.associations package: import weka.associations.FPGrowth; The FP-growth is initialized similarly as we did earlier: FPGrowth fpgModel = new FPGrowth(); fpgModel.buildAssociations(data); System.out.println(fpgModel); The output reveals that FP-growth discovered 16 rules: FPGrowth found 16 rules (displaying top 10) 1. [fruit=t, frozen foods=t, biscuits=t, total=high]: 788 ==> [bread and cake=t]: 723 <conf:(0.92)> lift:(1.27) lev:(0.03) conv:(3.35) 2. [fruit=t, baking needs=t, biscuits=t, total=high]: 760 ==> [bread and cake=t]: 696 <conf:(0.92)> lift:(1.27) lev:(0.03) conv:(3.28) ... We can observe that FP-growth found the same set of rules as Apriori; however, the time required to process larger datasets can be significantly shorter. Other applications in various areas We looked into affinity analysis to demystify shopping behavior patterns in supermarkets. Although, the roots of association rule learning are in analyzing point-of-sale transactions, they can be applied outside the retail industry to find relationships among other types of baskets. The notion of a basket can easily be extended to services and products, for example, to analyze items purchased using a credit card, such as rental cars and hotel rooms, and to analyze information on value-added services purchased by telecom customers (call waiting, call forwarding, DSL, speed call, and so on), which can help the operators determine the ways to improve their bundling of service packages. Additionally, we will look into the following examples of potential cross-industry applications: Medical diagnosis Protein sequences Census data Customer relationship management IT Operations Analytics Medical diagnosis Applying association rules in medical diagnosis can be used to assist physicians while curing patients. The general problem of the induction of reliable diagnostic rules is hard as, theoretically, no induction process can guarantee the correctness of induced hypotheses by itself. Practically, diagnosis is not an easy process as it involves unreliable diagnosis tests and the presence of noise in training examples. Nevertheless, association rules can be used to identify likely symptoms appearing together. A transaction, in this case, corresponds to a medical case, while symptoms correspond to items. When a patient is treated, a list of symptoms is recorded as one transaction. Protein sequences A lot of research has gone into understanding the composition and nature of proteins; yet many things remain to be understood satisfactorily. It is now generally believed that amino-acid sequences of proteins are not random. With association rules, it is possible to identify associations between different amino acids that are present in a protein. A protein is a sequences made up of 20 types of amino acids. Each protein has a unique three-dimensional structure, which depends on amino-acid sequence; slight change in the sequence may change the functioning of protein. To apply association rules, a protein corresponds to a transaction, while amino acids, their two grams and structure correspond to the items. Such association rules are desirable for enhancing our understanding of protein composition and hold the potential to give clues regarding the global interactions amongst some particular sets of amino acids occurring in the proteins. Knowledge of these association rules or constraints is highly desirable for synthesis of artificial proteins. Census data Censuses make a huge variety of general statistical information about the society available to both researchers and general public. The information related to population and economic census can be forecasted in planning public services (education, health, transport, and funds) as well as in public business(for setting up new factories, shopping malls, or banks and even marketing particular products). To discover frequent patterns, each statistical area (for example, municipality, city, and neighborhood) corresponds to a transaction, and the collected indicators correspond to the items. Customer relationship management Association rules can reinforce the knowledge management process and allow the marketing personnel to know their customers well in order to provide better quality services. For example, association rules can be applied to detect a change of customer behavior at different time snapshots from customer profiles and sales data. The basic idea is to discover changes from two datasets and generate rules from each dataset to carry out rule matching. IT Operations Analytics Based on records of a large number of transactions, association rule learning is well-suited to be applied to the data that is routinely collected in day-to-day IT operations, enabling IT Operations Analytics tools to detect frequent patterns and identify critical changes. IT specialists need to see the big picture and understand, for example, how a problem on a database could impact an application server. For a specific day, IT operations may take in a variety of alerts, presenting them in a transactional database. Using an association rule learning algorithm, IT Operations Analytics tools can correlate and detect the frequent patterns of alerts appearing together. This can lead to a better understanding about how a component impacts another. With identified alert patterns, it is possible to apply predictive analytics. For example, a particular database server hosts a web application and suddenly an alert about a database is triggered. By looking into frequent patterns identified by an association rule learning algorithm, this means that the IT staff needs to take action before the web application is impacted. Association rule learning can also discover alert events originating from the same IT event. For example, every time a new user is added, six changes in the Windows operating systems are detected. Next, in the Application Portfolio Management (APM), IT may face multiple alerts, showing that the transactional time in a database as high. If all these issues originate from the same source (such as getting hundreds of alerts about changes that are all due to a Windows update), this frequent pattern mining can help to quickly cut through a number of alerts, allowing the IT operators to focus on truly critical changes. Summary In this article, you learned how to leverage association rules learning on transactional datasets to gain insight about frequent patterns We performed an affinity analysis in Weka and learned that the hard work lies in the analysis of results—careful attention is required when interpreting rules, as association (that is, correlation) is not the same as causation. Resources for Article: Further resources on this subject: Debugging Java Programs using JDB [article] Functional Testing with JMeter [article] Implementing AJAX Grid using jQuery data grid plugin jqGrid [article]
Read more
  • 0
  • 0
  • 4687
article-image-getting-started-d3-es2016-and-nodejs
Packt
08 Apr 2016
25 min read
Save for later

Getting Started with D3, ES2016, and Node.js

Packt
08 Apr 2016
25 min read
In this article by Ændrew Rininsland, author of the book Learning d3.js Data Visualization, Second Edition, we'll lay the foundations of what you'll need to run all the examples in the article. I'll explain how you can start writing ECMAScript 2016 (ES2016) today—which is the latest and most advanced version of JavaScript—and show you how to use Babel to transpile it to ES5, allowing your modern JavaScript to be run on any browser. We'll then cover the basics of using D3 to render a basic chart. (For more resources related to this topic, see here.) What is D3.js? D3 stands for Data-Driven Documents, and it is being developed by Mike Bostock and the D3 community since 2011. The successor to Bostock's earlier Protovis library, it allows pixel-perfect rendering of data by abstracting the calculation of things such as scales and axes into an easy-to-use domain-specific language (DSL). D3's idioms should be immediately familiar to anyone with experience of using the massively popular jQuery JavaScript library. Much like jQuery, in D3, you operate on elements by selecting them and then manipulating via a chain of modifier functions. Especially within the context of data visualization, this declarative approach makes using it easier and more enjoyable than a lot of other tools out there. The official website, https://d3js.org/, features many great examples that show off the power of D3, but understanding them is tricky at best. After finishing this article, you should be able to understand D3 well enough to figure out the examples. If you want to follow the development of D3 more closely, check out the source code hosted on GitHub at https://github.com/mbostock/d3. The fine-grained control and its elegance make D3 one of the most—if not the most—powerful open source visualization libraries out there. This also means that it's not very suitable for simple jobs such as drawing a line chart or two—in that case you might want to use a library designed for charting. Many use D3 internally anyway. One such interface is Axis, an open source app that I've written. It allows users to easily build basic line, pie, area, and bar charts without writing any code. Try it out at use.axisjs.org. As a data manipulation library, D3 is based on the principles of functional programming, which is probably where a lot of confusion stems from. Unfortunately, functional programming goes beyond the scope of this article, but I'll explain all the relevant bits to make sure that everyone's on the same page. What’s ES2016? One of the main changes in this edition is the emphasis on ES2016, the most modern version of JavaScript currently available. Formerly known as ES6 (Harmony), it pushes the JavaScript language's features forward significantly, allowing for new usage patterns that simplify code readability and increase expressiveness. If you've written JavaScript before and the examples in this article look pretty confusing, it means you're probably familiar with the older, more common ES5 syntax. But don't sweat! It really doesn't take too long to get the hang of the new syntax, and I will try to explain the new language features as we encounter them. Although it might seem a somewhat steep learning curve at the start, by the end, you'll have improved your ability to write code quite substantially and will be on the cutting edge of contemporary JavaScript development. For a really good rundown of all the new toys you have with ES2016, check out this nice guide by the folks at Babel.js, which we will use extensively throughout this article: https://babeljs.io/docs/learn-es2015/. Before I go any further, let me clear some confusion about what ES2016 actually is. Initially, the ECMAScript (or ES for short) standards were incremented by cardinal numbers, for instance, ES4, ES5, ES6, and ES7. However, with ES6, they changed this so that a new standard is released every year in order to keep pace with modern development trends, and thus we refer to the year (2016) now. The big release was ES2015, which more or less maps to ES6. ES2016 is scheduled for ratification in June 2016, and builds on the previous year's standard, while adding a few fixes and two new features. You don't really need to worry about compatibility because we use Babel.js to transpile everything down to ES5 anyway, so it runs the same in Node.js and in the browser. For the sake of simplicity, I will use the word "ES2016" throughout in a general sense to refer to all modern JavaScript, but I'm not referring to the ECMAScript 2016 specification itself. Getting started with Node and Git on the command line I will try not to be too opinionated in this article about which editor or operating system you should use to work through it (though I am using Atom on Mac OS X), but you are going to need a few prerequisites to start. The first is Node.js. Node is widely used for web development nowadays, and it's actually just JavaScript that can be run on the command line. If you're on Windows or Mac OS X without Homebrew, use the installer at https://nodejs.org/en/. If you're on Mac OS X and are using Homebrew, I would recommend installing "n" instead, which allows you to easily switch between versions of Node: $ brew install n $ n latest Regardless of how you do it, once you finish, verify by running the following lines: $ node --version $ npm --version If it displays the versions of node and npm (I'm using 5.6.0 and 3.6.0, respectively), it means you're good to go. If it says something similar to Command not found, double-check whether you've installed everything correctly, and verify that Node.js is in your $PATH environment variable. Next, you'll want to clone the article's repository from GitHub. Change to your project directory and type this: $ git clone https://github.com/aendrew/learning-d3 $ cd $ learning-d3 This will clone the development environment and all the samples in the learning-d3/ directory as well as switch you into it. Another option is to fork the repository on GitHub and then clone your fork instead of mine as was just shown. This will allow you to easily publish your work on the cloud, enabling you to more easily seek support, display finished projects on GitHub pages, and even submit suggestions and amendments to the parent project. This will help us improve this article for future editions. To do this, fork aendrew/learning-d3 and replace aendrew in the preceding code snippet with your GitHub username. Each chapter of this book is in a separate branch. To switch between them, type the following command: $ git checkout chapter1 Replace 1 with whichever chapter you want the examples for. Stay at master for now though. To get back to it, type this line: $ git stash save && git checkout master The master branch is where you'll do a lot of your coding as you work through this article. It includes a prebuilt package.json file (used by npm to manage dependencies), which we'll use to aid our development over the course of this article. There's also a webpack.config.js file, which tells the build system where to put things, and there are a few other sundry config files. We still need to install our dependencies, so let's do that now: $ npm install All of the source code that you'll be working on is in the src/ folder. You'll notice it contains an index.html and an index.js file; almost always, we'll be working in index.js, as index.html is just a minimal container to display our work in: <!DOCTYPE html> <div id="chart"></div> <script src="/assets/bundle.js"></script> To get things rolling, start the development server by typing the following line: $ npm start This starts up the Webpack development server, which will transform our ES2016 JavaScript into backwards-compatible ES5, which can easily be loaded by most browsers. In the preceding HTML, bundle.js is the compiled code produced by Webpack. Now point Chrome to localhost:8080 and fire up the developer console (Ctrl +Shift + J for Linux and Windows and Option + Command + J for Mac). You should see a blank website and a blank JavaScript console with a Command Prompt waiting for some code: A quick Chrome Developer Tools primer Chrome Developer Tools are indispensable to web development. Most modern browsers have something similar, but to keep this article shorter, we'll stick to Chrome here for the sake of simplicity. Feel free to use a different browser. Firefox's Developer Edition is particularly nice. We are mostly going to use the Elements and Console tabs, Elements to inspect the DOM and Console to play with JavaScript code and look for any problems. The other six tabs come in handy for large projects: The Network tab will let you know how long files are taking to load and help you inspect the Ajax requests. The Profiles tab will help you profile JavaScript for performance. The Resources tab is good for inspecting client-side data. Timeline and Audits are useful when you have a global variable that is leaking memory and you're trying to work out exactly why your library is suddenly causing Chrome to use 500 MB of RAM. While I've used these in D3 development, they're probably more useful when building large web applications with frameworks such as React and Angular. One of the favorites from Developer Tools is the CSS inspector at the right-hand side of the Elements tab. It can tell you what CSS rules are affecting the styling of an element, which is very good for hunting rogue rules that are messing things up. You can also edit the CSS and immediately see the results, as follows: The obligatory bar chart example No introductory chapter on D3 would be complete without a basic bar chart example. They are to D3 as "Hello World" is to everything else, and 90 percent of all data storytelling can be done in its simplest form with an intelligent bar or line chart. For a good example of this, look at the kinds of graphics The Economist includes with their articles—they frequently summarize the entire piece with a simple line chart. Coming from a newsroom development background, many of my examples will be related to some degree to current events or possible topics worth visualizing with data. The news development community has been really instrumental in creating the environment for D3 to flourish, and it's increasingly important for aspiring journalists to have proficiency in tools such as D3. The first dataset that we'll use is UNHCR's regional population data. The documentation for this endpoint is at data.unhcr.org/wiki/index.php/Get-population-regional.html. We'll create a bar for each population of displaced people. The first step is to get a basic container set up, which we can then populate with all of our delicious new ES2016 code. At the top of index.js, put the following code: export class BasicChart {   constructor(data) {     var d3 = require('d3'); // Require D3 via Webpack     this.data = data;     this.svg = d3.select('div#chart').append('svg');   } } var chart = new BasicChart(); If you open this in your browser, you'll get the following error on your console: Uncaught Error: Cannot find module "d3" This is because we haven't installed it yet. You’ll notice on line 3 of the preceding code that we import D3 by requiring it. If you've used D3 before, you might be more familiar with it attached to the window global object. This is essentially the same as including a script tag that references D3 in your HTML document, the only difference being that Webpack uses the Node version and compiles it into your bundle.js. To install D3, you use npm. In your project directory, type the following line: $ npm install d3 --save This will pull the latest version of D3 from npmjs.org to the node_modules directory and save it in your package.json file. The package.json file is really useful; instead of keeping all your dependencies inside of your Git repository, you can easily redownload them all just by typing this line: $ npm install If you go back to your browser and switch quickly to the Elements tab, you'll notice a new SVG element as a child of #chart. Go back to index.js. Let's add a bit more to the constructor before I explain what's going on here: export class BasicChart {   constructor(data) {     var d3 = require('d3'); // Require D3 via Webpack     this.data = data;     this.svg = d3.select('div#chart').append('svg');     this.margin = {       left: 30,       top: 30,       right: 0,       bottom: 0     };     this.svg.attr('width', window.innerWidth);     this.svg.attr('height', window.innerHeight);     this.width = window.innerWidth - this.margin.left - this.margin.right;     this.height = window.innerHeight - this.margin.top - this.margin.bottom;     this.chart = this.svg.append('g')       .attr('width', this.width)       .attr('height', this.height) .attr('transform', `translate(${this.margin.left}, ${this.margin.top})`);   } } Okay, here we have the most basic container you'll ever make. All it does is attach data to the class: this.data = data; This selects the #chart element on the page, appending an SVG element and assigning it to another class property: this.svg = d3.select('div#chart').append('svg'); Then it creates a third class property, chart, as a group that's offset by the margins: this.width = window.innerWidth - this.margin.left - this.margin.right;   this.height = window.innerHeight - this.margin.top - this.margin.bottom;   this.chart = svg.append('g')     .attr('width', this.width)     .attr('height', this.height)     .attr('transform', `translate(${this.margin.left}, ${this.margin.top})`); Notice the snazzy new ES2016 string interpolation syntax—using `backticks`, you can then echo out a variable by enclosing it in ${ and }. No more concatenating! The preceding code is not really all that interesting, but wouldn't it be awesome if you never had to type that out again? Well! Because you're the total boss and are learning ES2016 like all the cool kids, you won't ever have to. Let's create our first child class! We're done with BasicChart for the moment. Now, we want to create our actual bar chart class: export class BasicBarChart extends BasicChart {   constructor(data) {     super(data);   } } This is probably very confusing if you're new to ES6. First off, we're extending BasicChart, which means all the class properties that we just defined a minute ago are now available for our BasicBarChart child class. However, if we instantiate a new instance of this, we get the constructor function in our child class. How do we attach the data object so that it's available for both BasicChart and BasicBarChart? The answer is super(), which merely runs the constructor function of the parent class. In other words, even though we don't assign data to this.data as we did previously, it will still be available there when we need it. This is because it was assigned via the parent constructor through the use of super(). We're almost at the point of getting some bars onto that graph; hold tight! But first, we need to define our scales, which decide how D3 maps data to pixel values. Add this code to the constructor of BasicBarChart: let x = d3.scale.ordinal()   .rangeRoundBands([this.margin.left, this.width - this.margin.right], 0.1); The x scale is now a function that maps inputs from an as-yet-unknown domain (we don't have the data yet) to a range of values between this.margin.left and this.width - this.margin.right, that is, between 30 and the width of your viewport minus the right margin, with some spacing defined by the 0.1 value. Because it's an ordinal scale, the domain will have to be discrete rather than continuous. The rangeRoundBands means the range will be split into bands that are guaranteed to be round numbers. Hoorah! We have fit our first new fancy ES2016 feature! The let is the new var—you can still use var to define variables, but you should use let instead because it's limited in scope to the block, statement, or expression on which it is used. Meanwhile, var is used for more global variables, or variables that you want available regardless of the block scope. For more on this, visit http://mdn.io/let. If you have no idea what I'm talking about here, don't worry. It just means that you should define variables with let because they're more likely to act as you think they should and are less likely to leak into other parts of your code. It will also throw an error if you use it before it's defined, which can help with troubleshooting and preventing sneaky bugs. Still inside the constructor, we define another scale named y: let y = d3.scale.linear().range([this.height, this.margin.bottom]); Similarly, the y scale is going to map a currently unknown linear domain to a range between this.height and this.margin.bottom, that is, your viewport height and 30. Inverting the range is important because D3.js considers the top of a graph to be y=0. If ever you find yourself trying to troubleshoot why a D3 chart is upside down, try switching the range values. Now, we define our axes. Add this just after the preceding line, inside the constructor: let xAxis = d3.svg.axis().scale(x).orient('bottom'); let yAxis = d3.svg.axis().scale(y).orient('left'); We've told each axis what scale to use when placing ticks and which side of the axis to put the labels on. D3 will automatically decide how many ticks to display, where they should go, and how to label them. Now the fun begins! We're going to load in our data using Node-style require statements this time around. This works because our sample dataset is in JSON and it's just a file in our repository. For now, this will suffice for our purposes—no callbacks, promises, or observables necessary! Put this at the bottom of the constructor: let data = require('./data/chapter1.json'); Once or maybe twice in your life, the keys in your dataset will match perfectly and you won't need to transform any data. This almost never happens, and today is not one of those times. We're going to use basic JavaScript array operations to filter out invalid data and map that data into a format that's easier for us to work with: let totalNumbers = data.filter((obj) => { return obj.population.length;   })   .map(     (obj) => {       return {         name: obj.name,         population: Number(obj.population[0].value)       };     }   ); This runs the data that we just imported through Array.prototype.filter, whereby any elements without a population array are stripped out. The resultant collection is then passed through Array.prototype.map, which creates an array of objects, each comprised of a name and a population value. We've turned our data into a list of two-value dictionaries. Let's now supply the data to our BasicBarChart class and instantiate it for the first time. Consider the line that says the following: var chart = new BasicChart(); Replace it with this line: var myChart = new BasicBarChart(totalNumbers); The myChart.data will now equal totalNumbers! Go back to the constructor in the BasicBarChart class. Remember the x and y scales from before? We can finally give them a domain and make them useful. Again, a scale is a simply a function that maps an input range to an output domain: x.domain(data.map((d) => { return d.name })); y.domain([0, d3.max(data, (d) => { return d.population; })]); Hey, there's another ES2016 feature! Instead of typing function() {} endlessly, you can now just put () => {} for anonymous functions. Other than being six keystrokes less, the "fat arrow" doesn't bind the value of this to something else, which can make life a lot easier. For more on this, visit http://mdn.io/Arrow_functions. Since most D3 elements are objects and functions at the same time, we can change the internal state of both scales without assigning the result to anything. The domain of x is a list of discrete values. The domain of y is a range from 0 to the d3.max of our dataset—the largest value. Now we're going to draw the axes on our graph: this.chart.append('g')         .attr('class', 'axis')         .attr('transform', `translate(0, ${this.height})`)         .call(xAxis); We've appended an element called g to the graph, given it the axis CSS class, and moved the element to a place in the bottom-left corner of the graph with the transform attribute. Finally, we call the xAxis function and let D3 handle the rest. The drawing of the other axis works exactly the same, but with different arguments: this.chart.append('g')         .attr('class', 'axis')         .attr('transform', `translate(${this.margin.left}, 0)`)         .call(yAxis); Now that our graph is labeled, it's finally time to draw some data: this.chart.selectAll('rect')         .data(data)         .enter()         .append('rect')         .attr('class', 'bar')         .attr('x', (d) => { return x(d.name); })         .attr('width', x.rangeBand())         .attr('y', (d) => { return y(d.population); })         .attr('height', (d) => { return this.height - y(d.population); }); Okay, there's plenty going on here, but this code is saying something very simple. This is what is says: For all rectangles (rect) in the graph, load our data Go through it For each item, append a rect Then define some attributes Ignore the fact that there aren't any rectangles initially; what you're doing is creating a selection that is bound to data and then operating on it. I can understand that it feels a bit weird to operate on non-existent elements (this was personally one of my biggest stumbling blocks when I was learning D3), but it's an idiom that shows its usefulness later on when we start adding and removing elements due to changing data. The x scale helps us calculate the horizontal positions, and rangeBand gives the width of the bar. The y scale calculates vertical positions, and we manually get the height of each bar from y to the bottom. Note that whenever we needed a different value for every element, we defined an attribute as a function (x, y, and height); otherwise, we defined it as a value (width). Keep this in mind when you're tinkering. Let's add some flourish and make each bar grow out of the horizontal axis. Time to dip our toes into animations! Modify the code you just added to resemble the following. I've highlighted the lines that are different: this.chart.selectAll('rect')   .data(data)   .enter()   .append('rect')   .attr('class', 'bar')   .attr('x', (d) => { return x(d.name); })   .attr('width', x.rangeBand())   .attr('y', () => { return y(this.margin.bottom); })   .attr('height', 0)   .transition()     .delay((d, i) => { return i*20; })     .duration(800)     .attr('y', (d) => { return y(d.population); })     .attr('height', (d) => {          return this.height - y(d.population);       }); The difference is that we statically put all bars at the bottom (margin.bottom) and then entered a transition with .transition(). From here on, we define the transition that we want. First, we wanted each bar's transition delayed by 20 milliseconds using i*20. Most D3 callbacks will return the datum (or "whatever data has been bound to this element," which is typically set to d) and the index (or the ordinal number of the item currently being evaluated, which is typically i) while setting the this argument to the currently selected DOM element. Because of this last point, we use the fat arrow—so that we can still use the class this.height property. Otherwise, we'd be trying to find the height property on our SVGRect element, which we're midway to trying to define! This gives the histogram a neat effect, gradually appearing from left to right instead of jumping up at once. Next, we say that we want each animation to last just shy of a second, with .duration(800). At the end, we define the final values for the animated attributes—y and height are the same as in the previous code—and D3 will take care of the rest. Save your file and the page should auto-refresh in the background. If everything went according to the plan, you should have a chart that looks like the following: According to this UNHCR data from June 2015, by far the largest number of displaced persons are from Syria. Hey, look at this—we kind of just did some data journalism here! Remember that you can look at the entire code on GitHub at http://github.com/aendrew/learning-d3/tree/chapter1 if you didn't get something similar to the preceding screenshot. We still need to do just a bit more, mainly by using CSS to style the SVG elements. We could have just gone to our HTML file and added CSS, but then that means opening that yucky index.html file. And where's the fun in writing HTML when we're learning some newfangled JavaScript?! First, create an index.css file in your src/ directory: html, body {   padding: 0;   margin: 0; }   .axis path, .axis line {   fill: none;   stroke: #eee;   shape-rendering: crispEdges; }   .axis text {   font-size: 11px; }   .bar {   fill: steelblue; } Then just add the following line to index.js: require('./index.css'); I know. Crazy, right?! No <style> tags needed! It's worth noting that anything involving require is the result of a Webpack loader; in this article, we've used both the CSS/Style and JSON loaders. Although the author of this text is a fan of Webpack, all we're doing is compiling the styles into bundle.js with Webpack instead of requiring them globally via a <style> tag. This is cool because instead of uploading a dozen files when deploying your finished code, you effectively deploy one optimized bundle. You can also scope CSS rules to be particular to when they’re being included and all sorts of other nifty stuff; for more information, refer to github.com/webpack/css-loader#local-scope. Looking at the preceding CSS, you can now see why we added all those classes to our shapes—we can now directly reference them when styling with CSS. We made the axes thin, gave them a light gray color, and used a smaller font for the labels. The bars should be light blue. Save and wait for the page to refresh. We've made our first D3 chart! I recommend fiddling with the values for width, height, and margin inside of BasicChart to get a feel of the power of D3. You'll notice that everything scales and adjusts to any size without you having to change other code. Smashing! Summary In this article, you learned what D3 is and took a glance at the core philosophy behind how it works. You also set up your computer for prototyping of ideas and to play with visualizations. This environment will be assumed throughout the article. We went through a simple example and created an animated histogram using some of the basics of D3. You learned about scales and axes, that the vertical axis is inverted, that any property defined as a function is recalculated for every data point, and that we use a combination of CSS and SVG to make things beautiful. We also did a lot of fancy stuff with ES2016, Babel, and Webpack and got Node.js installed. Go us! Most of all, this article has given you the basic tools so that you can start playing with D3.js on your own. Tinkering is your friend! Don't be afraid to break stuff—you can always reset to a chapter's default state by running $ git reset --soft origin/chapter1, replacing 1 with whichever chapter you're on. Next, we'll be looking at all this a bit more in depth, specifically how the DOM, SVG, and CSS interact with each other. This article discussed quite a lot, so if some parts got away from you, don't worry. Resources for Article: Further resources on this subject: An Introduction to Node.js Design Patterns [article] Developing Node.js Web Applications [article] Developing a Basic Site with Node.js and Express [article]
Read more
  • 0
  • 0
  • 1304

Packt
04 Apr 2016
20 min read
Save for later

Morphology – Getting Our Feet Wet

Packt
04 Apr 2016
20 min read
In this article by Deepti Chopra, Nisheeth Joshi, and Iti Mathur authors of the book Mastering Natural Language Processing with Python, morphology may be defined as the study of the composition of words using morphemes. A morpheme is the smallest unit of the language that has a meaning. In this article, we will discuss stemming and lemmatizing, creating a stemmer and lemmatizer for non-English languages, developing a morphological analyzer and morphological generator using machine learning tools, creating a search engine, and many other concepts. In brief, this article will include the following topics: Introducing morphology Creating a stemmer and lemmatizer Developing a stemmer for non-English languages Creating a morphological analyzer Creating a morphological generator Creating a search engine (For more resources related to this topic, see here.) Introducing morphology Morphology may be defined as the study of the production of tokens with the help of morphemes. A morpheme is the basic unit of language, which carries a meaning. There are two types of morphemes: stems and affixes (suffixes, prefixes, infixes, and circumfixes). Stems are also referred to as free morphemes since they can even exist without adding affixes. Affixes are referred to as bound morphemes since they cannot exist in a free form, and they always exist along with free morphemes. Consider the word "unbelievable". Here, "believe" is a stem or free morpheme. It can even exist on its own. The morphemes "un" and "able" are affixes or bound morphemes. They cannot exist in s free form but exist together with a stem. There are three kinds of languages, namely isolating languages, agglutinative languages, and inflecting languages. Morphology has different meanings in all these languages. Isolating languages are those languages in which words are merely free morphemes, and they do not carry any tense (past, present, and future) or number (singular or plural) information. Mandarin Chinese is an example of an isolating language. Agglutinative languages are those languages in which small words combine together to convey compound information. Turkish is an example of an agglutinative language. Inflecting languages are languages in which words are broken down into simpler units, but all these simpler units exhibit different meanings. Latin is an example of an inflecting language. There are morphological processes such as inflections, derivations, semi-affixes, combining forms, and cliticization. An inflection refers to transforming a word into a form so that it represents a person, number, tense, gender, case, aspect, and mood. Here, the syntactic category of the token remains the same. In derivation, the syntactic category of word is also changed. Semi-affixes are bound morphemes that exhibit a word-like quality, for example, noteworthy, antisocial, anticlockwise, and so on. Understanding stemmers Stemming may be defined as the process of obtaining a stem from a word by eliminating the affixes from it. For example, in the word "raining", a stemmer would return the root word or the stem word "rain" by removing the affix "ing" from "raining". In order to increase the accuracy of information retrieval, search engines mostly use stemming to get a stem and store it as an index word. Search engines call words with the same meaning synonyms, which may be a kind of query expansion known as conflation. Martin Porter has designed a well-known stemming algorithm known as the Porter Stemming Algorithm. This algorithm is basically designed to replace and eliminate some well-known suffices present in English words. To perform stemming in NLTK, we can simply perform the instantiation of the PorterStemmer class, and then perform stemming by calling the stem method. Let's take a look at the code for stemming using the PorterStemmer class in NLTK: >>> import nltk>>> from nltk.stem import PorterStemmer>>> stemmerporter = PorterStemmer()>>> stemmerporter.stem('working')'work'>>> stemmerporter.stem('happiness')'happi' The PorterStemmer class is trained and has the knowledge of many stems and word forms in the English language. The process of stemming takes place in a series of steps and transforms a word into a shorter word or this word may similar meaning to the root word. The stemmer I interface defines the stem() method, and all stemmers are inherited from this interface. The inheritance diagram is depicted here: Another Stemming algorithm, known as the Lancaster Stemming algorithm, was introduced in Lancaster University. Similar to the PorterStemmer class, the LancasterStemmer class is used in NLTK to implement Lancaster Stemming. Let's consider the following code, which depicts Lancaster stemming in NLTK: >>> import nltk >>> from nltk.stem import LancasterStemmer >>> stemmerlan=LancasterStemmer() >>> stemmerlan.stem('working') 'work' >>> stemmerlan.stem('happiness') 'happy' We can also build our own stemmer in NLTK using RegexpStemmer. This works by accepting a string and eliminates it from the prefix or suffix of a word when a match is found. Let's consider an example of stemming using RegexpStemmer in NLTK: >>> import nltk >>> from nltk.stem import RegexpStemmer >>> stemmerregexp=RegexpStemmer('ing') >>> stemmerregexp.stem('working') 'work' >>> stemmerregexp.stem('happiness') 'happiness' >>> stemmerregexp.stem('pairing') 'pair' We can use RegexpStemmer in cases where stemming cannot be performed using PorterStemmer and LancasterStemmer. The SnowballStemmer class is used to perform stemming in 13 languages other than English. In order to perform stemming using SnowballStemmer, firstly, an instance is created in the language where stemming needs to be performed, and then using the stem() method, stepping is performed. Consider the following example to perform stemming in Spanish and French in NLTK using SnowballStemmer: >>> import nltk >>> from nltk.stem import SnowballStemmer >>> SnowballStemmer.languages ('danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish') >>> spanishstemmer=SnowballStemmer('spanish') >>> spanishstemmer.stem('comiendo') 'com' >>> frenchstemmer=SnowballStemmer('french') >>> frenchstemmer.stem('manger') 'mang' Nltk.stem.api consists of the stemmer I class in which the stem function is performed. Consider the following code present in NLTK, which enables stemming to be performed: Class StemmerI(object): """ It is an interface that helps to eliminate morphological affixes from the tokens and the process is known as stemming. """ def stem(self, token): """ Eliminate affixes from token and stem is returned. """ raise NotImplementedError() Here's the code used to perform stemming using multiple stemmers: >>> import nltk >>> from nltk.stem.porter import PorterStemmer >>> from nltk.stem.lancaster import LancasterStemmer >>> from nltk.stem import SnowballStemmer >>> def obtain_tokens(): With open('/home/p/NLTK/sample1.txt') as stem: tok = nltk.word_tokenize(stem.read()) return tokens >>> def stemming(filtered): stem=[] for x in filtered: stem.append(PorterStemmer().stem(x)) return stem >>> if_name_=="_main_": tok= obtain_tokens() >>> print("tokens is %s")%(tok) >>> stem_tokens= stemming(tok) >>> print("After stemming is %s")%stem_tokens >>> res=dict(zip(tok,stem_tokens)) >>> print("{tok:stemmed}=%s")%(result) Understanding lemmatization Lemmatization is the process in which we transform a word into a form that has a different word category. The word formed after lemmatization is entirely different from what it was initially. Consider an example of lemmatization in NLTK: >>> import nltk >>> from nltk.stem import WordNetLemmatizer >>> lemmatizer_output=WordNetLemmatizer() >>> lemmatizer_output.lemmatize('working') 'working' >>> lemmatizer_output.lemmatize('working',pos='v') 'work' >>> lemmatizer_output.lemmatize('works') 'work' WordNetLemmatizer may be defined as a wrapper around the so-called WordNet corpus, and it makes use of the morphy() function present in WordNetCorpusReader to extract a lemma. If no lemma is extracted, then the word is only returned in its original form. For example, for 'works', the lemma that is returned is in the singular form 'work'. This code snippet illustrates the difference between stemming and lemmatization: >>> import nltk >>> from nltk.stem import PorterStemmer >>> stemmer_output=PorterStemmer() >>> stemmer_output.stem('happiness') 'happi' >>> from nltk.stem import WordNetLemmatizer >>> lemmatizer_output.lemmatize('happiness') 'happiness' In the preceding code, 'happiness' is converted to 'happi' by stemming it. Lemmatization can't find the root word for 'happiness', so it returns the word "happiness". Developing a stemmer for non-English languages Polyglot is a software that is used to provide models called morfessor models, which are used to obtain morphemes from tokens. The Morpho project's goal is to create unsupervised data-driven processes. Its focuses on the creation of morphemes, which are the smallest units of syntax. Morphemes play an important role in natural language processing. They are useful in automatic recognition and the creation of language. With the help of the vocabulary dictionaries of polyglot, morfessor models on 50,000 tokens of different languages was used. Here's the code to obtain a language table using a polyglot: from polyglot.downloader import downloader print(downloader.supported_languages_table("morph2")) The output obtained from the preceding code is in the form of these languages listed as follows: 1. Piedmontese language 2. Lombard language 3. Gan Chinese 4. Sicilian 5. Scots 6. Kirghiz, Kyrgyz 7. Pashto, Pushto 8. Kurdish 9. Portuguese 10. Kannada 11. Korean 12. Khmer 13. Kazakh 14. Ilokano 15. Polish 16. Panjabi, Punjabi 17. Georgian 18. Chuvash 19. Alemannic 20. Czech 21. Welsh 22. Chechen 23. Catalan; Valencian 24. Northern Sami 25. Sanskrit (Saṁskṛta) 26. Slovene 27. Javanese 28. Slovak 29. Bosnian-Croatian-Serbian 30. Bavarian 31. Swedish 32. Swahili 33. Sundanese 34. Serbian 35. Albanian 36. Japanese 37. Western Frisian 38. French 39. Finnish 40. Upper Sorbian 41. Faroese 42. Persian 43. Sinhala, Sinhalese 44. Italian 45. Amharic 46. Aragonese 47. Volapük 48. Icelandic 49. Sakha 50. Afrikaans 51. Indonesian 52. Interlingua 53. Azerbaijani 54. Ido 55. Arabic 56. Assamese 57. Yoruba 58. Yiddish 59. Waray-Waray 60. Croatian 61. Hungarian 62. Haitian; Haitian Creole 63. Quechua 64. Armenian 65. Hebrew (modern) 66. Silesian 67. Hindi 68. Divehi; Dhivehi; Mald... 69. German 70. Danish 71. Occitan 72. Tagalog 73. Turkmen 74. Thai 75. Tajik 76. Greek, Modern 77. Telugu 78. Tamil 79. Oriya 80. Ossetian, Ossetic 81. Tatar 82. Turkish 83. Kapampangan 84. Venetian 85. Manx 86. Gujarati 87. Galician 88. Irish 89. Scottish Gaelic; Gaelic 90. Nepali 91. Cebuano 92. Zazaki 93. Walloon 94. Dutch 95. Norwegian 96. Norwegian Nynorsk 97. West Flemish 98. Chinese 99. Bosnian 100. Breton 101. Belarusian 102. Bulgarian 103. Bashkir 104. Egyptian Arabic 105. Tibetan Standard, Tib... 106. Bengali 107. Burmese 108. Romansh 109. Marathi (Marāthī) 110. Malay 111. Maltese 112. Russian 113. Macedonian 114. Malayalam 115. Mongolian 116. Malagasy 117. Vietnamese 118. Spanish; Castilian 119. Estonian 120. Basque 121. Bishnupriya Manipuri 122. Asturian 123. English 124. Esperanto 125. Luxembourgish, Letzeb... 126. Latin 127. Uighur, Uyghur 128. Ukrainian 129. Limburgish, Limburgan... 130. Latvian 131. Urdu 132. Lithuanian 133. Fiji Hindi 134. Uzbek 135. Romanian, Moldavian, ... The necessary models can be downloaded using the following code: %%bash polyglot download morph2.en morph2.ar [polyglot_data] Downloading package morph2.en to [polyglot_data] /home/rmyeid/polyglot_data... [polyglot_data] Package morph2.en is already up-to-date! [polyglot_data] Downloading package morph2.ar to [polyglot_data] /home/rmyeid/polyglot_data... [polyglot_data] Package morph2.ar is already up-to-date! Consider this example that obtains output from a polyglot: from polyglot.text import Text, Word tokens =["unconditional" ,"precooked", "impossible", "painful", "entered"] for s in tokens: s=Word(s, language="en") print("{:<20}{}".format(s,s.morphemes)) unconditional ['un','conditional'] precooked ['pre','cook','ed'] impossible ['im','possible'] painful ['pain','ful'] entered ['enter','ed'] If tokenization is not performed properly, then we can perform morphological analysis for the process of splitting text into its original constituents: sent="Ihopeyoufindthebookinteresting" para=Text(sent) para.language="en" para.morphemes WordList(['I','hope','you','find','the','book','interesting']) A morphological analyzers Morphological analysis may be defined as the process of obtaining grammatical information about a token given its suffix information. Morphological analysis can be performed in three ways: Morpheme-based morphology (or the item and arrangement approach), Lexeme-based morphology (or the item and process approach), and Word-based morphology (or the word and paradigm approach). A morphological analyzer may be defined as a program that is responsible for the analysis of the morphology of a given input token. It analyzes a given token and generates morphological information, such as gender, number, class, and so on, as an output. In order to perform morphological analysis on a given non-whitespace token, pyEnchant dictionary is used. Consider the following code that performs morphological analysis: >>> import enchant >>> s = enchant.Dict("en_US") >>> tok=[] >>> def tokenize(st1): if not st1:return for j in xrange(len(st1),-1,-1): if s.check(st1[0:j]): tok.append(st1[0:i]) st1=st[j:] tokenize(st1) break >>> tokenize("itismyfavouritebook") >>> tok ['it', 'is', 'my','favourite','book'] >>> tok=[ ] >>> tokenize("ihopeyoufindthebookinteresting") >>> tok ['i','hope','you','find','the','book','interesting'] We can determine the category of a word as follows: Morphological hints: Suffix information helps us to detect the category of a word. For example, -ness and –ment suffixes exist with nouns. Syntactic hints: Contextual information is conducive in determining the category of a word. For example, if we have found a word that has a noun category, then syntactic hints will be useful in determining whether an adjective will appear before the noun or after the noun category. Semantic hints: A semantic hint is also useful in determining the category of a word. For example, if we already know that a word represents the name of a location, then it will fall under the noun category. Open class: This refers to the class of words that are not fixed and each day, their number keeps on increasing whenever a new word is added to the list. Words in an open class are usually in the form of nouns. Prepositions are mostly a closed class. Morphology captured by the part of speech tagset: Part of Speech tagset capture information that helps us to perform morphology. For example, the word 'plays' would appear with the third person and singular noun. Omorfi (the open morphology of Finnish) is a package that has been licensed by version 3 of GNU GPL. It is used for the purpose of performing numerous tasks such as language modeling, morphological analysis, rule-based machine translations, information retrieval, statistical machine translations, morphological segmentation, ontologies, and spell checking and correction. A morphological generator A morphological generator is a program that performs the task of morphological generations. Morphological generation may be considered the opposite of morphological analysis. Here, given the description of a word in terms of its number, category, stem, and so on, the original word is retrieved. For example, if root = go, Part of Speech = verb, tense= present, and if it occurs along with a third person and singular subject, then the morphological generator would generate its surface form, that is, goes. There are many Python-based software that perform morphological analysis and generation. Some of them are as follows: ParaMorfo: This is used to perform the morphological generation and analysis of Spanish and Guarani nouns, adjectives, and verbs HornMorpho: This is used for the morphological generation and analysis of Oromo and Amharic nouns and verbs as well as Tigrinya verbs AntiMorfo: This is used for the morphological generation and analysis of Quechua adjectives, verbs, and nouns as well as Spanish verbs MorfoMelayu: This is used for the morphological analysis of Malay words Other examples of software that is used to perform morphological analysis and generation are as follows: Morph is a morphological generator and analyzer for the English language and the RASP system Morphy is a morphological generator, analyzer, and POS tagger for German Morphisto is a morphological generator and analyzer for German Morfette performs supervised learning (inflectional morphology) for Spanish and French Search engines PyStemmer 1.0.1 consists of Snowball stemming algorithms that are conducive for performing information retrieval tasks and the construction of a search engine. It consists of the Porter stemming algorithm and many other stemming algorithms that are useful for the purpose of performing stemming and information retrieval tasks in many languages, including many European languages. We can construct a vector space search engine by converting the texts into vectors. Here are the steps needed to construct a vector space search engine: Stemming and elimination of stop words. A stemmer is a program that accepts words and converts them into stems. Tokens that have same stem almost have the same meanings. Stop words are also eliminated from text. Consider the following code for the removal of stop words and tokenization: def eliminatestopwords(self,list): " " " Eliminate words which occur often and have not much significance from context point of view. " " " return[ word for word in list if word not in self.stopwords ] def tokenize(self,string): " " " Perform the task of splitting text into stop words and tokens " " " Str=self.clean(str) Words=str.split(" ") return [self.stemmer.stem(word,0,len(word)-1) for word in words] Mapping keywords into vector dimensions.Here's the code required to perform the mapping of keywords into vector dimensions: def obtainvectorkeywordindex(self, documentList): " " " In the document vectors, generate the keyword for the given position of element " " " #Perform mapping of text into strings vocabstring = " ".join(documentList) vocablist = self.parser.tokenise(vocabstring) #Eliminate common words that have no search significance vocablist = self.parser.eliminatestopwords(vocablist) uniqueVocablist = util.removeDuplicates(vocablist) vectorIndex={} offset=0 #Attach a position to keywords that performs mapping with dimension that is used to depict this token for word in uniqueVocablist: vectorIndex[word]=offset offset+=1 return vectorIndex #(keyword:position) Mapping of text strings to vectorsHere, a simple term count model is used. The code to convert text strings into vectors is as follows: def constructVector(self, wordString): # Initialise the vector with 0's Vector_val = [0] * len(self.vectorKeywordIndex) tokList = self.parser.tokenize(tokString) tokList = self.parser.eliminatestopwords(tokList) for word in toklist: vector[self.vectorKeywordIndex[word]] += 1; # simple Term Count Model is used return vector Searching similar documents By finding the cosine of an angle between the vectors of a document, we can prove whether two given documents are similar or not. If the cosine value is 1, then the angle value is 0 degrees and vectors are said to be parallel (this means that documents are related). If the cosine value is 0 and the value of the angle is 90 degrees, then vectors are said to be perpendicular (this means that documents are not related). This is the code to compute the cosine between the text vector using scipy: def cosine(vec1, vec2): """ cosine = ( X * Y ) / ||X|| x ||Y|| """ return float(dot(vec1,vec2) / (norm(vec1) * norm(vec2))) Search keywords We perform the mapping of keywords to a vector space. We construct a temporary text that represents items to be searched and then compare it with document vectors with the help of a cosine measurement. Here is the following code needed to search for the vector space: def searching(self,searchinglist): """ search for text that are matched on the basis of list of items """ askVector = self.buildQueryVector(searchinglist) ratings = [util.cosine(askVector, textVector) for textVector in self.documentVectors] ratings.sort(reverse=True) return ratings The following code can be used to detect languages from a source text: >>> import nltk >>> import sys >>> try: from nltk import wordpunct_tokenize from nltk.corpus import stopwords except ImportError: print( 'Error has occured') #---------------------------------------------------------------------- >>> def _calculate_languages_ratios(text): """ Compute probability of given document that can be written in different languages and give a dictionary that appears like {'german': 2, 'french': 4, 'english': 1} """ languages_ratios = {} ''' nltk.wordpunct_tokenize() splits all punctuations into separate tokens wordpunct_tokenize("I hope you like the book interesting .") [' I',' hope ','you ','like ','the ','book' ,'interesting ','.'] ''' tok = wordpunct_tokenize(text) wor = [word.lower() for word in tok] # Compute occurence of unique stopwords in a text for language in stopwords.fileids(): stopwords_set = set(stopwords.words(language)) words_set = set(words) common_elements = words_set.intersection(stopwords_set) languages_ratios[language] = len(common_elements) # language "score" return languages_ratios #---------------------------------------------------------------- >>> def detect_language(text): """ Compute the probability of given text that is written in different languages and obtain the one that is highest scored. It makes use of stopwords calculation approach, finds out unique stopwords present in a analyzed text. """ ratios = _calculate_languages_ratios(text) most_rated_language = max(ratios, key=ratios.get) return most_rated_language if __name__=='__main__': text = ''' All over this cosmos, most of the people believe that there is an invisible supreme power that is the creator and the runner of this world. Human being is supposed to be the most intelligent and loved creation by that power and that is being searched by human beings in different ways into different things. As a result people reveal His assumed form as per their own perceptions and beliefs. It has given birth to different religions and people are divided on the name of religion viz. Hindu, Muslim, Sikhs, Christian etc. People do not stop at this. They debate the superiority of one over the other and fight to establish their views. Shrewd people like politicians oppose and support them at their own convenience to divide them and control them. It has intensified to the extent that even parents of a new born baby teach it about religious differences and recommend their own religion superior to that of others and let the child learn to hate other people just because of religion. Jonathan Swift, an eighteenth century novelist, observes that we have just enough religion to make us hate, but not enough to make us love one another. The word 'religion' does not have a derogatory meaning - A literal meaning of religion is 'A personal or institutionalized system grounded in belief in a God or Gods and the activities connected with this'. At its basic level, 'religion is just a set of teachings that tells people how to lead a good life'. It has never been the purpose of religion to divide people into groups of isolated followers that cannot live in harmony together. No religion claims to teach intolerance or even instructs its believers to segregate a certain religious group or even take the fundamental rights of an individual solely based on their religious choices. It is also said that 'Majhab nhi sikhata aaps mai bair krna'. But this very majhab or religion takes a very heinous form when it is misused by the shrewd politicians and the fanatics e.g. in Ayodhya on 6th December, 1992 some right wing political parties and communal organizations incited the Hindus to demolish the 16th century Babri Masjid in the name of religion to polarize Hindus votes. Muslim fanatics in Bangladesh retaliated and destroyed a number of temples, assassinated innocent Hindus and raped Hindu girls who had nothing to do with the demolition of Babri Masjid. This very inhuman act has been presented by Taslima Nasrin, a Banglsdeshi Doctor-cum-Writer in her controversial novel 'Lajja' (1993) in which, she seems to utilizes fiction's mass emotional appeal, rather than its potential for nuance and universality. ''' >>> language = detect_language(text) >>> print(language) The preceding code will search for stop words and detect the language of the text, which is English. Summary In this article, we discussed stemming, lemmatization, and morphological analysis and generation. Resources for Article: Further resources on this subject: How is Python code organized[article] Machine learning and Python – the Dream Team[article] Putting the Fun in Functional Python[article]
Read more
  • 0
  • 0
  • 3327