Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-machine-learning-tasks
Packt
01 Apr 2016
16 min read
Save for later

Machine Learning Tasks

Packt
01 Apr 2016
16 min read
In this article written by David Julian, author of the book Designing Machine Learning Systems with Python, the author wants to state that, he will first introduce the basic machine learning tasks. Classification is probably the most common task, due in part to the fact that it is relatively easy, well understood, and solves a lot of common problems. Multiclass classification (for instance, handwriting recognition) can sometimes be achieved by chaining binary classification tasks. However, we lose information this way, and we become unable to define a single decision boundary. For this reason, multiclass classification is often treated separately from binary classification. (For more resources related to this topic, see here.) There are cases where we are not interested in discrete classes but rather a real number, for instance, a probability. These type of problems are regression problems. Both classification and regression require a training set of correctly labelled data. They are supervised learning problems. Originating from these basic machine tasks are a number of derived tasks. In many applications, this may simply be applying the learning model to a prediction to establish a causal relationship. We must remember that explaining and predicting are not the same. A model can make a prediction, but unless we know explicitly how it made the prediction, we cannot begin to form a comprehensible explanation. An explanation requires human knowledge of the domain. We can also use a prediction model to find exceptions from a general pattern. Here, we are interested in the individual cases that deviate from the predictions. This is often called anomaly detection and has wide applications in areas such as detecting bank fraud, noise filtering, and even in the search for extraterrestrial life. An important and potentially useful task is subgroup discovery. Our goal here is not, as in clustering, to partition the entire domain but rather to find a subgroup that has a substantially different distribution. In essence, subgroup discovery is trying to find relationships between a dependent target variable and many independent explaining variables. We are not trying to find a complete relationship but rather a group of instances that are different in ways that are important in the domain. For instance, establishing the subgroups, smoker = true and family history =true, for a target variable of heart disease =true. Finally, we consider control type tasks. These act to optimize control setting to maximize a pay off is given different conditions. This can be achieved in several ways. We can clone expert behavior; the machine learns directly from a human and makes predictions of actions given different conditions. The task is to learn a prediction model for the expert's actions. This is similar to reinforcement learning, where the task is to learn about the relationship between conditions and optimal action. Clustering, on the other hand, is the task of grouping items without any information on that group; this is an unsupervised learning task. Clustering is basically making a measurement of similarity. Related to clustering is association, which is an unsupervised task to find a certain type of pattern in the data. This task is behind movie recommender systems, and customers who bought this also bought .. on checkout pages of online stores. Data for machine learning When considering raw data for machine learning applications, there are three separate aspects: The volume of the data The velocity of the data The variety of the data Data volume The volume problem can be approached from three different directions: efficiency, scalability, and parallelism. Efficiency is about minimizing the time it takes for an algorithm to process a unit of information. A component of this is the underlying processing power of the hardware. The other component, and one that we have more control over, is ensuring our algorithms are not wasting precious processing cycles on unnecessary tasks. Scalability is really about brute force, and throwing as much hardware at a problem as you can. With Moore's law, which predicts the trend of computer power doubling every two years and reaching its limit, it is clear that scalability is not, by its self, going to be able to keep pace with the ever increasing amounts of data. Simply adding more memory and faster processors is not, in many cases, going to be a cost effective solution. Parallelism is a growing area of machine learning, and it encompasses a number of different approaches from harnessing capabilities of multi core processors, to large scale distributed computing on many different platforms. Probably, the most common method is to simply run the same algorithm on many machines, each with a different set of parameters. Another method is to decompose a learning algorithm into an adaptive sequence of queries, and have these queries processed in parallel. A common implementation of this technique is known as MapReduce, or its open source version, Hadoop. Data velocity The velocity problem is often approached in terms of data producers and data consumers. The data transfer rate between the two is its velocity, and it can be measured in interactive response times. This is the time it takes from a query being made to its response being delivered. Response times are constrained by latencies such as hard disk read and write times, and the time it takes to transmit data across a network. Data is being produced at ever greater rates, and this is largely driven by the rapid expansion of mobile networks and devices. The increasing instrumentation of daily life is revolutionizing the way products and services are delivered. This increasing flow of data has led to the idea of streaming processing. When input data is at a velocity that makes it impossible to store in its entirety, a level of analysis is necessary as the data streams, in essence, deciding what data is useful and should be stored and what data can be thrown away. An extreme example is the Large Hadron Collider at CERN, where the vast majority of data is discarded. A sophisticated algorithm must scan the data as it is being generated, looking at the information needle in the data haystack. Another instance where processing data streams might be important is when an application requires an immediate response. This is becoming increasingly used in applications such as online gaming and stock market trading. It is not just the velocity of incoming data that we are interested in. In many applications, particularly on the web, the velocity of a system's output is also important. Consider applications such as recommender systems, which need to process large amounts of data and present a response in the time it takes for a web page to load. Data variety Collecting data from different sources invariably means dealing with misaligned data structures, and incompatible formats. It also often means dealing with different semantics and having to understand a data system that may have been built on a pretty different set of logical principles. We have to remember that, very often, data is repurposed for an entirely different application than the one it was originally intended for. There is a huge variety of data formats and underlying platforms. Significant time can be spent converting data into one consistent format. Even when this is done, the data itself needs to be aligned such that each record consists of the same number of features and is measured in the same units. Models The goal in machine learning is not to just solve an instance of a problem, but to create a model that will solve unique problems from new data. This is the essence of learning. A learning model must have a mechanism for evaluating its output, and in turn, changing its behavior to a state that is closer to a solution. A model is essentially a hypothesis: a proposed explanation for a phenomenon. The goal is to apply a generalization to the problem. In the case of supervised learning, problem knowledge gained from the training set is applied to the unlabeled test. In the case of an unsupervised learning problem, such as clustering, the system does not learn from a training set. It must learn from the characteristics of the dataset itself, such as degree of similarity. In both cases, the process is iterative. It repeats a well-defined set of tasks, that moves the model closer to a correct hypothesis. There are many models and as many variations on these models as there are unique solutions. We can see that the problems that machine learning systems solve (regression, classification, association, and so on) come up in many different settings. They have been used successfully in almost all branches of science, engineering, mathematics, commerce, and also in the social sciences; they are as diverse as the domains they operate in. This diversity of models gives machine learning systems great problem solving powers. However, it can also be a bit daunting for the designer to decide what is the best model, or models, for a particular problem. To complicate things further, there are often several models that may solve your task, or your task may need several models. The most accurate and efficient pathway through an original problem is something you simply cannot know when you embark upon such a project. There are several modeling approaches. These are really different perspectives that we can use to help us understand the problem landscape. A distinction can be made regarding how a model divides up the instance space. The instance space can be considered all possible instances of your data, regardless of whether each instance actually appears in the data. The data is a subset of the instance space. There are two approaches to dividing up this space: grouping and grading. The key difference between the two is that grouping models divide the instance space into fixed discrete units called segments. Each segment has a finite resolution and cannot distinguish between classes beyond this resolution. Grading, on the other hand, forms a global model over the entire instance space, rather than dividing the space into segments. In theory, the resolution of a grading model is infinite, and it can distinguish between instances no matter how similar they are. The distinction between grouping and grading is not absolute, and many models contain elements of both. Geometric models One of the most useful approaches to machine learning modeling is through geometry. Geometric models use the concept of instance space. The most obvious example is when all the features are numerical and can become coordinates in a Cartesian coordinate system. When we only have two or three features, they are easy to visualize. Since many machine learning problems have hundreds or thousands of features, and therefore dimensions, visualizing these spaces is impossible. Importantly, many of the geometric concepts, such as linear transformations, still apply in this hyper space. This can help us better understand our models. For instance, we expect many learning algorithms to be translation invariant, which means that it does not matter where we place the origin in the coordinate system. Also, we can use the geometric concept of Euclidean distance to measure similarity between instances; this gives us a method to cluster alike instances and form a decision boundary between them. Probabilistic models Often, we will want our models to output probabilities rather than just binary true or false. When we take a probabilistic approach, we assume that there is an underlying random process that creates a well-defined, but unknown, probability distribution. Probabilistic models are often expressed in the form of a tree. Tree models are ubiquitous in machine learning, and one of their main advantages is that they can inform us about the underlying structure of a problem. Decision trees are naturally easy to visualize and conceptualize. They allow inspection and do not just give an answer. For example, if we have to predict a category, we can also expose the logical steps that gave rise to a particular result. Also, tree models generally require less data preparation than other models and can handle numerical and categorical data. On the down side, tree models can create overly complex models that do not generalize very well to new data. Another potential problem with tree models is that they can become very sensitive to changes in the input data, and as we will see later, this problem can be mitigated by using them as ensemble learners. Linear models A key concept in machine learning is that of the linear model. Linear models form the foundation of many advanced nonlinear techniques such as support vector machines and neural networks. They can be applied to any predictive task such as classification, regression, or probability estimation. When responding to small changes in the input data, and provided that our data consists of entirely uncorrelated features, linear models tend to be more stable than tree models. Tree models can over-respond to small variation in training data. This is because splits at the root of a tree have consequences that are not recoverable further down a branch, potentially making the rest of the tree significantly different. Linear models, on the other hand, are relatively stable, being less sensitive to initial conditions. However, as you would expect, this has the opposite effect of making it less sensitive to nuanced data. This is described by the terms variance (for over fitting models) and bias (for under fitting models). A linear model is typically low variance and high bias. Linear models are generally best approached from a geometric perspective. We know we can easily plot two dimensions of space in a Cartesian co-ordinate system, and we can use the illusion of perspective to illustrate a third. We have also been taught to think of time as being a fourth dimension, but when we start speaking of n dimensions, a physical analogy breaks down. Intriguingly, we can still use many of the mathematical tools that we intuitively apply to three dimensions of space. While it becomes difficult to visualize these extra dimensions, we can still use the same geometric concepts (such as lines, planes, angles, and distance) to describe them. With geometric models, we describe each instance as having a set of real-valued features, each of which is a dimension in a space. Model ensembles Ensemble techniques can be divided broadly into two types. The Averaging Method: With this method, several estimators are run independently, and their predictions are averaged. This includes the random forests and bagging methods. The Boosting Methods: With this method, weak learners are built sequentially using weighted distributions of the data, based on the error rates. Ensemble methods use multiple models to obtain better performance than any single constituent model. The aim is to not only build diverse and robust models, but also to work within limitations such as processing speed and return times. When working with large datasets and quick response times, this can be a significant developmental bottleneck. Troubleshooting and diagnostics are important aspects of working with all machine learning models, but they are especially important when dealing with models that might take days to run. The types of machine learning ensembles that can be created are as diverse as the models themselves, and the main considerations revolve around three things: how we divide our data, how we select the models, and the methods we use to combine their results. This simplistic statement actually encompasses a very large and diverse space. Neural nets When we approach the problem of trying to mimic the brain, we are faced with a number of difficulties. Considering all the different things the brain does, we might first think that it consists of a number of different algorithms, each specialized to do a particular task, and each hard wired into different parts of the brain. This approach translates to considering the brain as a number of subsystems, each with its own program and task. For example, the auditory cortex for perceiving sound has its own algorithm that, say, does a Fourier transform on an incoming sound wave to detect the pitch. The visual cortex, on the other hand, has its own distinct algorithm for decoding the signals from the optic nerve and translating them into the sensation of sight. There is, however, growing evidence that the brain does not function like this at all. It appears, from biological studies, that brain tissue in different parts of the brain can relearn how to interpret inputs. So, rather than consisting of specialized subsystems that are programmed to perform specific tasks, the brain uses the same algorithm to learn different tasks. This single algorithm approach has many advantages, not least of which is that it is relatively easy to implement. It also means that we can create generalized models and then train them to perform specialized tasks. Like in real brains, using a singular algorithm to describe how each neuron communicates with the other neurons around it allows artificial neural networks to be adaptable and able to carry out multiple higher-level tasks. Much of the most important work being done with neural net models, and indeed machine learning in general, is through the use of very complex neural nets with many layers and features. This approach is often called deep architecture or deep learning. Human and animal learning occurs at a rate and depth that no machine can match. Many of the elements of biological learning still remain a mystery. One of the key areas of research, and one of the most useful in application, is that of object recognition. This is something quite fundamental to living systems, and higher animals have evolved to possessing an extraordinary ability to learn complex relationships between objects. Biological brains have many layers; each synaptic event exists in a long chain of synaptic processes. In order to recognize complex objects, such as people's faces or handwritten digits, a fundamental task is to create a hierarchy of representation from the raw input to higher and higher levels of abstraction. The goal is to transform raw data, such as a set of pixel values, into something that we can describe as, say, a person riding bicycle. Resources for Article: Further resources on this subject: Python Data Structures [article] Exception Handling in MySQL for Python [article] Python Data Analysis Utilities [article]
Read more
  • 0
  • 0
  • 4389

article-image-why-mesos
Packt
31 Mar 2016
8 min read
Save for later

Why Mesos?

Packt
31 Mar 2016
8 min read
In this article by Dipa Dubhasi and Akhil Das authors of the book Mastering Mesos, delves into understanding the importance of Mesos. Apache Mesos is an open source, distributed cluster management software that came out of AMPLab, UC Berkeley in 2011. It abstracts CPU, memory, storage, and other computer resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively. It is referred to as a metascheduler (scheduler of schedulers) and a "distributed systems kernel/distributed datacenter OS". It improves resource utilization, simplifies system administration, and supports a wide variety of distributed applications that can be deployed by leveraging its pluggable architecture. It is scalable and efficient and provides a host of features, such as resource isolation and high availability, which, along with a strong and vibrant open source community, makes this one of the most exciting projects. (For more resources related to this topic, see here.) Introduction to the datacenter OS and architecture of Mesos Over the past decade, datacenters have graduated from packing multiple applications into a single server box to having large datacenters that aggregate thousands of servers to serve as a massively distributed computing infrastructure. With the advent of virtualization, microservices, cluster computing, and hyper-scale infrastructure, the need of the hour is the creation of an application-centric enterprise that follows a software-defined datacenter strategy. Currently, server clusters are predominantly managed individually, which can be likened to having multiple operating systems on the PC, one each for processor, disk drive, and so on. With an abstraction model that treats these machines as individual entities being managed in isolation, the ability of the datacenter to effectively build and run distributed applications is greatly reduced. Another way of looking at the situation is comparing running applications in a datacenter to running them on a laptop. One major difference is that while launching a text editor or web browser, we are not required to check which memory modules are free and choose ones that suit our need. Herein lies the significance of a platform that acts like a host operating system and allows multiple users to run multiple applications simultaneously by utilizing a shared set of resources. Datacenters now run varied distributed application workloads, such as Spark, Hadoop, and so on, and need the capability to intelligently match resources and applications. The datacenter ecosystem today has to be equipped to manage and monitor resources and efficiently distribute workloads across a unified pool of resources with the agility and ease to cater to a diverse user base (noninfrastructure teams included). A datacenter OS brings to the table a comprehensive and sustainable approach to resource management and monitoring. This not only reduces the cost of ownership but also allows a flexible handling of resource requirements in a manner that isolated datacenter infrastructure cannot support. The idea behind a datacenter OS is that of an intelligent software that sits above all the hardware in a datacenter and ensures efficient and dynamic resource sharing. Added to this is the capability to constantly monitor resource usage and improve workload and infrastructure management in a seamless way that is not tied to specific application requirements. In its absence, we have a scenario with silos in a datacenter that force developers to build software catering to machine-specific characteristics and make the moving and resizing of applications a highly cumbersome procedure. The datacenter OS acts as a software layer that aggregates all servers in a datacenter into one giant supercomputer to deliver the benefits of multilatency, isolation, and resource control across all microservice applications. Another major advantage is the elimination of human-induced error during the continual assigning and reassigning of virtual resources. From a developer's perspective, this will allow them to easily and safely build distributed applications without restricting them to a bunch of specialized tools, each catering to a specific set of requirements. For instance, let's consider the case of Data Science teams who develop analytic applications that are highly resource intensive. An operating system that can simplify how the resources are accessed, shared, and distributed successfully alleviates their concern about reallocating hardware every time the workloads change. Of key importance is the relevance of the datacenter OS to DevOps, primarily a software development approach that emphasizes automation, integration, collaboration, and communication between traditional software developers and other IT professionals. With a datacenter OS that effectively transforms individual servers into a pool of resources, DevOps teams can focus on accelerating development and not continuously worry about infrastructure issues. In a world where distributed computing becomes the norm, the datacenter OS is a boon in disguise. With freedom from manually configuring and maintaining individual machines and applications, system engineers need not configure specific machines for specific applications as all applications would be capable of running on any available resources from any machine, even if there are other applications already running on them. Using a datacenter OS results in centralized control and smart utilization of resources that eliminate hardware and software silos to ensure greater accessibility and usability even for noninfrastructural professionals. Examples of some organizations administering their hyperscale datacenters via the datacenter OS are Google with the Borg (and next geneneration Omega) systems. The merits of the datacenter OS are undeniable, with benefits ranging from the scalability of computing resources and flexibility to support data sharing across applications to saving team effort, time, and money while launching and managing interoperable cluster applications. It is this vision of transforming the datacenter into a single supercomputer that Apache Mesos seeks to achieve. Born out of a Berkeley AMPLab research paper in 2011, it has since come a long way with a number of leading companies, such as Apple, Twitter, Netflix, and AirBnB among others, using it in production. Mesosphere is a start-up that is developing a distributed OS product with Mesos at its core. The architecture of Mesos Mesos is an open-source platform for sharing clusters of commodity servers between different distributed applications (or frameworks), such as Hadoop, Spark, and Kafka among others. The idea is to act as a centralized cluster manager by pooling together all the physical resources of the cluster and making it available as a single reservoir of highly available resources for all the different frameworks to utilize. For example, if an organization has one 10-node cluster (16 CPUs and 64 GB RAM) and another 5-node cluster (4 CPUs and 16 GB RAM), then Mesos can be leveraged to pool them into one virtual cluster of 720 GB RAM and 180 CPUs, where multiple distributed applications can be run. Sharing resources in this fashion greatly improves cluster utilization and eliminates the need for an expensive data replication process per-framework. Some of the important features of Mesos are: Scalability: It can elastically scale to over 50,000 nodes Resource isolation: This is achieved through Linux/Docker containers Efficiency: This is achieved through CPU and memory-aware resource scheduling across multiple frameworks High availability: This is through Apache ZooKeeper Interface: A web UI for monitoring the cluster state Mesos is based on the same principles as the Linux kernel and aims to provide a highly available, scalable, and fault-tolerant base for enabling various frameworks to share cluster resources effectively and in isolation. Distributed applications are varied and continuously evolving, a fact that leads Mesos' design philosophy towards a thin interface that allows an efficient resource allocation between different frameworks and delegates the task of scheduling and job execution to the frameworks themselves. The two advantages of doing so are: Different frame data replication works can independently devise methods to address their data locality, fault-tolerance, and other such needs. It simplifies the Mesos codebase and allows it to be scalable, flexible, robust, and agile Mesos' architecture hands over the responsibility of scheduling tasks to the respective frameworks by employing a resource offer abstraction that packages a set of resources and makes offers to each framework. The Mesos master node decides the quantity of resources to offer each framework, while each framework decides which resource offers to accept and which tasks to execute on these accepted resources. This method of resource allocation is shown to achieve good degree of data locality for each framework sharing the same cluster. An alternative architecture would implement a global scheduler that took framework requirements, organizational priorities, and resource availability as inputs and provided a task schedule breakdown by framework and resource as output, essentially acting as a matchmaker for jobs and resources with priorities acting as constraints. The challenges with this architecture, such as developing a robust API that could capture all the varied requirements of different frameworks, anticipating new frameworks, and solving a complex scheduling problem for millions of jobs, made the former approach a much more attractive option for the creators. Summary Thus in this article, we introduced Mesos, and then dived deep into its architecture to understand importance of Mesos. Resources for Article:   Further resources on this subject: Understanding Mesos Internals [article] Leveraging Python in the World of Big Data [article] Self-service Business Intelligence, Creating Value from Data [article]
Read more
  • 0
  • 0
  • 1993

article-image-support-vector-machines-classification-engine
Packt
17 Mar 2016
9 min read
Save for later

Support Vector Machines as a Classification Engine

Packt
17 Mar 2016
9 min read
In this article by Tomasz Drabas, author of the book, Practical Data Analysis Cookbook, we will discuss on how Support Vector Machine models can be used as a classification engine. (For more resources related to this topic, see here.) Support Vector Machines Support Vector Machines (SVMs) are a family of extremely powerful models that can be used in classification and regression problems. They aim at finding decision boundaries that separate observations with differing class memberships. While many classifiers exist that can classify linearly separable data (for example, logistic regression), SVMs can handle highly non-linear problems using a kernel trick that implicitly maps the input vectors to higher-dimensional feature spaces. The transformation rearranges the dataset in such a way that it is then linearly solvable. The mechanics of the machine Given a set of n points of a form (x1,y1)...(xn,yn), where xi is a z-dimensional input vector and  yi is a class label, the SVM aims at finding the maximum margin hyperplane that separates the data points: In a two-dimensional dataset, with linearly separable data points (as shown in the preceding figure), the maximum margin hyperplane would be a line that would maximize the distance between each of the classes. The hyperplane could be expressed as a dot product of the set of input vectors  x and a vector normal to the hyperplane W:W.X=b, where b is the offset from the origin of the coordinate system. To find the hyperplane, we solve the following optimization problem: The constraint of our optimization problem effectively states that no point can cross the hyperplane if it does not belong to the class on that side of the hyperplane. Linear SVM Building a linear SVM classifier in Python is easy. There are multiple Python packages that can estimate a linear SVM but here, we decided to use MLPY (http://mlpy.sourceforge.net): import pandas as pd import numpy as np import mlpy as ml First, we load the necessary modules that we will use later, namely pandas (http://pandas.pydata.org), NumPy (http://www.numpy.org), and the aforementioned MLPY. We use pandas to read the data (https://github.com/drabastomek/practicalDataAnalysisCookbook repository to download the data): # the file name of the dataset r_filename = 'Data/Chapter03/bank_contacts.csv' # read the data csv_read = pd.read_csv(r_filename) The dataset that we use was described in S. Moro, P. Cortez, and P. Rita. A data-driven approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014 and found here http://archive.ics.uci.edu/ml/datasets/Bank+Marketing. It consists of over 41.1k outbound marketing calls of a bank. Our aim is to classify these calls into two buckets: those that resulted in a credit application and those that did not. Once the file was loaded, we split the data into training and testing datasets; we also keep the input and class indicator data separately. To this end, we use the split_dataset(...) method: def split_data(data, y, x = 'All', test_size = 0.33): ''' Method to split the data into training and testing ''' import sys # dependent variable variables = {'y': y} # and all the independent if x == 'All': allColumns = list(data.columns) allColumns.remove(y) variables['x'] = allColumns else: if type(x) != list: print('The x parameter has to be a list...') sys.exit(1) else: variables['x'] = x # create a variable to flag the training sample data['train'] = np.random.rand(len(data)) < (1 - test_size) # split the data into training and testing train_x = data[data.train] [variables['x']] train_y = data[data.train] [variables['y']] test_x = data[~data.train][variables['x']] test_y = data[~data.train][variables['y']] return train_x, train_y, test_x, test_y, variables['x'] We randomly set 1/3 of the dataset aside for testing purposes and use the remaining 2/3 for the training of the model: # split the data into training and testing train_x, train_y, test_x, test_y, labels = hlp.split_data( csv_read, y = 'credit_application' ) Once we read the data and split it into training and testing datasets, we can estimate the model: # create the classifier object svm = ml.LibSvm(svm_type='c_svc', kernel_type='linear', C=100.0) # fit the data svm.learn(train_x,train_y) The svm_type parameter of the .LibSvm(...) method controls what algorithm to use to estimate the SVM. Here, we use the c_svc method—a C-support Vector Classifier. The method specifies how much you want to avoid misclassifying observations: the larger values of C parameter will shrink the margin for the hyperplane (theb) so that more of the observations are correctly classified. You can also specify nu_svc with a nu parameter that controls how much of your sample (at most) can be misclassified and how many of your observations (at least) can become support vectors. Here, we estimate an SVM with a linear kernel, so let's talk about kernels. Kernels A kernel function K is effectively a function that computes a dot product between two n-dimensional vectors, K: Rn.Rn --> R. In other words, the kernel function takes two vectors and produces a scalar: The linear kernel does not effectively transform the data into a higher dimensional space. This is not true for polynomial or Radial Basis Function (RBF) kernels that transform the input feature space into higher dimensions. In case of the polynomial kernel of degree d, the obtained feature space has (n+d/d) dimensions for the Rn dimensional input feature space. As you can see, the number of additional dimensions can grow very quickly and this would pose significant problems in estimating the model if we would explicitly transform the data into higher dimensions. Thankfully, we do not have to do this as that's where the kernel trick comes into play. The truth is that SVMs do not have to work explicitly in higher dimensions but can rather implicitly map the data to higher dimensions using pairwise inner products (instead of an explicit dot product) and then use it to find the maximum margin hyperplane. You can find a really good explanation of the kernel trick at http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html. Back to our example The .learn(...) method of the .LibSvm(...) object estimates the model. Once the model is estimated, we can test how well it performs. First, we use the estimated model to predict the classes for the observations in the testing dataset: predicted_l = svm.pred(test_x) Next, we will use some of the scikit-learn methods to print the basic statistics for our model: def printModelSummary(actual, predicted): ''' Method to print out model summaries ''' import sklearn.metrics as mt print('Overall accuracy of the model is {0:.2f} percent' .format( (actual == predicted).sum() / len(actual) * 100)) print('Classification report: n', mt.classification_report(actual, predicted)) print('Confusion matrix: n', mt.confusion_matrix(actual, predicted)) print('ROC: ', mt.roc_auc_score(actual, predicted)) First, we calculate the overall accuracy of the model expressed as a ratio of properly classified observations to the total number of observations in the testing sample. Next, we print the classification report: The precision is the model's ability to avoid classifying an observation as positive when it is not. It is a ratio of true positives to the overall number of positively classified records. The overall precision score is a weighted average of the individual precision scores where the weight is the support. The support is the total number of actual observations in each class. The total precision for our model is not too bad—89 out of 100. However, when we look at the precision to classify the true positives, the situation is not as good—only 63 out of 100 were properly classified. Recall can be viewed as the model's capacity to find all the positive samples. It is a ratio of true positives to the sum of true positives and false negatives. The recall for the class 0.0 is almost perfect but for class 1.0, it looks really bad. This might be a problem with the fact that our sample is not balanced, but it is more likely that the features we use to classify the data do not really capture the differences between the two groups. The f1-score is effectively a weighted amalgam of the precision and recall: it is a ratio of twice the product of precision and recall to their sum. In one measure, it shows whether the model performs well or not. At the general level, the model does not perform badly but when looked at the model's ability to classify the true signal, it fails gravely. It is a perfect example why judging the model at the general level might be misleading when dealing with samples that are heavily unbalanced. RBF kernel SVM Given that the linear kernel performed poorly, our dataset might not be linearly separable. Thus, let's try the RBF kernel. The RBF kernel is given as K(x,y)=e ||x-y||2/2a2, where ||x-y||2 is a Euclidean distance between the two vectors, x and y, and σ is a free parameter. The value of RBF equals to 1 when x=y and gradually falls to 0 when the distance approaches infinity. To fit an RBF version of our model, we can specify our svm object as follows: svm = ml.LibSvm(svm_type='c_svc', kernel_type='rbf', gamma=0.1, C=1.0) The gamma parameter here specifies how far the influence of a single support vector reaches. Visually, you can investigate the relationship between gamma and C parameters at http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html. The rest of the code for the model estimation follows in a similar fashion as with the linear kernel and we obtain the following results: The results are even worse than the linear kernel as the precision and recall were lost across the board. The SVM with the RBF kernel performed worse when classifying calls that resulted in applying for the credit card and those that did not. Summary In this article, we saw that the problem is not with the model but rather, the dataset that we use does not explain the variance sufficiently. This requires going back to the drawing board and selecting other features. Resources for Article: Further resources on this subject: Push your data to the Web [article] Transferring Data from MS Access 2003 to SQL Server 2008 [article] Exporting data from MS Access 2003 to MySQL [article]
Read more
  • 0
  • 0
  • 4260
Visually different images

article-image-integrating-imagery-creating-and-styling-features-openlayers-3
Packt
17 Mar 2016
21 min read
Save for later

Integrating Imagery with Creating and Styling Features in OpenLayers 3

Packt
17 Mar 2016
21 min read
This article by Peter J Langley, author of the book OpenLayers 3.x Cookbook, sheds some light on three of the most important and talked about features of the OpenLayers library. (For more resources related to this topic, see here.) Introduction This article shows us the basics and the important things that we need to know when we start creating our first web-mapping application with OpenLayers. As we will see in this and the following recipes, OpenLayers is a big and complex framework, but at the same time, it is also very powerful and flexible. Although we're now spoilt for choice when it comes to picking a JavaScript mapping library (as we are with most JavaScript libraries and frameworks), OpenLayers is a mature, fully-featured, and well-supported library. In contrast to other libraries, such as Leaflet (http://leafletjs.com), which focus on a smaller download size in order to provide only the most common functionality as standard, OpenLayers tries to implement all the required things that a developer could need to create a web Geographic Information System (GIS) application. One aspect of OpenLayers 3 that immediately differentiates itself from OpenLayers 2, is that it's been built with the Google Closure library (https://developers.google.com/closure). Google Closure provides an extensive range of modular cross-browser JavaScript utility methods that OpenLayers 3 selectively includes. In GIS, a real-world phenomenon is represented by the concept of a feature. It can be a place, such as a city or a village, it can be a road or a railway, it can be a region, a lake, the border of a country, or something entirely arbitrary. Features can have a set of attributes, such as population, length, and so on. These can be represented visually through the use of points, lines, polygons, and so on, using some visual style: color, radius, width, and so on. OpenLayers offers us a great degree of flexibility when styling features. We can use static styles or dynamic styles influenced by feature attributes. Styles can be created through various methods, such as from style functions (ol.style.StyleFunction), or by applying new style instances (ol.style.Style) directly to a feature or layer. Let's take a look at all of this in the following recipes. Adding WMS layers Web Map Service (WMS) is a standard developed by the Open Geospatial Consortium (OGC), which is implemented by many geospatial servers, among which we can find the free and open source projects, GeoServer (http://geoserver.org) and MapServer (http://mapserver.org). More information on WMS can be found at http://en.wikipedia.org/wiki/Web_Map_Service. As a basic summary, a WMS server is a normal HTTP web server that accepts requests with some GIS-related parameters (such as projection, bounding box, and so on) and returns map tiles forming a mosaic that covers the requested bounding box. Here's the finished recipe's outcome using a WMS layer that covers the extent of the USA: We are going to work with remote WMS servers, so it is not necessary that you have one installed yourself. Note that we are not responsible for these servers, and that they may have problems, or they may not be available any longer when you read this section. Any other WMS server can be used, but the URL and layer name must be known. How to do it… We will add two WMS layers to work with. To do this, perform the following steps: Create an HTML file and add the OpenLayers dependencies. In particular, create the HTML to hold the map and the layer panel: <div id="js-map" class="map"></div> <div class="pane"> <h1>WMS layers</h1> <p>Select the WMS layer you wish to view:</p> <select id="js-layers" class="layers"> <option value="-10527519,3160212,4">Temperature " (USA)</option> <option value="-408479,7213209,6">Bedrock (UK)</option> </select> </div> Create the map instance with the default OpenStreetMap layer: var map = new ol.Map({ view: new ol.View({ zoom: 4, center: [-10527519, 3160212] }), target: 'js-map', layers: [ new ol.layer.Tile({ source: new ol.source.OSM() }) ] }); Add the first WMS layer to the map: map.addLayer(new ol.layer.Tile({ source: new ol.source.TileWMS({ url: 'http://gis.srh.noaa.gov/arcgis/services/' + 'NDFDTemps/MapServer/WMSServer', params: { LAYERS: 16, FORMAT: 'image/png', TRANSPARENT: true }, attributions: [ new ol.Attribution({ html: 'Data provided by the ' + '<a href="http://noaa.gov">NOAA</a>.' }) ] }), opacity: 0.50 })); Add the second WMS layer to the map: map.addLayer(new ol.layer.Tile({ source: new ol.source.TileWMS({ url: 'http://ogc.bgs.ac.uk/cgi-bin/' + 'BGS_Bedrock_and_Superficial_Geology/wms', params: { LAYERS: 'BGS_EN_Bedrock_and_Superficial_Geology' }, attributions: [ new ol.Attribution({ html: 'Contains <a href="http://bgs.ac.uk">' + 'British Geological Survey</a> ' + 'materials &copy; NERC 2015' }) ] }), opacity: 0.85 })); Finally, add the layer-switching logic: document.getElementById('js-layers') .addEventListener('change', function() { var values = this.value.split(','); var view = map.getView(); view.setCenter([ parseFloat(values[0]), parseFloat(values[1]) ]); view.setZoom(values[2]); }); How it works… The HTML and CSS divide the page into two sections: one for the map, and the other for the layer-switching panel. The top part of our custom JavaScript file creates a new map instance with a single OpenStreetMap layer; this layer will become the background for the WMS layers in order to provide some context. Let's spend the rest of our time concentrating on how the WMS layers are created. WMS layers are encapsulated within the ol.layer.Tile layer type. The source is an instance of ol.source.TileWMS, which is a subclass of ol.source.TileImage. The ol.source.TileImage class is behind many source types, such as Bing Maps, and custom OpenStreetMap layers that are based on XYZ format. When using ol.source.TileWMS, we must at least pass in the URL of the WMS server and a layers parameter. Let's breakdown the first WMS layer as follows: map.addLayer(new ol.layer.Tile({ source: new ol.source.TileWMS({ url: 'http://gis.srh.noaa.gov/arcgis/services/NDFDTemps/' + 'MapServer/WMSServer', params: { LAYERS: 16, FORMAT: 'image/png', TRANSPARENT: true }, attributions: [ new ol.Attribution({ html: 'Data provided by the ' + '<a href="http://noaa.gov">NOAA</a>.' }) ] }), opacity: 0.50 })); For the url property of the source, we provide the URL of the WMS server from NOAA (http://www.noaa.gov). The params property expects an object of key/value pairs. The content of this is appended to the previous URL as query string parameters, for example, http://gis.srh.noaa.gov/arcgis/services/NDFDTemps/MapServer/WMSServer?LAYERS=16. As mentioned earlier, at minimum, this object requires the LAYERS property with a value. We request for the layer by the name of 16. Along with this parameter, we also explicitly ask for the tile images to be in the .PNG format (FORMAT: 'image/png') and that the background of the tiles be transparent (TRANSPARENT: true) rather than white, which would undesirably block out the background map layer. The default values for format and transparency are already image or PNG and false, respectively. This means you don't need to pass them in as parameters, OpenLayers will do it for you. We've shown you this for learning purposes, but this isn't strictly necessary. There are also other parameters that OpenLayers fills in for you if not specified, such as service (WMS), version (1.3.0), request (GetMap), and so on. For the attributions property, we created a new attribution instance to cover our usage of the WMS service, which simply contains a string of HTML linking back to the NOAA website. Lastly, we set the opacity property of the layer to 50% (0.50), which suitably overlays the OpenStreetMap layer underneath: map.addLayer(new ol.layer.Tile({ source: new ol.source.TileWMS({ url: 'http://ogc.bgs.ac.uk/cgi-bin/' + 'BGS_Bedrock_and_Superficial_Geology/wms', params: { LAYERS: 'BGS_EN_Bedrock_and_Superficial_Geology' }, attributions: [ new ol.Attribution({ html: 'Contains <a href="http://bgs.ac.uk">' + 'British Geological Survey</a> ' + 'materials &copy; NERC 2015' }) ] }), opacity: 0.85 })); Check the WMS standard to know which parameters you can use within the params property. The use of layers is mandatory, so you always need to specify this value. This layer from the British Geological Survey (http://bgs.ac.uk) follows the same structure as the previous WMS layer. Similarly, we provided a source URL and a layers parameter for the HTTP request. The layer name is a string rather than a number this time, which is delimited by underscores. The naming convention is at the discretion of the WMS service itself. Like earlier, an attribution instance has been added to the layer, which contains a string of HTML linking back to the BGS website, covering our usage of the WMS service. The opacity property of this layer is a little less transparent than the last one, at 85% (0.85): document.getElementById('js-layers') .addEventListener('change', function() { var values = this.value.split(','); var view = map.getView(); view.setCenter([ parseFloat(values[0]), parseFloat(values[1]) ]); view.setZoom(values[2]); }); Finally, we added a change-event listener and handler to the select menu containing both the WMS layers. If you recall from the HTML, an option's value contains a comma-delimited string. For example, the Bedrock WMS layer option looks like this: <option value="-408479,7213209,6">Bedrock (UK)</option> This translates to x coordinate, y coordinate, and zoom level. With this in mind when the change event fires, we store the value of the newly-selected option in a variable named values. The split JavaScript method creates a three-item array from the string. The array now contains the xy coordinates and the zoom level, respectively. We store a reference to the view into a variable, namely view, as it's accessed more than once within the event handler. The map view is then centered to the new location with the setCenter method. We've made sure to convert the string values into float types for OpenLayers, via the parseFloat JavaScript method. The zoom level is then set via the setZoom method. Continuing with the Bedrock example, it will recenter at -408479, 7213209 with zoom level 6. Integrating with custom WMS services plays an essential role in many web-mapping applications. Learning how we did this in this recipe should give you a good idea of how to integrate with any other WMS services that you may use. There's more… It's worth mentioning that WMS services do not necessarily cover a global extent, and they will more likely cover only subset extents of the world. Case in point, the NOAA WMS layer covers only USA, and the BGS WMS layer only covers the UK. During this topic, we only looked at the request type of GetMap, but there's also a request type called GetCapabilities. Using the GetCapabilities request parameter on the same URL endpoint returns the capabilities (such as extent) that a WMS server supports. This is discussed in much more detail later in this book. If you don't specify the type of projection, the view default projection will be used. In our case, this will be EPSG:3857, which is passed up in a parameter named CRS (it's named SRS for the GetMap version requests less than 1.3.0). If you want to retrieve WMS tiles in different projections, you need to ensure that the WMS server supports that particular format. WMS servers return images no matter whether there is information in the bounding box that we are requesting or not. Taking this recipe as an example, if the viewable extent of the map is only the UK, blank images will get returned for WMS layer requests made for USA (via the NOAA tile requests). You can prevent these unnecessary HTTP requests by setting the visibility of any layers that do not cover the extent of the area being viewed to false. There are some useful methods of the ol.source.TileWMS class that are worth being aware of, such as updateParams, which can be used to set parameters for the WMS request, and getUrls, which return the URLs used for the WMS source. Creating features programmatically Loading data from an external source is not the only way to work with vector layers. Imagine a web-mapping application where users can create new features on the fly: landing zones, perimeters, areas of interest, and so on, and add them to a vector layer with some style. This scenario requires the ability to create and add the features programmatically. In this recipe, we will take a look at some of the ways to create a selection of features programmatically. How to do it… Here, we'll create some features programmatically without any file importing. The following instructions show you how this is done: Start by creating a new HTML file with the required OpenLayers dependencies. In particular, add the div element to hold the map: <div id="js-map"></div> Create an empty JavaScript file and instantiate a map with a background raster layer: var map = new ol.Map({ view: new ol.View({ zoom: 3, center: [-2719935, 3385243] }), target: 'js-map', layers: [ new ol.layer.Tile({ source: new ol.source.MapQuest({layer: 'osm'}) }) ] }); Create the point and circle features: var point = new ol.Feature({ geometry: new ol.geom.Point([-606604, 3228700]) }); var circle = new ol.Feature( new ol.geom.Circle([-391357, 4774562], 9e5) ); Create the line and polygon features: var line = new ol.Feature( new ol.geom.LineString([ [-371789, 6711782], [1624133, 4539747] ]) ); var polygon = new ol.Feature( new ol.geom.Polygon([[ [606604, 4285365], [1506726, 3933143], [1252344, 3248267], [195678, 3248267] ]]) ); Create the vector layer and add features to the layer: map.addLayer(new ol.layer.Vector({ source: new ol.source.Vector({ features: [point, circle, line, polygon] }) })); How it works… Although we've created some random features for this recipe, features in mapping applications would normally represent some phenomenon of the real world with an appropriate geometry and a style associated with it. Let's go over the programmatic feature creation and how they are added to a vector: layervar point = new ol.Feature({ geometry: new ol.geom.Point([-606604, 3228700]) }); Features are instances of ol.Feature. This constructor contains many useful methods, such as clone, setGeometry, getStyle, and others. When creating an instance of ol.Feature, we must either pass in a geometry of type ol.geom.Geometry or an object containing properties. We demonstrate both variations throughout this recipe. For the point feature, we pass in a configuration object. The only property that we supply is geometry. There are other properties available, such as style, and the use of custom properties to set the feature attributes ourselves, which come with getters and setters. The geometry is an instance of ol.geom.Point. The ol.geom class provides a variety of other feature types that we don't get to see in this recipe, such as MultiLineString and MultiPoint. The pointgeometry type simply requires an ol.Coordinate type array (xy coordinates): var circle = new ol.Feature( new ol.geom.Circle([-391357, 4774562], 9e5) ); Remember to express the coordinates in the appropriate projection, such as the one used by the view, or translate the coordinates yourself. So, for now, all features will be rendered with the default OpenLayers styling. The circle feature follows almost the same structure as the point feature. This time, however, we don't pass in a configuration object to ol.Feature, but instead, we directly instantiate an ol.geom.Geometry type of Circle. The circle geometry takes an array of coordinates and a second parameter for the radius. 9e5 or 9e+5 is exponential notation for 900,000. The circle geometry also has useful methods, such as getCenter and setRadius: var line = new ol.Feature( new ol.geom.LineString([ [-371789, 6711782], [1624133, 4539747] ]) ); The only noticeable difference with the LineString feature is that ol.geom.LineString expects an array of coordinate arrays. For more advanced line strings, use the ol.geom.MultiLineString geometry type (more information about them can be found on the OpenLayers API documentation: http://openlayers.org/en/v3.13.0/apidoc/): The LineString feature also has useful methods, such as getLength: var polygon = new ol.Feature( new ol.geom.Polygon([[ [606604, 4285365], [1506726, 3933143], [1252344, 3248267], [195678, 3248267] ]]) ); The final feature, a Polygon geometry type differs slightly from the LineString feature as it expects an ol.Coordinate type array within an array within another wrapping array. This is because the constructor (ol.geom.Polygon) expects an array of rings with each ring representing an array of coordinates. Ideally, each ring should be closed. The polygon feature also has useful methods, such as getArea and getLinearRing: map.addLayer(new ol.layer.Vector({ source: new ol.source.Vector({ features: [point, circle, line, polygon] }) })); The OGC's Simple Feature Access specification (http://www.opengeospatial.org/standards/sfa) contains an in-depth description of the standard. It also contains a UML class diagram where you can see all the geometry classes and hierarchy. Finally, we create the vector layer, with a vector source instance and then add all four features into an array and pass it to the features property. All the features we've created are subclasses of ol.geom.SimpleGeometry. This class provides useful base methods, such as getExtent and getFirstCoordinate. All features have a getType method that can be used to identify the type of feature, for example, 'Point' or 'LineString'. There's more… Sometimes, the polygon features may represent a region with a hole in it. To create the hollow part of a polygon, we use the LinearRing geometry. The outcome is best explained with the following screenshot: You can see that the polygon has a section cut out of it. To achieve this geometry, we must create the polygon in a slightly different way. Here are the steps: Create the polygon geometry: var polygon = new ol.geom.Polygon([[ [606604, 4285365], [1506726, 3933143], [1252344, 3248267], [195678, 3248267] ]]); Create and add the linear ring to the polygon geometry: polygon.appendLinearRing( new ol.geom.LinearRing([ [645740, 3766816], [1017529, 3786384], [1017529, 3532002], [626172, 3532002] ]) ); Create the completed feature: var polygonFeature = new ol.Feature(polygon); Finish off by adding the polygon feature to the vector layer: vectorLayer.getSource().addFeature(polygonFeature); We won't break this logic down any further, as it's quite self explanatory. Now, we're comfortable with geometry creation. The ol.geom.LinearRing feature can only be used in conjunction with a polygon geometry, not as a standalone feature. Styling features based on geometry type We can summarize that there are two ways to style a feature. The first is by applying the style to the layer so that every feature inherits the styling. The second is to apply the styling options directly to the feature, which we'll see with this recipe. This recipe shows you how we can choose which flavor of styling to apply to a feature depending on the geometry type. We will apply the style directly to the feature using the ol.Feature method, setStyle. When a point geometry type is detected, we will actually style the representing geometry as a star, rather than the default circle shape. Other styling will be applied when a geometry type of line string is detected and here's what the output of the recipe will look like: How to do it… To customize the feature styling based on the geometry type, follow these steps: Create the HTML file with OpenLayers dependencies, the jQuery library, and a div element that will hold the map instance. Create a custom JavaScript file and initialize a new map instance: var map = new ol.Map({ view: new ol.View({ zoom: 4, center: [-10732981, 4676723] }), target: 'js-map', layers: [ new ol.layer.Tile({ source: new ol.source.MapQuest({layer: 'osm'}) }) ] Create a new vector layer and add it to the map. Have the source loader function retrieve the GeoJSON file, format the response, then pass it through our custom modifyFeatures method (which we'll implement next) before adding the features to the vector source: var vectorLayer = new ol.layer.Vector({ source: new ol.source.Vector({ loader: function() { $.ajax({ type: 'GET', url: 'features.geojson', context: this }).done(function(data) { var format = new ol.format.GeoJSON(); var features = format.readFeatures(data); this.addFeatures(modifyFeatures(features)); }); } }) }); map.addLayer(vectorLayer); Finish off by implementing the modifyFeatures function so that it transforms the projection of the geometry and styles the feature that are based on the geometry type: function modifyFeatures(features) { features.forEach(function(feature) { var geometry = feature.getGeometry(); geometry.transform('EPSG:4326', 'EPSG:3857'); if (geometry.getType() === 'Point') { feature.setStyle( new ol.style.Style({ image: new ol.style.RegularShape({ fill: new ol.style.Fill({ color: [255, 0, 0, 0.6] }), stroke: new ol.style.Stroke({ width: 2, color: 'blue' }), points: 5, radius1: 25, radius2: 12.5 }) }) ); } if (geometry.getType() === 'LineString') { feature.setStyle( new ol.style.Style({ stroke: new ol.style.Stroke({ color: [255, 255, 255, 1], width: 3, lineDash: [8, 6] }) }) ); } }); return features; } How it works… Let's briefly look over the loader function of the vector source before we take a closer examination of the logic behind the styling: loader: function() { $.ajax({ type: 'GET', url: 'features.geojson', context: this }).done(function(data) { var format = new ol.format.GeoJSON(); var features = format.readFeatures(data); this.addFeatures(modifyFeatures(features)); }); } Our external resource contains points and line strings in the format of GeoJSON. So we must create a new instance of ol.format.GeoJSON so that we can read in the data (format.readFeatures(data)) of the AJAX response to build out a collection of OpenLayers features. Before adding the group of features straight into the vector source (this refers to the vector source here), we pass the array of features through our modifyFeatures method. This method will apply all the necessary styling to each feature, then return the modified features in place, and feed the result into the addFeatures method. Let's break down the contents our modifyFeatures method: function modifyFeatures(features) { features.forEach(function(feature) { var geometry = feature.getGeometry(); geometry.transform('EPSG:4326', 'EPSG:3857'); The logic begins by looping over each feature in the array using the JavaScript array method, forEach. The first argument passed into the anonymous iterator function is the (feature) feature. Within the loop iteration, we store the feature's geometry into a variable, namely geometry, as it's accessed more than once during the loop iteration. Unbeknown to you, the projection of coordinates within the GeoJSON file are in longitude/latitude, the EPSG:4326 projection code. The map's view, however, is in the EPSG:3857 projection. To ensure they appear where intended on the map, we use the transform geometry method, which takes the source and the destination projections as arguments and converts the coordinates of the geometry in place: if (geometry.getType() === 'Point') { feature.setStyle( new ol.style.Style({ image: new ol.style.RegularShape({ fill: new ol.style.Fill({ color: [255, 0, 0, 0.6] }), stroke: new ol.style.Stroke({ width: 2, color: 'blue' }), points: 5, radius1: 25, radius2: 12.5 }) }) ); } Next up is a conditional check on whether or not the geometry is a type of Point. The geometry instance has the getType method for this kind of purpose. Inline of the setStyle method of the feature instance, we create a new style object from the ol.style.Style constructor. The only direct property that we're interested in is the image property. By default, point geometries are styled as a circle. Instead, we want to style the point as a star. We can achieve this through the use of the ol.style.RegularShape constructor. We set up a fill style with color and a stroke style with width and color. The points property specifies the number of points for the star. In the case of a polygon shape, it represents the number of sides. The radius1 and radius2 properties are specifically to design star shapes for the configuration of the inner and outer radius, respectively: if (geometry.getType() === 'LineString') { feature.setStyle( new ol.style.Style({ stroke: new ol.style.Stroke({ color: [255, 255, 255, 1], width: 3, lineDash: [8, 6] }) }) ); } The final piece of the method has a conditional check on the geometry type of LineString. If this is the case, we style this geometry type differently to the point geometry type. We provide a stroke style with a color, width,property and a custom lineDash. The lineDash array declares a line length of 8 followed by a gap length of 6. Summary In this article we looked at how to integrate WMS layers to our map from a basic HTTP web server by passing in some GIS related parameters. We also saw how to create and add features to our vector layer with some styling the idea behind this particular recipe was to enable the user to create the features programmatically without any file importing. We also saw how to style these features by applying styling option to the features individually based on their geometry type rather than styling the layer. Resources for Article: Further resources on this subject: What is OpenLayers?[article] Getting Started with OpenLayers[article] Creating Simple Maps with OpenLayers 3[article]
Read more
  • 0
  • 0
  • 3554

article-image-welcome-to-machine-learning-using-the-net-framework
Oli Huggins
16 Mar 2016
26 min read
Save for later

Welcome to Machine Learning using the .NET Framework

Oli Huggins
16 Mar 2016
26 min read
This article by, Jamie Dixon, the author of the book, Mastering .NET Machine Learning, will focus on some of the larger questions you might have about machine learning using the .NET Framework, namely: What is machine learning? Why should we consider it in the .NET Framework? How can I get started with coding? (For more resources related to this topic, see here.) What is machine learning? If you check out on Wikipedia, you will find a fairly abstract definition of machine learning: "Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions, rather than following strictly static program instructions." I like to think of machine learning as computer programs that produce different results as they are exposed to more information without changing their source code (and consequently needed to be redeployed). For example, consider a game that I play with the computer. I show the computer this picture  and tell it "Blue Circle". I then show it this picture  and tell it "Red Circle". Next I show it this picture  and say "Green Triangle." Finally, I show it this picture  and ask it "What is this?". Ideally the computer would respond, "Green Circle." This is one example of machine learning. Although I did not change my code or recompile and redeploy, the computer program can respond accurately to data it has never seen before. Also, the computer code does not have to explicitly write each possible data permutation. Instead, we create models that the computer applies to new data. Sometimes the computer is right, sometimes it is wrong. We then feed the new data to the computer to retrain the model so the computer gets more and more accurate over time—or, at least, that is the goal. Once you decide to implement some machine learning into your code base, another decision has to be made fairly early in the process. How often do you want the computer to learn? For example, if you create a model by hand, how often do you update it? With every new data row? Every month? Every year? Depending on what you are trying to accomplish, you might create a real-time ML model, a near-time model, or a periodic model. Why .NET? If you are a Windows developer, using .NET is something you do without thinking. Indeed, a vast majority of Windows business applications written in the last 15 years use managed code—most of it written in C#. Although it is difficult to categorize millions of software developers, it is fair to say that .NET developers often come from nontraditional backgrounds. Perhaps a developer came to .NET from a BCSC degree but it is equally likely s/he started writing VBA scripts in Excel, moving up to Access applications, and then into VB.NET/C# applications. Therefore, most .NET developers are likely to be familiar with C#/VB.NET and write in an imperative and perhaps OO style. The problem with this rather narrow exposure is that most machine learning classes, books, and code examples are in R or Python and very much use a functional style of writing code. Therefore, the .NET developer is at a disadvantage when acquiring machine learning skills because of the need to learn a new development environment, a new language, and a new style of coding before learning how to write the first line of machine learning code. If, however, that same developer could use their familiar IDE (Visual Studio) and the same base libraries (the .NET Framework), they can concentrate on learning machine learning much sooner. Also, when creating machine learning models in .NET, they have immediate impact as you can slide the code right into an existing C#/VB.NET solution. On the other hand, .NET is under-represented in the data science community. There are a couple of different reasons floating around for that fact. The first is that historically Microsoft was a proprietary closed system and the academic community embraced open source systems such as Linux and Java. The second reason is that much academic research uses domain-specific languages such as R, whereas Microsoft concentrated .NET on general purpose programming languages. Research that moved to industry took their language with them. However, as the researcher's role is shifted from data science to building programs that can work at real time that customers touch, the researcher is getting more and more exposure to Windows and Windows development. Whether you like it or not, all companies which create software that face customers must have a Windows strategy, an iOS strategy, and an Android strategy. One real advantage to writing and then deploying your machine learning code in .NET is that you can get everything with one stop shopping. I know several large companies who write their models in R and then have another team rewrite them in Python or C++ to deploy them. Also, they might write their model in Python and then rewrite it in C# to deploy on Windows devices. Clearly, if you could write and deploy in one language stack, there is a tremendous opportunity for efficiency and speed to market. What version of the .NET Framework are we using? The .NET Framework has been around for general release since 2002. The base of the framework is the Common Language Runtime or CLR. The CLR is a virtual machine that abstracts much of the OS specific functionality like memory management and exception handling. The CLR is loosely based on the Java Virtual Machine (JVM). Sitting on top of the CLR is the Framework Class Library (FCL) that allows different languages to interoperate with the CLR and each other: the FCL is what allows VB.Net, C#, F#, and Iron Python code to work side-by-side with each other. Since its first release, the .NET framework has included more and more features. The first release saw support for the major platform libraries like WinForms, ASP.NET, and ADO.NET. Subsequent releases brought in things like Windows Communication Foundation (WCF), Language Integrated Query (LINQ), and Task Parallel Library (TPL). At the time of writing, the latest version is of the .Net Framework is 4.6.2. In addition to the full-Monty .NET Framework, over the years Microsoft has released slimmed down versions of the .NET Framework intended to run on machines that have limited hardware and OS support. The most famous of these releases was the Portable Class Library (PCL) that targeted Windows RT applications running Windows 8. The most recent incantation of this is Universal Windows Applications (UWA), targeting Windows 10. At Connect(); in November 2015, Microsoft announced GA of the latest edition of the .NET Framework. This release introduced the .Net Core 5. In January, they decided to rename it to .Net Core 1.0. .NET Core 1.0 is intended to be a slimmed down version of the full .NET Framework that runs on multiple operating systems (specifically targeting OS X and Linux). The next release of ASP.NET (ASP.NET Core 1.0) sits on top of .NET Core 1.0. ASP.NET Core 1.0 applications that run on Windows can still run the full .NET Framework. (https://blogs.msdn.microsoft.com/webdev/2016/01/19/asp-net-5-is-dead-int...) In this book, we will be using a mixture of ASP.NET 4.0, ASP.NET 5.0, and Universal Windows Applications. As you can guess, machine learning models (and the theory behind the models) change with a lot less frequency than framework releases so the most of the code you write on .NET 4.6 will work equally well with PCL and .NET Core 1.0. Saying that, the external libraries that we will use need some time to catch up—so they might work with PCL but not with .NET Core 1.0 yet. To make things realistic, the demonstration projects will use .NET 4.6 on ASP.NET 4.x for existing (Brownfield) applications. New (Greenfield) applications will be a mixture of a UWA using PCL and ASP.NET 5.0 applications. Why write your own? It seems like all of the major software companies are pitching machine learning services such as Google Analytics, Amazon Machine Learning Services, IBM Watson, Microsoft Cortana Analytics, to name a few. In addition, major software companies often try to sell products that have a machine learning component, such as Microsoft SQL Server Analysis Service, Oracle Database Add-In, IBM SPSS, or SAS JMP. I have not included some common analytical software packages such as PowerBI or Tableau because they are more data aggregation and report writing applications. Although they do analytics, they do not have a machine learning component (not yet at least). With all these options, why would you want to learn how to implement machine learning inside your applications, or in effect, write some code that you can purchase elsewhere? It is the classic build versus buy decision that every department or company has to make. You might want to build because: You really understand what you are doing and you can be a much more informed consumer and critic of any given machine learning package. In effect, you are building your internal skill set that your company will most likely prize. Another way to look at it, companies are not one tool away from purchasing competitive advantage because if they were, their competitors could also buy the same tool and cancel any advantage. However, companies can be one hire away or more likely one team away to truly have the ability to differentiate themselves in their market. You can get better performance by executing locally, which is especially important for real-time machine learning and can be implemented in disconnected or slow connection scenarios. This becomes particularly important when we start implementing machine learning with Internet of Things (IoT) devices in scenarios where the device has a lot more RAM than network bandwidth. Consider the Raspberry Pi running Windows 10 on a pipeline. Network communication might be spotty, but the machine has plenty of power to implement ML models. You are not beholden to any one vendor or company, for example, every time you implement an application with a specific vendor and are not thinking about how to move away from the vendor, you make yourself more dependent on the vendor and their inevitable recurring licensing costs. The next time you are talking to the CTO of a shop that has a lot of Oracle, ask him/her if they regret any decision to implement any of their business logic in Oracle databases. The answer will not surprise you. A majority of this book's code is written in F#—an open source language that runs great on Windows, Linux, and OS X. You can be much more agile and have much more flexibility in what you implement. For example, we will often re-train our models on the fly and when you write your own code, it is fairly easy to do this. If you use a third-party service, they may not even have API hooks to do model training and evaluation, so near-time model changes are impossible. Once you decide to go native, you have a choice of rolling your own code or using some of the open source assemblies out there. This book will introduce both the techniques to you, highlight some of the pros and cons of each technique, and let you decide how you want to implement them. For example, you can easily write your own basic classifier that is very effective in production but certain models, such as a neural network, will take a considerable amount of time and energy and probably will not give you the results that the open source libraries do. As a final note, since the libraries that we will look at are open source, you are free to customize pieces of it—the owners might even accept your changes. However, we will not be customizing these libraries in this book. Why open data? Many books on machine learning use datasets that come with the language install (such as R or Hadoop) or point to public repositories that have considerable visibility in the data science community. The most common ones are Kaggle (especially the Titanic competition) and the UC Irvine's datasets. While these are great datasets and give a common denominator, this book will expose you to datasets that come from government entities. The notion of getting data from government and hacking for social good is typically called open data. I believe that open data will transform how the government interacts with its citizens and will make government entities more efficient and transparent. Therefore, we will use open datasets in this book and hopefully you will consider helping out with the open data movement. Why F#? As we will be on the .NET Framework, we could use either C#, VB.NET, or F#. All three languages have strong support within Microsoft and all three will be around for many years. F# is the best choice for this book because it is unique in the .NET Framework for thinking in the scientific method and machine learning model creation. Data scientists will feel right at home with the syntax and IDE (languages such as R are also functional first languages). It is the best choice for .NET business developers because it is built right into Visual Studio and plays well with your existing C#/VB.NET code. The obvious alternative is C#. Can I do this all in C#? Yes, kind of. In fact, many of the .NET libraries we will use are written in C#. However, using C# in our code base will make it larger and have a higher chance of introducing bugs into the code. At certain points, I will show some examples in C#, but the majority of the book is in F#. Another alternative is to forgo .NET altogether and develop the machine learning models in R and Python. You could spin up a web service (such as AzureML), which might be good in some scenarios, but in disconnected or slow network environments, you will get stuck. Also, assuming comparable machines, executing locally will perform better than going over the wire. When we implement our models to do real-time analytics, anything we can do to minimize the performance hit is something to consider. A third alternative that the .NET developers will consider is to write the models in T-SQL. Indeed, many of our initial models have been implemented in T-SQL and are part of the SQL Server Analysis Server. The advantage of doing it on the data server is that the computation is as close as you can get to the data, so you will not suffer the latency of moving large amount of data over the wire. The downsides of using T-SQL are that you can't implement unit tests easily, your domain logic is moving away from the application and to the data server (which is considered bad form with most modern application architecture), and you are now reliant on a specific implementation of the database. F# is open source and runs on a variety of operating systems, so you can port your code much more easily. Getting ready for Machine Learning In this section, we will install Visual Studio, take a quick lap around F#, and install the major open source libraries that we will be using. Setting up Visual Studio To get going, you will need to download Visual Studio on a Microsoft Windows machine. As of this writing, the latest (free) version is Visual Studio 2015 Community. If you have a higher version already installed on your machine, you can skip this step. If you need a copy, head on over to the Visual Studio home page at https://www.visualstudio.com. Download the Visual Studio Community 2015 installer and execute it. Now, you will get the following screen: Select Custom installation and you will be taken to the following screen: Make sure Visual F# has a check mark next to it. Once it is installed, you should see Visual Studio in your Windows Start menu. Learning F# One of the great features about F# is that you can accomplish a whole lot with very little code. It is a very terse language compared to C# and VB.NET, so picking up the syntax is a bit easier. Although this is not a comprehensive introduction, this is going to introduce you to the major language features that we will use in this book. I encourage you to check out http://www.tryfsharp.org/ or the tutorials at http://fsharpforfunandprofit.com/ if you want to get a deeper understanding of the language. With that in mind, let's create our 1st F# project: Start Visual Studio. Navigate to File | New | Project as shown in the following screenshot: When the New Project dialog box appears, navigate the tree view to Visual F# | Windows | Console Application. Have a look at the following screenshot: Give your project a name, hit OK, and the Visual Studio Template generator will create the following boilerplate: Although Visual Studio created a Program.fs file that creates a basic console .exe application for us, we will start learning about F# in a different way, so we are going to ignore it for now. Right-click in the Solution Explorer and navigate to Add | New Item. When the Add New Item dialog box appears, select Script File. The Script1.fsx file is then added to the project. Once Script1.fsx is created, open it up, and enter the following into the file: let x = "Hello World" Highlight that entire row of code, right-click and select Execute In Interactive (or press Alt + Enter). And the F# Interactive console will pop up and you will see this: The F# Interactive is a type of REPL, which stands for Read-Evaluate-Print-Loop. If you are a .NET developer who has spent any time in SQL Server Management Studio, the F# Interactive will look very familiar to the Query Analyzer where you enter your code at the top and see how it executes at the bottom. Also, if you are a data scientist using R Studio, you are very familiar with the concept of a REPL. I have used the words REPL and FSI interchangeably in this book. There are a couple of things to notice about this first line of F# code you wrote. First, it looks very similar to C#. In fact, consider changing the code to this: It would be perfectly valid C#. Note that the red squiggly line, showing you that the F# compiler certainly does not think this is valid. Going back to the correct code, notice that type of x is not explicitly defined. F# uses the concept of inferred typing so that you don't have to write the type of the values that you create. I used the term value deliberately because unlike variables, which can be assigned in C# and VB.NET, values are immutable; once bound, they can never change. Here, we are permanently binding the name x to its value, Hello World. This notion of immutability might seem constraining at first, but it has profound and positive implications, especially when writing machine learning models. With our basic program idea proven out, let's move it over to a compliable assembly; in this case, an .exe that targets the console. Highlight the line that you just wrote, press Ctrl + C, and then open up Program.fs. Go into the code that was generated and paste it in: [<EntryPoint>] let main argv = printfn "%A" argv let x = "Hello World" 0 // return an integer exit code Then, add the following lines of code around what you just added: // Learn more about F# at http://fsharp.org // See the 'F# Tutorial' project for more help. open System [<EntryPoint>] let main argv = printfn "%A" argv let x = "Hello World" Console.WriteLine(x) let y = Console.ReadKey() 0 // return an integer exit code Press the Start button (or hit F5) and you should see your program run: You will notice that I had to bind the return value from Console.ReadKey() to y. In C# or VB.NET, you can get away with not handling the return value explicitly. In F#, you are not allowed to ignore the returned values. Although some might think this is a limitation, it is actually a strength of the language. It is much harder to make a mistake in F# because the language forces you to address execution paths explicitly versus accidentally sweeping them under the rug (or into a null, but we'll get to that later). In any event, let's go back to our script file and enter in another line of code: let ints = [|1;2;3;4;5;6|] If you send that line of code to the REPL, you should see this: val ints : int [] = [|1; 2; 3; 4; 5; 6|] This is an array, as if you did this in C#: var ints = new[] {1,2,3,4,5,6}; Notice that the separator is a semicolon in F# and not a comma. This differs from many other languages, including C#. The comma in F# is reserved for tuples, not for separating items in an array. We'll discuss tuples later. Now, let's sum up the values in our array: let summedValue = ints |> Array.sum While sending that line to the REPL, you should see this: val summedValue : int = 21 There are two things going on. We have the |> operator, which is a pipe forward operator. If you have experience with Linux or PowerShell, this should be familiar. However, if you have a background in C#, it might look unfamiliar. The pipe forward operator takes the result of the value on the left-hand side of the operator (in this case, ints) and pushes it into the function on the right-hand side (in this case, sum). The other new language construct is Array.sum. Array is a module in the core F# libraries, which has a series of functions that you can apply to your data. The function sum, well, sums the values in the array, as you can probably guess by inspecting the result. So, now, let's add a different function from the Array type: let multiplied = ints |> Array.map (fun i -> i * 2) If you send it to the REPL, you should see this: val multiplied : int [] = [|2; 4; 6; 8; 10; 12|] Array.map is an example of a high ordered function that is part of the Array type. Its parameter is another function. Effectively, we are passing a function into another function. In this case, we are creating an anonymous function that takes a parameter i and returns i * 2. You know it is an anonymous function because it starts with the keyword fun and the IDE makes it easy for us to understand that by making it blue. This anonymous function is also called a lambda expression, which has been in C# and VB.NET since .Net 3.5, so you might have run across it before. If you have a data science background using R, you are already quite familiar with lambdas. Getting back to the higher-ordered function Array.map, you can see that it applies the lambda function against each item of the array and returns a new array with the new values. We will be using Array.map (and its more generic kin Seq.map) a lot when we start implementing machine learning models as it is the best way to transform an array of data. Also, if you have been paying attention to the buzz words of map/reduce when describing big data applications such as Hadoop, the word map means exactly the same thing in this context. One final note is that because of immutability in F#, the original array is not altered, instead, multiplied is bound to a new array. Let's stay in the script and add in another couple more lines of code: let multiplyByTwo x = x * 2 If you send it to the REPL, you should see this: val multiplyByTwo : x:int -> int These two lines created a named function called multiplyByTwo. The function that takes a single parameter x and then returns the value of the parameter multiplied by 2. This is exactly the same as our anonymous function we created earlier in-line that we passed into the map function. The syntax might seem a bit strange because of the -> operator. You can read this as, "the function multiplyByTwo takes in a parameter called x of type int and returns an int." Note three things here. Parameter x is inferred to be an int because it is used in the body of the function as multiplied to another int. If the function reads x * 2.0, the x would have been inferred as a float. This is a significant departure from C# and VB.NET but pretty familiar for people who use R. Also, there is no return statement for the function, instead, the final expression of any function is always returned as the result. The last thing to note is that whitespace is important so that the indentation is required. If the code was written like this: let multiplyByTwo(x) = x * 2 The compiler would complain: Script1.fsx(8,1): warning FS0058: Possible incorrect indentation: this token is offside of context started at position (7:1). Since F# does not use curly braces and semicolons (or the end keyword), such as C# or VB.NET, it needs to use something to separate code. That separation is whitespace. Since it is good coding practice to use whitespace judiciously, this should not be very alarming to people having a C# or VB.NET background. If you have a background in R or Python, this should seem natural to you. Since multiplyByTwo is the functional equivalent of the lambda created in Array.map (fun i -> i * 2), we can do this if we want: let multiplied' = ints |> Array.map (fun i -> multiplyByTwo i) If you send it to the REPL, you should see this: val multiplied' : int [] = [|2; 4; 6; 8; 10; 12|] Typically, we will use named functions when we need to use that function in several places in our code and we use a lambda expression when we only need that function for a specific line of code. There is another minor thing to note. I used the tick notation for the value multiplied when I wanted to create another value that was representing the same idea. This kind of notation is used frequently in the scientific community, but can get unwieldy if you attempt to use it for a third or even fourth (multiplied'''') representation. Next, let's add another named function to the REPL: let isEven x = match x % 2 = 0 with | true -> "even" | false -> "odd" isEven 2 isEven 3 If you send it to the REPL, you should see this: val isEven : x:int -> string This is a function named isEven that takes a single parameter x. The body of the function uses a pattern-matching statement to determine whether the parameter is odd or even. When it is odd, then it returns the string odd. When it is even, it returns the string even. There is one really interesting thing going on here. The match statement is a basic example of pattern matching and it is one of the coolest features of F#. For now, you can consider the match statement much like the switch statement that you may be familiar within R, Python, C#, or VB.NET. I would have written the conditional logic like this: let isEven' x = if x % 2 = 0 then "even" else "odd" But I prefer to use pattern matching for this kind of conditional logic. In fact, I will attempt to go through this entire book without using an if…then statement. With isEven written, I can now chain my functions together like this: let multipliedAndIsEven = ints |> Array.map (fun i -> multiplyByTwo i) |> Array.map (fun i -> isEven i) If you send it to REPL, you should see this: val multipliedAndIsEven : string [] = [|"even"; "even"; "even"; "even"; "even"; "even"|] In this case, the resulting array from the first pipe Array.map (fun i -> multiplyByTwo i))gets sent to the next function Array.map (fun i -> isEven i). This means we might have three arrays floating around in memory: ints which is passed into the first pipe, the result from the first pipe that is passed into the second pipe, and the result from the second pipe. From your mental model point of view, you can think about each array being passed from one function into the next. In this book, I will be chaining pipe forwards frequently as it is such a powerful construct and it perfectly matches the thought process when we are creating and using machine learning models. You now know enough F# to get you up and running with the first machine learning models in this book. I will be introducing other F# language features as the book goes along, but this is a good start. As you will see, F# is truly a powerful language where a simple syntax can lead to very complex work. Third-party libraries The following are a few third-party libraries that we will cover in our book later on: Math.NET Math.NET is an open source project that was created to augment (and sometimes replace) the functions that are available in System.Math. Its home page is http://www.mathdotnet.com/. We will be using Math.Net's Numerics and Symbolics namespaces in some of the machine learning algorithms that we will write by hand. A nice feature about Math.Net is that it has strong support for F#. Accord.NET Accord.NET is an open source project that was created to implement many common machine learning models. Its home page is http://accord-framework.net/. Although the focus of Accord.NET was for computer vision and signal processing, we will be using Accord.Net extensively in this book as it makes it very easy to implement algorithms in our problem domain. Numl Numl is an open source project that implements several common machine learning models as experiments. Its home page is http://numl.net/. Numl is newer than any of the other third-party libraries that we will use in the book, so it may not be as extensive as the other ones, but it can be very powerful and helpful in certain situations. Summary We covered a lot of ground in this article. We discussed what machine learning is, why you want to learn about it in the .NET stack, how to get up and running using F#, and had a brief introduction to the major open source libraries that we will be using in this book. With all this preparation out of the way, we are ready to start exploring machine learning. Further resources on this subject: ASP.Net Site Performance: Improving JavaScript Loading [article] Displaying MySQL data on an ASP.NET Web Page [article] Creating a NHibernate session to access database within ASP.NET [article]
Read more
  • 0
  • 0
  • 6676

article-image-exploring-hdfs
Packt
10 Mar 2016
17 min read
Save for later

Exploring HDFS

Packt
10 Mar 2016
17 min read
In this article by Tanmay Deshpande, the author of the book Hadoop Real World Solutions Cookbook- Second Edition, we'll cover the following recipes: Loading data from a local machine to HDFS Exporting HDFS data to a local machine Changing the replication factor of an existing file in HDFS Setting the HDFS block size for all the files in a cluster Setting the HDFS block size for a specific file in a cluster Enabling transparent encryption for HDFS Importing data from another Hadoop cluster Recycling deleted data from trash to HDFS Saving compressed data in HDFS Hadoop has two important components: Storage: This includes HDFS Processing: This includes Map Reduce HDFS takes care of the storage part of Hadoop. So, let's explore the internals of HDFS through various recipes. (For more resources related to this topic, see here.) Loading data from a local machine to HDFS In this recipe, we are going to load data from a local machine's disk to HDFS. Getting ready To perform this recipe, you should have an already Hadoop running cluster. How to do it... Performing this recipe is as simple as copying data from one folder to another. There are a couple of ways to copy data from the local machine to HDFS. Using the copyFromLocal commandTo copy the file on HDFS, let's first create a directory on HDFS and then copy the file. Here are the commands to do this: hadoop fs -mkdir /mydir1 hadoop fs -copyFromLocal /usr/local/hadoop/LICENSE.txt /mydir1 Using the put commandWe will first create the directory, and then put the local file in HDFS: hadoop fs -mkdir /mydir2 hadoop fs -put /usr/local/hadoop/LICENSE.txt /mydir2 You can validate that the files have been copied to the correct folders by listing the files: hadoop fs -ls /mydir1 hadoop fs -ls /mydir2 How it works... When you use HDFS copyFromLocal or the put command, the following things will occur: First of all, the HDFS client (the command prompt, in this case) contacts NameNode because it needs to copy the file to HDFS. NameNode then asks the client to break the file into chunks of different cluster block sizes. In Hadoop 2.X, the default block size is 128MB. Based on the capacity and availability of space in DataNodes, NameNode will decide where these blocks should be copied. Then, the client starts copying data to specified DataNodes for a specific block. The blocks are copied sequentially one after another. When a single block is copied, the block is sent to DataNode into packets that are 4MB in size. With each packet, a checksum is sent; once the packet copying is done, it is verified with checksum to check whether it matches. The packets are then sent to the next DataNode where the block will be replicated. The HDFS client's responsibility is to copy the data to only the first node; the replication is taken care by respective DataNode. Thus, the data block is pipelined from one DataNode to the next. When the block copying and replication is taking place, metadata on the file is updated in NameNode by DataNode. Exporting data from HDFS to Local machine In this recipe, we are going to export/copy data from HDFS to the local machine. Getting ready To perform this recipe, you should already have a running Hadoop cluster. How to do it... Performing this recipe is as simple as copying data from one folder to the other. There are a couple of ways in which you can export data from HDFS to the local machine. Using the copyToLocal command, you'll get this code: hadoop fs -copyToLocal /mydir1/LICENSE.txt /home/ubuntu Using the get command, you'll get this code: hadoop fs -get/mydir1/LICENSE.txt /home/ubuntu How it works... When you use HDFS copyToLocal or the get command, the following things occur: First of all, the client contacts NameNode because it needs a specific file in HDFS. NameNode then checks whether such a file exists in its FSImage. If the file is not present, the error code is returned to the client. If the file exists, NameNode checks the metadata for blocks and replica placements in DataNodes. NameNode then directly points DataNode from where the blocks would be given to client one by one. The data is directly copied from DataNode to the client machine. and it never goes through NameNode to avoid bottlenecks. Thus, the file is exported to the local machine from HDFS. Changing the replication factor of an existing file in HDFS In this recipe, we are going to take a look at how to change the replication factor of a file in HDFS. The default replication factor is 3. Getting ready To perform this recipe, you should already have a running Hadoop cluster. How to do it... Sometimes. there might be a need to increase or decrease the replication factor of a specific file in HDFS. In this case, we'll use the setrep command. This is how you can use the command: hadoop fs -setrep [-R] [-w] <noOfReplicas><path> ... In this command, a path can either be a file or directory; if its a directory, then it recursively sets the replication factor for all replicas. The w option flags the command and should wait until the replication is complete The r option is accepted for backward compatibility First, let's check the replication factor of the file we copied to HDFS in the previous recipe: hadoop fs -ls /mydir1/LICENSE.txt -rw-r--r-- 3 ubuntu supergroup 15429 2015-10-29 03:04 /mydir1/LICENSE.txt Once you list the file, it will show you the read/write permissions on this file, and the very next parameter is the replication factor. We have the replication factor set to 3 for our cluster, hence, you the number is 3. Let's change it to 2 using this command: hadoop fs -setrep -w 2 /mydir1/LICENSE.txt It will wait till the replication is adjusted. Once done, you can verify this again by running the ls command: hadoop fs -ls /mydir1/LICENSE.txt -rw-r--r-- 2 ubuntu supergroup 15429 2015-10-29 03:04 /mydir1/LICENSE.txt How it works... Once the setrep command is executed, NameNode will be notified, and then NameNode decides whether the replicas need to be increased or decreased from certain DataNode. When you are using the –w command, sometimes, this process may take too long if the file size is too big. Setting the HDFS block size for all the files in a cluster In this recipe, we are going to take a look at how to set a block size at the cluster level. Getting ready To perform this recipe, you should already have a running Hadoop cluster. How to do it... The HDFS block size is configurable for all files in the cluster or for a single file as well. To change the block size at the cluster level itself, we need to modify the hdfs-site.xml file. By default, the HDFS block size is 128MB. In case we want to modify this, we need to update this property, as shown in the following code. This property changes the default block size to 64MB: <property> <name>dfs.block.size</name> <value>67108864</value> <description>HDFS Block size</description> </property> If you have a multi-node Hadoop cluster, you should update this file in the nodes, that is, NameNode and DataNode. Make sure you save these changes and restart the HDFS daemons: /usr/local/hadoop/sbin/stop-dfs.sh /usr/local/hadoop/sbin/start-dfs.sh This will set the block size for files that will now get added to the HDFS cluster. Make sure that this does not change the block size of the files that are already present in HDFS. There is no way to change the block sizes of existing files. How it works... By default, the HDFS block size is 128MB for Hadoop 2.X. Sometimes, we may want to change this default block size for optimization purposes. When this configuration is successfully updated, all the new files will be saved into blocks of this size. Ensure that these changes do not affect the files that are already present in HDFS; their block size will be defined at the time being copied. Setting the HDFS block size for a specific file in a cluster In this recipe, we are going to take a look at how to set the block size for a specific file only. Getting ready To perform this recipe, you should already have a running Hadoop cluster. How to do it... In the previous recipe, we learned how to change the block size at the cluster level. But this is not always required. HDFS provides us with the facility to set the block size for a single file as well. The following command copies a file called myfile to HDFS, setting the block size to 1MB: hadoop fs -Ddfs.block.size=1048576 -put /home/ubuntu/myfile / Once the file is copied, you can verify whether the block size is set to 1MB and has been broken into exact chunks: hdfs fsck -blocks /myfile Connecting to namenode via http://localhost:50070/fsck?ugi=ubuntu&blocks=1&path=%2Fmyfile FSCK started by ubuntu (auth:SIMPLE) from /127.0.0.1 for path /myfile at Thu Oct 29 14:58:00 UTC 2015 .Status: HEALTHY Total size: 17276808 B Total dirs: 0 Total files: 1 Total symlinks: 0 Total blocks (validated): 17 (avg. block size 1016282 B) Minimally replicated blocks: 17 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 1 Average block replication: 1.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 3 Number of racks: 1 FSCK ended at Thu Oct 29 14:58:00 UTC 2015 in 2 milliseconds The filesystem under path '/myfile' is HEALTHY How it works... When we specify the block size at the time of copying a file, it overwrites the default block size and copies the file to HDFS by breaking the file into chunks of a given size. Generally, these modifications are made in order to perform other optimizations. Make sure you make these changes, and you are aware of their consequences. If the block size isn't adequate enough, it will increase the parallelization, but it will also increase the load on NameNode as it would have more entries in FSImage. On the other hand, if the block size is too big, then it will reduce the parallelization and degrade the processing performance. Enabling transparent encryption for HDFS When handling sensitive data, it is always important to consider the security measures. Hadoop allows us to encrypt sensitive data that's present in HDFS. In this recipe, we are going to see how to encrypt data in HDFS. Getting ready To perform this recipe, you should already have a running Hadoop cluster. How to do it... For many applications that hold sensitive data, it is very important to adhere to standards such as PCI, HIPPA, FISMA, and so on. To enable this, HDFS provides a utility called encryption zone where we can create a directory in it so that data is encrypted on writes and decrypted on read. To use this encryption facility, we first need to enable Hadoop Key Management Server (KMS): /usr/local/hadoop/sbin/kms.sh start This would start KMS in the Tomcat web server. Next, we need to append the following properties in core-site.xml and hdfs-site.xml. In core-site.xml, add the following property: <property> <name>hadoop.security.key.provider.path</name> <value>kms://http@localhost:16000/kms</value> </property> In hds-site.xml, add the following property: <property> <name>dfs.encryption.key.provider.uri</name> <value>kms://http@localhost:16000/kms</value> </property> Restart the HDFS daemons: /usr/local/hadoop/sbin/stop-dfs.sh /usr/local/hadoop/sbin/start-dfs.sh Now, we are all set to use KMS. Next, we need to create a key that will be used for the encryption: hadoop key create mykey This will create a key, and then, save it on KMS. Next, we have to create an encryption zone, which is a directory in HDFS where all the encrypted data is saved: hadoop fs -mkdir /zone hdfs crypto -createZone -keyName mykey -path /zone We will change the ownership to the current user: hadoop fs -chown ubuntu:ubuntu /zone If we put any file into this directory, it will encrypt and would decrypt at the time of reading: hadoop fs -put myfile /zone hadoop fs -cat /zone/myfile How it works... There can be various types of encryptions one can do in order to comply with security standards, for example, application-level encryption, database level, file level, and disk-level encryption. The HDFS transparent encryption sits between the database and file-level encryptions. KMS acts like proxy between HDFS clients and HDFS's encryption provider via HTTP REST APIs. There are two types of keys used for encryption: Encryption Zone Key( EZK) and Data Encryption Key (DEK). EZK is used to encrypt DEK, which is also called Encrypted Data Encryption Key(EDEK). This is then saved on NameNode. When a file needs to be written to the HDFS encryption zone, the client gets EDEK from NameNode and EZK from KMS to form DEK, which is used to encrypt data and store it in HDFS (the encrypted zone). When an encrypted file needs to be read, the client needs DEK, which is formed by combining EZK and EDEK. These are obtained from KMS and NameNode, respectively. Thus, encryption and decryption is automatically handled by HDFS. and the end user does not need to worry about executing this on their own. You can read more on this topic at http://blog.cloudera.com/blog/2015/01/new-in-cdh-5-3-transparent-encryption-in-hdfs/. Importing data from another Hadoop cluster Sometimes, we may want to copy data from one HDFS to another either for development, testing, or production migration. In this recipe, we will learn how to copy data from one HDFS cluster to another. Getting ready To perform this recipe, you should already have a running Hadoop cluster. How to do it... Hadoop provides a utility called DistCp, which helps us copy data from one cluster to another. Using this utility is as simple as copying from one folder to another: hadoop distcp hdfs://hadoopCluster1:9000/source hdfs://hadoopCluster2:9000/target This would use a Map Reduce job to copy data from one cluster to another. You can also specify multiple source files to be copied to the target. There are couple of other options that we can also use: -update: When we use DistCp with the update option, it will copy only those files from the source that are not part of the target or differ from the target. -overwrite: When we use DistCp with the overwrite option, it overwrites the target directory with the source. How it works... When DistCp is executed, it uses map reduce to copy the data and also assists in error handling and reporting. It expands the list of source files and directories and inputs them to map tasks. When copying from multiple sources, collisions are resolved in the destination based on the option (update/overwrite) that's provided. By default, it skips if the file is already present at the target. Once the copying is complete, the count of skipped files is presented. You can read more on DistCp at https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html. Recycling deleted data from trash to HDFS In this recipe, we are going to see how recover deleted data from the trash to HDFS. Getting ready To perform this recipe, you should already have a running Hadoop cluster. How to do it... To recover accidently deleted data from HDFS, we first need to enable the trash folder, which is not enabled by default in HDFS. This can be achieved by adding the following property to core-site.xml: <property> <name>fs.trash.interval</name> <value>120</value> </property> Then, restart the HDFS daemons: /usr/local/hadoop/sbin/stop-dfs.sh /usr/local/hadoop/sbin/start-dfs.sh This will set the deleted file retention to 120 minutes. Now, let's try to delete a file from HDFS: hadoop fs -rmr /LICENSE.txt 15/10/30 10:26:26 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 120 minutes, Emptier interval = 0 minutes. Moved: 'hdfs://localhost:9000/LICENSE.txt' to trash at: hdfs://localhost:9000/user/ubuntu/.Trash/Current We have 120 minutes to recover this file before it is permanently deleted from HDFS. To restore the file to its original location, we can execute the following commands. First, let's confirm whether the file exists: hadoop fs -ls /user/ubuntu/.Trash/Current Found 1 items -rw-r--r-- 1 ubuntu supergroup 15429 2015-10-30 10:26 /user/ubuntu/.Trash/Current/LICENSE.txt Now, restore the deleted file or folder; it's better to use the distcp command instead of copying each file one by one: hadoop distcp hdfs //localhost:9000/user/ubuntu/.Trash/Current/LICENSE.txt hdfs://localhost:9000/ This will start a map reduce job to restore data from the trash to the original HDFS folder. Check the HDFS path; the deleted file should be back to its original form. How it works... Enabling trash enforces the file retention policy for a specified amount of time. So, when trash is enabled, HDFS does not execute any blocks deletions or movements immediately but only updates the metadata of the file and its location. This way, we can accidently stop deleting files from HDFS; make sure that trash is enabled before experimenting with this recipe. Saving compressed data on HDFS In this recipe, we are going to take a look at how to store and process compressed data in HDFS. Getting ready To perform this recipe, you should already have a running Hadoop. How to do it... It's always good to use compression while storing data in HDFS. HDFS supports various types of compression algorithms such as LZO, BIZ2, Snappy, GZIP, and so on. Every algorithm has its own pros and cons when you consider the time taken to compress and decompress and the space efficiency. These days people prefer Snappy compression as it aims to achieve a very high speed and reasonable amount compression. We can easily store and process any number of files in HDFS. To store compressed data, we don't need to specifically make any changes to the Hadoop cluster. You can simply copy the compressed data in the same way it's in HDFS. Here is an example of this: hadoop fs -mkdir /compressed hadoop fs –put file.bz2 /compressed Now, we'll run a sample program to take a look at how Hadoop automatically uncompresses the file and processes it: hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar wordcount /compressed /compressed_out Once the job is complete, you can verify the output. How it works... Hadoop explores native libraries to find the support needed for various codecs and their implementations. Native libraries are specific to the platform that you run Hadoop on. You don't need to make any configurations changes to enable compression algorithms. As mentioned earlier, Hadoop supports various compression algorithms that are already familiar to the computer world. Based on your needs and requirements (more space or more time), you can choose your compression algorithm. Take a look at http://comphadoop.weebly.com/ for more information on this. Summary We covered major factors with respect to HDFS in this article which comprises of recipes that help us to load, extract, import, export and saving data in HDFS. It also covers enabling transparent encryption for HDFS as well adjusting block size of HDFS cluster. Resources for Article: Further resources on this subject: Hadoop and MapReduce [article] Advanced Hadoop MapReduce Administration [article] Integration with Hadoop [article]
Read more
  • 0
  • 0
  • 4675
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-getting-started-deep-learning
Packt
07 Mar 2016
12 min read
Save for later

Getting Started with Deep Learning

Packt
07 Mar 2016
12 min read
In this article by Joshua F. Wiley, author of the book, R Deep Learning Essentials, we will discuss deep learning, a powerful multilayered architecture for pattern recognition, signal detection, classification, and prediction. Although deep learning is not new, it has gained popularity in the past decade due to the advances in the computational capacity and new ways of efficient training models, as well as the availability of ever growing amount of data. In this article, you will learn what deep learning is. What is deep learning? To understand what deep learning is, perhaps it is easiest to start with what is meant by regular machine learning. In general terms, machine learning is devoted to developing and using algorithms that learn from raw data in order to make predictions. Prediction is a very general term. For example, predictions from machine learning may include predicting how much money a customer will spend at a given company, or whether a particular credit card purchase is fraudulent. Predictions also encompass more general pattern recognition, such as what letters are present in a given image, or whether a picture is of a horse, dog, person, face, building, and so on. Deep learning is a branch of machine learning where a multi-layered (deep) architecture is used to map the relations between inputs or observed features and the outcome. This deep architecture makes deep learning particularly suitable for handling a large number of variables and allows deep learning to generate features as part of the overall learning algorithm, rather than feature creation being a separate step. Deep learning has proven particularly effective in the fields of image recognition (including handwriting as well as photo or object classification) and natural language processing, such as recognizing speech. There are many types of machine learning algorithms. In this article, we are primarily going to focus on neural networks as these have been particularly popular in deep learning. However, this focus does not mean that it is the only technique available in machine learning or even deep learning, nor that other techniques are not valuable or even better suited, depending on the specific task. Conceptual overview of neural networks As their name suggests, neural networks draw their inspiration from neural processes and neurons in the body. Neural networks contain a series of neurons, or nodes, which are interconnected and process input. The connections between neurons are weighted, with these weights based on the function being used and learned from the data. Activation in one set of neurons and the weights (adaptively learned from the data) may then feed into other neurons, and the activation of some final neuron(s) is the prediction. To make this process more concrete, an example from human visual perception may be helpful. The term grandmother cell is used to refer to the concept that somewhere in the brain there is a cell or neuron that responds specifically to a complex and specific object, such as your grandmother. Such specificity would require thousands of cells to represent every unique entity or object we encounter. Instead, it is thought that visual perception occurs by building up more basic pieces into complex representations. For example, the following is a picture of a square: Figure 1 Rather than our visual system having cells neurons that are activated only upon seeing the gestalt, or entirety, of a square, we can have cells that recognize horizontal and vertical lines, as shown in the following: Figure 2 In this hypothetical case, there may be two neurons, one which is activated when it senses horizontal lines and another that is activated when it senses vertical lines. Finally, a higher-order process recognizes that it is seeing a square when both the lower order neurons are activated simultaneously. Neural networks share some of these same concepts, with inputs being processed by a first layer of neurons that may go on to trigger another layer. Neural networks are sometimes shown as graphical models. In Figure 3, Inputs are data represented as squares. These may be pixels in an image or different aspects of sounds, or something else. The next layer of Hidden neurons is neurons that recognize basic features, such as horizontal lines, vertical lines, or curved lines. Finally, the output may be a neuron that is activated by the simultaneous activation of two of the hidden neurons. In this article, observed data or features are depicted as squares, and unobserved or hidden layers as circles: Figure 3 Neural networks are used to refer to a broad class of models and algorithms. Hidden neurons are generated based on some combination of the observed data, similar to a basis expansion in other statistical techniques; however, rather than choosing the form of the expansion, the weights used to create the hidden neurons are learned from the data. Neural networks can involve a variety of activation function(s), which are transformations of the weighted raw data inputs to create the hidden neurons. A common choice for activation functions is the sigmoid function:  and the hyperbolic tangent function . Finally, radial basis functions are sometimes used as they are efficient function approximators. Although there are a variety of these, the Gaussian form is common: . In a shallow neural network such as is shown in Figure 3, with only a single hidden layer, from the hidden units to the outputs is essentially a standard regression or classification problem. The hidden units can be denoted by, h, the outputs by, Y. Different outputs can be denoted by subscripts i = 1, …, k and may represent different possible classifications, such as (in our case) a circle or square. The paths from each hidden unit to each output are the weights and for the ith output are denoted by wi. These weights are also learned from the data, just like the weights used to create the hidden layer. For classification, it is common to use a final transformation, the softmax function, which is   as this ensures that the estimates are positive (using the exponential function) and that the probability of being in any given class sums to one. For linear regression, the identity function, which returns its input, is commonly used. Confusion may arise as to why there are paths between every hidden unit and output as well as every input and hidden unit. These are commonly drawn to represent that a priori any of these relations are allowed to exist. The weights must then be learned from the data, with zero or near zero weights essentially equating to dropping unnecessary relations. This only scratches the surface of the conceptual and practical aspects of neural networks. For a slightly more in-depth introduction to neural networks, see Chapter 11 of The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2009) also freely available at http://statweb.stanford.edu/~tibs/ElemStatLearn/. Next, we will turn to a brief introduction to deep neural networks. Deep neural networks Perhaps the simplest, if not the most informative, definition of a deep neural network (DNN) is that it is a neural network with multiple hidden layers. Although a relatively simple conceptual extension of neural networks, such deep architecture provides valuable advances in terms of the capability of the models and new challenges in training them. Using multiple hidden layers allows a more sophisticated build-up from simple elements to more complex ones. When discussing neural networks, we considered the outputs to be whether the object was a circle or a square. In a deep neural network, many circles and squares could be combined to form other more advanced shapes. One can consider two complexity aspects of a model's architecture. One is how wide or narrow it is—that is, how many neurons in a given layer. The second is how deep it is, or how many layers of neurons there are. For data that truly has such deep architectures, a DNN can fit it more accurately with fewer parameters than a neural network (NN), because more layers (each with fewer neurons) can be a more efficient and accurate representation; for example, because the shallow NN cannot build more advanced shapes from basic pieces, in order to provide equal accuracy to the DNN it must represent each unique object. Again considering pattern recognition in images, if we are trying to train a model for text recognition the raw data may be pixels from an image. The first layer of neurons could be trained to capture different letters of the alphabet, and then another layer could recognize sets of these letters as words. The advantage is that the second layer does not have to directly learn from the pixels, which are noisy and complex. In contrast, a shallow architecture may require far more parameters, as each hidden neuron would have to be capable of going directly from pixels in an image to a complete word, and many words may overlap, creating redundancy in the model. One of the challenges in training deep neural networks is how to efficiently learn the weights. The models are often complex and local minima abound making the optimization problem a challenging one. One of the major advancements came in 2006, when it was shown that Deep Belief Networks (DBNs) could be trained one layer at a time (Refer A Fast Learning Algorithm for Deep Belief Nets, by Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh, (2006) at http://www.cs.toronto.edu/~fritz/absps/ncfast.pdf). A DBN is a type of DNN where multiple hidden layers and connections between (but not within) layers (that is, a neuron in layer 1 may be connected to a neuron in layer 2, but may not be connected to another neuron in layer 1). This is the essentially the same definition of a Restricted Boltzmann Machine (RBM)—an example is diagrammed in Figure 4, except that a RBM typically has one input layer and one hidden layer: Figure 4 The restriction of no connections within a layer is valuable as it allows for much faster training algorithms to be used, such as the contrastive divergence algorithm. If several RBMs are stacked together, they can form a DBN. Essentially, the DBN can then be trained as a series of RBMs. The first RBM layer is trained and used to transform raw data into hidden neurons, which are then treated as a new set of inputs in a second RBM, and the process is repeated until all layers have been trained. The benefits of the realization that DBNs could be trained one layer at a time extend beyond just DBNs, however. DBNs are sometimes used as a pre-training stage for a deep neural network. This allows the comparatively fast, greedy layer-by-layer training to be used to provide good initial estimates, which are then refined in the deep neural network using other, slower, training algorithms such as back propagation. So far we have been primarily focused on feed-forward neural networks, where the results from one layer and neuron feed forward to the next. Before closing this section, two specific kinds of deep neural networks that have grown in popularity are worth mentioning. The first is a Recurrent Neural Network (RNN) where neurons send feedback signals to each other. These feedback loops allow RNNs to work well with sequences. A recent example of an application of RNNs was to automatically generate click-bait such as One trick to great hair salons don't want you to know or Top 10 reasons to visit Los Angeles: #6 will shock you!. RNNs work well for such jobs as they can be seeded from a large initial pool of a few words (even just trending search terms or names) and then predict/generate what the next word should be. This process can be repeated a few times until a short phrase is generated, the click-bait. This example is drawn from a blog post by Lars Eidnes, available at http://larseidnes.com/2015/10/13/auto-generating-clickbait-with-recurrent-neural-networks/. The second type is a Convolutional Neural Network (CNN). CNNs are most commonly used in image recognition. CNNs work by having each neuron respond to overlapping subregions of an image. The benefits of CNNs are that they require comparatively minimal pre-processing yet still do not require too many parameters through weight sharing (for example, across subregions of an image). This is particularly valuable for images as they are often not consistent. For example, imagine ten different people taking a picture of the same desk. Some may be closer or farther away or at positions resulting in essentially the same image having different heights, widths, and the amount of image captured around the focal object. As for neural networks, this description only provides the briefest of overviews as to what DNNs are and some of the use cases to which they can be applied. Summary This article presented a brief introduction to NNs and DNNs. Using multiple hidden layers, DNNs have been a revolution in machine learning by providing a powerful unsupervised learning and feature-extraction component that can be standalone or integrated as part of a supervised model. There are many applications of such models and they are increasingly used by large-scale companies such as Google, Microsoft, and Facebook. Examples of tasks for deep learning are image recognition (for example, automatically tagging faces or identifying keywords for an image), voice recognition, and text translation (for example, to go from English to Spanish, or vice versa). Work is being done on text recognition, such as sentiment analysis to try to identify whether a sentence or paragraph is generally positive or negative, which is particularly useful to evaluate perceptions about a product or service. Imagine being able to scrape reviews and social media for any mention of your product and analyze whether it was being discussed more favorably than the previous month or year or not! Resources for Article:   Further resources on this subject: Dealing with a Mess [article] Design with Spring AOP [article] Probability of R? [article]
Read more
  • 0
  • 0
  • 1801

article-image-introducing-openstack-trove
Packt
02 Mar 2016
17 min read
Save for later

Introducing OpenStack Trove

Packt
02 Mar 2016
17 min read
In this article, Alok Shrivastwa and Sunil Sarat, authors of the book OpenStack Trove Essentials, explain how OpenStack Trove truly and remarkably is a treasure or collection of valuable things, especially for open source lovers like us and, of course, it is an apt name for the Database as a Service (DBaaS) component of OpenStack. In this article, we shall see why this component shows the potential and is on its way to becoming one of the crucial components in the OpenStack world. In this article, we will cover the following: DBaaS and its advantages An introduction to OpenStack's Trove project and its components Database as a Service Data is a key component in today's world, and what would applications do without data? Data is very critical, especially in the case of businesses such as the financial sector, social media, e-commerce, healthcare, and streaming media. Storing and retrieving data in a manageable way is absolutely key. Databases, as we all know, have been helping us manage data for quite some time now. Databases form an integral part of any application. Also, the data-handling needs of different type of applications are different, which has given rise to an increase in the number of database types. As the overall complexity increases, it becomes increasingly challenging and difficult for the database administrators (DBAs) to manage them. DBaaS is a cloud-based service-oriented approach to offering databases on demand for storing and managing data. DBaaS offers a flexible and scalable platform that is oriented towards self-service and easy management, particularly in terms of provisioning a business' environment using a database of choice in a matter of a few clicks and in minutes rather than waiting on it for days or even, in some cases, weeks. The fundamental building block of any DBaaS is that it will be deployed over a cloud platform, be it public (AWS, Azure, and so on) or private (VMware, OpenStack, and so on). In our case, we are looking at a private cloud running OpenStack. So, to the extent necessary, you might come across references to OpenStack and its other services, on which Trove depends. XaaS (short for Anything/Everything as a Service, of which DBaaS is one such service) is fast gaining momentum. In the cloud world, everything is offered as a service, be it infrastructure, software, or, in this case, databases. Amazon Web Services (AWS) offers various services around this: the Relational Database Service (RDS) for the RDBMS (short for relational database management system) kind of system; SimpleDB and DynamoDB for NoSQL databases; and Redshift for data warehousing needs. The OpenStack world was also not untouched by the growing demand for DBaaS, not just by users but also by DBAs, and as a result, Trove made its debut with the OpenStack release Icehouse in April 2014 and since then is one of the most popular advanced services of OpenStack. It supports several SQL and NoSQL databases and provides the full life cycle management of the databases. Advantages Now, you must be wondering why we must even consider DBaaS over traditional database management strategies. Here are a few points you might want to consider that might make it worth your time. Reduced database management costs In any organization, most of their DBAs' time is wasted in mundane tasks such as creating databases, creating instances, and so on. They are not able to concentrate on tasks such as fine-tuning SQL queries so that applications run faster, not to mention the time taken to do it all manually (or with a bunch of scripts that need to be fired manually), so this in effect is wasting resources in terms of both developers' and DBAs' time. This can be significantly reduced using a DBaaS. Faster provisioning and standardization With DBaaS, databases that are provisioned by the system will be compliant with standards as there is very little human intervention involved. This is especially helpful in the case of heavily regulated industries. As an example, let's look at members of the healthcare industry. They are bound by regulations such as HIPAA (short for Health Insurance Portability and Accountability Act of 1996), which enforces certain controls on how data is to be stored and managed. Given this scenario, DBaaS makes the database provisioning process easy and compliant as they only need to qualify the process once, and then every other database coming out of the automated provisioning system is then compliant with the standards or controls set. Easier administration Since DBaaS is cloud based, which means there will be a lot of automation, administration becomes that much more automated and easier. Some important administration tasks are backup/recovery and software upgrade/downgrade management. As an example, with most databases, we should be able to push configuration modifications within minutes to all the database instances that have been spun out by the DBaaS system. This ensures that any new standards being thought of can easily be implemented. Scaling and efficiency Scaling (up or down) becomes immensely easy, and this reduces resource hogging, which developers used as part of their planning for a rainy day, and in most cases, it never came. In the case of DBaaS, since you don't commit resources upfront and only scale up or down as and when necessary, resource utilization will be highly efficient. These are some of the advantages available to organizations that use DBaaS. Some of the concerns and roadblocks for organizations in adopting DBaaS, especially in a public cloud model, are as follows: Companies don't want to have sensitive data leave their premises. Database access and speed are key to application performance. Not being able to manage the underlying infrastructure inhibits some organizations from going to a DBaaS model. In contrast to public cloud-based DBaaS, concerns regarding data security, performance, and visibility reduce significantly in the case of private DBaaS systems such as Trove. In addition, the benefits of a cloud environment are not lost either. Trove OpenStack Trove, which was originally called Red Dwarf, is a project that was initiated by HP, and many others contributed to it later on, including Rackspace. The project was in incubation till the Havana release of OpenStack. It was formally introduced in the Icehouse release in April 2014, and its mission is to provide scalable and reliable cloud DBaaS provisioning functionality for relational and non-relational database engines. As of the Liberty release, Trove is considered as a big-tent service. Big-tent is a new approach that allows projects to enter the OpenStack code namespace. In order for a service to be a big-tent service, it only needs to follow some basic rules, which are listed here. This allows the projects to have access to the shared teams in OpenStack, such as the infrastructure teams, release management teams, and documentation teams. The project should: Align with the OpenStack mission Subject itself to the rulings of the OpenStack Technical Committee Support Keystone authentication Be completely open source and open community based At the time of writing the article, the adoption and maturity levels are as shown here: The previous diagram shows that the Age of the project is just 2 YRS and it has a 27% Adoption rate, meaning 27 of 100 people running OpenStack also run Trove. The maturity index is 1 on a scale of 1 to 5. It is derived from the following five aspects: The presence of an installation guide Whether the Adoption percentage is greater or lesser than 75 Stable branches of the project Whether it supports seven or more SDKs Corporate diversity in the team working on the project Without further ado, let's take a look at the architecture that Trove implements in order to provide DBaaS. Architecture The trove project uses some shared components and some dedicated project-related components as mentioned in the following subsections. Shared components The Trove system shares two components with the other OpenStack projects, the backend database (MySQL/MariaDB), and the message bus. The message bus The AMQP (short for Advanced Message Queuing Protocol) message bus brokers the interactions between the task manager, API, guest agent, and conductor. This component ensures that Trove can be installed and configured as a distributed system. MySQL/MariaDB MySQL or MariaDB is used by Trove to store the state of the system. API This component is responsible for providing the RESTful API with JSON and XML support. This component can be called the face of Trove to the external world since all the other components talk to Trove using this. It talks to the task manager for complex tasks, but it can also talk to the guest agent directly to perform simple tasks, such as retrieving users. The task manager The task manager is the engine responsible for doing the majority of the work. It is responsible for provisioning instances, managing the life cycle, and performing different operations. The task manager normally sends common commands, which are of an abstract nature; it is the responsibility of the guest agent to read them and issue database-specific commands in order to execute them. The guest agent The guest agent runs inside the Nova instances that are used to run the database engines. The agent listens to the messaging bus for the topic and is responsible for actually translating and executing the commands that are sent to it by the task manager component for the particular datastore. Let's also look at the different types of guest agents that are required depending on the database engine that needs to be supported. The different guest agents (for example, the MySQL and PostgreSQL guest agents) may even have different capabilities depending on what is supported on the particular database. This way, different datastores with different capabilities can be supported, and the system is kept extensible. The conductor The conductor component is responsible for updating the Trove backend database with the information that the guest agent sends regarding the instances. It eliminates the need for direct database access by all the guest agents for updating information. This is like the way the guest agent also listens to the topic on the messaging bus and performs its functions based on it. The following diagram can be used to illustrate the different components of Trove and also their interaction with the dependent services: Terminology Let's take a look at some of the terminology that Trove uses. Datastore Datastore is the term used for the RDBMS or NoSQL database that Trove can manage; it is nothing more than an abstraction of the underlying database engine, for example, MySQL, MongoDB, Percona, Couchbase, and so on. Datastore version This is linked to the datastore and defines a set of packages to be installed or already installed on an image. As an example, let's take MySQL 5.5. The datastore version will also link to a base image (operating system) that is stored in Glance. The configuration parameters that can be modified are also dependent on the datastore and the datastore version. Instance An instance is an instantiation of a datastore version. It runs on OpenStack Nova and uses Cinder for persistent storage. It has a full OS and additionally has the guest agent of Trove. Configuration group A configuration group is a bunch of options that you can set. As an example, we can create a group and associate a number of instances to one configuration group, thereby maintaining the configurations in sync. Flavor The flavor is similar to the Nova machine flavor, but it is just a definition of memory and CPU requirements for the instance that will run and host the databases. Normally, it's a good idea to have a high memory-to-CPU ratio as a flavor for running database instances. Database This is the actual database that the users consume. Several databases can run in a single Trove instance. This is where the actual users or applications connect with their database clients. The following diagram shows these different terminologies, as a quick summary. Users or applications connect to databases, which reside in instances. The instances run in Nova but are instantiations of the Datastore version belonging to a Datastore. Just to explain this a little further, say we have two versions of MySQL that are being serviced. We will have one datastore but two datastore versions, and any instantiation of that will be called an instance, and the actual MySQL database that will be used by the application will be called the database (shown as DB in the diagram). A multi-datastore scenario One of the important features of the Trove system is that it supports multiple databases to various degrees. In this subsection, we will see how Trove works with multiple Trove datastores. In the following diagram, we have represented all the components of Trove (the API, task manager, and conductor) except the Guest Agent databases as Trove Controller. The Guest Agent code is different for every datastore that needs to be supported and the Guest Agent for that particular datastore is installed on the corresponding image of the datastore version. The guest agents by default have to implement some of the basic actions for the datastore, namely, create, resize, and delete, and individual guest agents have extensions that enable them to support additional features just for that datastore. The following diagram should help us understand the command proxy function of the guest agent. Please note that the commands shown are only indicative, and the actual commands will vary. At the time of writing this article, Trove's guest agents are installable only on Linux; hence, only databases on Linux systems are supported. Feature requests (https://blueprints.launchpad.net/trove/+spec/mssql-server-db-support) were created for the ability to create a guest agent for Windows and support Microsoft SQL databases, but they have not yet been approved at the time of writing this and might be a remote possibility. Database software distribution support Trove supports various databases; the following table shows the databases supported by this service at the time of writing this. Automated installation is available for all the different databases, but there is some level of difference when it comes to the configuration capabilities of Trove with respect to different databases. This has lot to do with the lack of a common configuration base among the different databases. At the time of writing this article, MySQL and MariaDB have the most configuration options available, as shown in this list: Database Version MySQL 5.5, 5.6 Percona 5.5, 5.6 MariaDB 5.5, 10.0 Couchbase 2.2, 3.0 Cassandra 2.1 Redis 2.8 PostgreSQL 9.3, 9.4 MongoDB 2.6, 3.0 DB2 Expre 10.5 CouchDB 1.6 So, as you can see, almost all the major database applications that can run on Linux are already supported on Trove. Putting it all together Now that we have understood the architecture and terminologies, we will take a look at the general steps that are followed: Horizon/Trove CLI requests a new database instance and passes the datastore name and version, along with the flavor ID and volume size as mandatory parameters. Optional parameters such as the configuration group, AZ, replica-of, and so on can also be passed. The Trove API requests Nova for an instance with the particular image and a Cinder volume of a specific size to be added to the instance. The Nova instance boots and follows these steps: The cloud-init scripts are run(like all other Nova instances) The configuration files (for example, trove-guestagent.conf) are copied down to the instance The guest agent is installed The Trove API will also have sent the request to the task manager, which will then send the prepare call to the message bus topic. After booting, the guest agent listens to the message bus for any activities for it to do, and once it finds a message for itself, it processes the prepare command and performs the following functions: Installing the database distribution (if not already installed on the image) Creating the configuration file with the default configuration for the database engine (and any configuration from the configuration groups associated overriding the defaults) Starting the database engine and enabling auto-start Polling the database engine for availability (until the database engine is available or the timeout is reached) Reporting the status back to the Trove backend using the Trove conductor The Trove manager reports back to the API and the status of the machine is changed. Use cases So, if you are wondering all the places where we can use Trove, it fits in rather nicely with the following use cases. Dev/test databases Dev/test databases are an absolute killer feature, and almost all companies that start using Trove will definitely use it for their dev/test environments. This provides developers with the ability to freely create and dispose of database instances at will. This ability helps them be more productive and removes any lag from when they want it to when they get it. The capability of being able to take a backup, run a database, and restore the backup to another server is especially key when it comes to these kinds of workloads. Web application databases Trove is used in production for any database that supports low-risk applications, such as some web applications. With the introduction of different redundancy mechanisms, such as master-slave in MySQL, this is becoming more suited to many production environments. Features Trove is moving fast in terms of the features being added in the various releases. In this section, we will take a look at the features of three releases: the current release and the past two. The Juno release The Juno release saw a lot of features being added to the Trove system. Here is a non-exhaustive list: Support for Neutron: Now we can use both nova-network and Neutron for networking purposes Replication: MySQL master/slave replication was added. The API also allowed us to detach a slave for it to be promoted Clustering: Mongo DB cluster support was added Configuration group improvements: The functionality of using a default configuration group for a datastore version was added. This allows us to build the datastore version with a base configuration of your company standards Basic error checking was added to configuration groups The Kilo release The Kilo release majorly worked on introducing a new datastore. The following is the list of major features that were introduced: Support for the GTID (short for global transaction identifier) replication strategy New datastores, namely Vertica, DB2, and CouchDB, are supported The Liberty release The Liberty release introduced the following features to Trove. This is a non-exhaustive list. Configuration groups for Redis and MongoDB Cluster support for Redis and MongoDB Percona XtraDB cluster support Backup and restore for a single instance of MongoDB User and database management for Mongo DB Horizon support for database clusters A management API for datastores and versions The ability to deploy Trove instances in a single admin tenant so that the Nova instances are hidden from the user In order to see all the features introduced in the releases, please look at the release notes of the system, which can be found at these URLs: Juno : https://wiki.openstack.org/wiki/ReleaseNotes/Juno Kilo : https://wiki.openstack.org/wiki/ReleaseNotes/Kilo Liberty : https://wiki.openstack.org/wiki/ReleaseNotes/Liberty Summary In this article, we were introduced to the basic concepts of DBaaS and how Trove can help with this. With several changes being introduced and a score of one on five with respect to maturity, it might seem as if it is too early to adopt Trove. However, a lot of companies are giving Trove a go in their dev/test environments as well as for some web databases in production, which is why the adoption percentage is steadily on the rise. A few companies that are using Trove today are giants such as eBay, who run their dev/test Test databases on Trove; HP Helion Cloud, Rackspace Cloud, and Tesora (which is also one of the biggest contributors to the project) have DBaaS offerings based on the Trove component. Trove is increasingly being used in various companies, and it is helping in reducing DBAs' mundane work and improving standardization. Resources for Article: Further resources on this subject: OpenStack Performance, Availability [article] Concepts for OpenStack [article] Implementing OpenStack Networking and Security [article]
Read more
  • 0
  • 0
  • 2661

article-image-responsive-visualizations-using-d3js-and-bootstrap
Packt
01 Mar 2016
21 min read
Save for later

Responsive Visualizations Using D3.js and Bootstrap

Packt
01 Mar 2016
21 min read
In this article by Christoph Körner, the author of the book Learning Responsive Data Visualization, we will design and implement a responsive data visualization using Bootstrap and Media Queries based on real data. We will cover the following topics: Absolute and relative units in the browsers Drawing charts with percentage values Adapting charts using JavaScript event listeners Learning to adapt the resolution of the data Using bootstrap's Media Queries Understanding how to use Media Queries in CSS, LESS and JavaScript Learning how to use bootstrap's grid system (For more resources related to this topic, see here.) First, we will discuss the most important absolute and relative units that are available in modern browsers. You will learn the difference of absolute pixels, relative percentages, em, rem, and many more. In the next section, we will take a look at what is really needed for a chart to be responsive. Adapting the width to the parent element is one of the requirements, and you will learn about two different ways to implement this. After this section, you will know when to use percentage values or JavaScript event listeners. We will also take a look at adapting the data resolution, which is another important property of responsive visualizations. In the next section, we will explore Media Queries, and understand how we can use them to make viewport depended responsive charts. We will take the advantage of Bootstrap's definitions of Media Queries for the most common device resolutions to integrate them into our responsive chart using CSS or LESS. Finally, we will also see how to include Media Queries into JavaScript. In the last section, we will take a look at Bootstrap's grid system and learn how to seamlessly integrate it with the charts. This will give us not only some great flexibility but will also make it easier to combine multiple charts to one big dashboard application. Units and lengths in the browser Creating a responsive design, bet it website or graphics, depends strongly on the units and lengths that a browser can interpret. We can easily create an element that fills the entire width of a container using the percentage values that are relative to the parent container; whereas, achieving the same result width absolute values could be very tricky. Thus, mastering responsive graphics also means knowing all the absolute and relative units that are available in the browser. Units for Absolute lengths The most convenient and popular way in web design and development is to define and measure lengths and dimensions in absolute units, usually in pixels. The reason for this is that designers and developers often want to exactly specify the exact dimensions of an object. The pixel unit called px has been introduced as a visual unit based on a physical measurement to read from a device in the distance of approximately one arm length; however, all the modern browsers can also allow the definitions of lengths based on physical units. The following list shows the most common absolute units and their relations to each other: cm: centimeters (1 cm = 96px/2.54) mm: millimeters (1 mm = 1/10th of 1 cm) in: inches (1in = 2.54 cm = 96 px) pt: points (1 pt = 1/72th of 1 in) px: pixels (1 px = 1/96th of 1 in) More information on the origin and meaning of the pixel unit can be found in the CSS3 Specification available at http://www.w3.org/TR/css3-values/#viewport-relative-lengths. Units for Relative lengths In addition to Absolute lengths, relative lengths that are expressed as the percentage of the width or height of a parent element has also been a common technique to the style dynamic elements of web pages. Traditionally, the % unit has always been the unit of choice for the preceding reason. However, with CSS3, a couple of additional relative units have found their way into the browsers, for example, to define a length relative to the font size of the element. Here is a list of the relative length units that have been specified in the CSS3 specifications and will soon be available in modern browsers: %: the percentage of the width/height of the absolute container Em: the factor of font size of the element Rem: the factor of font size of the root element Vw: 1% of viewport's width vh: 1% viewport's height vmin: 1% of the viewport's smaller dimension (either vw or vh) vmax: 1% of the viewport's larger dimension (either vw or vh) I am aware that as a web developer, we cannot really take the advantage of any technology that will be supported soon; however, I want to point out one unit that will play an important role for feature web developers: the rem unit. The rem unit defines the length of an element based on the font size of the root node (the html node in this case), rather than the font size of the current element such as em. This rem unit is very powerful if we use it to define the lengths and spacings of a layout because the layout can also adapt when the user increases the font size of the browser (that is, for readability). I want to mention that Bootstrap 4 will replace all absolute pixel units for Media Queries by the rem units because of this reason. If we look at the following figure, we also see that rem units are already supported in all major browsers. I recommend you to start replacing all your absolute pixel units of your layout and spacing by the rem units: Cross-browser compatibility of rem units However, percentage units are not dead; we can still use them when they are appropriate. We will use them later to draw SVG elements with dimensions based on their parent elements' dimensions. Units for Resolution To round up the section of relative and absolute units, I want to mention that we can also use resolutions in different units. These resolution units can be used in Media Queries together with the min-resolution or max-resolution attribute: Dpi: dots per inch dpcm: dots per centimeter dppx: dots per px unit Mathematical Expressions We often have the problem of dealing with rational numbers or expressions in CSS; just imagine defining a 3-column grid with a width of 33% per column, or imagine the need of computing a simple expression in the CSS file. CSS3 provides a simple solution for this: calc(exp): This computes the mathematical expression called exp, which can consist of lengths, values, and the operators called +, -, /, and * Note that the + and -operators must be surrounded by whitespace. Otherwise, they will be interpreted as a sign of the second number rather than the operator. Both the other operators called * and / don't require a whitespace, but I encourage you to add them for consistency. We can use these expression in the following snippets. .col-4 { width: calc(100%/3); } .col-sp-2 { width: calc(50% - 2em); } The preceding examples look great; however, as we can see in the following figure, we need to take care of the limitations of the browser compatibility: Cross-browser compatibility of the calc() expression Responsive charts Now that we know some basics about absolute and relative units, we can start to define, design, and implement responsive charts. A responsive chart is a chart that automatically adapts its look and feel to the resolution of the user's device; thus, responsive charts need to adapt the following properties: The dimension (width and height) The resolution of data points The interactions and interaction areas. Adapting the dimensions is most obvious. The chart should always scale and adapt to the width of its parent element. In the previous section, you learned about relative and absolute lengths, so one might think that simply using relative values for the chart's dimensions would be enough. However, there are multiple ways with advantages and disadvantages to achieve this; in this section, we will discuss three of them. Adapting the resolution of the data is a little less obvious and often neglected. The resolution of data points (the amount of data point per pixel) should adapt, so that we can see more points on a device with a higher resolution and less points on a low resolution screen. In this section we will see that this can only be achieved using JavaScript event listeners and by redrawing/updating the whole chart manually. Adapting interactions and interaction areas is important for not just using different screen resolutions but also different devices. We interact differently with a TV than a computer, and we use different input devices on a desktop and mobile phone. However, the chart will allow interactions and interaction areas that are appropriate for a given device and screen resolution.. Using Relative Lengths in SVG The first and most obvious solution for adapting the dimensions of a chart to its parent container is the use of relative values for lengths and coordinates. This means that when we define the chart once with relative values and the browser takes care of recomputing all the values when the dimension of the parent container has changed, there is no manual redrawing of the chart required. First, we will add some CSS styles to scale the SVG element to the full width of the parent container: .chart { height: 16rem; position: relative; } .chart > svg { width: 100%; height: 100%; } Next, we modify all our scales to work on a range at [0, 100] and subtract a padding from both sides: var xScale = d3.scale.ordinal() .domain(flatData.map(xKey)) .rangeBands([padding, 100 - 2*padding]); var yScale = d3.scale.linear() .domain([0, d3.max(flatData, yKey)]) .range([100 - 2*padding, padding]); Finally, we can draw the chart as before, but simply adding percentage signs % at the end of the attributes—to indicate the use of percentage units: $$bars .attr('x', function(d) { return (xScale(d.x) + i*barWidth ) + '%'; }) .attr('y', function(d) { return yScale(d.y) + '%'; }) .attr('height', function(d) { return (yScale(0) - yScale(d.y)) + '%'; }) .attr('width', barWidth + '%') .attr('fill', colors(data.key)); Observe that we only slightly modified the code to plot the bars in the bar chart in order to use percentage values as coordinates for the attributes. However, the effect of this small change is enormous. In the following figure, we can see the result of the chart in a browser window: Bar chart using relative lengths If we now increase the size of the browser, the bar chart scales nicely to the full width of the parent container. We can see the scaled chart in the following figure: Scaled bar chart using relative lengths If you are not impressed by it now, you better be. This is awesome in my opinion because it leaves all the hard work of recomputing the SVG element dimensions to the browser. We don't have to care about them, and these native computations give us some maximal performance. The previous example shows how to use percentage values to create a simple bar chart. However, what we didn't explain so far is why we didn't add any axis and labels to the chart. Well, despite the idea that we can exploit native rescaling of the browser, we need to face the limitations of this technique. Relative values are only allowed in standard attributes, such as width, height, x, y, cx, cy—but not in SVG paths or transform functions. Conclusion about using Relative Lengths While this sounds like an excellent solution (and indeed it is wonderful for certain use cases), it has two major drawbacks: The percentage values are not accepted for the SVG transform attributes and for the d attribute in the path elements, only in standard attributes Due to the fact that the browser is recomputing all the values automatically, we cannot adapt the resolution of the data of the charts The first point is the biggest drawback, which means we can only position elements using the standard attributes called width, height, x, y, cx, cy, and more. However, we can still draw a bar chart that seamlessly adapts according to the parent elements without the use of JavaScript event listeners. The second argument doesn't play a huge role anymore compared to the first one, and it can be circumvented using additional JavaScript event listeners, but I am sure you get the point. Using the JavaScript Resize event The last option is to use JavaScript event handlers and redraw the chart manually when the dimensions of the parent container change. Using this technique, we can always measure the width of the parent container (in absolute units) and use this length to update and redraw the chart accordingly. This gives us great flexibility over the data resolution, and we can adapt the chart to a different aspect ratio when needed as well. The Native Resize event Theoretically, this solution sounds brilliant. We simply watch the parent container or even the SVG container itself (if it uses a width of 100%) for the resize events, and then redraw the chart when the dimensions of the element change. However, there does not exist a native resize event on the div or svg element; modern browsers only support the resize events on the window element. Hence, it triggers only if the dimensions of the browser window changes. This means also that we need to clean up listeners once we remove a chart from the page. Although this is a limitation, in most cases, we can still use the windowresize event to adapt the chart to its parent container; we have to just keep this in our mind. Let's always use the parent container's absolute dimensions for drawing and redrawing the chart; we need to define the following things inside a redraw function: var width = chart.clientWidth; var height = width / aspectRatio; Now, we can add a resize event listener to the window element and call the redraw function whenever the window dimensions change: window.addEventListener('resize', function(event){ redraw(); }); The benefit of this solution is that we can do everything that we want in the redraw function, for example, modifying the aspect ratio, adapting the labels of the axis, or modifying the number of displayed elements. The following figure shows a resized version of the previous chart; we observe that this time, the axis ticks adapt nicely and don't overlap anymore. Moreover, the axis ticks now take the full available space: Resized chart with adapted axis ticks Adapting the Resolution of the Data However, there is another problem that can be nicely solved using these types of manual redraws—the problem of data resolution. How much data should be displayed in a small chart and how much in a bigger chart? Small chart with high data resolution I think you agree that in the preceding figure, we display too much data for the size of the graphic. This is bad and makes the chart useless. Moreover, we should really adapt the resolution of the data according to the small viewport in the redrawing process. Let's implement a function that returns only ever i-th element of an array: function adaptResolution(data, resolution) { resolution = resolution ? Math.ceil(resolution) : 1; return data.filter(function(d, i) { return i % resolution === 0; }); } Great, let's define a width depended data resolution and filter the data accordingly: var pixelsPerData = 20; var resolution = pixelsPerData * (flatData.length) / width; In the previous code, we observed that we can now define the minimum amount of pixel that one data point should have and remove the amount of values accordingly by calling the following: var flatDataRes = adaptResolution(flatData, resolution); The following image shows a small chart with a low number of values, which is perfectly readable even though it is very small: Small chart with a proper data resolution In the next figure, we can see the same chart based on the same data drawn with a bigger container. We immediately observe that also the data resolutions adapts accordingly; and again, the chart looks nice: Big chart with a proper data resolution Conclusion of using Resize events This is the most flexible solution, and therefore, in many situations, it is the solution of choice. However, you need to be aware that there are also drawbacks in using this solution: There is no easy way to listen for resize events of the parent container We need to add event listeners We need to make sure that event listeners are removed properly We need to manually redraw the chart Using Bootstrap's Media Queries Bootstrap is an awesome library that gets you started quickly with new projects. It not just includes a huge amount of useful HTML components but also normalized amd standardized CSS styles. One particular style is the implementation of Media Queries for four typical device types (five types in Bootstrap 4). In this section, we will take a look at how to make use of these Media Queries in our styles and scripts. The great thing about Bootstrap is that it successfully standardizes typical device dimensions for web developers thus, beginners can simply use them without rethinking over and over which pixel width could be the most common one for tablets. Media Queries in CSS The quickest way to use Bootstrap's Media Queries is to simply copy them from the compiled source code. The queries are here: /* Extra small devices (phones, etc. less than 768px) */ /* No media query since this is the default in Bootstrap */ /* Small devices (tablets, etc.) */ @media (min-width: 768px) { ... } /* Medium devices (desktops, 992px and up) */ @media (min-width: 992px) { ... } /* Large devices (large desktops, 1200px and up) */ @media (min-width: 1200px) { ... } We can easily add these queries to our CSS styles and define certain properties and styles for our visualizations, such as four predefined widths, aspect ratios, spacing, and so on in order to adapt the chart appearance to the device type of the user. Bootstrap 4 is currently in alpha; however, I think you can already start using the predefined device types in your CSS. The reason I am strongly arguing for Bootstrap 4 is because of its shift towards the em units instead of pixels: // Extra small devices (portrait phones, etc.) // No media query since this is the default in Bootstrap // Small devices (landscape phones, etc.) @media (min-width: 34em) { ... } // Medium devices (tablets, etc.) @media (min-width: 48em) { ... } // Large devices (desktops, etc.) @media (min-width: 62em) { ... } // Extra large devices (large desktops, etc.) @media (min-width: 75em) { ... } Once again, the huge benefit of this is that the layout can adapt when the user increases the font size of the browser, for example, to enhance readability. Media Queries in LESS/SASS In Bootstrap 3, you can include Media Query mixins to your LESS file, which then gets compiled to plain CSS. To use these mixins, you have to create a LESS file instead of CSS and import the Bootstrap variables.less file. In this file, Bootstrap defines all its dimensions, colors, and other variables. Let's create a style.less file and import variables.less: // style.less @import "bower_components/bootstrap/less/variables.less"; Perfect, that's all. Now, we can go ahead and start using Bootstrap's device types in our LESS file. /* Extra small devices (phones, etc. less than 768px) */ /* No media query since this is the default in Bootstrap */ /* Small devices (tablets, etc.) */ @media (min-width: @screen-sm-min) { ... } /* Medium devices (desktops, etc.) */ @media (min-width: @screen-md-min) { ... } /* Large devices (large desktops, etc.) */ @media (min-width: @screen-lg-min) { ... } Finally, we need to use a LESS compiler to transform our style.less file to plain CSS. To achieve this, we run the following command from the terminal: lessc styles.less styles.css As we can see, the command requires the LESS compiler called lessc being installed. If it's not yet installed on your system, go ahead and install it using the following command: npm install -g less If you are new to LESS, I recommend you to read through the LESS documentation on http://lesscss.org/. Once you check out of LESS, you can also look at the very similar SASS format, which is favored by Bootstrap 4. You can find the SASS documentation at http://sass-lang.com/. We can use the Bootstrap 4 Media Queries in a SASS file by the following mixins: @include media-breakpoint-up(xs) { ... } @include media-breakpoint-up(sm) { ... } @include media-breakpoint-up(md) { ... } @include media-breakpoint-up(lg) { ... } @include media-breakpoint-up(xl) { ... } In my opinion, including Bootstrap's LESS/SASS mixins to the styles of your visualization is the cleanest solution because you always compile your CSS from the latest Bootstrap source, and you don't have to copy CSS into your project. Media Queries in JavaScript Another great possibility of using Bootstrap's Media Queries to adapt your visualization to the user's device is to use them directly in JavaScript. The native window.matchMedia (mediaQuery) function gives you the same control over your JavaScript as Media Queries gives us over CSS. Here is a little example on how to use it: if (window.matchMedia("(min-width: 1200px)").matches) { /* the viewport is at least 1200 pixels wide */ } else { /* the viewport is less than 1200 pixels wide */ } In the preceding code, we see that this function is quite easy to use and adds almost infinite customization possibilities to our visualization. More information about the matchMedia function can be found on the Mozilla Website https://developer.mozilla.org/de/docs/Web/API/Window/matchMedia. However, apart from using the watchMedia function directly, we could also use a wrapper around the native API call. I can really recommend the enquire.js library by Nick Williams, which allows you to declare event listeners for viewport changes. It can be installed via the package manager bower by running the following command from the terminal: bower install enquire Then, we need to add enquire.js to the website and use in the following snippet: enquire.register("screen and (min-width:1200px)", { // triggers when the media query matches. match : function() { /* the viewport is at least 1200 pixels wide */ }, // optional; triggers when the media query transitions unmatch : function() { /* the viewport is less than 1200 pixels wide */ }, }); In the preceding code, we see that we can now can add the match and unmatch listeners almost in the same way as listening for resize events—just much more flexible. More information about require.js can be found on the GitHub page of the project at https://github.com/WickyNilliams/enquire.js. If we would like to use the Bootstrap device types, we could easily implement them (as needed) with enquire.js and trigger events for each device type. However, I prefer being very flexible and using the bare wrapper. Using Bootstrap's Grid System Another great and quick way of making your charts responsive and play nicely together with Bootstrap is to integrate them into Bootstrap's gird system. It is the best and cleanest integration however, is to separate concerns—and make the visualization as general and adaptive as possible. Let's take our bar chart example with the custom resize events and integrate it into a simple grid layout. As usual, you can find the full source code of the example in the code examples: <div class="container"> <div class="row"> <div class="col-md-8"> <div class="chart" data-url="…" …> </div> </div> <div class="col-md-4"> <h2>My Dashboard</h2> <p>This is a simple dashboard</p> </div> </div> <div class="row"> <div class="col-md-4"> <div class="chart" data-url="…" …> </div> </div> <div class="col-md-4"> <div class="chart" data-url="…" …> </div> </div> <div class="col-md-4"> <div class="chart" data-url="…" …> </div> </div> </div> <div class="row"> <div class="col-md-6"> <div class="chart" data-url="…" …> </div> </div> <div class="col-md-6"> <div class="chart" data-url="…" …> </div> </div> </div> We observe that by making use of the parent containers' width, we can simply add the charts as the div elements in the columns of the grid. This is the preferred integration where two components play together nicely but are not depended on each other. In the following figure, we can see a screenshot from the simple dashboard that we just built. We observe that the visualizations already fit nicely into our grid layout, which makes it easy to compose them together: A simple dashboard using Bootstrap's grid layout Summary In this article, you learned the essentials about absolute and relative units to define lengths in a browser. We remember that the em and rem unit plays an important role because it allows a layout to adapt when a user increases the font size of the web site. Then, you learned about how to use relative units and JavaScript resize events to adapt the chart size and the data resolution according to the current container size. We looked into Media Queries in CSS, LESS, und JavaScript. Finally, we saw how to integrate charts with Bootstrap's grid system and implemented a simple Google Analytics-like dashboard with multiple charts.
Read more
  • 0
  • 0
  • 12141

article-image-breaking-bank
Packt
01 Mar 2016
32 min read
Save for later

Breaking the Bank

Packt
01 Mar 2016
32 min read
In this article by Jon Jenkins, author of the book Learning Xero, covers the Xero core bank functionalities, including one of the most innovative tools of our time: automated bank feeds. We will walk through how to set up the different types of bank account you may have and the most efficient way to reconcile your bank accounts. If they don't reconcile, you will be shown how you can spot and correct any errors. Automated bank feeds have revolutionized the way in which a bank reconciliation is carried out and the speed at which you can do it. Thanks to Xero, there is no longer an excuse to not keep on top of your bank accounts and therefore maintain accurate and up-to-date information for your business's key decision makers. These are the topics we'll be covering in this article: Setting up a bank feed Using rules to speed up the process Completing a bank reconciliation Dealing with common bank reconciliation errors (For more resources related to this topic, see here.) Bank overview Reconciling bank accounts has never been as easy or as quick, and we are only just at the beginning of this journey as Xero continues to push the envelope. Xero is working with banks to not only bring bank data into your accounting system but to also push it back again. That's right; you could mark a supplier invoice paid in Xero and it could actually send the payment from your bank account. Dashboard When you log in to Xero, you are presented with the dashboard, which gives a great overview of what is going on within the business. It is also an excellent place to navigate to the main parts of Xero that you will need. If you have several bank accounts, the summary pane that shows the cash in and out during a month, as shown below, is very useful as you can hover over the bar chart to get a quick snapshot of the total cash in and out for the month with no effort at all. If you want the chart to sit at the top of your dashboard, click on Edit Dashboard at the bottom of the page, and drag and drop the chart. When finished, click on Done at the bottom of the page to lock them in place. Reconciling bank accounts is fundamental to good bookkeeping, and only once the accounts have been reconciled do you know that your records are up-to-date. It isn't worth spending lots of time looking at reports if the bank accounts haven't been reconciled, as there may be important items missing from your records. By default, all bank accounts added will be shown on your dashboard, which shows the main account information, the number of unreconciled items, and the balance per Xero and per statement. You may wish to just see a few key bank accounts, in which case you can turn some of them off by going to Accounts | Bank Accounts, where you will see a list of all bank accounts. Here, you can choose to remove the bank accounts showing on the dashboard by unchecking the Show account on Dashboard option. You can also choose the Change order option, which allows you to decide the order in which you see the bank accounts on the dashboard. Click on the up and down arrows to move the accounts as required. Bank feeds If you did not set up a bank account when you were setting up Xero, then we recommend you do that now, as you cannot set up a bank feed without one. You can do this from the dashboard by clicking on the Add Bank Account button or by navigating to Accounts | Bank Accounts and Add Bank Account. Then, you are presented with of list of options including Bank Account, Credit Card, or PayPal. Enter your account details as requested and click on Save. It is very important at this stage to note that you may be presented with several options for your bank as they offer different accounts. If you choose the wrong one, your feed will not work. Some banks charge for a direct bank feed; you do not have to adhere to this, so ignore the feeds ending with Direct Bank Feed and select the alternative one. The difference between a Direct Bank Feed and the Yodlee service that Xero uses is that the data comes directly from the bank and not via a third party, so is deemed to be more reliable. Now that you have a bank account, you can add the feed by clicking on the Get bank feeds button, as shown in the following screenshot: On the Activate Bank Feed page, the fields will vary depending on the bank you use and the type of account you selected earlier. Enter the User Credentials as requested. You will then see a screen with the text Connecting to Bank, which states it might take a few minutes, so please bear with it; you are almost there. When prompted, select the bank account from the dropdown called Select the matching bank feed... that matches the account you are adding and choose whether you wish to import from a certain date or collect as much data as possible. How far it goes back varies by bank. If you are converting from another system, it would be wise to use the conversion date as the date from which you wish to start transactions, in order to avoid bringing in transactions processed in your old system (that is if your conversion date was May 31, 2015, you would use June 1, 2015. Once you are happy with the information provided, click on OK. If you have several accounts, such as a savings account, then simply follow the process again for each account. Refresh feed Each bank feed is different, and some will automatically update; others, however, require refreshing. You can usually tell which bank accounts require a manual refresh, as they will show the following message at the bottom of the bank account tile on the dashboard: To refresh the feed from the dashboard, find the account to update and click the Manage button in the top right-hand corner and then Refresh Bank Feed. You can also do this from within Accounts | Bank Accounts | Manage Accounts | Refresh Bank Feed. To update your bank feed, you will need to refresh the feed each time you want to reconcile the bank account. Import statements You get over most disappointments in life, unlike when you find out that the bank account you have does not have a bank feed available. Your options here are simple: go and change banks. But if that is too much hassle for you, then you could always just import a file. Xero accepts OFX, QIF, and CSV. Should your bank offer a selection, go with them in this order. OFX and QIF files are formatted and should import without too many problems. CSV, on the other hand, is a different matter. Each bank CSV download will come in a different format and will need some work before it is ready for importing. This takes some time, so I would recommend using the Xero Help Guide and searching Import a Bank Statement to get the file formatted correctly. If you only do things on a monthly basis, uploading a statement is not too much of a chore. We would say at this point that the automated bank feed is one of the most revolutionary things to come out of accounting software, so not using it is a crime. You simply are not enjoying the full benefits of using Xero and cloud software without it. Petty cash It probably costs the business more to find missing petty cash discrepancies than the discrepancy totals. Our advice is simple: try not to maintain a petty cash account if you can; it is just one more thing to reconcile and manage. We would advocate using a company debit card where possible, as the transactions will then go through your main bank account and you will know what is missing. Get staff to submit an expense claim, and if that is too much hassle, treat payments in and out as if they have gone through the director's loan account, as that is what happens in most small businesses. Should you wish to operate a petty cash account, you will need to mark the transactions as reconciled manually, as there is no feed to match bank payments and receipts against. In order to do this, you must first turn on the ability to mark transactions as reconciled manually. This can be found hiding in the Help section. When you click on Help, you should then see an option to Enable Mark as Reconciled, which you will need to click to turn on this functionality. Now that you have the ability to Mark as Reconciled, you can begin to reconcile the petty cash. Go to Manage | Reconcile Account. You will be presented with the four tabs below (if Cash Coding is not turned on for your user role, you will not see that tab). The Reconcile tab should be blank, as there is no feed or import in place. You will want to go to Account transactions, which is where the items you have marked as paid from petty cash will live. Underneath this section, you will also find a button called New Transactions, where you can create transactions on the fly that you may have missed or are not going to raise a sales invoice or supplier bill for. You can see from the following screenshot that we have added an example of a Spend Money transaction, but you can also create a Receive Money transaction when clicking on the New Transaction button. Click on Save when you have finished entering the details of your transaction. If you have outstanding sales invoices or purchase invoices that have been paid through the petty cash account, then you will need to mark them as paid using the petty cash bank account in order for them to appear in the Account Transactions section. To do this, navigate to those items and then complete the Make a Payment or Receive a Payment section in the bottom left-hand corner. From the main Account transactions screen, you can then mark your transactions as reconciled. Check off the items you wish to reconcile using the checkboxes on the left, then click on More | Mark as Reconciled. When you have completed the process, your bank account balance in Xero should match that of your petty cash sheet. The status of transactions marked as Reconciled Manually will change to Reconciled in black. When a transaction is reconciled, it has come in from either an import or a bank feed, and it will be green. If it is unreconciled, it will be orange. Loan account A loan account works in the same way as a bank account, and we would recommend that you set it up if a feed is available from your bank. Managing loans in Xero is easy, as you can set up a bank rule for the interest and use the Transfer facility to reconcile any loan repayments. Credit cards Like adding bank accounts, you can add credit cards from the dashboard by clicking on the Add Bank Account button or by navigating to Accounts |Bank Accounts |Add Bank Account | Credit Card. Add a card You may see several options for the bank you use, so double-check you are using the right option or the feed will not work. If there is no feed set up for your particular bank or credit card account, you will be notified as follows: In this case, you will need to either import a statement in either the OFX, QIF, or CSV format, or manually reconcile your credit card account. The process is the same as that detailed above for reconciling petty cash and will be matched against your credit card statement. If a feed is available, enter the login credentials requested to set up the feed in the same fashion as when adding a new bank account feed. Common issues Credit cards act in a different way than a bank account, in that each card is an account on its own, separate from the main account on which interest and balance payments are allocated. This means that even if you have just one business credit card, you will, in effect, have two accounts with the bank. You can add a feed for each account if you wish, but for the main account, the only transactions that will go through it are any interest accrued and balance payments to clear the card. We would suggest you set up the credit card account as a feed, as this is where you will see most transactions and therefore save the most time in processing. Each time interest is accrued, you will need to post it as a Spend Money transaction, and each time you make a payment, the amount will be a Transfer from one of your other accounts. Both these transactions will need to be marked as Reconciled Manually, as they will not appear on your credit card feed setup. This is done in the same way as oultlined in the Petty cash section earlier PayPal Just like a bank account, you can sync Xero with your PayPal account, even in multiple currencies. The ability to do this, coupled with using bank rules, can help supercharge your processing ability and cut down on posting errors. Add a feed There is a little bit of configuration required to set up a PayPal account. Go to Accounts | Bank Accounts | Add Bank Account | PayPal. Add the account name as you wish for it to appear in Xero and the currency for the account. To use the feed (why wouldn't you?), check  the Set up automatic PayPal import option, which will then bring up the other options shown in the following screenshot: As previously suggested, if converting from another system, then import transactions from the conversion dates, as with all previous transactions, should be dealt with in your old accounting system. Click on Save, and you will receive an e-mail from Xero to confirm your PayPal e-mail address. Click on the activation link in the e-mail. To complete the setup process, you need to update your PayPal settings in order for Xero to turn on the automatic feeds. In PayPal, go to My Account | Profile | My Selling Tools. Next to the API access, click on Update | Option 1. The box should then be Grant API Permission. In the Third Party Permission field, enter paypal_api1.xero.com, then click on Lookup. Under Available Permissions, make sure you check the following options: Click on Add, and you have finished the setup process. If you have multiple currency accounts, then complete this process for each currency. Bank rules Bank rules give you the ability to automate the processing of recurring payments and receipts based on your chosen criteria, and they can be a massive time saver if you have many transactions to deal with. An example would be the processing of PayPal fees on your PayPal account. Rather than having to process each one, you could set up a bank rule to deal with the transaction. Bank rules cannot be used for bank transfers or allocating supplier payment or customer receipts. Add bank rules You can add a bank rule directly from the bank statement line by clicking on Create rule above the bank line details. This means waiting for something to come through the bank first, which we think makes sense, as that detail is taken into consideration when setting up the bank rule, making it simpler to set up. You can also enter bank rules you know will need adding by going to the relevant bank account and clicking Manage Account | Bank Rules | Create Rule. We have broken the bank rule down into different sections. Section 1 allows you to set the conditions that must be present in order for the bank rule to trigger. Using equals means the payee or description in this example must match exactly. If you were to change it to contains, then only part of the description need be present. This can be very useful when the description contains a reference number that changes each month. You do not want the bank rule to fail, so you might choose to remove the reference number and change the condition to contains instead. You must set at least one condition. Section 2 allows you to set a contact, which we suggest you do; otherwise, you will have to do this on the Bank Account screen each time before being able to reconcile that bank statement line. Section 3 allows you to fix a value to an account code. This can be useful if the bank rule you are setting up contains an element of a fixed amount and variable amount. An example might be a telephone bill where the line rental is fixed and the balance is for call charges that will vary month by month. Section 4 allows you to allocate a percentage to an account code. If there is not a fixed value amount in section 3, you can just use section 4 and post 100% of the cost to the account code of your choice. Likewise, if you had a bank statement line that you wanted to split between account codes, then you could do so by entering a second line and using the percentage column. Section 5 allows you to set a reference to be used when the bank rule runs and there are five options. We would suggest not using the by me during bank rec option, as this again creates extra work, since you will have to fill it in each time before you can reconcile that bank statement line. Section 6 allows you to choose which bank account you want the bank rule to run on. This is useful if you start paying for an item out of a different bank account, as you can edit the rule and change the bank account rather than having to create the rule all over again. Section 7 allows you to set a title for the bank rule. Use something that will make it easy for you to remember when on the Bank Rules screen. Edit bank rules If your bank rules are not firing the way you expected or at all, then you will want to edit them to get them right. It is worth spending time in this area, as once you have mastered setting up bank rules, they will save you time. To edit a bank rule, you will need to navigate to Accounts | Bank Accounts | Manage Accounts | Bank Rules. Click on the bank rule you wish to edit, make the necessary adjustments, and then click on Save. You will know if the bank rule is working, as it appears like the following when reconciling your bank account. If you do not wish to apply the rule, you can click on Don't apply rule in the bottom-left corner, or if you wish to check what it is going to do, click on View details first to verify the bank rule is going to post where you prefer. Re-order bank rules The order in which your bank rules sit is the order in which Xero runs them. This is important to remember if you have bank rules set up that may conflict with each other and not return the result you were expecting. An example might be you purchasing different items from a supermarket, such as petrol and stationery. In this example, we will call it Xeroco. In most instances, the bank statement line will show two different descriptions or references, in this case Xeroco for the main store and Xeroco Fuel for the gas station. You will need to set up your rules carefully, as using only contains for Xeroco will mean that your postings could end up going to the wrong account code. You would want the Xeroco Fuel bank rule to sit above the Xeroco rule. Because they both contain the same description, if Xeroco was first, it would always trigger and everything would get posted to stationery, including the petrol. If you set Xeroco Fuel as the first bank rule to run if the bank statement line does not contain both words, it will continue and then run the Xeroco rule, which would prove successful. Gas will get posted to fuel and stationery will get posted to stationery. You can drag and drop bank rules to change the order in which they run. Hover over the two dots to the left of the bank rule number and you can drag them to the appropriate position. Bank reconciliation Bank reconciliation is one of the main drivers in knowing when your books and records are up-to-date. The introduction of automated bank feeds has revolutionized the way in which we can complete a bank reconciliation, which is the process of matching what has gone through the bank account and what has been posted in your accounting system. Below are some ways to utilize all the functionality in the Xero toolkit. Auto Suggest Xero is an intuitive system; it learns how you process transactions and is also able to make suggestions based on what you have posted. As shown below, Xero has found a match for the bank statement line on the left, which is why the right-hand panel is now green and you can see the OK button to reconcile the transaction, provided you are happy it is the correct selection. You can choose to turn Auto Suggest off. At the bottom of each page in the bank screen, you will find a checkbox, as shown in the following screenshot. Simply uncheck the Suggest previous entries box to turn it off. The more you use Xero, the better it learns, so we would advise sticking with it. It is not a substitute for checking, however, so please check before hitting the OK button. Find & Match When Xero cannot find a match and you know the bank statement line in question probably has an associated invoice or bill in the system, you can use Find & Match in the upper right-hand corner, as shown in the following screenshot: You can look through the list of unreconciled bank transactions shown in the panel or you can opt to use the Search facility and search by name, reference, or amount. In this example, you can see that we have now found two transactions from SMART Agency that total the £4,500 spent. As you can see in the following screenshot, there is also an option next to the monetary amounts that will allow you to split the transaction. This is useful if someone has not paid the invoice in full. If the amount received was only £2,500 in total, for example, you could use Split to allocate £1,000 against the first transaction and £1,500 against the second transaction. When checked off, these turn green, and you can click on OK to reconcile that bank statement line. If you cannot find a match, you will need to investigate what it relates to and if you are missing some paperwork. If we had been clever when making the original supplier payment, we could have used the Batch Payment option in Accounts |Purchases |Awaiting Payment, checking off the items that make up the amount paid, and Batch Payment would have enabled us to tell Xero that there was a payment made totaling £4,500. Auto Suggest would have picked this up, making the reconciliation easier and quicker. You have already done the hard bit by working out how much to pay suppliers; you don't want to have to do it again when an amount comes through the bank and you can't remember what it was for. It is also good practice, as it means you will not inadvertently pay the same supplier again since the bill will be marked as paid. The same can be done for customer invoices using the Deposit button. This is very helpful when receiving check deposits or remittance advice well in advance of the actual receipt. By marking the invoices as paid, you will not chase customers for money, unnecessarily causing bad feelings along the way. Create There will be occasions when you will not have a bill or invoice in Xero from which to reconcile the bank statement line. In these situations, you will need to create a transaction to clear the bank statement line. In the example below, you can see that we have entered who the contact is, what the cost relates to, and added why we spent the money. Xero will now allow us to clear the bank statement line, as the OK button is visible. We would suggest that this option be used sparingly, as you should have paperwork posted into Xero in the form of a bill or invoice to deal with the majority of your bank statement lines. Bank transfers When you receive money in or transfer money out to another bank account, you have set up within Xero a very simple way to deal with those transactions. Click on the Transfer tab and choose the bank account from the dropdown. You will then be able to reconcile that bank statement line. In the account that you have made the transfer to, you will find that Xero will make the auto suggest for you when you reconcile that bank account. Discuss and comments When you are performing the bank reconciliation, you may find that you get stuck and you genuinely do not know what to do with it. This is where the Discuss tab, shown in the following screenshot, can help: You can simply enter a note to yourself for someone else in the business or for your advisor to take a look at. Don't forget to click on Save when you are done. If someone can answer your query, they can then enter their comment in the Discuss tab and save it. Note that at present there is no notification process when you save a comment in the Discuss tab, so you are reliant on someone regularly checking it. You will see something similar to the note underneath the business name when you log in to Xero, so you can see there is a comment that needs action. Reconciliation Report This is the major tool in your armory to check whether your bank reconciles. There is no greater feeling in life than your bank account reconciling and there being no unpresented items left hanging around. To run the report from within a bank account, click on the Reconciliation Report button next to the Manage Account button. From here, you can choose which bank account you wish to run the report from, so you do not need to keep moving between the accounts, and also a date option as you will probably want to run the report to various dates, especially if you encounter a problem. On the reconciliation report, you will see the balance in Xero, which is the cashbook balance (that is what would be in the bank if all the outstanding payments and receipts posted in Xero cleared and all the bank statement lines that have come from the bank feed were processed). The outstanding payments and receipts are invoices and bills you have marked as paid in Xero but have not been matched to a bank statement line yet. You need to keep an eye on these, as older unreconciled items would normally indicate a misallocation or uncleared item. Plus Un-Reconciled Bank Statement Lines are those items that have come through on a feed but have not yet been allocated to something in Xero. This might mean that there are missing bills or sales invoices in Xero, for example. Statement Balance is the number that should match what is on your online or paper statement, whether it is a bank account, credit card, or PayPal account. If the figures do not match, then it will need investigating. In the next section, we have highlighted some of the things that may have caused the imbalance and some ideas of what to do to rectify the situation. Manual reconciliation If you are unable to set up a bank feed or import a statement, then you can still reconcile bank accounts in Xero; it just feels a bit like going back in time. To complete a manual reconciliation, you will need to follow the same process as used to process petty cash, as discussed earlier in this article. Common errors and corrections Despite all the innovation and technological advances Xero has made in this area, there are still things that can go wrong—some human, some machine. The main thing is to recognize this and know how to deal with it in the event that it happens. We have highlighted some of the more common issues and resolutions in the following subsections. Account not reconciling There is no greater feeling than when you get that big green checkmark telling you that you have reconciled all your transactions. Fantastic job done, you think! But not quite. You need to check your bank statement, credit card statement, loan statement, or PayPal statement to make sure it definitely matches as per the preceding bank reconciliation report section. The job's not done until you have completed the manual check. Duplicated statement lines With all things technology, there is a chance that things can go wrong, and every now and again, you may find that your bank feed has duplicated a line item. This is why it is so important to check the actual statement against that in Xero. It is the only way to truly know if the accounts reconcile. A direct bank feed that costs money is deemed to be more robust, and some feeds through Yodlee are better than others. It all depends on your bank, so it is worth checking with the Xero community to get some guidance from fellow Xeroes. If you are convinced that you have a duplicated bank statement line, you can choose to delete it by finding the offending item in your account and clicking on the cross in the top left-hand corner of the bank statement line. When you hover over the cross, the item will turn red. Use this sparingly and only when you know you have a duplicated line. Missing statement lines As with duplicated statement lines, there is also the possibility of a bank statement line not being synced, and this can be picked up when checking that the Xero bank account figure matches that of your online or paper bank statement. If they do not match, then we recommend using the reconciliation report and working backwards month by month and then week by week to try and isolate the date at which the bank last reconciled. Once you have narrowed it down to a week, you can then start doing it day by day until you find the date, and then check off the items in Xero against those on your bank statement until you find the missing items. If there are several missing items, we would probably suggest doing an import via OFX, QIF, or CSV, but if there are only a few, then it would probably be best to enter them manually and then mark them as reconciled so the bank will reconcile. Remove & Redo We know you are great at what you do, but everyone has an off day. If you have allocated something incorrectly, you can easily amend it. auto suggest is fantastic, but you may get carried away and just keep hitting that the OK button without paying enough attention. This is particularly problematic for businesses dealing with lots of invoices for similar amounts. If you do find that you have made an error, then you can remove the original allocation and redo it. You can do this by going to Accounts | Bank Accounts | Manage Account | Reconcile Account | Account Transactions. As you can see in the following screenshot, once you have found the offending item, you can check it off and then click on Remove & Redo. You will also find on the right-hand side a Search button, which will allow you to search for particular transactions rather than having to scroll through endless pages. If you happen to be in the actual bank transaction when you identify a problem, then you can click on Options | Remove & Redo. This will then push the transaction back in to the bank account screen to reconcile again. Note that if you Remove & Redo a manually entered bank transaction, it will not reappear in the bank account for you to reconcile, as it was never there in the first place. What you will need to do is post the payment or receipt against the correct invoice or bill, and then manually mark it as reconciled again or create another spend or receive money transaction. Manually marked as reconciled A telltale sign that someone has inadvertently caused a problem is when you look at the Account Transactions tab in the bank account and there is a sea of green and reconciled statuses, and then you spot the odd black reconciled status. This is an indication that something has been posted to that account and marked as reconciled manually. This will need investigating, as it may be genuine, such as some missing bank statement lines, or it could be that someone has made a mistake and it needs to be removed. Understanding reports If the bank account does not reconcile and it is not something obvious, then we would suggest looking at the bank statements imported into Xero to see if there are any obvious problems. Go to Accounts | Bank Accounts | Manage Account | Bank Statements. From this screen, have a look to see if there is any overlap of dates imported and then drill into the statements to check for anything that doesn't look right. If you do come across duplicated lines, you can remove them by checking off the box on the left and then clicking on the Delete button. You can see below that the bank statement line has been grayed out and has a status of Deleted next to it. If you later discover that you have made a mistake, then you can restore the bank statement line by checking off the box again but clicking on Restore this time. If you have incorrectly imported a duplicate statement or the bank feed has done so rather than deleting the transactions, you can choose to delete the entire statement. This can be achieved by clicking on the Delete Entire Statement button at the bottom-left of the Bank Statements screen: Make sure you have checked the Also delete reconciled transactions for this statement option before clicking Delete. If you are deleting the statement because it is incorrect, it only makes sense that you also clear any transactions associated with this statement to avoid further issues. Summary We have successfully added bank feeds that are now ready for automating the bank reconciliation process. In this article, we ran through the major bank functions and set up your bank feeds, exploring how to set up bank rules to make the bank reconciliation task even easier and quicker. On top of that, we also explored the possibilities of what could go wrong, but more importantly, how to identify errors and put them right. One of the biggest bookkeeping tasks you will undertake should now seem a lot easier. Resources for Article:   Further resources on this subject: Probability of R? [article] Dealing with a Mess [article] Essbase ASO (Aggregate Storage Option) [article]
Read more
  • 0
  • 0
  • 3609
article-image-building-recommendation-engine-spark
Packt
24 Feb 2016
44 min read
Save for later

Building a Recommendation Engine with Spark

Packt
24 Feb 2016
44 min read
In this article, we will explore individual machine learning models in detail, starting with recommendation engines. (For more resources related to this topic, see here.) Recommendation engines are probably among the best types of machine learning model known to the general public. Even if people do not know exactly what a recommendation engine is, they have most likely experienced one through the use of popular websites such as Amazon, Netflix, YouTube, Twitter, LinkedIn, and Facebook. Recommendations are a core part of all these businesses, and in some cases, they drive significant percentages of their revenue. The idea behind recommendation engines is to predict what people might like and to uncover relationships between items to aid in the discovery process (in this way, it is similar and, in fact, often complementary to search engines, which also play a role in discovery). However, unlike search engines, recommendation engines try to present people with relevant content that they did not necessarily search for or that they might not even have heard of. Typically, a recommendation engine tries to model the connections between users and some type of item. If we can do a good job of showing our users movies related to a given movie, we could aid in discovery and navigation on our site, again improving our users' experience, engagement, and the relevance of our content to them. However, recommendation engines are not limited to movies, books, or products. The techniques we will explore in this article can be applied to just about any user-to-item relationship as well as user-to-user connections, such as those found on social networks, allowing us to make recommendations such as people you may know or who to follow. Recommendation engines are most effective in two general scenarios (which are not mutually exclusive). They are explained here: Large number of available options for users: When there are a very large number of available items, it becomes increasingly difficult for the user to find something they want. Searching can help when the user knows what they are looking for, but often, the right item might be something previously unknown to them. In this case, being recommended relevant items, that the user may not already know about, can help them discover new items. A significant degree of personal taste involved: When personal taste plays a large role in selection, recommendation models, which often utilize a wisdom of the crowd approach, can be helpful in discovering items based on the behavior of others that have similar taste profiles. In this article, we will: Introduce the various types of recommendation engines Build a recommendation model using data about user preferences Use the trained model to compute recommendations for a given user as well compute similar items for a given item (that is, related items) Apply standard evaluation metrics to the model that we created to measure how well it performs in terms of predictive capability Types of recommendation models Recommender systems are widely studied, and there are many approaches used, but there are two that are probably most prevalent: content-based filtering and collaborative filtering. Recently, other approaches such as ranking models have also gained in popularity. In practice, many approaches are hybrids, incorporating elements of many different methods into a model or combination of models. Content-based filtering Content-based methods try to use the content or attributes of an item, together with some notion of similarity between two pieces of content, to generate items similar to a given item. These attributes are often textual content (such as titles, names, tags, and other metadata attached to an item), or in the case of media, they could include other features of the item, such as attributes extracted from audio and video content. In a similar manner, user recommendations can be generated based on attributes of users or user profiles, which are then matched to item attributes using the same measure of similarity. For example, a user can be represented by the combined attributes of the items they have interacted with. This becomes their user profile, which is then compared to item attributes to find items that match the user profile. Collaborative filtering Collaborative filtering is a form of wisdom of the crowd approach where the set of preferences of many users with respect to items is used to generate estimated preferences of users for items with which they have not yet interacted. The idea behind this is the notion of similarity. In a user-based approach, if two users have exhibited similar preferences (that is, patterns of interacting with the same items in broadly the same way), then we would assume that they are similar to each other in terms of taste. To generate recommendations for unknown items for a given user, we can use the known preferences of other users that exhibit similar behavior. We can do this by selecting a set of similar users and computing some form of combined score based on the items they have shown a preference for. The overall logic is that if others have tastes similar to a set of items, these items would tend to be good candidates for recommendation. We can also take an item-based approach that computes some measure of similarity between items. This is usually based on the existing user-item preferences or ratings. Items that tend to be rated the same by similar users will be classed as similar under this approach. Once we have these similarities, we can represent a user in terms of the items they have interacted with and find items that are similar to these known items, which we can then recommend to the user. Again, a set of items similar to the known items is used to generate a combined score to estimate for an unknown item. The user- and item-based approaches are usually referred to as nearest-neighbor models, since the estimated scores are computed based on the set of most similar users or items (that is, their neighbors). Finally, there are many model-based methods that attempt to model the user-item preferences themselves so that new preferences can be estimated directly by applying the model to unknown user-item combinations. Matrix factorization Since Spark's recommendation models currently only include an implementation of matrix factorization, we will focus our attention on this class of models. This focus is with good reason; however, these types of models have consistently been shown to perform extremely well in collaborative filtering and were among the best models in well-known competitions such as the Netflix prize. For more information on and a brief overview of the performance of the best algorithms for the Netflix prize, see http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html. Explicit matrix factorization When we deal with data that consists of preferences of users that are provided by the users themselves, we refer to explicit preference data. This includes, for example, ratings, thumbs up, likes, and so on that are given by users to items. We can take these ratings and form a two-dimensional matrix with users as rows and items as columns. Each entry represents a rating given by a user to a certain item. Since in most cases, each user has only interacted with a relatively small set of items, this matrix has only a few non-zero entries (that is, it is very sparse). As a simple example, let's assume that we have the following user ratings for a set of movies: Tom, Star Wars, 5 Jane, Titanic, 4 Bill, Batman, 3 Jane, Star Wars, 2 Bill, Titanic, 3 We will form the following ratings matrix: A simple movie-rating matrix Matrix factorization (or matrix completion) attempts to directly model this user-item matrix by representing it as a product of two smaller matrices of lower dimension. Thus, it is a dimensionality-reduction technique. If we have U users and I items, then our user-item matrix is of dimension U x I and might look something like the one shown in the following diagram: A sparse ratings matrix If we want to find a lower dimension (low-rank) approximation to our user-item matrix with the dimension k, we would end up with two matrices: one for users of size U x k and one for items of size I x k. These are known as factor matrices. If we multiply these two factor matrices, we would reconstruct an approximate version of the original ratings matrix. Note that while the original ratings matrix is typically very sparse, each factor matrix is dense, as shown in the following diagram: The user- and item-factor matrices These models are often also called latent feature models, as we are trying to discover some form of hidden features (which are represented by the factor matrices) that account for the structure of behavior inherent in the user-item rating matrix. While the latent features or factors are not directly interpretable, they might, perhaps, represent things such as the tendency of a user to like movies from a certain director, genre, style, or group of actors, for example. As we are directly modeling the user-item matrix, the prediction in these models is relatively straightforward: to compute a predicted rating for a user and item, we compute the vector dot product between the relevant row of the user-factor matrix (that is, the user's factor vector) and the relevant row of the item-factor matrix (that is, the item's factor vector). This is illustrated with the highlighted vectors in the following diagram: Computing recommendations from user- and item-factor vectors To find out the similarity between two items, we can use the same measures of similarity as we would use in the nearest-neighbor models, except that we can use the factor vectors directly by computing the similarity between two item-factor vectors, as illustrated in the following diagram: Computing similarity with item-factor vectors The benefit of factorization models is the relative ease of computing recommendations once the model is created. However, for very large user and itemsets, this can become a challenge as it requires storage and computation across potentially many millions of user- and item-factor vectors. Another advantage, as mentioned earlier, is that they tend to offer very good performance. Projects such as Oryx (https://github.com/OryxProject/oryx) and Prediction.io (https://github.com/PredictionIO/PredictionIO) focus on model serving for large-scale models, including recommenders based on matrix factorization. On the down side, factorization models are relatively more complex to understand and interpret compared to nearest-neighbor models and are often more computationally intensive during the model's training phase. Implicit matrix factorization So far, we have dealt with explicit preferences such as ratings. However, much of the preference data that we might be able to collect is implicit feedback, where the preferences between a user and item are not given to us, but are, instead, implied from the interactions they might have with an item. Examples include binary data (such as whether a user viewed a movie, whether they purchased a product, and so on) as well as count data (such as the number of times a user watched a movie). There are many different approaches to deal with implicit data. MLlib implements a particular approach that treats the input rating matrix as two matrices: a binary preference matrix, P, and a matrix of confidence weights, C. For example, let's assume that the user-movie ratings we saw previously were, in fact, the number of times each user had viewed that movie. The two matrices would look something like ones shown in the following screenshot. Here, the matrix P informs us that a movie was viewed by a user, and the matrix C represents the confidence weighting, in the form of the view counts—generally, the more a user has watched a movie, the higher the confidence that they actually like it. Representation of an implicit preference and confidence matrix The implicit model still creates a user- and item-factor matrix. In this case, however, the matrix that the model is attempting to approximate is not the overall ratings matrix but the preference matrix P. If we compute a recommendation by calculating the dot product of a user- and item-factor vector, the score will not be an estimate of a rating directly. It will rather be an estimate of the preference of a user for an item (though not strictly between 0 and 1, these scores will generally be fairly close to a scale of 0 to 1). Alternating least squares Alternating Least Squares (ALS) is an optimization technique to solve matrix factorization problems; this technique is powerful, achieves good performance, and has proven to be relatively easy to implement in a parallel fashion. Hence, it is well suited for platforms such as Spark. At the time of writing this, it is the only recommendation model implemented in MLlib. ALS works by iteratively solving a series of least squares regression problems. In each iteration, one of the user- or item-factor matrices is treated as fixed, while the other one is updated using the fixed factor and the rating data. Then, the factor matrix that was solved for is, in turn, treated as fixed, while the other one is updated. This process continues until the model has converged (or for a fixed number of iterations). Spark's documentation for collaborative filtering contains references to the papers that underlie the ALS algorithms implemented each component of explicit and implicit data. You can view the documentation at http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html. Extracting the right features from your data In this section, we will use explicit rating data, without additional user or item metadata or other information related to the user-item interactions. Hence, the features that we need as inputs are simply the user IDs, movie IDs, and the ratings assigned to each user and movie pair. Extracting features from the MovieLens 100k dataset Start the Spark shell in the Spark base directory, ensuring that you provide enough memory via the –driver-memory option: >./bin/spark-shell –driver-memory 4g In this example, we will use the same MovieLens dataset. Use the directory in which you placed the MovieLens 100k dataset as the input path in the following code. First, let's inspect the raw ratings dataset: val rawData = sc.textFile("/PATH/ml-100k/u.data") rawData.first() You will see output similar to these lines of code: 14/03/30 11:42:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/03/30 11:42:41 WARN LoadSnappy: Snappy native library not loaded 14/03/30 11:42:41 INFO FileInputFormat: Total input paths to process : 1 14/03/30 11:42:41 INFO SparkContext: Starting job: first at <console>:15 14/03/30 11:42:41 INFO DAGScheduler: Got job 0 (first at <console>:15) with 1 output partitions (allowLocal=true) 14/03/30 11:42:41 INFO DAGScheduler: Final stage: Stage 0 (first at <console>:15) 14/03/30 11:42:41 INFO DAGScheduler: Parents of final stage: List() 14/03/30 11:42:41 INFO DAGScheduler: Missing parents: List() 14/03/30 11:42:41 INFO DAGScheduler: Computing the requested partition locally 14/03/30 11:42:41 INFO HadoopRDD: Input split: file:/Users/Nick/workspace/datasets/ml-100k/u.data:0+1979173 14/03/30 11:42:41 INFO SparkContext: Job finished: first at <console>:15, took 0.030533 s res0: String = 196  242  3  881250949 Recall that this dataset consisted of the user id, movie id, rating, timestamp fields separated by a tab ("t") character. We don't need the time when the rating was made to train our model, so let's simply extract the first three fields: val rawRatings = rawData.map(_.split("t").take(3)) We will first split each record on the "t" character, which gives us an Array[String] array. We will then use Scala's take function to keep only the first 3 elements of the array, which correspond to user id, movie id, and rating, respectively. We can inspect the first record of our new RDD by calling rawRatings.first(), which collects just the first record of the RDD back to the driver program. This will result in the following output: 14/03/30 12:24:00 INFO SparkContext: Starting job: first at <console>:21 14/03/30 12:24:00 INFO DAGScheduler: Got job 1 (first at <console>:21) with 1 output partitions (allowLocal=true) 14/03/30 12:24:00 INFO DAGScheduler: Final stage: Stage 1 (first at <console>:21) 14/03/30 12:24:00 INFO DAGScheduler: Parents of final stage: List() 14/03/30 12:24:00 INFO DAGScheduler: Missing parents: List() 14/03/30 12:24:00 INFO DAGScheduler: Computing the requested partition locally 14/03/30 12:24:00 INFO HadoopRDD: Input split: file:/Users/Nick/workspace/datasets/ml-100k/u.data:0+1979173 14/03/30 12:24:00 INFO SparkContext: Job finished: first at <console>:21, took 0.00391 s res6: Array[String] = Array(196, 242, 3) We will use Spark's MLlib library to train our model. Let's take a look at what methods are available for us to use and what input is required. First, import the ALS model from MLlib: import org.apache.spark.mllib.recommendation.ALS On the console, we can inspect the available methods on the ALS object using tab completion. Type in ALS. (note the dot) and then press the Tab key. You should see the autocompletion of the methods: ALS. asInstanceOf    isInstanceOf    main            toString        train           trainImplicit The method we want to use is train. If we type ALS.train and hit Enter, we will get an error. However, this error will tell us what the method signature looks like: ALS.train <console>:12: error: ambiguous reference to overloaded definition, both method train in object ALS of type (ratings: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating], rank: Int, iterations: Int)org.apache.spark.mllib.recommendation.MatrixFactorizationModel and  method train in object ALS of type (ratings: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating], rank: Int, iterations: Int, lambda: Double)org.apache.spark.mllib.recommendation.MatrixFactorizationModel match expected type ?               ALS.train                   ^ So, we can see that at a minimum, we need to provide the input arguments, ratings, rank, and iterations. The second method also requires an argument called lambda. We'll cover these three shortly, but let's take a look at the ratings argument. First, let's import the Rating class that it references and use a similar approach to find out what an instance of Rating requires, by typing in Rating() and hitting Enter: import org.apache.spark.mllib.recommendation.Rating Rating() <console>:13: error: not enough arguments for method apply: (user: Int, product: Int, rating: Double)org.apache.spark.mllib.recommendation.Rating in object Rating. Unspecified value parameters user, product, rating.               Rating()                     ^ As we can see from the preceding output, we need to provide the ALS model with an RDD that consists of Rating records. A Rating class, in turn, is just a wrapper around user id, movie id (called product here), and the actual rating arguments. We'll create our rating dataset using the map method and transforming the array of IDs and ratings into a Rating object: val ratings = rawRatings.map { case Array(user, movie, rating) => Rating(user.toInt, movie.toInt, rating.toDouble) } Notice that we need to use toInt or toDouble to convert the raw rating data (which was extracted as Strings from the text file) to Int or Double numeric inputs. Also, note the use of a case statement that allows us to extract the relevant variable names and use them directly (this saves us from having to use something like val user = ratings(0)). For more on Scala case statements and pattern matching as used here, take a look at http://docs.scala-lang.org/tutorials/tour/pattern-matching.html. We now have an RDD[Rating] that we can verify by calling: ratings.first() 14/03/30 12:32:48 INFO SparkContext: Starting job: first at <console>:24 14/03/30 12:32:48 INFO DAGScheduler: Got job 2 (first at <console>:24) with 1 output partitions (allowLocal=true) 14/03/30 12:32:48 INFO DAGScheduler: Final stage: Stage 2 (first at <console>:24) 14/03/30 12:32:48 INFO DAGScheduler: Parents of final stage: List() 14/03/30 12:32:48 INFO DAGScheduler: Missing parents: List() 14/03/30 12:32:48 INFO DAGScheduler: Computing the requested partition locally 14/03/30 12:32:48 INFO HadoopRDD: Input split: file:/Users/Nick/workspace/datasets/ml-100k/u.data:0+1979173 14/03/30 12:32:48 INFO SparkContext: Job finished: first at <console>:24, took 0.003752 s res8: org.apache.spark.mllib.recommendation.Rating = Rating(196,242,3.0) Training the recommendation model Once we have extracted these simple features from our raw data, we are ready to proceed with model training; MLlib takes care of this for us. All we have to do is provide the correctly-parsed input RDD we just created as well as our chosen model parameters. Training a model on the MovieLens 100k dataset We're now ready to train our model! The other inputs required for our model are as follows: rank: This refers to the number of factors in our ALS model, that is, the number of hidden features in our low-rank approximation matrices. Generally, the greater the number of factors, the better, but this has a direct impact on memory usage, both for computation and to store models for serving, particularly for large number of users or items. Hence, this is often a trade-off in real-world use cases. A rank in the range of 10 to 200 is usually reasonable. iterations: This refers to the number of iterations to run. While each iteration in ALS is guaranteed to decrease the reconstruction error of the ratings matrix, ALS models will converge to a reasonably good solution after relatively few iterations. So, we don't need to run for too many iterations in most cases (around 10 is often a good default). lambda: This parameter controls the regularization of our model. Thus, lambda controls over fitting. The higher the value of lambda, the more is the regularization applied. What constitutes a sensible value is very dependent on the size, nature, and sparsity of the underlying data, and as with almost all machine learning models, the regularization parameter is something that should be tuned using out-of-sample test data and cross-validation approaches. We'll use rank of 50, 10 iterations, and a lambda parameter of 0.01 to illustrate how to train our model: val model = ALS.train(ratings, 50, 10, 0.01) This returns a MatrixFactorizationModel object, which contains the user and item factors in the form of an RDD of (id, factor) pairs. These are called userFeatures and productFeatures, respectively. For example: model.userFeatures res14: org.apache.spark.rdd.RDD[(Int, Array[Double])] = FlatMappedRDD[659] at flatMap at ALS.scala:231 We can see that the factors are in the form of an Array[Double]. Note that the operations used in MLlib's ALS implementation are lazy transformations, so the actual computation will only be performed once we call some sort of action on the resulting RDDs of the user and item factors. We can force the computation using a Spark action such as count: model.userFeatures.count This will trigger the computation, and we will see a quite a bit of output text similar to the following lines of code: 14/03/30 13:10:40 INFO SparkContext: Starting job: count at <console>:26 14/03/30 13:10:40 INFO DAGScheduler: Registering RDD 665 (map at ALS.scala:147) 14/03/30 13:10:40 INFO DAGScheduler: Registering RDD 664 (map at ALS.scala:146) 14/03/30 13:10:40 INFO DAGScheduler: Registering RDD 674 (mapPartitionsWithIndex at ALS.scala:164) ... 14/03/30 13:10:45 INFO SparkContext: Job finished: count at <console>:26, took 5.068255 s res16: Long = 943 If we call count for the movie factors, we will see the following output: model.productFeatures.count 14/03/30 13:15:21 INFO SparkContext: Starting job: count at <console>:26 14/03/30 13:15:21 INFO DAGScheduler: Got job 10 (count at <console>:26) with 1 output partitions (allowLocal=false) 14/03/30 13:15:21 INFO DAGScheduler: Final stage: Stage 165 (count at <console>:26) 14/03/30 13:15:21 INFO DAGScheduler: Parents of final stage: List(Stage 169, Stage 166) 14/03/30 13:15:21 INFO DAGScheduler: Missing parents: List() 14/03/30 13:15:21 INFO DAGScheduler: Submitting Stage 165 (FlatMappedRDD[883] at flatMap at ALS.scala:231), which has no missing parents 14/03/30 13:15:21 INFO DAGScheduler: Submitting 1 missing tasks from Stage 165 (FlatMappedRDD[883] at flatMap at ALS.scala:231) ... 14/03/30 13:15:21 INFO SparkContext: Job finished: count at <console>:26, took 0.030044 s res21: Long = 1682 As expected, we have a factor array for each user (943 factors) and movie (1682 factors). Training a model using implicit feedback data The standard matrix factorization approach in MLlib deals with explicit ratings. To work with implicit data, you can use the trainImplicit method. It is called in a manner similar to the standard train method. There is an additional parameter, alpha, that can be set (and in the same way, the regularization parameter, lambda, should be selected via testing and cross-validation methods). The alpha parameter controls the baseline level of confidence weighting applied. A higher level of alpha tends to make the model more confident about the fact that missing data equates to no preference for the relevant user-item pair. As an exercise, try to take the existing MovieLens dataset and convert it into an implicit dataset. One possible approach is to convert it to binary feedback (0s and 1s) by applying a threshold on the ratings at some level. Another approach could be to convert the ratings' values into confidence weights (for example, perhaps, low ratings could imply zero weights, or even negative weights, which are supported by MLlib's implementation). Train a model on this dataset and compare the results of the following section with those generated by your implicit model. Using the recommendation model Now that we have our trained model, we're ready to use it to make predictions. These predictions typically take one of two forms: recommendations for a given user and related or similar items for a given item. User recommendations In this case, we would like to generate recommended items for a given user. This usually takes the form of a top-K list, that is, the K items that our model predicts will have the highest probability of the user liking them. This is done by computing the predicted score for each item and ranking the list based on this score. The exact method to perform this computation depends on the model involved. For example, in user-based approaches, the ratings of similar users on items are used to compute the recommendations for a user, while in an item-based approach, the computation is based on the similarity of items the user has rated to the candidate items. In matrix factorization, because we are modeling the ratings matrix directly, the predicted score can be computed as the vector dot product between a user-factor vector and an item-factor vector. Generating movie recommendations from the MovieLens 100k dataset As MLlib's recommendation model is based on matrix factorization, we can use the factor matrices computed by our model to compute predicted scores (or ratings) for a user. We will focus on the explicit rating case using MovieLens data; however, the approach is the same when using the implicit model. The MatrixFactorizationModel class has a convenient predict method that will compute a predicted score for a given user and item combination: val predictedRating = model.predict(789, 123) 14/03/30 16:10:10 INFO SparkContext: Starting job: lookup at MatrixFactorizationModel.scala:45 14/03/30 16:10:10 INFO DAGScheduler: Got job 30 (lookup at MatrixFactorizationModel.scala:45) with 1 output partitions (allowLocal=false) ... 14/03/30 16:10:10 INFO SparkContext: Job finished: lookup at MatrixFactorizationModel.scala:46, took 0.023077 s predictedRating: Double = 3.128545693368485 As we can see, this model predicts a rating of 3.12 for user 789 and movie 123. Note that you might see different results than those shown in this section because the ALS model is initialized randomly. So, different runs of the model will lead to different solutions.  The predict method can also take an RDD of (user, item) IDs as the input and will generate predictions for each of these. We can use this method to make predictions for many users and items at the same time. To generate the top-K recommended items for a user, MatrixFactorizationModel provides a convenience method called recommendProducts. This takes two arguments: user and num, where user is the user ID, and num is the number of items to recommend. It returns the top num items ranked in the order of the predicted score. Here, the scores are computed as the dot product between the user-factor vector and each item-factor vector. Let's generate the top 10 recommended items for user 789: val userId = 789 val K = 10 val topKRecs = model.recommendProducts(userId, K) We now have a set of predicted ratings for each movie for user 789. If we print this out, we could inspect the top 10 recommendations for this user: println(topKRecs.mkString("n")) You should see the following output on your console: Rating(789,715,5.931851273771102) Rating(789,12,5.582301095666215) Rating(789,959,5.516272981542168) Rating(789,42,5.458065302395629) Rating(789,584,5.449949837103569) Rating(789,750,5.348768847643657) Rating(789,663,5.30832117499004) Rating(789,134,5.278933936827717) Rating(789,156,5.250959077906759) Rating(789,432,5.169863417126231) Inspecting the recommendations We can give these recommendations a sense check by taking a quick look at the titles of the movies a user has rated and the recommended movies. First, we need to load the movie data. We'll collect this data as a Map[Int, String] method mapping the movie ID to the title: val movies = sc.textFile("/PATH/ml-100k/u.item") val titles = movies.map(line => line.split("\|").take(2)).map(array => (array(0).toInt,  array(1))).collectAsMap() titles(123) res68: String = Frighteners, The (1996) For our user 789, we can find out what movies they have rated, take the 10 movies with the highest rating, and then check the titles. We will do this now by first using the keyBy Spark function to create an RDD of key-value pairs from our ratings RDD, where the key will be the user ID. We will then use the lookup function to return just the ratings for this key (that is, that particular user ID) to the driver: val moviesForUser = ratings.keyBy(_.user).lookup(789) Let's see how many movies this user has rated. This will be the size of the moviesForUser collection: println(moviesForUser.size) We will see that this user has rated 33 movies. Next, we will take the 10 movies with the highest ratings by sorting the moviesForUser collection using the rating field of the Rating object. We will then extract the movie title for the relevant product ID attached to the Rating class from our mapping of movie titles and print out the top 10 titles with their ratings: moviesForUser.sortBy(-_.rating).take(10).map(rating => (titles(rating.product), rating.rating)).foreach(println) You will see the following output displayed: (Godfather, The (1972),5.0) (Trainspotting (1996),5.0) (Dead Man Walking (1995),5.0) (Star Wars (1977),5.0) (Swingers (1996),5.0) (Leaving Las Vegas (1995),5.0) (Bound (1996),5.0) (Fargo (1996),5.0) (Last Supper, The (1995),5.0) (Private Parts (1997),4.0) Now, let's take a look at the top 10 recommendations for this user and see what the titles are using the same approach as the one we used earlier (note that the recommendations are already sorted): topKRecs.map(rating => (titles(rating.product), rating.rating)).foreach(println) (To Die For (1995),5.931851273771102) (Usual Suspects, The (1995),5.582301095666215) (Dazed and Confused (1993),5.516272981542168) (Clerks (1994),5.458065302395629) (Secret Garden, The (1993),5.449949837103569) (Amistad (1997),5.348768847643657) (Being There (1979),5.30832117499004) (Citizen Kane (1941),5.278933936827717) (Reservoir Dogs (1992),5.250959077906759) (Fantasia (1940),5.169863417126231) We leave it to you to decide whether these recommendations make sense. Item recommendations Item recommendations are about answering the following question: for a certain item, what are the items most similar to it? Here, the precise definition of similarity is dependent on the model involved. In most cases, similarity is computed by comparing the vector representation of two items using some similarity measure. Common similarity measures include Pearson correlation and cosine similarity for real-valued vectors and Jaccard similarity for binary vectors. Generating similar movies for the MovieLens 100K dataset The current MatrixFactorizationModel API does not directly support item-to-item similarity computations. Therefore, we will need to create our own code to do this. We will use the cosine similarity metric, and we will use the jblas linear algebra library (a dependency of MLlib) to compute the required vector dot products. This is similar to how the existing predict and recommendProducts methods work, except that we will use cosine similarity as opposed to just the dot product. We would like to compare the factor vector of our chosen item with each of the other items, using our similarity metric. In order to perform linear algebra computations, we will first need to create a vector object out of the factor vectors, which are in the form of an Array[Double]. The JBLAS class, DoubleMatrix, takes an Array[Double] as the constructor argument as follows: import org.jblas.DoubleMatrix val aMatrix = new DoubleMatrix(Array(1.0, 2.0, 3.0)) aMatrix: org.jblas.DoubleMatrix = [1.000000; 2.000000; 3.000000] Note that using jblas, vectors are represented as a one-dimensional DoubleMatrix class, while matrices are a two-dimensional DoubleMatrix class. We will need a method to compute the cosine similarity between two vectors. Cosine similarity is a measure of the angle between two vectors in an n-dimensional space. It is computed by first calculating the dot product between the vectors and then dividing the result by a denominator, which is the norm (or length) of each vector multiplied together (specifically, the L2-norm is used in cosine similarity). In this way, cosine similarity is a normalized dot product. The cosine similarity measure takes on values between -1 and 1. A value of 1 implies completely similar, while a value of 0 implies independence (that is, no similarity). This measure is useful because it also captures negative similarity, that is, a value of -1 implies that not only are the vectors not similar, but they are also completely dissimilar. Let's create our cosineSimilarity function here: def cosineSimilarity(vec1: DoubleMatrix, vec2: DoubleMatrix): Double = {   vec1.dot(vec2) / (vec1.norm2() * vec2.norm2()) } Note that we defined a return type for this function of Double. We are not required to do this, since Scala features type inference. However, it can often be useful to document return types for Scala functions. Let's try it out on one of our item factors for item 567. We will need to collect an item factor from our model; we will do this using the lookup method in a similar way that we did earlier to collect the ratings for a specific user. In the following lines of code, we also use the head function, since lookup returns an array of values, and we only need the first value (in fact, there will only be one value, which is the factor vector for this item). Since this will be an Array[Double], we will then need to create a DoubleMatrix object from it and compute the cosine similarity with itself: val itemId = 567 val itemFactor = model.productFeatures.lookup(itemId).head val itemVector = new DoubleMatrix(itemFactor) cosineSimilarity(itemVector, itemVector) A similarity metric should measure how close, in some sense, two vectors are to each other. Here, we can see that our cosine similarity metric tells us that this item vector is identical to itself, which is what we would expect: res113: Double = 1.0 Now, we are ready to apply our similarity metric to each item: val sims = model.productFeatures.map{ case (id, factor) =>  val factorVector = new DoubleMatrix(factor)   val sim = cosineSimilarity(factorVector, itemVector)   (id, sim) } Next, we can compute the top 10 most similar items by sorting out the similarity score for each item: // recall we defined K = 10 earlier val sortedSims = sims.top(K)(Ordering.by[(Int, Double), Double] { case (id, similarity) => similarity }) In the preceding code snippet, we used Spark's top function, which is an efficient way to compute top-K results in a distributed fashion, instead of using collect to return all the data to the driver and sorting it locally (remember that we could be dealing with millions of users and items in the case of recommendation models). We need to tell Spark how to sort the (item id, similarity score) pairs in the sims RDD. To do this, we will pass an extra argument to top, which is a Scala Ordering object that tells Spark that it should sort by the value in the key-value pair (that is, sort by similarity). Finally, we can print the 10 items with the highest computed similarity metric to our given item: println(sortedSims.take(10).mkString("n")) You will see output like the following one: (567,1.0000000000000002) (1471,0.6932331537649621) (670,0.6898690594544726) (201,0.6897964975027041) (343,0.6891221044611473) (563,0.6864214133620066) (294,0.6812075443259535) (413,0.6754663844488256) (184,0.6702643811753909) (109,0.6594872765176396) Not surprisingly, we can see that the top-ranked similar item is our item. The rest are the other items in our set of items, ranked in order of our similarity metric. Inspecting the similar items Let's see what the title of our chosen movie is: println(titles(itemId)) Wes Craven's New Nightmare (1994) As we did for user recommendations, we can sense check our item-to-item similarity computations and take a look at the titles of the most similar movies. This time, we will take the top 11 so that we can exclude our given movie. So, we will take the numbers 1 to 11 in the list: val sortedSims2 = sims.top(K + 1)(Ordering.by[(Int, Double), Double] { case (id, similarity) => similarity }) sortedSims2.slice(1, 11).map{ case (id, sim) => (titles(id), sim) }.mkString("n") You will see the movie titles and scores displayed similar to this output: (Hideaway (1995),0.6932331537649621) (Body Snatchers (1993),0.6898690594544726) (Evil Dead II (1987),0.6897964975027041) (Alien: Resurrection (1997),0.6891221044611473) (Stephen King's The Langoliers (1995),0.6864214133620066) (Liar Liar (1997),0.6812075443259535) (Tales from the Crypt Presents: Bordello of Blood (1996),0.6754663844488256) (Army of Darkness (1993),0.6702643811753909) (Mystery Science Theater 3000: The Movie (1996),0.6594872765176396) (Scream (1996),0.6538249646863378) Once again note that you might see quite different results due to random model initialization. Now that you have computed similar items using cosine similarity, see if you can do the same with the user-factor vectors to compute similar users for a given user. Evaluating the performance of recommendation models How do we know whether the model we have trained is a good model? We need to be able to evaluate its predictive performance in some way. Evaluation metrics are measures of a model's predictive capability or accuracy. Some are direct measures of how well a model predicts the model's target variable (such as Mean Squared Error), while others are concerned with how well the model performs at predicting things that might not be directly optimized in the model but are often closer to what we care about in the real world (such as Mean average precision). Evaluation metrics provide a standardized way of comparing the performance of the same model with different parameter settings and of comparing performance across different models. Using these metrics, we can perform model selection to choose the best-performing model from the set of models we wish to evaluate. Here, we will show you how to calculate two common evaluation metrics used in recommender systems and collaborative filtering models: Mean Squared Error and Mean average precision at K. Mean Squared Error The Mean Squared Error (MSE) is a direct measure of the reconstruction error of the user-item rating matrix. It is also the objective function being minimized in certain models, specifically many matrix-factorization techniques, including ALS. As such, it is commonly used in explicit ratings settings. It is defined as the sum of the squared errors divided by the number of observations. The squared error, in turn, is the square of the difference between the predicted rating for a given user-item pair and the actual rating. We will use our user 789 as an example. Let's take the first rating for this user from the moviesForUser set of Ratings that we previously computed: val actualRating = moviesForUser.take(1)(0) actualRating: org.apache.spark.mllib.recommendation.Rating = Rating(789,1012,4.0) We will see that the rating for this user-item combination is 4. Next, we will compute the model's predicted rating: val predictedRating = model.predict(789, actualRating.product) ... 14/04/13 13:01:15 INFO SparkContext: Job finished: lookup at MatrixFactorizationModel.scala:46, took 0.025404 s predictedRating: Double = 4.001005374200248 We will see that the predicted rating is about 4, very close to the actual rating. Finally, we will compute the squared error between the actual rating and the predicted rating: val squaredError = math.pow(predictedRating - actualRating.rating, 2.0) squaredError: Double = 1.010777282523947E-6 So, in order to compute the overall MSE for the dataset, we need to compute this squared error for each (user, movie, actual rating, predicted rating) entry, sum them up, and divide them by the number of ratings. We will do this in the following code snippet. Note the following code is adapted from the Apache Spark programming guide for ALS at http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html. First, we will extract the user and product IDs from the ratings RDD and make predictions for each user-item pair using model.predict. We will use the user-item pair as the key and the predicted rating as the value: val usersProducts = ratings.map{ case Rating(user, product, rating)  => (user, product)} val predictions = model.predict(usersProducts).map{     case Rating(user, product, rating) => ((user, product), rating) } Next, we extract the actual ratings and also map the ratings RDD so that the user-item pair is the key and the actual rating is the value. Now that we have two RDDs with the same form of key, we can join them together to create a new RDD with the actual and predicted ratings for each user-item combination: val ratingsAndPredictions = ratings.map{   case Rating(user, product, rating) => ((user, product), rating) }.join(predictions) Finally, we will compute the MSE by summing up the squared errors using reduce and dividing by the count method of the number of records: val MSE = ratingsAndPredictions.map{     case ((user, product), (actual, predicted)) =>  math.pow((actual - predicted), 2) }.reduce(_ + _) / ratingsAndPredictions.count println("Mean Squared Error = " + MSE) Mean Squared Error = 0.08231947642632852 It is common to use the Root Mean Squared Error (RMSE), which is just the square root of the MSE metric. This is somewhat more interpretable, as it is in the same units as the underlying data (that is, the ratings in this case). It is equivalent to the standard deviation of the differences between the predicted and actual ratings. We can compute it simply as follows: val RMSE = math.sqrt(MSE) println("Root Mean Squared Error = " + RMSE) Root Mean Squared Error = 0.2869137090247319 Mean average precision at K Mean average precision at K (MAPK) is the mean of the average precision at K (APK) metric across all instances in the dataset. APK is a metric commonly used in information retrieval. APK is a measure of the average relevance scores of a set of the top-K documents presented in response to a query. For each query instance, we will compare the set of top-K results with the set of actual relevant documents (that is, a ground truth set of relevant documents for the query). In the APK metric, the order of the result set matters, in that, the APK score would be higher if the result documents are both relevant and the relevant documents are presented higher in the results. It is, thus, a good metric for recommender systems in that typically we would compute the top-K recommended items for each user and present these to the user. Of course, we prefer models where the items with the highest predicted scores (which are presented at the top of the list of recommendations) are, in fact, the most relevant items for the user. APK and other ranking-based metrics are also more appropriate evaluation measures for implicit datasets; here, MSE makes less sense. In order to evaluate our model, we can use APK, where each user is the equivalent of a query, and the set of top-K recommended items is the document result set. The relevant documents (that is, the ground truth) in this case, is the set of items that a user interacted with. Hence, APK attempts to measure how good our model is at predicting items that a user will find relevant and choose to interact with. The code for the following average precision computation is based on https://github.com/benhamner/Metrics.  More information on MAPK can be found at https://www.kaggle.com/wiki/MeanAveragePrecision. Our function to compute the APK is shown here: def avgPrecisionK(actual: Seq[Int], predicted: Seq[Int], k: Int): Double = {   val predK = predicted.take(k)   var score = 0.0   var numHits = 0.0   for ((p, i) <- predK.zipWithIndex) {     if (actual.contains(p)) {       numHits += 1.0       score += numHits / (i.toDouble + 1.0)     }   }   if (actual.isEmpty) {     1.0   } else {     score / scala.math.min(actual.size, k).toDouble   } } As you can see, this takes as input a list of actual item IDs that are associated with the user and another list of predicted ids so that our estimate will be relevant for the user. We can compute the APK metric for our example user 789 as follows. First, we will extract the actual movie IDs for the user: val actualMovies = moviesForUser.map(_.product) actualMovies: Seq[Int] = ArrayBuffer(1012, 127, 475, 93, 1161, 286, 293, 9, 50, 294, 181, 1, 1008, 508, 284, 1017, 137, 111, 742, 248, 249, 1007, 591, 150, 276, 151, 129, 100, 741, 288, 762, 628, 124) We will then use the movie recommendations we made previously to compute the APK score using K = 10: val predictedMovies = topKRecs.map(_.product) predictedMovies: Array[Int] = Array(27, 497, 633, 827, 602, 849, 401, 584, 1035, 1014) val apk10 = avgPrecisionK(actualMovies, predictedMovies, 10) apk10: Double = 0.0 In this case, we can see that our model is not doing a very good job of predicting relevant movies for this user as the APK score is 0. In order to compute the APK for each user and average them to compute the overall MAPK, we will need to generate the list of recommendations for each user in our dataset. While this can be fairly intensive on a large scale, we can distribute the computation using our Spark functionality. However, one limitation is that each worker must have the full item-factor matrix available so that it can compute the dot product between the relevant user vector and all item vectors. This can be a problem when the number of items is extremely high as the item matrix must fit in the memory of one machine. There is actually no easy way around this limitation. One possible approach is to only compute recommendations for a subset of items from the total item set, using approximate techniques such as Locality Sensitive Hashing (http://en.wikipedia.org/wiki/Locality-sensitive_hashing). We will now see how to go about this. First, we will collect the item factors and form a DoubleMatrix object from them: val itemFactors = model.productFeatures.map { case (id, factor) => factor }.collect() val itemMatrix = new DoubleMatrix(itemFactors) println(itemMatrix.rows, itemMatrix.columns) (1682,50) This gives us a matrix with 1682 rows and 50 columns, as we would expect from 1682 movies with a factor dimension of 50. Next, we will distribute the item matrix as a broadcast variable so that it is available on each worker node: val imBroadcast = sc.broadcast(itemMatrix) 14/04/13 21:02:01 INFO MemoryStore: ensureFreeSpace(672960) called with curMem=4006896, maxMem=311387750 14/04/13 21:02:01 INFO MemoryStore: Block broadcast_21 stored as values to memory (estimated size 657.2 KB, free 292.5 MB) imBroadcast: org.apache.spark.broadcast.Broadcast[org.jblas.DoubleMatrix] = Broadcast(21) Now we are ready to compute the recommendations for each user. We will do this by applying a map function to each user factor within which we will perform a matrix multiplication between the user-factor vector and the movie-factor matrix. The result is a vector (of length 1682, that is, the number of movies we have) with the predicted rating for each movie. We will then sort these predictions by the predicted rating: val allRecs = model.userFeatures.map{ case (userId, array) =>   val userVector = new DoubleMatrix(array)   val scores = imBroadcast.value.mmul(userVector)   val sortedWithId = scores.data.zipWithIndex.sortBy(-_._1)   val recommendedIds = sortedWithId.map(_._2 + 1).toSeq   (userId, recommendedIds) } allRecs: org.apache.spark.rdd.RDD[(Int, Seq[Int])] = MappedRDD[269] at map at <console>:29 As we can see, we now have an RDD that contains a list of movie IDs for each user ID. These movie IDs are sorted in order of the estimated rating. Note that we needed to add 1 to the returned movie ids (as highlighted in the preceding code snippet), as the item-factor matrix is 0-indexed, while our movie IDs start at 1. We also need the list of movie IDs for each user to pass into our APK function as the actual argument. We already have the ratings RDD ready, so we can extract just the user and movie IDs from it. If we use Spark's groupBy operator, we will get an RDD that contains a list of (userid, movieid) pairs for each user ID (as the user ID is the key on which we perform the groupBy operation): val userMovies = ratings.map{ case Rating(user, product, rating) => (user, product) }.groupBy(_._1) userMovies: org.apache.spark.rdd.RDD[(Int, Seq[(Int, Int)])] = MapPartitionsRDD[277] at groupBy at <console>:21 Finally, we can use Spark's join operator to join these two RDDs together on the user ID key. Then, for each user, we have the list of actual and predicted movie IDs that we can pass to our APK function. In a manner similar to how we computed MSE, we will sum each of these APK scores using a reduce action and divide by the number of users (that is, the count of the allRecs RDD): val K = 10 val MAPK = allRecs.join(userMovies).map{ case (userId, (predicted, actualWithIds)) =>   val actual = actualWithIds.map(_._2).toSeq   avgPrecisionK(actual, predicted, K) }.reduce(_ + _) / allRecs.count println("Mean Average Precision at K = " + MAPK) Mean Average Precision at K = 0.030486963254725705 Our model achieves a fairly low MAPK. However, note that typical values for recommendation tasks are usually relatively low, especially if the item set is extremely large. Try out a few parameter settings for lambda and rank (and alpha if you are using the implicit version of ALS) and see whether you can find a model that performs better based on the RMSE and MAPK evaluation metrics. Using MLlib's built-in evaluation functions While we have computed MSE, RMSE, and MAPK from scratch, and it a useful learning exercise to do so, MLlib provides convenience functions to do this for us in the RegressionMetrics and RankingMetrics classes. RMSE and MSE First, we will compute the MSE and RMSE metrics using RegressionMetrics. We will instantiate a RegressionMetrics instance by passing in an RDD of key-value pairs that represent the predicted and true values for each data point, as shown in the following code snippet. Here, we will again use the ratingsAndPredictions RDD we computed in our earlier example: import org.apache.spark.mllib.evaluation.RegressionMetrics val predictedAndTrue = ratingsAndPredictions.map { case ((user, product), (predicted, actual)) => (predicted, actual) } val regressionMetrics = new RegressionMetrics(predictedAndTrue) We can then access various metrics, including MSE and RMSE. We will print out these metrics here: println("Mean Squared Error = " + regressionMetrics.meanSquaredError) println("Root Mean Squared Error = " + regressionMetrics.rootMeanSquaredError) You will see that the output for MSE and RMSE is exactly the same as the metrics we computed earlier: Mean Squared Error = 0.08231947642632852 Root Mean Squared Error = 0.2869137090247319 MAP As we did for MSE and RMSE, we can compute ranking-based evaluation metrics using MLlib's RankingMetrics class. Similarly, to our own average precision function, we need to pass in an RDD of key-value pairs, where the key is an Array of predicted item IDs for a user, while the value is an array of actual item IDs. The implementation of the average precision at the K function in RankingMetrics is slightly different from ours, so we will get different results. However, the computation of the overall mean average precision (MAP, which does not use a threshold at K) is the same as our function if we select K to be very high (say, at least as high as the number of items in our item set): First, we will calculate MAP using RankingMetrics: import org.apache.spark.mllib.evaluation.RankingMetrics val predictedAndTrueForRanking = allRecs.join(userMovies).map{ case (userId, (predicted, actualWithIds)) =>   val actual = actualWithIds.map(_._2)   (predicted.toArray, actual.toArray) } val rankingMetrics = new RankingMetrics(predictedAndTrueForRanking) println("Mean Average Precision = " + rankingMetrics.meanAveragePrecision) You will see the following output: Mean Average Precision = 0.07171412913757183 Next, we will use our function to compute the MAP in exactly the same way as we did previously, except that we set K to a very high value, say 2000: val MAPK2000 = allRecs.join(userMovies).map{ case (userId, (predicted, actualWithIds)) =>   val actual = actualWithIds.map(_._2).toSeq   avgPrecisionK(actual, predicted, 2000) }.reduce(_ + _) / allRecs.count println("Mean Average Precision = " + MAPK2000) You will see that the MAP from our own function is the same as the one computed using RankingMetrics: Mean Average Precision = 0.07171412913757186 We will not cover cross validation in this article. However, note that the same techniques for cross-validation can be used to evaluate recommendation models, using the performance metrics such as MSE, RMSE, and MAP, which we covered in this section. Summary In this article, we used Spark's MLlib library to train a collaborative filtering recommendation model, and you learned how to use this model to make predictions for the items that a given user might have a preference for. We also used our model to find items that are similar or related to a given item. Finally, we explored common metrics to evaluate the predictive capability of our recommendation model. To learn more about Spark, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Fast Data Processing with Spark - Second Edition (https://www.packtpub.com/big-data-and-business-intelligence/fast-data-processing-spark-second-edition) Spark Cookbook (https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook) Resources for Article: Further resources on this subject: Reactive Programming And The Flux Architecture [article] Spark - Architecture And First Program [article] The Design Patterns Out There And Setting Up Your Environment [article]
Read more
  • 0
  • 1
  • 16207

article-image-make-things-pretty-ggplot2
Packt
24 Feb 2016
30 min read
Save for later

Make Things Pretty with ggplot2

Packt
24 Feb 2016
30 min read
 The objective of this article is to provide you with a general overview of the plotting environments in R and of the most efficient way of coding your graphs in it. We will go through the most important Integrated Development Environment (IDE) available for R as well as the most important packages available for plotting data; this will help you to get an overview of what is available in R and how those packages are compared with ggplot2. Finally, we will dig deeper into the grammar of graphics, which represents the basic concepts on which ggplot2 was designed. But first, let's make sure that you have a working version of R on your computer. (For more resources related to this topic, see here.) Getting ggplot2 up and running You can download the most up-to-date version of R from the R project website (http://www.r-project.org/). There, you will find a direct connection to the Comprehensive R Archive Network (CRAN), a network of FTP and web servers around the world that store identical, up-to-date versions of code and documentation for R. In addition to access to the CRAN servers, on the website of the R project, you may also find information about R, a few technical manuals, the R journal, and details about the packages developed for R and stored in the CRAN repositories. At the time of writing, the current version of R is 3.1.2. If you have already installed R on your computer, you can check the actual version with the R.Version() code, or for a more concise result, you can use the R.version.string code that recalls only part of the output of the previous function. Packages in R In the next few pages of this article, we will quickly go through the most important visualization packages available in R, so in order to try the code, you will also need to have additional packages as well as ggplot2 up and running in your R installation. In the basic R installation, you will already have the graphics package available and loaded in the session; the lattice package is already available among the standard packages delivered with the basic installation, but it is not loaded by default. ggplot2, on the other hand, will need to be installed. You can install and load a package with the following code: > install.packages(“ggplot2”) > library(ggplot2) Keep in mind that every time R is started, you will need to load the package you need with the library(name_of_the_package) command to be able to use the functions contained in the package. In order to get a list of all the packages installed on your computer, you can use the call to the library() function without arguments. If, on the other hand, you would like to have a list of the packages currently loaded in the workspace, you can use the search() command. One more function that can turn out to be useful when managing your library of packages is .libPaths(), which provides you with the location of your R libraries. This function is very useful to trace back the package libraries you are currently using, if any, in addition to the standard library of packages, which on Windows is located by default in a path of the kind C:/Program Files/R/R-3.1.2/library. The following list is a short recap of the functions just discussed: .libPaths()   # get library location library()   # see all the packages installed search()   # see the packages currently loaded Integrated Development Environment (IDE) You will definitely be able to run the code and the examples explained in the article directly from the standard R Graphical User Interface (GUI), especially if you are frequently working with R in more complex projects or simply if you like to keep an eye on the different components of your code, such as scripts, plots, and help pages, you may well think about the possibility of using an IDE. The number of specific IDEs that get integrated with R is still limited, but some of them are quite efficient, well-designed and open source. RStudio RStudio (http://www.rstudio.com/) is a very nice and advanced programming environment developed specifically for R, and this would be my recommended choice of IDE as the R programming environment in most cases. It is available for all the major platforms (Windows, Linux, and Mac OS X), and it can be run on a local machine, such as your computer, or even over the Web, using RStudio Server. With RStudio Server, you can connect a browser-based interface (the RStudio IDE) to a version of R running on a remote Linux server. RStudio allows you to integrate several useful functionalities, in particular if you use R for a more complex project. The way the software interface is organized allows you to keep an eye on the different activities you very often deal with in R, such as working on different scripts, overviewing the installed packages, as well as having easy access to the help pages and the plots generated. This last feature is particularly interesting for ggplot2 since in RStudio, you will be able to easily access the history of the plots created instead of visualizing only the last created plot, as is the case in the default R GUI. One other very useful feature of RStudio is code completion. You can, in fact, start typing a comment, and upon pressing the Tab key, the interface will provide you with functions matching what you have written . This feature will turn out to be very useful in ggplot2, so you will not necessarily need to remember all the functions and you will also have guidance for the arguments of the functions as well. In Figure 1.1, you can see a screenshot from the current version of RStudio (v 0.98.1091): Figure 1.1: This is a screenshot of RStudio on Windows 8 The environment is composed of four different areas: Scripting area: In this area you can open, create, and write the scripts. Console area: This area is the actual R console in which the commands are executed. It is possible to type commands directly here in the console or write them in a script and then run them on the console (I would recommend the last option). Workspace/History area: In this area, you can find a practical summary of all the objects created in the workspace in which you are working and the history of the typed commands. Visualization area: Here, you can easily load packages, open R help files, and, even more importantly, visualize plots. The RStudio website provides a lot of material on how to use the program, such as manuals, tutorials, and videos, so if you are interested, refer to the website for more details. Eclipse and StatET Eclipse (http://www.eclipse.org/) is a very powerful IDE that was mainly developed in Java and initially intended for Java programming. Subsequently, several extension packages were also developed to optimize the programming environment for other programming languages, such as C++ and Python. Thanks to its original objective of being a tool for advanced programming, this IDE is particularly intended to deal with very complex programming projects, for instance, if you are working on a big project folder with many different scripts. In these circumstances, Eclipse could help you to keep your programming scripts in order and have easy access to them. One drawback of such a development environment is probably its big size (around 200 MB) and a slightly slow-starting environment. Eclipse does not support interaction with R natively, so in order to be able to write your code and execute it directly in the R console, you need to add StatET to your basic Eclipse installation. StatET (http://www.walware.de/goto/statet) is a plugin for the Eclipse IDE, and it offers a set of tools for R coding and package building. More detailed information on how to install Eclipse and StatET and how to configure the connections between R and Eclipse/StatET can be found on the websites of the related projects. Emacs and ESS Emacs (http://www.gnu.org/software/emacs/) is a customizable text editor and is very popular, particularly in the Linux environment. Although this text editor appears with a very simple GUI, it is an extremely powerful environment, particularly thanks to the numerous keyboard shortcuts that allow interaction with the environment in a very efficient manner after getting some practice. Also, if the user interface of a typical IDE, such as RStudio, is more sophisticated and advanced, Emacs may be useful if you need to work with R on systems with a poor graphical interface, such as servers and terminal windows. Like Eclipse, Emacs does not support interfacing with R by default, so you will need to install an add-on package on your Emacs that will enable such a connection, Emacs Speaks Statistics (ESS). ESS (http://ess.r-project.org/) is designed to support the editing of scripts and interacting with various statistical analysis programs including R. The objective of the ESS project is to provide efficient text editor support to statistical software, which in some cases comes with a more or less defined GUI, but for which the real power of the language is only accessible through the original scripting language. The plotting environments in R R provides a complete series of options to realize graphics, which makes it quite advanced with regard to data visualization. Along the next few sections of this article, we will go through the most important R packages for data visualization by quickly discussing some high-level differences and analogies. If you already have some experience with other R packages for data visualization, in particular graphics or lattice, the following sections will provide you with some references and examples of how the code used in such packages appears in comparison with that used in ggplot2. Moreover, you will also have an idea of the typical layout of the plots created with a certain package, so you will be able to identify the tool used to realize the plots you will come across. The core of graphics visualization in R is within the grDevices package, which provides the basic structure of data plotting, such as the colors and fonts used in the plots. Such a graphic engine was then used as the starting point in the development of more advanced and sophisticated packages for data visualization, the most commonly used being graphics and grid. The graphics package is often referred to as the base or traditional graphics environment since, historically, it was the first package for data visualization available in R, and it provides functions that allow the generation of complete plots. The grid package, on the other hand, provides an alternative set of graphics tools. This package does not directly provide functions that generate complete plots, so it is not frequently used directly to generate graphics, but it is used in the development of advanced data visualization packages. Among the grid-based packages, the most widely used are lattice and ggplot2, although they are built by implementing different visualization approaches—Trellis plots in the case of lattice and the grammar of graphics in the case of ggplot2. We will describe these principles in more detail in the coming sections. A diagram representing the connections between the tools just mentioned is shown in Figure 1.2. Just keep in mind that this is not a complete overview of the packages available but simply a small snapshot of the packages we will discuss. Many other packages are built on top of the tools just mentioned, but in the following sections, we will focus on the most relevant packages used in data visualization, namely graphics, lattice, and, of course, ggplot2. If you would like to get a more complete overview of the graphics tools available in R, you can have a look at the web page of the R project summarizing such tools, http://cran.r-project.org/web/views/Graphics.html. Figure 1.2: This is an overview of the most widely used R packages for graphics In order to see some examples of plots in graphics, lattice and ggplot2, we will go through a few examples of different plots over the following pages. The objective of providing these examples is not to do an exhaustive comparison of the three packages but simply to provide you with a simple comparison of how the different codes as well as the default plot layouts appear for these different plotting tools. For these examples, we will use the Orange dataset available in R; to load it in the workspace, simply write the following code: >data(Orange) This dataset contains records of the growth of orange trees. You can have a look at the data by recalling its first lines with the following code: >head(Orange) You will see that the dataset contains three columns. The first one, Tree, is an ID number indicating the tree on which the measurement was taken, while age and circumference refer to the age in days and the size of the tree in millimeters, respectively. If you want to have more information about this data, you can have a look at the help page of the dataset by typing the following code: ?Orange Here, you will find the reference of the data as well as a more detailed description of the variables included. Standard graphics and grid-based graphics The existence of these two different graphics environments brings these questions  to most users' minds—which package to use and under which circumstances? For simple and basic plots, where the data simply needs to be represented in a standard plot type (such as a scatter plot, histogram, or boxplot) without any additional manipulation, then all the plotting environments are fairly equivalent. In fact, it would probably be possible to produce the same type of plot with graphics as well as with lattice or ggplot2. Nevertheless, in general, the default graphic output of ggplot2 or lattice will be most likely superior compared to graphics since both these packages are designed considering the principles of human perception deeply and to make the evaluation of data contained in plots easier. When more complex data should be analyzed, then the grid-based packages, lattice and ggplot2, present a more sophisticated support in the analysis of multivariate data. On the other hand, these tools require greater effort to become proficient because of their flexibility and advanced functionalities. In both cases, lattice and ggplot2, the package provides a full set of tools for data visualization, so you will not need to use grid directly in most cases, but you will be able to do all your work directly with one of those packages. Graphics and standard plots The graphics package was originally developed based on the experience of the graphics environment in R. The approach implemented in this package is based on the principle of the pen-on-paper model, where the plot is drawn in the first function call and once content is added, it cannot be deleted or modified. In general, the functions available in this package can be divided into high-level and low-level functions. High-level functions are functions capable of drawing the actual plot, while low-level functions are functions used to add content to a graph that was already created with a high-level function. Let's assume that we would like to have a look at how age is related to the circumference of the trees in our dataset Orange; we could simply plot the data on a scatter plot using the high-level function plot() as shown in the following code: plot(age~circumference, data=Orange) This code creates the graph in Figure 1.3. As you would have noticed, we obtained the graph directly with a call to a function that contains the variables to plot in the form of y~x, and the dataset to locate them. As an alternative, instead of using a formula expression, you can use a direct reference to x and y, using code in the form of plot(x,y). In this case, you will have to use a direct reference to the data instead of using the data argument of the function. Type in the following code: plot(Orange$circumference, Orange$age) The preceding code results in the following output: Figure 1.3: Simple scatterplot of the dataset Orange using graphics For the time being, we are not interested in the plot's details, such as the title or the axis, but we will simply focus on how to add elements to the plot we just created. For instance, if we want to include a regression line as well as a smooth line to have an idea of the relation between the data, we should use a low-level function to add the just-created additional lines to the plot; this is done with the lines() function: plot(age~circumference, data=Orange)   ###Create basic plot abline(lm(Orange$age~Orange$circumference), col=”blue”) lines(loess.smooth(Orange$circumference,Orange$age), col=”red”) The graph generated as the output of this code is shown in Figure 1.4: Figure 1.4: This is a scatterplot of the Orange data with a regression line (in blue) and a smooth line (in red) realized with graphics As illustrated, with this package, we have built a graph by first calling one function, which draws the main plot frame, and then additional elements were included using other functions. With graphics, only additional elements can be included in the graph without changing the overall plot frame defined by the plot() function. This ability to add several graphical elements together to create a complex plot is one of the fundamental elements of R, and you will notice how all the different graphical packages rely on this principle. If you are interested in getting other code examples of plots in graphics, there is also some demo code available in R for this package, and it can be visualized with demo(graphics). In the coming sections, you will find a quick reference to how you can generate a similar plot using graphics and ggplot2. As will be described in more detail later on, in ggplot2, there are two main functions to realize plots, ggplot() and qplot(). The function qplot() is a wrapper function that is designed to easily create basic plots with ggplot2, and it has a similar code to the plot() function of graphics. Due to its simplicity, this function is the easiest way to start working with ggplot2, so we will use this function in the examples in the following sections. The code in these sections also uses our example dataset Orange; in this way, you can run the code directly on your console and see the resulting output. Scatterplot with individual data points To generate the plot generated using graphics, use the following code: plot(age~circumference, data=Orange) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference,age, data=Orange) The preceding code results in the following output: Scatterplots with the line of one tree To generate the plot using graphics, use the following code: plot(age~circumference, data=Orange[Orange$Tree==1,], type=”l”) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference,age, data=Orange[Orange$Tree==1,], geom=”line”) The preceding code results in the following output: Scatterplots with the line and points of one tree To generate the plot using graphics, use the following code: plot(age~circumference, data=Orange[Orange$Tree==1,], type=”b”) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference,age, data=Orange[Orange$Tree==1,], geom=c(“line”,”point”)) The preceding code results in the following output: Boxplot of orange dataset To generate the plot using graphics, use the following code: boxplot(circumference~Tree, data=Orange) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(Tree,circumference, data=Orange, geom=”boxplot”) The preceding code results in the following output: Boxplot with individual observations To generate the plot using graphics, use the following code: boxplot(circumference~Tree, data=Orange) points(circumference~Tree, data=Orange) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(Tree,circumference, data=Orange, geom=c(“boxplot”,”point”)) The preceding code results in the following output: Histogram of orange dataset To generate the plot using graphics, use the following code: hist(Orange$circumference) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference, data=Orange, geom=”histogram”) The preceding code results in the following output: Histogram with reference line at median value in red To generate the plot using graphics, use the following code: hist(Orange$circumference) abline(v=median(Orange$circumference), col=”red”) The preceding code results in the following output: To generate the plot using ggplot2, use the following code: qplot(circumference, data=Orange, geom=”histogram”)+geom_vline(xintercept = median(Orange$circumference), colour=”red”) The preceding code results in the following output: Lattice and the Trellis plots Along with with graphics, the base R installation also includes the lattice package. This package implements a family of techniques known as Trellis graphics, proposed by William Cleveland to visualize complex datasets with multiple variables. The objective of those design principles was to ensure the accurate and faithful communication of data information. These principles are embedded into the package and are already evident in the default plot design settings. One interesting feature of Trellis plots is the option of multipanel conditioning, which creates multiple plots by splitting the data on the basis of one variable. A similar option is also available in ggplot2, but in that case, it is called faceting. In lattice, we also have functions that are able to generate a plot with one single call, but once the plot is drawn, it is already final. Consequently, plot details as well as additional elements that need to be included in the graph, need to be specified already within the call to the main function. This is done by including all the specifications in the panel function argument. These specifications can be included directly in the main body of the function or specified in an independent function, which is then called; this last option usually generates more readable code, so this will be the approach used in the following examples. For instance, if we want to draw the same plot we just generated in the previous section with graphics, containing the age and circumference of trees and also the regression and smooth lines, we need to specify such elements within the function call. You may see an example of the code here; remember that lattice needs to be loaded in the workspace: require(lattice)              ##Load lattice if needed myPanel <- function(x,y){ panel.xyplot(x,y)            # Add the observations panel.lmline(x,y,col=”blue”)   # Add the regression panel.loess(x,y,col=”red”)      # Add the smooth line } xyplot(age~circumference, data=Orange, panel=myPanel) This code produces the plot in Figure 1.5: Figure 1.5: This is a scatter plot of the Orange data with the regression line (in blue) and the smooth line (in red) realized with lattice As you would have noticed, taking aside the code differences, the plot generated does not look very different from the one obtained with graphics. This is because we are not using any special visualization feature of lattice. As mentioned earlier, with this package, we have the option of multipanel conditioning, so let's take a look at this. Let's assume that we want to have the same plot but for the different trees in the dataset. Of course, in this case, you would not need the regression or the smooth line, since there will only be one tree in each plot window, but it could be nice to have the different observations connected. This is shown in the following code: myPanel <- function(x,y){ panel.xyplot(x,y, type=”b”) #the observations } xyplot(age~circumference | Tree, data=Orange, panel=myPanel) This code generates the graph shown in Figure 1.6: Figure 1.6: This is a scatterplot of the Orange data realized with lattice, with one subpanel representing the individual data of each tree. The number of trees in each panel is reported in the upper part of the plot area As illustrated, using the vertical bar |, we are able to obtain the plot conditional to the value of the variable Tree. In the upper part of the panels, you would notice the reference to the value of the conditional variable, which, in this case, is the column Tree. As mentioned before, ggplot2 offers this option too; we will see one example of that in the next section. In the next section, You would find a quick reference to how to convert a typical plot type from lattice to ggplot2. In this case, the examples are adapted to the typical plotting style of the lattice plots. Scatterplot with individual observations To plot the graph using lattice, use the following code: xyplot(age~circumference, data=Orange) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference,age, data=Orange) The preceding code results in the following output: Scatterplot of orange dataset with faceting To plot the graph using lattice, use the following code: xyplot(age~circumference|Tree, data=Orange) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference,age, data=Orange, facets=~Tree) The preceding code results in the following output: Faceting scatterplot with line and points To plot the graph using lattice, use the following code: xyplot(age~circumference|Tree, data=Orange, type=”b”) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference,age, data=Orange, geom=c(“line”,”point”), facets=~Tree) The preceding code results in the following output: Scatterplots with grouping data To plot the graph using lattice, use the following code: xyplot(age~circumference, data=Orange, groups=Tree, type=”b”) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference,age, data=Orange,color=Tree, geom=c(“line”,”point”)) The preceding code results in the following output: Boxplot of orange dataset To plot the graph using lattice, use the following code: bwplot(circumference~Tree, data=Orange) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(Tree,circumference, data=Orange, geom=”boxplot”) The preceding code results in the following output: Histogram of orange dataset To plot the graph using lattice, use the following code: histogram(Orange$circumference, type = “count”) To plot the graph using ggplot2, use the following code: qplot(circumference, data=Orange, geom=”histogram”) The preceding code results in the following output: Histogram with reference line at median value in red To plot the graph using lattice, use the following code: histogram(~circumference, data=Orange, type = “count”, panel=function(x,...){panel.histogram(x, ...);panel.abline(v=median(x), col=”red”)}) The preceding code results in the following output: To plot the graph using ggplot2, use the following code: qplot(circumference, data=Orange, geom=”histogram”)+geom_vline(xintercept = median(Orange$circumference), colour=”red”) The preceding code results in the following output: ggplot2 and the grammar of graphics The ggplot2 package was developed by Hadley Wickham by implementing a completely different approach to statistical plots. As is the case with lattice, this package is also based on grid, providing a series of high-level functions that allow the creation of complete plots. The Grammar of Graphics by Leland Wilkinson. Briefly, The Grammar of Graphics assumes that a statistical graphic is a mapping of data to the aesthetic attributes and geometric objects used to represent data, such as points, lines, bars, and so on. Besides the aesthetic attributes, the plot can also contain statistical transformation or grouping of data. As in lattice, in ggplot2, we have the possibility of splitting data by a certain variable, obtaining a representation of each subset of data in an independent subplot; such representation in ggplot2 is called faceting. In a more formal way, the main components of the grammar of graphics are the data and its mapping, aesthetics, geometric objects, statistical transformations, scales, coordinates, and faceting: The data that must be visualized is mapped to aesthetic attributes, which define how the data should be perceived Geometric objects describe what is actually displayed on the plot, such as lines, points, or bars; the geometric objects basically define which kind of plot you are going to draw Statistical transformations are applied to the data to group them; examples of statistical transformations would be the smooth line or the regression lines of the previous examples or the binning of the histograms Scales represent the connection between the aesthetic spaces and the actual values that should be represented. Scales may also be used to draw legends Coordinates represent the coordinate system in which the data is drawn Faceting, which we have already mentioned, is the grouping of data in subsets defined by a value of one variable In ggplot2, there are two main high-level functions capable of directly creating a plot, qplot(), and ggplot(); qplot() stands for quick plot, and it is a simple function that serves a purpose similar to that served by the plot() function in graphics. The ggplot()function, on the other hand, is a much more advanced function that allows the user to have more control of the plot layout and details. In our journey into the world of ggplot2, we will see some examples of qplot(), in particular when we go through the different kinds of graphs, but we will dig a lot deeper into ggplot() since this last function is more suited to advanced examples. If you have a look at the different forums based on R programming, there is quite a bit of discussion as to which of these two functions would be more convenient to use. My general recommendation would be that it depends on the type of graph you are drawing more frequently. For simple and standard plots, where only the data should be represented and only the minor modification of standard layouts are required, the qplot() function will do the job. On the other hand, if you need to apply particular transformations to the data or if you would just like to keep the freedom of controlling and defining the different details of the plot layout, I would recommend that you focus on ggplot(). As you will see, the code between these functions is not completely different since they are both based on the same underlying philosophy, but the way in which the options are set is quite different, so if you want to adapt a plot from one function to the other, you will essentially need to rewrite your code. If you just want to focus on learning only one of them, I would definitely recommend that you learn ggplot(). In the following code, you will see an example of a plot realized with ggplot2, where you can identify some of the components of the grammar of graphics. The example is realized with the ggplot() function, which allows a more direct comparison with the grammar of graphics, but coming just after the following code, you could also find the corresponding qplot() code useful. Both codes generate the graph depicted in Figure 1.7: require(ggplot2)                             ## Load ggplot2 data(Orange)                                 ## Load the data   ggplot(data=Orange,                          ## Data used   aes(x=circumference,y=age, color=Tree))+   ## Aesthetic geom_point()+                                ## Geometry stat_smooth(method=”lm”,se=FALSE)            ## Statistics   ### Corresponding code with qplot() qplot(circumference,age,data=Orange,         ## Data used   color=Tree,                                ## Aesthetic mapping   geom=c(“point”,”smooth”),method=”lm”,se=FALSE) This simple example can give you an idea of the role of each portion of code in a ggplot2 graph; you have seen how the main function body creates the connection between the data and the aesthetics we are interested to represent and how, on top of this, you add the components of the plot, as in this case, we added the geometry element of points and the statistical element of regression. You can also notice how the components that need to be added to the main function call are included using the + sign. One more thing worth mentioning at this point is that if you run just the main body function in the ggplot() function, you will get an error message. This is because this call is not able to generate an actual plot. The step during which the plot is actually created is when you include the geometric attribute, which, in this case is geom_point(). This is perfectly in line with the grammar of graphics since, as we have seen, the geometry represents the actual connection between the data and what is represented on the plot. This is the stage where we specify that the data should be represented as points; before that, nothing was specified about which plot we were interested in drawing. Figure 1.7: This is an example of plotting the Orange dataset with ggplot2 Summary To learn more about the similar technology, the following books/videos published by Packt Publishing (https://www.packtpub.com/) are recommended: ggplot2 Essentials (https://www.packtpub.com/big-data-and-business-intelligence/ggplot2-essentials) Video: Building Interactive Graphs with ggplot2 and Shiny (https://www.packtpub.com/big-data-and-business-intelligence/building-interactive-graphs-ggplot2-and-shiny-video) Resources for Article: Further resources on this subject: Refresher [article] Interactive Documents [article] Driving Visual Analyses Automobile Data (Python) [article]
Read more
  • 0
  • 0
  • 1476

article-image-dealing-mess
Packt
23 Feb 2016
59 min read
Save for later

Dealing with a Mess

Packt
23 Feb 2016
59 min read
Analyzing data in the real world often requires some know-how outside of the typical introductory data analysis curriculum. For example, rarely do we get a neatly formatted, tidy dataset with no errors, junk, or missing values. Rather, we often get messy, unwieldy datasets. What makes a dataset messy? Different people in different roles have different ideas about what constitutes messiness. Some regard any data that invalidates the assumptions of the parametric model as messy. Others see messiness in datasets with a grievously imbalanced number of observations in each category for a categorical variable. Some examples of things that I would consider messy are: Many missing values (NAs) Misspelled names in categorical variables Inconsistent data coding Numbers in the same column being in different units Mis-recorded data and data entry mistakes Extreme outliers Since there are an infinite number of ways that data can be messy, there's simply no chance of enumerating every example and their respective solutions. Instead, we are going to talk about two tools that help combat the bulk of the messiness issues that I cited just now. (For more resources related to this topic, see here.) Analysis with missing data Missing data is another one of those topics that are largely ignored in most introductory texts. Probably, part of the reason why this is the case is that many myths about analysis with missing data still abound. Additionally, some of the research into cutting-edge techniques is still relatively new. A more legitimate reason for its absence in introductory texts is that most of the more principled methodologies are fairly complicated—mathematically speaking. Nevertheless, the incredible ubiquity of problems related to missing data in real life data analysis necessitates some broaching of the subject. This section serves as a gentle introduction into the subject and one of the more effective techniques for dealing with it. A common refrain on the subject is something along the lines of the best way to deal with missing data is not to have any. It's true that missing data is a messy subject, and there are a lot of ways to do it wrong. It's important not to take this advice to the extreme, though. In order to bypass missing data problems, some have disallowed survey participants, for example, to go on without answering all the questions on a form. You can coerce the participants in a longitudinal study to not drop out, too. Don't do this. Not only is it unethical, it is also prodigiously counter-productive; there are treatments for missing data, but there are no treatments for bad data. The standard treatment to the problem of missing data is to replace the missing data with non-missing values. This process is called imputation. In most cases, the goal of imputation is not to recreate the lost completed dataset but to allow valid statistical estimates or inferences to be drawn from incomplete data. Because of this, the effectiveness of different imputation techniques can't be evaluated by their ability to most accurately recreate the data from a simulated missing dataset; they must, instead, be judged by their ability to support the same statistical inferences as would be drawn from the analysis on the complete data. In this way, filling in the missing data is only a step towards the real goal—the analysis. The imputed dataset is rarely considered the final goal of imputation. There are many different ways that missing data is dealt with in practice—some are good, some are not so good. Some are okay under certain circumstances, but not okay in others. Some involve missing data deletion, while some involve imputation. We will briefly touch on some of the more common methods. The ultimate goal of this article, though, is to get you started on what is often described as the gold-standard of imputation techniques: multiple imputation. Visualizing missing data In order to demonstrate the visualizing patterns of missing data, we first have to create some missing data. This will also be the same dataset that we perform analysis on later in the article. To showcase how to use multiple imputation for a semi-realistic scenario, we are going to create a version of the mtcars dataset with a few missing values: Okay, let's set the seed (for deterministic randomness), and create a variable to hold our new marred dataset. set.seed(2) miss_mtcars <- mtcars First, we are going to create seven missing values in drat (about 20 percent), five missing values in the mpg column (about 15 percent), five missing values in the cyl column, three missing values in wt (about 10 percent), and three missing values in vs: some_rows <- sample(1:nrow(miss_mtcars), 7) miss_mtcars$drat[some_rows] <- NA  some_rows <- sample(1:nrow(miss_mtcars), 5) miss_mtcars$mpg[some_rows] <- NA  some_rows <- sample(1:nrow(miss_mtcars), 5) miss_mtcars$cyl[some_rows] <- NA  some_rows <- sample(1:nrow(miss_mtcars), 3) miss_mtcars$wt[some_rows] <- NA  some_rows <- sample(1:nrow(miss_mtcars), 3) miss_mtcars$vs[some_rows] <- NA Now, we are going to create four missing values in qsec, but only for automatic cars: only_automatic <- which(miss_mtcars$am==0) some_rows <- sample(only_automatic, 4) miss_mtcars$qsec[some_rows] <- NA Now, let's take a look at the dataset: > miss_mtcars                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4 Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4 Datsun 710          22.8   4 108.0  93 3.85    NA 18.61  1  1    4    1 Hornet 4 Drive      21.4   6 258.0 110   NA 3.215 19.44  1  0    3    1 Hornet Sportabout   18.7   8 360.0 175   NA 3.440 17.02  0  0    3    2 Valiant             18.1  NA 225.0 105   NA 3.460    NA  1  0    3    1 Great, now let's visualize the missingness. The first way we are going to visualize the pattern of missing data is by using the md.pattern function from the mice package (which is also the package that we are ultimately going to use for imputing our missing data). If you don't have the package already, install it. > library(mice) > md.pattern(miss_mtcars)    disp hp am gear carb wt vs qsec mpg cyl drat   12    1  1  1    1    1  1  1    1   1   1    1  0  4    1  1  1    1    1  1  1    1   0   1    1  1  2    1  1  1    1    1  1  1    1   1   0    1  1  3    1  1  1    1    1  1  1    1   1   1    0  1  3    1  1  1    1    1  0  1    1   1   1    1  1  2    1  1  1    1    1  1  1    0   1   1    1  1  1    1  1  1    1    1  1  1    1   0   1    0  2  1    1  1  1    1    1  1  1    0   1   0    1  2  1    1  1  1    1    1  1  0    1   1   0    1  2  2    1  1  1    1    1  1  0    1   1   1    0  2  1    1  1  1    1    1  1  1    0   1   0    0  3       0  0  0    0    0  3  3    4   5   5    7 27 A row-wise missing data pattern refers to the columns that are missing for each row. This function aggregates and counts the number of rows with the same missing data pattern. This function outputs a binary (0 and 1) matrix. Cells with a 1 represent non-missing data; 0s represent missing data. Since the rows are sorted in an increasing-amount-of-missingness order, the first row always refers to the missing data pattern containing the least amount of missing data. In this case, the missing data pattern with the least amount of missing data is the pattern containing no missing data at all. Because of this, the first row has all 1s in the columns that are named after the columns in the miss_mtcars dataset. The left-most column is a count of the number of rows that display the missing data pattern, and the right-most column is a count of the number of missing data points in that pattern. The last row contains a count of the number of missing data points in each column. As you can see, 12 of the rows contain no missing data. The next most common missing data pattern is the one with missing just mpg; four rows fit this pattern. There are only six rows that contain more than one missing value. Only one of these rows contains more than two missing values (as shown in the second-to-last row). As far as datasets with missing data go, this particular one doesn't contain much. It is not uncommon for some datasets to have more than 30 percent of its data missing. This data set doesn't even hit 3 percent. Now let's visualize the missing data pattern graphically using the VIM package. You will probably have to install this, too. library(VIM) aggr(miss_mtcars, numbers=TRUE) Figure11.1: The output of VIM's visual aggregation of missing data. The left plot shows the proportion on missing values for each column. The right plot depicts the prevalence of row-wise missing data patterns, like md.pattern At a glance, this representation shows us, effortlessly, that the drat column accounts for the highest proportion of missingness, column-wise, followed by mpg, cyl, qsec, vs, and wt. The graphic on the right shows us information similar to that of the output of md.pattern. This representation, though, makes it easier to tell if there is some systematic pattern of missingness. The blue cells represent non-missing data, and the red cells represent missing data. The numbers on the right of the graphic represent the proportion of rows displaying that missing data pattern. 37.5 percent of the rows contain no missing data whatsoever. Types of missing data The VIM package allowed us to visualize the missing data patterns. A related term, the missing data mechanism, describes the process that determines each data point's likelihood of being missing. There are three main categories of missing data mechanisms: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Discrimination based on missing data mechanism is crucial, since it informs us about the options for handling the missingness. The first mechanism, MCAR, occurs when data's missingness is unrelated to the data. This would occur, for example, if rows were deleted from a database at random, or if a gust of wind took a random sample of a surveyor's survey forms off into the horizon. The mechanism that governs the missingness of drat, mpg, cyl, wt, and vs' is MCAR, because we randomly selected elements to go missing. This mechanism, while being the easiest to work with, is seldom tenable in practice. MNAR, on the other hand, occurs when a variable's missingness is related to the variable itself. For example, suppose the scale that weighed each car had a capacity of only 3,700 pounds, and because of this, the eight cars that weighed more than that were recorded as NA. This is a classic example of the MNAR mechanism—it is the weight of the observation itself that is the cause for its being missing. Another example would be if during the course of trial of an anti-depressant drug, participants who were not being helped by the drug became too depressed to continue with the trial. At the end of the trial, when all the participants' level of depression is accessed and recorded, there would be missing values for participants whose reason for absence is related to their level of depression. The last mechanism, missing at random, is somewhat unfortunately named. Contrary to what it may sound like, it means there is a systematic relationship between the missingness of an outcome variable' and other observed variables, but not the outcome variable itself. This is probably best explained by the following example. Suppose that in a survey, there is a question about income level that, in its wording, uses a particular colloquialism. Due to this, a large number of the participants in the survey whose native language is not English couldn't interpret the question, and left it blank. If the survey collected just the name, gender, and income, the missing data mechanism of the question on income would be MNAR. If, however, the questionnaire included a question that asked if the participant spoke English as a first language, then the mechanism would be MAR. The inclusion of the Is English your first language? variable means that the missingness of the income question can be completely accounted for. The reason for the moniker missing at random is that when you control the relationship between the missing variable and the observed variable(s) it is related to (for example, What is your income? and Is English your first language? respectively), the data are missing at random. As another example, there is a systematic relationship between the am and qsec variables in our simulated missing dataset: qsecs were missing only for automatic cars. But within the group of automatic cars, the qsec variable is missing at random. Therefore, qsec 's mechanism is MAR; controlling for transmission type, qsec is missing at random. Bear in mind, though, if we removed am from our simulated dataset, qsec would become MNAR. As mentioned earlier, MCAR is the easiest type to work with because of the complete absence of a systematic relationship in the data's missingness. Many unsophisticated techniques for handling missing data rest on the assumption that the data are MCAR. On the other hand, MNAR data is the hardest to work with since the properties of the missing data that caused its missingness has to be understood quantifiably, and included in the imputation model. Though multiple imputations can handle the MNAR mechanisms, the procedures involved become more complicated and far beyond the scope of this text. The MCAR and MAR mechanisms allow us not to worry about the properties and parameters of the missing data. For this reason, may sometimes find MCAR or MAR missingness being referred to as ignorable missingness. MAR data is not as hard to work with as MNAR data, but it is not as forgiving as MCAR. For this reason, though our simulated dataset contains MCAR and MAR components, the mechanism that describes the whole data is MAR—just one MAR mechanism makes the whole dataset MAR. So which one is it? You may have noticed that the place of a particular dataset in the missing data mechanism taxonomy is dependent on the variables that it includes. For example, we know that the mechanism behind qsec is MAR, but if the dataset did not include am, it would be MNAR. Since we are the ones that created the data, we know the procedure that resulted in qsec 's missing values. If we weren't the ones creating the data—as happens in the real world—and the dataset did not contain the am column, we would just see a bunch of arbitrarily missing qsec values. This might lead us to believe that the data is MCAR. It isn't, though; just because the variable to which another variable's missingness is systematically related is non-observed, doesn't mean that it doesn't exist. This raises a critical question: can we ever be sure that our data is not MNAR? The unfortunate answer is no. Since the data that we need to prove or disprove MNAR is ipso facto missing, the MNAR assumption can never be conclusively disconfirmed. It's our job, as critically thinking data analysts, to ask whether there is likely an MNAR mechanism or not. Unsophisticated methods for dealing with missing data Here we are going to look at various types of methods for dealing with missing data: Complete case analysis This method, also called list-wise deletion, is a straightforward procedure that simply removes all rows or elements containing missing values prior to the analysis. In the univariate case—taking the mean of the drat column, for example—all elements of drat that are missing would simply be removed: > mean(miss_mtcars$drat) [1] NA > mean(miss_mtcars$drat, na.rm=TRUE) [1] 3.63 In a multivariate procedure—for example, linear regression predicting mpg from am, wt, and qsec—all rows that have a missing value in any of the columns included in the regression are removed: listwise_model <- lm(mpg ~ am + wt + qsec,                      data=miss_mtcars,                      na.action = na.omit) ## OR # complete.cases returns a boolean vector comp <- complete.cases(cbind(miss_mtcars$mpg,                              miss_mtcars$am,                              miss_mtcars$wt,                              miss_mtcars$qsec)) comp_mtcars <- mtcars[comp,] listwise_model <- lm(mpg ~ am + wt + qsec,                      data=comp_mtcars) Under an MCAR mechanism, a complete case analysis produces unbiased estimates of the mean, variance/standard deviation, and regression coefficients, which means that the estimates don't systematically differ from the true values on average, since the included data elements are just a random sampling of the recorded data elements. However, inference-wise, since we lost a number of our samples, we are going to lose statistical power and generate standard errors and confidence intervals that are bigger than they need to be. Additionally, in the multivariate regression case, note that our sample size depends on the variables that we include in the regression; more the variables, more is the missing data that we open ourselves up to, and more the rows that we are liable to lose. This makes comparing results across different models slightly hairy. Under an MAR or MNAR mechanism, list-wise deletion will produce biased estimates of the mean and variance. For example, if am were highly correlated with qsec, the fact that we are missing qsec only for automatic cars would significantly shift our estimates of the mean of qsec. Surprisingly, list-wise deletion produces unbiased estimates of the regression coefficients, even if the data is MNAR or MAR, as long as the relevant variables are included in the regression equations. For this reason, if there are relatively few missing values in a data set that is to be used in regression analysis, list-wise deletion could be an acceptable alternative to more principled approaches. Pairwise deletion Also called available-case analysis, this technique is (somewhat unfortunately) common when estimating covariance or correlation matrices. For each pair of variables, it only uses the cases that are non-missing for both. This often means that the number of elements used will vary from cell to cell of the covariance/correlation matrices. This can result in absurd correlation coefficients that are above 1, making the resulting matrices largely useless to methodologies that depend on them. Mean substitution Mean substitution, as the name suggests, replaces all the missing values with the mean of the available cases. For example: mean_sub <- miss_mtcars mean_sub$qsec[is.na(mean_sub$qsec)] <- mean(mean_sub$qsec,                                             na.rm=TRUE) # etc... Although this seemingly solves the problem of the loss of sample size in the list-wise deletion procedure, mean substitution has some very unsavory properties of it's own. Whilst mean substitution produces unbiased estimates of the mean of a column, it produces biased estimates of the variance, since it removes the natural variability that would have occurred in the missing values had they not been missing. The variance estimates from mean substitution will therefore be, systematically, too small. Additionally, it's not hard to see that mean substitution will result in biased estimates if the data are MAR or MNAR. For these reasons, mean substitution is not recommended under virtually any circumstance. Hot deck imputation Hot deck imputation is an intuitively elegant approach that fills in the missing data with donor values from another row in the dataset. In the least sophisticated formulation, a random non-missing element from the same dataset is shared with a missing value. In more sophisticated hot deck approaches, the donor value comes from a row that is similar to the row with the missing data. The multiple imputation techniques that we will be using in a later section of this article borrows this idea for one of its imputation methods. The term hot deck refers to the old practice of storing data in decks of punch cards. The deck that holds the donor value would be hot because it is the one that is currently being processed. Regression imputation This approach attempts to fill in the missing data in a column using regression to predict likely values of the missing elements using other columns as predictors. For example, using regression imputation on the drat column would employ a linear regression predicting drat from all the other columns in miss_mtcars. The process would be repeated for all columns containing missing data, until the dataset is complete. This procedure is intuitively appealing, because it integrates knowledge of the other variables and patterns of the dataset. This creates a set of more informed imputations. As a result, this produces unbiased estimates of the mean and regression coefficients under MCAR and MAR (so long as the relevant variables are included in the regression model. However, this approach is not without its problems. The predicted values of the missing data lie right on the regression line but, as we know, very few data points lie right on the regression line—there is usually a normally distributed residual (error) term. Due to this, regression imputation underestimates the variability of the missing values. As a result, it will result in biased estimates of the variance and covariance between different columns. However, we're on the right track. Stochastic regression imputation As far as unsophisticated approaches go, stochastic regression is fairly evolved. This approach solves some of the issues of regression imputation, and produces unbiased estimates of the mean, variance, covariance, and regression coefficients under MCAR and MAR. It does this by adding a random (stochastic) value to the predictions of regression imputation. This random added value is sampled from the residual (error) distribution of the linear regression—which, if you remember, is assumed to be a normal distribution. This restores the variability in the missing values (that we lost in regression imputation) that those values would have had if they weren't missing. However, as far as subsequent analysis and inference on the imputed dataset goes, stochastic regression results in standard errors and confidence intervals that are smaller than they should be. Since it produces only one imputed dataset, it does not capture the extent to which we are uncertain about the residuals and our coefficient estimates. Nevertheless, stochastic regression forms the basis of still more sophisticated imputation methods. There are two sophisticated, well-founded, and recommended methods of dealing with missing data. One is called the Expectation Maximization (EM) method, which we do not cover here. The second is called Multiple Imputation, and because it is widely considered the most effective method, it is the one we explore in this article. Multiple imputation The big idea behind multiple imputation is that instead of generating one set of imputed data with our best estimation of the missing data, we generate multiple versions of the imputed data where the imputed values are drawn from a distribution. The uncertainty about what the imputed values should be is reflected in the variation between the multiply imputed datasets. We perform our intended analysis separately with each of these m amount of completed datasets. These analyses will then yield m different parameter estimates (like regression coefficients, and so on). The critical point is that these parameter estimates are different solely due to the variability in the imputed missing values, and hence, our uncertainty about what the imputed values should be. This is how multiple imputation integrates uncertainty, and outperforms more limited imputation methods that produce one imputed dataset, conferring an unwarranted sense of confidence in the filled-in data of our analysis. The following diagram illustrates this idea: Figure 11.2: Multiple imputation in a nutshell So how does mice come up with the imputed values? Let's focus on the univariate case—where only one column contains missing data and we use all the other (completed) columns to impute the missing values—before generalizing to a multivariate case. mice actually has a few different imputation methods up its sleeve, each best suited for a particular use case. mice will often choose sensible defaults based on the data type (continuous, binary, non-binary categorical, and so on). The most important method is what the package calls the norm method. This method is very much like stochastic regression. Each of the m imputations is created by adding a normal "noise" term to the output of a linear regression predicting the missing variable. What makes this slightly different than just stochastic regression repeated m times is that the norm method also integrates uncertainty about the regression coefficients used in the predictive linear model. Recall that the regression coefficients in a linear regression are just estimates of the population coefficients from a random sample (that's why each regression coefficient has a standard error and confidence interval). Another sample from the population would have yielded slightly different coefficient estimates. If through all our imputations, we just added a normal residual term from a linear regression equation with the same coefficients, we would be systematically understating our uncertainty regarding what the imputed values should be. To combat this, in multiple imputation, each imputation of the data contains two steps. The first step performs stochastic linear regression imputation using coefficients for each predictor estimated from the data. The second step chooses slightly different estimates of these regression coefficients, and proceeds into the next imputation. The first step of the next imputation uses the slightly different coefficient estimates to perform stochastic linear regression imputation again. After that, in the second step of the second iteration, still other coefficient estimates are generated to be used in the third imputation. This cycle goes on until we have m multiply imputed datasets. How do we choose these different coefficient estimates at the second step of each imputation? Traditionally, the approach is Bayesian in nature; these new coefficients are drawn from each of the coefficients' posterior distribution, which describes credible values of the estimate using the observed data and uninformative priors. This is the approach that norm uses. There is an alternate method that chooses these new coefficient estimates from a sampling distribution that is created by taking repeated samples of the data (with replacement) and estimating the regression coefficients of each of these samples. mice calls this method norm.boot. The multivariate case is a little more hairy, since the imputation for one column depends on the other columns, which may contain missing data of their own. For this reason, we make several passes over all the columns that need imputing, until the imputation of all missing data in a particular column is informed by informed estimates of the missing data in the predictor columns. These passes over all the columns are called iterations. So that you really understand how this iteration works, let's say we are performing multiple imputation on a subset of miss_mtcars containing only mpg, wt and drat. First, all the missing data in all the columns are set to a placeholder value like the mean or a randomly sampled non-missing value from its column. Then, we visit mpg where the placeholder values are turned back into missing values. These missing values are predicted using the two-part procedure described in the univariate case. Then we move on to wt; the placeholder values are turned back into missing values, whose new values are imputed with the two-step univariate procedure using mpg and drat as predictors. Then this is repeated with drat. This is one iteration. On the next iteration, it is not the placeholder values that get turned back into random values and imputed but the imputed values from the previous iteration. As this repeats, we shift away from the starting values and the imputed values begin to stabilize. This usually happens within just a few iterations. The dataset at the completion of the last iteration is the first multiply imputed dataset. Each m starts the iteration process all over again. The default in mice is five iterations. Of course, you can increase this number if you have reason to believe that you need to. We'll discuss how to tell if this is necessary later in the section. Methods of imputation The method of imputation that we described for the univariate case, norm, works best for imputed values that follow an unconstrained normal distribution—but it could lead to some nonsensical imputations otherwise. For example, since the weights in wt are so close to 0 (because it's in units of a thousand pounds) it is possible for the norm method to impute a negative weight. Though this will no doubt balance out over the other m-1 multiply imputed datasets, we can combat this situation by using another method of imputation called predictive mean matching. Predictive mean matching (mice calls this pmm) works a lot like norm. The difference is that the norm imputations are then used to find the d closest values to the imputed value among the non-missing data in the column. Then, one of these d values is chosen as the final imputed value—d=3 is the default in mice. This method has a few great properties. For one, the possibility of imputing a negative value for wt is categorically off the table; the imputed value would have to be chosen from the set {1.513, 1.615, 1.835}, since these are the three lowest weights. More generally, any natural constraint in the data (lower or upper bounds, integer count data, numbers rounded to the nearest one-half, and so on) is respected with predictive mean matching, because the imputed values appear in the actual non-missing observed values. In this way, predictive mean matching is like hot-deck imputation. Predictive mean matching is the default imputation method in mice for numerical data, though it may be inferior to norm for small datasets and/or datasets with a lot of missing values. Many of the other imputation methods in mice are specially suited for one particular data type. For example, binary categorical variables use logreg by default; this is like norm but uses logistic regression to impute a binary outcome. Similarly, non-binary categorical data uses multinomial regression—mice calls this method polyreg. Multiple imputation in practice There are a few steps to follow and decisions to make when using this powerful imputation technique: Are the data MAR?: And be honest! If the mechanism is likely not MAR, then more complicated measures have to be taken. Are there any derived terms, redundant variables, or irrelevant variables in the data set?: Any of these types of variables will interfere with the regression process. Irrelevant variables—like unique IDs—will not have any predictive power. Derived terms or redundant variables—like having a column for weight in pounds and grams, or a column for area in addition to a length and width column—will similarly interfere with the regression step. Convert all categorical variables to factors, otherwise mice will not be able to tell that the variable is supposed to be categorical. Choose number of iterations and m: By default, these are both five. Using five iterations is usually okay—and we'll be able to tell if we need more. Five imputations are usually okay, too, but we can achieve more statistical power from more imputed datasets. I suggest setting m to 20, unless the processing power and time can't be spared. Choose an imputation method for each variable: You can stick with the defaults as long as you are aware of what they are and think they're the right fit.     Choose the predictors: Let mice use all the available columns as predictors as long as derived terms and redundant/irrelevant columns are removed. Not only does using more    predictors result in reduced bias, but it also increases the likelihood that the data is MAR.     Perform the imputations     Audit the imputations     Perform analysis with the imputations     Pool the results of the analyses Before we get down to it, let's call the mice function on our data frame with missing data, and use its default arguments, just to see what we shouldn't do and why: # we are going to set the seed and printFlag to FALSE, but # everything else will the default argument imp <- mice(miss_mtcars, seed=3, printFlag=FALSE) print(imp)   ------------------------------   Multiply imputed data set Call: mice(data = miss_mtcars, printFlag = FALSE, seed = 3) Number of multiple imputations:  5 Missing cells per column:  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb    5    5    0    0    7    3    4    3    0    0    0 Imputation methods:   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb "pmm" "pmm"    ""    "" "pmm" "pmm" "pmm" "pmm"    ""    ""    "" VisitSequence:  mpg  cyl drat   wt qsec   vs    1    2    5    6    7    8 PredictorMatrix:      mpg cyl disp hp drat wt qsec vs am gear carb mpg    0   1    1  1    1  1    1  1  1    1    1 cyl    1   0    1  1    1  1    1  1  1    1    1 disp   0   0    0  0    0  0    0  0  0    0    0  ... Random generator seed value:  3 The first thing we notice (on line four of the output) is that mice chose to create five multiply imputed datasets, by default. As we discussed, this isn't a bad default, but more imputation can only improve our statistical power (if only marginally); when we impute this data set in earnest, we will use m=20. The second thing we notice (on lines 8-10 of the output) is that it used predictive mean matching as the imputation method for all the columns with missing data. If you recall, predictive mean matching is the default imputation method for numeric columns. However, vs and cyl are binary categorical and non-binary categorical variables, respectively. Because we didn't convert them to factors, mice thinks these are just regular numeric columns. We'll have to fix this. The last thing we should notice here is the predictor matrix (starting on line 14). Each row and column of the predictor matrix refers to a column in the dataset to impute. If a cell contains a 1, it means that the variable referred to in the column is used as a predictor for the variable in the row. The first row indicates that all available attributes are used to help predict mpg with the exception of mpg itself. All the values in the diagonal are 0, because mice won't use an attribute to predict itself. Note that the disp, hp, am, gear, and carb rows all contain `0`s—this is because these variables are complete, and don't need to use any predictors. Since we thought carefully about whether there were any attributes that should be removed before we perform the imputation, we can use mice's default predictor matrix for this dataset. If there were any non-predictive attributes (like unique identifiers, redundant variables, and so on) we would have either had to remove them (easiest option), or instruct mice not to use them as predictors (harder). Let's now correct the issues that we've discussed. # convert categorical variables into factors miss_mtcars$vs <- factor(miss_mtcars$vs) miss_mtcars$cyl <- factor(miss_mtcars$cyl)   imp <- mice(miss_mtcars, m=20, seed=3, printFlag=FALSE) imp$method -------------------------------------       mpg       cyl      disp        hp      drat     "pmm" "polyreg"        ""        ""     "pmm"        wt      qsec        vs        am      gear     "pmm"     "pmm"  "logreg"        ""        ""      carb        "" Now mice has corrected the imputation method of cyl and vs to their correct defaults. In truth, cyl is a kind of discrete numeric variable called an ordinal variable, which means that yet another imputation method may be optimal for that attribute, but, for the sake of simplicity, we'll treat it as a categorical variable. Before we get to use the imputations in an analysis, we have to check the output. The first thing we need to check is the convergence of the iterations. Recall that for imputing data with missing values in multiple columns, multiple imputation requires iteration over all these columns a few times. At each iteration, mice produces imputations—and samples new parameter estimates from the parameters' posterior distributions—for all columns that need to be imputed. The final imputations, for each multiply imputed dataset m, are the imputed values from the final iteration. In contrast to when we used MCMC the convergence in mice is much faster; it usually occurs in just a few iterations. However, visually checking for convergence is highly recommended. We even check for it similarly; when we call the plot function on the variable that we assign the mice output to, it displays trace plots of the mean and standard deviation of all the variables involved in the imputations. Each line in each plot is one of the m imputations. plot(imp) Figure 11.3: A subset of the trace plots produced by plotting an object returned by a mice imputation As you can see from the preceding trace plot on imp, there are no clear trends and the variables are all overlapping from one iteration to the next. Put another way, the variance within a chain (there are m chains ) should be about equal to the variance between the chains. This indicates that convergence was achieved. If convergence was not achieved, you can increase the number of iterations that mice employs by explicitly specifying the maxit parameter to the mice function. To see an example of non-convergence, take a look at Figures 7 and 8 in the paper that describes this package written by the authors of the package' themselves. It is available at http://www.jstatsoft.org/article/view/v045i03 The next step is to make sure the imputed values are reasonable. In general, whenever we quickly review the results of something to see if they make sense, it is called a sanity test or sanity check. With the following line, we're going to display the imputed values for the five missing mpgs for the first six imputations: imp$imp$mpg[,1:6] ------------------------------------                       1    2    3    4    5    6 Duster 360         19.2 16.4 17.3 15.5 15.0 19.2 Cadillac Fleetwood 15.2 13.3 15.0 13.3 10.4 17.3 Chrysler Imperial  10.4 15.0 15.0 16.4 10.4 10.4 Porsche 914-2      27.3 22.8 21.4 22.8 21.4 15.5 Ferrari Dino       19.2 21.4 19.2 15.2 18.1 19.2 These sure look reasonable. A better method for sanity checking is to call densityplot on the variable that we assign the mice output to: densityplot(imp) Figure 11.4: Density plots of all the imputed values for mpg, drat, wt, and qsec. Each imputation has its own density curve in each quadrant This displays, for every attribute imputed, a density plot of the actual non-missing values (the thick line) and the imputed values (the thin lines). We are looking to see that the distributions are similar. Note that the density curve of the imputed values extend much higher than the observed values' density curve in this case. This is partly because we imputed so few variables that there weren't enough data points to properly smooth the density approximation. Height and non-smoothness notwithstanding, these density plots indicate no outlandish behavior among the imputed variables. We are now ready for the analysis phase. We are going to perform linear regression on each imputed dataset and attempt to model mpg as a function of am, wt, and qsec. Instead of repeating the analyses on each dataset manually, we can apply an expression to all the datasets at one time with the with function, as follows: imp_models <- with(imp, lm(mpg ~ am + wt + qsec)) We could take a peak at the estimated coefficients from each dataset using lapply on the analyses attribute of the returned object: lapply(imp_models$analyses, coef) --------------------------------- [[1]] (Intercept)          am          wt        qsec  18.1534095   2.0284014  -4.4054825   0.8637856   [[2]] (Intercept)          am          wt        qsec    8.375455    3.336896   -3.520882    1.219775   [[3]] (Intercept)          am          wt        qsec    5.254578    3.277198   -3.233096    1.337469 ......... Finally, let's pool the results of the analyses (with the pool function), and call summary on it: pooled_model <- pool(imp_models) summary(pooled_model) ----------------------------------                   est        se         t       df    Pr(>|t|) (Intercept)  7.049781 9.2254581  0.764166 17.63319 0.454873254 am           3.182049 1.7445444  1.824000 21.36600 0.082171407 wt          -3.413534 0.9983207 -3.419276 14.99816 0.003804876 qsec         1.270712 0.3660131  3.471765 19.93296 0.002416595                   lo 95     hi 95 nmis       fmi    lambda (Intercept) -12.3611281 26.460690   NA 0.3459197 0.2757138 am           -0.4421495  6.806247    0 0.2290359 0.1600952 wt           -5.5414268 -1.285641    3 0.4324828 0.3615349 qsec          0.5070570  2.034366    4 0.2736026 0.2042003 Though we could have performed the pooling ourselves using the equations that Donald Rubin outlined in his 1987 classic Multiple Imputation for Nonresponse in Surveys, it is less of a hassle and less error-prone to have the pool function do it for us. Readers who are interested in the pooling rules are encouraged to consult the aforementioned text. As you can see, for each parameter, pool has combined the coefficient estimate and standard errors, and calculated the appropriate degrees of freedom. These allow us to t-test each coefficient against the null hypothesis that the coefficient is equal to 0, produce p-values for the t-test, and construct confidence intervals. The standard errors and confidence intervals are wider than those that would have resulted from linear regression on a single imputed dataset, but that's because it is appropriately taking into account our uncertainty regarding what the missing values would have been. There are, at present time, a limited number of analyses that can be automatically pooled by mice—the most important being lm/glm. If you recall, though, the generalized linear model is extremely flexible, and can be used to express a wide array of different analyses. By extension, we could use multiple imputation for not only linear regression but logistic regression, Poisson regression, t-tests, ANOVA, ANCOVA, and more. Analysis with unsanitized data Very often, there will be errors or mistakes in data that can severely complicate analyses—especially with public data or data outside of your organization. For example, say there is a stray comma or punctuation mark in a column that was supposed to be numeric. If we aren't careful, R will read this column as character, and subsequent analysis may, in the best case scenario, fail; it is also possible, however, that our analysis will silently chug along, and return an unexpected result. This will happen, for example, if we try to perform linear regression using the punctuation-containing-but-otherwise-numeric column as a predictor, which will compel R to convert it into a factor thinking that it is a categorical variable. In the worst-case scenario, an analysis with unsanitized data may not error out or return nonsensical results, but return results that look plausible but are actually incorrect. For example, it is common (for some reason) to encode missing data with 999 instead of NA; performing a regression analysis with 999 in a numeric column can severely adulterate our linear models, but often not enough to cause clearly inappropriate results. This mistake may then go undetected indefinitely. Some problems like these could, rather easily, be detected in small datasets by visually auditing the data. Often, however, mistakes like these are notoriously easy to miss. Further, visual inspection is an untenable solution for datasets with thousands of rows and hundreds of columns. Any sustainable solution must off-load this auditing process to R. But how do we describe aberrant behavior to R so that it can catch mistakes on its own? The package assertr seeks to do this by introducing a number of data checking verbs. Using assertr grammar, these verbs (functions) can be combined with subjects (data) in different ways to express a rich vocabulary of data validation tasks. More prosaically, assertr provides a suite of functions designed to verify the assumptions about data early in the analysis process, before any time is wasted computing on bad data. The idea is to provide as much information as you can about how you expect the data to look upfront so that any deviation from this expectation can be dealt with immediately. Given that the assertr grammar is designed to be able to describe a bouquet of error-checking routines, rather than list all the functions and functionalities that the package provides, it would be more helpful to visit particular use cases. Two things before we start. First, make sure you install assertr. Second, bear in mind that all data verification verbs in assertr take a data frame to check as their first argument, and either (a) returns the same data frame if the check passes, or (b) produces a fatal error. Since the verbs return a copy of the chosen data frame if the check passes, the main idiom in assertr involves reassignment of the returning data frame after it passes the check. a_dataset <- CHECKING_VERB(a_dataset, ....) Checking for out-of-bounds data It's common for numeric values in a column to have a natural constraint on the values that it should hold. For example, if a column represents a percent of something, we might want to check if all the values in that column are between 0 and 1 (or 0 and 100). In assertr, we typically use the within_bounds function in conjunction with the assert verb to ensure that this is the case. For example, if we added a column to mtcars that represented the percent of heaviest car's weight, the weight of each car is: library(assertr) mtcars.copy <- mtcars   mtcars.copy$Percent.Max.Wt <- round(mtcars.copy$wt /                                     max(mtcars.copy$wt),                                     2)   mtcars.copy <- assert(mtcars.copy, within_bounds(0,1),                      Percent.Max.Wt) within_bounds is actually a function that takes the lower and upper bounds and returns a predicate, a function that returns TRUE or FALSE. The assert function then applies this predicate to every element of the column specified in the third argument. If there are more than three arguments, assert will assume there are more columns to check. Using within_bounds, we can also avoid the situation where NA values are specified as "999", as long as the second argument in within_bounds is less than this value. within_bounds can take other information such as whether the bounds should be inclusive or exclusive, or whether it should ignore the NA values. To see the options for this, and all the other functions in assertr, use the help function on them. Let's see an example of what it looks like when the assert function fails: mtcars.copy$Percent.Max.Wt[c(10,15)] <- 2 mtcars.copy <- assert(mtcars.copy, within_bounds(0,1),                       Percent.Max.Wt) ------------------------------------------------------------ Error: Vector 'Percent.Max.Wt' violates assertion 'within_bounds' 2 times (e.g. [2] at index 10) We get an informative error message that tells us how many times the assertion was violated, and the index and value of the first offending datum. With assert, we have the option of checking a condition on multiple columns at the same time. For example, none of the measurements in iris can possibly be negative. Here's how we might make sure our dataset is compliant: iris <- assert(iris, within_bounds(0, Inf),                Sepal.Length, Sepal.Width,                Petal.Length, Petal.Width)   # or simply "-Species" because that # will include all columns *except* Species iris <- assert(iris, within_bounds(0, Inf),                -Species) On occasion, we will want to check elements for adherence to a more complicated pattern. For example, let's say we had a column that we knew was either between -10 and -20, or 10 and 20. We can check for this by using the more flexible verify verb, which takes a logical expression as its second argument; if any of the results in the logical expression is FALSE, verify will cause an error. vec <- runif(10, min=10, max=20) # randomly turn some elements negative vec <- vec * sample(c(1, -1), 10,                     replace=TRUE)   example <- data.frame(weird=vec)   example <- verify(example, ((weird < 20 & weird > 10) |                               (weird < -10 & weird > -20)))   # or   example <- verify(example, abs(weird) < 20 & abs(weird) > 10) # passes   example$weird[4] <- 0 example <- verify(example, abs(weird) < 20 & abs(weird) > 10) # fails ------------------------------------- Error in verify(example, abs(weird) < 20 & abs(weird) > 10) :   verification failed! (1 failure) Checking the data type of a column By default, most of the data import functions in R will attempt to guess the data type for each column at the import phase. This is usually nice, because it saves us from tedious work. However, it can backfire when there are, for example, stray punctuation marks in what are supposed to be numeric columns. To verify this, we can use the assert function with the is.numeric base function: iris <- assert(iris, is.numeric, -Species) We can use the is.character and is.logical functions with assert, too. An alternative method that will disallow the import of unexpected data types is to specify the data type that each column should be at the data import phase with the colClasses optional argument: iris <- read.csv("PATH_TO_IRIS_DATA.csv",                  colClasses=c("numeric", "numeric",                               "numeric", "numeric",                               "character")) This solution comes with the added benefit of speeding up the data import process, since R doesn't have to waste time guessing each column's data type. Checking for unexpected categories Another data integrity impropriety that is, unfortunately, very common is the mislabeling of categorical variables. There are two types of mislabeling of categories that can occur: an observation's class is mis-entered/mis-recorded/mistaken for that of another class, or the observation's class is labeled in a way that is not consistent with the rest of the labels. To see an example of what we can do to combat the former case, read assertr's vignette. The latter case covers instances where, for example, the species of iris could be misspelled (such as "versicolour", "verginica") or cases where the pattern established by the majority of class names is ignored ("iris setosa", "i. setosa", "SETOSA"). Either way, these misspecifications prove to be a great bane to data analysts for several reasons. For example, an analysis that is predicated upon a two-class categorical variable (for example, logistic regression) will now have to contend with more than two categories. Yet another way in which unexpected categories can haunt you is by producing statistics grouped by different values of a categorical variable; if the categories were extracted from the main data manually—with subset, for example, as opposed to with by, tapply, or aggregate—you'll be missing potentially crucial observations. If you know what categories you are expecting from the start, you can use the in_set function, in concert with assert, to confirm that all the categories of a particular column are squarely contained within a predetermined set. # passes iris <- assert(iris, in_set("setosa", "versicolor",                             "virginica"), Species)   # mess up the data iris.copy <- iris # We have to make the 'Species' column not # a factor ris.copy$Species <- as.vector(iris$Species) iris.copy$Species[4:9] <- "SETOSA" iris.copy$Species[135] <- "verginica" iris.copy$Species[95] <- "i. versicolor"   # fails iris.copy <- assert(iris.copy, in_set("setosa", "versicolor",                                       "virginica"), Species) ------------------------------------------- Error: Vector 'Species' violates assertion 'in_set' 8 times (e.g. [SETOSA] at index 4) If you don't know the categories that you should be expecting, a priori, the following incantation, which will tell you how many rows each category contains, may help you identify the categories that are either rare or misspecified: by(iris.copy, iris.copy$Species, nrow) Checking for outliers, entry errors, or unlikely data points Automatic outlier detection (sometimes known as anomaly detection) is something that a lot of analysts scoff at and view as a pipe dream. Though the creation of a routine that automagically detects all erroneous data points with 100 percent specificity and precision is impossible, unmistakably mis-entered data points and flagrant outliers are not hard to detect even with very simple methods. In my experience, there are a lot of errors of this type. One simple way to detect the presence of a major outlier is to confirm that every data point is within some n number of standard deviations away from the mean of the group. assertr has a function, within_n_sds—in conjunction with the insist verb—to do just this; if we wanted to check that every numeric value in iris is within five standard deviations of its respective column's mean, we could express so thusly: iris <- insist(iris, within_n_sds(5), -Species) An issue with using standard deviations away from the mean (z-scores) for detecting outliers is that both the mean and standard deviation are influenced heavily by outliers; this means that the very thing we are trying to detect is obstructing our ability to find it. There is a more robust measure for finding central tendency and dispersion than the mean and standard deviation: the median and median absolute deviation. The median absolute deviation is the median of the absolute value of all the elements of a vector subtracted by the vector's median. assertr has a sister to within_n_sds, within_n_mads, that checks every element of a vector to make sure it is within n median absolute deviations away from its column's median. iris <- insist(iris, within_n_mads(4), -Species) iris$Petal.Length[5] <- 15 iris <- insist(iris, within_n_mads(4), -Species) --------------------------------------------- Error: Vector 'Petal.Length' violates assertion 'within_n_mads' 1 time (value [15] at index 5) In my experience, within_n_mads can be an effective guard against illegitimate univariate outliers if n is chosen carefully. The examples here have been focusing on outlier identification in the univariate case—across one dimension at a time. Often, there are times where an observation is truly anomalous but it wouldn't be evident by looking at the spread of each dimension individually. assertr has support for this type of multivariate outlier analysis, but a full discussion of it would require a background outside the scope of this text. Chaining assertions The check assertr aims to make the checking of assumptions so effortless that the user never feels the need to hold back any implicit assumption. Therefore, it's expected that the user uses multiple checks on one data frame. The usage examples that we've seen so far are really only appropriate for one or two checks. For example, a usage pattern such as the following is clearly unworkable: iris <- CHECKING_CONSTRUCT4(CHECKING_CONSTRUCT3(CHECKING_CONSTRUCT2(CHECKING_CONSTRUCT1(this, ...), ...), ...), ...) To combat this visual cacophony, assertr provides direct support for chaining multiple assertions by using the "piping" construct from the magrittr package. The pipe operator of magrittr', %>%, works as follows: it takes the item on the left-hand side of the pipe and inserts it (by default) into the position of the first argument of the function on the right-hand side. The following are some examples of simple magrittr usage patterns: library(magrittr) 4 %>% sqrt              # 2 iris %>% head(n=3)      # the first 3 rows of iris iris <- iris %>% assert(within_bounds(0, Inf), -Species) Since the return value of a passed assertr check is the validated data frame, you can use the magrittr pipe operator to tack on more checks in a way that lends itself to easier human understanding. For example: iris <- iris %>%   assert(is.numeric, -Species) %>%   assert(within_bounds(0, Inf), -Species) %>%   assert(in_set("setosa", "versicolor", "virginica"), Species) %>%   insist(within_n_mads(4), -Species)   # or, equivalently   CHECKS <- . %>%   assert(is.numeric, -Species) %>%   assert(within_bounds(0, Inf), -Species) %>%   assert(in_set("setosa", "versicolor", "virginica"), Species) %>%   insist(within_n_mads(4), -Species)   iris <- iris %>% CHECKS When chaining assertions, I like to put the most integral and general one right at the top. I also like to put the assertions most likely to be violated right at the top so that execution is terminated before any more checks are run. There are many other capabilities built into the assertr multivariate outlier checking. For more information about these, read the package's vignette, (vignette("assertr")). On the magrittr side, besides the forward-pipe operator, this package sports some other very helpful pipe operators. Additionally, magrittr allows the substitution at the right side of the pipe operator to occur at locations other than the first argument. For more information about the wonderful magrittr package, read its vignette. Other messiness As we discussed in this article's preface, there are countless ways that a dataset may be messy. There are many other messy situations and solutions that we couldn't discuss at length here. In order that you, dear reader, are not left in the dark regarding custodial solutions, here are some other remedies which you may find helpful along your analytics journey: OpenRefine Though OpenRefine (formerly Google Refine) doesn't have anything to do with R per se, it is a sophisticated tool for working with and for cleaning up messy data. Among its numerous, sophisticated capabilities is the capacity to auto-detect misspelled or mispecified categories and fix them at the click of a button. Regular expressions Suppose you find that there are commas separating every third digit of the numbers in a numeric column. How would you remove them? Or suppose you needed to strip a currency symbol from values in columns that hold monetary values so that you can compute with them as numbers. These, and vastly more complicated text transformations, can be performed using regular expressions (a formal grammar for specifying the search patterns in text) and associate R functions like grep and sub. Any time spent learning regular expressions will pay enormous dividends over your career as an analyst, and there are many great, free tutorials available on the web for this purpose. tidyr There are a few different ways in which you can represent the same tabular dataset. In one form—called long, narrow, stacked, or entity-attribute-value model—each row contains an observation ID, a variable name, and the value of that variable. For example:             member  attribute  value 1     Ringo Starr  birthyear   1940 2  Paul McCartney  birthyear   1942 3 George Harrison  birthyear   1943 4     John Lennon  birthyear   1940 5     Ringo Starr instrument  Drums 6  Paul McCartney instrument   Bass 7 George Harrison instrument Guitar 8     John Lennon instrument Guitar In another form (called wide or unstacked), each of the observation's variables are stored in each column:             member birthyear instrument 1 George Harrison      1943     Guitar 2     John Lennon      1940     Guitar 3  Paul McCartney      1942       Bass 4     Ringo Starr      1940      Drums If you ever need to convert between these representations, (which is a somewhat common operation, in practice) tidyr is your tool for the job. Exercises The following are a few exercises for you to strengthen your grasp over the concepts learned in this article: Normally, when there is missing data for a question such as "What is your income?", we strongly suspect an MNAR mechanism, because we live in a dystopia that equates wealth with worth. As a result, the participants with the lowest income may be embarrassed to answer that question. In the relevant section, we assumed that because the question was poorly worded and we could account for whether English was the first language of the participant, the mechanism is MAR. If we were wrong about this reason, and it was really because the lower income participants were reticent to admit their income, what would the missing data mechanism be now? If, however, the differences in income were fully explained by whether English was the first language of the participant, what would the missing data mechanism be in that case? Find a dataset on the web with missing data. What does it use to denote that data is missing? Think about that dataset's missing data mechanism. Is there a chance that this data is MNAR? Find a freely available government dataset on the web. Read the dataset's description, and think about what assumptions you might make about the data when planning a certain analysis. Translate these into actual code so that R can check them for you. Were there any deviations from your expectations? When two autonomous individuals decide to voluntarily trade, the transaction can be in both parties' best interests. Does it necessarily follow that a voluntary trade between nations benefits both states? Why or why not? Summary "Messy data"—no matter what definition you use—present a huge roadblock for people who work with data. This article focused on two of the most notorious and prolific culprits: missing data and data that has not been cleaned or audited for quality. On the missing data side, you learned how to visualize missing data patterns, and how to recognize different types of missing data. You saw a few unprincipled ways of tackling the problem, and learned why they were suboptimal solutions. Multiple imputation, so you learned, addresses the shortcomings of these approaches and, through its usage of several imputed data sets, correctly communicates our uncertainty surrounding the imputed values. On unsanitized data, we saw that the, perhaps, optimal solution (visually auditing the data) was untenable for moderately sized datasets or larger. We discovered that the grammar of the package assertr provides a mechanism to offload this auditing process to R. You now have a few assertr checking "recipes" under your belt for some of the more common manifestations of the mistakes that plague data that has not been scrutinized. You can check out similar books published by Packt Publishing on R (https://www.packtpub.com/tech/r): Unsupervised Learning with R by Erik Rodríguez Pacheco (https://www.packtpub.com/big-data-and-business-intelligence/unsupervised-learning-r) R Data Science Essentials by Raja B. Koushik and Sharan Kumar Ravindran (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science-essentials) Resources for Article: Further resources on this subject: Debugging The Scheduler In Oracle 11G Databases [article] Looking Good – The Graphical Interface [article] Being Offline [article]
Read more
  • 0
  • 0
  • 2395
article-image-probability-r
Packt
23 Feb 2016
17 min read
Save for later

Probability of R?

Packt
23 Feb 2016
17 min read
It's time for us to put descriptive statistics down for the time being. It was fun for a while, but we're no longer content just determining the properties of observed data; now we want to start making deductions about data we haven't observed. This leads us to the realm of inferential statistics. In data analysis, probability is used to quantify uncertainty of our deductions about unobserved data. In the land of inferential statistics, probability reigns queen. Many regard her as a harsh mistress, but that's just a rumor. (For more resources related to this topic, see here.) Basic probability Probability measures the likeliness that a particular event will occur. When mathematicians (us, for now!) speak of an event, we are referring to a set of potential outcomes of an experiment, or trial, to which we can assign a probability of occurrence. Probabilities are expressed as a number between 0 and 1 (or as a percentage out of 100). An event with a probability of 0 denotes an impossible outcome, and a probability of 1 describes an event that is certain to occur. The canonical example of probability at work is a coin flip. In the coin flip event, there are two outcomes: the coin lands on heads, or the coin lands on tails. Pretending that coins never land on their edge (they almost never do), those two outcomes are the only ones possible. The sample space (the set of all possible outcomes), therefore, is {heads, tails}. Since the entire sample space is covered by these two outcomes, they are said to be collectively exhaustive. The sum of the probabilities of collectively exhaustive events is always 1. In this example, the probability that the coin flip will yield heads or yield tails is 1; it is certain that the coin will land on one of those. In a fair and correctly balanced coin, each of those two outcomes is equally likely. Therefore, we split the probability equally among the outcomes: in the event of a coin flip, the probability of obtaining heads is 0.5, and the probability of tails is 0.5 as well. This is usually denoted as follows: The probability of a coin flip yielding either heads or tails looks like this: And the probability of a coin flip yielding both heads and tails is denoted as follows: The two outcomes, in addition to being collectively exhaustive, are also mutually exclusive. This means that they can never co-occur. This is why the probability of heads and tails is 0; it just can't happen. The next obligatory application of beginner probability theory is in the case of rolling a standard six-sided die. In the event of a die roll, the sample space is {1, 2, 3, 4, 5, 6}. With every roll of the die, we are sampling from this space. In this event, too, each outcome is equally likely, except now we have to divide the probability across six outcomes. In the following equation, we denote the probability of rolling a 1 as P(1): Rolling a 1 or rolling a 2 is not collectively exhaustive (we can still roll a 3, 4, 5, or 6), but they are mutually exclusive; we can't roll a 1 and 2. If we want to calculate the probability of either one of two mutually exclusive events occurring, we add the probabilities: While rolling a 1 or rolling a 2 aren't mutually exhaustive, rolling 1 and not rolling a 1 are. This is usually denoted in this manner: These two events—and all events that are both collectively exhaustive and mutually exclusive—are called complementary events. Our last pedagogical example in the basic probability theory is using a deck of cards. Our deck has 52 cards—4 for each number from 2 to 10 and 4 each of Jack, Queen, King, and Ace (no Jokers!). Each of these 4 cards belong to one suit, either a Heart, Club, Spade or Diamond. There are, therefore, 13 cards in each suit. Further, every Heart and Diamond card is colored red, and every Spade and Club are black. From this, we can deduce the following probabilities for the outcome of randomly choosing a card: What, then, is the probability of getting a black card and an Ace? Well, these events are conditionally independent, meaning that the probability of either outcome does not affect the probability of the other. In cases like these, the probability of event A and event B is the product of the probability of A and the probability of B. Therefore: Intuitively, this makes sense, because there are two black Aces out of a possible 52. What about the probability that we choose a red card and a Heart? These two outcomes are not conditionally independent, because knowing that the card is red has a bearing on the likelihood that the card is also a Heart. In cases like these, the probability of event A and B is denoted as follows: Where P(A|B) means the probability of A given B. For example, if we represent A as drawing a Heart and B as drawing a red card, P(A | B) means what's the probability of drawing a heart if we know that the card we drew was red?. Since a red card is equally likely to be a Heart or a Diamond, P(A|B) is 0.5. Therefore: In the preceding equation, we used the form P(B) P(A|B). Had we used the form P(A) P(B|A), we would have got the same answer: So, these two forms are equivalent: For kicks, let's divide both sides of the equation by P(B). That yields the following equivalence: This equation is known as Bayes' Theorem. This equation is very easy to derive, but its meaning and influence is profound. In fact, it is one of the most famous equations in all of mathematics. Bayes' Theorem has been applied to and proven useful in an enormous amount of different disciplines and contexts. It was used to help crack the German Enigma code during World War II, saving the lives of millions. It was also used recently, and famously, by Nate Silver to help correctly predict the voting patterns of 49 states in the 2008 US presidential election. At its core, Bayes' Theorem tells us how to update the probability of a hypothesis in light of new evidence. Due to this, the following formulation of Bayes' Theorem is often more intuitive: where H is the hypothesis and E is the evidence. Let's see an example of Bayes' Theorem in action! There's a hot new recreational drug on the scene called Allighate (or Ally for short). It's named as such because it makes its users go wild and act like an alligator. Since the effect of the drug is so deleterious, very few people actually take the drug. In fact, only about 1 in every thousand people (0.1%) take it. Frightened by fear-mongering late-night news, Daisy Girl, Inc., a technology consulting firm, ordered an Allighate testing kit for all of its 200 employees so that it could offer treatment to any employee who has been using it. Not sparing any expense, they bought the best kit on the market; it had 99% sensitivity and 99% specificity. This means that it correctly identified drug users 99 out of 100 times, and only falsely identified a non-user as a user once in every 100 times. When the results finally came back, two employees tested positive. Though the two denied using the drug, their supervisor, Ronald, was ready to send them off to get help. Just as Ronald was about to send them off, Shanice, a clever employee from the statistics department, came to their defense. Ronald incorrectly assumed that each of the employees who tested positive were using the drug with 99% certainty and, therefore, the chances that both were using it was 98%. Shanice explained that it was actually far more likely that neither employee was using Allighate. How so? Let's find out by applying Bayes' theorem! Let's focus on just one employee right now; let H be the hypothesis that one of the employees is using Ally, and E represent the evidence that the employee tested positive. We want to solve the left side of the equation, so let's plug in values. The first part of the right side of the equation, P(Positive Test | Ally User), is called the likelihood. The probability of testing positive if you use the drug is 99%; this is what tripped up Ronald—and most other people when they first heard of the problem. The second part, P(Ally User), is called the prior. This is our belief that any one person has used the drug before we receive any evidence. Since we know that only .1% of people use Ally, this would be a reasonable choice for a prior. Finally, the denominator of the equation is a normalizing constant, which ensures that the final probability in the equation will add up to one of all possible hypotheses. Finally, the value we are trying to solve, P(Ally user | Positive Test), is the posterior. It is the probability of our hypothesis updated to reflect new evidence. In many practical settings, computing the normalizing factor is very difficult. In this case, because there are only two possible hypotheses, being a user or not, the probability of finding the evidence of a positive test is given as follows: Which is: (.99 * .001) + (.01 * .999) = 0.01098 Plugging that into the denominator, our final answer is calculated as follows: Note that the new evidence, which favored the hypothesis that the employee was using Ally, shifted our prior belief from .001 to .09. Even so, our prior belief about whether an employee was using Ally was so extraordinarily low, it would take some very very strong evidence indeed to convince us that an employee was an Ally user. Ignoring the prior probability in cases like these is known as base-rate fallacy. Shanice assuaged Ronald's embarrassment by assuring him that it was a very common mistake. Now to extend this to two employees: the probability of any two employees both using the drug is, as we now know, .01 squared, or 1 million to one. Squaring our new posterior yields, we get .0081. The probability that both employees use Ally, even given their positive results, is less than 1%. So, they are exonerated. Sally is a different story, though. Her friends noticed her behavior had dramatically changed as of late—she snaps at co-workers and has taken to eating pencils. Her concerned cubicle-mate even followed her after work and saw her crawl into a sewer, not to emerge until the next day to go back to work. Even though Sally passed the drug test, we know that it's likely (almost certain) that she uses Ally. Bayes' theorem gives us a way to quantify that probability! Our prior is the same, but now our likelihood is pretty much as close to 1 as you can get - after all, how many non-Ally users do you think eat pencils and live in sewers? A tale of two interpretations Though it may seem strange to hear, there is actually a hot philosophical debate about what probability really is. Though there are others, the two primary camps into which virtually all mathematicians fall are the frequentist camp and the Bayesian camp. The frequentist interpretation describes probability as the relative likelihood of observing an outcome in an experiment when you repeat the experiment multiple times. Flipping a coin is a perfect example; the probability of heads converges to 50% as the number of times it is flipped goes to infinity. The frequentist interpretation of probability is inherently objective; there is a true probability out there in the world, which we are trying to estimate. The Bayesian interpretation, however, views probability as our degree of belief about something. Because of this, the Bayesian interpretation is subjective; when evidence is scarce, there are sometimes wildly different degrees of belief among different people. Described in this manner, Bayesianism may scare many people off, but it is actually quite intuitive. For example, when a meteorologist describes the probability of rain as 70%, people rarely bat an eyelash. But this number only really makes sense within a Bayesian framework because exact meteorological conditions are not repeatable, as is required by frequentist probability. Not simply a heady academic exercise, these two interpretations lead to different methodologies in solving problems in data analysis. Many times, both approaches lead to similar results. Though practitioners may strongly align themselves with one side over another, good statisticians know that there's a time and a place for both approaches. Though Bayesianism as a valid way of looking at probability is debated, Bayes theorem is a fact about probability and is undisputed and non-controversial. Sampling from distributions Observing the outcome of trials that involve a random variable, a variable whose value changes due to chance, can be thought of as sampling from a probability distribution—one that describes the likelihood of each member of the sample space occurring. That sentence probably sounds much scarier than it needs to be. Take a die roll for example. Figure 4.1: Probability distribution of outcomes of a die roll Each roll of a die is like sampling from a discrete probability distribution for which each outcome in the sample space has a probability of 0.167 or 1/6. This is an example of a uniform distribution, because all the outcomes are uniformly as likely to occur. Further, there are a finite number of outcomes, so this is a discrete uniform distribution (there also exist continuous uniform distributions). Flipping a coin is like sampling from a uniform distribution with only two outcomes. More specifically, the probability distribution that describes coin-flip events is called a Bernoulli distribution—it's a distribution describing only two events. Parameters We use probability distributions to describe the behavior of random variables because they make it easy to compute with and give us a lot of information about how a variable behaves. But before we perform computations with probability distributions, we have to specify the parameters of those distributions. These parameters will determine exactly what the distribution looks like and how it will behave. For example, the behavior of both a 6-sided die and a 12-sided die is modeled with a uniform distribution. Even though the behavior of both the dice is modeled as uniform distributions, the behavior of each is a little different. To further specify the behavior of each distribution, we detail its parameter; in the case of the (discrete) uniform distribution, the parameter is called n. A uniform distribution with parameter n has n equally likely outcomes of probability 1 / n. The n for a 6-sided die and a 12-sided die is 6 and 12 respectively. For a Bernoulli distribution, which describes the probability distribution of an event with only two outcomes, the parameter is p. Outcome 1 occurs with probability p, and the other outcome occurs with probability 1 - p, because they are collectively exhaustive. The flip of a fair coin is modeled as a Bernoulli distribution with p = 0.5. Imagine a six-sided die with one side labeled 1 and the other five sides labeled 2. The outcome of the die roll trials can be described with a Bernoulli distribution, too! This time, p = 0.16 (1/6). Therefore, the probability of not rolling a 1 is 5/6. The binomial distribution The binomial distribution is a fun one. Like our uniform distribution described in the previous section, it is discrete. When an event has two possible outcomes, success or failure, this distribution describes the number of successes in a certain number of trials. Its parameters are n, the number of trials, and p, the probability of success. Concretely, a binomial distribution with n=1 and p=0.5 describes the behavior of a single coin flip—if we choose to view heads as successes (we could also choose to view tails as successes). A binomial distribution with n=30 and p=0.5 describes the number of heads we should expect. Figure 4.2: A binomial distribution (n=30, p=0.5) On average, of course, we would expect to have 15 heads. However, randomness is the name of the game, and seeing more or fewer heads is totally expected. How can we use the binomial distribution in practice?, you ask. Well, let's look at an application. Larry the Untrustworthy Knave—who can only be trusted some of the time—gives us a coin that he alleges is fair. We flip it 30 times and observe 10 heads. It turns out that the probability of getting exactly 10 heads on 30 flips is about 2.8%*. We can use R to tell us the probability of getting 10 or fewer heads using the pbinom function:   > pbinom(10, size=30, prob=.5)   [1] 0.04936857 It appears as if the probability of this occurring, in a correctly balanced coin, is roughly 5%. Do you think we should take Larry at his word? *If you're interested The way we determined the probability of getting exactly 10 heads is by using the probability formula for Bernoulli trials. The probability of getting k successes in n trials is equal to: where p is the probability of getting one success and: The normal distribution When we described the normal distribution and how ubiquitous it is? The behavior of many random variables in real life is very well described by a normal distribution with certain parameters. The two parameters that uniquely specify a normal distribution are µ (mu) and σ (sigma). µ, the mean, describes where the distribution's peak is located and σ, the standard deviation, describes how wide or narrow the distribution is. Figure 4.3: Normal distributions with different parameters The distribution of heights of American females is approximately normally distributed with parameters µ= 65 inches and σ= 3.5 inches. Figure 4.4: Normal distributions with different parameters With this information, we can easily answer questions about how probable it is to choose, at random, US women of certain heights. We can't really answer the question What is the probability that we choose a person who is exactly 60 inches?, because virtually no one is exactly 60 inches. Instead, we answer questions about how probable it is that a random person is within a certain range of heights. What is the probability that a randomly chosen woman is 70 inches or taller? If you recall, the probability of a height within a range is the area under the curve, or the integral over that range. In this case, the range we will integrate looks like this: Figure 4.5: Area under the curve of the height distribution from 70 inches to positive infinity    > f <- function(x){ dnorm(x, mean=65, sd=3.5) }   > integrate(f, 70, Inf)   0.07656373 with absolute error < 2.2e-06 The preceding R code indicates that there is a 7.66% chance of randomly choosing a woman who is 70 inches or taller. Luckily for us, the normal distribution is so popular and well studied, that there is a function built into R, so we don't need to use integration ourselves.   > pnorm(70, mean=65, sd=3.5)   [1] 0.9234363  The pnorm function tells us the probability of choosing a woman who is shorter than 70 inches. If we want to find P (> 70 inches), we can either subtract this value by 1 (which gives us the complement) or use the optional argument lower.tail=FALSE. If you do this, you'll see that the result matches the 7.66% chance we arrived at earlier. Summary You can check out similar books published by Packt Publishing on R (https://www.packtpub.com/tech/r): Unsupervised Learning with R by Erik Rodríguez Pacheco (https://www.packtpub.com/big-data-and-business-intelligence/unsupervised-learning-r) R Data Science Essentials by Raja B. Koushik and Sharan Kumar Ravindran (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science-essentials) Resources for Article: Further resources on this subject: Dealing With A Mess [article] Navigating The Online Drupal Community [article] Design With Spring AOP [article]
Read more
  • 0
  • 0
  • 2417

article-image-introduction-clustering-and-unsupervised-learning
Packt
23 Feb 2016
16 min read
Save for later

Introduction to Clustering and Unsupervised Learning

Packt
23 Feb 2016
16 min read
The act of clustering, or spotting patterns in data, is not much different from spotting patterns in groups of people. In this article, you will learn: The ways clustering tasks differ from the classification tasks How clustering defines a group, and how such groups are identified by k-means, a classic and easy-to-understand clustering algorithm The steps needed to apply clustering to a real-world task of identifying marketing segments among teenage social media users Before jumping into action, we'll begin by taking an in-depth look at exactly what clustering entails. (For more resources related to this topic, see here.) Understanding clustering Clustering is an unsupervised machine learning task that automatically divides the data into clusters, or groups of similar items. It does this without having been told how the groups should look ahead of time. As we may not even know what we're looking for, clustering is used for knowledge discovery rather than prediction. It provides an insight into the natural groupings found within data. Without advance knowledge of what comprises a cluster, how can a computer possibly know where one group ends and another begins? The answer is simple. Clustering is guided by the principle that items inside a cluster should be very similar to each other, but very different from those outside. The definition of similarity might vary across applications, but the basic idea is always the same—group the data so that the related elements are placed together. The resulting clusters can then be used for action. For instance, you might find clustering methods employed in the following applications: Segmenting customers into groups with similar demographics or buying patterns for targeted marketing campaigns Detecting anomalous behavior, such as unauthorized network intrusions, by identifying patterns of use falling outside the known clusters Simplifying extremely large datasets by grouping features with similar values into a smaller number of homogeneous categories Overall, clustering is useful whenever diverse and varied data can be exemplified by a much smaller number of groups. It results in meaningful and actionable data structures that reduce complexity and provide insight into patterns of relationships. Clustering as a machine learning task Clustering is somewhat different from the classification, numeric prediction, and pattern detection tasks we examined so far. In each of these cases, the result is a model that relates features to an outcome or features to other features; conceptually, the model describes the existing patterns within data. In contrast, clustering creates new data. Unlabeled examples are given a cluster label that has been inferred entirely from the relationships within the data. For this reason, you will, sometimes, see the clustering task referred to as unsupervised classification because, in a sense, it classifies unlabeled examples. The catch is that the class labels obtained from an unsupervised classifier are without intrinsic meaning. Clustering will tell you which groups of examples are closely related—for instance, it might return the groups A, B, and C—but it's up to you to apply an actionable and meaningful label. To see how this impacts the clustering task, let's consider a hypothetical example. Suppose you were organizing a conference on the topic of data science. To facilitate professional networking and collaboration, you planned to seat people in groups according to one of three research specialties: computer and/or database science, math and statistics, and machine learning. Unfortunately, after sending out the conference invitations, you realize that you had forgotten to include a survey asking which discipline the attendee would prefer to be seated with. In a stroke of brilliance, you realize that you might be able to infer each scholar's research specialty by examining his or her publication history. To this end, you begin collecting data on the number of articles each attendee published in computer science-related journals and the number of articles published in math or statistics-related journals. Using the data collected for several scholars, you create a scatterplot: As expected, there seems to be a pattern. We might guess that the upper-left corner, which represents people with many computer science publications but few articles on math, could be a cluster of computer scientists. Following this logic, the lower-right corner might be a group of mathematicians. Similarly, the upper-right corner, those with both math and computer science experience, may be machine learning experts. Our groupings were formed visually; we simply identified clusters as closely grouped data points. Yet in spite of the seemingly obvious groupings, we unfortunately have no way to know whether they are truly homogeneous without personally asking each scholar about his/her academic specialty. The labels we applied required us to make qualitative, presumptive judgments about the types of people that would fall into the group. For this reason, you might imagine the cluster labels in uncertain terms, as follows: Rather than defining the group boundaries subjectively, it would be nice to use machine learning to define them objectively. This might provide us with a rule in the form if a scholar has few math publications, then he/she is a computer science expert. Unfortunately, there's a problem with this plan. As we do not have data on the true class value for each point, a supervised learning algorithm would have no ability to learn such a pattern, as it would have no way of knowing what splits would result in homogenous groups. On the other hand, clustering algorithms use a process very similar to what we did by visually inspecting the scatterplot. Using a measure of how closely the examples are related, homogeneous groups can be identified. In the next section, we'll start looking at how clustering algorithms are implemented. This example highlights an interesting application of clustering. If you begin with unlabeled data, you can use clustering to create class labels. From there, you could apply a supervised learner such as decision trees to find the most important predictors of these classes. This is called semi-supervised learning. The k-means clustering algorithm The k-means algorithm is perhaps the most commonly used clustering method. Having been studied for several decades, it serves as the foundation for many more sophisticated clustering techniques. If you understand the simple principles it uses, you will have the knowledge needed to understand nearly any clustering algorithm in use today. Many such methods are listed on the following site, the CRAN Task View for clustering at http://cran.r-project.org/web/views/Cluster.html. As k-means has evolved over time, there are many implementations of the algorithm. One popular approach is described in : Hartigan JA, Wong MA. A k-means clustering algorithm. Applied Statistics. 1979; 28:100-108. Even though clustering methods have advanced since the inception of k-means, this is not to imply that k-means is obsolete. In fact, the method may be more popular now than ever. The following table lists some reasons why k-means is still used widely: Strengths Weaknesses Uses simple principles that can be explained in non-statistical terms Highly flexible, and can be adapted with simple adjustments to address nearly all of its shortcomings Performs well enough under many real-world use cases Not as sophisticated as more modern clustering algorithms Because it uses an element of random chance, it is not guaranteed to find the optimal set of clusters Requires a reasonable guess as to how many clusters naturally exist in the data Not ideal for non-spherical clusters or clusters of widely varying density The k-means algorithm assigns each of the n examples to one of the k clusters, where k is a number that has been determined ahead of time. The goal is to minimize the differences within each cluster and maximize the differences between the clusters. Unless k and n are extremely small, it is not feasible to compute the optimal clusters across all the possible combinations of examples. Instead, the algorithm uses a heuristic process that finds locally optimal solutions. Put simply, this means that it starts with an initial guess for the cluster assignments, and then modifies the assignments slightly to see whether the changes improve the homogeneity within the clusters. We will cover the process in depth shortly, but the algorithm essentially involves two phases. First, it assigns examples to an initial set of k clusters. Then, it updates the assignments by adjusting the cluster boundaries according to the examples that currently fall into the cluster. The process of updating and assigning occurs several times until changes no longer improve the cluster fit. At this point, the process stops and the clusters are finalized. Due to the heuristic nature of k-means, you may end up with somewhat different final results by making only slight changes to the starting conditions. If the results vary dramatically, this could indicate a problem. For instance, the data may not have natural groupings or the value of k has been poorly chosen. With this in mind, it's a good idea to try a cluster analysis more than once to test the robustness of your findings. To see how the process of assigning and updating works in practice, let's revisit the case of the hypothetical data science conference. Though this is a simple example, it will illustrate the basics of how k-means operates under the hood. Using distance to assign and update clusters As with k-NN, k-means treats feature values as coordinates in a multidimensional feature space. For the conference data, there are only two features, so we can represent the feature space as a two-dimensional scatterplot as depicted previously. The k-means algorithm begins by choosing k points in the feature space to serve as the cluster centers. These centers are the catalyst that spurs the remaining examples to fall into place. Often, the points are chosen by selecting k random examples from the training dataset. As we hope to identify three clusters, according to this method, k = 3 points will be selected at random. These points are indicated by the star, triangle, and diamond in the following diagram: It's worth noting that although the three cluster centers in the preceding diagram happen to be widely spaced apart, this is not always necessarily the case. Since they are selected at random, the three centers could have just as easily been three adjacent points. As the k-means algorithm is highly sensitive to the starting position of the cluster centers, this means that random chance may have a substantial impact on the final set of clusters. To address this problem, k-means can be modified to use different methods for choosing the initial centers. For example, one variant chooses random values occurring anywhere in the feature space (rather than only selecting among the values observed in the data). Another option is to skip this step altogether; by randomly assigning each example to a cluster, the algorithm can jump ahead immediately to the update phase. Each of these approaches adds a particular bias to the final set of clusters, which you may be able to use to improve your results. In 2007, an algorithm called k-means++ was introduced, which proposes an alternative method for selecting the initial cluster centers. It purports to be an efficient way to get much closer to the optimal clustering solution while reducing the impact of random chance. For more information, refer to Arthur D, Vassilvitskii S. k-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. 2007:1027–1035. After choosing the initial cluster centers, the other examples are assigned to the cluster center that is nearest according to the distance function. You will remember that we studied distance functions while learning about k-Nearest Neighbors. Traditionally, k-means uses Euclidean distance, but Manhattan distance or Minkowski distance are also sometimes used. Recall that if n indicates the number of features, the formula for Euclidean distance between example x and example y is: For instance, if we are comparing a guest with five computer science publications and one math publication to a guest with zero computer science papers and two math papers, we could compute this in R as follows: > sqrt((5 - 0)^2 + (1 - 2)^2) [1] 5.09902 Using this distance function, we find the distance between each example and each cluster center. The example is then assigned to the nearest cluster center. Keep in mind that as we are using distance calculations, all the features need to be numeric, and the values should be normalized to a standard range ahead of time. As shown in the following diagram, the three cluster centers partition the examples into three segments labeled Cluster A, Cluster B, and Cluster C. The dashed lines indicate the boundaries for the Voronoi diagram created by the cluster centers. The Voronoi diagram indicates the areas that are closer to one cluster center than any other; the vertex where all the three boundaries meet is the maximal distance from all three cluster centers. Using these boundaries, we can easily see the regions claimed by each of the initial k-means seeds: Now that the initial assignment phase has been completed, the k-means algorithm proceeds to the update phase. The first step of updating the clusters involves shifting the initial centers to a new location, known as the centroid, which is calculated as the average position of the points currently assigned to that cluster. The following diagram illustrates how as the cluster centers shift to the new centroids, the boundaries in the Voronoi diagram also shift and a point that was once in Cluster B (indicated by an arrow) is added to Cluster A: As a result of this reassignment, the k-means algorithm will continue through another update phase. After shifting the cluster centroids, updating the cluster boundaries, and reassigning points into new clusters (as indicated by arrows), the figure looks like this: Because two more points were reassigned, another update must occur, which moves the centroids and updates the cluster boundaries. However, because these changes result in no reassignments, the k-means algorithm stops. The cluster assignments are now final: The final clusters can be reported in one of the two ways. First, you might simply report the cluster assignments such as A, B, or C for each example. Alternatively, you could report the coordinates of the cluster centroids after the final update. Given either reporting method, you are able to define the cluster boundaries by calculating the centroids or assigning each example to its nearest cluster. Choosing the appropriate number of clusters In the introduction to k-means, we learned that the algorithm is sensitive to the randomly-chosen cluster centers. Indeed, if we had selected a different combination of three starting points in the previous example, we may have found clusters that split the data differently from what we had expected. Similarly, k-means is sensitive to the number of clusters; the choice requires a delicate balance. Setting k to be very large will improve the homogeneity of the clusters, and at the same time, it risks overfitting the data. Ideally, you will have a priori knowledge (a prior belief) about the true groupings and you can apply this information to choosing the number of clusters. For instance, if you were clustering movies, you might begin by setting k equal to the number of genres considered for the Academy Awards. In the data science conference seating problem that we worked through previously, k might reflect the number of academic fields of study that were invited. Sometimes the number of clusters is dictated by business requirements or the motivation for the analysis. For example, the number of tables in the meeting hall could dictate how many groups of people should be created from the data science attendee list. Extending this idea to another business case, if the marketing department only has resources to create three distinct advertising campaigns, it might make sense to set k = 3 to assign all the potential customers to one of the three appeals. Without any prior knowledge, one rule of thumb suggests setting k equal to the square root of (n / 2), where n is the number of examples in the dataset. However, this rule of thumb is likely to result in an unwieldy number of clusters for large datasets. Luckily, there are other statistical methods that can assist in finding a suitable k-means cluster set. A technique known as the elbow method attempts to gauge how the homogeneity or heterogeneity within the clusters changes for various values of k. As illustrated in the following diagrams, the homogeneity within clusters is expected to increase as additional clusters are added; similarly, heterogeneity will also continue to decrease with more clusters. As you could continue to see improvements until each example is in its own cluster, the goal is not to maximize homogeneity or minimize heterogeneity, but rather to find k so that there are diminishing returns beyond that point. This value of k is known as the elbow point because it looks like an elbow. There are numerous statistics to measure homogeneity and heterogeneity within the clusters that can be used with the elbow method (the following information box provides a citation for more detail). Still, in practice, it is not always feasible to iteratively test a large number of k values. This is in part because clustering large datasets can be fairly time consuming; clustering the data repeatedly is even worse. Regardless, applications requiring the exact optimal set of clusters are fairly rare. In most clustering applications, it suffices to choose a k value based on convenience rather than strict performance requirements. For a very thorough review of the vast assortment of cluster performance measures, refer to: Halkidi M, Batistakis Y, Vazirgiannis M. On clustering validation techniques. Journal of Intelligent Information Systems. 2001; 17:107-145. The process of setting k itself can sometimes lead to interesting insights. By observing how the characteristics of the clusters change as k is varied, one might infer where the data have naturally defined boundaries. Groups that are more tightly clustered will change a little, while less homogeneous groups will form and disband over time. In general, it may be wise to spend little time worrying about getting k exactly right. The next example will demonstrate how even a tiny bit of subject-matter knowledge borrowed from a Hollywood film can be used to set k such that actionable and interesting clusters are found. As clustering is unsupervised, the task is really about what you make of it; the value is in the insights you take away from the algorithm's findings. Summary This article covered only the fundamentals of clustering. As a very mature machine learning method, there are many variants of the k-means algorithm as well as many other clustering algorithms that bring unique biases and heuristics to the task. Based on the foundation in this article, you will be able to understand and apply other clustering methods to new problems. To learn more about different machine learning techniques, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Learning Data Mining with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-r) Mastering Scientific Computing with R (https://www.packtpub.com/application-development/mastering-scientific-computing-r) R for Data Science (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science) Resources for Article:   Further resources on this subject: Displaying SQL Server Data using a Linq Data Source [article] Probability of R? [article] Working with Commands and Plugins [article]
Read more
  • 0
  • 0
  • 9925