Data | 0 articles | Tech News, Tutorials & Expert Insights

01 Apr 2016

16 min read

Machine Learning Tasks

01 Apr 2016

In this article written by David Julian, author of the book Designing Machine Learning Systems with Python, the author wants to state that, he will first introduce the basic machine learning tasks. Classification is probably the most common task, due in part to the fact that it is relatively easy, well understood, and solves a lot of common problems. Multiclass classification (for instance, handwriting recognition) can sometimes be achieved by chaining binary classification tasks. However, we lose information this way, and we become unable to define a single decision boundary. For this reason, multiclass classification is often treated separately from binary classification. (For more resources related to this topic, see here.) There are cases where we are not interested in discrete classes but rather a real number, for instance, a probability. These type of problems are regression problems. Both classification and regression require a training set of correctly labelled data. They are supervised learning problems. Originating from these basic machine tasks are a number of derived tasks. In many applications, this may simply be applying the learning model to a prediction to establish a causal relationship. We must remember that explaining and predicting are not the same. A model can make a prediction, but unless we know explicitly how it made the prediction, we cannot begin to form a comprehensible explanation. An explanation requires human knowledge of the domain. We can also use a prediction model to find exceptions from a general pattern. Here, we are interested in the individual cases that deviate from the predictions. This is often called anomaly detection and has wide applications in areas such as detecting bank fraud, noise filtering, and even in the search for extraterrestrial life. An important and potentially useful task is subgroup discovery. Our goal here is not, as in clustering, to partition the entire domain but rather to find a subgroup that has a substantially different distribution. In essence, subgroup discovery is trying to find relationships between a dependent target variable and many independent explaining variables. We are not trying to find a complete relationship but rather a group of instances that are different in ways that are important in the domain. For instance, establishing the subgroups, smoker = true and family history =true, for a target variable of heart disease =true. Finally, we consider control type tasks. These act to optimize control setting to maximize a pay off is given different conditions. This can be achieved in several ways. We can clone expert behavior; the machine learns directly from a human and makes predictions of actions given different conditions. The task is to learn a prediction model for the expert's actions. This is similar to reinforcement learning, where the task is to learn about the relationship between conditions and optimal action. Clustering, on the other hand, is the task of grouping items without any information on that group; this is an unsupervised learning task. Clustering is basically making a measurement of similarity. Related to clustering is association, which is an unsupervised task to find a certain type of pattern in the data. This task is behind movie recommender systems, and customers who bought this also bought .. on checkout pages of online stores. Data for machine learning When considering raw data for machine learning applications, there are three separate aspects: The volume of the data The velocity of the data The variety of the data Data volume The volume problem can be approached from three different directions: efficiency, scalability, and parallelism. Efficiency is about minimizing the time it takes for an algorithm to process a unit of information. A component of this is the underlying processing power of the hardware. The other component, and one that we have more control over, is ensuring our algorithms are not wasting precious processing cycles on unnecessary tasks. Scalability is really about brute force, and throwing as much hardware at a problem as you can. With Moore's law, which predicts the trend of computer power doubling every two years and reaching its limit, it is clear that scalability is not, by its self, going to be able to keep pace with the ever increasing amounts of data. Simply adding more memory and faster processors is not, in many cases, going to be a cost effective solution. Parallelism is a growing area of machine learning, and it encompasses a number of different approaches from harnessing capabilities of multi core processors, to large scale distributed computing on many different platforms. Probably, the most common method is to simply run the same algorithm on many machines, each with a different set of parameters. Another method is to decompose a learning algorithm into an adaptive sequence of queries, and have these queries processed in parallel. A common implementation of this technique is known as MapReduce, or its open source version, Hadoop. Data velocity The velocity problem is often approached in terms of data producers and data consumers. The data transfer rate between the two is its velocity, and it can be measured in interactive response times. This is the time it takes from a query being made to its response being delivered. Response times are constrained by latencies such as hard disk read and write times, and the time it takes to transmit data across a network. Data is being produced at ever greater rates, and this is largely driven by the rapid expansion of mobile networks and devices. The increasing instrumentation of daily life is revolutionizing the way products and services are delivered. This increasing flow of data has led to the idea of streaming processing. When input data is at a velocity that makes it impossible to store in its entirety, a level of analysis is necessary as the data streams, in essence, deciding what data is useful and should be stored and what data can be thrown away. An extreme example is the Large Hadron Collider at CERN, where the vast majority of data is discarded. A sophisticated algorithm must scan the data as it is being generated, looking at the information needle in the data haystack. Another instance where processing data streams might be important is when an application requires an immediate response. This is becoming increasingly used in applications such as online gaming and stock market trading. It is not just the velocity of incoming data that we are interested in. In many applications, particularly on the web, the velocity of a system's output is also important. Consider applications such as recommender systems, which need to process large amounts of data and present a response in the time it takes for a web page to load. Data variety Collecting data from different sources invariably means dealing with misaligned data structures, and incompatible formats. It also often means dealing with different semantics and having to understand a data system that may have been built on a pretty different set of logical principles. We have to remember that, very often, data is repurposed for an entirely different application than the one it was originally intended for. There is a huge variety of data formats and underlying platforms. Significant time can be spent converting data into one consistent format. Even when this is done, the data itself needs to be aligned such that each record consists of the same number of features and is measured in the same units. Models The goal in machine learning is not to just solve an instance of a problem, but to create a model that will solve unique problems from new data. This is the essence of learning. A learning model must have a mechanism for evaluating its output, and in turn, changing its behavior to a state that is closer to a solution. A model is essentially a hypothesis: a proposed explanation for a phenomenon. The goal is to apply a generalization to the problem. In the case of supervised learning, problem knowledge gained from the training set is applied to the unlabeled test. In the case of an unsupervised learning problem, such as clustering, the system does not learn from a training set. It must learn from the characteristics of the dataset itself, such as degree of similarity. In both cases, the process is iterative. It repeats a well-defined set of tasks, that moves the model closer to a correct hypothesis. There are many models and as many variations on these models as there are unique solutions. We can see that the problems that machine learning systems solve (regression, classification, association, and so on) come up in many different settings. They have been used successfully in almost all branches of science, engineering, mathematics, commerce, and also in the social sciences; they are as diverse as the domains they operate in. This diversity of models gives machine learning systems great problem solving powers. However, it can also be a bit daunting for the designer to decide what is the best model, or models, for a particular problem. To complicate things further, there are often several models that may solve your task, or your task may need several models. The most accurate and efficient pathway through an original problem is something you simply cannot know when you embark upon such a project. There are several modeling approaches. These are really different perspectives that we can use to help us understand the problem landscape. A distinction can be made regarding how a model divides up the instance space. The instance space can be considered all possible instances of your data, regardless of whether each instance actually appears in the data. The data is a subset of the instance space. There are two approaches to dividing up this space: grouping and grading. The key difference between the two is that grouping models divide the instance space into fixed discrete units called segments. Each segment has a finite resolution and cannot distinguish between classes beyond this resolution. Grading, on the other hand, forms a global model over the entire instance space, rather than dividing the space into segments. In theory, the resolution of a grading model is infinite, and it can distinguish between instances no matter how similar they are. The distinction between grouping and grading is not absolute, and many models contain elements of both. Geometric models One of the most useful approaches to machine learning modeling is through geometry. Geometric models use the concept of instance space. The most obvious example is when all the features are numerical and can become coordinates in a Cartesian coordinate system. When we only have two or three features, they are easy to visualize. Since many machine learning problems have hundreds or thousands of features, and therefore dimensions, visualizing these spaces is impossible. Importantly, many of the geometric concepts, such as linear transformations, still apply in this hyper space. This can help us better understand our models. For instance, we expect many learning algorithms to be translation invariant, which means that it does not matter where we place the origin in the coordinate system. Also, we can use the geometric concept of Euclidean distance to measure similarity between instances; this gives us a method to cluster alike instances and form a decision boundary between them. Probabilistic models Often, we will want our models to output probabilities rather than just binary true or false. When we take a probabilistic approach, we assume that there is an underlying random process that creates a well-defined, but unknown, probability distribution. Probabilistic models are often expressed in the form of a tree. Tree models are ubiquitous in machine learning, and one of their main advantages is that they can inform us about the underlying structure of a problem. Decision trees are naturally easy to visualize and conceptualize. They allow inspection and do not just give an answer. For example, if we have to predict a category, we can also expose the logical steps that gave rise to a particular result. Also, tree models generally require less data preparation than other models and can handle numerical and categorical data. On the down side, tree models can create overly complex models that do not generalize very well to new data. Another potential problem with tree models is that they can become very sensitive to changes in the input data, and as we will see later, this problem can be mitigated by using them as ensemble learners. Linear models A key concept in machine learning is that of the linear model. Linear models form the foundation of many advanced nonlinear techniques such as support vector machines and neural networks. They can be applied to any predictive task such as classification, regression, or probability estimation. When responding to small changes in the input data, and provided that our data consists of entirely uncorrelated features, linear models tend to be more stable than tree models. Tree models can over-respond to small variation in training data. This is because splits at the root of a tree have consequences that are not recoverable further down a branch, potentially making the rest of the tree significantly different. Linear models, on the other hand, are relatively stable, being less sensitive to initial conditions. However, as you would expect, this has the opposite effect of making it less sensitive to nuanced data. This is described by the terms variance (for over fitting models) and bias (for under fitting models). A linear model is typically low variance and high bias. Linear models are generally best approached from a geometric perspective. We know we can easily plot two dimensions of space in a Cartesian co-ordinate system, and we can use the illusion of perspective to illustrate a third. We have also been taught to think of time as being a fourth dimension, but when we start speaking of n dimensions, a physical analogy breaks down. Intriguingly, we can still use many of the mathematical tools that we intuitively apply to three dimensions of space. While it becomes difficult to visualize these extra dimensions, we can still use the same geometric concepts (such as lines, planes, angles, and distance) to describe them. With geometric models, we describe each instance as having a set of real-valued features, each of which is a dimension in a space. Model ensembles Ensemble techniques can be divided broadly into two types. The Averaging Method: With this method, several estimators are run independently, and their predictions are averaged. This includes the random forests and bagging methods. The Boosting Methods: With this method, weak learners are built sequentially using weighted distributions of the data, based on the error rates. Ensemble methods use multiple models to obtain better performance than any single constituent model. The aim is to not only build diverse and robust models, but also to work within limitations such as processing speed and return times. When working with large datasets and quick response times, this can be a significant developmental bottleneck. Troubleshooting and diagnostics are important aspects of working with all machine learning models, but they are especially important when dealing with models that might take days to run. The types of machine learning ensembles that can be created are as diverse as the models themselves, and the main considerations revolve around three things: how we divide our data, how we select the models, and the methods we use to combine their results. This simplistic statement actually encompasses a very large and diverse space. Neural nets When we approach the problem of trying to mimic the brain, we are faced with a number of difficulties. Considering all the different things the brain does, we might first think that it consists of a number of different algorithms, each specialized to do a particular task, and each hard wired into different parts of the brain. This approach translates to considering the brain as a number of subsystems, each with its own program and task. For example, the auditory cortex for perceiving sound has its own algorithm that, say, does a Fourier transform on an incoming sound wave to detect the pitch. The visual cortex, on the other hand, has its own distinct algorithm for decoding the signals from the optic nerve and translating them into the sensation of sight. There is, however, growing evidence that the brain does not function like this at all. It appears, from biological studies, that brain tissue in different parts of the brain can relearn how to interpret inputs. So, rather than consisting of specialized subsystems that are programmed to perform specific tasks, the brain uses the same algorithm to learn different tasks. This single algorithm approach has many advantages, not least of which is that it is relatively easy to implement. It also means that we can create generalized models and then train them to perform specialized tasks. Like in real brains, using a singular algorithm to describe how each neuron communicates with the other neurons around it allows artificial neural networks to be adaptable and able to carry out multiple higher-level tasks. Much of the most important work being done with neural net models, and indeed machine learning in general, is through the use of very complex neural nets with many layers and features. This approach is often called deep architecture or deep learning. Human and animal learning occurs at a rate and depth that no machine can match. Many of the elements of biological learning still remain a mystery. One of the key areas of research, and one of the most useful in application, is that of object recognition. This is something quite fundamental to living systems, and higher animals have evolved to possessing an extraordinary ability to learn complex relationships between objects. Biological brains have many layers; each synaptic event exists in a long chain of synaptic processes. In order to recognize complex objects, such as people's faces or handwritten digits, a fundamental task is to create a hierarchy of representation from the raw input to higher and higher levels of abstraction. The goal is to transform raw data, such as a set of pixel values, into something that we can describe as, say, a person riding bicycle. Resources for Article: Further resources on this subject: Python Data Structures [article] Exception Handling in MySQL for Python [article] Python Data Analysis Utilities [article]

0
0
4389

Packt

31 Mar 2016

8 min read

Why Mesos?

Packt

31 Mar 2016

8 min read

In this article by Dipa Dubhasi and Akhil Das authors of the book Mastering Mesos, delves into understanding the importance of Mesos. Apache Mesos is an open source, distributed cluster management software that came out of AMPLab, UC Berkeley in 2011. It abstracts CPU, memory, storage, and other computer resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively. It is referred to as a metascheduler (scheduler of schedulers) and a "distributed systems kernel/distributed datacenter OS". It improves resource utilization, simplifies system administration, and supports a wide variety of distributed applications that can be deployed by leveraging its pluggable architecture. It is scalable and efficient and provides a host of features, such as resource isolation and high availability, which, along with a strong and vibrant open source community, makes this one of the most exciting projects. (For more resources related to this topic, see here.) Introduction to the datacenter OS and architecture of Mesos Over the past decade, datacenters have graduated from packing multiple applications into a single server box to having large datacenters that aggregate thousands of servers to serve as a massively distributed computing infrastructure. With the advent of virtualization, microservices, cluster computing, and hyper-scale infrastructure, the need of the hour is the creation of an application-centric enterprise that follows a software-defined datacenter strategy. Currently, server clusters are predominantly managed individually, which can be likened to having multiple operating systems on the PC, one each for processor, disk drive, and so on. With an abstraction model that treats these machines as individual entities being managed in isolation, the ability of the datacenter to effectively build and run distributed applications is greatly reduced. Another way of looking at the situation is comparing running applications in a datacenter to running them on a laptop. One major difference is that while launching a text editor or web browser, we are not required to check which memory modules are free and choose ones that suit our need. Herein lies the significance of a platform that acts like a host operating system and allows multiple users to run multiple applications simultaneously by utilizing a shared set of resources. Datacenters now run varied distributed application workloads, such as Spark, Hadoop, and so on, and need the capability to intelligently match resources and applications. The datacenter ecosystem today has to be equipped to manage and monitor resources and efficiently distribute workloads across a unified pool of resources with the agility and ease to cater to a diverse user base (noninfrastructure teams included). A datacenter OS brings to the table a comprehensive and sustainable approach to resource management and monitoring. This not only reduces the cost of ownership but also allows a flexible handling of resource requirements in a manner that isolated datacenter infrastructure cannot support. The idea behind a datacenter OS is that of an intelligent software that sits above all the hardware in a datacenter and ensures efficient and dynamic resource sharing. Added to this is the capability to constantly monitor resource usage and improve workload and infrastructure management in a seamless way that is not tied to specific application requirements. In its absence, we have a scenario with silos in a datacenter that force developers to build software catering to machine-specific characteristics and make the moving and resizing of applications a highly cumbersome procedure. The datacenter OS acts as a software layer that aggregates all servers in a datacenter into one giant supercomputer to deliver the benefits of multilatency, isolation, and resource control across all microservice applications. Another major advantage is the elimination of human-induced error during the continual assigning and reassigning of virtual resources. From a developer's perspective, this will allow them to easily and safely build distributed applications without restricting them to a bunch of specialized tools, each catering to a specific set of requirements. For instance, let's consider the case of Data Science teams who develop analytic applications that are highly resource intensive. An operating system that can simplify how the resources are accessed, shared, and distributed successfully alleviates their concern about reallocating hardware every time the workloads change. Of key importance is the relevance of the datacenter OS to DevOps, primarily a software development approach that emphasizes automation, integration, collaboration, and communication between traditional software developers and other IT professionals. With a datacenter OS that effectively transforms individual servers into a pool of resources, DevOps teams can focus on accelerating development and not continuously worry about infrastructure issues. In a world where distributed computing becomes the norm, the datacenter OS is a boon in disguise. With freedom from manually configuring and maintaining individual machines and applications, system engineers need not configure specific machines for specific applications as all applications would be capable of running on any available resources from any machine, even if there are other applications already running on them. Using a datacenter OS results in centralized control and smart utilization of resources that eliminate hardware and software silos to ensure greater accessibility and usability even for noninfrastructural professionals. Examples of some organizations administering their hyperscale datacenters via the datacenter OS are Google with the Borg (and next geneneration Omega) systems. The merits of the datacenter OS are undeniable, with benefits ranging from the scalability of computing resources and flexibility to support data sharing across applications to saving team effort, time, and money while launching and managing interoperable cluster applications. It is this vision of transforming the datacenter into a single supercomputer that Apache Mesos seeks to achieve. Born out of a Berkeley AMPLab research paper in 2011, it has since come a long way with a number of leading companies, such as Apple, Twitter, Netflix, and AirBnB among others, using it in production. Mesosphere is a start-up that is developing a distributed OS product with Mesos at its core. The architecture of Mesos Mesos is an open-source platform for sharing clusters of commodity servers between different distributed applications (or frameworks), such as Hadoop, Spark, and Kafka among others. The idea is to act as a centralized cluster manager by pooling together all the physical resources of the cluster and making it available as a single reservoir of highly available resources for all the different frameworks to utilize. For example, if an organization has one 10-node cluster (16 CPUs and 64 GB RAM) and another 5-node cluster (4 CPUs and 16 GB RAM), then Mesos can be leveraged to pool them into one virtual cluster of 720 GB RAM and 180 CPUs, where multiple distributed applications can be run. Sharing resources in this fashion greatly improves cluster utilization and eliminates the need for an expensive data replication process per-framework. Some of the important features of Mesos are: Scalability: It can elastically scale to over 50,000 nodes Resource isolation: This is achieved through Linux/Docker containers Efficiency: This is achieved through CPU and memory-aware resource scheduling across multiple frameworks High availability: This is through Apache ZooKeeper Interface: A web UI for monitoring the cluster state Mesos is based on the same principles as the Linux kernel and aims to provide a highly available, scalable, and fault-tolerant base for enabling various frameworks to share cluster resources effectively and in isolation. Distributed applications are varied and continuously evolving, a fact that leads Mesos' design philosophy towards a thin interface that allows an efficient resource allocation between different frameworks and delegates the task of scheduling and job execution to the frameworks themselves. The two advantages of doing so are: Different frame data replication works can independently devise methods to address their data locality, fault-tolerance, and other such needs. It simplifies the Mesos codebase and allows it to be scalable, flexible, robust, and agile Mesos' architecture hands over the responsibility of scheduling tasks to the respective frameworks by employing a resource offer abstraction that packages a set of resources and makes offers to each framework. The Mesos master node decides the quantity of resources to offer each framework, while each framework decides which resource offers to accept and which tasks to execute on these accepted resources. This method of resource allocation is shown to achieve good degree of data locality for each framework sharing the same cluster. An alternative architecture would implement a global scheduler that took framework requirements, organizational priorities, and resource availability as inputs and provided a task schedule breakdown by framework and resource as output, essentially acting as a matchmaker for jobs and resources with priorities acting as constraints. The challenges with this architecture, such as developing a robust API that could capture all the varied requirements of different frameworks, anticipating new frameworks, and solving a complex scheduling problem for millions of jobs, made the former approach a much more attractive option for the creators. Summary Thus in this article, we introduced Mesos, and then dived deep into its architecture to understand importance of Mesos. Resources for Article: Further resources on this subject: Understanding Mesos Internals [article] Leveraging Python in the World of Big Data [article] Self-service Business Intelligence, Creating Value from Data [article]

0
0
1993

article-image-support-vector-machines-classification-engine

Packt

17 Mar 2016

9 min read

Support Vector Machines as a Classification Engine

Packt

17 Mar 2016

9 min read

In this article by Tomasz Drabas, author of the book, Practical Data Analysis Cookbook, we will discuss on how Support Vector Machine models can be used as a classification engine. (For more resources related to this topic, see here.) Support Vector Machines Support Vector Machines (SVMs) are a family of extremely powerful models that can be used in classification and regression problems. They aim at finding decision boundaries that separate observations with differing class memberships. While many classifiers exist that can classify linearly separable data (for example, logistic regression), SVMs can handle highly non-linear problems using a kernel trick that implicitly maps the input vectors to higher-dimensional feature spaces. The transformation rearranges the dataset in such a way that it is then linearly solvable. The mechanics of the machine Given a set of n points of a form (x1,y1)...(xn,yn), where xi is a z-dimensional input vector and yi is a class label, the SVM aims at finding the maximum margin hyperplane that separates the data points: In a two-dimensional dataset, with linearly separable data points (as shown in the preceding figure), the maximum margin hyperplane would be a line that would maximize the distance between each of the classes. The hyperplane could be expressed as a dot product of the set of input vectors x and a vector normal to the hyperplane W:W.X=b, where b is the offset from the origin of the coordinate system. To find the hyperplane, we solve the following optimization problem: The constraint of our optimization problem effectively states that no point can cross the hyperplane if it does not belong to the class on that side of the hyperplane. Linear SVM Building a linear SVM classifier in Python is easy. There are multiple Python packages that can estimate a linear SVM but here, we decided to use MLPY (http://mlpy.sourceforge.net): import pandas as pd import numpy as np import mlpy as ml First, we load the necessary modules that we will use later, namely pandas (http://pandas.pydata.org), NumPy (http://www.numpy.org), and the aforementioned MLPY. We use pandas to read the data (https://github.com/drabastomek/practicalDataAnalysisCookbook repository to download the data): # the file name of the dataset r_filename = 'Data/Chapter03/bank_contacts.csv' # read the data csv_read = pd.read_csv(r_filename) The dataset that we use was described in S. Moro, P. Cortez, and P. Rita. A data-driven approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014 and found here http://archive.ics.uci.edu/ml/datasets/Bank+Marketing. It consists of over 41.1k outbound marketing calls of a bank. Our aim is to classify these calls into two buckets: those that resulted in a credit application and those that did not. Once the file was loaded, we split the data into training and testing datasets; we also keep the input and class indicator data separately. To this end, we use the split_dataset(...) method: def split_data(data, y, x = 'All', test_size = 0.33): ''' Method to split the data into training and testing ''' import sys # dependent variable variables = {'y': y} # and all the independent if x == 'All': allColumns = list(data.columns) allColumns.remove(y) variables['x'] = allColumns else: if type(x) != list: print('The x parameter has to be a list...') sys.exit(1) else: variables['x'] = x # create a variable to flag the training sample data['train'] = np.random.rand(len(data)) < (1 - test_size) # split the data into training and testing train_x = data[data.train] [variables['x']] train_y = data[data.train] [variables['y']] test_x = data[~data.train][variables['x']] test_y = data[~data.train][variables['y']] return train_x, train_y, test_x, test_y, variables['x'] We randomly set 1/3 of the dataset aside for testing purposes and use the remaining 2/3 for the training of the model: # split the data into training and testing train_x, train_y, test_x, test_y, labels = hlp.split_data( csv_read, y = 'credit_application' ) Once we read the data and split it into training and testing datasets, we can estimate the model: # create the classifier object svm = ml.LibSvm(svm_type='c_svc', kernel_type='linear', C=100.0) # fit the data svm.learn(train_x,train_y) The svm_type parameter of the .LibSvm(...) method controls what algorithm to use to estimate the SVM. Here, we use the c_svc method—a C-support Vector Classifier. The method specifies how much you want to avoid misclassifying observations: the larger values of C parameter will shrink the margin for the hyperplane (theb) so that more of the observations are correctly classified. You can also specify nu_svc with a nu parameter that controls how much of your sample (at most) can be misclassified and how many of your observations (at least) can become support vectors. Here, we estimate an SVM with a linear kernel, so let's talk about kernels. Kernels A kernel function K is effectively a function that computes a dot product between two n-dimensional vectors, K: Rn.Rn --> R. In other words, the kernel function takes two vectors and produces a scalar: The linear kernel does not effectively transform the data into a higher dimensional space. This is not true for polynomial or Radial Basis Function (RBF) kernels that transform the input feature space into higher dimensions. In case of the polynomial kernel of degree d, the obtained feature space has (n+d/d) dimensions for the Rn dimensional input feature space. As you can see, the number of additional dimensions can grow very quickly and this would pose significant problems in estimating the model if we would explicitly transform the data into higher dimensions. Thankfully, we do not have to do this as that's where the kernel trick comes into play. The truth is that SVMs do not have to work explicitly in higher dimensions but can rather implicitly map the data to higher dimensions using pairwise inner products (instead of an explicit dot product) and then use it to find the maximum margin hyperplane. You can find a really good explanation of the kernel trick at http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html. Back to our example The .learn(...) method of the .LibSvm(...) object estimates the model. Once the model is estimated, we can test how well it performs. First, we use the estimated model to predict the classes for the observations in the testing dataset: predicted_l = svm.pred(test_x) Next, we will use some of the scikit-learn methods to print the basic statistics for our model: def printModelSummary(actual, predicted): ''' Method to print out model summaries ''' import sklearn.metrics as mt print('Overall accuracy of the model is {0:.2f} percent' .format( (actual == predicted).sum() / len(actual) * 100)) print('Classification report: n', mt.classification_report(actual, predicted)) print('Confusion matrix: n', mt.confusion_matrix(actual, predicted)) print('ROC: ', mt.roc_auc_score(actual, predicted)) First, we calculate the overall accuracy of the model expressed as a ratio of properly classified observations to the total number of observations in the testing sample. Next, we print the classification report: The precision is the model's ability to avoid classifying an observation as positive when it is not. It is a ratio of true positives to the overall number of positively classified records. The overall precision score is a weighted average of the individual precision scores where the weight is the support. The support is the total number of actual observations in each class. The total precision for our model is not too bad—89 out of 100. However, when we look at the precision to classify the true positives, the situation is not as good—only 63 out of 100 were properly classified. Recall can be viewed as the model's capacity to find all the positive samples. It is a ratio of true positives to the sum of true positives and false negatives. The recall for the class 0.0 is almost perfect but for class 1.0, it looks really bad. This might be a problem with the fact that our sample is not balanced, but it is more likely that the features we use to classify the data do not really capture the differences between the two groups. The f1-score is effectively a weighted amalgam of the precision and recall: it is a ratio of twice the product of precision and recall to their sum. In one measure, it shows whether the model performs well or not. At the general level, the model does not perform badly but when looked at the model's ability to classify the true signal, it fails gravely. It is a perfect example why judging the model at the general level might be misleading when dealing with samples that are heavily unbalanced. RBF kernel SVM Given that the linear kernel performed poorly, our dataset might not be linearly separable. Thus, let's try the RBF kernel. The RBF kernel is given as K(x,y)=e ||x-y||2/2a2, where ||x-y||2 is a Euclidean distance between the two vectors, x and y, and σ is a free parameter. The value of RBF equals to 1 when x=y and gradually falls to 0 when the distance approaches infinity. To fit an RBF version of our model, we can specify our svm object as follows: svm = ml.LibSvm(svm_type='c_svc', kernel_type='rbf', gamma=0.1, C=1.0) The gamma parameter here specifies how far the influence of a single support vector reaches. Visually, you can investigate the relationship between gamma and C parameters at http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html. The rest of the code for the model estimation follows in a similar fashion as with the linear kernel and we obtain the following results: The results are even worse than the linear kernel as the precision and recall were lost across the board. The SVM with the RBF kernel performed worse when classifying calls that resulted in applying for the credit card and those that did not. Summary In this article, we saw that the problem is not with the model but rather, the dataset that we use does not explain the variance sufficiently. This requires going back to the drawing board and selecting other features. Resources for Article: Further resources on this subject: Push your data to the Web [article] Transferring Data from MS Access 2003 to SQL Server 2008 [article] Exporting data from MS Access 2003 to MySQL [article]

0
0
4260

article-image-integrating-imagery-creating-and-styling-features-openlayers-3

Packt

17 Mar 2016

21 min read

Integrating Imagery with Creating and Styling Features in OpenLayers 3

Packt

17 Mar 2016

21 min read

0
0
3554

article-image-welcome-to-machine-learning-using-the-net-framework

Oli Huggins

16 Mar 2016

26 min read

Welcome to Machine Learning using the .NET Framework

Oli Huggins

16 Mar 2016

26 min read

This article by, Jamie Dixon, the author of the book, Mastering .NET Machine Learning, will focus on some of the larger questions you might have about machine learning using the .NET Framework, namely: What is machine learning? Why should we consider it in the .NET Framework? How can I get started with coding? (For more resources related to this topic, see here.) What is machine learning? If you check out on Wikipedia, you will find a fairly abstract definition of machine learning: "Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions, rather than following strictly static program instructions." I like to think of machine learning as computer programs that produce different results as they are exposed to more information without changing their source code (and consequently needed to be redeployed). For example, consider a game that I play with the computer. I show the computer this picture and tell it "Blue Circle". I then show it this picture and tell it "Red Circle". Next I show it this picture and say "Green Triangle." Finally, I show it this picture and ask it "What is this?". Ideally the computer would respond, "Green Circle." This is one example of machine learning. Although I did not change my code or recompile and redeploy, the computer program can respond accurately to data it has never seen before. Also, the computer code does not have to explicitly write each possible data permutation. Instead, we create models that the computer applies to new data. Sometimes the computer is right, sometimes it is wrong. We then feed the new data to the computer to retrain the model so the computer gets more and more accurate over time—or, at least, that is the goal. Once you decide to implement some machine learning into your code base, another decision has to be made fairly early in the process. How often do you want the computer to learn? For example, if you create a model by hand, how often do you update it? With every new data row? Every month? Every year? Depending on what you are trying to accomplish, you might create a real-time ML model, a near-time model, or a periodic model. Why .NET? If you are a Windows developer, using .NET is something you do without thinking. Indeed, a vast majority of Windows business applications written in the last 15 years use managed code—most of it written in C#. Although it is difficult to categorize millions of software developers, it is fair to say that .NET developers often come from nontraditional backgrounds. Perhaps a developer came to .NET from a BCSC degree but it is equally likely s/he started writing VBA scripts in Excel, moving up to Access applications, and then into VB.NET/C# applications. Therefore, most .NET developers are likely to be familiar with C#/VB.NET and write in an imperative and perhaps OO style. The problem with this rather narrow exposure is that most machine learning classes, books, and code examples are in R or Python and very much use a functional style of writing code. Therefore, the .NET developer is at a disadvantage when acquiring machine learning skills because of the need to learn a new development environment, a new language, and a new style of coding before learning how to write the first line of machine learning code. If, however, that same developer could use their familiar IDE (Visual Studio) and the same base libraries (the .NET Framework), they can concentrate on learning machine learning much sooner. Also, when creating machine learning models in .NET, they have immediate impact as you can slide the code right into an existing C#/VB.NET solution. On the other hand, .NET is under-represented in the data science community. There are a couple of different reasons floating around for that fact. The first is that historically Microsoft was a proprietary closed system and the academic community embraced open source systems such as Linux and Java. The second reason is that much academic research uses domain-specific languages such as R, whereas Microsoft concentrated .NET on general purpose programming languages. Research that moved to industry took their language with them. However, as the researcher's role is shifted from data science to building programs that can work at real time that customers touch, the researcher is getting more and more exposure to Windows and Windows development. Whether you like it or not, all companies which create software that face customers must have a Windows strategy, an iOS strategy, and an Android strategy. One real advantage to writing and then deploying your machine learning code in .NET is that you can get everything with one stop shopping. I know several large companies who write their models in R and then have another team rewrite them in Python or C++ to deploy them. Also, they might write their model in Python and then rewrite it in C# to deploy on Windows devices. Clearly, if you could write and deploy in one language stack, there is a tremendous opportunity for efficiency and speed to market. What version of the .NET Framework are we using? The .NET Framework has been around for general release since 2002. The base of the framework is the Common Language Runtime or CLR. The CLR is a virtual machine that abstracts much of the OS specific functionality like memory management and exception handling. The CLR is loosely based on the Java Virtual Machine (JVM). Sitting on top of the CLR is the Framework Class Library (FCL) that allows different languages to interoperate with the CLR and each other: the FCL is what allows VB.Net, C#, F#, and Iron Python code to work side-by-side with each other. Since its first release, the .NET framework has included more and more features. The first release saw support for the major platform libraries like WinForms, ASP.NET, and ADO.NET. Subsequent releases brought in things like Windows Communication Foundation (WCF), Language Integrated Query (LINQ), and Task Parallel Library (TPL). At the time of writing, the latest version is of the .Net Framework is 4.6.2. In addition to the full-Monty .NET Framework, over the years Microsoft has released slimmed down versions of the .NET Framework intended to run on machines that have limited hardware and OS support. The most famous of these releases was the Portable Class Library (PCL) that targeted Windows RT applications running Windows 8. The most recent incantation of this is Universal Windows Applications (UWA), targeting Windows 10. At Connect(); in November 2015, Microsoft announced GA of the latest edition of the .NET Framework. This release introduced the .Net Core 5. In January, they decided to rename it to .Net Core 1.0. .NET Core 1.0 is intended to be a slimmed down version of the full .NET Framework that runs on multiple operating systems (specifically targeting OS X and Linux). The next release of ASP.NET (ASP.NET Core 1.0) sits on top of .NET Core 1.0. ASP.NET Core 1.0 applications that run on Windows can still run the full .NET Framework. (https://blogs.msdn.microsoft.com/webdev/2016/01/19/asp-net-5-is-dead-int...) In this book, we will be using a mixture of ASP.NET 4.0, ASP.NET 5.0, and Universal Windows Applications. As you can guess, machine learning models (and the theory behind the models) change with a lot less frequency than framework releases so the most of the code you write on .NET 4.6 will work equally well with PCL and .NET Core 1.0. Saying that, the external libraries that we will use need some time to catch up—so they might work with PCL but not with .NET Core 1.0 yet. To make things realistic, the demonstration projects will use .NET 4.6 on ASP.NET 4.x for existing (Brownfield) applications. New (Greenfield) applications will be a mixture of a UWA using PCL and ASP.NET 5.0 applications. Why write your own? It seems like all of the major software companies are pitching machine learning services such as Google Analytics, Amazon Machine Learning Services, IBM Watson, Microsoft Cortana Analytics, to name a few. In addition, major software companies often try to sell products that have a machine learning component, such as Microsoft SQL Server Analysis Service, Oracle Database Add-In, IBM SPSS, or SAS JMP. I have not included some common analytical software packages such as PowerBI or Tableau because they are more data aggregation and report writing applications. Although they do analytics, they do not have a machine learning component (not yet at least). With all these options, why would you want to learn how to implement machine learning inside your applications, or in effect, write some code that you can purchase elsewhere? It is the classic build versus buy decision that every department or company has to make. You might want to build because: You really understand what you are doing and you can be a much more informed consumer and critic of any given machine learning package. In effect, you are building your internal skill set that your company will most likely prize. Another way to look at it, companies are not one tool away from purchasing competitive advantage because if they were, their competitors could also buy the same tool and cancel any advantage. However, companies can be one hire away or more likely one team away to truly have the ability to differentiate themselves in their market. You can get better performance by executing locally, which is especially important for real-time machine learning and can be implemented in disconnected or slow connection scenarios. This becomes particularly important when we start implementing machine learning with Internet of Things (IoT) devices in scenarios where the device has a lot more RAM than network bandwidth. Consider the Raspberry Pi running Windows 10 on a pipeline. Network communication might be spotty, but the machine has plenty of power to implement ML models. You are not beholden to any one vendor or company, for example, every time you implement an application with a specific vendor and are not thinking about how to move away from the vendor, you make yourself more dependent on the vendor and their inevitable recurring licensing costs. The next time you are talking to the CTO of a shop that has a lot of Oracle, ask him/her if they regret any decision to implement any of their business logic in Oracle databases. The answer will not surprise you. A majority of this book's code is written in F#—an open source language that runs great on Windows, Linux, and OS X. You can be much more agile and have much more flexibility in what you implement. For example, we will often re-train our models on the fly and when you write your own code, it is fairly easy to do this. If you use a third-party service, they may not even have API hooks to do model training and evaluation, so near-time model changes are impossible. Once you decide to go native, you have a choice of rolling your own code or using some of the open source assemblies out there. This book will introduce both the techniques to you, highlight some of the pros and cons of each technique, and let you decide how you want to implement them. For example, you can easily write your own basic classifier that is very effective in production but certain models, such as a neural network, will take a considerable amount of time and energy and probably will not give you the results that the open source libraries do. As a final note, since the libraries that we will look at are open source, you are free to customize pieces of it—the owners might even accept your changes. However, we will not be customizing these libraries in this book. Why open data? Many books on machine learning use datasets that come with the language install (such as R or Hadoop) or point to public repositories that have considerable visibility in the data science community. The most common ones are Kaggle (especially the Titanic competition) and the UC Irvine's datasets. While these are great datasets and give a common denominator, this book will expose you to datasets that come from government entities. The notion of getting data from government and hacking for social good is typically called open data. I believe that open data will transform how the government interacts with its citizens and will make government entities more efficient and transparent. Therefore, we will use open datasets in this book and hopefully you will consider helping out with the open data movement. Why F#? As we will be on the .NET Framework, we could use either C#, VB.NET, or F#. All three languages have strong support within Microsoft and all three will be around for many years. F# is the best choice for this book because it is unique in the .NET Framework for thinking in the scientific method and machine learning model creation. Data scientists will feel right at home with the syntax and IDE (languages such as R are also functional first languages). It is the best choice for .NET business developers because it is built right into Visual Studio and plays well with your existing C#/VB.NET code. The obvious alternative is C#. Can I do this all in C#? Yes, kind of. In fact, many of the .NET libraries we will use are written in C#. However, using C# in our code base will make it larger and have a higher chance of introducing bugs into the code. At certain points, I will show some examples in C#, but the majority of the book is in F#. Another alternative is to forgo .NET altogether and develop the machine learning models in R and Python. You could spin up a web service (such as AzureML), which might be good in some scenarios, but in disconnected or slow network environments, you will get stuck. Also, assuming comparable machines, executing locally will perform better than going over the wire. When we implement our models to do real-time analytics, anything we can do to minimize the performance hit is something to consider. A third alternative that the .NET developers will consider is to write the models in T-SQL. Indeed, many of our initial models have been implemented in T-SQL and are part of the SQL Server Analysis Server. The advantage of doing it on the data server is that the computation is as close as you can get to the data, so you will not suffer the latency of moving large amount of data over the wire. The downsides of using T-SQL are that you can't implement unit tests easily, your domain logic is moving away from the application and to the data server (which is considered bad form with most modern application architecture), and you are now reliant on a specific implementation of the database. F# is open source and runs on a variety of operating systems, so you can port your code much more easily. Getting ready for Machine Learning In this section, we will install Visual Studio, take a quick lap around F#, and install the major open source libraries that we will be using. Setting up Visual Studio To get going, you will need to download Visual Studio on a Microsoft Windows machine. As of this writing, the latest (free) version is Visual Studio 2015 Community. If you have a higher version already installed on your machine, you can skip this step. If you need a copy, head on over to the Visual Studio home page at https://www.visualstudio.com. Download the Visual Studio Community 2015 installer and execute it. Now, you will get the following screen: Select Custom installation and you will be taken to the following screen: Make sure Visual F# has a check mark next to it. Once it is installed, you should see Visual Studio in your Windows Start menu. Learning F# One of the great features about F# is that you can accomplish a whole lot with very little code. It is a very terse language compared to C# and VB.NET, so picking up the syntax is a bit easier. Although this is not a comprehensive introduction, this is going to introduce you to the major language features that we will use in this book. I encourage you to check out http://www.tryfsharp.org/ or the tutorials at http://fsharpforfunandprofit.com/ if you want to get a deeper understanding of the language. With that in mind, let's create our 1st F# project: Start Visual Studio. Navigate to File | New | Project as shown in the following screenshot: When the New Project dialog box appears, navigate the tree view to Visual F# | Windows | Console Application. Have a look at the following screenshot: Give your project a name, hit OK, and the Visual Studio Template generator will create the following boilerplate: Although Visual Studio created a Program.fs file that creates a basic console .exe application for us, we will start learning about F# in a different way, so we are going to ignore it for now. Right-click in the Solution Explorer and navigate to Add | New Item. When the Add New Item dialog box appears, select Script File. The Script1.fsx file is then added to the project. Once Script1.fsx is created, open it up, and enter the following into the file: let x = "Hello World" Highlight that entire row of code, right-click and select Execute In Interactive (or press Alt + Enter). And the F# Interactive console will pop up and you will see this: The F# Interactive is a type of REPL, which stands for Read-Evaluate-Print-Loop. If you are a .NET developer who has spent any time in SQL Server Management Studio, the F# Interactive will look very familiar to the Query Analyzer where you enter your code at the top and see how it executes at the bottom. Also, if you are a data scientist using R Studio, you are very familiar with the concept of a REPL. I have used the words REPL and FSI interchangeably in this book. There are a couple of things to notice about this first line of F# code you wrote. First, it looks very similar to C#. In fact, consider changing the code to this: It would be perfectly valid C#. Note that the red squiggly line, showing you that the F# compiler certainly does not think this is valid. Going back to the correct code, notice that type of x is not explicitly defined. F# uses the concept of inferred typing so that you don't have to write the type of the values that you create. I used the term value deliberately because unlike variables, which can be assigned in C# and VB.NET, values are immutable; once bound, they can never change. Here, we are permanently binding the name x to its value, Hello World. This notion of immutability might seem constraining at first, but it has profound and positive implications, especially when writing machine learning models. With our basic program idea proven out, let's move it over to a compliable assembly; in this case, an .exe that targets the console. Highlight the line that you just wrote, press Ctrl + C, and then open up Program.fs. Go into the code that was generated and paste it in: [<EntryPoint>] let main argv = printfn "%A" argv let x = "Hello World" 0 // return an integer exit code Then, add the following lines of code around what you just added: // Learn more about F# at http://fsharp.org // See the 'F# Tutorial' project for more help. open System [<EntryPoint>] let main argv = printfn "%A" argv let x = "Hello World" Console.WriteLine(x) let y = Console.ReadKey() 0 // return an integer exit code Press the Start button (or hit F5) and you should see your program run: You will notice that I had to bind the return value from Console.ReadKey() to y. In C# or VB.NET, you can get away with not handling the return value explicitly. In F#, you are not allowed to ignore the returned values. Although some might think this is a limitation, it is actually a strength of the language. It is much harder to make a mistake in F# because the language forces you to address execution paths explicitly versus accidentally sweeping them under the rug (or into a null, but we'll get to that later). In any event, let's go back to our script file and enter in another line of code: let ints = [|1;2;3;4;5;6|] If you send that line of code to the REPL, you should see this: val ints : int [] = [|1; 2; 3; 4; 5; 6|] This is an array, as if you did this in C#: var ints = new[] {1,2,3,4,5,6}; Notice that the separator is a semicolon in F# and not a comma. This differs from many other languages, including C#. The comma in F# is reserved for tuples, not for separating items in an array. We'll discuss tuples later. Now, let's sum up the values in our array: let summedValue = ints |> Array.sum While sending that line to the REPL, you should see this: val summedValue : int = 21 There are two things going on. We have the |> operator, which is a pipe forward operator. If you have experience with Linux or PowerShell, this should be familiar. However, if you have a background in C#, it might look unfamiliar. The pipe forward operator takes the result of the value on the left-hand side of the operator (in this case, ints) and pushes it into the function on the right-hand side (in this case, sum). The other new language construct is Array.sum. Array is a module in the core F# libraries, which has a series of functions that you can apply to your data. The function sum, well, sums the values in the array, as you can probably guess by inspecting the result. So, now, let's add a different function from the Array type: let multiplied = ints |> Array.map (fun i -> i * 2) If you send it to the REPL, you should see this: val multiplied : int [] = [|2; 4; 6; 8; 10; 12|] Array.map is an example of a high ordered function that is part of the Array type. Its parameter is another function. Effectively, we are passing a function into another function. In this case, we are creating an anonymous function that takes a parameter i and returns i * 2. You know it is an anonymous function because it starts with the keyword fun and the IDE makes it easy for us to understand that by making it blue. This anonymous function is also called a lambda expression, which has been in C# and VB.NET since .Net 3.5, so you might have run across it before. If you have a data science background using R, you are already quite familiar with lambdas. Getting back to the higher-ordered function Array.map, you can see that it applies the lambda function against each item of the array and returns a new array with the new values. We will be using Array.map (and its more generic kin Seq.map) a lot when we start implementing machine learning models as it is the best way to transform an array of data. Also, if you have been paying attention to the buzz words of map/reduce when describing big data applications such as Hadoop, the word map means exactly the same thing in this context. One final note is that because of immutability in F#, the original array is not altered, instead, multiplied is bound to a new array. Let's stay in the script and add in another couple more lines of code: let multiplyByTwo x = x * 2 If you send it to the REPL, you should see this: val multiplyByTwo : x:int -> int These two lines created a named function called multiplyByTwo. The function that takes a single parameter x and then returns the value of the parameter multiplied by 2. This is exactly the same as our anonymous function we created earlier in-line that we passed into the map function. The syntax might seem a bit strange because of the -> operator. You can read this as, "the function multiplyByTwo takes in a parameter called x of type int and returns an int." Note three things here. Parameter x is inferred to be an int because it is used in the body of the function as multiplied to another int. If the function reads x * 2.0, the x would have been inferred as a float. This is a significant departure from C# and VB.NET but pretty familiar for people who use R. Also, there is no return statement for the function, instead, the final expression of any function is always returned as the result. The last thing to note is that whitespace is important so that the indentation is required. If the code was written like this: let multiplyByTwo(x) = x * 2 The compiler would complain: Script1.fsx(8,1): warning FS0058: Possible incorrect indentation: this token is offside of context started at position (7:1). Since F# does not use curly braces and semicolons (or the end keyword), such as C# or VB.NET, it needs to use something to separate code. That separation is whitespace. Since it is good coding practice to use whitespace judiciously, this should not be very alarming to people having a C# or VB.NET background. If you have a background in R or Python, this should seem natural to you. Since multiplyByTwo is the functional equivalent of the lambda created in Array.map (fun i -> i * 2), we can do this if we want: let multiplied' = ints |> Array.map (fun i -> multiplyByTwo i) If you send it to the REPL, you should see this: val multiplied' : int [] = [|2; 4; 6; 8; 10; 12|] Typically, we will use named functions when we need to use that function in several places in our code and we use a lambda expression when we only need that function for a specific line of code. There is another minor thing to note. I used the tick notation for the value multiplied when I wanted to create another value that was representing the same idea. This kind of notation is used frequently in the scientific community, but can get unwieldy if you attempt to use it for a third or even fourth (multiplied'''') representation. Next, let's add another named function to the REPL: let isEven x = match x % 2 = 0 with | true -> "even" | false -> "odd" isEven 2 isEven 3 If you send it to the REPL, you should see this: val isEven : x:int -> string This is a function named isEven that takes a single parameter x. The body of the function uses a pattern-matching statement to determine whether the parameter is odd or even. When it is odd, then it returns the string odd. When it is even, it returns the string even. There is one really interesting thing going on here. The match statement is a basic example of pattern matching and it is one of the coolest features of F#. For now, you can consider the match statement much like the switch statement that you may be familiar within R, Python, C#, or VB.NET. I would have written the conditional logic like this: let isEven' x = if x % 2 = 0 then "even" else "odd" But I prefer to use pattern matching for this kind of conditional logic. In fact, I will attempt to go through this entire book without using an if…then statement. With isEven written, I can now chain my functions together like this: let multipliedAndIsEven = ints |> Array.map (fun i -> multiplyByTwo i) |> Array.map (fun i -> isEven i) If you send it to REPL, you should see this: val multipliedAndIsEven : string [] = [|"even"; "even"; "even"; "even"; "even"; "even"|] In this case, the resulting array from the first pipe Array.map (fun i -> multiplyByTwo i))gets sent to the next function Array.map (fun i -> isEven i). This means we might have three arrays floating around in memory: ints which is passed into the first pipe, the result from the first pipe that is passed into the second pipe, and the result from the second pipe. From your mental model point of view, you can think about each array being passed from one function into the next. In this book, I will be chaining pipe forwards frequently as it is such a powerful construct and it perfectly matches the thought process when we are creating and using machine learning models. You now know enough F# to get you up and running with the first machine learning models in this book. I will be introducing other F# language features as the book goes along, but this is a good start. As you will see, F# is truly a powerful language where a simple syntax can lead to very complex work. Third-party libraries The following are a few third-party libraries that we will cover in our book later on: Math.NET Math.NET is an open source project that was created to augment (and sometimes replace) the functions that are available in System.Math. Its home page is http://www.mathdotnet.com/. We will be using Math.Net's Numerics and Symbolics namespaces in some of the machine learning algorithms that we will write by hand. A nice feature about Math.Net is that it has strong support for F#. Accord.NET Accord.NET is an open source project that was created to implement many common machine learning models. Its home page is http://accord-framework.net/. Although the focus of Accord.NET was for computer vision and signal processing, we will be using Accord.Net extensively in this book as it makes it very easy to implement algorithms in our problem domain. Numl Numl is an open source project that implements several common machine learning models as experiments. Its home page is http://numl.net/. Numl is newer than any of the other third-party libraries that we will use in the book, so it may not be as extensive as the other ones, but it can be very powerful and helpful in certain situations. Summary We covered a lot of ground in this article. We discussed what machine learning is, why you want to learn about it in the .NET stack, how to get up and running using F#, and had a brief introduction to the major open source libraries that we will be using in this book. With all this preparation out of the way, we are ready to start exploring machine learning. Further resources on this subject: ASP.Net Site Performance: Improving JavaScript Loading [article] Displaying MySQL data on an ASP.NET Web Page [article] Creating a NHibernate session to access database within ASP.NET [article]

0
0
6676

Packt

10 Mar 2016

17 min read

Exploring HDFS

Packt

10 Mar 2016

17 min read

0
0
4675

article-image-getting-started-deep-learning

Packt

07 Mar 2016

12 min read

Getting Started with Deep Learning

Packt

07 Mar 2016

12 min read

In this article by Joshua F. Wiley, author of the book, R Deep Learning Essentials, we will discuss deep learning, a powerful multilayered architecture for pattern recognition, signal detection, classification, and prediction. Although deep learning is not new, it has gained popularity in the past decade due to the advances in the computational capacity and new ways of efficient training models, as well as the availability of ever growing amount of data. In this article, you will learn what deep learning is. What is deep learning? To understand what deep learning is, perhaps it is easiest to start with what is meant by regular machine learning. In general terms, machine learning is devoted to developing and using algorithms that learn from raw data in order to make predictions. Prediction is a very general term. For example, predictions from machine learning may include predicting how much money a customer will spend at a given company, or whether a particular credit card purchase is fraudulent. Predictions also encompass more general pattern recognition, such as what letters are present in a given image, or whether a picture is of a horse, dog, person, face, building, and so on. Deep learning is a branch of machine learning where a multi-layered (deep) architecture is used to map the relations between inputs or observed features and the outcome. This deep architecture makes deep learning particularly suitable for handling a large number of variables and allows deep learning to generate features as part of the overall learning algorithm, rather than feature creation being a separate step. Deep learning has proven particularly effective in the fields of image recognition (including handwriting as well as photo or object classification) and natural language processing, such as recognizing speech. There are many types of machine learning algorithms. In this article, we are primarily going to focus on neural networks as these have been particularly popular in deep learning. However, this focus does not mean that it is the only technique available in machine learning or even deep learning, nor that other techniques are not valuable or even better suited, depending on the specific task. Conceptual overview of neural networks As their name suggests, neural networks draw their inspiration from neural processes and neurons in the body. Neural networks contain a series of neurons, or nodes, which are interconnected and process input. The connections between neurons are weighted, with these weights based on the function being used and learned from the data. Activation in one set of neurons and the weights (adaptively learned from the data) may then feed into other neurons, and the activation of some final neuron(s) is the prediction. To make this process more concrete, an example from human visual perception may be helpful. The term grandmother cell is used to refer to the concept that somewhere in the brain there is a cell or neuron that responds specifically to a complex and specific object, such as your grandmother. Such specificity would require thousands of cells to represent every unique entity or object we encounter. Instead, it is thought that visual perception occurs by building up more basic pieces into complex representations. For example, the following is a picture of a square: Figure 1 Rather than our visual system having cells neurons that are activated only upon seeing the gestalt, or entirety, of a square, we can have cells that recognize horizontal and vertical lines, as shown in the following: Figure 2 In this hypothetical case, there may be two neurons, one which is activated when it senses horizontal lines and another that is activated when it senses vertical lines. Finally, a higher-order process recognizes that it is seeing a square when both the lower order neurons are activated simultaneously. Neural networks share some of these same concepts, with inputs being processed by a first layer of neurons that may go on to trigger another layer. Neural networks are sometimes shown as graphical models. In Figure 3, Inputs are data represented as squares. These may be pixels in an image or different aspects of sounds, or something else. The next layer of Hidden neurons is neurons that recognize basic features, such as horizontal lines, vertical lines, or curved lines. Finally, the output may be a neuron that is activated by the simultaneous activation of two of the hidden neurons. In this article, observed data or features are depicted as squares, and unobserved or hidden layers as circles: Figure 3 Neural networks are used to refer to a broad class of models and algorithms. Hidden neurons are generated based on some combination of the observed data, similar to a basis expansion in other statistical techniques; however, rather than choosing the form of the expansion, the weights used to create the hidden neurons are learned from the data. Neural networks can involve a variety of activation function(s), which are transformations of the weighted raw data inputs to create the hidden neurons. A common choice for activation functions is the sigmoid function: and the hyperbolic tangent function . Finally, radial basis functions are sometimes used as they are efficient function approximators. Although there are a variety of these, the Gaussian form is common: . In a shallow neural network such as is shown in Figure 3, with only a single hidden layer, from the hidden units to the outputs is essentially a standard regression or classification problem. The hidden units can be denoted by, h, the outputs by, Y. Different outputs can be denoted by subscripts i = 1, …, k and may represent different possible classifications, such as (in our case) a circle or square. The paths from each hidden unit to each output are the weights and for the ith output are denoted by wi. These weights are also learned from the data, just like the weights used to create the hidden layer. For classification, it is common to use a final transformation, the softmax function, which is as this ensures that the estimates are positive (using the exponential function) and that the probability of being in any given class sums to one. For linear regression, the identity function, which returns its input, is commonly used. Confusion may arise as to why there are paths between every hidden unit and output as well as every input and hidden unit. These are commonly drawn to represent that a priori any of these relations are allowed to exist. The weights must then be learned from the data, with zero or near zero weights essentially equating to dropping unnecessary relations. This only scratches the surface of the conceptual and practical aspects of neural networks. For a slightly more in-depth introduction to neural networks, see Chapter 11 of The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2009) also freely available at http://statweb.stanford.edu/~tibs/ElemStatLearn/. Next, we will turn to a brief introduction to deep neural networks. Deep neural networks Perhaps the simplest, if not the most informative, definition of a deep neural network (DNN) is that it is a neural network with multiple hidden layers. Although a relatively simple conceptual extension of neural networks, such deep architecture provides valuable advances in terms of the capability of the models and new challenges in training them. Using multiple hidden layers allows a more sophisticated build-up from simple elements to more complex ones. When discussing neural networks, we considered the outputs to be whether the object was a circle or a square. In a deep neural network, many circles and squares could be combined to form other more advanced shapes. One can consider two complexity aspects of a model's architecture. One is how wide or narrow it is—that is, how many neurons in a given layer. The second is how deep it is, or how many layers of neurons there are. For data that truly has such deep architectures, a DNN can fit it more accurately with fewer parameters than a neural network (NN), because more layers (each with fewer neurons) can be a more efficient and accurate representation; for example, because the shallow NN cannot build more advanced shapes from basic pieces, in order to provide equal accuracy to the DNN it must represent each unique object. Again considering pattern recognition in images, if we are trying to train a model for text recognition the raw data may be pixels from an image. The first layer of neurons could be trained to capture different letters of the alphabet, and then another layer could recognize sets of these letters as words. The advantage is that the second layer does not have to directly learn from the pixels, which are noisy and complex. In contrast, a shallow architecture may require far more parameters, as each hidden neuron would have to be capable of going directly from pixels in an image to a complete word, and many words may overlap, creating redundancy in the model. One of the challenges in training deep neural networks is how to efficiently learn the weights. The models are often complex and local minima abound making the optimization problem a challenging one. One of the major advancements came in 2006, when it was shown that Deep Belief Networks (DBNs) could be trained one layer at a time (Refer A Fast Learning Algorithm for Deep Belief Nets, by Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh, (2006) at http://www.cs.toronto.edu/~fritz/absps/ncfast.pdf). A DBN is a type of DNN where multiple hidden layers and connections between (but not within) layers (that is, a neuron in layer 1 may be connected to a neuron in layer 2, but may not be connected to another neuron in layer 1). This is the essentially the same definition of a Restricted Boltzmann Machine (RBM)—an example is diagrammed in Figure 4, except that a RBM typically has one input layer and one hidden layer: Figure 4 The restriction of no connections within a layer is valuable as it allows for much faster training algorithms to be used, such as the contrastive divergence algorithm. If several RBMs are stacked together, they can form a DBN. Essentially, the DBN can then be trained as a series of RBMs. The first RBM layer is trained and used to transform raw data into hidden neurons, which are then treated as a new set of inputs in a second RBM, and the process is repeated until all layers have been trained. The benefits of the realization that DBNs could be trained one layer at a time extend beyond just DBNs, however. DBNs are sometimes used as a pre-training stage for a deep neural network. This allows the comparatively fast, greedy layer-by-layer training to be used to provide good initial estimates, which are then refined in the deep neural network using other, slower, training algorithms such as back propagation. So far we have been primarily focused on feed-forward neural networks, where the results from one layer and neuron feed forward to the next. Before closing this section, two specific kinds of deep neural networks that have grown in popularity are worth mentioning. The first is a Recurrent Neural Network (RNN) where neurons send feedback signals to each other. These feedback loops allow RNNs to work well with sequences. A recent example of an application of RNNs was to automatically generate click-bait such as One trick to great hair salons don't want you to know or Top 10 reasons to visit Los Angeles: #6 will shock you!. RNNs work well for such jobs as they can be seeded from a large initial pool of a few words (even just trending search terms or names) and then predict/generate what the next word should be. This process can be repeated a few times until a short phrase is generated, the click-bait. This example is drawn from a blog post by Lars Eidnes, available at http://larseidnes.com/2015/10/13/auto-generating-clickbait-with-recurrent-neural-networks/. The second type is a Convolutional Neural Network (CNN). CNNs are most commonly used in image recognition. CNNs work by having each neuron respond to overlapping subregions of an image. The benefits of CNNs are that they require comparatively minimal pre-processing yet still do not require too many parameters through weight sharing (for example, across subregions of an image). This is particularly valuable for images as they are often not consistent. For example, imagine ten different people taking a picture of the same desk. Some may be closer or farther away or at positions resulting in essentially the same image having different heights, widths, and the amount of image captured around the focal object. As for neural networks, this description only provides the briefest of overviews as to what DNNs are and some of the use cases to which they can be applied. Summary This article presented a brief introduction to NNs and DNNs. Using multiple hidden layers, DNNs have been a revolution in machine learning by providing a powerful unsupervised learning and feature-extraction component that can be standalone or integrated as part of a supervised model. There are many applications of such models and they are increasingly used by large-scale companies such as Google, Microsoft, and Facebook. Examples of tasks for deep learning are image recognition (for example, automatically tagging faces or identifying keywords for an image), voice recognition, and text translation (for example, to go from English to Spanish, or vice versa). Work is being done on text recognition, such as sentiment analysis to try to identify whether a sentence or paragraph is generally positive or negative, which is particularly useful to evaluate perceptions about a product or service. Imagine being able to scrape reviews and social media for any mention of your product and analyze whether it was being discussed more favorably than the previous month or year or not! Resources for Article: Further resources on this subject: Dealing with a Mess [article] Design with Spring AOP [article] Probability of R? [article]

0
0
1801

article-image-introducing-openstack-trove

Packt

02 Mar 2016

17 min read

Introducing OpenStack Trove

Packt

02 Mar 2016

17 min read

In this article, Alok Shrivastwa and Sunil Sarat, authors of the book OpenStack Trove Essentials, explain how OpenStack Trove truly and remarkably is a treasure or collection of valuable things, especially for open source lovers like us and, of course, it is an apt name for the Database as a Service (DBaaS) component of OpenStack. In this article, we shall see why this component shows the potential and is on its way to becoming one of the crucial components in the OpenStack world. In this article, we will cover the following: DBaaS and its advantages An introduction to OpenStack's Trove project and its components Database as a Service Data is a key component in today's world, and what would applications do without data? Data is very critical, especially in the case of businesses such as the financial sector, social media, e-commerce, healthcare, and streaming media. Storing and retrieving data in a manageable way is absolutely key. Databases, as we all know, have been helping us manage data for quite some time now. Databases form an integral part of any application. Also, the data-handling needs of different type of applications are different, which has given rise to an increase in the number of database types. As the overall complexity increases, it becomes increasingly challenging and difficult for the database administrators (DBAs) to manage them. DBaaS is a cloud-based service-oriented approach to offering databases on demand for storing and managing data. DBaaS offers a flexible and scalable platform that is oriented towards self-service and easy management, particularly in terms of provisioning a business' environment using a database of choice in a matter of a few clicks and in minutes rather than waiting on it for days or even, in some cases, weeks. The fundamental building block of any DBaaS is that it will be deployed over a cloud platform, be it public (AWS, Azure, and so on) or private (VMware, OpenStack, and so on). In our case, we are looking at a private cloud running OpenStack. So, to the extent necessary, you might come across references to OpenStack and its other services, on which Trove depends. XaaS (short for Anything/Everything as a Service, of which DBaaS is one such service) is fast gaining momentum. In the cloud world, everything is offered as a service, be it infrastructure, software, or, in this case, databases. Amazon Web Services (AWS) offers various services around this: the Relational Database Service (RDS) for the RDBMS (short for relational database management system) kind of system; SimpleDB and DynamoDB for NoSQL databases; and Redshift for data warehousing needs. The OpenStack world was also not untouched by the growing demand for DBaaS, not just by users but also by DBAs, and as a result, Trove made its debut with the OpenStack release Icehouse in April 2014 and since then is one of the most popular advanced services of OpenStack. It supports several SQL and NoSQL databases and provides the full life cycle management of the databases. Advantages Now, you must be wondering why we must even consider DBaaS over traditional database management strategies. Here are a few points you might want to consider that might make it worth your time. Reduced database management costs In any organization, most of their DBAs' time is wasted in mundane tasks such as creating databases, creating instances, and so on. They are not able to concentrate on tasks such as fine-tuning SQL queries so that applications run faster, not to mention the time taken to do it all manually (or with a bunch of scripts that need to be fired manually), so this in effect is wasting resources in terms of both developers' and DBAs' time. This can be significantly reduced using a DBaaS. Faster provisioning and standardization With DBaaS, databases that are provisioned by the system will be compliant with standards as there is very little human intervention involved. This is especially helpful in the case of heavily regulated industries. As an example, let's look at members of the healthcare industry. They are bound by regulations such as HIPAA (short for Health Insurance Portability and Accountability Act of 1996), which enforces certain controls on how data is to be stored and managed. Given this scenario, DBaaS makes the database provisioning process easy and compliant as they only need to qualify the process once, and then every other database coming out of the automated provisioning system is then compliant with the standards or controls set. Easier administration Since DBaaS is cloud based, which means there will be a lot of automation, administration becomes that much more automated and easier. Some important administration tasks are backup/recovery and software upgrade/downgrade management. As an example, with most databases, we should be able to push configuration modifications within minutes to all the database instances that have been spun out by the DBaaS system. This ensures that any new standards being thought of can easily be implemented. Scaling and efficiency Scaling (up or down) becomes immensely easy, and this reduces resource hogging, which developers used as part of their planning for a rainy day, and in most cases, it never came. In the case of DBaaS, since you don't commit resources upfront and only scale up or down as and when necessary, resource utilization will be highly efficient. These are some of the advantages available to organizations that use DBaaS. Some of the concerns and roadblocks for organizations in adopting DBaaS, especially in a public cloud model, are as follows: Companies don't want to have sensitive data leave their premises. Database access and speed are key to application performance. Not being able to manage the underlying infrastructure inhibits some organizations from going to a DBaaS model. In contrast to public cloud-based DBaaS, concerns regarding data security, performance, and visibility reduce significantly in the case of private DBaaS systems such as Trove. In addition, the benefits of a cloud environment are not lost either. Trove OpenStack Trove, which was originally called Red Dwarf, is a project that was initiated by HP, and many others contributed to it later on, including Rackspace. The project was in incubation till the Havana release of OpenStack. It was formally introduced in the Icehouse release in April 2014, and its mission is to provide scalable and reliable cloud DBaaS provisioning functionality for relational and non-relational database engines. As of the Liberty release, Trove is considered as a big-tent service. Big-tent is a new approach that allows projects to enter the OpenStack code namespace. In order for a service to be a big-tent service, it only needs to follow some basic rules, which are listed here. This allows the projects to have access to the shared teams in OpenStack, such as the infrastructure teams, release management teams, and documentation teams. The project should: Align with the OpenStack mission Subject itself to the rulings of the OpenStack Technical Committee Support Keystone authentication Be completely open source and open community based At the time of writing the article, the adoption and maturity levels are as shown here: The previous diagram shows that the Age of the project is just 2 YRS and it has a 27% Adoption rate, meaning 27 of 100 people running OpenStack also run Trove. The maturity index is 1 on a scale of 1 to 5. It is derived from the following five aspects: The presence of an installation guide Whether the Adoption percentage is greater or lesser than 75 Stable branches of the project Whether it supports seven or more SDKs Corporate diversity in the team working on the project Without further ado, let's take a look at the architecture that Trove implements in order to provide DBaaS. Architecture The trove project uses some shared components and some dedicated project-related components as mentioned in the following subsections. Shared components The Trove system shares two components with the other OpenStack projects, the backend database (MySQL/MariaDB), and the message bus. The message bus The AMQP (short for Advanced Message Queuing Protocol) message bus brokers the interactions between the task manager, API, guest agent, and conductor. This component ensures that Trove can be installed and configured as a distributed system. MySQL/MariaDB MySQL or MariaDB is used by Trove to store the state of the system. API This component is responsible for providing the RESTful API with JSON and XML support. This component can be called the face of Trove to the external world since all the other components talk to Trove using this. It talks to the task manager for complex tasks, but it can also talk to the guest agent directly to perform simple tasks, such as retrieving users. The task manager The task manager is the engine responsible for doing the majority of the work. It is responsible for provisioning instances, managing the life cycle, and performing different operations. The task manager normally sends common commands, which are of an abstract nature; it is the responsibility of the guest agent to read them and issue database-specific commands in order to execute them. The guest agent The guest agent runs inside the Nova instances that are used to run the database engines. The agent listens to the messaging bus for the topic and is responsible for actually translating and executing the commands that are sent to it by the task manager component for the particular datastore. Let's also look at the different types of guest agents that are required depending on the database engine that needs to be supported. The different guest agents (for example, the MySQL and PostgreSQL guest agents) may even have different capabilities depending on what is supported on the particular database. This way, different datastores with different capabilities can be supported, and the system is kept extensible. The conductor The conductor component is responsible for updating the Trove backend database with the information that the guest agent sends regarding the instances. It eliminates the need for direct database access by all the guest agents for updating information. This is like the way the guest agent also listens to the topic on the messaging bus and performs its functions based on it. The following diagram can be used to illustrate the different components of Trove and also their interaction with the dependent services: Terminology Let's take a look at some of the terminology that Trove uses. Datastore Datastore is the term used for the RDBMS or NoSQL database that Trove can manage; it is nothing more than an abstraction of the underlying database engine, for example, MySQL, MongoDB, Percona, Couchbase, and so on. Datastore version This is linked to the datastore and defines a set of packages to be installed or already installed on an image. As an example, let's take MySQL 5.5. The datastore version will also link to a base image (operating system) that is stored in Glance. The configuration parameters that can be modified are also dependent on the datastore and the datastore version. Instance An instance is an instantiation of a datastore version. It runs on OpenStack Nova and uses Cinder for persistent storage. It has a full OS and additionally has the guest agent of Trove. Configuration group A configuration group is a bunch of options that you can set. As an example, we can create a group and associate a number of instances to one configuration group, thereby maintaining the configurations in sync. Flavor The flavor is similar to the Nova machine flavor, but it is just a definition of memory and CPU requirements for the instance that will run and host the databases. Normally, it's a good idea to have a high memory-to-CPU ratio as a flavor for running database instances. Database This is the actual database that the users consume. Several databases can run in a single Trove instance. This is where the actual users or applications connect with their database clients. The following diagram shows these different terminologies, as a quick summary. Users or applications connect to databases, which reside in instances. The instances run in Nova but are instantiations of the Datastore version belonging to a Datastore. Just to explain this a little further, say we have two versions of MySQL that are being serviced. We will have one datastore but two datastore versions, and any instantiation of that will be called an instance, and the actual MySQL database that will be used by the application will be called the database (shown as DB in the diagram). A multi-datastore scenario One of the important features of the Trove system is that it supports multiple databases to various degrees. In this subsection, we will see how Trove works with multiple Trove datastores. In the following diagram, we have represented all the components of Trove (the API, task manager, and conductor) except the Guest Agent databases as Trove Controller. The Guest Agent code is different for every datastore that needs to be supported and the Guest Agent for that particular datastore is installed on the corresponding image of the datastore version. The guest agents by default have to implement some of the basic actions for the datastore, namely, create, resize, and delete, and individual guest agents have extensions that enable them to support additional features just for that datastore. The following diagram should help us understand the command proxy function of the guest agent. Please note that the commands shown are only indicative, and the actual commands will vary. At the time of writing this article, Trove's guest agents are installable only on Linux; hence, only databases on Linux systems are supported. Feature requests (https://blueprints.launchpad.net/trove/+spec/mssql-server-db-support) were created for the ability to create a guest agent for Windows and support Microsoft SQL databases, but they have not yet been approved at the time of writing this and might be a remote possibility. Database software distribution support Trove supports various databases; the following table shows the databases supported by this service at the time of writing this. Automated installation is available for all the different databases, but there is some level of difference when it comes to the configuration capabilities of Trove with respect to different databases. This has lot to do with the lack of a common configuration base among the different databases. At the time of writing this article, MySQL and MariaDB have the most configuration options available, as shown in this list: Database Version MySQL 5.5, 5.6 Percona 5.5, 5.6 MariaDB 5.5, 10.0 Couchbase 2.2, 3.0 Cassandra 2.1 Redis 2.8 PostgreSQL 9.3, 9.4 MongoDB 2.6, 3.0 DB2 Expre 10.5 CouchDB 1.6 So, as you can see, almost all the major database applications that can run on Linux are already supported on Trove. Putting it all together Now that we have understood the architecture and terminologies, we will take a look at the general steps that are followed: Horizon/Trove CLI requests a new database instance and passes the datastore name and version, along with the flavor ID and volume size as mandatory parameters. Optional parameters such as the configuration group, AZ, replica-of, and so on can also be passed. The Trove API requests Nova for an instance with the particular image and a Cinder volume of a specific size to be added to the instance. The Nova instance boots and follows these steps: The cloud-init scripts are run(like all other Nova instances) The configuration files (for example, trove-guestagent.conf) are copied down to the instance The guest agent is installed The Trove API will also have sent the request to the task manager, which will then send the prepare call to the message bus topic. After booting, the guest agent listens to the message bus for any activities for it to do, and once it finds a message for itself, it processes the prepare command and performs the following functions: Installing the database distribution (if not already installed on the image) Creating the configuration file with the default configuration for the database engine (and any configuration from the configuration groups associated overriding the defaults) Starting the database engine and enabling auto-start Polling the database engine for availability (until the database engine is available or the timeout is reached) Reporting the status back to the Trove backend using the Trove conductor The Trove manager reports back to the API and the status of the machine is changed. Use cases So, if you are wondering all the places where we can use Trove, it fits in rather nicely with the following use cases. Dev/test databases Dev/test databases are an absolute killer feature, and almost all companies that start using Trove will definitely use it for their dev/test environments. This provides developers with the ability to freely create and dispose of database instances at will. This ability helps them be more productive and removes any lag from when they want it to when they get it. The capability of being able to take a backup, run a database, and restore the backup to another server is especially key when it comes to these kinds of workloads. Web application databases Trove is used in production for any database that supports low-risk applications, such as some web applications. With the introduction of different redundancy mechanisms, such as master-slave in MySQL, this is becoming more suited to many production environments. Features Trove is moving fast in terms of the features being added in the various releases. In this section, we will take a look at the features of three releases: the current release and the past two. The Juno release The Juno release saw a lot of features being added to the Trove system. Here is a non-exhaustive list: Support for Neutron: Now we can use both nova-network and Neutron for networking purposes Replication: MySQL master/slave replication was added. The API also allowed us to detach a slave for it to be promoted Clustering: Mongo DB cluster support was added Configuration group improvements: The functionality of using a default configuration group for a datastore version was added. This allows us to build the datastore version with a base configuration of your company standards Basic error checking was added to configuration groups The Kilo release The Kilo release majorly worked on introducing a new datastore. The following is the list of major features that were introduced: Support for the GTID (short for global transaction identifier) replication strategy New datastores, namely Vertica, DB2, and CouchDB, are supported The Liberty release The Liberty release introduced the following features to Trove. This is a non-exhaustive list. Configuration groups for Redis and MongoDB Cluster support for Redis and MongoDB Percona XtraDB cluster support Backup and restore for a single instance of MongoDB User and database management for Mongo DB Horizon support for database clusters A management API for datastores and versions The ability to deploy Trove instances in a single admin tenant so that the Nova instances are hidden from the user In order to see all the features introduced in the releases, please look at the release notes of the system, which can be found at these URLs: Juno : https://wiki.openstack.org/wiki/ReleaseNotes/Juno Kilo : https://wiki.openstack.org/wiki/ReleaseNotes/Kilo Liberty : https://wiki.openstack.org/wiki/ReleaseNotes/Liberty Summary In this article, we were introduced to the basic concepts of DBaaS and how Trove can help with this. With several changes being introduced and a score of one on five with respect to maturity, it might seem as if it is too early to adopt Trove. However, a lot of companies are giving Trove a go in their dev/test environments as well as for some web databases in production, which is why the adoption percentage is steadily on the rise. A few companies that are using Trove today are giants such as eBay, who run their dev/test Test databases on Trove; HP Helion Cloud, Rackspace Cloud, and Tesora (which is also one of the biggest contributors to the project) have DBaaS offerings based on the Trove component. Trove is increasingly being used in various companies, and it is helping in reducing DBAs' mundane work and improving standardization. Resources for Article: Further resources on this subject: OpenStack Performance, Availability [article] Concepts for OpenStack [article] Implementing OpenStack Networking and Security [article]

0
0
2661

article-image-responsive-visualizations-using-d3js-and-bootstrap

Packt

01 Mar 2016

21 min read

Responsive Visualizations Using D3.js and Bootstrap

Packt

01 Mar 2016

21 min read

0
0
12141

Packt

01 Mar 2016

32 min read

Breaking the Bank

Packt

01 Mar 2016

32 min read

0
0
3609

article-image-building-recommendation-engine-spark

Packt

24 Feb 2016

44 min read

Building a Recommendation Engine with Spark

Packt

24 Feb 2016

44 min read

0
1
16207

article-image-make-things-pretty-ggplot2

Packt

24 Feb 2016

30 min read

Make Things Pretty with ggplot2

Packt

24 Feb 2016

30 min read

0
0
1476

Packt

23 Feb 2016

59 min read

Dealing with a Mess

Packt

23 Feb 2016

59 min read

0
0
2395

Packt

23 Feb 2016

17 min read

Probability of R?

Packt

23 Feb 2016

17 min read

0
0
2417

article-image-introduction-clustering-and-unsupervised-learning

Packt

23 Feb 2016

16 min read

Introduction to Clustering and Unsupervised Learning

Packt

23 Feb 2016

16 min read

The act of clustering, or spotting patterns in data, is not much different from spotting patterns in groups of people. In this article, you will learn: The ways clustering tasks differ from the classification tasks How clustering defines a group, and how such groups are identified by k-means, a classic and easy-to-understand clustering algorithm The steps needed to apply clustering to a real-world task of identifying marketing segments among teenage social media users Before jumping into action, we'll begin by taking an in-depth look at exactly what clustering entails. (For more resources related to this topic, see here.) Understanding clustering Clustering is an unsupervised machine learning task that automatically divides the data into clusters, or groups of similar items. It does this without having been told how the groups should look ahead of time. As we may not even know what we're looking for, clustering is used for knowledge discovery rather than prediction. It provides an insight into the natural groupings found within data. Without advance knowledge of what comprises a cluster, how can a computer possibly know where one group ends and another begins? The answer is simple. Clustering is guided by the principle that items inside a cluster should be very similar to each other, but very different from those outside. The definition of similarity might vary across applications, but the basic idea is always the same—group the data so that the related elements are placed together. The resulting clusters can then be used for action. For instance, you might find clustering methods employed in the following applications: Segmenting customers into groups with similar demographics or buying patterns for targeted marketing campaigns Detecting anomalous behavior, such as unauthorized network intrusions, by identifying patterns of use falling outside the known clusters Simplifying extremely large datasets by grouping features with similar values into a smaller number of homogeneous categories Overall, clustering is useful whenever diverse and varied data can be exemplified by a much smaller number of groups. It results in meaningful and actionable data structures that reduce complexity and provide insight into patterns of relationships. Clustering as a machine learning task Clustering is somewhat different from the classification, numeric prediction, and pattern detection tasks we examined so far. In each of these cases, the result is a model that relates features to an outcome or features to other features; conceptually, the model describes the existing patterns within data. In contrast, clustering creates new data. Unlabeled examples are given a cluster label that has been inferred entirely from the relationships within the data. For this reason, you will, sometimes, see the clustering task referred to as unsupervised classification because, in a sense, it classifies unlabeled examples. The catch is that the class labels obtained from an unsupervised classifier are without intrinsic meaning. Clustering will tell you which groups of examples are closely related—for instance, it might return the groups A, B, and C—but it's up to you to apply an actionable and meaningful label. To see how this impacts the clustering task, let's consider a hypothetical example. Suppose you were organizing a conference on the topic of data science. To facilitate professional networking and collaboration, you planned to seat people in groups according to one of three research specialties: computer and/or database science, math and statistics, and machine learning. Unfortunately, after sending out the conference invitations, you realize that you had forgotten to include a survey asking which discipline the attendee would prefer to be seated with. In a stroke of brilliance, you realize that you might be able to infer each scholar's research specialty by examining his or her publication history. To this end, you begin collecting data on the number of articles each attendee published in computer science-related journals and the number of articles published in math or statistics-related journals. Using the data collected for several scholars, you create a scatterplot: As expected, there seems to be a pattern. We might guess that the upper-left corner, which represents people with many computer science publications but few articles on math, could be a cluster of computer scientists. Following this logic, the lower-right corner might be a group of mathematicians. Similarly, the upper-right corner, those with both math and computer science experience, may be machine learning experts. Our groupings were formed visually; we simply identified clusters as closely grouped data points. Yet in spite of the seemingly obvious groupings, we unfortunately have no way to know whether they are truly homogeneous without personally asking each scholar about his/her academic specialty. The labels we applied required us to make qualitative, presumptive judgments about the types of people that would fall into the group. For this reason, you might imagine the cluster labels in uncertain terms, as follows: Rather than defining the group boundaries subjectively, it would be nice to use machine learning to define them objectively. This might provide us with a rule in the form if a scholar has few math publications, then he/she is a computer science expert. Unfortunately, there's a problem with this plan. As we do not have data on the true class value for each point, a supervised learning algorithm would have no ability to learn such a pattern, as it would have no way of knowing what splits would result in homogenous groups. On the other hand, clustering algorithms use a process very similar to what we did by visually inspecting the scatterplot. Using a measure of how closely the examples are related, homogeneous groups can be identified. In the next section, we'll start looking at how clustering algorithms are implemented. This example highlights an interesting application of clustering. If you begin with unlabeled data, you can use clustering to create class labels. From there, you could apply a supervised learner such as decision trees to find the most important predictors of these classes. This is called semi-supervised learning. The k-means clustering algorithm The k-means algorithm is perhaps the most commonly used clustering method. Having been studied for several decades, it serves as the foundation for many more sophisticated clustering techniques. If you understand the simple principles it uses, you will have the knowledge needed to understand nearly any clustering algorithm in use today. Many such methods are listed on the following site, the CRAN Task View for clustering at http://cran.r-project.org/web/views/Cluster.html. As k-means has evolved over time, there are many implementations of the algorithm. One popular approach is described in : Hartigan JA, Wong MA. A k-means clustering algorithm. Applied Statistics. 1979; 28:100-108. Even though clustering methods have advanced since the inception of k-means, this is not to imply that k-means is obsolete. In fact, the method may be more popular now than ever. The following table lists some reasons why k-means is still used widely: Strengths Weaknesses Uses simple principles that can be explained in non-statistical terms Highly flexible, and can be adapted with simple adjustments to address nearly all of its shortcomings Performs well enough under many real-world use cases Not as sophisticated as more modern clustering algorithms Because it uses an element of random chance, it is not guaranteed to find the optimal set of clusters Requires a reasonable guess as to how many clusters naturally exist in the data Not ideal for non-spherical clusters or clusters of widely varying density The k-means algorithm assigns each of the n examples to one of the k clusters, where k is a number that has been determined ahead of time. The goal is to minimize the differences within each cluster and maximize the differences between the clusters. Unless k and n are extremely small, it is not feasible to compute the optimal clusters across all the possible combinations of examples. Instead, the algorithm uses a heuristic process that finds locally optimal solutions. Put simply, this means that it starts with an initial guess for the cluster assignments, and then modifies the assignments slightly to see whether the changes improve the homogeneity within the clusters. We will cover the process in depth shortly, but the algorithm essentially involves two phases. First, it assigns examples to an initial set of k clusters. Then, it updates the assignments by adjusting the cluster boundaries according to the examples that currently fall into the cluster. The process of updating and assigning occurs several times until changes no longer improve the cluster fit. At this point, the process stops and the clusters are finalized. Due to the heuristic nature of k-means, you may end up with somewhat different final results by making only slight changes to the starting conditions. If the results vary dramatically, this could indicate a problem. For instance, the data may not have natural groupings or the value of k has been poorly chosen. With this in mind, it's a good idea to try a cluster analysis more than once to test the robustness of your findings. To see how the process of assigning and updating works in practice, let's revisit the case of the hypothetical data science conference. Though this is a simple example, it will illustrate the basics of how k-means operates under the hood. Using distance to assign and update clusters As with k-NN, k-means treats feature values as coordinates in a multidimensional feature space. For the conference data, there are only two features, so we can represent the feature space as a two-dimensional scatterplot as depicted previously. The k-means algorithm begins by choosing k points in the feature space to serve as the cluster centers. These centers are the catalyst that spurs the remaining examples to fall into place. Often, the points are chosen by selecting k random examples from the training dataset. As we hope to identify three clusters, according to this method, k = 3 points will be selected at random. These points are indicated by the star, triangle, and diamond in the following diagram: It's worth noting that although the three cluster centers in the preceding diagram happen to be widely spaced apart, this is not always necessarily the case. Since they are selected at random, the three centers could have just as easily been three adjacent points. As the k-means algorithm is highly sensitive to the starting position of the cluster centers, this means that random chance may have a substantial impact on the final set of clusters. To address this problem, k-means can be modified to use different methods for choosing the initial centers. For example, one variant chooses random values occurring anywhere in the feature space (rather than only selecting among the values observed in the data). Another option is to skip this step altogether; by randomly assigning each example to a cluster, the algorithm can jump ahead immediately to the update phase. Each of these approaches adds a particular bias to the final set of clusters, which you may be able to use to improve your results. In 2007, an algorithm called k-means++ was introduced, which proposes an alternative method for selecting the initial cluster centers. It purports to be an efficient way to get much closer to the optimal clustering solution while reducing the impact of random chance. For more information, refer to Arthur D, Vassilvitskii S. k-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. 2007:1027–1035. After choosing the initial cluster centers, the other examples are assigned to the cluster center that is nearest according to the distance function. You will remember that we studied distance functions while learning about k-Nearest Neighbors. Traditionally, k-means uses Euclidean distance, but Manhattan distance or Minkowski distance are also sometimes used. Recall that if n indicates the number of features, the formula for Euclidean distance between example x and example y is: For instance, if we are comparing a guest with five computer science publications and one math publication to a guest with zero computer science papers and two math papers, we could compute this in R as follows: > sqrt((5 - 0)^2 + (1 - 2)^2) [1] 5.09902 Using this distance function, we find the distance between each example and each cluster center. The example is then assigned to the nearest cluster center. Keep in mind that as we are using distance calculations, all the features need to be numeric, and the values should be normalized to a standard range ahead of time. As shown in the following diagram, the three cluster centers partition the examples into three segments labeled Cluster A, Cluster B, and Cluster C. The dashed lines indicate the boundaries for the Voronoi diagram created by the cluster centers. The Voronoi diagram indicates the areas that are closer to one cluster center than any other; the vertex where all the three boundaries meet is the maximal distance from all three cluster centers. Using these boundaries, we can easily see the regions claimed by each of the initial k-means seeds: Now that the initial assignment phase has been completed, the k-means algorithm proceeds to the update phase. The first step of updating the clusters involves shifting the initial centers to a new location, known as the centroid, which is calculated as the average position of the points currently assigned to that cluster. The following diagram illustrates how as the cluster centers shift to the new centroids, the boundaries in the Voronoi diagram also shift and a point that was once in Cluster B (indicated by an arrow) is added to Cluster A: As a result of this reassignment, the k-means algorithm will continue through another update phase. After shifting the cluster centroids, updating the cluster boundaries, and reassigning points into new clusters (as indicated by arrows), the figure looks like this: Because two more points were reassigned, another update must occur, which moves the centroids and updates the cluster boundaries. However, because these changes result in no reassignments, the k-means algorithm stops. The cluster assignments are now final: The final clusters can be reported in one of the two ways. First, you might simply report the cluster assignments such as A, B, or C for each example. Alternatively, you could report the coordinates of the cluster centroids after the final update. Given either reporting method, you are able to define the cluster boundaries by calculating the centroids or assigning each example to its nearest cluster. Choosing the appropriate number of clusters In the introduction to k-means, we learned that the algorithm is sensitive to the randomly-chosen cluster centers. Indeed, if we had selected a different combination of three starting points in the previous example, we may have found clusters that split the data differently from what we had expected. Similarly, k-means is sensitive to the number of clusters; the choice requires a delicate balance. Setting k to be very large will improve the homogeneity of the clusters, and at the same time, it risks overfitting the data. Ideally, you will have a priori knowledge (a prior belief) about the true groupings and you can apply this information to choosing the number of clusters. For instance, if you were clustering movies, you might begin by setting k equal to the number of genres considered for the Academy Awards. In the data science conference seating problem that we worked through previously, k might reflect the number of academic fields of study that were invited. Sometimes the number of clusters is dictated by business requirements or the motivation for the analysis. For example, the number of tables in the meeting hall could dictate how many groups of people should be created from the data science attendee list. Extending this idea to another business case, if the marketing department only has resources to create three distinct advertising campaigns, it might make sense to set k = 3 to assign all the potential customers to one of the three appeals. Without any prior knowledge, one rule of thumb suggests setting k equal to the square root of (n / 2), where n is the number of examples in the dataset. However, this rule of thumb is likely to result in an unwieldy number of clusters for large datasets. Luckily, there are other statistical methods that can assist in finding a suitable k-means cluster set. A technique known as the elbow method attempts to gauge how the homogeneity or heterogeneity within the clusters changes for various values of k. As illustrated in the following diagrams, the homogeneity within clusters is expected to increase as additional clusters are added; similarly, heterogeneity will also continue to decrease with more clusters. As you could continue to see improvements until each example is in its own cluster, the goal is not to maximize homogeneity or minimize heterogeneity, but rather to find k so that there are diminishing returns beyond that point. This value of k is known as the elbow point because it looks like an elbow. There are numerous statistics to measure homogeneity and heterogeneity within the clusters that can be used with the elbow method (the following information box provides a citation for more detail). Still, in practice, it is not always feasible to iteratively test a large number of k values. This is in part because clustering large datasets can be fairly time consuming; clustering the data repeatedly is even worse. Regardless, applications requiring the exact optimal set of clusters are fairly rare. In most clustering applications, it suffices to choose a k value based on convenience rather than strict performance requirements. For a very thorough review of the vast assortment of cluster performance measures, refer to: Halkidi M, Batistakis Y, Vazirgiannis M. On clustering validation techniques. Journal of Intelligent Information Systems. 2001; 17:107-145. The process of setting k itself can sometimes lead to interesting insights. By observing how the characteristics of the clusters change as k is varied, one might infer where the data have naturally defined boundaries. Groups that are more tightly clustered will change a little, while less homogeneous groups will form and disband over time. In general, it may be wise to spend little time worrying about getting k exactly right. The next example will demonstrate how even a tiny bit of subject-matter knowledge borrowed from a Hollywood film can be used to set k such that actionable and interesting clusters are found. As clustering is unsupervised, the task is really about what you make of it; the value is in the insights you take away from the algorithm's findings. Summary This article covered only the fundamentals of clustering. As a very mature machine learning method, there are many variants of the k-means algorithm as well as many other clustering algorithms that bring unique biases and heuristics to the task. Based on the foundation in this article, you will be able to understand and apply other clustering methods to new problems. To learn more about different machine learning techniques, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Learning Data Mining with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-r) Mastering Scientific Computing with R (https://www.packtpub.com/application-development/mastering-scientific-computing-r) R for Data Science (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science) Resources for Article: Further resources on this subject: Displaying SQL Server Data using a Linq Data Source [article] Probability of R? [article] Working with Commands and Plugins [article]

0
0
9925

How-To Tutorials - Data

Machine Learning Tasks

Why Mesos?

Support Vector Machines as a Classification Engine

Integrating Imagery with Creating and Styling Features in OpenLayers 3

Welcome to Machine Learning using the .NET Framework

Exploring HDFS

Getting Started with Deep Learning

Introducing OpenStack Trove

Responsive Visualizations Using D3.js and Bootstrap

Breaking the Bank

Trending Topics

Building a Recommendation Engine with Spark

Make Things Pretty with ggplot2

Dealing with a Mess

Probability of R?

Introduction to Clustering and Unsupervised Learning