Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-consistency-conflicts
Packt
10 Aug 2016
11 min read
Save for later

Consistency Conflicts

Packt
10 Aug 2016
11 min read
In this article by Robert Strickland, author of the book Cassandra 3.x High Availability - Second Edition, we will discuss how for any given call, it is possible to achieve either strong consistency or eventual consistency. In the former case, we can know for certain that the copy of the data that Cassandra returns will be the latest. In the case of eventual consistency, the data returned may or may not be the latest, or there may be no data returned at all if the node is unaware of newly inserted data. Under eventual consistency, it is also possible to see deleted data if the node you're reading from has not yet received the delete request. (For more resources related to this topic, see here.) Depending on the read_repair_chance setting and the consistency level chosen for the read operation, Cassandra might block the client and resolve the conflict immediately, or this might occur asynchronously. If data in conflict is never requested, the system will resolve the conflict the next time nodetool repair is run. How does Cassandra know there is a conflict? Every column has three parts: key, value, and timestamp. Cassandra follows last-write-wins semantics, which means that the column with the latest timestamp always takes precedence. Now, let's discuss one of the most important knobs a developer can turn to determine the consistency characteristics of their reads and writes. Consistency levels On every read and write operation, the caller must specify a consistency level, which lets Cassandra know what level of consistency to guarantee for that one call. The following table details the various consistency levels and their effects on both read and write operations: Consistency level Reads Writes ANY This is not supported for reads. Data must be written to at least one node, but permits writes via hinted handoff. Effectively allows a write to any node, even if all nodes containing the replica are down. A subsequent read might be impossible if all replica nodes are down. ONE The replica from the closest node will be returned. Data must be written to at least one replica node (both commit log and memtable). Unlike ANY, hinted handoff writes are not sufficient. TWO The replicas from the two closest nodes will be returned. The same as ONE, except two replicas must be written. THREE The replicas from the three closest nodes will be returned. The same as ONE, except three replicas must be written. QUORUM Replicas from a quorum of nodes will be compared, and the replica with the latest timestamp will be returned. Data must be written to a quorum of replica nodes (both commit log and memtable) in the entire cluster, including all data centers. SERIAL Permits reading uncommitted data as long as it represents the current state. Any uncommitted transactions will be committed as part of the read. Similar to QUORUM, except that writes are conditional based on the support for lightweight transactions. LOCAL_ONE Similar to ONE, except that the read will be returned by the closest replica in the local data center. Similar to ONE, except that the write must be acknowledged by at least one node in the local data center. LOCAL_QUORUM Similar to QUORUM, except that only replicas in the local data center are compared. Similar to QUORUM, except the quorum must only be met using the local data center. LOCAL_SERIAL Similar to SERIAL, except only local replicas are used. Similar to SERIAL, except only writes to local replicas must be acknowledged. EACH_QUORUM The opposite of LOCAL_QUORUM; requires each data center to produce a quorum of replicas, then returns the replica with the latest timestamp. The opposite of LOCAL_QUORUM; requires a quorum of replicas to be written in each data center. ALL Replicas from all nodes in the entire cluster (including all data centers) will be compared, and the replica with the latest timestamp will be returned. Data must be written to all replica nodes (both commit log and memtable) in the entire cluster, including all data centers. As you can see, there are numerous combinations of read and write consistency levels, all with different ultimate consistency guarantees. To illustrate this point, let's assume that you would like to guarantee absolute consistency for all read operations. On the surface, it might seem as if you would have to read with a consistency level of ALL, thus sacrificing availability in the case of node failure. But there are alternatives depending on your use case. There are actually two additional ways to achieve strong read consistency: Write with consistency level of ALL: This has the advantage of allowing the read operation to be performed using ONE, which lowers the latency for that operation. On the other hand, it means the write operation will result in UnavailableException if one of the replica nodes goes offline. Read and write with QUORUM or LOCAL_QUORUM: Since QUORUM and LOCAL_QUORUM both require a majority of nodes, using this level for both the write and the read will result in a full consistency guarantee (in the same data center when using LOCAL_QUORUM), while still maintaining availability during a node failure. You should carefully consider each use case to determine what guarantees you actually require. For example, there might be cases where a lost write is acceptable, or occasions where a read need not be absolutely current. At times, it might be sufficient to write with a level of QUORUM, then read with ONE to achieve maximum read performance, knowing you might occasionally and temporarily return stale data. Cassandra gives you this flexibility, but it's up to you to determine how to best employ it for your specific data requirements. A good rule of thumb to attain strong consistency is that the read consistency level plus write consistency level should be greater than the replication factor. If you are unsure about which consistency levels to use for your specific use case, it's typically safe to start with LOCAL_QUORUM (or QUORUM for a single data center) reads and writes. This configuration offers strong consistency guarantees and good performance while allowing for the inevitable replica failure. It is important to understand that even if you choose levels that provide less stringent consistency guarantees, Cassandra will still perform anti-entropy operations asynchronously in an attempt to keep replicas up to date. Repairing data Cassandra employs a multifaceted anti-entropy mechanism that keeps replicas in sync. Data repair operations generally fall into three categories: Synchronous read repair: When a read operation requires comparing multiple replicas, Cassandra will initially request a checksum from the other nodes. If the checksum doesn't match, the full replica is sent and compared with the local version. The replica with the latest timestamp will be returned and the old replica will be updated. This means that in normal operations, old data is repaired when it is requested. Asynchronous read repair: Each table in Cassandra has a setting called read_repair_chance (as well as its related setting, dclocal_read_repair_chance), which determines how the system treats replicas that are not compared during a read. The default setting of 0.1 means that 10 percent of the time, Cassandra will also repair the remaining replicas during read operations. Manually running repair: A full repair (using nodetool repair) should be run regularly to clean up any data that has been missed as part of the previous two operations. At a minimum, it should be run once every gc_grace_seconds, which is set in the table schema and defaults to 10 days. One might ask what the consequence would be of failing to run a repair operation within the window specified by gc_grace_seconds. The answer relates to Cassandra's mechanism to handle deletes. As you might be aware, all modifications (or mutations) are immutable, so a delete is really just a marker telling the system not to return that record to any clients. This marker is called a tombstone. Cassandra performs garbage collection on data marked by a tombstone each time a compaction occurs. If you don't run the repair, you risk deleted data reappearing unexpectedly. In general, deletes should be avoided when possible as the unfettered buildup of tombstones can cause significant issues. In the course of normal operations, Cassandra will repair old replicas when their records are requested. Thus, it can be said that read repair operations are lazy, such that they only occur when required. With all these options for replication and consistency, it can seem daunting to choose the right combination for a given use case. Let's take a closer look at this balance to help bring some additional clarity to the topic. Balancing the replication factor with consistency There are many considerations when choosing a replication factor, including availability, performance, and consistency. Since our topic is high availability, let's presume your desire is to maintain data availability in the case of node failure. It's important to understand exactly what your failure tolerance is, and this will likely be different depending on the nature of the data. The definition of failure is probably going to vary among use cases as well, as one case might consider data loss a failure, whereas another accepts data loss as long as all queries return. Achieving the desired availability, consistency, and performance targets requires coordinating your replication factor with your application's consistency level configurations. In order to assist you in your efforts to achieve this balance, let's consider a single data center cluster of 10 nodes and examine the impact of various configuration combinations (where RF corresponds to the replication factor): RF Write CL Read CL Consistency Availability Use cases 1 ONE QUORUM ALL ONE QUORUM ALL Consistent Doesn't tolerate any replica loss Data can be lost and availability is not critical, such as analysis clusters 2 ONE ONE Eventual Tolerates loss of one replica Maximum read performance and low write latencies are required, and sometimes returning stale data is acceptable 2 QUORUM ALL ONE Consistent Tolerates loss of one replica on reads, but none on writes Read-heavy workloads where some downtime for data ingest is acceptable (improves read latencies) 2 ONE QUORUM ALL Consistent Tolerates loss of one replica on writes, but none on reads Write-heavy workloads where read consistency is more important than availability 3 ONE ONE Eventual Tolerates loss of two replicas Maximum read and write performance are required, and sometimes returning stale data is acceptable 3 QUORUM ONE Eventual Tolerates loss of one replica on write and two on reads Read throughput and availability are paramount, while write performance is less important, and sometimes returning stale data is acceptable 3 ONE QUORUM Eventual Tolerates loss of two replicas on write and one on reads Low write latencies and availability are paramount, while read performance is less important, and sometimes returning stale data is acceptable 3 QUORUM QUORUM Consistent Tolerates loss of one replica Consistency is paramount, while striking a balance between availability and read/write performance 3 ALL ONE Consistent Tolerates loss of two replicas on reads, but none on writes Additional fault tolerance and consistency on reads is paramount at the expense of write performance and availability 3 ONE ALL Consistent Tolerates loss of two replicas on writes, but none on reads Low write latencies and availability are paramount, but read consistency must be guaranteed at the expense of performance and availability 3 ANY ONE Eventual Tolerates loss of all replicas on write and two on read Maximum write and read performance and availability are paramount, and often returning stale data is acceptable (note that hinted writes are less reliable than the guarantees offered at CL ONE) 3 ANY QUORUM Eventual Tolerates loss of all replicas on write and one on read Maximum write performance and availability are paramount, and sometimes returning stale data is acceptable 3 ANY ALL Consistent Tolerates loss of all replicas on writes, but none on reads Write throughput and availability are paramount, and clients must all see the same data, even though they might not see all writes immediately There are also two additional consistency levels, SERIAL and LOCAL_SERIAL, which can be used to read the latest value, even if it is part of an uncommitted transaction. Otherwise, they follow the semantics of QUORUM and LOCAL_QUORUM, respectively. As you can see, there are numerous possibilities to consider when choosing these values, especially in a scenario involving multiple data centers. This discussion will give you greater confidence as you design your applications to achieve the desired balance. Summary In this article, we introduced the foundational concept of consistency. In our discussion, we outlined the importance of the relationship between replication factor and consistency level, and their impact on performance, data consistency, and availability. Resources for Article: Further resources on this subject: Cassandra Design Patterns [Article] Cassandra Architecture [Article] About Cassandra [Article]
Read more
  • 0
  • 0
  • 976

article-image-expanding-your-data-mining-toolbox
Packt
09 Aug 2016
15 min read
Save for later

Expanding Your Data Mining Toolbox

Packt
09 Aug 2016
15 min read
In this article by Megan Squire, author of Mastering Data Mining with Python, when faced with sensory information, human beings naturally want to find patterns to explain, differentiate, categorize, and predict. This process of looking for patterns all around us is a fundamental human activity, and the human brain is quite good at it. With this skill, our ancient ancestors became better at hunting, gathering, cooking, and organizing. It is no wonder that pattern recognition and pattern prediction were some of the first tasks humans set out to computerize, and this desire continues in earnest today. Depending on the goals of a given project, finding patterns in data using computers nowadays involve database systems, artificial intelligence, statistics, information retrieval, computer vision, and any number of other various subfields of computer science, information systems, mathematics, or business, just to name a few. No matter what we call this activity – knowledge discovery in databases, data mining, data science – its primary mission is always to find interesting patterns. (For more resources related to this topic, see here.) Despite this humble-sounding mission, data mining has existed for long enough and has built up enough variation in how it is implemented that it has now become a large and complicated field to master. We can think of a cooking school, where every beginner chef is first taught how to boil water and how to use a knife before moving to more advanced skills, such as making puff pastry or deboning a raw chicken. In data mining, we also have common techniques that even the newest data miners will learn: how to build a classifier and how to find clusters in data. The aim is to teach you some of the techniques you may not have seen yet in earlier data mining projects. In this article, we will cover the following topics: What is data mining? We will situate data mining in the growing field of other similar concepts, and we will learn a bit about the history of how this discipline has grown and changed. How do we do data mining? Here, we compare several processes or methodologies commonly used in data mining projects. What are the techniques used in data mining? In this article, we will summarize each of the data analysis techniques that are typically included in a definition of data mining. How do we set up a data mining work environment? Finally, we will walk through setting up a Python-based development environment. What is data mining? We explained earlier that the goal of data mining is to find patterns in data, but this oversimplification falls apart quickly under scrutiny. After all, could we not also say that finding patterns is the goal of classical statistics, or business analytics, or machine learning, or even the newer practices of data science or big data? What is the difference between data mining and all of these other fields, anyway? And while we are at it, why is it called data mining if what we are really doing is mining for patterns? Don't we already have the data? It was apparent from the beginning that the term data mining is indeed fraught with many problems. The term was originally used as something of a pejorative by statisticians who cautioned against going on fishing expeditions, where a data analyst is casting about for patterns in data without forming proper hypotheses first. Nonetheless, the term rose to prominence in the 1990s, as the popular press caught wind of exciting research that was marrying the mature field of database management systems with the best algorithms from machine learning and artificial intelligence. The inclusion of the word mining inspires visions of a modern-day Gold Rush, in which the persistent and intrepid miner will discover (and perhaps profit from) previously hidden gems. The idea that data itself could be a rare and precious commodity was immediately appealing to the business and technology press, despite efforts by early pioneers to promote more the holistic term knowledge discovery in databases (KDD). The term data mining persisted, however, and ultimately some definitions of the field attempted to re-imagine the term data mining to refer to just one of the steps in a longer, more comprehensive knowledge discovery process. Today, data mining and KDD are considered very similar, closely related terms. What about other related terms, such as machine learning, predictive analytics, big data, and data science? Are these the same as data mining or KDD? Let's draw some comparisons between each of these terms: Machine learning is a very specific subfield of computer science that focuses on developing algorithms that can learn from data in order to make predictions. Many data mining solutions will use techniques from machine learning, but not all data mining is trying to make predictions or learn from data. Sometimes we just want to find a pattern in the data. In fact, in this article we will be exploring a few data mining solutions that do use machine learning techniques, and many more that do not. Predictive analytics, sometimes just called analytics, is a general term for computational solutions that attempt to make predictions from data in a variety of domains. We can think of the terms business analytics, media analytics, and so on. Some, but not all, predictive analytics solutions will use machine learning techniques to perform their predictions. But again, in data mining, we are not always interested in prediction. Big data is a term that refers to the problems and solutions of dealing with very large sets of data, irrespective of whether we are searching for patterns in that data, or simply storing it. In terms of comparing big data to data mining, many data mining problems are made more interesting when the data sets are large, so solutions discovered for dealing with big data might come in handy to solve a data mining problem. Nonetheless, these two terms are merely complementary, not interchangeable. Data science is the closest of these terms to being interchangeable with the KDD process, of which data mining is one step. Because data science is an extremely popular buzzword at this time, its meaning will continue to evolve and change as the field continues to mature. To show the relative search interest for these various terms over time, we can look at Google Trends. This tool shows how frequently people are searching for various keywords over time. In the following figure, the newcomer term data science is currently the hot buzzword, with data mining pulling into second place, followed by machine learning, data science, and predictive analytics. (I tried to include the search term knowledge discovery in databases as well, but the results were so close to zero that the line was invisible.) The y-axis shows the popularity of that particular search term as a 0-100 indexed value. In addition, I combined the weekly index values that Google Trends gives into a monthly average for each month in the period 2004-2015. Google Trends search results for four common data-related terms How do we do data mining? Since data mining is traditionally seen as one of the steps in the overall KDD process, and increasingly in the data science process, in this article we get acquainted with the steps involved. There are several popular methodologies for doing the work of data mining. Here we highlight four methodologies: two that are taken from textbook introductions to the theory of data mining, one taken from a very practical process used in industry, and one designed for teaching beginners. The Fayyad et al. KDD process One early version of the knowledge discovery and data mining process was defined by Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth in a 1996 article (The KDD Process for Extracting Useful Knowledge from Volumes of Data). This article was important at the time for refining the rapidly-changing KDD methodology into a concrete set of steps. The following steps lead from raw data at the beginning to knowledge at the end: Data selection: The input to this step is raw data, and the output of this selection step is a smaller subset of the data, called the target data. Data pre-processing: The target data is cleaned, oddities and outliers are removed, and missing data is accounted for. The output of this step is pre-processed data, or cleaned data. Data transformation: The cleaned data is organized into a format appropriate for the mining step, and the number of features or variables is reduced if need be. The output of this step is transformed data. Data Mining: The transformed data is mined for patterns using one or more data mining algorithms appropriate to the problem at hand. The output of this step is the discovered patterns. Data Interpretation/Evaluation: The discovered patterns are evaluated for their ability to solve the problem at hand. The output of this step is knowledge. Since this process leads from raw data to knowledge, it is appropriate that these authors were the ones who were really committed to the term knowledge discovery in databases rather than simply data mining. The Han et al. KDD process Another version of the knowledge discovery process is described in the popular data mining textbook Data Mining: Concepts and Techniques by Jiawei Han, Micheline Kamber, and Jian Pei as the following steps, which also lead from raw data to knowledge at the end: Data cleaning: The input to this step is raw data, and the output is cleaned data Data integration: In this step, the cleaned data is integrated (if it came from multiple sources). The output of this step is integrated data. Data selection: The data set is reduced to only the data needed for the problem at hand. The output of this step is a smaller data set. Data transformation: The smaller data set is consolidated into a form that will work with the upcoming data mining step. This is called transformed data. Data Mining: The transformed data is processed by intelligent algorithms that are designed to discover patterns in that data. The output of this step is one or more patterns. Pattern evaluation: The discovered patterns are evaluated for their interestingness and their ability to solve the problem at hand. The output of this step is an interestingness measure applied to each pattern, representing knowledge. Knowledge representation: In this step, the knowledge is communicated to users through various means, including visualization. In both the Fayyad and Han methodologies, it is expected that the process will iterate multiple times over steps, if such iteration is needed. For example, if during the transformation step the person doing the analysis realized that another data cleaning or pre-processing step is needed, both of these methodologies specify that the analyst should double back and complete a second iteration of the incomplete earlier step. The CRISP-DM process A third popular version of the KDD process that is used in many business and applied domains is called CRISP-DM, which stands for CRoss-Industry Standard Process for Data Mining. It consists of the following steps: Business Understanding: In this step, the analyst spends time understanding the reasons for the data mining project from a business perspective. Data Understanding: In this step, the analyst becomes familiar with the data and its potential promises and shortcomings, and begins to generate hypotheses. The analyst is tasked to reassess the business understanding (step 1) if needed. Data Preparation: This step includes all the data selection, integration, transformation, and pre-processing steps that are enumerated as separate steps in the other models. The CRISP-DM model has no expectation of what order these tasks will be done in. Modeling: This is the step in which the algorithms are applied to the data to discover the patterns. This step is closest to the actual data mining steps in the other KDD models. The analyst is tasked to reassess the data preparation step (step 3) if the modeling and mining step requires it. Evaluation: The model and discovered patterns are evaluated for their value in answering the business problem at hand. The analyst is tasked with revisiting the business understanding (step 1) if necessary. Deployment: The discovered knowledge and models are presented and put into production to solve the original problem at hand. One of the strengths of this methodology is that iteration is built in. Between specific steps, it is expected that the analyst will check that the current step is still in agreement with certain previous steps. Another strength of this method is that the analyst is explicitly reminded to keep the business problem front and center in the project, even down in the evaluation steps. The Six Steps process When I teach the introductory data science course at my university, I use a hybrid methodology of my own creation. This methodology is called the Six Steps, and I designed it to be especially friendly for teaching. My Six Steps methodology removes some of the ambiguity that inexperienced students may have with open-ended tasks from CRISP-DM, such as Business Understanding, or a corporate-focused task such as Deployment. In addition, the Six Steps method keeps the focus on developing students' critical thinking skills by requiring them to answer Why are we doing this? and What does it mean? at the beginning and end of the process. My Six Steps method looks like this: Problem statement: In this step, the students identify what the problem is that they are trying to solve. Ideally, they motivate the case for why they are doing all this work. Data collection and storage: In this step, students locate data and plan their storage for the data needed for this problem. They also provide information about where the data that is helping them answer their motivating question came from, as well as what format it is in and what all the fields mean. Data cleaning: In this phase, students carefully select only the data they really need, and pre-process the data into the format required for the mining step. Data mining: In this step, students formalize their chosen data mining methodology. They describe what algorithms they used and why. The output of this step is a model and discovered patterns. Representation and visualization: In this step, the students show the results of their work visually. The outputs of this step can be tables, drawings, graphs, charts, network diagrams, maps, and so on. Problem resolution: This is an important step for beginner data miners. This step explicitly encourages the student to evaluate whether the patterns they showed in step 5 are really an answer to the question or problem they posed in step 1. Students are asked to state the limitations of their model or results, and to identify parts of the motivating question that they could not answer with this method. Which data mining methodology is the best? A 2014 survey of the subscribers of Gregory Piatetsky-Shapiro's very popular data mining email newsletter KDNuggets included the question What main methodology are you using for your analytics, data mining, or data science projects? 43% of the poll respondents indicated that they were using the CRISP-DM methodology 27% of the respondents were using their own methodology or a hybrid 7% were using the traditional KDD methodology These results are generally similar to the 2007 results from the same newsletter asking the same question. My best advice is that it does not matter too much which methodology you use for a data mining project, as long as you just pick one. If you do not have any methodology at all, then you run the risk of forgetting important steps. Choose one of the methods that seems like it might work for your project and your needs, and then just do your best to follow the steps. We will vary our data mining methodology depending on which technique we are looking at in a given article. For example, even though the focus of the article as a whole is on the data mining step, we still need to motivate of project with a healthy dose of Business Understanding (CRISP-DM) or Problem Statement (Six Steps) so that we understand why we are doing the tasks and what the results mean. In addition, in order to learn a particular data mining method, we may also have to do some pre-processing, whether we call that data cleaning, integration, or transformation. But in general, we will try to keep these tasks to a minimum so that our focus on data mining remains clear. Finally, even though data visualization is typically very important for representing the results of your data mining process to your audience, we will also keep these tasks to a minimum so that we can remain focused on the primary job at hand: data mining. Summary In this article, we learned what it would take to expand our data mining toolbox to the master level. First we took a long view of the field as a whole, starting with the history of data mining as a piece of the knowledge discovery in databases (KDD) process. We also compared the field of data mining to other similar terms such as data science, machine learning, and big data. Next, we outlined the common tools and techniques that most experts consider to be most important to the KDD process, paying special attention to the techniques that are used most frequently in the mining and analysis steps. To really master data mining, it is important that we work on problems that are different than simple textbook examples. For this reason we will be working on more exotic data mining techniques such as generating summaries and finding outliers, and focusing on more unusual data types, such as text and networks.  Resources for Article: Further resources on this subject: Python Data Structures [article] Mining Twitter with Python – Influence and Engagement [article] Data mining [article]
Read more
  • 0
  • 0
  • 1721

article-image-key-elements-time-series-analysis
Packt
08 Aug 2016
7 min read
Save for later

Key Elements of Time Series Analysis

Packt
08 Aug 2016
7 min read
In this article by Jay Gendron, author of the book, Introduction to R for Business Intelligence, we will see that the time series analysis is the most difficult analysis technique. It is true that this is a challenging topic. However, one may also argue that an introductory awareness of a difficult topic is better than perfect ignorance about it. Time series analysis is a technique designed to look at chronologically ordered data that may form cycles over time. Key topics covered in this article include the following: (For more resources related to this topic, see here.) Introducing key elements of time series analysis Time series analysis is an upper-level college statistics course. It is also a demanding topic taught in econometrics. This article provides you an understanding of a useful but difficult analysis technique. It provides a combination of theoretical learning and hands-on practice. The goal is to provide you a basic understanding of working with time series data and give you a foundation to learn more. Use Case: forecasting future ridership The finance group approached the BI team and asked for help with forecasting future trends. They heard about your great work for the marketing team and wanted to get your perspective on their problem. Once a year they prepare an annual report that includes ridership details. They are hoping to include not only last year's ridership levels, but also a forecast of ridership levels in the coming year. These types of time-based predictions are forecasts. The Ch6_ridership_data_2011-2012.csv data file is available at the website—http://jgendron.github.io/com.packtpub.intro.r.bi/. This data is a subset of the bike sharing data. It contains two years of observations, including the date and a count of users by hour. Introducing key elements of time series analysis You just applied a linear regression model to time series data and saw it did not work. The biggest problem was not a failure in fitting a linear model to the trend. For this well-behaved time series, the average formed a linear plot over time. Where was the problem? The problem was in seasonal fluctuations. The seasonal fluctuations were one year in length and then repeated. Most of the data points existed above and below the fitted line, instead of on it or near it. As we saw, the ability to make a point estimate prediction was poor. There is an old adage that says even a broken clock is correct twice a day. This is a good analogy for analyzing seasonal time series data with linear regression. The fitted linear line would be a good predictor twice every cycle. You will need to do something about the seasonal fluctuations in order to make better forecasts; otherwise, they will simply be straight lines with no account of the seasonality. With seasonality in mind, there are functions in R that can break apart the trend, seasonality, and random components of a time series. The decompose() function found in the forecast package shows how each of these three components influence the data. You can think of this technique as being similar to creating the correlogram plot during exploratory data analysis. It captures a greater understanding of the data in a single plot: library(forecast); plot(decompose(airpass)) The output of this code is shown here: This decomposition capability is nice as it gives you insights about approaches you may want to take with the data, and with reference to the previous output, they are described as follows: The top panel provides a view of the original data for context. The next panel shows the trend. It smooths the data and removes the seasonal component. In this case, you will see that over time, air passenger volume has increased steadily and in the same direction. The third plot shows the seasonal component. Removing the trend helps reveal any seasonal cycles. This data shows a regular and repeated seasonal cycle through the years. The final plot is the randomness—everything else in the data. It is like the error term in linear regression. You will see less error in the middle of the series. The stationary assumption There is an assumption for creating time series models. The data must be stationary. Stationary data exists when its mean and variance do not change as a function of time. If you decompose a time series and witness a trend, seasonal component, or both, then you have non-stationary data. You can transform them into stationary data in order to meet the required assumption. Using a linear model for comparison, there is randomness around a mean—represented by data points scattered randomly around a fitted line. The data is independent of time and it does not follow other data in a cycle. This means that the data is stationary. Not all the data lies on the fitted line, but it is not moving. In order to analyze time series data, you need your data points to stay still. Imagine trying to count a class of primary school students while they are on the playground during recess. They are running about back and forth. In order to count them, you need them to stay still—be stationary. Transforming non-stationary data into stationary data allows you to analyze it. You can transform non-stationary data into stationary data using a technique called differencing. The differencing techniques Differencing subtracts each data point from the data point that is immediately in front of it in the series. This is done with the diff() function. Mathematically, it works as follows: Seasonal differencing is similar, but it subtracts each data point from its related data point in the next cycle. This is done with the diff() function, along with a lag parameter set to the number of data points in a cycle. Mathematically, it works as follows: Look at the results of differencing in this toy example. Build a small sample dataset of 36 data points that include an upward trend and seasonal component, as shown here: seq_down <- seq(.625, .125, -0.125) seq_up <- seq(0, 1.5, 0.25) y <- c(seq_down, seq_up, seq_down + .75, seq_up + .75, seq_down + 1.5, seq_up + 1.5) Then, plot the original data and the results obtained after calling the diff() function: par(mfrow = c(1, 3)) plot(y, type = "b", ylim = c(-.1, 3)) plot(diff(y), ylim = c(-.1, 3), xlim = c(0, 36)) plot(diff(diff(y), lag = 12), ylim = c(-.1, 3), xlim = c(0, 36)) par(mfrow = c(1, 1)) detach(package:TSA, unload=TRUE) These three panels show the results of differencing and seasonal differencing. Detach the TSA package to avoid conflicts with other functions in the forecast library we will use, as follows: These three panes are described as follows: The left pane shows n = 36 data points, with 12 points in each of the three cycles. It also shows a steadily increasing trend. Either of these characteristics breaks the stationary data assumption. The center pane shows the results of differencing. Plotting the difference between each point and its next neighbor removes the trend. Also, notice that you get one less data point. With differencing, you get (n - 1) results. The right pane shows seasonal differencing with a lag of 12. The data is stationary. Notice that the trend differencing is now the data in the seasonal differencing. Also note that you will lose a cycle of data, getting (n - lag) results. Summary Congratulations, you truly deserve recognition for getting through a very tough topic. You now have more awareness about time series analysis than some people with formal statistical training.  Resources for Article:   Further resources on this subject: Managing Oracle Business Intelligence [article] Self-service Business Intelligence, Creating Value from Data [article] Business Intelligence and Data Warehouse Solution - Architecture and Design [article]
Read more
  • 0
  • 0
  • 3842
Visually different images

article-image-data-extracting-transforming-and-loading
Packt
01 Aug 2016
15 min read
Save for later

Data Extracting, Transforming, and Loading

Packt
01 Aug 2016
15 min read
In this article, by Yu-Wei, Chiu, author of the book, R for Data Science Cookbook, covers the following topics: Scraping web data Accessing Facebook data (For more resources related to this topic, see here.) Before using data to answer critical business questions, the most important thing is to prepare it. Data is normally archived in files, and using Excel or text editors allows it to be easily obtained. However, data can be located in a range of different sources, such as databases, websites, and various file formats. Being able to import data from these sources is crucial. There are four main types of data. Data recorded in a text format is the most simple. As some users require storing data in a structured format, files with a .tab or .csv extension can be used to arrange data in a fixed number of columns. For many years, Excel has held a leading role in the field of data processing, and this software uses the .xls and .xlsx formats. Knowing how to read and manipulate data from databases is another crucial skill. Moreover, as most data is not stored in a database, we must know how to use the web scraping technique to obtain data from the internet. As part of this chapter, we will introduce how to scrape data from the internet using the rvest package. Many experienced developers have already created packages to allow beginners to obtain data more easily, and we focus on leveraging these packages to perform data extraction, transformation, and loading. In this chapter, we will first learn how to utilize R packages to read data from a text format and scan files line by line. We then move to the topic of reading structured data from databases and Excel. Finally, we will learn how to scrape internet and social network data using the R web scraper. Scraping web data In most cases, the majority of data will not exist in your database, but it will instead be published in different forms on the internet. To dig more valuable information from these data sources, we need to know how to access and scrape data from the Web. Here, we will illustrate how to use the rvest package to harvest finance data from http://www.bloomberg.com/. Getting ready For this recipe, prepare your environment with R installed on a computer with internet access. How to do it... Perform the following steps to scrape data from http://www.bloomberg.com/: First, access the following link to browse the S&P 500 index on the Bloomberg Business website http://www.bloomberg.com/quote/SPX:IND: Once the page appears as shown in the preceding screenshot, we can begin installing and loading the rvest package: > install.packages("rvest") > library(rvest) Next, you can use the HTML function from rvest package to scrape and parse the HTML page of the link to the S&P 500 index at http://www.bloomberg.com/: > spx_quote <- html("http://www.bloomberg.com/quote/SPX:IND") Use the browser's built-in web inspector to inspect the location of detail quote (marked with a red rectangle) below the index chart: You can then move the mouse over the detail quote and click on the target element that you wish to scrape down. As the following screenshot shows, the <div class="cell"> section holds all the information that we need: Extract elements with the class of cell using the html_nodes function: > cell <- spx_quote %>% html_nodes(".cell") Furthermore, we can parse the label of the detailed quote from elements with the class of cell__label, extract text from scraped HTML, and eventually clean spaces and newline characters from the extracted text: > label <- cell %>% + html_nodes(".cell__label") %>% + html_text() %>% + lapply(function(e) gsub("n|\s+", "", e)) We can also extract the value of detailed quote from the element with the class of cell__value, extract text from scraped HTML, as well as clean spaces and newline characters: > value <- cell %>% + html_nodes(".cell__value") %>% + html_text() %>% + lapply(function(e)gsub("n|\s+", "", e)) Finally, we can set the extracted label as the name to value: > names(value) <- title Next, we can access the energy and oil market index page at this link (http://www.bloomberg.com/energy), as shown in the following screenshot: We can then use the web inspector to inspect the location of the table element: Finally, we can use html_table to extract the table element with class of data-table: > energy <- html("http://www.bloomberg.com/energy") > energy.table <- energy %>% html_node(".data-table") %>% html_table() How it works... The most difficult step in scraping data from a website is that web data is published and structured in different formats. We have to fully understand how data is structured within the HTML tag before continuing. As HTML (Hypertext Markup Language) is a language that has similar syntax to XML, we can use the XML package to read and parse HTML pages. However, the XML package only provides the XPath method, which has two main shortcomings, as follows: Inconsistent behavior in different browsers It is hard to read and maintain For these reasons, we recommend using the CSS selector over XPath when parsing HTML. Python users may be familiar with how to scrape data quickly using requests and the BeautifulSoup packages. The rvest package is the counterpart package in R, which provides the same capability to simply and efficiently harvest data from HTML pages. In this recipe, our target is to scrape the finance data of the S&P 500 detail quote from http://www.bloomberg.com/. Our first step is to make sure that we can access our target webpage through the internet, which is followed by installing and loading the rvest package. After installation and loading is complete, we can then use the HTML function to read the source code of the page to spx_quote. Once we have confirmed that we can read the HTML page, we can start parsing the detailed quote from the scraped HTML. However, we first need to inspect the CSS path of the detail quote. There are many ways to inspect the CSS path of a specific element. The most popular method is to use the development tool built into each browser (press F12 or FN + F12) to inspect the CSS path. Using Google Chrome as an example, you can open the development tool by pressing F12. A DevTools window may show up somewhere in the visual area (you may refer to https://developer.chrome.com/devtools/docs/dom-and-styles#inspecting-elements). Then, you can move the mouse cursor to the upper left of the DevTools window and select the Inspect Element icon (a magnifier icon similar to ). Next, click on the target element, and the DevTools window will highlight the source code of the selected area. You can then move the mouse cursor to the highlighted area and right-click on it. From the pop-up menu, click on Copy CSS Path to extract the CSS path. Or, you can examine the source code and find that the selected element is structured in HTML code with the class of cell. One highlight of rvest is that it is designed to work with magrittr, so that we can use a %>% pipelines operator to chain output parsed at each stage. Thus, we can first obtain the output source by calling spx_quote and then pipe the output to html_nodes. As the html_nodes function uses CSS selector to parse elements, the function takes basic selectors with type (for example, div), ID (for example, #header), and class (for example, .cell). As the elements to be extracted have the class of cell, you should place a period (.) in front of cell. Finally, we should extract both label and value from previously parsed nodes. Here, we first extract the element of class cell__label, and we then use html_text to extract text. We can then use the gsub function to clean spaces and newline characters from the parsed text. Likewise, we apply the same pipeline to extract the element of the class__value class. As we extracted both label and value from detail quote, we can apply the label as the name to the extracted values. We have now organized data from the web to structured data. Alternatively, we can also use rvest to harvest tabular data. Similarly to the process used to harvest the S&P 500 index, we can first access the energy and oil market index page. We can then use the web element inspector to find the element location of table data. As we have found the element located in the class of data-table, we can use the html_table function to read the table content into an R data frame. There's more... Instead of using the web inspector built into each browser, we can consider using SelectorGadget (http://selectorgadget.com/) to search for the CSS path. SelectorGadget is a very powerful and simple to use extension for Google Chrome, which enables the user to extract the CSS path of the target element with only a few clicks: To begin using SelectorGadget, access this link (https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb). Then, click on the green button (circled in the red rectangle as shown in the following screenshot) to install the plugin to Chrome: Next, click on the upper-right icon to open SelectorGadget, and then select the area which needs to be scraped down. The selected area will be colored green, and the gadget will display the CSS path of the area and the number of elements matched to the path: Finally, you can paste the extracted CSS path to html_nodes as an input argument to parse the data. Besides rvest, we can connect R to Selenium via Rselenium to scrape the web page. Selenium was originally designed as an automating web application that enables the user to command a web browser to automate processes through simple scripts. However, we can also use Selenium to scrape data from the Internet. The following instruction presents a sample demo on how to scrape Bloomberg.com using Rselenium: First, access this link to download the Selenium standalone server (http://www.seleniumhq.org/download/), as shown in the following screenshot: Next, start the Selenium standalone server using the following command: $ java -jar selenium-server-standalone-2.46.0.jar If you can successfully launch the standalone server, you should see the following message, which means that you can connect to the server that binds to port 4444: At this point, you can begin installing and loading RSelenium with the following command: > install.packages("RSelenium") > library(RSelenium) After RSelenium is installed, register the driver and connect to the Selenium server: > remDr <- remoteDriver(remoteServerAddr = "localhost" + , port = 4444 + , browserName = "firefox" +) Examine the status of the registered driver: > remDr$getStatus() Next, we navigate to Bloomberg.com: > remDr$open() > remDr$navigate("http://www.bloomberg.com/quote/SPX:IND ") Finally, we can scrape the data using the CSS selector. > webElem <- remDr$findElements('css selector', ".cell") > webData <- sapply(webElem, function(x){ + label <- x$findChildElement('css selector', '.cell__label') + value <- x$findChildElement('css selector', '.cell__value') + cbind(c("label" = label$getElementText(), "value" = value$getElementText())) + } + ) Accessing Facebook data Social network data is another great source for a user who is interested in exploring and analyzing social interactions. The main difference between social network data and web data is that social network platforms often provide a semi-structured data format (mostly JSON). Thus, we can easily access the data without the need to inspect how the data is structured. In this recipe, we will illustrate how to use rvest and rson to read and parse data from Facebook. Getting ready For this recipe, prepare your environment with R installed on a computer with Internet access. How to do it… Perform the following steps to access data from Facebook: First, we need to log in to Facebook and access the developer page (https://developers.facebook.com/), as shown in the following screenshot: Click on Tools & Support and select Graph API Explorer: Next, click on Get Token and choose Get Access Token: On the User Data Permissions pane, select user_tagged_places and then click on Get Access Token: Copy the generated access token to the clipboard: Try to access Facebook API using rvest: > access_token <- '<access_token>' > fb_data <- html(sprintf("https://graph.facebook.com/me/tagged_places?access_token=%s",access_token)) Install and load rjson package: > install.packages("rjson") > library(rjson) Extract the text from fb_data and then use fromJSON to read JSON data: > fb_json <- fromJSON(fb_data %>% html_text()) Use sapply to extract the name and ID of the place from fb_json: > fb_place <- sapply(fb_json$data, function(e){e$place$name}) > fb_id <- sapply(fb_json$data, function(e){e$place$id}) Last, use data.frame to wrap the data: > data.frame(place = fb_place, id = fb_id) How it works… In this recipe, we covered how to retrieve social network data through Facebook's Graph API. Unlike scraping web pages, you need to obtain a Facebook access token before making any request for insight information. There are two ways to retrieve the access token: the first is to use Facebook's Graph API Explorer, and the other is to create a Facebook application. In this recipe, we illustrated how to use the Graph API Explorer to obtain the access token. Facebook's Graph API Explorer is where you can craft your requests URL to access Facebook data on your behalf. To access the explorer page, we first visit Facebook's developer page (https://developers.facebook.com/). The Graph API Explorer page is under the drop-down menu of Tools & Support. After entering the explorer page, we select Get Access Token from the drop-down menu of Get Token. Subsequently, a tabbed window will appear; we can check access permission to various levels of the application. For example, we can check tagged_places to access the locations that we previously tagged. After we selected the permissions that we require, we can click on Get Access Token to allow Graph API Explorer to access our insight data. After completing these steps, you will see an access token, which is a temporary and short-lived token that you can use to access Facebook API. With the access token, we can then access Facebook API with R. First, we need a HTTP request package. Similarly to the web scraping recipe, we can use the rvest package to make the request. We craft a request URL with the addition of the access_token (copied from Graph API Explorer) to the Facebook API. From the response, we should receive JSON formatted data. To read the attributes of the JSON format data, we install and load the RJSON package. We can then use the fromJSON function to read the JSON format string extracted from the response. Finally, we read places and ID information through the use of the sapply function, and we can then use data.frame to transform extracted information to the data frame. At the end of this recipe, we should see data formatted in the data frame. There's more... To learn more about Graph API, you can read the official document from Facebook (https://developers.facebook.com/docs/reference/api/field_expansion/): First, we need to install and load the Rfacebook package: > install.packages("Rfacebook") > library(Rfacebook) We can then use built-in functions to retrieve data from the user or access similar information with the provision of an access token: > getUsers("me", "<access_token>") If you want to scrape public fan pages without logging into Facebook every time, you can create a Facebook app to access insight information on behalf of the app.: To create an authorized app token, login to the Facebook developer page and click on Add a New Page: You can create a new Facebook app with any name, providing that it has not already been registered: Finally, you can copy both the app ID and app secret and craft the access token to <APP ID>|<APP SECRET>. You can now use this token to scrape public fan page information with Graph API: Similarly to Rfacebook, we can then replace the access_token with <APP ID>|<APP SECRET>: > getUsers("me", "<access_token>") Summary In this article, we learned how to utilize R packages to read data from a text format and scan files line by line. We also learned how to scrape internet and social network data using the R web scraper. Resources for Article: Further resources on this subject: Learning Data Analytics with R and Hadoop [article] Big Data Analysis (R and Hadoop) [article] Using R for Statistics, Research, and Graphics [article]
Read more
  • 0
  • 0
  • 1923

article-image-overview-certificate-management
Packt
18 Jul 2016
24 min read
Save for later

Overview of Certificate Management

Packt
18 Jul 2016
24 min read
In this article by David Steadman and Jeff Ingalls, the authors of Microsoft Identity Manager 2016 Handbook, we will look at certificate management in brief. Microsoft Identity Management (MIM)—certificate management (CM)—is deemed the outcast in many discussions. We are here to tell you that this is not the case. We see many scenarios where CM makes the management of user-based certificates possible and improved. If you are currently using FIM certificate management or considering a new certificate management deployment with MIM, we think you will find that CM is a component to consider. CM is not a requirement for using smart cards, but it adds a lot of functionality and security to the process of managing the complete life cycle of your smart cards and software-based certificates in a single forest or multiforest scenario. In this article, we will look at the following topics: What is CM? Certificate management components Certificate management agents The certificate management permission model (For more resources related to this topic, see here.) What is certificate management? Certificate management extends MIM functionality by adding management policy to a driven workflow that enables the complete life cycle of initial enrollment, duplication, and the revocation of user-based certificates. Some smart card features include offline unblocking, duplicating cards, and recovering a certificate from a lost card. The concept of this policy is driven by a profile template within the CM application. Profile templates are stored in Active Directory, which means the application already has a built-in redundancy. CM is based on the idea that the product will proxy, or be the middle man, to make a request to and get one from CA. CM performs its functions with user agents that encrypt and decrypt its communications. When discussing PKI (Public Key Infrastructure) and smart cards, you usually need to have some discussion about the level of assurance you would like for the identities secured by your PKI. For basic insight on PKI and assurance, take a look at http://bit.ly/CorePKI. In typical scenarios, many PKI designers argue that you should use Hardware Security Module (HSM) to secure your PKI in order to get the assurance level to use smart cards. Our personal opinion is that HSMs are great if you need high assurance on your PKI, but smart cards increase your security even if your PKI has medium or low assurance. Using MIM CM with HSM will not be covered in this article, but if you take a look at http://bit.ly/CMandLunSA, you will find some guidelines on how to use MIM CM and HSM Luna SA. The Financial Company has a low-assurance PKI with only one enterprise root CA issuing the certificates. The Financial Company does not use a HSM with their PKI or their MIM CM. If you are running a medium- or high-assurance PKI within your company, policies on how to issue smart cards may differ from the example. More details on PKI design can be found at http://bit.ly/PKIDesign. Certificate management components Before we talk about certificate management, we need to understand the underlying components and architecture: As depicted before, we have several components at play. We will start from the left to the right. From a high level, we have the Enterprise CA. The Enterprise CA can be multiple CAs in the environment. Communication from the CM application server to the CA is over the DCOM/RPC channel. End user communication can be with the CM web page or with a new REST API via a modern client to enable the requesting of smart cards and the management of these cards. From the CM perspective, the two mandatory components are the CM server and the CA modules. Looking at the logical architecture, we have the CA, and underneath this, we have the modules. The policy and exit module, once installed, control the communication and behavior of the CA based on your CM's needs. Moving down the stack, we have Active Directory integration. AD integration is the nuts and bolts of the operation. Integration into AD can be very complex in some environments, so understanding this area and how CM interacts with it is very important. We will cover the permission model later in this article, but it is worth mentioning that most of the configuration is done and stored in AD along with the database. CM uses its own SQL database, and the default name is FIMCertificateManagement. The CM application uses its own dedicated IIS application pool account to gain access to the CM database in order to record transactions on behalf of users. By default, the application pool account is granted the clmApp role during the installation of the database, as shown in the following screenshot:   In CM, we have a concept called the profile template. The profile template is stored in the configuration partition of AD, and the security permissions on this container and its contents determine what a user is authorized to see. As depicted in the following screenshot, CM stores the data in the Public Key Services (1) and the Profile Templates container. CM then reads all the stored templates and the permissions to determine what a user has the right to do (2): Profile templates are at the core of the CM logic. The three components comprising profile templates are certificate templates, profile details, and management policies. The first area of the profile template is certificate templates. Certificate templates define the extensions and data point that can be included in the certificate being requested. The next item is profile details, which determines the type of request (either a smart card or a software user-based certificate), where we will generate the certificates (either on the server or on the client side of the operations), and which certificate templates will be included in the request. The final area of a profile template is known as management policies. Management policies are the workflow engine of the process and contain the manager, the subscriber functions, and any data collection items. The e-mail function is initiated here and commonly referred to as the One Time Password (OTP) activity. Note the word "One". A trigger will only happen once here; therefore, multiple alerts using e-mail would have to be engineered through alternate means, such as using the MIM service and expiration activities. The permission model is a bit complex, but you'll soon see the flexibility it provides. Keep in mind that Service Connection Point (SCP) also has permissions applied to it to determine who can log in to the portal and what rights the user has within the portal. SCP is created upon installation during the wizard configuration. You will want to be aware of the SCP location in case you run into configuration issues with administrators not being able to perform particular functions. The SCP location is in the System container, within Microsoft, and within Certificate Lifecycle Manager, as shown here: Typical location CN=Certificate Lifecycle Manager,CN=Microsoft,CN=System,DC=THEFINANCIALCOMPANY,DC=NET Certificate management agents We covered several key components of the profile templates and where some of the permission model is stored. We now need to understand how the separation of duties is defined within the agent role. The permission model provides granular control, which promotes the separation of duties. CM uses six agent accounts, and they can be named to fit your organization's requiremensts. We will walk through the initial setup again later in this article so that you can use our setup or alter it based on your need. The Financial Company only requires the typical setup. We precreated the following accounts for TFC, but the wizard will create them for you if you do not use them. During the installation and configuration of CM, we will use the following accounts: Besides the separation of duty, CM offers enrollment by proxy. Proxy enrollment of a request refers to providing a middle man to provide the end user with a fluid workflow during enrollment. Most of this proxy is accomplished via the agent accounts in one way or another. The first account is MIM CM Agent (MIMCMAgent), which is used by the CM server to encrypt data from the smart card admin PINs to the data collection stored in the database. So, the agent account has an important role to protect data and communication to and from the certificate authorities. The last user agent role CMAgent has is the capability to revoke certificates. The agent certificate thumbprint is very important, and you need to make sure the correct value is updated in the three areas: CM, web.config, and the certificate policy module under the Signing Certificates tab on the CA. We have identified these areas in the following. For web.config: <add key="Clm.SigningCertificate.Hash" value <add key="Clm.Encryption.Certificate.Hash" value <add key="Clm.SmartCard.ExchangeCertificate.Hash" value The Signing Certificates tab is as shown in the following screenshot:   Now, when you run through the configuration wizard, these items are already updated, but it is good to know which locations need to be updated if you need to troubleshoot agent issues or even update/renew this certificate. The second account we want to look at is Key Recovery Agent (MIMCMKRAgent); this agent account is needed for CM to recover any archived private keys certificates. Now, let's look at Enrollment Agent (MIMCMEnrollAgent); the main purpose of this agent account is to provide the enrollment of smart cards. Enrollment Agent, as we call it, is responsible for signing all smart card requests before they are submitted to the CA. Typical permission for this account on the CA is read and request. Authorization Agent (MIMCMAuthAgent)—or as some folks call this, the authentication agent—is responsible for determining access rights for all objects from a DACL perspective. When you log in to the CM site, it is the authorization account's job to determine what you have the right to do based on all the core components that ACL has applied. We will go over all the agents accounts and rights needed later in this article during our setup. CA Manager Agent (MIMCMManagerAgent) is used to perform core CA functions. More importantly, its job is to issue Certificate Revocation Lists (CRLs). This happens when a smart card or certificate is retired or revoked. It is up to this account to make sure the CRL is updated with this critical information. We saved the best for last: Web Pool Agent (MIMCMWebAgent). This agent is used to run the CM web application. The agent is the account that contacts the SQL server to record all user and admin transactions. The following is a good depiction of all the accounts together and the high-level functions:   The certificate management permission model In CM, we think this part is the most complex because with the implementation, you can be as granular as possible. For this reason, this area is the most difficult to understand. We will uncover the permission model so that we can begin to understand how the permission model works within CM. When looking at CM, you need to formulate the type of management model you will be deploying. What we mean by this is will you have a centralized or delegated model? This plays a key part in deployment planning for CM and the permission you will need to apply. In the centralized model, a specific set of managers are assigned all the rights for the management policy. This includes permissions on the users. Most environments use this method as it is less complex for environments. Now, within this model, we have manager-initiated permission, and this is where CM permissions are assigned to groups containing the subscribers. Subscribers are the actual users doing the enrollment or participating in the workflow. This is the model that The Financial Company will use in its configuration. The delegated model is created by updating two flags in web.config called clm.RequestSecurity.Flags and clm.RequestSecurity.Groups. These two flags work hand in hand as if you have UseGroups, then it will evaluate all the groups within the forests to include universal/global security. Now, if you use UseGroups and define clm.RequestSecurity.Groups, then it will only look for these specific groups and evaluate via the Authorization Agent . The user will tell the Authorization Agent to only read the permission on the user and ignore any group membership permissions:   When we continue to look at the permission, there are five locations that permissions can be applied in. In the preceding figure is an outline of these locations, but we will go in more depth in the subsections in a bit. The basis of the figure is to understand the location and what permission can be applied. The following are the areas and the permissions that can be set: Service Connection Point: Extended Permissions Users or Groups: Extended Permissions Profile Template Objects: Container: Read or Write Template Object: Read/Write or Enroll Certificate Template: Read or Enroll CM Management Policy within the Web application: We have multiple options based on the need, such as Initiate Request Now, let's begin to discuss the core areas to understand what they can do. So, The Financial Company can design the enrollment option they want. In the example, we will use the main scenario we encounter, such as the helpdesk, manager, and user-(subscriber) based scenarios. For example, certain functions are delegated to the helpdesk to allow them to assist the user base without giving them full control over the environment (delegated model). Remember this as we look at the five core permission areas. Creating service accounts So far, in our MIM deployment, we have created quite a few service accounts. MIM CM, however, requires that we create a few more. During the configuration wizard, we will get the option of having the wizard create them for us, but we always recommend creating them manually in FIM/MIM CM deployments. One reason is that a few of these need to be assigned some certificates. If we use an HSM, we have to create it manually in order to make sure the certificates are indeed using the HSM. The wizard will ask for six different service accounts (agents), but we actually need seven. In The Financial Company, we created the following seven accounts to be used by FIM/MIM CM: MIMCMAgent MIMCMAuthAgent MIMCMCAManagerAgent MIMCMEnrollAgent MIMCMKRAgent MIMCMWebAgent MIMCMService The last one, MIMCMService, will not be used during the configuration wizard, but it will be used to run the MIM CM Update service. We also created the following security groups to help us out in the scenarios we will go over: MIMCM-Helpdesk: This is the next step in OTP for subscribers MIMCM-Managers: These are the managers of the CM environment MIMCM-Subscribers: This is group of users that will enroll Service Connection Point Service Connection Point (SCP)is located under the Systems folder within Active Directory. This location, as discussed in the earlier parts of the article, defines who functions as the user as it relates to logging in to the web application. As an example, if we just wanted every user to only log in, we would give them read rights. Again, authenticated users, have this by default, but if you only wanted a subset of users to access, you should remove authenticated users and add your group. When you run the configuration wizard, SCP is decided, but the default is the one shown in the following screenshot:   If a user is assigned to any of the MIM CM permissions available on SCP, the administrative view of the MIM CM portal will be shown. The MIM CM permissions are defined in a Microsoft TechNet article at http://bit.ly/MIMCMPermission. For your convenience, we have copied parts of the information here: MIM CM Audit: This generates and displays MIM CM policy templates, defines management policies within a profile template, and generates MIM CM reports. MIM CM Enrollment Agent: This performs certificate requests for the user or group on behalf of another user. The issued certificate's subject contains the target user's name and not the requester's name. MIM CM Request Enroll: This initiates, executes, or completes an enrollment request. MIM CM Request Recover: This initiates encryption key recovery from the CA database. MIM CM Request Renew: This initiates, executes, or completes an enrollment request. The renewal request replaces a user's certificate that is near its expiration date with a new certificate that has a new validity period. MIM CM Request Revoke: This revokes a certificate before the expiration of the certificate's validity period. This may be necessary, for example, if a user's computer or smart card is stolen. MIM CM Request Unblock Smart Card: This resets a smart card's user Personal Identification Number (PIN) so that he/she can access the key material on a smart card. The Active Directory extended permissions So, even if you have the SCP defined, we still need to set up the permissions on the user or group of users that we want to manage. As in our helpdesk example, if we want to perform certain functions, the most common one is offline unblock. This would require the MIMCM-HelpDesk group. We will create this group later in this article. It would contain all help desk users then on SCP; we would give them CM Request Unblock Smart Card and CM Enrollment Agent. Then, you need to assign the permission to the extended permission on MIMCM-Subscribers, which contains all the users we plan to manage with the helpdesk and offline unblock:   So, as you can see, we are getting into redundant permissions, but depending on the location, it means what the user can do. So, planning of the model is very important. Also, it is important to document what you have as with some slight tweak, things can and will break. The certificate templates permission In order for any of this to be possible, we still need to give permission to the manager of the user to enroll or read the certificate template, as this will be added to the profile template. For anyone to manage this certificate, everyone will need read and enroll permissions. This is pretty basic, but that is it, as shown in the following screenshot:   The profile template permission The profile template determines what a user can read within the template. To get to the profile template, we need to use Active Directory sites and services to manage profile templates. We need to activate the services node as this is not shown by default, and to do this, we will click on View | Show Services Node:   As an example if you want a user to enroll in the cert, he/she would need CM Enroll on the profile template, as shown in the following screenshot:   Now, this is for users, but let's say you want to delegate the creation of profile templates. For this, all you need to do is give the MIMCM-Managers delegate the right to create all child items on the profile template container, as follows:   The management policy permission For the management policy, we will break it down into two sections: a software-based policy and a smart card management policy. As we have different capabilities within CM based on the type, by default, CM comes with two sample policies (take a look at the following screenshot), which we use for duplication to create a new one. When configuring, it is good to know that you cannot combine software and smart card-based certificates in a policy:   The software management policy The software-based certificate policy has the following policies available through the CM life cycle:   The Duplicate Policy panel creates a duplicate of all the certificates in the current profile. Now, if the first profile is created for the user, all the other profiles created afterwards will be considered duplicate, and the first generated policy will be primary. The Enroll Policy panel defines the initial enrollment steps for certificates such as initiate enroll request and data collection during enroll initiation. The Online Update Policy panel is part of the automatic policy function when key items in the policy change. This includes certificates about to expire, when a certificate is added to the existing profile template or even removed. The Recover Policy panel allows for the recovery of the profile in the event that the user was deleted. This includes the cases where certs are deleted by accident. One thing to point out is if the certificate was a signing cert, the recovery policy would issue a new replacement cert. However, if the cert was used for encryption, you can recover the original using this policy. The Recover On Behalf Policy panel allows managers or helpdesk operations to be recovered on behalf the user in the event that they need any of the certificates. The Renew Policy panel is the workflow that defines the renew setting, such as revocation and who can initiate a request. The Suspend and Reinstate Policy panel enables a temporary revocation of the profile and puts a "certificate hold" status. More information about the CRL status can be found at http://bit.ly/MIMCMCertificateStatus. The Revoke Policy panel maintains the revocation policy and setting around being able to set the revocation reason and delay. Also, it allows the system to push a delta CRL. You also can define the initiators for this policy workflow. The smart card management policy The smart card policy has some similarities to the software-based policy, but it also has a few new workflows to manage the full life cycle of the smart card:   The Profile Details panel is by far the most commonly used part in this section of the policy as it defines all the smart card certificates that will be loaded in the policy along with the type of provider. One key item is creating and destroying virtual smart cards. One final key part is diversifying the admin key. This is best practice as this secures the admin PIN using diversification. So, before we continue, we want to go over this setting as we think it is an important topic. Diversifying the admin key is important because each card or batch of cards comes with a default admin key. Smart cards may have several PINs, an admin PIN, a PINunlock key (PUK), and a user PIN. This admin key, as CM refers to it, is also known as the administrator PIN. This PIN differs from the user's PIN. When personalizing the smart card, you configure the admin key, the PUK, and the user's PIN. The admin key and the PUK are used to reset the virtual smart card's PIN. However, you cannot configure both. You must use the PUK to unlock the PIN if you assign one during the virtual smart card's creation. It is important to note that you must use the PUK to reset the PIN if you provide both a PUK and an admin key. During the configuration of the profile template, you will be asked to enter this key as follows:   The admin key is typically used by smart card management solutions that enable a challenge response approach to PIN unlocking. The card provides a set of random data that the user reads (after the verification of identity) to the deployment admin. The admin then encrypts the data with the admin key (obtained as mentioned before) and gives the encrypted data back to the user. If the encrypted data matches that produced by the card during verification, the card will allow PIN resetting. As the admin key is never in the hands of anyone other than the deployment administrator, it cannot be intercepted or recorded by any other party (including the employee) and thus has significant security benefits beyond those in using a PUK—an important consideration during the personalization process. When enabled, the admin key is set to a card-unique value when the card is assigned to the user. The option to diversify admin keys with the default initialization provider allows MIM CM to use an algorithm to uniquely generate a new key on the card. The key is encrypted and securely transmitted to the client. It is not stored in the database or anywhere else. MIM CM recalculates the key as needed to manage the card:   The CM profile template contains a thumbprint for the certificate to be used in admin key diversification. CM looks in the personal store of the CM agent service account for the private key of the certificate in the profile template. Once located, the private key is used to calculate the admin key for the smart card. The admin key allows CM to manage the smart card (issuing, revoking, retiring, renewing, and so on). Loss of the private key prevents the management of cards diversified using this certificate. More detail on the control can be found at http://bit.ly/MIMCMDiversifyAdminKey. Continuing on, the Disable Policy panel defines the termination of the smart card before expiration, you can define the reason if you choose. Once disabled, it cannot be reused in the environment. The Duplicate Policy panel, similarly to the software-based one, produces a duplicate of all the certificates that will be on the smart card. The Enroll Policy panel, similarly to the software policy, defines who can initiate the workflow and printing options. The Online Update Policy panel, similarly to the software-based cert, allows for the updating of certificates if the profile template is updated. The update is triggered when a renewal happens or, similarly to the software policy, a cert is added or removed. The Offline Unblock Policy panel is the configuration of a process to allow offline unblocking. This is used when a user is not connected to the network. This process only supports Microsoft-based smart cards with challenge questions and answers via, in most cases, the user calling the helpdesk. The Recovery On Behalf Policy panel allows the recovery of certificates for the management or the business to recover if the cert is needed to decrypt information from a user whose contract was terminated or who left the company. The Replace Policy panel is utilized by being able to replace a user's certificate in the event of a them losing their card. If the card they had had a signing cert, then a new signing cert would be issued on this new card. Like with software certs, if the certificate type is encryption, then it would need to be restored on the replace policy. The Renew Policy panel will be used when the profile/certificate is in the renewal period and defines revocation details and options and initiates permission. The Suspend and Reinstate Policy panel is the same as the software-based policy for putting the certificate on hold. The Retire Policy panel is similar to the disable policy, but a key difference is that this policy allows the card to be reused within the environment. The Unblock Policy panel defines the users that can perform an actual unblocking of a smart card. More in-depth detail of these policies can be found at http://bit.ly/MIMCMProfiletempates. Summary In this article, we uncovered the basics of certificate management and the management components that are required to successfully deploy a CM solution. Then, we discussed and outlined, agent accounts and the roles they play. Finally, we looked into the management permission model from the policy template to the permissions and the workflow. Resources for Article: Further resources on this subject: Managing Network Devices [article] Logging and Monitoring [article] Creating Horizon Desktop Pools [article]
Read more
  • 0
  • 0
  • 4094

article-image-microstrategy-10
Packt
15 Jul 2016
13 min read
Save for later

MicroStrategy 10

Packt
15 Jul 2016
13 min read
In this article by Dmitry Anoshin, Himani Rana, and Ning Ma, the authors of the book, Mastering Business Intelligence with MicroStrategy, we are going to talk about MicroStrategy 10 which is one of the leading platforms on the market, can handle all data analytics demands, and offers a powerful solution. We will be discussing the different concepts of MicroStrategy such as its history, deployment, and so on. (For more resources related to this topic, see here.) Meet MicroStrategy 10 MicroStrategy is a market leader in Business Intelligence (BI) products. It has rich functionality in order to meet the requirements of modern businesses. In 2015, MicroStrategy provided a new release of MicroStrategy, version 10. It offers both agility and governance like no other BI product. In addition, it is easy to use and enterprise ready. At the same time, it is great for both IT and business. In other words, MicroStrategy 10 offers an analytics platform that combines an easy and empowering user experience, together with enterprise-grade performance, management, and security capabilities. It is true bimodal BI and moves seamlessly between styles: Data discovery and visualization Enterprise reporting and dashboards In-memory high performance BI Scales from departments to enterprises Administration and security MicroStrategy 10 consists of three main products: MicroStrategy Desktop, MicroStrategy Mobile, and MicroStrategy Web. MicroStrategy Desktop lets users start discovering and visualizing data instantly. It is available for Mac and PC. It allows users to connect, prepare, discover, and visualize data. In addition, we can easily promote to a MicroStrategy Server. Moreover, MicroStrategy Desktop has a brand new HTML5 interface and includes all connection drivers. It allows us to use data blending, data preparation, and data enrichment. Finally, it has powerful advanced analytics and can be integrated with R. To cut a long story short, we want to notice main changes of new BI platform. All developers keep the same functionality, the looks as well as architect the same. All changes are about Web interface and Intelligence Server. Let's look closer at what MicroStrategy 10 can show us. MicroStrategy 10 expands the analytical ecosystem by using third-party toolkits such as: Data visualization libraries: We can easily plug in and use any visualization from the expanding range of Java libraries Statistical toolkits: R, SAS, SPSS, KXEN, and others Geolocation data visualization: Uses mapping capabilities to visualize and interact with location data MicroStrategy 10 has more than 25 new data sources that we can connect to quickly and simply. In addition, it allows us build reports on top of other BI tools, such as SAP Business Objects, Cognos, and Oracle BI. It has a new connector to Hadoop, which uses the native connector. Moreover, it allows us to blend multiple data sources in-memory. We want to notice that MicroStrategy 10 got reach functionality for work with data such as: Streamlined workflows to parse and prepare data Multi-table in-memory support from different sources Automatically parse and prepare data with every refresh 100+ inbuilt functions to profile and clean data Create custom groups on the fly without coding In terms of connection to Hadoop, most BI products use Hive or Impala ODBC drivers in order to use SQL to get data from Hadoop. However, this method is bad in terms of performance. MicroStrategy 10 queries directly against Hadoop. As a result, it is up to 50 times faster than via ODBC. Let's look at some of the main technical changes that have significantly improved MicroStrategy. The platform is now faster than ever before, because it doesn't have a two-billion-row limit on in-memory datasets and allows us to create analytical cubes up to 16 times bigger in size. It publishes cubes dramatically faster. Moreover, MicroStrategy 10 has higher data throughput and cubes can be loaded in parallel 4 times faster with multi-threaded parallel loading. In addition, the in-memory engine allows us to create cubes 80 times larger than before, and we can access data from cubes 50% faster, by using up to 8 parallel threads. Look at the following table, where we compare in-memory cube functionality in version 9 versus version 10: Feature Ver. 9 Ver. 10 Data volume 100 GB ~2TB Number of rows 2 billion 200 billion Load rate 8 GB/hour ~200 GB/hour Data model Star schema Any schema, tabular or multiple sets   In order to make the administration of MicroStrategy more effective in the new version, MicroStrategy Operation Manager was released. It gives MicroStrategy administrators powerful development tools to monitor, automate, and control systems. Operations Manager gives us: Centralized management in a web browser Enterprise Manager Console within Tool Triggers and 24/7 alerts System health monitors Server management Multiple environment administration MicroStrategy 10 education and certification MicroStrategy 10 offers new training courses that can be conducted offline in a training center, or online at http://www.microstrategy.com/us/services/education. We believe that certification is a good thing on your journey. The following certifications now exist for version 10: MicroStrategy 10 Certified Associated Analyst MicroStrategy 10 Certified Application Designer MicroStrategy 10 Certified Application Developer MicroStrategy 10 Certified Administrator After passing all of these exams, you will become a MicroStrategy 10 Application Engineer. More details can be found here: http://www.microstrategy.com/Strategy/media/downloads/training-events/MicroStrategy-certification-matrix_v10.pdf. History ofMicroStrategy Let us briefly look at the history of MicroStrategy, which began in 1991: 1991: Released first BI product, which allowed users to create graphical views and analyses of information data 2000: Released MicroStrategy 7 with a web interface 2003: First to release a fully integrated reporting tool, combining list reports, BI-style dashboards, and interface analyses in a single module. 2005: Released MicroStrategy 8, including one-click actions and drag-and-drop dashboard creation 2009: Released MicroStrategy 9, delivering a seamless consolidated path from department to enterprise BI 2010: Unveiled new mobile BI capabilities for iPad and iPhone, and was featured on the iTunes Bestseller List 2011: Released MicroStrategy Cloud, the first SaaS offering from a major BI vendor 2012: Released Visual Data Discovery and groundbreaking new security platform, Usher 2013: Released expanded Analytics Platform and free Analytics Desktop client 2014: Announced availability of MicroStrategy Analytics via Amazon Web Services (AWS) 2015: MicroStrategy 10 was released, the first ever enterprise analytics solution for centralized and decentralized BI DeployingMicroStrategy 10 We know only one way to master MicroStrategy, through practical exercises. Let's start by downloading and deploying MicroStrategy 10.2. Overview of training architecture In order to master MicroStrategy and learn about some BI considerations, we need to download the all-important software, deploy it, and connect to a network. During the preparation of the training environment, we will cover the installation of MicroStrategy on a Linux operating system. This is very good practice, because many people work with Windows and are not familiar with Linux, so this chapter will provide additional knowledge of working with Linux, as well as installing MicroStrategy and a web server. Look at the training architecture: There are three main components: Red Hat Linux 6.4: Used for deploying the web server and Intelligence Server. Windows machine: Uses MicroStrategy Client and Oracle database. Virtual machine with Hadoop: Ready virtual machine with Hadoop, which will connect to MicroStrategy using a brand new connection. In the real world, we should use separate machines for every component, and sometimes several machines in order to run one component. This is called clustering. Let's create a virtual machine. Creating of Red Hat Linux virtual machine Let's create a virtual machine with Red Hat Linux, which will host our Intelligence Server: Go to http://www.redhat.com/ and create an account Go to the software download center: https://access.redhat.com/downloads Download RHEL: https://access.redhat.com/downloads/content/69/ver=/rhel---7/7.2/x86_64/product-software Choose Red Hat Enterprise Linux Server Download Red Hat Enterprise Linux 6.4 x86_64 Choose Binary DVD Now we can create a virtual machine with RHEL 6.4. We have several options in order to choose the software for deploying virtual machine. In our case, we will use a VMware workstation. Before starting to deploy a new VM, we should adjust the default settings, such as increasing RAM and HDD, and adding one more network card in order to connect the external environment with the MicroStrategyClient and sample database. In addition, we should create a new network. When the deployment of the RHEL virtual machine is complete, we should activate a subscription in order to install the required packages. Let us do this with one command in the terminal: # subscription-manager register --username <username> --password <password> --auto-attach Performing prerequisites for MicroStrategy 10 According to the installation and configuration guide, we should deploy all necessary packages. In order to install them, we should execute them under the root: # su # yum install compat-libstdc++-33.i686 # yum install libXp.x86_64 # yum install elfutils-devel.x86_64 # yum install libstdc++-4.4.7-3.el6.i686 # yum install krb5-libs.i686 # yum install nss-pam-ldapd.i686 # yum install ksh.x86_64 The project design process Project design is not just about creating a project in MicroStrategy architect; it involves several steps and thorough analysis, such as how data is stored in the data warehouse, what reports the user wants based on the data, and so on. The following are the steps involved in our project design process: Logical data model design Once the user have business requirements documented, the user must create a fact qualifier matrix to identify the attributes, facts, and hierarchies, which are the building blocks of any logical data model. An example of a fact qualifier is as follows: A logical data model is created based on the source systems and designed before defining a data warehouse. So, it's good for seeing which objects the users want and checking whether the objects are in the source systems. It represents the definition, characteristics, and relationships of the data. This graphical representation of information is easily understandable by business users too. A logical data model graphically represents the following concepts: Attributes: Provides a detailed description of the data Facts: Provide numerical information about the data Hierarchies: Provide relationships between data Data warehouse schema design Physical data warehouse design is based on the logical data model and represents the storage and retrieval of data from the data warehouse. Here, we determine the optimal schema design, which ensures reporting performance and maintenance. The key components of a physical data warehouse schema are columns and tables: Columns: These store attribute and fact data. The following are the three types of columns: ID column: Stores the ID for an attribute Description column: Stores text description of the attribute Fact column: Stores fact data Tables: Physical grouping of related data. Following are the types of tables: Lookup tables: Store information about attributes such as IDs and descriptions Relationship tables: Store information about relationship between two or more attributes Fact tables: Store factual data and the level of aggregation, which is defined based on the attributes of the fact table. They contain base fact columns or derived fact columns: Base fact: Stores the data at the lowest possible level of detail. Aggregate fact: Stores data at a higher or summarized level of detail. Mobile server installation and configuration While mobile client is easy to install, mobile server is not. Here we provide a step-by-step guide on how to install mobile server: Download MicroStrategyMobile.war. Mobile server is packed in a WAR file, just like Operation Manager or Web: Copy MicroStrategyMobile.war from <Microstrategy Installation folder>/Mobile/MobileServer to /usr/local/tomcat7/webapps. Then restart Tomcat, by issuing the ./shutdown.sh and ./startup.sh commands: Connect to the mobile server. Go to http://192.168.81.134:8080/MicroStrategyMobile/servlet/mstrWebAdmin. Then add the server name localhost.localdomain and click connect: Configure mobile server. You can configure (1) Authentication settings for the mobile server application; (2) Privileges and permissions; (3) SSL encryption; (4) Client authentication with a certificate server; (5) Destination folder for the photo uploader widget and signature capture input control. Performing Pareto analysis One good thing about data discovery tools is their agile approach to the data. We can connect any data source and easily slice and dice data. Let's try to use the Pareto principle in order to answer the question: How are sales distributed among the different products? The Pareto principle states that, for many events, roughly 80% of results come from 20% of the causes. For example, 80% of profits come from 20% of the products offered. This type of analysis is very popular in product analytics. In MicroStrategy Desktop, we can use shortcut metrics in order to quickly make complex calculations such as running sums or a percent of the total. Let's build a visualization in order to see the 20% of products that bring us 80% of the money: Choose Combo Chart. Drag and drop Salesamount to the vertical and Englishproductname to the horizontal. Add Orderdate to the filters and restrict to 60 days. Right-click on Sales amountand choose Descending Sort. Right-click on Salesamount | ShortcutMetrics | Percent Running Total. Drag and drop Metric Names to the Color By. Change the color of Salesamount and Percent Running Total. Change the shape of Percent Running Total. As a result, we get this chart: From this chart we can quickly understand our top 20% of products which bring us 80% of revenue. Splunk and MicroStrategy MicroStrategy 10 has announced a new connection to Splunk. I suppose that Splunk is not very popular in the world of Business Intelligence. Most people who have heard about Splunk think that it is just a platform for processing logs. The answers is both true and false. Splunk was derived from the world of spelunking, because searching for root causes in logs is a kind of spelunking without light, and Splunk solves this problem by indexing machine data from a tremendous number of data sources, starting from applications, hardware, sensors, and so on. What is Splunk Splunk's goal is making machine data accessible, usable, and valuable for everyone, and turning machine data into business value. It can: Collect data from anywhere Search and analyze everything Gain real-time Operational Intelligence In the BI world, everyone knows what a data warehouse is. Creating reports from Splunk Now we are ready to build reports using MicroStrategy Desktop and Splunk. Let's do it: Go to MicroStrategy Desktop, click add data, and choose Splunk Create a connection using the existing DNS based on Splunk ODBC: Choose one of tables (Splunk reports): Add other tables as new data sources. Now we can build a dashboard using data from Splunk by dragging and dropping attributes and metrics: Summary In this article we looked at MicroStrategy 10 and its features. We learned about its history and deployment. We also learnt about the project design process, the Pareto analysis and about the connection of Splunk and MicroStrategy. Resources for Article: Further resources on this subject: Stacked Denoising Autoencoders [article] Creating external tables in your Oracle 10g/11g Database [article] Clustering Methods [article]
Read more
  • 0
  • 0
  • 3344
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
Packt
11 Jul 2016
10 min read
Save for later

Mining Twitter with Python – Influence and Engagement

Packt
11 Jul 2016
10 min read
In this article by Marco Bonzanini, author of the book Mastering Social Media Mining with Python, we will discussmining Twitter data. Here, we will analyze users, their connections, and their interactions. In this article, we will discuss how to measure influence and engagement on Twitter. (For more resources related to this topic, see here.) Measuring influence and engagement One of the most commonly mentioned characters in the social media arena is the mythical influencer. This figure is responsible for a paradigm shift in the recent marketing strategies (https://en.wikipedia.org/wiki/Influencer_marketing), which focus on targeting key individuals rather than the market as a whole. Influencers are typically active users within their community.In case of Twitter, an influencer tweets a lot about topics they care about. Influencers are well connected as they follow and are followed by many other users who are also involved in the community. In general, an influencer is also regarded as an expert in their area, and is typically trusted by other users. This description should explain why influencers are an important part of recent trends in marketing: an influencer can increase awareness or even become an advocate of a specific product or brand and can reach a vast number of supporters. Whether your main interest is Python programming or wine tasting, regardless how huge (or tiny) your social network is, you probably already have an idea who the influencers in your social circles are: a friend, acquaintance, or random stranger on the Internet whose opinion you trust and value because of their expertise on the given subject. A different, but somehow related, concept is engagement. User engagement, or customer engagement, is the assessment of the response to a particular offer, such as a product or service. In the context of social media, pieces of content are often created with the purpose to drive traffic towards the company website or e-commerce. Measuring engagement is important as it helps in defining and understanding strategies to maximize the interactions with your network, and ultimately bring business. On Twitter, users engage by the means of retweeting or liking a particular tweet, which in return, provides more visibility to the tweet itself. In this section, we'll discuss some interesting aspects of social media analysis regarding the possibility of measuring influence and engagement. On Twitter, a natural thought would be to associate influence with the number of users in a particular network. Intuitively, a high number of followers means that a user can reach more people, but it doesn't tell us how a tweet is perceived. The following script compares some statistics for two user profiles: import sys import json   def usage():   print("Usage:")   print("python {} <username1><username2>".format(sys.argv[0]))   if __name__ == '__main__':   if len(sys.argv) != 3:     usage()     sys.exit(1)   screen_name1 = sys.argv[1]   screen_name2 = sys.argv[2] After reading the two screen names from the command line, we will build up a list of followersfor each of them, including their number of followers to calculate the number of reachable users: followers_file1 = 'users/{}/followers.jsonl'.format(screen_name1)   followers_file2 = 'users/{}/followers.jsonl'.format(screen_name2)   with open(followers_file1) as f1, open(followers_file2) as f2:     reach1 = []     reach2 = []     for line in f1:       profile = json.loads(line)       reach1.append((profile['screen_name'], profile['followers_count']))     for line in f2:       profile = json.loads(line)       reach2.append((profile['screen_name'],profile['followers_count'])) We will then load some basic statistics (followers and statuses count) from the two user profiles: profile_file1 = 'users/{}/user_profile.json'.format(screen_name1)   profile_file2 = 'users/{}/user_profile.json'.format(screen_name2)   with open(profile_file1) as f1, open(profile_file2) as f2:     profile1 = json.load(f1)     profile2 = json.load(f2)     followers1 = profile1['followers_count']     followers2 = profile2['followers_count']     tweets1 = profile1['statuses_count']     tweets2 = profile2['statuses_count']     sum_reach1 = sum([x[1] for x in reach1])   sum_reach2 = sum([x[1] for x in reach2])   avg_followers1 = round(sum_reach1 / followers1, 2)   avg_followers2 = round(sum_reach2 / followers2, 2) We will also load the timelines for the two users, in particular, to observe the number of times their tweets have been favorited or retweeted: timeline_file1 = 'user_timeline_{}.jsonl'.format(screen_name1)   timeline_file2 = 'user_timeline_{}.jsonl'.format(screen_name2)   with open(timeline_file1) as f1, open(timeline_file2) as f2:     favorite_count1, retweet_count1 = [], []     favorite_count2, retweet_count2 = [], []     for line in f1:       tweet = json.loads(line)       favorite_count1.append(tweet['favorite_count'])       retweet_count1.append(tweet['retweet_count'])     for line in f2:       tweet = json.loads(line)       favorite_count2.append(tweet['favorite_count'])       retweet_count2.append(tweet['retweet_count']) The preceding numbers are then aggregated into average number of favorites and average number of retweets, both in absolute terms and per number of followers: avg_favorite1 = round(sum(favorite_count1) / tweets1, 2)   avg_favorite2 = round(sum(favorite_count2) / tweets2, 2)   avg_retweet1 = round(sum(retweet_count1) / tweets1, 2)   avg_retweet2 = round(sum(retweet_count2) / tweets2, 2)   favorite_per_user1 = round(sum(favorite_count1) / followers1, 2)   favorite_per_user2 = round(sum(favorite_count2) / followers2, 2)   retweet_per_user1 = round(sum(retweet_count1) / followers1, 2)   retweet_per_user2 = round(sum(retweet_count2) / followers2, 2)   print("----- Stats {} -----".format(screen_name1))   print("{} followers".format(followers1))   print("{} users reached by 1-degree connections".format(sum_reach1))   print("Average number of followers for {}'s followers: {}".format(screen_name1, avg_followers1))   print("Favorited {} times ({} per tweet, {} per user)".format(sum(favorite_count1), avg_favorite1, favorite_per_user1))   print("Retweeted {} times ({} per tweet, {} per user)".format(sum(retweet_count1), avg_retweet1, retweet_per_user1))   print("----- Stats {} -----".format(screen_name2))   print("{} followers".format(followers2))   print("{} users reached by 1-degree connections".format(sum_reach2))   print("Average number of followers for {}'s followers: {}".format(screen_name2, avg_followers2))   print("Favorited {} times ({} per tweet, {} per user)".format(sum(favorite_count2), avg_favorite2, favorite_per_user2))   print("Retweeted {} times ({} per tweet, {} per user)".format(sum(retweet_count2), avg_retweet2, retweet_per_user2)) This script takes two arguments from the command line and assumes that the data has already been downloaded. In particular, for both users, we need the data about followers and the respective user timelines. The script is somehow verbose, because it computes the same operations for two profiles and prints everything on the terminal. We can break it down into different parts. Firstly, we will look into the followers' followers. This will provide some information related to the part of the network immediately connected to the given user. In other words, it should answer the question how many users can I reach if all my followers retweet me? We can achieve this by reading the users/<user>/followers.jsonl file and keeping a list of tuples, where each tuple represents one of the followers and is in the (screen_name, followers_count)form. Keeping the screen name at this stage is useful in case we want to observe who the users with the highest number of followers are (not computed in the script, but easy to produce using sorted()). In the second step, we will read the user profile from the users/<user>/user_profile.jsonfile so that we can get information about the total number of followers and the total number of tweets. With the data collected so far, we can compute the total number of users who are reachable within a degree of separation (follower of a follower) and the average number of followers of a follower. This is achieved via the following lines: sum_reach1 = sum([x[1] for x in reach1]) avg_followers1 = round(sum_reach1 / followers1, 2) The first one uses a list comprehension to iterate through the list of tuples mentioned previously, while the second one is a simple arithmetic average, rounded to two decimal points. The third part of the script reads the user timeline from the user_timeline_<user>.jsonlfile and collects information about the number of retweets and favorite for each tweet. Putting everything together allows us to calculate how many times a user has been retweeted or favorited and what is the average number of retweet/favorite per tweet and follower. To provide an example, I'll perform some vanity analysis and compare my account,@marcobonzanini, with Packt Publishing: $ python twitter_influence.py marcobonzanini PacktPub The script produces the following output: ----- Stats marcobonzanini ----- 282 followers 1411136 users reached by 1-degree connections Average number of followers for marcobonzanini's followers: 5004.03 Favorited 268 times (1.47 per tweet, 0.95 per user) Retweeted 912 times (5.01 per tweet, 3.23 per user) ----- Stats PacktPub ----- 10209 followers 29961760 users reached by 1-degree connections Average number of followers for PacktPub's followers: 2934.84 Favorited 3554 times (0.33 per tweet, 0.35 per user) Retweeted 6434 times (0.6 per tweet, 0.63 per user) As you can see, the raw number of followers shows no contest, with Packt Publishing having approximatively 35 times more followers than me. The interesting part of this analysis comes up when we compare the average number of retweets and favorites, apparently my followers are much more engaged with my content than PacktPub's. Is this enough to declare than I'm an influencer while PacktPub is not? Clearly not. What we observe here is a natural consequence of the fact that my tweets are probably more focused on specific topics (Python and data science), hence my followers are already more interested in what I'm publishing. On the other side, the content produced by Packt Publishing is highly diverse as it ranges across many different technologies. This diversity is also reflected in PacktPub's followers, who include developers, designers, scientists, system administrator, and so on. For this reason, each of PacktPub's tweet is found interesting (that is worth retweeting) by a smaller proportion of their followers. Summary In this article,we discussed mining data from Twitter by focusing on the analysis of user connections and interactions. In particular, we discussed how to compare influence and engagement between users. For more information on social media mining, refer the following books by Packt Publishing: Social Media Mining with R: https://www.packtpub.com/big-data-and-business-intelligence/social-media-mining-r Mastering Social Media Mining with R: https://www.packtpub.com/big-data-and-business-intelligence/mastering-social-media-mining-r Further resources on this subject: Probabilistic Graphical Models in R [article] Machine Learning Tasks [article] Support Vector Machines as a Classification Engine [article]
Read more
  • 0
  • 0
  • 2549

article-image-stacked-denoising-autoencoders
Packt
11 Jul 2016
13 min read
Save for later

Stacked Denoising Autoencoders

Packt
11 Jul 2016
13 min read
In this article by John Hearty, author of the book Advanced Machine Learning with Python, we discuss autoencoders as valuable tools in themselves, significant accuracy can be obtained by stacking autoencoders to form a deep network. This is achieved by feeding the representation created by the encoder on one layer into the next layer's encoder as input to that layer. (For more resources related to this topic, see here.) Stacked DenoisingAutoencoders(SdA) are currently in use in many leading data science teams for sophisticated natural language analyses as well as a broad range of signals, images, and text analyses. The implementation of SdA will be very familiar after the previous chapter's discussion of deep belief networks. The SdA is usedin much the same way as the RBMs in our deep belief networks were used. Each layer of the deep architecture will have a dA and sigmoid component, with the autoencoder component being used to pretrain the sigmoid network. The performance measure used by anSdA is the training set error with an intensive period of layer-to-layer (layer-wise) pretraining used to gradually align network parameters before a final period of fine-tuning. During fine-tuning, the network is trained using validation and test data, over fewer epochs but with larger update steps. The goal is to have the network converge at the end of the fine-tuning to deliver an accurate result. In addition to delivering on the typical advantages of deep networks (the ability to learn feature representations for complex or high-dimensional datasets and train a model without extensive feature engineering), stacked autoencoders have an additional, very interesting property. Correctly configured stacked autoencoders can capture a hierarchical grouping of their input data. Successive layers of anSdA may learn increasingly high-level features. While the first layer might learn some first-order features from input data (such as learning edges in a photo image), a second layer may learn some grouping of first-order features (for instance, by learning given configurations of edges that correspond to contours or structural elements in the input image). There's no golden rule to determine how many layers or how large layers should be for a given problem. The best solution is usually to experiment with these model parameters until you find an optimal point. This experimentation is best done with a hyperparameter optimization technique or genetic algorithm (subjects we'll discuss in later chapters of this book). Higher layers may learn increasingly high-order configurations, enabling anSdA to learn to recognise facial features, alphanumerical characters, or the generalised forms of objects (such as a bird). This is what gives SdAs their unique capability to learn very sophisticated, high-level abstractions of their input data. Autoencoders can be stacked indefinitely, and it has been demonstrated that continuing to stack autoencoders can improve the effectiveness of the deep architecture (with the main constraint becoming computing cost in time). In this chapter, we'll look at stacking three autoencoders to solve a natural language processing challenge. Applying SdA Now that we've had a chance to understand the advantages and power of the SdA as a deep learning architecture, let's test our skills on a real-world dataset. For this chapter, let's step away from image datasets and work with the OpinRank Review Dataset, a text dataset of around 259,000 hotel reviews from TripAdvisor,which is accessible via the UCI Machine Learning dataset Repository. This freely-available dataset provides review scores (as floating point numbers from 1 to 5) and review text for a broad range of hotels; we'll be applying our SdA to attempt to identify the scoring of each hotel from its review text. We'll be applying our autoencoder to analyze a preprocessed version of this data, which is accessible from the GitHub share accompanying this chapter. We'll be discussing the techniques by which we prepare text data in an upcoming chapter. The source data is available at https://archive.ics.uci.edu/ml/datasets/OpinRank+Review+Dataset. In order to get started, we're going to need anSdA (hereafter SdA) class! classSdA(object):     def __init__( self, numpy_rng, theano_rng=None, n_ins=280, hidden_layers_sizes=[500, 500], n_outs=5, corruption_levels=[0.1, 0.1]     ): As we previously discussed, the SdA is created by feeding the encoding from one layer's autoencoder as the input to the subsequent layer. This class supports the configuration of the layer count (reflected in, but not set by, the length of the hidden_layers_sizes and corruption_levels vectors). It also supports differentiated layer sizes (in nodes) at each layer, which can be set using hidden_layers_sizes. As we discussed, the ability to configure successive layers of the autoencoder is critical to developing successful representations. Next, we need parameters to store the MLP (self.sigmoid_layers) and dA (self.dA_layers) elements of the SdA. In order to specify the depth of our architecture, we use the self.n_layers parameter to specify the number of sigmoid and dA layers required: self.sigmoid_layers = [] self.dA_layers = [] self.params = [] self.n_layers = len(hidden_layers_sizes)   assertself.n_layers> 0 Next, we need to construct our sigmoid and dAlayers. We begin by setting the hidden layer size to be set either from the input vector size or by the activation of the preceding layer. Following this, sigmoid_layers and dA_layers components are created with the dA layer drawing from the dA class we discussed earlier in this article: for i in xrange(self.n_layers): if i == 0: input_size = n_ins else: input_size = hidden_layers_sizes[i - 1]   if i == 0: layer_input = self.x else: layer_input = self.sigmoid_layers[-1].output   sigmoid_layer = HiddenLayer(rng=numpy_rng, input=layer_input, n_in=input_size, n_out=hidden_layers_sizes[i], activation=T.nnet.sigmoid) self.sigmoid_layers.append(sigmoid_layer) self.params.extend(sigmoid_layer.params)   dA_layer = dA(numpy_rng=numpy_rng, theano_rng=theano_rng, input=layer_input, n_visible=input_size, n_hidden=hidden_layers_sizes[i],                           W=sigmoid_layer.W, bhid=sigmoid_layer.b) self.dA_layers.append(dA_layer) Having implemented the layers of our SdA, we'll need a final, logistic regression layer to complete the MLP component of the network: self.logLayer = LogisticRegression( input=self.sigmoid_layers[-1].output, n_in=hidden_layers_sizes[-1], n_out=n_outs         )   self.params.extend(self.logLayer.params) self.finetune_cost = self.logLayer.negative_log_likelihood(self.y) self.errors = self.logLayer.errors(self.y) This completes the architecture of our SdA. Next up, we need to generate the training functions used by the SdA class. Each function will have the minibatch index (index) as an argument, together with several other elements; corruption_level and learning_rate are enabled here so that we can adjust them (for example, gradually increase or decrease them) during training. Additionally, we identify variables that help identify where the batch starts and ends: batch_begin and batch_end, respectively. defpretraining_functions(self, train_set_x, batch_size): index = T.lscalar('index')  corruption_level = T.scalar('corruption')  learning_rate = T.scalar('lr')  batch_begin = index * batch_size batch_end = batch_begin + batch_size   pretrain_fns = [] fordAinself.dA_layers: cost, updates = dA.get_cost_updates(corruption_level, learning_rate) fn = theano.function( inputs=[ index, theano.Param(corruption_level, default=0.2), theano.Param(learning_rate, default=0.1)                 ], outputs=cost, updates=updates, givens={ self.x: train_set_x[batch_begin: batch_end]                 }             ) pretrain_fns.append(fn)   returnpretrain_fns The ability to dynamically adjust the learning rate particularly is very helpful and may be applied in one of two ways. Once a technique has begun to converge on an appropriate solution, it is very helpful to be able to reduce the learning rate. If you do not do this, you risk creating a situation in which the network oscillates between values located around the optimum, without ever hitting it. In some contexts, it can be helpful to tie the learning rate to the network's performance measure. If the error rate is high, it can make sense to make larger adjustments until the error rate begins to decrease! The pretraining function we've created takes the minibatch index and can optionally take the corruption level or learning rate. It performs one step of pretraining and outputs the cost value and vector of weight updates. In addition to pretraining, we need to build functions to support the fine-tuning stage, where the network is run iteratively over the validation and test data to optimize network parameters. The train_fn implements a single step of fine-tuning. The valid_score is a Python function that computes a validation score using the error measure produced by the SdA over validation data. Similarly, test_score computes the error score over test data. To get this process off the ground, we first need to set up training, validation, and test datasets. Each stage requires two datasets (set x and set y), containing the features and class labels, respectively. The required number of minibatches for validation and test is determined, and an index is created to track batch size (and provide a means of identifying at which entries a batch starts and ends). Training, validation, and testing occurs for each batch and afterward, both valid_score and test_score are calculated across all batches: defbuild_finetune_functions(self, datasets, batch_size, learning_rate):           (train_set_x, train_set_y) = datasets[0]         (valid_set_x, valid_set_y) = datasets[1]         (test_set_x, test_set_y) = datasets[2]   n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] n_valid_batches /= batch_size n_test_batches = test_set_x.get_value(borrow=True).shape[0] n_test_batches /= batch_size   index = T.lscalar('index')      gparams = T.grad(self.finetune_cost, self.params)     updates = [             (param, param - gparam * learning_rate) forparam, gparamin zip(self.params, gparams)         ]   train_fn = theano.function( inputs=[index], outputs=self.finetune_cost, updates=updates, givens={ self.x: train_set_x[ index * batch_size: (index + 1) * batch_size                 ], self.y: train_set_y[ index * batch_size: (index + 1) * batch_size                 ]             }, name='train'         )   test_score_i = theano.function(             [index], self.errors, givens={ self.x: test_set_x[ index * batch_size: (index + 1) * batch_size                 ], self.y: test_set_y[ index * batch_size: (index + 1) * batch_size                 ]             }, name='test'         )   valid_score_i = theano.function(             [index], self.errors, givens={ self.x: valid_set_x[ index * batch_size: (index + 1) * batch_size                 ], self.y: valid_set_y[ index * batch_size: (index + 1) * batch_size                 ]             }, name='valid'         )     defvalid_score(): return [valid_score_i(i) for i inxrange(n_valid_batches)]   deftest_score(): return [test_score_i(i) for i inxrange(n_test_batches)]   returntrain_fn, valid_score, test_score With the training functionality in place, the following code initiates our SdA: numpy_rng = numpy.random.RandomState(89677) print '... building the model' sda = SdA( numpy_rng=numpy_rng, n_ins=280, hidden_layers_sizes=[240, 170, 100], n_outs=5     ) It should be noted that, at this point, we should be trying an initial configuration of layer sizes to see how we do. In this case, the layer sizes used here are the product of some initial testing. As we discussed, training the SdA occurs in two stages. The first is a layer-wise pretraining process that loops over all of the SdA's layers. The second is a process of fine-tuning over validation and test data. To pretrain the SdA, we provide the required corruption levels to train each layer and iterate over the layers using our previously-defined pretraining_fns: print '... getting the pretraining functions' pretraining_fns = sda.pretraining_functions(train_set_x=train_set_x, batch_size=batch_size)   print '... pre-training the model' start_time = time.clock() corruption_levels = [.1, .2, .2] for i inxrange(sda.n_layers):   for epoch inxrange(pretraining_epochs):             c = [] forbatch_indexinxrange(n_train_batches): c.append(pretraining_fns[i](index=batch_index, corruption=corruption_levels[i], lr=pretrain_lr)) print 'Pre-training layer %i, epoch %d, cost ' % (i, epoch), printnumpy.mean(c)   end_time = time.clock()   print>>sys.stderr, ('The pretraining code for file ' + os.path.split(__file__)[1] + ' ran for %.2fm' % ((end_time - start_time) / 60.)) At this point, we're able to initialize our SdA class via calling the preceding code stored within this book's GitHub repository, MasteringMLWithPython/Chapter3/SdA.py. Assessing SdA performance The SdA will take a significant length of time to run. With 15 epochs per layer and each layer typically taking an average of 11 minutes, the network will run for around 500 minutes on a modern desktop system with GPU acceleration and a single-threaded GotoBLAS. On a system without GPU acceleration, the network will take substantially longer to train and it is recommended that you use the alternative, which runs over a significantly smaller input dataset, MasteringMLWithPython/Chapter3/SdA_no_blas.py. The results are of a high quality, with a validation error score of 3.22% and test error score of 3.14%. These results are particularly impressive given the ambiguous and sometimes challenging nature of natural language processing applications. It was noticeable that the network classified more correctly for the 1-star and 5-star rating cases than for the intermediate levels. This is largely due to the ambiguous nature of unpolarized or unemotional language. Part of the reason that this input data was classifiable well was via significant feature engineering. While time-consuming and sometimes problematic, we've seen that well-executed feature engineering combined with an optimized model can deliver an excellent level of accuracy. Summary In this article, we introduced the autoencoder, an effective dimensionality reduction technique with some unique applications. We focused on the theory behind the SdA, an extension of autoencoders whereby any numbers of autoencoders are stacked in a deep architecture. Resources for Article: Further resources on this subject: Exception Handling in MySQL for Python [article] Clustering Methods [article] Machine Learning Using Spark MLlib [article]
Read more
  • 0
  • 0
  • 3328

article-image-implementing-artificial-neural-networks-tensorflow
Packt
08 Jul 2016
12 min read
Save for later

Implementing Artificial Neural Networks with TensorFlow

Packt
08 Jul 2016
12 min read
In this article by Giancarlo Zaccone, the author of Getting Started with TensorFlow, we will learn about artificial neural networks (ANNs), an information processing system whose operating mechanism is inspired by biological neural circuits. Thanks to their characteristics, neural networks are the protagonists of a real revolution in machine learning systems and more generally in the context of Artificial Intelligence. An artificial neural network possesses many simple processing units variously connected to each other, according to various architectures. If we look at the schema of an ANN report, it can be seen that the hidden units communicate with the external layer, both in input and output, while the input and output units communicate only with the hidden layer of the network Each unit or node simulates the role of the neuron in biological neural networks, a node, said artificial neuron, plays a very simple operation: becomes active if the total quantity of signal, which it receives exceeds its activation threshold, defined by the so-called activation function. If a node becomesactive, it emits a signal that is transmitted along the transmission channels up to the other unit to which it is connected. A connection point acts as a filter that converts the message into an inhibitory or excitatory signal increasing or decreasing the intensity, according to their individual characteristics. The connection points simulate the biological synapses and have the fundamental function to weigh the intensity of the transmitted signals, by multiplying them by the weights whose value depends on the connection itself. ANN schematic diagram Neural network architectures The way to connect the nodes, the total number of layers, that is, the levels of nodes between input and output, define the architecture of a neural network. For example, in a multilayer networks, one can identify the artificial neurons of layers such that: Each neuron is connected with all those of the next layer There are no connections between neurons belonging to the same layer The number of layers and of neurons per layer depends on the problem to be solved Now we start our exploration of neural network models, introducing the most simple neural network model: the Single Layer Perceptron or the so-called Rosenblatt's Perceptron. Single Layer Perceptron The Single Layer Perceptron was the first neural network model proposed in 1958 by Frank Rosenblatt. In this model, the content of the local memory of the neuron consists of a vector of weights, W = (w1, w2,……, wn). The computation is performed over the calculation of a sum of the input vector X =(x1, x2,……, xn), each of which is multiplied by the corresponding element of the vector of the weights; then the value provided in the output (that is, a weighted sum) will be the input of an activation function. This function returns 1 if the result is greater than a certain threshold, otherwise it returns -1. In the following figure, the activation function is the so-called sign function:         +1        x > 0 sign(x) =         −1        otherwise It is possible to use other activation functions, preferably non-linear (such as the sigmoid function that we will see in the next section). The learning procedure of the net is iterative: it slightly modifies for each learning cycle (called epoch) the synaptic weights by using a selected set called training set. At each cycle, the weights must be modified so as to minimize a cost function, which is specific to the problem under consideration. Finally, when the perceptron will be trained on the training set, it will be able to be tested on other inputs (the test set) in order to verify its capacity of generalization. Schema of a Rosemblatt's Perceptron Let's now see how to implement a single layer neural network for an image classification problem using TensorFlow. The logistic regression This algorithm has nothing to do with the canonical linear regression, but it is an algorithm that allows us to solve supervised classification problems. In fact to estimate the dependent variable, now we make use of the so-called logistic function or sigmoid. It is precisely because of this feature that we call this algorithm logistic regression.The sigmoid function has this pattern: As we can see, the dependent variable takes values strictly between 0 and 1 that is precisely what serves us. In the case of the logistic regression we want, then, that our function tell us what's the probability of belonging to a particular element of our class. We recall again that the supervised learning by the neural network is configured as an iterative process of optimization of the weights; these are then modified on the basis of the network's performance of the training set. Indeed the aim is to the loss function which indicates the degree to which the behavior of the network deviates from the desired one. The performance of the network is then verified on a test set, consisting of images other than those of train. The basic steps of training that we're going to implement are as follows: The weights are initialized with random values at the beginning of the training. For each element of the training set is calculated the error, that is, the difference between the desired output and the actual output. This error is used to adjust the weights The process is repeated resubmitting to the network, in a random order, all the examples of the training set until the error made on the entire training set is not less than a certain threshold or until the number of iterations are over. Let's now see in detail how to implement the logistic regression with TensorFlow. The problem we want to solve is yet to classify images from the MNIST dataset. The TensorFlow implementation First of all, we have to import all the necessary libraries: import input_data import tensorflow as tf import matplotlib.pyplot as plt We use the input_data.readfunction, to upload the images to our problem: mnist = input_data.read_data_sets("/tmp/data/", one_hot=True) Then we set the total number of epochs for the training phase: training_epochs = 25 Also we must define other parameters necessary for the model building: learning_rate = 0.01 batch_size = 100 display_step = 1 Now we move to the construction of the model Building the model Define x as the input tensor, it represents the MNIST data image of shape 28 x 28 = 784 pixels x = tf.placeholder("float", [None, 784]) We recall that our problem consists in assigning a probability value for each of the possible classes of membership (the numbers from 0 to 9). At the end of this calculation, we will use a probability distribution, which gives us the value of what is confident with our prediction. So the output we're going to get will be an output tensor with 10 probabilities each one corresponding to a digit (of course the sum of probabilities must be one): y = tf.placeholder("float", [None, 10]) To assign probabilities to each image, we will use the so-called softmax activation function. The softmax function is specified in two main steps: Calculate the evidence that a certain image belongs to a particular class. Convert the evidence into probabilities of belonging to each of the 10 possible classes. To evaluate the evidence, we first define the weights input tensor asW: W = tf.Variable(tf.zeros([784, 10])) For a given image, we could evaluate the evidence for each class isimply multiplying the tensorWwith the input tensorx. Using TensorFlow, we should have something like this: evidence = tf.matmul(x, W) In general, the models include an extra parameter representing the bias that indicates a certain degree of uncertainty; in our case, the final formula for the evidence is: evidence = tf.matmul(x, W) + b It means that for everyi(from 0 to 9) we have aWimatrix elements784  (28 × 28), where each elementjof the matrix is multiplied by the correspondingcomponentjof the input image (784 parts) that is added and the corresponding bias elementbi. So to define the evidence, we must define the following tensor of biases: b = tf.Variable(tf.zeros([10])) The second step is finally to use thesoftmaxfunction to obtain the output vector of probabilities, namelyactivation: activation = tf.nn.softmax(tf.matmul(x, W) + b) The TensorFlow's functiontf.nn.softmaxprovides a probability based output from the input evidence tensor. Once we implement the model, we can proceed to specify the necessary code to find the W weights and biases b network through the iterative training algorithm. In each iteration, the training algorithm takes the training data, applies the neural network, and compares the result with the expected. In order to train our model and to know when we have a good one, we must know how to define the accuracy of our model. Our goal is to try to get valuesof parameters W and b that minimize the value of the metric that indicates how bad the model is. Different metrics calculate the degree of error between the desired output and output of the training data. A common measure of error is the mean squared error or the Squared Euclidean Distance. However, there are some research findings that suggest to use other metrics to a neural network like this. In this example, we use the so-called cross-entropy error function, it is defined as follows: cross_entropy = y*tf.lg(activation) In order to minimize the cross_entropy, we could use the following combination of tf.reduce_mean and tf.reduce_sum to build the cost function: cost = tf.reduce_mean          (-tf.reduce_sum            (cross_entropy, reduction_indices=1)) Then we must minimize it using the gradient descent optimization algorithm: optimizer = tf.train.GradientDescentOptimizer  (learning_rate).minimize(cost) Few lines of code to build a neural network model! Launching the session It's the moment to build the session and launch our neural net model.We fix these lists to visualize the training session: avg_set = [] epoch_set=[] Then we initialize the TensorFlow variables: init = tf.initialize_all_variables() Start the session: with tf.Session() as sess: sess.run(init) As explained, each epoch is a training cycle:     for epoch in range(training_epochs): avg_cost = 0. total_batch = int(mnist.train.num_examples/batch_size) Then we loop over all batches:         for i in range(total_batch): batch_xs, batch_ys = mnist.train.next_batch(batch_size) Fit training using batch data: sess.run(optimizer, feed_dict={x: batch_xs, y: batch_ys}) Compute the average loss running the train_step function with the given image values (x) and the real output (y_): avg_cost += sess.run                         (cost, feed_dict={x: batch_xs,                                  y: batch_ys})/total_batch During the computation, we display a log per epoch step:         if epoch % display_step == 0:             print "Epoch:", '%04d' % (epoch+1), "cost=","{:.9f}".format(avg_cost)     print " Training phase finished" Let's get the accuracy of our mode.It is correct if the index with the highest y value is the same as in the real digit vector the mean of correct_prediction gives us the accuracy. We need to run the accuracy function with our test set (mnist.test). We use the keys images and labelsfor x and y_: correct_prediction = tf.equal                            (tf.argmax(activation, 1), tf.argmax(y, 1))       accuracy = tf.reduce_mean                        (tf.cast(correct_prediction, "float"))    print "MODEL accuracy:", accuracy.eval({x: mnist.test.images,                                       y: mnist.test.labels}) Test evaluation We have seen the training phase in the preceding sections; for each epoch we have printed the relative cost function: Python 2.7.10 (default, Oct 14 2015, 16:09:02) [GCC 5.2.1 20151010] on linux2 Type "copyright", "credits" or "license()" for more information. >>> ======================= RESTART ============================ >>> Extracting /tmp/data/train-images-idx3-ubyte.gz Extracting /tmp/data/train-labels-idx1-ubyte.gz Extracting /tmp/data/t10k-images-idx3-ubyte.gz Extracting /tmp/data/t10k-labels-idx1-ubyte.gz Epoch: 0001 cost= 1.174406662 Epoch: 0002 cost= 0.661956009 Epoch: 0003 cost= 0.550468774 Epoch: 0004 cost= 0.496588717 Epoch: 0005 cost= 0.463674555 Epoch: 0006 cost= 0.440907706 Epoch: 0007 cost= 0.423837747 Epoch: 0008 cost= 0.410590841 Epoch: 0009 cost= 0.399881751 Epoch: 0010 cost= 0.390916621 Epoch: 0011 cost= 0.383320325 Epoch: 0012 cost= 0.376767031 Epoch: 0013 cost= 0.371007620 Epoch: 0014 cost= 0.365922904 Epoch: 0015 cost= 0.361327561 Epoch: 0016 cost= 0.357258660 Epoch: 0017 cost= 0.353508228 Epoch: 0018 cost= 0.350164634 Epoch: 0019 cost= 0.347015593 Epoch: 0020 cost= 0.344140861 Epoch: 0021 cost= 0.341420144 Epoch: 0022 cost= 0.338980592 Epoch: 0023 cost= 0.336655581 Epoch: 0024 cost= 0.334488012 Epoch: 0025 cost= 0.332488823 Training phase finished As wesaw, during the training phase, the cost function is minimized.At the end of the test, we show how accurately the model is implemented: Model Accuracy: 0.9475 >>> Finally, using these lines of code, we could visualize the the training phase of the net: plt.plot(epoch_set,avg_set, 'o',  label='Logistic Regression Training phase') plt.ylabel('cost') plt.xlabel('epoch') plt.legend() plt.show() Training phase in logistic regression Summary In this article, we learned the implementation of artificial neural networks, Single Layer Perceptron, TensorFlow. We also learned how to build the model and launch the session.
Read more
  • 0
  • 0
  • 4159

article-image-recommendation-systems
Packt
07 Jul 2016
12 min read
Save for later

Recommendation Systems

Packt
07 Jul 2016
12 min read
 In this article, Pradeepta Mishra, the author of R Data Mining Blueprints, says that in this age of Internet, everything available over the Internet is not useful for everyone. Different companies and entities use different approaches in finding out relevant content for their audiences. People started building algorithms to construct relevance score, based on that, recommendation can be build and suggested to the users. From our day to day life, every time I see an image on Google, 3-4 other images are recommended to me by Google. Every time I look for some videos on YouTube, 10 more videos are recommended to me. Every time I visit Amazon to buy some products, 5-6 products are recommended to me. And every time I read one blog or article, a few more articles and blogs are recommended to me. This is an evidence of algorithmic forces at play to recommend certain things based on users’ preferences or choices, since the users’ time is precious and content available over the Internet is unlimited. Hence, a recommendation engine helps organizations customize their offerings based on user preferences so that the user need not have to spend time in exploring what is required. In this article, the reader will learn the implementation of product recommendation using R. (For more resources related to this topic, see here.) Practical project The dataset contains a sample of 5000 users from the anonymous ratings data from the Jester Online Joke Recommender System collected between April 1999 and May 2003 (Golberg, Roeder, Gupta, and Perkins 2001). The dataset contains ratings for 100 jokes on a scale from -10 to 10. All users in the dataset have rated 36 or more jokes. Let's load the recommenderlab library and the Jester5K dataset: > library("recommenderlab") > data(Jester5k) > Jester5k@data@Dimnames[2] [[1]] [1] "j1" "j2" "j3" "j4" "j5" "j6" "j7" "j8" "j9" [10] "j10" "j11" "j12" "j13" "j14" "j15" "j16" "j17" "j18" [19] "j19" "j20" "j21" "j22" "j23" "j24" "j25" "j26" "j27" [28] "j28" "j29" "j30" "j31" "j32" "j33" "j34" "j35" "j36" [37] "j37" "j38" "j39" "j40" "j41" "j42" "j43" "j44" "j45" [46] "j46" "j47" "j48" "j49" "j50" "j51" "j52" "j53" "j54" [55] "j55" "j56" "j57" "j58" "j59" "j60" "j61" "j62" "j63" [64] "j64" "j65" "j66" "j67" "j68" "j69" "j70" "j71" "j72" [73] "j73" "j74" "j75" "j76" "j77" "j78" "j79" "j80" "j81" [82] "j82" "j83" "j84" "j85" "j86" "j87" "j88" "j89" "j90" [91] "j91" "j92" "j93" "j94" "j95" "j96" "j97" "j98" "j99" [100] "j100" The following image shows the distribution of real ratings given by 2000 users. > data<-sample(Jester5k,2000) > hist(getRatings(data),breaks=100,col="blue") The input dataset contains the individual ratings; the normalization function reduces the individual rating bias by centering the row (which is a standard z-score transformation), subtracting each element from the mean, and then dividing by standard deviation. The following graph shows normalized ratings for the preceding dataset: > hist(getRatings(normalize(data)),breaks=100,col="blue4") To create a recommender system: A recommendation engine is created using the recommender() function. A new recommendation algorithm can be added by the user using the recommenderRegistry$get_entries() function: > recommenderRegistry$get_entries(dataType = "realRatingMatrix") $IBCF_realRatingMatrix Recommender method: IBCF Description: Recommender based on item-based collaborative filtering (real data). Parameters: k method normalize normalize_sim_matrix alpha na_as_zero minRating 1 30 Cosine center FALSE 0.5 FALSE NA $POPULAR_realRatingMatrix Recommender method: POPULAR Description: Recommender based on item popularity (real data). Parameters: None $RANDOM_realRatingMatrix Recommender method: RANDOM Description: Produce random recommendations (real ratings). Parameters: None $SVD_realRatingMatrix Recommender method: SVD Description: Recommender based on SVD approximation with column-mean imputation (real data). Parameters: k maxiter normalize minRating 1 10 100 center NA $SVDF_realRatingMatrix Recommender method: SVDF Description: Recommender based on Funk SVD with gradient descend (real data). Parameters: k gamma lambda min_epochs max_epochs min_improvement normalize 1 10 0.015 0.001 50 200 1e-06 center minRating verbose 1 NA FALSE $UBCF_realRatingMatrix Recommender method: UBCF Description: Recommender based on user-based collaborative filtering (real data). Parameters: method nn sample normalize minRating 1 cosine 25 FALSE center NA The preceding registry command helps in identifying the methods available in the recommenderlab parameters for the model. There are six different methods for implementing recommender systems, such as popular, item-based, user-based, PCA, random, and SVD. Let's start the recommendation engine using the popular method: > rc <- Recommender(Jester5k, method = "POPULAR") > rc Recommender of type 'POPULAR' for 'realRatingMatrix' learned using 5000 users. > names(getModel(rc)) [1] "topN" "ratings" [3] "minRating" "normalize" [5] "aggregationRatings" "aggregationPopularity" [7] "minRating" "verbose" > getModel(rc)$topN Recommendations as 'topNList' with n = 100 for 1 users. The objects such as top N, verbose, aggregation popularity, and so on, can be printed using names of the getmodel()command: recom <- predict(rc, Jester5k, n=5) recom To generate a recommendation, we can use the predict function against the same dataset and validate the accuracy of the predictive model. Here we are generating the top 5 recommended jokes to each of the users. The result of the prediction is as follows: > head(as(recom,"list")) $u2841 [1] "j89" "j72" "j76" "j88" "j83" $u15547 [1] "j89" "j93" "j76" "j88" "j91" $u15221 character(0) $u15573 character(0) $u21505 [1] "j89" "j72" "j93" "j76" "j88" $u15994 character(0) For the same Jester5K dataset, let's try to implement item-based collaborative filtering (IBCF): > rc <- Recommender(Jester5k, method = "IBCF") > rc Recommender of type 'IBCF' for 'realRatingMatrix' learned using 5000 users. > recom <- predict(rc, Jester5k, n=5) > recom Recommendations as 'topNList' with n = 5 for 5000 users. > head(as(recom,"list")) $u2841 [1] "j85" "j86" "j74" "j84" "j80" $u15547 [1] "j91" "j87" "j88" "j89" "j93" $u15221 character(0) $u15573 character(0) $u21505 [1] "j78" "j80" "j73" "j77" "j92" $u15994 character(0) The Principal component analysis (PCA) method is not applicable for real-rating-based datasets; this is because getting a correlation matrix and subsequent eigenvector and eigenvalue calculations would not be accurate. Hence we will not show its application. Next we are going to show how the random method works: > rc <- Recommender(Jester5k, method = "RANDOM") > rc Recommender of type 'RANDOM' for 'ratingMatrix' learned using 5000 users. > recom <- predict(rc, Jester5k, n=5) > recom Recommendations as 'topNList' with n = 5 for 5000 users. > head(as(recom,"list")) [[1]] [1] "j90" "j74" "j86" "j78" "j85" [[2]] [1] "j87" "j88" "j74" "j92" "j79" [[3]] character(0) [[4]] character(0) [[5]] [1] "j95" "j86" "j93" "j78" "j83" [[6]] character(0) In the recommendation engine, the SVD approach is used to predict the missing ratings so that a recommendation can be generated. Using the singular value decomposition (SVD) method, the following recommendation can be generated: > rc <- Recommender(Jester5k, method = "SVD") > rc Recommender of type 'SVD' for 'realRatingMatrix' learned using 5000 users. > recom <- predict(rc, Jester5k, n=5) > recom Recommendations as 'topNList' with n = 5 for 5000 users. > head(as(recom,"list")) $u2841 [1] "j74" "j71" "j84" "j79" "j80" $u15547 [1] "j89" "j93" "j76" "j81" "j88" $u15221 character(0) $u15573 character(0) $u21505 [1] "j80" "j73" "j100" "j72" "j78" $u15994 character(0) The result from user-based collaborative filtering is shown as follows: > rc <- Recommender(Jester5k, method = "UBCF") > rc Recommender of type 'UBCF' for 'realRatingMatrix' learned using 5000 users. > recom <- predict(rc, Jester5k, n=5) > recom Recommendations as 'topNList' with n = 5 for 5000 users. > head(as(recom,"list")) $u2841 [1] "j81" "j78" "j83" "j80" "j73" $u15547 [1] "j96" "j87" "j89" "j76" "j93" $u15221 character(0) $u15573 character(0) $u21505 [1] "j100" "j81" "j83" "j92" "j96" $u15994 character(0) Now let's compare the results obtained from all the five different algorithms except PCA (because PCA requires a binary dataset; it does not accept a real ratings matrix). Table 4: Comparison of results between different recommendation algorithms Popular IBCF Random method SVD UBCF > head(as(recom,"list")) > head(as(recom,"list")) > head(as(recom,"list")) > head(as(recom,"list")) > head(as(recom,"list")) $u2841 $u2841 [[1]] $u2841 $u2841 [1] "j89" "j72" "j76" "j88" "j83" [1] "j85" "j86" "j74" "j84" "j80" [1] "j90" "j74" "j86" "j78" "j85" [1] "j74" "j71" "j84" "j79" "j80" [1] "j81" "j78" "j83" "j80" "j73"           $u15547 $u15547 [[2]] $u15547 $u15547 [1] "j89" "j93" "j76" "j88" "j91" [1] "j91" "j87" "j88" "j89" "j93" [1] "j87" "j88" "j74" "j92" "j79" [1] "j89" "j93" "j76" "j81" "j88" [1] "j96" "j87" "j89" "j76" "j93"           $u15221 $u15221 [[3]] $u15221 $u15221 character(0) character(0) character(0) character(0) character(0)           $u15573 $u15573 [[4]] $u15573 $u15573 character(0) character(0) character(0) character(0) character(0)           $u21505 $u21505 [[5]] $u21505 $u21505 [1] "j89" "j72" "j93" "j76" "j88" [1] "j78" "j80" "j73" "j77" "j92" [1] "j95" "j86" "j93" "j78" "j83" [1] "j80"   "j73" "j100" "j72" "j78" [1] "j100" "j81" "j83" "j92" "j96"           $u15994 $u15994 [[6]] $u15994 $u15994 character(0) character(0) character(0) character(0) character(0)             One thing is clear from the above table. For users 15573 and 15221, none of the five methods generate recommendation. Hence it is important to look at methods to evaluate the recommendation results. To validate the accuracy of the model, let's implement accuracy measures and compare the accuracies of all the models. For the evaluation of the model results, the dataset is divided into 90% for training and 10% for testing the algorithm. The definition of a good rating is updated as 5: > e <- evaluationScheme(Jester5k, method="split", + train=0.9,given=15, goodRating=5) > e Evaluation scheme with 15 items given Method: 'split' with 1 run(s). Training set proportion: 0.900 Good ratings: >=5.000000 Data set: 5000 x 100 rating matrix of class 'realRatingMatrix' with 362106 ratings. The following script is used to build the collaborative filtering model and apply it on a new dataset for predicting the ratings. Then the prediction accuracy is computed. The error matrix is shown as follows: > #User based collaborative filtering > r1 <- Recommender(getData(e, "train"), "UBCF") > #Item based collaborative filtering > r2 <- Recommender(getData(e, "train"), "IBCF") > #PCA based collaborative filtering > #r3 <- Recommender(getData(e, "train"), "PCA") > #POPULAR based collaborative filtering > r4 <- Recommender(getData(e, "train"), "POPULAR") > #RANDOM based collaborative filtering > r5 <- Recommender(getData(e, "train"), "RANDOM") > #SVD based collaborative filtering > r6 <- Recommender(getData(e, "train"), "SVD") > #Predicted Ratings > p1 <- predict(r1, getData(e, "known"), type="ratings") > p2 <- predict(r2, getData(e, "known"), type="ratings") > #p3 <- predict(r3, getData(e, "known"), type="ratings") > p4 <- predict(r4, getData(e, "known"), type="ratings") > p5 <- predict(r5, getData(e, "known"), type="ratings") > p6 <- predict(r6, getData(e, "known"), type="ratings") > #calculate the error between the prediction and > #the unknown part of the test data > error <- rbind( + calcPredictionAccuracy(p1, getData(e, "unknown")), + calcPredictionAccuracy(p2, getData(e, "unknown")), + #calcPredictionAccuracy(p3, getData(e, "unknown")), + calcPredictionAccuracy(p4, getData(e, "unknown")), + calcPredictionAccuracy(p5, getData(e, "unknown")), + calcPredictionAccuracy(p6, getData(e, "unknown")) + ) > rownames(error) <- c("UBCF","IBCF","POPULAR","RANDOM","SVD") > error RMSE MSE MAE UBCF 4.485571 20.12034 3.511709 IBCF 4.606355 21.21851 3.466738 POPULAR 4.509973 20.33985 3.548478 RANDOM 7.917373 62.68480 6.464369 SVD 4.653111 21.65144 3.679550 From the preceding result, UBCF has the lowest error in comparison to other recommendation methods. Here, to evaluate the results of the predictive model, we are using the k-fold cross-validation method. k is assumed to have been taken as 4: > #Evaluation of a top-N recommender algorithm > scheme <- evaluationScheme(Jester5k, method="cross", k=4, + given=3,goodRating=5) > scheme Evaluation scheme with 3 items given Method: 'cross-validation' with 4 run(s). Good ratings: >=5.000000 Data set: 5000 x 100 rating matrix of class 'realRatingMatrix' with 362106 ratings. The result of the models from the evaluation scheme shows the runtime versus prediction time by different cross-validation results for different models. The result is shown as follows: > results <- evaluate(scheme, method="POPULAR", n=c(1,3,5,10,15,20)) POPULAR run fold/sample [model time/prediction time] 1 [0.14sec/2.27sec] 2 [0.16sec/2.2sec] 3 [0.14sec/2.24sec] 4 [0.14sec/2.23sec] > results <- evaluate(scheme, method="IBCF", n=c(1,3,5,10,15,20)) IBCF run fold/sample [model time/prediction time] 1 [0.4sec/0.38sec] 2 [0.41sec/0.37sec] 3 [0.42sec/0.38sec] 4 [0.43sec/0.37sec] > results <- evaluate(scheme, method="UBCF", n=c(1,3,5,10,15,20)) UBCF run fold/sample [model time/prediction time] 1 [0.13sec/6.31sec] 2 [0.14sec/6.47sec] 3 [0.15sec/6.21sec] 4 [0.13sec/6.18sec] > results <- evaluate(scheme, method="RANDOM", n=c(1,3,5,10,15,20)) RANDOM run fold/sample [model time/prediction time] 1 [0sec/0.27sec] 2 [0sec/0.26sec] 3 [0sec/0.27sec] 4 [0sec/0.26sec] > results <- evaluate(scheme, method="SVD", n=c(1,3,5,10,15,20)) SVD run fold/sample [model time/prediction time] 1 [0.36sec/0.36sec] 2 [0.35sec/0.36sec] 3 [0.33sec/0.36sec] 4 [0.36sec/0.36sec] The confusion matrix displays the level of accuracy provided by each of the models. We can estimate the accuracy measures such as precision, recall and TPR, FPR, and so on; the result is shown here: > getConfusionMatrix(results)[[1]] TP FP FN TN precision recall TPR FPR 1 0.2736 0.7264 17.2968 78.7032 0.2736000 0.01656597 0.01656597 0.008934588 3 0.8144 2.1856 16.7560 77.2440 0.2714667 0.05212659 0.05212659 0.027200530 5 1.3120 3.6880 16.2584 75.7416 0.2624000 0.08516269 0.08516269 0.046201487 10 2.6056 7.3944 14.9648 72.0352 0.2605600 0.16691259 0.16691259 0.092274243 15 3.7768 11.2232 13.7936 68.2064 0.2517867 0.24036802 0.24036802 0.139945095 20 4.8136 15.1864 12.7568 64.2432 0.2406800 0.30082509 0.30082509 0.189489883 Association rules as a method for recommendation engine, for building product recommendation in a retail/e-commerce scenario. Summary In this article, we discussed the way of recommending products to users based on similarities in their purchase patterns, content, item-to-item comparison and so on. So far, the accuracy is concerned, always the user-based collaborative filtering is giving better result in a real-rating-based matrix as an input. Similarly, the choice of methods for a specific use case is really difficult, so it is recommended to apply all six different methods. The best one should be selected automatically, and the recommendation should also get updates automatically. Resources for Article: Further resources on this subject: Data mining[article] Machine Learning with R[article] Machine learning and Python – the Dream Team[article]
Read more
  • 0
  • 0
  • 2831
article-image-data-science-r
Packt
04 Jul 2016
16 min read
Save for later

Data Science with R

Packt
04 Jul 2016
16 min read
In this article by Matthias Templ, author of the book Simulation for Data Science with R, we will cover: What is meant bydata science A short overview of what Ris The essential tools for a data scientist in R (For more resources related to this topic, see here.) Data science Looking at the job market it is no doubt that the industry needs experts on data science. But what is data science and what's the difference to statistics or computational statistics? Statistics is computing with data. In computational statistics, methods and corresponding software are developed in a highly data-depended manner using modern computational tools. Computational statistics has a huge intersection with data science. Data science is the applied part of computational statistics plus data management including storage of data, data bases, and data security issues. The term data science is used when your work is driven by data with a less strong component on method and algorithm development as computational statistics, but with a lot of pure computer science topics related to storing, retrieving, and handling data sets. It is the marriage of computer science and computational statistics. As an example to show differences, we took the broad area of visualization. A data scientist is also interested in pure process related visualizations (airflows in an engine, for example),while in computational statistics, methods for visualization of data and statistical results are onlytouched upon. Data science is the management of the entire modelling process, from data collection to automatized reporting and presenting the results. Storage and managing data, data pre-processing (editing, imputation), data analysis, and modelling are included in this process. Data scientists use statistics and data-oriented computer science tools to solve the problems they face. R R has become an essential tool for statistics and data science(Godfrey 2013). As soon as data scientists have to analyze data, R might be the first choice. The opensource programming language and software environment, R, is currently one of the most widely used and popular software tools for statistics and data analysis. It is available at the Comprehensive R Archive Network (CRAN) as free software under the terms of the Free Software Foundation's GNU General Public License (GPL) in source code and binary form. The R Core Team defines R as an environment. R is an integrated suite of software facilities for data manipulation, calculation, and graphical display. Base R includes: A suite of operators for calculations on arrays, mostly written in C and integrated in R Comprehensive, coherent, and integrated collection of methods for data analysis Graphical facilities for data analysis and display, either on-screen or in hard copy A well-developed, simple, and effective programming language thatincludes conditional statements, loops, user-defined recursive functions, and input and output facilities A flexible object-oriented system facilitating code reuse High performance computing with interfaces to compiled code and facilities for parallel and grid computing The ability to be extended with (add-on) packages An environment that allows communication with many other software tools Each R package provides a structured standard documentation including code application examples. Further documents(so called vignettes???)potentially show more applications of the packages and illustrate dependencies between the implemented functions and methods. R is not only used extensively in the academic world, but also companies in the area of social media (Google, Facebook, Twitter, and Mozilla Corporation), the banking world (Bank of America, ANZ Bank, Simple), food and pharmaceutical areas (FDA, Merck, and Pfizer), finance (Lloyd, London, and Thomas Cook), technology companies (Microsoft), car construction and logistic companies (Ford, John Deere, and Uber), newspapers (The New York Times and New Scientist), and companies in many other areas; they use R in a professional context(see also, Gentlemen 2009andTippmann 2015). International and national organizations nowadays widely use R in their statistical offices(Todorov and Templ 2012 and Templ and Todorov 2016). R can be extended with add-on packages, and some of those extensions are especially useful for data scientists as discussed in the following section. Tools for data scientists in R Data scientists typically like: The flexibility in reading and writing data including the connection to data bases To have easy-to-use, flexible, and powerful data manipulation features available To work with modern statistical methodology To use high-performance computing tools including interfaces to foreign languages and parallel computing Versatile presentation capabilities for generating tables and graphics, which can readily be used in text processing systems, such as LaTeX or Microsoft Word To create dynamical reports To build web-based applications An economical solution The following presented tools are related to these topics and helps data scientists in their daily work. Use a smart environment for R Would you prefer to have one environment that includes types of modern tools for scientific computing, programming and management of data and files, versioning, output generation that also supports a project philosophy, code completion, highlighting, markup languages and interfaces to other software, and automated connections to servers? Currently two software products supports this concept. The first one is Eclipse with the extensionSTATET or the modified Eclipse IDE from Open Analytics called Architect. The second is a very popular IDE for R called RStudio, which also includes the named features and additionally includes an integration of the packages shiny(RStudio, Inc. 2014)for web-based development and integration of R and rmarkdown(Allaire et al. 2015). It provides a modern scientific computing environment, well designed and easy to use, and most importantly, distributed under GPL License. Use of R as a mediator Data exchange between statistical systems, database systems, or output formats is often required. In this respect, R offers very flexible import and export interfaces either through its base installation but mostly through add-on packages, which are available from CRAN or GitHub. For example, the packages xml2(Wickham 2015a)allow to read XML files. For importing delimited files, fixed width files, and web log files, it is worth mentioning the package readr(Wickham and Francois 2015a)or data.table(Dowle et al. 2015)(functionfread), which are supposed to be faster than the available functions in base R. The packages XLConnect(Mirai Solutions GmbH 2015)can be used to read and write Microsoft Excel files including formulas, graphics, and so on. The readxlpackage(Wickham 2015b)is faster for data import but do not provide export features. The foreignpackages(R Core Team 2015)and a newer promising package called haven(Wickham and Miller 2015)allow to read file formats from various commercial statistical software. The connection to all major database systems is easily established with specialized packages. Note that theROBDCpackage(Ripley and Lapsley 2015)is slow but general, while other specialized packages exists for special data bases. Efficient data manipulation as the daily job Data manipulation, in general but in any case with large data, can be best done with the dplyrpackage(Wickham and Francois 2015b)or the data.tablepackage(Dowle et al. 2015). The computational speed of both packages is much faster than the data manipulation features of base R, while data.table is slightly faster than dplyr using keys and fast binary search based methods for performance improvements. In the author's viewpoint, the syntax of dplyr is much easier to learn for beginners as the base R data manipulation features, and it is possible to write thedplyr syntax using data pipelines that is internally provided by package magrittr(Bache and Wickham 2014). Let's take an example to see the logical concept. We want to compute a new variableEngineSizeas the square ofEngineSizefrom the data set Cars93. For each group, we want to compute the minimum of the new variable. In addition, the results should be sorted in descending order: data(Cars93, package = "MASS") library("dplyr") Cars93 %>%   mutate(ES2 = EngineSize^2) %>%   group_by(Type) %>%   summarize(min.ES2 = min(ES2)) %>%   arrange(desc(min.ES2)) ## Source: local data frame [6 x 2] ## ##      Type min.ES2 ## 1   Large   10.89 ## 2     Van    5.76 ## 3 Compact    4.00 ## 4 Midsize    4.00 ## 5  Sporty    1.69 ## 6   Small    1.00 The code is somehow self-explanatory, while data manipulation in base R and data.table needs more expertise on syntax writing. In the case of large data files thatexceed available RAM, interfaces to (relational) database management systems are available, see the CRAN task view on high-performance computingthat includes also information about parallel computing. According to data manipulation, the excellent packages stringr, stringi, and lubridate for string operations and date-time handling should also be mentioned. The requirement of efficient data preprocessing A data scientist typically spends a major amount of time not only ondata management issues but also on fixing data quality problems. It is out of the scope of this book to mention all the tools for each data preprocessing topic. As an example, we concentrate on one particular topic—the handling of missing values. The VIMpackage(Templ, Alfons, and Filzmoser 2011)(Kowarik and Templ 2016)can be used for visual inspection and imputation of data. It is possible to visualize missing values using suitable plot methods and to analyze missing values' structure in microdata using univariate, bivariate, multiple, and multivariate plots. The information on missing values from specified variables is highlighted in selected variables. VIM can also evaluate imputations visually. Moreover, the VIMGUIpackage(Schopfhauser et al., 2014)provides a point and click graphical user interface (GUI). One plot, a parallel coordinate plot, for missing values is shown in the following graph. It highlights the values on certain chemical elements. In red, those values are marked that contain the missing in the chemical element Bi. It is easy to see missing at random situations with such plots as well as to detect any structure according to the missing pattern. Note that this data is compositional thus transformed using a log-ratio transformation from the package robCompositions(Templ, Hron, and Filzmoser 2011): library("VIM") data(chorizonDL, package = "VIM") ## for missing values x <- chorizonDL[,c(15,101:110)] library("robCompositions") x <- cenLR(x)$x.clr parcoordMiss(x,     plotvars=2:11, interactive = FALSE) legend("top", col = c("skyblue", "red"), lwd = c(1,1),     legend = c("observed in Bi", "missing in Bi")) To impute missing values,not onlykk-nearest neighbor and hot-deck methods are included, but also robust statistical methods implemented in an EMalgorithm, for example, in the functionirmi. The implemented methods can deal with a mixture of continuous, semi-continuous, binary, categorical, and count variables: any(is.na(x)) ## [1] TRUE ximputed <- irmi(x) ## Time difference of 0.01330566 secs any(is.na(ximputed)) ## [1] FALSE Visualization as a must While in former times, results were presented mostly in tables and data was analyzed by their values on screen; nowadays visualization of data and results becomes very important. Data scientists often heavily use visualizations to analyze data andalso for reporting and presenting results. It's already a nogo to not make use of visualizations. R features not only it's traditional graphical system but also an implementation of the grammar of graphics book(Wilkinson 2005)in the form of the R package(Wickham 2009). Why a data scientist should make use of ggplot2? Since it is a very flexible, customizable, consistent, and systematic approach to generate graphics. It allows to define own themes (for example, cooperative designs in companies) and support the users with legends and optimal plot layout. In ggplot2, the parts of a plot are defined independently. We do not go into details and refer to(Wickham 2009)or(???), but here's a simple example to show the user-friendliness of the implementation: library("ggplot2") ggplot(Cars93, aes(x = Horsepower, y = MPG.city)) + geom_point() + facet_wrap(~Cylinders) Here, we mapped Horsepower to the x variable and MPG.city to the y variable. We used Cylinder for faceting. We usedgeom_pointto tell ggplot2 to produce scatterplots. Reporting and webapplications Every analysis and report should be reproducible, especially when a data scientist does the job. Everything from the past should be able to compute at any time thereafter. Additionally,a task for a data scientist is to organize and managetext,code,data, andgraphics. The use of dynamical reporting tools raise the quality of outcomes and reduce the work-load. In R, the knitrpackage provides functionality for creating reproducible reports. It links code and text elements. The code is executed and the results are embedded in the text. Different output formats are possible such as PDF,HTML, orWord. The structuring can be most simply done using rmarkdown(Allaire et al., 2015). markdown is a markup language with many features, including headings of different sizes, text formatting, lists, links, HTML, JavaScript,LaTeX equations, tables, and citations. The aim is to generate documents from plain text. Cooperate designs and styles can be managed through CSS stylesheets. For data scientists, it is highly recommended to use these tools in their daily work. We already mentioned the automated generation from HTML pages from plain text with rmarkdown. The shinypackage(RStudio Inc. 2014)allows to build web-based applications. The website generated with shiny changes instantly as users modify inputs. You can stay within the R environment to build shiny user interfaces. Interactivity can be integrated using JavaScript, and built-in support for animation and sliders. Following is a very simple example that includes a slider and presents a scatterplot with highlighting of outliers given. We do not go into detail on the code that should only prove that it is just as simple to make a web application with shiny: library("shiny") library("robustbase") ## Define server code server <- function(input, output) {   output$scatterplot <- renderPlot({     x <- c(rnorm(input$obs-10), rnorm(10, 5)); y <- x + rnorm(input$obs)     df <- data.frame("x" = x, "y" = y)     df$out <- ifelse(covMcd(df)$mah > qchisq(0.975, 1), "outlier", "non-outlier")     ggplot(df, aes(x=x, y=y, colour=out)) + geom_point()   }) }   ## Define UI ui <- fluidPage(   sidebarLayout(     sidebarPanel(       sliderInput("obs", "No. of obs.", min = 10, max = 500, value = 100, step = 10)     ),     mainPanel(plotOutput("scatterplot"))   ) )   ## Shiny app object shinyApp(ui = ui, server = server) Building R packages First, RStudio and the package devtools(Wickham and Chang 2016)make life easy when building packages. RStudio has a lot of facilities for package building, and it's integrated package devtools includes features for checking, building, and documenting a package efficiently, and includes roxygen2(Wickham, Danenberg, and Eugster)for automated documentation of packages. When code of a package is updated,load_all('pathToPackage')simulates a restart of R, the new installation of the package and the loading of the newly build packages. Note that there are many other functions available for testing, documenting, and checking. Secondly, build a package whenever you wrote more than two functions and whenever you deal with more than one data set. If you use it only for yourself, you may be lazy with documenting the functions to save time. Packages allow to share code easily, to load all functions and data with one line of code, to have the documentation integrated, and to support consistency checks and additional integrated unit tests. Advice for beginners is to read the manualWriting R Extensions, and use all the features that are provided by RStudio and devtools. Summary In this article, we discussed essential tools for data scientists in R. This covers methods for data pre-processing, data manipulation, and tools for reporting, reproducible work, visualization, R packaging, and writing web-applications. A data scientist should learn to use the presented tools and deepen the knowledge in the proposed methods and software tools. Having learnt these lessons, a data scientist is well-prepared to face the challenges in data analysis, data analytics, data science, and data problems in practice. References Allaire, J.J., J. Cheng, Xie Y, J. McPherson, W. Chang, J. Allen, H. Wickham, and H. Hyndman. 2015.Rmarkdown: Dynamic Documents for R.http://CRAN.R-project.org/package=rmarkdown. Bache, S.M., and W. Wickham. 2014.magrittr: A Forward-Pipe Operator for R.https://CRAN.R-project.org/package=magrittr. Dowle, M., A. Srinivasan, T. Short, S. Lianoglou, R. Saporta, and E. Antonyan. 2015.Data.table: Extension of Data.frame.https://CRAN.R-project.org/package=data.table. Gentlemen, R. 2009. "Data Analysts Captivated by R's Power."New York Times.http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html. Godfrey, A.J.R. 2013. "Statistical Analysis from a Blind Person's Perspective."The R Journal5 (1): 73–80. Kowarik, A., and M. Templ. 2016. "Imputation with the R Package VIM."Journal of Statistical Software. Mirai Solutions GmbH. 2015.XLConnect: Excel Connector for R.http://CRAN.R-project.org/package=XLConnect. R Core Team. 2015.Foreign: Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, Weka, dBase, ….http://CRAN.R-project.org/package=foreign. Ripley, B., and M. Lapsley. 2015.RODBC: ODBC Database Access.http://CRAN.R-project.org/package=RODBC. RStudio Inc. 2014.Shiny: Web Application Framework for R.http://CRAN.R-project.org/package=shiny. Schopfhauser, D., M. Templ, A. Alfons, A. Kowarik, and B. Prantner. 2014.VIMGUI: Visualization and Imputation of Missing Values.http://CRAN.R-project.org/package=VIMGUI. Templ, M., A. Alfons, and P. Filzmoser. 2011. "Exploring Incomplete Data Using Visualization Techniques."Advances in Data Analysis and Classification6 (1): 29–47. Templ, M., and V. Todorov. 2016. "The Software Environment R for Official Statistics and Survey Methodology."Austrian Journal of Statistics45 (1): 97–124. Templ, M., K. Hron, and P. Filzmoser. 2011.RobCompositions: An R-Package for Robust Statistical Analysis of Compositional Data. John Wiley; Sons. Tippmann, S. 2015. "Programming Tools: Adventures with R."Nature, 109–10. doi:10.1038/517109a. Todorov, V., and M. Templ. 2012.R in the Statistical Office: Part II. Working paper 1/2012. United Nations Industrial Development. Wickham, H. 2009.Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.http://had.co.nz/ggplot2/book. 2015a.Xml2: Parse XML.http://CRAN.R-project.org/package=xml2. 2015b.Readxl: Read Excel Files.http://CRAN.R-project.org/package=readxl. Wickham, H., and W. Chang. 2016.Devtools: Tools to Make Developing R Packages Easier.https://CRAN.R-project.org/package=devtools. Wickham, H., and R. Francois. 2015a.Readr: Read Tabular Data.http://CRAN.R-project.org/package=readr. 2015b.dplyr: A Grammar of Data Manipulation.https://CRAN.R-project.org/package=dplyr. Wickham, H., and E. Miller. 2015.Haven: Import SPSS,Stata and SAS Files.http://CRAN.R-project.org/package=haven. Wickham, H., P. Danenberg, and M. Eugster.Roxygen2: In-Source Documentation for R.https://github.com/klutometis/roxygen. Wilkinson, L. 2005.The Grammar of Graphics (Statistics and Computing). Secaucus, NJ, USA: Springer-Verlag New York, Inc. Resources for Article: Further resources on this subject: Adding Media to Our Site [article] Data Tables and DataTables Plugin in jQuery 1.3 with PHP [article] JavaScript Execution with Selenium [article]
Read more
  • 0
  • 0
  • 2166

article-image-getting-started-tensorflow-api-primer
Sam Abrahams
19 Jun 2016
8 min read
Save for later

Getting Started with TensorFlow: an API Primer

Sam Abrahams
19 Jun 2016
8 min read
TensorFlow has picked up a lot of steam over the past couple of months, and there's been more and more interest in learning how to use the library. I've seen tons of tutorials out there that just slap together TensorFlow code, roughly describe what some of the lines do, and call it a day. Conversely, I've seen really dense tutorials that mix together universal machine learning concepts and TensorFlow's API. There is value in both of these sorts of examples, but I find them either a little too sparse or too confusing respectively. In this post, I plan to focus solely on information related to the TensorFlow API, and not touch on general machine learning concepts (aside from describing computational graphs). Additionally, I will link directly to relevant portions of the TensorFlow API for further reading. While this post isn't going to be a proper tutorial, my hope is that focusing on the core components and workflows of the TensorFlow API will make working with other resources more accessible and comprehensible. As a final note, I'll be referring to the Python API and not the C++ API in this post. Definitions Let's start off with a glossary of key words you're going to see when using TensorFlow. Tensor: An n-dimensional matrix. For most practical purposes, you can think of them the same way you would a two-dimensional matrix for matrix algebra. In TensorFlow, the return value of any mathematical operation is a tensor. See here for more about TensorFlow Tensor objects. Graph: The computational graph that is defined by the user. It's constructed of nodes and edges, representing computations and connections between those computations respectively. For a quick primer on computation graphs and how they work in backpropagation, check out Chris Olah's post here. A TensorFlow user can define more than one Graph object and run them separately. Additionally, it is possible to define a large graph and run only smaller portions of it. See here for more information about TensorFlow Graphs. Op, Operation (Ops, Operations): Any sort of computation on tensors. Operations (or Ops) can take in zero or more TensorFlow Tensor objects, and output zero or more Tensor objects as a result of the computation. Ops are used all throughout TensorFlow, from doing simple addition to matrix multiplication to initializing TensorFlow variables. Operations run only when they are passed to the Session object, which I'll discuss below. For the most part, nodes and operations are interchangable concepts. In this guide, I'll try to use the term Operation or Op when referring to TensorFlow-specific operations and node when referring to general computation graph terminology. Here's the API reference for the Operation class. Node: A computation in the graph that takes as input zero or more tensors and outputs zero or more tensors. A node does not have to interact with any other nodes, and thus does not have to have any edges connected to it. Visually, these are usually depicted as ellipses or boxes. Edge: The directed connection between two nodes. In TensorFlow, each edge can be seen as one or more tensors, and usually represents the output of one node becoming the input of the next node. Visually, these are usually depicted as lines or arrows. Device: A CPU or GPU. In TensorFlow, computations can occur across many different CPUs and GPUs, and it must keep track of these devices in order to coordinate work properly. The Typical TensorFlow Coding Workflow Writing a working TensorFlow model boils down to two steps: Build the Graph using a series of Operations, placeholders, and Variables. Run the Graph with training data repeatedly using the Session (you'll want to test the model while training to make sure it's learning properly). Sounds simple enough, and once you get a hang of it, it really is! We talked about Ops in the section above, but now I want to put special emphasis on placeholders, Variables, and the Session. They are fairly easy to grasp, but getting these core fundamentals solidified will give context to learning the rest of the TensorFlow API. Placeholders A Placeholder is a node in the graph that must be fed data via the feed_dict parameter in Session.run (see below). In general, these are used to specify input data and label data. Basically, you use placeholders to tell TensorFlow, "Hey TF, the data here is going to change each time you run the graph, but it will always be a tensor of size [N] and data-type [D]. Use that information to make sure that my matrix/tensor calculations are set up properly." TensorFlow needs to have that information in order to compile the program, as it has to guarantee that you don't accidentally try to multiply a 5x5 matrix with an 8x8 matrix (amongst other things). Placeholders are easy to define. Just make a variable that is assigned to the result of tensorflow.placeholder(): import tensorflow as tf # Create a Placeholder of size 100x400 that will contain 32-bit floating point numbers my_placeholder = tf.placeholder(tf.float32, shape=(100, 400)) Read more about Placeholder objects here. Note: We are required to feed data to the placeholder when we run our graph. We'll cover this in the Session section below. Variables Variables are objects that contain tensor information but persist across multiple calls to Session.run(). That is, they contain information that can be altered during the run of a graph, and then that altered state can be accessed the next time the graph is run. Variables are used to hold the weights and biases of a machine learning model while it trains, and their final values are what define the trained model. Defining and using Variables is mostly straightforward. Define a Variable with tensorflow.Variable() and update its information with the assign() method: import tensorflow as tf # Create a variable with the value 0 and the name of 'my_variable' my_var = tf.Variable(0, name='my_variable') # Increment the variable by one my_var.assign(my_var + 1) One catch with Variable objects is that you can't run Ops with them until you initialize them within the Session object. This is usually done with the Operation returned from tf.initialize_all_variables(), as I'll describe in the next section. Variable API reference The official how-to for Variable objects The Session Finally, let's talk about running the Session. The TensorFlow Session object is in charge of keeping track of all Variables, coordinating computation across devices, and generally doing anything that involves running the graph. You generally start a Session by calling tensorflow.Session(), and either directly assign the value of that statement to a handle or use a with ... as statement. The most important method in the Session object is run(), which takes in as input fetches, a list of Operations and Tensors that the user wishes to calculate; and feed_dict, which is an optional dictionary mapping Tensors (often Placeholders) to values that should override that Tensor. This is how you specify which values you want returned from your computation as well as the input values for training. Here is a toy example that uses a placeholder, a Variable, and the Session to showcase their basic use: import tensorflow as tf # Create a placeholder for inputting floating point data later a = tf.placeholder(tf.float32) # Make a base Variable object with the starting value of 0 start = tf.Variable(0.0) # Create a node that is the value of incrementing the 'start' Variable by the value of 'a' y = start.assign(start + a) # Open up a TensorFlow Session and assign it to the handle 'sess' sess = tf.Session() # Important: initialize the Variable, or else we won't be able to run our computation init = tf.initialize_all_variables() # 'init' is an Op: must be run by sess sess.run(init) # Now the Variable is initialized! # Get the value of 'y', feeding in different values for 'a', and print the result # Because we are using a Variable, the value should change each time print(sess.run(y, feed_dict={a:1})) # Prints 1.0 print(sess.run(y, feed_dict={a:0.5})) # Prints 1.5 print(sess.run(y, feed_dict={a:2.2})) # Prints 3.7 # Close the Session sess.close() Check out the documentation for TensorFlow's Session object here. Finishing Up Alright! This primer should give you a head start on understanding more of the resources out there for TensorFlow. The less you have to think about how TensorFlow works, the more time you can spend working out how to set up the best neural network you can! Good luck, and happy flowing! About the author Sam Abrahams is a freelance data engineer and animator in Los Angeles, CA, USA. He specializes in real-world applications of machine learning and is a contributor to TensorFlow. Sam runs a small tech blog, Memdump, and is an active member of the local hacker scene in West LA.
Read more
  • 0
  • 0
  • 2439

article-image-use-sqlite-ionic-store-data
Oli Huggins
13 Jun 2016
10 min read
Save for later

How to use SQLite with Ionic to store data?

Oli Huggins
13 Jun 2016
10 min read
Hybrid Mobile apps have a challenging task of being as performant as native apps, but I always tell other developers that it depends on not the technology but how we code. The Ionic Framework is a popular hybrid app development library, which uses optimal design patterns to create awe-inspiring mobile experiences. We cannot exactly use web design patterns to create hybrid mobile apps. The task of storing data locally on a device is one such capability, which can make or break the performance of your app. In a web app, we may use localStorage to store data but mobile apps require much more data to be stored and swift access. Localstorage is synchronous, so it acts slow in accessing the data. Also, web developers who have experience of coding in a backend language such as C#, PHP, or Java would find it more convenient to access data using SQL queries than using object-based DB. SQLite is a lightweight embedded relational DBMS used in web browsers and web views for hybrid mobile apps. It is similar to the HTML5 WebSQL API and is asynchronous in nature, so it does not block the DOM or any other JS code. Ionic apps can leverage this tool using an open source Cordova Plugin by Chris Brody (@brodybits). We can use this plugin directly or use it with the ngCordova library by the Ionic team, which abstracts Cordova plugin calls into AngularJS-based services. We will create an Ionic app in this blog post to create Trackers to track any information by storing it at any point in time. We can use this data to analyze the information and draw it on charts. We will be using the ‘cordova-sqlite-ext’ plugin and the ngCordova library. We will start by creating a new Ionic app with a blank starter template using the Ionic CLI command, ‘$ ionic start sqlite-sample blank’. We should also add appropriate platforms for which we want to build our app. The command to add a specific platform is ‘$ ionic platform add <platform_name>’. Since we will be using ngCordova to manage SQLite plugin from the Ionic app, we have to now install ngCordova to our app. Run the following bower command to download ngCordova dependencies to the local bower ‘lib’ folder: bower install ngCordova We need to inject the JS file using a script tag in our index.html: <script src=“lib/ngCordova/dist/ng-cordova.js"></script> Also, we need to include the ngCordova module as a dependency in our app.js main module declaration: angular.module('starter', [‘ionic’,’ngCordova']) Now, we need to add the Cordova plugin for SQLite using the CLI command: cordova plugin add https://github.com/litehelpers/Cordova-sqlite-storage.git Since we will be using the $cordovaSQLite service of ngCordova only to access this plugin from our Ionic app, we need not inject any other plugin. We will have the following two views in our Ionic app: Trackers list: This list shows all the trackers we add to DB Tracker details: This is a view to show list of data entries we make for a specific tracker We would need to create the routes by registering the states for the two views we want to create. We need to add the following config block code for our ‘starter’ module in the app.js file only: .config(function($stateProvider,$urlRouterProvider){ $urlRouterProvider.otherwise('/') $stateProvider.state('home', { url: '/', controller:'TrackersListCtrl', templateUrl: 'js/trackers-list/template.html' }); $stateProvider.state('tracker', { url: '/tracker/:id', controller:'TrackerDetailsCtrl', templateUrl: 'js/tracker-details/template.html' }) }); Both views would have similar functionality, but will display different entities. Our view will display a list of trackers from the SQLite DB table and also provide a feature to add a new tracker or delete an existing one. Create a new folder named trackers-list where we can store our controller and template for the view. We will also abstract our code to access the SQLite DB into an Ionic factory. We will implement the following methods: initDB: This will initialize or create a table for this entity if it does not exist getAllTrackers: This will get all trackers list rows from the created table addNewTracker - This is a method to insert a new row for a new tracker into the table deleteTracker - This is a method to delete a specific tracker using its ID getTracker - This will get a specific Tracker from the cached list using an ID to display anywhere We will be injecting the $cordovaSQLite service into our factory to interact with our SQLite DB. We can open an existing DB or create a new DB using the command $cordovaSQLite.openDB(“myApp.db”). We have to store the object reference returned from this method call, so we will store it in a module-level variable called db. We have to pass this object reference to our future $cordovaSQLite service calls. $cordovaSQLite has a handful of methods to provide varying features: openDB: This is a method to establish a connection to the existing DB or create a new DB execute: This is a method to execute a single SQL command query insertCollection: This is a method to insert bulk values nestedExecute: This is a method to run nested queries deleted: This is a method to delete a particular DB We see the usage of openDB and execute the command in this post. In our factory, we will create a standard method runQuery to adhere to DRY(Don’t Repeat Yourself) principles. The code for the runQuery function is as follows: function runQuery(query,dataParams,successCb,errorCb) { $ionicPlatform.ready(function() { $cordovaSQLite.execute(db, query,dataParams).then(function(res) { successCb(res); }, function (err) { errorCb(err); }); }.bind(this)); } In the preceding code, we pass the query as a string, dataParams (dynamic query parameters) as an array, and successCB/errorCB as callback functions. We should always ensure that any Cordova plugin code should be called when the Cordova ready event is already fired, which is ensured by the $ionicPlatform.ready() method. We will then call the execute method of the $cordovaSQLite service passing the ‘db’ object reference, query, and dataParams as arguments. The method returns a promise to which we register callbacks using the ‘.then’ method. We pass the results or error using the success callback or error callback. Now, we will write code for each of the methods to initialize DB, insert a new row, fetch all rows, and then delete a row. initDB Method: function initDB() { db = $cordovaSQLite.openDB("myapp.db"); var query = "CREATE TABLE IF NOT EXISTS trackers_list (id in-teger autoincrement primary key, name string)"; runQuery(query,[],function(res) { console.log("table created "); }, function (err) { console.log(err); }); } In the preceding code, the openDB method is used to establish a connection with an existing DB or create a new DB. Then, we run the query to create a new table called ‘trackers_list’ if it does not exist. We define the columns ID with integer autoincrement primary key properties with the name string. addNewTracker Method: function addNewTracker(name) { var deferred = $q.defer(); var query = "INSERT INTO trackers_list (name) VALUES (?)"; runQuery(query,[name],function(response){ //Success Callback console.log(response); deferred.resolve(response); },function(error){ //Error Callback console.log(error); deferred.reject(error); }); return deferred.promise; } In the preceding code, we take ‘name’ as an argument, which will be passed into the insert query. We write the insert query and add a new row to the trackers_list table where ID will be auto-generated. We pass dynamic query parameters using the ‘?’ character in our query string, which will be replaced by elements in the dataParams array passed as the second argument to the runQuery method. We also use a $q library to return a promise to our factory methods so that controllers can manage asynchronous calls. getAllTrackers Method: This method is the same as the addNewTracker method, only without the name parameter, and it has the following query: var query = "SELECT * from trackers_list”; This method will return a promise, which when resolved will give the response from the $cordovaSQLite method. The response object will have the following structure: { insertId: <specific_id>, rows: {item: function, length: <total_no_of_rows>} rowsAffected: 0 } The response object has properties insertId representing the new ID generated for the row, rowsAffected giving the number of rows affected by the query and rows object with item method property, to which we can pass the index of the row to retrieve it. We will write the following code in the controller to convert the response.rows object into an utterable array of rows to be displayed using the ng-repeat directive: for(var i=0;i<response.rows.length;i++) { $scope.trackersList.push({ id:response.rows.item(i).id, name:response.rows.item(i).name }); } The code in the template to display the list of Trackers would be as follows: <ion-item ui-sref="tracker({id:tracker.id})" class="item-icon-right" ng-repeat="tracker in trackersList track by $index"> {{tracker.name}} <ion-delete-button class="ion-minus-circled" ng-click=“deleteTracker($index,tracker.id)"> </ion-delete-button> <i class="icon ion-chevron-right”></i> </ion-item> deleteTracker Method: function deleteTracker(id) { var deferred = $q.defer(); var query = "DELETE FROM trackers_list WHERE id = ?"; runQuery(query,[id],function(response){ … [Same Code as addNewTrackerMethod] } The delete tracker method has the same code as the addNewTracker method, where the only change is in the query and the argument passed. We pass ‘id’ as the argument to be used in the WHERE clause of delete query to delete the row with that specific ID. Rest of the Ionic App Code: The rest of the app code has not been discussed in this post because we have already discussed the code that is intended for integration with SQLite. You can implement your own version of this app or even use this sample code for any other use case. The trackers details view will be implemented in the same way to store data into the tracker_entries table with a foreign key, tracker_id, used for this table. It will also use this ID in the SELECT query to fetch entries for a specific tracker on its detail view. The GitHub link for the exact functioning code for complete app developed during this tutorial. About the author Rahat Khanna is a techno nerd experienced in developing web and mobile apps for many international MNCs and start-ups. He has completed his bachelors in technology with computer science and engineering as specialization. During the last 7 years, he has worked for a multinational IT service company and ran his own entrepreneurial venture in his early twenties. He has worked on projects ranging from static HTML websites to scalable web applications and engaging mobile apps. Along with his current job as a senior UI developer at Flipkart, a billion dollar e-commerce firm, he now blogs on the latest technology frameworks on sites such as www.airpair.com, appsonmob.com, and so on, and delivers talks at community events. He has been helping individual developers and start-ups in their Ionic projects to deliver amazing mobile apps.
Read more
  • 0
  • 0
  • 8070
article-image-setting-spark
Packt
10 Jun 2016
14 min read
Save for later

Setting up Spark

Packt
10 Jun 2016
14 min read
In this article by Alexander Kozlov, author of the book Mastering Scala Machine Learning, we will discuss how to download the pre-build Spark package from http://spark.apache.org/downloads.html,if you haven't done so yet. The latest release of  Spark, at the time of writing, is 1.6.1: Figure 3-1: The download site at http://spark.apache.org with recommended selections for this article (For more resources related to this topic, see here.) Alternatively, you can build the Spark by downloading the full source distribution from https://github.com/apache/spark: $ git clone https://github.com/apache/spark.git Cloning into 'spark'... remote: Counting objects: 301864, done. ... $ cd spark $sh ./ dev/change-scala-version.sh 2.11 ... $./make-distribution.sh --name alex-build-2.6-yarn --skip-java-test --tgz -Pyarn -Phive -Phive-thriftserver -Pscala-2.11 -Phadoop-2.6 ... The command will download the necessary dependencies and create the spark-2.0.0-SNAPSHOT-bin-alex-spark-build-2.6-yarn.tgz file in the Spark directory; the version is 2.0.0, as it is the next release version at the time of writing. In general, you do not want to build from trunk unless you are interested in the latest features. If you want a released version, you can visit the corresponding tag. Full list of available versions is available via the git branch –r command. The spark*.tgz file is all you need to run Spark on any machine that has Java JRE. The distribution comes with the docs/building-spark.md document that describes other options for building Spark and their descriptions, including incremental Scala compiler zinc. Full Scala 2.11 support is in the works for the next Spark 2.0.0 release. Applications Let's consider a few practical examples and libraries in Spark/Scala starting with a very traditional problem of word counting. Word count Most modern machine learning algorithms require multiple passes over data. If the data fits in the memory of a single machine, the data is readily available and this does not present a performance bottleneck. However, if the data becomes too large to fit into RAM, one has a choice of either dumping pieces of the data on disk (or database), which is about 100 times slower, but has a much larger capacity, or splitting the dataset between multiple machines across the network and transferring the results. While there are still ongoing debates, for most practical systems, analysis shows that storing the data over a set of network connected nodes has a slight advantage over repeatedly storing and reading it from hard disks on a single node, particularly if we can split the workload effectively between multiple CPUs. An average disk has bandwidth of about 100 MB/sec and transfers with a few mms latency, depending on the rotation speed and caching. This is about 100 times slower than reading the data from memory, depending on the data size and caching implementation again. Modern data bus can transfer data at over 10 GB/sec. While the network speed still lags behind the direct memory access, particularly with standard TCP/IP kernel networking layer overhead, specialized hardware can reach tens of GB/sec and if run in parallel, it can be potentially as fast as reading from the memory. In practice, the network-transfer speeds are somewhere between 1 to 10 GB/sec, but still faster than the disk in most practical systems. Thus, we can potentially fit the data into combined memory of all the cluster nodes and perform iterative machine learning algorithms across a system of them. One problem with memory, however, is that it is does not persist across node failures and reboots. A popular big data framework, Hadoop, made possible with the help of the original Dean/Ghemawat paper (Jeff Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI, 2004.), is using exactly the disk layer persistence to guarantee fault tolerance and store intermediate results. A Hadoop MapReduce program would first run a map function on each row of a dataset, emitting one or more key/value pairs. These key/value pairs then would be sorted, grouped, and aggregated by key so that the records with the same key would end up being processed together on the same reducer, which might be running on same or another node. The reducer applies a reduce function that traverses all the values that were emitted for the same key and aggregates them accordingly. The persistence of intermediate results would guarantee that if a reducer fails for one or another reason, the partial computations can be discarded and the reduce computation can be restarted from the checkpoint-saved results. Many simple ETL-like applications traverse the dataset only once with very little information preserved as state from one record to another. For example, one of the traditional applications of MapReduce is word count. The program needs to count the number of occurrences of each word in a document consisting of lines of text. In Scala, the word count is readily expressed as an application of the foldLeft method on a sorted list of words: val lines = scala.io.Source.fromFile("...").getLines.toSeq val counts = lines.flatMap(line => line.split("\W+")).sorted.   foldLeft(List[(String,Int)]()){ (r,c) =>     r match {       case (key, count) :: tail =>         if (key == c) (c, count+1) :: tail         else (c, 1) :: r         case Nil =>           List((c, 1))   } } If I run this program, the output will be a list of (word, count) tuples. The program splits the lines into words, sorts the words, and then matches each word with the latest entry in the list of (word, count) tuples. The same computation in MapReduce would be expressed as follows: val linesRdd = sc.textFile("hdfs://...") val counts = linesRdd.flatMap(line => line.split("\W+"))     .map(_.toLowerCase)     .map(word => (word, 1)).     .reduceByKey(_+_) counts.collect First, we need to process each line of the text by splitting the line into words and generation (word, 1) pairs. This task is easily parallelized. Then, to parallelize the global count, we need to split the counting part by assigning a task to do the count for a subset of words. In Hadoop, we compute the hash of the word and divide the work based on the value of the hash. Once the map task finds all the entries for a given hash, it can send the key/value pairs to the reducer, the sending part is usually called shuffle in MapReduce vernacular. A reducer waits until it receives all the key/value pairs from all the mappers, combines the values—a partial combine can also happen on the mapper, if possible—and computes the overall aggregate, which in this case is just sum. A single reducer will see all the values for a given word. Let's look at the log output of the word count operation in Spark (Spark is very verbose by default, you can manage the verbosity level by modifying the conf/log4j.properties file by replacing INFO with ERROR or FATAL): $ wget http://mirrors.sonic.net/apache/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz $ tar xvf spark-1.6.1-bin-hadoop2.6.tgz $ cd spark-1.6.1-bin-hadoop2.6 $ mkdir leotolstoy $ (cd leotolstoy; wget http://www.gutenberg.org/files/1399/1399-0.txt) $ bin/spark-shell Welcome to       ____              __      / __/__  ___ _____/ /__     _ / _ / _ `/ __/  '_/    /___/ .__/_,_/_/ /_/_   version 1.6.1       /_/   Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> val linesRdd = sc.textFile("leotolstoy", minPartitions=10) linesRdd: org.apache.spark.rdd.RDD[String] = leotolstoy MapPartitionsRDD[3] at textFile at <console>:27 At this stage, the only thing that happened is metadata manipulations, Spark has not touched the data itself. Spark estimates that the size of the dataset and the number of partitions. By default, this is the number of HDFS blocks, but we can specify the minimum number of partitions explicitly with the minPartitions parameter: scala> val countsRdd = linesRdd.flatMap(line => line.split("\W+")).      | map(_.toLowerCase).      | map(word => (word, 1)).      | reduceByKey(_+_) countsRdd: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[5] at reduceByKey at <console>:31 We just defined another RDD derived from the original linesRdd: scala> countsRdd.collect.filter(_._2 > 99) res3: Array[(String, Int)] = Array((been,1061), (them,841), (found,141), (my,794), (often,105), (table,185), (this,1410), (here,364), (asked,320), (standing,132), ("",13514), (we,592), (myself,140), (is,1454), (carriage,181), (got,277), (won,153), (girl,117), (she,4403), (moment,201), (down,467), (me,1134), (even,355), (come,667), (new,319), (now,872), (upon,207), (sister,115), (veslovsky,110), (letter,125), (women,134), (between,138), (will,461), (almost,124), (thinking,159), (have,1277), (answer,146), (better,231), (men,199), (after,501), (only,654), (suddenly,173), (since,124), (own,359), (best,101), (their,703), (get,304), (end,110), (most,249), (but,3167), (was,5309), (do,846), (keep,107), (having,153), (betsy,111), (had,3857), (before,508), (saw,421), (once,334), (side,163), (ough... Word count over 2 GB of text data—40,291 lines and 353,087 words—took under a second to read, split, and group by words. With extended logging, you could see the following: Spark opens a few ports to communicate with the executors and users Spark UI runs on port 4040 on http://localhost:4040 You can read the file either from local or distributed storage (HDFS, Cassandra, and S3) Spark will connect to Hive if Spark is built with Hive support Spark uses lazy evaluation and executes the pipeline only when necessary or when output is required Spark uses internal scheduler to split the job into tasks, optimize the execution, and execute the tasks The results are stored into RDDs, which can either be saved or brought into RAM of the node executing the shell with collect method The art of parallel performance tuning is to split the workload between different nodes or threads so that the overhead is relatively small and the workload is balanced. Streaming word count Spark supports listening on incoming streams, partitioning it, and computing aggregates close to real-time. Currently supported sources are Kafka, Flume, HDFS/S3, Kinesis, Twitter, as well as the traditional MQs such as ZeroMQ and MQTT. In Spark, streaming is implemented as micro-batches. Internally, Spark divides input data into micro-batches, usually from subseconds to minutes in size and performs RDD aggregation operations on these micro-batches. For example, let's extend the Flume example that we covered earlier. We'll need to modify the Flume configuration file to create a Spark polling sink. Instead of HDFS, replace the sink section: # The sink is Spark a1.sinks.k1.type=org.apache.spark.streaming.flume.sink.SparkSink a1.sinks.k1.hostname=localhost a1.sinks.k1.port=4989 Now, instead of writing to HDFS, Flume will wait for Spark to poll for data: object FlumeWordCount {   def main(args: Array[String]) {     // Create the context with a 2 second batch size     val sparkConf = new SparkConf().setMaster("local[2]")       .setAppName("FlumeWordCount")     val ssc = new StreamingContext(sparkConf, Seconds(2))     ssc.checkpoint("/tmp/flume_check")     val hostPort=args(0).split(":")     System.out.println("Opening a sink at host: [" + hostPort(0) +       "] port: [" + hostPort(1).toInt + "]")     val lines = FlumeUtils.createPollingStream(ssc, hostPort(0),       hostPort(1).toInt, StorageLevel.MEMORY_ONLY)     val words = lines       .map(e => new String(e.event.getBody.array)).         map(_.toLowerCase).flatMap(_.split("\W+"))       .map(word => (word, 1L))       .reduceByKeyAndWindow(_+_, _-_, Seconds(6),         Seconds(2)).print     ssc.start()     ssc.awaitTermination()   } } To run the program, start the Flume agent in one window: $ ./bin/flume-ng agent -Dflume.log.level=DEBUG,console -n a1 –f ../chapter03/conf/flume-spark.conf ... Then run the FlumeWordCount object in another: $ cd ../chapter03 $ sbt "run-main org.akozlov.chapter03.FlumeWordCount localhost:4989 ... Now, any text typed to the netcat connection will be split into words and counted every two seconds for a six second sliding window: $ echo "Happy families are all alike; every unhappy family is unhappy in its own way" | nc localhost 4987 ... ------------------------------------------- Time: 1464161488000 ms ------------------------------------------- (are,1) (is,1) (its,1) (family,1) (families,1) (alike,1) (own,1) (happy,1) (unhappy,2) (every,1) ...   ------------------------------------------- Time: 1464161490000 ms ------------------------------------------- (are,1) (is,1) (its,1) (family,1) (families,1) (alike,1) (own,1) (happy,1) (unhappy,2) (every,1) ... Spark/Scala allows to seamlessly switch between the streaming sources. For example, the same program for Kafka publish/subscribe topic model looks similar to the following: object KafkaWordCount {   def main(args: Array[String]) {     // Create the context with a 2 second batch size     val sparkConf = new SparkConf().setMaster("local[2]")       .setAppName("KafkaWordCount")     val ssc = new StreamingContext(sparkConf, Seconds(2))     ssc.checkpoint("/tmp/kafka_check")     System.out.println("Opening a Kafka consumer at zk:       [" + args(0) + "] for group group-1 and topic example")     val lines = KafkaUtils.createStream(ssc, args(0), "group-1",       Map("example" -> 1), StorageLevel.MEMORY_ONLY)     val words = lines       .flatMap(_._2.toLowerCase.split("\W+"))       .map(word => (word, 1L))       .reduceByKeyAndWindow(_+_, _-_, Seconds(6),         Seconds(2)).print     ssc.start()     ssc.awaitTermination()   } } To start the Kafka broker, first download the latest binary distribution and start ZooKeeper. ZooKeeper is a distributed-services coordinator and is required by Kafka even in a single-node deployment: $ wget http://apache.cs.utah.edu/kafka/0.9.0.1/kafka_2.11-0.9.0.1.tgz ... $ tar xf kafka_2.11-0.9.0.1.tgz $ bin/zookeeper-server-start.sh config/zookeeper.properties ... In another window, start the Kafka server: $ bin/kafka-server-start.sh config/server.properties ... Run the KafkaWordCount object: $ $ sbt "run-main org.akozlov.chapter03.KafkaWordCount localhost:2181" ... Now, publishing the stream of words into the Kafka topic will produce the window counts: $ echo "Happy families are all alike; every unhappy family is unhappy in its own way" | ./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic example ... $ sbt "run-main org.akozlov.chapter03.FlumeWordCount localhost:4989 ... ------------------------------------------- Time: 1464162712000 ms ------------------------------------------- (are,1) (is,1) (its,1) (family,1) (families,1) (alike,1) (own,1) (happy,1) (unhappy,2) (every,1) As you see, the programs output every two seconds. Spark streaming is sometimes called micro-batch processing. Streaming has many other applications (and frameworks), but this is too big of a topic to be entirely considered here and needs to be covered separately. Spark SQL and DataFrame DataFrame was a relatively recent addition to Spark, introduced in version 1.3, allowing one to use the standard SQL language for data analysis. SQL is really great for simple exploratory analysis and data aggregations. According to the latest poll results, about 70% of Spark users use DataFrame. Although DataFrame recently became the most popular framework for working with tabular data, it is relatively a heavyweight object. The pipelines that use DataFrames may execute much slower than the ones that are based on Scala's vector or LabeledPoint, which will be discussed in the next chapter. The evidence from different developers is that the response times can be driven to tens or hundreds of milliseconds, depending on the query from submillisecond on simpler objects. Spark implements its own shell for SQL, which can be invoked additionally to the standard Scala REPL shell: ./bin/spark-sql can be used to access the existing Hive/Impala or relational DB tables: $ ./bin/spark-sql … spark-sql> select min(duration), max(duration), avg(duration) from kddcup; … 0  58329  48.34243046395876 Time taken: 11.073 seconds, Fetched 1 row(s) In standard Spark's REPL, the same query can performed by running the following command: $ ./bin/spark-shell … scala> val df = sqlContext.sql("select min(duration), max(duration), avg(duration) from kddcup" 16/05/12 13:35:34 INFO parse.ParseDriver: Parsing command: select min(duration), max(duration), avg(duration) from alex.kddcup_parquet 16/05/12 13:35:34 INFO parse.ParseDriver: Parse Completed df: org.apache.spark.sql.DataFrame = [_c0: bigint, _c1: bigint, _c2: double] scala> df.collect.foreach(println) … 16/05/12 13:36:32 INFO scheduler.DAGScheduler: Job 2 finished: collect at <console>:22, took 4.593210 s [0,58329,48.34243046395876] Summary In this article, we discussed Spark and Hadoop and their relationship with Scala. We also discussed functional programming at a very high level. We then considered a classic word count example and it's implementation in Scala and Spark. Resources for Article: Further resources on this subject: Holistic View on Spark [article] Exploring Scala Performance [article] Getting Started with Apache Hadoop and Apache Spark [article]
Read more
  • 0
  • 0
  • 1926

article-image-clustering-methods
Packt
09 Jun 2016
18 min read
Save for later

Clustering Methods

Packt
09 Jun 2016
18 min read
In this article by Magnus Vilhelm Persson, author of the book Mastering Python Data Analysis, we will see that with data comprising of several separated distributions, how do we find and characterize them? In this article, we will look at some ways to identify clusters in data. Groups of points with similar characteristics form clusters. There are many different algorithms and methods to achieve this, with good and bad points. We want to detect multiple separate distributions in the data; for each point, we determine the degree of association (or similarity) with another point or cluster. The degree of association needs to be high if they belong in a cluster together or low if they do not. This can, of course, just as previously, be a one-dimensional problem or a multidimensional problem. One of the inherent difficulties of cluster finding is determining how many clusters there are in the data. Various approaches to define this exist—some where the user needs to input the number of clusters and then the algorithm finds which points belong to which cluster, and some where the starting assumption is that every point is a cluster and then two nearby clusters are combined iteratively on trial basis to see if they belong together. In this article, we will cover the following topics: A short introduction to cluster finding, reminding you of the general problem and an algorithm to solve it Analysis of a dataset in the context of cluster finding, the Cholera outbreak in central London 1854 By Simple zeroth order analysis, calculating the centroid of the whole dataset By finding the closest water pump for each recorded Cholera-related death Applying the K-means nearest neighbor algorithm for cluster finding to the data and identifying two separate distributions (For more resources related to this topic, see here.) The algorithms and methods covered here are focused on those available in SciPy. Start a new Notebook, and put in the default imports. Perhaps you want to change to interactive Notebook plotting to try it out a bit more. For this article, we are adding the following specific imports. The ones related to clustering are from SciPy, while later on we will need some packages to transform astronomical coordinates. These packages are all preinstalled in the Anaconda Python 3 distribution and have been tested there: import scipy.cluster.hierarchy as hac import scipy.cluster.vq as vq Introduction to cluster finding There are many different algorithms for cluster identification. Many of them try to solve a specific problem in the best way. Therefore, the specific algorithm that you want to use might depend on the problem you are trying to solve and also on what algorithms are available in the specific package that you are using. Some of the first clustering algorithms consisted of simply finding the centroid positions that minimize the distances to all the points in each cluster. The points in each cluster are closer to that centroid than other cluster centroids. As might be obvious at this point, the hardest part with this is figuring out how many clusters there are. If we can determine this, it is fairly straightforward to try various ways of moving the cluster centroid around, calculate the distance to each point, and then figure out where the cluster centroids are. There are also obvious situations where this might not be the best solution, for example, if you have two very elongated clusters next to each other. Commonly, the distance is the Euclidean distance: Here, p is a vector with all the points' positions,that is,{p1,p2,...,pN–1,pN} in cluster Ck, that is P E Ck , the distances are calculated from the cluster centroid,Ui . We have to find the cluster centroids that minimize the sum of the absolute distances to the points: In this first example, we shall first work with fixed cluster centroids. Starting out simple – John Snow on Cholera In 1854, there was an outbreak of cholera in North-western London, in the neighborhood around Broad Street. The leading theories at the time claimed that cholera spread, just like it was believed that the plague spread: through foul, bad air. John Snow, a physician at the time, hypothesized that cholera spread through drinking water. During the outbreak, John tracked the deaths and drew them on a map of the area. Through his analysis, he concluded that most of the cases were centered on the Broad Street water pump. Rumors say that he then removed the handle of the water pump, thus stopping an epidemic. Today, we know that cholera is usually transmitted through contaminated food or water, thus confirming John's hypothesis. We will do a short but instructive reanalysis of John Snow's data. The data comes from the public data archives of The National Center for Geographic Information and Analysis (http://www.ncgia.ucsb.edu/ and http://www.ncgia.ucsb.edu/pubs/data.php). A cleaned up map and copy of the data files along with an example of Geospatial information analysis of the data can also be found at https://www.udel.edu/johnmack/frec682/cholera/cholera2.html.A wealth of information about physician and scientist John Snow's life and works can be found at http://johnsnow.matrix.msu.edu. To start the analysis, we read the data into a Pandas DataFrame; the data is already formatted in CSV-files readable by Pandas: deaths = pd.read_csv('data/cholera_deaths.txt') pumps = pd.read_csv('data/cholera_pumps.txt') Each file contains two columns, one for X coordinates and one for Y coordinates. Let's check what it looks like: deaths.head() pumps.head() With this information, we can now plot all the pumps and deaths to visualize the data: plt.figure(figsize=(4,3.5)) plt.plot(deaths['X'], deaths['Y'], marker='o', lw=0, mew=1, mec='0.9', ms=6) plt.plot(pumps['X'],pumps['Y'], marker='s', lw=0, mew=1, mec='0.9', color='k', ms=6) plt.axis('equal') plt.xlim((4.0,22.0)); plt.xlabel('X-coordinate') plt.ylabel('Y-coordinate') plt.title('John Snow's Cholera') It is fairly easy to see that the pump in the middle is important. As a first data exploration, we will simply calculate the mean centroid of the distribution and plot that in the figure as an ellipse. We will calculate the mean and standard deviation along the x and y axis as the centroid position: fig = plt.figure(figsize=(4,3.5)) ax = fig.add_subplot(111) plt.plot(deaths['X'], deaths['Y'], marker='o', lw=0, mew=1, mec='0.9', ms=6) plt.plot(pumps['X'],pumps['Y'], marker='s', lw=0, mew=1, mec='0.9', color='k', ms=6) from matplotlib.patches import Ellipse ellipse = Ellipse(xy=(deaths['X'].mean(), deaths['Y'].mean()), width=deaths['X'].std(), height=deaths['Y'].std(), zorder=32, fc='None', ec='IndianRed', lw=2) ax.add_artist(ellipse) plt.plot(deaths['X'].mean(), deaths['Y'].mean(), '.', ms=10, mec='IndianRed', zorder=32) for i in pumps.index: plt.annotate(s='{0}'.format(i), xy=(pumps[['X','Y']].loc[i]), xytext=(-15,6), textcoords='offset points') plt.axis('equal') plt.xlim((4.0,22.5)) plt.xlabel('X-coordinate') plt.ylabel('Y-coordinate') plt.title('John Snow's Cholera') Here, we also plotted the pump index, which we can get from DataFrame with the pumps.index method. The next step in the analysis is to see which pump is the closest to each point. We do this by calculating the distance from all pumps to all points. Then we want to figure out which pump is the closest for each point. We save the closest pump to each point in a separate column of the deaths' DataFrame. With this dataset, the for-loop runs fairly quickly. However, the DataFrame subtract method chained with sum() and idxmin() methods takes a few seconds. I strongly encourage you to play around with various ways to speed this up. We also use the .apply() method of DataFrame to square and square root the values. The simple brute force first attempt of this took over a minute to run. The built-in functions and methods help a lot: deaths_tmp = deaths[['X','Y']].as_matrix() idx_arr = np.array([], dtype='int') for i in range(len(deaths)): idx_arr = np.append(idx_arr, (pumps.subtract(deaths_tmp[i])).apply(lambda x:x**2).sum(axis=1).apply(lambda x:x**0.5).idxmin()) deaths['C'] = idx_arr Quickly check whether everything seems fine by printing out the top rows of the table: deaths.head() Now we want to visualize what we have. With colors, we can show which water pump we associate each death with. To do this, we use a colormap, in this case, the jet colormap. By calling the colormap with a value between 0 and 1, it returns a color; thus we give it the pump indexes and then divide it with the total number of pumps, 12 in our case: fig = plt.figure(figsize=(4,3.5)) ax = fig.add_subplot(111) np.unique(deaths['C'].values) plt.scatter(deaths['X'].as_matrix(), deaths['Y'].as_matrix(), color=plt.cm.jet(deaths['C']/12.), marker='o', lw=0.5, edgecolors='0.5', s=20) plt.plot(pumps['X'],pumps['Y'], marker='s', lw=0, mew=1, mec='0.9', color='0.3', ms=6) for i in pumps.index: plt.annotate(s='{0}'.format(i), xy=(pumps[['X','Y']].loc[i]), xytext=(-15,6), textcoords='offset points', ha='right') ellipse = Ellipse(xy=(deaths['X'].mean(), deaths['Y'].mean()), width=deaths['X'].std(), height=deaths['Y'].std(), zorder=32, fc='None', ec='IndianRed', lw=2) ax.add_artist(ellipse) plt.axis('equal') plt.xlim((4.0,22.5)) plt.xlabel('X-coordinate') plt.ylabel('Y-coordinate') plt.title('John Snow's Cholera') The majority of deaths are dominated by the proximity of the pump in the center. This pump is located on Broad Street. Now, remember that we have used fixed positions for the cluster centroids. In this case, we are basically working on the assumption that the water pumps are related to the cholera cases. Furthermore, the Euclidean distance is not really the real-life distance. People go along roads to get water and the road there is not necessarily straight. Thus, one would have to map out the streets and calculate the distance to each pump from that. Even so, already at this level, it is clear that there is something with the center pump related to the cholera cases. How would you account for the different distance? To calculate the distance, you would do what is called cost-analysis (c.f. when you hit directions on your sat-nav to go to a place). There are many different ways of doing cost analysis, and it also relates to the problem of finding the correct way through a maze. In addition to these things, we do not have any data in the time domain, that is, the cholera would possibly spread to other pumps with time and thus the outbreak might have started at the Broad Street pump and spread to other nearby pumps. Without time data, it is difficult to figure out what happened. This is the general approach to cluster finding. The coordinates might be attributes instead, length and weight of dogs for example, and the location of the cluster centroid something that we would iteratively move around until we find the best position. K-means clustering The K-means algorithm is also referred to as vector quantization. What the algorithm does is find the cluster (centroid) positions that minimize the distances to all points in the cluster. This is done iteratively; the problem with the algorithm is that it can be a bit greedy, meaning that it will find the nearest minima quickly. This is generally solved with some kind of basin-hopping approach where the nearest minima found is randomly perturbed and the algorithm restarted. Due to this fact, the algorithm is dependent on good initial guesses as input. Suicide rate versus GDP versus absolute lattitude We will analyze the data of suicide rates versus GDP versus absolute lattitude or Degrees From Equator (DFE) for clusters. Our hypothesis from the visual inspection was that there were at least two distinct clusters, one with higher suicide rate, GDP, and absolute lattitude and one with lower. Here, the Hierarchical Data Format (HDF) file is now read in as a DataFrame. This time, we want to discard all the rows where one or more column entries are NaN or empty. Thus, we use the appropriate DataFrame method for this: TABLE_FILE = 'data/data_ch4.h5' d2 = pd.read_hdf(TABLE_FILE) d2 = d2.dropna() Next, while the DataFrame is a very handy format, which we will utilize later on, the input to the cluster algorithms in SciPy do not handle Pandas data types natively. Thus, we transfer the data to a NumPy array: rates = d2[['DFE','GDP_CD','Both']].as_matrix().astype('float') Next, to recap, we visualise the data by one histogram of the GDP and one scatter plot for all the data. We do this to aid us in the initial guesses of the cluster centroid positions: plt.subplots(12, figsize=(8,3.5)) plt.subplot(121) plt.hist(rates.T[1], bins=20,color='SteelBlue') plt.xticks(rotation=45, ha='right') plt.yscale('log') plt.xlabel('GDP') plt.ylabel('Counts') plt.subplot(122) plt.scatter(rates.T[0], rates.T[2], s=2e5*rates.T[1]/rates.T[1].max(), color='SteelBlue', edgecolors='0.3'); plt.xlabel('Absolute Latitude (Degrees, 'DFE')') plt.ylabel('Suicide Rate (per 100')') plt.subplots_adjust(wspace=0.25); The scatter plots shows the GDP as size. The function to run the clustering k-means takes a special kind of normalized input. The data arrays (columns) have to be normalized by the standard deviation of the array. Although this is straightforward, there is a function included in the module called whiten. It will scale the data with the standard deviation: w = vq.whiten(rates) To show what it does to the data, we plot the same plots as we did previously, but with the output from the whiten function: plt.subplots(12, figsize=(8,3.5)) plt.subplot(121) plt.hist(w[:,1], bins=20, color='SteelBlue') plt.yscale('log') plt.subplot(122) plt.scatter(w.T[0], w.T[2], s=2e5*w.T[1]/w.T[1].max(), color='SteelBlue', edgecolors='0.3') plt.xticks(rotation=45, ha='right'); As you can see, all the data is scaled from the previous figure. However, as mentioned, the scaling is just the standard deviation. So let's calculate the scaling and save it to the sc variable: sc = rates.std(axis=0) Now we are ready to estimate the initial guesses for the cluster centroids. Reading off the first plot of the data, we guess the centroids to be at 20 DFE, 200,000 GDP, and 10 suicides and the second at 45 DFE, 100,000 GDP, and 15 suicides. We put this in an array and scale it with our scale parameter to the same scale as the output from the whiten function. This is then sent to the kmeans2 function of SciPy: init_guess = np.array([[20,20E3,10],[45,100E3,15]]) init_guess /= sc z2_cb, z2_lbl = vq.kmeans2(w, init_guess, minit='matrix', iter=500) There is another function, kmeans (without the 2), which is a less complex version and does not stop iterating when it reaches a local minima. It stops when the changes between two iterations go below some level. Thus, the standard K-means algorithm is represented in SciPy by the kmeans2 function. The function outputs the centroids' scaled positions (here z2_cb) and a lookup table (z2_lbl) telling us which row belongs to which centroid. To get the centroid positions in units we understand, we simply multiply with our scaling value: z2_cb_sc = z2_cb * sc At this point, we can plot the results. The following section is rather long and contains many different parts so we will go through them section by section. However, the code should be run in one cell of the Notebook: # K-means clustering figure START plt.figure(figsize=(6,4)) plt.scatter(z2_cb_sc[0,0], z2_cb_sc[0,2], s=5e2*z2_cb_sc[0,1]/rates.T[1].max(), marker='+', color='k', edgecolors='k', lw=2, zorder=10, alpha=0.7); plt.scatter(z2_cb_sc[1,0], z2_cb_sc[1,2], s=5e2*z2_cb_sc[1,1]/rates.T[1].max(), marker='+', color='k', edgecolors='k', lw=3, zorder=10, alpha=0.7); The first steps are quite simple—we set up the figure size and plot the points of the cluster centroids. We hypothesized about two clusters; thus, we plot them with two different calls to plt.scatter. Here, z2_cb_sc[1,0] gets the second cluster x-coordinate (DFE); then switching 0 for 1 gives us the y coordinate (rate). We set the size of the marker s-parameter to scale with the value of the third data axis, the GDP. We also do this further down for the data, just as in previous plots, so that it is easier to compare and differentiate the clusters. The zorder keyword gives the order in depth of the elements that are plotted; a high zorder will put them on top of everything else and a negative zorder will send them to the back. s0 = abs(z2_lbl==0).astype('bool') s1 = abs(z2_lbl==1).astype('bool') pattern1 = 5*'x' pattern2 = 4*'/' plt.scatter(w.T[0][s0]*sc[0], w.T[2][s0]*sc[2], s=5e2*rates.T[1][s0]/rates.T[1].max(), lw=1, hatch=pattern1, edgecolors='0.3', color=plt.cm.Blues_r( rates.T[1][s0]/rates.T[1].max())); plt.scatter(rates.T[0][s1], rates.T[2][s1], s=5e2*rates.T[1][s1]/rates.T[1].max(), lw=1, hatch=pattern2, edgecolors='0.4', marker='s', color=plt.cm.Reds_r( rates.T[1][s1]/rates.T[1].max()+0.4)) In this section, we plot the points of the clusters. First, we get the selection (Boolean) arrays. They are simply found by setting all indexes that refer to cluster 0 to True and all else to False; this gives us the Boolean array for cluster 0 (the first cluster). The second Boolean array is matched for the second cluster (cluster 1). Next, we define the hatch pattern for the scatter plot markers, which we later give as input to the plotting function. The multiplier for the hatch pattern gives the density of the pattern. The scatter plots for the points are created in a similar fashion as the centroids, except that the markers are a bit more complex. They are both color-coded, like in the previous example with Cholera deaths, but in a gradient instead of the exact same colors for all points. The gradient is defined by the GDP, which also defines the size of the points. The x and y data sent to the plot is different between the clusters, but they access the same data in the end because we multiply with our scaling factor. p1 = plt.scatter([],[], hatch='None', s=20E3*5e2/rates.T[1].max(), color='k', edgecolors='None',) p2 = plt.scatter([],[], hatch='None', s=40E3*5e2/rates.T[1].max(), color='k', edgecolors='None',) p3 = plt.scatter([],[], hatch='None', s=60E3*5e2/rates.T[1].max(), color='k', edgecolors='None',) p4 = plt.scatter([],[], hatch='None', s=80E3*5e2/rates.T[1].max(), color='k', edgecolors='None',) labels = ["20'", "40'", "60'", ">80'"] plt.legend([p1, p2, p3, p4], labels, ncol=1, frameon=True, #fontsize=12, handlelength=1, loc=1, borderpad=0.75,labelspacing=0.75, handletextpad=0.75, title='GDP', scatterpoints=1.5) plt.ylim((-4,40)) plt.xlim((-4,80)) plt.title('K-means clustering') plt.xlabel('Absolute Lattitude (Degrees, 'DFE')') plt.ylabel('Suicide Rate (per 100 000)'); The last tweak to the plot is made by creating a custom legend. We want to show different sizes of the points and what GDP they correspond to. As there is a continuous gradient from low to high, we cannot use the plotted points. Thus we create our own, but leave the x and y input coordinates as empty lists. This will not show anything in the plot but we can use them to register in the legend. The various tweaks to the legend function controls different aspects of the legend layout. I encourage you to experiment with it to see what happens: As for the final analysis, two different clusters are identified. Just as our previous hypothesis, there is a cluster with a clear linear trend with relatively higher GDP, which is also located at higher absolute latitude. Although the identification is rather weak, it is clear that the two groups are separated. Countries with low GDP are clustered closer to the equator. What happens when you add more clusters? Try to add a cluster for the low DFE, high rate countries, visualize it, and think about what this could mean for the conclusion(s). Summary In this article, we identified clusters using methods such as finding the centroid positions and K-means clustering. For more information about Python Data Analysis, refer to the following books by Packt Publishing: Python Data Analysis (https://www.packtpub.com/big-data-and-business-intelligence/python-data-analysis) Getting Started with Python Data Analysis (https://www.packtpub.com/big-data-and-business-intelligence/getting-started-python-data-analysis)   Resources for Article: Further resources on this subject: Python Data Science Up and Running [article] Basics of Jupyter Notebook and Python [article] Scientific Computing APIs for Python [article]
Read more
  • 0
  • 0
  • 2048