Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-predicting-sports-winners-decision-trees-and-pandas

12 Aug 2015

6 min read

Predicting Sports Winners with Decision Trees and pandas

12 Aug 2015

In this article by Robert Craig Layton, author of Learning Data Mining with Python, we will look at predicting the winner of games of the National Basketball Association (NBA) using a different type of classification algorithm—decision trees. Collecting the data The data we will be using is the match history data for the NBA, for the 2013-2014 season. The Basketball-Reference.com website contains a significant number of resources and statistics collected from the NBA and other leagues. Perform the following steps to download the dataset: Navigate to http://www.basketball-reference.com/leagues/NBA_2014_games.html in your web browser. Click on the Export button next to the Regular Season heading. Download the file to your data folder (and make a note of the path). This will download a CSV file containing the results of 1,230 games in the regular season of the NBA. We will load the file with the pandas library, which is an incredibly useful library for manipulating data. Python also contains a built-in library called csv that supports reading and writing CSV files. We will use pandas instead as it provides more powerful functions to work with datasets. For this article, you will need to install pandas. The easiest way to do that is to use pip3, which you may previously have used to install scikit-learn: $pip3 install pandas Using pandas to load the dataset We can load the dataset using the read_csv function in pandas as follows: import pandas as pddataset = pd.read_csv(data_filename) The result of this is a data frame, a data structure used by pandas. The pandas.read_csv function has parameters to fix some of the problems in the data, such as missing headings, which we can specify when loading the file: dataset = pd.read_csv(data_filename, parse_dates=["Date"],skiprows=[0,])dataset.columns = ["Date", "Score Type", "Visitor Team","VisitorPts", "Home Team", "HomePts", "OT?", "Notes"] We can now view a sample of the data frame: dataset.ix[:5] Extracting new features We extract our classes, 1 for a home win, and 0 for a visitor win. We can specify this using the following code to extract those wins into a NumPy array: dataset["HomeWin"] = dataset["VisitorPts"] < dataset["HomePts"] y_true = dataset["HomeWin"].values The first two new features we want to create are to indicate whether each of the two teams won their previous game. This would roughly approximate which team is currently playing well. We will compute this feature by iterating through the rows in order, and recording which team won. When we get to a new row, we look up whether the team won the last time: from collections import defaultdictwon_last = defaultdict(int) We can then iterate over all the rows and update the current row with the team's last result (win or loss): for index, row in dataset.iterrows():home_team = row["Home Team"]visitor_team = row["Visitor Team"]row["HomeLastWin"] = won_last[home_team]row["VisitorLastWin"] = won_last[visitor_team]dataset.ix[index] = row We then set our dictionary with each team's result (from this row) for the next time we see these teams: won_last[home_team] = row["HomeWin"]won_last[visitor_team] = not row["HomeWin"] Decision trees Decision trees are a class of classification algorithm such as a flow chart that consist of a sequence of nodes, where the values for a sample are used to make a decision on the next node to go to. We can use the DecisionTreeClassifier class to create a decision tree: from sklearn.tree import DecisionTreeClassifierclf = DecisionTreeClassifier(random_state=14) We now need to extract the dataset from our pandas data frame in order to use it with our scikit-learn classifier. We do this by specifying the columns we wish to use and using the values parameter of a view of the data frame: X_previouswins = dataset[["HomeLastWin", "VisitorLastWin"]].values Decision trees are estimators, and therefore, they have fit and predict methods. We can also use the cross_val_score method as before to get the average score: scores = cross_val_score(clf, X_previouswins, y_true,scoring='accuracy')print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100)) This scores up to 56.1%; we are better off choosing randomly! Predicting sports outcomes We have a method for testing how accurate our models are using the cross_val_score method that allows us to try new features. For the first feature, we will create a feature that tells us whether the home team is generally better than the visitors by seeing whether they ranked higher in the previous season. To obtain the data, perform the following steps: Head to http://www.basketball-reference.com/leagues/NBA_2013_standings.html Scroll down to Expanded Standings. This gives us a single list for the entire league. Click on the Export link to the right of this heading. Save the download in your data folder. In your IPython Notebook, enter the following into a new cell. You'll need to ensure that the file was saved into the location pointed to by the data_folder variable: standings_filename = os.path.join(data_folder,"leagues_NBA_2013_standings_expanded-standings.csv")standings = pd.read_csv(standings_filename, skiprows=[0,1]) We then iterate over the rows and compare the team's standings: dataset["HomeTeamRanksHigher"] = 0for index, row in dataset.iterrows():home_team = row["Home Team"]visitor_team = row["Visitor Team"] Between 2013 and 2014, a team was renamed as follows: if home_team == "New Orleans Pelicans":home_team = "New Orleans Hornets"elif visitor_team == "New Orleans Pelicans":visitor_team = "New Orleans Hornets" Now, we can get the rankings for each team. We then compare them and update the feature in the row: home_rank = standings[standings["Team"] ==home_team]["Rk"].values[0]visitor_rank = standings[standings["Team"] ==visitor_team]["Rk"].values[0]row["HomeTeamRanksHigher"] = int(home_rank > visitor_rank)dataset.ix[index] = row Next, we use the cross_val_score function to test the result. First, we extract the dataset as before: X_homehigher = dataset[["HomeLastWin", "VisitorLastWin", "HomeTeamRanksHigher"]].values Then, we create a new DecisionTreeClassifier class and run the evaluation: clf = DecisionTreeClassifier(random_state=14)scores = cross_val_score(clf, X_homehigher, y_true,scoring='accuracy')print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100)) This now scores up to 60.3%—even better than our previous result. Unleash the full power of Python machine learning with our 'Learning Data Mining with Python' book.

0
0
7766

article-image-how-do-machine-learning-python

Packt

12 Aug 2015

5 min read

How to do Machine Learning with Python

Packt

12 Aug 2015

5 min read

In this article, Sunila Gollapudi, author of Practical Machine Learning, introduces the key aspects of machine learning semantics and various toolkit options in Python. Machine learning has been around for many years now and all of us, at some point in time, have been consumers of machine learning technology. One of the most common examples is facial recognition software, which can identify if a digital photograph includes a particular person. Today, Facebook users can see automatic suggestions to tag their friends in their uploaded photos. Some cameras and software such as iPhoto also have this capability. What is learning? Let's spend some time understanding what the "learning" in machine learning means. We are referring to learning from some kind of observation or data to automatically carry out further actions. An intelligent system cannot be built without using learning to get there. The following are some questions that you’ll need to answer to define your learning problem: What do you want to learn? What is the required data and where does it come from? Is the complete data available in one shot? What is the goal of learning or why should there be learning at all? Before we plunge into understanding the internals of each learning type, let's quickly understand a simple predictive analytics process for building and validating models that solve a problem with maximum accuracy: Identify whether the raw dataset is validated or cleansed and is broken into training, testing, and evaluation datasets. Pick a model that best suits and has an error function that will be minimized over the training set. Make sure this model works on the testing set. Iterate this process with other machine learning algorithms and/or attributes until there is a reasonable performance on the test set. This result can now be used to apply for new inputs and predict the output. The following diagram depicts how learning can be applied to predict behavior: Key aspects of machine learning semantics The following concept map shows the key aspects of machine learning semantics: Python Python is one of the most highly adopted programming or scripting languages in the field of machine learning and data science. Python is known for its ease of learning, implementation, and maintenance. Python is highly portable and can run on Unix, Windows, and Mac platforms. With the availability of libraries such as Pydoop and SciPy, its relevance in the world of big data analytics has tremendously increased. Some of the key reasons for the popularity of Python in solving machine learning problems are as follows: Python is well suited for data analysis It is a versatile scripting language that can be used to write some basic, quick and dirty scripts to test some basic functions or can be used in real-time applications leveraging its full-featured toolkits Python comes with mature machine learning packages and can be used in a plug-and-play manner Toolkit options in Python Before we go deeper into what toolkit options we have in Python, let's first understand what toolkit option trade-offs should be considered before choosing one: What are my performance priorities? Do I need offline or real-time processing implementations? How transparent are the toolkits? Can I customize the library myself? What is the community status? How fast are bugs fixed and how is the community support and expert communication availability? There are three options in Python: Python external bindings. These are interfaces to popular packages in the market such as Matlab, R, Octave, and so on. This option will work well if you already have existing implementations in these frameworks. Python-based toolkits. There are a number of toolkits written in Python which come with a bunch of algorithms. Write your own logic/toolkit. Python has two core toolkits that are more like building blocks. Almost all the following specialized toolkits use these core ones: NumPy: Fast and efficient arrays built in Python SciPy: A bunch of algorithms for standard operations built on NumPy There are also C/C++ based implementations such as LIBLINEAR, LIBSVM, OpenCV, and others. Some of the most popular Python toolkits are as follows: nltk: The natural language toolkit. This focuses on natural language processing (NLP). mlpy: The machine learning algorithms toolkit that comes with support for some key machine learning algorithms such as classifications, regression, and clustering, among others. PyML: This toolkit focuses on support vector machine (SVM). PyBrain: This toolkit focuses on neural network and related functions. mdp-toolkit: The focus of this toolkit is on data processing and it supports scheduling and parallelizing the processing. scikit-learn: This is one of the most popular toolkits and has been highly adopted by data scientists in the recent past. It has support for supervised and unsupervised learning and some special support for feature selection and visualizations. There is a large team that is actively building this toolkit and is known for its excellent documentation. PyDoop: Python integration with the Hadoop platform. PyDoop and SciPy are heavily deployed in big data analytics. Find out how to apply python machine learning to your working environment with our 'Practical Machine Learning' book

0
0
2192

Packt

11 Aug 2015

17 min read

Divide and Conquer – Classification Using Decision Trees and Rules

Packt

11 Aug 2015

17 min read

In this article by Brett Lantz, author of the book Machine Learning with R, Second Edition, we will get a basic understanding about decision trees and rule learners, including the C5.0 decision tree algorithm. This algorithm will cover mechanisms such as choosing the best split and pruning the decision tree. While deciding between several job offers with various levels of pay and benefits, many people begin by making lists of pros and cons, and eliminate options based on simple rules. For instance, ''if I have to commute for more than an hour, I will be unhappy.'' Or, ''if I make less than $50k, I won't be able to support my family.'' In this way, the complex and difficult decision of predicting one's future happiness can be reduced to a series of simple decisions. This article covers decision trees and rule learners—two machine learning methods that also make complex decisions from sets of simple choices. These methods then present their knowledge in the form of logical structures that can be understood with no statistical knowledge. This aspect makes these models particularly useful for business strategy and process improvement. By the end of this article, you will learn: How trees and rules "greedily" partition data into interesting segments The most common decision tree and classification rule learners, including the C5.0, 1R, and RIPPER algorithms We will begin by examining decision trees, followed by a look at classification rules. (For more resources related to this topic, see here.) Understanding decision trees Decision tree learners are powerful classifiers, which utilize a tree structure to model the relationships among the features and the potential outcomes. As illustrated in the following figure, this structure earned its name due to the fact that it mirrors how a literal tree begins at a wide trunk, which if followed upward, splits into narrower and narrower branches. In much the same way, a decision tree classifier uses a structure of branching decisions, which channel examples into a final predicted class value. To better understand how this works in practice, let's consider the following tree, which predicts whether a job offer should be accepted. A job offer to be considered begins at the root node, where it is then passed through decision nodes that require choices to be made based on the attributes of the job. These choices split the data across branches that indicate potential outcomes of a decision, depicted here as yes or no outcomes, though in some cases there may be more than two possibilities. In the case a final decision can be made, the tree is terminated by leaf nodes (also known as terminal nodes) that denote the action to be taken as the result of the series of decisions. In the case of a predictive model, the leaf nodes provide the expected result given the series of events in the tree. A great benefit of decision tree algorithms is that the flowchart-like tree structure is not necessarily exclusively for the learner's internal use. After the model is created, many decision tree algorithms output the resulting structure in a human-readable format. This provides tremendous insight into how and why the model works or doesn't work well for a particular task. This also makes decision trees particularly appropriate for applications in which the classification mechanism needs to be transparent for legal reasons, or in case the results need to be shared with others in order to inform future business practices. With this in mind, some potential uses include: Credit scoring models in which the criteria that causes an applicant to be rejected need to be clearly documented and free from bias Marketing studies of customer behavior such as satisfaction or churn, which will be shared with management or advertising agencies Diagnosis of medical conditions based on laboratory measurements, symptoms, or the rate of disease progression Although the previous applications illustrate the value of trees in informing decision processes, this is not to suggest that their utility ends here. In fact, decision trees are perhaps the single most widely used machine learning technique, and can be applied to model almost any type of data—often with excellent out-of-the-box applications. This said, in spite of their wide applicability, it is worth noting some scenarios where trees may not be an ideal fit. One such case might be a task where the data has a large number of nominal features with many levels or it has a large number of numeric features. These cases may result in a very large number of decisions and an overly complex tree. They may also contribute to the tendency of decision trees to overfit data, though as we will soon see, even this weakness can be overcome by adjusting some simple parameters. Divide and conquer Decision trees are built using a heuristic called recursive partitioning. This approach is also commonly known as divide and conquer because it splits the data into subsets, which are then split repeatedly into even smaller subsets, and so on and so forth until the process stops when the algorithm determines the data within the subsets are sufficiently homogenous, or another stopping criterion has been met. To see how splitting a dataset can create a decision tree, imagine a bare root node that will grow into a mature tree. At first, the root node represents the entire dataset, since no splitting has transpired. Next, the decision tree algorithm must choose a feature to split upon; ideally, it chooses the feature most predictive of the target class. The examples are then partitioned into groups according to the distinct values of this feature, and the first set of tree branches are formed. Working down each branch, the algorithm continues to divide and conquer the data, choosing the best candidate feature each time to create another decision node, until a stopping criterion is reached. Divide and conquer might stop at a node in a case that: All (or nearly all) of the examples at the node have the same class There are no remaining features to distinguish among the examples The tree has grown to a predefined size limit To illustrate the tree building process, let's consider a simple example. Imagine that you work for a Hollywood studio, where your role is to decide whether the studio should move forward with producing the screenplays pitched by promising new authors. After returning from a vacation, your desk is piled high with proposals. Without the time to read each proposal cover-to-cover, you decide to develop a decision tree algorithm to predict whether a potential movie would fall into one of three categories: Critical Success, Mainstream Hit, or Box Office Bust. To build the decision tree, you turn to the studio archives to examine the factors leading to the success and failure of the company's 30 most recent releases. You quickly notice a relationship between the film's estimated shooting budget, the number of A-list celebrities lined up for starring roles, and the level of success. Excited about this finding, you produce a scatterplot to illustrate the pattern: Using the divide and conquer strategy, we can build a simple decision tree from this data. First, to create the tree's root node, we split the feature indicating the number of celebrities, partitioning the movies into groups with and without a significant number of A-list stars: Next, among the group of movies with a larger number of celebrities, we can make another split between movies with and without a high budget: At this point, we have partitioned the data into three groups. The group at the top-left corner of the diagram is composed entirely of critically acclaimed films. This group is distinguished by a high number of celebrities and a relatively low budget. At the top-right corner, majority of movies are box office hits with high budgets and a large number of celebrities. The final group, which has little star power but budgets ranging from small to large, contains the flops. If we wanted, we could continue to divide and conquer the data by splitting it based on the increasingly specific ranges of budget and celebrity count, until each of the currently misclassified values resides in its own tiny partition, and is correctly classified. However, it is not advisable to overfit a decision tree in this way. Though there is nothing to stop us from splitting the data indefinitely, overly specific decisions do not always generalize more broadly. We'll avoid the problem of overfitting by stopping the algorithm here, since more than 80 percent of the examples in each group are from a single class. This forms the basis of our stopping criterion. You might have noticed that diagonal lines might have split the data even more cleanly. This is one limitation of the decision tree's knowledge representation, which uses axis-parallel splits. The fact that each split considers one feature at a time prevents the decision tree from forming more complex decision boundaries. For example, a diagonal line could be created by a decision that asks, "is the number of celebrities is greater than the estimated budget?" If so, then "it will be a critical success." Our model for predicting the future success of movies can be represented in a simple tree, as shown in the following diagram. To evaluate a script, follow the branches through each decision until the script's success or failure has been predicted. In no time, you will be able to identify the most promising options among the backlog of scripts and get back to more important work, such as writing an Academy Awards acceptance speech. Since real-world data contains more than two features, decision trees quickly become far more complex than this, with many more nodes, branches, and leaves. In the next section, you will learn about a popular algorithm to build decision tree models automatically. The C5.0 decision tree algorithm There are numerous implementations of decision trees, but one of the most well-known implementations is the C5.0 algorithm. This algorithm was developed by computer scientist J. Ross Quinlan as an improved version of his prior algorithm, C4.5, which itself is an improvement over his Iterative Dichotomiser 3 (ID3) algorithm. Although Quinlan markets C5.0 to commercial clients (see http://www.rulequest.com/ for details), the source code for a single-threaded version of the algorithm was made publically available, and it has therefore been incorporated into programs such as R. To further confuse matters, a popular Java-based open source alternative to C4.5, titled J48, is included in R's RWeka package. Because the differences among C5.0, C4.5, and J48 are minor, the principles in this article will apply to any of these three methods, and the algorithms should be considered synonymous. The C5.0 algorithm has become the industry standard to produce decision trees, because it does well for most types of problems directly out of the box. Compared to other advanced machine learning models, the decision trees built by C5.0 generally perform nearly as well, but are much easier to understand and deploy. Additionally, as shown in the following table, the algorithm's weaknesses are relatively minor and can be largely avoided: Strengths Weaknesses An all-purpose classifier that does well on most problems Highly automatic learning process, which can handle numeric or nominal features, as well as missing data Excludes unimportant features Can be used on both small and large datasets Results in a model that can be interpreted without a mathematical background (for relatively small trees) More efficient than other complex models Decision tree models are often biased toward splits on features having a large number of levels It is easy to overfit or underfit the model Can have trouble modeling some relationships due to reliance on axis-parallel splits Small changes in the training data can result in large changes to decision logic Large trees can be difficult to interpret and the decisions they make may seem counterintuitive To keep things simple, our earlier decision tree example ignored the mathematics involved in how a machine would employ a divide and conquer strategy. Let's explore this in more detail to examine how this heuristic works in practice. Choosing the best split The first challenge that a decision tree will face is to identify which feature to split upon. In the previous example, we looked for a way to split the data such that the resulting partitions contained examples primarily of a single class. The degree to which a subset of examples contains only a single class is known as purity, and any subset composed of only a single class is called pure. There are various measurements of purity that can be used to identify the best decision tree splitting candidate. C5.0 uses entropy, a concept borrowed from information theory that quantifies the randomness, or disorder, within a set of class values. Sets with high entropy are very diverse and provide little information about other items that may also belong in the set, as there is no apparent commonality. The decision tree hopes to find splits that reduce entropy, ultimately increasing homogeneity within the groups. Typically, entropy is measured in bits. If there are only two possible classes, entropy values can range from 0 to 1. For n classes, entropy ranges from 0 to log2(n). In each case, the minimum value indicates that the sample is completely homogenous, while the maximum value indicates that the data are as diverse as possible, and no group has even a small plurality. In the mathematical notion, entropy is specified as follows: In this formula, for a given segment of data (S), the term c refers to the number of class levels and pi refers to the proportion of values falling into class level i. For example, suppose we have a partition of data with two classes: red (60 percent) and white (40 percent). We can calculate the entropy as follows: > -0.60 * log2(0.60) - 0.40 * log2(0.40) [1] 0.9709506 We can examine the entropy for all the possible two-class arrangements. If we know that the proportion of examples in one class is x, then the proportion in the other class is (1 – x). Using the curve() function, we can then plot the entropy for all the possible values of x: > curve(-x * log2(x) - (1 - x) * log2(1 - x), col = "red", xlab = "x", ylab = "Entropy", lwd = 4) This results in the following figure: As illustrated by the peak in entropy at x = 0.50, a 50-50 split results in maximum entropy. As one class increasingly dominates the other, the entropy reduces to zero. To use entropy to determine the optimal feature to split upon, the algorithm calculates the change in homogeneity that would result from a split on each possible feature, which is a measure known as information gain. The information gain for a feature F is calculated as the difference between the entropy in the segment before the split (S1) and the partitions resulting from the split (S2): One complication is that after a split, the data is divided into more than one partition. Therefore, the function to calculate Entropy(S2) needs to consider the total entropy across all of the partitions. It does this by weighing each partition's entropy by the proportion of records falling into the partition. This can be stated in a formula as: In simple terms, the total entropy resulting from a split is the sum of the entropy of each of the n partitions weighted by the proportion of examples falling in the partition (wi). The higher the information gain, the better a feature is at creating homogeneous groups after a split on this feature. If the information gain is zero, there is no reduction in entropy for splitting on this feature. On the other hand, the maximum information gain is equal to the entropy prior to the split. This would imply that the entropy after the split is zero, which means that the split results in completely homogeneous groups. The previous formulae assume nominal features, but decision trees use information gain for splitting on numeric features as well. To do so, a common practice is to test various splits that divide the values into groups greater than or less than a numeric threshold. This reduces the numeric feature into a two-level categorical feature that allows information gain to be calculated as usual. The numeric cut point yielding the largest information gain is chosen for the split. Though it is used by C5.0, information gain is not the only splitting criterion that can be used to build decision trees. Other commonly used criteria are Gini index, Chi-Squared statistic, and gain ratio. For a review of these (and many more) criteria, refer to Mingers J. An Empirical Comparison of Selection Measures for Decision-Tree Induction. Machine Learning. 1989; 3:319-342. Pruning the decision tree A decision tree can continue to grow indefinitely, choosing splitting features and dividing the data into smaller and smaller partitions until each example is perfectly classified or the algorithm runs out of features to split on. However, if the tree grows overly large, many of the decisions it makes will be overly specific and the model will be overfitted to the training data. The process of pruning a decision tree involves reducing its size such that it generalizes better to unseen data. One solution to this problem is to stop the tree from growing once it reaches a certain number of decisions or when the decision nodes contain only a small number of examples. This is called early stopping or pre-pruning the decision tree. As the tree avoids doing needless work, this is an appealing strategy. However, one downside to this approach is that there is no way to know whether the tree will miss subtle, but important patterns that it would have learned had it grown to a larger size. An alternative, called post-pruning, involves growing a tree that is intentionally too large and pruning leaf nodes to reduce the size of the tree to a more appropriate level. This is often a more effective approach than pre-pruning, because it is quite difficult to determine the optimal depth of a decision tree without growing it first. Pruning the tree later on allows the algorithm to be certain that all the important data structures were discovered. The implementation details of pruning operations are very technical and beyond the scope of this article. For a comparison of some of the available methods, see Esposito F, Malerba D, Semeraro G. A Comparative Analysis of Methods for Pruning Decision Trees. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1997;19: 476-491. One of the benefits of the C5.0 algorithm is that it is opinionated about pruning—it takes care of many decisions automatically using fairly reasonable defaults. Its overall strategy is to post-prune the tree. It first grows a large tree that overfits the training data. Later, the nodes and branches that have little effect on the classification errors are removed. In some cases, entire branches are moved further up the tree or replaced by simpler decisions. These processes of grafting branches are known as subtree raising and subtree replacement, respectively. Balancing overfitting and underfitting a decision tree is a bit of an art, but if model accuracy is vital, it may be worth investing some time with various pruning options to see if it improves the performance on test data. As you will soon see, one of the strengths of the C5.0 algorithm is that it is very easy to adjust the training options. Summary This article covered two classification methods that use so-called "greedy" algorithms to partition the data according to feature values. Decision trees use a divide and conquer strategy to create flowchart-like structures, while rule learners separate and conquer data to identify logical if-else rules. Both methods produce models that can be interpreted without a statistical background. One popular and highly configurable decision tree algorithm is C5.0. We used the C5.0 algorithm to create a tree to predict whether a loan applicant will default. This article merely scratched the surface of how trees and rules can be used. Resources for Article: Further resources on this subject: Introduction to S4 Classes [article] First steps with R [article] Supervised learning [article]

0
0
7063

article-image-getting-started-java-driver-mongodb

Packt

11 Aug 2015

8 min read

Getting Started with Java Driver for MongoDB

Packt

11 Aug 2015

8 min read

In this article by Francesco Marchioni, author of the book MongoDB for Java Developers, you will be able to perform all the create/read/update/delete (CRUD) operations that we have so far accomplished using the mongo shell. (For more resources related to this topic, see here.) Querying data We will now see how to use the Java API to query for your documents. Querying for documents with MongoDB resembles JDBC queries; the main difference is that the returned object is a com.mongodb.DBCursor class, which is an iterator over the database result. In the following example, we are iterating over the javastuff collection that should contain the documents inserted so far: package com.packtpub.mongo.chapter2; import com.mongodb.DB; import com.mongodb.DBCollection; import com.mongodb.DBCursor; import com.mongodb.DBObject; import com.mongodb.MongoClient; public class SampleQuery{ private final static String HOST = "localhost"; private final static int PORT = 27017; public static void main( String args[] ){ try{ MongoClient mongoClient = new MongoClient( HOST,PORT ); DB db = mongoClient.getDB( "sampledb" ); DBCollection coll = db.getCollection("javastuff"); DBCursor cursor = coll.find(); try { while(cursor.hasNext()) { DBObject object = cursor.next(); System.out.println(object); } } finally { cursor.close(); } } catch(Exception e) { System.err.println( e.getClass().getName() + ": " + e.getMessage() ); } } } Depending on the documents you have inserted, the output could be something like this: { "_id" : { "$oid" : "5513f8836c0df1301685315b"} , "name" : "john" , "age" : 35 , "kids" : [ { "name" : "mike"} , { "name" : "faye"}] , "info" : { "email" : "[email protected]" , "phone" : "876-134-667"}} . . . . Restricting the search to the first document The find operator executed without any parameter returns the full cursor of a collection; pretty much like the SELECT * query in relational DB terms. If you are interested in reading just the first document in the collection, you could use the findOne() operation to get the first document in the collection. This method returns a single document (instead of the DBCursor that the find() operation returns). As you can see, the findOne() operator directly returns a DBObject instead of a com.mongodb.DBCursor class: DBObject myDoc = coll.findOne(); System.out.println(myDoc); Querying the number of documents in a collection Another typical construct that you probably know from the SQL is the SELECT count(*) query that is useful to retrieve the number of records in a table. In MongoDB terms, you can get this value simply by invoking the getCount against a DBCollection class: DBCollection coll = db.getCollection("javastuff"); System.out.println(coll.getCount()); As an alternative, you could execute the count() method over the DBCursor object: DBCursor cursor = coll.find(); System.out.println(cursor.count()); Eager fetching of data using DBCursor When find is executed and a DBCursor is executed you have a pointer to a database document. This means that the documents are fetched in the memory as you call next() method on the DBCursor. On the other hand, you can eagerly load all the data into the memory by executing the toArray() method, which returns a java.util.List structure: List list = collection.find( query ).toArray(); The problem with this approach is that you could potentially fill up the memory with lots of documents, which are eagerly loaded. You are therefore advised to include some operators such as skip() and limit() to control the amount of data to be loaded into the memory: List list = collection.find( query ).skip( 100 ). limit( 10 ).toArray(); Just like you learned from the mongo shell, the skip operator can be used as an initial offset of your cursor whilst the limit construct can eventually load the first n occurrences in the cursor. Filtering through the records Typically, you will not need to fetch the whole set of documents in a collection. So, just like SQL uses WHERE conditions to filter records, in MongoDB you can restrict searches by creating a BasicDBObject and passing it to the find function as an argument. See the following example: DBCollection coll = db.getCollection("javastuff"); DBObject query = new BasicDBObject("name", "owen"); DBCursor cursor = coll.find(query); try { while(cursor.hasNext()) { System.out.println(cursor.next()); } } finally { cursor.close(); } In the preceding example, we retrieve the documents in the javastuff collection, whose name key equals to owen. That's the equivalent of an SQL query like this: SELECT * FROM javastuff WHERE name='owen' Building more complex searches As your collections keep growing, you will need to be more selective with your searches. For example, you could include multiple keys in your BasicDBObject that will eventually be passed to find. We can then apply the same functions in our queries. For example, here is how to find documents whose name does not equal ($ne) to Frank and whose age is greater than 10: DBCollection coll = db.getCollection("javastuff"); DBObject query = new BasicDBObject("name", new BasicDBObject("$ne", "frank")).append("age", new BasicDBObject("$gt", 10)); DBCursor cursor = coll.find(query); Updating documents Having learned about create and read, we are half way through our CRUD track. The next operation you will learn is update. The DBCollection class contains an update method that can be used for this purpose. Let's say we have the following document: > db.javastuff.find({"name":"frank"}).pretty() { "_id" : ObjectId("55142c27627b27560bd365b1"), "name" : "frank", "age" : 31, "info" : { "email" : "[email protected]", "phone" : "222-111-444" } } Now we want to change the age value for this document by setting it to 23: DBCollection coll = db.getCollection("javastuff"); DBObject newDocument = new BasicDBObject(); newDocument.put("age", 23); DBObject searchQuery = new BasicDBObject().append("name", "owen"); coll.update(searchQuery, newDocument); You might think that would do the trick, but wait! Let's have a look at our document using the mongo shell: > db.javastuff.find({"age":23}).pretty() { "_id" : ObjectId("55142c27627b27560bd365b1"), "age" : 23 } As you can see, the update statement has replaced the original document with another one, including only the keys and values we have passed to the update. In most cases, this is not what we want to achieve. If we want to update a particular value, we have to use the $set update modifier. DBCollection coll = db.getCollection("javastuff"); BasicDBObject newDocument = new BasicDBObject(); newDocument.append("$set", new BasicDBObject().append("age", 23)); BasicDBObject searchQuery = new BasicDBObject().append("name", "frank"); coll.update(searchQuery, newDocument); So, suppose we restored the initial document with all the fields, this is the outcome of the update using the $set update modifier: > db.javastuff.find({"age":23}).pretty() { "_id" : ObjectId("5514326e627b383428c2ccd8"), "name" : "frank", "age" : 23, "in,fo" : { "email" : "[email protected]", "phone" : "222-111-444" } } Please note that the DBCollection class overloads the method update with update (DBObject q, DBObject o, boolean upsert, boolean multi). The first parameter (upsert) determines whether the database should create the element if it does not exist. The second one (multi) causes the update to be applied to all matching objects. Deleting documents The operator to be used for deleting documents is obviously delete. As for other operators, it includes several variants. In its simplest form, when executed over a single document returned, it will remove it: MongoClient mongoClient = new MongoClient("localhost", 27017); DB db = mongoClient.getDB("sampledb"); DBCollection coll = db.getCollection("javastuff"); DBObject doc = coll.findOne(); coll.remove(doc); Most of the time you will need to filter the documents to be deleted. Here is how to delete the document with the key frank: DBObject document = new BasicDBObject(); document.put("name", "frank"); coll.remove(document); Deleting a set of documents Bulk deletion of documents can be achieved by including the keys in a List and building an $in modifier expression that uses this list. Let's see, for example, how to delete all records whose age ranges from 0 to 49: BasicDBObject deleteQuery = new BasicDBObject(); List<Integer> list = new ArrayList<Integer>(); for (int i=0;i<50;i++) list.add(i); deleteQuery.put("age", new BasicDBObject("$in", list)); Summary In this article, we covered how to perform the same operations that are available in the mongo shell. Resources for Article: Further resources on this subject: MongoDB data modeling [article] Apache Solr and Big Data – integration with MongoDB [article] About MongoDB [article]

0
0
1398

article-image-matrix-and-pixel-manipulation-along-handling-files

Packt

11 Aug 2015

14 min read

Matrix and Pixel Manipulation along with Handling Files

Packt

11 Aug 2015

14 min read

In this article, by Daniel Lélis Baggio, author of the book OpenCV 3.0 Computer Vision with Java, you will learn to perform basic operations required in computer vision, such as dealing with matrices, pixels, and opening files for prototype applications. In this article, the following topics will be covered: Basic matrix manipulation Pixel manipulation How to load and display images from files (For more resources related to this topic, see here.) Basic matrix manipulation From a computer vision background, we can see an image as a matrix of numerical values, which represents its pixels. For a gray-level image, we usually assign values ranging from 0 (black) to 255 (white) and the numbers in between show a mixture of both. These are generally 8-bit images. So, each element of the matrix refers to each pixel on the gray-level image, the number of columns refers to the image width, as well as the number of rows refers to the image's height. In order to represent a color image, we usually adopt each pixel as a combination of three basic colors: red, green, and blue. So, each pixel in the matrix is represented by a triplet of colors. It is important to observe that with 8 bits, we get 2 to the power of eight (28), which is 256. So, we can represent the range from 0 to 255, which includes, respectively the values used for black and white levels in 8-bit grayscale images. Besides this, we can also represent these levels as floating points and use 0.0 for black and 1.0 for white. OpenCV has a variety of ways to represent images, so you are able to customize the intensity level through the number of bits considering whether one wants signed, unsigned, or floating point data types, as well as the number of channels. OpenCV's convention is seen through the following expression: CV_<bit_depth>{U|S|F}C(<number_of_channels>) Here, U stands for unsigned, S for signed, and F stands for floating point. For instance, if an 8-bit unsigned single-channel image is required, the data type representation would be CV_8UC1, while a colored image represented by 32-bit floating point numbers would have the data type defined as CV_32FC3. If the number of channels is omitted, it evaluates to 1. We can see the ranges according to each bit depth and data type in the following list: CV_8U: These are the 8-bit unsigned integers that range from 0 to 255 CV_8S: These are the 8-bit signed integers that range from -128 to 127 CV_16U: These are the 16-bit unsigned integers that range from 0 to 65,535 CV_16S: These are the 16-bit signed integers that range from -32,768 to 32,767 CV_32S: These are the 32-bit signed integers that range from -2,147,483,648 to 2,147,483,647 CV_32F: These are the 32-bit floating-point numbers that range from -FLT_MAX to FLT_MAX and include INF and NAN values CV_64F: These are the 64-bit floating-point numbers that range from -DBL_MAX to DBL_MAX and include INF and NAN values You will generally start the project from loading an image, but it is important to know how to deal with these values. Make sure you import org.opencv.core.CvType and org.opencv.core.Mat. Several constructors are available for matrices as well, for instance: Mat image2 = new Mat(480,640,CvType.CV_8UC3); Mat image3 = new Mat(new Size(640,480), CvType.CV_8UC3); Both of the preceding constructors will construct a matrix suitable to fit an image with 640 pixels of width and 480 pixels of height. Note that width is to columns as height is to rows. Also pay attention to the constructor with the Size parameter, which expects the width and height order. In case you want to check some of the matrix properties, the methods rows(), cols(), and elemSize() are available: System.out.println(image2 + "rows " + image2.rows() + " cols " + image2.cols() + " elementsize " + image2.elemSize()); The output of the preceding line is: Mat [ 480*640*CV_8UC3, isCont=true, isSubmat=false, nativeObj=0xceeec70, dataAddr=0xeb50090 ]rows 480 cols 640 elementsize 3 The isCont property tells us whether this matrix uses extra padding when representing the image, so that it can be hardware-accelerated in some platforms; however, we won't cover it in detail right now. The isSubmat property refers to fact whether this matrix was created from another matrix and also whether it refers to the data from another matrix. The nativeObj object refers to the native object address, which is a Java Native Interface (JNI) detail, while dataAddr points to an internal data address. The element size is measured in the number of bytes. Another matrix constructor is the one that passes a scalar to be filled as one of its elements. The syntax for this looks like the following: Mat image = new Mat(new Size(3,3), CvType.CV_8UC3, new Scalar(new double[]{128,3,4})); This constructor will initialize each element of the matrix with the triple {128, 3, 4}. A very useful way to print a matrix's contents is using the auxiliary method dump() from Mat. Its output will look similar to the following: [128, 3, 4, 128, 3, 4, 128, 3, 4; 128, 3, 4, 128, 3, 4, 128, 3, 4; 128, 3, 4, 128, 3, 4, 128, 3, 4] It is important to note that while creating the matrix with a specified size and type, it will also immediately allocate memory for its contents. Pixel manipulation Pixel manipulation is often required for one to access pixels in an image. There are several ways to do this and each one has its advantages and disadvantages. A straightforward method to do this is the put(row, col, value) method. For instance, in order to fill our preceding matrix with values {1, 2, 3}, we will use the following code: for(int i=0;i<image.rows();i++){ for(int j=0;j<image.cols();j++){ image.put(i, j, new byte[]{1,2,3}); } } Note that in the array of bytes {1, 2, 3}, for our matrix, 1 stands for the blue channel, 2 for the green, and 3 for the red channel, as OpenCV stores its matrix internally in the BGR (blue, green, and red) format. It is okay to access pixels this way for small matrices. The only problem is the overhead of JNI calls for big images. Remember that even a small 640 x 480 pixel image has 307,200 pixels and if we think about a colored image, it has 921,600 values in a matrix. Imagine that it might take around 50 ms to make an overloaded call for each of the 307,200 pixels. On the other hand, if we manipulate the whole matrix on the Java side and then copy it to the native side in a single call, it will take around 13 ms. If you want to manipulate the pixels on the Java side, perform the following steps: Allocate memory with the same size as the matrix in a byte array. Put the image contents into that array (optional). Manipulate the byte array contents. Make a single put call, copying the whole byte array to the matrix. A simple example that will iterate all image pixels and set the blue channel to zero, which means that we will set to zero every element whose modulo is 3 equals zero, that is {0, 3, 6, 9, …}, as shown in the following piece of code: public void filter(Mat image){ int totalBytes = (int)(image.total() * image.elemSize()); byte buffer[] = new byte[totalBytes]; image.get(0, 0,buffer); for(int i=0;i<totalBytes;i++){ if(i%3==0) buffer[i]=0; } image.put(0, 0, buffer); } First, we find out the number of bytes in the image by multiplying the total number of pixels (image.total) with the element size in bytes (image.elemenSize). Then, we build a byte array with that size. We use the get(row, col, byte[]) method to copy the matrix contents in our recently created byte array. Then, we iterate all bytes and check the condition that refers to the blue channel (i%3==0). Remember that OpenCV stores colors internally as {Blue, Green, Red}. We finally make another JNI call to image.put, which copies the whole byte array to OpenCV's native storage. An example of this filter can be seen in the following screenshot, which was uploaded by Mromanchenko, licensed under CC BY-SA 3.0: Be aware that Java does not have any unsigned byte data type, so be careful when working with it. The safe procedure is to cast it to an integer and use the And operator (&) with 0xff. A simple example of this would be int unsignedValue = myUnsignedByte & 0xff;. Now, unsignedValue can be checked in the range of 0 to 255. Loading and displaying images from files Most computer vision applications need to retrieve images from some where. In case you need to get them from files, OpenCV comes with several image file loaders. Unfortunately, some loaders depend on codecs that sometimes aren't shipped with the operating system, which might cause them not to load. From the documentation, we see that the following files are supported with some caveats: Windows bitmaps: *.bmp, *.dib JPEG files: *.jpeg, *.jpg, *.jpe JPEG 2000 files: *.jp2 Portable Network Graphics: *.png Portable image format: *.pbm, *.pgm, *.ppm Sun rasters: *.sr, *.ras TIFF files: *.tiff, *.tif Note that Windows bitmaps, the portable image format, and sun raster formats are supported by all platforms, but the other formats depend on a few details. In Microsoft Windows and Mac OS X, OpenCV can always read the jpeg, png, and tiff formats. In Linux, OpenCV will look for codecs supplied with the OS, as stated by the documentation, so remember to install the relevant packages (do not forget the development files, for example, "libjpeg-dev" in Debian* and Ubuntu*) to get the codec support or turn on the OPENCV_BUILD_3RDPARTY_LIBS flag in CMake, as pointed out in Imread's official documentation. The imread method is supplied to get access to images through files. Use Imgcodecs.imread (name of the file) and check whether dataAddr() from the read image is different from zero to make sure the image has been loaded correctly, that is, the filename has been typed correctly and its format is supported. A simple method to open a file could look like the one shown in the following code. Make sure you import org.opencv.imgcodecs.Imgcodecs and org.opencv.core.Mat: public Mat openFile(String fileName) throws Exception{ Mat newImage = Imgcodecs.imread(fileName); if(newImage.dataAddr()==0){ throw new Exception ("Couldn't open file "+fileName); } return newImage; } Displaying an image with Swing OpenCV developers are used to a simple cross-platform GUI by OpenCV, which was called as HighGUI, and a handy method called imshow. It constructs a window easily and displays an image within it, which is nice to create quick prototypes. As Java comes with a popular GUI API called Swing, we had better use it. Besides, no imshow method was available for Java until its 2.4.7.0 version was released. On the other hand, it is pretty simple to create such functionality. Let's break down the work in to two classes: App and ImageViewer. The App class will be responsible for loading the file, while ImageViewer will display it. The application's work is simple and will only need to use Imgcodecs's imread method, which is shown as follows: package org.javaopencvbook; import java.io.File; … import org.opencv.imgcodecs.Imgcodecs; public class App { static{ System.loadLibrary(Core.NATIVE_LIBRARY_NAME); } public static void main(String[] args) throws Exception { String filePath = "src/main/resources/images/cathedral.jpg"; Mat newImage = Imgcodecs.imread(filePath); if(newImage.dataAddr()==0){ System.out.println("Couldn't open file " + filePath); } else{ ImageViewer imageViewer = new ImageViewer(); imageViewer.show(newImage, "Loaded image"); } } } Note that the App class will only read an example image file in the Mat object and it will call the ImageViewer method to display it. Now, let's see how the ImageViewer class's show method works: package org.javaopencvbook.util; import java.awt.BorderLayout; import java.awt.Dimension; import java.awt.Image; import java.awt.image.BufferedImage; import javax.swing.ImageIcon; import javax.swing.JFrame; import javax.swing.JLabel; import javax.swing.JScrollPane; import javax.swing.UIManager; import javax.swing.UnsupportedLookAndFeelException; import javax.swing.WindowConstants; import org.opencv.core.Mat; import org.opencv.imgproc.Imgproc; public class ImageViewer { private JLabel imageView; public void show(Mat image){ show(image, ""); } public void show(Mat image,String windowName){ setSystemLookAndFeel(); JFrame frame = createJFrame(windowName); Image loadedImage = toBufferedImage(image); imageView.setIcon(new ImageIcon(loadedImage)); frame.pack(); frame.setLocationRelativeTo(null); frame.setVisible(true); } private JFrame createJFrame(String windowName) { JFrame frame = new JFrame(windowName); imageView = new JLabel(); final JScrollPane imageScrollPane = new JScrollPane(imageView); imageScrollPane.setPreferredSize(new Dimension(640, 480)); frame.add(imageScrollPane, BorderLayout.CENTER); frame.setDefaultCloseOperation(WindowConstants.EXIT_ON_CLOSE); return frame; } private void setSystemLookAndFeel() { try { UIManager.setLookAndFeel (UIManager.getSystemLookAndFeelClassName()); } catch (ClassNotFoundException e) { e.printStackTrace(); } catch (InstantiationException e) { e.printStackTrace(); } catch (IllegalAccessException e) { e.printStackTrace(); } catch (UnsupportedLookAndFeelException e) { e.printStackTrace(); } } public Image toBufferedImage(Mat matrix){ int type = BufferedImage.TYPE_BYTE_GRAY; if ( matrix.channels() > 1 ) { type = BufferedImage.TYPE_3BYTE_BGR; } int bufferSize = matrix.channels()*matrix.cols()*matrix.rows(); byte [] buffer = new byte[bufferSize]; matrix.get(0,0,buffer); // get all the pixels BufferedImage image = new BufferedImage(matrix.cols(),matrix.rows(), type); final byte[] targetPixels = ((DataBufferByte) image.getRaster().getDataBuffer()).getData(); System.arraycopy(buffer, 0, targetPixels, 0, buffer.length); return image; } } Pay attention to the show and toBufferedImage methods. Show will try to set Swing's look and feel to the default native look, which is cosmetic. Then, it will create JFrame with JScrollPane and JLabel inside it. It will then call toBufferedImage, which will convert an OpenCV Mat object to a BufferedImage AWT. This conversion is made through the creation of a byte array that will store matrix contents. The appropriate size is allocated through the multiplication of the number of channels by the number of columns and rows. The matrix.get method puts all the elements into the byte array. Finally, the image's raster data buffer is accessed through the getDataBuffer() and getData() methods. It is then filled with a fast system call to the System.arraycopy method. The resulting image is then assigned to JLabel and then it is easily displayed. Note that this method expects a matrix that is either stored as one channel's unsigned 8-bit or three channel's unsigned 8-bit. In case your image is stored as a floating point, you should convert it using the following code before calling this method, supposing that the image you need to convert is a Mat object called originalImage: Mat byteImage = new Mat(); originalImage.convertTo(byteImage, CvType.CV_8UC3); This way, you can call toBufferedImage from your converted byteImage property. The image viewer can be easily installed in any Java OpenCV project and it will help you to show your images for debugging purposes. The output of this program can be seen in the next screenshot: Summary In this article, we learned dealing with matrices, pixels, and opening files for GUI prototype applications. Resources for Article: Further resources on this subject: Wrapping OpenCV [article] Making subtle color shifts with curves [article] Linking OpenCV to an iOS project [article]

0
0
2710

article-image-neo4j-modeling-bookings-and-users

Packt

11 Aug 2015

14 min read

Neo4j – Modeling Bookings and Users

Packt

11 Aug 2015

14 min read

0
0
1373

article-image-bayesian-network-fundamentals

Packt

10 Aug 2015

25 min read

Bayesian Network Fundamentals

Packt

10 Aug 2015

25 min read

In this article by Ankur Ankan and Abinash Panda, the authors of Mastering Probabilistic Graphical Models Using Python, we'll cover the basics of random variables, probability theory, and graph theory. We'll also see the Bayesian models and the independencies in Bayesian models. A graphical model is essentially a way of representing joint probability distribution over a set of random variables in a compact and intuitive form. There are two main types of graphical models, namely directed and undirected. We generally use a directed model, also known as a Bayesian network, when we mostly have a causal relationship between the random variables. Graphical models also give us tools to operate on these models to find conditional and marginal probabilities of variables, while keeping the computational complexity under control. (For more resources related to this topic, see here.) Probability theory To understand the concepts of probability theory, let's start with a real-life situation. Let's assume we want to go for an outing on a weekend. There are a lot of things to consider before going: the weather conditions, the traffic, and many other factors. If the weather is windy or cloudy, then it is probably not a good idea to go out. However, even if we have information about the weather, we cannot be completely sure whether to go or not; hence we have used the words probably or maybe. Similarly, if it is windy in the morning (or at the time we took our observations), we cannot be completely certain that it will be windy throughout the day. The same holds for cloudy weather; it might turn out to be a very pleasant day. Further, we are not completely certain of our observations. There are always some limitations in our ability to observe; sometimes, these observations could even be noisy. In short, uncertainty or randomness is the innate nature of the world. The probability theory provides us the necessary tools to study this uncertainty. It helps us look into options that are unlikely yet probable. Random variable Probability deals with the study of events. From our intuition, we can say that some events are more likely than others, but to quantify the likeliness of a particular event, we require the probability theory. It helps us predict the future by assessing how likely the outcomes are. Before going deeper into the probability theory, let's first get acquainted with the basic terminologies and definitions of the probability theory. A random variable is a way of representing an attribute of the outcome. Formally, a random variable X is a function that maps a possible set of outcomes ? to some set E, which is represented as follows: X : ? ? E As an example, let us consider the outing example again. To decide whether to go or not, we may consider the skycover (to check whether it is cloudy or not). Skycover is an attribute of the day. Mathematically, the random variable skycover (X) is interpreted as a function, which maps the day (?) to its skycover values (E). So when we say the event X = 40.1, it represents the set of all the days {?} such that , where is the mapping function. Formally speaking, . Random variables can either be discrete or continuous. A discrete random variable can only take a finite number of values. For example, the random variable representing the outcome of a coin toss can take only two values, heads or tails; and hence, it is discrete. Whereas, a continuous random variable can take infinite number of values. For example, a variable representing the speed of a car can take any number values. For any event whose outcome is represented by some random variable (X), we can assign some value to each of the possible outcomes of X, which represents how probable it is. This is known as the probability distribution of the random variable and is denoted by P(X). For example, consider a set of restaurants. Let X be a random variable representing the quality of food in a restaurant. It can take up a set of values, such as {good, bad, average}. P(X), represents the probability distribution of X, that is, if P(X = good) = 0.3, P(X = average) = 0.5, and P(X = bad) = 0.2. This means there is 30 percent chance of a restaurant serving good food, 50 percent chance of it serving average food, and 20 percent chance of it serving bad food. Independence and conditional independence In most of the situations, we are rather more interested in looking at multiple attributes at the same time. For example, to choose a restaurant, we won't only be looking just at the quality of food; we might also want to look at other attributes, such as the cost, location, size, and so on. We can have a probability distribution over a combination of these attributes as well. This type of distribution is known as joint probability distribution. Going back to our restaurant example, let the random variable for the quality of food be represented by Q, and the cost of food be represented by C. Q can have three categorical values, namely {good, average, bad}, and C can have the values {high, low}. So, the joint distribution for P(Q, C) would have probability values for all the combinations of states of Q and C. P(Q = good, C = high) will represent the probability of a pricey restaurant with good quality food, while P(Q = bad, C = low) will represent the probability of a restaurant that is less expensive with bad quality food. Let us consider another random variable representing an attribute of a restaurant, its location L. The cost of food in a restaurant is not only affected by the quality of food but also the location (generally, a restaurant located in a very good location would be more costly as compared to a restaurant present in a not-very-good location). From our intuition, we can say that the probability of a costly restaurant located at a very good location in a city would be different (generally, more) than simply the probability of a costly restaurant, or the probability of a cheap restaurant located at a prime location of city is different (generally less) than simply probability of a cheap restaurant. Formally speaking, P(C = high | L = good) will be different from P(C = high) and P(C = low | L = good) will be different from P(C = low). This indicates that the random variables C and L are not independent of each other. These attributes or random variables need not always be dependent on each other. For example, the quality of food doesn't depend upon the location of restaurant. So, P(Q = good | L = good) or P(Q = good | L = bad)would be the same as P(Q = good), that is, our estimate of the quality of food of the restaurant will not change even if we have knowledge of its location. Hence, these random variables are independent of each other. In general, random variables can be considered as independent of each other, if: They may also be considered independent if: We can easily derive this conclusion. We know the following from the chain rule of probability: P(X, Y) = P(X) P(Y | X) If Y is independent of X, that is, if X | Y, then P(Y | X) = P(Y). Then: P(X, Y) = P(X) P(Y) Extending this result on multiple variables, we can easily get to the conclusion that a set of random variables are independent of each other, if their joint probability distribution is equal to the product of probabilities of each individual random variable. Sometimes, the variables might not be independent of each other. To make this clearer, let's add another random variable, that is, the number of people visiting the restaurant N. Let's assume that, from our experience we know the number of people visiting only depends on the cost of food at the restaurant and its location (generally, lesser number of people visit costly restaurants). Does the quality of food Q affect the number of people visiting the restaurant? To answer this question, let's look into the random variable affecting N, cost C, and location L. As C is directly affected by Q, we can conclude that Q affects N. However, let's consider a situation when we know that the restaurant is costly, that is, C = high and let's ask the same question, "does the quality of food affect the number of people coming to the restaurant?". The answer is no. The number of people coming only depends on the price and location, so if we know that the cost is high, then we can easily conclude that fewer people will visit, irrespective of the quality of food. Hence, . This type of independence is called conditional independence. Installing tools Let's now see some coding examples using pgmpy, to represent joint distributions and independencies. Here, we will mostly work with IPython and pgmpy (and a few other libraries) for coding examples. So, before moving ahead, let's get a basic introduction to these. IPython IPython is a command shell for interactive computing in multiple programming languages, originally developed for the Python programming language, which offers enhanced introspection, rich media, additional shell syntax, tab completion, and a rich history. IPython provides the following features: Powerful interactive shells (terminal and Qt-based) A browser-based notebook with support for code, text, mathematical expressions, inline plots, and other rich media Support for interactive data visualization and use of GUI toolkits Flexible and embeddable interpreters to load into one's own projects Easy-to-use and high performance tools for parallel computing You can install IPython using the following command: >>> pip3 install ipython To start the IPython command shell, you can simply type ipython3 in the terminal. For more installation instructions, you can visit http://ipython.org/install.html. pgmpy pgmpy is a Python library to work with Probabilistic Graphical models. As it's currently not on PyPi, we will need to build it manually. You can get the source code from the Git repository using the following command: >>> git clone https://github.com/pgmpy/pgmpy Now cd into the cloned directory switch branch for version used and build it with the following code: >>> cd pgmpy >>> git checkout book/v0.1 >>> sudo python3 setup.py install For more installation instructions, you can visit http://pgmpy.org/install.html. With both IPython and pgmpy installed, you should now be able to run the examples. Representing independencies using pgmpy To represent independencies, pgmpy has two classes, namely IndependenceAssertion and Independencies. The IndependenceAssertion class is used to represent individual assertions of the form of or . Let's see some code to represent assertions: # Firstly we need to import IndependenceAssertion In [1]: from pgmpy.independencies import IndependenceAssertion # Each assertion is in the form of [X, Y, Z] meaning X is # independent of Y given Z. In [2]: assertion1 = IndependenceAssertion('X', 'Y') In [3]: assertion1 Out[3]: (X _|_ Y) Here, assertion1 represents that the variable X is independent of the variable Y. To represent conditional assertions, we just need to add a third argument to IndependenceAssertion: In [4]: assertion2 = IndependenceAssertion('X', 'Y', 'Z') In [5]: assertion2 Out [5]: (X _|_ Y | Z) In the preceding example, assertion2 represents . IndependenceAssertion also allows us to represent assertions in the form of . To do this, we just need to pass a list of random variables as arguments: In [4]: assertion2 = IndependenceAssertion('X', 'Y', 'Z') In [5]: assertion2 Out[5]: (X _|_ Y | Z) Moving on to the Independencies class, an Independencies object is used to represent a set of assertions. Often, in the case of Bayesian or Markov networks, we have more than one assertion corresponding to a given model, and to represent these independence assertions for the models, we generally use the Independencies object. Let's take a few examples: In [8]: from pgmpy.independencies import Independencies # There are multiple ways to create an Independencies object, we # could either initialize an empty object or initialize with some # assertions. In [9]: independencies = Independencies() # Empty object In [10]: independencies.get_assertions() Out[10]: [] In [11]: independencies.add_assertions(assertion1, assertion2) In [12]: independencies.get_assertions() Out[12]: [(X _|_ Y), (X _|_ Y | Z)] We can also directly initialize Independencies in these two ways: In [13]: independencies = Independencies(assertion1, assertion2) In [14]: independencies = Independencies(['X', 'Y'], ['A', 'B', 'C']) In [15]: independencies.get_assertions() Out[15]: [(X _|_ Y), (A _|_ B | C)] Representing joint probability distributions using pgmpy We can also represent joint probability distributions using pgmpy's JointProbabilityDistribution class. Let's say we want to represent the joint distribution over the outcomes of tossing two fair coins. So, in this case, the probability of all the possible outcomes would be 0.25, which is shown as follows: In [16]: from pgmpy.factors import JointProbabilityDistribution as Joint In [17]: distribution = Joint(['coin1', 'coin2'], [2, 2], [0.25, 0.25, 0.25, 0.25]) Here, the first argument includes names of random variable. The second argument is a list of the number of states of each random variable. The third argument is a list of probability values, assuming that the first variable changes its states the slowest. So, the preceding distribution represents the following: In [18]: print(distribution) +--------------------------------------+ ¦ coin1 ¦ coin2 ¦ P(coin1,coin2) ¦ ¦---------+---------+------------------¦ ¦ coin1_0 ¦ coin2_0 ¦ 0.2500 ¦ +---------+---------+------------------¦ ¦ coin1_0 ¦ coin2_1 ¦ 0.2500 ¦ +---------+---------+------------------¦ ¦ coin1_1 ¦ coin2_0 ¦ 0.2500 ¦ +---------+---------+------------------¦ ¦ coin1_1 ¦ coin2_1 ¦ 0.2500 ¦ +--------------------------------------+ We can also conduct independence queries over these distributions in pgmpy: In [19]: distribution.check_independence('coin1', 'coin2') Out[20]: True Conditional probability distribution Let's take an example to understand conditional probability better. Let's say we have a bag containing three apples and five oranges, and we want to randomly take out fruits from the bag one at a time without replacing them. Also, the random variables and represent the outcomes in the first try and second try respectively. So, as there are three apples and five oranges in the bag initially, and . Now, let's say that in our first attempt we got an orange. Now, we cannot simply represent the probability of getting an apple or orange in our second attempt. The probabilities in the second attempt will depend on the outcome of our first attempt and therefore, we use conditional probability to represent such cases. Now, in the second attempt, we will have the following probabilities that depend on the outcome of our first try: , , , and . The Conditional Probability Distribution (CPD) of two variables and can be represented as , representing the probability of given that is the probability of after the event has occurred and we know it's outcome. Similarly, we can have representing the probability of after having an observation for . The simplest representation of CPD is tabular CPD. In a tabular CPD, we construct a table containing all the possible combinations of different states of the random variables and the probabilities corresponding to these states. Let's consider the earlier restaurant example. Let's begin by representing the marginal distribution of the quality of food with Q. As we mentioned earlier, it can be categorized into three values {good, bad, average}. For example, P(Q) can be represented in the tabular form as follows: Quality P(Q) Good 0.3 Normal 0.5 Bad 0.2 Similarly, let's say P(L) is the probability distribution of the location of the restaurant. Its CPD can be represented as follows: Location P(L) Good 0.6 Bad 0.4 As the cost of restaurant C depends on both the quality of food Q and its location L, we will be considering P(C | Q, L), which is the conditional distribution of C, given Q and L: Location Good Bad Quality Good Normal Bad Good Normal Bad Cost High 0.8 0.6 0.1 0.6 0.6 0.05 Low 0.2 0.4 0.9 0.4 0.4 0.95 Representing CPDs using pgmpy Let's first see how to represent the tabular CPD using pgmpy for variables that have no conditional variables: In [1]: from pgmpy.factors import TabularCPD # For creating a TabularCPD object we need to pass three # arguments: the variable name, its cardinality that is the number # of states of the random variable and the probability value # corresponding each state. In [2]: quality = TabularCPD(variable='Quality', variable_card=3, values=[[0.3], [0.5], [0.2]]) In [3]: print(quality) +----------------------+ ¦ ['Quality', 0] ¦ 0.3 ¦ +----------------+-----¦ ¦ ['Quality', 1] ¦ 0.5 ¦ +----------------+-----¦ ¦ ['Quality', 2] ¦ 0.2 ¦ +----------------------+ In [4]: quality.variables Out[4]: OrderedDict([('Quality', [State(var='Quality', state=0), State(var='Quality', state=1), State(var='Quality', state=2)])]) In [5]: quality.cardinality Out[5]: array([3]) In [6]: quality.values Out[6]: array([0.3, 0.5, 0.2]) You can see here that the values of the CPD are a 1D array instead of a 2D array, which you passed as an argument. Actually, pgmpy internally stores the values of the TabularCPD as a flattened numpy array. In [7]: location = TabularCPD(variable='Location', variable_card=2, values=[[0.6], [0.4]]) In [8]: print(location) +-----------------------+ ¦ ['Location', 0] ¦ 0.6 ¦ +-----------------+-----¦ ¦ ['Location', 1] ¦ 0.4 ¦ +-----------------------+ However, when we have conditional variables, we also need to specify them and the cardinality of those variables. Let's define the TabularCPD for the cost variable: In [9]: cost = TabularCPD( variable='Cost', variable_card=2, values=[[0.8, 0.6, 0.1, 0.6, 0.6, 0.05], [0.2, 0.4, 0.9, 0.4, 0.4, 0.95]], evidence=['Q', 'L'], evidence_card=[3, 2]) Graph theory The second major framework for the study of probabilistic graphical models is graph theory. Graphs are the skeleton of PGMs, and are used to compactly encode the independence conditions of a probability distribution. Nodes and edges The foundation of graph theory was laid by Leonhard Euler when he solved the famous Seven Bridges of Konigsberg problem. The city of Konigsberg was set on both sides by the Pregel river and included two islands that were connected and maintained by seven bridges. The problem was to find a walk to exactly cross all the bridges once in a single walk. To visualize the problem, let's think of the graph in Fig 1.1: Fig 1.1: The Seven Bridges of Konigsberg graph Here, the nodes a, b, c, and d represent the land, and are known as vertices of the graph. The line segments ab, bc, cd, da, ab, and bc connecting the land parts are the bridges and are known as the edges of the graph. So, we can think of the problem of crossing all the bridges once in a single walk as tracing along all the edges of the graph without lifting our pencils. Formally, a graph G = (V, E) is an ordered pair of finite sets. The elements of the set V are known as the nodes or the vertices of the graph, and the elements of are the edges or the arcs of the graph. The number of nodes or cardinality of G, denoted by |V|, are known as the order of the graph. Similarly, the number of edges denoted by |E| are known as the size of the graph. Here, we can see that the Konigsberg city graph shown in Fig 1.1 is of order 4 and size 7. In a graph, we say that two vertices, u, v ? V are adjacent if u, v ? E. In the City graph, all the four vertices are adjacent to each other because there is an edge for every possible combination of two vertices in the graph. Also, for a vertex v ? V, we define the neighbors set of v as . In the City graph, we can see that b and d are neighbors of c. Similarly, a, b, and c are neighbors of d. We define an edge to be a self loop if the start vertex and the end vertex of the edge are the same. We can put it more formally as, any edge of the form (u, u), where u ? V is a self loop. Until now, we have been talking only about graphs whose edges don't have a direction associated with them, which means that the edge (u, v) is same as the edge (v, u). These types of graphs are known as undirected graphs. Similarly, we can think of a graph whose edges have a sense of direction associated with it. For these graphs, the edge set E would be a set of ordered pair of vertices. These types of graphs are known as directed graphs. In the case of a directed graph, we also define the indegree and outdegree for a vertex. For a vertex v ? V, we define its outdegree as the number of edges originating from the vertex v, that is, . Similarly, the indegree is defined as the number of edges that end at the vertex v, that is, . Walk, paths, and trails For a graph G = (V, E) and u,v ? V, we define a u - v walk as an alternating sequence of vertices and edges, starting with u and ending with v. In the City graph of Fig 1.1, we can have an example of a - d walk as . If there aren't multiple edges between the same vertices, then we simply represent a walk by a sequence of vertices. As in the case of the Butterfly graph shown in Fig 1.2, we can have a walk W : a, c, d, c, e: Fig 1.2: Butterfly graph—a undirected graph A walk with no repeated edges is known as a trail. For example, the walk in the City graph is a trail. Also, a walk with no repeated vertices, except possibly the first and the last, is known as a path. For example, the walk in the City graph is a path. Also, a graph is known as cyclic if there are one or more paths that start and end at the same node. Such paths are known as cycles. Similarly, if there are no cycles in a graph, it is known as an acyclic graph. Bayesian models In most of the real-life cases when we would be representing or modeling some event, we would be dealing with a lot of random variables. Even if we would consider all the random variables to be discrete, there would still be exponentially large number of values in the joint probability distribution. Dealing with such huge amount of data would be computationally expensive (and in some cases, even intractable), and would also require huge amount of memory to store the probability of each combination of states of these random variables. However, in most of the cases, many of these variables are marginally or conditionally independent of each other. By exploiting these independencies, we can reduce the number of values we need to store to represent the joint probability distribution. For instance, in the previous restaurant example, the joint probability distribution across the four random variables that we discussed (that is, quality of food Q, location of restaurant L, cost of food C, and the number of people visiting N) would require us to store 23 independent values. By the chain rule of probability, we know the following: P(Q, L, C, N) = P(Q) P(L|Q) P(C|L, Q) P(N|C, Q, L) Now, let us try to exploit the marginal and conditional independence between the variables, to make the representation more compact. Let's start by considering the independency between the location of the restaurant and quality of food over there. As both of these attributes are independent of each other, P(L|Q) would be the same as P(L). Therefore, we need to store only one parameter to represent it. From the conditional independence that we have seen earlier, we know that . Thus, P(N|C, Q, L) would be the same as P(N|C, L); thus needing only four parameters. Therefore, we now need only (2 + 1 + 6 + 4 = 13) parameters to represent the whole distribution. We can conclude that exploiting independencies helps in the compact representation of joint probability distribution. This forms the basis for the Bayesian network. Representation A Bayesian network is represented by a Directed Acyclic Graph (DAG) and a set of Conditional Probability Distributions (CPD) in which: The nodes represent random variables The edges represent dependencies For each of the nodes, we have a CPD In our previous restaurant example, the nodes would be as follows: Quality of food (Q) Location (L) Cost of food (C) Number of people (N) As the cost of food was dependent on the quality of food (Q) and the location of the restaurant (L), there will be an edge each from Q ? C and L ? C. Similarly, as the number of people visiting the restaurant depends on the price of food and its location, there would be an edge each from L ? N and C ? N. The resulting structure of our Bayesian network is shown in Fig 1.3: Fig 1.3: Bayesian network for the restaurant example Factorization of a distribution over a network Each node in our Bayesian network for restaurants has a CPD associated to it. For example, the CPD for the cost of food in the restaurant is P(C|Q, L), as it only depends on the quality of food and location. For the number of people, it would be P(N|C, L) . So, we can generalize that the CPD associated with each node would be P(node|Par(node)) where Par(node) denotes the parents of the node in the graph. Assuming some probability values, we will finally get a network as shown in Fig 1.4: Fig 1.4: Bayesian network of restaurant along with CPDs Let us go back to the joint probability distribution of all these attributes of the restaurant again. Considering the independencies among variables, we concluded as follows: P(Q,C,L,N) = P(Q)P(L)P(C|Q, L)P(N|C, L) So now, looking into the Bayesian network (BN) for the restaurant, we can say that for any Bayesian network, the joint probability distribution over all its random variables {X1,X2,...,Xn} can be represented as follows: This is known as the chain rule for Bayesian networks. Also, we say that a distribution P factorizes over a graph G, if P can be encoded as follows: Here, ParG(X) is the parent of X in the graph G. Summary In this article, we saw how we can represent a complex joint probability distribution using a directed graph and a conditional probability distribution associated with each node, which is collectively known as a Bayesian network. Resources for Article: Further resources on this subject: Web Scraping with Python [article] Exact Inference Using Graphical Models [article] wxPython: Design Approaches and Techniques [article]

0
0
4066

article-image-setting-synchronous-replication

Packt

10 Aug 2015

17 min read

Setting Up Synchronous Replication

Packt

10 Aug 2015

17 min read

0
0
3583

article-image-oracle-goldengate-12c-overview

Packt

10 Aug 2015

21 min read

Oracle GoldenGate 12c — An Overview

Packt

10 Aug 2015

21 min read

In this article by John P Jeffries, author of the book Oracle GoldenGate 12c Implementer's Guide, he provides an introduction to Oracle GoldenGate by describing the key components, processes, and considerations required to build and implement a GoldenGate solution. John tells you how to address some of the issues that influence the decision-making process when you design a GoldenGate solution. He focuses on the additional configuration options available in Oracle GoldenGate 12c (For more resources related to this topic, see here.) 12c new features Oracle has provided some exciting new features in their 12c version of GoldenGate, some of which we have already touched upon. Following the official desupport of Oracle Streams in Oracle Database 12c, Oracle has essentially migrated some of the key features to its strategic product. You will find that GoldenGate now has a tighter integration with the Oracle database, enabling enhanced functionality. Let's explore some of the new features available in Oracle GoldenGate 12c. Integrated capture Integrated capture has been available since Oracle GoldenGate 11gR2 with Oracle Database 11g (11.2.0.3). Originally decoupled from the database, GoldenGate's new architecture provides the option to integrate its Extract process(es) with the Oracle database. This enables GoldenGate to access the database's data dictionary and undo tablespace, providing replication support for advanced features and data types. Oracle GoldenGate 12c still supports the original Extract configuration, known as Classic Capture. Integrated Replicat Integrated Replicat is a new feature in Oracle GoldenGate 12c for the delivery of data to Oracle Database 11g (11.2.0.4) or 12c. The performance enhancement provides better scalability and load balancing that leverages the database parallel apply servers for automatic, dependency-aware parallel Replicat processes. With Integrated Replicat, there is no need for users to manually split the delivery process into multiple threads and manage multiple parameter files. GoldenGate now uses a lightweight streaming API to prepare, coordinate, and apply the data to the downstream database. Oracle GoldenGate 12c still supports the original Replicat configuration, known as Classic Delivery. Downstream capture Downstream capture was one of my favorite Oracle Stream features. It allows for a combined in-memory capture and apply process that achieves very low latency even in heavy data load situations. Like Streams, GoldenGate builds on this feature by employing a real-time downstream capture process. This method uses Oracle Data Guard's log transportation mechanism, which writes changed data to standby redo logs. It provides a best-of-both-worlds approach, enabling a real-time mine configuration that falls back to archive log mining when the apply process cannot keep up. In addition, the real-time mine process is re-enabled automatically when the data throughput is less. Installation One of the major changes in Oracle GoldenGate 12c is the installation method. Like other Oracle products, Oracle GoldenGate 12c is now installed using the Java-based Oracle Universal Installer (OUI) in either the interactive or silent mode. OUI reads the Oracle Inventory on your system to discover existing installations (Oracle Homes), allowing you to install, deinstall, or clone software products. Upgrading to 12c Whether you wish to upgrade your current GoldenGate installation from Oracle GoldenGate 11g Release 2 or from an earlier version, the steps are the same. Simply stop all the GoldenGate running processes on your database server, backup the GoldenGate home, and then use OUI to perform the fresh installation. It is important to note, however, while restarting replication, ensure the capture process begins from the point at which it was gracefully stopped to guarantee against lost synchronization data. Multitenant database replication As the version suggests, Oracle GoldenGate 12c now supports data replication for Oracle Database 12c. Those familiar with the 12c database features will be aware of the multitenant container database (CDB) that provides database consolidation. Each CDB consists of a root container and one or more pluggable databases (PDB). The PDB can contain multiple schemas and objects, just like a conventional database that GoldenGate replicates data to and from. The GoldenGate Extract process pulls data from multiple PDBs or containers in the source, combining the changed data into a single trail file. Replicat, however, splits the data into multiple process groups in order to apply the changes to a target PDB. Coordinated Delivery The Coordinated Delivery option applies to the GoldenGate Replicat process when configured in the classic mode. It provides a performance gain by automatically splitting the delivered data from a remote trail file into multiple threads that are then applied to the target database in parallel. GoldenGate manages the coordination across selected events that require ordering, including DDL, primary key updates, event marker interface (EMI), and SQLEXEC. Coordinated Delivery can be used with both Oracle (from version 11.2.0.4) and non-Oracle databases. Event-based processing In GoldenGate 12c, event-based processing has been enhanced to allow specific events to be captured and acted upon automatically through an EMI. SQLEXEC provides the API to EMI, enabling programmatic execution of tasks following an event. Now it is possible, for example, to detect the start of a batch job or large transaction, trap the SQL statement(s), and ignore the subsequent multiple change records until the end of the source system transaction. The original DML can then be replayed on the target database as one transaction. This is a major step forward in the performance tuning for data replication. Enhanced security Recent versions of GoldenGate have included security features such as the encryption of passwords and data. Oracle GoldenGate 12c now supports a credential store, better known as an Oracle wallet, that securely stores an alias associated with a username and password. The alias is then referenced in the GoldenGate parameter files rather than the actual username and password. Conflict Detection and Resolution In earlier versions of GoldenGate, Conflict Detection and Resolution (CDR) has been somewhat lightweight and was not readily available out of the box. Although available in Oracle Streams, the GoldenGate administrator would have to programmatically resolve any data conflict in the replication process using GoldenGate built-in tools. In the 12c version, the feature has emerged as an easily configurable option through Extract and Replicat parameters. Dynamic Rollback Selective data back out of applied transactions is now possible using the Dynamic Rollback feature. The feature operates at table and record-level and supports point-in-time recovery. This potentially eliminates the need for a full database restore, following data corruption, erroneous deletions, or perhaps the removal of test data, thus avoiding hours of system downtime. Streams to GoldenGate migration Oracle Streams users can now migrate their data replication solution to Oracle GoldenGate 12c using a purpose-built utility. This is a welcomed feature given that Streams is no longer supported in Oracle Database 12c. The Streams2ogg tool auto generates Oracle GoldenGate configuration files that greatly simplify the effort required in the migration process. Performance In today's demand for real-time access to real-time data, high performance is the key. For example, businesses will no longer wait for information to arrive on their DSS to make decisions and users will expect the latest information to be available in the public cloud. Data has value and must be delivered in real time to meet the demand. So, how long does it take to replicate a transaction from the source database to its target? This is known as end-to-end latency, which typically has a threshold that must not be breeched in order to satisfy a predefined Service Level Agreement (SLA). GoldenGate refers to latency as lag, which can be measured at different intervals in the replication process. They are as follows: Source to Extract: The time taken for a record to be processed by the Extract compared to the commit timestamp on the database Replicat to target: The time taken for the last record to be processed by the Replicat process compared to the record creation time in the trail file A well-designed system may still encounter spikes in the latency, but it should never be continuous or growing. Peaks are typically caused by load on the source database system, where the latency increases with the number of transactions per second. Lag should be measured as an average over a specified period. Trying to tune GoldenGate when the design is poor is a difficult situation to be in. For the system to perform well, you may need to revisit the design. Availability Another important NFR is availability. Normally quoted as a percentage, the system must be available for the specified length of time. For example, NFR of 99.9 percent availability equates to a downtime of 8.76 hours in a year, which sounds quite a lot, especially if it were to occur all at once. Oracle's maximum availability architecture (MAA) offers enhanced availability through products such as Real Application Clusters (RAC) and Active Data Guard (ADG). However, as we previously described, the network plays a major role in data replication. The NFR relates to the whole system, so you need to be sure your design covers redundancy for all components. Event-based processing It is important in any data replication environment to capture and manage events, such as trail records containing specific data or operations or maybe the occurrence of a certain error. These are known as Event Markers. GoldenGate provides a mechanism to perform an action on a given event or condition. These are known as Event Actions and are triggered by Event Records. If you are familiar with Oracle Streams, Event Actions are like rules. The Event Marker System GoldenGate's Event Marker System, also known as event marker interface (EMI), allows custom DML-driven processing on an event. This comprises of an Event Record to trigger a given action. An Event Record can be either a trail record that satisfies a condition evaluated by a WHERE or FILTER clause or a record written to an event table that enables an action to occur. Typical actions are writing status information, reporting errors, ignoring certain records in a trail, invoking a shell script, or performing an administrative task. The following Replicat code describes the process of capturing an event and performing an action by logging DELETE operations made against the CREDITCARD_ACCOUNTS table using the EVENTACTIONS parameter: MAP SRC.CREDITCARD_ACCOUNTS, TARGET TGT.CREDITCARD_ACCOUNTS_DIM;TABLE SRC.CREDITCARD_ACCOUNTS, &FILTER (@GETENV ('GGHEADER', 'OPTYPE') = 'DELETE'), &EVENTACTIONS (LOG INFO); By default, all logged information is written to the process group report file, the GoldenGate error log, and the system messages file. On Linux, this is the /var/log/messages file. Note that the TABLE parameter is also used in the Replicat's parameter file. This is a means of triggering an Event Action to be executed by the Replicat when it encounters an Event Marker. The following code shows the use of the IGNORE option that prevents certain records from being extracted or replicated, which is particularly useful to filter out system type data. When used with the TRANSACTION option, the whole transaction and not just the Event Record is ignored: TABLE SRC.CREDITCARD_ACCOUNTS, &FILTER (@GETENV ('GGHEADER', 'OPTYPE') = 'DELETE'), &EVENTACTIONS (IGNORE TRANSACTION); The preceding code extends the previous code by stopping the Event Record itself from being replicated. Using Event Actions to improve batch performance All replication technologies typically suffer from one flaw that is the way in which the data is replicated. Consider a table that is populated with a million rows as part of a batch process. This may be a bulk insert operation that Oracle completes on the source database as one transaction. However, Oracle will write each change to its redo logs as Logical Change Records (LCRs). GoldenGate will subsequently mine the logs, write the LCRs to a remote trail, convert each one back to DML, and apply them to the target database, one row at a time. The single source transaction becomes one million transactions, which causes a huge performance overhead. To overcome this issue, we can use Event Actions to: Detect the DML statement (INSERT INTO TABLE SELECT ..) Ignore the data resulting from the SELECT part of the statement Replicate just the DML statement as an Event Record Execute just the DML statement on the target database The solution requires a statement table on both source and target databases to trigger the event. Also, both databases must be perfectly synchronized to avoid data integrity issues. User tokens User tokens are GoldenGate environment variables that are captured and stored in the trail record for replication. They can be accessed via the @GETENV function. We can use token data in column maps, stored procedures called by SQLEXEC, and, of course, in macros. Using user tokens to populate a heartbeat table A vast array of user tokens exist in GoldenGate. Let's start by looking at a common method of replicating system information to populate a heartbeat table that can be used to monitor performance. We can use the TOKENS option of the Extract TABLE parameter to define a user token and associate it with the GoldenGate environment data. The following Extract configuration code shows the token declarations for the heartbeat table: TABLE GGADMIN.GG_HB_OUT, &TOKENS (EXTGROUP = @GETENV ("GGENVIRONMENT","GROUPNAME"), &EXTTIME = @DATE ("YYYY-MM-DD HH:MI:SS.FFFFFF","JTS",@GETENV("JULIANTIMESTAMP")), &EXTLAG = @GETENV ("LAG","SEC"), &EXTSTAT_TOTAL = @GETENV ("DELTASTATS","DML"), &), FILTER (@STREQ (EXTGROUP, @GETENV("GGENVIRONMENT","GROUPNAME"))); For the data pump, the example Extract configuration is shown here: TABLE GGADMIN.GG_HB_OUT, &TOKENS (PMPGROUP = @GETENV ("GGENVIRONMENT","GROUPNAME"), &PMPTIME = @DATE ("YYYY-MM-DD HH:MI:SS.FFFFFF","JTS",@GETENV("JULIANTIMESTAMP")), &PMPLAG = @GETENV ("LAG","SEC")); Also, for the Replicat, the following configuration populates the heartbeat table on the target database with the token data derived from Extract, data pump, and Replicat, containing system details and replication lag: MAP GGADMIN.GG_HB_OUT_SRC, TARGET GGADMIN.GG_HB_IN_TGT, &KEYCOLS (DB_NAME, EXTGROUP, PMPGROUP, REPGROUP), &INSERTMISSINGUPDATES, &COLMAP (USEDEFAULTS, &ID = 0, &SOURCE_COMMIT = @GETENV ("GGHEADER", "COMMITTIMESTAMP"), &EXTGROUP = @TOKEN ("EXTGROUP"), &EXTTIME = @TOKEN ("EXTTIME"), &PMPGROUP = @TOKEN ("PMPGROUP"), &PMPTIME = @TOKEN ("PMPTIME"), &REPGROUP = @TOKEN ("REPGROUP"), &REPTIME = @DATE ("YYYY-MM-DD HH:MI:SS.FFFFFF","JTS",@GETENV("JULIANTIMESTAMP")), &EXTLAG = @TOKEN ("EXTLAG"), &PMPLAG = @TOKEN ("PMPLAG"), &REPLAG = @GETENV ("LAG","SEC"), &EXTSTAT_TOTAL = @TOKEN ("EXTSTAT_TOTAL")); As in the heartbeat table example, the defined user tokens can be called in a MAP statement using the @TOKEN function. The SOURCE_COMMIT and LAG metrics are self-explained. However, EXTSTAT_TOTAL, which is derived from DELTASTATS, is particularly useful to measure the load on the source system when you evaluate latency peaks. For applications, user tokens are useful to audit data and trap exceptions within the replicated data stream. Common user tokens are shown in the following code that replicates the token data to five columns of an audit table: MAP SRC.AUDIT_LOG, TARGET TGT.AUDIT_LOG, &COLMAP (USEDEFAULTS, &OSUSER = @TOKEN ("TKN_OSUSER"), &DBNAME = @TOKEN ("TKN_DBNAME"), &HOSTNAME = @TOKEN ("TKN_HOSTNAME"), &TIMESTAMP = @TOKEN ("TKN_COMMITTIME"), &BEFOREAFTERINDICATOR = @TOKEN ("TKN_ BEFOREAFTERINDICATOR"); The BEFOREAFTERINDICATOR environment variable is particularly useful to provide a status flag in order to check whether the data was from a Before or After image of an UPDATE or DELETE operation. By default, GoldenGate provides After images. To enable a Before image extraction, the GETUPDATEBEFORES Extract parameter must be used on the source database. Using logic in the data replication GoldenGate has a number of functions that enable the administrator to program logic in the Extract and Replicat process configuration. These provide generic functions found in the IF and CASE programming languages. In addition, the @COLTEST function enables conditional calculations by testing for one or more column conditions. This is typically used with the @IF function, as shown in the following code: MAP SRC.CREDITCARD_PAYMENTS, TARGET TGT.CREDITCARD_PAYMENTS_FACT,&COLMAP (USEDEFAULTS, &AMOUNT = @IF(@COLTEST(AMOUNT, MISSING, INVALID), 0, AMOUNT)); Here, the @COLTEST function tests the AMOUNT column in the source data to check whether it is MISSING or INVALID. The @IF function returns 0 if @COLTEST returns TRUE and returns the value of AMOUNT if FALSE. The target AMOUNT column is therefore set to 0 when the equivalent source is found to be missing or invalid; otherwise, a direct mapping occurs. The @CASE function tests a list of values for a match and then returns a specified value. If no match is found, @CASE will return a default value. There is no limit to the number of cases to test; however, if the list is very large, a database lookup may be more appropriate. The following code shows the simplicity of the @CASE statement. Here, the country name is returned from the country code: MAP SRC.CREDITCARD_STATEMENT, TARGET TGT.CREDITCARD_STATEMENT_DIM,&COLMAP (USEDEFAULTS, &COUNTRY = @CASE(COUNTRY_CODE, "UK", "United Kingdom", "USA","United States of America")); Other GoldenGate functions: @EVAL and @VALONEOF exist that perform tests. Similar to @CASE, @VALONEOF compares a column or string to a list of values. The difference being it evaluates more than one value against a single column or string. When the following code is used with @IF, it returns "EUROPE" when TRUE and "UNKNOWN" when FALSE: MAP SRC.CREDITCARD_STATEMENT, TARGET TGT.CREDITCARD_STATEMENT_DIM,&COLMAP (USEDEFAULTS, &REGION = @IF(@VALONEOF(COUNTRY_CODE, "UK","E", "D"),"EUROPE","UNKNOWN")); The @EVAL function evaluates a list of conditions and returns a specified value. Optionally, if none are satisfied, it returns a default value. There is no limit to the number of evaluations you can list. However, it is best to list the most common evaluations at the beginning to enhance performance. The following code includes the BEFORE option that compares the before value of the replicated source column to the current value of the target column. Depending on the evaluation, @EVAL will return "PAID MORE", "PAID LESS", or "PAID SAME": MAP SRC.CREDITCARD_ PAYMENTS, TARGET TGT.CREDITCARD_PAYMENTS, &COLMAP (USEDEFAULTS, &STATUS = @EVAL(AMOUNT < BEFORE.AMOUNT, "PAID LESS", AMOUNT > BEFORE.AMOUNT, "PAID MORE", AMOUNT = BEFORE.AMOUNT, "PAID SAME")); The BEFORE option can be used with other GoldenGate functions, including the WHERE and FILTER clauses. However, for the Before image to be written to the trail and to be available, the GETUPDATEBEFORES parameter must be enabled in the source database's Extract parameter file or the target database's Replicat parameter file, but not both. The GETUPDATEBEFORES parameter can be set globally for all tables defined in the Extract or individually per table using GETUPDATEBEFORES and IGNOREUPDATEBEFORES, as seen in the following code: EXTRACT EOLTP01USERIDALIAS srcdb DOMAIN adminSOURCECATALOG PDB1EXTTRAIL ./dirdat/aaGETAPPLOPSIGNOREREPLICATESGETUPDATEBEFORESTABLE SRC.CHECK_PAYMENTS;IGNOREUPDATEBEFORESTABLE SRC.CHECK_PAYMENTS_STATUS;TABLE SRC.CREDITCARD_ACCOUNTS;TABLE SRC.CREDITCARD_PAYMENTS; Tracing processes to find wait events If you have worked with Oracle software, particularly in the performance tuning space, you will be familiar with tracing. Tracing enables additional information to be gathered from a given process or function to diagnose performance problems or even bugs. One example is the SQL trace that can be enabled at a database session or the system level to provide key information, such as; wait events, parse, fetch, and execute times. Oracle GoldenGate 12c offers a similar tracing mechanism through its trace and trace2 options of the SEND GGSCI command. This is like the session-level SQL trace. Also, in a similar fashion to performing a database system trace, tracing can be enabled in the GoldenGate process parameter files that make it permanent until the Extract or Replicat is stopped. trace provides processing information, whereas trace2 identifies the processes with wait events. The following commands show tracing being dynamically enabled for 2 minutes on a running Replicat process: GGSCI (db12server02) 1> send ROLAP01 trace2 ./dirrpt/ROLAP01.trc Wait for 2 minutes, then turn tracing off: GGSCI (db12server02) 2> send ROLAP01 trace2 offGGSCI (db12server02) 3> exit To view the contents of the Replicat trace file, we can execute the following command. In the case of a coordinated Replicat, the trace file will contain information from all of its threads: $ view dirrpt/ROLAP01.trcstatistics between 2015-08-08 Wed HKT 11:55:27 and 2015-08-08 Wed HKT11:57:28RPT_PROD_Ol.LIMIT_TP_RESP : n=2 : op=Insert; total=3; avg=1.5000;max=3msecRPT_PROD_01.SUP_POOL_SMRY_HIST : n=1 : op=Insert; total=2; avg=2.0000;max=2msecRPT_PROD_01.EVENTS : n=1 : op=Insert; total=2; avg=2.0000; max=2msecRPT_PROD_01.DOC_SHIP_DTLS : n=17880 : op=FieldComp; total=22003;avg=1.2306; max=42msecRPT_PROD_01.BUY_POOL_SMRY_HIST : n=1 : op=Insert; total=2; avg=2.0000;max=2msecRPT_PROD_01.LIMIT_TP_LOG : n=2 : op-Insert; total=2; avg=1.0000;max=2msecRPT_PROD_01.POOL_SMRY : n=1 : op=FieldComp; total=2; avg=2.0000;max=2msec..===============================================summary==============Delete : n=2; total=2; avg=1.00;Insert : n=78; total=356; avg=4.56;FieldComp : n=85728; total=123018; avg=1.43;total_op_num=85808 : total_op_time=123376 ms : total_avg_time=1.44ms/optotal commit number=1 The trace file provides the following information: The table name The operation type (FieldComp is for a compressed field) The number of operations The average wait The maximum wait Summary Armed with the preceding information, we can quickly see what operations against which tables are taking the longest time. Exception handling Oracle GoldenGate 12c now supports Conflict Detection and Resolution (CDR). However, out-of-the-box, GoldenGate takes a catch all approach to exception handling. For example, by default, should any operational failure occur, a Replicat process will ABEND and roll back the transaction to the last known checkpoint. This may not be ideal in a production environment. The HANDLECOLLISIONS and NOHANDLECOLLISIONS parameters can be used to control whether or not a Replicat process tries to resolve the duplicate record error and the missing record error. The way to determine what error occurred and on which Replicat is to create an exceptions handler. Exception handling differs from CDR by trapping and reporting Oracle errors suffered by the data replication (DML and DDL). On the other hand, CDR detects and resolves inconsistencies in the replicated data, such as mismatches with before and after images. Exceptions can always be trapped by the Oracle error they produce. GoldenGate provides an exception handler parameter called REPERROR that allows the Replicat to continue processing data after a predefined error. For example, we can include the following configuration in our Replicat parameter file to ignore ORA-00001 "unique constraint (%s.%s) violated": REPERROR (DEFAULT, EXCEPTION)REPERROR (DEFAULT2, ABEND)REPERROR (-1, EXCEPTION) Cloud computing Cloud computing has grown enormously in the recent years. Oracle has named its latest version of products: 12c, the c standing for Cloud of course. The architecture of Oracle 12c Database allows a multitenant container database to support multiple pluggable databases—a key feature of cloud computing—rather than implement the inefficient schema consolidation, typical of the previous Oracle database version architecture, which is known to cause contention on shared resources during high load. The Oracle 12c architecture supports a database consolidation approach through its efficient memory management and dedicated background processes. Online computer companies such as Amazon have leveraged the cloud concept by offering Relational Database Services (RDS), which is becoming very popular for its speed of readiness, support, and low cost. The cloud environments are often huge, containing hundreds of servers, petabytes of storage, terabytes of memory, and countless CPU cores. The cloud has to support multiple applications in a multi-tiered, shared environment, often through virtualization technologies, where storage and CPUs are typically the driving factors for cost-effective options. Customers choose their hardware footprint that best suits their budget and system requirements, commonly known as Platform as a Service (PaaS). Cloud computing is an extension to grid computing that offers both public and private clouds. GoldenGate and Big Data It is increasingly evident that organizations need to quickly access, analyze, and report on their data across their Enterprise in order to be agile in a competitive market. Data is becoming more of an asset to companies; it adds value to a business, but may be stored in any number of current and legacy systems, making it difficult to realize its full potential. Known as big data, it has until recently been nearly impossible to perform real-time business analysis on the combined data from multiple sources. Nowadays, the ability to access all transactional data with low latency is essential. With the introduction of products such as Apache Hadoop, integration of structured data from an RDBMS, including semi-structured and unstructured data, offers a common playing field to support business intelligence. When coupled with ODI, GoldenGate for big data provides real-time delivery to a suite of Apache products, such as Flume, HDFS, Hive, and Hbase, to support big data analytics. Summary In this article, we have learned an introduction to Oracle GoldenGate by describing the key components, processes, and considerations required to build and implement a GoldenGate solution. Resources for Article: Further resources on this subject: What is Oracle Public Cloud? [Article] Oracle GoldenGate- Advanced Administration Tasks - I [Article] Oracle B2B Overview [Article]

0
0
2762

Packt

10 Aug 2015

17 min read

The Splunk Interface

Packt

10 Aug 2015

17 min read

In this article by Vincent Bumgarner & James D. Miller, author of the book, Implementing Splunk - Second Edition, we will walk through the most common elements in the Splunk interface, and will touch upon concepts that will be covered in greater detail. You may want to dive right into the search section, but an overview of the user interface elements might save you some frustration later. We will cover the following topics: Logging in and app selection A detailed explanation of the search interface widgets A quick overview of the admin interface (For more resources related to this topic, see here.) Logging into Splunk The Splunk GUI interface (Splunk is also accessible through its command-line interface [CLI] and REST API) is web-based, which means that no client needs to be installed. Newer browsers with fast JavaScript engines, such as Chrome, Firefox, and Safari, work better with the interface. As of Splunk Version 6.2.0, no browser extensions are required. Splunk Versions 4.2 and earlier require Flash to render graphs. Flash can still be used by older browsers, or for older apps that reference Flash explicitly. The default port for a Splunk installation is 8000. The address will look like: http://mysplunkserver:8000 or http://mysplunkserver.mycompany.com:8000. The Splunk interface If you have installed Splunk on your local machine, the address can be some variant of http://localhost:8000, http://127.0.0.1:8000, http://machinename:8000, or http://machinename.local:8000. Once you determine the address, the first page you will see is the login screen. The default username is admin with the password changeme. The first time you log in, you will be prompted to change the password for the admin user. It is a good idea to change this password to prevent unwanted changes to your deployment. By default, accounts are configured and stored within Splunk. Authentication can be configured to use another system, for instance Lightweight Directory Access Protocol (LDAP). By default, Splunk authenticates locally. If LDAP is set up, the order is as follows: LDAP / Local. The home app After logging in, the default app is the Launcher app (some may refer to this as Home). This app is a launching pad for apps and tutorials. In earlier versions of Splunk, the Welcome tab provided two important shortcuts, Add data and the Launch search app. In version 6.2.0, the Home app is divided into distinct areas, or panes, that provide easy access to Explore Splunk Enterprise (Add Data, Splunk Apps, Splunk Docs, and Splunk Answers) as well as Apps (the App management page) Search & Reporting (the link to the Search app), and an area where you can set your default dashboard (choose a home dashboard). The Explore Splunk Enterprise pane shows links to: Add data: This links Add Data to the Splunk page. This interface is a great start for getting local data flowing into Splunk (making it available to Splunk users). The Preview data interface takes an enormous amount of complexity out of configuring dates and line breaking. Splunk Apps: This allows you to find and install more apps from the Splunk Apps Marketplace (http://apps.splunk.com). This marketplace is a useful resource where Splunk users and employees post Splunk apps, mostly free but some premium ones as well. Splunk Answers: This is one of your links to the wide amount of Splunk documentation available, specifically http://answers.splunk.com, where you can engage with the Splunk community on Splunkbase (https://splunkbase.splunk.com/) and learn how to get the most out of your Splunk deployment. The Apps section shows the apps that have GUI elements on your instance of Splunk. App is an overloaded term in Splunk. An app doesn't necessarily have a GUI at all; it is simply a collection of configurations wrapped into a directory structure that means something to Splunk. Search & Reporting is the link to the Splunk Search & Reporting app. Beneath the Search & Reporting link, Splunk provides an outline which, when you hover over it, displays a Find More Apps balloon tip. Clicking on the link opens the same Browse more apps page as the Splunk Apps link mentioned earlier. Choose a home dashboard provides an intuitive way to select an existing (simple XML) dashboard and set it as part of your Splunk Welcome or Home page. This sets you at a familiar starting point each time you enter Splunk. The following image displays the Choose Default Dashboard dialog: Once you select an existing dashboard from the dropdown list, it will be part of your welcome screen every time you log into Splunk – until you change it. There are no dashboards installed by default after installing Splunk, except the Search & Reporting app. Once you have created additional dashboards, they can be selected as the default. The top bar The bar across the top of the window contains information about where you are, as well as quick links to preferences, other apps, and administration. The current app is specified in the upper-left corner. The following image shows the upper-left Splunk bar when using the Search & Reporting app: Clicking on the text takes you to the default page for that app. In most apps, the text next to the logo is simply changed, but the whole block can be customized with logos and alternate text by modifying the app's CSS. The upper-right corner of the window, as seen in the previous image, contains action links that are almost always available: The name of the user who is currently logged in appears first. In this case, the user is Administrator. Clicking on the username allows you to select Edit Account (which will take you to the Your account page) or to Logout (of Splunk). Logout ends the session and forces the user to login again. The following screenshot shows what the Your account page looks like: This form presents the global preferences that a user is allowed to change. Other settings that affect users are configured through permissions on objects and settings on roles. (Note: preferences can also be configured using the CLI or by modifying specific Splunk configuration files). Full name and Email address are stored for the administrator's convenience. Time zone can be changed for the logged-in user. This is a new feature in Splunk 4.3. Setting the time zone only affects the time zone used to display the data. It is very important that the date is parsed properly when events are indexed. Default app controls the starting page after login. Most users will want to change this to search. Restart backgrounded jobs controls whether unfinished queries should run again if Splunk is restarted. Set password allows you to change your password. This is only relevant if Splunk is configured to use internal authentication. For instance, if the system is configured to use Windows Active Directory via LDAP (a very common configuration), users must change their password in Windows. Messages allows you to view any system-level error messages you may have pending. When there is a new message for you to review, a notification displays as a count next to the Messages menu. You can click the X to remove a message. The Settings link presents the user with the configuration pages for all Splunk Knowledge objects, Distributed Environment settings, System and Licensing, Data, and Users and Authentication settings. If you do not see some of these options, you do not have the permissions to view or edit them. The Activity menu lists shortcuts to Splunk Jobs, Triggered Alerts, and System Activity views. You can click Jobs (to open the search jobs manager window, where you can view and manage currently running searches), click Triggered Alerts (to view scheduled alerts that are triggered) or click System Activity (to see dashboards about user activity and the status of the system). Help lists links to video Tutorials, Splunk Answers, the Splunk Contact Support portal, and online Documentation. Find can be used to search for objects within your Splunk Enterprise instance. For example, if you type in error, it returns the saved objects that contain the term error. These saved objects include Reports, Dashboards, Alerts, and so on. You can also search for error in the Search & Reporting app by clicking Open error in search. The search & reporting app The Search & Reporting app (or just the search app) is where most actions in Splunk start. This app is a dashboard where you will begin your searching. The summary view Within the Search & Reporting app, the user is presented with the Summary view, which contains information about the data which that user searches for by default. This is an important distinction—in a mature Splunk installation, not all users will always search all data by default. But at first, if this is your first trip into Search & Reporting, you'll see the following: From the screen depicted in the previous screenshot, you can access the Splunk documentation related to What to Search and How to Search. Once you have at least some data indexed, Splunk will provide some statistics on the available data under What to Search (remember that this reflects only the indexes that this particular user searches by default; there are other events that are indexed by Splunk, including events that Splunk indexes about itself.) This is seen in the following image: In previous versions of Splunk, panels such as the All indexed data panel provided statistics for a user's indexed data. Other panels gave a breakdown of data using three important pieces of metadata—Source, Sourcetype, and Hosts. In the current version—6.2.0—you access this information by clicking on the button labeled Data Summary, which presents the following to the user: This dialog splits the information into three tabs—Hosts, Sources and Sourcetypes. A host is a captured hostname for an event. In the majority of cases, the host field is set to the name of the machine where the data originated. There are cases where this is not known, so the host can also be configured arbitrarily. A source in Splunk is a unique path or name. In a large installation, there may be thousands of machines submitting data, but all data on the same path across these machines counts as one source. When the data source is not a file, the value of the source can be arbitrary, for instance, the name of a script or network port. A source type is an arbitrary categorization of events. There may be many sources across many hosts, in the same source type. For instance, given the sources /var/log/access.2012-03-01.log and /var/log/access.2012-03-02.log on the hosts fred and wilma, you could reference all these logs with source type access or any other name that you like. Let's move on now and discuss each of the Splunk widgets (just below the app name). The first widget is the navigation bar. As a general rule, within Splunk, items with downward triangles are menus. Items without a downward triangle are links. Next we find the Search bar. This is where the magic starts. We'll go into great detail shortly. Search Okay, we've finally made it to search. This is where the real power of Splunk lies. For our first search, we will search for the word (not case specific); error. Click in the search bar, type the word error, and then either press Enter or click on the magnifying glass to the right of the bar. Upon initiating the search, we are taken to the search results page. Note that the search we just executed was across All time (by default); to change the search time, you can utilize the Splunk time picker. Actions Let's inspect the elements on this page. Below the Search bar, we have the event count, action icons, and menus. Starting from the left, we have the following: The number of events matched by the base search. Technically, this may not be the number of results pulled from disk, depending on your search. Also, if your query uses commands, this number may not match what is shown in the event listing. Job: This opens the Search job inspector window, which provides very detailed information about the query that was run. Pause: This causes the current search to stop locating events but keeps the job open. This is useful if you want to inspect the current results to determine whether you want to continue a long running search. Stop: This stops the execution of the current search but keeps the results generated so far. This is useful when you have found enough and want to inspect or share the results found so far. Share: This shares the search job. This option extends the job's lifetime to seven days and sets the read permissions to everyone. Export: This exports the results. Select this option to output to CSV, raw events, XML, or JavaScript Object Notation (JSON) and specify the number of results to export. Print: This formats the page for printing and instructs the browser to print. Smart Mode: This controls the search experience. You can set it to speed up searches by cutting down on the event data it returns and, additionally, by reducing the number of fields that Splunk will extract by default from the data (Fast mode). You can, otherwise, set it to return as much event information as possible (Verbose mode). In Smart mode (the default setting) it toggles search behavior based on the type of search you're running. Timeline Now we'll skip to the timeline below the action icons. Along with providing a quick overview of the event distribution over a period of time, the timeline is also a very useful tool for selecting sections of time. Placing the pointer over the timeline displays a pop-up for the number of events in that slice of time. Clicking on the timeline selects the events for a particular slice of time. Clicking and dragging selects a range of time. Once you have selected a period of time, clicking on Zoom to selection changes the time frame and reruns the search for that specific slice of time. Repeating this process is an effective way to drill down to specific events. Deselect shows all events for the time range selected in the time picker. Zoom out changes the window of time to a larger period around the events in the current time frame The field picker To the left of the search results, we find the field picker. This is a great tool for discovering patterns and filtering search results. Fields The field list contains two lists: Selected Fields, which have their values displayed under the search event in the search results Interesting Fields, which are other fields that Splunk has picked out for you Above the field list are two links: Hide Fields and All Fields. Hide Fields: Hides the field list area from view. All Fields: Takes you to the Selected Fields window. Search results We are almost through with all the widgets on the page. We still have a number of items to cover in the search results section though, just to be thorough. As you can see in the previous screenshot, at the top of this section, we have the number of events displayed. When viewing all results in their raw form, this number will match the number above the timeline. This value can be changed either by making a selection on the timeline or by using other search commands. Next, we have the action icons (described earlier) that affect these particular results. Under the action icons, we have four results tabs: Events list, which will show the raw events. This is the default view when running a simple search, as we have done so far. Patterns streamlines the event pattern detection. It displays a list of the most common patterns among the set of events returned by your search. Each of these patterns represents the number of events that share a similar structure. Statistics populates when you run a search with transforming commands such as stats, top, chart, and so on. The previous keyword search for error does not display any results in this tab because it does not have any transforming commands. Visualization transforms searches and also populates the Visualization tab. The results area of the Visualization tab includes a chart and the statistics table used to generate the chart. Not all searches are eligible for visualization. Under the tabs described just now, is the timeline. Options Beneath the timeline, (starting at the left) is a row of option links that include: Show Fields: shows the Selected Fields screen List: allows you to select an output option (Raw, List, or Table) for displaying the search results Format: provides the ability to set Result display options, such as Show row numbers, Wrap results, the Max lines (to display) and Drilldown as on or off. NN Per Page: is where you can indicate the number of results to show per page (10, 20, or 50). To the right are options that you can use to choose a page of results, and to change the number of events per page. In prior versions of Splunk, these options were available from the Results display options popup dialog. The events viewer Finally, we make it to the actual events. Let's examine a single event. Starting at the left, we have: Event Details: Clicking here (indicated by the right facing arrow) opens the selected event, providing specific information about the event by type, field, and value, and allows you the ability to perform specific actions on a particular event field. In addition, Splunk version 6.2.0 offers a button labeled Event Actions to access workflow actions, a few of which are always available. Build Eventtype: Event types are a way to name events that match a certain query. Extract Fields: This launches an interface for creating custom field extractions. Show Source: This pops up a window with a simulated view of the original source. The event number: Raw search results are always returned in the order most recent first. Next to appear are any workflow actions that have been configured. Workflow actions let you create new searches or links to other sites, using data from an event. Next comes the parsed date from this event, displayed in the time zone selected by the user. This is an important and often confusing distinction. In most installations, everything is in one time zone—the servers, the user, and the events. When one of these three things is not in the same time zone as the others, things can get confusing. Next, we see the raw event itself. This is what Splunk saw as an event. With no help, Splunk can do a good job finding the date and breaking lines appropriately, but as we will see later, with a little help, event parsing can be more reliable and more efficient. Below the event are the fields that were selected in the field picker. Clicking on the value adds the field value to the search. Summary As you have seen, the Splunk GUI provides a rich interface for working with search results. We have really only scratched the surface and will cover more elements. Resources for Article: Further resources on this subject: The Splunk Web Framework [Article] Loading data, creating an app, and adding dashboards and reports in Splunk [Article] Working with Apps in Splunk [Article]

0
0
2816

article-image-understanding-hadoop-backup-and-recovery-needs

Packt

10 Aug 2015

25 min read

Understanding Hadoop Backup and Recovery Needs

Packt

10 Aug 2015

25 min read

0
0
5460

Packt

07 Aug 2015

9 min read

NLTK for hackers

Packt

07 Aug 2015

9 min read

In this article written by Nitin Hardeniya, author of the book NLTK Essentials, we will learn that "Life is short, we need Python" that's the mantra I follow and truly believe in. As fresh graduates, we learned and worked mostly with C/C++/JAVA. While these languages have amazing features, Python has a charm of its own. The day I started using Python I loved it. I really did. The big coincidence here is that I finally ended up working with Python during my initial projects on the job. I started to love the kind of datastructures, Libraries, and echo system Python has for beginners as well as for an expert programmer. (For more resources related to this topic, see here.) Python as a language has advanced very fast and spatially. If you are a Machine learning/ Natural language Processing enthusiast, then Python is 'the' go-to language these days. Python has some amazing ways of dealing with strings. It has a very easy and elegant coding style, and most importantly a long list of open libraries. I can go on and on about Python and my love for it. But here I want to talk about very specifically about NLTK (Natural Language Toolkit), one of the most popular Python libraries for Natural language processing. NLTK is simply awesome, and in my opinion,it's the best way to learn and implement some of the most complex NLP concepts. NLTK has variety of generic text preprocessing tool, such as Tokenization, Stop word removal, Stemming, and at the same time,has some very NLP-specific tools,such as Part of speech tagging, Chunking, Named Entity recognition, and dependency parsing. NLTK provides some of the easiest solutions to all the above stages of NLP and that's why it is the most preferred library for any text processing/ text mining application. NLTK not only provides some pretrained models that can be applied directly to your dataset, it also provides ways to customize and build your own taggers, tokenizers, and so on. NLTK is a big library that has many tools available for an NLP developer. I have provided a cheat-sheet of some of the most common steps and their solutions using NLTK. In our book, NLTK Essentials, I have tried to give you enough information to deal with all these processing steps using NLTK. To show you the power of NLTK, let's try to develop a very easy application of finding topics in the unstructured text in a word cloud. Word CloudNLTK Instead of going further into the theoretical aspects of natural language processing, let's start with a quick dive into NLTK. I am going to start with some basic example use cases of NLTK. There is a good chance that you have already done something similar. First, I will give a typical Python programmer approach and then move on to NLTK for a much more efficient, robust, and clean solution. We will start analyzing with some example text content: >>>import urllib2>>># urllib2 is use to download the html content of the web link>>>response = urllib2.urlopen('http://python.org/')>>># You can read the entire content of a file using read() method>>>html = response.read()>>>print len(html)47020 For the current example, I have taken the content from Python's home page: https://www.python.org/. We don't have any clue about the kind of topics that are discussed in this URL, so let's say that we want to start an exploratory data analysis (EDA). Typically in a text domain, EDA can have many meanings, but will go with a simple case of what kinds of terms dominate the documents. What are the topics? How frequent are they? The process will involve some level of preprocessing we will try to do this in a pure Python wayand then we will do it using NLTK. Let's start with cleaning the html tags. One way to do this is to select just tokens, including numbers and character. Anybody who has worked with regular expression should be able to convert html string into a list of tokens: >>># regular expression based split the string>>>tokens = [tok for tok in html.split()]>>>print "Total no of tokens :"+ str(len(tokens))>>># first 100 tokens>>>print tokens[0:100]Total no of tokens :2860['<!doctype', 'html>', '', '', ''type="text/css"', 'media="not', 'print,', 'braille,' ...] As you can see, there is an excess of html tags and other unwanted characters when we use the preceding method. A cleaner version of the same task will look something like this: >>>import re>>># using the split function https://docs.python.org/2/library/re.html>>>tokens = re.split('W+',html)>>>print len(tokens)>>>print tokens[0:100]5787['', 'doctype', 'html', 'if', 'lt', 'IE', '7', 'html', 'class', 'no', 'js', 'ie6', 'lt', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '7', 'html', 'class', 'no', 'js', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '8', 'msapplication', 'tooltip', 'content', 'The', 'official', 'home', 'of', 'the', 'Python', 'Programming', 'Language', 'meta', 'name', 'apple' ...] This looks much cleaner now. But still you can do more; I leave it to you to try to remove as much noise as you can. You can still look for word length as a criteria and remove words that have a length one—it will remove elements,such as 7, 8, and so on, which are just noise in this case. Now let's go to NLTK for the same task. There is a function called clean_html() that can do all the work we were looking for: >>>import nltk>>># http://www.nltk.org/api/nltk.html#nltk.util.clean_html>>>clean = nltk.clean_html(html)>>># clean will have entire string removing all the html noise>>>tokens = [tok for tok in clean.split()]>>>print tokens[:100]['Welcome', 'to', 'Python.org', 'Skip', 'to', 'content', '▼', 'Close', 'Python', 'PSF', 'Docs', 'PyPI', 'Jobs', 'Community', '▲', 'The', 'Python', 'Network', '&equiv;', 'Menu', 'Arts', 'Business' ...] Cool, right? This definitely is much cleaner and easier to do. No analysis in any EDA can start without distribution. Let's try to get the frequency distribution. First, let's do it the Python way, then I will tell you the NLTK recipe. >>>import operator>>>freq_dis={}>>>for tok in tokens:>>> if tok in freq_dis:>>> freq_dis[tok]+=1>>> else:>>> freq_dis[tok]=1>>># We want to sort this dictionary on values ( freq in this case )>>>sorted_freq_dist= sorted(freq_dis.items(), key=operator.itemgetter(1), reverse=True)>>> print sorted_freq_dist[:25][('Python', 55), ('>>>', 23), ('and', 21), ('to', 18), (',', 18), ('the', 14), ('of', 13), ('for', 12), ('a', 11), ('Events', 11), ('News', 11), ('is', 10), ('2014-', 10), ('More', 9), ('#', 9), ('3', 9), ('=', 8), ('in', 8), ('with', 8), ('Community', 7), ('The', 7), ('Docs', 6), ('Software', 6), (':', 6), ('3:', 5), ('that', 5), ('sum', 5)] Naturally, as this is Python's home page, Python and the >>> interpreters are the most common terms, also giving a sense about the website. A better and efficient approach is to use NLTK's FreqDist() function. For this, we will take a look at the same code we developed before: >>>import nltk>>>Freq_dist_nltk=nltk.FreqDist(tokens)>>>print Freq_dist_nltk>>>for k,v in Freq_dist_nltk.items():>>> print str(k)+':'+str(v)<FreqDist: 'Python': 55, '>>>': 23, 'and': 21, ',': 18, 'to': 18, 'the': 14, 'of': 13, 'for': 12, 'Events': 11, 'News': 11, ...>Python:55>>>:23and:21,:18to:18the:14of:13for:12Events:11News:11 Let's now do some more funky things. Let's plot this: >>>Freq_dist_nltk.plot(50, cumulative=False)>>># below is the plot for the frequency distributions We can see that the cumulative frequency is growing, and at words such as other and frequency 400, the curve is going into long tail. Still, there is some noise, and there are words such asthe, of, for, and =. These are useless words, and there is a terminology for these words. These words are stop words,such asthe, a, and an. Article pronouns are generally present in most of the documents; hence, they are not discriminative enough to be informative. In most of the NLP and information retrieval tasks, people generally remove stop words. Let's go back again to our running example: >>>stopwords=[word.strip().lower() for word in open("PATH/english.stop.txt")]>>>clean_tokens=[tok for tok in tokens if len(tok.lower())>1 and (tok.lower() not in stopwords)]>>>Freq_dist_nltk=nltk.FreqDist(clean_tokens)>>>Freq_dist_nltk.plot(50, cumulative=False) This looks much cleaner now! After finishing this much, you should be able to get something like this using word cloud: Please go to http://www.wordle.net/advanced for more word clouds. Summary To summarize, this article was intended to give you a brief introduction toNatural Language Processing. The book does assume some background in NLP andprogramming in Python, but we have tried to give a very quick head start to Pythonand NLP. Resources for Article: Further resources on this subject: Hadoop Monitoring and its aspects [Article] Big Data Analysis (R and Hadoop) [Article] SciPy for Signal Processing [Article]

0
0
2306

Packt

23 Jul 2015

18 min read

Elasticsearch – Spicing Up a Search Using Geo

Packt

23 Jul 2015

18 min read

A geo point refers to the latitude and longitude of a point on Earth. Each location on it has its own unique latitude and longitude. Elasticsearch is aware of geo-based points and allows you to perform various operations on top of it. In many contexts, it's also required to consider a geo location component to obtain various functionalities. For example, say you need to search for all the nearby restaurants that serve Chinese food or I need to find the nearest cab that is free. In some other situation, I need to find to which state a particular geo point location belongs to understand where I am currently standing. This article by Vineeth Mohan, author of the book Elasticsearch Blueprints, is modeled such that all the examples mentioned are related to real-life scenarios, of restaurant searching, for better understanding. Here, we take the example of sorting restaurants based on geographical preferences. A number of cases ranging from the simple, such as finding the nearest restaurant, to the more complex case, such as categorization of restaurants based on distance are covered in this article. What makes Elasticsearch unique and powerful is the fact that you can combine geo operation with any other normal search query to yield results clubbed with both the location data and the query data. (For more resources related to this topic, see here.) Restaurant search Let's consider creating a search portal for restaurants. The following are its requirements: To find the nearest restaurant with Chinese cuisine, which has the word ChingYang in its name. To decrease the importance of all restaurants outside city limits. To find the distance between the restaurant and current point for each of the preceding restaurant matches. To find whether the person is in a particular city's limit or not. To aggregate all restaurants within a distance of 10 km. That is, for a radius of the first 10 km, we have to compute the number of restaurants. For the next 10 km, we need to compute the number of restaurants and so on. Data modeling for restaurants Firstly, we need to see the aspects of data and model it around a JSON document for Elasticsearch to make sense of the data. A restaurant has a name, its location information, and rating. To store the location information, Elasticsearch has a provision to understand the latitude and longitude information and has features to conduct searches based on it. Hence, it would be best to use this feature. Let's see how we can do this. First, let's see what our document should look like: { "name" : "Tamarind restaurant", "location" : { "lat" : 1.10, "lon" : 1.54 } } Now, let's define the schema for the same: curl -X PUT "http://$hostname:9200/restaurants" -d '{ "index": { "number_of_shards": 1, "number_of_replicas": 1 }, "analysis":{ "analyzer":{ "flat" : { "type" : "custom", "tokenizer" : "keyword", "filter" : "lowercase" } } } }' echo curl -X PUT "http://$hostname:9200/restaurants /restaurant/_mapping" -d '{ "restaurant" : { "properties" : { "name" : { "type" : "string" }, "location" : { "type" : "geo_point", "accuracy" : "1km" } }} }' Let's now index some documents in the index. An example of this would be the Tamarind restaurant data shown in the previous section. We can index the data as follows: curl -XPOST 'http://localhost:9200/restaurants/restaurant' -d '{ "name": "Tamarind restaurant", "location": { "lat": 1.1, "lon": 1.54 } }' Likewise, we can index any number of documents. For the sake of convenience, we have indexed only a total of five restaurants for this article. The latitude and longitude should be of this format. Elasticsearch also accepts two other formats (geohash and lat_lon), but let's stick to this one. As we have mapped the field location to the type geo_point, Elasticsearch is aware of what this information means and how to act upon it. The nearest hotel problem Let's assume that we are at a particular point where the latitude is 1.234 and the longitude is 2.132. We need to find the nearest restaurants to this location. For this purpose, the function_score query is the best option. We can use the decay (Gauss) functionality of the function score query to achieve this: curl -XPOST 'http://localhost:9200/restaurants/_search' -d '{ "query": { "function_score": { "functions": [ { "gauss": { "location": { "scale": "1km", "origin": [ 1.231, 1.012 ] } } } ] } } }' Here, we tell Elasticsearch to give a higher score to the restaurants that are nearby the referral point we gave it. The closer it is, the higher is the importance. Maximum distance covered Now, let's move on to another example of finding restaurants that are within 10 kms from my current position. Those that are beyond 10 kms are of no interest to me. So, it almost makes up to a circle with a radius of 10 km from my current position, as shown in the following map: Our best bet here is using a geo distance filter. It can be used as follows: curl -XPOST 'http://localhost:9200/restaurants/_search' -d '{ "query": { "filtered": { "filter": { "geo_distance": { "distance": "100km", "location": { "lat": 1.232, "lon": 1.112 } } } } } }' Inside city limits Next, I need to consider only those restaurants that are inside a particular city limit; the rest are of no interest to me. As the city shown in the following map is rectangle in nature, this makes my job easier: Now, to see whether a geo point is inside a rectangle, we can use the bounding box filter. A rectangle is marked when you feed the top-left point and bottom-right point. Let's assume that the city is within the following rectangle with the top-left point as X and Y and the bottom-right point as A and B: curl -XPOST 'http://localhost:9200/restaurants/_search' -d '{ "query": { "filtered": { "query": { "match_all": {} }, "filter": { "geo_bounding_box": { "location": { "top_left": { "lat": 2, "lon": 0 }, "bottom_right": { "lat": 0, "lon": 2 } } } } } } }' Distance values between the current point and each restaurant Now, consider the scenario where you need to find the distance between the user location and each restaurant. How can we achieve this requirement? We can use scripts; the current geo coordinates are passed to the script and then the query to find the distance between each restaurant is run, as in the following code. Here, the current location is given as (1, 2): curl -XPOST 'http://localhost:9200/restaurants/_search?pretty' -d '{ "script_fields": { "distance": { "script": "doc['"'"'location'"'"'].arcDistanceInKm(1, 2)" } }, "fields": [ "name" ], "query": { "match": { "name": "chinese" } } }' We have used the function called arcDistanceInKm in the preceding query, which accepts the geo coordinates and then returns the distance between that point and the locations satisfied by the query. Note that the unit of distance calculated is in kilometers (km). You might have noticed a long list of quotes and double quotes before and after location in the script mentioned previously. This is the standard format and if we don't use this, it would result in returning the format error while processing. The distances are calculated from the current point to the filtered hotels and are returned in the distance field of response, as shown in the following code: { "took" : 3, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 0.7554128, "hits" : [ { "_index" : "restaurants", "_type" : "restaurant", "_id" : "AU08uZX6QQuJvMORdWRK", "_score" : 0.7554128, "fields" : { "distance" : [ 112.92927483176413 ], "name" : [ "Great chinese restaurant" ] } }, { "_index" : "restaurants", "_type" : "restaurant", "_id" : "AU08uZaZQQuJvMORdWRM", "_score" : 0.7554128, "fields" : { "distance" : [ 137.61635969665923 ], "name" : [ "Great chinese restaurant" ] } } ] } } Note that the distances measured from the current point to the hotels are direct distances and not road distances. Restaurant out of city limits One of my friends called me and asked me to join him on his journey to the next city. As we were leaving the city, he was particular that he wants to eat at some restaurant off the city limits, but outside the next city. For this, the requirement was translated to any restaurant that is minimum 15 kms and a maximum of 100 kms from the center of the city. Hence, we have something like a donut in which we have to conduct our search, as show in the following map: The area inside the donut is a match, but the area outside is not. For this donut area calculation, we have the geo_distance_range filter to our rescue. Here, we can apply the minimum distance and maximum distance in the fields from and to to populate the results, as shown in the following code: curl -XPOST 'http://localhost:9200/restaurants/_search' -d '{ "query": { "filtered": { "query": { "match_all": {} }, "filter": { "geo_distance_range": { "from": "15km", "to": "100km", "location": { "lat": 1.232, "lon": 1.112 } } } } } }' Restaurant categorization based on distance In an e-commerce solution, to search restaurants, it's required that you increase the searchable characteristics of the application. This means that if we are able to give a snapshot of results other than the top-10 results, it would add to the searchable characteristics of the search. For example, if we are able to show how many restaurants serve Indian, Thai, or other cuisines, it would actually help the user to get a better idea of the result set. In a similar manner, if we can tell them if the restaurant is near, at a medium distance, or far away, we can really pull a chord in the restaurant search user experience, as shown in the following map: Implementing this is not hard, as we have something called the distance range aggregation. In this aggregation type, we can handcraft the range of distance we are interested in and create a bucket for each of them. We can also define the key name we need, as shown in the following code: curl -XPOST 'http://localhost:9200/restaurants/_search' -d '{ "aggs": { "distanceRanges": { "geo_distance": { "field": "location", "origin": "1.231, 1.012", "unit": "meters", "ranges": [ { "key": "Near by Locations", "to": 200 }, { "key": "Medium distance Locations", "from": 200, "to": 2000 }, { "key": "Far Away Locations", "from": 2000 } ] } } } }' In the preceding code, we categorized the restaurants under three distance ranges, which are the nearby hotels (less than 200 meters), medium distant hotels (within 200 meters to 2,000 meters), and the far away ones (greater than 2,000 meters). This logic was translated to the Elasticsearch query using which, we received the results as follows: { "took": 44, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 5, "max_score": 0, "hits": [ ] }, "aggregations": { "distanceRanges": { "buckets": [ { "key": "Near by Locations", "from": 0, "to": 200, "doc_count": 1 }, { "key": "Medium distance Locations", "from": 200, "to": 2000, "doc_count": 0 }, { "key": "Far Away Locations", "from": 2000, "doc_count": 4 } ] } } } In the results, we received how many restaurants are there in each distance range indicated by the doc_count field. Aggregating restaurants based on their nearness In the previous example, we saw the aggregation of restaurants based on their distance from the current point to three different categories. Now, we can consider another scenario in which we classify the restaurants on the basis of the geohash grids that they belong to. This kind of classification can be advantageous if the user would like to get a geographical picture of how the restaurants are distributed. Here is the code for a geohash-based aggregation of restaurants: curl -XPOST 'http://localhost:9200/restaurants/_search?pretty' -d '{ "size": 0, "aggs": { "DifferentGrids": { "geohash_grid": { "field": "location", "precision": 6 }, "aggs": { "restaurants": { "top_hits": {} } } } } }' You can see from the preceding code that we used the geohash aggregation, which is named as DifferentGrids and the precision here, is to be set as 6. The precision field value can be varied within the range of 1 to 12, with 1 being the lowest and 12 being the highest reference of precision. Also, we used another aggregation named restaurants inside the DifferentGrids aggregation. The restaurant aggregation uses the top_hits query to fetch the aggregated details from the DifferentGrids aggregation, which otherwise, would return only the key and doc_count values. So, running the preceding code gives us the following result: { "took":5, "timed_out":false, "_shards":{ "total":1, "successful":1, "failed":0 }, "hits":{ "total":5, "max_score":0, "hits":[ ] }, "aggregations":{ "DifferentGrids":{ "buckets":[ { "key":"s009", "doc_count":2, "restaurants":{... } }, { "key":"s01n", "doc_count":1, "restaurants":{... } }, { "key":"s00x", "doc_count":1, "restaurants":{... } }, { "key":"s00p", "doc_count":1, "restaurants":{... } } ] } } } As we can see from the response, there are four buckets with the key values, which are s009, s01n, s00x, and s00p. These key values represent the different geohash grids that the restaurants belong to. From the preceding result, we can evidently say that the s009 grid contains two restaurants inside it and all the other grids contain one each. A pictorial representation of the previous aggregation would be like the one shown on the following map: Summary We found that Elasticsearch can handle geo point and various geo-specific operations. A few geospecific and geopoint operations that we covered in this article were searching for nearby restaurants (restaurants inside a circle), searching for restaurants within a range (restaurants inside a concentric circle), searching for restaurants inside a city (restaurants inside a rectangle), searching for restaurants inside a polygon, and categorization of restaurants by the proximity. Apart from these, we can use Kibana, a flexible and powerful visualization tool provided by Elasticsearch for geo-based operations. Resources for Article: Further resources on this subject: Elasticsearch Administration [article] Extending ElasticSearch with Scripting [article] Indexing the Data [article]

0
0
3006

article-image-getting-started-apache-spark

Packt

17 Jul 2015

7 min read

Getting Started with Apache Spark

Packt

17 Jul 2015

7 min read

In this article by Rishi Yadav, the author of Spark Cookbook, we will cover the following recipes: Installing Spark from binaries Building the Spark source code with Maven (For more resources related to this topic, see here.) Introduction Apache Spark is a general-purpose cluster computing system to process big data workloads. What sets Spark apart from its predecessors, such as MapReduce, is its speed, ease-of-use, and sophisticated analytics. Apache Spark was originally developed at AMPLab, UC Berkeley, in 2009. It was made open source in 2010 under the BSD license and switched to the Apache 2.0 license in 2013. Toward the later part of 2013, the creators of Spark founded Databricks to focus on Spark's development and future releases. Talking about speed, Spark can achieve sub-second latency on big data workloads. To achieve such low latency, Spark makes use of the memory for storage. In MapReduce, memory is primarily used for actual computation. Spark uses memory both to compute and store objects. Spark also provides a unified runtime connecting to various big data storage sources, such as HDFS, Cassandra, HBase, and S3. It also provides a rich set of higher-level libraries for different big data compute tasks, such as machine learning, SQL processing, graph processing, and real-time streaming. These libraries make development faster and can be combined in an arbitrary fashion. Though Spark is written in Scala, and this book only focuses on recipes in Scala, Spark also supports Java and Python. Spark is an open source community project, and everyone uses the pure open source Apache distributions for deployments, unlike Hadoop, which has multiple distributions available with vendor enhancements. The following figure shows the Spark ecosystem: The Spark runtime runs on top of a variety of cluster managers, including YARN (Hadoop's compute framework), Mesos, and Spark's own cluster manager called standalone mode. Tachyon is a memory-centric distributed file system that enables reliable file sharing at memory speed across cluster frameworks. In short, it is an off-heap storage layer in memory, which helps share data across jobs and users. Mesos is a cluster manager, which is evolving into a data center operating system. YARN is Hadoop's compute framework that has a robust resource management feature that Spark can seamlessly use. Installing Spark from binaries Spark can be either built from the source code or precompiled binaries can be downloaded from http://spark.apache.org. For a standard use case, binaries are good enough, and this recipe will focus on installing Spark using binaries. Getting ready All the recipes in this book are developed using Ubuntu Linux but should work fine on any POSIX environment. Spark expects Java to be installed and the JAVA_HOME environment variable to be set. In Linux/Unix systems, there are certain standards for the location of files and directories, which we are going to follow in this book. The following is a quick cheat sheet: Directory Description /bin Essential command binaries /etc Host-specific system configuration /opt Add-on application software packages /var Variable data /tmp Temporary files /home User home directories How to do it... At the time of writing this, Spark's current version is 1.4. Please check the latest version from Spark's download page at http://spark.apache.org/downloads.html. Binaries are developed with a most recent and stable version of Hadoop. To use a specific version of Hadoop, the recommended approach is to build from sources, which will be covered in the next recipe. The following are the installation steps: Open the terminal and download binaries using the following command: $ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.4.0-bin-hadoop2.4.tgz Unpack binaries: $ tar -zxf spark-1.4.0-bin-hadoop2.4.tgz Rename the folder containing binaries by stripping the version information: $ sudo mv spark-1.4.0-bin-hadoop2.4 spark Move the configuration folder to the /etc folder so that it can be made a symbolic link later: $ sudo mv spark/conf/* /etc/spark Create your company-specific installation directory under /opt. As the recipes in this book are tested on infoobjects sandbox, we are going to use infoobjects as directory name. Create the /opt/infoobjects directory: $ sudo mkdir -p /opt/infoobjects Move the spark directory to /opt/infoobjects as it's an add-on software package: $ sudo mv spark /opt/infoobjects/ Change the ownership of the spark home directory to root: $ sudo chown -R root:root /opt/infoobjects/spark Change permissions of the spark home directory, 0755 = user:read-write-execute group:read-execute world:read-execute: $ sudo chmod -R 755 /opt/infoobjects/spark Move to the spark home directory: $ cd /opt/infoobjects/spark Create the symbolic link: $ sudo ln -s /etc/spark conf Append to PATH in .bashrc: $ echo "export PATH=$PATH:/opt/infoobjects/spark/bin" >> /home/hduser/.bashrc Open a new terminal. Create the log directory in /var: $ sudo mkdir -p /var/log/spark Make hduser the owner of the Spark log directory. $ sudo chown -R hduser:hduser /var/log/spark Create the Spark tmp directory: $ mkdir /tmp/spark Configure Spark with the help of the following command lines: $ cd /etc/spark$ echo "export HADOOP_CONF_DIR=/opt/infoobjects/hadoop/etc/hadoop">> spark-env.sh$ echo "export YARN_CONF_DIR=/opt/infoobjects/hadoop/etc/Hadoop">> spark-env.sh$ echo "export SPARK_LOG_DIR=/var/log/spark" >> spark-env.sh$ echo "export SPARK_WORKER_DIR=/tmp/spark" >> spark-env.sh Building the Spark source code with Maven Installing Spark using binaries works fine in most cases. For advanced cases, such as the following (but not limited to), compiling from the source code is a better option: Compiling for a specific Hadoop version Adding the Hive integration Adding the YARN integration Getting ready The following are the prerequisites for this recipe to work: Java 1.6 or a later version Maven 3.x How to do it... The following are the steps to build the Spark source code with Maven: Increase MaxPermSize for heap: $ echo "export _JAVA_OPTIONS="-XX:MaxPermSize=1G"" >> /home/hduser/.bashrc Open a new terminal window and download the Spark source code from GitHub: $ wget https://github.com/apache/spark/archive/branch-1.4.zip Unpack the archive: $ gunzip branch-1.4.zip Move to the spark directory: $ cd spark Compile the sources with these flags: Yarn enabled, Hadoop version 2.4, Hive enabled, and skipping tests for faster compilation: $ mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean package Move the conf folder to the etc folder so that it can be made a symbolic link: $ sudo mv spark/conf /etc/ Move the spark directory to /opt as it's an add-on software package: $ sudo mv spark /opt/infoobjects/spark Change the ownership of the spark home directory to root: $ sudo chown -R root:root /opt/infoobjects/spark Change the permissions of the spark home directory 0755 = user:rwx group:r-x world:r-x: $ sudo chmod -R 755 /opt/infoobjects/spark Move to the spark home directory: $ cd /opt/infoobjects/spark Create a symbolic link: $ sudo ln -s /etc/spark conf Put the Spark executable in the path by editing .bashrc: $ echo "export PATH=$PATH:/opt/infoobjects/spark/bin" >> /home/hduser/.bashrc Create the log directory in /var: $ sudo mkdir -p /var/log/spark Make hduser the owner of the Spark log directory: $ sudo chown -R hduser:hduser /var/log/spark Create the Spark tmp directory: $ mkdir /tmp/spark Configure Spark with the help of the following command lines: $ cd /etc/spark$ echo "export HADOOP_CONF_DIR=/opt/infoobjects/hadoop/etc/hadoop">> spark-env.sh$ echo "export YARN_CONF_DIR=/opt/infoobjects/hadoop/etc/Hadoop">> spark-env.sh$ echo "export SPARK_LOG_DIR=/var/log/spark" >> spark-env.sh$ echo "export SPARK_WORKER_DIR=/tmp/spark" >> spark-env.sh Summary In this article, we learned what Apache Spark is, how we can install Spark from binaries, and how to build Spark source code with Maven. Resources for Article: Further resources on this subject: Big Data Analysis (R and Hadoop) [Article] YARN and Hadoop [Article] Hadoop and SQL [Article]

0
0
1730

article-image-clustering-and-other-unsupervised-learning-methods

Packt

09 Jul 2015

19 min read

Clustering and Other Unsupervised Learning Methods

Packt

09 Jul 2015

19 min read

0
0
12152

How-To Tutorials - Data

Predicting Sports Winners with Decision Trees and pandas

How to do Machine Learning with Python

Divide and Conquer – Classification Using Decision Trees and Rules

Getting Started with Java Driver for MongoDB

Matrix and Pixel Manipulation along with Handling Files

Neo4j – Modeling Bookings and Users

Bayesian Network Fundamentals

Setting Up Synchronous Replication

Oracle GoldenGate 12c — An Overview

The Splunk Interface

Trending Topics

Understanding Hadoop Backup and Recovery Needs

NLTK for hackers

Elasticsearch – Spicing Up a Search Using Geo

Getting Started with Apache Spark

Clustering and Other Unsupervised Learning Methods