Data | 0 articles | Tech News, Tutorials & Expert Insights

27 Nov 2014

17 min read

About MongoDB

27 Nov 2014

In this article by Amol Nayak, the author of MongoDB Cookbook, describes the various features of MongoDB. (For more resources related to this topic, see here.) MongoDB is a document-oriented database and is the most popular and favorite NoSQL database. The rankings given at http://db-engines.com/en/ranking shows us that MongoDB is sitting on the fifth rank overall as of August 2014 and is the first NoSQL product in this list. It is currently being used in production by a huge list of companies in various domains handling terabytes of data efficiently. MongoDB is developed to scale horizontally and cope up with the increasing data volumes. It is very simple to use and get started with, backed by a good support from its company MongoDB and has a vast array open source and proprietary tools build around it to improve developer and administrator's productivity. In this article, we will cover the following recipes: Single node installation of MongoDB with options from the config file Viewing database stats Creating an index and viewing plans of queries Single node installation of MongoDB with options from the config file As we're aware that providing options from the command line does the work, but it starts getting awkward as soon as the number of options we provide increases. We have a nice and clean alternative to providing the startup options from a configuration file rather than as command-line arguments. Getting ready Well, assuming that we have downloaded the MongoDB binaries from the download site, extracted it, and have the bin directory of MongoDB in the operating system's path variable (this is not mandatory but it really becomes convenient after doing it), the binaries can be downloaded from http://www.mongodb.org/downloads after selecting your host operating system. How to do it… The /data/mongo/db directory for the database and /logs/ for the logs should be created and present on your filesystem, with the appropriate permissions to write to it. Let's take a look at the steps in detail: Create a config file, which can have any arbitrary name. In our case, let's say we create the file at /conf/mongo.conf. We will then edit the file and add the following lines of code to it: port = 27000 dbpath = /data/mongo/db logpath = /logs/mongo.log smallfiles = true Start the Mongo server using the following command: > mongod --config /config/mongo.conf How it works… The properties are specified as <property name> = <value>. For all those properties that don't have values, for example, the smallfiles option, the value given is a Boolean value, true. If you need to have a verbose output, you will add v=true (or multiple v's to make it more verbose) to our config file. If you already know what the command-line option is, it is pretty easy to guess the value of the property in the file. It is the same as the command-line option, with just the hyphen removed. Viewing database stats In this recipe, we will see how to get the statistics of a database. Getting ready To find the stats of the database, we need to have a server up and running, and a single node is what should be ok. The data on which we would be operating needs to be imported into the database. Once these steps are completed, we are all set to go ahead with this recipe. How to do it… We will be using the test database for the purpose of this recipe. It already has the postalCodes collection in it. Let's take a look at the steps in detail: Connect to the server using the Mongo shell by typing in the following command from the operating system terminal (it is assumed that the server is listening to port 27017): $ mongo On the shell, execute the following command and observe the output: > db.stats() Now, execute the following command, but this time with the scale parameter (observe the output): > db.stats(1024) { "db" : "test", "collections" : 3, "objects" : 39738, "avgObjSize" : 143.32699179626553, "dataSize" : 5562, "storageSize" : 16388, "numExtents" : 8, "indexes" : 2, "indexSize" : 2243, "fileSize" : 196608, "nsSizeMB" : 16, "dataFileVersion" : { "major" : 4, "minor" : 5 }, "ok" : 1 } How it works… Let us start by looking at the collections field. If you look carefully at the number and also execute the show collections command on the Mongo shell, you shall find one extra collection in the stats as compared to those by executing the command. The difference is for one collection, which is hidden, and its name is system.namespaces. You may execute db.system.namespaces.find() to view its contents. Getting back to the output of stats operation on the database, the objects field in the result has an interesting value too. If we find the count of documents in the postalCodes collection, we see that it is 39732. The count shown here is 39738, which means there are six more documents. These six documents come from the system.namespaces and system.indexes collection. Executing a count query on these two collections will confirm it. Note that the test database doesn't contain any other collection apart from postalCodes. The figures will change if the database contains more collections with documents in it. The scale parameter, which is a parameter to the stats function, divides the number of bytes with the given scale value. In this case, it is 1024, and hence, all the values will be in KB. Let's analyze the output: > db.stats(1024) { "db" : "test", "collections" : 3, "objects" : 39738, "avgObjSize" : 143.32699179626553, "dataSize" : 5562, "storageSize" : 16388, "numExtents" : 8, "indexes" : 2, "indexSize" : 2243, "fileSize" : 196608, "nsSizeMB" : 16, "dataFileVersion" : { "major" : 4, "minor" : 5 }, "ok" : 1 } The following table shows the meaning of the important fields: Field Description db This is the name of the database whose stats are being viewed. collections This is the total number of collections in the database. objects This is the count of documents across all collections in the database. If we find the stats of a collection by executing db.<collection>.stats(), we get the count of documents in the collection. This attribute is the sum of counts of all the collections in the database. avgObjectSize This is simply the size (in bytes) of all the objects in all the collections in the database, divided by the count of the documents across all the collections. This value is not affected by the scale provided even though this is a size field. dataSize This is the total size of the data held across all the collections in the database. This value is affected by the scale provided. storageSize This is the total amount of storage allocated to collections in this database for storing documents. This value is affected by the scale provided. numExtents This is the count of all the number of extents in the database across all the collections. This is basically the sum of numExtents in the collection stats for collections in this database. indexes This is the sum of number of indexes across all collections in the database. indexSize This is the size (in bytes) for all the indexes of all the collections in the database. This value is affected by the scale provided. fileSize This is simply the addition of the size of all the database files you should find on the filesystem for this database. The files will be named test.0, test.1, and so on for the test database. This value is affected by the scale provided. nsSizeMB This is the size of the file in MBs for the .ns file of the database. Another thing to note is the value of the avgObjectSize, and there is something weird in this value. Unlike this very field in the collection's stats, which is affected by the value of the scale provided. In database stats, this value is always in bytes, which is pretty confusing and one cannot really be sure why this is not scaled according to the provided scale. Creating an index and viewing plans of queries In this recipe, we will look at querying data, analyzing its performance by explaining the query plan, and then optimizing it by creating indexes. Getting ready For the creation of indexes, we need to have a server up and running. A simple single node is what we will need. The data with which we will be operating needs to be imported in the database. Once we have this prerequisite, we are good to go. How to do it… We will trying to write a query that will find all the zip codes in a given state. To do this, perform the following steps: Execute the following query to view the plan of a query: > db.postalCodes.find({state:'Maharashtra'}).explain() Take a note of the cursor, n, nscannedObjects, and millis fields in the result of the explain plan operation Let's execute the same query again, but this time, we will limit the results to only 100 results: > db.postalCodes.find({state:'Maharashtra'}).limit(100).explain() Again, take a note of the cursor, n, nscannedObjects, and millis fields in the result We will now create an index on the state and pincode fields as follows: > db.postalCodes.ensureIndex({state:1, pincode:1}) Execute the following query: > db.postalCodes.find({state:'Maharashtra'}).explain() Again, take a note of the cursor, n, nscannedObjects, millis, and indexOnly fields in the result Since we want only the pin codes, we will modify the query as follows and view its plan: > db.postalCodes.find({state:'Maharashtra'}, {pincode:1, _id:0}).explain() Take a note of the cursor, n, nscannedObjects, nscanned, millis, and indexOnly fields in the result. How it works… There is a lot to explain here. We will first discuss what we just did and how to analyze the stats. Next, we will discuss some points to be kept in mind for the index creation and some gotchas. Analysis of the plan Let's look at the first step and analyze the output we executed: > db.postalCodes.find({state:'Maharashtra'}).explain() The output on my machine is as follows (I am skipping the nonrelevant fields for now): { "cursor" : "BasicCursor", "n" : 6446, "nscannedObjects" : 39732, "nscanned" : 39732, … "millis" : 55, … } The value of the cursor field in the result is BasicCursor, which means a full collection scan (all the documents are scanned one after another) has happened to search the matching documents in the entire collection. The value of n is 6446, which is the number of results that matched the query. The nscanned and nscannedobjects fields have values of 39,732, which is the number of documents in the collection that are scanned to retrieve the results. This is the also the total number of documents present in the collection, and all were scanned for the result. Finally, millis is the number of milliseconds taken to retrieve the result. Improving the query execution time So far, the query doesn't look too good in terms of performance, and there is great scope for improvement. To demonstrate how the limit applied to the query affects the query plan, we can find the query plan again without the index but with the limit clause: > db.postalCodes.find({state:'Maharashtra'}).limit(100).explain() { "cursor" : "BasicCursor", … "n" : 100, "nscannedObjects" : 19951, "nscanned" : 19951, … "millis" : 30, … } The query plan this time around is interesting. Though we still haven't created an index, we saw an improvement in the time the query took for execution and the number of objects scanned to retrieve the results. This is due to the fact that Mongo does not scan the remaining documents once the number of documents specified in the limit function is reached. We can thus conclude that it is recommended that you use the limit function to limit your number of results, whereas the maximum number of documents accessed is known upfront. This might give better query performance. The word "might" is important, as in the absence of index, the collection might still be completely scanned if the number of matches is not met. Improvement using indexes Moving on, we will create a compound index on state and pincode. The order of the index is ascending in this case (as the value is 1) and is not significant unless we plan to execute a multikey sort. This is a deciding factor as to whether the result can be sorted using only the index or whether Mongo needs to sort it in memory later on, before we return the results. As far as the plan of the query is concerned, we can see that there is a significant improvement: { "cursor" : "BtreeCursor state_1_pincode_1", … "n" : 6446, "nscannedObjects" : 6446, "nscanned" : 6446, … "indexOnly" : false, … "millis" : 16, … } The cursor field now has the BtreeCursor state_1_pincode_1 value , which shows that the index is indeed used now. As expected, the number of results stays the same at 6446. The number of objects scanned in the index and documents scanned in the collection have now reduced to the same number of documents as in the result. This is because we now used an index that gave us the starting document from which we could scan, and then, only the required number of documents were scanned. This is similar to using the book's index to find a word or scanning the entire book to search for the word. The time, millis has come down too, as expected. Improvement using covered indexes This leaves us with one field, indexOnly, and we will see what this means. To know what this value is, we need to look briefly at how indexes operate. Indexes store a subset of fields of the original document in the collection. The fields present in the index are the same as those on which the index is created. The fields, however, are kept sorted in the index in an order specified during the creation of the index. Apart from the fields, there is an additional value stored in the index; this acts as a pointer to the original document in the collection. Thus, whenever the user executes a query, if the query contains fields on which an index is present, the index is consulted to get a set of matches. The pointer stored with the index entries that match the query is then used to make another IO operation to fetch the complete document from the collection; this document is then returned to the user. The value of indexOnly, which is false, indicates that the data requested by the user in the query is not entirely present in the index, but an additional IO operation is needed to retrieve the entire document from the collection that follows the pointer from the index. Had the value been present in the index itself, an additional operation to retrieve the document from the collection will not be necessary, and the data from the index will be returned. This is called covered index, and the value of indexOnly, in this case, will be true. In our case, we just need the pin codes, so why not use projection in our queries to retrieve just what we need? This will also make the index covered as the index entry that just has the state's name and pin code, and the required data can be served completely without retrieving the original document from the collection. The plan of the query in this case is interesting too. Executing the following query results in the following plan: db.postalCodes.find({state:'Maharashtra'}, {pincode:1, _id:0}).explain() { "cursor" : "BtreeCursor state_1_pincode_1", … "n" : 6446, "nscannedObjects" : 0, "nscanned" : 6446, … "indexOnly" : true, … "millis" : 15, … } The values of the nscannedobjects and indexOnly fields are something to be observed. As expected, since the data we requested in the projection in the find query is pin code only, which can be served from the index alone, the value of indexOnly is true. In this case, we scanned 6,446 entries in the index, and thus, the nscanned value is 6446. We, however, didn't reach out to any document in the collection on the disk, as this query was covered by the index alone, and no additional IO was needed to retrieve the entire document. Hence, the value of nscannedobjects is 0. As this collection in our case is small, we do not see a significant difference in the execution time of the query. This will be more evident on larger collections. Making use of indexes is great and gives good performance. Making use of covered indexes gives even better performance. Another thing to remember is that wherever possible, try and use projection to retrieve only the number of fields we need. The _id field is retrieved every time by default, unless we plan to use it set _id:0 to not retrieve it if it is not part of the index. Executing a covered query is the most efficient way to query a collection. Some gotchas of index creations We will now see some pitfalls in index creation and some facts where the array field is used in the index. Some of the operators that do not use the index efficiently are the $where, $nin, and $exists operators. Whenever these operators are used in the query, one should bear in mind a possible performance bottleneck when the data size increases. Similarly, the $in operator must be preferred over the $or operator, as both can be more or less used to achieve the same result. As an exercise, try to find the pin codes in the state of Maharashtra and Gujarat from the postalCodes collection. Write two queries: one using the $or operator and the other using the $in operator. Explain the plan for both these queries. What happens when an array field is used in the index? Mongo creates an index entry for each element present in the array field of a document. So, if there are 10 elements in an array in a document, there will be 10 index entries, one for each element in the array. However, there is a constraint while creating indexes that contain array fields. When creating indexes using multiple fields, not more than one field can be of the array type. This is done to prevent the possible explosion in the number of indexes on adding even a single element to the array used in the index. If we think of it carefully, for each element in the array, an index entry is created. If multiple fields of type array were allowed to be part of an index, we would have a large number of entries in the index, which would be a product of the length of these array fields. For example, a document added with two array fields, each of length 10, will add 100 entries to the index, had it been allowed to create one index using these two array fields. This should be good enough for now to scratch the surfaces of plain vanilla index. Summary This article provides detailed recipes that describe how to use the different features of MongoDB. MongoDB is a document-oriented, leading NoSQL database, which offers linear scalability, thus making it a good contender for high-volume, high-performance systems across all business domains. It has an edge over the majority of NoSQL solutions for its ease of use, high performance, and rich features. In this article, we learned how to start single node installations of MongoDB with options from the config file. We also learned how to create an index from the shell and viewing plans of queries. Resources for Article: Further resources on this subject: Ruby with MongoDB for Web Development [Article] MongoDB data modeling [Article] Using Mongoid [Article]

0
0
2053

article-image-creating-reusable-actions-agent-behaviors-lua

Packt

27 Nov 2014

18 min read

Creating reusable actions for agent behaviors with Lua

Packt

27 Nov 2014

18 min read

0
0
1072

Packt

27 Nov 2014

9 min read

Logistic regression

Packt

27 Nov 2014

9 min read

0
0
1139

article-image-machine-learning-examples-applicable-businesses

Packt

25 Nov 2014

7 min read

Machine Learning Examples Applicable to Businesses

Packt

25 Nov 2014

7 min read

The purpose of this article by Michele Usuelli, author of the book, R Machine Learning Essentials, is to show how you machine learning helps in solving a business problem. (For more resources related to this topic, see here.) Predicting the output The past marketing campaign targeted part of the customer base. Among other 1,000 clients, how do we identify the 100 that are keener to subscribe? We can build a model that learns from the data and estimates which clients are more similar to the ones that subscribed in the previous campaign. For each client, the model estimates a score that is higher if the client is more likely to subscribe. There are different machine learning models determining the scores and we use two well-performing techniques, as follows: Logistic regression: This is a variation of the linear regression to predict a binary output Random forest: This is an ensemble based on a decision tree that works well in presence of many features In the end, we need to choose one out of the two techniques. There are cross-validation methods that allow us to estimate model accuracy. Starting from that, we can measure the accuracy of both the options and pick the one performing better. After choosing the most proper machine learning algorithm, we can optimize it using cross validation. However, in order to avoid overcomplicating the model building, we don't perform any feature selection or parameter optimization. These are the steps to build and evaluate the models: Load the randomForest package containing the random forest algorithm:library('randomForest') Define the formula defining the output and the variable names. The formula is in the format output ~ feature1 + feature2 + ...: arrayFeatures <- names(dtBank)arrayFeatures <- arrayFeatures[arrayFeatures != 'output']formulaAll <- paste('output', '~')formulaAll <- paste(formulaAll, arrayFeatures[1])for(nameFeature in arrayFeatures[-1]){formulaAll <- paste(formulaAll, '+', nameFeature)}formulaAll <- formula(formulaAll) Initialize the table containing all the testing sets: dtTestBinded <- data.table() Define the number of iterations: nIter <- 10 Start a for loop: for(iIter in 1:nIter){ Define the training and the test datasets: indexTrain <- sample(x = c(TRUE, FALSE),size = nrow(dtBank),replace = T,prob = c(0.8, 0.2))dtTrain <- dtBank[indexTrain]dtTest <- dtBank[!indexTrain] Select a subset from the test set in such a way that we have the same number of output == 0 and output == 1. First, we split dtTest in two parts (dtTest0 and dtTest1) on the basis of the output and we count the number of rows of each part (n0 and n1). Then, as dtTest0 has more rows, we randomly select n1 rows. In the end, we redefine dtTest binding dtTest0 and dtTest1, as follows: dtTest1 <- dtTest[output == 1]dtTest0 <- dtTest[output == 0]n0 <- nrow(dtTest0)n1 <- nrow(dtTest1)dtTest0 <- dtTest0[sample(x = 1:n0, size = n1)]dtTest <- rbind(dtTest0, dtTest1) Build the random forest model using randomForest. The formula argument defines the relationship between variables and the data argument defines the training dataset. In order to avoid overcomplicating the model, all the other parameters are left as their defaults: modelRf <- randomForest(formula = formulaAll,data = dtTrain) Build the logistic regression model using glm, which is a function used to build Generalized Linear Models (GLM). GLMs are a generalization of the linear regression and they allow to define a link function that connects the linear predictor with the outputs. The input is the same as the random forest, with the addition of family = binomial(logit) defining that the regression is logistic: modelLr <- glm(formula = formulaAll,data = dtTest,family = binomial(logit)) Predict the output of the random forest. The function is predict and its main arguments are object defining the model and newdata defining the test set, as follows: dtTest[, outputRf := predict(object = modelRf, newdata = dtTest, type='response')] Predict the output of the logistic regression, using predict similar to the random forest. The other argument is type='response' and it is necessary in the case of the logistic regression: dtTest[, outputLr := predict(object = modelLr, newdata = dtTest, type='response')] Add the new test set to dtTestBinded: dtTestBinded <- rbind(dtTestBinded, dtTest) End the for loop: } We built dtTestBinded containing the output column that defines which clients subscribed and the scores estimated by the models. Comparing the scores with the real output, we can validate the model performances. In order to explore dtTestBinded, we can build a chart showing how the scores of the non-subscribing clients are distributed. Then, we add the distribution of the subscribing clients to the chart and compare them. In this way, we can see the difference between the scores of the two groups. Since we use the same chart for the random forest and for the logistic regression, we define a function building the chart by following the given steps: Define the function and its input that includes the data table and the name of the score column: plotDistributions <- function(dtTestBinded, colPred){ Compute the distribution density for the clients that didn't subscribe. With output == 0, we extract the clients not subscribing, and using density, we define a density object. The adjust parameter defines the smoothing bandwidth that is a parameter of the way we build the curve starting from the data. The bandwidth can be interpreted as the level of detail: densityLr0 <- dtTestBinded[ output == 0, density(get(colPred), adjust = 0.5) ] Compute the distribution density for the clients that subscribed: densityLr1 <- dtTestBinded[ output == 1, density(get(colPred), adjust = 0.5) ] Define the colors in the chart using rgb. The colors are transparent red and transparent blue: col0 <- rgb(1, 0, 0, 0.3)col1 <- rgb(0, 0, 1, 0.3) Build the plot with the density of the clients not subscribing. Here, polygon is a function that adds the area to the chart: plot(densityLr0, xlim = c(0, 1), main = 'density')polygon(densityLr0, col = col0, border = 'black') Add the clients that subscribed to the chart: polygon(densityLr1, col = col1, border = 'black') Add the legend: legend( 'top', c('0', '1'), pch = 16, col = c(col0, col1)) End the function: return()} Now, we can use plotDistributions on the random forest output: par(mfrow = c(1, 1))plotDistributions(dtTestBinded, 'outputRf') The histogram obtained is as follows: The x-axis represents the score and the y-axis represents the density that is proportional to the number of clients that subscribed for similar scores. Since we don't have a client for each possible score, assuming a level of detail of 0.01, the density curve is smoothed in the sense that the density of each score is the average between the data with a similar score. The red and blue areas represent the non-subscribing and subscribing clients respectively. As can be easily noticed, the violet area comes from the overlapping of the two curves. For each score, we can identify which density is higher. If the highest curve is red, the client will be more likely to subscribe, and vice versa. For the random forest, most of the non-subscribing client scores are between 0 and 0.2 and the density peak is around 0.05. The subscribing clients have a more spread score, although higher, and their peak is around 0.1. The two distributions overlap a lot, so it's not easy to identify which clients will subscribe starting from their scores. However, if the marketing campaign targets all customers with a score higher than 0.3, they will likely belong to the blue cluster. In conclusion, using random forest, we are able to identify a small set of customers that will subscribe very likely. Summary In this article, you learned how to predict your output using proper machine learning techniques. Resources for Article: Further resources on this subject: Using R for Statistics, Research, and Graphics [article] Machine Learning in Bioinformatics [article] Learning Data Analytics with R and Hadoop [article]

0
0
982

Packt

25 Nov 2014

4 min read

No to nodistinct

Packt

25 Nov 2014

4 min read

This article is written by Stephen Redmond, the author of Mastering QlikView. There is a great skill in creating the right expression to calculate the right answer. Being able to do this in all circumstances relies on having a good knowledge of creating advanced expressions. Of course, the best path to mastery in this subject is actually getting out and doing it, but there is a great argument here for regularly practicing with dummy or test datasets. (For more resources related to this topic, see here.) When presented with a problem that needs to be solved, all the QlikView masters will not necessarily know immediately how to answer it. What they will have though is a very good idea of where to start, that is, what to try and what not to try. This is what I hope to impart to you here. Knowing how to create many advanced expressions will arm you to know where to apply them—and where not to apply them. This is one area of QlikView that is alien to many people. For some reason, they fear the whole idea of concepts such as Aggr. However, the reality is that these concepts are actually very simple and supremely logical. Once you get your head around them, you will wonder what all the fuss was about. No to nodistinct The Aggr function has as an optional clause, that is, the possibility of stating that the aggregation will be either distinct or nodistinct. The default option is distinct, and as such, is rarely ever stated. In this default operation, the aggregation will only produce distinct results for every combination of dimensions—just as you would expect from a normal chart or straight table. The nodistinct option only makes sense within a chart, one that has more dimensions than are in the Aggr statement. In this case, the granularity of the chart is lower than the granularity of Aggr, and therefore, QlikView will only calculate that Aggr for the first occurrence of lower granularity dimensions and will return null for the other rows. If we specify nodistinct, the same result will be calculated across all of the lower granularity dimensions. This can be difficult to understand without seeing an example, so let's look at a common use case for this option. We will start with a dataset: ProductSales:Load * Inline [Product, Territory, Year, SalesProduct A, Territory A, 2013, 100Product B, Territory A, 2013, 110Product A, Territory B, 2013, 120Product B, Territory B, 2013, 130Product A, Territory A, 2014, 140Product B, Territory A, 2014, 150Product A, Territory B, 2014, 160Product B, Territory B, 2014, 170]; We will build a report from this data using a pivot table: Now, we want to bring the value in the Total column into a new column under each year, perhaps to calculate a percentage for each year. We might think that, because the total is the sum for each Product and Territory, we might use an Aggr in the following manner: Sum(Aggr(Sum(Sales), Product, Territory)) However, as stated previously, because the chart includes an additional dimension (Year) than Aggr, the expression will only be calculated for the first occurrence of each of the lower granularity dimensions (in this case, for Year = 2013): The commonly suggested fix for this is to use Aggr without Sum and with nodistinct as shown: Aggr(NoDistinct Sum(Sales), Product, Territory) This will allow the Aggr expression to be calculated across all the Year dimension values, and at first, it will appear to solve the problem: The problem occurs when we decide to have a total row on this chart: As there is no aggregation function surrounding Aggr, it does not total correctly at the Product or Territory dimensions. We can't add an aggregation function, such as Sum, because it will break one of the other totals. However, there is something different that we can do; something that doesn't involve Aggr at all! We can use our old friend Total: Sum(Total<Product, Territory> Sales) This will calculate correctly at all the levels: There might be other use cases for using a nodistinct clause in Aggr, but they should be reviewed to see whether a simpler Total will work instead. Summary We discussed an important function, the Aggr function. We now know that the Aggr function is extremely useful, but we don't need to apply it in all circumstances where we have vertical calculations. Resources for Article: Further resources on this subject: Common QlikView script errors [article] Introducing QlikView elements [article] Creating sheet objects and starting new list using Qlikview 11 [article]

0
0
1579

article-image-understanding-hbase-ecosystem

Packt

24 Nov 2014

11 min read

Understanding the HBase Ecosystem

Packt

24 Nov 2014

11 min read

This article by Shashwat Shriparv, author of the book, Learning HBase, will introduce you to the world of HBase. (For more resources related to this topic, see here.) HBase is a horizontally scalable, distributed, open source, and a sorted map database. It runs on top of Hadoop file system that is Hadoop Distributed File System (HDFS). HBase is a NoSQL nonrelational database that doesn't always require a predefined schema. It can be seen as a scaling flexible, multidimensional spreadsheet where any structure of data is fit with on-the-fly addition of new column fields, and fined column structure before data can be inserted or queried. In other words, HBase is a column-based database that runs on top of Hadoop distributed file system and supports features such as linear scalability (scale out), automatic failover, automatic sharding, and more flexible schema. HBase is modeled on Google BigTable. It was inspired by Google BigTable, which is a compressed, high-performance, proprietary data store built on the Google filesystem. HBase was developed as a Hadoop subproject to support storage of structural data, which can take advantage of most distributed files systems (typically, the Hadoop Distributed File System known as HDFS). The following table contains key information about HBase and its features: Features Description Developed by Apache Written in Java Type Column oriented License Apache License Lacking features of relational databases SQL support, relations, primary, foreign, and unique key constraints, normalization Website http://hbase.apache.org Distributions Apache, Cloudera Download link http://mirrors.advancedhosters.com/apache/hbase/ Mailing lists The user list: [email protected] The developer list: [email protected] Blog http://blogs.apache.org/hbase/ HBase layout on top of Hadoop The following figure represents the layout information of HBase on top of Hadoop: There is more than one ZooKeeper in the setup, which provides high availability of master status; a RegionServer may contain multiple rations. The RegionServers run on the machines where DataNodes run. There can be as many RegionServers as DataNodes. RegionServers can have multiple HRegions; one HRegion can have one HLog and multiple HFiles with its associate's MemStore. HBase can be seen as a master-slave database where the master is called HMaster, which is responsible for coordination between client application and HRegionServer. It is also responsible for monitoring and recording metadata changes and management. Slaves are called HRegionServers, which serve the actual tables in form of regions. These regions are the basic building blocks of the HBase tables, which contain distribution of tables. So, HMaster and RegionServer work in coordination to serve the HBase tables and HBase cluster. Usually, HMaster is co-hosted with Hadoop NameNode daemon process on a server and communicates to DataNode daemon for reading and writing data on HDFS. The RegionServer runs or is co-hosted on the Hadoop DataNodes. Comparing architectural differences between RDBMs and HBase Let's list the major differences between relational databases and HBase: Relational databases HBase Uses tables as databases Uses regions as databases Filesystems supported are FAT, NTFS, and EXT Filesystem supported is HDFS The technique used to store logs is commit logs The technique used to store logs is Write-Ahead Logs (WAL) The reference system used is coordinate system The reference system used is ZooKeeper Uses the primary key Uses the row key Partitioning is supported Sharding is supported Use of rows, columns, and cells Use of rows, column families, columns, and cells HBase features Let's see the major features of HBase that make it one of the most useful databases for the current and future industry: Automatic failover and load balancing: HBase runs on top of HDFS, which is internally distributed and automatically recovered using multiple block allocation and replications. It works with multiple HMasters and region servers. This failover is also facilitated using HBase and RegionServer replication. Automatic sharding: An HBase table is made up of regions that are hosted by RegionServers and these regions are distributed throughout the RegionServers on different DataNodes. HBase provides automatic and manual splitting of these regions to smaller subregions, once it reaches a threshold size to reduce I/O time and overhead. Hadoop/HDFS integration: It's important to note that HBase can run on top of other filesystems as well. While HDFS is the most common choice as it supports data distribution and high availability using distributed Hadoop, for which we just need to set some configuration parameters and enable HBase to communicate to Hadoop, an out-of-the-box underlying distribution is provided by HDFS. Real-time, random big data access: HBase uses log-structured merge-tree (LSM-tree) as data storage architecture internally, which merges smaller files to larger files periodically to reduce disk seeks. MapReduce: HBase has a built-in support of Hadoop MapReduce framework for fast and parallel processing of data stored in HBase. You can search for the Package org.apache.hadoop.hbase.mapreduce for more details. Java API for client access: HBase has a solid Java API support (client/server) for easy development and programming. Thrift and a RESTtful web service: HBase not only provides a thrift and RESTful gateway but also web service gateways for integrating and accessing HBase besides Java code (HBase Java APIs) for accessing and working with HBase. Support for exporting metrics via the Hadoop metrics subsystem: HBase provides Java Management Extensions (JMX) and exporting matrix for monitoring purposes with tools such as Ganglia and Nagios. Distributed: HBase works when used with HDFS. It provides coordination with Hadoop so that distribution of tables, high availability, and consistency is supported by it. Linear scalability (scale out): Scaling of HBase is not scale up but scale out, which means that we don't need to make servers more powerful but we add more machines to its cluster. We can add more nodes to the cluster on the fly. As soon as a new RegionServer node is up, the cluster can begin rebalancing, start the RegionServer on the new node, and it is scaled up, it is as simple as that. Column oriented: HBase stores each column separately in contrast with most of the relational databases, which uses stores or are row-based storage. So in HBase, columns are stored contiguously and not the rows. More about row- and column-oriented databases will follow. HBase shell support: HBase provides a command-line tool to interact with HBase and perform simple operations such as creating tables, adding data, and scanning data. This also provides full-fledged command-line tool using which we can interact with HBase and perform operations such as creating table, adding data, removing data, and a few other administrative commands. Sparse, multidimensional, sorted map database: HBase is a sparse, multidimensional, sorted map-based database, which supports multiple versions of the same record. Snapshot support: HBase supports taking snapshots of metadata for getting the previous or correct state form of data. HBase in the Hadoop ecosystem Let's see where HBase sits in the Hadoop ecosystem. In the Hadoop ecosystem, HBase provides a persistent, structured, schema-based data store. The following figure illustrates the Hadoop ecosystem: HBase can work as a separate entity on the local filesystem (which is not really effective as no distribution is provided) as well as in coordination with Hadoop as a separate but connected entity. As we know, Hadoop provides two services, a distributed files system (HDFS) for storage and a MapReduce framework for processing in a parallel mode. When there was a need to store structured data (data in the form of tables, rows and columns), which most of the programmers are already familiar with, the programmers were finding it difficult to process the data that was stored on HDFS as an unstructured flat file format. This led to the evolution of HBase, which provided a way to store data in a structural way. Consider that we have got a CSV file stored on HDFS and we need to query from it. We would need to write a Java code for this, which wouldn't be a good option. It would be better if we could specify the data key and fetch the data from that file. So, what we can do here is create a schema or table with the same structure of CSV file to store the data of the CSV file in the HBase table and query using HBase APIs, or HBase shell using key. Data representation in HBase Let's look into the representation of rows and columns in HBase table: An HBase table is divided into rows, column families, columns, and cells. Row keys are unique keys to identify a row, column families are groups of columns, columns are fields of the table, and the cell contains the actual value or the data. So, we have been through the introduction of HBase; now, let's see what Hadoop and its components are in brief. It is assumed here that you are already familiar with Hadoop; if not, following a brief introduction about Hadoop will help you to understand it. Hadoop Hadoop is an underlying technology of HBase, providing high availability, fault tolerance, and distribution. It is an Apache-sponsored, free, open source, Java-based programming framework which supports large dataset storage. It provides distributed file system and MapReduce, which is a distributed programming framework. It provides a scalable, reliable, distributed storage and development environment. Hadoop makes it possible to run applications on a system with tens to tens of thousands of nodes. The underlying distributed file system provides large-scale storage, rapid data access. It has the following submodules: Hadoop Common: This is the core component that supports the other Hadoop modules. It is like the master components facilitating communication and coordination between different Hadoop modules. Hadoop distributed file system: This is the underlying distributed file system, which is abstracted on the top of the local filesystem that provides high throughput of read and write operations of data on Hadoop. Hadoop YARN: This is the new framework that is shipped with newer releases of Hadoop. It provides job scheduling and job and resource management. Hadoop MapReduce: This is the Hadoop-based processing system that provides parallel processing of large data and datasets. Other Hadoop subprojects are HBase, Hive, Ambari, Avro, Cassandra (Cassandra isn't a Hadoop subproject, it's a related project; they solve similar problems in different ways), Mahout, Pig, Spark, ZooKeeper (ZooKeeper isn't a Hadoop subproject. It's a dependency shared by many distributed systems), and so on. All of these have different usability and the combination of all these subprojects forms the Hadoop ecosystem. Core daemons of Hadoop The following are the core daemons of Hadoop: NameNode: This stores and manages all metadata about the data present on the cluster, so it is the single point of contact to Hadoop. In the new release of Hadoop, we have an option of more than one NameNode for high availability. JobTracker: This runs on the NameNode and performs the MapReduce of the jobs submitted to the cluster. SecondaryNameNode: This maintains the backup of metadata present on the NameNode, and also records the file system changes. DataNode: This will contain the actual data. TaskTracker: This will perform tasks on the local data assigned by the JobTracker. The preceding are the daemons in the case of Hadoop v1 or earlier. In newer versions of Hadoop, we have ResourceManager instead of JobTracker, the node manager instead of TaskTrackers, and the YARN framework instead of a simple MapReduce framework. The following is the comparison between daemons in Hadoop 1 and Hadoop 2: Hadoop 1 Hadoop 2 HDFS NameNode Secondary NameNode DataNode NameNode (more than one active/standby) Checkpoint node DataNode Processing MapReduce v1 JobTracker TaskTracker YARN (MRv2) ResourceManager NodeManager Application Master Comparing HBase with Hadoop As we now know what HBase and what Hadoop are, let's have a comparison between HDFS and HBase for better understanding: Hadoop/HDFS HBase This provide filesystem for distributed storage This provides tabular column-oriented data storage This is optimized for storage of huge-sized files with no random read/write of these files This is optimized for tabular data with random read/write facility This uses flat files This uses key-value pairs of data The data model is not flexible Provides a flexible data model This uses file system and processing framework This uses tabular storage with built-in Hadoop MapReduce support This is mostly optimized for write-once read-many This is optimized for both read/write many Summary So in this article, we discussed the introductory aspects of HBase and it's features. We have also discussed HBase's components and their place in the HBase ecosystem. Resources for Article: Further resources on this subject: The HBase's Data Storage [Article] HBase Administration, Performance Tuning [Article] Comparative Study of NoSQL Products [Article]

0
0
3630

Packt

18 Nov 2014

17 min read

The plot function

Packt

18 Nov 2014

17 min read

0
0
1819

Packt

13 Nov 2014

9 min read

The HBase's Data Storage

Packt

13 Nov 2014

9 min read

In this article by Nishant Garg author of HBase Essentials, we will look at HBase's data storage from its architectural view point. (For more resources related to this topic, see here.) For most of the developers or users, the preceding topics are not of big interest, but for an administrator, it really makes sense to understand how underlying data is stored or replicated within HBase. Administrators are the people who deal with HBase, starting from its installation to cluster management (performance tuning, monitoring, failure, recovery, data security and so on). Let's start with data storage in HBase first. Data storage In HBase, tables are split into smaller chunks that are distributed across multiple servers. These smaller chunks are called regions and the servers that host regions are called RegionServers. The master process handles the distribution of regions among RegionServers, and each RegionServer typically hosts multiple regions. In HBase implementation, the HRegionServer and HRegion classes represent the region server and the region, respectively. HRegionServer contains the set of HRegion instances available to the client and handles two types of files for data storage: HLog (the write-ahead log file, also known as WAL) HFile (the real data storage file) In HBase, there is a system-defined catalog table called hbase:meta that keeps the list of all the regions for user-defined tables. In older versions prior to 0.96.0, HBase had two catalog tables called-ROOT- and .META. The -ROOT- table was used to keep track of the location of the .META table. Version 0.96.0 onwards, the -ROOT- table is removed. The .META table is renamed as hbase:meta. Now, the location of .META is stored in Zookeeper. The following is the structure of the hbase:meta table. Key—the region key of the format ([table],[region start key],[region id]). A region with an empty start key is the first region in a table. The values are as follows: info:regioninfo(serialized the HRegionInfo instance for this region) info:server(server:port of the RegionServer containing this region) info:serverstartcode(start time of the RegionServer process that contains this region) When the table is split, two new columns will be created as info:splitA and info:splitB. These columns represent the two newly created regions. The values for these columns are also serialized as HRegionInfo instances. Once the split process is complete, the row that contains the old region information is deleted. In the case of data reading, the client application first connects to ZooKeeper and looks up the location of the hbase:meta table. For the next client, the HTable instance queries the hbase:meta table and finds out the region that contains the rows of interest and also locate the region server that is serving the identified region. The information about the region and region server is then cached by the client application for future interactions and avoids the lookup process. If the region is reassigned by the load balancer process or if the region server has expired, fresh lookup is done on the hbase:meta catalog table to get the new location of the user table region and cache is updated accordingly. At the object level, the HRegionServer class is responsible to create a connection with the region by creating HRegion objects. This HRegion instance sets up a store instance that has one or more StoreFile instances (wrapped around HFile) and MemStore. MemStore accumulates the data edits as it happens and buffers them into the memory. This is also important for accessing the recent edits of table data. As shown in the preceding diagram, the HRegionServer instance (the region server) contains the map of HRegion instances (regions) and also has an HLog instance that represents the WAL. There is a single block cache instance at the region-server level, which holds data from all the regions hosted on that region server. A block cache instance is created at the time of the region server startup and it can have an implementation of LruBlockCache, SlabCache, or BucketCache. The block cache also supports multilevel caching; that is, a block cache might have first-level cache, L1, as LruBlockCache and second-level cache, L2, as SlabCache or BucketCache. All these cache implementations have their own way of managing the memory; for example, LruBlockCache is like a data structure and resides on the JVM heap whereas the other two types of implementation also use memory outside of the JVM heap. HLog (the write-ahead log – WAL) In the case of writing the data, when the client calls HTable.put(Put), the data is first written to the write-ahead log file (which contains actual data and sequence number together represented by the HLogKey class) and also written in MemStore. Writing data directly into MemStrore can be dangerous as it is a volatile in-memory buffer and always open to the risk of losing data in case of a server failure. Once MemStore is full, the contents of the MemStore are flushed to the disk by creating a new HFile on the HDFS. While inserting data from the HBase shell, the flush command can be used to write the in-memory (memstore) data to the store files. If there is a server failure, the WAL can effectively retrieve the log to get everything up to where the server was prior to the crash failure. Hence, the WAL guarantees that the data is never lost. Also, as another level of assurance, the actual write-ahead log resides on the HDFS, which is a replicated filesystem. Any other server having a replicated copy can open the log. The HLog class represents the WAL. When an HRegion object is instantiated, the single HLog instance is passed on as a parameter to the constructor of HRegion. In the case of an update operation, it saves the data directly to the shared WAL and also keeps track of the changes by incrementing the sequence numbers for each edits. WAL uses a Hadoop SequenceFile, which stores records as sets of key-value pairs. Here, the HLogKey instance represents the key, and the key-value represents the rowkey, column family, column qualifier, timestamp, type, and value along with the region and table name where data needs to be stored. Also, the structure starts with two fixed-length numbers that indicate the size and value of the key. The following diagram shows the structure of a key-value pair: The WALEdit class instance takes care of atomicity at the log level by wrapping each update. For example, in the case of a multicolumn update for a row, each column is represented as a separate KeyValue instance. If the server fails after updating few columns to the WAL, it ends up with only a half-persisted row and the remaining updates are not persisted. Atomicity is guaranteed by wrapping all updates that comprise multiple columns into a single WALEdit instance and writing it in a single operation. For durability, a log writer's sync() method is called, which gets the acknowledgement from the low-level filesystem on each update. This method also takes care of writing the WAL to the replication servers (from one datanode to another). The log flush time can be set to as low as you want, or even be kept in sync for every edit to ensure high durability but at the cost of performance. To take care of the size of the write ahead log file, the LogRoller instance runs as a background thread and takes care of rolling log files at certain intervals (the default is 60 minutes). Rolling of the log file can also be controlled based on the size and hbase.regionserver.logroll.multiplier. It rotates the log file when it becomes 90 percent of the block size, if set to 0.9. HFile (the real data storage file) HFile represents the real data storage file. The files contain a variable number of data blocks and fixed number of file info blocks and trailer blocks. The index blocks records the offsets of the data and meta blocks. Each data block contains a magic header and a number of serialized KeyValue instances. The default size of the block is 64 KB and can be as large as the block size. Hence, the default block size for files in HDFS is 64 MB, which is 1,024 times the HFile default block size but there is no correlation between these two blocks. Each key-value in the HFile is represented as a low-level byte array. Within the HBase root directory, we have different files available at different levels. Write-ahead log files represented by the HLog instances are created in a directory called WALs under the root directory defined by the hbase.rootdir property in hbase-site.xml. This WALs directory also contains a subdirectory for each HRegionServer. In each subdirectory, there are several write-ahead log files (because of log rotation). All regions from that region server share the same HLog files. In HBase, every table also has its own directory created under the data/default directory. This data/default directory is located under the root directory defined by the hbase.rootdir property in hbase-site.xml. Each table directory contains a file called .tableinfo within the .tabledesc folder. This .tableinfo file stores the metadata information about the table, such as table and column family schemas, and is represented as the serialized HTableDescriptor class. Each table directory also has a separate directory for every region comprising the table, and the name of this directory is created using the MD5 hash portion of a region name. The region directory also has a .regioninfo file that contains the serialized information of the HRegionInfo instance for the given region. Once the region exceeds the maximum configured region size, it splits and a matching split directory is created within the region directory. This size is configured using the hbase.hregion.max.filesize property or the configuration done at the column-family level using the HColumnDescriptor instance. In the case of multiple flushes by the MemStore, the number of files might get increased on this disk. The compaction process running in the background combines the files to the largest configured file size and also triggers region split. Summary In this article, we have learned about the internals of HBase and how it stores the data. Resources for Article: Further resources on this subject: Big Data Analysis [Article] Advanced Hadoop MapReduce Administration [Article] HBase Administration, Performance Tuning [Article]

0
0
3708

Packt

04 Nov 2014

23 min read

Postmodel Workflow

Packt

04 Nov 2014

23 min read

This article written by Trent Hauck, the author of scikit-learn Cookbook, Packt Publishing, will cover the following recipes: K-fold cross validation Automatic cross validation Cross validation with ShuffleSplit Stratified k-fold Poor man's grid search Brute force grid search Using dummy estimators to compare results (For more resources related to this topic, see here.) Even though by design the articles are unordered, you could argue by virtue of the art of data science, we've saved the best for last. For the most part, each recipe within this article is applicable to the various models we've worked with. In some ways, you can think about this article as tuning the parameters and features. Ultimately, we need to choose some criteria to determine the "best" model. We'll use various measures to define best. Then in the Cross validation with ShuffleSplit recipe, we will randomize the evaluation across subsets of the data to help avoid overfitting. K-fold cross validation In this recipe, we'll create, quite possibly, the most important post-model validation exercise—cross validation. We'll talk about k-fold cross validation in this recipe. There are several varieties of cross validation, each with slightly different randomization schemes. K-fold is perhaps one of the most well-known randomization schemes. Getting ready We'll create some data and then fit a classifier on the different folds. It's probably worth mentioning that if you can keep a holdout set, then that would be best. For example, we have a dataset where N = 1000. If we hold out 200 data points, then use cross validation between the other 800 points to determine the best parameters. How to do it... First, we'll create some fake data, then we'll examine the parameters, and finally, we'll look at the size of the resulting dataset: >>> N = 1000>>> holdout = 200>>> from sklearn.datasets import make_regression>>> X, y = make_regression(1000, shuffle=True) Now that we have the data, let's hold out 200 points, and then go through the fold scheme like we normally would: >>> X_h, y_h = X[:holdout], y[:holdout]>>> X_t, y_t = X[holdout:], y[holdout:]>>> from sklearn.cross_validation import KFold K-fold gives us the option of choosing how many folds we want, if we want the values to be indices or Booleans, if want to shuffle the dataset, and finally, the random state (this is mainly for reproducibility). Indices will actually be removed in later versions. It's assumed to be True. Let's create the cross validation object: >>> kfold = KFold(len(y_t), n_folds=4) Now, we can iterate through the k-fold object: >>> output_string = "Fold: {}, N_train: {}, N_test: {}">>> for i, (train, test) in enumerate(kfold): print output_string.format(i, len(y_t[train]), len(y_t[test]))Fold: 0, N_train: 600, N_test: 200Fold: 1, N_train: 600, N_test: 200Fold: 2, N_train: 600, N_test: 200Fold: 3, N_train: 600, N_test: 200 Each iteration should return the same split size. How it works... It's probably clear, but k-fold works by iterating through the folds and holds out 1/n_folds * N, where N for us was len(y_t). From a Python perspective, the cross validation objects have an iterator that can be accessed by using the in operator. Often times, it's useful to write a wrapper around a cross validation object that will iterate a subset of the data. For example, we may have a dataset that has repeated measures for data points or we may have a dataset with patients and each patient having measures. We're going to mix it up and use pandas for this part: >>> import numpy as np>>> import pandas as pd>>> patients = np.repeat(np.arange(0, 100, dtype=np.int8), 8)>>> measurements = pd.DataFrame({'patient_id': patients, 'ys': np.random.normal(0, 1, 800)}) Now that we have the data, we only want to hold out certain customers instead of data points: >>> custids = np.unique(measurements.patient_id)>>> customer_kfold = KFold(custids.size, n_folds=4)>>> output_string = "Fold: {}, N_train: {}, N_test: {}">>> for i, (train, test) in enumerate(customer_kfold): train_cust_ids = custids[train] training = measurements[measurements.patient_id.isin( train_cust_ids)] testing = measurements[~measurements.patient_id.isin( train_cust_ids)] print output_string.format(i, len(training), len(testing))Fold: 0, N_train: 600, N_test: 200Fold: 1, N_train: 600, N_test: 200Fold: 2, N_train: 600, N_test: 200Fold: 3, N_train: 600, N_test: 200 Automatic cross validation We've looked at the using cross validation iterators that scikit-learn comes with, but we can also use a helper function to perform cross validation for use automatically. This is similar to how other objects in scikit-learn are wrapped by helper functions, pipeline for instance. Getting ready First, we'll need to create a sample classifier; this can really be anything, a decision tree, a random forest, whatever. For us, it'll be a random forest. We'll then create a dataset and use the cross validation functions. How to do it... First import the ensemble module and we'll get started: >>> from sklearn import ensemble>>> rf = ensemble.RandomForestRegressor(max_features='auto') Okay, so now, let's create some regression data: >>> from sklearn import datasets>>> X, y = datasets.make_regression(10000, 10) Now that we have the data, we can import the cross_validation module and get access to the functions we'll use: >>> from sklearn import cross_validation>>> scores = cross_validation.cross_val_score(rf, X, y)>>> print scores[ 0.86823874 0.86763225 0.86986129] How it works... For the most part, this will delegate to the cross validation objects. One nice thing is that, the function will handle performing the cross validation in parallel. We can activate verbose mode play by play: >>> scores = cross_validation.cross_val_score(rf, X, y, verbose=3, cv=4)[CV] no parameters to be set[CV] no parameters to be set, score=0.872866 - 0.7s[CV] no parameters to be set[CV] no parameters to be set, score=0.873679 - 0.6s[CV] no parameters to be set[CV] no parameters to be set, score=0.878018 - 0.7s[CV] no parameters to be set[CV] no parameters to be set, score=0.871598 - 0.6s[Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 0.7s[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 2.6s finished As we can see, during each iteration, we scored the function. We also get an idea of how long the model runs. It's also worth knowing that we can score our function predicated on which kind of model we're trying to fit. Cross validation with ShuffleSplit ShuffleSplit is one of the simplest cross validation techniques. This cross validation technique will simply take a sample of the data for the number of iterations specified. Getting ready ShuffleSplit is another cross validation technique that is very simple. We'll specify the total elements in the dataset, and it will take care of the rest. We'll walk through an example of estimating the mean of a univariate dataset. This is somewhat similar to resampling, but it'll illustrate one reason why we want to use cross validation while showing cross validation. How to do it... First, we need to create the dataset. We'll use NumPy to create a dataset, where we know the underlying mean. We'll sample half of the dataset to estimate the mean and see how close it is to the underlying mean: >>> import numpy as np>>> true_loc = 1000>>> true_scale = 10>>> N = 1000>>> dataset = np.random.normal(true_loc, true_scale, N)>>> import matplotlib.pyplot as plt>>> f, ax = plt.subplots(figsize=(7, 5))>>> ax.hist(dataset, color='k', alpha=.65, histtype='stepfilled');>>> ax.set_title("Histogram of dataset");>>> f.savefig("978-1-78398-948-5_06_06.png") NumPy will give the following output: Now, let's take the first half of the data and guess the mean: >>> from sklearn import cross_validation>>> holdout_set = dataset[:500]>>> fitting_set = dataset[500:]>>> estimate = fitting_set[:N/2].mean()>>> import matplotlib.pyplot as plt>>> f, ax = plt.subplots(figsize=(7, 5))>>> ax.set_title("True Mean vs Regular Estimate")>>> ax.vlines(true_loc, 0, 1, color='r', linestyles='-', lw=5, alpha=.65, label='true mean')>>> ax.vlines(estimate, 0, 1, color='g', linestyles='-', lw=5, alpha=.65, label='regular estimate')>>> ax.set_xlim(999, 1001)>>> ax.legend()>>> f.savefig("978-1-78398-948-5_06_07.png") We'll get the following output: Now, we can use ShuffleSplit to fit the estimator on several smaller datasets: >>> from sklearn.cross_validation import ShuffleSplit>>> shuffle_split = ShuffleSplit(len(fitting_set))>>> mean_p = []>>> for train, _ in shuffle_split: mean_p.append(fitting_set[train].mean()) shuf_estimate = np.mean(mean_p)>>> import matplotlib.pyplot as plt>>> f, ax = plt.subplots(figsize=(7, 5))>>> ax.vlines(true_loc, 0, 1, color='r', linestyles='-', lw=5, alpha=.65, label='true mean')>>> ax.vlines(estimate, 0, 1, color='g', linestyles='-', lw=5, alpha=.65, label='regular estimate')>>> ax.vlines(shuf_estimate, 0, 1, color='b', linestyles='-', lw=5, alpha=.65, label='shufflesplit estimate')>>> ax.set_title("All Estimates")>>> ax.set_xlim(999, 1001)>>> ax.legend(loc=3) The output will be as follows: As we can see, we got an estimate that was similar to what we expected, but we were able to take many samples to get that estimate. Stratified k-fold In this recipe, we'll quickly look at stratified k-fold valuation. We've walked through different recipes where the class representation was unbalanced in some manner. Stratified k-fold is nice because its scheme is specifically designed to maintain the class proportions. Getting ready We're going to create a small dataset. In this dataset, we will then use stratified k-fold validation. We want it small so that we can see the variation. For larger samples. it probably won't be as big of a deal. We'll then plot the class proportions at each step to illustrate how the class proportions are maintained: >>> from sklearn import datasets>>> X, y = datasets.make_classification(n_samples=int(1e3), weights=[1./11]) Let's check the overall class weight distribution: >>> y.mean()0.90300000000000002 Roughly, 90.5 percent of the samples are 1, with the balance 0. How to do it... Let's create a stratified k-fold object and iterate it through each fold. We'll measure the proportion of verse that are 1. After that we'll plot the proportion of classes by the split number to see how and if it changes. This code will hopefully illustrate how this is beneficial. We'll also plot this code against a basic ShuffleSplit: >>> from sklearn import cross_validation>>> n_folds = 50>>> strat_kfold = cross_validation.StratifiedKFold(y, n_folds=n_folds)>>> shuff_split = cross_validation.ShuffleSplit(n=len(y), n_iter=n_folds)>>> kfold_y_props = []>>> shuff_y_props = []>>> for (k_train, k_test), (s_train, s_test) in zip(strat_kfold, shuff_split): kfold_y_props.append(y[k_train].mean()) shuff_y_props.append(y[s_train].mean()) Now, let's plot the proportions over each fold: >>> import matplotlib.pyplot as plt>>> f, ax = plt.subplots(figsize=(7, 5))>>> ax.plot(range(n_folds), shuff_y_props, label="ShuffleSplit", color='k')>>> ax.plot(range(n_folds), kfold_y_props, label="Stratified", color='k', ls='--')>>> ax.set_title("Comparing class proportions.")>>> ax.legend(loc='best') The output will be as follows: We can see that the proportion of each fold for stratified k-fold is stable across folds. How it works... Stratified k-fold works by taking the y value. First, getting the overall proportion of the classes, then intelligently splitting the training and test set into the proportions. This will generalize to multiple labels: >>> import numpy as np>>> three_classes = np.random.choice([1,2,3], p=[.1, .4, .5], size=1000)>>> import itertools as it>>> for train, test in cross_validation.StratifiedKFold(three_classes, 5): print np.bincount(three_classes[train])[ 0 90 314 395][ 0 90 314 395][ 0 90 314 395][ 0 91 315 395][ 0 91 315 396] As we can see, we got roughly the sample sizes of each class for our training and testing proportions. Poor man's grid search In this recipe, we're going to introduce grid search with basic Python, though we will use sklearn for the models and matplotlib for the visualization. Getting ready In this recipe, we will perform the following tasks: Design a basic search grid in the parameter space Iterate through the grid and check the loss/score function at each point in the parameter space for the dataset Choose the point in the parameter space that minimizes/maximizes the evaluation function Also, the model we'll fit is a basic decision tree classifier. Our parameter space will be 2 dimensional to help us with the visualization: The parameter space will then be the Cartesian product of the those two sets: We'll see in a bit how we can iterate through this space with itertools. Let's create the dataset and then get started: >>> from sklearn import datasets>>> X, y = datasets.make_classification(n_samples=2000, n_features=10) How to do it... Earlier we said that we'd use grid search to tune two parameters—criteria and max_features. We need to represent those as Python sets, and then use itertools product to iterate through them: >>> criteria = {'gini', 'entropy'}>>> max_features = {'auto', 'log2', None}>>> import itertools as it>>> parameter_space = it.product(criteria, max_features) Great! So now that we have the parameter space, let's iterate through it and check the accuracy of each model as specified by the parameters. Then, we'll store that accuracy so that we can compare different parameter spaces. We'll also use a test and train split of 50, 50: import numpy as nptrain_set = np.random.choice([True, False], size=len(y))from sklearn.tree import DecisionTreeClassifieraccuracies = {}for criterion, max_feature in parameter_space: dt = DecisionTreeClassifier(criterion=criterion, max_features=max_feature) dt.fit(X[train_set], y[train_set]) accuracies[(criterion, max_feature)] = (dt.predict(X[~train_set]) == y[~train_set]).mean()>>> accuracies{('entropy', None): 0.974609375, ('entropy', 'auto'): 0.9736328125,('entropy', 'log2'): 0.962890625, ('gini', None): 0.9677734375, ('gini','auto'): 0.9638671875, ('gini', 'log2'): 0.96875} So we now have the accuracies and its performance. Let's visualize the performance: >>> from matplotlib import pyplot as plt>>> from matplotlib import cm>>> cmap = cm.RdBu_r>>> f, ax = plt.subplots(figsize=(7, 4))>>> ax.set_xticklabels([''] + list(criteria))>>> ax.set_yticklabels([''] + list(max_features))>>> plot_array = []>>> for max_feature in max_features:m = []>>> for criterion in criteria: m.append(accuracies[(criterion, max_feature)]) plot_array.append(m)>>> colors = ax.matshow(plot_array, vmin=np.min(accuracies.values()) - 0.001, vmax=np.max(accuracies.values()) + 0.001, cmap=cmap)>>> f.colorbar(colors) The following is the output: It's fairly easy to see which one performed best here. Hopefully, you can see how this process can be taken to the further stage with a brute force method. How it works... This works fairly simply, we just have to perform the following steps: Choose a set of parameters. Iterate through them and find the accuracy of each step. Find the best performer by visual inspection. Brute force grid search In this recipe, we'll do an exhaustive grid search through scikit-learn. This is basically the same thing we did in the previous recipe, but we'll utilize built-in methods. We'll also walk through an example of performing randomized optimization. This is an alternative to brute force search. Essentially, we're trading computer cycles to make sure that we search the entire space. We were fairly calm in the last recipe. However, you could imagine a model that has several steps, first imputation for fix missing data, then PCA reduce the dimensionality to classification. Your parameter space could get very large, very fast; therefore, it can be advantageous to only search a part of that space. Getting ready To get started, we'll need to perform the following steps: Create some classification data. We'll then create a LogisticRegression object that will be the model we're fitting. After that, we'll create the search objects, GridSearch and RandomizedSearchCV. How to do it... Run the following code to create some classification data: >>> from sklearn.datasets import make_classification>>> X, y = make_classification(1000, n_features=5) Now, we'll create our logistic regression object: >>> from sklearn.linear_model import LogisticRegression>>> lr = LogisticRegression(class_weight='auto') We need to specify the parameters we want to search. For GridSearch, we can just specify the ranges that we care about, but for RandomizedSearchCV, we'll need to actually specify the distribution over the same space from which to sample: >>> lr.fit(X, y)LogisticRegression(C=1.0, class_weight={0: 0.25, 1: 0.75}, dual=False,fit_intercept=True, intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)>>> grid_search_params = {'penalty': ['l1', 'l2'],'C': [1, 2, 3, 4]} The only change we'll need to make is to describe the C parameter as a probability distribution. We'll keep it simple right now, though we will use scipy to describe the distribution: >>> import scipy.stats as st>>> import numpy as np>>> random_search_params = {'penalty': ['l1', 'l2'],'C': st.randint(1, 4)} How it works... Now, we'll fit the classifier. This works by passing lr to the parameter search objects: >>> from sklearn.grid_search import GridSearchCV, RandomizedSearchCV>>> gs = GridSearchCV(lr, grid_search_params) GridSearchCV implements the same API as the other models: >>> gs.fit(X, y)GridSearchCV(cv=None, estimator=LogisticRegression(C=1.0, class_weight='auto', dual=False, fit_intercept=True, intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001), fit_params={}, iid=True, loss_func=None, n_jobs=1, param_grid={'penalty': ['l1', 'l2'], 'C': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None, verbose=0) As we can see with the param_grid parameter, our penalty and C are both arrays. To access the scores, we can use the grid_scores_ attribute of the grid search. We also want to find the optimal set of parameters. We can also look at the marginal performance of the grid search: >>> gs.grid_scores_[mean: 0.90300, std: 0.01192, params: {'penalty': 'l1', 'C': 1},mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 1},mean: 0.90200, std: 0.01117, params: {'penalty': 'l1', 'C': 2},mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 2},mean: 0.90200, std: 0.01117, params: {'penalty': 'l1', 'C': 3},mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 3},mean: 0.90100, std: 0.01258, params: {'penalty': 'l1', 'C': 4},mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 4}] We might want to get the max score: >>> gs.grid_scores_[1][1]0.90100000000000002>>> max(gs.grid_scores_, key=lambda x: x[1])mean: 0.90300, std: 0.01192, params: {'penalty': 'l1', 'C': 1} The parameters obtained are the best choices for our logistic regression. Using dummy estimators to compare results This recipe is about creating fake estimators; this isn't the pretty or exciting stuff, but it is worthwhile to have a reference point for the model you'll eventually build. Getting ready In this recipe, we'll perform the following tasks: Create some data random data. Fit the various dummy estimators. We'll perform these two steps for regression data and classification data. How to do it... First, we'll create the random data: >>> from sklearn.datasets import make_regression, make_classification# classification if for later>>> X, y = make_regression()>>> from sklearn import dummy>>> dumdum = dummy.DummyRegressor()>>> dumdum.fit(X, y)DummyRegressor(constant=None, strategy='mean') By default, the estimator will predict by just taking the mean of the values and predicting the mean values: >>> dumdum.predict(X)[:5]array([ 2.23297907, 2.23297907, 2.23297907, 2.23297907, 2.23297907]) There are other two other strategies we can try. We can predict a supplied constant (refer to constant=None from the preceding command). We can also predict the median value. Supplying a constant will only be considered if strategy is "constant". Let's have a look: >>> predictors = [("mean", None), ("median", None), ("constant", 10)]>>> for strategy, constant in predictors: dumdum = dummy.DummyRegressor(strategy=strategy, constant=constant)>>> dumdum.fit(X, y)>>> print "strategy: {}".format(strategy), ",".join(map(str, dumdum.predict(X)[:5]))strategy: mean 2.23297906733,2.23297906733,2.23297906733,2.23297906733,2.23297906733strategy: median 20.38535248,20.38535248,20.38535248,20.38535248,20.38535248strategy: constant 10.0,10.0,10.0,10.0,10.0 We actually have four options for classifiers. These strategies are similar to the continuous case, it's just slanted toward classification problems: >>> predictors = [("constant", 0), ("stratified", None), ("uniform", None), ("most_frequent", None)] We'll also need to create some classification data: >>> X, y = make_classification()>>> for strategy, constant in predictors: dumdum = dummy.DummyClassifier(strategy=strategy, constant=constant) dumdum.fit(X, y) print "strategy: {}".format(strategy), ",".join(map(str, dumdum.predict(X)[:5]))strategy: constant 0,0,0,0,0strategy: stratified 1,0,0,1,0strategy: uniform 0,0,0,1,1strategy: most_frequent 1,1,1,1,1 How it works... It's always good to test your models against the simplest models and that's exactly what the dummy estimators give you. For example, imagine a fraud model. In this model, only 5 percent of the data set is fraud. Therefore, we can probably fit a pretty good model just by never guessing any fraud. We can create this model by using the stratified strategy, using the following command. We can also get a good example of why class imbalance causes problems: >>> X, y = make_classification(20000, weights=[.95, .05])>>> dumdum = dummy.DummyClassifier(strategy='most_frequent')>>> dumdum.fit(X, y)DummyClassifier(constant=None, random_state=None, strategy='most_frequent')>>> from sklearn.metrics import accuracy_score>>> print accuracy_score(y, dumdum.predict(X))0.94575 We were actually correct very often, but that's not the point. The point is that this is our baseline. If we cannot create a model for fraud that is more accurate than this, then it isn't worth our time. Summary This article taught us how we can take a basic model produced from one of the recipes and tune it so that we can achieve better results than we could with the basic model. Resources for Article: Further resources on this subject: Specialized Machine Learning Topics [article] Machine Learning in IPython with scikit-learn [article] Our First Machine Learning Method – Linear Classification [article]

0
0
1498

article-image-loading-data-creating-app-and-adding-dashboards-and-reports-splunk

Packt

31 Oct 2014

13 min read

Loading data, creating an app, and adding dashboards and reports in Splunk

Packt

31 Oct 2014

13 min read

0
0
3451

Packt

30 Oct 2014

10 min read

Theming with Highcharts

Packt

30 Oct 2014

10 min read

Besides the charting capabilities offered by Highcharts, theming is yet another strong feature of Highcharts. With its extensive theming API, charts can be customized completely to match the branding of a website or an app. Almost all of the chart elements are customizable through this API. In this article by Bilal Shahid, author of Highcharts Essentials, we will do the following things: (For more resources related to this topic, see here.) Use different fill types and fonts Create a global theme for our charts Use jQuery easing for animations Using Google Fonts with Highcharts Google provides an easy way to include hundreds of high quality web fonts to web pages. These fonts work in all major browsers and are served by Google CDN for lightning fast delivery. These fonts can also be used with Highcharts to further polish the appearance of our charts. This section assumes that you know the basics of using Google Web Fonts. If you are not familiar with them, visit https://developers.google.com/fonts/docs/getting_started. We will style the following example with Google Fonts. We will use the Merriweather family from Google Fonts and link to its style sheet from our web page inside the <head> tag: <link href='http://fonts.googleapis.com/css?family=Merriweather:400italic,700italic' rel='stylesheet' type='text/css'> Having included the style sheet, we can actually use the font family in our code for the labels in yAxis: yAxis: [{ ... labels: { style: { fontFamily: 'Merriweather, sans-serif', fontWeight: 400, fontStyle: 'italic', fontSize: '14px', color: '#ffffff' } } }, { ... labels: { style: { fontFamily: 'Merriweather, sans-serif', fontWeight: 700, fontStyle: 'italic', fontSize: '21px', color: '#ffffff' }, ... } }] For the outer axis, we used a font size of 21px with font weight of 700. For the inner axis, we lowered the font size to 14px and used font weight of 400 to compensate for the smaller font size. The following is the modified speedometer: In the next section, we will continue with the same example to include jQuery UI easing in chart animations. Using jQuery UI easing for series animation Animations occurring at the point of initialization of charts can be disabled or customized. The customization requires modifying two properties: animation.duration and animation.easing. The duration property accepts the number of milliseconds for the duration of the animation. The easing property can have various values depending on the framework currently being used. For a standalone jQuery framework, the values can be either linear or swing. Using the jQuery UI framework adds a couple of more options for the easing property to choose from. In order to follow this example, you must include the jQuery UI framework to the page. You can also grab the standalone easing plugin from http://gsgd.co.uk/sandbox/jquery/easing/ and include it inside your <head> tag. We can now modify the series to have a modified animation: plotOptions: { ... series: { animation: { duration: 1000, easing: 'easeOutBounce' } } } The preceding code will modify the animation property for all the series in the chart to have duration set to 1000 milliseconds and easing to easeOutBounce. Each series can have its own different animation by defining the animation property separately for each series as follows: series: [{ ... animation: { duration: 500, easing: 'easeOutBounce' } }, { ... animation: { duration: 1500, easing: 'easeOutBounce' } }, { ... animation: { duration: 2500, easing: 'easeOutBounce' } }] Different animation properties for different series can pair nicely with column and bar charts to produce visually appealing effects. Creating a global theme for our charts A Highcharts theme is a collection of predefined styles that are applied before a chart is instantiated. A theme will be applied to all the charts on the page after the point of its inclusion, given that the styling options have not been modified within the chart instantiation. This provides us with an easy way to apply custom branding to charts without the need to define styles over and over again. In the following example, we will create a basic global theme for our charts. This way, we will get familiar with the fundamentals of Highcharts theming and some API methods. We will define our theme inside a separate JavaScript file to make the code reusable and keep things clean. Our theme will be contained in an options object that will, in turn, contain styling for different Highcharts components. Consider the following code placed in a file named custom-theme.js. This is a basic implementation of a Highcharts custom theme that includes colors and basic font styles along with some other modifications for axes: Highcharts.customTheme = { colors: ['#1BA6A6', '#12734F', '#F2E85C', '#F27329', '#D95D30', '#2C3949', '#3E7C9B', '#9578BE'], chart: { backgroundColor: { radialGradient: {cx: 0, cy: 1, r: 1}, stops: [ [0, '#ffffff'], [1, '#f2f2ff'] ] }, style: { fontFamily: 'arial, sans-serif', color: '#333' } }, title: { style: { color: '#222', fontSize: '21px', fontWeight: 'bold' } }, subtitle: { style: { fontSize: '16px', fontWeight: 'bold' } }, xAxis: { lineWidth: 1, lineColor: '#cccccc', tickWidth: 1, tickColor: '#cccccc', labels: { style: { fontSize: '12px' } } }, yAxis: { gridLineWidth: 1, gridLineColor: '#d9d9d9', labels: { style: { fontSize: '12px' } } }, legend: { itemStyle: { color: '#666', fontSize: '9px' }, itemHoverStyle:{ color: '#222' } } }; Highcharts.setOptions( Highcharts.customTheme ); We start off by modifying the Highcharts object to include an object literal named customTheme that contains styles for our charts. Inside customTheme, the first option we defined is for series colors. We passed an array containing eight colors to be applied to series. In the next part, we defined a radial gradient as a background for our charts and also defined the default font family and text color. The next two object literals contain basic font styles for the title and subtitle components. Then comes the styles for the x and y axes. For the xAxis, we define lineColor and tickColor to be #cccccc with the lineWidth value of 1. The xAxis component also contains the font style for its labels. The y axis gridlines appear parallel to the x axis that we have modified to have the width and color at 1 and #d9d9d9 respectively. Inside the legend component, we defined styles for the normal and mouse hover states. These two states are stated by itemStyle and itemHoverStyle respectively. In normal state, the legend will have a color of #666 and font size of 9px. When hovered over, the color will change to #222. In the final part, we set our theme as the default Highcharts theme by using an API method Highcharts.setOptions(), which takes a settings object to be applied to Highcharts; in our case, it is customTheme. The styles that have not been defined in our custom theme will remain the same as the default theme. This allows us to partially customize a predefined theme by introducing another theme containing different styles. In order to make this theme work, include the file custom-theme.js after the highcharts.js file: <script src="js/highcharts.js"></script> <script src="js/custom-theme.js"></script> The output of our custom theme is as follows: We can also tell our theme to include a web font from Google without having the need to include the style sheet manually in the header, as we did in a previous section. For that purpose, Highcharts provides a utility method named Highcharts.createElement(). We can use it as follows by placing the code inside the custom-theme.js file: Highcharts.createElement( 'link', { href: 'http://fonts.googleapis.com/css?family=Open+Sans:300italic,400italic,700italic,400,300,700', rel: 'stylesheet', type: 'text/css' }, null, document.getElementsByTagName( 'head' )[0], null ); The first argument is the name of the tag to be created. The second argument takes an object as tag attributes. The third argument is for CSS styles to be applied to this element. Since, there is no need for CSS styles on a link element, we passed null as its value. The final two arguments are for the parent node and padding, respectively. We can now change the default font family for our charts to 'Open Sans': chart: { ... style: { fontFamily: "'Open Sans', sans-serif", ... } } The specified Google web font will now be loaded every time a chart with our custom theme is initialized, hence eliminating the need to manually insert the required font style sheet inside the <head> tag. This screenshot shows a chart with 'Open Sans' Google web font. Summary In this article, you learned about incorporating Google fonts and jQuery UI easing into our chart for enhanced styling. Resources for Article: Further resources on this subject: Integrating with other Frameworks [Article] Highcharts [Article] More Line Charts, Area Charts, and Scatter Plots [Article]

0
0
4773

article-image-hosting-service-iis-using-tcp-protocol

Packt

30 Oct 2014

8 min read

Hosting the service in IIS using the TCP protocol

Packt

30 Oct 2014

8 min read

0
0
7577

Packt

27 Oct 2014

8 min read

Data visualization

Packt

27 Oct 2014

8 min read

Data visualization is one of the most important tasks in data science track. Through effective visualization we can easily uncover underlying pattern among variables with doing any sophisticated statistical analysis. In this cookbook we have focused on graphical analysis using R in a very simple way with each independent example. We have covered default R functionality along with more advance visualization techniques such as lattice, ggplot2, and three-dimensional plots. Readers will not only learn the code to produce the graph but also learn why certain code has been written with specific examples. R Graphs Cookbook Second Edition written by Jaynal Abedin and Hrishi V. Mittal is such a book where the user will learn how to produce various graphs using R and how to customize them and finally how to make ready for publication. This practical recipe book starts with very brief description about R graphics system and then gradually goes through basic to advance plots with examples. Beside the R default graphics this recipe book introduces advance graphic system such as lattice and ggplot2; the grammar of graphics. We have also provided examples on how to inspect large dataset using advanced visualization such as tableplot and three dimensional visualizations. We also cover the following topics: How to create various types of bar charts using default R functions, lattice and ggplot2 How to produce density plots along with histograms using lattice and ggplot2 and customized them for publication How to produce graphs of frequency tabulated data How to inspect large dataset by simultaneously visualizing numeric and categorical variables in a single plot How to annotate graphs using ggplot2 (For more resources related to this topic, see here.) This recipe book is targeted to those reader groups who already exposed to R programming and want to learn effective graphics with the power of R and its various libraries. This hands-on guide starts with very short introduction to R graphics system and then gets straight to the point – actually creating graphs, instead of just theoretical learning. Each recipe is specifically tailored to full fill reader’s appetite for visually representing the data in the best way possible. Now, we will present few examples so that you can have an idea about the content of this recipe book: The ggplot2 R package is based on The Grammar of Graphics by Leland Wilkinson, Springer). Using this package, we can produce a variety of traditional graphics, and the user can produce their customized graphs as well. The beauty of this package is in its layered graphics facilities; through the use of layered graphics utilities, we can produce almost any kind of data visualization. Recently, ggplot2 is the most searched keyword in the R community, including the most popular R blog (www.r-bloggers.com). The comprehensive theme system allows the user to produce publication quality graphs with a variety of themes of choice. If we want to explain this package in a single sentence, then we can say that if whatever we can think about data visualization can be structured in a data frame, the visualization is a matter of few seconds. In the specific chapter on ggplot2 , we will see different examples and use themes to produce publication quality graphs. However, in this introductory chapter, we will show you one of the important features of the ggplot2 package that produces various types of graphs. The main function is ggplot(), but with the help of a different geom function, we can easily produce different types of graphs, such as the following: geom_point(): This will create scatter plot geom_line(): This will create a line chart geom_bar(): This will create a bar chart geom_boxplot(): This will create a box plot geom_text(): This will write certain text inside the plot area Now, we will see a simple example of the use of different geom functions with the default R mtcars dataset: # loading ggplot2 library library(ggplot2) # creating a basic ggplot object p <- ggplot(data=mtcars) # Creating scatter plot of mpg and disp variable p1 <- p+geom_point(aes(x=disp,y=mpg)) # creating line chart from the same ggplot object but different # geom function p2 <- p+geom_line(aes(x=disp,y=mpg)) # creating bar chart of mpg variable p3 <- p+geom_bar(aes(x=mpg)) # creating boxplot of mpg over gear p4 <- p+geom_boxplot(aes(x=factor(gear),y=mpg)) # writing certain text into the scatter plot p5 <- p1+geom_text(x=200,y=25,label="Scatter plot") The visualization of the preceding five plot will look like the following figure: Visualizing an empirical Cumulative Distribution function The empirical Cumulative Distribution function (CDF) is the non-parametric maximum-likelihood estimation of the CDF. In this recipe, we will see how the empirical CDF can be produced. Getting ready To produce this plot, we need to use the latticeExtra library. We will use the simulated dataset as shown in the following code: # Set a seed value to make the data reproducible set.seed(12345) qqdata <-data.frame(disA=rnorm(n=100,mean=20,sd=3), disB=rnorm(n=100,mean=25,sd=4), disC=rnorm(n=100,mean=15,sd=1.5), age=sample((c(1,2,3,4)),size=100,replace=T), sex=sample(c("Male","Female"),size=100,replace=T), econ_status=sample(c("Poor","Middle","Rich"), size=100,replace=T)) How to do it… To plot an empirical CDF, we first need to call the latticeExtra library (note that this library has a dependency on RColorBrewer). Now, to plot the empirical CDF, we can use the following simple code: library(latticeExtra) ecdfplot(~disA|sex,data=qqdata) Graph annotation with ggplot To produce publication-quality data visualization, we often need to annotate the graph with various texts, symbols, or even shapes. In this recipe, we will see how we can easily annotate an existing graph. Getting ready In this recipe, we will use the disA and disD variables from ggplotdata. Let's call ggplotdata for this recipe. We also need to call the grid and gridExtra libraries for this recipe. How to do it... In this recipe, we will execute the following annotation on an existing scatter plot. So, the whole procedure will be as follows: Create a scatter plot Add customized text within the plot Highlight certain region to indicate extreme values Draw a line segment with an arrow within the scatter plot to indicate a single extreme observation Now, we will implement each of the steps one by one: library(grid) library(gridExtra) # creating scatter plot and print it annotation_obj <- ggplot(data=ggplotdata,aes(x=disA,y=disD))+geom_point() annotation_obj # Adding custom text at (18,29) position annotation_obj1 <- annotation_obj + annotate(geom="text",x=18,y=29,label="Extreme value",size=3) annotation_obj1 # Highlight certain regions with a box annotation_obj2 <- annotation_obj1+ annotate("rect", xmin = 24, xmax = 27,ymin=17,ymax=22,alpha = .2) annotation_obj2 # Drawing line segment with arrow annotation_obj3 <- annotation_obj2+ annotate("segment",x = 16,xend=17.5,y=25,yend=27.5,colour="red", arrow = arrow(length = unit(0.5, "cm")),size=2) annotation_obj3 The preceding four steps are displayed in the following single graph: How it works... The annotate() function takes input of a geom such as “segment”, “text” etc, and then it takes another input regarding position of that geom that is where to draw or where to place.. In this particular recipe, we used three geom instances, such as text to write customized text within the plot, rect to highlight a certain region in the plot, and segment to draw an arrow. The alpha argument represents the transparency of the region and size argument to represent the size of the text and line width of the line segment. Summary This article just gives a sample recipe of what kind of recipes are included in the book, and how the structure of each recipe is. Resources for Article: Further resources on this subject: Using R for Statistics, Research, and Graphics [Article] First steps with R [Article] Aspects of Data Manipulation in R [Article]

0
0
3069

Packt

27 Oct 2014

6 min read

The EMR Architecture

Packt

27 Oct 2014

6 min read

This article is written by Amarkant Singh and Vijay Rayapati, the authors of Learning Big Data with Amazon Elastic MapReduce. The goal of this article is to introduce you to the EMR architecture and EMR use cases. (For more resources related to this topic, see here.) Traditionally, very few companies had access to large-scale infrastructure to build Big Data applications. However, cloud computing has democratized the access to infrastructure allowing developers and companies to quickly perform new experiments without worrying about the need for setting up or scaling infrastructure. A cloud provides an infrastructure as a service platform to allow businesses to build applications and host them reliably with scalable infrastructure. It includes a variety of application-level services to help developers to accelerate their development and deployment times. Amazon EMR is one of the hosted services provided by AWS and is built on top of a scalable AWS infrastructure to build Big Data applications. The EMR architecture Let's get familiar with the EMR. This section outlines the key concepts of EMR. Hadoop offers distributed processing by using the MapReduce framework for execution of tasks on a set of servers or compute nodes (also known as a cluster). One of the nodes in the Hadoop cluster will be controlling the distribution of tasks to other nodes and it's called the Master Node. The nodes executing the tasks using MapReduce are called Slave Nodes: Amazon EMR is designed to work with many other AWS services such as S3 for input/output data storage, DynamoDB, and Redshift for output data. EMR uses AWS CloudWatch metrics to monitor the cluster performance and raise notifications for user-specified alarms. We can create on-demand Hadoop clusters using EMR while storing the input and output data in S3 without worrying about managing a 24*7 cluster or HDFS for data storage. The Amazon EMR job flow is shown in the following diagram: Types of nodes Amazon EMR provides three different roles for the servers or nodes in the cluster and they map to the Hadoop roles of master and slave nodes. When you create an EMR cluster, then it's called a Job Flow, which has been created to execute a set of jobs or job steps one after the other: Master node: This node controls and manages the cluster. It distributes the MapReduce tasks to nodes in the cluster and monitors the status of task execution. Every EMR cluster will have only one master node in a master instance group. Core nodes: These nodes will execute MapReduce tasks and provide HDFS for storing the data related to task execution. The EMR cluster will have core nodes as part of it in a core instance group. The core node is related to the slave node in Hadoop. So, basically these nodes have two-fold responsibility: the first one is to execute the map and reduce tasks allocated by the master and the second is to hold the data blocks. Task nodes: These nodes are used for only MapReduce task execution and they are optional while launching the EMR cluster. The task node is related to the slave node in Hadoop and is part of a task instance group in EMR. When you scale down your clusters, you cannot remove any core nodes. This is because EMR doesn't want to let you lose your data blocks. You can remove nodes from a task group while scaling down your cluster. You should also be using only task instance groups to have spot instances, as spot instances can be taken away as per your bid price and you would not want to lose your data blocks. You can launch a cluster having just one node, that is, with just one master node and no other nodes. In that case, the same node will act as both master and core nodes. For simplicity, you can assume a node as EC2 server in EMR. EMR use cases Amazon EMR can be used to build a variety of applications such as recommendation engines, data analysis, log processing, event/click stream analysis, data transformations (ETL), fraud detection, scientific simulations, genomics, financial analysis, or data correlation in various industries. The following section outlines some of the use cases in detail. Web log processing We can use EMR to process logs to understand the usage of content such as video, file downloads, top web URLs accessed by end users, user consumption from different parts of the world, and many more. We can process any web or mobile application logs using EMR to understand specific business insights relevant for your business. We can move all our web access application or mobile logs to Amazon S3 for analysis using EMR even if we are not using AWS for running our production applications. Clickstream analysis By using clickstream analysis, we can segment users into different groups and understand their behaviors with respect to advertisements or application usage. Ad networks or advertisers can perform clickstream analysis on ad-impression logs to deliver more effective campaigns or advertisements to end users. Reports generated from this analysis can include various metrics such as source traffic distribution, purchase funnel, lead source ROI, and abandoned carts among others. Product recommendation engine Recommendation engines can be built using EMR for e-commerce, retail, or web businesses. Many of the e-commerce businesses have a large inventory of products across different categories while regularly adding new products or categories. It will be very difficult for end users to search and identify the products quickly. With recommendation engines, we can help end users to quickly find relevant products or suggest products based on what they are viewing and so on. We may also want to notify users via an e-mail based on their past purchase behavior. Scientific simulations When you need distributed processing with large-scale infrastructure for scientific or research simulations, EMR can be of great help. We can quickly launch large clusters in a matter of minutes and install specific MapReduce programs for analysis using EMR. AWS also offers genomics datasets for free on S3. Data transformations We can perform complex extract, transform, and load (ETL) processes using EMR for either data analysis or data warehousing needs. It can be as simple as transforming XML file data into JSON data for further usage or moving all financial transaction records of a bank into a common date-time format for archiving purposes. You can also use EMR to move data between different systems in AWS such as DynamoDB, Redshift, S3, and many more. Summary In this article, we learned about the EMR architecture. We understood the concepts related to EMR for various node types in detail. Resources for Article: Further resources on this subject: Introduction to MapReduce [Article] Understanding MapReduce [Article] HDFS and MapReduce [Article]

0
0
4847

Packt

27 Oct 2014

9 min read

Clustering with K-Means

Packt

27 Oct 2014

9 min read

In this article by Gavin Hackeling, the author of Mastering Machine Learning with scikit-Learn, we will discuss an unsupervised learning task called clustering. Clustering is used to find groups of similar observations within a set of unlabeled data. We will discuss the K-Means clustering algorithm, apply it to an image compression problem, and learn to measure its performance. Finally, we will work through a semi-supervised learning problem that combines clustering with classification. Clustering, or cluster analysis, is the task of grouping observations such that members of the same group, or cluster, are more similar to each other by some metric than they are to the members of the other clusters. As with supervised learning, we will represent an observation as an n-dimensional vector. For example, assume that your training data consists of the samples plotted in the following figure: Clustering might reveal the following two groups, indicated by squares and circles. Clustering could also reveal the following four groups: Clustering is commonly used to explore a data set. Social networks can be clustered to identify communities, and to suggest missing connections between people. In biology, clustering is used to find groups of genes with similar expression patterns. Recommendation systems sometimes employ clustering to identify products or media that might appeal to a user. In marketing, clustering is used to find segments of similar consumers. In the following sections we will work through an example of using the K-Means algorithm to cluster a data set. Clustering with the K-Means Algorithm The K-Means algorithm is a clustering method that is popular because of its speed and scalability. K-Means is an iterative process of moving the centers of the clusters, or the centroids, to the mean position of their constituent points, and re-assigning instances to their closest clusters. The titular K is a hyperparameter that specifies the number of clusters that should be created; K-Means automatically assigns observations to clusters but cannot determine the appropriate number of clusters. K must be a positive integer that is less than the number of instances in the training set. Sometimes the number of clusters is specified by the clustering problem's context. For example, a company that manufactures shoes might know that it is able to support manufacturing three new models. To understand what groups of customers to target with each model, it surveys customers and creates three clusters from the results. That is, the value of K was specified by the problem's context. Other problems may not require a specific number of clusters, and the optimal number of clusters may be ambiguous. We will discuss a heuristic for estimating the optimal number of clusters called the elbow method later in this article. The parameters of K-Means are the positions of the clusters' centroids and the observations that are assigned to each cluster. Like generalized linear models and decision trees, the optimal values of K-Means' parameters are found by minimizing a cost function. The cost function for K-Means is given by the following equation: where µk is the centroid for cluster k. The cost function sums the distortions of the clusters. Each cluster's distortion is equal to the sum of the squared distances between its centroid and its constituent instances. The distortion is small for compact clusters, and large for clusters that contain scattered instances. The parameters that minimize the cost function are learned through an iterative process of assigning observations to clusters and then moving the clusters. First, the clusters' centroids are initialized to random positions. In practice, setting the centroids' positions equal to the positions of randomly selected observations yields the best results. During each iteration, K-Means assigns observations to the cluster that they are closest to, and then moves the centroids to their assigned observations' mean location. Let's work through an example by hand using the training data shown in the following table. Instance X0 X1 1 7 5 2 5 7 3 7 7 4 3 3 5 4 6 6 1 4 7 0 0 8 2 2 9 8 7 10 6 8 11 5 5 12 3 7 There are two explanatory variables; each instance has two features. The instances are plotted in the following figure. Assume that K-Means initializes the centroid for the first cluster to the fifth instance and the centroid for the second cluster to the eleventh instance. For each instance, we will calculate its distance to both centroids, and assign it to the cluster with the closest centroid. The initial assignments are shown in the “Cluster” column of the following table. Instance X0 X1 C1 distance C2 distance Last Cluster Cluster Changed? 1 7 5 3.16228 2 None C2 Yes 2 5 7 1.41421 2 None C1 Yes 3 7 7 3.16228 2.82843 None C2 Yes 4 3 3 3.16228 2.82843 None C2 Yes 5 4 6 0 1.41421 None C1 Yes 6 1 4 3.60555 4.12311 None C1 Yes 7 0 0 7.21110 7.07107 None C2 Yes 8 2 2 4.47214 4.24264 None C2 Yes 9 8 7 4.12311 3.60555 None C2 Yes 10 6 8 2.82843 3.16228 None C1 Yes 11 5 5 1.41421 0 None C2 Yes 12 3 7 1.41421 2.82843 None C1 Yes C1 centroid 4 6 C2 centroid 5 5 The plotted centroids and the initial cluster assignments are shown in the following graph. Instances assigned to the first cluster are marked with “Xs”, and instances assigned to the second cluster are marked with dots. The markers for the centroids are larger than the markers for the instances. Now we will move both centroids to the means of their constituent instances, re-calculate the distances of the training instances to the centroids, and re-assign the instances to the closest centroids. Instance X0 X1 C1 distance C2 distance Last Cluster New Cluster Changed? 1 7 5 3.492850 2.575394 C2 C2 No 2 5 7 1.341641 2.889107 C1 C1 No 3 7 7 3.255764 3.749830 C2 C1 Yes 4 3 3 3.492850 1.943067 C2 C2 No 5 4 6 0.447214 1.943067 C1 C1 No 6 1 4 3.687818 3.574285 C1 C2 Yes 7 0 0 7.443118 6.169378 C2 C2 No 8 2 2 4.753946 3.347250 C2 C2 No 9 8 7 4.242641 4.463000 C2 C1 Yes 10 6 8 2.720294 4.113194 C1 C1 No 11 5 5 1.843909 0.958315 C2 C2 No 12 3 7 1 3.260775 C1 C1 No C1 centroid 3.8 6.4 C2 centroid 4.571429 4.142857 The new clusters are plotted in the following graph. Note that the centroids are diverging, and several instances have changed their assignments. Now we will move the centroids to the means of their constituents' locations again, and re-assign the instances to their nearest centroids. The centroids continue to diverge, as shown in the following figure. None of the instances' centroid assignments will change in the next iteration; K-Means will continue iterating until some stopping criteria is satisfied. Usually, this criteria is either a threshold for the difference between the values of the cost function for subsequent iterations, or a threshold for the change in the positions of the centroids between subsequent iterations. If these stopping criteria are small enough, K-Means will converge on an optimum. This optimum will not necessarily be the global optimum. Local Optima Recall that K-Means initially sets the positions of the clusters' centroids to the positions of randomly selected observations. Sometimes the random initialization is unlucky, and the centroids are set to positions that cause K-Means to converge to a local optimum. For example, assume that K-Means randomly initializes two cluster centroids to the following positions: K-Means will eventually converge on a local optimum like that shown in the following figure. These clusters may be informative, but it is more likely that the top and bottom groups of observations are more informative clusters. To avoid local optima, K-Means is often repeated dozens or hundreds of times. In each iteration, it is randomly initialized to different starting cluster positions. The initialization that minimizes the cost function best is selected. The Elbow Method If K is not specified by the problem's context, the optimal number of clusters can be estimated using a technique called the elbow method. The elbow method plots the value of the cost function produced by different values of K. As K increases, the average distortion will decrease; each cluster will have fewer constituent instances, and the instances will be closer to their respective centroids. However, the improvements to the average distortion will decline as K increases. The value of K at which the improvement to the distortion declines the most is called the elbow. Let's use the elbow method to choose the number of clusters for a data set. The following scatter plot visualizes a data set with two obvious clusters. We will calculate and plot the mean distortion of the clusters for each value of K from one to ten with the following: >>> import numpy as np>>> from sklearn.cluster import KMeans>>> from scipy.spatial.distance import cdist>>> import matplotlib.pyplot as plt>>> cluster1 = np.random.uniform(0.5, 1.5, (2, 10))>>> cluster2 = np.random.uniform(3.5, 4.5, (2, 10))>>> X = np.hstack((cluster1, cluster2)).T>>> X = np.vstack((x, y)).T>>> K = range(1, 10)>>> meandistortions = []>>> for k in K:>>> kmeans = KMeans(n_clusters=k)>>> kmeans.fit(X)>>> meandistortions.append(sum(np.min(cdist(X, kmeans.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])>>> plt.plot(K, meandistortions, 'bx-')>>> plt.xlabel('k')>>> plt.ylabel('Average distortion')>>> plt.title('Selecting k with the Elbow Method')>>> plt.show() The average distortion improves rapidly as we increase K from one to two. There is little improvement for values of K greater than two. Now let's use the elbow method on the following data set with three clusters: The following is the elbow plot for the data set. From this we can see that the rate of improvement to the average distortion declines the most when adding a fourth cluster. That is, the elbow method confirms that K should be set to three for this data set. Summary In this article we explained what clustering is and we talked about the two methods available for clustering Resources for Article: Further resources on this subject: Machine Learning in IPython with scikit-learn [Article] Machine Learning in Bioinformatics [Article] Specialized Machine Learning Topics [Article]

0
0
5588

How-To Tutorials - Data

About MongoDB

Creating reusable actions for agent behaviors with Lua

Logistic regression

Machine Learning Examples Applicable to Businesses

No to nodistinct

Understanding the HBase Ecosystem

The plot function

The HBase's Data Storage

Postmodel Workflow

Loading data, creating an app, and adding dashboards and reports in Splunk

Trending Topics

Theming with Highcharts

Hosting the service in IIS using the TCP protocol

Data visualization

The EMR Architecture

Clustering with K-Means