Data | 0 articles | Tech News, Tutorials & Expert Insights

20 Nov 2013

5 min read

Reducing Data Size

20 Nov 2013

(For more resources related to this topic, see here.) Selecting attributes using models Weighting by the PCA approach, mentioned previously, is an example where the combination of attributes within an example drives the generation of the principal components, and the correlation of an attribute with these generates the attribute's weight. When building classifiers, it is logical to take this a stage further and use the potential model itself as the determinant of whether the addition or removal of an attribute makes for better predictions. RapidMiner provides a number of operators to facilitate this, and the following sections go into detail for one of these operators with the intention of showing how applicable the techniques are to other similar operations. The operator that will be explained in detail is Forward Selection. This is similar to a number of others in the Optimization group within the Attribute selection and Data transformation section of the RapidMiner GUI operator tree. These operators include Backward Elimination and a number of Optimize Selection operators. The techniques illustrated are transferrable to these other operators. A process that uses Forward Selection is shown in the next screenshot: The Retrieve operator (labeled 1) simply retrieves the sonar data from the local sample repository. This data has 208 examples and 60 regular attributes named attribute_1 to attribute_60. The label is named class and has two values, Rock and Mine. Forward Selection operator (labeled 2) tests the performance of a model on The examples containing more and more attributes. The inner operators within this operator perform this testing. The Log to Data operator (labeled 3) creates an example set from the log entries that were written inside the Forward selection operator. Example sets are easier to process and store in the repository. The Guess Types operator (labeled 4) changes the types of attributes based on their The contents. This is simply a cosmetic step to change real numbers into integers to make plotting them look better. Now, let's return to the Forward Selection operator, which starts by invoking its inner operators to check the model performance using each of the 60 regular attributes individually. This means it runs 60 times. The attribute that gives the best performance is then retained, and the process is repeated with two attributes using the remaining 59 attributes along with the best from the first run. The best pair of attributes is then retained, and the process is repeated with three attributes using each of the remaining 58. This is repeated until the stopping conditions are met. For illustrative purposes, the parameters shown in the following screenshot are chosen to allow it to continue for 60 iterations and use all the 60 attributes. The inner operator to the Forward Selection operator is a simple cross validation with the number of folds set to three. Using cross validation ensures that the performance is an estimate of what the performance would be on unseen data. Some overfitting will inevitably occur, and it is likely that setting the number of validations to three will increase this. However, this process is for illustrative purposes and needs to run reasonably quickly, and a low cross-validation count facilitates this. Inside the Validation operator itself, there are operators to generate a model, calculate performance, and log data. These are shown in the following screenshot: The Naïve Bayes operator is a simple model that does not require a large runtime to complete. Within the Validation operator, it runs on different training partitions of the data. The Apply Model and Performance operators check the performance of the operator using test partitions. The Log operator outputs information each time it is called, and the following screenshot shows the details of what it logs. Running the process gives the log output as shown in the following screenshot: It is worth understanding this output because it gives a good overview of how the operators work and fit together in a process. For example, the attributes applyCountPerformance, applyCountValidation, and applyCountForwardSelection increment by one each time the respective operator is executed. The expected behavior is that applyCountPerformance will increment with each new row in the result, applyCountValidation will increment every three rows, which corresponds to the number of cross validation folds, and applyCountForwardSelection will remain at 1 throughout the process. Note that validationPerformance is missing for the first three rows. This is because the validation operator has not calculated a performance yet. The first occurrence of the logging operator is called validationPerformance; it is the average of innerPerformance within the validation operator. So, for example, the values for innerPerformance are 0.652, 0.514, and 0.580 for the first three rows; these values average out to 0.582, which is the value for validationPerformance in the fourth row. The featureNames attribute shows the attributes that were used to create the various performance measurements. The results are plotted as a graph as shown: This shows that as the number of attributes increases, the validation performance increases and reaches a maximum when the number of attributes is 23. From there, it steadily decreases as the number of attributes reaches 60. The best performance is given by the attributes immediately before the maximum validationPerformance attribute value. In this case, the attributes are: attribute_12, attribute_40, attribute_16, attribute_11, attribute_6, attribute_28, attribute_19, attribute_17, attribute_44, attribute_37, attribute_30, attribute_53, attribute_47, attribute_22, attribute_41, attribute_54, attribute_34, attribute_23, attribute_27, attribute_39, attribute_57, attribute_36, attribute_10. The point is that the number of attributes has reduced and indeed the model accuracy has increased. In real-world situations with large datasets and a reduction in the attribute count, an increase in performance is very valuable. Summary This article has covered the important topic of reducing data size by both the removal of examples and attributes. This is important to speed up processing time, and in some cases can even improve classification accuracy. Generally though, classification accuracy reduces as data reduces. Resources for Article: Further resources on this subject: Data Analytics [Article] Using the Spark Shell [Article] Getting started with Haskell [Article]

0
0
1262

Packt

20 Nov 2013

7 min read

Visualization of Big Data

Packt

20 Nov 2013

7 min read

(For more resources related to this topic, see here.) Data visualization Data visualization is nothing but a representation of your data in graphical form. It is required to study the pattern and trend on the enriched dataset. Easiest way for human being to understand the data is through visualization. KPI library has developed A Periodic Table of Visualization Methods, which includes the six types of data visualization methods viz. Data visualization, Information visualization, Concept visualization, Strategy visualization and Metaphor visualization. Data source preparation Throughout the article, we will be working with CTools further to build a more interactive dashboard. We will use the nyse_stocks data, but need to change its structure. The data source for the dashboard will be a PDI transformation. Repopulating the nyse_stocks Hive table Execute the following steps: Launch Spoon and open the nyse_stock_transfer.ktr file from the code folder. Move NYSE-2000-2001.tsv.gz within the same folder with the transformation file. Run the transformation until it is finished. This process will produce the NYSE-2000-2001-convert.tsv.gz file. Open sandbox by visiting http://192.168.1.122:8000. On the menu bar, choose the File Browser menu, click on the Upload, and choose Files. Navigate to your NYSE-2000-2001-convert.tsv.gz file and wait until the uploading process finishes. On the menu bar, choose the HCatalog / Tables menu. From here, drop the existing nyse_stocks table. On the left-hand side pane, click on the Create a new table from a file link In the Table Name textbox, type nyse_stocks. Click on the NYSE-2000-2001-convert.tsv.gz file. If the file does not exist, make sure you navigate to the right user or name path On the Create a new table from a file page, accept all the options and click on the Create Table button. Once it is finished, the page redirects to HCatalog Table List. Click on the Browse Data button next to nyse_stocks. Make sure the month and year columns are now available. Pentaho's data source integration Execute the following steps: Launch Spoon and open hive_java_query.ktr from the code folder. This transformation acts as our data. The transformation consists of several steps, but the most important are three initial steps: Generate Rows: Its function is to generate a data row and the trigger execution of next sequence steps, which are Get Variable and User Defined Java Class Get Variable: This enables the transformation to identify a variable and converted it into a row field with its value User Defined Java Class: This contains a Java code to query Hive data Double-click on the User Defined Java Class step. The code begins with all the import of needed Java packages, followed by the processRow() method. The code actually is a query to Hive database using JDBC objects. What makes it different is the following code: ResultSet res = stmt.executeQuery(sql); while (res.next()) { get(Fields.Out, "period").setValue(rowd, res.getString(3) + "-" + res.getString(4)); get(Fields.Out, "stock_price_close").setValue(rowd, res.getDouble(1)); putRow(data.outputRowMeta, rowd); } The code will execute a SQL query statement to Hive. The result will be iterated and filled in the PDI's output rows. Column 1 of the result will be reproduced as stock_price_close. The concatenation of columns 3 and 4 of the result becomes period. On the User Defined Java Class step, click on the Preview this transformation menu. It may take minutes because of the MapReduce process and since it is a single-node Hadoop cluster. You will have better performance when adding more nodes to achieve an optimum cluster setup. You will have a data preview like the following screenshot: Consuming PDI as CDA data source To consume data through CTools, use Community Data Access (CDA) as it is the standard data access layer. CDA is able to connect to several sources including a Pentaho Data Integration transformation. The following steps will help you create CDA data sources consuming PDI transformation: Copy the Chapter 5 folder from your book's code bundle folder into [BISERVER]/pentaho-solutions and launch PUC. In the Browser Panel window, you should see a newly added Chapter 5. If it does not appear, on the Tools menu, click on Refresh and select Repository Cache. In the PUC Browser Panel window, right-click on NYSE Stock Price - Hive and choose Edit. Create appropriate data sources. In the Browser Panel window, double-click on stock_price_dashboard_hive.cda inside Chapter 5 to open a CDA data browser. The listbox contains data source names that we have created before; choose DataAccess ID: line_trend_data to preview its data. It will show a table with three columns (stock_symbol, period, and stock_price_close) and one parameter, stock_param_data, with a default value, ALLSTOCKS. Explore all the other data sources to gain a better understanding when working with the next examples. Visualizing data using CTools After we prepare Pentaho Data Integration transformation as a data source, let us move further to develop data visualizations using CTools. Visualizing trends using a line chart The following steps will help you create a line chart using a PDI data source: In the PUC Browser Panel window, right-click on NYSE Stock Price ? Hive and choose Edit; the CDE editor appears. In the menu bar, click on the Layout menu. Explore the layout of this dashboard. Its structure can be represented by the following diagram: Using the same procedure to create a line chart component, type in the values for the following line chart's properties: Name: ccc_line_chart Title: Average Close Price Trend Datasource: line_trend_data Height: 300 HtmlObject: Panel_1 seriesInRows: False Click on Save from the menu and in the Browser Panel window, double-click on the NYSE Stock Price ? Hive menu to open the dashboard page. Interactivity using parameter The following steps will help you create a stock parameter and link it to the chart component and data source: Open the CDE editor again, click on the Components menu. In the left-hand side panel, click on Generic and choose the Simple Parameter component. Now, a parameter component is added to the components group. Click on it and type stock_param in the Name property. In the left-hand side panel, click on Selects and choose the Select Component component. Type in the values for the following properties: Name: select_stock Parameter: stock_param HtmlObject: Filter_Panel_1 Values array: ["ALLSTOCKS","ALLSTOCKS"], ["ARM","ARM"],["BBX","BBX"], ["DAI","DAI"],["ROS","ROS"] To insert values in the Values array textbox, you need to create several pair values. To add a new pair, click on the textbox, a dialog will appear. Then click on the Add button to create a new pair of Arg and Value textboxes and type in the values as stated in this step. The dialog entries will look like the following screenshot: On the same editor page, select ccc_line_chart and click on the Parameters property. A parameter dialog appears, click on the Add button to create the first index of a parameter pair. Type in stock_param_data and stock_param in the Arg and Value textboxes, respectively. This will link the global stock_param parameter with the data source's stock_param_data parameter. We have specified the parameter in the previous walkthroughs. While still on ccc_line_chart, click on Listeners. In the listbox, choose stock_param and click on the OK button to accept it. This configuration will reload the chart if the value of the stock_param parameter changes Open the NYSE Stock Price ? Hive dashboard page again. Now, you have a filter that interacts well with the line chart data, as shown in the following screenshot: Summary In this article we learned about preparing a data source, visualizing data using CTools, and also how to create an interactive analytical dashboard that consumes data from Hive. Resources for Article: Further resources on this subject: Pentaho Reporting: Building Interactive Reports in Swing [Article] Pentaho – Using Formulas in Our Reports [Article] Getting Started with Pentaho Data Integration [Article]

0
0
1705

article-image-learning-data-analytics-r-and-hadoop

Packt

20 Nov 2013

6 min read

Learning Data Analytics with R and Hadoop

Packt

20 Nov 2013

6 min read

(For more resources related to this topic, see here.) Understanding the data analytics project life cycle While dealing with the data analytics projects, there are some fixed tasks that should be followed to get the expected output. So here we are going to build a data analytics project cycle, which will be a set of standard data-driven processes to lead data to insights effectively. The defined data analytics processes of a project life cycle should be followed by sequences for effectively achieving the goal using input datasets. This data analytics process may include identifying the data analytics problems, designing, and collecting datasets, data analytics, and data visualization. The data analytics project life cycle stages are seen in the following diagram: Let's get some perspective on these stages for performing data analytics. Identifying the problem Today, business analytics trends change by performing data analytics over web datasets for growing business. Since their data size is increasing gradually day by day, their analytical application needs to be scalable for collecting insights from their datasets. With the help of web analytics; we can solve the business analytics problems. Let's assume that we have a large e-commerce website, and we want to know how to increase the business. We can identify the important pages of our website by categorizing them as per popularity into high, medium, and low. Based on these popular pages, their types, their traffic sources, and their content, we will be able to decide the roadmap to improve business by improving web traffic, as well as content. Designing data requirement To perform the data analytics for a specific problem, it needs datasets from related domains. Based on the domain and problem specification, the data source can be decided and based on the problem definition; the data attributes of these datasets can be defined. For example, if we are going to perform social media analytics (problem specification), we use the data source as Facebook or Twitter. For identifying the user characteristics, we need user profile information, likes, and posts as data attributes. Preprocessing data In data analytics, we do not use the same data sources, data attributes, data tools, and algorithms all the time as all of them will not use data in the same format. This leads to the performance of data operations, such as data cleansing, data aggregation, data augmentation, data sorting, and data formatting, to provide the data in a supported format to all the data tools as well as algorithms that will be used in the data analytics. In simple terms, preprocessing is used to perform data operation to translate data into a fixed data format before providing data to algorithms or tools. The data analytics process will then be initiated with this formatted data as the input. In case of Big Data, the datasets need to be formatted and uploaded to Hadoop Distributed File System (HDFS) and used further by various nodes with Mappers and Reducers in Hadoop clusters. Performing analytics over data After data is available in the required format for data analytics algorithms, data analytics operations will be performed. The data analytics operations are performed for discovering meaningful information from data to take better decisions towards business with data mining concepts. It may either use descriptive or predictive analytics for business intelligence. Analytics can be performed with various machine learning as well as custom algorithmic concepts, such as regression, classification, clustering, and model-based recommendation. For Big Data, the same algorithms can be translated to MapReduce algorithms for running them on Hadoop clusters by translating their data analytics logic to theMapReduce job which is to be run over Hadoop clusters. These models need to be further evaluated as well as improved by various evaluation stages of machine learning concepts. Improved or optimized algorithms can provide better insights. Visualizing data Data visualization is used for displaying the output of data analytics. Visualization is an interactive way to represent the data insights. This can be done with various data visualization softwares as well as R packages. R has a variety of packages for the visualization of datasets. They are as follows: ggplot2: This is an implementation of the Grammar of Graphics by Dr. Hadley Wickham (http://had.co.nz/). For more information refer http://cran.r-project.org/web/packages/ggplot2/. rCharts: This is an R package to create, customize, and publish interactive JavaScript visualizations from R by using a familiar lattice-style plotting interface by Markus Gesmann and Diego de Castillo. For more information refer http://ramnathv.github.io/rCharts/. Some popular examples of visualization with R are as follows: Plots for facet scales (ggplot): The following figure shows the comparison of males and females with different measures; namely, education, income, life expectancy, and literacy, using ggplot: Dashboard charts: This is an rCharts type. Using this we can build interactive animated dashboards with R. Understanding data analytics problems In this section, we have included three practical data analytics problems with various stages of data-driven activity with R and Hadoop technologies. These data analytics problem definitions are designed such that readers can understand how Big Data analytics can be done with the analytical power of functions, packages of R, and the computational powers of Hadoop. The data analytics problem definitions are as follows: Exploring the categorization of web pages Computing the frequency of changes in the stock market Predicting the sale price of a blue book for bulldozers (case study) Exploring web pages categorization This data analytics problem is designed to identify the category of a web page of a website, which may categorized popularity wise as high, medium, or low (regular), based on the visit count of the pages. While designing the data requirement stage of the data analytics life cycle, we will see how to collect these types of data from Google Analytics. Identifying the problem As this is a web analytics problem, the goal of the problem is to identify the importance of web pages designed for websites. Based on this information, the content, design, or visits of the lower popular pages can be improved or increased. Designing data requirement In this section, we will be working with data requirement as well as data collection for this data analytics problem. First let's see how the requirement for data can be achieved for this problem. Since this is a web analytics problem, we will use Google Analytics data source. To retrieve this data from Google Analytics, we need to have an existent Google Analytics account with web traffic data stored on it. To increase the popularity, we will require the visits information of all of the web pages. Also, there are many other attributes available in Google Analytics with respect to dimensions and metrics.

0
0
2658

Packt

19 Nov 2013

4 min read

Data Analytics

Packt

19 Nov 2013

4 min read

Introduction Data Analytics is the art of taking data and deriving information from it in order to make informed decisions. A large part of building and validating datasets for the decision making process is data integration—the moving, cleansing, and transformation of data from the source to a target. This article will focus on some of the tools that take Kettle beyond the normal data processing capabilities and integrate processes into analytical tools. Reading data from a SAS datafile SAS is one of the leading analytics suites, providing robust commercial tools for decision making in many different fields. Kettle can read files written in SAS' specialized data format known as sas7bdat using a new (since Version 4.3) input step called SAS Input. While SAS does support other format types (such as CSV and Excel), sas7bdat is a format most similar to other analytics packages' special formats (such as Weka's ARFF file format). This recipe will show you how to do it. Why read a SAS file? There are two main reasons for wanting to read a SAS file as part of a Kettle process. The first is that a dataset created by a SAS program is already in place, but the output of this process is used elsewhere in other Business Intelligence solutions (for instance, using the output for integration into reports, visualizations, or other analytic tools). The second is when there is already a standard library of business logic and rules built in Kettle that the dataset needs to run through before it can be used. Getting ready To be able to use the SAS Input step, a sas7bdat file will be required. The Centers for Disease Control and Prevention have some sample datasets as part of the NHANES Dietary dataset. Their tutorial datasets can be found at their website at http://www.cdc.gov/nchs/tutorials/dietary/downloads/downloads.htm. We will be using the calcmilk.sas7bdat dataset for this recipe. How to do it... Perform the following steps to read in the calcmilk.sas7bd dataset: Open Spoon and create a new transformation. From the input folder of the Design pallet, bring over a Get File Names step. Open the Get File Names step. Click on the Browse button and find the calcmilk. sas7bd file downloaded for the recipe and click on OK. From the input folder of the Design pallet, bring over a SAS Input step. Create a hop from the Get File Names step to the SAS Input step. Open the SAS Input step. For the Field in the input to use as filename field, select the Filename field from the dropdown. Click on Get Fields. Select the calcmilk.sas7bd file and click on OK. If you are using Version 4.4 of Kettle, you will receive a java.lang.NoClassDefFoundError message. There is a work around which can be found on the Pentaho wiki at http://wiki.pentaho.com/display/EAI/SAS+Input. To clean the stream up and only have the calcmilk data, add a Select Values step and add a hop between the SAS Input step to the Select Values step. Open the Select Values step and switch to the Remove tab. Select the fields generated from the Get File Names step (filename, short_filename, path, and so on). Click on OK to close the step. Preview the Select Values step. The data from the SAS Input step should appear in a data grid, as shown in the following screenshot: How it works... The SAS Input step takes advantage of Kasper Sørensen's Sassy Reader project (http://sassyreader.eobjects.org). Sassy is a Java library used to read datasets in the sas7bdat format and is derived from the R package created by Matt Shotwell (https://github.com/BioStatMatt/sas7bdat). Before those projects, it was not possible to read the proprietary file format outside of SAS' own tools. The SAS Input step requires the processed filenames to be provided from another step (like the Get File Names step). Also of note, while the sas7bdat format only has two format types (strings and numbers), PDI is able to convert fields to any of the built-in formats (dates, integers, and so on).

0
0
1421

article-image-iteration-and-searching-keys

Packt

13 Nov 2013

7 min read

Iteration and Searching Keys

Packt

13 Nov 2013

7 min read

(For more resources related to this topic, see here.) Introducing Sample04 to show you loops and searches Sample04 uses the same LevelDbHelpers.h as before. Please download the entire sample and look at main04.cpp to see the code in context. Running Sample04 starts by printing the output from the entire database, as shown in the following screenshot: Console output of listing keys Creating test records with a loop The test data being used here was created with a simple loop and forms a linked list as well. It is explained in more detail in the Simple Relational Style section. The loop creating the test data uses the new C++11 range-based for style of the loop: vector<string> words {"Packt", "Packer", "Unpack", "packing","Packt2", "Alpacca"}; stringprevKey; WriteOptionssyncW; syncW.sync = true; WriteBatchwb; for (auto key : words) { wb.Put(key, prevKey + "tsome more content"); prevKey = key; } assert(db->Write(syncW, &wb).ok() ); Note how we're using a string to hang onto the prevKey. There may be a temptation to use a Slice here to refer to the previous value of key, but remember the warning about a Slice only having a data pointer. This would be a classic bug introduced with a Slice pointing to a value that can be changed underneath it! We're adding all the keys using a WriteBatch not just for consistency, but also so that the storage engine knows it's getting a bunch of updates in one go and can optimize the file writing. I will be using the term Record regularly from now on. It's easier to say than Key-value Pair and is also indicative of the richer, multi-value data we're storing. Stepping through all the records with iterators The model for multiple record reading in LevelDB is a simple iteration. Find a starting point and then step forwards or backwards. This is done with an Iterator object that manages the order and starting point of your stepping through keys and values. You call methods on Iterator to choose where to start, to step and to get back the key and value. Each Iterator gets a consistent snapshot of the database, ignoring updates during iteration. Create a new Iterator to see changes. If you have used declarative database APIs such as SQL-based databases, you would be used to performing a query and then operating on the results. Many of these APIs and older, record-oriented databases have a concept of a cursor which maintains the current position in the results which you can only move forward. Some of them allow you to move the cursor to the previous records. Iterating through individual records may seem clunky and old-fashioned if you are used to getting collections from servers. However, remember LevelDB is a local database. Each step doesn't represent a network operation! The iterable cursor approach is all that LevelDB offers, called an Iterator. If you want some way of mapping a collected set of results directly to a listbox or other containers, you will have to implement it on top of the Iterator, as we will see later. Iterating forwards, we just get an Iterator from our database and jump to the first record with SeekToFirst(): Iterator* idb = db->NewIterator(ropt); for (idb->SeekToFirst(); idb->Valid(); idb->Next()) cout<<idb->key() <<endl; Going backwards is very similar, but inherently less efficient as a storage trade-off: for (idb->SeekToLast(); idb->Valid(); idb->Prev()) cout<<idb->key() <<endl; If you wanted to see the value as well as the keys, just use the value() method on the iterator (the test data in Sample04 would make it look a bit confusing so it isn't being done here): cout<<idb->key() << " " <<idb->value() <<endl; Unlike some other programming iterators, there's no concept of a special forward or backward iterator and no obligation to keep going in the same direction. Consider searching an HR database for the ten highest-paid managers. With a key of Job+Salary, you would iterate through a range until you know you have hit the end of the managers, then iterate backwards to get the last ten. An iterator is created by NewIterator(), so you have to remember to delete it or it will leak memory. Iteration is over a consistent snapshot of the data, and any data changes through Put, Get, or Delete operations won't show until another NewIterator() is created. Searching for ranges of keys The second half of the console output is from our examples of iterating through partial keys, which are case-sensitive by default, with the default BytewiseComparator: Console output of searches As we've seen many times, the Get function looks for an exact match for a key. However, if you have an Iterator, you can use Seek and it will jump to the first key that either matches exactly or is immediately after the partial key you specify. If we are just looking for keys with a common prefix, the optimal comparison is using the starts_with method of the Slice class: Void listKeysStarting(Iterator* idb, const Slice& prefix) { cout<< "List all keys starting with " <<prefix.ToString() <<endl; for (idb->Seek(prefix); idb-<Valid() &&idb->key().starts_with(prefix); idb-<Next()) cout<<idb->key() <<endl; } Going backwards is a little bit more complicated. We use a key that is guaranteed to fail. You could think of it as being between the last key starting with our prefix and the next key out of the desired range. When we Seek to that key, we need to step once to the previous key. If that's valid and matching, it's the last key in our range: Void listBackwardsKeysStarting(Iterator* idb, const Slice& prefix) { cout<<"List all keys starting with " <<prefix.ToString() << " backwards " <<endl; const string keyAfter = prefix.ToString() + "xFF"; idb->Seek(keyAfter); if (idb->Valid()) idb->Prev(); // step to last key with actual prefix else // no key just after our range, but idb->SeekToLast(); // maybe the last key matches? for(;idb->Valid() &&idb->key().starts_with(prefix); idb->Prev()) cout<<idb->key() <<endl; } What if you want to get keys within a range? For the first time, I disagree with the documentation included with LevelDB. Their iteration example shows a similar loop to that shown in the following code, but checks the key values with idb->key().ToString() < limit. That is a more expensive way to iterate keys as it's generating a temporary string object for every key being checked, which is expensive if there were thousands in the range: Void listKeys Between(Iterator* idb, const Slice&startKey, const Slice&endKey) { cout<< "List all keys >= " <<startKey.ToString() << " and < " <<endKey.ToString() <<endl; for (idb->Seek(startKey); idb->Valid() &&idb->key().compare(endKey) < 0; idb->Next()) cout<<idb->key() <<endl; } We can use another built-in method of Slice; the compare() method, which returns a result <0, 0, or >0 to indicate if Slice is less than, equal to, or greater than the other Slice it is being compared to. This is the same semantics as the standard C memcpy. The code shown in the previous snippet will find keys that are the same, or after the startKey and are before the endKey. If you want the range to include the endKey, change the comparison to compare(endKey) <= 0. Summary In this article, we learned the concept of an iterator in LevelDB as a way to step through records sorted by their keys. The database became far more useful with searches to get the starting point for the iterator, and samples showing how to efficiently check keys as you step through a range. Resources for Article : Further resources on this subject: New iPad Features in iOS 6 [Article] Securing data at the cell level (Intermediate) [Article] Python Data Persistence using MySQL Part III: Building Python Data Structures Upon the Underlying Database Data [Article]

0
0
2874

article-image-gathering-all-rejects-prior-killing-job

Packt

13 Nov 2013

3 min read

Gathering all rejects prior to killing a job

Packt

13 Nov 2013

3 min read

(For more resources related to this topic, see here.) Getting ready Open the job jo_cook_ch03_0010_validationSubjob. As you can see, the reject flow has been attached and the output is being sent to a temporary store (tHashMap). How to do it… Add the tJava, tDie, tHashInput, and tFileOutputDelimited components. Add onSubjobOk to tJava from the tFileInputDelimited component. Add a flow from the tHashInput component to the tFileOutputDelimited component. Right-click the tJava component, select Trigger and then Runif. Link the trigger to the tDie component. Click the if link, and add the following code ((Integer)globalMap.get("tFileOutputDelimited_1_NB_LINE")) > 0 Right-click the tJava component, select Trigger, and then Runif. Link this trigger to the tHashInput component. ((Integer)globalMap.get("tFileOutputDelimited_1_NB_LINE")) == 0 The job should now look like the following: Drag the generic schema sc_cook_ch3_0010_genericCustomer to both the tHashInput and tFileOutputDelimited. Run the job. You should see that the tDie component is activated, because the file contained two errors. How it works… What we have done in this exercise is created a validation stage prior to processing the data. Valid rows are held in temporary storage (tHashOutput) and invalid rows are written to a reject file until all input rows are processed. The job then checks to see how many records are rejected (using the RunIf link). In this instance, there are invalid rows, so the RunIf link is triggered, and the job is killed using tDie. By ensuring that the data is correct before we start to process it into a target, we know that the data will be fit for writing to the target, and thus avoiding the need for rollback procedures. The records captured can then be sent to the support team, who will then have a record of all incorrect rows. These rows can be fixed in situ within the source file and the job simply re-run from the beginning. There's more... This article is particularly important when rollback/correction of a job may be particularly complex, or where there may be a higher than expected number of errors in an input. An example would be when there are multiple executions of a job that appends to a target file. If the job fails midway through, then rolling back involves identifying which records were appended to the file by the job before failure, removing them from the file, fixing the offending record, and then re-running. This runs the risk of a second error causing the same thing to happen again. On the other hand, if the job does not die, but a subsection of the data is rejected, then the rejects must be manipulated into the target file via a second manual execution of the job. So, this method enables us to be certain that our records will not fail to write due to incorrect data, and therefore saves our target from becoming corrupted. Summary This article has shown how the rejects are collected before killing a job. This article also shows how incorrect rejects be manipulated into the target file. Resources for Article: Further resources on this subject: Pentaho Data Integration 4: Working with Complex Data Flows [Article] Nmap Fundamentals [Article] Getting Started with Pentaho Data Integration [Article]

0
0
1148

Packt

11 Nov 2013

2 min read

Wrapping OpenCV

Packt

11 Nov 2013

2 min read

(For more resources related to this topic, see here.) Architecture overview In this section we will examine and compare the architectures of OpenCV and Emgu CV. OpenCV In the hello-world project, we already knew our code had something to do with the bin folder in the Emgu library that we installed. Those files are OpenCV DLLs, which have the filename starting with opencv_. So the Emgu CV users need to have some basic knowledge about OpenCV. OpenCV is broadly structured into five main components. Four of them are described in the following section: The first one is the CV component, which includes the algorithms about computer vision and basic image processing. All the methods for basic processes are found here. ML is short for Machine Learning, which contains popular machine learning algorithms with clustering tools and statistical classifiers. HighGUI is designed to construct user-friendly interfaces to load and store media data. CXCore is the most important one. This component provides all the basic data structures and contents. The components can be seen in the following diagram: The preceding structure map does not include CvAux, which contains many areas. It can be divided into two parts: defunct areas and experimental algorithms. CvAux is not particularly well documented in the Wiki, but it covers many features. Some of them may migrate to CV in the future, others probably never will. Emgu CV Emgu CV can be seen as two layers on top of OpenCV, which are explained as follows: Layer 1 is the basic layer. It includes enumeration, structure, and function mappings. The namespaces are direct wrappers from OpenCV components. Layer 2 is an upper layer. It takes good advantage of .NET framework and mixes the classes together. It can be seen in the bridge from OpenCV to .NET. The architecture of Emgu CV can be seen in the following diagram, which includes more details: After we create our new Emgu CV project, the first thing we will do is add references. Now we can see what those DLLs are used for: Emgu.Util.dll: A collection of .NET utilities Emgu.CV.dll: Basic image-processing algorithms from OpenCV Emgu.CV.UI.dll: Useful tools for Emgu controls Emgu.CV.GPU.dll: GPU processing (Nvidia Cuda) Emgu.CV.ML.dll: Machine learning algorithms

0
0
2449

article-image-specialized-machine-learning-topics

Packt

31 Oct 2013

20 min read

Specialized Machine Learning Topics

Packt

31 Oct 2013

20 min read

(For more resources related to this topic, see here.) As you attempted to gather data, you might have realized that the information was trapped in a proprietary spreadsheet format or spread across pages on the Web. Making matters worse, after spending hours manually reformatting the data, perhaps your computer slowed to a crawl after running out of memory. Perhaps R even crashed or froze your machine. Hopefully you were undeterred; it does get easier with time. You might find the information particularly useful if you tend to work with data that are: Stored in unstructured or proprietary formats such as web pages, web APIs, or spreadsheets From a domain such as bioinformatics or social network analysis, which presents additional challenges So extremely large that R cannot store the dataset in memory or machine learning takes a very long time to complete You're not alone if you suffer from any of these problems. Although there is no panacea—these issues are the bane of the data scientist as well as the reason for data skills to be in high demand—through the dedicated efforts of the R community, a number of R packages provide a head start toward solving the problem. This article also provides a cookbook of such solutions. Even if you are an experienced R veteran, you may discover a package that simplifies your workflow, or perhaps one day you will author a package that makes work easier for everybody else! Working with specialized data Unlike the analyses in this article, real-world data are rarely packaged in a simple CSV form that can be downloaded from a website. Instead, significant effort is needed to prepare data for analysis. Data must be collected, merged, sorted, filtered, or reformatted to meet the requirements of the learning algorithm. This process is known informally as data munging. Munging has become even more important as the size of typical datasets has grown from megabytes to gigabytes and data are gathered from unrelated and messy sources, many of which are domain-specific. Several packages and resources for working with specialized or domain-specific data are listed as follows: Getting data from the Web with the RCurl package The RCurl package by Duncan Temple Lang provides an R interface to the curl (client for URLs) utility, a command-line tool for transferring data over networks. The curl utility is useful for web scraping, which refers to the practice of harvesting data from websites and transforming it into a structured form. Documentation for the RCurl package can be found on the Web at http://www.omegahat.org/RCurl/. After installing the RCurl package, downloading a page is as simple as typing: > library(RCurl) > webpage <- getURL("http://www.packtpub.com/") This will save the full text of the Packt Publishing's homepage (including all web markup) into the R character object named webpage. As shown in the following lines, this is not very useful as-is: > str(webpage) chr "<!DOCTYPE html>n<html >More information on the XML package, including simple examples to get you started quickly, can be found at the project's website: http://www.omegahat.org/RSXML/. Reading and writing JSON with the rjson package The rjson package by Alex Couture-Beil can be used to read and write files in the JavaScript Object Notation (JSON) format. JSON is a standard, plaintext format, most often used for data structures and objects on the Web. The format has become popular recently due to its utility in creating web applications, but despite the name, it is not limited to web browsers. For details about the JSON format, go to http://www.json.org/. The JSON format stores objects in plain text strings. After installing the rjson package, to convert from JSON to R: > library(rjson) > r_object <- fromJSON(json_string) To convert from an R object to a JSON object: >json_string <- toJSON(r_object) Used with the Rcurl package (noted previously), it is possible to write R programs that utilize JSON data directly from many online data stores. Reading and writing Microsoft Excel spreadsheets using xlsx The xlsx package by Adrian A. Dragulescu offers functions to read and write to spreadsheets in the Excel 2007 (or earlier) format—a common task in many business environments. The package is based on the Apache POI Java API for working with Microsoft's documents. For more information on xlsx, including a quick start document, go to https://code.google.com/p/rexcel/. Working with bioinformatics data Data analysis in the field of bioinformatics offers a number of challenges relative to other fields due to the unique nature of genetic data. The use of DNA and protein microarrays has resulted in datasets that are often much wider than they are long (that is, they have more features than examples). This creates problems when attempting to apply conventional visualizations, statistical tests, and machine learning-methods to such data. A CRAN task view for statistical genetics/bioinformatics is available at http://cran.r-project.org/web/views/Genetics.html. The Bioconductor project (http://www.bioconductor.org/) of the Fred Hutchinson Cancer Research Center in Seattle, Washington, provides a centralized hub for methods of analyzing genomic data. Using R as its foundation, Bioconductor adds packages and documentation specific to the field of bioinformatics. Bioconductor provides workflows for analyzing microarray data from common platforms such as for analysis of microarray platforms, including Affymetrix, Illumina, Nimblegen, and Agilent. Additional functionality includes sequence annotation, multiple testing procedures, specialized visualizations, and many other functions. Working with social network data and graph data Social network data and graph data present many challenges. These data record connections, or links, between people or objects. With N people, an N by N matrix of links is possible, which creates tremendous complexity as the number of people grows. The network is then analyzed using statistical measures and visualizations to search for meaningful patterns of relationships. The network package by Carter T. Butts, David Hunter, and Mark S. Handcock offers a specialized data structure for working with such networks. A closely-related package, sna, allows analysis and visualization of the network objects. For more information on network and sna, refer to the project website hosted by the University of Washington: http://www.statnet.org/. Improving the performance of R R has a reputation for being slow and memory inefficient, a reputation that is at least somewhat earned. These faults are largely unnoticed on a modern PC for datasets of many thousands of records, but datasets with a million records or more can push the limits of what is currently possible with consumer-grade hardware. The problem is worsened if the data have many features or if complex learning algorithms are being used. CRAN has a high performance computing task view that lists packages pushing the boundaries on what is possible in R: http://cran.r-project.org/web/views/HighPerformanceComputing.html. Packages that extend R past the capabilities of the base package are being developed rapidly. This work comes primarily on two fronts: some packages add the capability to manage extremely large datasets by making data operations faster or by allowing the size of data to exceed the amount of available system memory; others allow R to work faster, perhaps by spreading the work over additional computers or processors, by utilizing specialized computer hardware, or by providing machine learning optimized to Big Data problems. Some of these packages are listed as follows. Managing very large datasets Very large datasets can sometimes cause R to grind to a halt when the system runs out of memory to store the data. Even if the entire dataset can fit in memory, additional RAM is needed to read the data from disk, which necessitates a total memory size much larger than the dataset itself. Furthermore, very large datasets can take a long amount of time to process for no reason other than the sheer volume of records; even a quick operation can add up when performed many millions of times. Years ago, many would suggest performing data preparation of massive datasets outside R in another programming language, then using R to perform analyses on a smaller subset of data. However, this is no longer necessary, as several packages have been contributed to R to address these Big Data problems. Making data frames faster with data.table The data.table package by Dowle, Short, and Lianoglou provides an enhanced version of a data frame called a data table. The data.table objects are typically much faster than data frames for subsetting, joining, and grouping operations. Yet, because it is essentially an improved data frame, the resulting objects can still be used by any R function that accepts a data frame. The data.table project is found on the Web at http://datatable.r-forge.r-project.org/. One limitation of data.table structures is that like data frames, they are limited by the available system memory. The next two sections discuss packages that overcome this shortcoming at the expense of breaking compatibility with many R functions. Creating disk-based data frames with ff The ff package by Daniel Adler, Christian Glaser, Oleg Nenadic, Jens Oehlschlagel, and Walter Zucchini provides an alternative to a data frame (ffdf) that allows datasets of over two billion rows to be created, even if this far exceeds the available system memory. The ffdf structure has a physical component that stores the data on disk in a highly efficient form and a virtual component that acts like a typical R data frame but transparently points to the data stored in the physical component. You can imagine the ffdf object as a map that points to a location of data on a disk. The ff project is on the Web at http://ff.r-forge.r-project.org/. A downside of ffdf data structures is that they cannot be used natively by most R functions. Instead, the data must be processed in small chunks, and the results should be combined later on. The upside of chunking the data is that the task can be divided across several processors simultaneously using the parallel computing methods presented later in this article. The ffbase package by Edwin de Jonge, Jan Wijffels, and Jan van der Laan addresses this issue somewhat by adding capabilities for basic statistical analyses using ff objects. This makes it possible to use ff objects directly for data exploration. The ffbase project is hosted at http://github.com/edwindj/ffbase. Using massive matrices with bigmemory The bigmemory package by Michael J. Kane and John W. Emerson allows extremely large matrices that exceed the amount of available system memory. The matrices can be stored on disk or in shared memory, allowing them to be used by other processes on the same computer or across a network. This facilitates parallel computing methods, such as those covered later in this article. Additional documentation on the bigmemory package can be found at http://www.bigmemory.org/. Because bigmemory matrices are intentionally unlike data frames, they cannot be used directly with most of the machine learning methods covered in this book. They also can only be used with numeric data. That said, since they are similar to a typical R matrix, it is easy to create smaller samples or chunks that can be converted to standard R data structures. The authors also provide bigalgebra, biganalytics, and bigtabulate packages, which allow simple analyses to be performed on the matrices. Of particular note is the bigkmeans() function in the biganalytics package, which performs k-means clustering. Learning faster with parallel computing In the early days of computing, programs were entirely serial, which limited them to performing a single task at a time. The next instruction could not be performed until the previous instruction was complete. However, many tasks can be completed more efficiently by allowing work to be performed simultaneously. This need was addressed by the development of parallel computing methods, which use a set of two or more processors or computers to solve a larger problem. Many modern computers are designed for parallel computing. Even in the case that they have a single processor, they often have two or more cores which are capable of working in parallel. This allows tasks to be accomplished independently from one another. Networks of multiple computers called clusters can also be used for parallel computing. A large cluster may include a variety of hardware and be separated over large distances. In this case, the cluster is known as a grid. Taken to an extreme, a cluster or grid of hundreds or thousands of computers running commodity hardware could be a very powerful system. The catch, however, is that not every problem can be parallelized; certain problems are more conducive to parallel execution than others. You might expect that adding 100 processors would result in 100 times the work being accomplished in the same amount of time (that is, the execution time is 1/100), but this is typically not the case. The reason is that it takes effort to manage the workers; the work first must be divided into non-overlapping tasks and second, each of the workers' results must be combined into one final answer. So-called embarrassingly parallel problems are the ideal. These tasks are easy to reduce into non-overlapping blocks of work, and the results are easy to recombine. An example of an embarrassingly parallel machine learning task would be 10-fold cross-validation; once the samples are decided, each of the 10 evaluations is independent, meaning that its result does not affect the others. As you will soon see, this task can be sped up quite dramatically using parallel computing. Measuring execution time Efforts to speed up R will be wasted if it is not possible to systematically measure how much time was saved. Although you could sit and observe a clock, an easier solution is to wrap the offending code in a system.time() function. For example, on the author's laptop, the system.time() function notes that it takes about 0.13 seconds to generate a million random numbers: > system.time(rnorm(1000000)) user system elapsed 0.13 0.00 0.13 The same function can be used for evaluating improvement in performance, obtained with the methods that were just described or any R function. Working in parallel with foreach The foreach package by Steve Weston of Revolution Analytics provides perhaps the easiest way to get started with parallel computing, particularly if you are running R on the Windows operating system, as some of the other packages are platform-specific. The core of the package is a new foreach looping construct. If you have worked with other programming languages, this may be familiar. Essentially, it allows looping over a number of items in a set, without explicitly counting the number of items; in other words, for each item in the set, do something. In addition to the foreach package, Revolution Analytics has developed high-performance, enterprise-ready R builds. Free versions are available for trial and academic use. For more information, see their website at http://www.revolutionanalytics.com/. If you're thinking that R already provides a set of apply functions to loop over sets of items (for example, apply(), lapply(), sapply(), and so on), you are correct. However, the foreach loop has an additional benefit: iterations of the loop can be completed in parallel using a very simple syntax. The sister package doParallel provides a parallel backend for foreach that utilizes the parallel package included with R (Version 2.14.0 and later). The parallel package includes components of the multicore and snow packages described in the following sections. Using a multitasking operating system with multicore The multicore package by Simon Urbanek allows parallel processing on single machines that have multiple processors or processor cores. Because it utilizes multitasking capabilities of the operating system, it is not supported natively on Windows systems. An easy way to get started with the code package is using the mcapply() function, which is a parallelized version of lapply(). The multicore project is hosted at http://www.rforge.net/multicore/. Networking multiple workstations with snow and snowfall The snow package (simple networking of workstations) by Luke Tierney, A. J. Rossini, Na Li, and H. Sevcikova allows parallel computing on multicore or multiprocessor machines as well as on a network of multiple machines. The snowfall package by Jochen Knaus provides an easier-to-use interface for snow. For more information on code, including a detailed FAQ and information on how to configure parallel computing over a network, see http://www.imbi.uni-freiburg.de/parallel/. Parallel cloud computing with MapReduce and Hadoop The MapReduce programming model was developed at Google as a way to process their data on a large cluster of networked computers. MapReduce defined parallel programming as a two-step process: A map step, in which a problem is divided into smaller tasks that are distributed across the computers in the cluster A reduce step, in which the results of the small chunks of work are collected and synthesized into a final solution to the original problem A popular open source alternative to the proprietary MapReduce framework is Apache Hadoop. The Hadoop software comprises of the MapReduce concept plus a distributed filesystem capable of storing large amounts of data across a cluster of computers. Packt Publishing has published quite a number of books on Hadoop. To view the list of books on this topic, refer to Hadoop titles from Packt. Several R projects that provide an R interface to Hadoop are in development. One such project is RHIPE by Saptarshi Guha, which attempts to bring the divide and recombine philosophy into R by managing the communication between R and Hadoop. The RHIPE package is not yet available at CRAN, but it can be built from the source available on the Web at http://www.datadr.org. The RHadoop project by Revolution Analytics provides an R interface to Hadoop. The project provides a package, rmr, intended to be an easy way for R developers to write MapReduce programs. Additional RHadoop packages provide R functions for accessing Hadoop's distributed data stores. At the time of publication, development of RHadoop is progressing very rapidly. For more information about the project, see https://github.com/RevolutionAnalytics/RHadoop/wiki. GPU computing An alternative to parallel processing uses a computer's graphics processing unit (GPU) to increase the speed of mathematical calculations. A GPU is a specialized processor that is optimized for rapidly displaying images on a computer screen. Because a computer often needs to display complex 3D graphics (particularly for video games), many GPUs use hardware designed for parallel processing and extremely efficient matrix and vector calculations. A side benefit is that they can be used for efficiently solving certain types of mathematical problems. Where a computer processor may have on the order of 16 cores, a GPU may have thousands. The downside of GPU computing is that it requires specific hardware that is not included with many computers. In most cases, a GPU from the manufacturer Nvidia is required, as they provide a proprietary framework called CUDA (Complete Unified Device Architecture) that makes the GPU programmable using common languages such as C++. For more information on Nvidia's role in GPU computing, go to http://www.nvidia.com/object/what-is-gpu-computing.html. The gputools package by Josh Buckner, Mark Seligman, and Justin Wilson implements several R functions, such as matrix operations, clustering, and regression modeling using the Nvidia CUDA toolkit. The package requires a CUDA 1.3 or higher GPU and the installation of the Nvidia CUDA toolkit. Deploying optimized learning algorithms Some of the machine learning algorithms covered in this book are able to work on extremely large datasets with relatively minor modifications. For instance, it would be fairly straightforward to implement naive Bayes or the Apriori algorithm using one of the Big Data packages described previously. Some types of models such as ensembles, lend themselves well to parallelization, since the work of each model can be distributed across processors or computers in a cluster. On the other hand, some algorithms require larger changes to the data or algorithm, or need to be rethought altogether before they can be used with massive datasets. Building bigger regression models with biglm The biglm package by Thomas Lumley provides functions for training regression models on datasets that may be too large to fit into memory. It works by an iterative process in which the model is updated little-by-little using small chunks of data. The results will be nearly identical to what would have been obtained running the conventional lm() function on the entire dataset. The biglm() function allows use of a SQL database in place of a data frame. The model can also be trained with chunks obtained from data objects created by the ff package described previously. Growing bigger and faster random forests with bigrf The bigrf package by Aloysius Lim implements the training of random forests for classification and regression on datasets that are too large to fit into memory using bigmemory objects as described earlier in this article. The package also allows faster parallel processing using the foreach package described previously. Trees can be grown in parallel (on a single computer or across multiple computers), as can forests, and additional trees can be added to the forest at any time or merged with other forests. For more information, including examples and Windows installation instructions, see the package wiki hosted at GitHub: https://github.com/aloysius-lim/bigrf. Training and evaluating models in parallel with caret The caret package by Max Kuhn will transparently utilize a parallel backend if one has been registered with R (for instance, using the foreach package described previously). Many of the tasks involved in training and evaluating models, such as creating random samples and repeatedly testing predictions for 10-fold cross-validation are embarrassingly parallel. This makes a particularly good caret. Configuration instructions and a case study of the performance improvements for enabling parallel processing in caret are available at the project's website: http://caret.r-forge.r-project.org/parallel.html. Summary It is certainly an exciting time to be studying machine learning. Ongoing work on the relatively uncharted frontiers of parallel and distributed computing offers great potential for tapping the knowledge found in the deluge of Big Data. And the burgeoning data science community is facilitated by the free and open source R programming language, which provides a very low barrier for entry - you simply need to be willing to learn. The topics you have learned, provide the foundation for understanding more advanced machine learning methods. It is now your responsibility to keep learning and adding tools to your arsenal. Along the way, be sure to keep in mind the No Free Lunch theorem—no learning algorithm can rule them all. There will always be a human element to machine learning, adding subject-specific knowledge and the ability to match the appropriate algorithm to the task at hand. In the coming years, it will be interesting to see how the human side changes as the line between machine learning and human learning is blurred. Services such as Amazon's Mechanical Turk provide crowd-sourced intelligence, offering a cluster of human minds ready to perform simple tasks at a moment's notice. Perhaps one day, just as we have used computers to perform tasks that human beings cannot do easily, computers will employ human beings to do the reverse; food for thought. Resources for Article: Further resources on this subject: First steps with R [Article] SciPy for Computational Geometry [Article] Generating Reports in Notebooks in RStudio [Article]

0
0
1901

Packt

30 Oct 2013

10 min read

Highlights of Greenplum

Packt

30 Oct 2013

10 min read

(For more resources related to this topic, see here.) Big Data analytics – platform requirements Organizations are striving towards becoming more data driven and leverage data to gain the competitive advantage. It is inevitable that any current business intelligence infrastructure needs to be upgraded to include Big Data technologies and analytics needs to be embedded into every core business process. The following diagram depicts a matrix that connects requirements from low storage/cost to high storage/cost information management systems and analytics applications. The following section lists all the capabilities that an integrated platform for Big Data analytics should have: A data integration platform that can integrate data from any source, of any type, and highly voluminous in nature. This includes efficient data extraction, data cleansing, transformation, and loading capabilities. A data storage platform that can hold structured, unstructured, and semistructured data with a capability to slice and dice data to any degree, discarding the format. In short, while we store data, we should be able to use the best suited platform for a given data format (for example: structured data to use relational store, semi-structured data to use NoSQL store, and unstructured data to use a file store) and still be able to join data across platforms to run analytics. Support for running standard analytics functions and standard analytical tools on data that has characteristics described previously. Modular and elastically scalable hardware that wouldn't force changes to architecture/design with growing needs to handle bigger data and more complex processing requirements. A centralized management and monitoring system. Highly available and fault tolerant platform that can repair itself in times of any hardware failure seamlessly. Support for advanced visualizations to communicate insights in an effective way. A collaboration platform that can help end users perform the functions of loading, exploring, and visualizing data, and other workflow aspects as an end-to-end process. Core components The following figure depicts core software components of Greenplum UAP: In this section, we will take a brief look at what each component is and take a deep dive into their functions in the sections to follow. Greenplum Database Greenplum Database is a shared nothing, massively parallel processing solution built to support next generation data warehousing and Big Data analytics processing. It stores and analyzes voluminous structured data. It comes in a software-only version that works on commodity servers (this being its unique selling point) and additionally also is available as an appliance (DCA) that can take advantage of large clusters of powerful servers, storage, and switches. GPDB (Greenplum Database) comes with a parallel query optimizer that uses a cost-based algorithm to evaluate and select optimal query plans. Its high-speed interconnection supports continuous pipelining for data processing. In its new distribution under Pivotal, Greenplum Database is called Pivotal (Greenplum) Database. Shared nothing, massive parallel processing (MPP) systems, and elastic scalability Until now, our applications have been benchmarked for certain performance and the core hardware and its architecture determined its readiness for further scalability that came at a cost, be it in terms of changes to the design or hardware augmentation. With growing data volumes, scalability and total cost of ownership is becoming a big challenge and the need for elastic scalability has become prime. This section compares shared disk, shared memory, and shared nothing data architectures and introduces the concept of massive parallel processing. Greenplum Database and HD components implement shared nothing data architecture with master/worker paradigm demonstrating massive parallel processing capabilities. Shared disk data architecture Have a look at the following figure which gives an idea about shared disk data architecture: Shared disk data architecture refers to an architecture where there is a data disk that holds all the data and each node in the cluster accesses this data for processing. Any data operations can be performed by any node at a given point in time and in case two nodes attempt persisting/writing a tuple at the same time, to ensure consistency, a disk-based lock or intended lock communication is passed on thus affecting the performance. Further with increase in the number of nodes, contention at the database level increases. These architectures are write limited as there is a need to handle the locks across the nodes in the cluster. Even in case of the reads, partitioning should be implemented effectively to avoid complete table scans. Shared memory data architecture Have a look at the following figure which gives an idea about shared memory data architecture: In memory, data grids come under the shared memory data architecture category. In this architecture paradigm, data is held in memory that is accessible to all the nodes within the cluster. The major advantage with this architecture is that there would be no disk I/O involved and data access is very quick. This advantage comes with an additional need for loading and synchronizing data in memory with the underlying data store. The memory layer seen in the following figure can be distributed and local to the compute nodes or can exist as data node. Shared nothing data architecture Though an old paradigm, shared nothing data architecture is gaining traction in the context of Big Data. Here the data is distributed across the nodes in the cluster and every processor operates on the data local to itself. The location where data resides is referred to as data node and where the processing logic resides is called compute node. It can happen that both nodes, compute and data, are physically one. These nodes within the cluster are connected using high-speed interconnects. The following figure depicts two aspects of the architecture, the one on the left represents data and computes decoupled processes and the other to the right represents data and computes processes co-located: One of the most important aspects of shared nothing data architecture is the fact that there will not be any contention or locks that would need to be addressed. Data is distributed across the nodes within the cluster using a distribution plan that is defined as a part of the schema definition. Additionally, for higher query efficiency, partitioning can be done at the node level. Any requirement for a distributed lock would bring in complexity and an efficient distribution and partitioning strategy would be a key success factor. Reads are usually the most efficient relative to shared disk databases. Again, the efficiency is determined by the distribution policy, if a query needs to join data across the nodes in the cluster, users would see a temporary redistribution step that would bring required data elements together into another node before the query result is returned. Shared nothing data architecture thus supports massive parallel processing capabilities. Some of the features of shared nothing data architecture are as follows: It can scale extremely well on general purpose systems It provides automatic parallelization in loading and querying any database It has optimized I/O and can scan and process nodes in parallel It supports linear scalability, also referred to as elastic scalability, by adding a new node to the cluster, additional storage, and processing capability, both in terms of load performance and query performance is gained The Greenplum high-availability architecture In addition to primary Greenplum system components, we can also optionally deploy redundant components for high availability and avoiding single point of failure. The following components need to be implemented for data redundancy: Mirror segment instances: A mirror segment always resides on a different host than its primary segment. Mirroring provides you with a replica of the database contained in a segment. This may be useful in the event of disk/hardware failure. The metadata regarding the replica is stored on the master server in system catalog tables. Standby master host: For a fully redundant Greenplum Database system, a mirror of the Greenplum master can be deployed. A backup Greenplum master host serves as a warm standby in cases when the primary master host becomes unavailable. The standby master host is synchronized periodically and kept up-to-date using transaction replication log process that runs on the standby master to keep the master host and standby in sync. In the event of master host failure the standby master is activated and constructed using the transaction logs. Dual interconnect switches: A highly available interconnect can be achieved by deploying redundant network interfaces on all Greenplum hosts and a dual Gigabit Ethernet. The default configuration is to have one network interface per primary segment instance on a segment host (both the interconnects are by default 10Gig in DCA). External tables External tables in Greenplum refer to those database tables that help Greenplum Database access data from a source that is outside of the database. We can have different external tables for different formats. Greenplum supports fast, parallel, as well as nonparallel data loading and unloading. The external tables act as an interfacing point to external data source and give an impression of a local data source to the accessing function. File-based data sources are supported by external tables. The following file formats can be loaded onto external tables: Regular file-based source (supports Text, CSV, and XML data formats): file:// or gpfdist:// protocol Web-based file source (supports Text, CSV, OS commands, and scripts): http:// protocol Hadoop-based file source (supports Text and custom/user-defined formats): gphdfs:// protocol Following is the syntax for the creation and deletion of readable and writable external tables: To create a read-only external table: CREATE EXTERNAL (WEB) TABLE LOCATION (<<file paths>>) | EXECUTE '<<query>>' FORMAT '<<Format name for example: 'TEXT'>>' (DELIMITER, '<<name the delimiter>>'); To create a writable external table: CREATE WRITABLE EXTERNAL (WEB) TABLE LOCATION (<<file paths>>) | EXECUTE '<<query>>' FORMAT '<<Format name for example: 'TEXT'>>' (DELIMITER, '<<name the delimiter>>'); To drop an external table: DROP EXTERNAL (WEB) TABLE; Following are the examples on using file:// and gphdfs:// protocol: CREATE EXTERNAL TABLE test_load_file ( id int, name text, date date, description text ) LOCATION ( 'file://filehost:6781/data/folder1/*', 'file://filehost:6781/data/folder2/*' 'file://filehost:6781/data/folder3/*.csv' ) FORMAT 'CSV' (HEADER); In the preceding example, data is loaded from three different file server locations; also, as you can see, the wild card notation for each of the locations can be different. Now, in case where the files are located on HDFS, the following notation needs to be used (in the following example, the file is '|' delimited): CREATE EXTERNAL TABLE test_load_file ( id int, name text, date date, description text ) LOCATION ( 'gphdfs://hdfshost:8081/data/filename.txt' ) FORMAT 'TEXT' (DELIMITER '|'); Summary In this article, we have learned about Greenplum UAP and also Greenplum Database. This article also gives information about the core components of Greenplum UAP. Resources for Article: Further resources on this subject: Making Big Data Work for Hadoop and Solr [Article] Big Data Analysis [Article] Core Data iOS: Designing a Data Model and Building Data Objects [Article]

0
0
1672

Packt

30 Oct 2013

9 min read

Getting started with Haskell

Packt

30 Oct 2013

9 min read

(For more resources related to this topic, see here.) So what is Haskell? It is a fast, type-safe, purely functional programming language with a powerful type inference. Having said that, let us try to understand what it gives us. First, a purely functional programming language means that, in general, functions in Haskell don't have side effects. There is a special type for impure functions, or functions with side effects. Then, Haskell has a strong, static type system with an automatic and robust type inference. This, in practice, means that you do not usually need to specify types of functions and also the type checker does not allow passing incompatible types. In strongly typed languages, types are considered to be a specification, due to the Curry-Howard correspondence, the direct relationship between programs and mathematical proofs. Under this great simplification, the theorem states that if a value of the type exists (or is inhabited), then the corresponding mathematical proof is correct. Or jokingly saying, if a program compiles, then there is 99 percent probability that it works according to specification. Though the question if the types conform, the specification in natural language is still open; Haskell won't help you with it. The Haskell platform The glorious Glasgow Haskell Compilation System, or simply Glasgow Haskell Compiler (GHC), is the most widely used Haskell compiler. It is the current de facto standard. The compiler is packaged into the Haskell platform that follows Python's principle, "Batteries included". The platform is updated twice a year with new compilers and libraries. It usually includes a compiler, an interactive Read-Evaluate-Print Loop (REPL) interpreter, Haskell 98/2010 libraries (so-called Prelude) that includes most of the common definitions and functions, and a set of commonly used libraries. If you are on Windows or Mac OS X, it is strongly recommended to use prepackaged installers of the Haskell platform at http://www.haskell.org/platform/. Quick tour of Haskell To start with development, first we should be familiar with a few basic features of Haskell. We really need to know about laziness, datatypes, pattern matching, type classes, and basic notion of monads to start with Haskell. Laziness Haskell is a language with lazy evaluation. From a programmer's point of view that means that the value is evaluated if and only if it is really needed. Imperative languages usually have a strict evaluation, that is, function arguments are evaluated before function application. To see the difference, let's take a look at a simple expression in Haskell: let x = 2 + 3 In a strict or eager language, the 2 + 3 expression would be immediately evaluated to 5, whereas in Haskell, only a promise to do this evaluation will be created and the value will be evaluated only when it is needed. In other words, this statement just introduces definition of x which might be used afterwards, unlike in strict language where it is an operator that assigns the computed value, 5 to a memory cell named x. Also, this strategy allows sharing of evaluations, because laziness assumes that computations can be executed whenever it is needed and therefore, the result can be memorized. It might reduce the running time by exponential factor over strict evaluation. Laziness also allows to manipulate with infinite data structures. For instance, we can construct an infinite list of natural numbers as follows: let ns = [1..] And moreover, we can manipulate it as if it is a normal list, even though some caution is needed, as you can get an infinite loop. We can take the first five elements of this infinite list by means of the built-in function, take: take 5 ns By running this example in GHCi you will get [1,2,3,4,5]. Functions as first-class citizens The notion of function is one of the core ideas in functional languages and Haskell is not an exception at all. The definition of a function includes a body of function and an optional type declaration. For instance, the take function is defined in Prelude as follows: take :: Int -> [a] -> [a] take = ... Here, the type declaration says that the function takes an integer as the argument and a list of objects of the a type, and returns a new list of the same type. Also Haskell allows partial application of a function. For example, you can construct a function that takes first five elements of the list as follows: take5 :: [a] -> [a] take5 = take 5 Also functions are themselves objects, that is, you may pass a function as an argument to another function. Prelude defines map function as a function of a function: map :: (a -> b) -> [a] -> [b] map takes a function and applies it to each element of the list. Thus functions are first-class citizens in the language and it is possible to manipulate with them as if they were normal objects. Data types Data type is a core of a strongly typed language as Haskell. The distinctive feature of Haskell data types is that they all are immutable, i.e. after object constructed it cannot be changed. It might be weird on the first sight but in the long run it has few positive consequences. First, it enables computation parallelization. Second, all data are referentially transparent, i.e. there is no difference between reference to object and object itself. Those two properties allow compiler to reason about code optimization on higher level than C/C++ compiler can. Let us consider the following data type definitions in Haskell: type TimeStamp = Integer newtype Price = Price Double data Quote = AskQuote TimeStamp Price | BidQuote TimeStamp Price There are three common ways to define data types in Haskell: The first declaration just creates a synonym for an existing data type and type checker won’t prevent you from using Integer instead of TimeStamp. Also you can use a value of type TimeStamp with any function that expects to work with an Integer. The second declaration creates a new type for prices and you are not allowed to use Double instead of Price. The compiler will raise an error in such cases and thus it will enforce the correct usage. The last declaration introduces Algebraic Data Type (ADT) and says that type Quote might be constructed either by data constructor AskQuote, or by BidQuote with time stamp and price as its parameters. Quote itself is called a type constructor. Type constructor might be parameterized by type variables. For example, types Maybe and Either, quite often used standard types, are defined as follows: data Maybe a = Nothing | Just a data Either a b = Left a | Right b Type variables a and b can be substituted by any other type. Type classes Type classes in Haskell are not classes like in object-oriented languages. It is more like interfaces with optional implementation. You can find them in other languages named traits, mix-ins and so on but unlike in them, this feature in Haskell enables ad-hoc polymorphism, i.e. function could be applied to arguments of different types. It is also known as function overloading or operator overloading. A polymorphic function can specify different implementation for different types. In principle, type class consists of function declaration over some objects. Eq type class is a standard type class that specifies how to compare two objects of same type: class Eq a where (==) :: a -> a -> Bool (/=) :: a -> a -> Bool (/=) x y = not $ (==) x y Here Eq is the name of type class, a is a type variable and == and /= are the operations defined in the type class. This definition means that some type a is of class Eq if it has defined operation == and /=. Moreover, the definition provides default implementation of the operation /=. And if you decide to implement this class for some data type, you need provide the single operation implementation. There are numbers of useful type classes defined in Prelude. For example, Eq is used to define equality between 2 objects; Ord specifies total ordering; Show and Read introduce a String representation of object; Enum type class describes enumerations, i.e. data types with null constructors. It might be quite boring to implement this quite trivial but useful type classes for each data type, so Haskell supports automatic derivation of most of standard type classes: newtype Price = Price Double deriving (Eq, Show, Read, Ord) Now objects of type Price could be converted from/to String by means of read/show functions and compared with themselves using equality operators. Those who want to know all the details can look them up in Haskell language report. IO monad Being a pure functional language Haskell requires a marker of impure functions and IO monad is the marker. Function Main is an entry point for any Haskell program. It cannot be a pure function because it changes state of the “World”, at least by creating a new process in OS. Let us take a look at its type: main :: IO () main = do putStrLn "Hello, World!" IO () type signs that there is a computation that performs some I/O operation and return empty result. There is really only one way to perform I/O in Haskell: use it in main procedure of your program. Also it is not possible to execute I/O action from arbitrary function, unless that function is in the IO monad and called from main, directly or indirectly. Summary In this article we walked through the main unique features of Haskell language and learned a bit of its history. Resources for Article : Further resources on this subject: Core Data iOS: Designing a Data Model and Building Data Objects [Article] The OpenFlow Controllers [Article] Managing Network Layout [Article]

0
0
1530

Packt

30 Oct 2013

11 min read

Advanced Data Operations

Packt

30 Oct 2013

11 min read

(For more resources related to this topic, see here.) Recipe 1 – handling multi-valued cells It is a common problem in many tables: what do you do if multiple values apply to a single cell? For instance, consider a Clients table with the usual name, address, and telephone fields. A typist is adding new contacts to this table, when he/she suddenly discovers that Mr. Thompson has provided two addresses with a different telephone number for each of them. There are essentially three possible reactions to this: Adding only one address to the table: This is the easiest thing to do, as it eliminates half of the typing work. Unfortunately, this implies that half of the information is lost as well, so the completeness of the table is in danger. Adding two rows to the table: While the table is now complete, we now have redundant data. Redundancy is also dangerous, because it leads to error: the two rows might accidentally be treated as two different Mr. Thompsons, which can quickly become problematic if Mr. Thompson is billed twice for his subscription. Furthermore, as the rows have no connection, information updated in one of them will not automatically propagate to the other. Adding all information to one row: In this case, two addresses and two telephone numbers are added to the respective fields. We say the field is overloaded with regard to its originally envisioned definition. At first sight, this is both complete yet not redundant, but a subtle problem arises. While humans can perfectly make sense of this information, automated processes cannot. Imagine an envelope labeler, which will now print two addresses on a single envelope, or an automated dialer, which will treat the combined digits of both numbers as a single telephone number. The field has indeed lost its precise semantics. Note that there are various technical solutions to deal with the problem of multiple values, such as table relations. However, if you are not in control of the data model you are working with, you'll have to choose any of the preceding solutions. Luckily, OpenRefine is able to offer the best of both worlds. Since it is also an automated piece of software, it needs to be informed whether a field is multi-valued before it can perform sensible operations on it. In the Powerhouse Museum dataset, the Categories field is multi-valued, as each object in the collection can belong to different categories. Before we can perform meaningful operations on this field, we have to tell OpenRefine to somehow treat it a little different. Suppose we want to give the Categories field a closer look to check how many different categories are there and which categories are the most prominent. First, let's see what happens if we try to create a text facet on this field by clicking on the dropdown next to Categories and navigating to Facet| Text Facet as shown in the following screenshot. This doesn't work as expected because there are too many combinations of individual categories. OpenRefine simply gives up, saying that there are 14,805 choices in total, which is above the limit for display. While you can increase the maximum value by clicking on Set choice count limit, we strongly advise against this. First of all, it would make OpenRefine painfully slow as it would offer us a list of 14,805 possibilities, which is too large for an overview anyway. Second, it wouldn't help us at all because OpenRefine would only list the combined field values (such as Hen eggs | Sectional models | Animal Samples and Products). This does not allow us to inspect the individual categories, which is what we're interested in. To solve this, leave the facet open, but go to the Categories dropdown again and select Edit Cells| Split multi-valued cells…as shown in the following screenshot: OpenRefine now asks What separator currently separates the values?. As we can see in the first few records, the values are separated by a vertical bar or pipe character, as the horizontal line tokens are called. Therefore, enter a vertical bar |in the dialog. If you are not able to find the corresponding key on your keyboard, try selecting the character from one of the Categories cells and copying it so you can paste it in the dialog. Then, click on OK. After a few seconds, you will see that OpenRefine has split the cell values, and the Categories facet on the left now displays the individual categories. By default, it shows them in alphabetical order, but we will get more valuable insights if we sort them by the number of occurrences. This is done by changing the Sort by option from name to count, revealing the most popular categories. One thing we can do now, which we couldn't do when the field was still multi-valued is changing the name of a single category across all records. For instance, to change the name of Clothing and Dress, hover over its name in the created Categories facet and click on the edit link, as you can see in the following screenshot: Enter a new name such as Clothing and click on Apply. OpenRefine changes all occurrences of Clothing and Dress into Clothing, and the facet is updated to reflect this modification. Once you are done editing the separate values, it is time to merge them back together. Go to the Categories dropdown, navigate to Edit cells| Join multi-valued cells…, and enter the separator of your choice. This does not need to be the same separator as before, and multiple characters are also allowed. For instance, you could opt to separate the fields with a comma followed by a space. Recipe 3 – clustering similar cells Thanks to OpenRefine, you don't have to worry about inconsistencies that slipped in during the creation process of your data. If you have been investigating the various categories after splitting the multi-valued cells, you might have noticed that the same category labels do not always have the same spelling. For instance, there is Agricultural Equipment and Agricultural equipment(capitalization differences), Costumes and Costume(pluralization differences), and various other issues. The good news is that these can be resolved automatically; well, almost. But, OpenRefine definitely makes it a lot easier. The process of finding the same items with slightly different spelling is called clustering. After you have split multi-valued cells, you can click on the Categories dropdown and navigate to Edit cells| Cluster and edit…. OpenRefine presents you with a dialog box where you can choose between different clustering methods, each of which can use various similarity functions. When the dialog opens, key collision and fingerprint have been chosen as default settings. After some time (this can take a while, depending on the project size), OpenRefine will execute the clustering algorithm on the Categories field. It lists the found clusters in rows along with the spelling variations in each cluster and the proposed value for the whole cluster, as shown in the following screenshot: Note that OpenRefine does not automatically merge the values of the cluster. Instead, it wants you to confirm whether the values indeed point to the same concept. This avoids similar names, which still have a different meaning, accidentally ending up as the same. Before we start making decisions, let's first understand what all of the columns mean. The Cluster Size column indicates how many different spellings of a certain concept were thought to be found. The Row Count column indicates how many rows contain either of the found spellings. In Values in Cluster, you can see the different spellings and how many rows contain a particular spelling. Furthermore, these spellings are clickable, so you can indicate which one is correct. If you hover over the spellings, a Browse this cluster link appears, which you can use to inspect all items in the cluster in a separate browser tab. The Merge column contains a checkbox. If you check it, all values in that cluster will be changed to the value in the New Cell Value column when you click on one of the Merge Selected buttons. You can also manually choose a new cell value if the automatic value is not the best choice. So, let's perform our first clustering operation. I strongly advise you to scroll carefully through the list to avoid clustering values that don't belong together. In this case, however, the algorithm hasn't acted too aggressively: in fact, all suggested clusters are correct. Instead of manually ticking the Merge? checkbox on every single one of them, we can just click on Select All at the bottom. Then, click on the Merge Selected & Re-Cluster button, which will merge all the selected clusters but won't close the window yet, so we can try other clustering algorithms as well. OpenRefine immediately reclusters with the same algorithm, but no other clusters are found since we have merged all of them. Let's see what happens when we try a different similarity function. From the Keying Function menu, click on ngram fingerprint. Note that we get an additional parameter, Ngram Size, which we can experiment with to obtain less or more aggressive clustering. We see that OpenRefine has found several clusters again. It might be tempting to click on the Select All button again, but remember we warned to carefully inspect all rows in the list. Can you spot the mistake? Have a closer look at the following screenshot: Indeed, the clustering algorithm has decided that Shirts and T-shirts are similar enough to be merged. Unfortunately, this is not true. So, either manually select all correct suggestions, or deselect the ones that are not. Then, click on the Merge Selected & Re-Cluster button. Apart from trying different similarity functions, we can also try totally different clustering methods. From the Method menu, click on nearest neighbor. We again see new clustering parameters appear (Radius and Block Chars, but we will use their default settings for now). OpenRefine again finds several clusters, but now, it has been a little too aggressive. In fact, several suggestions are wrong, such as the Lockets / Pockets / Rockets cluster. Some other suggestions, such as "Photocopiers" and "Photocopier", are fine. In this situation, it might be best to manually pick the few correct ones among the many incorrect clusters. Assuming that all clusters have been identified, click on the Merge Selected & Close button, which will apply merging to the selected items and take you back into the main OpenRefine window. If you look at the data now or use a text facet on the Categories field, you will notice that the inconsistencies have disappeared. What are clustering methods? OpenRefine offers two different clustering methods, key collision and nearest neighbor, which fundamentally differ in how they function. With key collision, the idea is that a keying function is used to map a field value to a certain key. Values that are mapped to the same key are placed inside the same cluster. For instance, suppose we have a keying function which removes all spaces; then, A B C, AB C, and ABC will be mapped to the same key: ABC. In practice, the keying functions are constructed in a more sophisticated and helpful way. Nearest neighbor, on the other hand, is a technique in which each unique value is compared to every other unique value using a distance function. For instance, if we count every modification as one unit, the distance between Boot and Bots is 2: one addition and one deletion. This corresponds to an actual distance function in OpenRefine, namely levenshtein. In practice, it is hard to predict which combination of method and function is the best for a given field. Therefore, it is best to try out the various options, each time carefully inspecting whether the clustered values actually belong together. The OpenRefine interface helps you by putting the various options in the order they are most likely to help: for instance, trying key collision before nearest neighbor. Summary In this article we learned about how to handle multi-valued cells and clustering of similar cells in OpenRefine. Multi-valued cells are a common problem in many tables. This article showed us what to do if multiple values apply to a single cell. Since OpenRefine is an automated piece of software, it needs to be informed whether a field is multi-valued before it can perform sensible operations on it. This article also showed an example of how to go about it. It also shed light on clustering methods. OpenRefine offers two different clustering methods, key collision and nearest neighbor , which fundamentally differ in how they function. With key collision, the idea is that a keying function is used to map a field value to a certain key. Values that are mapped to the same key are placed inside the same cluster. Resources for Article : Further resources on this subject: Business Intelligence and Data Warehouse Solution - Architecture and Design [Article] Self-service Business Intelligence, Creating Value from Data [Article] Oracle Business Intelligence : Getting Business Information from Data [Article]

0
0
1069

article-image-getting-started-pentaho-data-integration

Packt

30 Oct 2013

16 min read

Getting Started with Pentaho Data Integration

Packt

30 Oct 2013

16 min read

(For more resources related to this topic, see here.) Pentaho Data Integration and Pentaho BI Suite Before introducing PDI, let’s talk about Pentaho BI Suite. The Pentaho Business Intelligence Suite is a collection of software applications intended to create and deliver solutions for decision making. The main functional areas covered by the suite are: Analysis: The analysis engine serves multidimensional analysis. It’s provided by the Mondrian OLAP server. Reporting: The reporting engine allows designing, creating, and distributing reports in various known formats (HTML, PDF, and so on), from different kinds of sources. Data Mining: Data mining is used for running data through algorithms in order to understand the business and do predictive analysis. Data mining is possible thanks to the Weka Project. Dashboards: Dashboards are used to monitor and analyze Key Performance Indicators (KPIs). The Community Dashboard Framework (CDF), a plugin developed by the community and integrated in the Pentaho BI Suite, allows the creation of interesting dashboards including charts, reports, analysis views, and other Pentaho content, without much effort. Data Integration: Data integration is used to integrate scattered information from different sources (applications, databases, files, and so on), and make the integrated information available to the final user. All of this functionality can be used standalone but also integrated. In order to run analysis, reports, and so on, integrated as a suite, you have to use the Pentaho BI Platform. The platform has a solution engine, and offers critical services, for example, authentication, scheduling, security, and web services. This set of software and services form a complete BI Platform, which makes Pentaho Suite the world’s leading open source Business Intelligence Suite. Exploring the Pentaho Demo The Pentaho BI Platform Demo is a pre-configured installation that allows you to explore several capabilities of the Pentaho platform. It includes sample reports, cubes, and dashboards for Steel Wheels. Steel Wheels is a fictional store that sells all kind of scale replicas of vehicles. The following screenshot is a sample dashboard available in the demo: The Pentaho BI Platform Demo is free and can be downloaded from http://sourceforge.net/projects/pentaho/files/. Under the Business Intelligence Server folder, look for the latest stable version. You can find out more about Pentaho BI Suite Community Edition at http://community.pentaho.com/projects/bi_platform. There is also an Enterprise Edition of the platform with additional features and support. You can find more on this at www.pentaho.org. Pentaho Data Integration Most of the Pentaho engines, including the engines mentioned earlier, were created as community projects and later adopted by Pentaho. The PDI engine is not an exception—Pentaho Data Integration is the new denomination for the business intelligence tool born as Kettle. The name Kettle didn’t come from the recursive acronym Kettle Extraction, Transportation, Transformation, and Loading Environment it has now. It came from KDE Extraction, Transportation, Transformation, and Loading Environment, since the tool was planned to be written on top of KDE, a Linux desktop environment, as mentioned in the introduction of the article. In April 2006, the Kettle project was acquired by the Pentaho Corporation and Matt Casters, the Kettle founder, also joined the Pentaho team as a Data Integration Architect. When Pentaho announced the acquisition, James Dixon, Chief Technology Officer said: We reviewed many alternatives for open source data integration, and Kettle clearly had the best architecture, richest functionality, and most mature user interface. The open architecture and superior technology of the Pentaho BI Platform and Kettle allowed us to deliver integration in only a few days, and make that integration available to the community. By joining forces with Pentaho, Kettle benefited from a huge developer community, as well as from a company that would support the future of the project. From that moment, the tool has grown with no pause. Every few months a new release is available, bringing to the users improvements in performance, existing functionality, new functionality, ease of use, and great changes in look and feel. The following is a timeline of the major events related to PDI since its acquisition by Pentaho: June 2006: PDI 2.3 is released. Numerous developers had joined the project and there were bug fixes provided by people in various regions of the world. The version included among other changes, enhancements for large-scale environments and multilingual capabilities. February 2007: Almost seven months after the last major revision, PDI 2.4 is released including remote execution and clustering support, enhanced database support, and a single designer for jobs and transformations, the two main kind of elements you design in Kettle. May 2007: PDI 2.5 is released including many new features; the most relevant being the advanced error handling. November 2007: PDI 3.0 emerges totally redesigned. Its major library changed to gain massive performance. The look and feel had also changed completely. October 2008: PDI 3.1 arrives, bringing a tool which was easier to use, and with a lot of new functionality as well. April 2009: PDI 3.2 is released with a really large amount of changes for a minor version: new functionality, visualization and performance improvements, and a huge amount of bug fixes. The main change in this version was the incorporation of dynamic clustering. June 2010: PDI 4.0 was released, delivering mostly improvements with regard to enterprise features, for example, version control. In the community version, the focus was on several visual improvements such as the mouseover assistance that you will experiment with soon. November 2010: PDI 4.1 is released with many bug fixes. August 2011: PDI 4.2 comes to light not only with a large amount of bug fixes, but also with a lot of improvements and new features. In particular, several of them were related to the work with repositories. April 2012: PDI 4.3 is released also with a lot of fixes, and a bunch of improvements and new features. November 2012: PDI 4.4 is released. This version incorporates a lot of enhancements and new features. In this version there is a special emphasis on Big Data—the ability of reading, searching, and in general transforming large and complex collections of datasets. 2013: PDI 5.0 will be released, delivering interesting low-level features such as step load balancing, job transactions, and restartability. Using PDI in real-world scenarios Paying attention to its name, Pentaho Data Integration, you could think of PDI as a tool to integrate data. In fact, PDI not only serves as a data integrator or an ETL tool. PDI is such a powerful tool, that it is common to see it used for these and for many other purposes. Here you have some examples. Loading data warehouses or datamarts The loading of a data warehouse or a datamart involves many steps, and there are many variants depending on business area, or business rules. But in every case, no exception, the process involves the following steps: Extracting information from one or different databases, text files, XML files and other sources. The extract process may include the task of validating and discarding data that doesn’t match expected patterns or rules. Transforming the obtained data to meet the business and technical needs required on the target. Transformation implies tasks as converting data types, doing some calculations, filtering irrelevant data, and summarizing. Loading the transformed data into the target database. Depending on the requirements, the loading may overwrite the existing information, or may add new information each time it is executed. Kettle comes ready to do every stage of this loading process. The following screenshot shows a simple ETL designed with Kettle: Integrating data Imagine two similar companies that need to merge their databases in order to have a unified view of the data, or a single company that has to combine information from a main ERP (Enterprise Resource Planning) application and a CRM (Customer Relationship Management) application, though they’re not connected. These are just two of hundreds of examples where data integration is needed. The integration is not just a matter of gathering and mixing data. Some conversions, validation, and transport of data have to be done. Kettle is meant to do all of those tasks. Data cleansing It’s important and even critical that data be correct and accurate for the efficiency of business, to generate trust conclusions in data mining or statistical studies, to succeed when integrating data. Data cleansing is about ensuring that the data is correct and precise. This can be achieved by verifying if the data meets certain rules, discarding or correcting those which don’t follow the expected pattern, setting default values for missing data, eliminating information that is duplicated, normalizing data to conform minimum and maximum values, and so on. These are tasks that Kettle makes possible thanks to its vast set of transformation and validation capabilities. Migrating information Think of a company, any size, which uses a commercial ERP application. One day the owners realize that the licenses are consuming an important share of its budget. So they decide to migrate to an open source ERP. The company will no longer have to pay licenses, but if they want to change, they will have to migrate the information. Obviously, it is not an option to start from scratch, nor type the information by hand. Kettle makes the migration possible thanks to its ability to interact with most kind of sources and destinations such as plain files, commercial and free databases, and spreadsheets, among others. Exporting data Data may need to be exported for numerous reasons: To create detailed business reports To allow communication between different departments within the same company To deliver data from your legacy systems to obey government regulations, and so on Kettle has the power to take raw data from the source and generate these kind of ad-hoc reports. Integrating PDI along with other Pentaho tools The previous examples show typical uses of PDI as a standalone application. However, Kettle may be used embedded as part of a process or a dataflow. Some examples are pre-processing data for an online report, sending mails in a scheduled fashion, generating spreadsheet reports, feeding a dashboard with data coming from web services, and so on. The use of PDI integrated with other tools is beyond the scope of this article. If you are interested, you can find more information on this subject in the Pentaho Data Integration 4 Cookbook by Packt Publishing at http://www.packtpub.com/pentaho-data-integration-4-cookbook/book. Installing PDI In order to work with PDI, you need to install the software. It’s a simple task, so let’s do it now. Time for action – installing PDI These are the instructions to install PDI, for whatever operating system you may be using. The only prerequisite to install the tool is to have JRE 6.0 installed. If you don’t have it, please download it from www.javasoft.com and install it before proceeding. Once you have checked the prerequisite, follow these steps: Go to the download page at http://sourceforge.net/projects/pentaho/files/Data Integration. Choose the newest stable release. At this time, it is 4.4.0, as shown in the following screenshot: Download the file that matches your platform. The preceding screenshot should help you. Unzip the downloaded file in a folder of your choice, that is, c:/util/kettle or /home/pdi_user/kettle. If your system is Windows, you are done. Under Unix-like environments, you have to make the scripts executable. Assuming that you chose /home/pdi_user/kettle as the installation folder, execute: cd /home/pdi_user/kettle chmod +x *.sh In Mac OS you have to give execute permissions to the JavaApplicationStub file. Look for this file; it is located in Data Integration 32-bit.appContentsMacOS, or Data Integration 64-bit.appContentsMacOS depending on your system. What just happened? You have installed the tool in just a few minutes. Now, you have all you need to start working. Launching the PDI graphical designer – Spoon Now that you’ve installed PDI, you must be eager to do some stuff with data. That will be possible only inside a graphical environment. PDI has a desktop designer tool named Spoon. Let’s launch Spoon and see what it looks like. Time for action – starting and customizing Spoon In this section, you are going to launch the PDI graphical designer, and get familiarized with its main features. Start Spoon. If your system is Windows, run Spoon.bat You can just double-click on the Spoon.bat icon, or Spoon if your Windows system doesn’t show extensions for known file types. Alternatively, open a command window—by selecting Run in the Windows start menu, and executing cmd, and run Spoon.bat in the terminal. In other platforms such as Unix, Linux, and so on, open a terminal window and type spoon.sh If you didn’t make spoon.sh executable, you may type sh spoon.sh Alternatively, if you work on Mac OS, you can execute the JavaApplicationStub file, or click on the Data Integration 32-bit.app, or Data Integration 64-bit.app icon As soon as Spoon starts, a dialog window appears asking for the repository connection data. Click on the Cancel button. A small window labeled Spoon tips... appears. You may want to navigate through various tips before starting. Eventually, close the window and proceed. Finally, the main window shows up. A Welcome! window appears with some useful links for you to see. Close the window. You can open it later from the main menu. Click on Options... from the menu Tools. A window appears where you can change various general and visual characteristics. Uncheck the highlighted checkboxes, as shown in the following screenshot: Select the tab window Look & Feel. Change the Grid size and Preferred Language settings as shown in the following screenshot: Click on the OK button. Restart Spoon in order to apply the changes. You should not see the repository dialog, or the Welcome! window. You should see the following screenshot full of French words instead: What just happened? You ran for the first time Spoon, the graphical designer of PDI. Then you applied some custom configuration. In the Option… tab, you chose not to show the repository dialog or the Welcome! window at startup. From the Look & Feel configuration window, you changed the size of the dotted grid that appears in the canvas area while you are working. You also changed the preferred language. These changes were applied as you restarted the tool, not before. The second time you launched the tool, the repository dialog didn’t show up. When the main window appeared, all of the visible texts were shown in French which was the selected language, and instead of the Welcome! window, there was a blank screen. You didn’t see the effect of the change in the Grid option. You will see it only after creating or opening a transformation or job, which will occur very soon! Spoon Spoon, the tool you’re exploring in this section, is the PDI’s desktop design tool. With Spoon, you design, preview, and test all your work, that is, Transformations and Jobs. When you see PDI screenshots, what you are really seeing are Spoon screenshots. Setting preferences in the Options window In the earlier section, you changed some preferences in the Options window. There are several look and feel characteristics you can modify beyond those you changed. Feel free to experiment with these settings. Remember to restart Spoon in order to see the changes applied. In particular, please take note of the following suggestion about the configuration of the preferred language. If you choose a preferred language other than English, you should select a different language as an alternative. If you do so, every name or description not translated to your preferred language, will be shown in the alternative language. One of the settings that you changed was the appearance of the Welcome! window at startup. The Welcome! window has many useful links, which are all related with the tool: wiki pages, news, forum access, and more. It’s worth exploring them. You don’t have to change the settings again to see the Welcome! window. You can open it by navigating to Help | Welcome Screen. Storing transformations and jobs in a repository The first time you launched Spoon, you chose not to work with repositories. After that, you configured Spoon to stop asking you for the Repository option. You must be curious about what the repository is and why we decided not to use it. Let’s explain it. As we said, the results of working with PDI are transformations and jobs. In order to save the transformations and jobs, PDI offers two main methods: Database repository: When you use the database repository method, you save jobs and transformations in a relational database specially designed for this purpose. Files: The files method consists of saving jobs and transformations as regular XML files in the filesystem, with extension KJB and KTR respectively. It’s not allowed to mix the two methods in the same project. That is, it makes no sense to mix jobs and transformations in a database repository with jobs and transformations stored in files. Therefore, you must choose the method when you start the tool. By clicking on Cancel in the repository window, you are implicitly saying that you will work with the files method. Why did we choose not to work with repositories? Or, in other words, to work with the files method? Mainly for two reasons: Working with files is more natural and practical for most users. Working with a database repository requires minimal database knowledge, and that you have access to a database engine from your computer. Although it would be an advantage for you to have both preconditions, maybe you haven’t got both of them. There is a third method called File repository, that is a mix of the two above—it’s a repository of jobs and transformations stored in the filesystem. Between the File repository and the files method, the latest is the most broadly used. Therefore, throughout this article we will use the files method. Creating your first transformation Until now, you’ve seen the very basic elements of Spoon. You must be waiting to do some interesting task beyond looking around. It’s time to create your first transformation.

0
0
4474

Packt

30 Oct 2013

16 min read

IBM SPSS Modeler – Pushing the Limits

Packt

30 Oct 2013

16 min read

(For more resources related to this topic, see here.) Using the Feature Selection node creatively to remove or decapitate perfect predictors In this recipe, we will identify perfect or near perfect predictors in order to insure that they do not contaminate our model. Perfect predictors earn their name by being correct 100 percent of the time, usually indicating circular logic and not a prediction of value. It is a common and serious problem. When this occurs we have accidentally allowed information into the model that could not possibly be known at the time of the prediction. Everyone 30 days late on their mortgage receives a late letter, but receiving a late letter is not a good predictor of lateness because their lateness caused the letter, not the other way around. The rather colorful term decapitate is borrowed from the data miner Dorian Pyle. It is a reference to the fact that perfect predictors will be found at the top of any list of key drivers ("caput" means head in Latin). Therefore, to decapitate is to remove the variable at the top. Their status at the top of the list will be capitalized upon in this recipe. The following table shows the three time periods; the past, the present, and the future. It is important to remember that, when we are making predictions, we can use information from the past to predict the present or the future but we cannot use information from the future to predict the future. This seems obvious, but it is common to see analysts use information that was gathered after the date for which predictions are made. As an example, if a company sends out a notice after a customer has churned, you cannot say that the notice is predictive of churning. Past Now Future Contract Start Expiration Outcome Renewal Date Joe January 1, 2010 January 1, 2012 Renewed January 2, 2012 Ann February 15, 2010 February 15, 2012 Out of Contract Null Bill March 21, 2010 March 21, 2012 Churn NA Jack April 5, 2010 April 5, 2012 Renewed April 9, 2012 New Customer 24 Months Ago Today ??? ??? Getting ready We will start with a blank stream, and will be using the cup98lrn reduced vars2.txt data set. How to do it... To identify perfect or near-perfect predictors in order to insure that they do not contaminate our model: Build a stream with a Source node, a Type node, and a Table then force instantiation by running the Table node. Force TARGET_B to be flag and make it the target. Add a Feature Selection Modeling node and run it. Edit the resulting generated model and examine the results. In particular, focus on the top of the list. Review what you know about the top variables, and check to see if any could be related to the target by definition or could possibly be based on information that actually postdates the information in the target. Add a CHAID Modeling node, set it to run in Interactive mode, and run it. Examine the first branch, looking for any child node that might be perfectly predicted; that is, look for child nodes whose members are all found in one category. Continue steps 6 and 7 for the first several variables. Variables that are problematic (steps 5 and/or 7) need to be set to None in the Type node. How it works... Which variables need decapitation? The problem is information that, although it was known at the time that you extracted it, was not known at the time of decision. In this case, the time of decision is the decision that the potential donor made to donate or not to donate. Was the amount, Target_D known before the decision was made to donate? Clearly not. No information that dates after the information in the target variable can ever be used in a predictive model. This recipe is built of the following foundation—variables with this problem will float up to the top of the Feature Selection results. They may not always be perfect predictors, but perfect predictors always must go. For example, you might find that, if a customer initially rejects or postpones a purchase, there should be a follow up sales call in 90 days. They are recorded as rejected offer in the campaign, and as a result most of them had a follow up call in 90 days after the campaign. Since a couple of the follow up calls might not have happened, it won't be a perfect predictor, but it still must go. Note that variables such as RFA_2 and RFA_2A are both very recent information and highly predictive. Are they a problem? You can't be absolutely certain without knowing the data. Here the information recorded in these variables is calculated just prior to the campaign. If the calculation was made just after, they would have to go. The CHAID tree almost certainly would have shown evidence of perfect prediction in this case. There's more... Sometimes a model has to have a lot of lead time; predicting today's weather is a different challenge than next year's prediction in the farmer's almanac. When more lead time is desired you could consider dropping all of the _2 series variables. What would the advantage be? What if you were buying advertising space and there was a 45 day delay for the advertisement to appear? If the _2 variables occur between your advertising deadline and your campaign you might have to use information attained in the _3 campaign. Next-Best-Offer for large datasets Association models have been the basis for next-best-offer recommendation engines for a long time. Recommendation engines are widely used for presenting customers with cross-sell offers. For example, if a customer purchases a shirt, pants, and a belt; which shoes would he also likely buy? This type of analysis is often called market-basket analysis as we are trying to understand which items customers purchase in the same basket/transaction. Recommendations must be very granular (for example, at the product level) to be usable at the check-out register, website, and so on. For example, knowing that female customers purchase a wallet 63.9 percent of the time when they buy a purse is not directly actionable. However, knowing that customers that purchase a specific purse (for example, SKU 25343) also purchase a specific wallet (for example, SKU 98343) 51.8 percent of the time, can be the basis for future recommendations. Product level recommendations require the analysis of massive data sets (that is, millions of rows). Usually, this data is in the form of sales transactions where each line item (that is, row of data) represents a single product. The line items are tied together by a single transaction ID. IBM SPSS Modeler association models support both tabular and transactional data. The tabular format requires each product to be represented as column. As most product level recommendations would contain thousands of products, this format is not practical. The transactional format uses the transactional data directly and requires only two inputs, the transaction ID and the product/item. Getting ready This example uses the file stransactions.sav and scoring.csv. How to do it... To recommend the next best offer for large datasets: Start with a new stream by navigating to File | New Stream. Go to File | Stream Properties from the IBM SPSS Modeler menu bar. On the Options tab change the Maximum members for nominal fields to 50000. Click on OK. Add a Statistics File source node to the upper left of the stream. Set the file field by navigating to transactions.sav. On the Types tab, change the Product_Code field to Nominal and click on the Read Values button. Click on OK. Add a CARMA Modeling node connected to the Statistics File source node in step 3. On the Fields tab, click on the Use custom settings and check the Use transactional format check box. Select Transaction_ID as the ID field and Product_Code as the Content field. On the Model tab of the CARMA Modeling node, change the Minimum rule support (%) to 0.0 and the Minimum rule confidence (%) to 5.0. Click on the Run button to build the model. Double-click the generated model to ensure that you have approximately 40,000 rules. Add a Var File source node to the middle left of the stream. Set the file field by navigating to scoring.csv. On the Types tab, click on the Read Values button. Click on the Preview button to preview the data. Click on OK to dismiss all dialogs. Add a Sort node connected to the Var File node in step 6. Choose Transaction_ID and Line_Number (with Ascending sort) by clicking the down arrow on the right of the dialog. Click on OK. Connect the Sort node in step 7 to the generated model (replacing the current link). Add an Aggregate node connected to the generated model. Add a Merge node connected to the generated model. Connect the Aggregate node in step 9 to the Merge node. On the Merge tab, choose Keys as the Merge Method, select Transaction_ID, and click on the right arrow. Click on OK. Add a Select node connected to the Merge node in step 10. Set the condition to Record_Count = Line_Number. Click on OK. At this point, the stream should look as follows: Add a Table node connected to the Select node in step 11. Right-click on the Table node and click on Run to see the next-best-offer for the input data. How it works... In steps 1-5, we set up the CARMA model to use the transactional data (without needing to restructure the data). CARMA was selected over A Priori for its improved performance and stability with large data sets. For recommendation engines, the settings for the Model tab are somewhat arbitrary and are driven by the practical limitations of the number of rules generated. Lowering the thresholds for confidence and rule support generates more rules. Having more rules can have a negative impact on scoring performance but will result in more (albeit weaker) recommendations. Rule Support How many transactions contain the entire rule (that is, both antecedents ("if" products) and consequents ("then" products)) Confidence If a transaction contains all the antecedents ("if" products), what percentage of the time does it contain the consequents ("then" products) In step 5, when we examine the model we see the generated Association Rules with the corresponding rules support and confidences. In the remaining steps (7-12), we score a new transaction and generate 3 next-best-offers based on the model containing the Association Rules. Since the model was built with transactional data, the scoring data must also be transactional. This means that each row is scored using the current row and the prior rows with the same transaction ID. The only row we generally care about is the last row for each transaction where all the data has been presented to the model. To accomplish this, we count the number of rows for each transaction and select the line number that equals the total row count (that is, the last row for each transaction). Notice that the model returns 3 recommended products, each with a confidence, in order of decreasing confidence. A next-best-offer engine would present the customer with the best option first (or potentially all three options ordered by decreasing confidence). Note that, if there is no rule that applies to the transaction, nulls will be returned in some or all of the corresponding columns. There's more... In this recipe, you'll notice that we generate recommendations across the entire transactional data set. By using all transactions, we are creating generalized next-best-offer recommendations; however, we know that we can probably segment (that is, cluster) our customers into different behavioral groups (for example, fashion conscience, value shoppers, and so on.). Partitioning the transactions by behavioral segment and generating separate models for each segment will result in rules that are more accurate and actionable for each group. The biggest challenge with this approach is that you will have to identify the customer segment for each customer before making recommendations (that is, scoring). A unified approach would be to use the general recommendations for a customer until a customer segment can be assigned then use segmented models. Correcting a confusion matrix for an imbalanced target variable by incorporating priors Classification models generate probabilities and a classification predicted class value. When there is a significant imbalance in the proportion of True values in the target variable, the confusion matrix as seen in the Analysis node output will show that the model has all predicted class values equal to the False value, leading an analyst to conclude the model is not effective and needs to be retrained. Most often, the conventional wisdom is to use a Balance node to balance the proportion of True and False values in the target variable, thus eliminating the problem in the confusion matrix. However, in many cases, the classifier is working fine without the Balance node; it is the interpretation of the model that is biased. Each model generates a probability that the record belongs to the True class and the predicted class is derived from this value by applying a threshold of 0.5. Often, no record has a propensity that high, resulting in every predicted class value being assigned False. In this recipe we learn how to adjust the predicted class for classification problems with imbalanced data by incorporating the prior probability of the target variable. Getting ready This recipe uses the datafile cup98lrn_reduced_vars3.sav and the stream Recipe – correct with priors.str. How to do it... To incorporate prior probabilities when there is an imbalanced target variable: Open the stream Recipe – correct with priors.str by navigating to File | Open Stream. Make sure the datafile points to the correct path to the datafile cup98lrn_reduced_vars3.sav. Open the generated model TARGET_B, and open the Settings tab. Note that compute Raw Propensity is checked. Close the generated model. Duplicate the generated model by copying and pasting the node in the stream. Connect the duplicated model to the original generated model. Add a Type node to the stream and connect it to the generated model. Open the Type node and scroll to the bottom of the list. Note that the fields related to the two models have not yet been instantiated. Click on Read Values so that they are fully instantiated. Insert a Filler node and connect it to the Type node. Open the Filler node and, in the variable list, select $N1-TARGET_B. Inside the Condition section, type $RP1-TARGET_B' >= TARGET_B_Mean, Click on OK to dismiss the Filler node (after exiting the Expression Builder). Insert an Analysis node to the stream. Open the Analysis node and click on the check box for Coincidence Matrices. Click on OK. Run the stream to the Analysis node. Notice that the coincidence matrix (confusion matrix) for $N-TARGET_B has no predictions with value = 1, but the coincidence matrix for the second model, the one adjusted by step 7 ($N1-TARGET_B), has more than 30 percent of the records labeled as value = 1. How it works... Classification algorithms do not generate categorical predictions; they generate probabilities, likelihoods, or confidences. For this data set, the target variable, TARGET_B, has two values: 1 and 0. The classifier output from any classification algorithm will be a number between 0 and 1. To convert the probability to a 1 or 0 label, the probability is thresholded, and the default in Modeler (and all predictive analytics software) is the threshold at 0.5. This recipe changes that default threshold to the prior probability. The proportion of TARGET_B = 1 values in the data is 5.1 percent, and therefore this is the classic imbalanced target variable problem. One solution to this problem is to resample the data so that the proportion of 1s and 0s are equal, normally achieved through use of the Balance node in Modeler. Moreover, one can create the Balance node from running a Distribution node for TARGET_B, and using the Generate | Balance node (reduce) option. The justification for balancing the sample is that, if one doesn't do it, all the records will be classified with value = 0. The reason for all the classification decisions having value 0 is not because the Neural Network isn't working properly. Consider the histogram of predictions from the Neural Network shown in the following screenshot. Notice that the maximum value of the predictions is less than 0.4, but the center of density is about 0.05. The actual shape of the histogram and the maximum predicted value depend on the Neural Network; some may have maximum values slightly above 0.5. If the threshold for the classification decision is set to 0.5, since no neural network predicted confidence is greater than 0.5, all of the classification labels will be 0. However, if one sets the threshold to the TARGET_B prior probability, 0.051, many of the predictions will exceed that value and be labeled as 1. We can see the result of the new threshold by color-coding the histogram of the previous figure with the new class label, in the following screenshot. This recipe used a Filler node to modify the existing predicted target value. The categorical prediction from the Neural Network whose prediction is being changed is $N1-TARGET_B. The $ variables are special field names that are used automatically in the Analysis node and Evaluation node. It's possible to construct one's own $ fields with a Derive node, but it is safer to modify the one that's already in the data. There's more... This same procedure defined in this recipe works for other modeling algorithms as well, including logistic regression. Decision trees are a different matter. Consider the following screenshot. This result, stating that the C5 tree didn't split at all, is the result of the imbalanced target variable. Rather than balancing the sample, there are other ways to get a tree built. For C&RT or Quest trees, go to the Build Options, select the Costs & Priors item, and select Equal for all classes for priors: equal priors. This option forces C&RT to treat the two classes mathematically as if their counts were equal. It is equivalent to running the Balance node to boost samples so that there are equal numbers of 0s and 1s. However, it's done without adding additional records to the data, slowing down training; equal priors is purely a mathematical reweighting. The C5 tree doesn't have the option of setting priors. An alternative, one that will work not only with C5 but also with C&RT, CHAID, and Quest trees, is to change the Misclassification Costs so that the cost of classifying a one as a zero is 20, approximately the ratio of the 95 percent 0s to 5 percent 1s.

0
0
3891

Packt

30 Oct 2013

4 min read

Working with axes (Should know)

Packt

30 Oct 2013

4 min read

(For more resources related to this topic, see here.) Getting ready We start with the same boilerplate that we used when creating basic charts. How to do it... The following code creates some sample data that grows exponentially. We then use the transform and tickSize setting on the Y axis to adjust how our data is displayed: ... <script> var data = [], i; for (i = 1; i <= 50; i++) { data.push([i, Math.exp(i / 10, 2)]); } $('#sampleChart').plot( [ data ], { yaxis: { transform: function (v) { return v == 0 ? v : Math.log(v); }, tickSize: 50 } } ); </script> ... Flot draws a chart with a logarithmic Y axis, so that our exponential data is easier to read: Next, we use Flot's ability to display multiple axes on the same chart as follows: ... var sine = []; for (i = 0; i < Math.PI * 2; i += 0.1) { sine.push([i, Math.sin(i)]); } var cosine = []; for (i = 0; i < Math.PI * 2; i += 0.1) { cosine.push([i, Math.cos(i) * 20]); } $('#sampleChart').plot( [ {label: 'sine', data: sine}, { label: 'cosine', data: cosine, yaxis: 2 } ], { yaxes: [ {}, { position: 'right' } ] } ); ... Flot draws the two series overlapping each other. The Y axis for the sine series is drawn on the left by default and the Y axis for the cosine series is drawn on the right as specified: How it works... The transform setting expects a function that takes a value, which is the y coordinate of our data, and returns a transformed value. In this case, we calculate the logarithm of our original data value so that our exponential data is displayed on a linear scale. We also use the tickSize setting to ensure that our labels do not overlap after the axis has been transformed. The yaxis setting under the series object is a number that specifies which axis the series should be associated with. When we specify the number 2, Flot automatically draws a second axis on the chart. We then use the yaxes setting to specify that the second axis should be positioned on the right of the chart. In this case, the sine data ranges from -1.0 to 1.0, whereas the cosine data ranges from -20 to 20. The cosine axis is drawn on the right and is independent of the sine axis. There's more... Flot doesn't have a built-in ability to interact with axes, but it does give you all the information you need to construct a solution. Making axes interactive Here, we use Flot's getAxes method to add interactivity to our axes as follows: ... var showFahrenheit = false, temperatureFormatter = function (val, axis) { if (showFahrenheit) { val = val * 9 / 5 + 32; } return val.toFixed(1); }, drawPlot = function () { var plot = $.plot( '#sampleChart', [[[0, 0], [1, 3], [3, 1]]], { yaxis: { tickFormatter: temperatureFormatter } } ); var plotPlaceholder = plot.getPlaceholder(); $.each(plot.getAxes(), function (i, axis) { var box = axis.box; var axisTarget = $('<div />'); axisTarget. css({ position: 'absolute', left: box.left, top: box.top, width: box.width, height: box.height }). click(function () { showFahrenheit = !showFahrenheit; drawPlot(); }). appendTo(plotPlaceholder); }); }; drawPlot(); ... First, note that we use a different way of creating a plot. Instead of calling the plot method on a jQuery collection that matches the placeholder element, we use the plot method directly from the jQuery object. This gives us immediate access to the Flot object, which we use to get the axes of our chart. You could have also used the following data method to gain access to the Flot object: var plot = $('#sampleChart').plot(...).data('plot'); Once we have the Flot object, we use the getAxes method to retrieve a list of axis objects. We use jQuery's each method to iterate over each axis and we create a div element that acts as a target for interaction. We set the div element's CSS so that it is in the same position and size as the axis' bounding box, and we attach an event handler to the click event before appending the div element to the plot's placeholder element. In this case, the event handler toggles a Boolean flag and redraws the plot. The flag determines whether the axis labels are displayed in Fahrenheit or Celsius, by changing the result of the function specified in the tickFormatter setting. Summary Now, we will be able to customize a chart's axes, transform the shape of a graph by using a logarithmic scale, display multiple data series with their own independent axes, and make the axes interactive. Resources for Article: Further resources on this subject: Getting started with your first jQuery plugin [Article] OpenCart Themes: Styling Effects of jQuery Plugins [Article] The Basics of WordPress and jQuery Plugin [Article]

0
0
626

article-image-cloudera-hadoop-and-hp-vertica

Packt

29 Oct 2013

9 min read

Cloudera Hadoop and HP Vertica

Packt

29 Oct 2013

9 min read

(For more resources related to this topic, see here.) Cloudera Hadoop Hadoop is one of the names we think about when it comes to Big Data. I'm not going into details about it since there is plenty of information out there; moreover, like somebody once said, "If you decided to use Hadoop for your data warehouse, then you probably have a good reason for it". Let's not forget: it is primarily a distributed filesystem, not a relational database. That said, there are many cases when we may need to use this technology for number crunching, for example, together with MicroStrategy for analysis and reporting. There are mainly two ways to leverage Hadoop data from MicroStrategy: the first is Hive and the second is Impala. They both work as SQL bridges to the underlying Hadoop structures, converting standard SELECT statements into jobs. The connection is handled by a proprietary 32-bit ODBC driver available for free from the Cloudera website. In my tests, Impala resulted largely faster than Hive, so I will show you how to use it from our MicroStrategy virtual machine. Please note that I am using Version 9.3.0 for consistency with the rest of the book. If you're serious about Big Data and Hadoop, I strongly recommend upgrading to 9.3.1 for enhanced performance and easier setup. See MicroStrategy knowledge base document TN43588 : Post-Certification of Cloudera Impala 1.0 with MicroStrategy 9.3.1 . The ODBC driver is the same for both Hive and Impala, only the driver settings change. Connecting to a Hadoop database To show how we can connect to a Hadoop database, I will use two virtual machines: one with MicroStrategy Suite and the second with Cloudera Hadoop distribution, specifically, a virtual appliance that is available for download from their website. The configuration of the Hadoop cluster is out of scope; moreover, I am not a Hadoop expert. I'll simply give some hints, feel free to use any other configuration/vendor, the procedure and ODBC parameters should be similar. Getting ready Start by going to http://at5.us/AppAU1 The Cloudera VM download is almost 3 GB (cloudera-quickstart-vm-4.3.0-vmware.tar.gz) and features the CH4 version. After unpacking the archive, you'll find a cloudera-quickstart-vm-4.3.0-vmware.ovf file that can be opened with VMware, see screen capture: Accept the defaults and click on Import to generate the cloudera-quickstart-vm-4.3.0-vmware virtual machine. Before starting the Cloudera appliance, change the network card settings from NAT to Bridged since we need to access the database from another VM: Leave the rest of the parameters, as per the default, and start the machine. After a while, you'll be presented with a graphical interface of Centos Linux. If the network has started correctly, the machine should have received an IP address from your network DHCP. We need a fixed rather than dynamic address in the Hadoop VM, so: Open the System | Preferences | Network Connections menu. Select the name of your card (should be something like Auto eth1 ) and click on Edit… . Move to the IPv4 Settings tab and change the Method from Automatic (DHCP) to Manual . Click on the Add button to create a new address. Ask your network administrator for details here and fill Address , Netmask , and Gateway . Click on Apply… and when prompted type the root password cloudera and click on Authenticate . Then click on Close . Check if the change was successful by opening a Terminal window (Applications | System Tools | Terminal ) and issue the ifconfig command, the answer should include the address that you typed in step 4. From the MicroStrategy Suite virtual machine, test if you can ping the Cloudera VM. When we first start Hadoop, there are no tables in the database, so we create the samples: In the Cloudera virtual machine, from the main page in Firefox open Cloudera Manager , click on I Agree in the Information Assurance Policy dialog. Log in with username admin and password admin. Look for a line with a service named oozie1 , notice that it is stopped. Click on the Actions button and select Start… . Confirm with the Start button in the dialog. A Status window will pop up, wait until the Progress is reported as Finished and close it. Now click on the Hue button in the bookmarks toolbar. Sign up with username admin and password admin, you are now in the Hue home page. Click on the first button in the blue Hue toolbar (tool tip: About Hue ) to go to the quick Start Wizard . Click on the Next button to go to Step 2: Examples tab. Click on Beeswax (Hive UI) and wait until a message over the toolbar says Examples refreshed . Now in the Hue toolbar, click on the seventh button from the left (tool tip: Metastore Manager ), you will see the default database with two tables: sample_07 and sample_08 . Enable the checkbox of sample_08 and click on the Browse Data button. After a while the Results tab shows a grid with data. So far so good. We now go back to the Cloudera Manager to start the Impala service. Click on the Cloudera Manager bookmark button. In the Impala1 row, open the Actions menu and choose Start… , then confirm Start . Wait until the Progress says Finished , then click on Close in the command details window. Go back to Hue and click on the fourth button on the toolbar (tool tip: Cloudera Impala (TM) Query UI ). In the Query Editor text area, type select * from sample_08 and click on Execute to see the table content. Next, we open the MicroStrategy virtual machine and download the 32-bit Cloudera ODBC Driver for Apache Hive, Version 2.0 from http://at5.us/AppAU2. Download the ClouderaHiveODBCSetup_v2_00.exe file and save it in C:install. How to do it... We install the ODBC driver: Run C:installClouderaHiveODBCSetup_v2_00.exe and click on the Next button until you reach Finish at the end of the setup, accepting every default. Go to Start | All Programs | Administrative Tools | Data Sources (ODBC) to open the 32-bit ODBC Data Source Administrator (if you're on 64-bit Windows, it's in the SysWOW64 folder). Click on System DSN and hit the Add… button. Select Cloudera ODBC Driver for Apache Hive and click on Finish . Fill the Hive ODBC DSN Configuration with these case-sensitive parameters (change the Host IP according to the address used in step 4 of the Getting ready section): Data Source Name : Cloudera VM Host : 192.168.1.40 Port : 21050 Database : default Type : HS2NoSasl Click on OK and then on OK again to close the ODBC Data Source Administrator. Now open the MicroStrategy Desktop application and log in with administrator and the corresponding password. Right-click on MicroStrategy Analytics Modules and select Create New Project… . Click on the Create project button and name it HADOOP, uncheck Enable Change Journal for this project and click on OK . When the wizard finishes creating the project click on Select tables from the Warehouse Catalog and hit the button labeled New… . Click on Next and type Cloudera VM in the Name textbox of the Database Instance Definition window. In this same window, open the Database type combobox and scroll down until you find Generic DBMS . Click on Next . In Local system ODBC data sources , pick Cloudera VM and type admin in both Database login and Password textboxes. Click on Next , then on Finish , and then on OK . When a Warehouse Catalog Browser error appears, click on Yes . In the Warehouse Catalog Options window, click on Edit… on the right below Cloudera VM . Select the Advanced tab and enable the radio button labeled Use 2.0 ODBC calls in the ODBC Version group. Click on OK . Now select the category Catalog | Read Settings in the left tree and enable the first radio button labeled Use standard ODBC calls to obtain the database catalog . Click on OK to close this window. When the Warehouse Catalog window appears, click on the lightning button (tool tip: Read the Warehouse Catalog ) to refresh the list of available tables. Pick sample_08 and move it to the right of the shopping cart. Then right-click on it and choose Import Prefix . Click on Save and Close and then on OK twice to close the Project Creation Assistant . You can now open the project and update the schema. From here, the procedure to create objects is the same as in any other project: Go to the Schema Objects | Attributes folder, and create a new Job attribute with these columns: ID : Table: sample_08 Column: code DESC : Table: sample_08 Column: description Go to the Fact folder and create a new Salary fact with salary column. Update the schema. Go to the Public Objects | Metrics folder and create a new Salary metric based on the Salary fact with Sum as aggregation function. Go to My Personal Objects | My Reports and create a new report with the Job attribute and the Salary metric: There you go; you just created your first Hadoop report. How it works... Executing Hadoop reports is no different from running any other standard DBMS reports. The ODBC driver handles the communication with Cloudera machine and Impala manages the creation of jobs to retrieve data. From MicroStrategy perspective, it is just another SELECT query that returns a dataset. There's more... Impala and Hive do not support the whole set of ANSI SQL syntax, so in some cases you may receive an error if a specific feature is not implemented: See the Cloudera documentation for details. HP Vertica Vertica Analytic Database is grid-based, column-oriented, and designed to manage large, fast-growing volumes of data while providing rapid query performance. It features a storage organization that favors SELECT statements over UPDATE and DELETE plus a high compression that stores columns of homogeneous datatype together. The Community (free) Edition allows up to three hosts and 1 TB of data, which is fairly sufficient for small to medium BI projects with MicroStrategy. There are several clients available for different operating systems, including 32-bit and 64-bit ODBC drivers for Windows.

0
0
2077

How-To Tutorials - Data

Reducing Data Size

Visualization of Big Data

Learning Data Analytics with R and Hadoop

Data Analytics

Iteration and Searching Keys

Gathering all rejects prior to killing a job

Wrapping OpenCV

Specialized Machine Learning Topics

Highlights of Greenplum

Getting started with Haskell

Trending Topics

Advanced Data Operations

Getting Started with Pentaho Data Integration

IBM SPSS Modeler – Pushing the Limits

Working with axes (Should know)

Cloudera Hadoop and HP Vertica