Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-analyzing-financial-data-qlikview

15 Sep 2015

8 min read

Analyzing Financial Data in QlikView

15 Sep 2015

In this article by Diane Blackwood, author of the book QlikView for Finance, the author talks about how QlikView is an easy-to-use business intelligence product designed to facilitate ad hoc relationship analysis. However, it can also be used in formal corporate performance applications by a financial user. It is designed to use a methodology of direct discovery to analyze data from multiple sources. QlikView is designed to allow you to do your own business discovery, take you out of the data management stage and into the data relationship investigation stage. Investigating relationships and outliers in financial data more can lead to effective management. (For more resources related to this topic, see here.) You could use QlikView when you wish to analyze and quickly see trends and exceptions that — with normal financial application-oriented BI products—would not be readily apparent without days of consultant and technology department setup. With QlikView, you can also analyze data relationships that are not measured in monetary units. Certainly, QlikView can be used to analyze sales trends and stock performance, but other relationships soon become apparent when you start using QlikView. Also, with the free downloadable personal edition of QlikView, you can start analyzing your own data right away. QlikView consists of two parts: The sheet: This can contain sheet objects, such as charts or list boxes, which show clickable information. The load script: This stores information about the data and the data sources that the data is coming from. Financial professionals are always using Excel to examine their data, and we can load data from an Excel sheet into QlikView. This can also help you to create a basic document sheet containing a chart. The newest version of QlikView comes with a sample Sales Order data that can be used to investigate and create sheet objects. In order to use data from other file types, you can use the File Wizard (Type) that you start from the Edit Script dialog by clicking on the Table Files button. Using the Edit Script dialog, you can view your data script and edit it in the script and add other data sources. You can also reload your data by clicking on the Reload button. If you just want to analyze data from an existing QlikView file and analyze the information in it, you do not need to work with the script at all. We will use some sample financial data that was downloaded from an ERP system to Excel in order to demonstrate how an analysis might work. Our QlikView Financial Analysis of Cheyenne Company will appear as follows: Figure 1: Our Financial Analysis QlikView Application When we create objects for analysis purposes in QlikView, the drop-down menu shows that there are multiple sheet object types to choose from, such as List Box, Statistics Box, Chart, Input Box, Current Selections Box, MultiBox, Table Box, Button, Text Object, Line/Arrow Object, Slider/Calendar Object, and Bookmark Object. In our example, we chose the Statistic Box Sheet object to add the grand total to our analysis. From this, we can see that the total company is out of balance by $1.59. From an auditor’s point of view, this amount is probably small enough to be immaterial, but, from our point of view as financial professionals, we want to know where our books are falling out of balance. To make our investigation easier, we should add one additional sheet object: a List Box for Company. This is done by right-clicking on the context menu and selecting New Sheet object and then List Box. Figure 2: Added Company List Box We can now see that we are actually out of balance in three companies. Cheyenne Co. L.P. is a company out by $1.59, but Cheyenne Holding and Cheyenne National Inc. seem to have balancing entries that balance at the total companies’ level, but these companies don’t balance at the individual company level. We can analyze our data using the list boxes just by selecting a Company and viewing the Account Groups and Cost Centers that are included (white) and excluded (gray). This is the standard color scheme usage of QlikView. Our selected company is shown in green and in the Current Selection Box. By selecting Cheyenne Holding, we would be able to verify that it is indeed a holding company, does not have any manufacturing or sales accounting groups, or cost centers. Alternatively, if we choose Provo, we can see that it is in balance. To load more than one spreadsheet or load from a different data source, we must edit load script. From the Edit Script interface, we can modify and execute a script that connects the QlikView document to an ODBC data source or to data files of different type and grab the data source information as well. Our first script was generated automatically, but scripts can be typed manually, or automatically generated scripts can be modified. Complex script statements must, at least partially, be entered manually. The Edit Script dialog uses autocomplete, so when typing, the program tries to predict what is wanted in the script without having to type it completely. The predictions include words that are part of the script syntax. The script is also color coded by syntax components. The Edit Script interface and behavior may be customized to your preferences by selecting Tools and Editor Preferences. A menu bar is found at the top of the Edit Script dialog with various script-related commands. The most frequently used commands also appear in the toolbar. In the toolbar, there is also a drop-down list for the tabs of the Edit Script wizard. The first script in the Edit Script interface is the automatically generated one that was created by the wizard when we started the QlikView file. The automatically generated script picks up the column names from the Excel file and puts in some default formatting scripting. The language selection that we made during the initial installation of QlikView determines the defaults assigned to this portion of the script. We can add data from multiple sources, such as ODBC links, additional Excel files, sources from the Web, FTP, and even other QlikView files. Our first Excel file, which we used to create the initial QlikView document, is already in our script. It happened to be October 2013 data, but suppose we wanted to add another month such as November data to our analysis? We would just navigate to the Edit Script interface from the File menu and then click on the script itself. Make sure that our cursor is at the bottom of the script after the first Excel file path and description. If you do not position your cursor where you want your additional script information to populate, you may generate your new script code in the middle of your existing script code. If you make a mistake, click on CANCEL and start over. After navigating to the script location where you want to add your new code, click on the Table Files button after the script and towards the center right first button in the column. Click on NEXT through the next four screens unless you need to add column labels. Comments can be added to scripts using // for a single line or by surrounding the comment by a beginning /* and an ending */ and comments show up as green. After clicking on the OK button to get out of Script Editor, there is another File menu item that can be used to verify that QlikView has correctly interpreted the joins. This is the Table Viewer menu item. You cannot edit in the Table view, but it is convenient to visualize how the table fields are interacting. Save the changes to the script by clicking on the OK button in the lower-right corner. Now, with the File menu, navigate to Edit Script and then to the Reload menu item and click on it to reload your data; otherwise, your new month of data will not be loaded. If you receive any error messages, the solutions can be researched in QlikView Help. In this case, the column headers were the same, so QlikView knew to add the data from the two spreadsheets together into one table. However, because of this, if we look at our Company List Box and Amount Statistics Box, we see everything added together. Figure 3: Data Doubled after Reload with Additional File The reason this data is doubled is that we do not have any way to split the months or only select October or November. Now that we have more than one month of data, we can add another List Box with the months. This will automatically link up to our Chart and Straight Table Sheet objects to separate our monthly data. Once added, from our new List Box, we can select OCTOBER or NOVEMBER, and our sheet object automatically shows the correct sum of the individual months. We can then use the List Box and linked objects to further analyze our financial data. Summary You can find further find books on QlikView published by Packt on the Packt website http://www.packtpub.com. Some of them are listed as follows: Learning QlikView Data Visualization by Karl Pover Predictive Analytics using Rattle and QlikView by Ferran Garcia Pagans Resources for Article: Further resources on this subject: Common QlikView script errors [article] Securing QlikView Documents [article] Conozca QlikView [article]

0
0
1884

Packt

14 Sep 2015

9 min read

Apache Spark

Packt

14 Sep 2015

9 min read

In this article by Mike, author of the book Mastering Apache Spark many Hadoop-based tools built on Hadoop CDH cluster are introduced. (For more resources related to this topic, see here.) His premise, when approaching any big data system, is that none of the components exist in isolation. There are many functions that need to be addressed in a big data system with components passing data along an ETL (Extract Transform and Load) chain, or calling the subcomponents to carry out processing. Some of the functions are: Data Movement Scheduling Storage Data Acquisition Real Time Data Processing Batch Data Processing Monitoring Reporting This list is not exhaustive, but it gives you an idea of the functional areas that are involved. For instance, HDFS (Hadoop Distributed File System) might be used for storage, Oozie for scheduling, Hue for monitoring, and Spark for real-time processing. His point, though, is that none of these systems exists in isolation; they either exist in an ETL chain when processing data, and rely on other sub components as in Oozie, or depend on other components to provide functionality that they do not have. His contention is that integration between big data systems is an important factor. One needs to consider from where the data is coming, how it will be processed, and where it is then going to. Given this consideration, the integration options for a big data component need to be investigated both, in terms of what is available now, and what might be available in the future. In the book, the author has distributed the system functionality by chapters, and tried to determine what tools might be available to carry out these functions. Then, with the help of simple examples by using code and data, he has shown how the systems might be used together. The book is based upon Apache Spark, so as you might expect, it investigates the four main functional modules of Spark: MLlib for machine learning Streaming for the data stream processing SQL for data processing in a tabular format GraphX for graph-based processing However, the book attempts to extend these common, real-time big data processing areas by examining extra areas such as graph-based storage and real-time cloud-based processing via Databricks. It provides examples of integration with external tools, such as Kafka and Flume, as well as Scala-based development examples. In order to Spark your interest, and prepare you for the book's contents, he has described the contents of the book by subject, and given you a sample of the content. Overview The introduction sets the scene for the book by examining topics such as Spark cluster design, and the choice of cluster managers. It considers the issues, affecting the cluster performance, and explains how real-time big data processing can be carried out in the cloud. The following diagram, describes the topics that are explained in the book: The Spark Streaming examples are provided along with details for checkpointing to avoid data loss. Installation and integration examples are provided for Kafka (messaging) and Flume (data movement). The functionality of Spark MLlib is extended via 0xdata H2O, and a deep learning example neural system is created and tested. The Spark SQL is investigated, and integrated with Hive to show that Spark can become a real-time processing engine for Hive. Spark storage is considered, by example, using Aurelius (Datastax) Titan along with underlying storage in HBase and Cassandra. The use of Tinkerpop and Gremlin shell are explained by example for graph processing. Finally, of course many, methods of integrating Spark to HDFS are shown with the help of an example. This gives you a flavor of what is in the book, but it doesn't give you the detail. Keep reading to find out what is in each area. Spark MLlib Spark MLlib examines data classification with Naïve Bayes, data clustering with K-Means, and neural processing with ANN (Artificial Neural Network). If these terms do not mean anything to you, don't worry. They are explained both, in terms of theory, and then practically with examples. The author has always been interested in neural networks, and was pleased to be able to base the ANN section on the work by Bert Greevenbosch (www.bertgreevenbosch.nl). This allows to show how Apache Spark can be built from source code, and be extended in the same process with extra functionality. The following diagram shows a real, biological neuron to the left, and a simulated neuron to the right. It also explains how computational neurons are simulated in a step-by-step process from real neurons in your head. It then goes on to describe how neural networks are created, and how processing takes place. It's an interesting topic. The integration of big data systems, and neural processing. Spark Streaming An important issue, when processing stream-based data, is failure recover. Here, we examine error recovery, and checkpointing with the help of an example for Apache Spark. It also provides examples for TCP, file, Flume, and Kafka-based stream processing using Spark. Even though he has provided step-by-step, code-based examples, data stream processing can become complicated. He has tried to reduce complexity, so that learning does not become a challenge. For example, when introducing a Kafka-based example, The following diagram is used to explain the test components with the data flow, and the component set up in a logical, step-by-step manner: Spark SQL When introducing Spark SQL, he has described the data file formats that might be used to assist with data integration. Then move on to describe with the help of an example the use of the data frames, followed closely by practical SQL examples. Finally, integration with Apache Hive is introduced to provide big data warehouse real-time processing by example. The user-defined functions are also explained, showing how they can be defined in multiple ways, and be used with Spark SQL. Spark GraphX Graph processing is examined by showing how a simple graph can be created in Scala. Then, sample graph algorithms are introduced like PageRank and Triangles. With permission from Kenny Bastani (http://www.kennybastani.com/), the Mazerunner prototype application is discussed. A step-by-step approach is described by which Docker, Neo4j, and Mazerunner can be installed. Then, the functionality of both, Neo4j and Mazerunner, is used to move the data between Neo4j and HDFS. The following diagram gives an overview of the architecture that will be introduced: Spark storage Apache Spark is a highly functional, real-time, distributed big data processing system. However, it does not provide any data storage. In many places within the book, the examples are provided for using HDFS-based storage, but what if you want graph-based storage? What if you want to process and store data as a graph? The Aurelius (Datastax) Titan graph database is examined in the book. The underlying storage options with Cassandra, and HBase are used with Scala examples. The graph-based processing is examined using Tinkerpop and Gremlin-based scripts. Using a simple, example-based approach, both: the architecture involved, and multiple ways of using Gremlin shell are introduced in the following diagram: Spark H2O While Apache Spark is highly functional and agile, allowing data to move easily between its modules, how might we extend it? By considering the H2O product from http://h2o.ai/, the machine learning functionality of Apache Spark can be extended. H2O plus Spark equals Sparkling Water. Sparkling Water is used to create a deep learning neural processing example for data processing. The H2O web-based Flow application is also introduced for analytics, and data investigation. Spark Databricks Having created big data processing clusters on the physical machines, the next logical step is to move processing into the cloud. This might be carried out by obtaining cloud-based storage, using Spark as a cloud-based service, or using a Spark-based management system. The people who designed Apache Spark have created a Spark cloud-based processing platform called https://databricks.com/. He has dedicated two chapters in the book to this service, because he feels that it is important to investigate the future trends. All the aspects of Databricks are examined from the user and cluster management to the use of Notebooks for data processing. The languages that can be used are investigated as the ways of developing code on local machines, and then they can be moved to the cloud, in order to save money. The data import is examined with examples, as is the DbUtils package for data processing. The REST interface for the Spark cloud instance management is investigated, because it offers integration options between your potential cloud instance, and the external systems. Finally, options for moving data and functionality are investigated in terms of data and folder import/export, along with library import, and cluster creation on demand. Databricks visualisation The various options of cloud-based big data visualization using Databricks are investigated. Multiple ways are described for creating reports with the help of tables and SQL bar graphs. Pie charts and world maps are used to present data. Databricks allows geolocation data to be combined with your raw data to create geographical real-time charts. The following figure, taken from the book, shows the result of a worked example, combining GeoNames data with geolocation data. The color coded country-based data counts are the result. It's difficult to demonstrate this in a book, but imagine this map, based upon the stream-based data, and continuously updating in real time. In a similar way, it is possible to create dashboards from your Databricks reports, and make them available to your external customers via a web-based URL. Summary Mike hopes that this article has given you an idea of the book's contents. And also that it has intrigued you, so that you will search out a copy of the Spark-based book, Mastering Apache Spark, and try out all of these examples for yourself. The book comes with a code package that provides the example-based sample code, as well as build and execution scripts. This should provide you with an easy start, and a platform to build your own Spark based-code. Resources for Article: Further resources on this subject: Sabermetrics with Apache Spark[article] Getting Started with Apache Spark[article] Machine Learning Using Spark MLlib[article]

0
0
1060

Packt

14 Sep 2015

10 min read

PostgreSQL in Action

Packt

14 Sep 2015

10 min read

In this article by Salahadin Juba, Achim Vannahme, and Andrey Volkov, authors of the book Learning PostgreSQL, we will discuss PostgreSQL (pronounced Post-Gres-Q-L) or Postgres is an open source, object-relational database management system. It emphasizes extensibility, creativity, and compatibility. It competes with major relational database vendors, such as Oracle, MySQL, SQL servers, and others. It is used by different sectors, including government agencies and the public and private sectors. It is cross-platform and runs on most modern operating systems, including Windows, Mac, and Linux flavors. It conforms to SQL standards and it is ACID complaint. (For more resources related to this topic, see here.) An overview of PostgreSQL PostgreSQL has many rich features. It provides enterprise-level services, including performance and scalability. It has a very supportive community and very good documentation. The history of PostgreSQL The name PostgreSQL comes from post-Ingres database. the history of PostgreSQL can be summarized as follows: Academia: University of California at Berkeley (UC Berkeley) 1977-1985, Ingres project: Michael Stonebraker created RDBMS according to the formal relational model 1986-1994, postgres: Michael Stonebraker created postgres in order to support complex data types and the object-relational model. 1995, Postgres95: Andrew Yu and Jolly Chen changed postgres to postgres query language (P) with an extended subset of SQL. Industry 1996, PostgreSQL: Several developers dedicated a lot of labor and time to stabilize Postgres95. The first open source version was released on January 29, 1997. With the introduction of new features, and enhancements, and at the start of open source projects, the Postgres95 name was changed to PostgreSQL. PostgreSQL began at version 6, with a very strong starting point by taking advantage of several years of research and development. Being an open source with a very good reputation, PostgreSQL has attracted hundreds of developers. Currently, PostgreSQL has innumerable extensions and a very active community. Advantages of PostgreSQL PostgreSQL provides many features that attract developers, administrators, architects, and companies. Business advantages of PostgreSQL PostgreSQL is free, open source software (OSS); it has been released under the PostgreSQL license, which is similar to the BSD and MIT licenses. The PostgreSQL license is highly permissive, and PostgreSQL is not a subject to monopoly and acquisition. This gives the company the following advantages. There is no associated licensing cost to PostgreSQL. The number of deployments of PostgreSQL is unlimited. A more profitable business model. PostgreSQL is SQL standards compliant. Thus finding professional developers is not very difficult. PostgreSQL is easy to learn and porting code from one database vendor to PostgreSQL is cost efficient. Also, PostgreSQL administrative tasks are easy to automate. Thus, the staffing cost is significantly reduced. PostgreSQL is cross-platform, and it has drivers for all modern programming languages; so, there is no need to change the company policy about the software stack in order to use PostgreSQL. PostgreSQL is scalable and it has a high performance. PostgreSQL is very reliable; it rarely crashes. Also, PostgreSQL is ACID compliant, which means that it can tolerate some hardware failure. In addition to that, it can be configured and installed as a cluster to ensure high availability (HA). User advantages of PostgreSQL PostgreSQL is very attractive for developers, administrators, and architects; it has rich features that enable developers to perform tasks in an agile way. The following are some attractive features for the developer: There is a new release almost each year; until now, starting from Postgres95, there have been 23 major releases. Very good documentation and an active community enable developers to find and solve problems quickly. The PostgreSQL manual is over than 2,500 pages in length. A rich extension repository enables developers to focus on the business logic. Also, it enables developers to meet requirement changes easily. The source code is available free of charge, it can be customized and extended without a huge effort. Rich clients and administrative tools enable developers to perform routine tasks, such as describing database objects, exporting and importing data, and dumping and restoring databases, very quickly. Database administration tasks do not requires a lot of time and can be automated. PostgreSQL can be integrated easily with other database management systems, giving software architecture good flexibility in putting software designs. Applications of PostgreSQL PostgreSQL can be used for a variety of applications. The main PostgreSQL application domains can be classified into two categories: Online transactional processing (OLTP): OLTP is characterized by a large number of CRUD operations, very fast processing of operations, and maintaining data integrity in a multiaccess environment. The performance is measured in the number of transactions per second. Online analytical processing (OLAP): OLAP is characterized by a small number of requests, complex queries that involve data aggregation, and a huge amount of data from different sources, with different formats and data mining and historical data analysis. OLTP is used to model business operations, such as customer relationship management (CRM). OLAP applications are used for business intelligence, decision support, reporting, and planning. An OLTP database size is relatively small compared to an OLAP database. OLTP normally follows the relational model concepts, such as normalization when designing the database, while OLAP is less relational and the schema is often star shaped. Unlike OLTP, the main operation of OLAP is data retrieval. OLAP data is often generated by a process called Extract, Transform and Load (ETL). ETL is used to load data into the OLAP database from different data sources and different formats. PostgreSQL can be used out of the box for OLTP applications. For OLAP, there are many extensions and tools to support it, such as the PostgreSQL COPY command and Foreign Data Wrappers (FDW). Success stories PostgreSQL is used in many application domains, including communication, media, geographical, and e-commerce applications. Many companies provide consultation as well as commercial services, such as migrating proprietary RDBMS to PostgreSQL in order to cut off licensing costs. These companies often influence and enhance PostgreSQL by developing and submitting new features. The following are a few companies that use PostgreSQL: Skype uses PostgreSQL to store user chats and activities. Skype has also affected PostgreSQL by developing many tools called Skytools. Instagram is a social networking service that enables its user to share pictures and photos. Instagram has more than 100 million active users. The American Chemical Society (ACS): More than one terabyte of data for their journal archive is stored using PostgreSQL. In addition to the preceding list of companies, PostgreSQL is used by HP, VMware, and Heroku. PostgreSQL is used by many scientific communities and organizations, such as NASA, due to its extensibility and rich data types. Forks There are more than 20 PostgreSQL forks; PostgreSQL extensible APIs makes postgres a great candidate to fork. Over years, many groups have forked PostgreSQL and contributed their findings to PostgreSQL. The following is a list of popular PostgreSQL forks: HadoopDB is a hybrid between the PostgreSQL, RDBMS, and MapReduce technologies to target analytical workload. Greenplum is a proprietary DBMS that was built on the foundation of PostgreSQL. It utilizes the shared-nothing and massively parallel processing (MPP) architectures. It is used as a data warehouse and for analytical workloads. The EnterpriseDB advanced server is a proprietary DBMS that provides Oracle capabilities to cap Oracle fees. Postgres-XC (eXtensible Cluster) is a multi-master PostgreSQL cluster based on the shared-nothing architecture. It emphasis write-scalability and provides the same APIs to applications that PostgreSQL provides. Vertica is a column-oriented database system, which was started by Michael Stonebraker in 2005 and acquisitioned by HP in 2011. Vertica reused the SQL parser, semantic analyzer, and standard SQL rewrites from the PostgreSQL implementation. Netzza is a popular data warehouse appliance solution that was started as a PostgreSQL fork. Amazon Redshift is a popular data warehouse management system based on PostgreSQL 8.0.2. It is mainly designed for OLAP applications. The PostgreSQL architecture PostgreSQL uses the client/server model; the client and server programs could be on different hosts. The communication between the client and server is normally done via TCP/IP protocols or Linux sockets. PostgreSQL can handle multiple connections from a client. A common PostgreSQL program consists of the following operating system processes: Client process or program (frontend): The database frontend application performs a database action. The frontend could be a web server that wants to display a web page or a command-line tool to perform maintenance tasks. PostgreSQL provides frontend tools, such as psql, createdb, dropdb, and createuser. Server process (backend): The server process manages database files, accepts connections from client applications, and performs actions on behalf of the client; the server process name is postgres. PostgreSQL forks a new process for each new connection; thus, the client and server processes communicate with each other without the intervention of the server main process (postgres), and they have a certain lifetime determined by accepting and terminating a client connection. The abstract architecture of PostgreSQL The aforementioned abstract, conceptual PostgreSQL architecture can give an overview of PostgreSQL's capabilities and interactions with the client as well as the operating system. The PostgreSQL server can be divided roughly into four subsystems as follows: Process manager: The process manager manages client connections, such as the forking and terminating processes. Query processor: When a client sends a query to PostgreSQL, the query is parsed by the parser, and then the traffic cop determines the query type. A Utility query is passed to the utilities subsystem. The Select, insert, update, and delete queries are rewritten by the rewriter, and then an execution plan is generated by the planner; finally, the query is executed, and the result is returned to the client. Utilities: The utilities subsystem provides the means to maintain the database, such as claiming storage, updating statistics, exporting and importing data with a certain format, and logging. Storage manager: The storage manager handles the memory cache, disk buffers, and storage allocation. Almost all PostgreSQL components can be configured, including the logger, planner, statistical analyzer, and storage manager. PostgreSQL configuration is governed by the application nature, such as OLAP and OLTP. The following diagram shows the PostgreSQL abstract, conceptual architecture: PostgreSQL's abstract, conceptual architecture The PostgreSQL community PostgreSQL has a very cooperative, active, and organized community. In the last 8 years, the PostgreSQL community published eight major releases. Announcements are brought to developers via the PostgreSQL weekly newsletter. There are dozens of mailing lists organized into categories, such as users, developers, and associations. Examples of user mailing lists are pgsql-general, psql-doc, and psql-bugs. pgsql-general is a very important mailing list for beginners. All non-bug-related questions about PostgreSQL installation, tuning, basic administration, PostgreSQL features, and general discussions are submitted to this list. The PostgreSQL community runs a blog aggregation service called Planet PostgreSQL—https://planet.postgresql.org/. Several PostgreSQL developers and companies use this service to share their experience and knowledge. Summary PostgreSQL is an open source, object-oriented relational database system. It supports many advanced features and complies with the ANSI-SQL:2008 standard. It has won industry recognition and user appreciation. The PostgreSQL slogan "The world's most advanced open source database" reflects the sophistication of the PostgreSQL features. PostgreSQL is a result of many years of research and collaboration between academia and industry. Companies in their infancy often favor PostgreSQL due to licensing costs. PostgreSQL can aid profitable business models. PostgreSQL is also favoured by many developers because of its capabilities and advantages. Resources for Article: Further resources on this subject: Introducing PostgreSQL 9 [article] PostgreSQL – New Features [article] Installing PostgreSQL [article]

0
0
3461

article-image-overview-common-machine-learning-tasks

Packt

14 Sep 2015

29 min read

Introducing Bayesian Inference

Packt

14 Sep 2015

29 min read

0
0
2233

article-image-understanding-model-based-clustering

Packt

14 Sep 2015

10 min read

Understanding Model-based Clustering

Packt

14 Sep 2015

10 min read

0
0
3277

Packt

09 Sep 2015

22 min read

Sabermetrics with Apache Spark

Packt

09 Sep 2015

22 min read

In this article by Rindra Ramamonjison, the author of the book called Apache Spark Graph Processing, we will gain useful insights that are required to quickly process big data, and handle its complexities. It is not the secret analytics that have made a big impact in sports. The quest for an objective understanding of the game has a name even—"sabermetrics". Analytics has proven invaluable in many aspects, from building dream teams under tight cap constraints, to selecting game-specific strategies, to actively engaging with fans, and so on. In the following sections, we will analyze NCAA Men's college basketball game stats, gathered during a single season. As sports data experts, we are going to leverage Spark's graph processing library to answer several questions for retrospection. Apache Spark is a fast, general-purpose technology, which greatly simplifies the parallel processing of large data that is distributed over a computing cluster. While Spark handles different types of processing, here, we will focus on its graph-processing capability. In particular, our goal is to expose the powerful yet generic graph-aggregation operator of Spark—aggregateMessages. We can think of this operator as a version of MapReduce for aggregating the neighborhood information in graphs. In fact, many graph-processing algorithms, such as PageRank rely on iteratively accessing the properties of neighboring vertices and adjacent edges. By applying aggregateMessages on the NCAA College Basketball datasets, we will: Identify the basic mechanisms and understand the patterns for using aggregateMessages Apply aggregateMessages to create custom graph aggregation operations Optimize the performance and efficiency of aggregateMessages (For more resources related to this topic, see here.) NCAA College Basketball datasets As an illustrative example, the NCAA College Basketball datasets consist of two CSV datasets. This first one called teams.csv contains the list of all the college teams that played in NCAA Division I competition. Each team is associated with a 4-digit ID number. The second dataset called stats.csv contains the score and statistics of every game played during the 2014-2015 regular season. Loading team data into RDDs To start with, we parse and load these datasets into RDDs (Resilient Distributed Datasets), which are the core Spark abstraction for any data that is distributed and stored over a cluster. First, we create a class called GameStats that records a team's statistics during a game: case class GameStats( val score: Int, val fieldGoalMade: Int, val fieldGoalAttempt: Int, val threePointerMade: Int, val threePointerAttempt: Int, val threeThrowsMade: Int, val threeThrowsAttempt: Int, val offensiveRebound: Int, val defensiveRebound: Int, val assist: Int, val turnOver: Int, val steal: Int, val block: Int, val personalFoul: Int ) Loading game stats into RDDs We also add the following methods to GameStats in order to know how efficient a team's offense was: // Field Goal percentage def fgPercent: Double = 100.0 * fieldGoalMade / fieldGoalAttempt // Three Point percentage def tpPercent: Double = 100.0 * threePointerMade / threePointerAttempt // Free throws percentage def ftPercent: Double = 100.0 * threeThrowsMade / threeThrowsAttempt override def toString: String = "Score: " + score Next, we create a couple of classes for the games' result: abstract class GameResult( val season: Int, val day: Int, val loc: String ) case class FullResult( override val season: Int, override val day: Int, override val loc: String, val winnerStats: GameStats, val loserStats: GameStats ) extends GameResult(season, day, loc) FullResult has the year and day of the season, the location where the game was played, and the game statistics of both the winning and losing teams. Next, we will create a statistics graph of the regular seasons. In this graph, the nodes are the teams, whereas each edge corresponds to a specific game. To create the graph, let's parse the CSV file called teams.csv into the RDD teams: val teams: RDD[(VertexId, String)] = sc.textFile("./data/teams.csv"). filter(! _.startsWith("#")). map {line => val row = line split ',' (row(0).toInt, row(1)) } We can check the first few teams in this new RDD: scala> teams.take(3).foreach{println} (1101,Abilene Chr) (1102,Air Force) (1103,Akron) We do the same thing to obtain an RDD of the game results, which will have a type called RDD[Edge[FullResult]]. We just parse stats.csv, and record the fields that we need: The ID of the winning team The ID of the losing team The game statistics of both the teams val detailedStats: RDD[Edge[FullResult]] = sc.textFile("./data/stats.csv"). filter(! _.startsWith("#")). map {line => val row = line split ',' Edge(row(2).toInt, row(4).toInt, FullResult( row(0).toInt, row(1).toInt, row(6), GameStats( score = row(3).toInt, fieldGoalMade = row(8).toInt, fieldGoalAttempt = row(9).toInt, threePointerMade = row(10).toInt, threePointerAttempt = row(11).toInt, threeThrowsMade = row(12).toInt, threeThrowsAttempt = row(13).toInt, offensiveRebound = row(14).toInt, defensiveRebound = row(15).toInt, assist = row(16).toInt, turnOver = row(17).toInt, steal = row(18).toInt, block = row(19).toInt, personalFoul = row(20).toInt ), GameStats( score = row(5).toInt, fieldGoalMade = row(21).toInt, fieldGoalAttempt = row(22).toInt, threePointerMade = row(23).toInt, threePointerAttempt = row(24).toInt, threeThrowsMade = row(25).toInt, threeThrowsAttempt = row(26).toInt, offensiveRebound = row(27).toInt, defensiveRebound = row(28).toInt, assist = row(20).toInt, turnOver = row(30).toInt, steal = row(31).toInt, block = row(32).toInt, personalFoul = row(33).toInt ) ) ) } We can avoid typing all this by using the nice spark-csv package that reads CSV files into SchemaRDD. Let's check what we got: scala> detailedStats.take(3).foreach(println) Edge(1165,1384,FullResult(2006,8,N,Score: 75-54)) Edge(1393,1126,FullResult(2006,8,H,Score: 68-37)) Edge(1107,1324,FullResult(2006,9,N,Score: 90-73)) We then create our score graph using the collection of teams (of the type called RDD[(VertexId, String)]) as vertices, and the collection called detailedStats (of the type called RDD[(VertexId, String)]) as edges: scala> val scoreGraph = Graph(teams, detailedStats) For curiosity, let's see which team has won against the 2015 NCAA national champ Duke during the regular season. It seems Duke has lost only four games during the regular season: scala> scoreGraph.triplets.filter(_.dstAttr == "Duke").foreach(println)((1274,Miami FL),(1181,Duke),FullResult(2015,71,A,Score: 90-74)) ((1301,NC State),(1181,Duke),FullResult(2015,69,H,Score: 87-75)) ((1323,Notre Dame),(1181,Duke),FullResult(2015,86,H,Score: 77-73)) ((1323,Notre Dame),(1181,Duke),FullResult(2015,130,N,Score: 74-64)) Aggregating game stats After we have our graph ready, let's start aggregating the stats data in scoreGraph. In Spark, aggregateMessages is the operator for such a kind of jobs. For example, let's find out the average field goals made per game by the winners. In other words, the games that a team has lost will not be counted. To get the average for each team, we first need to have the number of games won by the team, and the total field goals that the team made in these games: // Aggregate the total field goals made by winning teams type Msg = (Int, Int) type Context = EdgeContext[String, FullResult, Msg] val winningFieldGoalMade: VertexRDD[Msg] = scoreGraph aggregateMessages( // sendMsg (ec: Context) => ec.sendToSrc(1, ec.attr.winnerStats.fieldGoalMade), // mergeMsg (x: Msg, y: Msg) => (x._1 + y._1, x._2+ y._2) ) The aggregateMessage operator There is a lot going on in the previous call to aggregateMessages. So, let's see it working in slow motion. When we called aggregateMessages on the scoreGraph, we had to pass two functions as arguments. SendMsg The first function has a signature called EdgeContext[VD, ED, Msg] => Unit. It takes an EdgeContext as input. Since it does not return anything, its return type is Unit. This function is needed for sending message between the nodes. Okay, but what is the EdgeContext type? EdgeContext represents an edge along with its neighboring nodes. It can access both the edge attribute, and the source and destination nodes' attributes. In addition, EdgeContext has two methods to send messages along the edge to its source node, or to its destination node. These methods are called sendToSrc and sendToDst respectively. Then, the type of messages being sent through the graph is defined by Msg. Similar to vertex and edge types, we can define the concrete type that Msg takes as we wish. Merge In addition to sendMsg, the second function that we need to pass to aggregateMessages is a mergeMsg function with the (Msg, Msg) => Msg signature. As its name implies, mergeMsg is used to merge two messages, received at each node into a new one. Its output must also be of the Msg type. Using these two functions, aggregateMessages returns the aggregated messages inside VertexRDD[Msg]. Example In our example, we need to aggregate the number of games played and the number of field goals made. Therefore, Msg is simply a pair of Int. Furthermore, each edge context needs to send a message to only its source node, that is, the winning team. This is because we want to compute the total field goals made by each team for only the games that it has won. The actual message sent to each "winner" node is the pair of integers (1, ec.attr.winnerStats.fieldGoalMade). Here, 1 serves as a counter for the number of games won by the source node. The second integer, which is the number of field goals in one game, is extracted from the edge attribute. As we set out to compute the average field goals per winning game for all teams, we need to apply the mapValues operator to the output of aggregateMessages, which is as follows: // Average field goals made per Game by the winning teams val avgWinningFieldGoalMade: VertexRDD[Double] = winningFieldGoalMade mapValues ( (id: VertexId, x: Msg) => x match { case (count: Int, total: Int) => total.toDouble/count }) Here is the output: scala> avgWinningFieldGoalMade.take(5).foreach(println) (1260,24.71641791044776) (1410,23.56578947368421) (1426,26.239436619718308) (1166,26.137614678899084) (1434,25.34285714285714) Abstracting out the aggregation This was kind of cool! We can surely do the same thing for the average points per game scored by the winning teams: // Aggregate the points scored by winning teams val winnerTotalPoints: VertexRDD[(Int, Int)] = scoreGraph.aggregateMessages( // sendMsg triplet => triplet.sendToSrc(1, triplet.attr.winnerStats.score), // mergeMsg (x, y) => (x._1 + y._1, x._2+ y._2) ) // Average field goals made per Game by winning teams var winnersPPG: VertexRDD[Double] = winnerTotalPoints mapValues ( (id: VertexId, x: (Int, Int)) => x match { case (count: Int, total: Int) => total.toDouble/count }) Let's check the output: scala> winnersPPG.take(5).foreach(println) (1260,71.19402985074628) (1410,71.11842105263158) (1426,76.30281690140845) (1166,76.89449541284404) (1434,74.28571428571429) What if the coach wants to know the top five teams with the highest average three pointers made per winning game? By the way, he might also ask about the teams that are the most efficient in three pointers. Keeping things DRY We can copy and modify the previous code, but that would be quite repetitive. Instead, let's abstract out the average aggregation operator so that it can work on any statistics that the coach needs. Luckily, Scala's higher-order functions are there to help in this task. Let's define the functions that take a team's GameStats as an input, and return specific statistic that we are interested in. For now, we will need the number of three pointer made, and the average three pointer percentage: // Getting individual stats def threePointMade(stats: GameStats) = stats.threePointerMade def threePointPercent(stats: GameStats) = stats.tpPercent Then, we create a generic function that takes as an input a stats graph, and one of the functions defined previously, which has a signature called GameStats => Double: // Generic function for stats averaging def averageWinnerStat(graph: Graph[String, FullResult])(getStat: GameStats => Double): VertexRDD[Double] = { type Msg = (Int, Double) val winningScore: VertexRDD[Msg] = graph.aggregateMessages[Msg]( // sendMsg triplet => triplet.sendToSrc(1, getStat(triplet.attr.winnerStats)), // mergeMsg (x, y) => (x._1 + y._1, x._2+ y._2) ) winningScore mapValues ( (id: VertexId, x: Msg) => x match { case (count: Int, total: Double) => total/count }) } Now, we can get the average stats by passing the threePointMade and threePointPercent to averageWinnerStat functions: val winnersThreePointMade = averageWinnerStat(scoreGraph)(threePointMade) val winnersThreePointPercent = averageWinnerStat(scoreGraph)(threePointPercent) With little efforts, we can tell the coach which five winning teams score the highest number of threes per game: scala> winnersThreePointMade.sortBy(_._2,false).take(5).foreach(println) (1440,11.274336283185841) (1125,9.521929824561404) (1407,9.008849557522124) (1172,8.967441860465117) (1248,8.915384615384616) While we are at it, let's find out the five most efficient teams in three pointers: scala> winnersThreePointPercent.sortBy(_._2,false).take(5).foreach(println) (1101,46.90555728464225) (1147,44.224282479431224) (1294,43.754532434101534) (1339,43.52308905887638) (1176,43.080814169045105) Interestingly, the teams that made the most three pointers per winning game are not always the one who are the most efficient ones at it. But it is okay because at least they have won these games. Coach wants more numbers The coach seems to argue against this argument. He asks us to get the same statistics, but he wants the average over all the games that each team has played. We then have to aggregate the information at all the nodes, and not only at the destination nodes. To make our previous abstraction more flexible, let's create the following types: trait Teams case class Winners extends Teams case class Losers extends Teams case class AllTeams extends Teams We modify the previous higher-order function to have an extra argument called Teams, which will help us specify those nodes where we want to collect and aggregate the required game stats. The new function becomes as the following: def averageStat(graph: Graph[String, FullResult])(getStat: GameStats => Double, tms: Teams): VertexRDD[Double] = { type Msg = (Int, Double) val aggrStats: VertexRDD[Msg] = graph.aggregateMessages[Msg]( // sendMsg tms match { case _ : Winners => t => t.sendToSrc((1, getStat(t.attr.winnerStats))) case _ : Losers => t => t.sendToDst((1, getStat(t.attr.loserStats))) case _ => t => { t.sendToSrc((1, getStat(t.attr.winnerStats))) t.sendToDst((1, getStat(t.attr.loserStats))) } } , // mergeMsg (x, y) => (x._1 + y._1, x._2+ y._2) ) aggrStats mapValues ( (id: VertexId, x: Msg) => x match { case (count: Int, total: Double) => total/count }) } Now, aggregateStat allows us to choose if we want to aggregate the stats for winners only, for losers only, or for the all teams. Since the coach wants the overall stats averaged over all the games played, we aggregate the stats by passing the AllTeams() flag in aggregateStat. In this case, we define the sendMsg argument in aggregateMessages to send the required stats to both source (the winner) and destination (the loser) using the EdgeContext class's sendToSrc and sendToDst functions respectively. This mechanism is pretty straightforward. We just need to make sure that we send the right information to the right node. In this case, we send winnerStats to the winner, and loserStatsto the loser. Okay, you get the idea now. So, let's apply it to please our coach. Here are the teams with the overall highest three pointers per page: // Average Three Point Made Per Game for All Teams val allThreePointMade = averageStat(scoreGraph)(threePointMade, AllTeams()) scala> allThreePointMade.sortBy(_._2, false).take(5).foreach(println) (1440,10.180811808118081) (1125,9.098412698412698) (1172,8.575657894736842) (1184,8.428571428571429) (1407,8.411149825783973) And here are the five most efficient teams overall in three pointers per game: // Average Three Point Percent for All Teams val allThreePointPercent = averageStat(scoreGraph)(threePointPercent, AllTeams()) Let's check the output: scala> allThreePointPercent.sortBy(_._2,false).take(5).foreach(println) (1429,38.8351815824302) (1323,38.522819895594) (1181,38.43052051444854) (1294,38.41227053353959) (1101,38.097896464168954) Actually, there is only a 2 percent difference between the most efficient team and the one in the fiftieth position. Most NCAA teams are therefore pretty efficient behind the line. I bet coach knew this already! Average points per game We can also reuse the averageStat function to get the average points per game for the winners. In particular, let's take a look at the two teams that won games with the highest and lowest scores: // Winning teams val winnerAvgPPG = averageStat(scoreGraph)(score, Winners()) Let's check the output: scala> winnerAvgPPG.max()(Ordering.by(_._2)) res36: (org.apache.spark.graphx.VertexId, Double) = (1322,90.73333333333333) scala> winnerAvgPPG.min()(Ordering.by(_._2)) res39: (org.apache.spark.graphx.VertexId, Double) = (1197,60.5) Apparently, the most defensive team can win game by scoring only 60 points, whereas the most offensive team can score an average of 90 points. Next, let's average the points per game for all games played and look at the two teams with the best and worst offense during the 2015 season: // Average Points Per Game of All Teams val allAvgPPG = averageStat(scoreGraph)(score, AllTeams()) Let's see the output: scala> allAvgPPG.max()(Ordering.by(_._2)) res42: (org.apache.spark.graphx.VertexId, Double) = (1322,83.81481481481481) scala> allAvgPPG.min()(Ordering.by(_._2)) res43: (org.apache.spark.graphx.VertexId, Double) = (1212,51.111111111111114) To no one's surprise, the best offensive team is the same as the one who scores the most in winning games. To win the games, 50 points are not enough in an average for a team to win the games. Defense stats – the D matters as in direction Previously, we obtained some statistics such as field goals or a three-point percentage that a team achieves. What if we want to aggregate instead the average points or rebounds that each team concedes to their opponents? To compute this, we define a new higher-order function called averageConcededStat. Compared to averageStat, this function needs to send loserStats to the winning team, and the winnerStats function to the losing team. To make things more interesting, we are going to make the team name as a part of the message Msg: def averageConcededStat(graph: Graph[String, FullResult])(getStat: GameStats => Double, rxs: Teams): VertexRDD[(String, Double)] = { type Msg = (Int, Double, String) val aggrStats: VertexRDD[Msg] = graph.aggregateMessages[Msg]( // sendMsg rxs match { case _ : Winners => t => t.sendToSrc((1, getStat(t.attr.loserStats), t.srcAttr)) case _ : Losers => t => t.sendToDst((1, getStat(t.attr.winnerStats), t.dstAttr)) case _ => t => { t.sendToSrc((1, getStat(t.attr.loserStats),t.srcAttr)) t.sendToDst((1, getStat(t.attr.winnerStats),t.dstAttr)) } } , // mergeMsg (x, y) => (x._1 + y._1, x._2+ y._2, x._3) ) aggrStats mapValues ( (id: VertexId, x: Msg) => x match { case (count: Int, total: Double, name: String) => (name, total/count) }) } With this, we can calculate the average points conceded by the winning and losing teams as follows: val winnersAvgConcededPoints = averageConcededStat(scoreGraph)(score, Winners()) val losersAvgConcededPoints = averageConcededStat(scoreGraph)(score, Losers()) Let's check the output: scala> losersAvgConcededPoints.min()(Ordering.by(_._2)) res: (VertexId, (String, Double)) = (1101,(Abilene Chr,74.04761904761905)) scala> winnersAvgConcededPoints.min()(Ordering.by(_._2)) res: (org.apache.spark.graphx.VertexId, (String, Double)) = (1101,(Abilene Chr,74.04761904761905)) scala> losersAvgConcededPoints.max()(Ordering.by(_._2)) res: (VertexId, (String, Double)) = (1464,(Youngstown St,78.85714285714286)) scala> winnersAvgConcededPoints.max()(Ordering.by(_._2)) res: (VertexId, (String, Double)) = (1464,(Youngstown St,71.125)) The previous tells us that Abilene Christian University is the most defensive team. They concede the least points whether they win a game or not. On the other hand, Youngstown has the worst defense. Joining aggregated stats into graphs The previous example shows us how flexible the aggregateMessages operator is. We can define the Msg type of the messages to be aggregated to fit our needs. Moreover, we can select which nodes receive the messages. Finally, we can also define how we want to merge the messages. As a final example, let's aggregate many statistics about each team, and join this information into the nodes of the graph. To start, we create its own class for the team stats: // Average Stats of All Teams case class TeamStat( wins: Int = 0 // Number of wins ,losses: Int = 0 // Number of losses ,ppg: Int = 0 // Points per game ,pcg: Int = 0 // Points conceded per game ,fgp: Double = 0 // Field goal percentage ,tpp: Double = 0 // Three point percentage ,ftp: Double = 0 // Free Throw percentage ){ override def toString = wins + "-" + losses } Then, we collect the average stats for all teams using aggregateMessages in the following. For this, we define the type of the message to be an 8-element tuple that holds the counter for games played, wins, losses, and other statistics that will be stored in TeamStat as listed previously: type Msg = (Int, Int, Int, Int, Int, Double, Double, Double) val aggrStats: VertexRDD[Msg] = scoreGraph.aggregateMessages( // sendMsg t => { t.sendToSrc(( 1, 1, 0, t.attr.winnerStats.score, t.attr.loserStats.score, t.attr.winnerStats.fgPercent, t.attr.winnerStats.tpPercent, t.attr.winnerStats.ftPercent )) t.sendToDst(( 1, 0, 1, t.attr.loserStats.score, t.attr.winnerStats.score, t.attr.loserStats.fgPercent, t.attr.loserStats.tpPercent, t.attr.loserStats.ftPercent )) } , // mergeMsg (x, y) => ( x._1 + y._1, x._2 + y._2, x._3 + y._3, x._4 + y._4, x._5 + y._5, x._6 + y._6, x._7 + y._7, x._8 + y._8 ) ) Given the aggregate message called aggrStats, we map them into a collection of TeamStat: val teamStats: VertexRDD[TeamStat] = aggrStats mapValues { (id: VertexId, m: Msg) => m match { case ( count: Int, wins: Int, losses: Int, totPts: Int, totConcPts: Int, totFG: Double, totTP: Double, totFT: Double) => TeamStat( wins, losses, totPts/count, totConcPts/count, totFG/count, totTP/count, totFT/count) } } Next, let's join teamStats into the graph. For this, we first create a class called Team as a new type for the vertex attribute. Team will have a name and TeamStat: case class Team(name: String, stats: Option[TeamStat]) { override def toString = name + ": " + stats } Next, we use the joinVertices operator that we have seen in the previous chapter: // Joining the average stats to vertex attributes def addTeamStat(id: VertexId, t: Team, stats: TeamStat) = Team(t.name, Some(stats)) val statsGraph: Graph[Team, FullResult] = scoreGraph.mapVertices((_, name) => Team(name, None)). joinVertices(teamStats)(addTeamStat) We can see that the join has worked well by printing the first three vertices in the new graph called statsGraph: scala> statsGraph.vertices.take(3).foreach(println) (1260,Loyola-Chicago: Some(17-13)) (1410,TX Pan American: Some(7-21)) (1426,UT Arlington: Some(15-15)) To conclude this task, let's find out the top 10 teams in the regular seasons. To do so, we define an ordering for Option[TeamStat] as follows: import scala.math.Ordering object winsOrdering extends Ordering[Option[TeamStat]] { def compare(x: Option[TeamStat], y: Option[TeamStat]) = (x, y) match { case (None, None) => 0 case (Some(a), None) => 1 case (None, Some(b)) => -1 case (Some(a), Some(b)) => if (a.wins == b.wins) a.losses compare b.losses else a.wins compare b.wins }} Finally, we get the following: import scala.reflect.classTag import scala.reflect.ClassTag scala> statsGraph.vertices.sortBy(v => v._2.stats,false)(winsOrdering, classTag[Option[TeamStat]]). | take(10).foreach(println) (1246,Kentucky: Some(34-0)) (1437,Villanova: Some(32-2)) (1112,Arizona: Some(31-3)) (1458,Wisconsin: Some(31-3)) (1211,Gonzaga: Some(31-2)) (1320,Northern Iowa: Some(30-3)) (1323,Notre Dame: Some(29-5)) (1181,Duke: Some(29-4)) (1438,Virginia: Some(29-3)) (1268,Maryland: Some(27-6)) Note that the ClassTag parameter is required in sortBy to make use of Scala's reflection. This is why we had the previous imports. Performance optimization with tripletFields In addition to sendMsg and mergeMsg, aggregateMessages can also take an optional argument called tripletsFields, which indicates what data is accessed in the EdgeContext. The main reason for explicitly specifying such information is to help optimize the performance of the aggregateMessages operation. In fact, TripletFields represents a subset of the fields of EdgeTriplet, and it enables GraphX to populate only thse fields when necessary. The default value is TripletFields. All which means that the sendMsg function may access any of the fields in the EdgeContext. Otherwise, the tripletFields argument is used to tell GraphX that only part of the EdgeContext will be required so that an efficient join strategy can be used. All the possible options for the tripletsFields are listed here: TripletFields.All: Expose all the fields (source, edge, and destination) TripletFields.Dst: Expose the destination and edge fields, but not the source field TripletFields.EdgeOnly: Expose only the edge field. TripletFields.None: None of the triplet fields are exposed TripletFields.Src: Expose the source and edge fields, but not the destination field Using our previous example, if we are interested in computing the total number of wins and losses for each team, we will not need to access any field of the EdgeContext. In this case, we should use TripletFields. None to indicate so: // Number of wins of the teams val numWins: VertexRDD[Int] = scoreGraph.aggregateMessages( triplet => { triplet.sendToSrc(1) // No attribute is passed but an integer }, (x, y) => x + y, TripletFields.None ) // Number of losses of the teams val numLosses: VertexRDD[Int] = scoreGraph.aggregateMessages( triplet => { triplet.sendToDst(1) // No attribute is passed but an integer }, (x, y) => x + y, TripletFields.None ) To see that this works, let's print the top five and bottom five teams: scala> numWins.sortBy(_._2,false).take(5).foreach(println) (1246,34) (1437,32) (1112,31) (1458,31) (1211,31) scala> numLosses.sortBy(_._2, false).take(5).foreach(println) (1363,28) (1146,27) (1212,27) (1197,27) (1263,27) Should you want the name of the top five teams, you need to access the srcAttr attribute. In this case, we need to set tripletFields to TripletFields.Src: Kentucky as undefeated team in regular season: val numWinsOfTeams: VertexRDD[(String, Int)] = scoreGraph.aggregateMessages( t => { t.sendToSrc(t.srcAttr, 1) // Pass source attribute only }, (x, y) => (x._1, x._2 + y._2), TripletFields.Src ) Et voila! scala> numWinsOfTeams.sortBy(_._2._2, false).take(5).foreach(println) (1246,(Kentucky,34)) (1437,(Villanova,32)) (1112,(Arizona,31)) (1458,(Wisconsin,31)) (1211,(Gonzaga,31)) scala> numWinsOfTeams.sortBy(_._2._2).take(5).foreach(println) (1146,(Cent Arkansas,2)) (1197,(Florida A&M,2)) (1398,(Tennessee St,3)) (1263,(Maine,3)) (1420,(UMBC,4)) Kentucky has not lost any of its 34 games during the regular season. Too bad that they could not make it into the championship final. Warning about the MapReduceTriplets operator Prior to Spark 1.2, there was no aggregateMessages method in graph. Instead, the now deprecated mapReduceTriplets was the primary aggregation operator. The API for mapReduceTriplets is: class Graph[VD, ED] { def mapReduceTriplets[Msg]( map: EdgeTriplet[VD, ED] => Iterator[(VertexId, Msg)], reduce: (Msg, Msg) => Msg) : VertexRDD[Msg] } Compared to mapReduceTriplets, the new operator called aggregateMessages is more expressive as it employs the message passing mechanism instead of returning an iterator of messages as mapReduceTriplets does. In addition, aggregateMessages explicitly requires the user to specify the TripletFields object for performance improvement as we explained previously. In addition to the API improvements, aggregateMessages is optimized for performance. Because mapReduceTriplets is now deprecated, we will not discuss it further. If you have to use it with earlier versions of Spark, you can refer to the Spark programming guide. Summary In brief, AggregateMessages is a useful and generic operator that provides a functional abstraction for aggregating neighborhood information in the Spark graphs. Its definition is summarized here: class Graph[VD, ED] { def aggregateMessages[Msg: ClassTag]( sendMsg: EdgeContext[VD, ED, Msg] => Unit, mergeMsg: (Msg, Msg) => Msg, tripletFields: TripletFields = TripletFields.All) : VertexRDD[Msg] } This operator applies a user-defined sendMsg function to each edge in the graph using an EdgeContext. Each EdgeContext access the required information about the edge and passes this information to its source node and/or destination node using the sendToSrc and/or sendToDst respectively. After all the messages are received by the nodes, the mergeMsg function is used to aggregate these messages at each node. Some interesting reads Six keys to sports analytics Moneyball: The Art Of Winning An Unfair Game Golden State Warriors at the forefront of NBA data analysis How Data and Analytics Have Changed 'The Beautiful Game' NHL, SAP partnership to lead statistical revolution Resources for Article: Further resources on this subject: The Spark programming model[article] Apache Karaf – Provisioning and Clusters[article] Machine Learning Using Spark MLlib [article]

0
0
1629

Packt

02 Sep 2015

12 min read

Meeting SAP Lumira

Packt

02 Sep 2015

12 min read

In this article by Dmitry Anoshin, author of the book SAP Lumira Essentials, Dmitry talks about living in a century of information technology. There are a lot of electronic devices around us which generate lots of data. For example, you can surf the Internet, visit a couple of news portals, order new Nike Air Max shoes from a web store, write a couple of messages to your friend, and chat on Facebook. Your every action produces data. We can multiply that action by the amount of people who have access to the internet or just use a cell phone, and we get really BIG DATA. Of course, you have a question: how big is it? Now, it starts from terabytes or even petabytes. The volume is not the only issue; moreover, we struggle with the variety of data. As a result, it is not enough to analyze only the structured data. We should dive deep in to unstructured data, such as machine data which are generated by various machines. (For more resources related to this topic, see here.) Nowadays, we should have a new core competence—dealing with Big Data—, because these vast data volumes won't be just stored, they need to be analysed and mined for information that management can use in order to make right business decisions. This helps to make the business more competitive and efficient. Unfortunately, in modern organizations there are still many manual steps needed in order to get data and try to answer your business questions. You need the help of your IT guys, or need to wait until new data is available in your enterprise data warehouse. In addition, you are often working with an inflexible BI tool, which can only refresh a report or export it in to Excel. You definitely need a new approach, which gives you a competitive advantage, dramatically reduces errors, and accelerates business decisions. So, we can highlight some of the key points for this kind of analytics: Integrating data from heterogeneous systems Giving more access to data Using sophisticated analytics Reducing manual coding Simplifying processes Reducing time to prepare data Focusing on self-service Leveraging powerful computing resources We could continue this list with many other bullet points. If you are a fan of traditional BI tools, you may think that it is almost impossible. Yes, you are right, it is impossible. That's why we need to change the rules of the game. As the business world changes, you must change as well. Maybe you have guessed what this means, but if not, I can help you. I will focus on a new approach of doing data analytics, which is more flexible and powerful. It is called data discovery. Of course, we need the right way in order to overcome all the challenges of the modern world. That's why we have chosen SAP Lumira—one of the most powerful data discovery tools in the modern market. But before diving deep into this amazing tool, let's consider some of the challenges of data discovery that are in our path, as well as data discovery advantages. Data discovery challenges Let's imagine that you have several terabytes of data. Unfortunately, it is raw unstructured data. In order to get business insight from this data you have to spend a lot of time in order to prepare and clean the data. In addition, you are restricted by the capabilities of your machine. That's why a good data discovery tool usually is combined of software and hardware. As a result, this gives you more power for exploratory data analysis. Let's imagine that this entire Big Data store is in Hadoop or any NoSQL data store. You have to at least be at good programmer in order to do analytics on this data. Here we can find other benefit of a good data discovery tool: it gives a powerful tool to business users, who are not as technical and maybe don't even know SQL. Apache Hadoop is an open source software project that enables distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resilience of these clusters comes from the software's ability to detect and handle failures at the application layer. A NoSQL data store is a next generation database, mostly addressing some of the following points: non-relational, distributed, open-source, and horizontally scalable. Data discovery versus business intelligence You may be confused about data discovery and business intelligence technologies; it seems they are very close to each other or even BI tools can do all what data discovery can do. And why do we need a separate data discovery tool, such as, SAP Lumira? In order to better understand the difference between the two technologies, you can look at the table below: Enterprise BI Data discovery Key users All users Advanced analysts Approach Vertically-oriented (top to bottom), semantic layers, requests to existing repositories Vertically-oriented (bottom-up), mushup, putting data in the selected repository Interface Reports, dashboards Visualization Users Reporting Analysis Implementation By IT consultants By business users Let's consider the pros and cons of data discovery: Pros: Rapidly analyze data with a short shelf life Ideal for small teams Best for tactical analysis Great for answering on-off questions quickly Cons: Difficult to handle for enterprise organizations Difficult for junior users Lack of scalability As a result, it is clear that BI and data discovery handles their own tasks and complement each other. The role of data discovery Most organizations have a data warehouse. It was planned to supporting daily operations and to help make business decisions. But sometimes organizations need to meet new challenges. For example, Retail Company wants to improve their customer experience and decide to work closely with the customer database. Analysts try to segment customers into cohorts and try to analyse customer's behavior. They need to handle all customer data, which is quite big. In addition, they can use external data in order to learn more about their customers. If they start to use a corporate BI tool, every interaction, such as adding new a field or filter, can take 10-30 minutes. Another issue is adding a new field to an existing report. Usually, it is impossible without the help of IT staff, due to security or the complexities of the BI Enterprise solution. This is unacceptable in a modern business. Analysts want get an answer to their business questions immediately, and they prefer to visualize data because, as you know, human perception of visualization is much higher than text. In addition, these analysts may be independent from IT. They have their data discovery tool and they can connect to any data sources in the organization and check their crazy hypotheses. There are hundreds of examples where BI and DWH is weak, and data discovery is strong. Introducing SAP Lumira Starting from this point, we will focus on learning SAP Lumira. First of all, we need to understand what SAP Lumira is exactly. SAP Lumira is a family of data discovery tools which give us an opportunity to create amazing visualizations or even tell fantastic stories based on our big or small data. We can connect most of the popular data sources, such as Relational Database Management Systems (RDBMSs), flat files, excel spreadsheets or SAP applications. We are able to create datasets with measures, dimensions, hierarchies, or variables. In addition, Lumira allows us to prepare, edit, and clean our data before it is processed. SAP Lumira offers us a huge arsenal of graphical charts and tables to visualize our data. In addition, we can create data stories or even infographics based on our data by grouping charts, single cells, or tables together on boards to create presentation- style dashboards. Moreover, we can add images or text in order to add details. The following are the three main products in the Lumira family offered by SAP: SAP Lumira Desktop SAP Lumira Server SAP Lumira Cloud Lumira Desktop can be either a personal edition or a standard edition. Both of them give you the opportunity to analyse data on your local machine. You can even share your visualizations or insights via PDF or XLS. Lumira Server is also in two variations—Edge and Server. As you know, SAP BusinessObjects also has two types of license for the same software, Edge and Enterprise, and they differ only in terms of the number of users and the type of license. The Edge version is smaller; for example, it can cover the needs of a team or even the whole department. Lumira Cloud is Software as a Service (SaaS). It helps to quickly visualize large volumes of data without having to sacrifice performance or security. It is especially designed to speed time to insight. In addition, it saves time and money with flexible licensing options. Data connectors We met SAP Lumira for the first time and we played with the interface, and the reader could adjust the general settings of SAP Lumira. In addition, we can find this interesting menu in the middle of the window: There are several steps which help us to discover our data and gain business insights. In this article we start from first step by exploring data in SAP Lumira to create a document and acquire a dataset, which can include part or all of the original values from a data source. This is through Acquire Data. Let's click on Acquire Data. This new window will come up: There are four areas on this window. They are: A list of possible data sources (1): Here, the user can connect to his data source. Recently used objects (2): The user can open his previous connections or files. Ordinary buttons (3), such as Previous, Next, Create, and Cancel. This small chat box (4) we can find at almost every page. SAP Lumira cares about the quality of the product and gives the opportunity to the user to make a screen print and send feedback to SAP. Let's go deeper and consider more closely every connection in the table below: Data Source Description Microsoft Excel Excel data sheets Flat file CSV, TXT, LOG, PRN, or TSV SAP HANA There are two possible ways: Offline (downloading data) and Online (connected to SAP HANA) SAP BusinessObjects universe UNV or UNX SQL Databases Query data via SQL from relational databases SAP Business warehouse Downloaded data from a BEx Query or an InfoProvider Let's try to connect some data sources and extract some data from them. Microsoft spreadsheets Let's start with the easiest exercise. For example, our manager of inventory asked us to analyse flop products, which are not popular, and he sent us two excel spreadsheets, Unicorn_flop_products.xls and Unicorn_flop_price.xls. There are two different worksheets because prices and product attributes are in different systems. Both files have a unique field—SKU. As a result, it is possible to merge them by this field and analyse them as one data set. SKU or stock keeping unit is a distinct item for sale, such as a product or service, and them attributes associated with the item distinguish it from other items. For a product, these attributes include, but are not limited to, manufacturer, product description, material, size, color, packaging, and warranty terms. When a business takes inventory, it counts the quantity of each SKU. Connecting to the SAP BO universe Universe is a core thing in the SAP BusinessObjects BI platform. It is the semantic layer that isolates business users from the technical complexities of the databases where their corporate information is stored. For the ease of the end user, universes are made up of objects and classes that map to data in the database, using everyday terms that describe their business environment. Introducing Unicorn Fashion universe The Unicorn Fashion company uses the SAP BusinessObjects BI platform (BIP) as its primary BI tool. There is another Unicorn Fashion universe, which was built based on the unicorn datamart. It has a similar structure and joins as datamart. The following image shows the Unicorn Fashion universe: It unites two business processes: Sales (orange) and Stock (green) and has the following structure in business layer: Product: This specifies the attributes of an SKU, such as brand, category, ant, and so on Price: This denotes the different pricing of the SKU Sales: This specifies the sales business process Order: This denotes the order number, the shipping information, and orders measures Sales Date: This specifies the attributes of order date, such as month, year, and so on Sales Measures: This denotes various aggregated measures, such as shipped items, revenue waterfall, and so on Stock: This specifies the information about the quantity on stock Stock Date: This denotes the attributes of stock date, such as month, year, and so on Summary A step-by-step guide of learning SAP Lumira essentials starting from overview of SAP Lumira family products. We will demonstrate various data discovery techniques using real world scenarios of online ecommerce retailer. Moreover, we have detail recipes of installations, administration and customization of SAP Lumira. In addition, we will show how to work with data starting from acquiring data from various data sources, then preparing it and visualize through rich functionality of SAP Lumira. Finally, it teaches how to present data via data story or infographic and publish it across your organization or world wide web. Learn data discovery techniques, build amazing visualizations, create fantastic stories and share these visualizations through electronic medium with one of the most powerful tool – SAP Lumira. Moreover, we will focus on extracting data from different sources such as plain text, Microsoft Excel spreadsheets, SAP BusinessObjects BI Platform, SAP HANA and SQL databases. Finally, it will teach how to publish result of your painstaking work on various mediums, such as SAP BI Clients, SAP Lumira Cloud and so on. Resources for Article: Further resources on this subject: Creating Mobile Dashboards [article] Report Data Filtering [article] Creating Our First Universe [article]

0
0
1705

Packt

02 Sep 2015

24 min read

Big Data

Packt

02 Sep 2015

24 min read

In this article by Henry Garner, author of the book Clojure for Data Science, we'll be working with a relatively modest dataset of only 100,000 records. This isn't big data (at 100 MB, it will fit comfortably in the memory of one machine), but it's large enough to demonstrate the common techniques of large-scale data processing. Using Hadoop (the popular framework for distributed computation) as its case study, this article will focus on how to scale algorithms to very large volumes of data through parallelism. Before we get to Hadoop and distributed data processing though, we'll see how some of the same principles that enable Hadoop to be effective at a very large scale can also be applied to data processing on a single machine, by taking advantage of the parallel capacity available in all modern computers. (For more resources related to this topic, see here.) The reducers library The count operation we implemented previously is a sequential algorithm. Each line is processed one at a time until the sequence is exhausted. But there is nothing about the operation that demands that it must be done in this way. We could split the number of lines into two sequences (ideally of roughly equal length) and reduce over each sequence independently. When we're done, we would just add together the total number of lines from each sequence to get the total number of lines in the file: If each Reduce ran on its own processing unit, then the two count operations would run in parallel. All the other things being equal, the algorithm would run twice as fast. This is one of the aims of the clojure.core.reducers library—to bring the benefit of parallelism to algorithms implemented on a single machine by taking advantage of multiple cores. Parallel folds with reducers The parallel implementation of reduce implemented by the reducers library is called fold. To make use of a fold, we have to supply a combiner function that will take the results of our reduced sequences (the partial row counts) and return the final result. Since our row counts are numbers, the combiner function is simply +. Reducers are a part of Clojure's standard library, they do not need to be added as an external dependency. The adjusted example, using clojure.core.reducers as r, looks like this: (defn ex-5-5 [] (->> (io/reader "data/soi.csv") (line-seq) (r/fold + (fn [i x] (inc i))))) The combiner function, +, has been included as the first argument to fold and our unchanged reduce function is supplied as the second argument. We no longer need to pass the initial value of zero—fold will get the initial value by calling the combiner function with no arguments. Our preceding example works because +, called with no arguments, already returns zero: (defn ex-5-6 [] (+)) ;; 0 To participate in folding then, it's important that the combiner function have two implementations: one with zero arguments that returns the identity value and another with two arguments that combines the arguments. Different folds will, of course, require different combiner functions and identity values. For example, the identity value for multiplication is 1. We can visualize the process of seeding the computation with an identity value, iteratively reducing over the sequence of xs and combining the reductions into an output value as a tree: There may be more than two reductions to combine, of course. The default implementation of fold will split the input collection into chunks of 512 elements. Our 166,000-element sequence will therefore generate 325 reductions to be combined. We're going to run out of page real estate quite quickly with a tree representation diagram, so let's visualize the process more schematically instead—as a two-step reduce and combine process. The first step performs a parallel reduce across all the chunks in the collection. The second step performs a serial reduce over the intermediate results to arrive at the final result: The preceding representation shows reduce over several sequences of xs, represented here as circles, into a series of outputs, represented here as squares. The squares are combined serially to produce the final result, represented by a star. Loading large files with iota Calling fold on a lazy sequence requires Clojure to realize the sequence into memory and then chunk the sequence into groups for parallel execution. For situations where the calculation performed on each row is small, the overhead involved in coordination outweighs the benefit of parallelism. We can improve the situation slightly by using a library called iota (https://github.com/thebusby/iota). The iota library loads files directly into the data structures suitable for folding over with reducers that can handle files larger than available memory by making use of memory-mapped files. With iota in the place of our line-seq function, our line count simply becomes: (defn ex-5-7 [] (->> (iota/seq "data/soi.csv") (r/fold + (fn [i x] (inc i))))) So far, we've just been working with the sequences of unformatted lines, but if we're going to do anything more than counting the rows, we'll want to parse them into a more useful data structure. This is another area in which Clojure's reducers can help make our code more efficient. Creating a reducers processing pipeline We already know that the file is comma-separated, so let's first create a function to turn each row into a vector of fields. All fields except the first two contain numeric data, so let's parse them into doubles while we're at it: (defn parse-double [x] (Double/parseDouble x)) (defn parse-line [line] (let [[text-fields double-fields] (->> (str/split line #",") (split-at 2))] (concat text-fields (map parse-double double-fields)))) We're using the reducers version of map to apply our parse-line function to each of the lines from the file in turn: (defn ex-5-8 [] (->> (iota/seq "data/soi.csv") (r/drop 1) (r/map parse-line) (r/take 1) (into []))) ;; [("01" "AL" 0.0 1.0 889920.0 490850.0 ...)] The final into function call converts the reducers' internal representation (a reducible collection) into a Clojure vector. The previous example should return a sequence of 77 fields, representing the first row of the file after the header. We're just dropping the column names at the moment, but it would be great if we could make use of these to return a map representation of each record, associating the column name with the field value. The keys of the map would be the column headings and the values would be the parsed fields. The clojure.core function zipmap will create a map out of two sequences—one for the keys and one for the values: (defn parse-columns [line] (->> (str/split line #",") (map keyword))) (defn ex-5-9 [] (let [data (iota/seq "data/soi.csv") column-names (parse-columns (first data))] (->> (r/drop 1 data) (r/map parse-line) (r/map (fn [fields] (zipmap column-names fields))) (r/take 1) (into [])))) This function returns a map representation of each row, a much more user-friendly data structure: [{:N2 1505430.0, :A19300 181519.0, :MARS4 256900.0 ...}] A great thing about Clojure's reducers is that in the preceding computation, calls to r/map, r/drop and r/take are composed into a reduction that will be performed in a single pass over the data. This becomes particularly valuable as the number of operations increases. Let's assume that we'd like to filter out zero ZIP codes. We could extend the reducers pipeline like this: (defn ex-5-10 [] (let [data (iota/seq "data/soi.csv") column-names (parse-columns (first data))] (->> (r/drop 1 data) (r/map parse-line) (r/map (fn [fields] (zipmap column-names fields))) (r/remove (fn [record] (zero? (:zipcode record)))) (r/take 1) (into [])))) The r/remove step is now also being run together with the r/map, r/drop and r/take calls. As the size of the data increases, it becomes increasingly important to avoid making multiple iterations over the data unnecessarily. Using Clojure's reducers ensures that our calculations are compiled into a single pass. Curried reductions with reducers To make the process clearer, we can create a curried version of each of our previous steps. To parse the lines, create a record from the fields and filter zero ZIP codes. The curried version of the function is a reduction waiting for a collection: (def line-formatter (r/map parse-line)) (defn record-formatter [column-names] (r/map (fn [fields] (zipmap column-names fields)))) (def remove-zero-zip (r/remove (fn [record] (zero? (:zipcode record))))) In each case, we're calling one of reducers' functions, but without providing a collection. The response is a curried version of the function that can be applied to the collection at a later time. The curried functions can be composed together into a single parse-file function using comp: (defn load-data [file] (let [data (iota/seq file) col-names (parse-columns (first data)) parse-file (comp remove-zero-zip (record-formatter col-names) line-formatter)] (parse-file (rest data)))) It's only when the parse-file function is called with a sequence that the pipeline is actually executed. Statistical folds with reducers With the data parsed, it's time to perform some descriptive statistics. Let's assume that we'd like to know the mean number of returns (column N1) submitted to the IRS by ZIP code. One way of doing this—the way we've done several times throughout the book—is by adding up the values and dividing it by the count. Our first attempt might look like this: (defn ex-5-11 [] (let [data (load-data "data/soi.csv") xs (into [] (r/map :N1 data))] (/ (reduce + xs) (count xs)))) ;; 853.37 While this works, it's comparatively slow. We iterate over the data once to create xs, a second time to calculate the sum, and a third time to calculate the count. The bigger our dataset gets, the larger the time penalty we'll pay. Ideally, we would be able to calculate the mean value in a single pass over the data, just like our parse-file function previously. It would be even better if we can perform it in parallel too. Associativity Before we proceed, it's useful to take a moment to reflect on why the following code wouldn't do what we want: (defn mean ([] 0) ([x y] (/ (+ x y) 2))) Our mean function is a function of two arities. Without arguments, it returns zero, the identity for the mean computation. With two arguments, it returns their mean: (defn ex-5-12 [] (->> (load-data "data/soi.csv") (r/map :N1) (r/fold mean))) ;; 930.54 The preceding example folds over the N1 data with our mean function and produces a different result from the one we obtained previously. If we could expand out the computation for the first three xs, we might see something like the following code: (mean (mean (mean 0 a) b) c) This is a bad idea, because the mean function is not associative. For an associative function, the following holds true: Addition is associative, but multiplication and division are not. So the mean function is not associative either. Contrast the mean function with the following simple addition: (+ 1 (+ 2 3)) This yields an identical result to: (+ (+ 1 2) 3) It doesn't matter how the arguments to + are partitioned. Associativity is an important property of functions used to reduce over a set of data because, by definition, the results of a previous calculation are treated as inputs to the next. The easiest way of converting the mean function into an associative function is to calculate the sum and the count separately. Since the sum and the count are associative, they can be calculated in parallel over the data. The mean function can be calculated simply by dividing one by the other. Multiple regression with gradient descent The normal equation uses matrix algebra to very quickly and efficiently arrive at the least squares estimates. Where all data fits in memory, this is a very convenient and concise equation. Where the data exceeds the memory available to a single machine however, the calculation becomes unwieldy. The reason for this is matrix inversion. The calculation of is not something that can be accomplished on a fold over the data—each cell in the output matrix depends on many others in the input matrix. These complex relationships require that the matrix be processed in a nonsequential way. An alternative approach to solve linear regression problems, and many other related machine learning problems, is a technique called gradient descent. Gradient descent reframes the problem as the solution to an iterative algorithm—one that does not calculate the answer in one very computationally intensive step, but rather converges towards the correct answer over a series of much smaller steps. The gradient descent update rule Gradient descent works by the iterative application of a function that moves the parameters in the direction of their optimum values. To apply this function, we need to know the gradient of the cost function with the current parameters. Calculating the formula for the gradient involves calculus that's beyond the scope of this book. Fortunately, the resulting formula isn't terribly difficult to interpret: is the partial derivative, or the gradient, of our cost function J(β) for the parameter at index j. Therefore, we can see that the gradient of the cost function with respect to the parameter at index j is equal to the difference between our prediction and the true value of y multiplied by the value of x at index j. Since we're seeking to descend the gradient, we want to subtract some proportion of the gradient from the current parameter values. Thus, at each step of gradient descent, we perform the following update: Here, := is the assigment operator and α is a factor called the learning rate. The learning rate controls how large an adjustment we wish make to the parameters at each iteration as a fraction of the gradient. If our prediction ŷ nearly matches the actual value of y, then there would be little need to change the parameters. In contrast, a larger error will result in a larger adjustment to the parameters. This rule is called the Widrow-Hoff learning rule or the Delta rule. The gradient descent learning rate As we've seen, gradient descent is an iterative algorithm. The learning rate, usually represented by α, dictates the speed at which the gradient descent converges to the final answer. If the learning rate is too small, convergence will happen very slowly. If it is too large, gradient descent will not find values close to the optimum and may even diverge from the correct answer: In the preceding chart, a small learning rate leads to a show convergence over many iterations of the algorithm. While the algorithm does reach the minimum, it does so over many more steps than is ideal and, therefore, may take considerable time. By contrast, in following diagram, we can see the effect of a learning rate that is too large. The parameter estimates are changed so significantly between iterations that they actually overshoot the optimum values and diverge from the minimum value: The gradient descent algorithm requires us to iterate repeatedly over our dataset. With the correct version of alpha, each iteration should successively yield better approximations of the ideal parameters. We can choose to terminate the algorithm when either the change between iterations is very small or after a predetermined number of iterations. Feature scaling As more features are added to the linear model, it is important to scale features appropriately. Gradient descent will not perform very well if the features have radically different scales, since it won't be possible to pick a learning rate to suit them all. A simple scaling we can perform is to subtract the mean value from each of the values and divide it by the standard-deviation. This will tend to produce values with zero mean that generally vary between -3 and 3: ( defn feature-scales [features] (->> (prepare-data) (t/map #(select-keys % features)) (t/facet) (t/fuse {:mean (m/mean) :sd (m/standard-deviation)}))) The feature-factors function in the preceding code uses t/facet to calculate the mean value and standard deviation of all the input features: (defn ex-5-24 [] (let [data (iota/seq "data/soi.csv") features [:A02300 :A00200 :AGI_STUB :NUMDEP :MARS2]] (->> (feature-scales features) (t/tesser (chunks data))))) ;; {:MARS2 {:sd 533.4496892658647, :mean 317.0412009748016}...} If you run the preceding example, you'll see the different means and standard deviations returned by the feature-scales function. Since our feature scales and input records are represented as maps, we can perform the scale across all the features at once using Clojure's merge-with function: (defn scale-features [factors] (let [f (fn [x {:keys [mean sd]}] (/ (- x mean) sd))] (fn [x] (merge-with f x factors)))) Likewise, we can perform the all-important reversal with unscale-features: (defn unscale-features [factors] (let [f (fn [x {:keys [mean sd]}] (+ (* x sd) mean))] (fn [x] (merge-with f x factors)))) Let's scale our features and take a look at the very first feature. Tesser won't allow us to execute a fold without a reduce, so we'll temporarily revert to using Clojure's reducers: (defn ex-5-25 [] (let [data (iota/seq "data/soi.csv") features [:A02300 :A00200 :AGI_STUB :NUMDEP :MARS2] factors (->> (feature-scales features) (t/tesser (chunks data)))] (->> (load-data "data/soi.csv") (r/map #(select-keys % features )) (r/map (scale-features factors)) (into []) (first)))) ;; {:MARS2 -0.14837567114357617, :NUMDEP 0.30617757526890155, ;; :AGI_STUB -0.714280814223704, :A00200 -0.5894942801950217, ;; :A02300 0.031741856083514465} This simple step will help gradient descent perform optimally on our data. Feature extraction Although we've used maps to represent our input data in this article, it's going to be more convenient when running gradient descent to represent our features as a matrix. Let's write a function to transform our input data into a map of xs and y. The y axis will be a scalar response value and xs will be a matrix of scaled feature values. We're adding a bias term to the returned matrix of features: (defn feature-matrix [record features] (let [xs (map #(% record) features)] (i/matrix (cons 1 xs)))) (defn extract-features [fy features] (fn [record] {:y (fy record) :xs (feature-matrix record features)})) Our feature-matrix function simply accepts an input of a record and the features to convert into a matrix. We call this from within extract-features, which returns a function that we can call on each input record: (defn ex-5-26 [] (let [data (iota/seq "data/soi.csv") features [:A02300 :A00200 :AGI_STUB :NUMDEP :MARS2] factors (->> (feature-scales features) (t/tesser (chunks data)))] (->> (load-data "data/soi.csv") (r/map (scale-features factors)) (r/map (extract-features :A02300 features)) (into []) (first)))) ;; {:y 433.0, :xs A 5x1 matrix ;; ------------- ;; 1.00e+00 ;; -5.89e-01 ;; -7.14e-01 ;; 3.06e-01 ;; -1.48e-01 ;; } The preceding example shows the data converted into a format suitable to perform gradient descent: a map containing the y response variable and a matrix of values, including the bias term. Applying a single step of gradient descent The objective of calculating the cost is to determine the amount by which to adjust each of the coefficients. Once we've calculated the average cost, as we did previously, we need to update the estimate of our coefficients β. Together, these steps represent a single iteration of gradient descent: We can return the updated coefficients in a post-combiner step that makes use of the average cost, the value of alpha, and the previous coefficients. Let's create a utility function update-coefficients, which will receive the coefficients and alpha and return a function that will calculate the new coefficients, given a total model cost: (defn update-coefficients [coefs alpha] (fn [cost] (->> (i/mult cost alpha) (i/minus coefs)))) With the preceding function in place, we have everything we need to package up a batch gradient descent update rule: (defn gradient-descent-fold [{:keys [fy features factors coefs alpha]}] (let [zeros-matrix (i/matrix 0 (count features) 1)] (->> (prepare-data) (t/map (scale-features factors)) (t/map (extract-features fy features)) (t/map (calculate-error (i/trans coefs))) (t/fold (matrix-mean (inc (count features)) 1)) (t/post-combine (update-coefficients coefs alpha))))) (defn ex-5-31 [] (let [features [:A00200 :AGI_STUB :NUMDEP :MARS2] fcount (inc (count features)) coefs (vec (replicate fcount 0)) data (chunks (iota/seq "data/soi.csv")) factors (->> (feature-scales features) (t/tesser data)) options {:fy :A02300 :features features :factors factors :coefs coefs :alpha 0.1}] (->> (gradient-descent-fold options) (t/tesser data)))) ;; A 6x1 matrix ;; ------------- ;; -4.20e+02 ;; -1.38e+06 ;; -5.06e+07 ;; -9.53e+02 ;; -1.42e+06 ;; -4.86e+05 The resulting matrix represents the values of the coefficients after the first iteration of gradient descent. Running iterative gradient descent Gradient descent is an iterative algorithm, and we will usually need to run it many times to convergence. With a large dataset, this can be very time-consuming. To save time, we've included a random sample of soi.csv in the data directory called soi-sample.csv. The smaller size allows us to run iterative gradient descent in a reasonable timescale. The following code runs gradient descent for 100 iterations, plotting the values of the parameters between each iteration on an xy-plot: (defn descend [options data] (fn [coefs] (->> (gradient-descent-fold (assoc options :coefs coefs)) (t/tesser data)))) (defn ex-5-32 [] (let [features [:A00200 :AGI_STUB :NUMDEP :MARS2] fcount (inc (count features)) coefs (vec (replicate fcount 0)) data (chunks (iota/seq "data/soi-sample.csv")) factors (->> (feature-scales features) (t/tesser data)) options {:fy :A02300 :features features :factors factors :coefs coefs :alpha 0.1} iterations 100 xs (range iterations) ys (->> (iterate (descend options data) coefs) (take iterations))] (-> (c/xy-plot xs (map first ys) :x-label "Iterations" :y-label "Coefficient") (c/add-lines xs (map second ys)) (c/add-lines xs (map #(nth % 2) ys)) (c/add-lines xs (map #(nth % 3) ys)) (c/add-lines xs (map #(nth % 4) ys)) (i/view)))) If you run the example, you should see a chart similar to the following: In the preceding chart, you can see how the parameters converge to relatively stable the values over the course of 100 iterations. Scaling gradient descent with Hadoop The length of time each iteration of batch gradient descent takes to run is determined by the size of your data and by how many processors your computer has. Although several chunks of data are processed in parallel, the dataset is large and the processors are finite. We've achieved a speed gain by performing calculations in parallel, but if we double the size of the dataset, the runtime will double as well. Hadoop is one of several systems that has emerged in the last decade which aims to parallelize work that exceeds the capabilities of a single machine. Rather than running code across multiple processors, Hadoop takes care of running a calculation across many servers. In fact, Hadoop clusters can, and some do, consist of many thousands of servers. Hadoop consists of two primary subsystems— the Hadoop Distributed File System (HDFS)—and the job processing system, MapReduce. HDFS stores files in chunks. A given file may be composed of many chunks and chunks are often replicated across many servers. In this way, Hadoop can store quantities of data much too large for any single server and, through replication, ensure that the data is stored reliably in the event of hardware failure too. As the name implies, the MapReduce programming model is built around the concept of map and reduce steps. Each job is composed of at least one map step and may optionally specify a reduce step. An entire job may consist of several map and reduce steps chained together. In the respect that reduce steps are optional, Hadoop has a slightly more flexible approach to distributed calculation than Tesser. Gradient descent on Hadoop with Tesser and Parkour Tesser's Hadoop capabilities are available in the tesser.hadoop namespace, which we're including as h. The primary public API function in the Hadoop namespace is h/fold. The fold function expects to receive at least four arguments, representing the configuration of the Hadoop job, the input file we want to process, a working directory for Hadoop to store its intermediate files, and the fold we want to run, referenced as a Clojure var. Any additional arguments supplied will be passed as arguments to the fold when it is executed. The reason for using a var to represent our fold is that the function call initiating the fold may happen on a completely different computer than the one that actually executes it. In a distributed setting, the var and arguments must entirely specify the behavior of the function. We can't, in general, rely on other mutable local state (for example, the value of an atom, or the value of variables closing over the function) to provide any additional context. Parkour distributed sources and sinks The data which we want our Hadoop job to process may exist on multiple machines too, stored distributed in chunks on HDFS. Tesser makes use of a library called Parkour (https://github.com/damballa/parkour/) to handle accessing potentially distributed data sources. Although Hadoop is designed to be run and distributed across many servers, it can also run in local mode. Local mode is suitable for testing and enables us to interact with the local filesystem as if it were HDFS. Another namespace we'll be using from Parkour is the parkour.conf namespace. This will allow us to create a default Hadoop configuration and operate it in local mode: (defn ex-5-33 [] (->> (text/dseq "data/soi.csv") (r/take 2) (into []))) In the preceding example, we use Parkour's text/dseq function to create a representation of the IRS input data. The return value implements Clojure's reducers protocol, so we can use r/take on the result. Running a feature scale fold with Hadoop Hadoop needs a location to write its temporary files while working on a task, and will complain if we try to overwrite an existing directory. Since we'll be executing several jobs over the course of the next few examples, let's create a little utility function that returns a new file with a randomly-generated name. (defn rand-file [path] (io/file path (str (long (rand 0x100000000))))) (defn ex-5-34 [] (let [conf (conf/ig) input (text/dseq "data/soi.csv") workdir (rand-file "tmp") features [:A00200 :AGI_STUB :NUMDEP :MARS2]] (h/fold conf input workdir #'feature-scales features))) Parkour provides a default Hadoop configuration object with the shorthand (conf/ig). This will return an empty configuration. The default value is enough, we don't need to supply any custom configuration. All of our Hadoop jobs will write their temporary files to a random directory inside the project's tmp directory. Remember to delete this folder later, if you're concerned about preserving disk space. If you run the preceding example now, you should get an output similar to the following: ;; {:MARS2 317.0412009748016, :NUMDEP 581.8504423822615, ;; :AGI_STUB 3.499939975269811, :A00200 37290.58880658831} Although the return value is identical to the values we got previously, we're now making use of Hadoop behind the scenes to process our data. In spite of this, notice that Tesser will return the response from our fold as a single Clojure data structure. Running gradient descent with Hadoop Since tesser.hadoop folds return Clojure data structures just like tesser.core folds, defining a gradient descent function that makes use of our scaled features is very simple: (defn hadoop-gradient-descent [conf input-file workdir] (let [features [:A00200 :AGI_STUB :NUMDEP :MARS2] fcount (inc (count features)) coefs (vec (replicate fcount 0)) input (text/dseq input-file) options {:column-names column-names :features features :coefs coefs :fy :A02300 :alpha 1e-3} factors (h/fold conf input (rand-file workdir) #'feature-scales features) descend (fn [coefs] (h/fold conf input (rand-file workdir) #'gradient-descent-fold (merge options {:coefs coefs :factors factors})))] (take 5 (iterate descend coefs)))) The preceding code defines a hadoop-gradient-descent function that iterates a descend function 5 times. Each iteration of descend calculates the improved coefficients based on the gradient-descent-fold function. The final return value is a vector of coefficients after 5 iterations of a gradient descent. We run the job on the full IRS data in the following example: ( defn ex-5-35 [] (let [workdir "tmp" out-file (rand-file workdir)] (hadoop-gradient-descent (conf/ig) "data/soi.csv" workdir))) After several iterations, you should see an output similar to the following: ;; ([0 0 0 0 0] ;; (20.9839310796048 46.87214911003046 -7.363493937722712 ;; 101.46736841329326 55.67860863427868) ;; (40.918665605227744 56.55169901254631 -13.771345753228694 ;; 162.1908841131747 81.23969785586247) ;; (59.85666340457121 50.559130068258995 -19.463888245285332 ;; 202.32407094149158 92.77424653758085) ;; (77.8477613139478 38.67088624825574 -24.585818946408523 ;; 231.42399118694212 97.75201693843269)) We've seen how we're able to calculate gradient descent using distributed techniques locally. Now, let's see how we can run this on a cluster of our own. Summary In this article, we learned some of the fundamental techniques of distributed data processing and saw how the functions used locally for data processing, map and reduce, are powerful ways of processing even very large quantities of data. We learned how Hadoop can scale unbounded by the capabilities of any single server by running functions on smaller subsets of the data whose outputs are themselves combined to finally produce a result. Once you understand the tradeoffs, this "divide and conquer" approach toward processing data is a simple and very general way of analyzing data on a large scale. We saw both the power and limitations of simple folds to process data using both Clojure's reducers and Tesser. We've also begun exploring how Parkour exposes more of Hadoop's underlying capabilities. Resources for Article: Further resources on this subject: Supervised learning[article] Machine Learning[article] Why Big Data in the Financial Sector? [article]

0
0
2410

Packt

01 Sep 2015

15 min read

Starting with YARN Basics

Packt

01 Sep 2015

15 min read

0
0
2743

Packt

28 Aug 2015

16 min read

Building a "Click-to-Go" Robot

Packt

28 Aug 2015

16 min read

In this article by Özen Özkaya and Giray Yıllıkçı, author of the book Arduino Computer Vision Programming, you will learn how to approach computer vision applications, how to divide an application development process into basic steps, how to realize these design steps and how to combine a vision system with the Arduino. Now it is time to connect all the pieces into one! In this article you will learn about building a vision-assisted robot which can go to any point you want within the boundaries of the camera's sight. In this scenario there will be a camera attached to the ceiling and, once you get the video stream from the robot and click on any place in the view, the robot will go there. This application will give you an all-in-one development application. Before getting started, let's try to draw the application scheme and define the potential steps. We want to build a vision-enabled robot which can be controlled via a camera attached to the ceiling and, when we click on any point in the camera view, we want our robot to go to this specific point. This operation requires a mobile robot that can communicate with the vision system. The vision system should be able to detect or recognize the robot and calculate the position and orientation of the robot. The vision system should also give us the opportunity to click on any point in the view and it should calculate the path and the robot movements to get to the destination. This scheme requires a communication line between the robot and the vision controller. In the following illustration, you can see the physical scheme of the application setup on the left hand side and the user application window on the right hand side: After interpreting the application scheme, the next step is to divide the application into small steps by using the computer vision approach. In the data acquisition phase, we'll only use the scene's video stream. There won't be an external sensor on the robot because, for this application, we don't need one. Camera selection is important and the camera distance (the height from the robot plane) should be enough to see the whole area. We'll use the blue and red circles above the robot to detect the robot and calculate its orientation. We don't need smaller details. A resolution of about 640x480 pixels is sufficient for a camera distance of 120 cm. We need an RGB camera stream because we'll use the color properties of the circles. We will use the Logitech C110, which is an affordable webcam. Any other OpenCV compatible webcam will work because this application is not very demanding in terms of vision input. If you need more cable length you can use a USB extension cable. In the preprocessing phase, the first step is to remove the small details from the surface. Blurring is a simple and effective operation for this purpose. If you need to, you can resize your input image to reduce the image size and processing time. Do not forget that, if you resize to too small a resolution, you won't be able to extract useful information. The following picture is of the Logitech C110 webcam: The next step is processing. There are two main steps in this phase. The first step is to detect the circles in the image. The second step is to calculate the robot orientation and the path to the destination point. The robot can then follow the path and reach its destination. In color processing with which we can apply color filters to the image to get the image masks of the red circle and the blue circle, as shown in the following picture. Then we can use contour detection or blob analysis to detect the circles and extract useful features. It is important to keep it simple and logical: Blob analysis detects the bounding boxes of two circles on the robot and, if we draw a line between the centers of the circles, once we calculate the line angle, we will get the orientation of the robot itself. The mid-point of this line will be the center of the robot. If we draw a line from the center of the robot to the destination point we obtain the straightest route. The circles on the robot can also be detected by using the Hough transform for circles but, because it is a relatively slow algorithm and it is hard to extract image statistics from the results, the blob analysis-based approach is better. Another approach is by using the SURF, SIFT or ORB features. But these methods probably won't provide fast real-time behavior, so blob analysis will probably work better. After detecting blobs, we can apply post-filtering to remove the unwanted blobs. We can use the diameter of the circles, the area of the bounding box, and the color information, to filter the unwanted blobs. By using the properties of the blobs (extracted features), it is possible to detect or recognize the circles, and then the robot. To be able to check if the robot has reached the destination or not, a distance calculation from the center of the robot to the destination point would be useful. In this scenario, the robot will be detected by our vision controller. Detecting the center of the robot is sufficient to track the robot. Once we calculate the robot's position and orientation, we can combine this information with the distance and orientation to the destination point and we can send the robot the commands to move it! Efficient planning algorithms can be applied in this phase but, we'll implement a simple path planning approach. Firstly, the robot will orientate itself towards the destination point by turning right or left and then it will go forward to reach the destination. This scenario will work for scenarios without obstacles. If you want to extend the application for a complex environment with obstacles, you should implement an obstacle detection mechanism and an efficient path planning algorithm. We can send the commands such as Left!, Right!, Go!, or Stop! to the robot over a wireless line. RF communication is an efficient solution for this problem. In this scenario, we need two NRF24L01 modules—the first module is connected to the robot controller and the other is connected to the vision controller. The Arduino is the perfect means to control the robot and communicate with the vision controller. The vision controller can be built on any hardware platform such as a PC, tablet, or a smartphone. The vision controller application can be implemented on lots of operating systems as OpenCV is platform-independent. We preferred Windows and a laptop to run our vision controller application. As you can see, we have divided our application into small and easy-to-implement parts. Now it is time to build them all! Building a robot It is time to explain how to build our Click-to-Go robot. Before going any further we would like to boldly say that robotic projects can teach us the fundamental fields of science such as mechanics, electronics, and programming. As we go through the building process of our Click-to-Go robot, you will see that we have kept it as simple as possible. Moreover, instead of buying ready-to-use robot kits, we have built our own simple and robust robot. Of course, if you are planning to buy a robot kit or already have a kit available, you can simply adapt your existing robot into this project. Our robot design is relatively simple in terms of mechanics. We will use only a box-shaped container platform, two gear motors with two individual wheels, a battery to drive the motors, one nRF24L01 Radio Frequency (RF) transceiver module, a bunch of jumper wires, an L293D IC and, of course, one Arduino Uno board module. We will use one more nRF24L01 and one more Arduino Uno for the vision controller communication circuit. Our Click-to-Go robot will be operated by a simplified version of a differential drive. A differential drive can be summarized as a relative speed change on the wheels, which assigns a direction to the robot. In other words, if both wheels spin at the same rate, the robot goes forward. To drive in reverse, the wheels spin in the opposite direction. To turn left, the left wheel turns backwards and the right wheel stays still or turns forwards. Similarly, to turn right, the right wheel turns backwards and the left stays still or turns forwards. You can get curved paths by varying the rotation speeds of the wheels. Yet, to cover every aspect of this comprehensive project, we will drive the wheels of both the motors forward to go forwards. To turn left, the left wheel stays still and the right wheel turns forward. Symmetrically, to turn right, the right motor stays still and the left motor runs forward. We will not use running motors in a reverse direction to go backwards. Instead, we will change the direction of the robot by turning right or left. Building mechanics As we stated earlier, the mechanics of the robot are fairly simple. First of all we need a small box-shaped container to use as both a rigid surface and the storage for the battery and electronics. For this purpose, we will use a simple plywood box. We will attach gear motors in front of the plywood box and any kind of support surface to the bottom of the box. As can be seen in the following picture, we used a small wooden rod to support the back of the robot to level the box: If you think that the wooden rod support is dragging, we recommend adding a small ball support similar to Pololu's ball caster, shown at https://www.pololu.com/product/950. It is not a very expensive component and it significantly improves the mobility of the robot. You may want to drill two holes next to the motor wirings to keep the platform tidy. The easiest way to attach the motors and the support rod is by using two-sided tape. Just make sure that the tape is not too thin. It is much better to use two-sided foamy tape. The topside of the robot can be covered with a black shell to enhance the contrast between the red and blue circles. We will use these circles to ascertain the orientation of the robot during the operation, as mentioned earlier. For now, don't worry too much about this detail. Just be aware that we need to cover the top of the robot with a flat surface. We will explain in detail on how these red and blue circles are used. It is worth mentioning that we used large water bottle lids. It is better to use matt surfaces instead of shiny surfaces to avoid glare in the image. The finished Click-to-Go robot should be similar to the robot shown in the following picture. The robot's head is on the side with the red circle: As we have now covered building the mechanics of our robot we can move on to building the electronics. Building the electronics We will use two separate Arduino Unos for this vision-enabled robot project, one each for the robot and the transmitter system. The electronic setup needs a little bit more attention than the mechanics. The electronic components of the robot and the transmitter units are similar. However, the robot needs more work. We have selected nRF24L01 modules for the wireless communication module,. These modules are reliable and easy to find from both the Internet and local hobby stores. It is possible to use any pair of wireless connectivity modules but, for this project, we will stick with nRF24L01 modules, as shown in this picture: For the driving motors we will need to use a quadruple half-H driver, L293D. Again, every electronic shop should have these ICs. As a reminder, you may need to buy a couple of spare L293D ICs in case you burn the IC by mistake. Following is the picture of the L293D IC: We will need a bunch of jumper wires to connect the components together. It is nice to have a small breadboard for the robot/receiver, to wire the L293D. The transmitter part is very simple so a breadboard is not essential. Robot/receiver and transmitter drawings The drawings of both the receiver and the transmitter have two common modules: Arduino Uno and nRF24L01 connectivity modules. The connections of the nRF24L01 modules on both sides are the same. In addition to these connectivity modules, for the receiver, we need to put some effort into connecting the L293D IC and the battery to power up the motors. In the following picture, we can see a drawing of the transmitter. As it will always be connected to the OpenCV platform via the USB cable, there is no need to feed the system with an external battery: As shown in the following picture of the receiver and the robot, it is a good idea to separate the motor battery from the battery that feeds the Arduino Uno board because the motors may draw high loads or create high loads, which can easily damage the Arduino board's pin outs. Another reason is to keep the Arduino working even if the battery motor has drained. Separating the feeder batteries is a very good practice to follow if you are planning to use more than one 12V battery. To keep everything safe, we fed the Arduino Uno with a 6V battery pack and the motors with a 9V battery: Drawings of receiver systems can be little bit confusing and lead to errors. It is a good idea to open the drawings and investigate how the connections are made by using Fritzing. You can download the Fritzing drawings of this project from https://github.com/ozenozkaya/click_to_go_robot_drawings. To download the Fritzing application, visit the Fritzing download page: http://fritzing.org/download/ Building the robot controller and communications We are now ready to go through the software implementation of the robot and the transmitter. Basically what we are doing here is building the required connectivity to send data to the remote robot continuously from OpenCV via a transmitter. OpenCV will send commands to the transmitter through a USB cable to the first Arduino board, which will then send the data to the unit on the robot. And it will send this data to the remote robot over the RF module. Follow these steps: Before explaining the code, we need to import the RF24 library. To download RF24 library drawings please go to the GitHub link at https://github.com/maniacbug/RF24. After downloading the library, go to Sketch | Include Library | Add .ZIP Library… to include the library in the Arduino IDE environment. After clicking Add .ZIP Library…, a window will appear. Go into the downloads directory and select the RF24-master folder that you just downloaded. Now you are ready to use the RF24 library. As a reminder, it is pretty much the same to include a library in Arduino IDE as on other platforms. It is time to move on to the explanation of the code! It is important to mention that we use the same code for both the robot and the transmitter, with a small trick! The same code works differently for the robot and the transmitter. Now, let's make everything simpler during the code explanation. The receiver mode needs to ground an analog 4 pin. The idea behind the operation is simple; we are setting the role_pin to high through its internal pull-up resistor. So, it will read high even if you don't connect it, but you can still safely connect it to ground and it will read low. Basically, the analog 4 pin reads 0 if the there is a connection with a ground pin. On the other hand, if there is no connection to the ground, the analog 4 pin value is kept as 1. By doing this at the beginning, we determine the role of the board and can use the same code on both sides. Here is the code: #include <SPI.h> #include "nRF24L01.h" #include "RF24.h" #define MOTOR_PIN_1 3 #define MOTOR_PIN_2 5 #define MOTOR_PIN_3 6 #define MOTOR_PIN_4 7 #define ENABLE_PIN 4 #define SPI_ENABLE_PIN 9 #define SPI_SELECT_PIN 10 const int role_pin = A4; typedef enum {transmitter = 1, receiver} e_role; unsigned long motor_value[2]; String input_string = ""; boolean string_complete = false; RF24 radio(SPI_ENABLE_PIN, SPI_SELECT_PIN); const uint64_t pipes[2] = { 0xF0F0F0F0E1LL, 0xF0F0F0F0D2LL }; e_role role = receiver; void setup() { pinMode(role_pin, INPUT); digitalWrite(role_pin, HIGH); delay(20); radio.begin(); radio.setRetries(15, 15); Serial.begin(9600); Serial.println(" Setup Finished"); if (digitalRead(role_pin)) { Serial.println(digitalRead(role_pin)); role = transmitter; } else { Serial.println(digitalRead(role_pin)); role = receiver; } if (role == transmitter) { radio.openWritingPipe(pipes[0]); radio.openReadingPipe(1, pipes[1]); } else { pinMode(MOTOR_PIN_1, OUTPUT); pinMode(MOTOR_PIN_2, OUTPUT); pinMode(MOTOR_PIN_3, OUTPUT); pinMode(MOTOR_PIN_4, OUTPUT); pinMode(ENABLE_PIN, OUTPUT); digitalWrite(ENABLE_PIN, HIGH); radio.openWritingPipe(pipes[1]); radio.openReadingPipe(1, pipes[0]); } radio.startListening(); } void loop() { // TRANSMITTER CODE BLOCK // if (role == transmitter) { Serial.println("Transmitter"); if (string_complete) { if (input_string == "Right!") { motor_value[0] = 0; motor_value[1] = 120; } else if (input_string == "Left!") { motor_value[0] = 120; motor_value[1] = 0; } else if (input_string == "Go!") { motor_value[0] = 120; motor_value[1] = 110; } else { motor_value[0] = 0; motor_value[1] = 0; } input_string = ""; string_complete = false; } radio.stopListening(); radio.write(motor_value, 2 * sizeof(unsigned long)); radio.startListening(); delay(20); } // RECEIVER CODE BLOCK // if (role == receiver) { Serial.println("Receiver"); if (radio.available()) { bool done = false; while (!done) { done = radio.read(motor_value, 2 * sizeof(unsigned long)); delay(20); } Serial.println(motor_value[0]); Serial.println(motor_value[1]); analogWrite(MOTOR_PIN_1, motor_value[1]); digitalWrite(MOTOR_PIN_2, LOW); analogWrite(MOTOR_PIN_3, motor_value[0]); digitalWrite(MOTOR_PIN_4 , LOW); radio.stopListening(); radio.startListening(); } } } void serialEvent() { while (Serial.available()) { // get the new byte: char inChar = (char)Serial.read(); // add it to the inputString: input_string += inChar; // if the incoming character is a newline, set a flag // so the main loop can do something about it: if (inChar == '!' || inChar == '?') { string_complete = true; Serial.print("data_received"); } } } This example code is taken from one of the examples in the RF24 library. We have changed it in order to serve our needs in this project. The original example can be found in the RF24-master/Examples/pingpair directory. Summary We have combined everything we have learned up to now and built an all-in-one application. By designing and building the Click-to-Go robot from scratch you have embraced the concepts. You can see that the vision approach very well, even for complex applications. You now know how to divide a computer vision application into small pieces, how to design and implement each design step, and how to efficiently use the tools you have. Resources for Article: Further resources on this subject: Getting Started with Arduino[article] Arduino Development[article] Programmable DC Motor Controller with an LCD [article]

0
0
2273

article-image-rendering-stereoscopic-3d-models-using-opengl

Packt

24 Aug 2015

8 min read

Rendering Stereoscopic 3D Models using OpenGL

Packt

24 Aug 2015

8 min read

In this article, by Raymond C. H. Lo and William C. Y. Lo, authors of the book OpenGL Data Visualization Cookbook, we will demonstrate how to visualize data with stunning stereoscopic 3D technology using OpenGL. Stereoscopic 3D devices are becoming increasingly popular, and the latest generation's wearable computing devices (such as the 3D vision glasses from NVIDIA, Epson, and more recently, the augmented reality 3D glasses from Meta) can now support this feature natively. The ability to visualize data in a stereoscopic 3D environment provides a powerful and highly intuitive platform for the interactive display of data in many applications. For example, we may acquire data from the 3D scan of a model (such as in architecture, engineering, and dentistry or medicine) and would like to visualize or manipulate 3D objects in real time. Unfortunately, OpenGL does not provide any mechanism to load, save, or manipulate 3D models. Thus, to support this, we will integrate a new library named Open Asset Import Library (Assimp) into our code. The main dependencies include the GLFW library that requires OpenGL version 3.2 and higher. (For more resources related to this topic, see here.) Stereoscopic 3D rendering 3D television and 3D glasses are becoming much more prevalent with the latest trends in consumer electronics and technological advances in wearable computing. In the market, there are currently many hardware options that allow us to visualize information with stereoscopic 3D technology. One common format is side-by-side 3D, which is supported by many 3D glasses as each eye sees an image of the same scene from a different perspective. In OpenGL, creating side-by-side 3D rendering requires asymmetric adjustment as well as viewport adjustment (that is, the area to be rendered) – asymmetric frustum parallel projection or equivalently to lens-shift in photography. This technique introduces no vertical parallax and widely adopted in the stereoscopic rendering. To illustrate this concept, the following diagram shows the geometry of the scene that a user sees from the right eye: The intraocular distance (IOD) is the distance between two eyes. As we can see from the diagram, the Frustum Shift represents the amount of skew/shift for asymmetric frustrum adjustment. Similarly, for the left eye image, we perform the transformation with a mirrored setting. The implementation of this setup is described in the next section. How to do it... The following code illustrates the steps to construct the projection and view matrices for stereoscopic 3D visualization. The code uses the intraocular distance, the distance of the image plane, and the distance of the near clipping plane to compute the appropriate frustum shifts value. In the source file, common/controls.cpp, we add the implementation for the stereo 3D matrix setup: void computeStereoViewProjectionMatrices(GLFWwindow* window, float IOD, float depthZ, bool left_eye){ int width, height; glfwGetWindowSize(window, &width, &height); //up vector glm::vec3 up = glm::vec3(0,-1,0); glm::vec3 direction_z(0, 0, -1); //mirror the parameters with the right eye float left_right_direction = -1.0f; if(left_eye) left_right_direction = 1.0f; float aspect_ratio = (float)width/(float)height; float nearZ = 1.0f; float farZ = 100.0f; double frustumshift = (IOD/2)*nearZ/depthZ; float top = tan(g_initial_fov/2)*nearZ; float right = aspect_ratio*top+frustumshift*left_right_direction; //half screen float left = -aspect_ratio*top+frustumshift*left_right_direction; float bottom = -top; g_projection_matrix = glm::frustum(left, right, bottom, top, nearZ, farZ); // update the view matrix g_view_matrix = glm::lookAt( g_position-direction_z+ glm::vec3(left_right_direction*IOD/2, 0, 0), //eye position g_position+ glm::vec3(left_right_direction*IOD/2, 0, 0), //centre position up //up direction ); In the rendering loop in main.cpp, we define the viewports for each eye (left and right) and set up the projection and view matrices accordingly. For each eye, we translate our camera position by half of the intraocular distance, as illustrated in the previous figure: if(stereo){ //draw the LEFT eye, left half of the screen glViewport(0, 0, width/2, height); //computes the MVP matrix from the IOD and virtual image plane distance computeStereoViewProjectionMatrices(g_window, IOD, depthZ, true); //gets the View and Model Matrix and apply to the rendering glm::mat4 projection_matrix = getProjectionMatrix(); glm::mat4 view_matrix = getViewMatrix(); glm::mat4 model_matrix = glm::mat4(1.0); model_matrix = glm::translate(model_matrix, glm::vec3(0.0f, 0.0f, -depthZ)); model_matrix = glm::rotate(model_matrix, glm::pi<float>() * rotateY, glm::vec3(0.0f, 1.0f, 0.0f)); model_matrix = glm::rotate(model_matrix, glm::pi<float>() * rotateX, glm::vec3(1.0f, 0.0f, 0.0f)); glm::mat4 mvp = projection_matrix * view_matrix * model_matrix; //sends our transformation to the currently bound shader, //in the "MVP" uniform variable glUniformMatrix4fv(matrix_id, 1, GL_FALSE, &mvp[0][0]); //render scene, with different drawing modes if(drawTriangles) obj_loader->draw(GL_TRIANGLES); if(drawPoints) obj_loader->draw(GL_POINTS); if(drawLines) obj_loader->draw(GL_LINES); //Draw the RIGHT eye, right half of the screen glViewport(width/2, 0, width/2, height); computeStereoViewProjectionMatrices(g_window, IOD, depthZ, false); projection_matrix = getProjectionMatrix(); view_matrix = getViewMatrix(); model_matrix = glm::mat4(1.0); model_matrix = glm::translate(model_matrix, glm::vec3(0.0f, 0.0f, -depthZ)); model_matrix = glm::rotate(model_matrix, glm::pi<float>() * rotateY, glm::vec3(0.0f, 1.0f, 0.0f)); model_matrix = glm::rotate(model_matrix, glm::pi<float>() * rotateX, glm::vec3(1.0f, 0.0f, 0.0f)); mvp = projection_matrix * view_matrix * model_matrix; glUniformMatrix4fv(matrix_id, 1, GL_FALSE, &mvp[0][0]); if(drawTriangles) obj_loader->draw(GL_TRIANGLES); if(drawPoints) obj_loader->draw(GL_POINTS); if(drawLines) obj_loader->draw(GL_LINES); } The final rendering result consists of two separate images on each side of the display, and note that each image is compressed horizontally by a scaling factor of two. For some display systems, each side of the display is required to preserve the same aspect ratio depending on the specifications of the display. Here are the final screenshots of the same models in true 3D using stereoscopic 3D rendering: Here's the rendering of the architectural model in stereoscopic 3D: How it works... The stereoscopic 3D rendering technique is based on the parallel axis and asymmetric frustum perspective projection principle. In simpler terms, we rendered a separate image for each eye as if the object was seen at a different eye position but viewed on the same plane. Parameters such as the intraocular distance and frustum shift can be dynamically adjusted to provide the desired 3D stereo effects. For example, by increasing or decreasing the frustum asymmetry parameter, the object will appear to be moved in front or behind the plane of the screen. By default, the zero parallax plane is set to the middle of the view volume. That is, the object is set up so that the center position of the object is positioned at the screen level, and some parts of the object will appear in front of or behind the screen. By increasing the frustum asymmetry (that is, positive parallax), the scene will appear to be pushed behind the screen. Likewise, by decreasing the frustum asymmetry (that is, negative parallax), the scene will appear to be pulled in front of the screen. The glm::frustum function sets up the projection matrix, and we implemented the asymmetric frustum projection concept illustrated in the drawing. Then, we use the glm::lookAt function to adjust the eye position based on the IOP value we have selected. To project the images side by side, we use the glViewport function to constrain the area within which the graphics can be rendered. The function basically performs an affine transformation (that is, scale and translation) which maps the normalized device coordinate to the window coordinate. Note that the final result is a side-by-side image in which the graphic is scaled by a factor of two vertically (or compressed horizontally). Depending on the hardware configuration, we may need to adjust the aspect ratio. The current implementation supports side-by-side 3D, which is commonly used in most wearable Augmented Reality (AR) or Virtual Reality (VR) glasses. Fundamentally, the rendering technique, namely the asymmetric frustum perspective projection described in our article, is platform-independent. For example, we have successfully tested our implementation on the Meta 1 Developer Kit (https://www.getameta.com/products) and rendered the final results on the optical see-through stereoscopic 3D display: Here is the front view of the Meta 1 Developer Kit, showing the optical see-through stereoscopic 3D display and 3D range-sensing camera: The result is shown as follows, with the stereoscopic 3D graphics rendered onto the real world (which forms the basis of augmented reality): See also In addition, we can easily extend our code to support shutter glasses-based 3D monitors by utilizing the Quad Buffered OpenGL APIs (refer to the GL_BACK_RIGHT and GL_BACK_LEFT flags in the glDrawBuffer function). Unfortunately, such 3D formats require specific hardware synchronization and often require higher frame rate display (for example, 120Hz) as well as a professional graphics card. Further information on how to implement stereoscopic 3D in your application can be found at http://www.nvidia.com/content/GTC-2010/pdfs/2010_GTC2010.pdf. Summary In this article, we covered how to visualize data with stunning stereoscopic 3D technology using OpenGL. OpenGL does not provide any mechanism to load, save, or manipulate 3D models. Thus, to support this, we have integrated a new library named Assimp into the code. Resources for Article: Further resources on this subject: Organizing a Virtual Filesystem [article] Using OpenCL [article] Introduction to Modern OpenGL [article]

0
1
10012

article-image-identifying-big-data-evidence-hadoop

Packt

21 Aug 2015

13 min read

Identifying Big Data Evidence in Hadoop

Packt

21 Aug 2015

13 min read

In this article by Joe Sremack, author of the book Big Data Forensics, we will cover the following topics: An overview of how to identify Big Data forensic evidence Techniques for previewing Hadoop data (For more resources related to this topic, see here.) Hadoop and other Big Data systems pose unique challenges to forensic investigators. Hadoop clusters are distributed systems with voluminous data storage, complex data processing, and data that is split and made redundant at the data block level. Unlike with traditional forensics, performing forensics on Hadoop using the traditional methods is not always feasible. Instead, forensic investigators, experts, and legal professionals—such as attorneys and court officials—need to understand how forensics is performed against these complex systems. The first step in a forensic investigation is to identify the evidence. In this article, several of the concepts involved in identifying forensic evidence from Hadoop and its applications are covered. Identifying forensic evidence is a complex process for any type of investigation. It involves surveying a set of possible sources of evidence and determining which sources warrant collection. Data in any organization's systems is rarely well organized or documented. Investigators will need to take a set of investigation requirements and determine which data need to be collected. This requires a few first steps: Properly reviewing system and data documentation. Interviewing staff. Locating backup and non-centralized data repositories. Previewing data. The process of identifying Big Data evidence is made difficult by the large volume of data, distributed filesystems, the numerous types of data, and the potential for large-scale redundancy in evidence. Big Data solutions are also unique in that evidence can reside in different layers. Within Hadoop, evidence can take on multiple forms—such as file stored in the Hadoop Distributed File System (HDFS) to data extracted from application. To properly identify the evidence in Hadoop, multiple layers are examined. While all the data may reside in HDFS, the form may differ in a Hadoop application (for example, HBase), or the data may be more easily extracted to a viable format through HDFS using an application (such as Pig or Sqoop). Identifying Big Data evidence can also be complicated by redundancies caused by: Systems that input to or receive output from Big Data systems Archived systems that may have previously stored the evidence in the Big Data system A primary goal of identifying evidence is to capture all relevant evidence while minimizing redundant information. Outsiders looking at a company's data needs may assume that identifying information is as simple as asking several individuals where the data resides. In reality, the process is much more complicated for a number of possible reasons: The organization may be an adverse party and cannot be trusted to provide reliable information about the data The organization is large and no single person knows where all data is stored and what the contents of the data are The organization is divided into business units with no two business units knowing what data the other one stores Data is stored with a third-party data hosting provider IT staff may know where data and systems reside, but only the business users know the type of content the data stores For example, one might assume a pharmaceutical sales company would have an internal system structured with the following attributes: A division where the data is collected from a sales database An HR department database containing employee compensation, performance, and retention information A database of customer demographic information An accounting department database to assess what costs are associated with each sale In such a system, that data is then cleanly unified and compelling analyses are created to drive sales. In reality, an investigator will probably find that the Big Data sales system is actually comprised of a larger set of data that originates inside and outside the organization. There may be a collection of spreadsheets on sales employees' desktops and laptops, along with some of the older versions on backup tapes and file server shared folders. There may be a new Salesforce database implemented two years ago that is incomplete and is actually the replacement for a previous database, which was custom-developed and used by 75% of employees. A Hadoop instance running HBase for analysis receives a filtered set of data from social media feeds, the Salesforce database, and sales reports. All of these data sources may be managed by different teams, so identifying how to collect this information requires a series of steps to isolate the relevant information. The problem for large—or even midsize—companies is much more difficult than our pharmaceutical sales company example. Simply creating a map of every data source and the contents of those systems could require weeks of in-depth interviews with key business owners and staff. Several departments may have their own databases and Big Data solutions that may or may not be housed in a centralized repository. Backups for these systems could be located anywhere. Data retention policies will vary by department—and most likely by system. Data warehouses and other aggregators may contain important information that will not show themselves through normal interviews with staff. These data warehouses and aggregators may have previously generated reports that could serve as valuable reference points for future analysis; however, all data may not be available online, and some data may be inaccessible. In such cases, the company's data will most likely reside in off-site servers maintained by an outsourcing vendor, or worse, in a cloud-based solution. Big Data evidence can be intertwined with non-Big Data evidence. Email, document files, and other evidence can be extremely valuable for performing an investigation. The process for identifying Big Data evidence is very similar to the process for identifying other evidence, so the identification process described in this book can be carried out in conjunction with identifying other evidence. An important consideration for investigators to keep in mind is whether Big Data evidence should be collected (that is, determining whether it is relevant or if the same evidence can be collected more easily from other non-Big Data systems). Investigators must also consider whether evidence needs to be collected to meet the requirements of an investigation. The following figure illustrates the process for identifying Big Data evidence: Initial Steps The process for identifying evidence is: Examining requirements Examining the organization's system architecture Determining the kinds of data in each system Assessing which systems to collect In the book, Big Data Forensics, the topics of examining requirements and examining the organization's system architecture are covered in detail. The purpose of these two steps is to take the requirements of the investigation and match those to known data sources. From this, the investigator can begin to document which data sources should be examined and what types of data may be relevant. Assessing data viability Assessing the viability of data serves several purposes. It can: Allow the investigator to identify which data sources are potentially relevant Yield information that can corroborate the interview and documentation review information Highlight data limitations or gaps Provide the investigator with information to create a better data collection plan Up until this point in the investigation, the investigator has only gathered information about the data. Previewing and assessing samples of the data gives the investigator the chance to actually see what information is contained in the data and determine which data sources can meet the requirements of the investigation. Assessing the viability and relevance of data in a Big Data forensic investigation is different from that of a traditional digital forensic investigation. In a traditional digital forensic investigation, the data is typically not previewed out of fear of altering the data or metadata. With Big Data, however, the data can be previewed in some situations where metadata is not relevant or available. This factor opens up the opportunity for a forensic investigator to preview data when identifying which data should be collected. The main considerations for each source of data include the following: Data quality Data completeness Supporting documentation Validating the collected data Previous systems where the data resided How the data enter and leave the system The available formats for extraction How well the data meet the data requirements There are several methods for previewing data. The first is to review a data extract or the results of a query—or collect sample text files that are stored in Hadoop. This method allows the investigator to determine the types of information available and how the information is represented in the data. In highly complex systems consisting of thousands of data sources, this may not be feasible or requires a significant investment of time and effort. The second method is to review reports or canned query output that were derived from the data. Some Big Data solutions are designed with reporting applications connected to the Big Data system. These reports are a powerful tool, enabling an investigator to quickly gain an understanding of the contents of the system without requiring much up-front effort to gain access to the systems. Data retention policies and data purge schedules should be reviewed and considered in this step as well. Given the large volume of data involved, many organizations routinely purge data after a certain period of time. Data purging can mean the archival of data to near-line or offline storage, or it can mean the destruction of old data without backup. When data is archived, the investigator should also determine whether any of the data in near-line or offline backup media needs to be collected or if the live system data is sufficient. Regardless, the investigator should determine what the next purge cycle is and whether that necessitates an expedited collection to prevent loss of critical information. Additionally, the investigator should determine whether the organization should implement a litigation hold, which halts data purging during the investigation. When data is purged without backup, the investigator must determine: How the purge affects the investigation When the data needs to be collected Whether supplemental data sources must be collected to account for the lost data (for example, reports previously created from the purged data or other systems that created or received the purged data) Identifying HDFS evidence HDFS evidence can be identified in a number of ways. In some cases, the investigator does not want to preview the data to retain the integrity of the metadata. In other cases, the investigator is in only interested in collecting a subset of the data. Limiting the data can be necessary when the data volume prohibits a complete collection or forensically imaging the entire cluster is impossible. The primary methods for identifying HDFS evidence are to: Generate directory listings from the cluster Review the total data volume on each of the nodes Generating directory listings from the cluster is a straightforward process of accessing the cluster from a client and running the Hadoop directory listing command. The cluster is accessed from a client by either directly logging in through a cluster node or by logging in from a remote machine. The command to print a directory listing is as follows: # hdfs dfs –lsr / This generates a recursive directory listing of all HDFS files starting from the root directory. This command produces the filenames, directories, permissions, and file sizes of all files. The output can be piped to an output file, which should be saved to an external storage device for review. Identifying Hive evidence Hive evidence can be identified through HiveQL commands. The following table lists the commands that can be used to get a full listing of all databases and tables as well as the tables' formats: Command Description SHOW DATABASES; This lists all available databases SHOW TABLES; This lists all tables in current database USE databaseName; This makes databaseName the current database DESCRIBE (FORMATTED|EXTENDED) table; This lists the formatting details about the table Identifying all tables and their formats requires iterating through every database and generating a list of tables and each table's formats. This process can be performed either manually or through an automated HiveQL script file. These commands do not provide information about database and table metadata—such as number of records and last modified date—but they do give a full listing of all available, online Hive data. HiveQL can also be used to preview the data using subset queries. The following example shows how to identify the top ten rows in a Hive table: SELECT * FROM Table1 LIMIT 10 Identifying HBase evidence HBase evidence is stored in tables, and identifying the names of the tables and the properties of each is important for data collection. HBase stores metadata information in the -ROOT- and .META. tables. These tables can be queried using HBase shell commands to identify the information about all tables in the HBase cluster. Information about the HBase cluster can be gathered using the status command from the HBase shell: status This produces the following output: 2 servers, 0 dead, 1.5000 average load For additional information about the names and locations of the servers—as well as the total disk sizes for the memstores and HFiles—the status command can be given the detailed parameter. The list command outputs every HBase table. The one table created in HBase, testTable, is shown via the following command: list This produces the following output: TABLE testTable 1 row(s) in 0.0370 seconds => ["testTable"] Information about each table can be generated using the describe command: describe 'testTable' The following output is generated: 'testTable', {NAME => 'account', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'}, {NAME => 'address', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'} 1 row(s) in 0.0300 seconds The describe command yields several useful pieces of information about each table. Each of the column families are listed, and for each family, the encoding, number of columns (represented as versions), and whether the deleted cells are retained are also listed. Security information about each table can be gathered using the user_permission command as follows: user_permission 'testTable' This command is useful for identifying the users who currently have access to the table. As mentioned before, user accounts are not as meaningful in Hadoop because of the distributed nature of Hadoop configurations, but in some cases, knowing who had access to tables can be tied back to system logs to identify individuals who accessed the system and data. Summary Hadoop evidence comes in many forms. The methods for identifying the evidence require the forensic investigator to understand the Hadoop architecture and the options for identifying the evidence within HDFS and Hadoop applications. In Big Data Forensics, these topics are covered in more depth—from the internals of Hadoop to conducting a collection of a distributed server. Resources for Article: Further resources on this subject: Introduction to Hadoop[article] Hadoop Monitoring and its aspects[article] Understanding Hadoop Backup and Recovery Needs [article]

0
0
1512

article-image-scientific-computing-apis-python

Packt

20 Aug 2015

23 min read

Scientific Computing APIs for Python

Packt

20 Aug 2015

23 min read

In this article, by Hemant Kumar Mehta author of the book Mastering Python Scientific Computing we will have comprehensive discussion of features and capabilities of various scientific computing APIs and toolkits in Python. Besides the basics, we will also discuss some example programs for each of the APIs. As symbolic computing is relatively different area of computerized mathematics, we have kept a special sub section within the SymPy section to discuss basics of computerized algebra system. In this article, we will cover following topics: Scientific numerical computing using NumPy and SciPy Symbolic Computing using SymPy (For more resources related to this topic, see here.) Numerical Scientific Computing in Python The scientific computing mainly demands for facility of performing calculations on algebraic equations, matrices, differentiations, integrations, differential equations, statistics, equation solvers and much more. By default Python doesn't come with these functionalities. However, development of NumPy and SciPy has enabled us to perform these operations and much more advanced functionalities beyond these operations. NumPy and SciPy are very powerful Python packages that enable the users to efficiently perform the desired operations for all types of scientific applications. NumPy package NumPy is the basic Python package for the scientific computing. It provides facility of multi-dimensional arrays and basic mathematical operations such as linear algebra. Python provides several data structure to store the user data, while the most popular data structures are lists and dictionaries. The list objects may store any type of Python object as an element. These elements can be processed using loops or iterators. The dictionary objects store the data in key, value format. The ndarrays data structure The ndaarays are also similar to the list but highly flexible and efficient. The ndarrays is an array object to represent multidimensional array of fixed-size items. This array should be homogeneous. It has an associated object of dtype to define the data type of elements in the array. This object defines type of the data (integer, float, or Python object), size of data in bytes, byte ordering (big-endian or little-endian). Moreover, if the type of data is record or sub-array then it also contains details about them. The actual array can be constructed using any one of the array, zeros or empty methods. Another important aspect of ndarrays is that the size of arrays can be dynamically modified. Moreover, if the user needs to remove some elements from the arrays then it can be done using the module for masked arrays. In a number of situations, scientific computing demands deletion/removal of some incorrect or erroneous data. The numpy.ma module provides the facility of masked array to easily remove selected elements from arrays. A masked array is nothing but the normal ndarrays with a mask. Mask is another associated array with true or false values. If for a particular position mask has true value then the corresponding element in the main array is valid and if the mask is false then the corresponding element in the main array is invalid or masked. In such case while performing any computation on such ndarrays the masked elements will not be considered. File handling Another important aspect of scientific computing is storing the data into files and NumPy supports reading and writing on both text as well as binary files. Mostly, text files are good way for reading, writing and data exchange as they are inherently portable and most of the platforms by default have capabilities to manipulate them. However, for some of the applications sometimes it is better to use binary files or the desired data for such application can only be stored in binary files. Sometimes the size of data and nature of data like image, sound etc. requires them to store in binary files. In comparison to text files binary files are harder to manage as they have specific formats. Moreover, the size of binary files are comparatively very small and the read/ write operations are very fast then the read/ write text files. This fast read/ write is most suitable for the application working on large datasets. The only drawback of binary files manipulated with NumPy is that they are accessible only through NumPy. Python has text file manipulation functions such as open, readlines and writelines functions. However, it is not performance efficient to use these functions for scientific data manipulation. These default Python functions are very slow in reading and writing the data in file. NumPy has high performance alternative that load the data into ndarrays before actual computation. In NumPy, text files can be accessed using numpy.loadtxt and numpy.savetxt functions. The loadtxt function can be used to load the data from text files to the ndarrays. NumPy also has a separate functions to manipulate the data in binary files. The function for reading and writing are numpy.load and numpy.save respectively. Sample NumPy programs The NumPy array can be created from a list or tuple using the array, this method can transform sequences of sequences into two dimensional array. import numpy as np x = np.array([4,432,21], int) print x #Output [ 4 432 21] x2d = np.array( ((100,200,300), (111,222,333), (123,456,789)) ) print x2d Output: [ 4 432 21] [[100 200 300] [111 222 333] [123 456 789]] Basic matrix arithmetic operation can be easily performed on two dimensional arrays as used in the following program. Basically these operations are actually applied on elements hence the operand arrays must be of equal size, if the size is not matching then performing these operations will cause a runtime error. Consider the following example for arithmetic operations on one dimensional array. import numpy as np x = np.array([4,5,6]) y = np.array([1,2,3]) print x + y # output [5 7 9] print x * y # output [ 4 10 18] print x - y # output [3 3 3] print x / y # output [4 2 2] print x % y # output [0 1 0] There is a separate subclass named as matrix to perform matrix operations. Let us understand matrix operation by following example which demonstrates the difference between array based multiplication and matrix multiplication. The NumPy matrices are 2-dimensional and arrays can be of any dimension. import numpy as np x1 = np.array( ((1,2,3), (1,2,3), (1,2,3)) ) x2 = np.array( ((1,2,3), (1,2,3), (1,2,3)) ) print "First 2-D Array: x1" print x1 print "Second 2-D Array: x2" print x2 print "Array Multiplication" print x1*x2 mx1 = np.matrix( ((1,2,3), (1,2,3), (1,2,3)) ) mx2 = np.matrix( ((1,2,3), (1,2,3), (1,2,3)) ) print "Matrix Multiplication" print mx1*mx2 Output: First 2-D Array: x1 [[1 2 3] [1 2 3] [1 2 3]] Second 2-D Array: x2 [[1 2 3] [1 2 3] [1 2 3]] Array Multiplication [[1 4 9] [1 4 9] [1 4 9]] Matrix Multiplication [[ 6 12 18] [ 6 12 18] [ 6 12 18]] Following is a simple program to demonstrate simple statistical functions given in NumPy: import numpy as np x = np.random.randn(10) # Creates an array of 10 random elements print x mean = x.mean() print mean std = x.std() print std var = x.var() print var First Sample Output: [ 0.08291261 0.89369115 0.641396 -0.97868652 0.46692439 -0.13954144 -0.29892453 0.96177167 0.09975071 0.35832954] 0.208762357623 0.559388806817 0.312915837192 Second Sample Output: [ 1.28239629 0.07953693 -0.88112438 -2.37757502 1.31752476 1.50047537 0.19905071 -0.48867481 0.26767073 2.660184 ] 0.355946458357 1.35007701045 1.82270793415 The above programs are some simple examples of NumPy. SciPy package SciPy extends Python and NumPy support by providing advanced mathematical functions such as differentiation, integration, differential equations, optimization, interpolation, advanced statistical functions, equation solvers etc. SciPy is written on top of the NumPy array framework. SciPy has utilized the arrays and the basic operations on the arrays provided in NumPy and extended it to cover most of the mathematical aspects regularly required by scientists and engineers for their applications. In this article we will cover examples of some basic functionality. Optimization package The optimization package in SciPy provides facility to solve univariate and multivariate minimization problems. It provides solutions to minimization problems using a number of algorithms and methods. The minimization problem has wide range of application in science and commercial domains. Generally, we perform linear regression, search for function's minimum and maximum values, finding the root of a function, and linear programming for such cases. All these functionalities are supported by the optimization package. Interpolation package A number of interpolation methods and algorithms are provided in this package as built-in functions. It provides facility to perform univariate and multivariate interpolation, one dimensional and two dimensional Splines etc. We use univariate interpolation when data is dependent of one variable and if data is around more than one variable then we use multivariate interpolation. Besides these functionalities it also provides additional functionality for Lagrange and Taylor polynomial interpolators. Integration and differential equations in SciPy Integration is an important mathematical tool for scientific computations. The SciPy integrations sub-package provides functionalities to perform numerical integration. SciPy provides a range of functions to perform integration on equations and data. It also has ordinary differential equation integrator. It provides various functions to perform numerical integrations using a number of methods from mathematics using numerical analysis. Stats module SciPy Stats module contains a functions for most of the probability distributions and wide range or statistical functions. Supported probability distributions include various continuous distribution, multivariate distributions and discrete distributions. The statistical functions range from simple means to the most of the complex statistical concepts, including skewness, kurtosis chi-square test to name a few. Clustering package and Spatial Algorithms in SciPy Clustering analysis is a popular data mining technique having wide range of application in scientific and commercial applications. In Science domain biology, particle physics, astronomy, life science, bioinformatics are few subjects widely using clustering analysis for problem solution. Clustering analysis is being used extensively in computer science for computerized fraud detection, security analysis, image processing etc. The clustering package provides functionality for K-mean clustering, vector quantization, hierarchical and agglomerative clustering functions. The spatial class has functions to analyze distance between data points using triangulations, Voronoi diagrams, and convex hulls of a set of points. It also has KDTree implementations for performing nearest-neighbor lookup functionality. Image processing in SciPy SciPy provides support for performing various image processing operations including basic reading and writing of image files, displaying images, simple image manipulations operations such as cropping, flipping, rotating etc. It has also support for image filtering functions such as mathematical morphing, smoothing, denoising and sharpening of images. It also supports various other operations such as image segmentation by labeling pixels corresponding to different objects, Classification, Feature extraction for example edge detection etc. Sample SciPy programs In the subsequent subsections we will discuss some example programs using SciPy modules and packages. We start with a simple program performing standard statistical computations. After this, we will discuss a program performing finding a minimal solution using optimizations. At last we will discuss image processing programs. Statistics using SciPy The stats module of SciPy has functions to perform simple statistical operations and various probability distributions. The following program demonstrates simple statistical calculations using SciPy stats.describe function. This single function operates on an array and returns number of elements, minimum value, maximum value, mean, variance, skewness and kurtosis. import scipy as sp import scipy.stats as st s = sp.randn(10) n, min_max, mean, var, skew, kurt = st.describe(s) print("Number of elements: {0:d}".format(n)) print("Minimum: {0:3.5f} Maximum: {1:2.5f}".format(min_max[0], min_max[1])) print("Mean: {0:3.5f}".format(mean)) print("Variance: {0:3.5f}".format(var)) print("Skewness : {0:3.5f}".format(skew)) print("Kurtosis: {0:3.5f}".format(kurt)) Output: Number of elements: 10 Minimum: -2.00080 Maximum: 0.91390 Mean: -0.55638 Variance: 0.93120 Skewness : 0.16958 Kurtosis: -1.15542 Optimization in SciPY Generally, in mathematical optimization a non convex function called Rosenbrock function is used to test the performance of the optimization algorithm. The following program is demonstrating the minimization problem on this function. The Rosenbrock function of N variable is given by following equation and it has minimum value 0 at xi =1. The program for the above function is: import numpy as np from scipy.optimize import minimize # Definition of Rosenbrock function def rosenbrock(x): return sum(100.0*(x[1:]-x[:-1]**2.0)**2.0 + (1-x[:-1])**2.0) x0 = np.array([1, 0.7, 0.8, 2.9, 1.1]) res = minimize(rosenbrock, x0, method = 'nelder-mead', options = {'xtol': 1e-8, 'disp': True}) print(res.x) Output is: Optimization terminated successfully. Current function value: 0.000000 Iterations: 516 Function evaluations: 827 [ 1. 1. 1. 1. 1.] The last line is the output of print(res.x) where all the elements of array are 1. Image processing using SciPy Following two programs are developed to demonstrate the image processing functionality of SciPy. First of these program is simply displaying the standard test image widely used in the field of image processing called Lena. The second program is applying geometric transformation on this image. It performs image cropping and rotation by 45 %. The following program is displaying Lena image using matplotlib API. The imshow method renders the ndarrays into an image and the show method displays the image. from scipy import misc l = misc.lena() misc.imsave('lena.png', l) import matplotlib.pyplot as plt plt.gray() plt.imshow(l) plt.show() Output: The output of the above program is the following screen shot: The following program is performing geometric transformation. This program is displaying transformed images and along with the original image as a four axis array. import scipy from scipy import ndimage import matplotlib.pyplot as plt import numpy as np lena = scipy.misc.lena() lx, ly = lena.shape crop_lena = lena[lx/4:-lx/4, ly/4:-ly/4] crop_eyes_lena = lena[lx/2:-lx/2.2, ly/2.1:-ly/3.2] rotate_lena = ndimage.rotate(lena, 45) # Four axes, returned as a 2-d array f, axarr = plt.subplots(2, 2) axarr[0, 0].imshow(lena, cmap=plt.cm.gray) axarr[0, 0].axis('off') axarr[0, 0].set_title('Original Lena Image') axarr[0, 1].imshow(crop_lena, cmap=plt.cm.gray) axarr[0, 1].axis('off') axarr[0, 1].set_title('Cropped Lena') axarr[1, 0].imshow(crop_eyes_lena, cmap=plt.cm.gray) axarr[1, 0].axis('off') axarr[1, 0].set_title('Lena Cropped Eyes') axarr[1, 1].imshow(rotate_lena, cmap=plt.cm.gray) axarr[1, 1].axis('off') axarr[1, 1].set_title('45 Degree Rotated Lena') plt.show() Output: The SciPy and NumPy are core of Python's support for scientific computing as they provide solid functionality of numerical computing. Symbolic computations using SymPy Computerized computations performed over the mathematical symbols without evaluating or changing their meaning is called as symbolic computations. Generally the symbolic computing is also called as computerized algebra and such computerized system are called computer algebra system. The following subsection has a brief and good introduction to SymPy. Computer Algebra System (CAS) Let us discuss the concept of CAS. CAS is a software or toolkit to perform computations on mathematical expressions using computers instead of doing it manually. In the beginning, using computers for these applications was named as computer algebra and now this concept is called as symbolic computing. CAS systems may be grouped into two types. First is the general purpose CAS and the second type is the CAS specific to particular problem. The general purpose systems are applicable to most of the area of algebraic mathematics while the specialized CAS are the systems designed for the specific area such as group theory or number theory. Most of the time, we prefer the general purpose CAS to manipulate the mathematical expressions for scientific applications. Features of a general purpose CAS Various desired features of a general purpose computer algebra system for scientific applications are as: A user interface to manipulate mathematical expressions. An interface for programming and debugging Such systems requires simplification of various mathematical expressions hence, a simplifier is a most essential component of such computerized algebra system. The general purpose CAS system must support exhaustive set of functions to perform various mathematical operations required by any algebraic computations Most of the applications perform extensive computations an efficient memory management is highly essential. The system must provide support to perform mathematical computations on high precision numbers and large quantities. A brief idea of SymPy SymPy is an open source and Python based implementation of computerized algebra system (CAS). The philosophy behind the SymPy development is to design and develop a CAS having all the desired features yet its code as simple as possible so that it will be highly and easily extensible. It is written completely in Python and do not requires any external library. The basic idea about using SymPy is the creation and manipulation of expressions. Using SymPy, the user represents mathematical expressions in Python language using SymPy classes and objects. These expressions are composed of numbers, symbols, operators, functions etc. The functions are the modules to perform a mathematical functionality such as logarithms, trigonometry etc. The development of SymPy was started by Ondřej Čertíkin August 2006. Since then, it has been grown considerably with the contributions more than hundreds of the contributors. This library now consists of 26 different integrated modules. These modules have capability to perform computations required for basic symbolic arithmetic, calculus, algebra, discrete mathematics, quantum physics, plotting and printing with the option to export the output of the computations to LaTeX and other formats. The capabilities of SymPy can be divided into two categories as core capability and advanced capabilities as SymPy library is divided into core module with several advanced optional modules. The various supported functionality by various modules are as follows: Core capabilities The core capability module supports basic functionalities required by any mathematical algebra operations to be performed. These operations include basic arithmetic like multiplications, addition, subtraction and division, exponential etc. It also supports simplification of expressions to simplify complex expressions. It provides the functionality of expansion of series and symbols. Core module also supports functions to perform operations related to trigonometry, hyperbola, exponential, roots of equations, polynomials, factorials and gamma functions, logarithms etc. and a number of special functions for B-Splines, spherical harmonics, tensor functions, orthogonal polynomials etc. There is strong support also given for pattern matching operations in the core module. Core capabilities of the SymPy also include the functionalities to support substitutions required by algebraic operations. It not only supports the high precision arithmetic operations over integers, rational and gloating point numbers but also non-commutative variables and symbols required in polynomial operations. Polynomials Various functions to perform polynomial operations belong to the polynomial module. These functions includes basic polynomial operations such as division, greatest common divisor (GCD) least common multiplier (LCM), square-free factorization, representation of polynomials with symbolic coefficients, some special operations like computation of resultant, deriving trigonometric identities, Partial fraction decomposition, facilities for Gröbner basis over polynomial rings and fields. Calculus Various functionalities supporting different operations required by basic and advanced calculus are provided in this module. It supports functionalities required by limits, there is a limit function for this. It also supports differentiation and integrations and series expansion, differential equations and calculus of finite differences. SymPy is also having special support for definite integrals and integral transforms. In differential it supports numerical differential, composition of derivatives and fractional derivatives. Solving equations Solver is the name of the SymPy module providing equations solving functionality. This module supports solving capabilities for complex polynomials, roots of polynomials and solving system of polynomial equations. There is a function to solve the algebraic equations. It not only provides support for solutions for differential equations including ordinary differential equations, some forms of partial differential equations, initial and boundary values problems etc. but also supports solution of difference equations. In mathematics, difference equation is also called recurrence relations, that is an equation that recursively defines a sequence or multidimensional array of values. Discrete math Discrete mathematics includes those mathematical structures which are discrete in nature rather than the continuous mathematics like calculus. It deals with the integers, graphs, statements from logic theory etc. This module has full support for binomial coefficient, products, summations etc. This module also supports various functions from number theory including residual theory, Euler's Totient, partition and a number of functions dealing with prime numbers and their factorizations. SymPy also supports creation and manipulations of logic expressions using symbolic and Boolean values. Matrices SymPy has a strong support for various operations related to the matrices and determinants. Matrix belongs to linear algebra category of mathematics. It supports creation of matrix, basic matrix operations like multiplication, addition, matrix of zeros and ones, creation of random matrix and performing operations on matrix elements. It also supports special functions line computation of Hessian matrix for a function, Gram-Schmidt process on set of vectors, Computation of Wronskian for matrix of functions etc. It has also full support for Eigenvalues/eigenvectors, matrix inversion, solution of matrix and determinants. For computing determinants of the matrix, it also supports Bareis' fraction-free algorithm and berkowitz algorithms besides the other methods. For matrices it also supports nullspace calculation, cofactor expansion tools, derivative calculation for matrix elements, calculation of dual of matrix etc. Geometry SymPy is also having module that supports various operations associated with the two-dimensional (2-D) geometry. It supports creation of various 2-D entity or objects such as point, line, circle, ellipse, polygon, triangle, ray, segment etc. It also allows us to perform query on these entities such as area of some of the suitable objects line ellipse/ circle or triangle, intersection points of lines etc. It also supports other queries line tangency determination, finding similarity and intersection of entities. Plotting There is a very good module that allows us to draw two-dimensional and three-dimensional plots. At present, the plots are rendered using the matplotlib package. It also supports other packages such as TextBackend, Pyglet, textplot etc. It has a very good interactive interface facility of customizations and plotting of various geometric entities. The plotting module has the following functions: Plotting 2-D line plots Plotting of 2-D parametric plots. Plotting of 2-D implicit and region plots. Plotting of 3-D plots of functions involving two variables. Plotting of 3-D line and surface plots etc. Physics There is a module to solve the problem from Physics domain. It supports functionality for mechanics including classical and quantum mechanics, high energy physics. It has functions to support Pauli Algebra, quantum harmonic oscillators in 1-D and 3-D. It is also having functionality for optics. There is a separate module that integrates unit systems into SymPy. This will allow users to select the specific unit system for performing his/ her computations and conversion between the units. The unit systems are composed of units and constant for computations. Statistics The statistics module introduced in SymPy to support the various concepts of statistics required in mathematical computations. Apart from supporting various continuous and discrete statistical distributions, it also supports functionality related to the symbolic probability. Generally, these distributions support functions for random number generations in SymPy. Printing SymPy is having a module for provide full support for Pretty-Printing. Pretty-print is the idea of conversions of various stylistic formatting into the text files such as source code, text files and markup files or similar content. This module produces the desired output by printing using ASCII and or Unicode characters. It supports various printers such as LATEX and MathML printer. It is also capable of producing source code in various programming languages such as c, Python or FORTRAN. It is also capable of producing contents using markup languages like HTML/ XML. SymPy modules The following list has formal names of the modules discussed in above paragraphs: Assumptions: assumption engine Concrete: symbolic products and summations Core: basic class structure: Basic, Add, Mul, Pow etc. functions: elementary and special functions galgebra: geometric algebra geometry: geometric entities integrals: symbolic integrator interactive: interactive sessions (e.g. IPython) logic: boolean algebra, theorem proving matrices: linear algebra, matrices mpmath: fast arbitrary precision numerical math ntheory: number theoretical functions parsing: Mathematica and Maxima parsers physics: physical units, quantum stuff plotting: 2D and 3D plots using Pyglet polys: polynomial algebra, factorization printing: pretty-printing, code generation series: symbolic limits and truncated series simplify: rewrite expressions in other forms solvers: algebraic, recurrence, differential statistics: standard probability distributions utilities: test framework, compatibility stuf There are numerous symbolic computing systems available in various mathematical toolkits. There are some proprietary software such as Maple/ Mathematica and there are some open source alternatives also such as Singular/ AXIOM. However, these products have their own scripting language, difficult to extend their functionality and having slow development cycle. Whereas SymPy is highly extensible, designed and developed in Python language and open source API that supports speedy development life cycle. Simple exemplary programs These are some very simple examples to get idea about the capacity of SymPy. These are less than ten lines of SymPy source codes which covers topics ranging from basis symbol manipulations to limits, differentiations and integrations. We can test the execution of these programs on SymPy live running SymPy online on Google App Engine available on http://live.sympy.org/. Basic symbol manipulation The following code is defines three symbols, an expression on these symbols and finally prints the expression. import sympy a = sympy.Symbol('a') b = sympy.Symbol('b') c = sympy.Symbol('c') e = ( a * b * b + 2 * b * a * b) + (a * a + c * c) print e Output: a**2 + 3*a*b**2 + c**2 (here ** represents power operation). Expression expansion in SymPy The following program demonstrates the concept of expression expansion. It defines two symbols and a simple expression on these symbols and finally prints the expression and its expanded form. import sympy a = sympy.Symbol('a') b = sympy.Symbol('b') e = (a + b) ** 4 print e print e.expand() Output: (a + b)**4 a**4 + 4*a**3*b + 6*a**2*b**2 + 4*a*b**3 + b**4 Simplification of expression or formula The SymPy has facility to simplify the mathematical expressions. The following program is having two expressions to simplify and displays the output after simplifications of the expressions. import sympy x = sympy.Symbol('x') a = 1/x + (x*exp(x) - 1)/x simplify(a) simplify((x ** 3 + x ** 2 - x - 1)/(x ** 2 + 2 * x + 1)) Output: ex x – 1 Simple integrations The following program is calculates the integration of two simple functions. import sympy from sympy import integrate x = sympy.Symbol('x') integrate(x ** 3 + 2 * x ** 2 + x, x) integrate(x / (x ** 2 + 2 * x), x) Output: x**4/4+2*x**3/3+x**2/2 log(x + 2) Summary In this article, we have discussed the concepts, features and selective sample programs of various scientific computing APIs and toolkits. The article started with a discussion of NumPy and SciPy. After covering NymPy, we have discussed concepts associated with symbolic computing and SymPy. In the remaining article we have discussed the Interactive computing and data analysis & visualization alog with their APIs or toolkits. IPython is the python toolkit for interactive computing. We have also discussed the data analysis package Pandas and the data visualization API names Matplotlib. Resources for Article: Further resources on this subject: Optimization in Python [article] How to do Machine Learning with Python [article] Bayesian Network Fundamentals [article]

0
0
3099

article-image-dimensionality-reduction-principal-component-analysis

Packt

18 Aug 2015

12 min read

Dimensionality Reduction with Principal Component Analysis

Packt

18 Aug 2015

12 min read

In this article by Eric Mayor, author of the book Learning Predictive Analytics with R, we will be discussing how to use PCA in R. Nowadays, the access to data is easier and cheaper than ever before. This leads to a proliferation of data in organizations' data warehouses and on the Internet. Analyzing this data is not trivial, as the quantity often makes the analysis difficult or unpractical. For instance, the data is often more abundant than available memory on the machines. The available computational power is also often not enough to analyze the data in a reasonable time frame. One solution is to have recourse to technologies that deal with high dimensionality in data (big data). These solutions typically use the memory and computing power of several machines for analysis (computer clusters). However, most organizations do not have such infrastructure. Therefore, a more practical solution is to reduce the dimensionality of the data while keeping the essential of the information intact. (For more resources related to this topic, see here.) Another reason to reduce dimensionality is that, in some cases, there are more attributes than observations. If some scientists were to store the genome of all inhabitants of Europe and the United States, the number of cases (approximately 1 billion) would be much less than the 3 billion base pairs in the human DNA. Most analyses do not work well or not at all when observations are less than attributes. Confronted with this problem, the data analysts might select groups of attributes, which go together (like height and weight, for instance) according to their domain knowledge, and reduce the dimensionality of the dataset. The data is often structured across several relatively independent dimensions, with each dimension measured using several attributes. This is where principal component analysis (PCA) is an essential tool, as it permits each observation to receive a score on each of the dimensions (determined by PCA itself) while allowing to discard the attributes from which the dimensions are computed. What is meant here is that for each of the obtained dimensions, values (scores) will be produced that combines several attributes. These can be used for further analysis. In the next section, we will use PCA to combine several attributes in the questionnaire data. We will see that the participants' self-report on items such as lively, excited, enthusiastic (and many more) can be combined in a single dimension we call positive arousal. In this sense, PCA performs both dimensionality reduction (discard the attributes) and feature extraction (compute the dimensions). The features can then be used for classification and regression problems. Another use of PCA is to check that the underlying structure of the data corresponds to a theoretical model. For instance, in a questionnaire, a group of questions might assess construct A, and another group construct B, and so on. PCA will find two different factors if indeed there is more similarity in the answers of participants within each group of questions compared to the overall questionnaire. Researchers in fields such as psychology use PCA mainly to test their theoretical model in this fashion. This article is a shortened version of the chapter with the same name in my book Learning predictive analytics with R. In what follows, we will examine how to use PCA in R. We will notably discover how to interpret the results and perform diagnostics. Learning PCA in R In this section, we will learn more about how to use PCA in order to obtain knowledge from data, and ultimately reduce the number of attributes (also called features). The first dataset we will use is the msg dataset from the psych package. The msq (for motivational state questionnaire) data set is composed of 92 attributes, of which 72 is the rating of adjectives by 3896 participants describing their mood. We will only use these 72 attributes for the current purpose, which is the exploration of the structure of the questionnaire. We will, therefore, start by installing and loading the package and the data, and assign the data we are interested in (the 75 attributes mentioned) to an object called motiv. install.packages("psych") library(psych) data(msq) motiv = msq[,1:72] Dealing with missing values Missing values are a common problem in real-life datasets like the one we use here. There are several ways to deal with them, but here we will only mention omitting the cases where missing values are encountered. Let's see how many missing values (we will call them NAs) there are for each attribute in this dataset: apply(is.na(motiv),2,sum) View of missing data in the dataset We can see that for many attributes this is unproblematic (there are only few missing values). However, in the case of several attributes, the number of NAs is quite high (anxious: 1849 NAs, cheerful: 1850, idle: 1848, inactive: 1846, tranquil: 1843). The most probable explanation is that these items have been dropped from some of the samples in which the data has been collected. Removing all cases with NAs from the dataset would result in an empty dataset in this particular case. Try the following in your console to verify this claim: na.omit(motiv) For this reason, we will simply deal with the problem by removing these attributes from the analysis. Remember that there are other ways, such as data imputation, to solve this issue while keeping the attributes. In order to do so, we first need to know which column number corresponds to the attributes we want to suppress. An easy way to do this is simply by printing the names of each columns vertically (using cbind() for something it was not exactly made for). We will omit cases with missing values on other attributes from the analysis later. head(cbind(names(motiv)),5) Here, we only print the result for the first five columns of the data frame: [,1] [1,] active [2,] afraid [3,] alert [4,] angry [5,] anxious We invite the reader to verify on his screen whether we suppress the correct columns from the analysis using the following vector to match the attributes: ToSuppress = c(5, 15, 37, 38, 66) Let's check whether it is correct: names(motiv[ToSuppress]) The following is the output: [1] "anxious" "cheerful" "idle" "inactive" "tranquil" Naming the components using the loadings We will run the analysis using the principal() function from the psych package. This will allow examining the loadings in a sorted fashion and making them independent from the others using a rotation. We will extract five components. The psych package has already been loaded, as the data we are analyzing comes from a dataset, which is included in psych. Yet, we need to install another package, the GPArotation package, which is required for the analysis we will perform next: install.packages("GPArotation") library(GPArotation) We can now run the analysis. This time we want to apply an orthogonal rotation varimax, in order to obtain independent factorial scores. Precisions are provided in the paper Principal component analysis by Abdi and Williams (2010). We also want that the analysis uses imputed data for missing values and estimates the scores for each observation on each retained principal component. We use the print.psych() function to print the sorted loadings, which will make the interpretation of the principal components easier: Pca2 = principal(motiv[,-ToSuppress],nfactors = 5, rotate = "varimax", missing = T, scores = T) print.psych(Pca2, sort =T) The annotated output displayed in the following screenshot has been slightly altered in order to allow it to fit on one page. Results of the PCA with the principal() function using varimax rotation The name of the items are displayed in the first column of the first matrix of results. The column item displays their order. The component loadings for the five retained components are displayed next (RC1 to RC5). We will comment on h2 and u2 later. The loadings of the attributes on the loadings can be used to name the principal components. As can be seen in the preceding screenshot, the first component could be called Positive arousal, the second component could be called Negative arousal, the third could be called Serenity, the fourth Exhaustion, and the last component could be called Fear. It is worth noting that the MSQ has theoretically four components obtained from two dimensions: energy and tension. Therefore, the fifth component we found is not accounted for by the model. The h2 value indicates the proportion of variance of the variable that is accounted for by the selected components. The u2 value indicates the part of variance not explained by the components. The sum of h2 and u2 is 1. Below the first result matrix, a second matrix indicates the proportion of variance explained by the components on the whole dataset. For instance, the first component explains 21 percent of the variance in the dataset Proportion Var and 38 percent of the variance explained by the five components. Overall, the five components explain 56 percent of the variance in the dataset Cumulative Var. As a remainder, the purpose of PCA is to replace all the original attributes by the scores on these components in further analysis. This is what we will discuss next. PCA scores At this point, you might wonder where data reduction comes into play. Well, the computation of the scores for each factor allows reducing the number of attributes for a dataset. This has been done in the previous section. Accessing the PCA scores We will now examine these scores more thoroughly. Let's start by checking whether the scores are indeed uncorrelated: round(cor(Pca2$scores),3) This is the output: RC1 RC2 RC3 RC4 RC5 RC1 1.000 -0.001 0.000 0.000 0.001 RC2 -0.001 1.000 0.000 0.000 0.000 RC3 0.000 0.000 1.000 -0.001 0.001 RC4 0.000 0.000 -0.001 1.000 0.000 RC5 0.001 0.000 0.001 0.000 1.000 As can be seen in the previous output, the correlations between the values are equal to or lower than 0.001 for every pair of components. The components are basically uncorrelated, as we requested. Let's start by checking that it indeed contains the same number of rows: nrow(Pca2$scores) == nrow(msq) Here's the output: [1] TRUE As the principal() function didn't remove any cases but imputed the data instead, we can now append the factor scores to the original dataset (a copy of it): bound = cbind(msq,Pca2$score) PCA diagnostics Here, we will briefly discuss two diagnostics that can be performed on the dataset that will be subjected to analysis. These diagnostics should be performed analyzing the data with PCA in order to ascertain that PCA is an optimal analysis for the considered data. The first diagnostic is Bartlett's test of sphericity. This test examines the relationship between the variables together, instead of 2 x 2 as in a correlation. To be more specific, it tests whether the correlation matrix is different from an identity matrix (which has ones on the diagonals and zeroes elsewhere). The null hypothesis is that the variables are independent (no underlying structure). The paper When is a correlation matrix appropriate for factor analysis? Some decision rules by Dzuiban and Shirkey (1974) provides more information on the topic. This test can be performed by the cortest.normal() function in the psych package. In this case, the first argument is the correlation matrix of the data we want to subject to PCA, and the n1 argument is the number of cases. We will use the same data set as in the PCA (we first assign the name M to it): M = na.omit(motiv[-ToSuppress]) cortest.normal(cor(M), n1 = nrow(M)) The output is provided as follows: Tests of correlation matrices Call:cortest.normal(R1 = cor(M), n1 = nrow(M)) Chi Square value 829268 with df = 2211 with probability < 0 The last line of the output shows that the data subjected to the analysis is clearly different from an identity matrix: The probability to obtain these results if it were an identity matrix is close to zero (do not pay attention to the > 0, it is simply extremely close to zero). You might be curious about what is the output of an identity matrix. It turns out we have something close to it: the correlations of the PCA scores with the varimax rotation that we examined before. Let's subject the scores to the analysis: cortest.normal(cor(Pca2$scores), n1 = nrow(Pca2$scores)) Tests of correlation matrices Call:cortest.normal(R1 = cor(Pca2$scores), n1 = nrow(Pca2$scores)) Chi Square value 0.01 with df = 10 with probability < 1 In this case, the results show that the correlation matrix is not significantly different from an identity matrix. The other diagnostic is the Kaiser Meyer Olkin (KMO) index, which indicates the part of the data that can be explained by elements present in the dataset. The higher this score the more the proportion of the data is explainable, for instance, by PCA. The KMO (also called MSA for Measure of Sample Adequacy) ranges from 0 (nothing is explainable) to 1 (everything is explainable). It can be returned for each item separately, or for the overall dataset. We will examine this value. It is the first component of the object returned by the KMO()function from psych package. The second component is the list of the values for the individual items (not examined here). It simply takes a matrix, a data frame, or a correlation matrix as an argument. Let's run in on our data: KMO(motiv)[1] This returns a value of 0.9715381, meaning that most of our data is explainable by the analysis. Summary In this article, we have discussed how the PCA algorithm works, how to select the appropriate number of components and how to use PCA scores for further analysis. Resources for Article: Further resources on this subject: Big Data Analysis (R and Hadoop) [article] How to do Machine Learning with Python [article] Clustering and Other Unsupervised Learning Methods [article]

0
0
2947

article-image-oracle-12c-sql-plsql-new-features

Oli Huggins

16 Aug 2015

30 min read

Oracle 12c SQL and PL/SQL New Features

Oli Huggins

16 Aug 2015

30 min read

0
0
13998

How-To Tutorials - Data

Analyzing Financial Data in QlikView

Apache Spark

PostgreSQL in Action

Introducing Bayesian Inference

Understanding Model-based Clustering

Sabermetrics with Apache Spark

Meeting SAP Lumira

Big Data

Starting with YARN Basics

Building a "Click-to-Go" Robot

Trending Topics

Rendering Stereoscopic 3D Models using OpenGL

Identifying Big Data Evidence in Hadoop

Scientific Computing APIs for Python

Dimensionality Reduction with Principal Component Analysis

Oracle 12c SQL and PL/SQL New Features