Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-analyzing-financial-data-qlikview
Packt
15 Sep 2015
8 min read
Save for later

Analyzing Financial Data in QlikView

Packt
15 Sep 2015
8 min read
In this article by Diane Blackwood, author of the book QlikView for Finance, the author talks about how QlikView is an easy-to-use business intelligence product designed to facilitate ad hoc relationship analysis. However, it can also be used in formal corporate performance applications by a financial user. It is designed to use a methodology of direct discovery to analyze data from multiple sources. QlikView is designed to allow you to do your own business discovery, take you out of the data management stage and into the data relationship investigation stage. Investigating relationships and outliers in financial data more can lead to effective management. (For more resources related to this topic, see here.) You could use QlikView when you wish to analyze and quickly see trends and exceptions that — with normal financial application-oriented BI products—would not be readily apparent without days of consultant and technology department setup. With QlikView, you can also analyze data relationships that are not measured in monetary units. Certainly, QlikView can be used to analyze sales trends and stock performance, but other relationships soon become apparent when you start using QlikView. Also, with the free downloadable personal edition of QlikView, you can start analyzing your own data right away. QlikView consists of two parts: The sheet: This can contain sheet objects, such as charts or list boxes, which show clickable information. The load script: This stores information about the data and the data sources that the data is coming from. Financial professionals are always using Excel to examine their data, and we can load data from an Excel sheet into QlikView. This can also help you to create a basic document sheet containing a chart. The newest version of QlikView comes with a sample Sales Order data that can be used to investigate and create sheet objects. In order to use data from other file types, you can use the File Wizard (Type) that you start from the Edit Script dialog by clicking on the Table Files button. Using the Edit Script dialog, you can view your data script and edit it in the script and add other data sources. You can also reload your data by clicking on the Reload button. If you just want to analyze data from an existing QlikView file and analyze the information in it, you do not need to work with the script at all. We will use some sample financial data that was downloaded from an ERP system to Excel in order to demonstrate how an analysis might work. Our QlikView Financial Analysis of Cheyenne Company will appear as follows: Figure 1: Our Financial Analysis QlikView Application When we create objects for analysis purposes in QlikView, the drop-down menu shows that there are multiple sheet object types to choose from, such as List Box, Statistics Box, Chart, Input Box, Current Selections Box, MultiBox, Table Box, Button, Text Object, Line/Arrow Object, Slider/Calendar Object, and Bookmark Object. In our example, we chose the Statistic Box Sheet object to add the grand total to our analysis. From this, we can see that the total company is out of balance by $1.59. From an auditor’s point of view, this amount is probably small enough to be immaterial, but, from our point of view as financial professionals, we want to know where our books are falling out of balance. To make our investigation easier, we should add one additional sheet object: a List Box for Company. This is done by right-clicking on the context menu and selecting New Sheet object and then List Box. Figure 2: Added Company List Box We can now see that we are actually out of balance in three companies. Cheyenne Co. L.P. is a company out by $1.59, but Cheyenne Holding and Cheyenne National Inc. seem to have balancing entries that balance at the total companies’ level, but these companies don’t balance at the individual company level. We can analyze our data using the list boxes just by selecting a Company and viewing the Account Groups and Cost Centers that are included (white) and excluded (gray). This is the standard color scheme usage of QlikView. Our selected company is shown in green and in the Current Selection Box. By selecting Cheyenne Holding, we would be able to verify that it is indeed a holding company, does not have any manufacturing or sales accounting groups, or cost centers. Alternatively, if we choose Provo, we can see that it is in balance. To load more than one spreadsheet or load from a different data source, we must edit load script. From the Edit Script interface, we can modify and execute a script that connects the QlikView document to an ODBC data source or to data files of different type and grab the data source information as well. Our first script was generated automatically, but scripts can be typed manually, or automatically generated scripts can be modified. Complex script statements must, at least partially, be entered manually. The Edit Script dialog uses autocomplete, so when typing, the program tries to predict what is wanted in the script without having to type it completely. The predictions include words that are part of the script syntax. The script is also color coded by syntax components. The Edit Script interface and behavior may be customized to your preferences by selecting Tools and Editor Preferences. A menu bar is found at the top of the Edit Script dialog with various script-related commands. The most frequently used commands also appear in the toolbar. In the toolbar, there is also a drop-down list for the tabs of the Edit Script wizard. The first script in the Edit Script interface is the automatically generated one that was created by the wizard when we started the QlikView file. The automatically generated script picks up the column names from the Excel file and puts in some default formatting scripting. The language selection that we made during the initial installation of QlikView determines the defaults assigned to this portion of the script. We can add data from multiple sources, such as ODBC links, additional Excel files, sources from the Web, FTP, and even other QlikView files. Our first Excel file, which we used to create the initial QlikView document, is already in our script. It happened to be October 2013 data, but suppose we wanted to add another month such as November data to our analysis? We would just navigate to the Edit Script interface from the File menu and then click on the script itself. Make sure that our cursor is at the bottom of the script after the first Excel file path and description. If you do not position your cursor where you want your additional script information to populate, you may generate your new script code in the middle of your existing script code. If you make a mistake, click on CANCEL and start over. After navigating to the script location where you want to add your new code, click on the Table Files button after the script and towards the center right first button in the column. Click on NEXT through the next four screens unless you need to add column labels. Comments can be added to scripts using // for a single line or by surrounding the comment by a beginning /* and an ending */ and comments show up as green. After clicking on the OK button to get out of Script Editor, there is another File menu item that can be used to verify that QlikView has correctly interpreted the joins. This is the Table Viewer menu item. You cannot edit in the Table view, but it is convenient to visualize how the table fields are interacting. Save the changes to the script by clicking on the OK button in the lower-right corner. Now, with the File menu, navigate to Edit Script and then to the Reload menu item and click on it to reload your data; otherwise, your new month of data will not be loaded. If you receive any error messages, the solutions can be researched in QlikView Help. In this case, the column headers were the same, so QlikView knew to add the data from the two spreadsheets together into one table. However, because of this, if we look at our Company List Box and Amount Statistics Box, we see everything added together. Figure 3: Data Doubled after Reload with Additional File The reason this data is doubled is that we do not have any way to split the months or only select October or November. Now that we have more than one month of data, we can add another List Box with the months. This will automatically link up to our Chart and Straight Table Sheet objects to separate our monthly data. Once added, from our new List Box, we can select OCTOBER or NOVEMBER, and our sheet object automatically shows the correct sum of the individual months. We can then use the List Box and linked objects to further analyze our financial data. Summary You can find further find books on QlikView published by Packt on the Packt website http://www.packtpub.com. Some of them are listed as follows: Learning QlikView Data Visualization by Karl Pover Predictive Analytics using Rattle and QlikView by Ferran Garcia Pagans Resources for Article: Further resources on this subject: Common QlikView script errors [article] Securing QlikView Documents [article] Conozca QlikView [article]
Read more
  • 0
  • 0
  • 1884

article-image-apache-spark-0
Packt
14 Sep 2015
9 min read
Save for later

Apache Spark

Packt
14 Sep 2015
9 min read
 In this article by Mike, author of the book Mastering Apache Spark many Hadoop-based tools built on Hadoop CDH cluster are introduced. (For more resources related to this topic, see here.) His premise, when approaching any big data system, is that none of the components exist in isolation. There are many functions that need to be addressed in a big data system with components passing data along an ETL (Extract Transform and Load) chain, or calling the subcomponents to carry out processing. Some of the functions are: Data Movement Scheduling Storage Data Acquisition Real Time Data Processing Batch Data Processing Monitoring Reporting This list is not exhaustive, but it gives you an idea of the functional areas that are involved. For instance, HDFS (Hadoop Distributed File System) might be used for storage, Oozie for scheduling, Hue for monitoring, and Spark for real-time processing. His point, though, is that none of these systems exists in isolation; they either exist in an ETL chain when processing data, and rely on other sub components as in Oozie, or depend on other components to provide functionality that they do not have. His contention is that integration between big data systems is an important factor. One needs to consider from where the data is coming, how it will be processed, and where it is then going to. Given this consideration, the integration options for a big data component need to be investigated both, in terms of what is available now, and what might be available in the future. In the book, the author has distributed the system functionality by chapters, and tried to determine what tools might be available to carry out these functions. Then, with the help of simple examples by using code and data, he has shown how the systems might be used together. The book is based upon Apache Spark, so as you might expect, it investigates the four main functional modules of Spark: MLlib for machine learning Streaming for the data stream processing SQL for data processing in a tabular format GraphX for graph-based processing However, the book attempts to extend these common, real-time big data processing areas by examining extra areas such as graph-based storage and real-time cloud-based processing via Databricks. It provides examples of integration with external tools, such as Kafka and Flume, as well as Scala-based development examples. In order to Spark your interest, and prepare you for the book's contents, he has described the contents of the book by subject, and given you a sample of the content. Overview The introduction sets the scene for the book by examining topics such as Spark cluster design, and the choice of cluster managers. It considers the issues, affecting the cluster performance, and explains how real-time big data processing can be carried out in the cloud. The following diagram, describes the topics that are explained in the book: The Spark Streaming examples are provided along with details for checkpointing to avoid data loss. Installation and integration examples are provided for Kafka (messaging) and Flume (data movement). The functionality of Spark MLlib is extended via 0xdata H2O, and a deep learning example neural system is created and tested. The Spark SQL is investigated, and integrated with Hive to show that Spark can become a real-time processing engine for Hive. Spark storage is considered, by example, using Aurelius (Datastax) Titan along with underlying storage in HBase and Cassandra. The use of Tinkerpop and Gremlin shell are explained by example for graph processing. Finally, of course many, methods of integrating Spark to HDFS are shown with the help of an example. This gives you a flavor of what is in the book, but it doesn't give you the detail. Keep reading to find out what is in each area. Spark MLlib Spark MLlib examines data classification with Naïve Bayes, data clustering with K-Means, and neural processing with ANN (Artificial Neural Network). If these terms do not mean anything to you, don't worry. They are explained both, in terms of theory, and then practically with examples. The author has always been interested in neural networks, and was pleased to be able to base the ANN section on the work by Bert Greevenbosch (www.bertgreevenbosch.nl). This allows to show how Apache Spark can be built from source code, and be extended in the same process with extra functionality. The following diagram shows a real, biological neuron to the left, and a simulated neuron to the right. It also explains how computational neurons are simulated in a step-by-step process from real neurons in your head. It then goes on to describe how neural networks are created, and how processing takes place. It's an interesting topic. The integration of big data systems, and neural processing. Spark Streaming An important issue, when processing stream-based data, is failure recover. Here, we examine error recovery, and checkpointing with the help of an example for Apache Spark. It also provides examples for TCP, file, Flume, and Kafka-based stream processing using Spark. Even though he has provided step-by-step, code-based examples, data stream processing can become complicated. He has tried to reduce complexity, so that learning does not become a challenge. For example, when introducing a Kafka-based example, The following diagram is used to explain the test components with the data flow, and the component set up in a logical, step-by-step manner: Spark SQL When introducing Spark SQL, he has described the data file formats that might be used to assist with data integration. Then move on to describe with the help of an example the use of the data frames, followed closely by practical SQL examples. Finally, integration with Apache Hive is introduced to provide big data warehouse real-time processing by example. The user-defined functions are also explained, showing how they can be defined in multiple ways, and be used with Spark SQL. Spark GraphX Graph processing is examined by showing how a simple graph can be created in Scala. Then, sample graph algorithms are introduced like PageRank and Triangles. With permission from Kenny Bastani (http://www.kennybastani.com/), the Mazerunner prototype application is discussed. A step-by-step approach is described by which Docker, Neo4j, and Mazerunner can be installed. Then, the functionality of both, Neo4j and Mazerunner, is used to move the data between Neo4j and HDFS. The following diagram gives an overview of the architecture that will be introduced: Spark storage Apache Spark is a highly functional, real-time, distributed big data processing system. However, it does not provide any data storage. In many places within the book, the examples are provided for using HDFS-based storage, but what if you want graph-based storage? What if you want to process and store data as a graph? The Aurelius (Datastax) Titan graph database is examined in the book. The underlying storage options with Cassandra, and HBase are used with Scala examples. The graph-based processing is examined using Tinkerpop and Gremlin-based scripts. Using a simple, example-based approach, both: the architecture involved, and multiple ways of using Gremlin shell are introduced in the following diagram: Spark H2O While Apache Spark is highly functional and agile, allowing data to move easily between its modules, how might we extend it? By considering the H2O product from http://h2o.ai/, the machine learning functionality of Apache Spark can be extended. H2O plus Spark equals Sparkling Water. Sparkling Water is used to create a deep learning neural processing example for data processing. The H2O web-based Flow application is also introduced for analytics, and data investigation. Spark Databricks Having created big data processing clusters on the physical machines, the next logical step is to move processing into the cloud. This might be carried out by obtaining cloud-based storage, using Spark as a cloud-based service, or using a Spark-based management system. The people who designed Apache Spark have created a Spark cloud-based processing platform called https://databricks.com/. He has dedicated two chapters in the book to this service, because he feels that it is important to investigate the future trends. All the aspects of Databricks are examined from the user and cluster management to the use of Notebooks for data processing. The languages that can be used are investigated as the ways of developing code on local machines, and then they can be moved to the cloud, in order to save money. The data import is examined with examples, as is the DbUtils package for data processing. The REST interface for the Spark cloud instance management is investigated, because it offers integration options between your potential cloud instance, and the external systems. Finally, options for moving data and functionality are investigated in terms of data and folder import/export, along with library import, and cluster creation on demand. Databricks visualisation The various options of cloud-based big data visualization using Databricks are investigated. Multiple ways are described for creating reports with the help of tables and SQL bar graphs. Pie charts and world maps are used to present data. Databricks allows geolocation data to be combined with your raw data to create geographical real-time charts. The following figure, taken from the book, shows the result of a worked example, combining GeoNames data with geolocation data. The color coded country-based data counts are the result. It's difficult to demonstrate this in a book, but imagine this map, based upon the stream-based data, and continuously updating in real time. In a similar way, it is possible to create dashboards from your Databricks reports, and make them available to your external customers via a web-based URL. Summary Mike hopes that this article has given you an idea of the book's contents. And also that it has intrigued you, so that you will search out a copy of the Spark-based book, Mastering Apache Spark, and try out all of these examples for yourself. The book comes with a code package that provides the example-based sample code, as well as build and execution scripts. This should provide you with an easy start, and a platform to build your own Spark based-code. Resources for Article: Further resources on this subject: Sabermetrics with Apache Spark[article] Getting Started with Apache Spark[article] Machine Learning Using Spark MLlib[article]
Read more
  • 0
  • 0
  • 1060

article-image-postgresql-action
Packt
14 Sep 2015
10 min read
Save for later

PostgreSQL in Action

Packt
14 Sep 2015
10 min read
In this article by Salahadin Juba, Achim Vannahme, and Andrey Volkov, authors of the book Learning PostgreSQL, we will discuss PostgreSQL (pronounced Post-Gres-Q-L) or Postgres is an open source, object-relational database management system. It emphasizes extensibility, creativity, and compatibility. It competes with major relational database vendors, such as Oracle, MySQL, SQL servers, and others. It is used by different sectors, including government agencies and the public and private sectors. It is cross-platform and runs on most modern operating systems, including Windows, Mac, and Linux flavors. It conforms to SQL standards and it is ACID complaint. (For more resources related to this topic, see here.) An overview of PostgreSQL PostgreSQL has many rich features. It provides enterprise-level services, including performance and scalability. It has a very supportive community and very good documentation. The history of PostgreSQL The name PostgreSQL comes from post-Ingres database. the history of PostgreSQL can be summarized as follows: Academia: University of California at Berkeley (UC Berkeley) 1977-1985, Ingres project: Michael Stonebraker created RDBMS according to the formal relational model 1986-1994, postgres: Michael Stonebraker created postgres in order to support complex data types and the object-relational model. 1995, Postgres95: Andrew Yu and Jolly Chen changed postgres to postgres query language (P) with an extended subset of SQL. Industry 1996, PostgreSQL: Several developers dedicated a lot of labor and time to stabilize Postgres95. The first open source version was released on January 29, 1997. With the introduction of new features, and enhancements, and at the start of open source projects, the Postgres95 name was changed to PostgreSQL. PostgreSQL began at version 6, with a very strong starting point by taking advantage of several years of research and development. Being an open source with a very good reputation, PostgreSQL has attracted hundreds of developers. Currently, PostgreSQL has innumerable extensions and a very active community. Advantages of PostgreSQL PostgreSQL provides many features that attract developers, administrators, architects, and companies. Business advantages of PostgreSQL PostgreSQL is free, open source software (OSS); it has been released under the PostgreSQL license, which is similar to the BSD and MIT licenses. The PostgreSQL license is highly permissive, and PostgreSQL is not a subject to monopoly and acquisition. This gives the company the following advantages. There is no associated licensing cost to PostgreSQL. The number of deployments of PostgreSQL is unlimited. A more profitable business model. PostgreSQL is SQL standards compliant. Thus finding professional developers is not very difficult. PostgreSQL is easy to learn and porting code from one database vendor to PostgreSQL is cost efficient. Also, PostgreSQL administrative tasks are easy to automate. Thus, the staffing cost is significantly reduced. PostgreSQL is cross-platform, and it has drivers for all modern programming languages; so, there is no need to change the company policy about the software stack in order to use PostgreSQL. PostgreSQL is scalable and it has a high performance. PostgreSQL is very reliable; it rarely crashes. Also, PostgreSQL is ACID compliant, which means that it can tolerate some hardware failure. In addition to that, it can be configured and installed as a cluster to ensure high availability (HA). User advantages of PostgreSQL PostgreSQL is very attractive for developers, administrators, and architects; it has rich features that enable developers to perform tasks in an agile way. The following are some attractive features for the developer: There is a new release almost each year; until now, starting from Postgres95, there have been 23 major releases. Very good documentation and an active community enable developers to find and solve problems quickly. The PostgreSQL manual is over than 2,500 pages in length. A rich extension repository enables developers to focus on the business logic. Also, it enables developers to meet requirement changes easily. The source code is available free of charge, it can be customized and extended without a huge effort. Rich clients and administrative tools enable developers to perform routine tasks, such as describing database objects, exporting and importing data, and dumping and restoring databases, very quickly. Database administration tasks do not requires a lot of time and can be automated. PostgreSQL can be integrated easily with other database management systems, giving software architecture good flexibility in putting software designs. Applications of PostgreSQL PostgreSQL can be used for a variety of applications. The main PostgreSQL application domains can be classified into two categories: Online transactional processing (OLTP): OLTP is characterized by a large number of CRUD operations, very fast processing of operations, and maintaining data integrity in a multiaccess environment. The performance is measured in the number of transactions per second. Online analytical processing (OLAP): OLAP is characterized by a small number of requests, complex queries that involve data aggregation, and a huge amount of data from different sources, with different formats and data mining and historical data analysis. OLTP is used to model business operations, such as customer relationship management (CRM). OLAP applications are used for business intelligence, decision support, reporting, and planning. An OLTP database size is relatively small compared to an OLAP database. OLTP normally follows the relational model concepts, such as normalization when designing the database, while OLAP is less relational and the schema is often star shaped. Unlike OLTP, the main operation of OLAP is data retrieval. OLAP data is often generated by a process called Extract, Transform and Load (ETL). ETL is used to load data into the OLAP database from different data sources and different formats. PostgreSQL can be used out of the box for OLTP applications. For OLAP, there are many extensions and tools to support it, such as the PostgreSQL COPY command and Foreign Data Wrappers (FDW). Success stories PostgreSQL is used in many application domains, including communication, media, geographical, and e-commerce applications. Many companies provide consultation as well as commercial services, such as migrating proprietary RDBMS to PostgreSQL in order to cut off licensing costs. These companies often influence and enhance PostgreSQL by developing and submitting new features. The following are a few companies that use PostgreSQL: Skype uses PostgreSQL to store user chats and activities. Skype has also affected PostgreSQL by developing many tools called Skytools. Instagram is a social networking service that enables its user to share pictures and photos. Instagram has more than 100 million active users. The American Chemical Society (ACS): More than one terabyte of data for their journal archive is stored using PostgreSQL. In addition to the preceding list of companies, PostgreSQL is used by HP, VMware, and Heroku. PostgreSQL is used by many scientific communities and organizations, such as NASA, due to its extensibility and rich data types. Forks There are more than 20 PostgreSQL forks; PostgreSQL extensible APIs makes postgres a great candidate to fork. Over years, many groups have forked PostgreSQL and contributed their findings to PostgreSQL. The following is a list of popular PostgreSQL forks: HadoopDB is a hybrid between the PostgreSQL, RDBMS, and MapReduce technologies to target analytical workload. Greenplum is a proprietary DBMS that was built on the foundation of PostgreSQL. It utilizes the shared-nothing and massively parallel processing (MPP) architectures. It is used as a data warehouse and for analytical workloads. The EnterpriseDB advanced server is a proprietary DBMS that provides Oracle capabilities to cap Oracle fees. Postgres-XC (eXtensible Cluster) is a multi-master PostgreSQL cluster based on the shared-nothing architecture. It emphasis write-scalability and provides the same APIs to applications that PostgreSQL provides. Vertica is a column-oriented database system, which was started by Michael Stonebraker in 2005 and acquisitioned by HP in 2011. Vertica reused the SQL parser, semantic analyzer, and standard SQL rewrites from the PostgreSQL implementation. Netzza is a popular data warehouse appliance solution that was started as a PostgreSQL fork. Amazon Redshift is a popular data warehouse management system based on PostgreSQL 8.0.2. It is mainly designed for OLAP applications. The PostgreSQL architecture PostgreSQL uses the client/server model; the client and server programs could be on different hosts. The communication between the client and server is normally done via TCP/IP protocols or Linux sockets. PostgreSQL can handle multiple connections from a client. A common PostgreSQL program consists of the following operating system processes: Client process or program (frontend): The database frontend application performs a database action. The frontend could be a web server that wants to display a web page or a command-line tool to perform maintenance tasks. PostgreSQL provides frontend tools, such as psql, createdb, dropdb, and createuser. Server process (backend): The server process manages database files, accepts connections from client applications, and performs actions on behalf of the client; the server process name is postgres. PostgreSQL forks a new process for each new connection; thus, the client and server processes communicate with each other without the intervention of the server main process (postgres), and they have a certain lifetime determined by accepting and terminating a client connection. The abstract architecture of PostgreSQL The aforementioned abstract, conceptual PostgreSQL architecture can give an overview of PostgreSQL's capabilities and interactions with the client as well as the operating system. The PostgreSQL server can be divided roughly into four subsystems as follows: Process manager: The process manager manages client connections, such as the forking and terminating processes. Query processor: When a client sends a query to PostgreSQL, the query is parsed by the parser, and then the traffic cop determines the query type. A Utility query is passed to the utilities subsystem. The Select, insert, update, and delete queries are rewritten by the rewriter, and then an execution plan is generated by the planner; finally, the query is executed, and the result is returned to the client. Utilities: The utilities subsystem provides the means to maintain the database, such as claiming storage, updating statistics, exporting and importing data with a certain format, and logging. Storage manager: The storage manager handles the memory cache, disk buffers, and storage allocation. Almost all PostgreSQL components can be configured, including the logger, planner, statistical analyzer, and storage manager. PostgreSQL configuration is governed by the application nature, such as OLAP and OLTP. The following diagram shows the PostgreSQL abstract, conceptual architecture: PostgreSQL's abstract, conceptual architecture The PostgreSQL community PostgreSQL has a very cooperative, active, and organized community. In the last 8 years, the PostgreSQL community published eight major releases. Announcements are brought to developers via the PostgreSQL weekly newsletter. There are dozens of mailing lists organized into categories, such as users, developers, and associations. Examples of user mailing lists are pgsql-general, psql-doc, and psql-bugs. pgsql-general is a very important mailing list for beginners. All non-bug-related questions about PostgreSQL installation, tuning, basic administration, PostgreSQL features, and general discussions are submitted to this list. The PostgreSQL community runs a blog aggregation service called Planet PostgreSQL—https://planet.postgresql.org/. Several PostgreSQL developers and companies use this service to share their experience and knowledge. Summary PostgreSQL is an open source, object-oriented relational database system. It supports many advanced features and complies with the ANSI-SQL:2008 standard. It has won industry recognition and user appreciation. The PostgreSQL slogan "The world's most advanced open source database" reflects the sophistication of the PostgreSQL features. PostgreSQL is a result of many years of research and collaboration between academia and industry. Companies in their infancy often favor PostgreSQL due to licensing costs. PostgreSQL can aid profitable business models. PostgreSQL is also favoured by many developers because of its capabilities and advantages. Resources for Article: Further resources on this subject: Introducing PostgreSQL 9 [article] PostgreSQL – New Features [article] Installing PostgreSQL [article]
Read more
  • 0
  • 0
  • 3461
Visually different images

article-image-overview-common-machine-learning-tasks
Packt
14 Sep 2015
29 min read
Save for later

Introducing Bayesian Inference

Packt
14 Sep 2015
29 min read
In this article by Dr. Hari M. Kudovely, the author of Learning Bayesian Models with R, we will look at Bayesian inference in depth. The Bayes theorem is the basis for updating beliefs or model parameter values in Bayesian inference, given the observations. In this article, a more formal treatment of Bayesian inference will be given. To begin with, let us try to understand how uncertainties in a real-world problem are treated in Bayesian approach. (For more resources related to this topic, see here.) Bayesian view of uncertainty The classical or frequentist statistics typically take the view that any physical process-generating data containing noise can be modeled by a stochastic model with fixed values of parameters. The parameter values are learned from the observed data through procedures such as maximum likelihood estimate. The essential idea is to search in the parameter space to find the parameter values that maximize the probability of observing the data seen so far. Neither the uncertainty in the estimation of model parameters from data, nor the uncertainty in the model itself that explains the phenomena under study, is dealt with in a formal way. The Bayesian approach, on the other hand, treats all sources of uncertainty using probabilities. Therefore, neither the model to explain an observed dataset nor its parameters are fixed, but they are treated as uncertain variables. Bayesian inference provides a framework to learn the entire distribution of model parameters, not just the values, which maximize the probability of observing the given data. The learning can come from both the evidence provided by observed data and domain knowledge from experts. There is also a framework to select the best model among the family of models suited to explain a given dataset. Once we have the distribution of model parameters, we can eliminate the effect of uncertainty of parameter estimation in the future values of a random variable predicted using the learned model. This is done by averaging over the model parameter values through marginalization of joint probability distribution. Consider the joint probability distribution of N random variables again: This time, we have added one more term, m, to the argument of the probability distribution, in order to indicate explicitly that the parameters are generated by the model m. Then, according to Bayes theorem, the probability distribution of model parameters conditioned on the observed data  and model m is given by:   Formally, the term on the LHS of the equation  is called posterior probability distribution. The second term appearing in the numerator of RHS, , is called the prior probability distribution. It represents the prior belief about the model parameters, before observing any data, say, from the domain knowledge. Prior distributions can also have parameters and they are called hyperparameters. The term  is the likelihood of model m explaining the observed data. Since , it can be considered as a normalization constant . The preceding equation can be rewritten in an iterative form as follows:   Here,  represents values of observations that are obtained at time step n,  is the marginal parameter distribution updated until time step n - 1, and  is the model parameter distribution updated after seeing the observations  at time step n. Casting Bayes theorem in this iterative form is useful for online learning and it suggests the following: Model parameters can be learned in an iterative way as more and more data or evidence is obtained The posterior distribution estimated using the data seen so far can be treated as a prior model when the next set of observations is obtained Even if no data is available, one could make predictions based on prior distribution created using the domain knowledge alone To make these points clear, let's take a simple illustrative example. Consider the case where one is trying to estimate the distribution of the height of males in a given region. The data used for this example is the height measurement in centimeters obtained from M volunteers sampled randomly from the population. We assume that the heights are distributed according to a normal distribution with the mean  and variance :   As mentioned earlier, in classical statistics, one tries to estimate the values of  and  from observed data. Apart from the best estimate value for each parameter, one could also determine an error term of the estimate. In the Bayesian approach, on the other hand,  and  are also treated as random variables. Let's, for simplicity, assume  is a known constant. Also, let's assume that the prior distribution for  is a normal distribution with (hyper) parameters  and . In this case, the expression for posterior distribution of  is given by:   Here, for convenience, we have used the notation  for . It is a simple exercise to expand the terms in the product and complete the squares in the exponential. The resulting expression for the posterior distribution  is given by:   Here,  represents the sample mean. Though the preceding expression looks complex, it has a very simple interpretation. The posterior distribution is also a normal distribution with the following mean:   The variance is as follows:   The posterior mean is a weighted sum of prior mean  and sample mean . As the sample size M increases, the weight of the sample mean increases and that of the prior decreases. Similarly, posterior precision (inverse of the variance) is the sum of the prior precision  and precision of the sample mean :   As M increases, the contribution of precision from observations (evidence) outweighs that from the prior knowledge. Let's take a concrete example where we consider age distribution with the population mean 5.5 and population standard deviation 0.5. We sample 100 people from this population by using the following R script: >set.seed(100) >age_samples <- rnorm(10000,mean = 5.5,sd=0.5) We can calculate the posterior distribution using the following R function: >age_mean <- function(n){ mu0 <- 5 sd0 <- 1 mus <- mean(age_samples[1:n]) sds <- sd(age_samples[1:n]) mu_n <- (sd0^2/(sd0^2 + sds^2/n)) * mus + (sds^2/n/(sd0^2 + sds^2/n)) * mu0 mu_n } >samp <- c(25,50,100,200,400,500,1000,2000,5000,10000) >mu <- sapply(samp,age_mean,simplify = "array") >plot(samp,mu,type="b",col="blue",ylim=c(5.3,5.7),xlab="no of samples",ylab="estimate of mean") >abline(5.5,0) One can see that as the number of samples increases, the estimated mean asymptotically approaches the population mean. The initial low value is due to the influence of the prior, which is, in this case, 5.0. This simple and intuitive picture of how the prior knowledge and evidence from observations contribute to the overall model parameter estimate holds in any Bayesian inference. The precise mathematical expression for how they combine would be different. Therefore, one could start using a model for prediction with just prior information, either from the domain knowledge or the data collected in the past. Also, as new observations arrive, the model can be updated using the Bayesian scheme. Choosing the right prior distribution In the preceding simple example, we saw that if the likelihood function has the form of a normal distribution, and when the prior distribution is chosen as normal, the posterior also turns out to be a normal distribution. Also, we could get a closed-form analytical expression for the posterior mean. Since the posterior is obtained by multiplying the prior and likelihood functions and normalizing by integration over the parameter variables, the form of the prior distribution has a significant influence on the posterior. This section gives some more details about the different types of prior distributions and guidelines as to which ones to use in a given context. There are different ways of classifying prior distributions in a formal way. One of the approaches is based on how much information a prior provides. In this scheme, the prior distributions are classified as Informative, Weakly Informative, Least Informative, and Non-informative. Here, we take more of a practitioner's approach and illustrate some of the important classes of the prior distributions commonly used in practice. Non-informative priors Let's start with the case where we do not have any prior knowledge about the model parameters. In this case, we want to express complete ignorance about model parameters through a mathematical expression. This is achieved through what are called non-informative priors. For example, in the case of a single random variable x that can take any value between  and , the non-informative prior for its mean   would be the following: Here, the complete ignorance of the parameter value is captured through a uniform distribution function in the parameter space. Note that a uniform distribution is not a proper distribution function since its integral over the domain is not equal to 1; therefore, it is not normalizable. However, one can use an improper distribution function for the prior as long as it is multiplied by the likelihood function; the resulting posterior can be normalized. If the parameter of interest is variance , then by definition it can only take non-negative values. In this case, we transform the variable so that the transformed variable has a uniform probability in the range from  to : It is easy to show, using simple differential calculus, that the corresponding non-informative distribution function in the original variable  would be as follows: Another well-known non-informative prior used in practical applications is the Jeffreys prior, which is named after the British statistician Harold Jeffreys. This prior is invariant under reparametrization of  and is defined as proportional to the square root of the determinant of the Fisher information matrix: Here, it is worth discussing the Fisher information matrix a little bit. If X is a random variable distributed according to , we may like to know how much information observations of X carry about the unknown parameter . This is what the Fisher Information Matrix provides. It is defined as the second moment of the score (first derivative of the logarithm of the likelihood function): Let's take a simple two-dimensional problem to understand the Fisher information matrix and Jeffreys prior. This example is given by Prof. D. Wittman of the University of California. Let's consider two types of food item: buns and hot dogs. Let's assume that generally they are produced in pairs (a hot dog and bun pair), but occasionally hot dogs are also produced independently in a separate process. There are two observables such as the number of hot dogs () and the number of buns (), and two model parameters such as the production rate of pairs () and the production rate of hot dogs alone (). We assume that the uncertainty in the measurements of the counts of these two food products is distributed according to the normal distribution, with variance  and , respectively. In this case, the Fisher Information matrix for this problem would be as follows: In this case, the inverse of the Fisher information matrix would correspond to the covariance matrix: Subjective priors One of the key strengths of Bayesian statistics compared to classical (frequentist) statistics is that the framework allows one to capture subjective beliefs about any random variables. Usually, people will have intuitive feelings about minimum, maximum, mean, and most probable or peak values of a random variable. For example, if one is interested in the distribution of hourly temperatures in winter in a tropical country, then the people who are familiar with tropical climates or climatology experts will have a belief that, in winter, the temperature can go as low as 15°C and as high as 27°C with the most probable temperature value being 23°C. This can be captured as a prior distribution through the Triangle distribution as shown here. The Triangle distribution has three parameters corresponding to a minimum value (a), the most probable value (b), and a maximum value (c). The mean and variance of this distribution are given by:   One can also use a PERT distribution to represent a subjective belief about the minimum, maximum, and most probable value of a random variable. The PERT distribution is a reparametrized Beta distribution, as follows:   Here:     The PERT distribution is commonly used for project completion time analysis, and the name originates from project evaluation and review techniques. Another area where Triangle and PERT distributions are commonly used is in risk modeling. Often, people also have a belief about the relative probabilities of values of a random variable. For example, when studying the distribution of ages in a population such as Japan or some European countries, where there are more old people than young, an expert could give relative weights for the probability of different ages in the populations. This can be captured through a relative distribution containing the following details: Here, min and max represent the minimum and maximum values, {values} represents the set of possible observed values, and {weights} represents their relative weights. For example, in the population age distribution problem, these could be the following: The weights need not have a sum of 1. Conjugate priors If both the prior and posterior distributions are in the same family of distributions, then they are called conjugate distributions and the corresponding prior is called a conjugate prior for the likelihood function. Conjugate priors are very helpful for getting get analytical closed-form expressions for the posterior distribution. In the simple example we considered, we saw that when the noise is distributed according to the normal distribution, choosing a normal prior for the mean resulted in a normal posterior. The following table gives examples of some well-known conjugate pairs: Likelihood function Model parameters Conjugate prior Hyperparameters Binomial   (probability) Beta   Poisson   (rate) Gamma   Categorical   (probability, number of categories) Dirichlet   Univariate normal (known variance )   (mean) Normal   Univariate normal (known mean )   (variance) Inverse Gamma     Hierarchical priors Sometimes, it is useful to define prior distributions for the hyperparameters itself. This is consistent with the Bayesian view that all parameters should be treated as uncertain by using probabilities. These distributions are called hyper-prior distributions. In theory, one can continue this into many levels as a hierarchical model. This is one way of eliciting the optimal prior distributions. For example: is the prior distribution with a hyperparameter . We could define a prior distribution for  through a second set of equations, as follows: Here,  is the hyper-prior distribution for the hyperparameter , parametrized by the hyper-hyper-parameter . One can define a prior distribution for in the same way and continue the process forever. The practical reason for formalizing such models is that, at some level of hierarchy, one can define a uniform prior for the hyper parameters, reflecting complete ignorance about the parameter distribution, and effectively truncate the hierarchy. In practical situations, typically, this is done at the second level. This corresponds to, in the preceding example, using a uniform distribution for . I want to conclude this section by stressing one important point. Though prior distribution has a significant role in Bayesian inference, one need not worry about it too much, as long as the prior chosen is reasonable and consistent with the domain knowledge and evidence seen so far. The reasons are is that, first of all, as we have more evidence, the significance of the prior gets washed out. Secondly, when we use Bayesian models for prediction, we will average over the uncertainty in the estimation of the parameters using the posterior distribution. This averaging is the key ingredient of Bayesian inference and it removes many of the ambiguities in the selection of the right prior. Estimation of posterior distribution So far, we discussed the essential concept behind Bayesian inference and also how to choose a prior distribution. Since one needs to compute the posterior distribution of model parameters before one can use the models for prediction, we discuss this task in this section. Though the Bayesian rule has a very simple-looking form, the computation of posterior distribution in a practically usable way is often very challenging. This is primarily because computation of the normalization constant  involves N-dimensional integrals, when there are N parameters. Even when one uses a conjugate prior, this computation can be very difficult to track analytically or numerically. This was one of the main reasons for not using Bayesian inference for multivariate modeling until recent decades. In this section, we will look at various approximate ways of computing posterior distributions that are used in practice. Maximum a posteriori estimation Maximum a posteriori (MAP) estimation is a point estimation that corresponds to taking the maximum value or mode of the posterior distribution. Though taking a point estimation does not capture the variability in the parameter estimation, it does take into account the effect of prior distribution to some extent when compared to maximum likelihood estimation. MAP estimation is also called poor man's Bayesian inference. From the Bayes rule, we have: Here, for convenience, we have used the notation X for the N-dimensional vector . The last relation follows because the denominator of RHS of Bayes rule is independent of . Compare this with the following maximum likelihood estimate: The difference between the MAP and ML estimate is that, whereas ML finds the mode of the likelihood function, MAP finds the mode of the product of the likelihood function and prior. Laplace approximation We saw that the MAP estimate just finds the maximum value of the posterior distribution. Laplace approximation goes one step further and also computes the local curvature around the maximum up to quadratic terms. This is equivalent to assuming that the posterior distribution is approximately Gaussian (normal) around the maximum. This would be the case if the amount of data were large compared to the number of parameters: M >> N. Here, A is an N x N Hessian matrix obtained by taking the derivative of the log of the posterior distribution: It is straightforward to evaluate the previous expressions at , using the following definition of conditional probability: We can get an expression for P(X|m) from Laplace approximation that looks like the following: In the limit of a large number of samples, one can show that this expression simplifies to the following: The term  is called Bayesian information criterion (BIC) and can be used for model selections or model comparison. This is one of the goodness of fit terms for a statistical model. Another similar criterion that is commonly used is Akaike information criterion (AIC), which is defined by . Now we will discuss how BIC can be used to compare different models for model selection. In the Bayesian framework, two models such as  and  are compared using the Bayes factor. The definition of the Bayes factor  is the ratio of posterior odds to prior odds that is given by: Here, posterior odds is the ratio of posterior probabilities of the two models of the given data and prior odds is the ratio of prior probabilities of the two models, as given in the preceding equation. If , model  is preferred by the data and if , model  is preferred by the data. In reality, it is difficult to compute the Bayes factor because it is difficult to get the precise prior probabilities. It can be shown that, in the large N limit,  can be viewed as a rough approximation to . Monte Carlo simulations The two approximations that we have discussed so far, the MAP and Laplace approximations, are useful when the posterior is a very sharply peaked function about the maximum value. Often, in real-life situations, the posterior will have long tails. This is, for example, the case in e-commerce where the probability of the purchasing of a product by a user has a long tail in the space of all products. So, in many practical situations, both MAP and Laplace approximations fail to give good results. Another approach is to directly sample from the posterior distribution. Monte Carlo simulation is a technique used for sampling from the posterior distribution and is one of the workhorses of Bayesian inference in practical applications. In this section, we will introduce the reader to Markov Chain Monte Carlo (MCMC) simulations and also discuss two common MCMC methods used in practice. As discussed earlier, let  be the set of parameters that we are interested in estimating from the data through posterior distribution. Consider the case of the parameters being discrete, where each parameter has K possible values, that is, . Set up a Markov process with states  and transition probability matrix . The essential idea behind MCMC simulations is that one can choose the transition probabilities in such a way that the steady state distribution of the Markov chain would correspond to the posterior distribution we are interested in. Once this is done, sampling from the Markov chain output, after it has reached a steady state, will give samples of distributed according to the posterior distribution. Now, the question is how to set up the Markov process in such a way that its steady state distribution corresponds to the posterior of interest. There are two well-known methods for this. One is the Metropolis-Hastings algorithm and the second is Gibbs sampling. We will discuss both in some detail here. The Metropolis-Hasting algorithm The Metropolis-Hasting algorithm was one of the first major algorithms proposed for MCMC. It has a very simple concept—something similar to a hill-climbing algorithm in optimization: Let  be the state of the system at time step t. To move the system to another state at time step t + 1, generate a candidate state  by sampling from a proposal distribution . The proposal distribution is chosen in such a way that it is easy to sample from it. Accept the proposal move with the following probability: If it is accepted, = ; if not, . Continue the process until the distribution converges to the steady state. Here,  is the posterior distribution that we want to simulate. Under certain conditions, the preceding update rule will guarantee that, in the large time limit, the Markov process will approach a steady state distributed according to . The intuition behind the Metropolis-Hasting algorithm is simple. The proposal distribution  gives the conditional probability of proposing state  to make a transition in the next time step from the current state . Therefore,  is the probability that the system is currently in state  and would make a transition to state  in the next time step. Similarly,  is the probability that the system is currently in state  and would make a transition to state  in the next time step. If the ratio of these two probabilities is more than 1, accept the move. Alternatively, accept the move only with the probability given by the ratio. Therefore, the Metropolis-Hasting algorithm is like a hill-climbing algorithm where one accepts all the moves that are in the upward direction and accepts moves in the downward direction once in a while with a smaller probability. The downward moves help the system not to get stuck in local minima. Let's revisit the example of estimating the posterior distribution of the mean and variance of the height of people in a population discussed in the introductory section. This time we will estimate the posterior distribution by using the Metropolis-Hasting algorithm. The following lines of R code do this job: >set.seed(100) >mu_t <- 5.5 >sd_t <- 0.5 >age_samples <- rnorm(10000,mean = mu_t,sd = sd_t) >#function to compute log likelihood >loglikelihood <- function(x,mu,sigma){ singlell <- dnorm(x,mean = mu,sd = sigma,log = T) sumll <- sum(singlell) sumll } >#function to compute prior distribution for mean on log scale >d_prior_mu <- function(mu){ dnorm(mu,0,10,log=T) } >#function to compute prior distribution for std dev on log scale >d_prior_sigma <- function(sigma){ dunif(sigma,0,5,log=T) } >#function to compute posterior distribution on log scale >d_posterior <- function(x,mu,sigma){ loglikelihood(x,mu,sigma) + d_prior_mu(mu) + d_prior_sigma(sigma) } >#function to make transition moves tran_move <- function(x,dist = .1){ x + rnorm(1,0,dist) } >num_iter <- 10000 >posterior <- array(dim = c(2,num_iter)) >accepted <- array(dim=num_iter - 1) >theta_posterior <-array(dim=c(2,num_iter)) >values_initial <- list(mu = runif(1,4,8),sigma = runif(1,1,5)) >theta_posterior[1,1] <- values_initial$mu >theta_posterior[2,1] <- values_initial$sigma >for (t in 2:num_iter){ #proposed next values for parameters theta_proposed <- c(tran_move(theta_posterior[1,t-1]) ,tran_move(theta_posterior[2,t-1])) p_proposed <- d_posterior(age_samples,mu = theta_proposed[1] ,sigma = theta_proposed[2]) p_prev <-d_posterior(age_samples,mu = theta_posterior[1,t-1] ,sigma = theta_posterior[2,t-1]) eps <- exp(p_proposed - p_prev) # proposal is accepted if posterior density is higher w/ theta_proposed # if posterior density is not higher, it is accepted with probability eps accept <- rbinom(1,1,prob = min(eps,1)) accepted[t - 1] <- accept if (accept == 1){ theta_posterior[,t] <- theta_proposed } else { theta_posterior[,t] <- theta_posterior[,t-1] } } To plot the resulting posterior distribution, we use the sm package in R: >library(sm) x <- cbind(c(theta_posterior[1,1:num_iter]),c(theta_posterior[2,1:num_iter])) xlim <- c(min(x[,1]),max(x[,1])) ylim <- c(min(x[,2]),max(x[,2])) zlim <- c(0,max(1)) sm.density(x, xlab = "mu",ylab="sigma", zlab = " ",zlim = zlim, xlim = xlim ,ylim = ylim,col="white") title("Posterior density")  The resulting posterior distribution will look like the following figure:   Though the Metropolis-Hasting algorithm is simple to implement for any Bayesian inference problem, in practice it may not be very efficient in many cases. The main reason for this is that, unless one carefully chooses a proposal distribution , there would be too many rejections and it would take a large number of updates to reach the steady state. This is particularly the case when the number of parameters are high. There are various modifications of the basic Metropolis-Hasting algorithms that try to overcome these difficulties. We will briefly describe these when we discuss various R packages for the Metropolis-Hasting algorithm in the following section. R packages for the Metropolis-Hasting algorithm There are several contributed packages in R for MCMC simulation using the Metropolis-Hasting algorithm, and here we describe some popular ones. The mcmc package contributed by Charles J. Geyer and Leif T. Johnson is one of the popular packages in R for MCMC simulations. It has the metrop function for running the basic Metropolis-Hasting algorithm. The metrop function uses a multivariate normal distribution as the proposal distribution. Sometimes, it is useful to make a variable transformation to improve the speed of convergence in MCMC. The mcmc package has a function named morph for doing this. Combining these two, the function morph.metrop first transforms the variable, does a Metropolis on the transformed density, and converts the results back to the original variable. Apart from the mcmc package, two other useful packages in R are MHadaptive contributed by Corey Chivers and the Evolutionary Monte Carlo (EMC) algorithm package by Gopi Goswami. Gibbs sampling As mentioned before, the Metropolis-Hasting algorithm suffers from the drawback of poor convergence, due to too many rejections, if one does not choose a good proposal distribution. To avoid this problem, two physicists Stuart Geman and Donald Geman proposed a new algorithm. This algorithm is called Gibbs sampling and it is named after the famous physicist J W Gibbs. Currently, Gibbs sampling is the workhorse of MCMC for Bayesian inference. Let  be the set of parameters of the model that we wish to estimate: Start with an initial state . At each time step, update the components one by one, by drawing from a distribution conditional on the most recent value of rest of the components:         After N steps, all components of the parameter will be updated. Continue with step 2 until the Markov process converges to a steady state.  Gibbs sampling is a very efficient algorithm since there are no rejections. However, to be able to use Gibbs sampling, the form of the conditional distributions of the posterior distribution should be known. R packages for Gibbs sampling Unfortunately, there are not many contributed general purpose Gibbs sampling packages in R. The gibbs.met package provides two generic functions for performing MCMC in a Naïve way for user-defined target distribution. The first function is gibbs_met. This performs Gibbs sampling with each 1-dimensional distribution sampled by using the Metropolis algorithm, with normal distribution as the proposal distribution. The second function, met_gaussian, updates the whole state with independent normal distribution centered around the previous state. The gibbs.met package is useful for general purpose MCMC on moderate dimensional problems. Apart from the general purpose MCMC packages, there are several packages in R designed to solve a particular type of machine-learning problems. The GibbsACOV package can be used for one-way mixed-effects ANOVA and ANCOVA models. The lda package performs collapsed Gibbs sampling methods for topic (LDA) models. The stocc package fits a spatial occupancy model via Gibbs sampling. The binomlogit package implements an efficient MCMC for Binomial Logit models. Bmk is a package for doing diagnostics of MCMC output. Bayesian Output Analysis Program (BOA) is another similar package. RBugs is an interface of the well-known OpenBUGS MCMC package. The ggmcmc package is a graphical tool for analyzing MCMC simulation. MCMCglm is a package for generalized linear mixed models and BoomSpikeSlab is a package for doing MCMC for Spike and Slab regression. Finally, SamplerCompare is a package (more of a framework) for comparing the performance of various MCMC packages. Variational approximation In the variational approximation scheme, one assumes that the posterior distribution  can be approximated to a factorized form: Note that the factorized form is also a conditional distribution, so each  can have dependence on other s through the conditioned variable X. In other words, this is not a trivial factorization making each parameter independent. The advantage of this factorization is that one can choose more analytically tractable forms of distribution functions . In fact, one can vary the functions  in such a way that it is as close to the true posterior  as possible. This is mathematically formulated as a variational calculus problem, as explained here. Let's use some measures to compute the distance between the two probability distributions, such as  and , where . One of the standard measures of distance between probability distributions is the Kullback-Leibler divergence, or KL-divergence for short. It is defined as follows: The reason why it is called a divergence and not distance is that  is not symmetric with respect to Q and P. One can use the relation  and rewrite the preceding expression as an equation for log P(X): Here: Note that, in the equation for ln P(X), there is no dependence on Q on the LHS. Therefore, maximizing  with respect to Q will minimize , since their sum is a term independent of Q. By choosing analytically tractable functions for Q, one can do this maximization in practice. It will result in both an approximation for the posterior and a lower bound for ln P(X) that is the logarithm of evidence or marginal likelihood, since . Therefore, variational approximation gives us two quantities in one shot. A posterior distribution can be used to make predictions about future observations (as explained in the next section) and a lower bound for evidence can be used for model selection. How does one implement this minimization of KL-divergence in practice? Without going into mathematical details, here we write a final expression for the solution: Here,  implies that the expectation of the logarithm of the joint distribution  is taken over all the parameters  except for . Therefore, the minimization of KL-divergence leads to a set of coupled equations; one for each  needs to be solved self-consistently to obtain the final solution. Though the variational approximation looks very complex mathematically, it has a very simple, intuitive explanation. The posterior distribution of each parameter  is obtained by averaging the log of the joint distribution over all the other variables. This is analogous to the Mean Field theory in physics where, if there are N interacting charged particles, the system can be approximated by saying that each particle is in a constant external field, which is the average of fields produced by all the other particles. We will end this section by mentioning a few R packages for variational approximation. The VBmix package can be used for variational approximation in Bayesian mixture models. A similar package is vbdm used for Bayesian discrete mixture models. The package vbsr is used for variational inference in Spike Regression Regularized Linear Models. Prediction of future observations Once we have the posterior distribution inferred from data using some of the methods described already, it can be used to predict future observations. The probability of observing a value Y, given observed data X, and posterior distribution of parameters  is given by: Note that, in this expression, the likelihood function  is averaged by using the distribution of the parameter given by the posterior . This is, in fact, the core strength of the Bayesian inference. This Bayesian averaging eliminates the uncertainty in estimating the parameter values and makes the prediction more robust. Summary In this article, we covered the basic principles of Bayesian inference. Starting with how uncertainty is treated differently in Bayesian statistics compared to classical statistics, we discussed deeply various components of Bayes' rule. Firstly, we learned the different types of prior distributions and how to choose the right one for your problem. Then we learned the estimation of posterior distribution using techniques such as MAP estimation, Laplace approximation, and MCMC simulations. Resources for Article: Further resources on this subject: Bayesian Network Fundamentals [article] Learning Data Analytics with R and Hadoop [article] First steps with R [article]
Read more
  • 0
  • 0
  • 2233

article-image-understanding-model-based-clustering
Packt
14 Sep 2015
10 min read
Save for later

Understanding Model-based Clustering

Packt
14 Sep 2015
10 min read
 In this article by Ashish Gupta, author of the book, Rapid – Apache Mahout Clustering Designs, we will discuss a model-based clustering algorithm. Model-based clustering is used to overcome some of the deficiencies that can occur in K-means or Fuzzy K-means algorithms. We will discuss the following topics in this article: Learning model-based clustering Understanding Dirichlet clustering Understanding topic modeling (For more resources related to this topic, see here.) Learning model-based clustering In model-based clustering, we assume that data is generated by a model and try to get the model from the data. The right model will fit the data better than other models. In the K-means algorithm, we provide the initial set of cluster, and K-means provides us with the data points in the clusters. Think about a case where clusters are not distributed normally, then the improvement of a cluster will not be good using K-means. In this scenario, the model-based clustering algorithm will do the job. Another idea you can think of when dividing the clusters is—hierarchical clustering—and we need to find out the overlapping information. This situation will also be covered by model-based clustering algorithms. If all components are not well separated, a cluster can consist of multiple mixture components. In simple terms, in model-based clustering, data is a mixture of two or more components. Each component has an associated probability and is described by a density function. Model-based clustering can capture the hierarchy and the overlap of the clusters at the same time. Partitions are determined by an EM (expectation-maximization) algorithm for maximum likelihood. The generated models are compared by a Bayesian Information criterion (BIC). The model with the lowest BIC is preferred. In the equation BIC = -2 log(L) + mlog(n), L is the likelihood function and m is the number of free parameters to be estimated. n is the number of data points. Understanding Dirichlet clustering Dirichlet clustering is a model-based clustering method. This algorithm is used to understand the data and cluster the data. Dirichlet clustering is a process of nonparametric and Bayesian modeling. It is nonparametric because it can have infinite number of parameters. Dirichlet clustering is based on Dirichlet distribution. For this algorithm, we have a probabilistic mixture of a number of models that are used to explain data. Each data point will be coming from one of the available models. The models are taken from the sample of a prior distribution of models, and points are assigned to these models iteratively. In each iteration probability, a point generated by a particular model is calculated. After the points are assigned to a model, new parameters for each of the model are sampled. This sample is from the posterior distribution of the model parameters, and it considers all the observed data points assigned to the model. This sampling provides more information than normal clustering listed as follows: As we are assigning points to different models, we can find out how many models are supported by the data. The other information that we can get is how well the data is described by a model and how two points are explained by the same model. Topic modeling In machine learning, topic modeling is nothing but finding out a topic from the text document using a statistical model. A document on particular topics has some particular words. For example, if you are reading an article on sports, there are high chances that you will get words such as football, baseball, Formula One and Olympics. So a topic model actually uncovers the hidden sense of the article or a document. Topic models are nothing but the algorithms that can discover the main themes from a large set of unstructured document. It uncovers the semantic structure of the text. Topic modeling enables us to organize large scale electronic archives. Mahout has the implementation of one of the topic modeling algorithms—Latent Dirichlet Allocation (LDA). LDA is a statistical model of document collection that tries to capture the intuition of the documents. In normal clustering algorithms, if words having the same meaning don't occur together, then the algorithm will not associate them, but LDA can find out which two words are used in similar context, and LDA is better than other algorithms in finding out the association in this way. LDA is a generative, probabilistic model. It is generative because the model is tweaked to fit the data, and using the parameters of the model, we can generate the data on which it fits. It is probabilistic because each topic is modeled as an infinite mixture over an underlying set of topic probabilities. The topic probabilities provide an explicit representation of a document. Graphically, a LDA model can be represented as follows: The notation used in this image represents the following: M, N, and K represent the number of documents, the number of words in the document, and the number of topics in the document respectively. is the prior weight of the K topic in a document. is the prior weight of the w word in a topic. φ is the probability of a word occurring in a topic. Θ is the topic distribution. z is the identity of a topic of all the words in all the documents. w is the identity of all the words in all the documents. How LDA works in a map-reduce mode? So these are the steps that LDA follows in mapper and reducer steps: Mapper phase: The program starts with an empty topic model. All the documents are read by different mappers. The probabilities of each topic for each word in the document are calculated. Reducer Phase: The reducer receives the count of probabilities. These counts are summed and the model is normalized. This process is iterative, and in each iteration the sum of the probabilities is calculated and the process stops when it stops changing. A parameter set, which is similar to the convergence threshold in K-means, is set to check the changes. In the end, LDA estimates how well the model fits the data. In Mahout, the Collapsed Variation Bayes (CVB) algorithm is implemented for LDA. LDA uses a term frequency vector as an input and not tf-idf vectors. We need to take care of the two parameters while running the LDA algorithm—the number of topics and the number of words in the documents. A higher number of topics will provide very low level topics while a lower number will provide a generalized topic at high level, such as sports. In Mahout, mean field variational inference is used to estimate the model. It is similar to expectation-maximization of hierarchical Bayesian models. An expectation step reads each document and calculates the probability of each topic for each word in every document. The maximization step takes the counts and sums all the probabilities and normalizes them. Running LDA using Mahout To run LDA using Mahout, we will use the 20 Newsgroups dataset. We will convert the corpus to vectors, run LDA on these vectors, and get the resultant topics. Let's run this example to view how topic modeling works in Mahout. Dataset selection We will use the 20 Newsgroup dataset for this exercise. Download the 20news-bydate.tar.gz dataset from http://qwone.com/~jason/20Newsgroups/. Steps to execute CVB (LDA) Perform the following steps to execute the CVB algorithm: Create a 20newsdata directory and unzip the data here: mkdir /tmp/20newsdata cdtmp/20newsdatatar-xzvf /tmp/20news-bydate.tar.gz There are two folders under 20newsdata: 20news-bydate-test and 20news-bydate-train. Now, create another 20newsdataall directory and merge both the training and test data of the group. Now move to the home directory and execute the following command: mkdir /tmp/20newsdataall cp –R /20newsdata/*/* /tmp/20newsdataall Create a directory in Hadoop and save this data in HDFS: hadoopfs –mkdir /usr/hue/20newsdata hadoopfs –put /tmp/20newsdataall /usr/hue/20newsdata Mahout CVB will accept the data in the vector format. For this, first we will generate a sequence file from the directory as follows: bin/mahoutseqdirectory -i /user/hue/20newsdata/20newsdataall -o /user/hue/20newsdataseq-out Convert the sequence file to a sparse vector but, as discussed earlier, using the term frequency weight. bin/mahout seq2sparse -i /user/hue/20newsdataseq-out/part-m-00000 -o /user/hue/20newsdatavec -lnorm -nv -wtt Convert the sparse vector to the input form required by the CVB algorithm. bin/mahoutrowid -i /user/hue/20newsdatavec/tf-vectors –o /user/hue/20newsmatrix Convert the sparse vector to the input form required by CVB algorithm. bin/mahout cvb -i /user/hue/20newsmatrix/matrix –o /user/hue/ldaoutput–k 10 –x 20 –dict/user/hue/20newsdatavec/dictionary.file-0 –dt /user/hue/ldatopics –mt /user/hue/ldamodel The parameters used in the preceding command can be explained as follows:      -i: This is the input path of the document vector      -o: This is the output path of the topic term distribution      -k: This is the number of latent topics      -x: This is the maximum number of iterations      -dict: This is the term dictionary files      -dt: This is the output path of document—topic distribution      -mt: This is the model state path after each iteration The output of the preceding command can be seen as follows: Once the command finishes, you will get the information on the screen as follows: To view the output, run the following command : bin/mahout vectordump -i /user/hue/ldaoutput/ -d /user/hue/20newsdatavec/dictionary.file-0 -dtsequencefile -vs 10 -sort true -o /tmp/lda-output.txt The parameters used in the preceding command can be explained as follows:     -i: This is the input location of the CVB output     -d: This is the dictionary file location created during vector creation     -dt: This is the dictionary file type (sequence or text)     -vs: This is the vector size     -sort: This is the flag to put true or false     -o: This is the output location of local filesystem Now your output will be saved in the local filesystem. Open the file and you will see an output similar to the following: From the preceding screenshot you can see that after running the algorithm, you will get the term and probability of that. Summary In this article, we learned about model-based clustering, the Dirichlet process, and topic modeling. In model-based clustering, we tried to obtain the model from the data ,while the Dirichlet process is used to understand the data. Topic modeling helps us to identify the topics in an article or in a set of documents. We discussed how Mahout has implemented topic modeling using the latent Dirichlet process and how it is implemented in map reduce. We discussed how to use Mahout to find out the topic distribution on a set of documents. Resources for Article: Further resources on this subject: Learning Random Forest Using Mahout[article] Implementing the Naïve Bayes classifier in Mahout[article] Clustering [article]
Read more
  • 0
  • 0
  • 3277

article-image-sabermetrics-apache-spark
Packt
09 Sep 2015
22 min read
Save for later

Sabermetrics with Apache Spark

Packt
09 Sep 2015
22 min read
 In this article by Rindra Ramamonjison, the author of the book called Apache Spark Graph Processing, we will gain useful insights that are required to quickly process big data, and handle its complexities. It is not the secret analytics that have made a big impact in sports. The quest for an objective understanding of the game has a name even—"sabermetrics". Analytics has proven invaluable in many aspects, from building dream teams under tight cap constraints, to selecting game-specific strategies, to actively engaging with fans, and so on. In the following sections, we will analyze NCAA Men's college basketball game stats, gathered during a single season. As sports data experts, we are going to leverage Spark's graph processing library to answer several questions for retrospection. Apache Spark is a fast, general-purpose technology, which greatly simplifies the parallel processing of large data that is distributed over a computing cluster. While Spark handles different types of processing, here, we will focus on its graph-processing capability. In particular, our goal is to expose the powerful yet generic graph-aggregation operator of Spark—aggregateMessages. We can think of this operator as a version of MapReduce for aggregating the neighborhood information in graphs. In fact, many graph-processing algorithms, such as PageRank rely on iteratively accessing the properties of neighboring vertices and adjacent edges. By applying aggregateMessages on the NCAA College Basketball datasets, we will: Identify the basic mechanisms and understand the patterns for using aggregateMessages Apply aggregateMessages to create custom graph aggregation operations Optimize the performance and efficiency of aggregateMessages (For more resources related to this topic, see here.) NCAA College Basketball datasets As an illustrative example, the NCAA College Basketball datasets consist of two CSV datasets. This first one called teams.csv contains the list of all the college teams that played in NCAA Division I competition. Each team is associated with a 4-digit ID number. The second dataset called stats.csv contains the score and statistics of every game played during the 2014-2015 regular season. Loading team data into RDDs To start with, we parse and load these datasets into RDDs (Resilient Distributed Datasets), which are the core Spark abstraction for any data that is distributed and stored over a cluster. First, we create a class called GameStats that records a team's statistics during a game: case class GameStats( val score: Int, val fieldGoalMade: Int, val fieldGoalAttempt: Int, val threePointerMade: Int, val threePointerAttempt: Int, val threeThrowsMade: Int, val threeThrowsAttempt: Int, val offensiveRebound: Int, val defensiveRebound: Int, val assist: Int, val turnOver: Int, val steal: Int, val block: Int, val personalFoul: Int ) Loading game stats into RDDs We also add the following methods to GameStats in order to know how efficient a team's offense was: // Field Goal percentage def fgPercent: Double = 100.0 * fieldGoalMade / fieldGoalAttempt // Three Point percentage def tpPercent: Double = 100.0 * threePointerMade / threePointerAttempt // Free throws percentage def ftPercent: Double = 100.0 * threeThrowsMade / threeThrowsAttempt override def toString: String = "Score: " + score Next, we create a couple of classes for the games' result: abstract class GameResult( val season: Int, val day: Int, val loc: String ) case class FullResult( override val season: Int, override val day: Int, override val loc: String, val winnerStats: GameStats, val loserStats: GameStats ) extends GameResult(season, day, loc) FullResult has the year and day of the season, the location where the game was played, and the game statistics of both the winning and losing teams. Next, we will create a statistics graph of the regular seasons. In this graph, the nodes are the teams, whereas each edge corresponds to a specific game. To create the graph, let's parse the CSV file called teams.csv into the RDD teams: val teams: RDD[(VertexId, String)] = sc.textFile("./data/teams.csv"). filter(! _.startsWith("#")). map {line => val row = line split ',' (row(0).toInt, row(1)) } We can check the first few teams in this new RDD: scala> teams.take(3).foreach{println} (1101,Abilene Chr) (1102,Air Force) (1103,Akron) We do the same thing to obtain an RDD of the game results, which will have a type called RDD[Edge[FullResult]]. We just parse stats.csv, and record the fields that we need: The ID of the winning team The ID of the losing team The game statistics of both the teams val detailedStats: RDD[Edge[FullResult]] = sc.textFile("./data/stats.csv"). filter(! _.startsWith("#")). map {line => val row = line split ',' Edge(row(2).toInt, row(4).toInt, FullResult( row(0).toInt, row(1).toInt, row(6), GameStats( score = row(3).toInt, fieldGoalMade = row(8).toInt, fieldGoalAttempt = row(9).toInt, threePointerMade = row(10).toInt, threePointerAttempt = row(11).toInt, threeThrowsMade = row(12).toInt, threeThrowsAttempt = row(13).toInt, offensiveRebound = row(14).toInt, defensiveRebound = row(15).toInt, assist = row(16).toInt, turnOver = row(17).toInt, steal = row(18).toInt, block = row(19).toInt, personalFoul = row(20).toInt ), GameStats( score = row(5).toInt, fieldGoalMade = row(21).toInt, fieldGoalAttempt = row(22).toInt, threePointerMade = row(23).toInt, threePointerAttempt = row(24).toInt, threeThrowsMade = row(25).toInt, threeThrowsAttempt = row(26).toInt, offensiveRebound = row(27).toInt, defensiveRebound = row(28).toInt, assist = row(20).toInt, turnOver = row(30).toInt, steal = row(31).toInt, block = row(32).toInt, personalFoul = row(33).toInt ) ) ) } We can avoid typing all this by using the nice spark-csv package that reads CSV files into SchemaRDD. Let's check what we got: scala> detailedStats.take(3).foreach(println) Edge(1165,1384,FullResult(2006,8,N,Score: 75-54)) Edge(1393,1126,FullResult(2006,8,H,Score: 68-37)) Edge(1107,1324,FullResult(2006,9,N,Score: 90-73)) We then create our score graph using the collection of teams (of the type called RDD[(VertexId, String)]) as vertices, and the collection called detailedStats (of the type called RDD[(VertexId, String)]) as edges: scala> val scoreGraph = Graph(teams, detailedStats) For curiosity, let's see which team has won against the 2015 NCAA national champ Duke during the regular season. It seems Duke has lost only four games during the regular season: scala> scoreGraph.triplets.filter(_.dstAttr == "Duke").foreach(println)((1274,Miami FL),(1181,Duke),FullResult(2015,71,A,Score: 90-74)) ((1301,NC State),(1181,Duke),FullResult(2015,69,H,Score: 87-75)) ((1323,Notre Dame),(1181,Duke),FullResult(2015,86,H,Score: 77-73)) ((1323,Notre Dame),(1181,Duke),FullResult(2015,130,N,Score: 74-64)) Aggregating game stats After we have our graph ready, let's start aggregating the stats data in scoreGraph. In Spark, aggregateMessages is the operator for such a kind of jobs. For example, let's find out the average field goals made per game by the winners. In other words, the games that a team has lost will not be counted. To get the average for each team, we first need to have the number of games won by the team, and the total field goals that the team made in these games: // Aggregate the total field goals made by winning teams type Msg = (Int, Int) type Context = EdgeContext[String, FullResult, Msg] val winningFieldGoalMade: VertexRDD[Msg] = scoreGraph aggregateMessages( // sendMsg (ec: Context) => ec.sendToSrc(1, ec.attr.winnerStats.fieldGoalMade), // mergeMsg (x: Msg, y: Msg) => (x._1 + y._1, x._2+ y._2) ) The aggregateMessage operator There is a lot going on in the previous call to aggregateMessages. So, let's see it working in slow motion. When we called aggregateMessages on the scoreGraph, we had to pass two functions as arguments. SendMsg The first function has a signature called EdgeContext[VD, ED, Msg] => Unit. It takes an EdgeContext as input. Since it does not return anything, its return type is Unit. This function is needed for sending message between the nodes. Okay, but what is the EdgeContext type? EdgeContext represents an edge along with its neighboring nodes. It can access both the edge attribute, and the source and destination nodes' attributes. In addition, EdgeContext has two methods to send messages along the edge to its source node, or to its destination node. These methods are called sendToSrc and sendToDst respectively. Then, the type of messages being sent through the graph is defined by Msg. Similar to vertex and edge types, we can define the concrete type that Msg takes as we wish. Merge In addition to sendMsg, the second function that we need to pass to aggregateMessages is a mergeMsg function with the (Msg, Msg) => Msg signature. As its name implies, mergeMsg is used to merge two messages, received at each node into a new one. Its output must also be of the Msg type. Using these two functions, aggregateMessages returns the aggregated messages inside VertexRDD[Msg]. Example In our example, we need to aggregate the number of games played and the number of field goals made. Therefore, Msg is simply a pair of Int. Furthermore, each edge context needs to send a message to only its source node, that is, the winning team. This is because we want to compute the total field goals made by each team for only the games that it has won. The actual message sent to each "winner" node is the pair of integers (1, ec.attr.winnerStats.fieldGoalMade). Here, 1 serves as a counter for the number of games won by the source node. The second integer, which is the number of field goals in one game, is extracted from the edge attribute. As we set out to compute the average field goals per winning game for all teams, we need to apply the mapValues operator to the output of aggregateMessages, which is as follows: // Average field goals made per Game by the winning teams val avgWinningFieldGoalMade: VertexRDD[Double] = winningFieldGoalMade mapValues ( (id: VertexId, x: Msg) => x match { case (count: Int, total: Int) => total.toDouble/count }) Here is the output: scala> avgWinningFieldGoalMade.take(5).foreach(println) (1260,24.71641791044776) (1410,23.56578947368421) (1426,26.239436619718308) (1166,26.137614678899084) (1434,25.34285714285714) Abstracting out the aggregation This was kind of cool! We can surely do the same thing for the average points per game scored by the winning teams: // Aggregate the points scored by winning teams val winnerTotalPoints: VertexRDD[(Int, Int)] = scoreGraph.aggregateMessages( // sendMsg triplet => triplet.sendToSrc(1, triplet.attr.winnerStats.score), // mergeMsg (x, y) => (x._1 + y._1, x._2+ y._2) ) // Average field goals made per Game by winning teams var winnersPPG: VertexRDD[Double] = winnerTotalPoints mapValues ( (id: VertexId, x: (Int, Int)) => x match { case (count: Int, total: Int) => total.toDouble/count }) Let's check the output: scala> winnersPPG.take(5).foreach(println) (1260,71.19402985074628) (1410,71.11842105263158) (1426,76.30281690140845) (1166,76.89449541284404) (1434,74.28571428571429) What if the coach wants to know the top five teams with the highest average three pointers made per winning game? By the way, he might also ask about the teams that are the most efficient in three pointers. Keeping things DRY We can copy and modify the previous code, but that would be quite repetitive. Instead, let's abstract out the average aggregation operator so that it can work on any statistics that the coach needs. Luckily, Scala's higher-order functions are there to help in this task. Let's define the functions that take a team's GameStats as an input, and return specific statistic that we are interested in. For now, we will need the number of three pointer made, and the average three pointer percentage: // Getting individual stats def threePointMade(stats: GameStats) = stats.threePointerMade def threePointPercent(stats: GameStats) = stats.tpPercent Then, we create a generic function that takes as an input a stats graph, and one of the functions defined previously, which has a signature called GameStats => Double: // Generic function for stats averaging def averageWinnerStat(graph: Graph[String, FullResult])(getStat: GameStats => Double): VertexRDD[Double] = { type Msg = (Int, Double) val winningScore: VertexRDD[Msg] = graph.aggregateMessages[Msg]( // sendMsg triplet => triplet.sendToSrc(1, getStat(triplet.attr.winnerStats)), // mergeMsg (x, y) => (x._1 + y._1, x._2+ y._2) ) winningScore mapValues ( (id: VertexId, x: Msg) => x match { case (count: Int, total: Double) => total/count }) } Now, we can get the average stats by passing the threePointMade and threePointPercent to averageWinnerStat functions: val winnersThreePointMade = averageWinnerStat(scoreGraph)(threePointMade) val winnersThreePointPercent = averageWinnerStat(scoreGraph)(threePointPercent) With little efforts, we can tell the coach which five winning teams score the highest number of threes per game: scala> winnersThreePointMade.sortBy(_._2,false).take(5).foreach(println) (1440,11.274336283185841) (1125,9.521929824561404) (1407,9.008849557522124) (1172,8.967441860465117) (1248,8.915384615384616) While we are at it, let's find out the five most efficient teams in three pointers: scala> winnersThreePointPercent.sortBy(_._2,false).take(5).foreach(println) (1101,46.90555728464225) (1147,44.224282479431224) (1294,43.754532434101534) (1339,43.52308905887638) (1176,43.080814169045105) Interestingly, the teams that made the most three pointers per winning game are not always the one who are the most efficient ones at it. But it is okay because at least they have won these games. Coach wants more numbers The coach seems to argue against this argument. He asks us to get the same statistics, but he wants the average over all the games that each team has played. We then have to aggregate the information at all the nodes, and not only at the destination nodes. To make our previous abstraction more flexible, let's create the following types: trait Teams case class Winners extends Teams case class Losers extends Teams case class AllTeams extends Teams We modify the previous higher-order function to have an extra argument called Teams, which will help us specify those nodes where we want to collect and aggregate the required game stats. The new function becomes as the following: def averageStat(graph: Graph[String, FullResult])(getStat: GameStats => Double, tms: Teams): VertexRDD[Double] = { type Msg = (Int, Double) val aggrStats: VertexRDD[Msg] = graph.aggregateMessages[Msg]( // sendMsg tms match { case _ : Winners => t => t.sendToSrc((1, getStat(t.attr.winnerStats))) case _ : Losers => t => t.sendToDst((1, getStat(t.attr.loserStats))) case _ => t => { t.sendToSrc((1, getStat(t.attr.winnerStats))) t.sendToDst((1, getStat(t.attr.loserStats))) } } , // mergeMsg (x, y) => (x._1 + y._1, x._2+ y._2) ) aggrStats mapValues ( (id: VertexId, x: Msg) => x match { case (count: Int, total: Double) => total/count }) } Now, aggregateStat allows us to choose if we want to aggregate the stats for winners only, for losers only, or for the all teams. Since the coach wants the overall stats averaged over all the games played, we aggregate the stats by passing the AllTeams() flag in aggregateStat. In this case, we define the sendMsg argument in aggregateMessages to send the required stats to both source (the winner) and destination (the loser) using the EdgeContext class's sendToSrc and sendToDst functions respectively. This mechanism is pretty straightforward. We just need to make sure that we send the right information to the right node. In this case, we send winnerStats to the winner, and loserStatsto the loser. Okay, you get the idea now. So, let's apply it to please our coach. Here are the teams with the overall highest three pointers per page: // Average Three Point Made Per Game for All Teams val allThreePointMade = averageStat(scoreGraph)(threePointMade, AllTeams()) scala> allThreePointMade.sortBy(_._2, false).take(5).foreach(println) (1440,10.180811808118081) (1125,9.098412698412698) (1172,8.575657894736842) (1184,8.428571428571429) (1407,8.411149825783973) And here are the five most efficient teams overall in three pointers per game: // Average Three Point Percent for All Teams val allThreePointPercent = averageStat(scoreGraph)(threePointPercent, AllTeams()) Let's check the output: scala> allThreePointPercent.sortBy(_._2,false).take(5).foreach(println) (1429,38.8351815824302) (1323,38.522819895594) (1181,38.43052051444854) (1294,38.41227053353959) (1101,38.097896464168954) Actually, there is only a 2 percent difference between the most efficient team and the one in the fiftieth position. Most NCAA teams are therefore pretty efficient behind the line. I bet coach knew this already! Average points per game We can also reuse the averageStat function to get the average points per game for the winners. In particular, let's take a look at the two teams that won games with the highest and lowest scores: // Winning teams val winnerAvgPPG = averageStat(scoreGraph)(score, Winners()) Let's check the output: scala> winnerAvgPPG.max()(Ordering.by(_._2)) res36: (org.apache.spark.graphx.VertexId, Double) = (1322,90.73333333333333) scala> winnerAvgPPG.min()(Ordering.by(_._2)) res39: (org.apache.spark.graphx.VertexId, Double) = (1197,60.5) Apparently, the most defensive team can win game by scoring only 60 points, whereas the most offensive team can score an average of 90 points. Next, let's average the points per game for all games played and look at the two teams with the best and worst offense during the 2015 season: // Average Points Per Game of All Teams val allAvgPPG = averageStat(scoreGraph)(score, AllTeams()) Let's see the output: scala> allAvgPPG.max()(Ordering.by(_._2)) res42: (org.apache.spark.graphx.VertexId, Double) = (1322,83.81481481481481) scala> allAvgPPG.min()(Ordering.by(_._2)) res43: (org.apache.spark.graphx.VertexId, Double) = (1212,51.111111111111114) To no one's surprise, the best offensive team is the same as the one who scores the most in winning games. To win the games, 50 points are not enough in an average for a team to win the games. Defense stats – the D matters as in direction Previously, we obtained some statistics such as field goals or a three-point percentage that a team achieves. What if we want to aggregate instead the average points or rebounds that each team concedes to their opponents? To compute this, we define a new higher-order function called averageConcededStat. Compared to averageStat, this function needs to send loserStats to the winning team, and the winnerStats function to the losing team. To make things more interesting, we are going to make the team name as a part of the message Msg: def averageConcededStat(graph: Graph[String, FullResult])(getStat: GameStats => Double, rxs: Teams): VertexRDD[(String, Double)] = { type Msg = (Int, Double, String) val aggrStats: VertexRDD[Msg] = graph.aggregateMessages[Msg]( // sendMsg rxs match { case _ : Winners => t => t.sendToSrc((1, getStat(t.attr.loserStats), t.srcAttr)) case _ : Losers => t => t.sendToDst((1, getStat(t.attr.winnerStats), t.dstAttr)) case _ => t => { t.sendToSrc((1, getStat(t.attr.loserStats),t.srcAttr)) t.sendToDst((1, getStat(t.attr.winnerStats),t.dstAttr)) } } , // mergeMsg (x, y) => (x._1 + y._1, x._2+ y._2, x._3) ) aggrStats mapValues ( (id: VertexId, x: Msg) => x match { case (count: Int, total: Double, name: String) => (name, total/count) }) } With this, we can calculate the average points conceded by the winning and losing teams as follows: val winnersAvgConcededPoints = averageConcededStat(scoreGraph)(score, Winners()) val losersAvgConcededPoints = averageConcededStat(scoreGraph)(score, Losers()) Let's check the output: scala> losersAvgConcededPoints.min()(Ordering.by(_._2)) res: (VertexId, (String, Double)) = (1101,(Abilene Chr,74.04761904761905)) scala> winnersAvgConcededPoints.min()(Ordering.by(_._2)) res: (org.apache.spark.graphx.VertexId, (String, Double)) = (1101,(Abilene Chr,74.04761904761905)) scala> losersAvgConcededPoints.max()(Ordering.by(_._2)) res: (VertexId, (String, Double)) = (1464,(Youngstown St,78.85714285714286)) scala> winnersAvgConcededPoints.max()(Ordering.by(_._2)) res: (VertexId, (String, Double)) = (1464,(Youngstown St,71.125)) The previous tells us that Abilene Christian University is the most defensive team. They concede the least points whether they win a game or not. On the other hand, Youngstown has the worst defense. Joining aggregated stats into graphs The previous example shows us how flexible the aggregateMessages operator is. We can define the Msg type of the messages to be aggregated to fit our needs. Moreover, we can select which nodes receive the messages. Finally, we can also define how we want to merge the messages. As a final example, let's aggregate many statistics about each team, and join this information into the nodes of the graph. To start, we create its own class for the team stats: // Average Stats of All Teams case class TeamStat( wins: Int = 0 // Number of wins ,losses: Int = 0 // Number of losses ,ppg: Int = 0 // Points per game ,pcg: Int = 0 // Points conceded per game ,fgp: Double = 0 // Field goal percentage ,tpp: Double = 0 // Three point percentage ,ftp: Double = 0 // Free Throw percentage ){ override def toString = wins + "-" + losses } Then, we collect the average stats for all teams using aggregateMessages in the following. For this, we define the type of the message to be an 8-element tuple that holds the counter for games played, wins, losses, and other statistics that will be stored in TeamStat as listed previously: type Msg = (Int, Int, Int, Int, Int, Double, Double, Double) val aggrStats: VertexRDD[Msg] = scoreGraph.aggregateMessages( // sendMsg t => { t.sendToSrc(( 1, 1, 0, t.attr.winnerStats.score, t.attr.loserStats.score, t.attr.winnerStats.fgPercent, t.attr.winnerStats.tpPercent, t.attr.winnerStats.ftPercent )) t.sendToDst(( 1, 0, 1, t.attr.loserStats.score, t.attr.winnerStats.score, t.attr.loserStats.fgPercent, t.attr.loserStats.tpPercent, t.attr.loserStats.ftPercent )) } , // mergeMsg (x, y) => ( x._1 + y._1, x._2 + y._2, x._3 + y._3, x._4 + y._4, x._5 + y._5, x._6 + y._6, x._7 + y._7, x._8 + y._8 ) ) Given the aggregate message called aggrStats, we map them into a collection of TeamStat: val teamStats: VertexRDD[TeamStat] = aggrStats mapValues { (id: VertexId, m: Msg) => m match { case ( count: Int, wins: Int, losses: Int, totPts: Int, totConcPts: Int, totFG: Double, totTP: Double, totFT: Double) => TeamStat( wins, losses, totPts/count, totConcPts/count, totFG/count, totTP/count, totFT/count) } } Next, let's join teamStats into the graph. For this, we first create a class called Team as a new type for the vertex attribute. Team will have a name and TeamStat: case class Team(name: String, stats: Option[TeamStat]) { override def toString = name + ": " + stats } Next, we use the joinVertices operator that we have seen in the previous chapter: // Joining the average stats to vertex attributes def addTeamStat(id: VertexId, t: Team, stats: TeamStat) = Team(t.name, Some(stats)) val statsGraph: Graph[Team, FullResult] = scoreGraph.mapVertices((_, name) => Team(name, None)). joinVertices(teamStats)(addTeamStat) We can see that the join has worked well by printing the first three vertices in the new graph called statsGraph: scala> statsGraph.vertices.take(3).foreach(println) (1260,Loyola-Chicago: Some(17-13)) (1410,TX Pan American: Some(7-21)) (1426,UT Arlington: Some(15-15)) To conclude this task, let's find out the top 10 teams in the regular seasons. To do so, we define an ordering for Option[TeamStat] as follows: import scala.math.Ordering object winsOrdering extends Ordering[Option[TeamStat]] { def compare(x: Option[TeamStat], y: Option[TeamStat]) = (x, y) match { case (None, None) => 0 case (Some(a), None) => 1 case (None, Some(b)) => -1 case (Some(a), Some(b)) => if (a.wins == b.wins) a.losses compare b.losses else a.wins compare b.wins }} Finally, we get the following: import scala.reflect.classTag import scala.reflect.ClassTag scala> statsGraph.vertices.sortBy(v => v._2.stats,false)(winsOrdering, classTag[Option[TeamStat]]). | take(10).foreach(println) (1246,Kentucky: Some(34-0)) (1437,Villanova: Some(32-2)) (1112,Arizona: Some(31-3)) (1458,Wisconsin: Some(31-3)) (1211,Gonzaga: Some(31-2)) (1320,Northern Iowa: Some(30-3)) (1323,Notre Dame: Some(29-5)) (1181,Duke: Some(29-4)) (1438,Virginia: Some(29-3)) (1268,Maryland: Some(27-6)) Note that the ClassTag parameter is required in sortBy to make use of Scala's reflection. This is why we had the previous imports. Performance optimization with tripletFields In addition to sendMsg and mergeMsg, aggregateMessages can also take an optional argument called tripletsFields, which indicates what data is accessed in the EdgeContext. The main reason for explicitly specifying such information is to help optimize the performance of the aggregateMessages operation. In fact, TripletFields represents a subset of the fields of EdgeTriplet, and it enables GraphX to populate only thse fields when necessary. The default value is TripletFields. All which means that the sendMsg function may access any of the fields in the EdgeContext. Otherwise, the tripletFields argument is used to tell GraphX that only part of the EdgeContext will be required so that an efficient join strategy can be used. All the possible options for the tripletsFields are listed here: TripletFields.All: Expose all the fields (source, edge, and destination) TripletFields.Dst: Expose the destination and edge fields, but not the source field TripletFields.EdgeOnly: Expose only the edge field. TripletFields.None: None of the triplet fields are exposed TripletFields.Src: Expose the source and edge fields, but not the destination field Using our previous example, if we are interested in computing the total number of wins and losses for each team, we will not need to access any field of the EdgeContext. In this case, we should use TripletFields. None to indicate so: // Number of wins of the teams val numWins: VertexRDD[Int] = scoreGraph.aggregateMessages( triplet => { triplet.sendToSrc(1) // No attribute is passed but an integer }, (x, y) => x + y, TripletFields.None ) // Number of losses of the teams val numLosses: VertexRDD[Int] = scoreGraph.aggregateMessages( triplet => { triplet.sendToDst(1) // No attribute is passed but an integer }, (x, y) => x + y, TripletFields.None ) To see that this works, let's print the top five and bottom five teams: scala> numWins.sortBy(_._2,false).take(5).foreach(println) (1246,34) (1437,32) (1112,31) (1458,31) (1211,31) scala> numLosses.sortBy(_._2, false).take(5).foreach(println) (1363,28) (1146,27) (1212,27) (1197,27) (1263,27) Should you want the name of the top five teams, you need to access the srcAttr attribute. In this case, we need to set tripletFields to TripletFields.Src: Kentucky as undefeated team in regular season: val numWinsOfTeams: VertexRDD[(String, Int)] = scoreGraph.aggregateMessages( t => { t.sendToSrc(t.srcAttr, 1) // Pass source attribute only }, (x, y) => (x._1, x._2 + y._2), TripletFields.Src ) Et voila! scala> numWinsOfTeams.sortBy(_._2._2, false).take(5).foreach(println) (1246,(Kentucky,34)) (1437,(Villanova,32)) (1112,(Arizona,31)) (1458,(Wisconsin,31)) (1211,(Gonzaga,31)) scala> numWinsOfTeams.sortBy(_._2._2).take(5).foreach(println) (1146,(Cent Arkansas,2)) (1197,(Florida A&M,2)) (1398,(Tennessee St,3)) (1263,(Maine,3)) (1420,(UMBC,4)) Kentucky has not lost any of its 34 games during the regular season. Too bad that they could not make it into the championship final. Warning about the MapReduceTriplets operator Prior to Spark 1.2, there was no aggregateMessages method in graph. Instead, the now deprecated mapReduceTriplets was the primary aggregation operator. The API for mapReduceTriplets is: class Graph[VD, ED] { def mapReduceTriplets[Msg]( map: EdgeTriplet[VD, ED] => Iterator[(VertexId, Msg)], reduce: (Msg, Msg) => Msg) : VertexRDD[Msg] } Compared to mapReduceTriplets, the new operator called aggregateMessages is more expressive as it employs the message passing mechanism instead of returning an iterator of messages as mapReduceTriplets does. In addition, aggregateMessages explicitly requires the user to specify the TripletFields object for performance improvement as we explained previously. In addition to the API improvements, aggregateMessages is optimized for performance. Because mapReduceTriplets is now deprecated, we will not discuss it further. If you have to use it with earlier versions of Spark, you can refer to the Spark programming guide. Summary In brief, AggregateMessages is a useful and generic operator that provides a functional abstraction for aggregating neighborhood information in the Spark graphs. Its definition is summarized here: class Graph[VD, ED] { def aggregateMessages[Msg: ClassTag]( sendMsg: EdgeContext[VD, ED, Msg] => Unit, mergeMsg: (Msg, Msg) => Msg, tripletFields: TripletFields = TripletFields.All) : VertexRDD[Msg] } This operator applies a user-defined sendMsg function to each edge in the graph using an EdgeContext. Each EdgeContext access the required information about the edge and passes this information to its source node and/or destination node using the sendToSrc and/or sendToDst respectively. After all the messages are received by the nodes, the mergeMsg function is used to aggregate these messages at each node. Some interesting reads Six keys to sports analytics Moneyball: The Art Of Winning An Unfair Game Golden State Warriors at the forefront of NBA data analysis How Data and Analytics Have Changed 'The Beautiful Game' NHL, SAP partnership to lead statistical revolution Resources for Article: Further resources on this subject: The Spark programming model[article] Apache Karaf – Provisioning and Clusters[article] Machine Learning Using Spark MLlib [article]
Read more
  • 0
  • 0
  • 1629
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-meeting-sap-lumira
Packt
02 Sep 2015
12 min read
Save for later

Meeting SAP Lumira

Packt
02 Sep 2015
12 min read
In this article by Dmitry Anoshin, author of the book SAP Lumira Essentials, Dmitry talks about living in a century of information technology. There are a lot of electronic devices around us which generate lots of data. For example, you can surf the Internet, visit a couple of news portals, order new Nike Air Max shoes from a web store, write a couple of messages to your friend, and chat on Facebook. Your every action produces data. We can multiply that action by the amount of people who have access to the internet or just use a cell phone, and we get really BIG DATA. Of course, you have a question: how big is it? Now, it starts from terabytes or even petabytes. The volume is not the only issue; moreover, we struggle with the variety of data. As a result, it is not enough to analyze only the structured data. We should dive deep in to unstructured data, such as machine data which are generated by various machines. (For more resources related to this topic, see here.) Nowadays, we should have a new core competence—dealing with Big Data—, because these vast data volumes won't be just stored, they need to be analysed and mined for information that management can use in order to make right business decisions. This helps to make the business more competitive and efficient. Unfortunately, in modern organizations there are still many manual steps needed in order to get data and try to answer your business questions. You need the help of your IT guys, or need to wait until new data is available in your enterprise data warehouse. In addition, you are often working with an inflexible BI tool, which can only refresh a report or export it in to Excel. You definitely need a new approach, which gives you a competitive advantage, dramatically reduces errors, and accelerates business decisions. So, we can highlight some of the key points for this kind of analytics: Integrating data from heterogeneous systems Giving more access to data Using sophisticated analytics Reducing manual coding Simplifying processes Reducing time to prepare data Focusing on self-service Leveraging powerful computing resources We could continue this list with many other bullet points. If you are a fan of traditional BI tools, you may think that it is almost impossible. Yes, you are right, it is impossible. That's why we need to change the rules of the game. As the business world changes, you must change as well. Maybe you have guessed what this means, but if not, I can help you. I will focus on a new approach of doing data analytics, which is more flexible and powerful. It is called data discovery. Of course, we need the right way in order to overcome all the challenges of the modern world. That's why we have chosen SAP Lumira—one of the most powerful data discovery tools in the modern market. But before diving deep into this amazing tool, let's consider some of the challenges of data discovery that are in our path, as well as data discovery advantages. Data discovery challenges Let's imagine that you have several terabytes of data. Unfortunately, it is raw unstructured data. In order to get business insight from this data you have to spend a lot of time in order to prepare and clean the data. In addition, you are restricted by the capabilities of your machine. That's why a good data discovery tool usually is combined of software and hardware. As a result, this gives you more power for exploratory data analysis. Let's imagine that this entire Big Data store is in Hadoop or any NoSQL data store. You have to at least be at good programmer in order to do analytics on this data. Here we can find other benefit of a good data discovery tool: it gives a powerful tool to business users, who are not as technical and maybe don't even know SQL. Apache Hadoop is an open source software project that enables distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resilience of these clusters comes from the software's ability to detect and handle failures at the application layer. A NoSQL data store is a next generation database, mostly addressing some of the following points: non-relational, distributed, open-source, and horizontally scalable. Data discovery versus business intelligence You may be confused about data discovery and business intelligence technologies; it seems they are very close to each other or even BI tools can do all what data discovery can do. And why do we need a separate data discovery tool, such as, SAP Lumira? In order to better understand the difference between the two technologies, you can look at the table below:   Enterprise BI Data discovery Key users All users Advanced analysts Approach Vertically-oriented (top to bottom), semantic layers, requests to existing repositories Vertically-oriented (bottom-up), mushup, putting data in the selected repository Interface Reports, dashboards Visualization Users Reporting Analysis Implementation By IT consultants By business users Let's consider the pros and cons of data discovery: Pros: Rapidly analyze data with a short shelf life Ideal for small teams Best for tactical analysis Great for answering on-off questions quickly Cons: Difficult to handle for enterprise organizations Difficult for junior users Lack of scalability As a result, it is clear that BI and data discovery handles their own tasks and complement each other. The role of data discovery Most organizations have a data warehouse. It was planned to supporting daily operations and to help make business decisions. But sometimes organizations need to meet new challenges. For example, Retail Company wants to improve their customer experience and decide to work closely with the customer database. Analysts try to segment customers into cohorts and try to analyse customer's behavior. They need to handle all customer data, which is quite big. In addition, they can use external data in order to learn more about their customers. If they start to use a corporate BI tool, every interaction, such as adding new a field or filter, can take 10-30 minutes. Another issue is adding a new field to an existing report. Usually, it is impossible without the help of IT staff, due to security or the complexities of the BI Enterprise solution. This is unacceptable in a modern business. Analysts want get an answer to their business questions immediately, and they prefer to visualize data because, as you know, human perception of visualization is much higher than text. In addition, these analysts may be independent from IT. They have their data discovery tool and they can connect to any data sources in the organization and check their crazy hypotheses. There are hundreds of examples where BI and DWH is weak, and data discovery is strong. Introducing SAP Lumira Starting from this point, we will focus on learning SAP Lumira. First of all, we need to understand what SAP Lumira is exactly. SAP Lumira is a family of data discovery tools which give us an opportunity to create amazing visualizations or even tell fantastic stories based on our big or small data. We can connect most of the popular data sources, such as Relational Database Management Systems (RDBMSs), flat files, excel spreadsheets or SAP applications. We are able to create datasets with measures, dimensions, hierarchies, or variables. In addition, Lumira allows us to prepare, edit, and clean our data before it is processed. SAP Lumira offers us a huge arsenal of graphical charts and tables to visualize our data. In addition, we can create data stories or even infographics based on our data by grouping charts, single cells, or tables together on boards to create presentation- style dashboards. Moreover, we can add images or text in order to add details. The following are the three main products in the Lumira family offered by SAP: SAP Lumira Desktop SAP Lumira Server SAP Lumira Cloud Lumira Desktop can be either a personal edition or a standard edition. Both of them give you the opportunity to analyse data on your local machine. You can even share your visualizations or insights via PDF or XLS. Lumira Server is also in two variations—Edge and Server. As you know, SAP BusinessObjects also has two types of license for the same software, Edge and Enterprise, and they differ only in terms of the number of users and the type of license. The Edge version is smaller; for example, it can cover the needs of a team or even the whole department. Lumira Cloud is Software as a Service (SaaS). It helps to quickly visualize large volumes of data without having to sacrifice performance or security. It is especially designed to speed time to insight. In addition, it saves time and money with flexible licensing options. Data connectors We met SAP Lumira for the first time and we played with the interface, and the reader could adjust the general settings of SAP Lumira. In addition, we can find this interesting menu in the middle of the window: There are several steps which help us to discover our data and gain business insights. In this article we start from first step by exploring data in SAP Lumira to create a document and acquire a dataset, which can include part or all of the original values from a data source. This is through Acquire Data. Let's click on Acquire Data. This new window will come up: There are four areas on this window. They are: A list of possible data sources (1): Here, the user can connect to his data source. Recently used objects (2): The user can open his previous connections or files. Ordinary buttons (3), such as Previous, Next, Create, and Cancel. This small chat box (4) we can find at almost every page. SAP Lumira cares about the quality of the product and gives the opportunity to the user to make a screen print and send feedback to SAP. Let's go deeper and consider more closely every connection in the table below: Data Source Description Microsoft Excel Excel data sheets Flat file CSV, TXT, LOG, PRN, or TSV SAP HANA There are two possible ways: Offline (downloading data) and Online (connected to SAP HANA) SAP BusinessObjects universe UNV or UNX SQL Databases Query data via SQL from relational databases SAP Business warehouse Downloaded data from a BEx Query or an InfoProvider Let's try to connect some data sources and extract some data from them. Microsoft spreadsheets Let's start with the easiest exercise. For example, our manager of inventory asked us to analyse flop products, which are not popular, and he sent us two excel spreadsheets, Unicorn_flop_products.xls and Unicorn_flop_price.xls. There are two different worksheets because prices and product attributes are in different systems. Both files have a unique field—SKU. As a result, it is possible to merge them by this field and analyse them as one data set. SKU or stock keeping unit is a distinct item for sale, such as a product or service, and them attributes associated with the item distinguish it from other items. For a product, these attributes include, but are not limited to, manufacturer, product description, material, size, color, packaging, and warranty terms. When a business takes inventory, it counts the quantity of each SKU. Connecting to the SAP BO universe Universe is a core thing in the SAP BusinessObjects BI platform. It is the semantic layer that isolates business users from the technical complexities of the databases where their corporate information is stored. For the ease of the end user, universes are made up of objects and classes that map to data in the database, using everyday terms that describe their business environment. Introducing Unicorn Fashion universe The Unicorn Fashion company uses the SAP BusinessObjects BI platform (BIP) as its primary BI tool. There is another Unicorn Fashion universe, which was built based on the unicorn datamart. It has a similar structure and joins as datamart. The following image shows the Unicorn Fashion universe: It unites two business processes: Sales (orange) and Stock (green) and has the following structure in business layer: Product: This specifies the attributes of an SKU, such as brand, category, ant, and so on Price: This denotes the different pricing of the SKU Sales: This specifies the sales business process Order: This denotes the order number, the shipping information, and orders measures Sales Date: This specifies the attributes of order date, such as month, year, and so on Sales Measures: This denotes various aggregated measures, such as shipped items, revenue waterfall, and so on Stock: This specifies the information about the quantity on stock Stock Date: This denotes the attributes of stock date, such as month, year, and so on Summary A step-by-step guide of learning SAP Lumira essentials starting from overview of SAP Lumira family products. We will demonstrate various data discovery techniques using real world scenarios of online ecommerce retailer. Moreover, we have detail recipes of installations, administration and customization of SAP Lumira. In addition, we will show how to work with data starting from acquiring data from various data sources, then preparing it and visualize through rich functionality of SAP Lumira. Finally, it teaches how to present data via data story or infographic and publish it across your organization or world wide web. Learn data discovery techniques, build amazing visualizations, create fantastic stories and share these visualizations through electronic medium with one of the most powerful tool – SAP Lumira. Moreover, we will focus on extracting data from different sources such as plain text, Microsoft Excel spreadsheets, SAP BusinessObjects BI Platform, SAP HANA and SQL databases. Finally, it will teach how to publish result of your painstaking work on various mediums, such as SAP BI Clients, SAP Lumira Cloud and so on. Resources for Article: Further resources on this subject: Creating Mobile Dashboards [article] Report Data Filtering [article] Creating Our First Universe [article]
Read more
  • 0
  • 0
  • 1705

article-image-big-data
Packt
02 Sep 2015
24 min read
Save for later

Big Data

Packt
02 Sep 2015
24 min read
 In this article by Henry Garner, author of the book Clojure for Data Science, we'll be working with a relatively modest dataset of only 100,000 records. This isn't big data (at 100 MB, it will fit comfortably in the memory of one machine), but it's large enough to demonstrate the common techniques of large-scale data processing. Using Hadoop (the popular framework for distributed computation) as its case study, this article will focus on how to scale algorithms to very large volumes of data through parallelism. Before we get to Hadoop and distributed data processing though, we'll see how some of the same principles that enable Hadoop to be effective at a very large scale can also be applied to data processing on a single machine, by taking advantage of the parallel capacity available in all modern computers. (For more resources related to this topic, see here.) The reducers library The count operation we implemented previously is a sequential algorithm. Each line is processed one at a time until the sequence is exhausted. But there is nothing about the operation that demands that it must be done in this way. We could split the number of lines into two sequences (ideally of roughly equal length) and reduce over each sequence independently. When we're done, we would just add together the total number of lines from each sequence to get the total number of lines in the file: If each Reduce ran on its own processing unit, then the two count operations would run in parallel. All the other things being equal, the algorithm would run twice as fast. This is one of the aims of the clojure.core.reducers library—to bring the benefit of parallelism to algorithms implemented on a single machine by taking advantage of multiple cores. Parallel folds with reducers The parallel implementation of reduce implemented by the reducers library is called fold. To make use of a fold, we have to supply a combiner function that will take the results of our reduced sequences (the partial row counts) and return the final result. Since our row counts are numbers, the combiner function is simply +. Reducers are a part of Clojure's standard library, they do not need to be added as an external dependency. The adjusted example, using clojure.core.reducers as r, looks like this: (defn ex-5-5 [] (->> (io/reader "data/soi.csv") (line-seq) (r/fold + (fn [i x] (inc i))))) The combiner function, +, has been included as the first argument to fold and our unchanged reduce function is supplied as the second argument. We no longer need to pass the initial value of zero—fold will get the initial value by calling the combiner function with no arguments. Our preceding example works because +, called with no arguments, already returns zero: (defn ex-5-6 [] (+)) ;; 0 To participate in folding then, it's important that the combiner function have two implementations: one with zero arguments that returns the identity value and another with two arguments that combines the arguments. Different folds will, of course, require different combiner functions and identity values. For example, the identity value for multiplication is 1. We can visualize the process of seeding the computation with an identity value, iteratively reducing over the sequence of xs and combining the reductions into an output value as a tree: There may be more than two reductions to combine, of course. The default implementation of fold will split the input collection into chunks of 512 elements. Our 166,000-element sequence will therefore generate 325 reductions to be combined. We're going to run out of page real estate quite quickly with a tree representation diagram, so let's visualize the process more schematically instead—as a two-step reduce and combine process. The first step performs a parallel reduce across all the chunks in the collection. The second step performs a serial reduce over the intermediate results to arrive at the final result: The preceding representation shows reduce over several sequences of xs, represented here as circles, into a series of outputs, represented here as squares. The squares are combined serially to produce the final result, represented by a star. Loading large files with iota Calling fold on a lazy sequence requires Clojure to realize the sequence into memory and then chunk the sequence into groups for parallel execution. For situations where the calculation performed on each row is small, the overhead involved in coordination outweighs the benefit of parallelism. We can improve the situation slightly by using a library called iota (https://github.com/thebusby/iota). The iota library loads files directly into the data structures suitable for folding over with reducers that can handle files larger than available memory by making use of memory-mapped files. With iota in the place of our line-seq function, our line count simply becomes: (defn ex-5-7 [] (->> (iota/seq "data/soi.csv") (r/fold + (fn [i x] (inc i))))) So far, we've just been working with the sequences of unformatted lines, but if we're going to do anything more than counting the rows, we'll want to parse them into a more useful data structure. This is another area in which Clojure's reducers can help make our code more efficient. Creating a reducers processing pipeline We already know that the file is comma-separated, so let's first create a function to turn each row into a vector of fields. All fields except the first two contain numeric data, so let's parse them into doubles while we're at it: (defn parse-double [x] (Double/parseDouble x)) (defn parse-line [line] (let [[text-fields double-fields] (->> (str/split line #",") (split-at 2))] (concat text-fields (map parse-double double-fields)))) We're using the reducers version of map to apply our parse-line function to each of the lines from the file in turn: (defn ex-5-8 [] (->> (iota/seq "data/soi.csv") (r/drop 1) (r/map parse-line) (r/take 1) (into []))) ;; [("01" "AL" 0.0 1.0 889920.0 490850.0 ...)] The final into function call converts the reducers' internal representation (a reducible collection) into a Clojure vector. The previous example should return a sequence of 77 fields, representing the first row of the file after the header. We're just dropping the column names at the moment, but it would be great if we could make use of these to return a map representation of each record, associating the column name with the field value. The keys of the map would be the column headings and the values would be the parsed fields. The clojure.core function zipmap will create a map out of two sequences—one for the keys and one for the values: (defn parse-columns [line] (->> (str/split line #",") (map keyword))) (defn ex-5-9 [] (let [data (iota/seq "data/soi.csv") column-names (parse-columns (first data))] (->> (r/drop 1 data) (r/map parse-line) (r/map (fn [fields] (zipmap column-names fields))) (r/take 1) (into [])))) This function returns a map representation of each row, a much more user-friendly data structure: [{:N2 1505430.0, :A19300 181519.0, :MARS4 256900.0 ...}] A great thing about Clojure's reducers is that in the preceding computation, calls to r/map, r/drop and r/take are composed into a reduction that will be performed in a single pass over the data. This becomes particularly valuable as the number of operations increases. Let's assume that we'd like to filter out zero ZIP codes. We could extend the reducers pipeline like this: (defn ex-5-10 [] (let [data (iota/seq "data/soi.csv") column-names (parse-columns (first data))] (->> (r/drop 1 data) (r/map parse-line) (r/map (fn [fields] (zipmap column-names fields))) (r/remove (fn [record] (zero? (:zipcode record)))) (r/take 1) (into [])))) The r/remove step is now also being run together with the r/map, r/drop and r/take calls. As the size of the data increases, it becomes increasingly important to avoid making multiple iterations over the data unnecessarily. Using Clojure's reducers ensures that our calculations are compiled into a single pass. Curried reductions with reducers To make the process clearer, we can create a curried version of each of our previous steps. To parse the lines, create a record from the fields and filter zero ZIP codes. The curried version of the function is a reduction waiting for a collection: (def line-formatter (r/map parse-line)) (defn record-formatter [column-names] (r/map (fn [fields] (zipmap column-names fields)))) (def remove-zero-zip (r/remove (fn [record] (zero? (:zipcode record))))) In each case, we're calling one of reducers' functions, but without providing a collection. The response is a curried version of the function that can be applied to the collection at a later time. The curried functions can be composed together into a single parse-file function using comp: (defn load-data [file] (let [data (iota/seq file) col-names (parse-columns (first data)) parse-file (comp remove-zero-zip (record-formatter col-names) line-formatter)] (parse-file (rest data)))) It's only when the parse-file function is called with a sequence that the pipeline is actually executed. Statistical folds with reducers With the data parsed, it's time to perform some descriptive statistics. Let's assume that we'd like to know the mean number of returns (column N1) submitted to the IRS by ZIP code. One way of doing this—the way we've done several times throughout the book—is by adding up the values and dividing it by the count. Our first attempt might look like this: (defn ex-5-11 [] (let [data (load-data "data/soi.csv") xs (into [] (r/map :N1 data))] (/ (reduce + xs) (count xs)))) ;; 853.37 While this works, it's comparatively slow. We iterate over the data once to create xs, a second time to calculate the sum, and a third time to calculate the count. The bigger our dataset gets, the larger the time penalty we'll pay. Ideally, we would be able to calculate the mean value in a single pass over the data, just like our parse-file function previously. It would be even better if we can perform it in parallel too. Associativity Before we proceed, it's useful to take a moment to reflect on why the following code wouldn't do what we want: (defn mean ([] 0) ([x y] (/ (+ x y) 2))) Our mean function is a function of two arities. Without arguments, it returns zero, the identity for the mean computation. With two arguments, it returns their mean: (defn ex-5-12 [] (->> (load-data "data/soi.csv") (r/map :N1) (r/fold mean))) ;; 930.54 The preceding example folds over the N1 data with our mean function and produces a different result from the one we obtained previously. If we could expand out the computation for the first three xs, we might see something like the following code: (mean (mean (mean 0 a) b) c) This is a bad idea, because the mean function is not associative. For an associative function, the following holds true: Addition is associative, but multiplication and division are not. So the mean function is not associative either. Contrast the mean function with the following simple addition: (+ 1 (+ 2 3)) This yields an identical result to: (+ (+ 1 2) 3) It doesn't matter how the arguments to + are partitioned. Associativity is an important property of functions used to reduce over a set of data because, by definition, the results of a previous calculation are treated as inputs to the next. The easiest way of converting the mean function into an associative function is to calculate the sum and the count separately. Since the sum and the count are associative, they can be calculated in parallel over the data. The mean function can be calculated simply by dividing one by the other. Multiple regression with gradient descent The normal equation uses matrix algebra to very quickly and efficiently arrive at the least squares estimates. Where all data fits in memory, this is a very convenient and concise equation. Where the data exceeds the memory available to a single machine however, the calculation becomes unwieldy. The reason for this is matrix inversion. The calculation of  is not something that can be accomplished on a fold over the data—each cell in the output matrix depends on many others in the input matrix. These complex relationships require that the matrix be processed in a nonsequential way. An alternative approach to solve linear regression problems, and many other related machine learning problems, is a technique called gradient descent. Gradient descent reframes the problem as the solution to an iterative algorithm—one that does not calculate the answer in one very computationally intensive step, but rather converges towards the correct answer over a series of much smaller steps. The gradient descent update rule Gradient descent works by the iterative application of a function that moves the parameters in the direction of their optimum values. To apply this function, we need to know the gradient of the cost function with the current parameters. Calculating the formula for the gradient involves calculus that's beyond the scope of this book. Fortunately, the resulting formula isn't terribly difficult to interpret:  is the partial derivative, or the gradient, of our cost function J(β) for the parameter at index j. Therefore, we can see that the gradient of the cost function with respect to the parameter at index j is equal to the difference between our prediction and the true value of y multiplied by the value of x at index j. Since we're seeking to descend the gradient, we want to subtract some proportion of the gradient from the current parameter values. Thus, at each step of gradient descent, we perform the following update: Here, := is the assigment operator and α is a factor called the learning rate. The learning rate controls how large an adjustment we wish make to the parameters at each iteration as a fraction of the gradient. If our prediction ŷ nearly matches the actual value of y, then there would be little need to change the parameters. In contrast, a larger error will result in a larger adjustment to the parameters. This rule is called the Widrow-Hoff learning rule or the Delta rule. The gradient descent learning rate As we've seen, gradient descent is an iterative algorithm. The learning rate, usually represented by α, dictates the speed at which the gradient descent converges to the final answer. If the learning rate is too small, convergence will happen very slowly. If it is too large, gradient descent will not find values close to the optimum and may even diverge from the correct answer: In the preceding chart, a small learning rate leads to a show convergence over many iterations of the algorithm. While the algorithm does reach the minimum, it does so over many more steps than is ideal and, therefore, may take considerable time. By contrast, in following diagram, we can see the effect of a learning rate that is too large. The parameter estimates are changed so significantly between iterations that they actually overshoot the optimum values and diverge from the minimum value: The gradient descent algorithm requires us to iterate repeatedly over our dataset. With the correct version of alpha, each iteration should successively yield better approximations of the ideal parameters. We can choose to terminate the algorithm when either the change between iterations is very small or after a predetermined number of iterations. Feature scaling As more features are added to the linear model, it is important to scale features appropriately. Gradient descent will not perform very well if the features have radically different scales, since it won't be possible to pick a learning rate to suit them all. A simple scaling we can perform is to subtract the mean value from each of the values and divide it by the standard-deviation. This will tend to produce values with zero mean that generally vary between -3 and 3: ( defn feature-scales [features] (->> (prepare-data) (t/map #(select-keys % features)) (t/facet) (t/fuse {:mean (m/mean) :sd (m/standard-deviation)}))) The feature-factors function in the preceding code uses t/facet to calculate the mean value and standard deviation of all the input features: (defn ex-5-24 [] (let [data (iota/seq "data/soi.csv") features [:A02300 :A00200 :AGI_STUB :NUMDEP :MARS2]] (->> (feature-scales features) (t/tesser (chunks data))))) ;; {:MARS2 {:sd 533.4496892658647, :mean 317.0412009748016}...} If you run the preceding example, you'll see the different means and standard deviations returned by the feature-scales function. Since our feature scales and input records are represented as maps, we can perform the scale across all the features at once using Clojure's merge-with function: (defn scale-features [factors] (let [f (fn [x {:keys [mean sd]}] (/ (- x mean) sd))] (fn [x] (merge-with f x factors)))) Likewise, we can perform the all-important reversal with unscale-features: (defn unscale-features [factors] (let [f (fn [x {:keys [mean sd]}] (+ (* x sd) mean))] (fn [x] (merge-with f x factors)))) Let's scale our features and take a look at the very first feature. Tesser won't allow us to execute a fold without a reduce, so we'll temporarily revert to using Clojure's reducers: (defn ex-5-25 [] (let [data (iota/seq "data/soi.csv") features [:A02300 :A00200 :AGI_STUB :NUMDEP :MARS2] factors (->> (feature-scales features) (t/tesser (chunks data)))] (->> (load-data "data/soi.csv") (r/map #(select-keys % features )) (r/map (scale-features factors)) (into []) (first)))) ;; {:MARS2 -0.14837567114357617, :NUMDEP 0.30617757526890155, ;; :AGI_STUB -0.714280814223704, :A00200 -0.5894942801950217, ;; :A02300 0.031741856083514465} This simple step will help gradient descent perform optimally on our data. Feature extraction Although we've used maps to represent our input data in this article, it's going to be more convenient when running gradient descent to represent our features as a matrix. Let's write a function to transform our input data into a map of xs and y. The y axis will be a scalar response value and xs will be a matrix of scaled feature values. We're adding a bias term to the returned matrix of features: (defn feature-matrix [record features] (let [xs (map #(% record) features)] (i/matrix (cons 1 xs)))) (defn extract-features [fy features] (fn [record] {:y (fy record) :xs (feature-matrix record features)})) Our feature-matrix function simply accepts an input of a record and the features to convert into a matrix. We call this from within extract-features, which returns a function that we can call on each input record: (defn ex-5-26 [] (let [data (iota/seq "data/soi.csv") features [:A02300 :A00200 :AGI_STUB :NUMDEP :MARS2] factors (->> (feature-scales features) (t/tesser (chunks data)))] (->> (load-data "data/soi.csv") (r/map (scale-features factors)) (r/map (extract-features :A02300 features)) (into []) (first)))) ;; {:y 433.0, :xs A 5x1 matrix ;; ------------- ;; 1.00e+00 ;; -5.89e-01 ;; -7.14e-01 ;; 3.06e-01 ;; -1.48e-01 ;; } The preceding example shows the data converted into a format suitable to perform gradient descent: a map containing the y response variable and a matrix of values, including the bias term. Applying a single step of gradient descent The objective of calculating the cost is to determine the amount by which to adjust each of the coefficients. Once we've calculated the average cost, as we did previously, we need to update the estimate of our coefficients β. Together, these steps represent a single iteration of gradient descent: We can return the updated coefficients in a post-combiner step that makes use of the average cost, the value of alpha, and the previous coefficients. Let's create a utility function update-coefficients, which will receive the coefficients and alpha and return a function that will calculate the new coefficients, given a total model cost: (defn update-coefficients [coefs alpha] (fn [cost] (->> (i/mult cost alpha) (i/minus coefs)))) With the preceding function in place, we have everything we need to package up a batch gradient descent update rule: (defn gradient-descent-fold [{:keys [fy features factors coefs alpha]}] (let [zeros-matrix (i/matrix 0 (count features) 1)] (->> (prepare-data) (t/map (scale-features factors)) (t/map (extract-features fy features)) (t/map (calculate-error (i/trans coefs))) (t/fold (matrix-mean (inc (count features)) 1)) (t/post-combine (update-coefficients coefs alpha))))) (defn ex-5-31 [] (let [features [:A00200 :AGI_STUB :NUMDEP :MARS2] fcount (inc (count features)) coefs (vec (replicate fcount 0)) data (chunks (iota/seq "data/soi.csv")) factors (->> (feature-scales features) (t/tesser data)) options {:fy :A02300 :features features :factors factors :coefs coefs :alpha 0.1}] (->> (gradient-descent-fold options) (t/tesser data)))) ;; A 6x1 matrix ;; ------------- ;; -4.20e+02 ;; -1.38e+06 ;; -5.06e+07 ;; -9.53e+02 ;; -1.42e+06 ;; -4.86e+05 The resulting matrix represents the values of the coefficients after the first iteration of gradient descent. Running iterative gradient descent Gradient descent is an iterative algorithm, and we will usually need to run it many times to convergence. With a large dataset, this can be very time-consuming. To save time, we've included a random sample of soi.csv in the data directory called soi-sample.csv. The smaller size allows us to run iterative gradient descent in a reasonable timescale. The following code runs gradient descent for 100 iterations, plotting the values of the parameters between each iteration on an xy-plot: (defn descend [options data] (fn [coefs] (->> (gradient-descent-fold (assoc options :coefs coefs)) (t/tesser data)))) (defn ex-5-32 [] (let [features [:A00200 :AGI_STUB :NUMDEP :MARS2] fcount (inc (count features)) coefs (vec (replicate fcount 0)) data (chunks (iota/seq "data/soi-sample.csv")) factors (->> (feature-scales features) (t/tesser data)) options {:fy :A02300 :features features :factors factors :coefs coefs :alpha 0.1} iterations 100 xs (range iterations) ys (->> (iterate (descend options data) coefs) (take iterations))] (-> (c/xy-plot xs (map first ys) :x-label "Iterations" :y-label "Coefficient") (c/add-lines xs (map second ys)) (c/add-lines xs (map #(nth % 2) ys)) (c/add-lines xs (map #(nth % 3) ys)) (c/add-lines xs (map #(nth % 4) ys)) (i/view)))) If you run the example, you should see a chart similar to the following: In the preceding chart, you can see how the parameters converge to relatively stable the values over the course of 100 iterations. Scaling gradient descent with Hadoop The length of time each iteration of batch gradient descent takes to run is determined by the size of your data and by how many processors your computer has. Although several chunks of data are processed in parallel, the dataset is large and the processors are finite. We've achieved a speed gain by performing calculations in parallel, but if we double the size of the dataset, the runtime will double as well. Hadoop is one of several systems that has emerged in the last decade which aims to parallelize work that exceeds the capabilities of a single machine. Rather than running code across multiple processors, Hadoop takes care of running a calculation across many servers. In fact, Hadoop clusters can, and some do, consist of many thousands of servers. Hadoop consists of two primary subsystems— the Hadoop Distributed File System (HDFS)—and the job processing system, MapReduce. HDFS stores files in chunks. A given file may be composed of many chunks and chunks are often replicated across many servers. In this way, Hadoop can store quantities of data much too large for any single server and, through replication, ensure that the data is stored reliably in the event of hardware failure too. As the name implies, the MapReduce programming model is built around the concept of map and reduce steps. Each job is composed of at least one map step and may optionally specify a reduce step. An entire job may consist of several map and reduce steps chained together. In the respect that reduce steps are optional, Hadoop has a slightly more flexible approach to distributed calculation than Tesser. Gradient descent on Hadoop with Tesser and Parkour Tesser's Hadoop capabilities are available in the tesser.hadoop namespace, which we're including as h. The primary public API function in the Hadoop namespace is h/fold. The fold function expects to receive at least four arguments, representing the configuration of the Hadoop job, the input file we want to process, a working directory for Hadoop to store its intermediate files, and the fold we want to run, referenced as a Clojure var. Any additional arguments supplied will be passed as arguments to the fold when it is executed. The reason for using a var to represent our fold is that the function call initiating the fold may happen on a completely different computer than the one that actually executes it. In a distributed setting, the var and arguments must entirely specify the behavior of the function. We can't, in general, rely on other mutable local state (for example, the value of an atom, or the value of variables closing over the function) to provide any additional context. Parkour distributed sources and sinks The data which we want our Hadoop job to process may exist on multiple machines too, stored distributed in chunks on HDFS. Tesser makes use of a library called Parkour (https://github.com/damballa/parkour/) to handle accessing potentially distributed data sources. Although Hadoop is designed to be run and distributed across many servers, it can also run in local mode. Local mode is suitable for testing and enables us to interact with the local filesystem as if it were HDFS. Another namespace we'll be using from Parkour is the parkour.conf namespace. This will allow us to create a default Hadoop configuration and operate it in local mode: (defn ex-5-33 [] (->> (text/dseq "data/soi.csv") (r/take 2) (into []))) In the preceding example, we use Parkour's text/dseq function to create a representation of the IRS input data. The return value implements Clojure's reducers protocol, so we can use r/take on the result. Running a feature scale fold with Hadoop Hadoop needs a location to write its temporary files while working on a task, and will complain if we try to overwrite an existing directory. Since we'll be executing several jobs over the course of the next few examples, let's create a little utility function that returns a new file with a randomly-generated name. (defn rand-file [path] (io/file path (str (long (rand 0x100000000))))) (defn ex-5-34 [] (let [conf (conf/ig) input (text/dseq "data/soi.csv") workdir (rand-file "tmp") features [:A00200 :AGI_STUB :NUMDEP :MARS2]] (h/fold conf input workdir #'feature-scales features))) Parkour provides a default Hadoop configuration object with the shorthand (conf/ig). This will return an empty configuration. The default value is enough, we don't need to supply any custom configuration. All of our Hadoop jobs will write their temporary files to a random directory inside the project's tmp directory. Remember to delete this folder later, if you're concerned about preserving disk space. If you run the preceding example now, you should get an output similar to the following: ;; {:MARS2 317.0412009748016, :NUMDEP 581.8504423822615, ;; :AGI_STUB 3.499939975269811, :A00200 37290.58880658831} Although the return value is identical to the values we got previously, we're now making use of Hadoop behind the scenes to process our data. In spite of this, notice that Tesser will return the response from our fold as a single Clojure data structure. Running gradient descent with Hadoop Since tesser.hadoop folds return Clojure data structures just like tesser.core folds, defining a gradient descent function that makes use of our scaled features is very simple: (defn hadoop-gradient-descent [conf input-file workdir] (let [features [:A00200 :AGI_STUB :NUMDEP :MARS2] fcount (inc (count features)) coefs (vec (replicate fcount 0)) input (text/dseq input-file) options {:column-names column-names :features features :coefs coefs :fy :A02300 :alpha 1e-3} factors (h/fold conf input (rand-file workdir) #'feature-scales features) descend (fn [coefs] (h/fold conf input (rand-file workdir) #'gradient-descent-fold (merge options {:coefs coefs :factors factors})))] (take 5 (iterate descend coefs)))) The preceding code defines a hadoop-gradient-descent function that iterates a descend function 5 times. Each iteration of descend calculates the improved coefficients based on the gradient-descent-fold function. The final return value is a vector of coefficients after 5 iterations of a gradient descent. We run the job on the full IRS data in the following example: ( defn ex-5-35 [] (let [workdir "tmp" out-file (rand-file workdir)] (hadoop-gradient-descent (conf/ig) "data/soi.csv" workdir))) After several iterations, you should see an output similar to the following: ;; ([0 0 0 0 0] ;; (20.9839310796048 46.87214911003046 -7.363493937722712 ;; 101.46736841329326 55.67860863427868) ;; (40.918665605227744 56.55169901254631 -13.771345753228694 ;; 162.1908841131747 81.23969785586247) ;; (59.85666340457121 50.559130068258995 -19.463888245285332 ;; 202.32407094149158 92.77424653758085) ;; (77.8477613139478 38.67088624825574 -24.585818946408523 ;; 231.42399118694212 97.75201693843269)) We've seen how we're able to calculate gradient descent using distributed techniques locally. Now, let's see how we can run this on a cluster of our own. Summary In this article, we learned some of the fundamental techniques of distributed data processing and saw how the functions used locally for data processing, map and reduce, are powerful ways of processing even very large quantities of data. We learned how Hadoop can scale unbounded by the capabilities of any single server by running functions on smaller subsets of the data whose outputs are themselves combined to finally produce a result. Once you understand the tradeoffs, this "divide and conquer" approach toward processing data is a simple and very general way of analyzing data on a large scale. We saw both the power and limitations of simple folds to process data using both Clojure's reducers and Tesser. We've also begun exploring how Parkour exposes more of Hadoop's underlying capabilities. Resources for Article: Further resources on this subject: Supervised learning[article] Machine Learning[article] Why Big Data in the Financial Sector? [article]
Read more
  • 0
  • 0
  • 2410

article-image-starting-yarn-basics
Packt
01 Sep 2015
15 min read
Save for later

Starting with YARN Basics

Packt
01 Sep 2015
15 min read
In this article by Akhil Arora and Shrey Mehrotra, authors of the book Learning YARN, we will be discussing how Hadoop was developed as a solution to handle big data in a cost effective and easiest way possible. Hadoop consisted of a storage layer, that is, Hadoop Distributed File System (HDFS) and the MapReduce framework for managing resource utilization and job execution on a cluster. With the ability to deliver high performance parallel data analysis and to work with commodity hardware, Hadoop is used for big data analysis and batch processing of historical data through MapReduce programming. (For more resources related to this topic, see here.) With the exponential increase in the usage of social networking sites such as Facebook, Twitter, and LinkedIn and e-commerce sites such as Amazon, there was the need of a framework to support not only MapReduce batch processing, but real-time and interactive data analysis as well. Enterprises should be able to execute other applications over the cluster to ensure that cluster capabilities are utilized to the fullest. The data storage framework of Hadoop was able to counter the growing data size, but resource management became a bottleneck. The resource management framework for Hadoop needed a new design to solve the growing needs of big data. YARN, an acronym for Yet Another Resource Negotiator, has been introduced as a second-generation resource management framework for Hadoop. YARN is added as a subproject of Apache Hadoop. With MapReduce focusing only on batch processing, YARN is designed to provide a generic processing platform for data stored across a cluster and a robust cluster resource management framework. In this article, we will cover the following topics: Introduction to MapReduce v1 Shortcomings of MapReduce v1 An overview of the YARN components The YARN architecture How YARN satisfies big data needs Projects powered by YARN Introduction to MapReduce v1 MapReduce is a software framework used to write applications that simultaneously process vast amounts of data on large clusters of commodity hardware in a reliable, fault-tolerant manner. It is a batch-oriented model where a large amount of data is stored in Hadoop Distributed File System (HDFS), and the computation on data is performed as MapReduce phases. The basic principle for the MapReduce framework is to move computed data rather than move data over the network for computation. The MapReduce tasks are scheduled to run on the same physical nodes on which data resides. This significantly reduces the network traffic and keeps most of the I/O on the local disk or within the same rack. The high-level architecture of the MapReduce framework has three main modules: MapReduce API: This is the end-user API used for programming the MapReduce jobs to be executed on the HDFS data. MapReduce framework: This is the runtime implementation of various phases in a MapReduce job such as the map, sort/shuffle/merge aggregation, and reduce phases. MapReduce system: This is the backend infrastructure required to run the user's MapReduce application, manage cluster resources, schedule thousands of concurrent jobs, and so on. The MapReduce system consists of two components—JobTracker and TaskTracker. JobTracker is the master daemon within Hadoop that is responsible for resource management, job scheduling, and management. The responsibilities are as follows: Hadoop clients communicate with the JobTracker to submit or kill jobs and poll for jobs' progress JobTracker validates the client request and if validated, then it allocates the TaskTracker nodes for map-reduce tasks execution JobTracker monitors TaskTracker nodes and their resource utilization, that is, how many tasks are currently running, the count of map-reduce task slots available, decides whether the TaskTracker node needs to be marked as blacklisted node, and so on JobTracker monitors the progress of jobs and if a job/task fails, it automatically reinitializes the job/task on a different TaskTracker node JobTracker also keeps the history of the jobs executed on the cluster TaskTracker is a per node daemon responsible for the execution of map-reduce tasks. A TaskTracker node is configured to accept a number of map-reduce tasks from the JobTracker, that is, the total map-reduce tasks a TaskTracker can execute simultaneously. The responsibilities are as follows: TaskTracker initializes a new JVM process to perform the MapReduce logic. Running a task on a separate JVM ensures that the task failure does not harm the health of the TaskTracker daemon. TaskTracker monitors these JVM processes and updates the task progress to the JobTracker on regular intervals. TaskTracker also sends a heartbeat signal and its current resource utilization metric (available task slots) to the JobTracker every few minutes. Shortcomings of MapReducev1 Though the Hadoop MapReduce framework was widely used, the following are the limitations that were found with the framework: Batch processing only: The resources across the cluster are tightly coupled with map-reduce programming. It does not support integration of other data processing frameworks and forces everything to look like a MapReduce job. The emerging customer requirements demand support for real-time and near real-time processing on the data stored on the distributed file systems. Nonscalability and inefficiency: The MapReduce framework completely depends on the master daemon, that is, the JobTracker. It manages the cluster resources, execution of jobs, and fault tolerance as well. It is observed that the Hadoop cluster performance degrades drastically when the cluster size increases above 4,000 nodes or the count of concurrent tasks crosses 40,000. The centralized handling of jobs control flow resulted in endless scalability concerns for the scheduler. Unavailability and unreliability: The availability and reliability are considered to be critical aspects of a framework such as Hadoop. A single point of failure for the MapReduce framework is the failure of the JobTracker daemon. The JobTracker manages the jobs and resources across the cluster. If it goes down, information related to the running or queued jobs and the job history is lost. The queued and running jobs are killed if the JobTracker fails. The MapReduce v1 framework doesn't have any provision to recover the lost data or jobs. Partitioning of resources: A MapReduce framework divides a job into multiple map and reduce tasks. The nodes with running the TaskTracker daemon are considered as resources. The capability of a resource to execute MapReduce jobs is expressed as the number of map-reduce tasks a resource can execute simultaneously. The framework forced the cluster resources to be partitioned into map and reduce task slots. Such partitioning of the resources resulted in less utilization of the cluster resources. If you have a running Hadoop 1.x cluster, you can refer to the JobTracker web interface to view the map and reduce task slots of the active TaskTracker nodes. The link for the active TaskTracker list is as follows: http://JobTrackerHost:50030/machines.jsp?type=active Management of user logs and job resources: The user logs refer to the logs generated by a MapReduce job. Logs for MapReduce jobs. These logs can be used to validate the correctness of a job or to perform log analysis to tune up the job's performance. In MapReduce v1, the user logs are generated and stored on the local file system of the slave nodes. Accessing logs on the slaves is a pain as users might not have the permissions issued. Since logs were stored on the local file system of a slave, in case the disk goes down, the logs will be lost. A MapReduce job might require some extra resources for job execution. In the MapReduce v1 framework, the client copies job resources to the HDFS with the replication of 10. Accessing resources remotely or through HDFS is not efficient. Thus, there's a need for localization of resources and a robust framework to manage job resources. In January 2008, Arun C. Murthy logged a bug in JIRA against the MapReduce architecture, which resulted in a generic resource scheduler and a per job user-defined component that manages the application execution. You can see this at https://issues.apache.org/jira/browse/MAPREDUCE-279 An overview of YARN components YARN divides the responsibilities of JobTracker into separate components, each having a specified task to perform. In Hadoop-1, the JobTracker takes care of resource management, job scheduling, and job monitoring. YARN divides these responsibilities of JobTracker into ResourceManager and ApplicationMaster. Instead of TaskTracker, it uses NodeManager as the worker daemon for execution of map-reduce tasks. The ResourceManager and the NodeManager form the computation framework for YARN, and ApplicationMaster is an application-specific framework for application management.   ResourceManager A ResourceManager is a per cluster service that manages the scheduling of compute resources to applications. It optimizes cluster utilization in terms of memory, CPU cores, fairness, and SLAs. To allow different policy constraints, it has algorithms in terms of pluggable schedulers such as capacity and fair that allows resource allocation in a particular way. ResourceManager has two main components: Scheduler: This is a pure pluggable component that is only responsible for allocating resources to applications submitted to the cluster, applying constraint of capacities and queues. Scheduler does not provide any guarantee for job completion or monitoring, it only allocates the cluster resources governed by the nature of job and resource requirement. ApplicationsManager (AsM): This is a service used to manage application masters across the cluster that is responsible for accepting the application submission, providing the resources for application master to start, monitoring the application progress, and restart, in case of application failure. NodeManager The NodeManager is a per node worker service that is responsible for the execution of containers based on the node capacity. Node capacity is calculated based on the installed memory and the number of CPU cores. The NodeManager service sends a heartbeat signal to the ResourceManager to update its health status. The NodeManager service is similar to the TaskTracker service in MapReduce v1. NodeManager also sends the status to ResourceManager, which could be the status of the node on which it is running or the status of tasks executing on it. ApplicationMaster An ApplicationMaster is a per application framework-specific library that manages each instance of an application that runs within YARN. YARN treats ApplicationMaster as a third-party library responsible for negotiating the resources from the ResourceManager scheduler and works with NodeManager to execute the tasks. The ResourceManager allocates containers to the ApplicationMaster and these containers are then used to run the application-specific processes. ApplicationMaster also tracks the status of the application and monitors the progress of the containers. When the execution of a container gets complete, the ApplicationMaster unregisters the containers with the ResourceManager and unregisters itself after the execution of the application is complete. Container A container is a logical bundle of resources in terms of memory, CPU, disk, and so on that is bound to a particular node. In the first version of YARN, a container is equivalent to a block of memory. The ResourceManager scheduler service dynamically allocates resources as containers. A container grants rights to an ApplicationMaster to use a specific amount of resources of a specific host. An ApplicationMaster is considered as the first container of an application and it manages the execution of the application logic on allocated containers. The YARN architecture In the previous topic, we discussed the YARN components. Here we'll discuss the high-level architecture of YARN and look at how the components interact with each other. The ResourceManager service runs on the master node of the cluster. A YARN client submits an application to the ResourceManager. An application can be a single MapReduce job, a directed acyclic graph of jobs, a java application, or any shell script. The client also defines an ApplicationMaster and a command to start the ApplicationMaster on a node. The ApplicationManager service of resource manager will validate and accept the application request from the client. The scheduler service of resource manager will allocate a container for the ApplicationMaster on a node and the NodeManager service on that node will use the command to start the ApplicationMaster service. Each YARN application has a special container called ApplicationMaster. The ApplicationMaster container is the first container of an application. The ApplicationMaster requests resources from the ResourceManager. The RequestRequest will have the location of the node, memory, and CPU cores required. The ResourceManager will allocate the resources as containers on a set of nodes. The ApplicationMaster will connect to the NodeManager services and request NodeManager to start containers. The ApplicationMaster manages the execution of the containers and will notify the ResourceManager once the application execution is over. Application execution and progress monitoring is the responsibility of ApplicationMaster rather than ResourceManager. The NodeManager service runs on each slave of the YARN cluster. It is responsible for running application's containers. The resources specified for a container are taken from the NodeManager resources. Each NodeManager periodically updates ResourceManager for the set of available resources. The ResourceManager scheduler service uses this resource matrix to allocate new containers to ApplicationMaster or to start execution of a new application. How YARN satisfies big data needs We talked about the MapReduce v1 framework and some limitations of the framework. Let's now discuss how YARN solves these issues: Scalability and higher cluster utilization: Scalability is the ability of a software or product to implement well under an expanding workload. In YARN, the responsibility of resource management and job scheduling / monitoring is divided into separate daemons, allowing YARN daemons to scale the cluster without degrading the performance of the cluster. With a flexible and generic resource model in YARN, the scheduler handles an overall resource profile for each type of application. This structure makes the communication and storage of resource requests efficient for the scheduler resulting in higher cluster utilization. High availability for components: Fault tolerance is a core design principle for any multitenancy platform such as YARN. This responsibility is delegated to ResourceManager and ApplicationMaster. The application specific framework, ApplicationMaster, handles the failure of a container. The ResourceManager handles the failure of NodeManager and ApplicationMaster. Flexible resource model: In MapReduce v1, resources are defined as the number of map and reduce task slots available for the execution of a job. Every resource request cannot be mapped as map/reduce slots. In YARN, a resource-request is defined in terms of memory, CPU, locality, and so on. It results in a generic definition for a resource request by an application. The NodeManager node is the worker node and its capability is calculated based on the installed memory and cores of the CPU. Multiple data processing algorithms: The MapReduce framework is bounded to batch processing only. YARN is developed with a need to perform a wide variety of data processing over the data stored over Hadoop HDFS. YARN is a framework for generic resource management and allows users to execute multiple data processing algorithms over the data. Log aggregation and resource localization: As discussed earlier, accessing and managing user logs is difficult in the Hadoop 1.x framework. To manage user logs, YARN introduced a concept of log aggregation. In YARN, once the application is finished, the NodeManager service aggregates the user logs related to an application and these aggregated logs are written out to a single log file in HDFS. To access the logs, users can use either the YARN command-line options, YARN web interface, or can fetch directly from HDFS. A container might require external resources such as jars, files, or scripts on a local file system. These are made available to containers before they are started. An ApplicationMaster defines a list of resources that are required to run the containers. For efficient disk utilization and access security, the NodeManager ensures the availability of specified resources and their deletion after use. Projects powered by YARN Efficient and reliable resource management is a basic need of a distributed application framework. YARN provides a generic resource management framework to support data analysis through multiple data processing algorithms. There are a lot of projects that have started using YARN for resource management. We've listed a few of these projects here and discussed how YARN integration solves their business requirements: Apache Giraph: Giraph is a framework for offline batch processing of semistructured graph data stored using Hadoop. With the Hadoop 1.x version, Giraph had no control over the scheduling policies, heap memory of the mappers, and locality awareness for the running job. Also, defining a Giraph job on the basis of mappers / reducers slots was a bottleneck. YARN's flexible resource allocation model, locality awareness principle, and application master framework ease the Giraph's job management and resource allocation to tasks. Apache Spark: Spark enables iterative data processing and machine learning algorithms to perform analysis over data available through HDFS, HBase, or other storage systems. Spark uses YARN's resource management capabilities and framework to submit the DAG of a job. The spark user can focus more on data analytics' use cases rather than how spark is integrated with Hadoop or how jobs are executed. Some other projects powered by YARN are as follows: MapReduce: https://issues.apache.org/jira/browse/MAPREDUCE-279 Giraph: https://issues.apache.org/jira/browse/GIRAPH-13 Spark: http://spark.apache.org/ OpenMPI: https://issues.apache.org/jira/browse/MAPREDUCE-2911 HAMA: https://issues.apache.org/jira/browse/HAMA-431 HBase: https://issues.apache.org/jira/browse/HBASE-4329 Storm: http://hortonworks.com/labs/storm/ A page on Hadoop wiki lists a number of projects/applications that are migrating to or using YARN as their resource management tool. You can see this at http://wiki.apache.org/hadoop/PoweredByYarn. Summary This article covered an introduction to YARN, its components, architecture, and different projects powered by YARN. It also explained how YARN solves big data needs. Resources for Article: Further resources on this subject: YARN and Hadoop[article] Introduction to Hadoop[article] Hive in Hadoop [article]
Read more
  • 0
  • 0
  • 2743

article-image-building-click-go-robot
Packt
28 Aug 2015
16 min read
Save for later

Building a "Click-to-Go" Robot

Packt
28 Aug 2015
16 min read
 In this article by Özen Özkaya and Giray Yıllıkçı, author of the book Arduino Computer Vision Programming, you will learn how to approach computer vision applications, how to divide an application development process into basic steps, how to realize these design steps and how to combine a vision system with the Arduino. Now it is time to connect all the pieces into one! In this article you will learn about building a vision-assisted robot which can go to any point you want within the boundaries of the camera's sight. In this scenario there will be a camera attached to the ceiling and, once you get the video stream from the robot and click on any place in the view, the robot will go there. This application will give you an all-in-one development application. Before getting started, let's try to draw the application scheme and define the potential steps. We want to build a vision-enabled robot which can be controlled via a camera attached to the ceiling and, when we click on any point in the camera view, we want our robot to go to this specific point. This operation requires a mobile robot that can communicate with the vision system. The vision system should be able to detect or recognize the robot and calculate the position and orientation of the robot. The vision system should also give us the opportunity to click on any point in the view and it should calculate the path and the robot movements to get to the destination. This scheme requires a communication line between the robot and the vision controller. In the following illustration, you can see the physical scheme of the application setup on the left hand side and the user application window on the right hand side: After interpreting the application scheme, the next step is to divide the application into small steps by using the computer vision approach. In the data acquisition phase, we'll only use the scene's video stream. There won't be an external sensor on the robot because, for this application, we don't need one. Camera selection is important and the camera distance (the height from the robot plane) should be enough to see the whole area. We'll use the blue and red circles above the robot to detect the robot and calculate its orientation. We don't need smaller details. A resolution of about 640x480 pixels is sufficient for a camera distance of 120 cm. We need an RGB camera stream because we'll use the color properties of the circles. We will use the Logitech C110, which is an affordable webcam. Any other OpenCV compatible webcam will work because this application is not very demanding in terms of vision input. If you need more cable length you can use a USB extension cable. In the preprocessing phase, the first step is to remove the small details from the surface. Blurring is a simple and effective operation for this purpose. If you need to, you can resize your input image to reduce the image size and processing time. Do not forget that, if you resize to too small a resolution, you won't be able to extract useful information. The following picture is of the Logitech C110 webcam: The next step is processing. There are two main steps in this phase. The first step is to detect the circles in the image. The second step is to calculate the robot orientation and the path to the destination point. The robot can then follow the path and reach its destination. In color processing with which we can apply color filters to the image to get the image masks of the red circle and the blue circle, as shown in the following picture. Then we can use contour detection or blob analysis to detect the circles and extract useful features. It is important to keep it simple and logical: Blob analysis detects the bounding boxes of two circles on the robot and, if we draw a line between the centers of the circles, once we calculate the line angle, we will get the orientation of the robot itself. The mid-point of this line will be the center of the robot. If we draw a line from the center of the robot to the destination point we obtain the straightest route. The circles on the robot can also be detected by using the Hough transform for circles but, because it is a relatively slow algorithm and it is hard to extract image statistics from the results, the blob analysis-based approach is better. Another approach is by using the SURF, SIFT or ORB features. But these methods probably won't provide fast real-time behavior, so blob analysis will probably work better. After detecting blobs, we can apply post-filtering to remove the unwanted blobs. We can use the diameter of the circles, the area of the bounding box, and the color information, to filter the unwanted blobs. By using the properties of the blobs (extracted features), it is possible to detect or recognize the circles, and then the robot. To be able to check if the robot has reached the destination or not, a distance calculation from the center of the robot to the destination point would be useful. In this scenario, the robot will be detected by our vision controller. Detecting the center of the robot is sufficient to track the robot. Once we calculate the robot's position and orientation, we can combine this information with the distance and orientation to the destination point and we can send the robot the commands to move it! Efficient planning algorithms can be applied in this phase but, we'll implement a simple path planning approach. Firstly, the robot will orientate itself towards the destination point by turning right or left and then it will go forward to reach the destination. This scenario will work for scenarios without obstacles. If you want to extend the application for a complex environment with obstacles, you should implement an obstacle detection mechanism and an efficient path planning algorithm. We can send the commands such as Left!, Right!, Go!, or Stop! to the robot over a wireless line. RF communication is an efficient solution for this problem. In this scenario, we need two NRF24L01 modules—the first module is connected to the robot controller and the other is connected to the vision controller. The Arduino is the perfect means to control the robot and communicate with the vision controller. The vision controller can be built on any hardware platform such as a PC, tablet, or a smartphone. The vision controller application can be implemented on lots of operating systems as OpenCV is platform-independent. We preferred Windows and a laptop to run our vision controller application. As you can see, we have divided our application into small and easy-to-implement parts. Now it is time to build them all! Building a robot It is time to explain how to build our Click-to-Go robot. Before going any further we would like to boldly say that robotic projects can teach us the fundamental fields of science such as mechanics, electronics, and programming. As we go through the building process of our Click-to-Go robot, you will see that we have kept it as simple as possible. Moreover, instead of buying ready-to-use robot kits, we have built our own simple and robust robot. Of course, if you are planning to buy a robot kit or already have a kit available, you can simply adapt your existing robot into this project. Our robot design is relatively simple in terms of mechanics. We will use only a box-shaped container platform, two gear motors with two individual wheels, a battery to drive the motors, one nRF24L01 Radio Frequency (RF) transceiver module, a bunch of jumper wires, an L293D IC and, of course, one Arduino Uno board module. We will use one more nRF24L01 and one more Arduino Uno for the vision controller communication circuit. Our Click-to-Go robot will be operated by a simplified version of a differential drive. A differential drive can be summarized as a relative speed change on the wheels, which assigns a direction to the robot. In other words, if both wheels spin at the same rate, the robot goes forward. To drive in reverse, the wheels spin in the opposite direction. To turn left, the left wheel turns backwards and the right wheel stays still or turns forwards. Similarly, to turn right, the right wheel turns backwards and the left stays still or turns forwards. You can get curved paths by varying the rotation speeds of the wheels. Yet, to cover every aspect of this comprehensive project, we will drive the wheels of both the motors forward to go forwards. To turn left, the left wheel stays still and the right wheel turns forward. Symmetrically, to turn right, the right motor stays still and the left motor runs forward. We will not use running motors in a reverse direction to go backwards. Instead, we will change the direction of the robot by turning right or left. Building mechanics As we stated earlier, the mechanics of the robot are fairly simple. First of all we need a small box-shaped container to use as both a rigid surface and the storage for the battery and electronics. For this purpose, we will use a simple plywood box. We will attach gear motors in front of the plywood box and any kind of support surface to the bottom of the box. As can be seen in the following picture, we used a small wooden rod to support the back of the robot to level the box: If you think that the wooden rod support is dragging, we recommend adding a small ball support similar to Pololu's ball caster, shown at https://www.pololu.com/product/950. It is not a very expensive component and it significantly improves the mobility of the robot. You may want to drill two holes next to the motor wirings to keep the platform tidy. The easiest way to attach the motors and the support rod is by using two-sided tape. Just make sure that the tape is not too thin. It is much better to use two-sided foamy tape. The topside of the robot can be covered with a black shell to enhance the contrast between the red and blue circles. We will use these circles to ascertain the orientation of the robot during the operation, as mentioned earlier. For now, don't worry too much about this detail. Just be aware that we need to cover the top of the robot with a flat surface. We will explain in detail on how these red and blue circles are used. It is worth mentioning that we used large water bottle lids. It is better to use matt surfaces instead of shiny surfaces to avoid glare in the image. The finished Click-to-Go robot should be similar to the robot shown in the following picture. The robot's head is on the side with the red circle: As we have now covered building the mechanics of our robot we can move on to building the electronics. Building the electronics We will use two separate Arduino Unos for this vision-enabled robot project, one each for the robot and the transmitter system. The electronic setup needs a little bit more attention than the mechanics. The electronic components of the robot and the transmitter units are similar. However, the robot needs more work. We have selected nRF24L01 modules for the wireless communication module,. These modules are reliable and easy to find from both the Internet and local hobby stores. It is possible to use any pair of wireless connectivity modules but, for this project, we will stick with nRF24L01 modules, as shown in this picture: For the driving motors we will need to use a quadruple half-H driver, L293D. Again, every electronic shop should have these ICs. As a reminder, you may need to buy a couple of spare L293D ICs in case you burn the IC by mistake. Following is the picture of the L293D IC: We will need a bunch of jumper wires to connect the components together. It is nice to have a small breadboard for the robot/receiver, to wire the L293D. The transmitter part is very simple so a breadboard is not essential. Robot/receiver and transmitter drawings The drawings of both the receiver and the transmitter have two common modules: Arduino Uno and nRF24L01 connectivity modules. The connections of the nRF24L01 modules on both sides are the same. In addition to these connectivity modules, for the receiver, we need to put some effort into connecting the L293D IC and the battery to power up the motors. In the following picture, we can see a drawing of the transmitter. As it will always be connected to the OpenCV platform via the USB cable, there is no need to feed the system with an external battery: As shown in the following picture of the receiver and the robot, it is a good idea to separate the motor battery from the battery that feeds the Arduino Uno board because the motors may draw high loads or create high loads, which can easily damage the Arduino board's pin outs. Another reason is to keep the Arduino working even if the battery motor has drained. Separating the feeder batteries is a very good practice to follow if you are planning to use more than one 12V battery. To keep everything safe, we fed the Arduino Uno with a 6V battery pack and the motors with a 9V battery: Drawings of receiver systems can be little bit confusing and lead to errors. It is a good idea to open the drawings and investigate how the connections are made by using Fritzing. You can download the Fritzing drawings of this project from https://github.com/ozenozkaya/click_to_go_robot_drawings. To download the Fritzing application, visit the Fritzing download page: http://fritzing.org/download/ Building the robot controller and communications We are now ready to go through the software implementation of the robot and the transmitter. Basically what we are doing here is building the required connectivity to send data to the remote robot continuously from OpenCV via a transmitter. OpenCV will send commands to the transmitter through a USB cable to the first Arduino board, which will then send the data to the unit on the robot. And it will send this data to the remote robot over the RF module. Follow these steps: Before explaining the code, we need to import the RF24 library. To download RF24 library drawings please go to the GitHub link at https://github.com/maniacbug/RF24. After downloading the library, go to Sketch | Include Library | Add .ZIP Library… to include the library in the Arduino IDE environment. After clicking Add .ZIP Library…, a window will appear. Go into the downloads directory and select the RF24-master folder that you just downloaded. Now you are ready to use the RF24 library. As a reminder, it is pretty much the same to include a library in Arduino IDE as on other platforms. It is time to move on to the explanation of the code! It is important to mention that we use the same code for both the robot and the transmitter, with a small trick! The same code works differently for the robot and the transmitter. Now, let's make everything simpler during the code explanation. The receiver mode needs to ground an analog 4 pin. The idea behind the operation is simple; we are setting the role_pin to high through its internal pull-up resistor. So, it will read high even if you don't connect it, but you can still safely connect it to ground and it will read low. Basically, the analog 4 pin reads 0 if the there is a connection with a ground pin. On the other hand, if there is no connection to the ground, the analog 4 pin value is kept as 1. By doing this at the beginning, we determine the role of the board and can use the same code on both sides. Here is the code: #include <SPI.h> #include "nRF24L01.h" #include "RF24.h" #define MOTOR_PIN_1 3 #define MOTOR_PIN_2 5 #define MOTOR_PIN_3 6 #define MOTOR_PIN_4 7 #define ENABLE_PIN 4 #define SPI_ENABLE_PIN 9 #define SPI_SELECT_PIN 10 const int role_pin = A4; typedef enum {transmitter = 1, receiver} e_role; unsigned long motor_value[2]; String input_string = ""; boolean string_complete = false; RF24 radio(SPI_ENABLE_PIN, SPI_SELECT_PIN); const uint64_t pipes[2] = { 0xF0F0F0F0E1LL, 0xF0F0F0F0D2LL }; e_role role = receiver; void setup() { pinMode(role_pin, INPUT); digitalWrite(role_pin, HIGH); delay(20); radio.begin(); radio.setRetries(15, 15); Serial.begin(9600); Serial.println(" Setup Finished"); if (digitalRead(role_pin)) { Serial.println(digitalRead(role_pin)); role = transmitter; } else { Serial.println(digitalRead(role_pin)); role = receiver; } if (role == transmitter) { radio.openWritingPipe(pipes[0]); radio.openReadingPipe(1, pipes[1]); } else { pinMode(MOTOR_PIN_1, OUTPUT); pinMode(MOTOR_PIN_2, OUTPUT); pinMode(MOTOR_PIN_3, OUTPUT); pinMode(MOTOR_PIN_4, OUTPUT); pinMode(ENABLE_PIN, OUTPUT); digitalWrite(ENABLE_PIN, HIGH); radio.openWritingPipe(pipes[1]); radio.openReadingPipe(1, pipes[0]); } radio.startListening(); } void loop() { // TRANSMITTER CODE BLOCK // if (role == transmitter) { Serial.println("Transmitter"); if (string_complete) { if (input_string == "Right!") { motor_value[0] = 0; motor_value[1] = 120; } else if (input_string == "Left!") { motor_value[0] = 120; motor_value[1] = 0; } else if (input_string == "Go!") { motor_value[0] = 120; motor_value[1] = 110; } else { motor_value[0] = 0; motor_value[1] = 0; } input_string = ""; string_complete = false; } radio.stopListening(); radio.write(motor_value, 2 * sizeof(unsigned long)); radio.startListening(); delay(20); } // RECEIVER CODE BLOCK // if (role == receiver) { Serial.println("Receiver"); if (radio.available()) { bool done = false; while (!done) { done = radio.read(motor_value, 2 * sizeof(unsigned long)); delay(20); } Serial.println(motor_value[0]); Serial.println(motor_value[1]); analogWrite(MOTOR_PIN_1, motor_value[1]); digitalWrite(MOTOR_PIN_2, LOW); analogWrite(MOTOR_PIN_3, motor_value[0]); digitalWrite(MOTOR_PIN_4 , LOW); radio.stopListening(); radio.startListening(); } } } void serialEvent() { while (Serial.available()) { // get the new byte: char inChar = (char)Serial.read(); // add it to the inputString: input_string += inChar; // if the incoming character is a newline, set a flag // so the main loop can do something about it: if (inChar == '!' || inChar == '?') { string_complete = true; Serial.print("data_received"); } } } This example code is taken from one of the examples in the RF24 library. We have changed it in order to serve our needs in this project. The original example can be found in the RF24-master/Examples/pingpair directory. Summary We have combined everything we have learned up to now and built an all-in-one application. By designing and building the Click-to-Go robot from scratch you have embraced the concepts. You can see that the vision approach very well, even for complex applications. You now know how to divide a computer vision application into small pieces, how to design and implement each design step, and how to efficiently use the tools you have. Resources for Article: Further resources on this subject: Getting Started with Arduino[article] Arduino Development[article] Programmable DC Motor Controller with an LCD [article]
Read more
  • 0
  • 0
  • 2273
article-image-rendering-stereoscopic-3d-models-using-opengl
Packt
24 Aug 2015
8 min read
Save for later

Rendering Stereoscopic 3D Models using OpenGL

Packt
24 Aug 2015
8 min read
In this article, by Raymond C. H. Lo and William C. Y. Lo, authors of the book OpenGL Data Visualization Cookbook, we will demonstrate how to visualize data with stunning stereoscopic 3D technology using OpenGL. Stereoscopic 3D devices are becoming increasingly popular, and the latest generation's wearable computing devices (such as the 3D vision glasses from NVIDIA, Epson, and more recently, the augmented reality 3D glasses from Meta) can now support this feature natively. The ability to visualize data in a stereoscopic 3D environment provides a powerful and highly intuitive platform for the interactive display of data in many applications. For example, we may acquire data from the 3D scan of a model (such as in architecture, engineering, and dentistry or medicine) and would like to visualize or manipulate 3D objects in real time. Unfortunately, OpenGL does not provide any mechanism to load, save, or manipulate 3D models. Thus, to support this, we will integrate a new library named Open Asset Import Library (Assimp) into our code. The main dependencies include the GLFW library that requires OpenGL version 3.2 and higher. (For more resources related to this topic, see here.) Stereoscopic 3D rendering 3D television and 3D glasses are becoming much more prevalent with the latest trends in consumer electronics and technological advances in wearable computing. In the market, there are currently many hardware options that allow us to visualize information with stereoscopic 3D technology. One common format is side-by-side 3D, which is supported by many 3D glasses as each eye sees an image of the same scene from a different perspective. In OpenGL, creating side-by-side 3D rendering requires asymmetric adjustment as well as viewport adjustment (that is, the area to be rendered) – asymmetric frustum parallel projection or equivalently to lens-shift in photography. This technique introduces no vertical parallax and widely adopted in the stereoscopic rendering. To illustrate this concept, the following diagram shows the geometry of the scene that a user sees from the right eye: The intraocular distance (IOD) is the distance between two eyes. As we can see from the diagram, the Frustum Shift represents the amount of skew/shift for asymmetric frustrum adjustment. Similarly, for the left eye image, we perform the transformation with a mirrored setting. The implementation of this setup is described in the next section. How to do it... The following code illustrates the steps to construct the projection and view matrices for stereoscopic 3D visualization. The code uses the intraocular distance, the distance of the image plane, and the distance of the near clipping plane to compute the appropriate frustum shifts value. In the source file, common/controls.cpp, we add the implementation for the stereo 3D matrix setup: void computeStereoViewProjectionMatrices(GLFWwindow* window, float IOD, float depthZ, bool left_eye){ int width, height; glfwGetWindowSize(window, &width, &height); //up vector glm::vec3 up = glm::vec3(0,-1,0); glm::vec3 direction_z(0, 0, -1); //mirror the parameters with the right eye float left_right_direction = -1.0f; if(left_eye) left_right_direction = 1.0f; float aspect_ratio = (float)width/(float)height; float nearZ = 1.0f; float farZ = 100.0f; double frustumshift = (IOD/2)*nearZ/depthZ; float top = tan(g_initial_fov/2)*nearZ; float right = aspect_ratio*top+frustumshift*left_right_direction; //half screen float left = -aspect_ratio*top+frustumshift*left_right_direction; float bottom = -top; g_projection_matrix = glm::frustum(left, right, bottom, top, nearZ, farZ); // update the view matrix g_view_matrix = glm::lookAt( g_position-direction_z+ glm::vec3(left_right_direction*IOD/2, 0, 0), //eye position g_position+ glm::vec3(left_right_direction*IOD/2, 0, 0), //centre position up //up direction ); In the rendering loop in main.cpp, we define the viewports for each eye (left and right) and set up the projection and view matrices accordingly. For each eye, we translate our camera position by half of the intraocular distance, as illustrated in the previous figure: if(stereo){ //draw the LEFT eye, left half of the screen glViewport(0, 0, width/2, height); //computes the MVP matrix from the IOD and virtual image plane distance computeStereoViewProjectionMatrices(g_window, IOD, depthZ, true); //gets the View and Model Matrix and apply to the rendering glm::mat4 projection_matrix = getProjectionMatrix(); glm::mat4 view_matrix = getViewMatrix(); glm::mat4 model_matrix = glm::mat4(1.0); model_matrix = glm::translate(model_matrix, glm::vec3(0.0f, 0.0f, -depthZ)); model_matrix = glm::rotate(model_matrix, glm::pi<float>() * rotateY, glm::vec3(0.0f, 1.0f, 0.0f)); model_matrix = glm::rotate(model_matrix, glm::pi<float>() * rotateX, glm::vec3(1.0f, 0.0f, 0.0f)); glm::mat4 mvp = projection_matrix * view_matrix * model_matrix; //sends our transformation to the currently bound shader, //in the "MVP" uniform variable glUniformMatrix4fv(matrix_id, 1, GL_FALSE, &mvp[0][0]); //render scene, with different drawing modes if(drawTriangles) obj_loader->draw(GL_TRIANGLES); if(drawPoints) obj_loader->draw(GL_POINTS); if(drawLines) obj_loader->draw(GL_LINES); //Draw the RIGHT eye, right half of the screen glViewport(width/2, 0, width/2, height); computeStereoViewProjectionMatrices(g_window, IOD, depthZ, false); projection_matrix = getProjectionMatrix(); view_matrix = getViewMatrix(); model_matrix = glm::mat4(1.0); model_matrix = glm::translate(model_matrix, glm::vec3(0.0f, 0.0f, -depthZ)); model_matrix = glm::rotate(model_matrix, glm::pi<float>() * rotateY, glm::vec3(0.0f, 1.0f, 0.0f)); model_matrix = glm::rotate(model_matrix, glm::pi<float>() * rotateX, glm::vec3(1.0f, 0.0f, 0.0f)); mvp = projection_matrix * view_matrix * model_matrix; glUniformMatrix4fv(matrix_id, 1, GL_FALSE, &mvp[0][0]); if(drawTriangles) obj_loader->draw(GL_TRIANGLES); if(drawPoints) obj_loader->draw(GL_POINTS); if(drawLines) obj_loader->draw(GL_LINES); } The final rendering result consists of two separate images on each side of the display, and note that each image is compressed horizontally by a scaling factor of two. For some display systems, each side of the display is required to preserve the same aspect ratio depending on the specifications of the display. Here are the final screenshots of the same models in true 3D using stereoscopic 3D rendering: Here's the rendering of the architectural model in stereoscopic 3D: How it works... The stereoscopic 3D rendering technique is based on the parallel axis and asymmetric frustum perspective projection principle. In simpler terms, we rendered a separate image for each eye as if the object was seen at a different eye position but viewed on the same plane. Parameters such as the intraocular distance and frustum shift can be dynamically adjusted to provide the desired 3D stereo effects. For example, by increasing or decreasing the frustum asymmetry parameter, the object will appear to be moved in front or behind the plane of the screen. By default, the zero parallax plane is set to the middle of the view volume. That is, the object is set up so that the center position of the object is positioned at the screen level, and some parts of the object will appear in front of or behind the screen. By increasing the frustum asymmetry (that is, positive parallax), the scene will appear to be pushed behind the screen. Likewise, by decreasing the frustum asymmetry (that is, negative parallax), the scene will appear to be pulled in front of the screen. The glm::frustum function sets up the projection matrix, and we implemented the asymmetric frustum projection concept illustrated in the drawing. Then, we use the glm::lookAt function to adjust the eye position based on the IOP value we have selected. To project the images side by side, we use the glViewport function to constrain the area within which the graphics can be rendered. The function basically performs an affine transformation (that is, scale and translation) which maps the normalized device coordinate to the window coordinate. Note that the final result is a side-by-side image in which the graphic is scaled by a factor of two vertically (or compressed horizontally). Depending on the hardware configuration, we may need to adjust the aspect ratio. The current implementation supports side-by-side 3D, which is commonly used in most wearable Augmented Reality (AR) or Virtual Reality (VR) glasses. Fundamentally, the rendering technique, namely the asymmetric frustum perspective projection described in our article, is platform-independent. For example, we have successfully tested our implementation on the Meta 1 Developer Kit (https://www.getameta.com/products) and rendered the final results on the optical see-through stereoscopic 3D display: Here is the front view of the Meta 1 Developer Kit, showing the optical see-through stereoscopic 3D display and 3D range-sensing camera: The result is shown as follows, with the stereoscopic 3D graphics rendered onto the real world (which forms the basis of augmented reality): See also In addition, we can easily extend our code to support shutter glasses-based 3D monitors by utilizing the Quad Buffered OpenGL APIs (refer to the GL_BACK_RIGHT and GL_BACK_LEFT flags in the glDrawBuffer function). Unfortunately, such 3D formats require specific hardware synchronization and often require higher frame rate display (for example, 120Hz) as well as a professional graphics card. Further information on how to implement stereoscopic 3D in your application can be found at http://www.nvidia.com/content/GTC-2010/pdfs/2010_GTC2010.pdf. Summary In this article, we covered how to visualize data with stunning stereoscopic 3D technology using OpenGL. OpenGL does not provide any mechanism to load, save, or manipulate 3D models. Thus, to support this, we have integrated a new library named Assimp into the code. Resources for Article: Further resources on this subject: Organizing a Virtual Filesystem [article] Using OpenCL [article] Introduction to Modern OpenGL [article]
Read more
  • 0
  • 1
  • 10012

article-image-identifying-big-data-evidence-hadoop
Packt
21 Aug 2015
13 min read
Save for later

Identifying Big Data Evidence in Hadoop

Packt
21 Aug 2015
13 min read
In this article by Joe Sremack, author of the book Big Data Forensics, we will cover the following topics: An overview of how to identify Big Data forensic evidence Techniques for previewing Hadoop data (For more resources related to this topic, see here.) Hadoop and other Big Data systems pose unique challenges to forensic investigators. Hadoop clusters are distributed systems with voluminous data storage, complex data processing, and data that is split and made redundant at the data block level. Unlike with traditional forensics, performing forensics on Hadoop using the traditional methods is not always feasible. Instead, forensic investigators, experts, and legal professionals—such as attorneys and court officials—need to understand how forensics is performed against these complex systems. The first step in a forensic investigation is to identify the evidence. In this article, several of the concepts involved in identifying forensic evidence from Hadoop and its applications are covered. Identifying forensic evidence is a complex process for any type of investigation. It involves surveying a set of possible sources of evidence and determining which sources warrant collection. Data in any organization's systems is rarely well organized or documented. Investigators will need to take a set of investigation requirements and determine which data need to be collected. This requires a few first steps: Properly reviewing system and data documentation. Interviewing staff. Locating backup and non-centralized data repositories. Previewing data. The process of identifying Big Data evidence is made difficult by the large volume of data, distributed filesystems, the numerous types of data, and the potential for large-scale redundancy in evidence. Big Data solutions are also unique in that evidence can reside in different layers. Within Hadoop, evidence can take on multiple forms—such as file stored in the Hadoop Distributed File System (HDFS) to data extracted from application. To properly identify the evidence in Hadoop, multiple layers are examined. While all the data may reside in HDFS, the form may differ in a Hadoop application (for example, HBase), or the data may be more easily extracted to a viable format through HDFS using an application (such as Pig or Sqoop). Identifying Big Data evidence can also be complicated by redundancies caused by: Systems that input to or receive output from Big Data systems Archived systems that may have previously stored the evidence in the Big Data system A primary goal of identifying evidence is to capture all relevant evidence while minimizing redundant information. Outsiders looking at a company's data needs may assume that identifying information is as simple as asking several individuals where the data resides. In reality, the process is much more complicated for a number of possible reasons: The organization may be an adverse party and cannot be trusted to provide reliable information about the data The organization is large and no single person knows where all data is stored and what the contents of the data are The organization is divided into business units with no two business units knowing what data the other one stores Data is stored with a third-party data hosting provider IT staff may know where data and systems reside, but only the business users know the type of content the data stores For example, one might assume a pharmaceutical sales company would have an internal system structured with the following attributes: A division where the data is collected from a sales database An HR department database containing employee compensation, performance, and retention information A database of customer demographic information An accounting department database to assess what costs are associated with each sale In such a system, that data is then cleanly unified and compelling analyses are created to drive sales. In reality, an investigator will probably find that the Big Data sales system is actually comprised of a larger set of data that originates inside and outside the organization. There may be a collection of spreadsheets on sales employees' desktops and laptops, along with some of the older versions on backup tapes and file server shared folders. There may be a new Salesforce database implemented two years ago that is incomplete and is actually the replacement for a previous database, which was custom-developed and used by 75% of employees. A Hadoop instance running HBase for analysis receives a filtered set of data from social media feeds, the Salesforce database, and sales reports. All of these data sources may be managed by different teams, so identifying how to collect this information requires a series of steps to isolate the relevant information. The problem for large—or even midsize—companies is much more difficult than our pharmaceutical sales company example. Simply creating a map of every data source and the contents of those systems could require weeks of in-depth interviews with key business owners and staff. Several departments may have their own databases and Big Data solutions that may or may not be housed in a centralized repository. Backups for these systems could be located anywhere. Data retention policies will vary by department—and most likely by system. Data warehouses and other aggregators may contain important information that will not show themselves through normal interviews with staff. These data warehouses and aggregators may have previously generated reports that could serve as valuable reference points for future analysis; however, all data may not be available online, and some data may be inaccessible. In such cases, the company's data will most likely reside in off-site servers maintained by an outsourcing vendor, or worse, in a cloud-based solution. Big Data evidence can be intertwined with non-Big Data evidence. Email, document files, and other evidence can be extremely valuable for performing an investigation. The process for identifying Big Data evidence is very similar to the process for identifying other evidence, so the identification process described in this book can be carried out in conjunction with identifying other evidence. An important consideration for investigators to keep in mind is whether Big Data evidence should be collected (that is, determining whether it is relevant or if the same evidence can be collected more easily from other non-Big Data systems). Investigators must also consider whether evidence needs to be collected to meet the requirements of an investigation. The following figure illustrates the process for identifying Big Data evidence: Initial Steps The process for identifying evidence is: Examining requirements Examining the organization's system architecture Determining the kinds of data in each system Assessing which systems to collect In the book, Big Data Forensics, the topics of examining requirements and examining the organization's system architecture are covered in detail. The purpose of these two steps is to take the requirements of the investigation and match those to known data sources. From this, the investigator can begin to document which data sources should be examined and what types of data may be relevant. Assessing data viability Assessing the viability of data serves several purposes. It can: Allow the investigator to identify which data sources are potentially relevant Yield information that can corroborate the interview and documentation review information Highlight data limitations or gaps Provide the investigator with information to create a better data collection plan Up until this point in the investigation, the investigator has only gathered information about the data. Previewing and assessing samples of the data gives the investigator the chance to actually see what information is contained in the data and determine which data sources can meet the requirements of the investigation. Assessing the viability and relevance of data in a Big Data forensic investigation is different from that of a traditional digital forensic investigation. In a traditional digital forensic investigation, the data is typically not previewed out of fear of altering the data or metadata. With Big Data, however, the data can be previewed in some situations where metadata is not relevant or available. This factor opens up the opportunity for a forensic investigator to preview data when identifying which data should be collected. The main considerations for each source of data include the following: Data quality Data completeness Supporting documentation Validating the collected data Previous systems where the data resided How the data enter and leave the system The available formats for extraction How well the data meet the data requirements There are several methods for previewing data. The first is to review a data extract or the results of a query—or collect sample text files that are stored in Hadoop. This method allows the investigator to determine the types of information available and how the information is represented in the data. In highly complex systems consisting of thousands of data sources, this may not be feasible or requires a significant investment of time and effort. The second method is to review reports or canned query output that were derived from the data. Some Big Data solutions are designed with reporting applications connected to the Big Data system. These reports are a powerful tool, enabling an investigator to quickly gain an understanding of the contents of the system without requiring much up-front effort to gain access to the systems. Data retention policies and data purge schedules should be reviewed and considered in this step as well. Given the large volume of data involved, many organizations routinely purge data after a certain period of time. Data purging can mean the archival of data to near-line or offline storage, or it can mean the destruction of old data without backup. When data is archived, the investigator should also determine whether any of the data in near-line or offline backup media needs to be collected or if the live system data is sufficient. Regardless, the investigator should determine what the next purge cycle is and whether that necessitates an expedited collection to prevent loss of critical information. Additionally, the investigator should determine whether the organization should implement a litigation hold, which halts data purging during the investigation. When data is purged without backup, the investigator must determine: How the purge affects the investigation When the data needs to be collected Whether supplemental data sources must be collected to account for the lost data (for example, reports previously created from the purged data or other systems that created or received the purged data) Identifying HDFS evidence HDFS evidence can be identified in a number of ways. In some cases, the investigator does not want to preview the data to retain the integrity of the metadata. In other cases, the investigator is in only interested in collecting a subset of the data. Limiting the data can be necessary when the data volume prohibits a complete collection or forensically imaging the entire cluster is impossible. The primary methods for identifying HDFS evidence are to: Generate directory listings from the cluster Review the total data volume on each of the nodes Generating directory listings from the cluster is a straightforward process of accessing the cluster from a client and running the Hadoop directory listing command. The cluster is accessed from a client by either directly logging in through a cluster node or by logging in from a remote machine. The command to print a directory listing is as follows: # hdfs dfs –lsr / This generates a recursive directory listing of all HDFS files starting from the root directory. This command produces the filenames, directories, permissions, and file sizes of all files. The output can be piped to an output file, which should be saved to an external storage device for review. Identifying Hive evidence Hive evidence can be identified through HiveQL commands. The following table lists the commands that can be used to get a full listing of all databases and tables as well as the tables' formats: Command Description SHOW DATABASES; This lists all available databases SHOW TABLES; This lists all tables in current database USE databaseName; This makes databaseName the current database DESCRIBE (FORMATTED|EXTENDED) table; This lists the formatting details about the table Identifying all tables and their formats requires iterating through every database and generating a list of tables and each table's formats. This process can be performed either manually or through an automated HiveQL script file. These commands do not provide information about database and table metadata—such as number of records and last modified date—but they do give a full listing of all available, online Hive data. HiveQL can also be used to preview the data using subset queries. The following example shows how to identify the top ten rows in a Hive table: SELECT * FROM Table1 LIMIT 10 Identifying HBase evidence HBase evidence is stored in tables, and identifying the names of the tables and the properties of each is important for data collection. HBase stores metadata information in the -ROOT- and .META. tables. These tables can be queried using HBase shell commands to identify the information about all tables in the HBase cluster. Information about the HBase cluster can be gathered using the status command from the HBase shell: status This produces the following output: 2 servers, 0 dead, 1.5000 average load For additional information about the names and locations of the servers—as well as the total disk sizes for the memstores and HFiles—the status command can be given the detailed parameter. The list command outputs every HBase table. The one table created in HBase, testTable, is shown via the following command: list This produces the following output: TABLE testTable 1 row(s) in 0.0370 seconds => ["testTable"] Information about each table can be generated using the describe command: describe 'testTable' The following output is generated: 'testTable', {NAME => 'account', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'}, {NAME => 'address', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'} 1 row(s) in 0.0300 seconds The describe command yields several useful pieces of information about each table. Each of the column families are listed, and for each family, the encoding, number of columns (represented as versions), and whether the deleted cells are retained are also listed. Security information about each table can be gathered using the user_permission command as follows: user_permission 'testTable' This command is useful for identifying the users who currently have access to the table. As mentioned before, user accounts are not as meaningful in Hadoop because of the distributed nature of Hadoop configurations, but in some cases, knowing who had access to tables can be tied back to system logs to identify individuals who accessed the system and data. Summary Hadoop evidence comes in many forms. The methods for identifying the evidence require the forensic investigator to understand the Hadoop architecture and the options for identifying the evidence within HDFS and Hadoop applications. In Big Data Forensics, these topics are covered in more depth—from the internals of Hadoop to conducting a collection of a distributed server. Resources for Article: Further resources on this subject: Introduction to Hadoop[article] Hadoop Monitoring and its aspects[article] Understanding Hadoop Backup and Recovery Needs [article]
Read more
  • 0
  • 0
  • 1512

article-image-scientific-computing-apis-python
Packt
20 Aug 2015
23 min read
Save for later

Scientific Computing APIs for Python

Packt
20 Aug 2015
23 min read
In this article, by Hemant Kumar Mehta author of the book Mastering Python Scientific Computing we will have comprehensive discussion of features and capabilities of various scientific computing APIs and toolkits in Python. Besides the basics, we will also discuss some example programs for each of the APIs. As symbolic computing is relatively different area of computerized mathematics, we have kept a special sub section within the SymPy section to discuss basics of computerized algebra system. In this article, we will cover following topics: Scientific numerical computing using NumPy and SciPy Symbolic Computing using SymPy (For more resources related to this topic, see here.) Numerical Scientific Computing in Python The scientific computing mainly demands for facility of performing calculations on algebraic equations, matrices, differentiations, integrations, differential equations, statistics, equation solvers and much more. By default Python doesn't come with these functionalities. However, development of NumPy and SciPy has enabled us to perform these operations and much more advanced functionalities beyond these operations. NumPy and SciPy are very powerful Python packages that enable the users to efficiently perform the desired operations for all types of scientific applications. NumPy package NumPy is the basic Python package for the scientific computing. It provides facility of multi-dimensional arrays and basic mathematical operations such as linear algebra. Python provides several data structure to store the user data, while the most popular data structures are lists and dictionaries. The list objects may store any type of Python object as an element. These elements can be processed using loops or iterators. The dictionary objects store the data in key, value format. The ndarrays data structure The ndaarays are also similar to the list but highly flexible and efficient. The ndarrays is an array object to represent multidimensional array of fixed-size items. This array should be homogeneous. It has an associated object of dtype to define the data type of elements in the array. This object defines type of the data (integer, float, or Python object), size of data in bytes, byte ordering (big-endian or little-endian). Moreover, if the type of data is record or sub-array then it also contains details about them. The actual array can be constructed using any one of the array, zeros or empty methods. Another important aspect of ndarrays is that the size of arrays can be dynamically modified. Moreover, if the user needs to remove some elements from the arrays then it can be done using the module for masked arrays. In a number of situations, scientific computing demands deletion/removal of some incorrect or erroneous data. The numpy.ma module provides the facility of masked array to easily remove selected elements from arrays. A masked array is nothing but the normal ndarrays with a mask. Mask is another associated array with true or false values. If for a particular position mask has true value then the corresponding element in the main array is valid and if the mask is false then the corresponding element in the main array is invalid or masked. In such case while performing any computation on such ndarrays the masked elements will not be considered. File handling Another important aspect of scientific computing is storing the data into files and NumPy supports reading and writing on both text as well as binary files. Mostly, text files are good way for reading, writing and data exchange as they are inherently portable and most of the platforms by default have capabilities to manipulate them. However, for some of the applications sometimes it is better to use binary files or the desired data for such application can only be stored in binary files. Sometimes the size of data and nature of data like image, sound etc. requires them to store in binary files. In comparison to text files binary files are harder to manage as they have specific formats. Moreover, the size of binary files are comparatively very small and the read/ write operations are very fast then the read/ write text files. This fast read/ write is most suitable for the application working on large datasets. The only drawback of binary files manipulated with NumPy is that they are accessible only through NumPy. Python has text file manipulation functions such as open, readlines and writelines functions. However, it is not performance efficient to use these functions for scientific data manipulation. These default Python functions are very slow in reading and writing the data in file. NumPy has high performance alternative that load the data into ndarrays before actual computation.  In NumPy, text files can be accessed using numpy.loadtxt and numpy.savetxt functions.  The loadtxt function can be used to load the data from text files to the ndarrays. NumPy also has a separate functions to manipulate the data in binary files. The function for reading and writing are numpy.load and numpy.save respectively. Sample NumPy programs The NumPy array can be created from a list or tuple using the array, this method can transform sequences of sequences into two dimensional array. import numpy as np x = np.array([4,432,21], int) print x                            #Output [  4 432  21] x2d = np.array( ((100,200,300), (111,222,333), (123,456,789)) ) print x2d Output: [  4 432  21] [[100 200 300] [111 222 333] [123 456 789]] Basic matrix arithmetic operation can be easily performed on two dimensional arrays as used in the following program.  Basically these operations are actually applied on elements hence the operand arrays must be of equal size, if the size is not matching then performing these operations will cause a runtime error. Consider the following example for arithmetic operations on one dimensional array. import numpy as np x = np.array([4,5,6]) y = np.array([1,2,3]) print x + y                      # output [5 7 9] print x * y                      # output [ 4 10 18] print x - y                       # output [3 3 3]    print x / y                       # output [4 2 2] print x % y                    # output [0 1 0] There is a separate subclass named as matrix to perform matrix operations. Let us understand matrix operation by following example which demonstrates the difference between array based multiplication and matrix multiplication. The NumPy matrices are 2-dimensional and arrays can be of any dimension. import numpy as np x1 = np.array( ((1,2,3), (1,2,3), (1,2,3)) ) x2 = np.array( ((1,2,3), (1,2,3), (1,2,3)) ) print "First 2-D Array: x1" print x1 print "Second 2-D Array: x2" print x2 print "Array Multiplication" print x1*x2   mx1 = np.matrix( ((1,2,3), (1,2,3), (1,2,3)) ) mx2 = np.matrix( ((1,2,3), (1,2,3), (1,2,3)) ) print "Matrix Multiplication" print mx1*mx2   Output: First 2-D Array: x1 [[1 2 3]  [1 2 3]  [1 2 3]] Second 2-D Array: x2 [[1 2 3]  [1 2 3]  [1 2 3]] Array Multiplication [[1 4 9]  [1 4 9]  [1 4 9]] Matrix Multiplication [[ 6 12 18]  [ 6 12 18]  [ 6 12 18]] Following is a simple program to demonstrate simple statistical functions given in NumPy: import numpy as np x = np.random.randn(10)   # Creates an array of 10 random elements print x mean = x.mean() print mean std = x.std() print std var = x.var() print var First Sample Output: [ 0.08291261  0.89369115  0.641396   -0.97868652  0.46692439 -0.13954144  -0.29892453  0.96177167  0.09975071  0.35832954] 0.208762357623 0.559388806817 0.312915837192 Second Sample Output: [ 1.28239629  0.07953693 -0.88112438 -2.37757502  1.31752476  1.50047537   0.19905071 -0.48867481  0.26767073  2.660184  ] 0.355946458357 1.35007701045 1.82270793415 The above programs are some simple examples of NumPy. SciPy package SciPy extends Python and NumPy support by providing advanced mathematical functions such as differentiation, integration, differential equations, optimization, interpolation, advanced statistical functions, equation solvers etc. SciPy is written on top of the NumPy array framework. SciPy has utilized the arrays and the basic operations on the arrays provided in NumPy and extended it to cover most of the mathematical aspects regularly required by scientists and engineers for their applications. In this article we will cover examples of some basic functionality. Optimization package The optimization package in SciPy provides facility to solve univariate and multivariate minimization problems. It provides solutions to minimization problems using a number of algorithms and methods. The minimization problem has wide range of application in science and commercial domains. Generally, we perform linear regression, search for function's minimum and maximum values, finding the root of a function, and linear programming for such cases. All these functionalities are supported by the optimization package.  Interpolation package A number of interpolation methods and algorithms are provided in this package as built-in functions. It provides facility to perform univariate and multivariate interpolation, one dimensional and two dimensional Splines etc. We use univariate interpolation when data is dependent of one variable and if data is around more than one variable then we use multivariate interpolation. Besides these functionalities it also provides additional functionality for Lagrange and Taylor polynomial interpolators. Integration and differential equations in SciPy Integration is an important mathematical tool for scientific computations. The SciPy integrations sub-package provides functionalities to perform numerical integration. SciPy provides a range of functions to perform integration on equations and data. It also has ordinary differential equation integrator. It provides various functions to perform numerical integrations using a number of methods from mathematics using numerical analysis. Stats module SciPy Stats module contains a functions for most of the probability distributions and wide range or statistical functions. Supported probability distributions include various continuous distribution, multivariate distributions and discrete distributions. The statistical functions range from simple means to the most of the complex statistical concepts, including skewness, kurtosis chi-square test to name a few. Clustering package and Spatial Algorithms in SciPy Clustering analysis is a popular data mining technique having wide range of application in scientific and commercial applications. In Science domain biology, particle physics, astronomy, life science, bioinformatics are few subjects widely using clustering analysis for problem solution. Clustering analysis is being used extensively in computer science for computerized fraud detection, security analysis, image processing etc. The clustering package provides functionality for K-mean clustering, vector quantization, hierarchical and agglomerative clustering functions. The spatial class has functions to analyze distance between data points using triangulations, Voronoi diagrams, and convex hulls of a set of points. It also has KDTree implementations for performing nearest-neighbor lookup functionality. Image processing in SciPy           SciPy provides support for performing various image processing operations including basic reading and writing of image files, displaying images, simple image manipulations operations such as cropping, flipping, rotating etc. It has also support for image filtering functions such as mathematical morphing, smoothing, denoising and sharpening of images. It also supports various other operations such as image segmentation by labeling pixels corresponding to different objects, Classification, Feature extraction for example edge detection etc. Sample SciPy programs In the subsequent subsections we will discuss some example programs using SciPy modules and packages. We start with a simple program performing standard statistical computations. After this, we will discuss a program performing finding a minimal solution using optimizations. At last we will discuss image processing programs. Statistics using SciPy The stats module of SciPy has functions to perform simple statistical operations and various probability distributions. The following program demonstrates simple statistical calculations using SciPy stats.describe function. This single function operates on an array and returns number of elements, minimum value, maximum value, mean, variance, skewness and kurtosis. import scipy as sp import scipy.stats as st s = sp.randn(10) n, min_max, mean, var, skew, kurt = st.describe(s) print("Number of elements: {0:d}".format(n)) print("Minimum: {0:3.5f} Maximum: {1:2.5f}".format(min_max[0], min_max[1])) print("Mean: {0:3.5f}".format(mean)) print("Variance: {0:3.5f}".format(var)) print("Skewness : {0:3.5f}".format(skew)) print("Kurtosis: {0:3.5f}".format(kurt)) Output: Number of elements: 10 Minimum: -2.00080 Maximum: 0.91390 Mean: -0.55638 Variance: 0.93120 Skewness : 0.16958 Kurtosis: -1.15542 Optimization in SciPY Generally, in mathematical optimization a non convex function called Rosenbrock function is used to test the performance of the optimization algorithm. The following program is demonstrating the minimization problem on this function. The Rosenbrock function of N variable is given by following equation and it has minimum value 0 at xi =1. The program for the above function is: import numpy as np from scipy.optimize import minimize   # Definition of Rosenbrock function def rosenbrock(x):      return sum(100.0*(x[1:]-x[:-1]**2.0)**2.0 + (1-x[:-1])**2.0)   x0 = np.array([1, 0.7, 0.8, 2.9, 1.1]) res = minimize(rosenbrock, x0, method = 'nelder-mead', options = {'xtol': 1e-8, 'disp': True})   print(res.x) Output is: Optimization terminated successfully.          Current function value: 0.000000          Iterations: 516          Function evaluations: 827 [ 1.  1.  1.  1.  1.] The last line is the output of print(res.x) where all the elements of array are 1. Image processing using SciPy Following two programs are developed to demonstrate the image processing functionality of SciPy. First of these program is simply displaying the standard test image widely used in the field of image processing called Lena. The second program is applying geometric transformation on this image. It performs image cropping and rotation by 45 %. The following program is displaying Lena image using matplotlib API. The imshow method renders the ndarrays into an image and the show method displays the image. from scipy import misc l = misc.lena() misc.imsave('lena.png', l) import matplotlib.pyplot as plt plt.gray() plt.imshow(l) plt.show() Output: The output of the above program is the following screen shot: The following program is performing geometric transformation. This program is displaying transformed images and along with the original image as a four axis array. import scipy from scipy import ndimage import matplotlib.pyplot as plt import numpy as np   lena = scipy.misc.lena() lx, ly = lena.shape crop_lena = lena[lx/4:-lx/4, ly/4:-ly/4] crop_eyes_lena = lena[lx/2:-lx/2.2, ly/2.1:-ly/3.2] rotate_lena = ndimage.rotate(lena, 45)   # Four axes, returned as a 2-d array f, axarr = plt.subplots(2, 2) axarr[0, 0].imshow(lena, cmap=plt.cm.gray) axarr[0, 0].axis('off') axarr[0, 0].set_title('Original Lena Image') axarr[0, 1].imshow(crop_lena, cmap=plt.cm.gray) axarr[0, 1].axis('off') axarr[0, 1].set_title('Cropped Lena') axarr[1, 0].imshow(crop_eyes_lena, cmap=plt.cm.gray) axarr[1, 0].axis('off') axarr[1, 0].set_title('Lena Cropped Eyes') axarr[1, 1].imshow(rotate_lena, cmap=plt.cm.gray) axarr[1, 1].axis('off') axarr[1, 1].set_title('45 Degree Rotated Lena')   plt.show() Output: The SciPy and NumPy are core of Python's support for scientific computing as they provide solid functionality of numerical computing. Symbolic computations using SymPy Computerized computations performed over the mathematical symbols without evaluating or changing their meaning is called as symbolic computations. Generally the symbolic computing is also called as computerized algebra and such computerized system are called computer algebra system. The following subsection has a brief and good introduction to SymPy. Computer Algebra System (CAS) Let us discuss the concept of CAS. CAS is a software or toolkit to perform computations on mathematical expressions using computers instead of doing it manually. In the beginning, using computers for these applications was named as computer algebra and now this concept is called as symbolic computing. CAS systems may be grouped into two types. First is the general purpose CAS and the second type is the CAS specific to particular problem. The general purpose systems are applicable to most of the area of algebraic mathematics while the specialized CAS are the systems designed for the specific area such as group theory or number theory. Most of the time, we prefer the general purpose CAS to manipulate the mathematical expressions for scientific applications. Features of a general purpose CAS Various desired features of a general purpose computer algebra system for scientific applications are as: A user interface to manipulate mathematical expressions. An interface for programming  and debugging Such systems requires simplification of various mathematical expressions hence, a simplifier is a most essential component of such computerized algebra system. The general purpose CAS system must support exhaustive set of functions to perform various mathematical operations required by any algebraic computations Most of the applications perform extensive computations an efficient memory management is highly essential. The system must provide support to perform mathematical computations on high precision numbers and large quantities. A brief idea of SymPy SymPy is an open source and Python based implementation of computerized algebra system (CAS). The philosophy behind the SymPy development is to design and develop a CAS having all the desired features yet its code as simple as possible so that it will be highly and easily extensible. It is written completely in Python and do not requires any external library. The basic idea about using SymPy is the creation and manipulation of expressions. Using SymPy, the user represents mathematical expressions in Python language using SymPy classes and objects. These expressions are composed of numbers, symbols, operators, functions etc. The functions are the modules to perform a mathematical functionality such as logarithms, trigonometry etc. The development of SymPy was started by Ondřej Čertíkin August 2006. Since then, it has been grown considerably with the contributions more than hundreds of the contributors. This library now consists of 26 different integrated modules. These modules have capability to perform computations required for basic symbolic arithmetic, calculus, algebra, discrete mathematics, quantum physics, plotting and printing with the option to export the output of the computations to LaTeX and other formats. The capabilities of SymPy can be divided into two categories as core capability and advanced capabilities as SymPy library is divided into core module with several advanced optional modules. The various supported functionality by various modules are as follows: Core capabilities The core capability module supports basic functionalities required by any mathematical algebra operations to be performed. These operations include basic arithmetic like multiplications, addition, subtraction and division, exponential etc. It also supports simplification of expressions to simplify complex expressions. It provides the functionality of expansion of series and symbols. Core module also supports functions to perform operations related to trigonometry, hyperbola, exponential, roots of equations, polynomials, factorials and gamma functions, logarithms etc. and a number of special functions for B-Splines, spherical harmonics, tensor functions, orthogonal polynomials etc. There is strong support also given for pattern matching operations in the core module. Core capabilities of the SymPy also include the functionalities to support substitutions required by algebraic operations. It not only supports the high precision arithmetic operations over integers, rational and gloating point numbers but also non-commutative variables and symbols required in polynomial operations. Polynomials Various functions to perform polynomial operations belong to the polynomial module. These functions includes basic polynomial operations such as division, greatest common divisor (GCD) least common multiplier (LCM), square-free factorization, representation of  polynomials with symbolic coefficients, some special operations like computation of resultant, deriving trigonometric identities, Partial fraction decomposition, facilities for Gröbner basis over polynomial rings and fields. Calculus Various functionalities supporting different operations required by basic and advanced calculus are provided in this module. It supports functionalities required by limits, there is a limit function for this. It also supports differentiation and integrations and series expansion, differential equations and calculus of finite differences. SymPy is also having special support for definite integrals and integral transforms. In differential it supports numerical differential, composition of derivatives and fractional derivatives.  Solving equations Solver is the name of the SymPy module providing equations solving functionality. This module supports solving capabilities for complex polynomials, roots of polynomials and solving system of polynomial equations. There is a function to solve the algebraic equations. It not only provides support for solutions for differential equations including ordinary differential equations, some forms of partial differential equations, initial and boundary values problems etc. but also supports solution of difference equations. In mathematics, difference equation is also called recurrence relations, that is an equation that recursively defines a sequence or multidimensional array of values. Discrete math Discrete mathematics includes those mathematical structures which are discrete in nature rather than the continuous mathematics like calculus. It deals with the integers, graphs, statements from logic theory etc. This module has full support for binomial coefficient, products, summations etc. This module also supports various functions from number theory including residual theory, Euler's Totient, partition and a number of functions dealing with prime numbers and their factorizations. SymPy also supports creation and manipulations of logic expressions using symbolic and Boolean values. Matrices SymPy has a strong support for various operations related to the matrices and determinants. Matrix belongs to linear algebra category of mathematics. It supports creation of matrix, basic matrix operations like multiplication, addition, matrix of zeros and ones, creation of random matrix and performing operations on matrix elements. It also supports special functions line computation of Hessian matrix for a function, Gram-Schmidt process on set of vectors, Computation of Wronskian for matrix of functions etc. It has also full support for Eigenvalues/eigenvectors, matrix inversion, solution of matrix and determinants.  For computing determinants of the matrix, it also supports Bareis' fraction-free algorithm and berkowitz algorithms besides the other methods. For matrices it also supports nullspace calculation, cofactor expansion tools, derivative calculation for matrix elements, calculation of dual of matrix etc. Geometry SymPy is also having module that supports various operations associated with the two-dimensional (2-D) geometry. It supports creation of various 2-D entity or objects such as point, line, circle, ellipse, polygon, triangle, ray, segment etc. It also allows us to perform query on these entities such as area of some of the suitable objects line ellipse/ circle or triangle, intersection points of lines etc. It also supports other queries line tangency determination, finding similarity and intersection of entities. Plotting There is a very good module that allows us to draw two-dimensional and three-dimensional plots. At present, the plots are rendered using the matplotlib package. It also supports other packages such as TextBackend, Pyglet, textplot etc.  It has a very good interactive interface facility of customizations and plotting of various geometric entities. The plotting module has the following functions: Plotting 2-D line plots Plotting of 2-D parametric plots. Plotting of 2-D implicit and region plots. Plotting of 3-D plots of functions involving two variables. Plotting of 3-D line and surface plots etc. Physics There is a module to solve the problem from Physics domain. It supports functionality for mechanics including classical and quantum mechanics, high energy physics. It has functions to support Pauli Algebra, quantum harmonic oscillators in 1-D and 3-D. It is also having functionality for optics. There is a separate module that integrates unit systems into SymPy. This will allow users to select the specific unit system for performing his/ her computations and conversion between the units. The unit systems are composed of units and constant for computations. Statistics The statistics module introduced in SymPy to support the various concepts of statistics required in mathematical computations. Apart from supporting various continuous and discrete statistical distributions, it also supports functionality related to the symbolic probability. Generally, these distributions support functions for random number generations in SymPy. Printing SymPy is having a module for provide full support for Pretty-Printing. Pretty-print is the idea of conversions of various stylistic formatting into the text files such as source code, text files and markup files or similar content. This module produces the desired output by printing using ASCII and or Unicode characters. It supports various printers such as LATEX and MathML printer. It is also capable of producing source code in various programming languages such as c, Python or FORTRAN. It is also capable of producing contents using markup languages like HTML/ XML. SymPy modules The following list has formal names of the modules discussed in above paragraphs: Assumptions: assumption engine Concrete: symbolic products and summations Core: basic class structure: Basic, Add, Mul, Pow etc. functions: elementary and special functions galgebra: geometric algebra geometry: geometric entities integrals: symbolic integrator interactive: interactive sessions (e.g. IPython) logic: boolean algebra, theorem proving matrices: linear algebra, matrices mpmath: fast arbitrary precision numerical math ntheory: number theoretical functions parsing: Mathematica and Maxima parsers physics: physical units, quantum stuff plotting: 2D and 3D plots using Pyglet polys: polynomial algebra, factorization printing: pretty-printing, code generation series: symbolic limits and truncated series simplify: rewrite expressions in other forms solvers: algebraic, recurrence, differential statistics: standard probability distributions utilities: test framework, compatibility stuf There are numerous symbolic computing systems available in various mathematical toolkits. There are some proprietary software such as Maple/ Mathematica and there are some open source alternatives also such as Singular/ AXIOM. However, these products have their own scripting language, difficult to extend their functionality and having slow development cycle. Whereas SymPy is highly extensible, designed and developed in Python language and open source API that supports speedy development life cycle. Simple exemplary programs These are some very simple examples to get idea about the capacity of SymPy. These are less than ten lines of SymPy source codes which covers topics ranging from basis symbol manipulations to limits, differentiations and integrations. We can test the execution of these programs on SymPy live running SymPy online on Google App Engine available on http://live.sympy.org/. Basic symbol manipulation The following code is defines three symbols, an expression on these symbols and finally prints the expression. import sympy a = sympy.Symbol('a') b = sympy.Symbol('b') c = sympy.Symbol('c') e = ( a * b * b + 2 * b * a * b) + (a * a + c * c) print e Output: a**2 + 3*a*b**2 + c**2     (here ** represents power operation). Expression expansion in SymPy The following program demonstrates the concept of expression expansion. It defines two symbols and a simple expression on these symbols and finally prints the expression and its expanded form. import sympy a = sympy.Symbol('a') b = sympy.Symbol('b') e = (a + b) ** 4 print e print e.expand() Output: (a + b)**4 a**4 + 4*a**3*b + 6*a**2*b**2 + 4*a*b**3 + b**4 Simplification of expression or formula The SymPy has facility to simplify the mathematical expressions. The following program is having two expressions to simplify and displays the output after simplifications of the expressions. import sympy x = sympy.Symbol('x') a = 1/x + (x*exp(x) - 1)/x simplify(a) simplify((x ** 3 +  x ** 2 - x - 1)/(x ** 2 + 2 * x + 1)) Output: ex x – 1 Simple integrations The following program is calculates the integration of two simple functions. import sympy from sympy import integrate x = sympy.Symbol('x') integrate(x ** 3 + 2 * x ** 2 + x, x) integrate(x / (x ** 2 + 2 * x), x) Output: x**4/4+2*x**3/3+x**2/2 log(x + 2) Summary In this article, we have discussed the concepts, features and selective sample programs of various scientific computing APIs and toolkits. The article started with a discussion of NumPy and SciPy. After covering NymPy, we have discussed concepts associated with symbolic computing and SymPy. In the remaining article we have discussed the Interactive computing and data analysis & visualization alog with their APIs or toolkits. IPython is the python toolkit for interactive computing. We have also discussed the data analysis package Pandas and the data visualization API names Matplotlib. Resources for Article: Further resources on this subject: Optimization in Python [article] How to do Machine Learning with Python [article] Bayesian Network Fundamentals [article]
Read more
  • 0
  • 0
  • 3099
article-image-dimensionality-reduction-principal-component-analysis
Packt
18 Aug 2015
12 min read
Save for later

Dimensionality Reduction with Principal Component Analysis

Packt
18 Aug 2015
12 min read
In this article by Eric Mayor, author of the book Learning Predictive Analytics with R, we will be discussing how to use PCA in R. Nowadays, the access to data is easier and cheaper than ever before. This leads to a proliferation of data in organizations' data warehouses and on the Internet. Analyzing this data is not trivial, as the quantity often makes the analysis difficult or unpractical. For instance, the data is often more abundant than available memory on the machines. The available computational power is also often not enough to analyze the data in a reasonable time frame. One solution is to have recourse to technologies that deal with high dimensionality in data (big data). These solutions typically use the memory and computing power of several machines for analysis (computer clusters). However, most organizations do not have such infrastructure. Therefore, a more practical solution is to reduce the dimensionality of the data while keeping the essential of the information intact. (For more resources related to this topic, see here.) Another reason to reduce dimensionality is that, in some cases, there are more attributes than observations. If some scientists were to store the genome of all inhabitants of Europe and the United States, the number of cases (approximately 1 billion) would be much less than the 3 billion base pairs in the human DNA. Most analyses do not work well or not at all when observations are less than attributes. Confronted with this problem, the data analysts might select groups of attributes, which go together (like height and weight, for instance) according to their domain knowledge, and reduce the dimensionality of the dataset. The data is often structured across several relatively independent dimensions, with each dimension measured using several attributes. This is where principal component analysis (PCA) is an essential tool, as it permits each observation to receive a score on each of the dimensions (determined by PCA itself) while allowing to discard the attributes from which the dimensions are computed. What is meant here is that for each of the obtained dimensions, values (scores) will be produced that combines several attributes. These can be used for further analysis. In the next section, we will use PCA to combine several attributes in the questionnaire data. We will see that the participants' self-report on items such as lively, excited, enthusiastic (and many more) can be combined in a single dimension we call positive arousal. In this sense, PCA performs both dimensionality reduction (discard the attributes) and feature extraction (compute the dimensions). The features can then be used for classification and regression problems. Another use of PCA is to check that the underlying structure of the data corresponds to a theoretical model. For instance, in a questionnaire, a group of questions might assess construct A, and another group construct B, and so on. PCA will find two different factors if indeed there is more similarity in the answers of participants within each group of questions compared to the overall questionnaire. Researchers in fields such as psychology use PCA mainly to test their theoretical model in this fashion. This article is a shortened version of the chapter with the same name in my book Learning predictive analytics with R. In what follows, we will examine how to use PCA in R. We will notably discover how to interpret the results and perform diagnostics. Learning PCA in R In this section, we will learn more about how to use PCA in order to obtain knowledge from data, and ultimately reduce the number of attributes (also called features). The first dataset we will use is the msg dataset from the psych package. The msq (for motivational state questionnaire) data set is composed of 92 attributes, of which 72 is the rating of adjectives by 3896 participants describing their mood. We will only use these 72 attributes for the current purpose, which is the exploration of the structure of the questionnaire. We will, therefore, start by installing and loading the package and the data, and assign the data we are interested in (the 75 attributes mentioned) to an object called motiv. install.packages("psych") library(psych) data(msq) motiv = msq[,1:72] Dealing with missing values Missing values are a common problem in real-life datasets like the one we use here. There are several ways to deal with them, but here we will only mention omitting the cases where missing values are encountered. Let's see how many missing values (we will call them NAs) there are for each attribute in this dataset: apply(is.na(motiv),2,sum) View of missing data in the dataset We can see that for many attributes this is unproblematic (there are only few missing values). However, in the case of several attributes, the number of NAs is quite high (anxious: 1849 NAs, cheerful: 1850, idle: 1848, inactive: 1846, tranquil: 1843). The most probable explanation is that these items have been dropped from some of the samples in which the data has been collected. Removing all cases with NAs from the dataset would result in an empty dataset in this particular case. Try the following in your console to verify this claim: na.omit(motiv) For this reason, we will simply deal with the problem by removing these attributes from the analysis. Remember that there are other ways, such as data imputation, to solve this issue while keeping the attributes. In order to do so, we first need to know which column number corresponds to the attributes we want to suppress. An easy way to do this is simply by printing the names of each columns vertically (using cbind() for something it was not exactly made for). We will omit cases with missing values on other attributes from the analysis later. head(cbind(names(motiv)),5) Here, we only print the result for the first five columns of the data frame: [,1] [1,] active [2,] afraid [3,] alert [4,] angry [5,] anxious We invite the reader to verify on his screen whether we suppress the correct columns from the analysis using the following vector to match the attributes: ToSuppress = c(5, 15, 37, 38, 66) Let's check whether it is correct: names(motiv[ToSuppress]) The following is the output: [1] "anxious" "cheerful" "idle" "inactive" "tranquil" Naming the components using the loadings We will run the analysis using the principal() function from the psych package. This will allow examining the loadings in a sorted fashion and making them independent from the others using a rotation. We will extract five components. The psych package has already been loaded, as the data we are analyzing comes from a dataset, which is included in psych. Yet, we need to install another package, the GPArotation package, which is required for the analysis we will perform next: install.packages("GPArotation") library(GPArotation) We can now run the analysis. This time we want to apply an orthogonal rotation varimax, in order to obtain independent factorial scores. Precisions are provided in the paper Principal component analysis by Abdi and Williams (2010). We also want that the analysis uses imputed data for missing values and estimates the scores for each observation on each retained principal component. We use the print.psych() function to print the sorted loadings, which will make the interpretation of the principal components easier: Pca2 = principal(motiv[,-ToSuppress],nfactors = 5, rotate = "varimax", missing = T, scores = T) print.psych(Pca2, sort =T) The annotated output displayed in the following screenshot has been slightly altered in order to allow it to fit on one page. Results of the PCA with the principal() function using varimax rotation The name of the items are displayed in the first column of the first matrix of results. The column item displays their order. The component loadings for the five retained components are displayed next (RC1 to RC5). We will comment on h2 and u2 later. The loadings of the attributes on the loadings can be used to name the principal components. As can be seen in the preceding screenshot, the first component could be called Positive arousal, the second component could be called Negative arousal, the third could be called Serenity, the fourth Exhaustion, and the last component could be called Fear. It is worth noting that the MSQ has theoretically four components obtained from two dimensions: energy and tension. Therefore, the fifth component we found is not accounted for by the model. The h2 value indicates the proportion of variance of the variable that is accounted for by the selected components. The u2 value indicates the part of variance not explained by the components. The sum of h2 and u2 is 1. Below the first result matrix, a second matrix indicates the proportion of variance explained by the components on the whole dataset. For instance, the first component explains 21 percent of the variance in the dataset Proportion Var and 38 percent of the variance explained by the five components. Overall, the five components explain 56 percent of the variance in the dataset Cumulative Var. As a remainder, the purpose of PCA is to replace all the original attributes by the scores on these components in further analysis. This is what we will discuss next. PCA scores At this point, you might wonder where data reduction comes into play. Well, the computation of the scores for each factor allows reducing the number of attributes for a dataset. This has been done in the previous section. Accessing the PCA scores We will now examine these scores more thoroughly. Let's start by checking whether the scores are indeed uncorrelated: round(cor(Pca2$scores),3) This is the output: RC1 RC2 RC3 RC4 RC5 RC1 1.000 -0.001 0.000 0.000 0.001 RC2 -0.001 1.000 0.000 0.000 0.000 RC3 0.000 0.000 1.000 -0.001 0.001 RC4 0.000 0.000 -0.001 1.000 0.000 RC5 0.001 0.000 0.001 0.000 1.000 As can be seen in the previous output, the correlations between the values are equal to or lower than 0.001 for every pair of components. The components are basically uncorrelated, as we requested. Let's start by checking that it indeed contains the same number of rows: nrow(Pca2$scores) == nrow(msq)  Here's the output: [1] TRUE As the principal() function didn't remove any cases but imputed the data instead, we can now append the factor scores to the original dataset (a copy of it): bound = cbind(msq,Pca2$score) PCA diagnostics Here, we will briefly discuss two diagnostics that can be performed on the dataset that will be subjected to analysis. These diagnostics should be performed analyzing the data with PCA in order to ascertain that PCA is an optimal analysis for the considered data. The first diagnostic is Bartlett's test of sphericity. This test examines the relationship between the variables together, instead of 2 x 2 as in a correlation. To be more specific, it tests whether the correlation matrix is different from an identity matrix (which has ones on the diagonals and zeroes elsewhere). The null hypothesis is that the variables are independent (no underlying structure). The paper When is a correlation matrix appropriate for factor analysis? Some decision rules by Dzuiban and Shirkey (1974) provides more information on the topic. This test can be performed by the cortest.normal() function in the psych package. In this case, the first argument is the correlation matrix of the data we want to subject to PCA, and the n1 argument is the number of cases. We will use the same data set as in the PCA (we first assign the name M to it): M = na.omit(motiv[-ToSuppress]) cortest.normal(cor(M), n1 = nrow(M)) The output is provided as follows: Tests of correlation matrices Call:cortest.normal(R1 = cor(M), n1 = nrow(M)) Chi Square value 829268 with df = 2211 with probability < 0 The last line of the output shows that the data subjected to the analysis is clearly different from an identity matrix: The probability to obtain these results if it were an identity matrix is close to zero (do not pay attention to the > 0, it is simply extremely close to zero). You might be curious about what is the output of an identity matrix. It turns out we have something close to it: the correlations of the PCA scores with the varimax rotation that we examined before. Let's subject the scores to the analysis: cortest.normal(cor(Pca2$scores), n1 = nrow(Pca2$scores)) Tests of correlation matrices Call:cortest.normal(R1 = cor(Pca2$scores), n1 = nrow(Pca2$scores)) Chi Square value 0.01 with df = 10 with probability < 1 In this case, the results show that the correlation matrix is not significantly different from an identity matrix. The other diagnostic is the Kaiser Meyer Olkin (KMO) index, which indicates the part of the data that can be explained by elements present in the dataset. The higher this score the more the proportion of the data is explainable, for instance, by PCA. The KMO (also called MSA for Measure of Sample Adequacy) ranges from 0 (nothing is explainable) to 1 (everything is explainable). It can be returned for each item separately, or for the overall dataset. We will examine this value. It is the first component of the object returned by the KMO()function from psych package. The second component is the list of the values for the individual items (not examined here). It simply takes a matrix, a data frame, or a correlation matrix as an argument. Let's run in on our data: KMO(motiv)[1] This returns a value of 0.9715381, meaning that most of our data is explainable by the analysis. Summary In this article, we have discussed how the PCA algorithm works, how to select the appropriate number of components and how to use PCA scores for further analysis. Resources for Article: Further resources on this subject: Big Data Analysis (R and Hadoop) [article] How to do Machine Learning with Python [article] Clustering and Other Unsupervised Learning Methods [article]
Read more
  • 0
  • 0
  • 2947

article-image-oracle-12c-sql-plsql-new-features
Oli Huggins
16 Aug 2015
30 min read
Save for later

Oracle 12c SQL and PL/SQL New Features

Oli Huggins
16 Aug 2015
30 min read
In this article by Saurabh K. Gupta, author of the book Oracle Advanced PL/SQL Developer Professional Guide, Second Edition you will learn new features in Oracle 12c SQL and PL/SQL. (For more resources related to this topic, see here.) Oracle 12c SQL and PL/SQL new features SQL is the most widely used data access language while PL/SQL is an exigent language that can integrate seamlessly with SQL commands. The biggest benefit of running PL/SQL is that the code processing happens natively within the Oracle Database. In the past, there have been debates and discussions on server side programming while the client invokes the PL/SQL routines to perform a task. The server side programming approach has many benefits. It reduces the network round trips between the client and the database. It reduces the code size and eases the code portability because PL/SQL can run on all platforms, wherever Oracle Database is supported. Oracle Database 12c introduces many language features and enhancements which are focused on SQL to PL/SQL integration, code migration, and ANSI compliance. This section discusses the SQL and PL/SQL new features in Oracle database 12c. IDENTITY columns Oracle Database 12c Release 1 introduces identity columns in the tables in compliance with the American National Standard Institute (ANSI) SQL standard. A table column, marked as IDENTITY, automatically generate an incremental numeric value at the time of record creation. Before the release of Oracle 12c, developers had to create an additional sequence object in the schema and assign its value to the column. The new feature simplifies code writing and benefits the migration of a non-Oracle database to Oracle. The following script declares an identity column in the table T_ID_COL: /*Create a table for demonstration purpose*/ CREATE TABLE t_id_col (id NUMBER GENERATED AS IDENTITY, name VARCHAR2(20)) / The identity column metadata can be queried from the dictionary views USER_TAB_COLSand USER_TAB_IDENTITY_COLS. Note that Oracle implicitly creates a sequence to generate the number values for the column. However, Oracle allows the configuration of the sequence attributes of an identity column. The custom sequence configuration is listed under IDENTITY_OPTIONS in USER_TAB_IDENTITY_COLS view: /*Query identity column information in USER_TAB_COLS*/ SELECT column_name, data_default, user_generated, identity_column FROM user_tab_cols WHERE table_name='T_ID_COL' / COLUMN_NAME DATA_DEFAULT USE IDE -------------- ------------------------------ --- --- ID "SCOTT"."ISEQ$$_93001".nextval YES YES NAME YES NO Let us check the attributes of the preceding sequence that Oracle has implicitly created. Note that the query uses REGEXP_SUBSTR to print the sequence configuration in multiple rows: /*Check the sequence configuration from USER_TAB_IDENTITY_COLS view*/ SELECT table_name,column_name, generation_type, REGEXP_SUBSTR(identity_options,'[^,]+', 1, LEVEL) identity_options FROM user_tab_identity_cols WHERE table_name = 'T_ID_COL' CONNECT BY REGEXP_SUBSTR(identity_options,'[^,]+',1,level) IS NOT NULL / TABLE_NAME COLUMN_NAME GENERATION IDENTITY_OPTIONS ---------- ---------------------- --------------------------------- T_ID_COL ID ALWAYS START WITH: 1 T_ID_COL ID ALWAYS INCREMENT BY: 1 T_ID_COL ID ALWAYS MAX_VALUE: 9999999999999999999999999999 T_ID_COL ID ALWAYS MIN_VALUE: 1 T_ID_COL ID ALWAYS CYCLE_FLAG: N T_ID_COL ID ALWAYS CACHE_SIZE: 20 T_ID_COL ID ALWAYS ORDER_FLAG: N 7 rows selected While inserting data in the table T_ID_COL, do not include the identity column as its value is automatically generated: /*Insert test data in the table*/ BEGIN INSERT INTO t_id_col (name) VALUES ('Allen'); INSERT INTO t_id_col (name) VALUES ('Matthew'); INSERT INTO t_id_col (name) VALUES ('Peter'); COMMIT; END; / Let us check the data in the table. Note the identity column values: /*Query the table*/ SELECT id, name FROM t_id_col / ID NAME ----- -------------------- 1 Allen 2 Matthew 3 Peter The sequence created under the covers for identity columns is tightly coupled with the column. If a user tries to insert a user-defined input for the identity column, the operation throws an exception ORA-32795: INSERT INTO t_id_col VALUES (7,'Steyn'); insert into t_id_col values (7,'Steyn') * ERROR at line 1: ORA-32795: cannot insert into a generated always identity column Default column value to a sequence in Oracle 12c Oracle Database 12c allows developers to default a column directly to a sequence‑generated value. The DEFAULT clause of a table column can be assigned to SEQUENCE.CURRVAL or SEQUENCE.NEXTVAL. The feature will be useful while migrating non-Oracle data definitions to Oracle. The DEFAULT ON NULL clause Starting with Oracle Database 12c, a column can be assigned a default non-null value whenever the user tries to insert NULL into the column. The default value will be specified in the DEFAULT clause of the column with a new ON NULL extension. Note that the DEFAULT ON NULL cannot be used with an object type column. The following script creates a table t_def_cols. A column ID has been defaulted to a sequence while the column DOJ will always have a non-null value: /*Create a sequence*/ CREATE SEQUENCE seq START WITH 100 INCREMENT BY 10 / /*Create a table with a column defaulted to the sequence value*/ CREATE TABLE t_def_cols ( id number default seq.nextval primary key, name varchar2(30), doj date default on null '01-Jan-2000' ) / The following PL/SQL block inserts the test data: /*Insert the test data in the table*/ BEGIN INSERT INTO t_def_cols (name, doj) values ('KATE', '27-FEB-2001'); INSERT INTO t_def_cols (name, doj) values ('NANCY', '17-JUN-1998'); INSERT INTO t_def_cols (name, doj) values ('LANCE', '03-JAN-2004'); INSERT INTO t_def_cols (name) values ('MARY'); COMMIT; END; / Query the table and check the values for the ID and DOJ columns. ID gets the value from the sequence SEQ while DOJ for MARY has been defaulted to 01-JAN-2000. /*Query the table to verify sequence and default on null values*/ SELECT * FROM t_def_cols / ID NAME DOJ ---------- -------- --------- 100 KATE 27-FEB-01 110 NANCY 17-JUN-98 120 LANCE 03-JAN-04 130 MARY 01-JAN-00 Support for 32K VARCHAR2 Oracle Database 12c supports the VARCHAR2, NVARCHAR2, and RAW datatypes up to 32,767 bytes in size. The previous maximum limit for the VARCHAR2 (and NVARCHAR2) and RAW datatypes was 4,000 bytes and 2,000 bytes respectively. The support for extended string datatypes will benefit the non-Oracle to Oracle migrations. The feature can be controlled using the initialization parameter MAX_STRING_SIZE. It accepts two values: STANDARD (default)—The maximum size prior to the release of Oracle Database 12c will apply. EXTENDED—The new size limit for string datatypes apply. Note that, after the parameter is set to EXTENDED, the setting cannot be rolled back. The steps to increase the maximum string size in a database are: Restart the database in UPGRADE mode. In the case of a pluggable database, the PDB must be opened in MIGRATEmode. Use the ALTER SYSTEM command to set MAX_STRING_SIZE to EXTENDED. As SYSDBA, execute the $ORACLE_HOME/rdbms/admin/utl32k.sql script. The script is used to increase the maximum size limit of VARCHAR2, NVARCHAR2, and RAW wherever required. Restart the database in NORMAL mode. As SYSDBA, execute utlrp.sql to recompile the schema objects with invalid status. The points to be considered while working with the 32k support for string types are: COMPATIBLE must be 12.0.0.0 After the parameter is set to EXTENDED, the parameter cannot be rolled back to STANDARD In RAC environments, all the instances of the database comply with the setting of MAX_STRING_SIZE Row limiting using FETCH FIRST For Top-N queries, Oracle Database 12c introduces a new clause, FETCH FIRST, to simplify the code and comply with ANSI SQL standard guidelines. The clause is used to limit the number of rows returned by a query. The new clause can be used in conjunction with ORDER BY to retrieve Top-N results. The row limiting clause can be used with the FOR UPDATE clause in a SQL query. In the case of a materialized view, the defining query should not contain the FETCH clause. Another new clause, OFFSET, can be used to skip the records from the top or middle, before limiting the number of rows. For consistent results, the offset value must be a positive number, less than the total number of rows returned by the query. For all other offset values, the value is counted as zero. Keywords with the FETCH FIRST clause are: FIRST | NEXT—Specify FIRST to begin row limiting from the top. Use NEXT with OFFSET to skip certain rows. ROWS | PERCENT—Specify the size of the result set as a fixed number of rows or percentage of total number of rows returned by the query. ONLY | WITH TIES—Use ONLY to fix the size of the result set, irrespective of duplicate sort keys. If you want records with matching sort keys, specify WITH TIES. The following query demonstrates the use of the FETCH FIRST and OFFSET clauses in Top-N queries: /*Create the test table*/ CREATE TABLE t_fetch_first (empno VARCHAR2(30), deptno NUMBER, sal NUMBER, hiredate DATE) / The following PL/SQL block inserts sample data for testing: /*Insert the test data in T_FETCH_FIRST table*/ BEGIN INSERT INTO t_fetch_first VALUES (101, 10, 1500, '01-FEB-2011'); INSERT INTO t_fetch_first VALUES (102, 20, 1100, '15-JUN-2001'); INSERT INTO t_fetch_first VALUES (103, 20, 1300, '20-JUN-2000'); INSERT INTO t_fetch_first VALUES (104, 30, 1550, '30-DEC-2001'); INSERT INTO t_fetch_first VALUES (105, 10, 1200, '11-JUL-2012'); INSERT INTO t_fetch_first VALUES (106, 30, 1400, '16-AUG-2004'); INSERT INTO t_fetch_first VALUES (107, 20, 1350, '05-JAN-2007'); INSERT INTO t_fetch_first VALUES (108, 20, 1000, '18-JAN-2009'); COMMIT; END; / The SELECT query pulls in the top-5 rows when sorted by their salary: /*Query to list top-5 employees by salary*/ SELECT * FROM t_fetch_first ORDER BY sal DESC FETCH FIRST 5 ROWS ONLY / EMPNO DEPTNO SAL HIREDATE -------- ------ ------- --------- 104 30 1550 30-DEC-01 101 10 1500 01-FEB-11 106 30 1400 16-AUG-04 107 20 1350 05-JAN-07 103 20 1300 20-JUN-00 The SELECT query lists the top 25% of employees (2) when sorted by their hiredate: /*Query to list top-25% employees by hiredate*/ SELECT * FROM t_fetch_first ORDER BY hiredate FETCH FIRST 25 PERCENT ROW ONLY / EMPNO DEPTNO SAL HIREDATE -------- ------ ----- --------- 103 20 1300 20-JUN-00 102 20 1100 15-JUN-01 The SELECT query skips the first five employees and displays the next two—the 6th and 7th employee data: /*Query to list 2 employees after skipping first 5 employees*/ SELECT * FROM t_fetch_first ORDER BY SAL DESC OFFSET 5 ROWS FETCH NEXT 2 ROWS ONLY / Invisible columns Oracle Database 12c supports invisible columns, which implies that the visibility of a column. A column marked invisible does not appear in the following operations: SELECT * FROM queries on the table SQL* Plus DESCRIBE command Local records of %ROWTYPE Oracle Call Interface (OCI) description A column can be made invisible by specifying the INVISIBLE clause against the column. Columns of all types (except user-defined types), including virtual columns, can be marked invisible, provided the tables are not temporary tables, external tables, or clustered ones. An invisible column can be explicitly selected by the SELECT statement. Similarly, the INSERTstatement will not insert values in an invisible column unless explicitly specified. Furthermore, a table can be partitioned based on an invisible column. A column retains its nullity feature even after it is made invisible. An invisible column can be made visible, but the ordering of the column in the table may change. In the following script, the column NICKNAME is set as invisible in the table t_inv_col: /*Create a table to demonstrate invisible columns*/ CREATE TABLE t_inv_col (id NUMBER, name VARCHAR2(30), nickname VARCHAR2 (10) INVISIBLE, dob DATE ) / The information about the invisible columns can be found in user_tab_cols. Note that the invisible column is marked as hidden: /*Query the USER_TAB_COLS for metadata information*/ SELECT column_id, column_name, hidden_column FROM user_tab_cols WHERE table_name = 'T_INV_COL' ORDER BY column_id / COLUMN_ID COLUMN_NAME HID ---------- ------------ --- 1 ID NO 2 NAME NO 3 DOB NO NICKNAME YES Hidden columns are different from invisible columns. Invisible columns can be made visible and vice versa, but hidden columns cannot be made visible. If we try to make the NICKNAME visible and NAME invisible, observe the change in column ordering: /*Script to change visibility of NICKNAME column*/ ALTER TABLE t_inv_col MODIFY nickname VISIBLE / /*Script to change visibility of NAME column*/ ALTER TABLE t_inv_col MODIFY name INVISIBLE / /*Query the USER_TAB_COLS for metadata information*/ SELECT column_id, column_name, hidden_column FROM user_tab_cols WHERE table_name = 'T_INV_COL' ORDER BY column_id / COLUMN_ID COLUMN_NAME HID ---------- ------------ --- 1 ID NO 2 DOB NO 3 NICKNAME NO NAME YES Temporal databases Temporal databases were released as a new feature in ANSI SQL:2011. The term temporal data can be understood as a piece of information that can be associated with a period within which the information is valid. Before the feature was included in Oracle Database 12c, data whose validity is linked with a time period had to be handled either by the application or using multiple predicates in the queries. Oracle 12c partially inherits the feature from the ANSI SQL:2011 standard to support the entities whose business validity can be bracketed with a time dimension. If you're relating a temporal database with the Oracle Database 11g Total Recall feature, you're wrong. The total recall feature records the transaction time of the data in the database to secure the transaction validity and not the functional validity. For example, an investment scheme is active between January to December. The date recorded in the database at the time of data loading is the transaction timestamp. [box type="info" align="" class="" width=""]Starting from Oracle 12c, the Total Recall feature has been rebranded as Flashback Data Archive and has been made available for all versions of Oracle Database.[/box] The valid time temporal can be enabled for a table by adding a time dimension using the PERIOD FOR clause on the date or timestamp columns of the table. The following script creates a table t_tmp_db with the valid time temporal: /*Create table with valid time temporal*/ CREATE TABLE t_tmp_db( id NUMBER, name VARCHAR2(30), policy_no VARCHAR2(50), policy_term number, pol_st_date date, pol_end_date date, PERIOD FOR pol_valid_time (pol_st_date, pol_end_date)) / Create some sample data in the table: /*Insert test data in the table*/ BEGIN INSERT INTO t_tmp_db VALUES (100, 'Packt', 'PACKT_POL1', 1, '01-JAN-2015', '31-DEC-2015'); INSERT INTO t_tmp_db VALUES (110, 'Packt', 'PACKT_POL2', 2, '01-JAN-2015', '30-JUN-2015'); INSERT INTO t_tmp_db VALUES (120, 'Packt', 'PACKT_POL3', 3, '01-JUL-2015', '31-DEC-2015'); COMMIT; END; / Let us set the current time period window using DBMS_FLASHBACK_ARCHIVE. Grant the EXECUTE privilege on the package to the scott user. /*Connect to sysdba to grant execute privilege to scott*/ conn / as sysdba GRANT EXECUTE ON dbms_flashback_archive to scott / Grant succeeded. /*Connect to scott*/ conn scott/tiger /*Set the valid time period as CURRENT*/ EXEC DBMS_FLASHBACK_ARCHIVE.ENABLE_AT_VALID_TIME('CURRENT'); PL/SQL procedure successfully completed. Setting the valid time period as CURRENT means that all the tables with a valid time temporal will only list the rows that are valid with respect to today's date. You can set the valid time to a particular date too. /*Query the table*/ SELECT * from t_tmp_db / ID POLICY_NO POL_ST_DATE POL_END_DATE --------- ---------- ----------------------- ------------------- 100 PACKT_POL1 01-JAN-15 31-DEC-15 110 PACKT_POL2 01-JAN-15 30-JUN-15 [box type="info" align="" class="" width=""]Due to dependency on the current date, the result may vary when the reader runs the preceding queries.[/box] The query lists only those policies that are active as of March 2015. Since the third policy starts in July 2015, it is currently not active. In-Database Archiving Oracle Database 12c introduces In-Database Archiving to archive the low priority data in a table. The inactive data remains in the database but is not visible to the application. You can mark old data for archival, which is not actively required in the application except for regulatory purposes. Although the archived data is not visible to the application, it is available for querying and manipulation. In addition, the archived data can be compressed to improve backup performance. A table can be enabled by specifying the ROW ARCHIVAL clause at the table level, which adds a hidden column ORA_ARCHIVE_STATE to the table structure. The column value must be updated to mark a row for archival. For example: /*Create a table with row archiving*/ CREATE TABLE t_row_arch( x number, y number, z number) ROW ARCHIVAL / When we query the table structure in the USER_TAB_COLS view, we find an additional hidden column, which Oracle implicitly adds to the table: /*Query the columns information from user_tab_cols view*/ SELECT column_id,column_name,data_type, hidden_column FROM user_tab_cols WHERE table_name='T_ROW_ARCH' / COLUMN_ID COLUMN_NAME DATA_TYPE HID ---------- ------------------ ---------- --- ORA_ARCHIVE_STATE VARCHAR2 YES 1 X NUMBER NO 2 Y NUMBER NO 3 Z NUMBER NO Let us create test data in the table: /Insert test data in the table*/ BEGIN INSERT INTO t_row_arch VALUES (10,20,30); INSERT INTO t_row_arch VALUES (11,22,33); INSERT INTO t_row_arch VALUES (21,32,43); INSERT INTO t_row_arch VALUES (51,82,13); commit; END; / For testing purpose, let us archive the rows in the table where X > 50 by updating the ora_archive_state column: /*Update ORA_ARCHIVE_STATE column in the table*/ UPDATE t_row_arch SET ora_archive_state = 1 WHERE x > 50 / COMMIT / By default, the session displays only the active records from an archival-enabled table: /*Query the table*/ SELECT * FROM t_row_arch / X Y Z ------ -------- ---------- 10 20 30 11 22 33 21 32 43 If you wish to display all the records, change the session setting: /*Change the session parameter to display the archived records*/ ALTER SESSION SET ROW ARCHIVAL VISIBILITY = ALL / Session altered. /*Query the table*/ SELECT * FROM t_row_arch / X Y Z ---------- ---------- ---------- 10 20 30 11 22 33 21 32 43 51 82 13 Defining a PL/SQL subprogram in the SELECT query and PRAGMA UDF Oracle Database 12c includes two new features to enhance the performance of functions when called from SELECT statements. With Oracle 12c, a PL/SQL subprogram can be created inline with the SELECT query in the WITH clause declaration. The function created in the WITH clause subquery is not stored in the database schema and is available for use only in the current query. Since a procedure created in the WITH clause cannot be called from the SELECT query, it can be called in the function created in the declaration section. The feature can be very handy in read-only databases where the developers were not able to create PL/SQL wrappers. Oracle Database 12c adds the new PRAGMA UDF to create a standalone function with the same objective. Earlier, the SELECT queries could invoke a PL/SQL function, provided the function didn't change the database purity state. The query performance used to degrade because of the context switch from SQL to the PL/SQL engine (and vice versa) and the different memory representations of data type representation in the processing engines. In the following example, the function fun_with_plsql calculates the annual compensation of an employee's monthly salary: /*Create a function in WITH clause declaration*/ WITH FUNCTION fun_with_plsql (p_sal NUMBER) RETURN NUMBER IS BEGIN RETURN (p_sal * 12); END; SELECT ename, deptno, fun_with_plsql (sal) "annual_sal" FROM emp / ENAME DEPTNO annual_sal ---------- --------- ---------- SMITH 20 9600 ALLEN 30 19200 WARD 30 15000 JONES 20 35700 MARTIN 30 15000 BLAKE 30 34200 CLARK 10 29400 SCOTT 20 36000 KING 10 60000 TURNER 30 18000 ADAMS 20 13200 JAMES 30 11400 FORD 20 36000 MILLER 10 15600 14 rows selected. [box type="info" align="" class="" width=""]If the query containing the WITH clause declaration is not a top-level statement, then the top level statement must use the WITH_PLSQL hint. The hint will be used if INSERT, UPDATE, or DELETE statements are trying to use a SELECT with a WITHclause definition. Failure to include the hint results in an exception ORA-32034: unsupported use of WITH clause.[/box] A function can be created with the PRAGMA UDF to inform the compiler that the function is always called in a SELECT statement. Note that the standalone function created in the following code carries the same name as the one in the last example. The local WITH clause declaration takes precedence over the standalone function in the schema. /*Create a function with PRAGMA UDF*/ CREATE OR REPLACE FUNCTION fun_with_plsql (p_sal NUMBER) RETURN NUMBER is PRAGMA UDF; BEGIN RETURN (p_sal *12); END; / Since the objective of the feature is performance, let us go ahead with a case study to compare the performance when using a standalone function, a PRAGMA UDF function, and a WITHclause declared function. Test setup The exercise uses a test table with 1 million rows, loaded with random data. /*Create a table for performance test study*/ CREATE TABLE t_fun_plsql (id number, str varchar2(30)) / /*Generate and load random data in the table*/ INSERT /*+APPEND*/ INTO t_fun_plsql SELECT ROWNUM, DBMS_RANDOM.STRING('X', 20) FROM dual CONNECT BY LEVEL <= 1000000 / COMMIT / Case 1: Create a PL/SQL standalone function as it used to be until Oracle Database 12c. The function counts the numbers in the str column of the table. /*Create a standalone function without Oracle 12c enhancements*/ CREATE OR REPLACE FUNCTION f_count_num (p_str VARCHAR2) RETURN PLS_INTEGER IS BEGIN RETURN (REGEXP_COUNT(p_str,'d')); END; / The PL/SQL block measures the elapsed and CPU time when working with a pre-Oracle 12c standalone function. These numbers will serve as the baseline for our case study. /*Set server output on to display messages*/ SET SERVEROUTPUT ON /*Anonymous block to measure performance of a standalone function*/ DECLARE l_el_time PLS_INTEGER; l_cpu_time PLS_INTEGER; CURSOR C1 IS SELECT f_count_num (str) FROM t_fun_plsql; TYPE t_tab_rec IS TABLE OF PLS_INTEGER; l_tab t_tab_rec; BEGIN l_el_time := DBMS_UTILITY.GET_TIME (); l_cpu_time := DBMS_UTILITY.GET_CPU_TIME (); OPEN c1; FETCH c1 BULK COLLECT INTO l_tab; CLOSE c1; DBMS_OUTPUT.PUT_LINE ('Case 1: Performance of a standalone function'); DBMS_OUTPUT.PUT_LINE ('Total elapsed time:'||to_char(DBMS_UTILITY.GET_TIME () - l_el_time)); DBMS_OUTPUT.PUT_LINE ('Total CPU time:'||to_char(DBMS_UTILITY.GET_CPU_TIME () - l_cpu_time)); END; / Performance of a standalone function: Total elapsed time:1559 Total CPU time:1366 PL/SQL procedure successfully completed. Case 2: Create a PL/SQL function using PRAGMA UDF to count the numbers in the str column. /*Create the function with PRAGMA UDF*/ CREATE OR REPLACE FUNCTION f_count_num_pragma (p_str VARCHAR2) RETURN PLS_INTEGER IS PRAGMA UDF; BEGIN RETURN (REGEXP_COUNT(p_str,'d')); END; / Let us now check the performance of the PRAGMA UDF function using the following PL/SQL block. /*Set server output on to display messages*/ SET SERVEROUTPUT ON /*Anonymous block to measure performance of a PRAGMA UDF function*/ DECLARE l_el_time PLS_INTEGER; l_cpu_time PLS_INTEGER; CURSOR C1 IS SELECT f_count_num_pragma (str) FROM t_fun_plsql; TYPE t_tab_rec IS TABLE OF PLS_INTEGER; l_tab t_tab_rec; BEGIN l_el_time := DBMS_UTILITY.GET_TIME (); l_cpu_time := DBMS_UTILITY.GET_CPU_TIME (); OPEN c1; FETCH c1 BULK COLLECT INTO l_tab; CLOSE c1; DBMS_OUTPUT.PUT_LINE ('Case 2: Performance of a PRAGMA UDF function'); DBMS_OUTPUT.PUT_LINE ('Total elapsed time:'||to_char(DBMS_UTILITY.GET_TIME () - l_el_time)); DBMS_OUTPUT.PUT_LINE ('Total CPU time:'||to_char(DBMS_UTILITY.GET_CPU_TIME () - l_cpu_time)); END; / Performance of a PRAGMA UDF function: Total elapsed time:664 Total CPU time:582 PL/SQL procedure successfully completed. Case 3: The following PL/SQL block dynamically executes the function in the WITH clause subquery. Note that, unlike other SELECT statements, a SELECT query with a WITH clause declaration cannot be executed statically in the body of a PL/SQL block. /*Set server output on to display messages*/ SET SERVEROUTPUT ON /*Anonymous block to measure performance of inline function*/ DECLARE l_el_time PLS_INTEGER; l_cpu_time PLS_INTEGER; l_sql VARCHAR2(32767); c1 sys_refcursor; TYPE t_tab_rec IS TABLE OF PLS_INTEGER; l_tab t_tab_rec; BEGIN l_el_time := DBMS_UTILITY.get_time; l_cpu_time := DBMS_UTILITY.get_cpu_time; l_sql := 'WITH FUNCTION f_count_num_with (p_str VARCHAR2) RETURN NUMBER IS BEGIN RETURN (REGEXP_COUNT(p_str,'''||''||'d'||''')); END; SELECT f_count_num_with(str) FROM t_fun_plsql'; OPEN c1 FOR l_sql; FETCH c1 bulk collect INTO l_tab; CLOSE c1; DBMS_OUTPUT.PUT_LINE ('Case 3: Performance of an inline function'); DBMS_OUTPUT.PUT_LINE ('Total elapsed time:'||to_char(DBMS_UTILITY.GET_TIME () - l_el_time)); DBMS_OUTPUT.PUT_LINE ('Total CPU time:'||to_char(DBMS_UTILITY.GET_CPU_TIME () - l_cpu_time)); END; / Performance of an inline function: Total elapsed time:830 Total CPU time:718 PL/SQL procedure successfully completed. Comparative analysis Comparing the results from the preceding three cases, it's clear that the Oracle 12c flavor of PL/SQL functions out-performs the pre-12c standalone function by a high margin. From the following matrix, it is apparent that the usage of the PRAGMA UDF or WITH clause declaration enhances the code performance by (roughly) a factor of 2. Case Description Elapsed Time CPU time Performance gain factor by CPU time Standalone PL/SQL function in pre-Oracle 12c database 1559 1336 1x Standalone PL/SQL PRAGMA UDF function in Oracle 12c 664 582 2.3x Function created in WITH clause declaration in Oracle 12c 830 718 1.9x   [box type="info" align="" class="" width=""]Note that the numbers may slightly differ in the reader's testing environment but you should be able to draw the same conclusion by comparing them.[/box] The PL/SQL program unit whitelisting Prior to Oracle 12c, a standalone or packaged PL/SQL unit could be invoked by all other programs in the session's schema. Oracle Database 12c allows users to prevent unauthorized access to PL/SQL program units. You can now specify the list of whitelist program units that can invoke a particular program. The PL/SQL program header or the package specification can specify the list of program units in the ACCESSIBLE BY clause in the program header. All other program units, including cross-schema references (even SYS owned objects), trying to access a protected subprogram will receive an exception, PLS-00904: insufficient privileges to access object [object name]. The feature can be very useful in an extremely sensitive development environment. Suppose, a package PKG_FIN_PROC contains the sensitive implementation routines for financial institutions, the packaged subprograms are called by another PL/SQL package PKG_FIN_INTERNALS. The API layer exposes a fixed list of programs through a public API called PKG_CLIENT_ACCESS. In order to restrict access to the packaged routines in PKG_FIN_PROC, the users can build a safety net so as to allow access to only authorized programs. The following PL/SQL package PKG_FIN_PROC contains two subprograms—P_FIN_QTR and P_FIN_ANN. The ACCESSIBLE BY clause includes PKG_FIN_INTERNALS which means that all other program units, including anonymous PL/SQL blocks, are blocked from invoking PKG_FIN_PROC constructs. /*Package with the accessible by clause*/ CREATE OR REPLACE PACKAGE pkg_fin_proc ACCESSIBLE BY (PACKAGE pkg_fin_internals) IS PROCEDURE p_fin_qtr; PROCEDURE p_fin_ann; END; / [box type="info" align="" class="" width=""]The ACCESSIBLE BY clause can be specified for schema-level programs only.[/box] Let's see what happens when we invoke the packaged subprogram from an anonymous PL/SQL block. /*Invoke the packaged subprogram from the PL/SQL block*/ BEGIN pkg_fin_proc.p_fin_qtr; END; / pkg_fin_proc.p_fin_qtr; * ERROR at line 2: ORA-06550: line 2, column 4: PLS-00904: insufficient privilege to access object PKG_FIN_PROC ORA-06550: line 2, column 4: PL/SQL: Statement ignored Well, the compiler throws an exception as invoking the whitelisted package from an anonymous block is not allowed. The ACCESSIBLE BY clause can be included in the header information of PL/SQL procedures and functions, packages, and object types. Granting roles to PL/SQL program units Before Oracle Database 12c, a PL/SQL unit created with the definer's rights (default AUTHID) always executed with the definer's rights, whether or not the invoker has the required privileges. It may lead to an unfair situation where the invoking user may perform unwanted operations without needing the correct set of privileges. Similarly for an invoker's right unit, if the invoking user possesses a higher set of privileges than the definer, he might end up performing unauthorized operations. Oracle Database 12c secures the definer's rights by allowing the defining user to grant complementary roles to individual PL/SQL subprograms and packages. From the security standpoint, the granting of roles to schema level subprograms provides granular control as the privileges of the invoker are validated at the time of execution. In the following example, we will create two users: U1 and U2. The user U1 creates a PL/SQL procedure P_INC_PRICE that adds a surcharge to the price of a product by a certain amount. U1 grants the execute privilege to user U2. Test setup Let's create two users and give them the required privileges. /*Create a user with a password*/ CREATE USER u1 IDENTIFIED BY u1 / User created. /*Grant connect privileges to the user*/ GRANT CONNECT, RESOURCE TO u1 / Grant succeeded. /*Create a user with a password*/ CREATE USER u2 IDENTIFIED BY u2 / User created. /*Grant connect privileges to the user*/ GRANT CONNECT, RESOURCE TO u2 / Grant succeeded. The user U1 contains the PRODUCTS table. Let's create and populate the table. /*Connect to U1*/ CONN u1/u1 /*Create the table PRODUCTS*/ CREATE TABLE products ( prod_id INTEGER, prod_name VARCHAR2(30), prod_cat VARCHAR2(30), price INTEGER ) / /*Insert the test data in the table*/ BEGIN DELETE FROM products; INSERT INTO products VALUES (101, 'Milk', 'Dairy', 20); INSERT INTO products VALUES (102, 'Cheese', 'Dairy', 50); INSERT INTO products VALUES (103, 'Butter', 'Dairy', 75); INSERT INTO products VALUES (104, 'Cream', 'Dairy', 80); INSERT INTO products VALUES (105, 'Curd', 'Dairy', 25); COMMIT; END; / The procedure p_inc_price is designed to increase the price of a product by a given amount. Note that the procedure is created with the definer's rights. /*Create the procedure with the definer's rights*/ CREATE OR REPLACE PROCEDURE p_inc_price (p_prod_id NUMBER, p_amt NUMBER) IS BEGIN UPDATE products SET price = price + p_amt WHERE prod_id = p_prod_id; END; / The user U1 grants execute privilege on p_inc_price to U2. /*Grant execute on the procedure to the user U2*/ GRANT EXECUTE ON p_inc_price TO U2 / The user U2 logs in and executes the procedure P_INC_PRICE to increase the price of Milk by 5 units. /*Connect to U2*/ CONN u2/u2 /*Invoke the procedure P_INC_PRICE in a PL/SQL block*/ BEGIN U1.P_INC_PRICE (101,5); COMMIT; END; / PL/SQL procedure successfully completed. The last code listing exposes a gray area. The user U2, though not authorized to view PRODUCTS data, manipulates its data with the definer's rights. We need a solution to the problem. The first step is to change the procedure from definer's rights to invoker's rights. /*Connect to U1*/ CONN u1/u1 /*Modify the privilege authentication for the procedure to invoker's rights*/ CREATE OR REPLACE PROCEDURE p_inc_price (p_prod_id NUMBER, p_amt NUMBER) AUTHID CURRENT_USER IS BEGIN UPDATE products SET price = price + p_amt WHERE prod_id = p_prod_id; END; / Now, if we execute the procedure from U2, it throws an exception because it couldn't find the PRODUCTS table in its schema. /*Connect to U2*/ CONN u2/u2 /*Invoke the procedure P_INC_PRICE in a PL/SQL block*/ BEGIN U1.P_INC_PRICE (101,5); COMMIT; END; / BEGIN * ERROR at line 1: ORA-00942: table or view does not exist ORA-06512: at "U1.P_INC_PRICE", line 5 ORA-06512: at line 2 In a similar scenario in the past, the database administrators could have easily granted select or updated privileges to U2, which is not an optimal solution from the security standpoint. Oracle 12c allows the users to create program units with invoker's rights but grant the required roles to the program units and not the users. So, an invoker right unit executes with invoker's privileges, plus the PL/SQL program role. Let's check out the steps to create a role and assign it to the procedure. SYSDBA creates the role and assigns it to the user U1. Using the ADMIN or DELEGATE option with the grant enables the user to grant the role to other entities. /*Connect to SYSDBA*/ CONN / as sysdba /*Create a role*/ CREATE ROLE prod_role / /*Grant role to user U1 with delegate option*/ GRANT prod_role TO U1 WITH DELEGATE OPTION / Now, user U1 assigns the required set of privileges to the role. The role is then assigned to the required subprogram. Note that only roles, and not individual privileges, can be assigned to the schema level subprograms. /*Connect to U1*/ CONN u1/u1 /*Grant SELECT and UPDATE privileges on PRODUCTS to the role*/ GRANT SELECT, UPDATE ON PRODUCTS TO prod_role / /*Grant role to the procedure*/ GRANT prod_role TO PROCEDURE p_inc_price / User U2 tries to execute the procedure again. The procedure is successfully executed which means the value of "Milk" has been increased by 5 units. /*Connect to U2*/ CONN u2/u2 /*Invoke the procedure P_INC_PRICE in a PL/SQL block*/ BEGIN U1.P_INC_PRICE (101,5); COMMIT; END; / PL/SQL procedure successfully completed. User U1 verifies the result with a SELECT query. /*Connect to U1*/ CONN u1/u1 /*Query the table to verify the change*/ SELECT * FROM products / PROD_ID PROD_NAME PROD_CAT PRICE ---------- ---------- ---------- ---------- 101 Milk Dairy 25 102 Cheese Dairy 50 103 Butter Dairy 75 104 Cream Dairy 80 105 Curd Dairy 25 Miscellaneous PL/SQL enhancements Besides the preceding key features, there are a lot of new features in Oracle 12c. The list of features is as follows: An invoker rights function can be result-cached—until Oracle 11g, only the definers' programs were allowed to cache their results. Oracle 12c adds the invoking user's identity to the result cache to make it independent of the definer. The compilation parameter PLSQL_DEBUG has been deprecated. Two conditional compilation inquiry directives $$PLSQL_UNIT_OWNER and $$PLSQL_UNIT_TYPE have been implemented. Summary This chapter covers the top rated and new features of Oracle 12c SQL and PL/SQL as well as some miscellaneous PL/SQL enhancements. This chapter covers the top rated and new features of Oracle 12c SQL and PL/SQL as well as some miscellaneous PL/SQL enhancements. Further resources on this subject: What is Oracle Public Cloud? [article] Oracle APEX 4.2 reporting [article] Oracle E-Business Suite with Desktop Integration [article]
Read more
  • 0
  • 0
  • 13998