Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-integrating-storm-and-hadoop

04 Sep 2013

17 min read

Integrating Storm and Hadoop

04 Sep 2013

(For more resources related to this topic, see here.) In this article, we will implement the Batch and Service layers to complete the architecture. There are some key concepts underlying this big data architecture: Immutable state Abstraction and composition Constrain complexity Immutable state is the key, in that it provides true fault-tolerance for the architecture. If a failure is experienced at any level, we can always rebuild the data from the original immutable data. This is in contrast to many existing data systems, where the paradigm is to act on mutable data. This approach may seem simple and logical; however, it exposes the system to a particular kind of risk in which the state is lost or corrupted. It also constrains the system, in that you can only work with the current view of the data; it isn't possible to derive new views of the data. When the architecture is based on a fundamentally immutable state, it becomes both flexible and fault-tolerant. Abstractions allow us to remove complexity in some cases, and in others they can introduce complexity. It is important to achieve an appropriate set of abstractions that increase our productivity and remove complexity, but at an appropriate cost. It must be noted that all abstractions leak, meaning that when failures occur at a lower abstraction, they will affect the higher-level abstractions. It is therefore often important to be able to make changes within the various layers and understand more than one layer of abstraction. The designs we choose to implement our abstractions must therefore not prevent us from reasoning about or working at the lower levels of abstraction when required. Open source projects are often good at this, because of the obvious access to the code of the lower level abstractions, but even with source code available, it is easy to convolute the abstraction to the extent that it becomes a risk. In a big data solution, we have to work at higher levels of abstraction in order to be productive and deal with the massive complexity, so we need to choose our abstractions carefully. In the case of Storm, Trident represents an appropriate abstraction for dealing with the data-processing complexity, but the lower level Storm API on which Trident is based isn't hidden from us. We are therefore able to easily reason about Trident based on an understanding of lower-level abstractions within Storm. Another key issue to consider when dealing with complexity and productivity is composition. Composition within a given layer of abstraction allows us to quickly build out a solution that is well tested and easy to reason about. Composition is fundamentally decoupled, while abstraction contains some inherent coupling to the lower-level abstractions—something that we need to be aware of. Finally, a big data solution needs to constrain complexity. Complexity always equates to risk and cost in the long run, both from a development perspective and from an operational perspective. Real-time solutions will always be more complex than batch-based systems; they also lack some of the qualities we require in terms of performance. Nathan Marz's Lambda architecture attempts to address this by combining the qualities of each type of system to constrain complexity and deliver a truly fault-tolerant architecture. We divided this flow into preprocessing and "at time" phases, using streams and DRPC streams respectively. We also introduced time windows that allowed us to segment the preprocessed data. In this article, we complete the entire architecture by implementing the Batch and Service layers. The Service layer is simply a store of a view of the data. In this case, we will store this view in Cassandra, as it is a convenient place to access the state alongside Trident's state. The preprocessed view is identical to the preprocessed view created by Trident, counted elements of the TF-IDF formula (D, DF, and TF), but in the batch case, the dataset is much larger, as it includes the entire history. The Batch layer is implemented in Hadoop using MapReduce to calculate the preprocessed view of the data. MapReduce is extremely powerful, but like the lower-level Storm API, is potentially too low-level for the problem at hand for the following reasons: We need to describe the problem as a data pipeline; MapReduce isn't congruent with such a way of thinking Productivity We would like to think of a data pipeline in terms of streams of data, tuples within the stream and predicates acting on those tuples. This allows us to easily describe a solution to a data processing problem, but it also promotes composability, in that predicates are fundamentally composable, but pipelines themselves can also be composed to form larger, more complex pipelines. Cascading provides such an abstraction for MapReduce in the same way as Trident does for Storm. With these tools, approaches, and considerations in place, we can now complete our real-time big data architecture. There are a number of elements, that we will update, and a number of elements that we will add. The following figure illustrates the final architecture, where the elements in light grey will be updated from the existing recipe, and the elements in dark grey will be added in this article: Implementing TF-IDF in Hadoop TF-IDF is a well-known problem in the MapReduce communities; it is well-documented and implemented, and it is interesting in that it is sufficiently complex to be useful and instructive at the same time. Cascading has a series of tutorials on TF-IDF at http://www.cascading.org/2012/07/31/cascading-for-the-impatient-part-5/, which documents this implementation well. For this recipe, we shall use a Clojure Domain Specific Language (DSL) called Cascalog that is implemented on top of Cascading. Cascalog has been chosen because it provides a set of abstractions that are very semantically similar to the Trident API and are very terse while still remaining very readable and easy to understand. Getting ready Before you begin, please ensure that you have installed Hadoop by following the instructions at http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/. How to do it… Start by creating the project using the lein command: lein new tfidf-cascalog Next, you need to edit the project.clj file to include the dependencies: (defproject tfidf-cascalog "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.4.0"] [cascalog "1.10.1"] [org.apache.cassandra/cassandra-all "1.1.5"] [clojurewerkz/cassaforte "1.0.0-beta11-SNAPSHOT"] [quintona/cascading-cassandra "0.0.7-SNAPSHOT"] [clj-time "0.5.0"] [cascading.avro/avro-scheme "2.2-SNAPSHOT"] [cascalog-more-taps "0.3.0"] [org.apache.httpcomponents/httpclient "4.2.3"]] :profiles{:dev{:dependencies[[org.apache.hadoop/hadoop-core "0.20.2-dev"] [lein-midje "3.0.1"] [cascalog/midje-cascalog "1.10.1"]]}}) It is always a good idea to validate your dependencies; to do this, execute lein deps and review any errors. In this particular case, cascading-cassandra has not been deployed to clojars, and so you will receive an error message. Simply download the source from https://github.com/quintona/cascading-cassandra and install it into your local repository using Maven. It is also good practice to understand your dependency tree. This is important to not only prevent duplicate classpath issues, but also to understand what licenses you are subject to. To do this, simply run lein pom, followed by mvn dependency:tree. You can then review the tree for conflicts. In this particular case, you will notice that there are two conflicting versions of Avro. You can fix this by adding the appropriate exclusions: [org.apache.cassandra/cassandra-all "1.1.5" :exclusions [org.apache.cassandra.deps/avro]] We then need to create the Clojure-based Cascade queries that will process the document data. We first need to create the query that will create the "D" view of the data; that is, the D portion of the TF-IDF function. This is achieved by defining a Cascalog function that will output a key and a value, which is composed of a set of predicates: (defn D [src] (let [src (select-fields src ["?doc-id"])] (<- [?key ?d-str] (src ?doc-id) (c/distinct-count ?doc-id :> ?n-docs) (str "twitter" :> ?key) (str ?n-docs :> ?d-str)))) You can define this and any of the following functions in the REPL, or add them to core.clj in your project. If you want to use the REPL, simply use lein repl from within the project folder. The required namespace (the use statement), require, and import definitions can be found in the source code bundle. We then need to add similar functions to calculate the TF and DF values: (defn DF [src] (<- [?key ?df-count-str] (src ?doc-id ?time ?df-word) (c/distinct-count ?doc-id ?df-word :> ?df-count) (str ?df-word :> ?key) (str ?df-count :> ?df-count-str))) (defn TF [src] (<- [?key ?tf-count-str] (src ?doc-id ?time ?tf-word) (c/count ?tf-count) (str ?doc-id ?tf-word :> ?key) (str ?tf-count :> ?tf-count-str))) This Batch layer is only interested in calculating views for all the data leading up to, but not including, the current hour. This is because the data for the current hour will be provided by Trident when it merges this batch view with the view it has calculated. In order to achieve this, we need to filter out all the records that are within the current hour. The following function makes that possible: (deffilterop timing-correct? [doc-time] (let [now (local-now) interval (in-minutes (interval (from-long doc-time) now))] (if (< interval 60) false true)) Each of the preceding query definitions require a clean stream of words. The text contained in the source documents isn't clean. It still contains stop words. In order to filter these and emit a clean set of words for these queries, we can compose a function that splits the text into words and filters them based on a list of stop words and the time function defined previously: (defn etl-docs-gen [rain stop] (<- [?doc-id ?time ?word] (rain ?doc-id ?time ?line) (split ?line :> ?word-dirty) ((c/comp s/trim s/lower-case) ?word-dirty :> ?word) (stop ?word :> false) (timing-correct? ?time))) We will be storing the outputs from our queries to Cassandra, which requires us to define a set of taps for these views: (defn create-tap [rowkey cassandra-ip] (let [keyspace storm_keyspace column-family "tfidfbatch" scheme (CassandraScheme. cassandra-ip "9160" keyspace column-family rowkey {"cassandra.inputPartitioner""org.apache.cassandra.dht.RandomPartitioner" "cassandra.outputPartitioner" "org.apache.cassandra.dht.RandomPartitioner"}) tap (CassandraTap. scheme)] tap)) (defn create-d-tap [cassandra-ip] (create-tap "d"cassandra-ip)) (defn create-df-tap [cassandra-ip] (create-tap "df" cassandra-ip)) (defn create-tf-tap [cassandra-ip] (create-tap "tf" cassandra-ip)) The way this schema is created means that it will use a static row key and persist name-value pairs from the tuples as column:value within that row. This is congruent with the approach used by the Trident Cassandra adaptor. This is a convenient approach, as it will make our lives easier later. We can complete the implementation by a providing a function that ties everything together and executes the queries: (defn execute [in stop cassandra-ip] (cc/connect! cassandra-ip) (sch/set-keyspace storm_keyspace) (let [input (tap/hfs-tap (AvroScheme. (load-schema)) in) stop (hfs-delimited stop :skip-header? true) src (etl-docs-gen input stop)] (?- (create-d-tap cassandra-ip) (D src)) (?- (create-df-tap cassandra-ip) (DF src)) (?- (create-tf-tap cassandra-ip) (TF src)))) Next, we need to get some data to test with. I have created some test data, which is available at https://bitbucket.org/qanderson/tfidf-cascalog. Simply download the project and copy the contents of src/data to the data folder in your project structure. We can now test this entire implementation. To do this, we need to insert the data into Hadoop: hadoop fs -copyFromLocal ./data/document.avro data/document.avro hadoop fs -copyFromLocal ./data/en.stop data/en.stop Then launch the execution from the REPL: => (execute "data/document" "data/en.stop" "127.0.0.1") How it works… There are many excellent guides on the Cascalog wiki (https://github.com/nathanmarz/cascalog/wiki), but for completeness's sake, the nature of a Cascalog query will be explained here. Before that, however, a revision of Cascading pipelines is required. The following is quoted from the Cascading documentation (http://docs.cascading.org/cascading/2.1/userguide/htmlsingle/): Pipe assemblies define what work should be done against tuple streams, which are read from tap sources and written to tap sinks. The work performed on the data stream may include actions such as filtering, transforming, organizing, and calculating. Pipe assemblies may use multiple sources and multiple sinks, and may define splits, merges, and joins to manipulate the tuple streams. This concept is embodied in Cascalog through the definition of queries. A query takes a set of inputs and applies a list of predicates across the fields in each tuple of the input stream. Queries are composed through the application of many predicates. Queries can also be composed to form larger, more complex queries. In either event, these queries are reduced down into a Cascading pipeline. Cascalog therefore provides an extremely terse and powerful abstraction on top of Cascading; moreover, it enables an excellent development workflow through the REPL. Queries can be easily composed and executed against smaller representative datasets within the REPL, providing the idiomatic API and development workflow that makes Clojure beautiful. If we unpack the query we defined for TF, we will find the following code: (defn DF [src] (<- [?key ?df-count-str] (src ?doc-id ?time ?df-word) (c/distinct-count ?doc-id ?df-word :> ?df-count) (str ?df-word :> ?key) (str ?df-count :> ?df-count-str))) The <- macro defines a query, but does not execute it. The initial vector, [?key ?df-count-str], defines the output fields, which is followed by a list of predicate functions. Each predicate can be one of the following three types: Generators: A source of data where the underlying source is either a tap or another query. Operations: Implicit relations that take in input variables defined elsewhere and either act as a function that binds new variables or a filter. Operations typically act within the scope of a single tuple. Aggregators: Functions that act across tuples to create aggregate representations of data. For example, count and sum. The :> keyword is used to separate input variables from output variables. If no :> keyword is specified, the variables are considered as input variables for operations and output variables for generators and aggregators. The (src ?doc-id ?time ?df-word) predicate function names the first three values within the input tuple, whose names are applicable within the query scope. Therefore, if the tuple ("doc1" 123324 "This") arrives in this query, the variables would effectively bind as follows: ?doc-id: "doc1" ?time: 123324 ?df-word: "This" Each predicate within the scope of the query can use any bound value or add new bound variables to the scope of the query. The final set of bound values that are emitted is defined by the output vector. We defined three queries, each calculating a portion of the value required for the TF-IDF algorithm. These are fed from two single taps, which are files stored in the Hadoop filesystem. The document file is stored using Apache Avro, which provides a high-performance and dynamic serialization layer. Avro takes a record definition and enables serialization/deserialization based on it. The record structure, in this case, is for a document and is defined as follows: {"namespace": "storm.cookbook", "type": "record", "name": "Document", "fields": [ {"name": "docid", "type": "string"}, {"name": "time", "type": "long"}, {"name": "line", "type": "string"} ] } Both the stop words and documents are fed through an ETL function that emits a clean set of words that have been filtered. The words are derived by splitting the line field using a regular expression: (defmapcatop split [line] (s/split line #"[[](),.)s]+")) The ETL function is also a query, which serves as a source for our downstream queries, and defines the [?doc-id ?time ?word] output fields. The output tap, or sink, is based on the Cassandra scheme. A query defines predicate logic, not the source and destination of data. The sink ensures that the outputs of our queries are sent to Cassandra. The ?- macro executes a query, and it is only at execution time that a query is bound to its source and destination, again allowing for extreme levels of composition. The following, therefore, executes the TF query and outputs to Cassandra: (?- (create-tf-tap cassandra-ip) (TF src)) There's more… The Avro test data was created using the test data from the Cascading tutorial at http://www.cascading.org/2012/07/31/cascading-for-the-impatient-part-5/. Within this tutorial is the rain.txt tab-separated data file. A new column was created called time that holds the Unix epoc time in milliseconds. The updated text file was then processed using some basic Java code that leverages Avro: Schema schema = Schema.parse(SandboxMain.class.getResourceAsStream("/document.avsc")); File file = new File("document.avro"); DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema); DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter); dataFileWriter.create(schema, file); BufferedReader reader = new BufferedReader(new InputStreamReader(SandboxMain.class.getResourceAsStream("/rain.txt"))); String line = null; try { while ((line = reader.readLine()) != null) { String[] tokens = line.split("t"); GenericRecord docEntry = new GenericData.Record(schema); docEntry.put("docid", tokens[0]); docEntry.put("time", Long.parseLong(tokens[1])); docEntry.put("line", tokens[2]); dataFileWriter.append(docEntry); } } catch (IOException e) { e.printStackTrace(); } dataFileWriter.close(); Persisting documents from Storm In the previous recipe, we looked at deriving precomputed views of our data taking some immutable data as the source. In that recipe, we used statically created data. In an operational system, we need Storm to store the immutable data into Hadoop so that it can be used in any preprocessing that is required. How to do it… As each tuple is processed in Storm, we must generate an Avro record based on the document record definition and append it to the data file within the Hadoop filesystem. We must create a Trident function that takes each document tuple and stores the associated Avro record. Within the tfidf-topology project created in, inside the storm.cookbook.tfidf.function package, create a new class named PersistDocumentFunction that extends BaseFunction. Within the prepare function, initialize the Avro schema and document writer: public void prepare(Map conf, TridentOperationContext context) { try { String path = (String) conf.get("DOCUMENT_PATH"); schema = Schema.parse(PersistDocumentFunction.class .getResourceAsStream("/document.avsc")); File file = new File(path); DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema); dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter); if(file.exists()) dataFileWriter.appendTo(file); else dataFileWriter.create(schema, file); } catch (IOException e) { throw new RuntimeException(e); } } As each tuple is received, coerce it into an Avro record and add it to the file: public void execute(TridentTuple tuple, TridentCollector collector) { GenericRecord docEntry = new GenericData.Record(schema); docEntry.put("docid", tuple.getStringByField("documentId")); docEntry.put("time", Time.currentTimeMillis()); docEntry.put("line", tuple.getStringByField("document")); try { dataFileWriter.append(docEntry); dataFileWriter.flush(); } catch (IOException e) { LOG.error("Error writing to document record: " + e); throw new RuntimeException(e); } } Next, edit the TermTopology.build topology and add the function to the document stream: documentStream.each(new Fields("documentId","document"), new PersistDocumentFunction(), new Fields()); Finally, include the document path into the topology configuration: conf.put("DOCUMENT_PATH", "document.avro"); How it works… There are various logical streams within the topology, and certainly the input for the topology is not in the appropriate state for the recipes in this article containing only URLs. We therefore need to select the correct stream from which to consume tuples, coerce these into Avro records, and serialize them into a file. The previous recipe will then periodically consume this file. Within the context of the topology definition, include the following code: Stream documentStream = getUrlStream(topology, spout) .each(new Fields("url"), new DocumentFetchFunction(mimeTypes), new Fields("document", "documentId", "source")); documentStream.each(new Fields("documentId","document"), new PersistDocumentFunction(), new Fields()); The function should consume tuples from the document stream whose tuples are populated with already fetched documents.

0
0
1293

article-image-sql-server-integration-services-ssis

Packt

03 Sep 2013

5 min read

SQL Server Integration Services (SSIS)

Packt

03 Sep 2013

5 min read

0
0
1827

Packt

03 Sep 2013

22 min read

Working with Time

Packt

03 Sep 2013

22 min read

0
0
2656

Packt

03 Sep 2013

20 min read

Oracle APEX 4.2 reporting

Packt

03 Sep 2013

20 min read

(For more resources related to this topic, see here.) The objective of the first chapter is to quickly introduce you to the technology and then dive deep into the understanding the fabric of the tool. The chapter also helps you set the environment, which will be used throughout the book. Chapter 1, Know Your Horse Before You Ride It, starts with discussing the various features of APEX. This is to give heads up to the readers about the features offered by the tool and to inform them about some of the strengths of the tool. In order to understand the technology better, we discuss the various web server combinations possible with APEX, namely the Internal mod_plsql, External mod_plsql, and Listener configuration. While talking about the Internal mod_plsql configuration, we see the steps to enable the XMLDB HTTP server. In the Internal mod_plsql configuration, Oracle uses a DAD defined in EPG to talk to the database and the web server. So, we try to create a miniature APEX of our own, by creating our DAD using it to talk to the database and the web server. We then move on to learn about the External mod_plsql configuration. We discuss the architecture and the roles of the configuration files, such as dads.conf and httpd.conf. We also have a look at a typical dads.conf file and draw correlations between the configurations in the Internal and External mod_plsql configuration. We then move on to talk about the wwv_flow_epg_include_mod_local procedure that can help us use the DAD of APEX to call our own stored PL/SQL procedures. We then move on to talk about APEX Listener which is a JEE alternative to mod_plsql and is Oracle’s direction for the future. Once we are through with understanding the possible configurations, we see the steps to set up our environment. We use the APEX Listener configuration and see the steps to install the APEX engine, create a Weblogic domain, set the listener in the domain, and create an APEX workspace. With our environment in place, we straight away get into understanding the anatomy of APEX by analyzing various parts of its URL. This discussion includes a natter on the session management, request handling, debugging, error handling, use of TKPROF for tracing an APEX page execution, cache management and navigation, and value passing in APEX. We also try to understand the design behind the zero session ID in this section. Our discussions till now would have given you a brief idea about the technology, so we try to dig in a little deeper and understand the mechanism used by APEX to send web requests to its PL/SQL engine in the database by decoding the APEX page submission. We see the use of the wwv_flow.accept procedure and understand the role of page submission. We try to draw an analogy of an APEX form with a simple HTML to get a thorough understanding about the concept. The next logical thing after page submission is to see the SQL and PL/SQL queries and blocks reaching the database. We turn off the database auditing and see the OWA web toolkit requests flowing to the database as soon as we open an APEX page. We then broaden our vision by quickly knowing about some of the lesser known alternatives of mod_plsql. We end the chapter with a note of caution and try to understand the most valid criticisms of the technology, by understanding the SQL injection and Cross-site Scripting (XSS). After going through the architecture we straight away spring into action and begin the process of learning how to build the reports in APEX. The objective of this chapter is to help you understand and implement the most common reporting requirements along with introducing some interesting ways to frame the analytical queries in Oracle. The chapter also hugely focuses on the methods to implement different kinds of formatting in the APEX classic reports. We start Chapter 2, Reports, by creating the objects that will be used throughout the book and installing the reference application that contains the supporting code for the topics discussed in the second chapter. We start the chapter by setting up an authentication mechanism. We discuss external table authentication in this chapter. We then move on to see the mechanism of capturing the environment variables in APEX. These variables can help us set some logic related to a user’s session and environment. The variables also help us capture some properties of the underlying database session of an APEX session. We capture the variables using the USERENV namespace, DBMS_SESSION package, and owa_util package. After having a good idea of the ways and means to capture the environment variables, we build our understanding of developing the search functionality in a classic APEX report. This is mostly a talk about the APEX classic report features, we also use this opportunity to see the process to enable sorting in the report columns, and to create a link that helps us download the report in the CSV format. We then discuss various ways to implement the group reports in APEX. The discussion shows a way to implement this purely, by using the APEX’s feature, and then talks about getting the similar results by using the Oracle database feature. We talk about the APEX’s internal grouping feature and the Oracle grouping sets. The section also shows the first use of JavaScript to manipulate the report output. It also talks about a method to use the SQL query to create the necessary HTML, which can be used to display a data column in a classic report. We take the formatting discussion further, by talking about a few advanced ways of highlighting the report data using the APEX’s classic report features, and by editing the APEX templates. After trying our hand at formatting, we try to understand the mechanism to implement matrix reports in APEX. We use matrix reports to understand the use of the database features, such as the with clause, the pivot operator, and a number of string aggregation techniques in Oracle. The discussion on the string aggregation techniques includes the talk on the LISTAGG function, the wm_concat function, and the use of the hierarchical queries for this purpose. We also see the first use of the APEX items as the substitution variables in this book. We will see more use of the APEX items as the substitution variables to solve the vexed problems in other parts of the book as well. We do justice with the frontend as well, by talking about the use of jQuery, CSS, and APEX Dynamic Actions for making the important parts of data stand out. Then, we see the implementation of the handlers, such as this.affectedElements and the use of the jQuery functions, such as this.css() in this section. We also see the advanced formatting methods using the APEX templates. We then end this part of the discussion, by creating a matrix report using the APEX Dynamic Query report region. The hierarchical reports are always intriguing, because of the enormous ability to present the relations among different rows of data, and because of the use of the hierarchical queries in a number of unrelated places to answer the business queries. We reserve a discussion on the Oracle’s hierarchical queries for a later part of the section and start it, by understanding the implementation of the hierarchical reports by linking the values in APEX using drilldowns. We see the use of the APEX items as the substitution variable to create the dynamic messages. Since we have devised a mechanism to drill down, we should also build a ladder for the user to climb up the hierarchical chain. Now, the hierarchical chain for every person will be different depending on his position in the organization, so we build a mechanism to build the dynamic bread crumbs using the PL/SQL region in APEX. We then talk about two different methods of implementing the hierarchical queries in APEX. We talk about the connect, by clause first, and then continue our discussion by learning the use of the recursive with clause for the hierarchical reporting. We end our discussion on the hierarchical reporting, by talking about creating the APEX’s Tree region that displays the hierarchical data in the form of a tree. The reports are often associated with the supporting files that give more information about the business query. A typical user might want to upload a bunch of files while executing his piece of the task and the end user of his action might want to check out the files uploaded by him/her. To understand the implementation of this requirement, we check out the various ways to implement uploading and downloading the files in APEX. We start our discussion, by devising the process of uploading the files for the employees listed in the oehr_employees table. The solution of uploading the files involves the implementation of the dynamic action to capture the ID of the employee on whose row the user has clicked. It shows the use of the APEX items as the substitution variables for creating the dynamic labels and the use of JavaScript to feed one APEX item based on the another. We extensively talk about the implementation of jQuery in Dynamic Actions in this section as well. Finally, we check out the use of the APEX’s file browse item along with the WWV_FLOW_FILES table to capture the file uploaded by the user. Discussion on the methods to upload the files is immediately followed by talking about the ways to download these files in APEX. We nest the use of the functions, such as HTF.ANCHOR and APEX_UTIL.GET_BLOB_FILE_SRC for one of the ways to download a file, and also talk about the use of dbms_lob.getlength along with the APEX format mask for downloading the files. We then engineer our own stored procedure that can download a blob stored in the database as a file. We end this discussion, by having a look at the APEX’s p process, which can also be used for downloading. AJAX is the mantra of the new age and we try our hand at it, by implementing the soft deletion in APEX. We see the mechanism of refreshing just the report and not reloading the page as soon as the user clicks to delete a row. We use JavaScript along with the APEX page process and the APEX templates to achieve this objective. Slicing and dicing along with auditing are two of the most common requirements in the reporting world. We see the implementation of both of these using both the traditional method JavaScript with the page processes and the new method of using Dynamic Actions. We extend our use case a little further and learn about a two way interaction between the JavaScript function and the page process. We learn to pass the values back and forth between the two. While most business intelligence and reporting solutions are a one way road and are focused on the data presentation, Oracle APEX can go a step further, and can give an interface to the user for the data manipulations as well. To understand this strength of APEX, we have a look at the process of creating the tabular forms in APEX. We extend our understanding of the tabular forms to see a magical use of jQuery to convert a certain sections of a report from display-only to editable textboxes. We move our focus from implementing the interesting and tricky frontend requirements to framing the queries to display the complex data types. We use the string aggregation methods to display data in a column containing a varray. Time dimension is one of the most widely used dimension in reporting circles and comparing current performance with the past records is a favorite requirement of most businesses. With this in mind, we shift our focus to understand and implement the time series reports in APEX. We start our discussion by understanding the method to implement a report that shows the contribution of each business line in every quarter. We implement this by using partitioning dimensions on the fly. We also use the analytical functions, such as ratio_to_report, lead, and lag in the process of creating the time series reports. We use our understanding of time dimension to build a report that helps a user compare one time period to the other. The report gives the user the freedom to select the time segments, which he wishes to compare. We then outwit a limitation of this report, by using the query partition clause for the data densification. We bring our discussion on reports based on time dimension, by presenting a report based on the modal clause to you. The report serves as an example to show the enormous possibilities to code, by using the modal clause in Oracle. We bring our discussion on reports based on time dimension, by presenting a report based on the modal clause to you. The report serves as an example to show the enormous possibilities to code, by using the modal clause in Oracle. Chapter 3, In the APEX Mansion – Interactive Reports is all about Interactive Reports and the dynamic reporting. While the second chapter was more about the data presentation using the complex queries and the presentation methods, this chapter is about taking the presentation part a step ahead, by creating more visually appealing Interactive Reports. We start the discussion of this chapter, by talking about the ins and outs of Interactive Reports. We postmortem this feature of APEX to learn about every possible way of using Interactive Reports for making more sense of data. The chapter has a reference application of its own, which shows the code in action. We start our discussion, by exploring at various features of the Actions menu in Interactive Reports. The discussion is on search functionality, Select Columns feature, filtering, linking and filtering Interactive Reports using URLs, customizing Rows per page feature of an IR, using Control Break, creating computations in IR, creating charts in IR, using the Flashback feature and a method to see the back end flashback query, configuring the e-mail functionality for downloading a report, and for subscription of reports and methods to understand the download of reports in HTML, CSV, and PDF formats. Once we are through with understanding the Actions menu, we move on to understand the various configuration options in an IR. We talk about the Link section, the Icon View section, the Detail section, the Column Group section, and the Advanced section of the Report Attributes page of an IR. While discussing these sections, we understand the process of setting different default views of an IR for different users. Once we are through with our dissection of an IR, we put our knowledge to action, by inserting our own item in the Actions menu of an IR using Dynamic Actions and jQuery. We continue our quest for finding the newer formatting methods, by using different combinations of SQL, CSS, APEX templates, and jQuery to achieve unfathomable results. The objectives attained in this section include formatting a column of an IR based on another column, using CSS in the page header to format the APEX data, changing the font color of the alternate rows in APEX, using a user-defined CSS class in APEX, conditionally highlighting a column in IR using CSS and jQuery, and formatting an IR using the region query. After going through a number of examples on the use of CSS and jQuery in APEX, we lay down a process to use any kind of changes in an IR. We present this process by an example that changes one of the icons used in an IR to a different icon. APEX also has a number of views for IR, which can be used for intelligent programming. We talk about an example that uses the apex_application_page_ir_rpt view that show different IR reports on the user request. After a series of discussion on Interactive Reports, we move on to find a solution to an incurable problem of IR. We see a method to put multiple IR on the same page and link them as the master-child reports. We had seen an authentication mechanism (external table authentication) and data level authorization in Chapter 2, Reports. We use this chapter to see the object level authorization in APEX. We see the method to give different kinds of rights on the data of a column in a report to different user groups. After solving a number of problems and learning a number of things, we create a visual treat for ourselves. We create an Interactive report dashboard using Dynamic Actions. This dashboard presents different views of an IR as gadgets on a separate APEX page. We conclude this chapter, by looking at advanced ways of creating the dynamic reports in APEX. We look at the use of the table function in both native and interface approach, and we also look at the method to use the APEX collections for creating the dynamic IRs in APEX. Chapter 4, The Fairy Tale Begins – Advanced Reporting, is all about advanced reporting and pretty graphs. Since we are talking about advanced reporting, we see the process of setting the LDAP authentication in APEX. We also see the use of JXplorer to help us get the necessary DN for setting up the LDAP authentication. We also see the means to authenticate an LDAP user using PL/SQL in this section. This chapter also has a reference application of its own that shows the code in action. We start the reporting building in this chapter, by creating the Sparkline reports. This report uses jQuery for producing the charts. We then move on to use another jQuery library to create a report with slider. This report lets the user set the value of the salary of any employee using a slider. We then get into the world of the HTML charts. We start our talk, by looking at the various features of creating the HTML chart regions in APEX. We understand a method to implement the Top N and Bottom N chart reports in the HTML charts. We understand the APEX’s method to implement the HTML charts and use it to create an HTML chart on our own. We extend this technique of generating HTML from region source a little further, by using XMLTYPE to create the necessary HTML for displaying a report. We take our spaceship into a different galaxy of the charts world and see the use of Google Visualizations for creating the charts in APEX. We then switch back to AnyChart, which is a flash charting solution, and has been tightly integrated with APEX. It works on an XML and we talk about customizing this XML to produce different results. We put our knowledge to action, by creating a logarithmic chart and changing the style of a few display labels in APEX. We continue our discussion on AnyChart, and use the example of Doughnut Chart to understand advanced ways of using AnyChart in APEX. We use our knowledge to create Scatter chart, 3D stacked chart, Gauge chart, Gantt chart, Candlestick chart, Flash image maps, and SQL calendars. We move out of the pretty heaven of AnyChart only to get into another beautiful space of understanding the methods of displaying the reports with the images in APEX. We implement the reports with the images using the APEX’s format masks and also using HTF.IMG with the APEX_UTIL.GET_BLOB_FILE_SRC function. We then divert our attention to advanced jQuery uses such as, the creation of the Dialog box and the Context menu in APEX. We create a master-detail report using dialog boxes, where the child report is shown in the Dialog box. We close our discussion in this chapter, by talking about creating wizards in APEX and a method to show different kinds of customized error messages for problems appearing at different points of a page process. Chapter 5, Flight to Space Station: Advanced APEX, is all about advanced APEX. The topics discussed in this chapter fall into a niche category of reporting the implementation. This chapter has two reference applications of its own that shows the code in action. We start this chapter, by creating both the client-side and server-side image maps APEX. These maps are often used where regular shapes are involved. We then see the process of creating the PL/SQL Server Pages (PSPs). PSPs are similar to JSP and are used for putting the PL/SQL and HTML code in a single file. Loadpsp then converts this file into a stored procedure with the necessary calls to owa web toolkit functions. The next utility we check out is loadjava. This utility helps us load a Java class as a database object. This utility will be helpful, if some of the Java classes are required for code processing. We had seen the use of AnyChart in the previous chapter, and this chapter introduces you to FusionChart. We create a Funnel Chart using FusionChart in this chapter. We also use our knowledge of the PL/SQL region in APEX to create a Tag Cloud. We then stroll into the world of the APEX plugins. We understand the interface functions, which have to be created for plugins. We discuss the concepts and then use them to develop an Item type plugin and a Dynamic Action plugin. We understand the process of defining a Custom Attribute in a plugin, and then see the process of using it in APEX. We move on to learn about Websheets and their various features. We acquaint ourselves with the interface of the Websheet applications and understand the concept of sharing Websheets. We also create a few reports in a Websheet application and see the process of importing and using an image in Websheets. We then spend some time to learn about data grids in Websheets and the method to create them. We have a look at the Administration and View dropdowns in a Websheet application. We get into the administration mode and understand the process of configuring the sending of mails to Gmail server from APEX. We extend our administration skills further, by understanding the ways to download an APEX application using utilities, such as oracle.apex.APEXExport. Reporting and OLAP go hand in hand, so we see the method of using the OLAP cubes in APEX. We see the process of modeling a cube and understand the mechanism to use its powerful features. We then have a talk about Oracle Advanced Queues that can enable us to do reliable communication among different systems with different workloads in the enterprise, and gives improved performance. We spend a brief time to understand some of the other features of APEX, which might not be directly related to reporting, but are good to know. Some of these features include locking and unlocking of pages in APEX, the Database Object Dependencies report, Shortcuts, the Dataloading wizard, and the APEX views. We bring this exclusive chapter to an end, by discussing about the various packages that enable us to schedule the background jobs in APEX, and by discussing various other APEX and Database API, which can help us in the development process.

0
0
2406

Packt

02 Sep 2013

6 min read

OAuth Authentication

Packt

02 Sep 2013

6 min read

(For more resources related to this topic, see here.) Understanding OAuth OAuth has the concept of Providers and Clients. An OAuth Provider is like a SAML Identity Provider, and is the place where the user enters their authentication credentials. Typical OAuth Providers include Facebook and Google. OAuth Clients are resources that want to protect resources, such as a SAML Service Provider. If you have ever been to a site that has asked you to log in using your Twitter or LinkedIn credentials then odds are that site was using OAuth. The advantage of OAuth is that a user’s authentication credentials (username and password, for instance) is never passed to the OAuth Client, just a range of tokens that the Client requested from the Provider and which are authorized by the user. OpenAM can act as both an OAuth Provider and an OAuth Client. This chapter will focus on using OpenAM as an OAuth Client and using Facebook as an OAuth Provider. Preparing Facebook as an OAuth Provider Head to https://developers.facebook.com/apps/ and create a Facebook App. Once this is created, your Facebook App will have an App ID and an App Secret. We’ll use these later on when configuring OpenAM. Facebook won’t let a redirect to a URL (such as our OpenAM installation) without being aware of the URL. The steps for preparing Facebook as an OAuth provider are as follows: Under the settings for the App in the section Website with Facebook Login we need to add a Site URL. This is a special OpenAM OAuth Proxy URL, which for me was http://openam.kenning.co.nz:8080/openam/oauth2c/OAuthProxy.jsp as shown in the following screenshot: Click on the Save Changes button on Facebook. My OpenAM installation for this chapter was directly available on the Internet just in case Facebook checked for a valid URL destination. Configuring an OAuth authentication module OpenAM has the concept of authentication modules, which support different ways of authentication, such as OAuth, or against its Data Store, or LDAP or a Web Service. We need to create a new Module Instance for our Facebook OAuth Client. Log in to OpenAM console. Click on the Access Control tab, and click on the link to the realm / (Top Level Realm). Click on the Authentication tab and scroll down to the Module Instances section. Click on the New button. Enter a name for the New Module Instance and select OAuth 2.0 as the Type and click on the OK button. I used the name Facebook. You will then see a screen as shown: For Client Id, use the App ID value provided from Facebook. For the Client Secret use the App Secret value provided from Facebook as shown in the preceding screenshot. Since we’re using Facebook as our OAuth Provider, we can leave the Authentication Endpoint URL, Access Token Endpoint URL, and User Profile Service URL values as their default values. Scope defines the permissions we’re requesting from the OAuth Provider on behalf of the user. These values will be provided by the OAuth Provider, but we’ll use the default values of email and read_stream as shown in the preceding screenshot. Proxy URL is the URL we copied to Facebook as the Site URL. This needs to be replaced with your OpenAM installation value. The Account Mapper Configuration allows you to map values from your OAuth Provider to values that OpenAM recognizes. For instance, Facebook calls emails email while OpenAM references values from the directory it is connected to, such as mail in the case of the embedded LDAP server. This goes the same for the Attribute Mapper Configuration. We’ll leave all these sections as their defaults as shown in the preceding screenshot. OpenAM allows attributes passed from the OAuth Provider to be saved to the OpenAM session. We’ll make sure this option is Enabled as shown in the preceding screenshot. When a user authenticates against an OAuth Provider, they are likely to not already have an account with OpenAM. If they do not have a valid OpenAM account then they will not be allowed access to resources protected by OpenAM. We should make sure that the option to Create account if it does not exist is Enabled as shown in the preceding screenshot. Forcing authentication against particular authentication modules In the writing of this book I disabled the Create account if it does not exist option while I was testing. Then when I tried to log into OpenAM I was redirected to Facebook, which then passed my credentials to OpenAM. Since there was no valid OpenAM account that matched my Facebook credentials I could not log in. For your own testing, it would be recommended to use http://openam.kenning.co.nz:8080/openam/UI/Login?module=Facebook rather than changing your authentication chain. Thankfully, you can force a login using a particular authentication module by adjusting the login URL. By using http://openam.kenning.co.nz:8080/openam/UI/Login?module=DataStore, I was able to use the Data Store rather than OAuth authentication module, and log in successfully. For our newly created accounts we can choose to prompt the user to create a password and enter an activation code. For our prototype we’ll leave this option as Disabled. The flip side to Single Sign On is Single Log Out. Your OAuth Provider should provide a logout URL which we could possibly call to log out a user when they log out of OpenAM. The options we have when a user logs out of OpenAM is to either not log them out of the OAuth Provider, to log them out of the OAuth Provider, or to ask the user. If we had set earlier that we wanted to enforce password and activation token policies, then we would need to enter details of an SMTP server, which would be used to email the activation token to the user. For the purposes of our prototype we’ll leave all these options blank. Click on the Save button. Summary This article served as a quick primer on what OAuth is and how to achieve it with OpenAM. It covered the concept of using Facebook as an OAuth provider and configuring an OAuth module. It focused on using OpenAM as an OAuth Client and using Facebook as an OAuth Provider. This would really help when we might want to allow authentication against Facebook or Google. Resources for Article: Further resources on this subject: Getting Started with OpenSSO [Article] OpenAM: Oracle DSEE and Multiple Data Stores [Article] OpenAM Identity Stores: Types, Supported Types, Caching and Notification [Article]

0
0
3695

Packt

02 Sep 2013

10 min read

Memory and cache

Packt

02 Sep 2013

10 min read

(For more resources related to this topic, see here.) You can find this instruction in the OGlobalConfiguration.java file in the autoConfig() method. Furthermore, you can enable/disable level 1 cache, level 2 cache, or both. You can also set the number of records that will be stored in each level as follows: cache.level1.size: This sets the number of records to be stored in the level 1 caches (default -1, no limit) cache.level2.size: This sets the number of records to be stored in the level 2 cache (default -1, no limit) cache.level1.enabled: This is a boolean value, it enables/disables the level 1 cache (default, true) cache.level2.enabled: This is a boolean value, it enables/disables the level 2 cache (default, true) Mapping files OrientDB uses NIO to map data files in memory. However, you can change the way this mapping is performed. This is achieved by modifying the file access strategy. Mode 0: It uses the memory mapping for all the operations. Mode 1 (default): It uses the memory mapping, but new reads are performed only if there is enough memory, otherwise the regular Java NIO file read/write is used. Mode 2: It uses the memory mapping only if the data has been previously loaded. Mode 3: It uses memory mapping until there is space available, then use regular JAVA NIO file read/write. Mode 4: It disables all the memory mapping techniques. To set the strategy mode, you must use the file.mmap.strategy configuration property. Connections When you have to connect with a remote database you have some options to improve your application performance. You can use the connection pools, and define the timeout value to acquire a new connection. The pool has two attributes: minPool: It is the minimum number of opened connections maxPool: It is the maximum number of opened connections When the first connection is requested to the pool, a number of connections corresponding to the minPool attribute are opened against the server. If a thread requires a new connection, the requests are satisfied by using a connection from the pool. If all the connections are busy, a new one is created until the value of maxPool is reached. Then the thread will wait, so that a connection is freed. Minimum and maximum connections are defined by using the client.channel.minPool (default value 1) and client.channel.maxPool (default value 5) properties. However, you can override these values in the client code by using the setProperty() method of the connection class. For example: database = new ODatabaseDocumentTx("remote:localhost/demo");database.setProperty("minPool", 10);database.setProperty("maxPool", 50);database.open("admin", "admin"); You can also change the connection timeout values. In fact, you may experience some problem, if there are network latencies or if some server-side operations require more time to be performed. Generally these kinds of problems are shown in the logfile with warnings: WARNING: Connection re-acquired transparently after XXXms and Yretries: no errors will be thrown at application level You can try to change the network.lockTimeout and the network.socketTimeout values. The first one indicates the timeout in milliseconds to acquire a lock against a channel (default is 15000), the second one indicates the TCP/IP socket timeout in milliseconds (default is 10000). There are some other properties you can try to modify to resolve network issues. These are as follows: network.socketBufferSize: This is the TCP/IP socket buffer size in bytes (default 32 KB) network.retry: This indicates the number of retries a client should do to establish a connection against a server (default is 5) network.retryDelay: This indicates the number of milliseconds a client will wait before retrying to establish a new connection (default is 500) Transactions If your primary objective is the performance, avoid using transactions. However, if it is very important for you to have transactions to group operations, you can increase overall performance by disabling the transaction log. To do so just set the tx.useLog property to false. If you disable the transaction log, OrientDB cannot rollback operations in case JVM crashes. Other transaction parameters are as follows: tx.log.synch: It is a Boolean value. If set, OrientDB executes a synch against the filesystem for each transaction log entry. This slows down the transactions, but provides reliability on non- reliable devices. Default value is false. tx.commit.synch: It is a Boolean value. If set, it performs a storage synch after a commit. Default value is true. Massive insertions If you want to do a massive insertion, there are some tricks to speed up the operation. First of all, do it via Java API. This is the fastest way to communicate with OrientDB. Second, instruct the server about your intention: db.declareIntent( new OIntentMassiveInsert() );//your code here....db.declareIntent( null ); Here db is an opened database connection. By declaring the OIntentMassiveInsert() intent, you are instructing OrientDB to reconfigure itself (that is, it applies a set of preconfigured configuration values) because a massive insert operation will begin. During the massive insert, avoid creating a new ODocument instance for each record to insert. On the contrary, just create an instance the first time, and then clean it using the reset() method: ODocument doc = new ODocument();for(int i=0; i< 9999999; i++){doc.reset(); //here you will reset the ODocument instancedoc.setClassName("Author");doc.field("id", i);doc.field("name", "John");doc.save();} This trick works only in a non-transactional context. Finally, avoid transactions if you can. If you are using a graph database and you have to perform a massive insertion of vertices, you can still reset just one vertex: ODocument doc = db.createVertex();...doc.reset();... Moreover, since a graph database caches the most used elements, you may disable this: db.setRetainObjects(false); Datafile fragmentation Each time a record is updated or deleted, a hole is created in the datafiles structure. OrientDB tracks these holes and tries to reuse them. However, many updates and deletes can cause a fragmentation of datafiles, just like in a filesystem. To limit this problem, it is suggested to set the oversize attribute of the classes you create. The oversize attribute is used to allocate more space for records once they are created, so as to avoid defragmentation upon updates. The oversize attribute is a multiplying factor where 1.0 or 0 means no oversize. The default values are 0 for document, and 2 for vertices. OrientDB has a defrag algorithm that starts automatically when certain conditions are verified. You can set some of these conditions by using the following configuration parameter: file.defrag.holeMaxDistance: It defines the maximum distance in bytes between two holes that triggers the defrag procedure. The default is 32 KB, -1 means dynamic size. The dynamic size is computed in the ODataLocal class in the getCloserHole() method, as Math.max(32768 * (int) (size / 10000000), 32768), where size is the current size of the file. The profiler OrientDB has an embedded profiler that you can use to analyze the behavior of the server. The configuration parameters that act on the profiler are as follows: profiler.enabled: This is a boolean value (enable/disable the profiler), the default value is false. profiler.autoDump.interval: It is the number of seconds between profiler dump. The default value is 0, which means no dump. profiler.autoDump.reset: This is a boolean value, reset the profile at every dump. The default is true. The dump is a JSON string structured in sections. The first one is a huge collection of information gathered at runtime related to the configuration and resources used by each object in the database. The keys are structured as follows: db.<db-name>: They are database-related metrics db.<db-name>.cache: They are metrics about databases' caching db.<db-name>.data: They are metrics about databases' datafiles, mainly data holes db.<db-name>.index: They are metrics about databases' indexes system.disk: They are filesystem-related metrics system.memory: They are RAM-related metrics system.config.cpus: They are the number of the cores process.network: They are network metrics process.runtime: They provide process runtime information and metrics server.connections.actives: They are number of active connections The second part of the dump is a collection of chronos. A chrono is a log of an operation, for example, a create operation, an update operation, and so on. Each chrono has the following attributes: last: It is the last time recorded min: It is the minimum time recorded max: It is the maximum time recorded average: It is the average time recorded total: It is the total time recorded entries: It is the number of times the specific metric has been recorded Finally, there are sections about many counters. Query tips In the following paragraphs some useful information on how to optimize the queries execution is given. The explain command You can see how OrientDB accesses the data by using the explain command in the console. To use this command simply write explain followed by the select statement: orientdb> explain select from Posts A set of key-value pairs are returned. Keys mean the following: resultType: It is the type of the returned resultset. It can be collection, document, or number. resultSize: It is the number of records retrieved if the resultType is collection. recordReads: It is the number of records read from datafiles. involvedIndexes: They are the indices involved in the query. indexReads: It is the number of records read from the indices. documentReads: They are the documents read from the datafiles. This number could be different from recordReads, because in a scanned cluster there can be different kinds of records. documentAnalyzedCompatibleClass: They are the documents analyzed belonging to the class requested by the query. This number could be different from documentReads, because a cluster may contain several different classes. elapsed: This time is measured in nanoseconds, it is the time elapsed to execute the statement. As you can see, OrientDB can use indices to speed up the reads. Indexes You can define indexes as we do in a relational database using the create index statement or via Java API using the createIndex() method of the OClass class: create index <class>.<property> [unique|notunique|fulltext] [field type] Or for composite index (an index on more than one property): create index <index_name> on <class> (<field1>,<field2>)[unique|notunique|fulltext] If you create a composite index, OrientDB will use it also when in a where clause you don't specify a criteria against all the indexed fields. So you can avoid this to build an index for each field you use in the queries if you have already built a composite one. This is the case of a partial match search and further information about it can be found in the OrientDB wiki at https://github.com/nuvolabase/orientdb/wiki/Indexes#partial-match-search. Generally, the indexes don't work with the like operator. If you want to perform the following query: select from Authors where name like 'j%' And you want use an index, you must define on the field name a FULLTEXT index. FULLTEXT indices permit to index string fields. However keep in mind that indices slow down the insert, update, and delete operations. Summary In this article we have seen some strategies that try to optimize both the OrientDB server installation and queries. Resources for Article: Further resources on this subject: Comparative Study of NoSQL Products [Article] Connecting to Microsoft SQL Server Compact 3.5 with Visual Studio [Article] Microsoft SQL Azure Tools [Article]

0
0
1122

Packt

30 Aug 2013

13 min read

Managing a Hadoop Cluster

Packt

30 Aug 2013

13 min read

0
0
13243

article-image-working-import-process-intermediate

Packt

30 Aug 2013

5 min read

Working with Import Process (Intermediate)

Packt

30 Aug 2013

5 min read

(For more resources related to this topic, see here.) Getting ready The first thing we need to do is to download the latest version of Sqoop from following location http://www.apache.org/dist/sqoop/ and extract it on your machine. Now I am calling the Sqoop installation dir as $SQOOP_HOME. Given here are the prerequisites for Sqoop import process. Installed and running Relational Database Management System (MySQL). Installed and running Hadoop Cluster. Set $HADOOP_HOME environment variable. Following are the common arguments of import process. Parameters Description --connect <jdbc-uri> This command specifies the server or database to connect. It also specifies the port. Example: --connect jdbc:mysql://host:port/databaseName --connection-manager <class-name> Specify connection manager class name. --driver <class-name> Specify the fully qualified name of JDBC driver class. --password <password> Set authentication password required to connect to input source. --username <username> Set authentication username. How to do it Let’s see how to work with import process First, we will start with import single RDBMS table into Hadoop. Query1: $ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1 --username root --password password --table tableName --target-dir /user/abc/tableName The content of output file in HDFS will look like: Next, we will put some light on approach of import only selected rows and selected columns of RDBMS table into Hadoop. Query 2: Import selected columns $ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1 --username root --password password --table student --target-dir /user/abc/student --columns “student_id,address,name” Query 3: Import selected rows. $ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1 --username root --password password --table student --target-dir /user/abc/student --where ‘student_id<100’ Query 4: Import selected columns of selected rows. $ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1 --username root --password password --table student --target-dir /user/abc/student --columns “student_id,address,name” -- where ‘student_id<100’ How it works… Now let’s see how the above steps work: Import single table: Apart from the common arguments of import process, as explained previously, this part covers some other arguments which are required to import a table into Hadoop Distributed File System. Parameters Description --table <table-name> Name of input table to fetch. --target-dir<dir> Location of output/target dir in HDFS. --direct If user want to use non-JDBC based access mechanism for faster database access --options-file <file-path> All the command line options that are common in most of commands can put in options file for convenience. The Query1 will run a MapReduce job and import all the rows of given table to HDFS (where, /user/abc/tableName is the location of output files). The records imported in HDFS preserve their original columns order, which means, if input table contains four columns A, B, C and D, then content in HDFS file will look like: A1, B1, C1, D1 A2, B2, C2, D2 Import selected columns: By default, the import query will select all columns of input table for import, but we can select the subset of columns by specifying the comma separated list of columns in --columns argument. The Query2 will only fetch three columns (student_id, address and name) of student table. If import query contains the --columns argument, then the order of column in output files are same as order specified in --columns argument. The output in HDFS will look like: student_id, address, name 1, Delhi, XYZ 2, Mumbai, PQR .......... If the input query contains the column in following order -- “address, name, student_id”, then the output in HDFS will look like. address, name, student_id Delhi, XYZ, 1 Mumbai, PQR, 2 ............. Import selected rows: By default, all the rows of input table will be imported to HDFS, but we can control which rows need to be import by using a --where argument in the import statement. The Query3 will import only those rows into HDFS which has value of “student_id” column greater than 100. The Query4 use both --columns and --where arguments in one statement. For Query4, Sqoop will internally generates the query of the form “select student_id, address, name from student where student_id<100”. There’s more... This section covers some more examples of import process. Import all tables: So far we have imported a single table into HDFS, this section introduces an import-all-tables tool, by which we can import a set of tables from an RDBMS to HDFS. The import-all-tables tool creates a separate directory in HDFS for each RDBMS table. The following are the mandatory conditions for import-all-tables tool: All tables must have a single primary key column. User must intend to import all the columns of each table. No --where, --columns and --query arguments are permitted. Example: Query 5: $ bin/sqoop import-all-tables --connect jdbc:mysql://localhost:3306/db1 --username root --password password This query will import all tables (tableName and tableName1) of database db1 into HDFS. Output directories in HDFS look like: Summary We learned a lot in this article, about import single RDBMS table into HDFS, import selected columns and selected rows, and import set of RDBMS tables. Resources for Article : Further resources on this subject: Introduction to Logging in Tomcat 7 [Article] Configuring Apache and Nginx [Article] Geronimo Architecture: Part 2 [Article]

0
0
794

article-image-what-are-ssas-2012-dimensions-and-cube

Packt

29 Aug 2013

7 min read

What are SSAS 2012 dimensions and cube?

Packt

29 Aug 2013

7 min read

(For more resources related to this topic, see here.) What is SSAS? SQL Server Analysis Services is an online analytical processing tool that highly boosts the different types of SQL queries and calculations that are accepted in the business intelligence environment. It looks like a relation database, but it has differences. SSAS does not replace the requirement of relational databases, but if you combine the two, it would help to develop the business intelligence solutions. Why do we need SSAS? SSAS provide a very clear graphical interface for the end users to build queries. It is a kind of cache that we can use to speed up reporting. In most real scenarios where SSAS is used, there is a full copy of the data in the data warehouse. All reporting and analytic queries are run against SSAS rather than against the relational database. Today's modern relational databases include many features specifically aimed at BI reporting. SSAS are database services specifically designed for this type of workload, and in most cases it has achieved much better query performance. SSAS 2012 architecture In this article we will explain about the architecture of SSAS. The first and most important point to make about SSAS 2012 is that it is really two products in one package. It has had a few advancements relating to performance, scalability, and manageability. This new version of SSAS that closely resembles PowerPivot uses the tabular model. When installing SSAS, we must select either the tabular model or multidimensional model for installing an instance that runs inside the server; both data models are developed under the same code but sometimes both are treated separately. The concepts included in designing both data models are different, and we can't turn a tabular database into a multidimensional database, or vice versa without rebuilding everything from the start. The main point of view of the end users is that both data models do almost the same things and appear almost equally when used through a client tool such as Excel. The tabular model A concept of building a database using the tabular model is very similar to building it in a relational database. An instance of Analysis Services can hold many databases, and each database can be looked upon as a self-contained collection of objects and data relating to a single business solution. If we are writing reports or analyzing data and we find that we need to run queries on multiple databases, we probably have made a design mistake somewhere because everything we need should be contained within an individual database. Tabular models are designed by using SQL Server Data Tools (SSDT), and a data project in SSDT mapping onto a database in Analysis Services. The multidimensional model This data model is very similar to the tabular model. Data is managed in databases, and databases are designed in SSDT, which are in turn managed by using SQL Server Management Studio. The differences may become similar below the database level, where the multidimensional data model rather than relational concepts are accepted. In the multidimensional model, data is modeled as a series of cubes and dimensions and not tables. The future of Analysis Services We have two data models inside SSAS, along with two query and calculation languages; it is clearly not an ideal state of affairs. It means we have to select a data model to use at the start of our project, when we might not even know enough about our need to gauge which one is appropriate. It also means that anyone who decides to specialize in SSAS has to learn two technologies. Microsoft has very clearly said that the multidimensional model is not scrapped and that the tabular model is not its replacement. It is just like saying that the new advanced features for the multidimensional data model will be released in future versions of SSAS. The fact that the tabular and multidimensional data models share some of the same code suggests that some new features could easily be developed for both models simultaneously. What's new in SSAS 2012? As we know, there is no easy way of transferring a multidimensional data model into a tabular data model. We may have many tools in the market that claim to make this transition with a few mouse clicks, but such tools could only ever work for very simple multidimensional data models and would not save much development time. Therefore, if we already have a mature multidimensional implementation and the in-house skills to develop and maintain it, we may find the following improvements in SSAS 2012 useful. Ease of use If we are starting an SSAS 2012 project with no previous multidimensional or OLAP experience, it is very likely that we will find a tabular model much easier to learn than a multidimensional one. Not only are the concepts much easier to understand, especially if we are used to working with relational databases, but also the development process is much more straightforward and there are far fewer features to learn. Compatibility with PowerPivot The tabular data model and PowerPivot are the same in the way their models are designed. The user interfaces used are practically the same, as both the interfaces use DAX. PowerPivot models can be imported into SQL Server Data Tools to generate a tabular model, although the process does not work the other way around, and a tabular model cannot be converted to a PowerPivot model. Processing performance characteristics If we compare the processing performance of the multidimensional and tabular data models, that will become difficult. It may be slower to process a large table following the tabular data model than the equivalent measure group in a multidimensional one because a tabular data model can't process partitions in the same table at the same time, whereas a multidimensional model can process partitions in the same measure group at the same time. What is SSAS dimension? A database dimension is a collection of related objects; in other words, attributes; they provide the information about fact data in one or more cubes. Typical attributes in a product dimension are product name, product category, line, size, and price. Attributes can be organized into user-defined hierarchies that provide the paths to assist users when they browse through the data in a cube. By default these attributes are visible as attribute hierarchies, and they can be used to understand fact data in a cube. What is SSAS cube? A cube is a multidimensional structure that contains information for analytical purposes; the main constituents of a cube are dimensions and measures. Dimensions define the structure of a cube that you use to slice and dice over, and measures provide the aggregated numerical values of interest to the end user. As a logical structure, a cube allows a client application to retrieve values—of measures—as if they are contained in cells in the cube. The cells are defined for every possible summarized value. A cell, in the cube, is defined by the intersection of dimension members and contains the aggregated values of the measures at that specific intersection. Summary We talked about the special new features and services present, what you can do with them, and why they’re so great. Resources for Article: Further resources on this subject: Creating an Analysis Services Cube with Visual Studio 2008 - Part 1 [Article] Performing Common MDX-related Tasks [Article] How to Perform Iteration on Sets in MDX [Article]

0
0
1899

Packt

29 Aug 2013

8 min read

Creating a pivot table

Packt

29 Aug 2013

8 min read

(For more resources related to this topic, see here.) A pivot table is the core business intelligence tool that helps to turn meaningless data from various sources to a meaningful result. By using different ways of presenting data, we are able to identify relations between seemingly separate data and reach conclusions to help us identify our strengths and areas of improvement. Getting ready Prepare the two files entitled DatabaseData_v2.xlsx and GDPData_v2.xlsx.We will be using these results along with other data sources to create a meaningful PowerPivot table that will be used for intelligent business analysis. How to do it... For each of the two files, we will build upon the file and add a pivot table to it, gaining exposure using the data we are already familiar with. The following are the steps to create a pivot table with the DatabaseData_v2.xlsx file, which results in the creation of a DatabaseData_v3.xlsx file: Open the PowerPivot window of the DatabaseData_v2.xlsx file with its 13 tables. Click on the PivotTable button near the middle of the top row and save as New Worksheet. Select the checkboxes as shown in the following screenshot: Select CountryRegion | Name and move it under Row Labels Select Address | City and move it under Row Labels Select Address | AddressLine1 as Count of AddressLine1 and move it under Values Now, this shows the number of clients per city and per country. However, it is very difficult to navigate, as each country name has to be collapsed in order to see the next country. Let us move the CountryRegion | Name column to Slicers Vertical. Now, the PowerPivot Field List dashboard should appear as shown in the following screenshot: Now, the pivot table should display simple results: the number of clients in a region, filterable by the country using slicers. Let us apply some formatting to allow for a better understanding of the data. Right-click on Name under the Slicers Vertical area of the PowerPivot Field List dashboard. Select Field Settings, then change the name to Country Name. We now see that the title of the slicer has changed from Name to Country Name, allowing anyone who views this data to understand better what the data represents. Similarly, right-click on Count of AddressLine1 under Values, select Edit Measure, and then change its name to Number of Clients. Also change the data title City under the Row Labels area to City Name. The result should appear as shown in the following screenshot: Let's see our results change as we click on different country names. We can filter for multiple countries by holding the Ctrl key while clicking, and can remove all filters by clicking the small button on the top-right of slicers. This is definitely easier to navigate through and to understand compared to what we did at first without using slicers, which is how it would appear in Excel 2010 without PowerPivot. However, this table is still too big. Clicking on Canada gives too many cities whose names many of us have not heard about before. Let us break the data further down by including states/provinces. Select StateProvince | Name and move it under Slicers Horizontal and change its title to State Name. It is a good thing that we are renaming these as we go along. Otherwise, there would have been two datasets called Name, and anyone would be confused as we moved along. Now, we should see the state names filter on the top, the country name filter on the left, and a list of cities with the number of clients in the middle part. This, however, is kind of awkward. Let us rearrange the filters by having the largest filter (country) at the top and the sub-filter. (state) on the left-hand side This can be done simply by dragging the Country Name dataset to Slicers Horizontal and State Name to Slicers Vertical. After moving the slicers around a bit, the result should appear as shown in the following screenshot: Again, play with the results and try to understand the features: try filtering by a country—and by a state/province—now there are limited numbers of cities shown for each country and each state/province, making it easier to see the list of cities. However, for countries such as the United States, there are just too many states. Let us change the formatting of the vertical filter to display three states per line, so it is easier to find the state we are looking for. This can be done by right-clicking on the vertical filter, selecting Size and Properties| Position and Layout, and then by changing the Number of Columns value. Repeat the same step for Country Name to display six columns and then change the sizes of the filters to look more selectable. Change the name of the sheet as PivotTable and then save the file as DatabaseData_v3.xlsx. The following are the steps to create a pivot table with the GDPData_v2.xlsx file, which results in the creation of a GDPData_v3.xlsx file: Open the PowerPivot window of the GDPData_v2.xlsx file with its two tables. Click on the PivotTable button near the middle of the top row and save as New Worksheet. Move the dataset from the year 2000 to the year 2010 to the Value field, and move Country Name in the Row Labels field, and Country Name again into the Slicers Horizontal field. In the slicer, select five countries: Canada, China, Korea, Japan, and United States as shown in the following screenshot: Select all fields and reduce the number of its decimal places. We can now clearly see that GDP in China has tripled over the decade, and that only China and Korea saw an increase in GDP from 2008 to 2009 while the GDP of other nations dropped due to the 2008 financial crisis. Knowing the relevant background information of world finance events, we can make intelligent analysis such as which markets to invest in if we are worried about another financial crisis taking place. As the data get larger in size, looking at the GDP number becomes increasingly difficult. In such cases, we can switch the type of data displayed by using available buttons in the PivotTable Tools | Options menu, the Show Value As button. Play around with it and see how it works: % of Column Total shows each GDP as a percentage of the year, while % Different From allows the user to set one value as the standard and compare the rest to it, and the Rank Largest to Smallest option simply shows the ranking based on which country earns the most GDP. Change the name of the sheet as PivotTable and then save the file as GDPData_v3.xlsx. How it works... We looked at two different files and focused on two different fields. The first file was more qualitative and showed the relationship between regions and number of clients, using various features of pivot tables such as slicers. We also looked at how to format various aspects of a pivot table for easier processing and for a better understanding of the represented data. Slicers embedded in the pivot table are a unique and very powerful feature of PowerPivot that allow us to sort through data simply by clicking the different criteria. The increasing numbers of slicers help to customize the data further, enabling the user to create all sorts of data imaginable. There are no differences in horizontal and vertical slicers aside from the fact that they are at different locations. From the second file, we focused more on the quantitative data and different ways of representing the data. By using slicers to limit the number of countries, we were able to focus more on the data presented, and manage to represent the GDP in various formats such as percentages and ranks, and were able to compare the difference between the numbers by selecting one as a standard. A similar method of representing data in a different format could be applied to the first file to show the percentage of clients per nation, and so on. There's more... We covered the very basic setup of creating a pivot table. We can also analyze creating relationships between data and creating custom fields, so that better results are created. So don't worry about why the pivot table looks so small! For those who do not like working with a pivot table, there is also a feature that will convert all cells into Excel formula. Under the PowerPivot Tools | OLAP Tools option, the Convert to Formula button does exactly that. However, be warned that it cannot be undone as the changes are permanent. Summary In this article, we learned how to use the raw data to make some pivot tables that can help us make smart business decisions! Resources for Article: Further resources on this subject: SAP HANA integration with Microsoft Excel [Article] Managing Core Microsoft SQL Server 2008 R2 Technologies [Article] Eloquent relationships [Article]

0
0
1087

Packt

27 Aug 2013

12 min read

Schemas and Models

Packt

27 Aug 2013

12 min read

0
0
3132

article-image-creating-multivariate-charts

Packt

26 Aug 2013

10 min read

Creating Multivariate Charts

Packt

26 Aug 2013

10 min read

(For more resources related to this topic, see here.) With increasing number of variables, any analysis can become challenging and any observations harder; however, Tableau simplifies the process for the designer and uses effective layouts for the reader even in multivariate analysis. Using various combinations of colors and charts, we can create compelling graphics that generate critical insights from our data. Among the charts covered in this article, facets and area charts are easier to understand and easier to create compared to bullet graphs and dual axes charts. Creating facets Facets are one of the powerful features in Tableau. Edward Tufte, a pioneer in the field of information graphics, championed these types of charts, also called grid or panel charts; he called them small multiples. These charts show the same measure(s) across various values of one or two variables for easier comparison. Getting ready Let's use the sample file Sample – Coffee Chain (Access). Open a new worksheet and select Sample – Coffee Chain (Access) as the data source. How to do it... Once the data file is loaded on the new worksheet, perform the following steps to create a simple faceted chart: Drag-and-drop Market from Dimensions into the Columns shelf. Drag-and-drop Product Type from Dimensions into the Rows shelf. Drag-and-drop Profit from Measures into the Rows shelf next to Product Type. Optionally, you can drag-and-drop Market into the Color Marks box to give color to the four bars of different Market areas. The chart should look like the one in the following screenshot: How it works... When there is one dimension on one of the shelves, either Columns or Rows, and one measure on the other shelf, Tableau creates a univariate bar chart, but when we drop additional dimensions along with the measure, Tableau creates small charts or facets and displays univariate charts broken down by a dimension. There's more... A company named Juice Analytics has a great blog article on the topic of small multiples. This article lists the benefits of using small multiples as well as some examples of small multiples in practice. Find this blog at http://www.juiceanalytics.com/writing/better-know-visualization-small-multiples/. Creating area charts An area chart is an extension of a line chart. The area chart shows the line of the measure but fills the area below the line to emphasize on the value of the measure. A special case of area chart is a stacked area chart, which shows a line per measure and the area between the lines is filled. Tableau's implementation of area charts uses one date variable and one or more measures. Getting ready Let's use the sample file Sample – Superstore Sales (Excel). Open a new worksheet and select Sample – Superstore Sales (Excel) as the data source. How to do it... Once the data is loaded on the new worksheet, perform the following steps to create an area chart: Click on the Show Me button to bring the Show Me toolbar on the screen. Select Order Date from Dimensions and Order Quantity from Measures by clicking and holding the Ctrl key. Click on Area charts (continuous) from the Show Me toolbar. Drag-and-drop Order Date into the Columns shelf next to YEAR(Order Date. Expand YEAR(Order Date), seen on the right-hand side, by clicking on the plus sign. Drag-and-drop Region from Dimensions into the the Rows shelf to the left of SUM(Order Quantity). The chart should look like the one in the following screenshot: How it works… When we added Order Date for the first time, Tableau, by default, aggregated the date field by year; therefore, we added Order Date again to create aggregation by quarter of the Order Date. We also added Region to create facets on the regions that provide trends of order quantity over time. There's more... A blog post by visually, an information graphics company, discusses the key differences between line charts and area charts. You can find this post at http://blog.visual.ly/line-vs-area-charts/. Creating bullet graphs Stephen Few, an information visualization consultant and author, designed this chart to solve some of the problems that the gauges and meters type of charts poses. Gauges, although simple to understand, take a lot of space to show only one measure. Bullet graphs are a combination of the bar graph and thermometer types of charts, and they show a measure of interest in the form of a bar graph (which is the bullet) and target variables. Getting ready Let's use the sample file Sample – Coffee Chain (Access). Open a new worksheet and select Sample – Coffee Chain (Access) as the data source. How to do it... Once the data is loaded on the sheet, perform the following steps to create a bullet graph: Click on the Show Me button to bring the Show Me toolbar on the screen. While holding the Ctrl key, click on Type and Market from Dimensions and Budget Sales and Sales from Measures. Click on the bullet graphs icon on the Show Me toolbar. Right-click on the x axis (the Budget Sales axis) and click on Swap Reference Line Fields. The final chart should look like the one in the following screenshot: How it works... Although bullet graphs maximize the available space to show relevant information, readers require detailed explanation as to what all the components of the graphic are encoding. In this recipe, since we want to compare the budgeted sales with the actual sales, we had to swap the reference line from Sales to Budget Sales. The black bar on the graphic shows the budgeted sales and the blue bar shows the actual sales. The dark gray background color shows 60 percent of the actual sales and the lighter gray shows 80 percent of the actual sales. As we can see in this chart, blue bars crossed all the black lines, and that tells us that both the coffee types and all market regions exceeded the budgeted sales. There's more... A blog post by Data Pig Technologies discusses some of the problems with the bullet graph. The main problem is intuitive understanding of this chart. You can read about this problem and the reply by Stephen Few at http://datapigtechnologies.com/blog/index.php/the-good-and-bad-of-bullet-graphs/. Creating dual axes charts Dual axes charts are useful to compare two similar types of measures that may have different types of measurement units, such as pounds and dollars. In this recipe, we will look at the dual axes chart. Getting ready Let's use the same sample file, Sample – Coffee Chain (Access). Open a new worksheet and select Sample – Coffee Chain (Access) as the data source. How to do it... Once the data is loaded on the sheet, perform the following steps to create a dual axes chart: Click on the Show Me button to bring the Show Me toolbar on the screen. While holding the Ctrl key, click on Date, Type, and Market from Dimensions and Sales and Budget Sales from Measures. Click on the dual line graph icon on the Show Me toolbar. Click-and-drag Market from the Rows shelf into the Columns shelf. Right-click on the Sales vertical axis and click on Synchronize Axis. The chart should look like the one shown in the following screenshot: How it works... Tableau will create two vertical axes and automatically place Sales on one dual axes charts vertical axis and Budget Sales on the other. The scales on both the vertical axes are different, however. By synchronizing the axes, we get the same scales on both axes for better comparison and accurate representation of the patterns. Creating Gantt charts Gantt charts are most commonly used in project management as these charts show various activities and tasks with the time required to complete those tasks. Gantt charts are even more useful when they show dependencies among various tasks. This type of chart is very helpful when the number of activities is low (around 20-30), otherwise the chart becomes too big to be understood easily. Getting ready Let's use the sample file Sample – Superstore Sales (Excel). Open a new worksheet and select Sample – Superstore Sales (Excel) as the data source. How to do it... Once the data is loaded, perform the following steps to create a Gantt chart: Click on Analysis from the top menu toolbar, and if Aggregate Measures is checked, click on it again to uncheck that option. Click on the Show Me button to bring the Show Me toolbar on the screen. While holding the Ctrl key, click on Order Date and Category (under Products) from Dimensions and Time to Ship from Measures. Click on the Gantt chart icon on the Show Me toolbar. Drag-and-drop Order Date into the Filters pane. Select Years from the Filter Field [Order Date] options dialog box and hit Next. Check 2012 from the list and hit OK. Right-click on YEAR(Order Date) on the Columns shelf and select the Day May 8, 2011 option. Drag-and-drop Order Date into the Filters pane. Select Months from the Filter Field [Order Date] options dialog box and hit Next. Check December from the list and hit OK. Drag-and-drop Region from Dimensions into the Color Marks input box. Drag-and-drop Region from Dimensions into the Rows shelf before Category. The generated Gantt chart should look like the one in the following screenshot: How it works... Representing time this way helps the reader to discern which activity took the longest amount of time. We added the Order Date field two times in the Filters pane to first filter for the year 2012 and then for the month of December. In this recipe, out of all the products shipped in December of 2012, we can easily see the red bars for the West region in the Office Supplies category is longer, suggesting that these products took the longest amount of time to ship. There's more... Andy Kriebel, a Tableau data visualization expert, has a great example of Gantt charts using US presidential data. The following link shows the lengths of terms in office of Presidents from various parties: http://vizwiz.blogspot.com/2010/09/tableau-tip-creating-waterfall-chart.html Creating heat maps A heat map is a visual representation of numbers in a table or a grid such that the bigger numbers are encoded by darker colors or bigger sizes and the smaller numbers by lighter colors or smaller sizes. This type of representation makes the reader's pattern detection from the data easier. Getting ready Let's use the same sample file, Sample – Superstore Sales (Excel). Open a new worksheet and select Sample – Superstore Sales (Excel) as the data source. How to do it... Once the data is loaded, perform the following steps to create a heat map chart: Click on the Show Me button to bring the Show Me toolbar on the screen. While holding the Ctrl key, click on Sub-Category (under Products), Region, and Ship Mode from Dimensions and Profit from Measures. Click on the heat maps chart icon on the Show Me toolbar. Drag-and-drop Profit from Measures into the Color Marks box. The generated chart should look like the one in the following screenshot: Summary When we created the chart for the first time, Tableau assigned various sizes to the square boxes, but when we placed Profit as a color mark, red was used for low amounts of profit and green was used for higher amounts of profit. This made spotting of patterns very easy. Binders and Binder Accessories, shipped by Regular Air in the Central region, generated very high amounts of profit and Tables, shipped by Delivery Trucks in the East region, generated very low amounts of profit (it actually created losses for the company). Resources for Article: Further resources on this subject: Constructing and Evaluating Your Design Solution [Article] Basic use of Local Storage [Article] Creating Interactive Graphics and Animation [Article]

0
0
3201

article-image-making-big-data-work-hadoop-and-solr

Packt

23 Aug 2013

8 min read

Making Big Data Work for Hadoop and Solr

Packt

23 Aug 2013

8 min read

(For more resources related to this topic, see here.) Understanding data processing workflows Based on the data, configuration, and the requirements, data can be processed at multiple levels while it is getting ready for search. Cascading and LucidWorks Big Data are few such application platforms with which a complex data processing workflow can be rapidly developed on the Hadoop framework. In Cascading, the data is processed in different phases, with each phase containing a pipe responsible for carrying data units and applying a filter. The following diagram shows how incoming data can be processed in the pipeline-based workflow: Once a data is passed through the workflow, it can be persisted at the end with repository, and later synced with various nodes running in a distributed environment. The pipelining technique offers the following advantages: Apache Solr engine has minimum work to handle while index creation Incremental indexing can be supported By introducing intermediate store, you can have regular data backups at required stages The data can be transferred to a different type of storage such as HDFS directly through multiple processing units The data can be merged, joined, and processed as per the needs for different data sources LucidWorks Big Data is a more powerful product which helps the user to generate bulk indexes on Hadoop, allowing them to classify and analyze the data, and provide distributed searching capabilities. Sharding is a process of breaking one index into multiple logical units called shards across multiple records. In case of Solr, the results will be aggregated and returned. Big Data based technologies can be used with Apache Solr for various operations. Index creation itself can be made to run on distributed system in order to speed up the overall index generation activity. Once that is done, it can be distributed on different nodes participating in Big Data, and Solr can be made to run in a distributed manner for searching the data. You can set up your Solr instance in the following different configurations: Standalone machine This configuration uses single high end server containing indexes and Solr search; it is suitable for development, and in some cases, production. Distributed setup A distributed setup is suitable for large scale indexes where the index is difficult to store on one system. In this case index has to be distributed across multiple machines. Although distributed configuration of Solr offers ample flexibility in terms of processing, it has its own limitations. A lot of features of Apache Solr such as MoreLikeThis and Joins are not supported. The following diagram depicts the distributed setup: Replicated mode In this mode, more than one Solr instance exists; among them the master instance provides shared access to its slaves for replicating the indexes across multiple systems. Master continues to participate in index creation, search, and so on. Slaves sync up the storage through various replication techniques such as rsync utility. By default, Solr includes Java-based replication that uses HTTP protocol for communication. This replication is recommended due to its benefits over other external replication techniques. This mode is not used anymore with the release of Solr 4.x versions. Sharded mode This mode combines the best of both the worlds and brings in the real value of distributed system with high availability. In this configuration, the system has multiple masters, and each master holds multiple slaves where the replication has gone through. Load balancer is used to handle the load on multiple nodes equally. The following diagram depicts the distributed and replicated setup: If Apache Solr is deployed on a Hadoop-like framework, it falls into this category. Solr also provides SolrCloud for distributed Solr. We are going to look at different approaches in the next section. Using Solr 1045 patch – map-side indexing The work for Solr-1045 patch started with a goal to achieve index generation/building using the Apache MapReduce task. Solr-1045 patch converts all the input records to a set of <key, value> pairs in each map task that runs on Hadoop. Further it goes on creating SolrInputDocument from the <key, value>, and later creating the Solr indexes.The following diagram depicts this process: Reduce tasks can be used to perform deduplication of indexes, and merge them together if required. Although merge index seems to be an interesting feature, it is actually a costly affair in terms of processing, and you will not find many implementations with merge index functionality. Once the indexes are created, you can load them on your Solr instance and use them for searching. You can download this particular patch from https://issues.apache.org/jira/browse/SOLR-1045, and patch your Solr instance. To apply a patch to your Solr instance, you need to first build your Solr instance using source. You can download the patch from Apache JIRA. Before running the patch, first do a dry run which does not actually apply patch. You can do it with following command: cd <solr-trunk-dir>svn patch <name-of-patch> --dry-run If it is successful, you can run the patch without the –dry-run option to apply the patch. Let's look at some of the important classes in the patch. Important class Description SolrIndexUpdateMapper This class is a Hadoop mapper responsible for creating indexes out of <key, value> pairs of input. SolrXMLDocRecordReader This class is responsible for reading Solr input XML files. SolrIndexUpdater This class creates a MapReduce job configuration, runs the job to read the document, and updates the Solr instance. Right now it is built using the Lucene index updater. Benefits and drawbacks The following are the benefits and drawbacks of using the Solr-1045 patch: Benefits It achieves complete parallelism by index creation right at the map task. Merging of indexes is possible in the reduce phase of MapReduce. Drawbacks When the indexing is done at map-side, all the <key, value> pairs received by reducer gain equal weight/importance. So, it is difficult to use this patch with data that carries ranking/weight information. Using Solr 1301 patch – reduce-side indexing This patch focuses on using the Apache MapReduce framework for index creation. Keyword search can happen over Apache Solr or Apache SolrCloud. Unlike Solr-1045, in this patch, the indexes are created in the reduce phase of MapReduce. In this patch, a map task is responsible for converting input records to a <key, value> pair; later, they are passed to the reducer, which in turn converts them into SolrInputDocument, and then creates indexes out of it. This index is then passed as outputs of Hadoop MapReduce process. The following diagram depicts this process: To use Solr-1301 patch, you need to set up a Hadoop cluster. Once the index is created through Hadoop patch, it should then be provisioned to Solr server. The patch contains default converter for CSV files. Let's look at some of the important classes which are part of this patch. Important class Description CSVDocumentConverter This class is responsible for converting output of the map task, that is, key-value pair to SolrInputDocument; you can have multiple document converters. CSVReducer This is a reducer code implemented for Hadoop reducers. CSVIndexer This is the main class to be called from your command line for creating indexes using MapReduce. You need to provide input path for your data and output path for storing shards. SolrDocumentConverter This class is used in your map task for converting your objects in Solr document. SolrRecordWriter This class is an extension of mapreduce.RecordWriter; it breaks the data into multiple (key, value) pairs which are then converted into collection of SolrInputDocument(s), and then this data is submitted to SolrEmbeddedServer in batches. Once completed, it will commit the changes and run the optimizer on the embedded server. CSVMapper This class parses CSV file and gets key-value pair out of it. This is a mapper class. SolrOutputFormat This class is responsible for converting key-value pairs to write the data on file/HDFS as zip/raw format. Perform the following steps to run this patch: Create a local folder with configuration and library folder, conf containing Solr configuration (solr-config.xml, schema.xml), and lib containing library. Create your own converter class implementing SolrDocumentConverter; this will be used by SolrOutputFormat to convert output records to Solr document. You may also override the OutputFormat class provided by Solr. Write the Hadoop MapReduce job in the configuration writer: SolrOutputFormat.setupSolrHomeCache(newFile(solrConfigDir), conf);conf.setOutputFormat(SolrOutputFormat.class);SolrDocumentConverter.setSolrDocumentConverter(<yourclassname>.class, conf); Zip your configuration, and load it in HDFS. The ZIP file name should be solr.zip (unless you change the patch code). Now run the patch, each of the jobs will instantiate EmbeddedSolrInstance which will in turn do the conversion, and finally the SolrOutputDocument(s) get stored in the output format. Benefits and drawbacks The following are the benefits and drawbacks of using Solr-1301 patch: Benefits With reduced size index generation, it is possible to preserve the weights of documents, which can contribute while performing a prioritization during a search query. Drawbacks Merging of indexes is not possible like in Solr-1045, as the indexes are created in the reduce phase. Reducer becomes the crucial component of the system due to major tasks being performed. Summary In this article, we have understood different possible approaches of how Big Data can be made to work with Apache Hadoop and Solr. We also looked at the benefits of and drawbacks these approaches. Resources for Article : Further resources on this subject: Advanced Hadoop MapReduce Administration [Article] Apache Solr Configuration [Article] Analytics – Drawing a Frequency Distribution with MapReduce (Intermediate) [Article]

0
0
2381

Packt

22 Aug 2013

10 min read

What is Hazelcast?

Packt

22 Aug 2013

10 min read

(For more resources related to this topic, see here.) Starting out as usual In most modern software systems, data is the key. For more traditional architectures, the role of persisting and providing access to your system's data tends to fall to a relational database. Typically this is a monolithic beast, perhaps with a degree of replication, although this tends to be more for resilience rather than performance. For example, here is what a traditional architecture might look like (which hopefully looks rather familiar) This presents us with an issue in terms of application scalability, in that it is relatively easy to scale our application layer by throwing more hardware at it to increase the processing capacity. But the monolithic constraints of our data layer would only allow us to do this so far before diminishing returns or resource saturation stunted further performance increases; so what can we do to address this? In the past and in legacy architectures, the only solution would be to increase the performance capability of our database infrastructure, potentially by buying a bigger, faster server or by further tweaking and fettling the utilization of currently available resources. Both options are dramatic, either in terms of financial cost and/or manpower; so what else could we do? Data deciding to hang around In order for us to gain a bit more performance out of our existing setup, we can hold copies of our data away from the primary database and use these in preference wherever possible. There are a number of different strategies we could adopt, from transparent second-level caching layers to external key-value object storage. The detail and exact use of each varies significantly depending on the technology or its place in the architecture, but the main desire of these systems is to sit alongside the primary database infrastructure and attempt to protect it from an excessive load. This would then tend to lead to an increased performance of the primary database by reducing the overall dependency on it. However, this strategy tends to be only particularly valuable as a short-term solution, effectively buying us a little more time before the database once again starts to reach saturation. The other downside is that it only protects our database from read-based load; if our application is predominately write-heavy, this strategy has very little to offer. So our expanded architecture could look a bit like the following figure: Therein lies the problem However, in insulating the database from the read load, we have introduced a problem in the form of a cache consistency issue, in that, how does our local data cache deal with changing data underneath it within the primary database? The answer is rather depressing: it can't! The exact manifestation of any issues will largely depend on the data needs of the application and how frequently the data changes; but typically, caching systems will operate in one of the two following modes to combat the problem: Time bound cache: Holds entries for a defined period (time-to-live or TTL) Write through cache: Holds entries until they are invalidated by subsequent updates Time bound caches almost always have consistency issues, but at least the amount of time that the issue would be present is limited to the expiry time of each entry. However, we must consider the application's access to this data, because if the frequency of accessing a particular entry is less than the cache expiry time of it, the cache is providing no real benefit. Write through caches are consistent in isolation and can be configured to offer strict consistency, but if multiple write through caches exist within the overall architecture, then there will be consistency issues between them. We can avoid this by having a more intelligent cache, which features a communication mechanism between nodes, that can propagate entry invalidations to each other. In practice, an ideal cache would feature a combination of both features, so that entries would be held for a known maximum time, but also passes around invalidations as changes are made. So our evolved architecture would look a bit like the following figure: So far we've had a look through the general issues in scaling our data layer, and introduced strategies to help combat the trade-offs we will encounter along the way; however, the real world isn't quite as simple. There are various cache servers and in-memory database products in this area: however, most of these are stand-alone single instances, perhaps with some degree of distribution bolted on or provided by other supporting technologies. This tends to bring about the same issues we experienced with just our primary database, in that we could encounter resource saturation or capacity issues if the product is a single instance, or if the distribution doesn't provide consistency control, perhaps inconsistent data, which might harm our application. Breaking the mould Hazelcast is a radical new approach to data, designed from the ground up around distribution. It embraces a new scalable way of thinking; in that data should be shared around for both resilience and performance, while allowing us to configure the trade-offs surrounding consistency as the data requirements dictate. The first major feature to understand about Hazelcast is its master less nature; each node is configured to be functionally the same. The oldest node in the cluster is the de facto leader and manages the membership, automatically delegating as to which node is responsible for what data. In this way as new nodes join or dropout, the process is repeated and the cluster rebalances accordingly. This makes Hazelcast incredibly simple to get up and running, as the system is self-discovering, self-clustering, and works straight out of the box. However, the second feature to remember is that we are persisting data entirely in-memory; this makes it incredibly fast but this speed comes at a price. When a node is shutdown, all the dta that was held on it is lost. We combat this risk to resilience through replication, by holding enough copies of a piece of data across multiple nodes. In the event of failure, the overall cluster will not suffer any data loss. By default, the standard backup count is 1, so we can immediately enjoy basic resilience. But don't pull the plug on more than one node at a time, until the cluster has reacted to the change in membership and reestablished the appropriate number of backup copies of data. So when we introduce our new master less distributed cluster, we get something like the following figure: We previously identified that multi-node caches tend to suffer from either saturation or consistency issues. In the case of Hazelcast, each node is the owner of a number of partitions of the overall data, so the load will be fairly spread across the cluster. Hence, any saturation would be at the cluster level rather than any individual node. We can address this issue simply by adding more nodes. In terms of consistency, by default the backup copies of the data are internal to Hazelcast and not directly used, as such we enjoy strict consistency. This does mean that we have to interact with a specific node to retrieve or update a particular piece of data; however, exactly which node that is an internal operational detail and can vary over time — we as developers never actually need to know. If we imagine that our data is split into a number of partitions, that each partition slice is owned by one node and backed up on another, we could then visualize the interactions like the following figure: This means that for data belonging to Partition 1, our application will have to communicate to Node 1, Node 2 for data belonging to Partition 2, and so on. The slicing of the data into each partition is dynamic; so in practice, where there are more partitions than nodes, each node will own a number of different partitions and hold backups for others. As we have mentioned before, all of this is an internal operational detail, and our application does not need to know it, but it is important that we understand what is going on behind the scenes. Moving to new ground So far we have been talking mostly about simple persisted data and caches, but in reality, we should not think of Hazelcast as purely a cache, as it is much more powerful than just that. It is an in-memory data grid that supports a number of distributed collections and features. We can load in data from various sources into differing structures, send messages across the cluster; take out locks to guard against concurrent activity, and listen to the goings on inside the workings of the cluster. Most of these implementations correspond to a standard Java collection, or function in a manner comparable to other similar technologies, but all with the distribution and resilience capabilities already built in. Standard utility collections Map: Key-value pairs List: Collection of objects? Set: Non-duplicated collection Queue: Offer/poll FIFO collection Specialized collection Multi-Map: Key-list of values collection Lock: Cluster wide mutex Topic: Publish/subscribe messaging Concurrency utilities AtomicNumber: Cluster-wide atomic counter IdGenerator: Cluster-wide unique identifier generation Semaphore: Concurrency limitation CountdownLatch: Concurrent activity gate-keeping Listeners: Application notifications as things happen In addition to data storage collections, Hazelcast also features a distributed executor service allowing runnable tasks to be created that can be run anywhere on the cluster to obtain, manipulate, and store results. We could have a number of collections containing source data, then spin up a number of tasks to process the disparate data (for example, averaging or aggregating) and outputting the results into another collection for consumption. Again, just as we could scale up our data capacities by adding more nodes, we can also increase the execution capacity in exactly the same way. This essentially means that by building our data layer around Hazelcast, if our application needs rapidly increase, we can continuously increase the number of nodes to satisfy seemingly extensive demands, all without having to redesign or re-architect the actual application. With Hazelcast, we are dealing more with a technology than a server product, a library to build a system around rather than retrospectively bolting it on, or blindly connecting to an off-the-shelf commercial system. While it is possible (and in some simple cases quite practical) to run Hazelcast as a separate server-like cluster and connect to it remotely from our application; some of the greatest benefits come when we develop our own classes and tasks run within it and alongside it. With such a large range of generic capabilities, there is an entire world of problems that Hazelcast can help solve. We can use the technology in many ways; in isolation to hold data such as user sessions, run it alongside a more long-term persistent data store to increase capacity, or shift towards performing high performance and scalable operations on our data. By moving more and more responsibility away from monolithic systems to such a generic scalable one, there is no limit to the performance we can unlock. This will allow us to keep our application and data layers separate, but enabling the ability to scale them up independently as our application grows. This will avoid our application becoming a victim of its own success, while hopefully taking the world by storm. Summary In this article, we learned about Hazelcast. With such a large range of generic capabilities, Hazelcast can solve a world of problems. Resources for Article: Further resources on this subject: JBoss AS Perspective [Article] Drools Integration Modules: Spring Framework and Apache Camel [Article] JBoss RichFaces 3.3 Supplemental Installation [Article]

0
0
3196

article-image-learning-how-classify-real-world-examples

Packt

22 Aug 2013

24 min read

Learning How to Classify with Real-world Examples

Packt

22 Aug 2013

24 min read

(For more resources related to this topic, see here.) The Iris dataset The Iris dataset is a classic dataset from the 1930s; it is one of the first modern examples of statistical classification. The setting is that of Iris flowers, of which there are multiple species that can be identified by their morphology. Today, the species would be defined by their genomic signatures, but in the 1930s, DNA had not even been identified as the carrier of genetic information. The following four attributes of each plant were measured: Sepal length Sepal width Petal length Petal width In general, we will call any measurement from our data as features. Additionally, for each plant, the species was recorded. The question now is: if we saw a new flower out in the field, could we make a good prediction about its species from its measurements? This is the supervised learning or classification problem; given labeled examples, we can design a rule that will eventually be applied to other examples. This is the same setting that is used for spam classification; given the examples of spam and ham (non-spam e-mail) that the user gave the system, can we determine whether a new, incoming message is spam or not? For the moment, the Iris dataset serves our purposes well. It is small (150 examples, 4 features each) and can easily be visualized and manipulated. The first step is visualization Because this dataset is so small, we can easily plot all of the points and all two-dimensional projections on a page. We will thus build intuitions that can then be extended to datasets with many more dimensions and datapoints. Each subplot in the following screenshot shows all the points projected into two of the dimensions. The outlying group (triangles) are the Iris Setosa plants, while Iris Versicolor plants are in the center (circle) and Iris Virginica are indicated with "x" marks. We can see that there are two large groups: one is of Iris Setosa and another is a mixture of Iris Versicolor and Iris Virginica. We are using Matplotlib; it is the most well-known plotting package for Python. We present the code to generate the top-left plot. The code for the other plots is similar to the following code: from matplotlib import pyplot as plt from sklearn.datasets import load_iris import numpy as np # We load the data with load_iris from sklearn data = load_iris() features = data['data'] feature_names = data['feature_names'] target = data['target'] for t,marker,c in zip(xrange(3),">ox","rgb"): # We plot each class on its own to get different colored markers plt.scatter(features[target == t,0], features[target == t,1], marker=marker, c=c) Building our first classification model If the goal is to separate the three types of flower, we can immediately make a few suggestions. For example, the petal length seems to be able to separate Iris Setosa from the other two flower species on its own. We can write a little bit of code to discover where the cutoff is as follows: plength = features[:, 2] # use numpy operations to get setosa features is_setosa = (labels == 'setosa') # This is the important step: max_setosa =plength[is_setosa].max() min_non_setosa = plength[~is_setosa].min() print('Maximum of setosa: {0}.'.format(max_setosa)) print('Minimum of others: {0}.'.format(min_non_setosa)) This prints 1.9 and 3.0. Therefore, we can build a simple model: if the petal length is smaller than two, this is an Iris Setosa flower; otherwise, it is either Iris Virginica or Iris Versicolor. if features[:,2] < 2: print 'Iris Setosa' else: print 'Iris Virginica or Iris Versicolour' This is our first model, and it works very well in that it separates the Iris Setosa flowers from the other two species without making any mistakes. What we had here was a simple structure; a simple threshold on one of the dimensions. Then we searched for the best dimension threshold. We performed this visually and with some calculation; machine learning happens when we write code to perform this for us. The example where we distinguished Iris Setosa from the other two species was very easy. However, we cannot immediately see what the best threshold is for distinguishing Iris Virginica from Iris Versicolor. We can even see that we will never achieve perfect separation. We can, however, try to do it the best possible way. For this, we will perform a little computation. We first select only the non-Setosa features and labels: features = features[~is_setosa] labels = labels[~is_setosa] virginica = (labels == 'virginica') Here we are heavily using NumPy operations on the arrays. is_setosa is a Boolean array, and we use it to select a subset of the other two arrays, features and labels. Finally, we build a new Boolean array, virginica, using an equality comparison on labels. Now, we run a loop over all possible features and thresholds to see which one results in better accuracy. Accuracy is simply the fraction of examples that the model classifies correctly: best_acc = -1.0 for fi in xrange(features.shape[1]): # We are going to generate all possible threshold for this feature thresh = features[:,fi].copy() thresh.sort() # Now test all thresholds: for t in thresh: pred = (features[:,fi] > t) acc = (pred == virginica).mean() if acc > best_acc: best_acc = acc best_fi = fi best_t = t The last few lines select the best model. First we compare the predictions, pred, with the actual labels, virginica. The little trick of computing the mean of the comparisons gives us the fraction of correct results, the accuracy. At the end of the for loop, all possible thresholds for all possible features have been tested, and the best_fi and best_t variables hold our model. To apply it to a new example, we perform the following: if example[best_fi] > t: print 'virginica' else: print 'versicolor' What does this model look like? If we run it on the whole data, the best model that we get is split on the petal length. We can visualize the decision boundary. In the following screenshot, we see two regions: one is white and the other is shaded in grey. Anything that falls in the white region will be called Iris Virginica and anything that falls on the shaded side will be classified as Iris Versicolor: In a threshold model, the decision boundary will always be a line that is parallel to one of the axes. The plot in the preceding screenshot shows the decision boundary and the two regions where the points are classified as either white or grey. It also shows (as a dashed line) an alternative threshold that will achieve exactly the same accuracy. Our method chose the first threshold, but that was an arbitrary choice. Evaluation – holding out data and cross-validation The model discussed in the preceding section is a simple model; it achieves 94 percent accuracy on its training data. However, this evaluation may be overly optimistic. We used the data to define what the threshold would be, and then we used the same data to evaluate the model. Of course, the model will perform better than anything else we have tried on this dataset. The logic is circular. What we really want to do is estimate the ability of the model to generalize to new instances. We should measure its performance in instances that the algorithm has not seen at training. Therefore, we are going to do a more rigorous evaluation and use held-out data. For this, we are going to break up the data into two blocks: on one block, we'll train the model, and on the other—the one we held out of training—we'll test it. The output is as follows: Training error was 96.0%. Testing error was 90.0% (N = 50). The result of the testing data is lower than that of the training error. This may surprise an inexperienced machine learner, but it is expected and typical. To see why, look back at the plot that showed the decision boundary. See if some of the examples close to the boundary were not there or if one of the ones in between the two lines was missing. It is easy to imagine that the boundary would then move a little bit to the right or to the left so as to put them on the "wrong" side of the border. The error on the training data is called a training error and is always an overly optimistic estimate of how well your algorithm is doing. We should always measure and report the testing error; the error on a collection of examples that were not used for training. These concepts will become more and more important as the models become more complex. In this example, the difference between the two errors is not very large. When using a complex model, it is possible to get 100 percent accuracy in training and do no better than random guessing on testing! One possible problem with what we did previously, which was to hold off data from training, is that we only used part of the data (in this case, we used half of it) for training. On the other hand, if we use too little data for testing, the error estimation is performed on a very small number of examples. Ideally, we would like to use all of the data for training and all of the data for testing as well. We can achieve something quite similar by cross-validation. One extreme (but sometimes useful) form of cross-validation is leave-one-out. We will take an example out of the training data, learn a model without this example, and then see if the model classifies this example correctly: error = 0.0 for ei in range(len(features)): # select all but the one at position 'ei': training = np.ones(len(features), bool) training[ei] = False testing = ~training model = learn_model(features[training], virginica[training]) predictions = apply_model(features[testing], virginica[testing], model) error += np.sum(predictions != virginica[testing]) At the end of this loop, we will have tested a series of models on all the examples. However, there is no circularity problem because each example was tested on a model that was built without taking the model into account. Therefore, the overall estimate is a reliable estimate of how well the models would generalize. The major problem with leave-one-out cross-validation is that we are now being forced to perform 100 times more work. In fact, we must learn a whole new model for each and every example, and this will grow as our dataset grows. We can get most of the benefits of leave-one-out at a fraction of the cost by using x-fold cross-validation; here, "x" stands for a small number, say, five. In order to perform five-fold cross-validation, we break up the data in five groups, that is, five folds. Then we learn five models, leaving one fold out of each. The resulting code will be similar to the code given earlier in this section, but here we leave 20 percent of the data out instead of just one element. We test each of these models on the left out fold and average the results: The preceding figure illustrates this process for five blocks; the dataset is split into five pieces. Then for each fold, you hold out one of the blocks for testing and train on the other four. You can use any number of folds you wish. Five or ten fold is typical; it corresponds to training with 80 or 90 percent of your data and should already be close to what you would get from using all the data. In an extreme case, if you have as many folds as datapoints, you can simply perform leave-one-out cross-validation. When generating the folds, you need to be careful to keep them balanced. For example, if all of the examples in one fold come from the same class, the results will not be representative. We will not go into the details of how to do this because the machine learning packages will handle it for you. We have now generated several models instead of just one. So, what final model do we return and use for the new data? The simplest solution is now to use a single overall model on all your training data. The cross-validation loop gives you an estimate of how well this model should generalize. A cross-validation schedule allows you to use all your data to estimate if your methods are doing well. At the end of the cross-validation loop, you can use all your data to train a final model. Although it was not properly recognized when machine learning was starting out, nowadays it is seen as a very bad sign to even discuss the training error of a classification system. This is because the results can be very misleading. We always want to measure and compare either the error on a held-out dataset or the error estimated using a cross-validation schedule. Building more complex classifiers In the previous section, we used a very simple model: a threshold on one of the dimensions. Throughout this article, you will see many other types of models, and we're not even going to cover everything that is out there. What makes up a classification model? We can break it up into three parts: The structure of the model: In this, we use a threshold on a single feature. The search procedure: In this, we try every possible combination of feature and threshold. The loss function: Using the loss function, we decide which of the possibilities is less bad (because we can rarely talk about the perfect solution). We can use the training error or just define this point the other way around and say that we want the best accuracy. Traditionally, people want the loss function to be minimum. We can play around with these parts to get different results. For example, we can attempt to build a threshold that achieves minimal training error, but we will only test three values for each feature: the mean value of the features, the mean plus one standard deviation, and the mean minus one standard deviation. This could make sense if testing each value was very costly in terms of computer time (or if we had millions and millions of datapoints). Then the exhaustive search we used would be infeasible, and we would have to perform an approximation like this. Alternatively, we might have different loss functions. It might be that one type of error is much more costly than another. In a medical setting, false negatives and false positives are not equivalent. A false negative (when the result of a test comes back negative, but that is false) might lead to the patient not receiving treatment for a serious disease. A false positive (when the test comes back positive even though the patient does not actually have that disease) might lead to additional tests for confirmation purposes or unnecessary treatment (which can still have costs, including side effects from the treatment). Therefore, depending on the exact setting, different trade-offs can make sense. At one extreme, if the disease is fatal and treatment is cheap with very few negative side effects, you want to minimize the false negatives as much as you can. With spam filtering, we may face the same problem; incorrectly deleting a non-spam e-mail can be very dangerous for the user, while letting a spam e-mail through is just a minor annoyance. What the cost function should be is always dependent on the exact problem you are working on. When we present a general-purpose algorithm, we often focus on minimizing the number of mistakes (achieving the highest accuracy). However, if some mistakes are more costly than others, it might be better to accept a lower overall accuracy to minimize overall costs. Finally, we can also have other classification structures. A simple threshold rule is very limiting and will only work in the very simplest cases, such as with the Iris dataset. A more complex dataset and a more complex classifier We will now look at a slightly more complex dataset. This will motivate the introduction of a new classification algorithm and a few other ideas. Learning about the Seeds dataset We will now look at another agricultural dataset; it is still small, but now too big to comfortably plot exhaustively as we did with Iris. This is a dataset of the measurements of wheat seeds. Seven features are present, as follows: Area (A) Perimeter (P) Compactness () Length of kernel Width of kernel Asymmetry coefficient Length of kernel groove There are three classes that correspond to three wheat varieties: Canadian, Koma, and Rosa. As before, the goal is to be able to classify the species based on these morphological measurements. Unlike the Iris dataset, which was collected in the 1930s, this is a very recent dataset, and its features were automatically computed from digital images. This is how image pattern recognition can be implemented: you can take images in digital form, compute a few relevant features from them, and use a generic classification system. Later, we will work through the computer vision side of this problem and compute features in images. For the moment, we will work with the features that are given to us. UCI Machine Learning Dataset Repository The University of California at Irvine (UCI) maintains an online repository of machine learning datasets (at the time of writing, they are listing 233 datasets). Both the Iris and Seeds dataset used in this article were taken from there. The repository is available online: http://archive.ics.uci.edu/ml/ Features and feature engineering One interesting aspect of these features is that the compactness feature is not actually a new measurement, but a function of the previous two features, area and perimeter. It is often very useful to derive new combined features. This is a general area normally termed feature engineering; it is sometimes seen as less glamorous than algorithms, but it may matter more for performance (a simple algorithm on well-chosen features will perform better than a fancy algorithm on not-so-good features). In this case, the original researchers computed the "compactness", which is a typical feature for shapes (also called "roundness"). This feature will have the same value for two kernels, one of which is twice as big as the other one, but with the same shape. However, it will have different values for kernels that are very round (when the feature is close to one) as compared to kernels that are elongated (when the feature is close to zero). The goals of a good feature are to simultaneously vary with what matters and be invariant with what does not. For example, compactness does not vary with size but varies with the shape. In practice, it might be hard to achieve both objectives perfectly, but we want to approximate this ideal. You will need to use background knowledge to intuit which will be good features. Fortunately, for many problem domains, there is already a vast literature of possible features and feature types that you can build upon. For images, all of the previously mentioned features are typical, and computer vision libraries will compute them for you. In text-based problems too, there are standard solutions that you can mix and match. Often though, you can use your knowledge of the specific problem to design a specific feature. Even before you have data, you must decide which data is worthwhile to collect. Then, you need to hand all your features to the machine to evaluate and compute the best classifier. A natural question is whether or not we can select good features automatically. This problem is known as feature selection. There are many methods that have been proposed for this problem, but in practice, very simple ideas work best. It does not make sense to use feature selection in these small problems, but if you had thousands of features, throwing out most of them might make the rest of the process much faster. Nearest neighbor classification With this dataset, even if we just try to separate two classes using the previous method, we do not get very good results. Let me introduce, therefore, a new classifier: the nearest neighbor classifier. If we consider that each example is represented by its features (in mathematical terms, as a point in N-dimensional space), we can compute the distance between examples. We can choose different ways of computing the distance, for example: def distance(p0, p1): 'Computes squared euclidean distance' return np.sum( (p0-p1)**2) Now when classifying, we adopt a simple rule: given a new example, we look at the dataset for the point that is closest to it (its nearest neighbor) and look at its label: def nn_classify(training_set, training_labels, new_example): dists = np.array([distance(t, new_example) for t in training_set]) nearest = dists.argmin() return training_labels[nearest] In this case, our model involves saving all of the training data and labels and computing everything at classification time. A better implementation would be to actually index these at learning time to speed up classification, but this implementation is a complex algorithm. Now, note that this model performs perfectly on its training data! For each point, its closest neighbor is itself, and so its label matches perfectly (unless two examples have exactly the same features but different labels, which can happen). Therefore, it is essential to test using a cross-validation protocol. Using ten folds for cross-validation for this dataset with this algorithm, we obtain 88 percent accuracy. As we discussed in the earlier section, the cross-validation accuracy is lower than the training accuracy, but this is a more credible estimate of the performance of the model. We will now examine the decision boundary. For this, we will be forced to simplify and look at only two dimensions (just so that we can plot it on paper). In the preceding screenshot, the Canadian examples are shown as diamonds, Kama seeds as circles, and Rosa seeds as triangles. Their respective areas are shown as white, black, and grey. You might be wondering why the regions are so horizontal, almost weirdly so. The problem is that the x axis (area) ranges from 10 to 22 while the y axis (compactness) ranges from 0.75 to 1.0. This means that a small change in x is actually much larger than a small change in y. So, when we compute the distance according to the preceding function, we are, for the most part, only taking the x axis into account. If you have a physics background, you might have already noticed that we had been summing up lengths, areas, and dimensionless quantities, mixing up our units (which is something you never want to do in a physical system). We need to normalize all of the features to a common scale. There are many solutions to this problem; a simple one is to normalize to Z-scores. The Z-score of a value is how far away from the mean it is in terms of units of standard deviation. It comes down to this simple pair of operations: # subtract the mean for each feature: features -= features.mean(axis=0) # divide each feature by its standard deviation features /= features.std(axis=0) Independent of what the original values were, after Z-scoring, a value of zero is the mean and positive values are above the mean and negative values are below it. Now every feature is in the same unit (technically, every feature is now dimensionless; it has no units) and we can mix dimensions more confidently. In fact, if we now run our nearest neighbor classifier, we obtain 94 percent accuracy! Look at the decision space again in two dimensions; it looks as shown in the following screenshot: The boundaries are now much more complex and there is interaction between the two dimensions. In the full dataset, everything is happening in a seven-dimensional space that is very hard to visualize, but the same principle applies: where before a few dimensions were dominant, now they are all given the same importance. The nearest neighbor classifier is simple, but sometimes good enough. We can generalize it to a k-nearest neighbor classifier by considering not just the closest point but the k closest points. All k neighbors vote to select the label. k is typically a small number, such as 5, but can be larger, particularly if the dataset is very large. Binary and multiclass classification The first classifier we saw, the threshold classifier, was a simple binary classifier (the result is either one class or the other as a point is either above the threshold or it is not). The second classifier we used, the nearest neighbor classifier, was a naturally multiclass classifier (the output can be one of several classes). It is often simpler to define a simple binary method than one that works on multiclass problems. However, we can reduce the multiclass problem to a series of binary decisions. This is what we did earlier in the Iris dataset in a haphazard way; we observed that it was easy to separate one of the initial classes and focused on the other two, reducing the problem to two binary decisions: Is it an Iris Setosa (yes or no)? If no, check whether it is an Iris Virginica (yes or no). Of course, we want to leave this sort of reasoning to the computer. As usual, there are several solutions to this multiclass reduction. The simplest is to use a series of "one classifier versus the rest of the classifiers". For each possible label l, we build a classifier of the type "is this l or something else?". When applying the rule, exactly one of the classifiers would say "yes" and we would have our solution. Unfortunately, this does not always happen, so we have to decide how to deal with either multiple positive answers or no positive answers. Alternatively, we can build a classification tree. Split the possible labels in two and build a classifier that asks "should this example go to the left or the right bin?" We can perform this splitting recursively until we obtain a single label. The preceding diagram depicts the tree of reasoning for the Iris dataset. Each diamond is a single binary classifier. It is easy to imagine we could make this tree larger and encompass more decisions. This means that any classifier that can be used for binary classification can also be adapted to handle any number of classes in a simple way. There are many other possible ways of turning a binary method into a multiclass one. There is no single method that is clearly better in all cases. However, which one you use normally does not make much of a difference to the final result. Most classifiers are binary systems while many real-life problems are naturally multiclass. Several simple protocols reduce a multiclass problem to a series of binary decisions and allow us to apply the binary models to our multiclass problem. Summary In a sense, this was a very theoretical article, as we introduced generic concepts with simple examples. We went over a few operations with a classic dataset. This, by now, is considered a very small problem. However, it has the advantage that we were able to plot it out and see what we were doing in detail. This is something that will be lost when we move on to problems with many dimensions and many thousands of examples. The intuitions we gained here will all still be valid. Classification means generalizing from examples to build a model (that is, a rule that can automatically be applied to new, unclassified objects). It is one of the fundamental tools in machine learning. We also learned that the training error is a misleading, over-optimistic estimate of how well the model does. We must, instead, evaluate it on testing data that was not used for training. In order to not waste too many examples in testing, a cross-validation schedule can get us the best of both worlds (at the cost of more computation). We also had a look at the problem of feature engineering. Features are not something that is predefined for you, but choosing and designing features is an integral part of designing a machine-learning pipeline. In fact, it is often the area where you can get the most improvements in accuracy as better data beats fancier methods. In this article, we wrote all of our own code (except when we used NumPy, of course). We needed to build up intuitions on simple cases to illustrate the basic concepts. Resources for Article: Further resources on this subject: Python Testing: Installing the Robot Framework [Article] Getting Started with Spring Python [Article] Creating Skeleton Apps with Coily in Spring Python [Article]

0
0
2563

How-To Tutorials - Data

Integrating Storm and Hadoop

SQL Server Integration Services (SSIS)

Working with Time

Oracle APEX 4.2 reporting

OAuth Authentication

Memory and cache

Managing a Hadoop Cluster

Working with Import Process (Intermediate)

What are SSAS 2012 dimensions and cube?

Creating a pivot table

Trending Topics

Schemas and Models

Creating Multivariate Charts

Making Big Data Work for Hadoop and Solr

What is Hazelcast?

Learning How to Classify with Real-world Examples