Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-understanding-point-time-recovery

04 Sep 2013

28 min read

Understanding Point-In-Time-Recovery

04 Sep 2013

0
0
2323

article-image-integrating-storm-and-hadoop

Packt

04 Sep 2013

17 min read

Integrating Storm and Hadoop

Packt

04 Sep 2013

17 min read

(For more resources related to this topic, see here.) In this article, we will implement the Batch and Service layers to complete the architecture. There are some key concepts underlying this big data architecture: Immutable state Abstraction and composition Constrain complexity Immutable state is the key, in that it provides true fault-tolerance for the architecture. If a failure is experienced at any level, we can always rebuild the data from the original immutable data. This is in contrast to many existing data systems, where the paradigm is to act on mutable data. This approach may seem simple and logical; however, it exposes the system to a particular kind of risk in which the state is lost or corrupted. It also constrains the system, in that you can only work with the current view of the data; it isn't possible to derive new views of the data. When the architecture is based on a fundamentally immutable state, it becomes both flexible and fault-tolerant. Abstractions allow us to remove complexity in some cases, and in others they can introduce complexity. It is important to achieve an appropriate set of abstractions that increase our productivity and remove complexity, but at an appropriate cost. It must be noted that all abstractions leak, meaning that when failures occur at a lower abstraction, they will affect the higher-level abstractions. It is therefore often important to be able to make changes within the various layers and understand more than one layer of abstraction. The designs we choose to implement our abstractions must therefore not prevent us from reasoning about or working at the lower levels of abstraction when required. Open source projects are often good at this, because of the obvious access to the code of the lower level abstractions, but even with source code available, it is easy to convolute the abstraction to the extent that it becomes a risk. In a big data solution, we have to work at higher levels of abstraction in order to be productive and deal with the massive complexity, so we need to choose our abstractions carefully. In the case of Storm, Trident represents an appropriate abstraction for dealing with the data-processing complexity, but the lower level Storm API on which Trident is based isn't hidden from us. We are therefore able to easily reason about Trident based on an understanding of lower-level abstractions within Storm. Another key issue to consider when dealing with complexity and productivity is composition. Composition within a given layer of abstraction allows us to quickly build out a solution that is well tested and easy to reason about. Composition is fundamentally decoupled, while abstraction contains some inherent coupling to the lower-level abstractions—something that we need to be aware of. Finally, a big data solution needs to constrain complexity. Complexity always equates to risk and cost in the long run, both from a development perspective and from an operational perspective. Real-time solutions will always be more complex than batch-based systems; they also lack some of the qualities we require in terms of performance. Nathan Marz's Lambda architecture attempts to address this by combining the qualities of each type of system to constrain complexity and deliver a truly fault-tolerant architecture. We divided this flow into preprocessing and "at time" phases, using streams and DRPC streams respectively. We also introduced time windows that allowed us to segment the preprocessed data. In this article, we complete the entire architecture by implementing the Batch and Service layers. The Service layer is simply a store of a view of the data. In this case, we will store this view in Cassandra, as it is a convenient place to access the state alongside Trident's state. The preprocessed view is identical to the preprocessed view created by Trident, counted elements of the TF-IDF formula (D, DF, and TF), but in the batch case, the dataset is much larger, as it includes the entire history. The Batch layer is implemented in Hadoop using MapReduce to calculate the preprocessed view of the data. MapReduce is extremely powerful, but like the lower-level Storm API, is potentially too low-level for the problem at hand for the following reasons: We need to describe the problem as a data pipeline; MapReduce isn't congruent with such a way of thinking Productivity We would like to think of a data pipeline in terms of streams of data, tuples within the stream and predicates acting on those tuples. This allows us to easily describe a solution to a data processing problem, but it also promotes composability, in that predicates are fundamentally composable, but pipelines themselves can also be composed to form larger, more complex pipelines. Cascading provides such an abstraction for MapReduce in the same way as Trident does for Storm. With these tools, approaches, and considerations in place, we can now complete our real-time big data architecture. There are a number of elements, that we will update, and a number of elements that we will add. The following figure illustrates the final architecture, where the elements in light grey will be updated from the existing recipe, and the elements in dark grey will be added in this article: Implementing TF-IDF in Hadoop TF-IDF is a well-known problem in the MapReduce communities; it is well-documented and implemented, and it is interesting in that it is sufficiently complex to be useful and instructive at the same time. Cascading has a series of tutorials on TF-IDF at http://www.cascading.org/2012/07/31/cascading-for-the-impatient-part-5/, which documents this implementation well. For this recipe, we shall use a Clojure Domain Specific Language (DSL) called Cascalog that is implemented on top of Cascading. Cascalog has been chosen because it provides a set of abstractions that are very semantically similar to the Trident API and are very terse while still remaining very readable and easy to understand. Getting ready Before you begin, please ensure that you have installed Hadoop by following the instructions at http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/. How to do it… Start by creating the project using the lein command: lein new tfidf-cascalog Next, you need to edit the project.clj file to include the dependencies: (defproject tfidf-cascalog "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.4.0"] [cascalog "1.10.1"] [org.apache.cassandra/cassandra-all "1.1.5"] [clojurewerkz/cassaforte "1.0.0-beta11-SNAPSHOT"] [quintona/cascading-cassandra "0.0.7-SNAPSHOT"] [clj-time "0.5.0"] [cascading.avro/avro-scheme "2.2-SNAPSHOT"] [cascalog-more-taps "0.3.0"] [org.apache.httpcomponents/httpclient "4.2.3"]] :profiles{:dev{:dependencies[[org.apache.hadoop/hadoop-core "0.20.2-dev"] [lein-midje "3.0.1"] [cascalog/midje-cascalog "1.10.1"]]}}) It is always a good idea to validate your dependencies; to do this, execute lein deps and review any errors. In this particular case, cascading-cassandra has not been deployed to clojars, and so you will receive an error message. Simply download the source from https://github.com/quintona/cascading-cassandra and install it into your local repository using Maven. It is also good practice to understand your dependency tree. This is important to not only prevent duplicate classpath issues, but also to understand what licenses you are subject to. To do this, simply run lein pom, followed by mvn dependency:tree. You can then review the tree for conflicts. In this particular case, you will notice that there are two conflicting versions of Avro. You can fix this by adding the appropriate exclusions: [org.apache.cassandra/cassandra-all "1.1.5" :exclusions [org.apache.cassandra.deps/avro]] We then need to create the Clojure-based Cascade queries that will process the document data. We first need to create the query that will create the "D" view of the data; that is, the D portion of the TF-IDF function. This is achieved by defining a Cascalog function that will output a key and a value, which is composed of a set of predicates: (defn D [src] (let [src (select-fields src ["?doc-id"])] (<- [?key ?d-str] (src ?doc-id) (c/distinct-count ?doc-id :> ?n-docs) (str "twitter" :> ?key) (str ?n-docs :> ?d-str)))) You can define this and any of the following functions in the REPL, or add them to core.clj in your project. If you want to use the REPL, simply use lein repl from within the project folder. The required namespace (the use statement), require, and import definitions can be found in the source code bundle. We then need to add similar functions to calculate the TF and DF values: (defn DF [src] (<- [?key ?df-count-str] (src ?doc-id ?time ?df-word) (c/distinct-count ?doc-id ?df-word :> ?df-count) (str ?df-word :> ?key) (str ?df-count :> ?df-count-str))) (defn TF [src] (<- [?key ?tf-count-str] (src ?doc-id ?time ?tf-word) (c/count ?tf-count) (str ?doc-id ?tf-word :> ?key) (str ?tf-count :> ?tf-count-str))) This Batch layer is only interested in calculating views for all the data leading up to, but not including, the current hour. This is because the data for the current hour will be provided by Trident when it merges this batch view with the view it has calculated. In order to achieve this, we need to filter out all the records that are within the current hour. The following function makes that possible: (deffilterop timing-correct? [doc-time] (let [now (local-now) interval (in-minutes (interval (from-long doc-time) now))] (if (< interval 60) false true)) Each of the preceding query definitions require a clean stream of words. The text contained in the source documents isn't clean. It still contains stop words. In order to filter these and emit a clean set of words for these queries, we can compose a function that splits the text into words and filters them based on a list of stop words and the time function defined previously: (defn etl-docs-gen [rain stop] (<- [?doc-id ?time ?word] (rain ?doc-id ?time ?line) (split ?line :> ?word-dirty) ((c/comp s/trim s/lower-case) ?word-dirty :> ?word) (stop ?word :> false) (timing-correct? ?time))) We will be storing the outputs from our queries to Cassandra, which requires us to define a set of taps for these views: (defn create-tap [rowkey cassandra-ip] (let [keyspace storm_keyspace column-family "tfidfbatch" scheme (CassandraScheme. cassandra-ip "9160" keyspace column-family rowkey {"cassandra.inputPartitioner""org.apache.cassandra.dht.RandomPartitioner" "cassandra.outputPartitioner" "org.apache.cassandra.dht.RandomPartitioner"}) tap (CassandraTap. scheme)] tap)) (defn create-d-tap [cassandra-ip] (create-tap "d"cassandra-ip)) (defn create-df-tap [cassandra-ip] (create-tap "df" cassandra-ip)) (defn create-tf-tap [cassandra-ip] (create-tap "tf" cassandra-ip)) The way this schema is created means that it will use a static row key and persist name-value pairs from the tuples as column:value within that row. This is congruent with the approach used by the Trident Cassandra adaptor. This is a convenient approach, as it will make our lives easier later. We can complete the implementation by a providing a function that ties everything together and executes the queries: (defn execute [in stop cassandra-ip] (cc/connect! cassandra-ip) (sch/set-keyspace storm_keyspace) (let [input (tap/hfs-tap (AvroScheme. (load-schema)) in) stop (hfs-delimited stop :skip-header? true) src (etl-docs-gen input stop)] (?- (create-d-tap cassandra-ip) (D src)) (?- (create-df-tap cassandra-ip) (DF src)) (?- (create-tf-tap cassandra-ip) (TF src)))) Next, we need to get some data to test with. I have created some test data, which is available at https://bitbucket.org/qanderson/tfidf-cascalog. Simply download the project and copy the contents of src/data to the data folder in your project structure. We can now test this entire implementation. To do this, we need to insert the data into Hadoop: hadoop fs -copyFromLocal ./data/document.avro data/document.avro hadoop fs -copyFromLocal ./data/en.stop data/en.stop Then launch the execution from the REPL: => (execute "data/document" "data/en.stop" "127.0.0.1") How it works… There are many excellent guides on the Cascalog wiki (https://github.com/nathanmarz/cascalog/wiki), but for completeness's sake, the nature of a Cascalog query will be explained here. Before that, however, a revision of Cascading pipelines is required. The following is quoted from the Cascading documentation (http://docs.cascading.org/cascading/2.1/userguide/htmlsingle/): Pipe assemblies define what work should be done against tuple streams, which are read from tap sources and written to tap sinks. The work performed on the data stream may include actions such as filtering, transforming, organizing, and calculating. Pipe assemblies may use multiple sources and multiple sinks, and may define splits, merges, and joins to manipulate the tuple streams. This concept is embodied in Cascalog through the definition of queries. A query takes a set of inputs and applies a list of predicates across the fields in each tuple of the input stream. Queries are composed through the application of many predicates. Queries can also be composed to form larger, more complex queries. In either event, these queries are reduced down into a Cascading pipeline. Cascalog therefore provides an extremely terse and powerful abstraction on top of Cascading; moreover, it enables an excellent development workflow through the REPL. Queries can be easily composed and executed against smaller representative datasets within the REPL, providing the idiomatic API and development workflow that makes Clojure beautiful. If we unpack the query we defined for TF, we will find the following code: (defn DF [src] (<- [?key ?df-count-str] (src ?doc-id ?time ?df-word) (c/distinct-count ?doc-id ?df-word :> ?df-count) (str ?df-word :> ?key) (str ?df-count :> ?df-count-str))) The <- macro defines a query, but does not execute it. The initial vector, [?key ?df-count-str], defines the output fields, which is followed by a list of predicate functions. Each predicate can be one of the following three types: Generators: A source of data where the underlying source is either a tap or another query. Operations: Implicit relations that take in input variables defined elsewhere and either act as a function that binds new variables or a filter. Operations typically act within the scope of a single tuple. Aggregators: Functions that act across tuples to create aggregate representations of data. For example, count and sum. The :> keyword is used to separate input variables from output variables. If no :> keyword is specified, the variables are considered as input variables for operations and output variables for generators and aggregators. The (src ?doc-id ?time ?df-word) predicate function names the first three values within the input tuple, whose names are applicable within the query scope. Therefore, if the tuple ("doc1" 123324 "This") arrives in this query, the variables would effectively bind as follows: ?doc-id: "doc1" ?time: 123324 ?df-word: "This" Each predicate within the scope of the query can use any bound value or add new bound variables to the scope of the query. The final set of bound values that are emitted is defined by the output vector. We defined three queries, each calculating a portion of the value required for the TF-IDF algorithm. These are fed from two single taps, which are files stored in the Hadoop filesystem. The document file is stored using Apache Avro, which provides a high-performance and dynamic serialization layer. Avro takes a record definition and enables serialization/deserialization based on it. The record structure, in this case, is for a document and is defined as follows: {"namespace": "storm.cookbook", "type": "record", "name": "Document", "fields": [ {"name": "docid", "type": "string"}, {"name": "time", "type": "long"}, {"name": "line", "type": "string"} ] } Both the stop words and documents are fed through an ETL function that emits a clean set of words that have been filtered. The words are derived by splitting the line field using a regular expression: (defmapcatop split [line] (s/split line #"[[](),.)s]+")) The ETL function is also a query, which serves as a source for our downstream queries, and defines the [?doc-id ?time ?word] output fields. The output tap, or sink, is based on the Cassandra scheme. A query defines predicate logic, not the source and destination of data. The sink ensures that the outputs of our queries are sent to Cassandra. The ?- macro executes a query, and it is only at execution time that a query is bound to its source and destination, again allowing for extreme levels of composition. The following, therefore, executes the TF query and outputs to Cassandra: (?- (create-tf-tap cassandra-ip) (TF src)) There's more… The Avro test data was created using the test data from the Cascading tutorial at http://www.cascading.org/2012/07/31/cascading-for-the-impatient-part-5/. Within this tutorial is the rain.txt tab-separated data file. A new column was created called time that holds the Unix epoc time in milliseconds. The updated text file was then processed using some basic Java code that leverages Avro: Schema schema = Schema.parse(SandboxMain.class.getResourceAsStream("/document.avsc")); File file = new File("document.avro"); DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema); DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter); dataFileWriter.create(schema, file); BufferedReader reader = new BufferedReader(new InputStreamReader(SandboxMain.class.getResourceAsStream("/rain.txt"))); String line = null; try { while ((line = reader.readLine()) != null) { String[] tokens = line.split("t"); GenericRecord docEntry = new GenericData.Record(schema); docEntry.put("docid", tokens[0]); docEntry.put("time", Long.parseLong(tokens[1])); docEntry.put("line", tokens[2]); dataFileWriter.append(docEntry); } } catch (IOException e) { e.printStackTrace(); } dataFileWriter.close(); Persisting documents from Storm In the previous recipe, we looked at deriving precomputed views of our data taking some immutable data as the source. In that recipe, we used statically created data. In an operational system, we need Storm to store the immutable data into Hadoop so that it can be used in any preprocessing that is required. How to do it… As each tuple is processed in Storm, we must generate an Avro record based on the document record definition and append it to the data file within the Hadoop filesystem. We must create a Trident function that takes each document tuple and stores the associated Avro record. Within the tfidf-topology project created in, inside the storm.cookbook.tfidf.function package, create a new class named PersistDocumentFunction that extends BaseFunction. Within the prepare function, initialize the Avro schema and document writer: public void prepare(Map conf, TridentOperationContext context) { try { String path = (String) conf.get("DOCUMENT_PATH"); schema = Schema.parse(PersistDocumentFunction.class .getResourceAsStream("/document.avsc")); File file = new File(path); DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema); dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter); if(file.exists()) dataFileWriter.appendTo(file); else dataFileWriter.create(schema, file); } catch (IOException e) { throw new RuntimeException(e); } } As each tuple is received, coerce it into an Avro record and add it to the file: public void execute(TridentTuple tuple, TridentCollector collector) { GenericRecord docEntry = new GenericData.Record(schema); docEntry.put("docid", tuple.getStringByField("documentId")); docEntry.put("time", Time.currentTimeMillis()); docEntry.put("line", tuple.getStringByField("document")); try { dataFileWriter.append(docEntry); dataFileWriter.flush(); } catch (IOException e) { LOG.error("Error writing to document record: " + e); throw new RuntimeException(e); } } Next, edit the TermTopology.build topology and add the function to the document stream: documentStream.each(new Fields("documentId","document"), new PersistDocumentFunction(), new Fields()); Finally, include the document path into the topology configuration: conf.put("DOCUMENT_PATH", "document.avro"); How it works… There are various logical streams within the topology, and certainly the input for the topology is not in the appropriate state for the recipes in this article containing only URLs. We therefore need to select the correct stream from which to consume tuples, coerce these into Avro records, and serialize them into a file. The previous recipe will then periodically consume this file. Within the context of the topology definition, include the following code: Stream documentStream = getUrlStream(topology, spout) .each(new Fields("url"), new DocumentFetchFunction(mimeTypes), new Fields("document", "documentId", "source")); documentStream.each(new Fields("documentId","document"), new PersistDocumentFunction(), new Fields()); The function should consume tuples from the document stream whose tuples are populated with already fetched documents.

0
0
1293

article-image-sql-server-integration-services-ssis

Packt

03 Sep 2013

5 min read

SQL Server Integration Services (SSIS)

Packt

03 Sep 2013

5 min read

0
0
1827

Packt

03 Sep 2013

22 min read

Working with Time

Packt

03 Sep 2013

22 min read

0
0
2656

Packt

03 Sep 2013

20 min read

Oracle APEX 4.2 reporting

Packt

03 Sep 2013

20 min read

(For more resources related to this topic, see here.) The objective of the first chapter is to quickly introduce you to the technology and then dive deep into the understanding the fabric of the tool. The chapter also helps you set the environment, which will be used throughout the book. Chapter 1, Know Your Horse Before You Ride It, starts with discussing the various features of APEX. This is to give heads up to the readers about the features offered by the tool and to inform them about some of the strengths of the tool. In order to understand the technology better, we discuss the various web server combinations possible with APEX, namely the Internal mod_plsql, External mod_plsql, and Listener configuration. While talking about the Internal mod_plsql configuration, we see the steps to enable the XMLDB HTTP server. In the Internal mod_plsql configuration, Oracle uses a DAD defined in EPG to talk to the database and the web server. So, we try to create a miniature APEX of our own, by creating our DAD using it to talk to the database and the web server. We then move on to learn about the External mod_plsql configuration. We discuss the architecture and the roles of the configuration files, such as dads.conf and httpd.conf. We also have a look at a typical dads.conf file and draw correlations between the configurations in the Internal and External mod_plsql configuration. We then move on to talk about the wwv_flow_epg_include_mod_local procedure that can help us use the DAD of APEX to call our own stored PL/SQL procedures. We then move on to talk about APEX Listener which is a JEE alternative to mod_plsql and is Oracle’s direction for the future. Once we are through with understanding the possible configurations, we see the steps to set up our environment. We use the APEX Listener configuration and see the steps to install the APEX engine, create a Weblogic domain, set the listener in the domain, and create an APEX workspace. With our environment in place, we straight away get into understanding the anatomy of APEX by analyzing various parts of its URL. This discussion includes a natter on the session management, request handling, debugging, error handling, use of TKPROF for tracing an APEX page execution, cache management and navigation, and value passing in APEX. We also try to understand the design behind the zero session ID in this section. Our discussions till now would have given you a brief idea about the technology, so we try to dig in a little deeper and understand the mechanism used by APEX to send web requests to its PL/SQL engine in the database by decoding the APEX page submission. We see the use of the wwv_flow.accept procedure and understand the role of page submission. We try to draw an analogy of an APEX form with a simple HTML to get a thorough understanding about the concept. The next logical thing after page submission is to see the SQL and PL/SQL queries and blocks reaching the database. We turn off the database auditing and see the OWA web toolkit requests flowing to the database as soon as we open an APEX page. We then broaden our vision by quickly knowing about some of the lesser known alternatives of mod_plsql. We end the chapter with a note of caution and try to understand the most valid criticisms of the technology, by understanding the SQL injection and Cross-site Scripting (XSS). After going through the architecture we straight away spring into action and begin the process of learning how to build the reports in APEX. The objective of this chapter is to help you understand and implement the most common reporting requirements along with introducing some interesting ways to frame the analytical queries in Oracle. The chapter also hugely focuses on the methods to implement different kinds of formatting in the APEX classic reports. We start Chapter 2, Reports, by creating the objects that will be used throughout the book and installing the reference application that contains the supporting code for the topics discussed in the second chapter. We start the chapter by setting up an authentication mechanism. We discuss external table authentication in this chapter. We then move on to see the mechanism of capturing the environment variables in APEX. These variables can help us set some logic related to a user’s session and environment. The variables also help us capture some properties of the underlying database session of an APEX session. We capture the variables using the USERENV namespace, DBMS_SESSION package, and owa_util package. After having a good idea of the ways and means to capture the environment variables, we build our understanding of developing the search functionality in a classic APEX report. This is mostly a talk about the APEX classic report features, we also use this opportunity to see the process to enable sorting in the report columns, and to create a link that helps us download the report in the CSV format. We then discuss various ways to implement the group reports in APEX. The discussion shows a way to implement this purely, by using the APEX’s feature, and then talks about getting the similar results by using the Oracle database feature. We talk about the APEX’s internal grouping feature and the Oracle grouping sets. The section also shows the first use of JavaScript to manipulate the report output. It also talks about a method to use the SQL query to create the necessary HTML, which can be used to display a data column in a classic report. We take the formatting discussion further, by talking about a few advanced ways of highlighting the report data using the APEX’s classic report features, and by editing the APEX templates. After trying our hand at formatting, we try to understand the mechanism to implement matrix reports in APEX. We use matrix reports to understand the use of the database features, such as the with clause, the pivot operator, and a number of string aggregation techniques in Oracle. The discussion on the string aggregation techniques includes the talk on the LISTAGG function, the wm_concat function, and the use of the hierarchical queries for this purpose. We also see the first use of the APEX items as the substitution variables in this book. We will see more use of the APEX items as the substitution variables to solve the vexed problems in other parts of the book as well. We do justice with the frontend as well, by talking about the use of jQuery, CSS, and APEX Dynamic Actions for making the important parts of data stand out. Then, we see the implementation of the handlers, such as this.affectedElements and the use of the jQuery functions, such as this.css() in this section. We also see the advanced formatting methods using the APEX templates. We then end this part of the discussion, by creating a matrix report using the APEX Dynamic Query report region. The hierarchical reports are always intriguing, because of the enormous ability to present the relations among different rows of data, and because of the use of the hierarchical queries in a number of unrelated places to answer the business queries. We reserve a discussion on the Oracle’s hierarchical queries for a later part of the section and start it, by understanding the implementation of the hierarchical reports by linking the values in APEX using drilldowns. We see the use of the APEX items as the substitution variable to create the dynamic messages. Since we have devised a mechanism to drill down, we should also build a ladder for the user to climb up the hierarchical chain. Now, the hierarchical chain for every person will be different depending on his position in the organization, so we build a mechanism to build the dynamic bread crumbs using the PL/SQL region in APEX. We then talk about two different methods of implementing the hierarchical queries in APEX. We talk about the connect, by clause first, and then continue our discussion by learning the use of the recursive with clause for the hierarchical reporting. We end our discussion on the hierarchical reporting, by talking about creating the APEX’s Tree region that displays the hierarchical data in the form of a tree. The reports are often associated with the supporting files that give more information about the business query. A typical user might want to upload a bunch of files while executing his piece of the task and the end user of his action might want to check out the files uploaded by him/her. To understand the implementation of this requirement, we check out the various ways to implement uploading and downloading the files in APEX. We start our discussion, by devising the process of uploading the files for the employees listed in the oehr_employees table. The solution of uploading the files involves the implementation of the dynamic action to capture the ID of the employee on whose row the user has clicked. It shows the use of the APEX items as the substitution variables for creating the dynamic labels and the use of JavaScript to feed one APEX item based on the another. We extensively talk about the implementation of jQuery in Dynamic Actions in this section as well. Finally, we check out the use of the APEX’s file browse item along with the WWV_FLOW_FILES table to capture the file uploaded by the user. Discussion on the methods to upload the files is immediately followed by talking about the ways to download these files in APEX. We nest the use of the functions, such as HTF.ANCHOR and APEX_UTIL.GET_BLOB_FILE_SRC for one of the ways to download a file, and also talk about the use of dbms_lob.getlength along with the APEX format mask for downloading the files. We then engineer our own stored procedure that can download a blob stored in the database as a file. We end this discussion, by having a look at the APEX’s p process, which can also be used for downloading. AJAX is the mantra of the new age and we try our hand at it, by implementing the soft deletion in APEX. We see the mechanism of refreshing just the report and not reloading the page as soon as the user clicks to delete a row. We use JavaScript along with the APEX page process and the APEX templates to achieve this objective. Slicing and dicing along with auditing are two of the most common requirements in the reporting world. We see the implementation of both of these using both the traditional method JavaScript with the page processes and the new method of using Dynamic Actions. We extend our use case a little further and learn about a two way interaction between the JavaScript function and the page process. We learn to pass the values back and forth between the two. While most business intelligence and reporting solutions are a one way road and are focused on the data presentation, Oracle APEX can go a step further, and can give an interface to the user for the data manipulations as well. To understand this strength of APEX, we have a look at the process of creating the tabular forms in APEX. We extend our understanding of the tabular forms to see a magical use of jQuery to convert a certain sections of a report from display-only to editable textboxes. We move our focus from implementing the interesting and tricky frontend requirements to framing the queries to display the complex data types. We use the string aggregation methods to display data in a column containing a varray. Time dimension is one of the most widely used dimension in reporting circles and comparing current performance with the past records is a favorite requirement of most businesses. With this in mind, we shift our focus to understand and implement the time series reports in APEX. We start our discussion by understanding the method to implement a report that shows the contribution of each business line in every quarter. We implement this by using partitioning dimensions on the fly. We also use the analytical functions, such as ratio_to_report, lead, and lag in the process of creating the time series reports. We use our understanding of time dimension to build a report that helps a user compare one time period to the other. The report gives the user the freedom to select the time segments, which he wishes to compare. We then outwit a limitation of this report, by using the query partition clause for the data densification. We bring our discussion on reports based on time dimension, by presenting a report based on the modal clause to you. The report serves as an example to show the enormous possibilities to code, by using the modal clause in Oracle. We bring our discussion on reports based on time dimension, by presenting a report based on the modal clause to you. The report serves as an example to show the enormous possibilities to code, by using the modal clause in Oracle. Chapter 3, In the APEX Mansion – Interactive Reports is all about Interactive Reports and the dynamic reporting. While the second chapter was more about the data presentation using the complex queries and the presentation methods, this chapter is about taking the presentation part a step ahead, by creating more visually appealing Interactive Reports. We start the discussion of this chapter, by talking about the ins and outs of Interactive Reports. We postmortem this feature of APEX to learn about every possible way of using Interactive Reports for making more sense of data. The chapter has a reference application of its own, which shows the code in action. We start our discussion, by exploring at various features of the Actions menu in Interactive Reports. The discussion is on search functionality, Select Columns feature, filtering, linking and filtering Interactive Reports using URLs, customizing Rows per page feature of an IR, using Control Break, creating computations in IR, creating charts in IR, using the Flashback feature and a method to see the back end flashback query, configuring the e-mail functionality for downloading a report, and for subscription of reports and methods to understand the download of reports in HTML, CSV, and PDF formats. Once we are through with understanding the Actions menu, we move on to understand the various configuration options in an IR. We talk about the Link section, the Icon View section, the Detail section, the Column Group section, and the Advanced section of the Report Attributes page of an IR. While discussing these sections, we understand the process of setting different default views of an IR for different users. Once we are through with our dissection of an IR, we put our knowledge to action, by inserting our own item in the Actions menu of an IR using Dynamic Actions and jQuery. We continue our quest for finding the newer formatting methods, by using different combinations of SQL, CSS, APEX templates, and jQuery to achieve unfathomable results. The objectives attained in this section include formatting a column of an IR based on another column, using CSS in the page header to format the APEX data, changing the font color of the alternate rows in APEX, using a user-defined CSS class in APEX, conditionally highlighting a column in IR using CSS and jQuery, and formatting an IR using the region query. After going through a number of examples on the use of CSS and jQuery in APEX, we lay down a process to use any kind of changes in an IR. We present this process by an example that changes one of the icons used in an IR to a different icon. APEX also has a number of views for IR, which can be used for intelligent programming. We talk about an example that uses the apex_application_page_ir_rpt view that show different IR reports on the user request. After a series of discussion on Interactive Reports, we move on to find a solution to an incurable problem of IR. We see a method to put multiple IR on the same page and link them as the master-child reports. We had seen an authentication mechanism (external table authentication) and data level authorization in Chapter 2, Reports. We use this chapter to see the object level authorization in APEX. We see the method to give different kinds of rights on the data of a column in a report to different user groups. After solving a number of problems and learning a number of things, we create a visual treat for ourselves. We create an Interactive report dashboard using Dynamic Actions. This dashboard presents different views of an IR as gadgets on a separate APEX page. We conclude this chapter, by looking at advanced ways of creating the dynamic reports in APEX. We look at the use of the table function in both native and interface approach, and we also look at the method to use the APEX collections for creating the dynamic IRs in APEX. Chapter 4, The Fairy Tale Begins – Advanced Reporting, is all about advanced reporting and pretty graphs. Since we are talking about advanced reporting, we see the process of setting the LDAP authentication in APEX. We also see the use of JXplorer to help us get the necessary DN for setting up the LDAP authentication. We also see the means to authenticate an LDAP user using PL/SQL in this section. This chapter also has a reference application of its own that shows the code in action. We start the reporting building in this chapter, by creating the Sparkline reports. This report uses jQuery for producing the charts. We then move on to use another jQuery library to create a report with slider. This report lets the user set the value of the salary of any employee using a slider. We then get into the world of the HTML charts. We start our talk, by looking at the various features of creating the HTML chart regions in APEX. We understand a method to implement the Top N and Bottom N chart reports in the HTML charts. We understand the APEX’s method to implement the HTML charts and use it to create an HTML chart on our own. We extend this technique of generating HTML from region source a little further, by using XMLTYPE to create the necessary HTML for displaying a report. We take our spaceship into a different galaxy of the charts world and see the use of Google Visualizations for creating the charts in APEX. We then switch back to AnyChart, which is a flash charting solution, and has been tightly integrated with APEX. It works on an XML and we talk about customizing this XML to produce different results. We put our knowledge to action, by creating a logarithmic chart and changing the style of a few display labels in APEX. We continue our discussion on AnyChart, and use the example of Doughnut Chart to understand advanced ways of using AnyChart in APEX. We use our knowledge to create Scatter chart, 3D stacked chart, Gauge chart, Gantt chart, Candlestick chart, Flash image maps, and SQL calendars. We move out of the pretty heaven of AnyChart only to get into another beautiful space of understanding the methods of displaying the reports with the images in APEX. We implement the reports with the images using the APEX’s format masks and also using HTF.IMG with the APEX_UTIL.GET_BLOB_FILE_SRC function. We then divert our attention to advanced jQuery uses such as, the creation of the Dialog box and the Context menu in APEX. We create a master-detail report using dialog boxes, where the child report is shown in the Dialog box. We close our discussion in this chapter, by talking about creating wizards in APEX and a method to show different kinds of customized error messages for problems appearing at different points of a page process. Chapter 5, Flight to Space Station: Advanced APEX, is all about advanced APEX. The topics discussed in this chapter fall into a niche category of reporting the implementation. This chapter has two reference applications of its own that shows the code in action. We start this chapter, by creating both the client-side and server-side image maps APEX. These maps are often used where regular shapes are involved. We then see the process of creating the PL/SQL Server Pages (PSPs). PSPs are similar to JSP and are used for putting the PL/SQL and HTML code in a single file. Loadpsp then converts this file into a stored procedure with the necessary calls to owa web toolkit functions. The next utility we check out is loadjava. This utility helps us load a Java class as a database object. This utility will be helpful, if some of the Java classes are required for code processing. We had seen the use of AnyChart in the previous chapter, and this chapter introduces you to FusionChart. We create a Funnel Chart using FusionChart in this chapter. We also use our knowledge of the PL/SQL region in APEX to create a Tag Cloud. We then stroll into the world of the APEX plugins. We understand the interface functions, which have to be created for plugins. We discuss the concepts and then use them to develop an Item type plugin and a Dynamic Action plugin. We understand the process of defining a Custom Attribute in a plugin, and then see the process of using it in APEX. We move on to learn about Websheets and their various features. We acquaint ourselves with the interface of the Websheet applications and understand the concept of sharing Websheets. We also create a few reports in a Websheet application and see the process of importing and using an image in Websheets. We then spend some time to learn about data grids in Websheets and the method to create them. We have a look at the Administration and View dropdowns in a Websheet application. We get into the administration mode and understand the process of configuring the sending of mails to Gmail server from APEX. We extend our administration skills further, by understanding the ways to download an APEX application using utilities, such as oracle.apex.APEXExport. Reporting and OLAP go hand in hand, so we see the method of using the OLAP cubes in APEX. We see the process of modeling a cube and understand the mechanism to use its powerful features. We then have a talk about Oracle Advanced Queues that can enable us to do reliable communication among different systems with different workloads in the enterprise, and gives improved performance. We spend a brief time to understand some of the other features of APEX, which might not be directly related to reporting, but are good to know. Some of these features include locking and unlocking of pages in APEX, the Database Object Dependencies report, Shortcuts, the Dataloading wizard, and the APEX views. We bring this exclusive chapter to an end, by discussing about the various packages that enable us to schedule the background jobs in APEX, and by discussing various other APEX and Database API, which can help us in the development process.

0
0
2406

Packt

02 Sep 2013

6 min read

OAuth Authentication

Packt

02 Sep 2013

6 min read

(For more resources related to this topic, see here.) Understanding OAuth OAuth has the concept of Providers and Clients. An OAuth Provider is like a SAML Identity Provider, and is the place where the user enters their authentication credentials. Typical OAuth Providers include Facebook and Google. OAuth Clients are resources that want to protect resources, such as a SAML Service Provider. If you have ever been to a site that has asked you to log in using your Twitter or LinkedIn credentials then odds are that site was using OAuth. The advantage of OAuth is that a user’s authentication credentials (username and password, for instance) is never passed to the OAuth Client, just a range of tokens that the Client requested from the Provider and which are authorized by the user. OpenAM can act as both an OAuth Provider and an OAuth Client. This chapter will focus on using OpenAM as an OAuth Client and using Facebook as an OAuth Provider. Preparing Facebook as an OAuth Provider Head to https://developers.facebook.com/apps/ and create a Facebook App. Once this is created, your Facebook App will have an App ID and an App Secret. We’ll use these later on when configuring OpenAM. Facebook won’t let a redirect to a URL (such as our OpenAM installation) without being aware of the URL. The steps for preparing Facebook as an OAuth provider are as follows: Under the settings for the App in the section Website with Facebook Login we need to add a Site URL. This is a special OpenAM OAuth Proxy URL, which for me was http://openam.kenning.co.nz:8080/openam/oauth2c/OAuthProxy.jsp as shown in the following screenshot: Click on the Save Changes button on Facebook. My OpenAM installation for this chapter was directly available on the Internet just in case Facebook checked for a valid URL destination. Configuring an OAuth authentication module OpenAM has the concept of authentication modules, which support different ways of authentication, such as OAuth, or against its Data Store, or LDAP or a Web Service. We need to create a new Module Instance for our Facebook OAuth Client. Log in to OpenAM console. Click on the Access Control tab, and click on the link to the realm / (Top Level Realm). Click on the Authentication tab and scroll down to the Module Instances section. Click on the New button. Enter a name for the New Module Instance and select OAuth 2.0 as the Type and click on the OK button. I used the name Facebook. You will then see a screen as shown: For Client Id, use the App ID value provided from Facebook. For the Client Secret use the App Secret value provided from Facebook as shown in the preceding screenshot. Since we’re using Facebook as our OAuth Provider, we can leave the Authentication Endpoint URL, Access Token Endpoint URL, and User Profile Service URL values as their default values. Scope defines the permissions we’re requesting from the OAuth Provider on behalf of the user. These values will be provided by the OAuth Provider, but we’ll use the default values of email and read_stream as shown in the preceding screenshot. Proxy URL is the URL we copied to Facebook as the Site URL. This needs to be replaced with your OpenAM installation value. The Account Mapper Configuration allows you to map values from your OAuth Provider to values that OpenAM recognizes. For instance, Facebook calls emails email while OpenAM references values from the directory it is connected to, such as mail in the case of the embedded LDAP server. This goes the same for the Attribute Mapper Configuration. We’ll leave all these sections as their defaults as shown in the preceding screenshot. OpenAM allows attributes passed from the OAuth Provider to be saved to the OpenAM session. We’ll make sure this option is Enabled as shown in the preceding screenshot. When a user authenticates against an OAuth Provider, they are likely to not already have an account with OpenAM. If they do not have a valid OpenAM account then they will not be allowed access to resources protected by OpenAM. We should make sure that the option to Create account if it does not exist is Enabled as shown in the preceding screenshot. Forcing authentication against particular authentication modules In the writing of this book I disabled the Create account if it does not exist option while I was testing. Then when I tried to log into OpenAM I was redirected to Facebook, which then passed my credentials to OpenAM. Since there was no valid OpenAM account that matched my Facebook credentials I could not log in. For your own testing, it would be recommended to use http://openam.kenning.co.nz:8080/openam/UI/Login?module=Facebook rather than changing your authentication chain. Thankfully, you can force a login using a particular authentication module by adjusting the login URL. By using http://openam.kenning.co.nz:8080/openam/UI/Login?module=DataStore, I was able to use the Data Store rather than OAuth authentication module, and log in successfully. For our newly created accounts we can choose to prompt the user to create a password and enter an activation code. For our prototype we’ll leave this option as Disabled. The flip side to Single Sign On is Single Log Out. Your OAuth Provider should provide a logout URL which we could possibly call to log out a user when they log out of OpenAM. The options we have when a user logs out of OpenAM is to either not log them out of the OAuth Provider, to log them out of the OAuth Provider, or to ask the user. If we had set earlier that we wanted to enforce password and activation token policies, then we would need to enter details of an SMTP server, which would be used to email the activation token to the user. For the purposes of our prototype we’ll leave all these options blank. Click on the Save button. Summary This article served as a quick primer on what OAuth is and how to achieve it with OpenAM. It covered the concept of using Facebook as an OAuth provider and configuring an OAuth module. It focused on using OpenAM as an OAuth Client and using Facebook as an OAuth Provider. This would really help when we might want to allow authentication against Facebook or Google. Resources for Article: Further resources on this subject: Getting Started with OpenSSO [Article] OpenAM: Oracle DSEE and Multiple Data Stores [Article] OpenAM Identity Stores: Types, Supported Types, Caching and Notification [Article]

0
0
3695

Packt

02 Sep 2013

10 min read

Memory and cache

Packt

02 Sep 2013

10 min read

(For more resources related to this topic, see here.) You can find this instruction in the OGlobalConfiguration.java file in the autoConfig() method. Furthermore, you can enable/disable level 1 cache, level 2 cache, or both. You can also set the number of records that will be stored in each level as follows: cache.level1.size: This sets the number of records to be stored in the level 1 caches (default -1, no limit) cache.level2.size: This sets the number of records to be stored in the level 2 cache (default -1, no limit) cache.level1.enabled: This is a boolean value, it enables/disables the level 1 cache (default, true) cache.level2.enabled: This is a boolean value, it enables/disables the level 2 cache (default, true) Mapping files OrientDB uses NIO to map data files in memory. However, you can change the way this mapping is performed. This is achieved by modifying the file access strategy. Mode 0: It uses the memory mapping for all the operations. Mode 1 (default): It uses the memory mapping, but new reads are performed only if there is enough memory, otherwise the regular Java NIO file read/write is used. Mode 2: It uses the memory mapping only if the data has been previously loaded. Mode 3: It uses memory mapping until there is space available, then use regular JAVA NIO file read/write. Mode 4: It disables all the memory mapping techniques. To set the strategy mode, you must use the file.mmap.strategy configuration property. Connections When you have to connect with a remote database you have some options to improve your application performance. You can use the connection pools, and define the timeout value to acquire a new connection. The pool has two attributes: minPool: It is the minimum number of opened connections maxPool: It is the maximum number of opened connections When the first connection is requested to the pool, a number of connections corresponding to the minPool attribute are opened against the server. If a thread requires a new connection, the requests are satisfied by using a connection from the pool. If all the connections are busy, a new one is created until the value of maxPool is reached. Then the thread will wait, so that a connection is freed. Minimum and maximum connections are defined by using the client.channel.minPool (default value 1) and client.channel.maxPool (default value 5) properties. However, you can override these values in the client code by using the setProperty() method of the connection class. For example: database = new ODatabaseDocumentTx("remote:localhost/demo");database.setProperty("minPool", 10);database.setProperty("maxPool", 50);database.open("admin", "admin"); You can also change the connection timeout values. In fact, you may experience some problem, if there are network latencies or if some server-side operations require more time to be performed. Generally these kinds of problems are shown in the logfile with warnings: WARNING: Connection re-acquired transparently after XXXms and Yretries: no errors will be thrown at application level You can try to change the network.lockTimeout and the network.socketTimeout values. The first one indicates the timeout in milliseconds to acquire a lock against a channel (default is 15000), the second one indicates the TCP/IP socket timeout in milliseconds (default is 10000). There are some other properties you can try to modify to resolve network issues. These are as follows: network.socketBufferSize: This is the TCP/IP socket buffer size in bytes (default 32 KB) network.retry: This indicates the number of retries a client should do to establish a connection against a server (default is 5) network.retryDelay: This indicates the number of milliseconds a client will wait before retrying to establish a new connection (default is 500) Transactions If your primary objective is the performance, avoid using transactions. However, if it is very important for you to have transactions to group operations, you can increase overall performance by disabling the transaction log. To do so just set the tx.useLog property to false. If you disable the transaction log, OrientDB cannot rollback operations in case JVM crashes. Other transaction parameters are as follows: tx.log.synch: It is a Boolean value. If set, OrientDB executes a synch against the filesystem for each transaction log entry. This slows down the transactions, but provides reliability on non- reliable devices. Default value is false. tx.commit.synch: It is a Boolean value. If set, it performs a storage synch after a commit. Default value is true. Massive insertions If you want to do a massive insertion, there are some tricks to speed up the operation. First of all, do it via Java API. This is the fastest way to communicate with OrientDB. Second, instruct the server about your intention: db.declareIntent( new OIntentMassiveInsert() );//your code here....db.declareIntent( null ); Here db is an opened database connection. By declaring the OIntentMassiveInsert() intent, you are instructing OrientDB to reconfigure itself (that is, it applies a set of preconfigured configuration values) because a massive insert operation will begin. During the massive insert, avoid creating a new ODocument instance for each record to insert. On the contrary, just create an instance the first time, and then clean it using the reset() method: ODocument doc = new ODocument();for(int i=0; i< 9999999; i++){doc.reset(); //here you will reset the ODocument instancedoc.setClassName("Author");doc.field("id", i);doc.field("name", "John");doc.save();} This trick works only in a non-transactional context. Finally, avoid transactions if you can. If you are using a graph database and you have to perform a massive insertion of vertices, you can still reset just one vertex: ODocument doc = db.createVertex();...doc.reset();... Moreover, since a graph database caches the most used elements, you may disable this: db.setRetainObjects(false); Datafile fragmentation Each time a record is updated or deleted, a hole is created in the datafiles structure. OrientDB tracks these holes and tries to reuse them. However, many updates and deletes can cause a fragmentation of datafiles, just like in a filesystem. To limit this problem, it is suggested to set the oversize attribute of the classes you create. The oversize attribute is used to allocate more space for records once they are created, so as to avoid defragmentation upon updates. The oversize attribute is a multiplying factor where 1.0 or 0 means no oversize. The default values are 0 for document, and 2 for vertices. OrientDB has a defrag algorithm that starts automatically when certain conditions are verified. You can set some of these conditions by using the following configuration parameter: file.defrag.holeMaxDistance: It defines the maximum distance in bytes between two holes that triggers the defrag procedure. The default is 32 KB, -1 means dynamic size. The dynamic size is computed in the ODataLocal class in the getCloserHole() method, as Math.max(32768 * (int) (size / 10000000), 32768), where size is the current size of the file. The profiler OrientDB has an embedded profiler that you can use to analyze the behavior of the server. The configuration parameters that act on the profiler are as follows: profiler.enabled: This is a boolean value (enable/disable the profiler), the default value is false. profiler.autoDump.interval: It is the number of seconds between profiler dump. The default value is 0, which means no dump. profiler.autoDump.reset: This is a boolean value, reset the profile at every dump. The default is true. The dump is a JSON string structured in sections. The first one is a huge collection of information gathered at runtime related to the configuration and resources used by each object in the database. The keys are structured as follows: db.<db-name>: They are database-related metrics db.<db-name>.cache: They are metrics about databases' caching db.<db-name>.data: They are metrics about databases' datafiles, mainly data holes db.<db-name>.index: They are metrics about databases' indexes system.disk: They are filesystem-related metrics system.memory: They are RAM-related metrics system.config.cpus: They are the number of the cores process.network: They are network metrics process.runtime: They provide process runtime information and metrics server.connections.actives: They are number of active connections The second part of the dump is a collection of chronos. A chrono is a log of an operation, for example, a create operation, an update operation, and so on. Each chrono has the following attributes: last: It is the last time recorded min: It is the minimum time recorded max: It is the maximum time recorded average: It is the average time recorded total: It is the total time recorded entries: It is the number of times the specific metric has been recorded Finally, there are sections about many counters. Query tips In the following paragraphs some useful information on how to optimize the queries execution is given. The explain command You can see how OrientDB accesses the data by using the explain command in the console. To use this command simply write explain followed by the select statement: orientdb> explain select from Posts A set of key-value pairs are returned. Keys mean the following: resultType: It is the type of the returned resultset. It can be collection, document, or number. resultSize: It is the number of records retrieved if the resultType is collection. recordReads: It is the number of records read from datafiles. involvedIndexes: They are the indices involved in the query. indexReads: It is the number of records read from the indices. documentReads: They are the documents read from the datafiles. This number could be different from recordReads, because in a scanned cluster there can be different kinds of records. documentAnalyzedCompatibleClass: They are the documents analyzed belonging to the class requested by the query. This number could be different from documentReads, because a cluster may contain several different classes. elapsed: This time is measured in nanoseconds, it is the time elapsed to execute the statement. As you can see, OrientDB can use indices to speed up the reads. Indexes You can define indexes as we do in a relational database using the create index statement or via Java API using the createIndex() method of the OClass class: create index <class>.<property> [unique|notunique|fulltext] [field type] Or for composite index (an index on more than one property): create index <index_name> on <class> (<field1>,<field2>)[unique|notunique|fulltext] If you create a composite index, OrientDB will use it also when in a where clause you don't specify a criteria against all the indexed fields. So you can avoid this to build an index for each field you use in the queries if you have already built a composite one. This is the case of a partial match search and further information about it can be found in the OrientDB wiki at https://github.com/nuvolabase/orientdb/wiki/Indexes#partial-match-search. Generally, the indexes don't work with the like operator. If you want to perform the following query: select from Authors where name like 'j%' And you want use an index, you must define on the field name a FULLTEXT index. FULLTEXT indices permit to index string fields. However keep in mind that indices slow down the insert, update, and delete operations. Summary In this article we have seen some strategies that try to optimize both the OrientDB server installation and queries. Resources for Article: Further resources on this subject: Comparative Study of NoSQL Products [Article] Connecting to Microsoft SQL Server Compact 3.5 with Visual Studio [Article] Microsoft SQL Azure Tools [Article]

0
0
1122

Packt

30 Aug 2013

13 min read

Managing a Hadoop Cluster

Packt

30 Aug 2013

13 min read

0
0
13243

article-image-working-import-process-intermediate

Packt

30 Aug 2013

5 min read

Working with Import Process (Intermediate)

Packt

30 Aug 2013

5 min read

(For more resources related to this topic, see here.) Getting ready The first thing we need to do is to download the latest version of Sqoop from following location http://www.apache.org/dist/sqoop/ and extract it on your machine. Now I am calling the Sqoop installation dir as $SQOOP_HOME. Given here are the prerequisites for Sqoop import process. Installed and running Relational Database Management System (MySQL). Installed and running Hadoop Cluster. Set $HADOOP_HOME environment variable. Following are the common arguments of import process. Parameters Description --connect <jdbc-uri> This command specifies the server or database to connect. It also specifies the port. Example: --connect jdbc:mysql://host:port/databaseName --connection-manager <class-name> Specify connection manager class name. --driver <class-name> Specify the fully qualified name of JDBC driver class. --password <password> Set authentication password required to connect to input source. --username <username> Set authentication username. How to do it Let’s see how to work with import process First, we will start with import single RDBMS table into Hadoop. Query1: $ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1 --username root --password password --table tableName --target-dir /user/abc/tableName The content of output file in HDFS will look like: Next, we will put some light on approach of import only selected rows and selected columns of RDBMS table into Hadoop. Query 2: Import selected columns $ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1 --username root --password password --table student --target-dir /user/abc/student --columns “student_id,address,name” Query 3: Import selected rows. $ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1 --username root --password password --table student --target-dir /user/abc/student --where ‘student_id<100’ Query 4: Import selected columns of selected rows. $ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1 --username root --password password --table student --target-dir /user/abc/student --columns “student_id,address,name” -- where ‘student_id<100’ How it works… Now let’s see how the above steps work: Import single table: Apart from the common arguments of import process, as explained previously, this part covers some other arguments which are required to import a table into Hadoop Distributed File System. Parameters Description --table <table-name> Name of input table to fetch. --target-dir<dir> Location of output/target dir in HDFS. --direct If user want to use non-JDBC based access mechanism for faster database access --options-file <file-path> All the command line options that are common in most of commands can put in options file for convenience. The Query1 will run a MapReduce job and import all the rows of given table to HDFS (where, /user/abc/tableName is the location of output files). The records imported in HDFS preserve their original columns order, which means, if input table contains four columns A, B, C and D, then content in HDFS file will look like: A1, B1, C1, D1 A2, B2, C2, D2 Import selected columns: By default, the import query will select all columns of input table for import, but we can select the subset of columns by specifying the comma separated list of columns in --columns argument. The Query2 will only fetch three columns (student_id, address and name) of student table. If import query contains the --columns argument, then the order of column in output files are same as order specified in --columns argument. The output in HDFS will look like: student_id, address, name 1, Delhi, XYZ 2, Mumbai, PQR .......... If the input query contains the column in following order -- “address, name, student_id”, then the output in HDFS will look like. address, name, student_id Delhi, XYZ, 1 Mumbai, PQR, 2 ............. Import selected rows: By default, all the rows of input table will be imported to HDFS, but we can control which rows need to be import by using a --where argument in the import statement. The Query3 will import only those rows into HDFS which has value of “student_id” column greater than 100. The Query4 use both --columns and --where arguments in one statement. For Query4, Sqoop will internally generates the query of the form “select student_id, address, name from student where student_id<100”. There’s more... This section covers some more examples of import process. Import all tables: So far we have imported a single table into HDFS, this section introduces an import-all-tables tool, by which we can import a set of tables from an RDBMS to HDFS. The import-all-tables tool creates a separate directory in HDFS for each RDBMS table. The following are the mandatory conditions for import-all-tables tool: All tables must have a single primary key column. User must intend to import all the columns of each table. No --where, --columns and --query arguments are permitted. Example: Query 5: $ bin/sqoop import-all-tables --connect jdbc:mysql://localhost:3306/db1 --username root --password password This query will import all tables (tableName and tableName1) of database db1 into HDFS. Output directories in HDFS look like: Summary We learned a lot in this article, about import single RDBMS table into HDFS, import selected columns and selected rows, and import set of RDBMS tables. Resources for Article : Further resources on this subject: Introduction to Logging in Tomcat 7 [Article] Configuring Apache and Nginx [Article] Geronimo Architecture: Part 2 [Article]

0
0
794

article-image-what-are-ssas-2012-dimensions-and-cube

Packt

29 Aug 2013

7 min read

What are SSAS 2012 dimensions and cube?

Packt

29 Aug 2013

7 min read

(For more resources related to this topic, see here.) What is SSAS? SQL Server Analysis Services is an online analytical processing tool that highly boosts the different types of SQL queries and calculations that are accepted in the business intelligence environment. It looks like a relation database, but it has differences. SSAS does not replace the requirement of relational databases, but if you combine the two, it would help to develop the business intelligence solutions. Why do we need SSAS? SSAS provide a very clear graphical interface for the end users to build queries. It is a kind of cache that we can use to speed up reporting. In most real scenarios where SSAS is used, there is a full copy of the data in the data warehouse. All reporting and analytic queries are run against SSAS rather than against the relational database. Today's modern relational databases include many features specifically aimed at BI reporting. SSAS are database services specifically designed for this type of workload, and in most cases it has achieved much better query performance. SSAS 2012 architecture In this article we will explain about the architecture of SSAS. The first and most important point to make about SSAS 2012 is that it is really two products in one package. It has had a few advancements relating to performance, scalability, and manageability. This new version of SSAS that closely resembles PowerPivot uses the tabular model. When installing SSAS, we must select either the tabular model or multidimensional model for installing an instance that runs inside the server; both data models are developed under the same code but sometimes both are treated separately. The concepts included in designing both data models are different, and we can't turn a tabular database into a multidimensional database, or vice versa without rebuilding everything from the start. The main point of view of the end users is that both data models do almost the same things and appear almost equally when used through a client tool such as Excel. The tabular model A concept of building a database using the tabular model is very similar to building it in a relational database. An instance of Analysis Services can hold many databases, and each database can be looked upon as a self-contained collection of objects and data relating to a single business solution. If we are writing reports or analyzing data and we find that we need to run queries on multiple databases, we probably have made a design mistake somewhere because everything we need should be contained within an individual database. Tabular models are designed by using SQL Server Data Tools (SSDT), and a data project in SSDT mapping onto a database in Analysis Services. The multidimensional model This data model is very similar to the tabular model. Data is managed in databases, and databases are designed in SSDT, which are in turn managed by using SQL Server Management Studio. The differences may become similar below the database level, where the multidimensional data model rather than relational concepts are accepted. In the multidimensional model, data is modeled as a series of cubes and dimensions and not tables. The future of Analysis Services We have two data models inside SSAS, along with two query and calculation languages; it is clearly not an ideal state of affairs. It means we have to select a data model to use at the start of our project, when we might not even know enough about our need to gauge which one is appropriate. It also means that anyone who decides to specialize in SSAS has to learn two technologies. Microsoft has very clearly said that the multidimensional model is not scrapped and that the tabular model is not its replacement. It is just like saying that the new advanced features for the multidimensional data model will be released in future versions of SSAS. The fact that the tabular and multidimensional data models share some of the same code suggests that some new features could easily be developed for both models simultaneously. What's new in SSAS 2012? As we know, there is no easy way of transferring a multidimensional data model into a tabular data model. We may have many tools in the market that claim to make this transition with a few mouse clicks, but such tools could only ever work for very simple multidimensional data models and would not save much development time. Therefore, if we already have a mature multidimensional implementation and the in-house skills to develop and maintain it, we may find the following improvements in SSAS 2012 useful. Ease of use If we are starting an SSAS 2012 project with no previous multidimensional or OLAP experience, it is very likely that we will find a tabular model much easier to learn than a multidimensional one. Not only are the concepts much easier to understand, especially if we are used to working with relational databases, but also the development process is much more straightforward and there are far fewer features to learn. Compatibility with PowerPivot The tabular data model and PowerPivot are the same in the way their models are designed. The user interfaces used are practically the same, as both the interfaces use DAX. PowerPivot models can be imported into SQL Server Data Tools to generate a tabular model, although the process does not work the other way around, and a tabular model cannot be converted to a PowerPivot model. Processing performance characteristics If we compare the processing performance of the multidimensional and tabular data models, that will become difficult. It may be slower to process a large table following the tabular data model than the equivalent measure group in a multidimensional one because a tabular data model can't process partitions in the same table at the same time, whereas a multidimensional model can process partitions in the same measure group at the same time. What is SSAS dimension? A database dimension is a collection of related objects; in other words, attributes; they provide the information about fact data in one or more cubes. Typical attributes in a product dimension are product name, product category, line, size, and price. Attributes can be organized into user-defined hierarchies that provide the paths to assist users when they browse through the data in a cube. By default these attributes are visible as attribute hierarchies, and they can be used to understand fact data in a cube. What is SSAS cube? A cube is a multidimensional structure that contains information for analytical purposes; the main constituents of a cube are dimensions and measures. Dimensions define the structure of a cube that you use to slice and dice over, and measures provide the aggregated numerical values of interest to the end user. As a logical structure, a cube allows a client application to retrieve values—of measures—as if they are contained in cells in the cube. The cells are defined for every possible summarized value. A cell, in the cube, is defined by the intersection of dimension members and contains the aggregated values of the measures at that specific intersection. Summary We talked about the special new features and services present, what you can do with them, and why they’re so great. Resources for Article: Further resources on this subject: Creating an Analysis Services Cube with Visual Studio 2008 - Part 1 [Article] Performing Common MDX-related Tasks [Article] How to Perform Iteration on Sets in MDX [Article]

0
0
1899

Packt

29 Aug 2013

8 min read

Creating a pivot table

Packt

29 Aug 2013

8 min read

(For more resources related to this topic, see here.) A pivot table is the core business intelligence tool that helps to turn meaningless data from various sources to a meaningful result. By using different ways of presenting data, we are able to identify relations between seemingly separate data and reach conclusions to help us identify our strengths and areas of improvement. Getting ready Prepare the two files entitled DatabaseData_v2.xlsx and GDPData_v2.xlsx.We will be using these results along with other data sources to create a meaningful PowerPivot table that will be used for intelligent business analysis. How to do it... For each of the two files, we will build upon the file and add a pivot table to it, gaining exposure using the data we are already familiar with. The following are the steps to create a pivot table with the DatabaseData_v2.xlsx file, which results in the creation of a DatabaseData_v3.xlsx file: Open the PowerPivot window of the DatabaseData_v2.xlsx file with its 13 tables. Click on the PivotTable button near the middle of the top row and save as New Worksheet. Select the checkboxes as shown in the following screenshot: Select CountryRegion | Name and move it under Row Labels Select Address | City and move it under Row Labels Select Address | AddressLine1 as Count of AddressLine1 and move it under Values Now, this shows the number of clients per city and per country. However, it is very difficult to navigate, as each country name has to be collapsed in order to see the next country. Let us move the CountryRegion | Name column to Slicers Vertical. Now, the PowerPivot Field List dashboard should appear as shown in the following screenshot: Now, the pivot table should display simple results: the number of clients in a region, filterable by the country using slicers. Let us apply some formatting to allow for a better understanding of the data. Right-click on Name under the Slicers Vertical area of the PowerPivot Field List dashboard. Select Field Settings, then change the name to Country Name. We now see that the title of the slicer has changed from Name to Country Name, allowing anyone who views this data to understand better what the data represents. Similarly, right-click on Count of AddressLine1 under Values, select Edit Measure, and then change its name to Number of Clients. Also change the data title City under the Row Labels area to City Name. The result should appear as shown in the following screenshot: Let's see our results change as we click on different country names. We can filter for multiple countries by holding the Ctrl key while clicking, and can remove all filters by clicking the small button on the top-right of slicers. This is definitely easier to navigate through and to understand compared to what we did at first without using slicers, which is how it would appear in Excel 2010 without PowerPivot. However, this table is still too big. Clicking on Canada gives too many cities whose names many of us have not heard about before. Let us break the data further down by including states/provinces. Select StateProvince | Name and move it under Slicers Horizontal and change its title to State Name. It is a good thing that we are renaming these as we go along. Otherwise, there would have been two datasets called Name, and anyone would be confused as we moved along. Now, we should see the state names filter on the top, the country name filter on the left, and a list of cities with the number of clients in the middle part. This, however, is kind of awkward. Let us rearrange the filters by having the largest filter (country) at the top and the sub-filter. (state) on the left-hand side This can be done simply by dragging the Country Name dataset to Slicers Horizontal and State Name to Slicers Vertical. After moving the slicers around a bit, the result should appear as shown in the following screenshot: Again, play with the results and try to understand the features: try filtering by a country—and by a state/province—now there are limited numbers of cities shown for each country and each state/province, making it easier to see the list of cities. However, for countries such as the United States, there are just too many states. Let us change the formatting of the vertical filter to display three states per line, so it is easier to find the state we are looking for. This can be done by right-clicking on the vertical filter, selecting Size and Properties| Position and Layout, and then by changing the Number of Columns value. Repeat the same step for Country Name to display six columns and then change the sizes of the filters to look more selectable. Change the name of the sheet as PivotTable and then save the file as DatabaseData_v3.xlsx. The following are the steps to create a pivot table with the GDPData_v2.xlsx file, which results in the creation of a GDPData_v3.xlsx file: Open the PowerPivot window of the GDPData_v2.xlsx file with its two tables. Click on the PivotTable button near the middle of the top row and save as New Worksheet. Move the dataset from the year 2000 to the year 2010 to the Value field, and move Country Name in the Row Labels field, and Country Name again into the Slicers Horizontal field. In the slicer, select five countries: Canada, China, Korea, Japan, and United States as shown in the following screenshot: Select all fields and reduce the number of its decimal places. We can now clearly see that GDP in China has tripled over the decade, and that only China and Korea saw an increase in GDP from 2008 to 2009 while the GDP of other nations dropped due to the 2008 financial crisis. Knowing the relevant background information of world finance events, we can make intelligent analysis such as which markets to invest in if we are worried about another financial crisis taking place. As the data get larger in size, looking at the GDP number becomes increasingly difficult. In such cases, we can switch the type of data displayed by using available buttons in the PivotTable Tools | Options menu, the Show Value As button. Play around with it and see how it works: % of Column Total shows each GDP as a percentage of the year, while % Different From allows the user to set one value as the standard and compare the rest to it, and the Rank Largest to Smallest option simply shows the ranking based on which country earns the most GDP. Change the name of the sheet as PivotTable and then save the file as GDPData_v3.xlsx. How it works... We looked at two different files and focused on two different fields. The first file was more qualitative and showed the relationship between regions and number of clients, using various features of pivot tables such as slicers. We also looked at how to format various aspects of a pivot table for easier processing and for a better understanding of the represented data. Slicers embedded in the pivot table are a unique and very powerful feature of PowerPivot that allow us to sort through data simply by clicking the different criteria. The increasing numbers of slicers help to customize the data further, enabling the user to create all sorts of data imaginable. There are no differences in horizontal and vertical slicers aside from the fact that they are at different locations. From the second file, we focused more on the quantitative data and different ways of representing the data. By using slicers to limit the number of countries, we were able to focus more on the data presented, and manage to represent the GDP in various formats such as percentages and ranks, and were able to compare the difference between the numbers by selecting one as a standard. A similar method of representing data in a different format could be applied to the first file to show the percentage of clients per nation, and so on. There's more... We covered the very basic setup of creating a pivot table. We can also analyze creating relationships between data and creating custom fields, so that better results are created. So don't worry about why the pivot table looks so small! For those who do not like working with a pivot table, there is also a feature that will convert all cells into Excel formula. Under the PowerPivot Tools | OLAP Tools option, the Convert to Formula button does exactly that. However, be warned that it cannot be undone as the changes are permanent. Summary In this article, we learned how to use the raw data to make some pivot tables that can help us make smart business decisions! Resources for Article: Further resources on this subject: SAP HANA integration with Microsoft Excel [Article] Managing Core Microsoft SQL Server 2008 R2 Technologies [Article] Eloquent relationships [Article]

0
0
1087

Packt

27 Aug 2013

12 min read

Schemas and Models

Packt

27 Aug 2013

12 min read

0
0
3132

article-image-creating-multivariate-charts

Packt

26 Aug 2013

10 min read

Creating Multivariate Charts

Packt

26 Aug 2013

10 min read

(For more resources related to this topic, see here.) With increasing number of variables, any analysis can become challenging and any observations harder; however, Tableau simplifies the process for the designer and uses effective layouts for the reader even in multivariate analysis. Using various combinations of colors and charts, we can create compelling graphics that generate critical insights from our data. Among the charts covered in this article, facets and area charts are easier to understand and easier to create compared to bullet graphs and dual axes charts. Creating facets Facets are one of the powerful features in Tableau. Edward Tufte, a pioneer in the field of information graphics, championed these types of charts, also called grid or panel charts; he called them small multiples. These charts show the same measure(s) across various values of one or two variables for easier comparison. Getting ready Let's use the sample file Sample – Coffee Chain (Access). Open a new worksheet and select Sample – Coffee Chain (Access) as the data source. How to do it... Once the data file is loaded on the new worksheet, perform the following steps to create a simple faceted chart: Drag-and-drop Market from Dimensions into the Columns shelf. Drag-and-drop Product Type from Dimensions into the Rows shelf. Drag-and-drop Profit from Measures into the Rows shelf next to Product Type. Optionally, you can drag-and-drop Market into the Color Marks box to give color to the four bars of different Market areas. The chart should look like the one in the following screenshot: How it works... When there is one dimension on one of the shelves, either Columns or Rows, and one measure on the other shelf, Tableau creates a univariate bar chart, but when we drop additional dimensions along with the measure, Tableau creates small charts or facets and displays univariate charts broken down by a dimension. There's more... A company named Juice Analytics has a great blog article on the topic of small multiples. This article lists the benefits of using small multiples as well as some examples of small multiples in practice. Find this blog at http://www.juiceanalytics.com/writing/better-know-visualization-small-multiples/. Creating area charts An area chart is an extension of a line chart. The area chart shows the line of the measure but fills the area below the line to emphasize on the value of the measure. A special case of area chart is a stacked area chart, which shows a line per measure and the area between the lines is filled. Tableau's implementation of area charts uses one date variable and one or more measures. Getting ready Let's use the sample file Sample – Superstore Sales (Excel). Open a new worksheet and select Sample – Superstore Sales (Excel) as the data source. How to do it... Once the data is loaded on the new worksheet, perform the following steps to create an area chart: Click on the Show Me button to bring the Show Me toolbar on the screen. Select Order Date from Dimensions and Order Quantity from Measures by clicking and holding the Ctrl key. Click on Area charts (continuous) from the Show Me toolbar. Drag-and-drop Order Date into the Columns shelf next to YEAR(Order Date. Expand YEAR(Order Date), seen on the right-hand side, by clicking on the plus sign. Drag-and-drop Region from Dimensions into the the Rows shelf to the left of SUM(Order Quantity). The chart should look like the one in the following screenshot: How it works… When we added Order Date for the first time, Tableau, by default, aggregated the date field by year; therefore, we added Order Date again to create aggregation by quarter of the Order Date. We also added Region to create facets on the regions that provide trends of order quantity over time. There's more... A blog post by visually, an information graphics company, discusses the key differences between line charts and area charts. You can find this post at http://blog.visual.ly/line-vs-area-charts/. Creating bullet graphs Stephen Few, an information visualization consultant and author, designed this chart to solve some of the problems that the gauges and meters type of charts poses. Gauges, although simple to understand, take a lot of space to show only one measure. Bullet graphs are a combination of the bar graph and thermometer types of charts, and they show a measure of interest in the form of a bar graph (which is the bullet) and target variables. Getting ready Let's use the sample file Sample – Coffee Chain (Access). Open a new worksheet and select Sample – Coffee Chain (Access) as the data source. How to do it... Once the data is loaded on the sheet, perform the following steps to create a bullet graph: Click on the Show Me button to bring the Show Me toolbar on the screen. While holding the Ctrl key, click on Type and Market from Dimensions and Budget Sales and Sales from Measures. Click on the bullet graphs icon on the Show Me toolbar. Right-click on the x axis (the Budget Sales axis) and click on Swap Reference Line Fields. The final chart should look like the one in the following screenshot: How it works... Although bullet graphs maximize the available space to show relevant information, readers require detailed explanation as to what all the components of the graphic are encoding. In this recipe, since we want to compare the budgeted sales with the actual sales, we had to swap the reference line from Sales to Budget Sales. The black bar on the graphic shows the budgeted sales and the blue bar shows the actual sales. The dark gray background color shows 60 percent of the actual sales and the lighter gray shows 80 percent of the actual sales. As we can see in this chart, blue bars crossed all the black lines, and that tells us that both the coffee types and all market regions exceeded the budgeted sales. There's more... A blog post by Data Pig Technologies discusses some of the problems with the bullet graph. The main problem is intuitive understanding of this chart. You can read about this problem and the reply by Stephen Few at http://datapigtechnologies.com/blog/index.php/the-good-and-bad-of-bullet-graphs/. Creating dual axes charts Dual axes charts are useful to compare two similar types of measures that may have different types of measurement units, such as pounds and dollars. In this recipe, we will look at the dual axes chart. Getting ready Let's use the same sample file, Sample – Coffee Chain (Access). Open a new worksheet and select Sample – Coffee Chain (Access) as the data source. How to do it... Once the data is loaded on the sheet, perform the following steps to create a dual axes chart: Click on the Show Me button to bring the Show Me toolbar on the screen. While holding the Ctrl key, click on Date, Type, and Market from Dimensions and Sales and Budget Sales from Measures. Click on the dual line graph icon on the Show Me toolbar. Click-and-drag Market from the Rows shelf into the Columns shelf. Right-click on the Sales vertical axis and click on Synchronize Axis. The chart should look like the one shown in the following screenshot: How it works... Tableau will create two vertical axes and automatically place Sales on one dual axes charts vertical axis and Budget Sales on the other. The scales on both the vertical axes are different, however. By synchronizing the axes, we get the same scales on both axes for better comparison and accurate representation of the patterns. Creating Gantt charts Gantt charts are most commonly used in project management as these charts show various activities and tasks with the time required to complete those tasks. Gantt charts are even more useful when they show dependencies among various tasks. This type of chart is very helpful when the number of activities is low (around 20-30), otherwise the chart becomes too big to be understood easily. Getting ready Let's use the sample file Sample – Superstore Sales (Excel). Open a new worksheet and select Sample – Superstore Sales (Excel) as the data source. How to do it... Once the data is loaded, perform the following steps to create a Gantt chart: Click on Analysis from the top menu toolbar, and if Aggregate Measures is checked, click on it again to uncheck that option. Click on the Show Me button to bring the Show Me toolbar on the screen. While holding the Ctrl key, click on Order Date and Category (under Products) from Dimensions and Time to Ship from Measures. Click on the Gantt chart icon on the Show Me toolbar. Drag-and-drop Order Date into the Filters pane. Select Years from the Filter Field [Order Date] options dialog box and hit Next. Check 2012 from the list and hit OK. Right-click on YEAR(Order Date) on the Columns shelf and select the Day May 8, 2011 option. Drag-and-drop Order Date into the Filters pane. Select Months from the Filter Field [Order Date] options dialog box and hit Next. Check December from the list and hit OK. Drag-and-drop Region from Dimensions into the Color Marks input box. Drag-and-drop Region from Dimensions into the Rows shelf before Category. The generated Gantt chart should look like the one in the following screenshot: How it works... Representing time this way helps the reader to discern which activity took the longest amount of time. We added the Order Date field two times in the Filters pane to first filter for the year 2012 and then for the month of December. In this recipe, out of all the products shipped in December of 2012, we can easily see the red bars for the West region in the Office Supplies category is longer, suggesting that these products took the longest amount of time to ship. There's more... Andy Kriebel, a Tableau data visualization expert, has a great example of Gantt charts using US presidential data. The following link shows the lengths of terms in office of Presidents from various parties: http://vizwiz.blogspot.com/2010/09/tableau-tip-creating-waterfall-chart.html Creating heat maps A heat map is a visual representation of numbers in a table or a grid such that the bigger numbers are encoded by darker colors or bigger sizes and the smaller numbers by lighter colors or smaller sizes. This type of representation makes the reader's pattern detection from the data easier. Getting ready Let's use the same sample file, Sample – Superstore Sales (Excel). Open a new worksheet and select Sample – Superstore Sales (Excel) as the data source. How to do it... Once the data is loaded, perform the following steps to create a heat map chart: Click on the Show Me button to bring the Show Me toolbar on the screen. While holding the Ctrl key, click on Sub-Category (under Products), Region, and Ship Mode from Dimensions and Profit from Measures. Click on the heat maps chart icon on the Show Me toolbar. Drag-and-drop Profit from Measures into the Color Marks box. The generated chart should look like the one in the following screenshot: Summary When we created the chart for the first time, Tableau assigned various sizes to the square boxes, but when we placed Profit as a color mark, red was used for low amounts of profit and green was used for higher amounts of profit. This made spotting of patterns very easy. Binders and Binder Accessories, shipped by Regular Air in the Central region, generated very high amounts of profit and Tables, shipped by Delivery Trucks in the East region, generated very low amounts of profit (it actually created losses for the company). Resources for Article: Further resources on this subject: Constructing and Evaluating Your Design Solution [Article] Basic use of Local Storage [Article] Creating Interactive Graphics and Animation [Article]

0
0
3201

article-image-making-big-data-work-hadoop-and-solr

Packt

23 Aug 2013

8 min read

Making Big Data Work for Hadoop and Solr

Packt

23 Aug 2013

8 min read

(For more resources related to this topic, see here.) Understanding data processing workflows Based on the data, configuration, and the requirements, data can be processed at multiple levels while it is getting ready for search. Cascading and LucidWorks Big Data are few such application platforms with which a complex data processing workflow can be rapidly developed on the Hadoop framework. In Cascading, the data is processed in different phases, with each phase containing a pipe responsible for carrying data units and applying a filter. The following diagram shows how incoming data can be processed in the pipeline-based workflow: Once a data is passed through the workflow, it can be persisted at the end with repository, and later synced with various nodes running in a distributed environment. The pipelining technique offers the following advantages: Apache Solr engine has minimum work to handle while index creation Incremental indexing can be supported By introducing intermediate store, you can have regular data backups at required stages The data can be transferred to a different type of storage such as HDFS directly through multiple processing units The data can be merged, joined, and processed as per the needs for different data sources LucidWorks Big Data is a more powerful product which helps the user to generate bulk indexes on Hadoop, allowing them to classify and analyze the data, and provide distributed searching capabilities. Sharding is a process of breaking one index into multiple logical units called shards across multiple records. In case of Solr, the results will be aggregated and returned. Big Data based technologies can be used with Apache Solr for various operations. Index creation itself can be made to run on distributed system in order to speed up the overall index generation activity. Once that is done, it can be distributed on different nodes participating in Big Data, and Solr can be made to run in a distributed manner for searching the data. You can set up your Solr instance in the following different configurations: Standalone machine This configuration uses single high end server containing indexes and Solr search; it is suitable for development, and in some cases, production. Distributed setup A distributed setup is suitable for large scale indexes where the index is difficult to store on one system. In this case index has to be distributed across multiple machines. Although distributed configuration of Solr offers ample flexibility in terms of processing, it has its own limitations. A lot of features of Apache Solr such as MoreLikeThis and Joins are not supported. The following diagram depicts the distributed setup: Replicated mode In this mode, more than one Solr instance exists; among them the master instance provides shared access to its slaves for replicating the indexes across multiple systems. Master continues to participate in index creation, search, and so on. Slaves sync up the storage through various replication techniques such as rsync utility. By default, Solr includes Java-based replication that uses HTTP protocol for communication. This replication is recommended due to its benefits over other external replication techniques. This mode is not used anymore with the release of Solr 4.x versions. Sharded mode This mode combines the best of both the worlds and brings in the real value of distributed system with high availability. In this configuration, the system has multiple masters, and each master holds multiple slaves where the replication has gone through. Load balancer is used to handle the load on multiple nodes equally. The following diagram depicts the distributed and replicated setup: If Apache Solr is deployed on a Hadoop-like framework, it falls into this category. Solr also provides SolrCloud for distributed Solr. We are going to look at different approaches in the next section. Using Solr 1045 patch – map-side indexing The work for Solr-1045 patch started with a goal to achieve index generation/building using the Apache MapReduce task. Solr-1045 patch converts all the input records to a set of <key, value> pairs in each map task that runs on Hadoop. Further it goes on creating SolrInputDocument from the <key, value>, and later creating the Solr indexes.The following diagram depicts this process: Reduce tasks can be used to perform deduplication of indexes, and merge them together if required. Although merge index seems to be an interesting feature, it is actually a costly affair in terms of processing, and you will not find many implementations with merge index functionality. Once the indexes are created, you can load them on your Solr instance and use them for searching. You can download this particular patch from https://issues.apache.org/jira/browse/SOLR-1045, and patch your Solr instance. To apply a patch to your Solr instance, you need to first build your Solr instance using source. You can download the patch from Apache JIRA. Before running the patch, first do a dry run which does not actually apply patch. You can do it with following command: cd <solr-trunk-dir>svn patch <name-of-patch> --dry-run If it is successful, you can run the patch without the –dry-run option to apply the patch. Let's look at some of the important classes in the patch. Important class Description SolrIndexUpdateMapper This class is a Hadoop mapper responsible for creating indexes out of <key, value> pairs of input. SolrXMLDocRecordReader This class is responsible for reading Solr input XML files. SolrIndexUpdater This class creates a MapReduce job configuration, runs the job to read the document, and updates the Solr instance. Right now it is built using the Lucene index updater. Benefits and drawbacks The following are the benefits and drawbacks of using the Solr-1045 patch: Benefits It achieves complete parallelism by index creation right at the map task. Merging of indexes is possible in the reduce phase of MapReduce. Drawbacks When the indexing is done at map-side, all the <key, value> pairs received by reducer gain equal weight/importance. So, it is difficult to use this patch with data that carries ranking/weight information. Using Solr 1301 patch – reduce-side indexing This patch focuses on using the Apache MapReduce framework for index creation. Keyword search can happen over Apache Solr or Apache SolrCloud. Unlike Solr-1045, in this patch, the indexes are created in the reduce phase of MapReduce. In this patch, a map task is responsible for converting input records to a <key, value> pair; later, they are passed to the reducer, which in turn converts them into SolrInputDocument, and then creates indexes out of it. This index is then passed as outputs of Hadoop MapReduce process. The following diagram depicts this process: To use Solr-1301 patch, you need to set up a Hadoop cluster. Once the index is created through Hadoop patch, it should then be provisioned to Solr server. The patch contains default converter for CSV files. Let's look at some of the important classes which are part of this patch. Important class Description CSVDocumentConverter This class is responsible for converting output of the map task, that is, key-value pair to SolrInputDocument; you can have multiple document converters. CSVReducer This is a reducer code implemented for Hadoop reducers. CSVIndexer This is the main class to be called from your command line for creating indexes using MapReduce. You need to provide input path for your data and output path for storing shards. SolrDocumentConverter This class is used in your map task for converting your objects in Solr document. SolrRecordWriter This class is an extension of mapreduce.RecordWriter; it breaks the data into multiple (key, value) pairs which are then converted into collection of SolrInputDocument(s), and then this data is submitted to SolrEmbeddedServer in batches. Once completed, it will commit the changes and run the optimizer on the embedded server. CSVMapper This class parses CSV file and gets key-value pair out of it. This is a mapper class. SolrOutputFormat This class is responsible for converting key-value pairs to write the data on file/HDFS as zip/raw format. Perform the following steps to run this patch: Create a local folder with configuration and library folder, conf containing Solr configuration (solr-config.xml, schema.xml), and lib containing library. Create your own converter class implementing SolrDocumentConverter; this will be used by SolrOutputFormat to convert output records to Solr document. You may also override the OutputFormat class provided by Solr. Write the Hadoop MapReduce job in the configuration writer: SolrOutputFormat.setupSolrHomeCache(newFile(solrConfigDir), conf);conf.setOutputFormat(SolrOutputFormat.class);SolrDocumentConverter.setSolrDocumentConverter(<yourclassname>.class, conf); Zip your configuration, and load it in HDFS. The ZIP file name should be solr.zip (unless you change the patch code). Now run the patch, each of the jobs will instantiate EmbeddedSolrInstance which will in turn do the conversion, and finally the SolrOutputDocument(s) get stored in the output format. Benefits and drawbacks The following are the benefits and drawbacks of using Solr-1301 patch: Benefits With reduced size index generation, it is possible to preserve the weights of documents, which can contribute while performing a prioritization during a search query. Drawbacks Merging of indexes is not possible like in Solr-1045, as the indexes are created in the reduce phase. Reducer becomes the crucial component of the system due to major tasks being performed. Summary In this article, we have understood different possible approaches of how Big Data can be made to work with Apache Hadoop and Solr. We also looked at the benefits of and drawbacks these approaches. Resources for Article : Further resources on this subject: Advanced Hadoop MapReduce Administration [Article] Apache Solr Configuration [Article] Analytics – Drawing a Frequency Distribution with MapReduce (Intermediate) [Article]

0
0
2381

Packt

22 Aug 2013

10 min read

What is Hazelcast?

Packt

22 Aug 2013

10 min read

(For more resources related to this topic, see here.) Starting out as usual In most modern software systems, data is the key. For more traditional architectures, the role of persisting and providing access to your system's data tends to fall to a relational database. Typically this is a monolithic beast, perhaps with a degree of replication, although this tends to be more for resilience rather than performance. For example, here is what a traditional architecture might look like (which hopefully looks rather familiar) This presents us with an issue in terms of application scalability, in that it is relatively easy to scale our application layer by throwing more hardware at it to increase the processing capacity. But the monolithic constraints of our data layer would only allow us to do this so far before diminishing returns or resource saturation stunted further performance increases; so what can we do to address this? In the past and in legacy architectures, the only solution would be to increase the performance capability of our database infrastructure, potentially by buying a bigger, faster server or by further tweaking and fettling the utilization of currently available resources. Both options are dramatic, either in terms of financial cost and/or manpower; so what else could we do? Data deciding to hang around In order for us to gain a bit more performance out of our existing setup, we can hold copies of our data away from the primary database and use these in preference wherever possible. There are a number of different strategies we could adopt, from transparent second-level caching layers to external key-value object storage. The detail and exact use of each varies significantly depending on the technology or its place in the architecture, but the main desire of these systems is to sit alongside the primary database infrastructure and attempt to protect it from an excessive load. This would then tend to lead to an increased performance of the primary database by reducing the overall dependency on it. However, this strategy tends to be only particularly valuable as a short-term solution, effectively buying us a little more time before the database once again starts to reach saturation. The other downside is that it only protects our database from read-based load; if our application is predominately write-heavy, this strategy has very little to offer. So our expanded architecture could look a bit like the following figure: Therein lies the problem However, in insulating the database from the read load, we have introduced a problem in the form of a cache consistency issue, in that, how does our local data cache deal with changing data underneath it within the primary database? The answer is rather depressing: it can't! The exact manifestation of any issues will largely depend on the data needs of the application and how frequently the data changes; but typically, caching systems will operate in one of the two following modes to combat the problem: Time bound cache: Holds entries for a defined period (time-to-live or TTL) Write through cache: Holds entries until they are invalidated by subsequent updates Time bound caches almost always have consistency issues, but at least the amount of time that the issue would be present is limited to the expiry time of each entry. However, we must consider the application's access to this data, because if the frequency of accessing a particular entry is less than the cache expiry time of it, the cache is providing no real benefit. Write through caches are consistent in isolation and can be configured to offer strict consistency, but if multiple write through caches exist within the overall architecture, then there will be consistency issues between them. We can avoid this by having a more intelligent cache, which features a communication mechanism between nodes, that can propagate entry invalidations to each other. In practice, an ideal cache would feature a combination of both features, so that entries would be held for a known maximum time, but also passes around invalidations as changes are made. So our evolved architecture would look a bit like the following figure: So far we've had a look through the general issues in scaling our data layer, and introduced strategies to help combat the trade-offs we will encounter along the way; however, the real world isn't quite as simple. There are various cache servers and in-memory database products in this area: however, most of these are stand-alone single instances, perhaps with some degree of distribution bolted on or provided by other supporting technologies. This tends to bring about the same issues we experienced with just our primary database, in that we could encounter resource saturation or capacity issues if the product is a single instance, or if the distribution doesn't provide consistency control, perhaps inconsistent data, which might harm our application. Breaking the mould Hazelcast is a radical new approach to data, designed from the ground up around distribution. It embraces a new scalable way of thinking; in that data should be shared around for both resilience and performance, while allowing us to configure the trade-offs surrounding consistency as the data requirements dictate. The first major feature to understand about Hazelcast is its master less nature; each node is configured to be functionally the same. The oldest node in the cluster is the de facto leader and manages the membership, automatically delegating as to which node is responsible for what data. In this way as new nodes join or dropout, the process is repeated and the cluster rebalances accordingly. This makes Hazelcast incredibly simple to get up and running, as the system is self-discovering, self-clustering, and works straight out of the box. However, the second feature to remember is that we are persisting data entirely in-memory; this makes it incredibly fast but this speed comes at a price. When a node is shutdown, all the dta that was held on it is lost. We combat this risk to resilience through replication, by holding enough copies of a piece of data across multiple nodes. In the event of failure, the overall cluster will not suffer any data loss. By default, the standard backup count is 1, so we can immediately enjoy basic resilience. But don't pull the plug on more than one node at a time, until the cluster has reacted to the change in membership and reestablished the appropriate number of backup copies of data. So when we introduce our new master less distributed cluster, we get something like the following figure: We previously identified that multi-node caches tend to suffer from either saturation or consistency issues. In the case of Hazelcast, each node is the owner of a number of partitions of the overall data, so the load will be fairly spread across the cluster. Hence, any saturation would be at the cluster level rather than any individual node. We can address this issue simply by adding more nodes. In terms of consistency, by default the backup copies of the data are internal to Hazelcast and not directly used, as such we enjoy strict consistency. This does mean that we have to interact with a specific node to retrieve or update a particular piece of data; however, exactly which node that is an internal operational detail and can vary over time — we as developers never actually need to know. If we imagine that our data is split into a number of partitions, that each partition slice is owned by one node and backed up on another, we could then visualize the interactions like the following figure: This means that for data belonging to Partition 1, our application will have to communicate to Node 1, Node 2 for data belonging to Partition 2, and so on. The slicing of the data into each partition is dynamic; so in practice, where there are more partitions than nodes, each node will own a number of different partitions and hold backups for others. As we have mentioned before, all of this is an internal operational detail, and our application does not need to know it, but it is important that we understand what is going on behind the scenes. Moving to new ground So far we have been talking mostly about simple persisted data and caches, but in reality, we should not think of Hazelcast as purely a cache, as it is much more powerful than just that. It is an in-memory data grid that supports a number of distributed collections and features. We can load in data from various sources into differing structures, send messages across the cluster; take out locks to guard against concurrent activity, and listen to the goings on inside the workings of the cluster. Most of these implementations correspond to a standard Java collection, or function in a manner comparable to other similar technologies, but all with the distribution and resilience capabilities already built in. Standard utility collections Map: Key-value pairs List: Collection of objects? Set: Non-duplicated collection Queue: Offer/poll FIFO collection Specialized collection Multi-Map: Key-list of values collection Lock: Cluster wide mutex Topic: Publish/subscribe messaging Concurrency utilities AtomicNumber: Cluster-wide atomic counter IdGenerator: Cluster-wide unique identifier generation Semaphore: Concurrency limitation CountdownLatch: Concurrent activity gate-keeping Listeners: Application notifications as things happen In addition to data storage collections, Hazelcast also features a distributed executor service allowing runnable tasks to be created that can be run anywhere on the cluster to obtain, manipulate, and store results. We could have a number of collections containing source data, then spin up a number of tasks to process the disparate data (for example, averaging or aggregating) and outputting the results into another collection for consumption. Again, just as we could scale up our data capacities by adding more nodes, we can also increase the execution capacity in exactly the same way. This essentially means that by building our data layer around Hazelcast, if our application needs rapidly increase, we can continuously increase the number of nodes to satisfy seemingly extensive demands, all without having to redesign or re-architect the actual application. With Hazelcast, we are dealing more with a technology than a server product, a library to build a system around rather than retrospectively bolting it on, or blindly connecting to an off-the-shelf commercial system. While it is possible (and in some simple cases quite practical) to run Hazelcast as a separate server-like cluster and connect to it remotely from our application; some of the greatest benefits come when we develop our own classes and tasks run within it and alongside it. With such a large range of generic capabilities, there is an entire world of problems that Hazelcast can help solve. We can use the technology in many ways; in isolation to hold data such as user sessions, run it alongside a more long-term persistent data store to increase capacity, or shift towards performing high performance and scalable operations on our data. By moving more and more responsibility away from monolithic systems to such a generic scalable one, there is no limit to the performance we can unlock. This will allow us to keep our application and data layers separate, but enabling the ability to scale them up independently as our application grows. This will avoid our application becoming a victim of its own success, while hopefully taking the world by storm. Summary In this article, we learned about Hazelcast. With such a large range of generic capabilities, Hazelcast can solve a world of problems. Resources for Article: Further resources on this subject: JBoss AS Perspective [Article] Drools Integration Modules: Spring Framework and Apache Camel [Article] JBoss RichFaces 3.3 Supplemental Installation [Article]

0
0
3196

How-To Tutorials - Data

Understanding Point-In-Time-Recovery

Integrating Storm and Hadoop

SQL Server Integration Services (SSIS)

Working with Time

Oracle APEX 4.2 reporting

OAuth Authentication

Memory and cache

Managing a Hadoop Cluster

Working with Import Process (Intermediate)

What are SSAS 2012 dimensions and cube?

Trending Topics

Creating a pivot table

Schemas and Models

Creating Multivariate Charts

Making Big Data Work for Hadoop and Solr

What is Hazelcast?

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access