Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-integrating-storm-and-hadoop
Packt
04 Sep 2013
17 min read
Save for later

Integrating Storm and Hadoop

Packt
04 Sep 2013
17 min read
(For more resources related to this topic, see here.) In this article, we will implement the Batch and Service layers to complete the architecture. There are some key concepts underlying this big data architecture: Immutable state Abstraction and composition Constrain complexity Immutable state is the key, in that it provides true fault-tolerance for the architecture. If a failure is experienced at any level, we can always rebuild the data from the original immutable data. This is in contrast to many existing data systems, where the paradigm is to act on mutable data. This approach may seem simple and logical; however, it exposes the system to a particular kind of risk in which the state is lost or corrupted. It also constrains the system, in that you can only work with the current view of the data; it isn't possible to derive new views of the data. When the architecture is based on a fundamentally immutable state, it becomes both flexible and fault-tolerant. Abstractions allow us to remove complexity in some cases, and in others they can introduce complexity. It is important to achieve an appropriate set of abstractions that increase our productivity and remove complexity, but at an appropriate cost. It must be noted that all abstractions leak, meaning that when failures occur at a lower abstraction, they will affect the higher-level abstractions. It is therefore often important to be able to make changes within the various layers and understand more than one layer of abstraction. The designs we choose to implement our abstractions must therefore not prevent us from reasoning about or working at the lower levels of abstraction when required. Open source projects are often good at this, because of the obvious access to the code of the lower level abstractions, but even with source code available, it is easy to convolute the abstraction to the extent that it becomes a risk. In a big data solution, we have to work at higher levels of abstraction in order to be productive and deal with the massive complexity, so we need to choose our abstractions carefully. In the case of Storm, Trident represents an appropriate abstraction for dealing with the data-processing complexity, but the lower level Storm API on which Trident is based isn't hidden from us. We are therefore able to easily reason about Trident based on an understanding of lower-level abstractions within Storm. Another key issue to consider when dealing with complexity and productivity is composition. Composition within a given layer of abstraction allows us to quickly build out a solution that is well tested and easy to reason about. Composition is fundamentally decoupled, while abstraction contains some inherent coupling to the lower-level abstractions—something that we need to be aware of. Finally, a big data solution needs to constrain complexity. Complexity always equates to risk and cost in the long run, both from a development perspective and from an operational perspective. Real-time solutions will always be more complex than batch-based systems; they also lack some of the qualities we require in terms of performance. Nathan Marz's Lambda architecture attempts to address this by combining the qualities of each type of system to constrain complexity and deliver a truly fault-tolerant architecture. We divided this flow into preprocessing and "at time" phases, using streams and DRPC streams respectively. We also introduced time windows that allowed us to segment the preprocessed data. In this article, we complete the entire architecture by implementing the Batch and Service layers. The Service layer is simply a store of a view of the data. In this case, we will store this view in Cassandra, as it is a convenient place to access the state alongside Trident's state. The preprocessed view is identical to the preprocessed view created by Trident, counted elements of the TF-IDF formula (D, DF, and TF), but in the batch case, the dataset is much larger, as it includes the entire history. The Batch layer is implemented in Hadoop using MapReduce to calculate the preprocessed view of the data. MapReduce is extremely powerful, but like the lower-level Storm API, is potentially too low-level for the problem at hand for the following reasons: We need to describe the problem as a data pipeline; MapReduce isn't congruent with such a way of thinking Productivity We would like to think of a data pipeline in terms of streams of data, tuples within the stream and predicates acting on those tuples. This allows us to easily describe a solution to a data processing problem, but it also promotes composability, in that predicates are fundamentally composable, but pipelines themselves can also be composed to form larger, more complex pipelines. Cascading provides such an abstraction for MapReduce in the same way as Trident does for Storm. With these tools, approaches, and considerations in place, we can now complete our real-time big data architecture. There are a number of elements, that we will update, and a number of elements that we will add. The following figure illustrates the final architecture, where the elements in light grey will be updated from the existing recipe, and the elements in dark grey will be added in this article: Implementing TF-IDF in Hadoop TF-IDF is a well-known problem in the MapReduce communities; it is well-documented and implemented, and it is interesting in that it is sufficiently complex to be useful and instructive at the same time. Cascading has a series of tutorials on TF-IDF at http://www.cascading.org/2012/07/31/cascading-for-the-impatient-part-5/, which documents this implementation well. For this recipe, we shall use a Clojure Domain Specific Language (DSL) called Cascalog that is implemented on top of Cascading. Cascalog has been chosen because it provides a set of abstractions that are very semantically similar to the Trident API and are very terse while still remaining very readable and easy to understand. Getting ready Before you begin, please ensure that you have installed Hadoop by following the instructions at http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/. How to do it… Start by creating the project using the lein command: lein new tfidf-cascalog Next, you need to edit the project.clj file to include the dependencies: (defproject tfidf-cascalog "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.4.0"] [cascalog "1.10.1"] [org.apache.cassandra/cassandra-all "1.1.5"] [clojurewerkz/cassaforte "1.0.0-beta11-SNAPSHOT"] [quintona/cascading-cassandra "0.0.7-SNAPSHOT"] [clj-time "0.5.0"] [cascading.avro/avro-scheme "2.2-SNAPSHOT"] [cascalog-more-taps "0.3.0"] [org.apache.httpcomponents/httpclient "4.2.3"]] :profiles{:dev{:dependencies[[org.apache.hadoop/hadoop-core "0.20.2-dev"] [lein-midje "3.0.1"] [cascalog/midje-cascalog "1.10.1"]]}}) It is always a good idea to validate your dependencies; to do this, execute lein deps and review any errors. In this particular case, cascading-cassandra has not been deployed to clojars, and so you will receive an error message. Simply download the source from https://github.com/quintona/cascading-cassandra and install it into your local repository using Maven. It is also good practice to understand your dependency tree. This is important to not only prevent duplicate classpath issues, but also to understand what licenses you are subject to. To do this, simply run lein pom, followed by mvn dependency:tree. You can then review the tree for conflicts. In this particular case, you will notice that there are two conflicting versions of Avro. You can fix this by adding the appropriate exclusions: [org.apache.cassandra/cassandra-all "1.1.5" :exclusions [org.apache.cassandra.deps/avro]] We then need to create the Clojure-based Cascade queries that will process the document data. We first need to create the query that will create the "D" view of the data; that is, the D portion of the TF-IDF function. This is achieved by defining a Cascalog function that will output a key and a value, which is composed of a set of predicates: (defn D [src] (let [src (select-fields src ["?doc-id"])] (<- [?key ?d-str] (src ?doc-id) (c/distinct-count ?doc-id :> ?n-docs) (str "twitter" :> ?key) (str ?n-docs :> ?d-str)))) You can define this and any of the following functions in the REPL, or add them to core.clj in your project. If you want to use the REPL, simply use lein repl from within the project folder. The required namespace (the use statement), require, and import definitions can be found in the source code bundle. We then need to add similar functions to calculate the TF and DF values: (defn DF [src] (<- [?key ?df-count-str] (src ?doc-id ?time ?df-word) (c/distinct-count ?doc-id ?df-word :> ?df-count) (str ?df-word :> ?key) (str ?df-count :> ?df-count-str))) (defn TF [src] (<- [?key ?tf-count-str] (src ?doc-id ?time ?tf-word) (c/count ?tf-count) (str ?doc-id ?tf-word :> ?key) (str ?tf-count :> ?tf-count-str))) This Batch layer is only interested in calculating views for all the data leading up to, but not including, the current hour. This is because the data for the current hour will be provided by Trident when it merges this batch view with the view it has calculated. In order to achieve this, we need to filter out all the records that are within the current hour. The following function makes that possible: (deffilterop timing-correct? [doc-time] (let [now (local-now) interval (in-minutes (interval (from-long doc-time) now))] (if (< interval 60) false true)) Each of the preceding query definitions require a clean stream of words. The text contained in the source documents isn't clean. It still contains stop words. In order to filter these and emit a clean set of words for these queries, we can compose a function that splits the text into words and filters them based on a list of stop words and the time function defined previously: (defn etl-docs-gen [rain stop] (<- [?doc-id ?time ?word] (rain ?doc-id ?time ?line) (split ?line :> ?word-dirty) ((c/comp s/trim s/lower-case) ?word-dirty :> ?word) (stop ?word :> false) (timing-correct? ?time))) We will be storing the outputs from our queries to Cassandra, which requires us to define a set of taps for these views: (defn create-tap [rowkey cassandra-ip] (let [keyspace storm_keyspace column-family "tfidfbatch" scheme (CassandraScheme. cassandra-ip "9160" keyspace column-family rowkey {"cassandra.inputPartitioner""org.apache.cassandra.dht.RandomPartitioner" "cassandra.outputPartitioner" "org.apache.cassandra.dht.RandomPartitioner"}) tap (CassandraTap. scheme)] tap)) (defn create-d-tap [cassandra-ip] (create-tap "d"cassandra-ip)) (defn create-df-tap [cassandra-ip] (create-tap "df" cassandra-ip)) (defn create-tf-tap [cassandra-ip] (create-tap "tf" cassandra-ip)) The way this schema is created means that it will use a static row key and persist name-value pairs from the tuples as column:value within that row. This is congruent with the approach used by the Trident Cassandra adaptor. This is a convenient approach, as it will make our lives easier later. We can complete the implementation by a providing a function that ties everything together and executes the queries: (defn execute [in stop cassandra-ip] (cc/connect! cassandra-ip) (sch/set-keyspace storm_keyspace) (let [input (tap/hfs-tap (AvroScheme. (load-schema)) in) stop (hfs-delimited stop :skip-header? true) src (etl-docs-gen input stop)] (?- (create-d-tap cassandra-ip) (D src)) (?- (create-df-tap cassandra-ip) (DF src)) (?- (create-tf-tap cassandra-ip) (TF src)))) Next, we need to get some data to test with. I have created some test data, which is available at https://bitbucket.org/qanderson/tfidf-cascalog. Simply download the project and copy the contents of src/data to the data folder in your project structure. We can now test this entire implementation. To do this, we need to insert the data into Hadoop: hadoop fs -copyFromLocal ./data/document.avro data/document.avro hadoop fs -copyFromLocal ./data/en.stop data/en.stop Then launch the execution from the REPL: => (execute "data/document" "data/en.stop" "127.0.0.1") How it works… There are many excellent guides on the Cascalog wiki (https://github.com/nathanmarz/cascalog/wiki), but for completeness's sake, the nature of a Cascalog query will be explained here. Before that, however, a revision of Cascading pipelines is required. The following is quoted from the Cascading documentation (http://docs.cascading.org/cascading/2.1/userguide/htmlsingle/): Pipe assemblies define what work should be done against tuple streams, which are read from tap sources and written to tap sinks. The work performed on the data stream may include actions such as filtering, transforming, organizing, and calculating. Pipe assemblies may use multiple sources and multiple sinks, and may define splits, merges, and joins to manipulate the tuple streams. This concept is embodied in Cascalog through the definition of queries. A query takes a set of inputs and applies a list of predicates across the fields in each tuple of the input stream. Queries are composed through the application of many predicates. Queries can also be composed to form larger, more complex queries. In either event, these queries are reduced down into a Cascading pipeline. Cascalog therefore provides an extremely terse and powerful abstraction on top of Cascading; moreover, it enables an excellent development workflow through the REPL. Queries can be easily composed and executed against smaller representative datasets within the REPL, providing the idiomatic API and development workflow that makes Clojure beautiful. If we unpack the query we defined for TF, we will find the following code: (defn DF [src] (<- [?key ?df-count-str] (src ?doc-id ?time ?df-word) (c/distinct-count ?doc-id ?df-word :> ?df-count) (str ?df-word :> ?key) (str ?df-count :> ?df-count-str))) The <- macro defines a query, but does not execute it. The initial vector, [?key ?df-count-str], defines the output fields, which is followed by a list of predicate functions. Each predicate can be one of the following three types: Generators: A source of data where the underlying source is either a tap or another query. Operations: Implicit relations that take in input variables defined elsewhere and either act as a function that binds new variables or a filter. Operations typically act within the scope of a single tuple. Aggregators: Functions that act across tuples to create aggregate representations of data. For example, count and sum. The :> keyword is used to separate input variables from output variables. If no :> keyword is specified, the variables are considered as input variables for operations and output variables for generators and aggregators. The (src ?doc-id ?time ?df-word) predicate function names the first three values within the input tuple, whose names are applicable within the query scope. Therefore, if the tuple ("doc1" 123324 "This") arrives in this query, the variables would effectively bind as follows: ?doc-id: "doc1" ?time: 123324 ?df-word: "This" Each predicate within the scope of the query can use any bound value or add new bound variables to the scope of the query. The final set of bound values that are emitted is defined by the output vector. We defined three queries, each calculating a portion of the value required for the TF-IDF algorithm. These are fed from two single taps, which are files stored in the Hadoop filesystem. The document file is stored using Apache Avro, which provides a high-performance and dynamic serialization layer. Avro takes a record definition and enables serialization/deserialization based on it. The record structure, in this case, is for a document and is defined as follows: {"namespace": "storm.cookbook", "type": "record", "name": "Document", "fields": [ {"name": "docid", "type": "string"}, {"name": "time", "type": "long"}, {"name": "line", "type": "string"} ] } Both the stop words and documents are fed through an ETL function that emits a clean set of words that have been filtered. The words are derived by splitting the line field using a regular expression: (defmapcatop split [line] (s/split line #"[[](),.)s]+")) The ETL function is also a query, which serves as a source for our downstream queries, and defines the [?doc-id ?time ?word] output fields. The output tap, or sink, is based on the Cassandra scheme. A query defines predicate logic, not the source and destination of data. The sink ensures that the outputs of our queries are sent to Cassandra. The ?- macro executes a query, and it is only at execution time that a query is bound to its source and destination, again allowing for extreme levels of composition. The following, therefore, executes the TF query and outputs to Cassandra: (?- (create-tf-tap cassandra-ip) (TF src)) There's more… The Avro test data was created using the test data from the Cascading tutorial at http://www.cascading.org/2012/07/31/cascading-for-the-impatient-part-5/. Within this tutorial is the rain.txt tab-separated data file. A new column was created called time that holds the Unix epoc time in milliseconds. The updated text file was then processed using some basic Java code that leverages Avro: Schema schema = Schema.parse(SandboxMain.class.getResourceAsStream("/document.avsc")); File file = new File("document.avro"); DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema); DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter); dataFileWriter.create(schema, file); BufferedReader reader = new BufferedReader(new InputStreamReader(SandboxMain.class.getResourceAsStream("/rain.txt"))); String line = null; try { while ((line = reader.readLine()) != null) { String[] tokens = line.split("t"); GenericRecord docEntry = new GenericData.Record(schema); docEntry.put("docid", tokens[0]); docEntry.put("time", Long.parseLong(tokens[1])); docEntry.put("line", tokens[2]); dataFileWriter.append(docEntry); } } catch (IOException e) { e.printStackTrace(); } dataFileWriter.close(); Persisting documents from Storm In the previous recipe, we looked at deriving precomputed views of our data taking some immutable data as the source. In that recipe, we used statically created data. In an operational system, we need Storm to store the immutable data into Hadoop so that it can be used in any preprocessing that is required. How to do it… As each tuple is processed in Storm, we must generate an Avro record based on the document record definition and append it to the data file within the Hadoop filesystem. We must create a Trident function that takes each document tuple and stores the associated Avro record. Within the tfidf-topology project created in, inside the storm.cookbook.tfidf.function package, create a new class named PersistDocumentFunction that extends BaseFunction. Within the prepare function, initialize the Avro schema and document writer: public void prepare(Map conf, TridentOperationContext context) { try { String path = (String) conf.get("DOCUMENT_PATH"); schema = Schema.parse(PersistDocumentFunction.class .getResourceAsStream("/document.avsc")); File file = new File(path); DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema); dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter); if(file.exists()) dataFileWriter.appendTo(file); else dataFileWriter.create(schema, file); } catch (IOException e) { throw new RuntimeException(e); } } As each tuple is received, coerce it into an Avro record and add it to the file: public void execute(TridentTuple tuple, TridentCollector collector) { GenericRecord docEntry = new GenericData.Record(schema); docEntry.put("docid", tuple.getStringByField("documentId")); docEntry.put("time", Time.currentTimeMillis()); docEntry.put("line", tuple.getStringByField("document")); try { dataFileWriter.append(docEntry); dataFileWriter.flush(); } catch (IOException e) { LOG.error("Error writing to document record: " + e); throw new RuntimeException(e); } } Next, edit the TermTopology.build topology and add the function to the document stream: documentStream.each(new Fields("documentId","document"), new PersistDocumentFunction(), new Fields()); Finally, include the document path into the topology configuration: conf.put("DOCUMENT_PATH", "document.avro"); How it works… There are various logical streams within the topology, and certainly the input for the topology is not in the appropriate state for the recipes in this article containing only URLs. We therefore need to select the correct stream from which to consume tuples, coerce these into Avro records, and serialize them into a file. The previous recipe will then periodically consume this file. Within the context of the topology definition, include the following code: Stream documentStream = getUrlStream(topology, spout) .each(new Fields("url"), new DocumentFetchFunction(mimeTypes), new Fields("document", "documentId", "source")); documentStream.each(new Fields("documentId","document"), new PersistDocumentFunction(), new Fields()); The function should consume tuples from the document stream whose tuples are populated with already fetched documents.
Read more
  • 0
  • 0
  • 1293

article-image-sql-server-integration-services-ssis
Packt
03 Sep 2013
5 min read
Save for later

SQL Server Integration Services (SSIS)

Packt
03 Sep 2013
5 min read
(For more resources related to this topic, see here.) SSIS as an ETL – extract, transform, and load tool The primary objective of an ETL tool is to be able to import and export data to and from heterogeneous data sources. This includes the ability to connect to external systems, as well as to transform or clean the data while moving the data between the external systems and the databases. SSIS can be used to import data to and from SQL Server. It can even be used to move data between external non-SQL systems without requiring SQL server to be the source or the destination. For instance, SSIS can be used to move data from an FTP server to a local flat file. SSIS also provides a workflow engine for automation of the different tasks (for example, data flows, tasks executions, and so on.) that are executed in an ETL job. An SSIS package execution can itself be one step that is part of an SQL Agent job, and SQL Agent can run multiple jobs independent of each other. An SSIS solution consists of one or more package, each containing a control flow to perform a sequence of tasks. Tasks in a control flow can include calls to web services, FTP operations, file system tasks, automation of command line commands, and others. In particular, a control flow usually includes one or more data flow tasks, which encapsulate an in-memory, buffer-based pipeline of data from a source to a destination, with transformations applied to the data as it flows through the pipeline. An SSIS package has one control flow, and as many data flows as necessary. Data flow execution is dictated by the content of the control flow. A detailed discussion on SSIS and its components are outside the scope of this article and it assumes that you are familiar with the basic SSIS package development using Business Intelligence Development Studio (SQL Server 2005/2008/2008 R2) or SQL Server Data Tools (SQL Server 2012). If you are a beginner in SSIS, it is highly recommended to read from a bunch of good SSIS books available as a prerequisite. In the rest of this article, we will focus on how to consume Hive data from SSIS using the Hive ODBC driver. The prerequisites to develop the package shown in this article are SQL Server Data Tools, (which comes as a part of SQL Server 2012 Client Tools and Components) and the 32-bit Hive ODBC Driver installed. You will also need your Hadoop cluster up with Hive running on it. Developing the package SQL Server Data Tools (SSDT) is the integrated development environment available from Microsoft to design, deploy, and develop SSIS packages. SSDT is installed when you choose to install SQL Server Client tools and Workstation Components from your SQL Server installation media. SSDT supports creation of Integration Services, Analysis Services, and Reporting Services projects. Here, we will focus on Integration Services project type. Creating the project Launch SQL Server Data Tools from SQL Server 2012 Program folders as shown in the following screenshot: Create a new Project and choose Integration Services Project in the New Project dialog as shown in the following screenshot: This should create the SSIS project with a blank Package.dtsx inside it visible in the Solution Explorer window of the project as shown in the following screenshot: Creating the Data Flow A Data Flow is a SSIS package component, which consists of the sources and destinations that extract and load data, the transformations that modify and extend data, and the paths that link sources, transformations, and destinations. Before you can add a data flow to a package, the package control flow must include a Data Flow task. The Data Flow task is the executable within the SSIS package, which creates, orders, and runs the data flow. A separate instance of the data flow engine is opened for each Data Flow task in a package. To create a Data Flow task, perform the following steps: Double-click (or drag-and-drop) on a Data Flow Task from the toolbox in the left. This should place a Data Flow Task in the Control Flow canvas of the package as in the following screenshot: Double-click on the Data Flow Task or click on the Data Flow tab in SSDT to edit the task and design the source and destination components as in the following screenshot: Creating the source Hive connection The first thing we need to do is create a connection manager that will connect to our Hive data tables hosted in the Hadoop cluster. We will use an ADO.NET connection, which will use the DSN HadoopOnLinux we created earlier to connect to Hive. To create the connection, perform the following steps: Right-click on the Connection Managers section in the project and click on New ADO.Net Connection... as shown in the following screenshot: From the list of providers, navigate to .Net Providers | ODBC Data Provider and click on OK in the Connection Manager window as shown in the following screenshot: Select the HadoopOnLinux DSN from the Data Sources list. Provide the Hadoop cluster credentials and test connection should succeed as shown in the following screenshot: Summary In this way we learned how to create an SQL Server Integration Services package to move data from Hadoop to SQL Server using the Hive ODBC driver. Resources for Article: Further resources on this subject: Microsoft SQL Azure Tools [Article] Connecting to Microsoft SQL Server Compact 3.5 with Visual Studio [Article] Getting Started with SQL Developer: Part 1 [Article]
Read more
  • 0
  • 0
  • 1827

article-image-working-time
Packt
03 Sep 2013
22 min read
Save for later

Working with Time

Packt
03 Sep 2013
22 min read
(For more resources related to this topic, see here.) Time handling features are an important part of every BI system. Programming languages, database systems, they all incorporate various time-related functions and Microsoft SQL Server Analysis Services (SSAS) is no exception there. In fact, that's one of its main strengths. The MDX language has various time-related functions designed to work with a special type of dimension called the Time and its typed attributes. While it's true that some of those functions work with any type of dimension, their usefulness is most obvious when applied to time-type dimensions. An additional prerequisite is the existence of multi-level hierarchies, also known as user hierarchies, in which types of levels must be set correctly or some of the time-related functions will either give false results or will not work at all. In this article we're dealing with typical operations, such as year-to-date calculations, running totals, and jumping from one period to another. We go into detail with each operation, explaining known and less known variants and pitfalls. We will discuss why some time calculations can create unnecessary data for the periods that should not have data at all, and why we should prevent it from happening. We will then show you how to prevent time calculations from having values after a certain point in time. In most BI projects, there are always reporting requirements to show measures for today, yesterday, month-to-date, quarter-to-date, year-to-date, and so on. We have three recipes to explore various ways to calculate today's date, and how to turn it into a set and use MDX's powerful set operations to calculate other related periods. Calculating date and time spans is also a common reporting requirement. Calculating the YTD (Year-To-Date) value In this recipe we will look at how to calculate the Year-To-Date value of a measure, that is, the accumulated value of all dates in a year up to the current member on the date dimension. An MDX function YTD() can be used to calculate the Year-To-Date value, but not without its constraints. In this recipe, we will discuss the constraints when using the YTD() function and also the alternative solutions. Getting ready Start SQL Server Management Studio and connect to your SSAS 2012 instance. Click on the New Query button and check that the target database is Adventure Works DW 2012. In order for this type of calculation to work, we need a dimension marked as Time in the Type property, in the Dimension Structure tab of SSDT. That should not be a problem because almost every database contains at least one such dimension and Adventure Works is no exception here. In this example, we're going to use the Date dimension. We can verify in SSDT that the Date dimension's Type property is set to Time. See the following screenshot from SSDT: Here's the query we'll start from: SELECT{ [Measures].[Reseller Sales Amount] } ON 0,{ [Date].[Calendar Weeks].[Calendar Week].MEMBERS } ON 1FROM[Adventure Works] Once executed, the preceding query returns reseller sales values for every week in the database. How to do it... We are going to use the YTD() function, which takes only one member expression, and returns all dates in the year up to the specified member. Then we will use the aggregation function SUM() to sum up the Reseller Sales Amount. Follow these steps to create a calculated measure with YTD calculation: Add the WITH block of the query. Create a new calculated measure within the WITH block and name it Reseller Sales YTD. The new measure should return the sum of the measure Reseller Sales Amount using the YTD() function and the current date member of the hierarchy of interest. Add the new measure on axis 0 and execute the complete query: WITH MEMBER [Measures].[Reseller Sales YTD] AS Sum( YTD( [Date].[Calendar Weeks].CurrentMember ), [Measures].[Reseller Sales Amount] ) SELECT { [Measures].[Reseller Sales Amount], [Measures].[Reseller Sales YTD] } ON 0, { [Date].[Calendar Weeks].[Calendar Week].MEMBERS } ON 1 FROM [Adventure Works] The result will include the second column, the one with the YTD values. Notice how the values in the second column increase over time: How it works... The YTD() function returns the set of members from the specified date hierarchy, starting from the first date of the year and ending with the specified member. The first date of the year is calculated according to the level [Calendar Year] marked as Years type in the hierarchy [Calendar Weeks]. In our example, the YTD() value for the member Week 9 CY 2008 is a set of members starting from Week 1 CY 2008 and going up to that member because the upper level containing years is of the Years type. The set is then summed up using the SUM() function and the Reseller Sales Amount measure. If we scroll down, we'll see that the cumulative sum resets every year, which means that YTD() works as expected. In this example we used the most common aggregation function, SUM(), in order to aggregate the values of the measure throughout the calculated set. SUM() was used because the aggregation type of the Reseller Sales Amount measure is Sum. Alternatively, we could have used the Aggregate() function instead. More information about that function can be found later in this recipe. There's more... Sometimes it is necessary to create a single calculation that will work for any user hierarchy of the date dimension. In that case, the solution is to prepare several YTD() functions, each using a different hierarchy, cross join them, and then aggregate that set using a proper aggregation function (Sum, Aggregate, and so on). However, bear in mind that this will only work if all user hierarchies used in the expression share the same year level. In other words, that there is no offset in years among them (such as exists between the fiscal and calendar hierarchies in Adventure Works cube in 2008 R2). Why does it have to be so? Because the cross join produces the set intersection of members on those hierarchies. Sets are generated relative to the position when the year starts. If there is offset in years, it is possible that sets won't have an intersection. In that case, the result will be an empty space. Now let's continue with a couple of working examples. Here's an example that works for both monthly and weekly hierarchies: WITHMEMBER [Measures].[Reseller Sales YTD] ASSum( YTD( [Date].[Calendar Weeks].CurrentMember ) *YTD( [Date].[Calendar].CurrentMember ),[Measures].[Reseller Sales Amount] )SELECT{ [Measures].[Reseller Sales Amount],[Measures].[Reseller Sales YTD] } ON 0,{ [Date].[Calendar Weeks].[Calendar Week].MEMBERS } ON 1FROM[Adventure Works] If we replace [Date].[Calendar Weeks].[Calendar Week].MEMBERS with [Date].[Calendar].[Month].MEMBERS, the calculation will continue to work. Without the cross join part, that wouldn't be the case. Try it in order to see for yourself! Just be aware that if you slice by additional attribute hierarchies, the calculation might become wrong. In short, there are many obstacles to getting the time-based calculation right. It partially depends on the design of the time dimension (which attributes exist, which are hidden, how the relations are defined, and so on), and partially on the complexity of the calculations provided and their ability to handle various scenarios. A better place to define time-based calculation is the MDX script. There, we can define scoped assignments, but that's a separate topic which will be covered later in the recipe, Using utility dimension to implement time-based calculations. In the meantime, here are some articles related to that topic: http://tinyurl.com/MoshaDateCalcs http://tinyurl.com/DateToolDim Inception-To-Date calculation A similar calculation is the Inception-To-Date calculation in which we're calculating the sum of all dates up to the current member, that is, we do not perform a reset at the beginning of every year. In that case, the YTD() part of the expression should be replaced with this: Null : [Date].[Calendar Weeks].CurrentMember Using the argument in the YTD() function The argument of the YTD() function is optional. When not specified, the first dimension of the Time type in the measure group is used. More precisely, the current member of the first user hierarchy with a level of type Years. This is quite convenient in the case of a simple Date dimension; a dimension with a single user hierarchy. In the case of multiple hierarchies or a role-playing dimension, the YTD() function might not work, if we forget to specify the hierarchy for which we expect it to work. This can be easily verified. Omit the [Date].[Calendar Weeks].CurrentMember part in the initial query and see that both columns return the same values. The YTD() function is not working anymore. Therefore, it is best to always use the argument in the YTD() function. Common problems and how to avoid them In our example we used the [Date].[Calendar Weeks] user hierarchy. That hierarchy has the level Calendar Year created from the same attribute. The type of attribute is Years, which can be verified in the Properties pane of SSDT: However, the Date dimension in the Adventure Works cube has fiscal attributes and user hierarchies built from them as well. The fiscal hierarchy equivalent to [Date].[Calendar Weeks] hierarchy is the [Date].[Fiscal Weeks] hierarchy. There, the top level is named Fiscal Year, created from the same attribute. This time, the type of the attribute is FiscalYear, not Year. If we exchange those two hierarchies in our example query, the YTD() function will not work on the new hierarchy. It will return an error: The name of the solution is the PeriodsToDate() function. YTD() is in fact a short version of the PeriodsToDate() function, which works only if the Year type level is specified in a user hierarchy. When it is not so (that is, some BI developers tend to forget to set it up correctly or in the case that the level is defined as, let's say, FiscalYear like in this test), we can use the PeriodsToDate() function as follows: MEMBER [Measures].[Reseller Sales YTD] ASSum( PeriodsToDate( [Date].[Fiscal Weeks].[Fiscal Year],[Date].[Fiscal Weeks].CurrentMember ),[Measures].[Reseller Sales Amount] ) PeriodsToDate() might therefore be used as a safer variant of the YTD() function. YTD() and future dates It's worth noting that the value returned by a SUM-YTD combination is never empty once a value is encountered in a particular year. Only the years with no values at all will remain completely blank for all their descendants. In our example with the [Calendar Weeks] hierarchy, scrolling down to the Week 23 CY 2008, you will see that this is the last week that has reseller sales. However, the Year-To-Date value is not empty for the rest of the weeks for year 2008, as shown in the following screenshot: This can cause problems for the descendants of the member that represents the current year (and future years as well). The NON EMPTY keyword will not be able to remove empty rows, meaning we'll get YTD values in the future. We might be tempted to use the NON_EMPTY_BEHAVIOR operator to solve this problem but it wouldn't help. Moreover, it would be completely wrong to use it, because it is only a hint to the engine which may or may not be used. It is not a mechanism for removing empty values. In short, we need to set some rows to null, those positioned after the member representing today's date. We'll cover the proper approach to this challenge in the recipe, Finding the last date with data. Calculating the YoY (Year-over-Year) growth (parallel periods) This recipe explains how to calculate the value in a parallel period, the value for the same period in a previous year, previous quarter, or some other level in the date dimension. We're going to cover the most common scenario – calculating the value for the same period in the previous year, because most businesses have yearly cycles. A ParallelPeriod() is a function that is closely related to time series. It returns a member from a prior period in the same relative position as a specified member. For example, if we specify June 2008 as the member, Year as the level, and 1 as the lag, the ParallelPeriod() function will return June 2007. Once we have the measure from the prior parallel period, we can calculate how much the measure in the current period has increased or decreased with respect to the parallel period's value. Getting ready Start SQL Server Management Studio and connect to your SSAS 2012 instance. Click on the New Query button, and check that the target database is Adventure Works DW 2012. In this example we're going to use the Date dimension. Here's the query we'll start from: SELECT{ [Measures].[Reseller Sales Amount] } ON 0,{ [Date].[Fiscal].[Month].MEMBERS } ON 1FROM[Adventure Works] Once executed, the previous query returns the value of Reseller Sales Amount for all fiscal months. How to do it... Follow these steps to create a calculated measure with YoY calculation: Add the WITH block of the query. Create a new calculated measure there and name it Reseller Sales PP. The new measure should return the value of the measure Reseller Sales Amount measure using the ParallelPeriod() function. In other words, the definition of the new measure should be as follows: MEMBER [Measures].[Reseller Sales PP] As( [Measures].[Reseller Sales Amount],ParallelPeriod( [Date].[Fiscal].[Fiscal Year], 1,[Date].[Fiscal].CurrentMember ) ) Specify the format string property of the new measure to match the format of the original measure. In this case that should be the currency format. Create the second calculated measure and name it Reseller Sales YoY %. The definition of that measure should be the ratio of the current member's value against the parallel period member's value. Be sure to handle potential division by zero errors (see the recipe Handling division by zero errors). Include both calculated measures on axis 0 and execute the query, which should look like: WITHMEMBER [Measures].[Reseller Sales PP] As( [Measures].[Reseller Sales Amount],ParallelPeriod( [Date].[Fiscal].[Fiscal Year], 1,[Date].[Fiscal].CurrentMember ) ), FORMAT_STRING = 'Currency'MEMBER [Measures].[Reseller Sales YoY %] Asiif( [Measures].[Reseller Sales PP] = 0, null,( [Measures].[Reseller Sales Amount] /[Measures].[Reseller Sales PP] ) ), FORMAT_STRING = 'Percent'SELECT{ [Measures].[Reseller Sales Amount],[Measures].[Reseller Sales PP],[Measures].[Reseller Sales YoY %] } ON 0,{ [Date].[Fiscal].[Month].MEMBERS } ON 1FROM[Adventure Works] The result will include two additional columns, one with the PP values and the other with the YoY change. Notice how the values in the second column repeat over time and that YoY % ratio shows the growth over time: How it works... The ParallelPeriod() function takes three arguments, a level expression, an index, and a member expression, and all three arguments are optional. The first argument indicates the level on which to look for that member's ancestor, typically the year level like in this example. The second argument indicates how many members to go back on the ancestor's level, typically one, as in this example. The last argument indicates the member for which the function is to be applied. Given the right combination of arguments, the function returns a member that is in the same relative position as a specified member, under a new ancestor. The value for the parallel period's member is obtained using a tuple which is formed with a measure and the new member. In our example, this represents the definition of the PP measure. The growth is calculated as the ratio of the current member's value over the parallel period member's value, in other words, as a ratio of two measures. In our example, that was YoY % measure. In our example we've also taken care of a small detail, setting the FORMAT_STRING to Percent. There's more... The ParallelPeriod() function is very closely related to time series, and typically used on date dimensions. However, it can be used on any type of dimension. For example, this query is perfectly valid: SELECT{ [Measures].[Reseller Sales Amount] } ON 0,{ ParallelPeriod( [Geography].[Geography].[Country],2,[Geography].[Geography].[State-Province].&[CA]&[US] ) } ON 1FROM[Adventure Works] The query returns Hamburg on rows, which is the third state-province in the alphabetical list of states-provinces under Germany. Germany is two countries back from the USA, whose member California, used in this query, is the third state-province underneath that country in the Geography.Geography user hierarchy. We can verify this by browsing the Geography user hierarchy in the Geography dimension in SQL Server Management Studio, as shown in the following screenshot. The UK one member back from the USA, has only one state-province: England. If we change the second argument to 1 instead, we'll get nothing on rows because there's no third state-province under the UK. Feel free to try it: All arguments of the ParallelPeriod() function are optional. When not specified, the first dimension of type Time in the measure group is used, more precisely, the previous member of the current member's parent. This can lead to unexpected results as discussed in the previous recipe. Therefore, it is recommended that you use all the arguments of the ParallelPeriod() function. ParallelPeriod is not a time-aware function The ParallelPeriod() function simply looks for the member from the prior period based on its relative position to its ancestor. For example, if your hierarchy is missing the first six months in the year 2005, for member January 2006, the function will find July 2005 as its parallel period (lagging by one year) because July is indeed the first month in the year 2005. This is exactly the case in Adventure Works DW SSAS prior to 2012. You can test the following scenario in Adventure Works DW SSAS 2008 R2. In our example we used the [Date].[Fiscal] user hierarchy. That hierarchy has all 12 months in every year which is not the case with the [Date].[Calendar] user hierarchy where there's only six months in the first year. This can lead to strange results. For example, if you search-replace the word "Fiscal" with the word "Calendar" in the query we used in this recipe, you'll get this as the result: Notice how the values are incorrect for the year 2006. That's because the ParallelPeriod() function is not a time-aware function, it merely does what it's designed for taking the member that is in the same relative position. Gaps in your time dimension are another potential problem. Therefore, always make the complete date dimensions, with all 12 months in every year and all dates in them, not just working days or similar shortcuts. Remember, Analysis Services isn't doing the date math. It's just navigating using the member's relative position. Therefore, make sure you have laid a good foundation for that. However, that's not always possible. There's an offset of six months between fiscal and calendar years, meaning if you want both of them as date hierarchies, you have a problem; one of them will not have all of the months in the first year. The solution is to test the current member in the calculation and to provide a special logic for the first year, fiscal or calendar; the one that doesn't have all months in it. This is most efficiently done with a scope statement in the MDX script. Another problem in calculating the YoY value is leap years. Calculating moving averages The moving average, also known as the rolling average, is a statistical technique often used in events with unpredictable short-term fluctuations in order to smooth their curve and to visualize the pattern of behavior. The key to get the moving average is to know how to construct a set of members up to and including a specified member, and to get the average value over the number of members in the set. In this recipe, we're going to look at two different ways to calculate moving averages in MDX. Getting ready Start SQL Server Management Studio and connect to your SSAS 2012 instance. Click on the New Query button and check that the target database is Adventure Works DW 2012. In this example we're going to use the Date hierarchy of the Date dimension. Here's the query we'll start from: SELECT{ [Measures].[Internet Order Count] } ON 0,{ [Date].[Date].[Date].MEMBERS} ON 1FROM[Adventure Works] Execute it. The result shows the count of Internet orders for each date in the Date.Date attribute hierarchy. Our task is to calculate the simple moving average (SMA) for dates in the year 2008 based on the count of orders in the previous 30 days. How to do it... We are going to use the LastPeriods() function with a 30 day moving window, and a member expression, [Date].[Date].CurrentMember, as two parameters, and also the AVG() function, to calculate the moving average of Internet order count in the last 30 days. Follow these steps to calculate moving averages: Add the WHERE part of the query and put the year 2006 inside using any available hierarchy. Add the WITH part and define a new calculated measure. Name it SMA 30. Define that measure using the AVG() and LastPeriods() functions. Test to see if you get a managed query similar to this. If so, execute it: WITHMEMBER [Measures].[SMA 30] ASAvg( LastPeriods( 30, [Date].[Date].CurrentMember ),[Measures].[Internet Order Count] )SELECT{ [Measures].[Internet Order Count],[Measures].[SMA 30] } ON 0,{ [Date].[Date].[Date].MEMBERS } ON 1FROM[Adventure Works]WHERE( [Date].[Calendar Year].&[2008] ) The second column in the result set will represent the simple moving average based on the last 30 days. Our final result will look like the following screenshot: How it works... The moving average is a calculation that uses the moving window of N items for which it calculates the statistical mean, that is, the average value. The window starts with the first item and then progressively shifts to the next one until the whole set of items is passed. The function that acts as the moving window is the LastPeriods() function. It returns N items, in this example, 30 dates. That set is then used to calculate the average orders using the AVG() function. Note that the number of members returned by the LastPeriods() function is equal to the span, 30, starting with the member that lags 30 - 1 from the specified member expression, and ending with the specified member. There's more... Another way of specifying what the LastPeriods() function does is to use a range of members with a range-based shortcut. The last member of the range is usually the current member of the hierarchy on an axis. The first member is the N-1th member moving backwards on the same level in that hierarchy, which can be constructed using the Lag(N-1) function. The following expression employing the Lag() function and a range-based shortcut is equivalent to the LastPeriods() in the preceding example: [Date].[Date].CurrentMember.Lag(29) : [Date].[Date].CurrentMember Note that the members returned from the range-based shortcut are inclusive of both the starting member and the ending member. We can easily modify the moving window scope to fit different requirements. For example, in case we need to calculate a 30-day moving average up to the previous member, we can use this syntax: [Date].[Date].CurrentMember.Lag(30) : [Date].[Date].PrevMember The LastPeriods() function is not on the list of optimized functions on this web page: http://tinyurl.com/Improved2008R2. However, tests show no difference in duration with respect to its range alternative. Still, if you come across a situation where the LastPeriods() function performs slowly, try its range alternative. Finally, in case we want to parameterize the expression (for example, to be used in SQL Server Reporting Services), these would be generic forms of the previous expressions: [Date].[Date].CurrentMember.Lag( @span - @offset ) :[Date].[Date].CurrentMember.Lag( @offset ) And LastPeriods( @span, [Date].[Date].CurrentMember.Lag( @offset ) ) The @span parameter is a positive value which determines the size of the window. The @offset parameter determines how much the right side of the window is moved from the current member's position. This shift can be either a positive or negative value. The value of zero means there is no shift at all, the most common scenario. Other ways to calculate the moving averages The simple moving average is just one of many variants of calculating the moving averages. A good overview of a possible variant can be found in Wikipedia: http://tinyurl.com/WikiMovingAvg MDX examples of other variants of moving averages can be found in Mosha Pasumansky's blog article: http://tinyurl.com/MoshaMovingAvg Moving averages and the future dates It's worth noting that the value returned by the moving average calculation is not empty for dates in future because the window is looking backwards, so that there will always be values for future dates. This can be easily verified by scrolling down in our example using the LastPeriods() function, as shown in the following screenshot: In this case the NON EMPTY keyword will not be able to remove empty rows. We might be tempted to use NON_EMPTY_BEHAVIOR to solve this problem but it wouldn't help. Moreover, it would be completely wrong. We don't want to set all the empty rows to null, but only those positioned after the member representing today's date. We'll cover the proper approach to this challenge in the following recipes. Summary This article presents various time-related functions in MDX language that are designed to work with a special type of dimension called the Time and its typed attributes. Resources for Article: Further resources on this subject: What are SSAS 2012 dimensions and cube? [Article] Creating an Analysis Services Cube with Visual Studio 2008 - Part 1 [Article] Terms and Concepts Related to MDX [Article]
Read more
  • 0
  • 0
  • 2656
Visually different images

article-image-oracle-apex-42-reporting
Packt
03 Sep 2013
20 min read
Save for later

Oracle APEX 4.2 reporting

Packt
03 Sep 2013
20 min read
(For more resources related to this topic, see here.) The objective of the first chapter is to quickly introduce you to the technology and then dive deep into the understanding the fabric of the tool. The chapter also helps you set the environment, which will be used throughout the book. Chapter 1, Know Your Horse Before You Ride It, starts with discussing the various features of APEX. This is to give heads up to the readers about the features offered by the tool and to inform them about some of the strengths of the tool. In order to understand the technology better, we discuss the various web server combinations possible with APEX, namely the Internal mod_plsql, External mod_plsql, and Listener configuration. While talking about the Internal mod_plsql configuration, we see the steps to enable the XMLDB HTTP server. In the Internal mod_plsql configuration, Oracle uses a DAD defined in EPG to talk to the database and the web server. So, we try to create a miniature APEX of our own, by creating our DAD using it to talk to the database and the web server. We then move on to learn about the External mod_plsql configuration. We discuss the architecture and the roles of the configuration files, such as dads.conf and httpd.conf. We also have a look at a typical dads.conf file and draw correlations between the configurations in the Internal and External mod_plsql configuration. We then move on to talk about the wwv_flow_epg_include_mod_local procedure that can help us use the DAD of APEX to call our own stored PL/SQL procedures. We then move on to talk about APEX Listener which is a JEE alternative to mod_plsql and is Oracle’s direction for the future. Once we are through with understanding the possible configurations, we see the steps to set up our environment. We use the APEX Listener configuration and see the steps to install the APEX engine, create a Weblogic domain, set the listener in the domain, and create an APEX workspace. With our environment in place, we straight away get into understanding the anatomy of APEX by analyzing various parts of its URL. This discussion includes a natter on the session management, request handling, debugging, error handling, use of TKPROF for tracing an APEX page execution, cache management and navigation, and value passing in APEX. We also try to understand the design behind the zero session ID in this section. Our discussions till now would have given you a brief idea about the technology, so we try to dig in a little deeper and understand the mechanism used by APEX to send web requests to its PL/SQL engine in the database by decoding the APEX page submission. We see the use of the wwv_flow.accept procedure and understand the role of page submission. We try to draw an analogy of an APEX form with a simple HTML to get a thorough understanding about the concept. The next logical thing after page submission is to see the SQL and PL/SQL queries and blocks reaching the database. We turn off the database auditing and see the OWA web toolkit requests flowing to the database as soon as we open an APEX page. We then broaden our vision by quickly knowing about some of the lesser known alternatives of mod_plsql. We end the chapter with a note of caution and try to understand the most valid criticisms of the technology, by understanding the SQL injection and Cross-site Scripting (XSS). After going through the architecture we straight away spring into action and begin the process of learning how to build the reports in APEX. The objective of this chapter is to help you understand and implement the most common reporting requirements along with introducing some interesting ways to frame the analytical queries in Oracle. The chapter also hugely focuses on the methods to implement different kinds of formatting in the APEX classic reports. We start Chapter 2, Reports, by creating the objects that will be used throughout the book and installing the reference application that contains the supporting code for the topics discussed in the second chapter. We start the chapter by setting up an authentication mechanism. We discuss external table authentication in this chapter. We then move on to see the mechanism of capturing the environment variables in APEX. These variables can help us set some logic related to a user’s session and environment. The variables also help us capture some properties of the underlying database session of an APEX session. We capture the variables using the USERENV namespace, DBMS_SESSION package, and owa_util package. After having a good idea of the ways and means to capture the environment variables, we build our understanding of developing the search functionality in a classic APEX report. This is mostly a talk about the APEX classic report features, we also use this opportunity to see the process to enable sorting in the report columns, and to create a link that helps us download the report in the CSV format. We then discuss various ways to implement the group reports in APEX. The discussion shows a way to implement this purely, by using the APEX’s feature, and then talks about getting the similar results by using the Oracle database feature. We talk about the APEX’s internal grouping feature and the Oracle grouping sets. The section also shows the first use of JavaScript to manipulate the report output. It also talks about a method to use the SQL query to create the necessary HTML, which can be used to display a data column in a classic report. We take the formatting discussion further, by talking about a few advanced ways of highlighting the report data using the APEX’s classic report features, and by editing the APEX templates. After trying our hand at formatting, we try to understand the mechanism to implement matrix reports in APEX. We use matrix reports to understand the use of the database features, such as the with clause, the pivot operator, and a number of string aggregation techniques in Oracle. The discussion on the string aggregation techniques includes the talk on the LISTAGG function, the wm_concat function, and the use of the hierarchical queries for this purpose. We also see the first use of the APEX items as the substitution variables in this book. We will see more use of the APEX items as the substitution variables to solve the vexed problems in other parts of the book as well. We do justice with the frontend as well, by talking about the use of jQuery, CSS, and APEX Dynamic Actions for making the important parts of data stand out. Then, we see the implementation of the handlers, such as this.affectedElements and the use of the jQuery functions, such as this.css() in this section. We also see the advanced formatting methods using the APEX templates. We then end this part of the discussion, by creating a matrix report using the APEX Dynamic Query report region. The hierarchical reports are always intriguing, because of the enormous ability to present the relations among different rows of data, and because of the use of the hierarchical queries in a number of unrelated places to answer the business queries. We reserve a discussion on the Oracle’s hierarchical queries for a later part of the section and start it, by understanding the implementation of the hierarchical reports by linking the values in APEX using drilldowns. We see the use of the APEX items as the substitution variable to create the dynamic messages. Since we have devised a mechanism to drill down, we should also build a ladder for the user to climb up the hierarchical chain. Now, the hierarchical chain for every person will be different depending on his position in the organization, so we build a mechanism to build the dynamic bread crumbs using the PL/SQL region in APEX. We then talk about two different methods of implementing the hierarchical queries in APEX. We talk about the connect, by clause first, and then continue our discussion by learning the use of the recursive with clause for the hierarchical reporting. We end our discussion on the hierarchical reporting, by talking about creating the APEX’s Tree region that displays the hierarchical data in the form of a tree. The reports are often associated with the supporting files that give more information about the business query. A typical user might want to upload a bunch of files while executing his piece of the task and the end user of his action might want to check out the files uploaded by him/her. To understand the implementation of this requirement, we check out the various ways to implement uploading and downloading the files in APEX. We start our discussion, by devising the process of uploading the files for the employees listed in the oehr_employees table. The solution of uploading the files involves the implementation of the dynamic action to capture the ID of the employee on whose row the user has clicked. It shows the use of the APEX items as the substitution variables for creating the dynamic labels and the use of JavaScript to feed one APEX item based on the another. We extensively talk about the implementation of jQuery in Dynamic Actions in this section as well. Finally, we check out the use of the APEX’s file browse item along with the WWV_FLOW_FILES table to capture the file uploaded by the user. Discussion on the methods to upload the files is immediately followed by talking about the ways to download these files in APEX. We nest the use of the functions, such as HTF.ANCHOR and APEX_UTIL.GET_BLOB_FILE_SRC for one of the ways to download a file, and also talk about the use of dbms_lob.getlength along with the APEX format mask for downloading the files. We then engineer our own stored procedure that can download a blob stored in the database as a file. We end this discussion, by having a look at the APEX’s p process, which can also be used for downloading. AJAX is the mantra of the new age and we try our hand at it, by implementing the soft deletion in APEX. We see the mechanism of refreshing just the report and not reloading the page as soon as the user clicks to delete a row. We use JavaScript along with the APEX page process and the APEX templates to achieve this objective. Slicing and dicing along with auditing are two of the most common requirements in the reporting world. We see the implementation of both of these using both the traditional method JavaScript with the page processes and the new method of using Dynamic Actions. We extend our use case a little further and learn about a two way interaction between the JavaScript function and the page process. We learn to pass the values back and forth between the two. While most business intelligence and reporting solutions are a one way road and are focused on the data presentation, Oracle APEX can go a step further, and can give an interface to the user for the data manipulations as well. To understand this strength of APEX, we have a look at the process of creating the tabular forms in APEX. We extend our understanding of the tabular forms to see a magical use of jQuery to convert a certain sections of a report from display-only to editable textboxes. We move our focus from implementing the interesting and tricky frontend requirements to framing the queries to display the complex data types. We use the string aggregation methods to display data in a column containing a varray. Time dimension is one of the most widely used dimension in reporting circles and comparing current performance with the past records is a favorite requirement of most businesses. With this in mind, we shift our focus to understand and implement the time series reports in APEX. We start our discussion by understanding the method to implement a report that shows the contribution of each business line in every quarter. We implement this by using partitioning dimensions on the fly. We also use the analytical functions, such as ratio_to_report, lead, and lag in the process of creating the time series reports. We use our understanding of time dimension to build a report that helps a user compare one time period to the other. The report gives the user the freedom to select the time segments, which he wishes to compare. We then outwit a limitation of this report, by using the query partition clause for the data densification. We bring our discussion on reports based on time dimension, by presenting a report based on the modal clause to you. The report serves as an example to show the enormous possibilities to code, by using the modal clause in Oracle. We bring our discussion on reports based on time dimension, by presenting a report based on the modal clause to you. The report serves as an example to show the enormous possibilities to code, by using the modal clause in Oracle. Chapter 3, In the APEX Mansion – Interactive Reports is all about Interactive Reports and the dynamic reporting. While the second chapter was more about the data presentation using the complex queries and the presentation methods, this chapter is about taking the presentation part a step ahead, by creating more visually appealing Interactive Reports. We start the discussion of this chapter, by talking about the ins and outs of Interactive Reports. We postmortem this feature of APEX to learn about every possible way of using Interactive Reports for making more sense of data. The chapter has a reference application of its own, which shows the code in action. We start our discussion, by exploring at various features of the Actions menu in Interactive Reports. The discussion is on search functionality, Select Columns feature, filtering, linking and filtering Interactive Reports using URLs, customizing Rows per page feature of an IR, using Control Break, creating computations in IR, creating charts in IR, using the Flashback feature and a method to see the back end flashback query, configuring the e-mail functionality for downloading a report, and for subscription of reports and methods to understand the download of reports in HTML, CSV, and PDF formats. Once we are through with understanding the Actions menu, we move on to understand the various configuration options in an IR. We talk about the Link section, the Icon View section, the Detail section, the Column Group section, and the Advanced section of the Report Attributes page of an IR. While discussing these sections, we understand the process of setting different default views of an IR for different users. Once we are through with our dissection of an IR, we put our knowledge to action, by inserting our own item in the Actions menu of an IR using Dynamic Actions and jQuery. We continue our quest for finding the newer formatting methods, by using different combinations of SQL, CSS, APEX templates, and jQuery to achieve unfathomable results. The objectives attained in this section include formatting a column of an IR based on another column, using CSS in the page header to format the APEX data, changing the font color of the alternate rows in APEX, using a user-defined CSS class in APEX, conditionally highlighting a column in IR using CSS and jQuery, and formatting an IR using the region query. After going through a number of examples on the use of CSS and jQuery in APEX, we lay down a process to use any kind of changes in an IR. We present this process by an example that changes one of the icons used in an IR to a different icon. APEX also has a number of views for IR, which can be used for intelligent programming. We talk about an example that uses the apex_application_page_ir_rpt view that show different IR reports on the user request. After a series of discussion on Interactive Reports, we move on to find a solution to an incurable problem of IR. We see a method to put multiple IR on the same page and link them as the master-child reports. We had seen an authentication mechanism (external table authentication) and data level authorization in Chapter 2, Reports. We use this chapter to see the object level authorization in APEX. We see the method to give different kinds of rights on the data of a column in a report to different user groups. After solving a number of problems and learning a number of things, we create a visual treat for ourselves. We create an Interactive report dashboard using Dynamic Actions. This dashboard presents different views of an IR as gadgets on a separate APEX page. We conclude this chapter, by looking at advanced ways of creating the dynamic reports in APEX. We look at the use of the table function in both native and interface approach, and we also look at the method to use the APEX collections for creating the dynamic IRs in APEX. Chapter 4, The Fairy Tale Begins – Advanced Reporting, is all about advanced reporting and pretty graphs. Since we are talking about advanced reporting, we see the process of setting the LDAP authentication in APEX. We also see the use of JXplorer to help us get the necessary DN for setting up the LDAP authentication. We also see the means to authenticate an LDAP user using PL/SQL in this section. This chapter also has a reference application of its own that shows the code in action. We start the reporting building in this chapter, by creating the Sparkline reports. This report uses jQuery for producing the charts. We then move on to use another jQuery library to create a report with slider. This report lets the user set the value of the salary of any employee using a slider. We then get into the world of the HTML charts. We start our talk, by looking at the various features of creating the HTML chart regions in APEX. We understand a method to implement the Top N and Bottom N chart reports in the HTML charts. We understand the APEX’s method to implement the HTML charts and use it to create an HTML chart on our own. We extend this technique of generating HTML from region source a little further, by using XMLTYPE to create the necessary HTML for displaying a report. We take our spaceship into a different galaxy of the charts world and see the use of Google Visualizations for creating the charts in APEX. We then switch back to AnyChart, which is a flash charting solution, and has been tightly integrated with APEX. It works on an XML and we talk about customizing this XML to produce different results. We put our knowledge to action, by creating a logarithmic chart and changing the style of a few display labels in APEX. We continue our discussion on AnyChart, and use the example of Doughnut Chart to understand advanced ways of using AnyChart in APEX. We use our knowledge to create Scatter chart, 3D stacked chart, Gauge chart, Gantt chart, Candlestick chart, Flash image maps, and SQL calendars. We move out of the pretty heaven of AnyChart only to get into another beautiful space of understanding the methods of displaying the reports with the images in APEX. We implement the reports with the images using the APEX’s format masks and also using HTF.IMG with the APEX_UTIL.GET_BLOB_FILE_SRC function. We then divert our attention to advanced jQuery uses such as, the creation of the Dialog box and the Context menu in APEX. We create a master-detail report using dialog boxes, where the child report is shown in the Dialog box. We close our discussion in this chapter, by talking about creating wizards in APEX and a method to show different kinds of customized error messages for problems appearing at different points of a page process. Chapter 5, Flight to Space Station: Advanced APEX, is all about advanced APEX. The topics discussed in this chapter fall into a niche category of reporting the implementation. This chapter has two reference applications of its own that shows the code in action. We start this chapter, by creating both the client-side and server-side image maps APEX. These maps are often used where regular shapes are involved. We then see the process of creating the PL/SQL Server Pages (PSPs). PSPs are similar to JSP and are used for putting the PL/SQL and HTML code in a single file. Loadpsp then converts this file into a stored procedure with the necessary calls to owa web toolkit functions. The next utility we check out is loadjava. This utility helps us load a Java class as a database object. This utility will be helpful, if some of the Java classes are required for code processing. We had seen the use of AnyChart in the previous chapter, and this chapter introduces you to FusionChart. We create a Funnel Chart using FusionChart in this chapter. We also use our knowledge of the PL/SQL region in APEX to create a Tag Cloud. We then stroll into the world of the APEX plugins. We understand the interface functions, which have to be created for plugins. We discuss the concepts and then use them to develop an Item type plugin and a Dynamic Action plugin. We understand the process of defining a Custom Attribute in a plugin, and then see the process of using it in APEX. We move on to learn about Websheets and their various features. We acquaint ourselves with the interface of the Websheet applications and understand the concept of sharing Websheets. We also create a few reports in a Websheet application and see the process of importing and using an image in Websheets. We then spend some time to learn about data grids in Websheets and the method to create them. We have a look at the Administration and View dropdowns in a Websheet application. We get into the administration mode and understand the process of configuring the sending of mails to Gmail server from APEX. We extend our administration skills further, by understanding the ways to download an APEX application using utilities, such as oracle.apex.APEXExport. Reporting and OLAP go hand in hand, so we see the method of using the OLAP cubes in APEX. We see the process of modeling a cube and understand the mechanism to use its powerful features. We then have a talk about Oracle Advanced Queues that can enable us to do reliable communication among different systems with different workloads in the enterprise, and gives improved performance. We spend a brief time to understand some of the other features of APEX, which might not be directly related to reporting, but are good to know. Some of these features include locking and unlocking of pages in APEX, the Database Object Dependencies report, Shortcuts, the Dataloading wizard, and the APEX views. We bring this exclusive chapter to an end, by discussing about the various packages that enable us to schedule the background jobs in APEX, and by discussing various other APEX and Database API, which can help us in the development process.
Read more
  • 0
  • 0
  • 2406

article-image-oauth-authentication
Packt
02 Sep 2013
6 min read
Save for later

OAuth Authentication

Packt
02 Sep 2013
6 min read
(For more resources related to this topic, see here.) Understanding OAuth OAuth has the concept of Providers and Clients. An OAuth Provider is like a SAML Identity Provider, and is the place where the user enters their authentication credentials. Typical OAuth Providers include Facebook and Google. OAuth Clients are resources that want to protect resources, such as a SAML Service Provider. If you have ever been to a site that has asked you to log in using your Twitter or LinkedIn credentials then odds are that site was using OAuth. The advantage of OAuth is that a user’s authentication credentials (username and password, for instance) is never passed to the OAuth Client, just a range of tokens that the Client requested from the Provider and which are authorized by the user. OpenAM can act as both an OAuth Provider and an OAuth Client. This chapter will focus on using OpenAM as an OAuth Client and using Facebook as an OAuth Provider. Preparing Facebook as an OAuth Provider Head to https://developers.facebook.com/apps/ and create a Facebook App. Once this is created, your Facebook App will have an App ID and an App Secret. We’ll use these later on when configuring OpenAM. Facebook won’t let a redirect to a URL (such as our OpenAM installation) without being aware of the URL. The steps for preparing Facebook as an OAuth provider are as follows: Under the settings for the App in the section Website with Facebook Login we need to add a Site URL. This is a special OpenAM OAuth Proxy URL, which for me was http://openam.kenning.co.nz:8080/openam/oauth2c/OAuthProxy.jsp as shown in the following screenshot: Click on the Save Changes button on Facebook. My OpenAM installation for this chapter was directly available on the Internet just in case Facebook checked for a valid URL destination. Configuring an OAuth authentication module OpenAM has the concept of authentication modules, which support different ways of authentication, such as OAuth, or against its Data Store, or LDAP or a Web Service. We need to create a new Module Instance for our Facebook OAuth Client. Log in to OpenAM console. Click on the Access Control tab, and click on the link to the realm / (Top Level Realm). Click on the Authentication tab and scroll down to the Module Instances section. Click on the New button. Enter a name for the New Module Instance and select OAuth 2.0 as the Type and click on the OK button. I used the name Facebook. You will then see a screen as shown: For Client Id, use the App ID value provided from Facebook. For the Client Secret use the App Secret value provided from Facebook as shown in the preceding screenshot. Since we’re using Facebook as our OAuth Provider, we can leave the Authentication Endpoint URL, Access Token Endpoint URL, and User Profile Service URL values as their default values. Scope defines the permissions we’re requesting from the OAuth Provider on behalf of the user. These values will be provided by the OAuth Provider, but we’ll use the default values of email and read_stream as shown in the preceding screenshot. Proxy URL is the URL we copied to Facebook as the Site URL. This needs to be replaced with your OpenAM installation value. The Account Mapper Configuration allows you to map values from your OAuth Provider to values that OpenAM recognizes. For instance, Facebook calls emails email while OpenAM references values from the directory it is connected to, such as mail in the case of the embedded LDAP server. This goes the same for the Attribute Mapper Configuration. We’ll leave all these sections as their defaults as shown in the preceding screenshot. OpenAM allows attributes passed from the OAuth Provider to be saved to the OpenAM session. We’ll make sure this option is Enabled as shown in the preceding screenshot. When a user authenticates against an OAuth Provider, they are likely to not already have an account with OpenAM. If they do not have a valid OpenAM account then they will not be allowed access to resources protected by OpenAM. We should make sure that the option to Create account if it does not exist is Enabled as shown in the preceding screenshot. Forcing authentication against particular authentication modules In the writing of this book I disabled the Create account if it does not exist option while I was testing. Then when I tried to log into OpenAM I was redirected to Facebook, which then passed my credentials to OpenAM. Since there was no valid OpenAM account that matched my Facebook credentials I could not log in. For your own testing, it would be recommended to use http://openam.kenning.co.nz:8080/openam/UI/Login?module=Facebook rather than changing your authentication chain. Thankfully, you can force a login using a particular authentication module by adjusting the login URL. By using http://openam.kenning.co.nz:8080/openam/UI/Login?module=DataStore, I was able to use the Data Store rather than OAuth authentication module, and log in successfully. For our newly created accounts we can choose to prompt the user to create a password and enter an activation code. For our prototype we’ll leave this option as Disabled. The flip side to Single Sign On is Single Log Out. Your OAuth Provider should provide a logout URL which we could possibly call to log out a user when they log out of OpenAM. The options we have when a user logs out of OpenAM is to either not log them out of the OAuth Provider, to log them out of the OAuth Provider, or to ask the user. If we had set earlier that we wanted to enforce password and activation token policies, then we would need to enter details of an SMTP server, which would be used to email the activation token to the user. For the purposes of our prototype we’ll leave all these options blank. Click on the Save button. Summary This article served as a quick primer on what OAuth is and how to achieve it with OpenAM. It covered the concept of using Facebook as an OAuth provider and configuring an OAuth module. It focused on using OpenAM as an OAuth Client and using Facebook as an OAuth Provider. This would really help when we might want to allow authentication against Facebook or Google. Resources for Article: Further resources on this subject: Getting Started with OpenSSO [Article] OpenAM: Oracle DSEE and Multiple Data Stores [Article] OpenAM Identity Stores: Types, Supported Types, Caching and Notification [Article]
Read more
  • 0
  • 0
  • 3695

article-image-memory-and-cache
Packt
02 Sep 2013
10 min read
Save for later

Memory and cache

Packt
02 Sep 2013
10 min read
(For more resources related to this topic, see here.) You can find this instruction in the OGlobalConfiguration.java file in the autoConfig() method. Furthermore, you can enable/disable level 1 cache, level 2 cache, or both. You can also set the number of records that will be stored in each level as follows: cache.level1.size: This sets the number of records to be stored in the level 1 caches (default -1, no limit) cache.level2.size: This sets the number of records to be stored in the level 2 cache (default -1, no limit) cache.level1.enabled: This is a boolean value, it enables/disables the level 1 cache (default, true) cache.level2.enabled: This is a boolean value, it enables/disables the level 2 cache (default, true) Mapping files OrientDB uses NIO to map data files in memory. However, you can change the way this mapping is performed. This is achieved by modifying the file access strategy. Mode 0: It uses the memory mapping for all the operations. Mode 1 (default): It uses the memory mapping, but new reads are performed only if there is enough memory, otherwise the regular Java NIO file read/write is used. Mode 2: It uses the memory mapping only if the data has been previously loaded. Mode 3: It uses memory mapping until there is space available, then use regular JAVA NIO file read/write. Mode 4: It disables all the memory mapping techniques. To set the strategy mode, you must use the file.mmap.strategy configuration property. Connections When you have to connect with a remote database you have some options to improve your application performance. You can use the connection pools, and define the timeout value to acquire a new connection. The pool has two attributes: minPool: It is the minimum number of opened connections maxPool: It is the maximum number of opened connections When the first connection is requested to the pool, a number of connections corresponding to the minPool attribute are opened against the server. If a thread requires a new connection, the requests are satisfied by using a connection from the pool. If all the connections are busy, a new one is created until the value of maxPool is reached. Then the thread will wait, so that a connection is freed. Minimum and maximum connections are defined by using the client.channel.minPool (default value 1) and client.channel.maxPool (default value 5) properties. However, you can override these values in the client code by using the setProperty() method of the connection class. For example: database = new ODatabaseDocumentTx("remote:localhost/demo");database.setProperty("minPool", 10);database.setProperty("maxPool", 50);database.open("admin", "admin"); You can also change the connection timeout values. In fact, you may experience some problem, if there are network latencies or if some server-side operations require more time to be performed. Generally these kinds of problems are shown in the logfile with warnings: WARNING: Connection re-acquired transparently after XXXms and Yretries: no errors will be thrown at application level You can try to change the network.lockTimeout and the network.socketTimeout values. The first one indicates the timeout in milliseconds to acquire a lock against a channel (default is 15000), the second one indicates the TCP/IP socket timeout in milliseconds (default is 10000). There are some other properties you can try to modify to resolve network issues. These are as follows: network.socketBufferSize: This is the TCP/IP socket buffer size in bytes (default 32 KB) network.retry: This indicates the number of retries a client should do to establish a connection against a server (default is 5) network.retryDelay: This indicates the number of milliseconds a client will wait before retrying to establish a new connection (default is 500) Transactions If your primary objective is the performance, avoid using transactions. However, if it is very important for you to have transactions to group operations, you can increase overall performance by disabling the transaction log. To do so just set the tx.useLog property to false. If you disable the transaction log, OrientDB cannot rollback operations in case JVM crashes. Other transaction parameters are as follows: tx.log.synch: It is a Boolean value. If set, OrientDB executes a synch against the filesystem for each transaction log entry. This slows down the transactions, but provides reliability on non- reliable devices. Default value is false. tx.commit.synch: It is a Boolean value. If set, it performs a storage synch after a commit. Default value is true. Massive insertions If you want to do a massive insertion, there are some tricks to speed up the operation. First of all, do it via Java API. This is the fastest way to communicate with OrientDB. Second, instruct the server about your intention: db.declareIntent( new OIntentMassiveInsert() );//your code here....db.declareIntent( null ); Here db is an opened database connection. By declaring the OIntentMassiveInsert() intent, you are instructing OrientDB to reconfigure itself (that is, it applies a set of preconfigured configuration values) because a massive insert operation will begin. During the massive insert, avoid creating a new ODocument instance for each record to insert. On the contrary, just create an instance the first time, and then clean it using the reset() method: ODocument doc = new ODocument();for(int i=0; i< 9999999; i++){doc.reset(); //here you will reset the ODocument instancedoc.setClassName("Author");doc.field("id", i);doc.field("name", "John");doc.save();} This trick works only in a non-transactional context. Finally, avoid transactions if you can. If you are using a graph database and you have to perform a massive insertion of vertices, you can still reset just one vertex: ODocument doc = db.createVertex();...doc.reset();... Moreover, since a graph database caches the most used elements, you may disable this: db.setRetainObjects(false); Datafile fragmentation Each time a record is updated or deleted, a hole is created in the datafiles structure. OrientDB tracks these holes and tries to reuse them. However, many updates and deletes can cause a fragmentation of datafiles, just like in a filesystem. To limit this problem, it is suggested to set the oversize attribute of the classes you create. The oversize attribute is used to allocate more space for records once they are created, so as to avoid defragmentation upon updates. The oversize attribute is a multiplying factor where 1.0 or 0 means no oversize. The default values are 0 for document, and 2 for vertices. OrientDB has a defrag algorithm that starts automatically when certain conditions are verified. You can set some of these conditions by using the following configuration parameter: file.defrag.holeMaxDistance: It defines the maximum distance in bytes between two holes that triggers the defrag procedure. The default is 32 KB, -1 means dynamic size. The dynamic size is computed in the ODataLocal class in the getCloserHole() method, as Math.max(32768 * (int) (size / 10000000), 32768), where size is the current size of the file. The profiler OrientDB has an embedded profiler that you can use to analyze the behavior of the server. The configuration parameters that act on the profiler are as follows: profiler.enabled: This is a boolean value (enable/disable the profiler), the default value is false. profiler.autoDump.interval: It is the number of seconds between profiler dump. The default value is 0, which means no dump. profiler.autoDump.reset: This is a boolean value, reset the profile at every dump. The default is true. The dump is a JSON string structured in sections. The first one is a huge collection of information gathered at runtime related to the configuration and resources used by each object in the database. The keys are structured as follows: db.<db-name>: They are database-related metrics db.<db-name>.cache: They are metrics about databases' caching db.<db-name>.data: They are metrics about databases' datafiles, mainly data holes db.<db-name>.index: They are metrics about databases' indexes system.disk: They are filesystem-related metrics system.memory: They are RAM-related metrics system.config.cpus: They are the number of the cores process.network: They are network metrics process.runtime: They provide process runtime information and metrics server.connections.actives: They are number of active connections The second part of the dump is a collection of chronos. A chrono is a log of an operation, for example, a create operation, an update operation, and so on. Each chrono has the following attributes: last: It is the last time recorded min: It is the minimum time recorded max: It is the maximum time recorded average: It is the average time recorded total: It is the total time recorded entries: It is the number of times the specific metric has been recorded Finally, there are sections about many counters. Query tips In the following paragraphs some useful information on how to optimize the queries execution is given. The explain command You can see how OrientDB accesses the data by using the explain command in the console. To use this command simply write explain followed by the select statement: orientdb> explain select from Posts A set of key-value pairs are returned. Keys mean the following: resultType: It is the type of the returned resultset. It can be collection, document, or number. resultSize: It is the number of records retrieved if the resultType is collection. recordReads: It is the number of records read from datafiles. involvedIndexes: They are the indices involved in the query. indexReads: It is the number of records read from the indices. documentReads: They are the documents read from the datafiles. This number could be different from recordReads, because in a scanned cluster there can be different kinds of records. documentAnalyzedCompatibleClass: They are the documents analyzed belonging to the class requested by the query. This number could be different from documentReads, because a cluster may contain several different classes. elapsed: This time is measured in nanoseconds, it is the time elapsed to execute the statement. As you can see, OrientDB can use indices to speed up the reads. Indexes You can define indexes as we do in a relational database using the create index statement or via Java API using the createIndex() method of the OClass class: create index <class>.<property> [unique|notunique|fulltext] [field type] Or for composite index (an index on more than one property): create index <index_name> on <class> (<field1>,<field2>)[unique|notunique|fulltext] If you create a composite index, OrientDB will use it also when in a where clause you don't specify a criteria against all the indexed fields. So you can avoid this to build an index for each field you use in the queries if you have already built a composite one. This is the case of a partial match search and further information about it can be found in the OrientDB wiki at https://github.com/nuvolabase/orientdb/wiki/Indexes#partial-match-search. Generally, the indexes don't work with the like operator. If you want to perform the following query: select from Authors where name like 'j%' And you want use an index, you must define on the field name a FULLTEXT index. FULLTEXT indices permit to index string fields. However keep in mind that indices slow down the insert, update, and delete operations. Summary In this article we have seen some strategies that try to optimize both the OrientDB server installation and queries. Resources for Article: Further resources on this subject: Comparative Study of NoSQL Products [Article] Connecting to Microsoft SQL Server Compact 3.5 with Visual Studio [Article] Microsoft SQL Azure Tools [Article]
Read more
  • 0
  • 0
  • 1122
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-managing-hadoop-cluster
Packt
30 Aug 2013
13 min read
Save for later

Managing a Hadoop Cluster

Packt
30 Aug 2013
13 min read
(For more resources related to this topic, see here.) From the perspective of functionality, a Hadoop cluster is composed of an HDFS cluster and a MapReduce cluster . The HDFS cluster consists of the default filesystem for Hadoop. It has one or more NameNodes to keep track of the filesystem metadata, while actual data blocks are stored on distributed slave nodes managed by DataNode. Similarly, a MapReduce cluster has one JobTracker daemon on the master node and a number of TaskTrackers on the slave nodes. The JobTracker manages the life cycle of MapReduce jobs. It splits jobs into smaller tasks and schedules the tasks to run by the TaskTrackers. A TaskTracker executes tasks assigned by the JobTracker in parallel by forking one or a number of JVM processes. As a Hadoop cluster administrator, you will be responsible for managing both the HDFS cluster and the MapReduce cluster. In general, system administrators should maintain the health and availability of the cluster. More specifically, for an HDFS cluster, it means the management of the NameNodes and DataNodes and the management of the JobTrackers and TaskTrackers for MapReduce. Other administrative tasks include the management of Hadoop jobs, for example configuring job scheduling policy with schedulers. Managing the HDFS cluster The health of HDFS is critical for a Hadoop-based Big Data platform. HDFS problems can negatively affect the efficiency of the cluster. Even worse, it can make the cluster not function properly. For example, DataNode's unavailability caused by network segmentation can lead to some under-replicated data blocks. When this happens, HDFS will automatically replicate those data blocks, which will bring a lot of overhead to the cluster and cause the cluster to be too unstable to be available for use. In this recipe, we will show commands to manage an HDFS cluster. Getting ready Before getting started, we assume that our Hadoop cluster has been properly configured and all the daemons are running without any problems. Log in to the master node from the administrator machine with the following command: ssh hduser@master How to do it... Use the following steps to check the status of an HDFS cluster with hadoop fsck: Check the status of the root filesystem with the following command: hadoop fsck / We will get an output similar to the following: FSCK started by hduser from /10.147.166.55 for path / at Thu Feb 28 17:14:11 EST 2013 .. /user/hduser/.staging/job_201302281211_0002/job.jar: Under replicated blk_-665238265064328579_1016. Target Replicas is 10 but found 5 replica(s). .................................Status: HEALTHY Total size: 14420321969 B Total dirs: 22 Total files: 35 Total blocks (validated): 241 (avg. block size 59835360 B) Minimally replicated blocks: 241 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 2 (0.8298755 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 2 Average block replication: 2.0248964 Corrupt blocks: 0 Missing replicas: 10 (2.0491803 %) Number of data-nodes: 5 Number of racks: 1 FSCK ended at Thu Feb 28 17:14:11 EST 2013 in 28 milliseconds The filesystem under path '/' is HEALTHY The output shows that some percentage of data blocks is under-replicated. But because HDFS can automatically make duplication for those data blocks, the HDFS filesystem and the '/' directory are both HEALTHY. Check the status of all the files on HDFS with the following command: hadoop fsck / -files We will get an output similar to the following: FSCK started by hduser from /10.147.166.55 for path / at Thu Feb 28 17:40:35 EST 2013 / <dir> /home <dir> /home/hduser <dir> /home/hduser/hadoop <dir> /home/hduser/hadoop/tmp <dir> /home/hduser/hadoop/tmp/mapred <dir> /home/hduser/hadoop/tmp/mapred/system <dir> /home/hduser/hadoop/tmp/mapred/system/jobtracker.info 4 bytes, 1 block(s): OK /user <dir> /user/hduser <dir> /user/hduser/randtext <dir> /user/hduser/randtext/_SUCCESS 0 bytes, 0 block(s): OK /user/hduser/randtext/_logs <dir> /user/hduser/randtext/_logs/history <dir> /user/hduser/randtext/_logs/history/job_201302281451_0002_13620904 21087_hduser_random-text-writer 23995 bytes, 1 block(s): OK /user/hduser/randtext/_logs/history/job_201302281451_0002_conf.xml 22878 bytes, 1 block(s): OK /user/hduser/randtext/part-00001 1102231864 bytes, 17 block(s): OK Status: HEALTHY Hadoop will scan and list all the files in the cluster. This command scans all ? les on HDFS and prints the size and status. Check the locations of file blocks with the following command: hadoop fsck / -files -locations The output of this command will contain the following information: The first line tells us that file part-00000 has 17 blocks in total and each block has 2 replications (replication factor has been set to 2). The following lines list the location of each block on the DataNode. For example, block blk_6733127705602961004_1127 has been replicated on hosts 10.145.231.46 and 10.145.223.184. The number 50010 is the port number of the DataNode. Check the locations of file blocks containing rack information with the following command: hadoop fsck / -files -blocks -racks Delete corrupted files with the following command: hadoop fsck -delete Move corrupted files to /lost+found with the following command: hadoop fsck -move Use the following steps to check the status of an HDFS cluster with hadoop dfsadmin: Report the status of each slave node with the following command: hadoop dfsadmin -report The output will be similar to the following: Configured Capacity: 422797230080 (393.76 GB) Present Capacity: 399233617920 (371.82 GB) DFS Remaining: 388122796032 (361.47 GB) DFS Used: 11110821888 (10.35 GB) DFS Used%: 2.78% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 ------------------------------------------------- Datanodes available: 5 (5 total, 0 dead) Name: 10.145.223.184:50010 Decommission Status : Normal Configured Capacity: 84559446016 (78.75 GB) DFS Used: 2328719360 (2.17 GB) Non DFS Used: 4728565760 (4.4 GB) DFS Remaining: 77502160896(72.18 GB) DFS Used%: 2.75% DFS Remaining%: 91.65% Last contact: Thu Feb 28 20:30:11 EST 2013 ... The first section of the output shows the summary of the HDFS cluster, including the configured capacity, present capacity, remaining capacity, used space, number of under-replicated data blocks, number of data blocks with corrupted replicas, and number of missing blocks. The following sections of the output information show the status of each HDFS slave node, including the name (ip:port) of the DataNode machine, commission status, configured capacity, HDFS and non-HDFS used space amount, HDFS remaining space, and the time that the slave node contacted the master. Refresh all the DataNodes using the following command: hadoop dfsadmin -refreshNodes Check the status of the safe mode using the following command: hadoop dfsadmin -safemode get We will be able to get the following output: Safe mode is OFF The output tells us that the NameNode is not in safe mode. In this case, the filesystem is both readable and writable. If the NameNode is in safe mode, the filesystem will be read-only (write protected). Manually put the NameNode into safe mode using the following command: hadoop dfsadmin -safemode enter This command is useful for system maintenance. Make the NameNode to leave safe mode using the following command: hadoop dfsadmin -safemode leave If the NameNode has been in safe mode for a long time or it has been put into safe mode manually, we need to use this command to let the NameNode leave this mode. Wait until NameNode leaves safe mode using the following command: hadoop dfsadmin -safemode wait This command is useful when we want to wait until HDFS finishes data block replication or wait until a newly commissioned DataNode to be ready for service. Save the metadata of the HDFS filesystem with the following command: hadoop dfsadmin -metasave meta.log The meta.log file will be created under the directory $HADOOP_HOME/logs. Its content will be similar to the following: 21 files and directories, 88 blocks = 109 total Live Datanodes: 5 Dead Datanodes: 0 Metasave: Blocks waiting for replication: 0 Metasave: Blocks being replicated: 0 Metasave: Blocks 0 waiting deletion from 0 datanodes. Metasave: Number of datanodes: 5 10.145.223.184:50010 IN 84559446016(78.75 GB) 2328719360(2.17 GB) 2.75% 77502132224(72.18 GB) Thu Feb 28 21:43:52 EST 2013 10.152.166.137:50010 IN 84559446016(78.75 GB) 2357415936(2.2 GB) 2.79% 77492854784(72.17 GB) Thu Feb 28 21:43:52 EST 2013 10.145.231.46:50010 IN 84559446016(78.75 GB) 2048004096(1.91 GB) 2.42% 77802893312(72.46 GB) Thu Feb 28 21:43:54 EST 2013 10.152.161.43:50010 IN 84559446016(78.75 GB) 2250854400(2.1 GB) 2.66% 77600096256(72.27 GB) Thu Feb 28 21:43:52 EST 2013 10.152.175.122:50010 IN 84559446016(78.75 GB) 2125828096(1.98 GB) 2.51% 77724323840(72.39 GB) Thu Feb 28 21:43:53 EST 2013 21 files and directories, 88 blocks = 109 total ... How it works... The HDFS filesystem will be write protected when NameNode enters safe mode. When an HDFS cluster is started, it will enter safe mode first. The NameNode will check the replication factor for each data block. If the replica count of a data block is smaller than the configured value, which is 3 by default, the data block will be marked as under-replicated. Finally, an under-replication factor, which is the percentage of under-replicated data blocks, will be calculated. If the percentage number is larger than the threshold value, the NameNode will stay in safe mode until enough new replicas are created for the under-replicated data blocks so as to make the under-replication factor lower than the threshold. We can get the usage of the fsck command using: hadoop fsck The usage information will be similar to the following: Usage: DFSck <path> [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]] <path> start checking from this path -move move corrupted files to /lost+found -delete delete corrupted files -files print out files being checked -openforwrite print out files opened for write -blocks print out block report -locations print out locations for every block -racks print out network topology for data-node locations By default fsck ignores files opened for write, use -openforwrite to report such files. They are usually tagged CORRUPT or HEALTHY depending on their block allocation status.   We can get the usage of the dfsadmin command using: hadoop dfsadmin The output will be similar to the following: Usage: java DFSAdmin [-report] [-safemode enter | leave | get | wait] [-saveNamespace] [-refreshNodes] [-finalizeUpgrade] [-upgradeProgress status | details | force] [-metasave filename] [-refreshServiceAcl] [-refreshUserToGroupsMappings] [-refreshSuperUserGroupsConfiguration] [-setQuota <quota> <dirname>...<dirname>] [-clrQuota <dirname>...<dirname>] [-setSpaceQuota <quota> <dirname>...<dirname>] [-clrSpaceQuota <dirname>...<dirname>] [-setBalancerBandwidth <bandwidth in bytes per second>] [-help [cmd]] There's more… Besides using command line, we can use the web UI to check the status of an HDFS cluster. For example, we can get the status information of HDFS by opening the link http://master:50070/dfshealth.jsp. We will get a web page that shows the summary of the HDFS cluster such as the configured capacity and remaining space. For example, the web page will be similar to the following screenshot: By clicking on the Live Nodes link, we can check the status of each DataNode. We will get a web page similar to the following screenshot: By clicking on the link of each node, we can browse the directory of the HDFS filesystem. The web page will be similar to the following screenshot: The web page shows that file /user/hduser/randtext has been split into five partitions. We can browse the content of each partition by clicking on the part-0000x link. Configuring SecondaryNameNode Hadoop NameNode is a single point of failure. By configuring SecondaryNameNode, the filesystem image and edit log files can be backed up periodically. And in case of NameNode failure, the backup files can be used to recover the NameNode. In this recipe, we will outline steps to configure SecondaryNameNode. Getting ready We assume that Hadoop has been configured correctly. Log in to the master node from the cluster administration machine using the following command: ssh hduser@master How to do it... Perform the following steps to configure SecondaryNameNode: Stop the cluster using the following command: stop-all.sh Add or change the following into the file $HADOOP_HOME/conf/hdfs-site.xml: <property> <name>fs.checkpoint.dir</name> <value>/hadoop/dfs/namesecondary</value> </property> If this property is not set explicitly, the default checkpoint directory will be ${hadoop.tmp.dir}/dfs/namesecondary. Start the cluster using the following command: start-all.sh The tree structure of the NameNode data directory will be similar to the following: ${dfs.name.dir}/ ├── current │ ├── edits │ ├── fsimage │ ├── fstime │ └── VERSION ├── image │ └── fsimage ├── in_use.lock └── previous.checkpoint ├── edits ├── fsimage ├── fstime └── VERSION And the tree structure of the SecondaryNameNode data directory will be similar to the following: ${fs.checkpoint.dir}/ ├── current │ ├── edits │ ├── fsimage │ ├── fstime │ └── VERSION ├── image │ └── fsimage └── in_use.lock There's more... To increase redundancy, we can configure NameNode to write filesystem metadata on multiple locations. For example, we can add an NFS shared directory for backup by changing the following property in the file $HADOOP_HOME/conf/hdfs-site.xml: <property> <name>dfs.name.dir</name> <value>/hadoop/dfs/name,/nfs/name</value> </property> Managing the MapReduce cluster A typical MapReduce cluster is composed of one master node that runs the JobTracker and a number of slave nodes that run TaskTrackers. The task of managing a MapReduce cluster includes maintaining the health as well as the membership between TaskTrackers and the JobTracker. In this recipe, we will outline commands to manage a MapReduce cluster. Getting ready We assume that the Hadoop cluster has been properly configured and running. Log in to the master node from the cluster administration machine using the following command: ssh hduser@master How to do it... Perform the following steps to manage a MapReduce cluster: List all the active TaskTrackers using the following command: hadoop -job -list-active-trackers This command can help us check the registration status of the TaskTrackers in the cluster. Check the status of the JobTracker safe mode using the following command: hadoop mradmin -safemode get We will get the following output: Safe mode is OFF The output tells us that the JobTracker is not in safe mode. We can submit jobs to the cluster. If the JobTracker is in safe mode, no jobs can be submitted to the cluster. Manually let the JobTracker enter safe mode using the following command: hadoop mradmin -safemode enter This command is handy when we want to maintain the cluster. Let the JobTracker leave safe mode using the following command: hadoop mradmin -safemode leave When maintenance tasks are done, you need to run this command. If we want to wait for safe mode to exit, the following command can be used: hadoop mradmin -safemode wait Reload the MapReduce queue configuration using the following command: hadoop mradmin -refreshQueues Reload active TaskTrackers using the following command: hadoop mradmin -refreshNodes How it works... Get the usage of the mradmin command using the following: hadoop mradmin The usage information will be similar to the following: Usage: java MRAdmin [-refreshServiceAcl] [-refreshQueues] [-refreshUserToGroupsMappings] [-refreshSuperUserGroupsConfiguration] [-refreshNodes] [-safemode <enter | leave | get | wait>] [-help [cmd]] ... The meaning of the command options is listed in the following table: Option Description -refreshServiceAcl Force JobTracker to reload service ACL. -refreshQueues Force JobTracker to reload queue configurations. -refreshUserToGroupsMappings Force JobTracker to reload user group mappings. -refreshSuperUserGroupsConfiguration Force JobTracker to reload super user group mappings. -refreshNodes Force JobTracker to refresh the JobTracker hosts. -help [cmd] Show the help info for a command or all commands. Summary In this article, we learned Managing the HDFS cluster, configuring SecondaryNameNode, and managing the MapReduce cluster. As a Hadoop cluster administrator, as the system administrator is responsible for managing both the HDFS cluster and the MapReduce cluster, he/she must be aware of how to manage these in order to maintain the health and availability of the cluster. More specifically, for an HDFS cluster, it means the management of the NameNodes and DataNodes and the management of the JobTrackers and TaskTrackers for MapReduce, which is covered in this article. Resources for Article : Further resources on this subject: Analytics – Drawing a Frequency Distribution with MapReduce (Intermediate) [Article] Advanced Hadoop MapReduce Administration [Article] Understanding MapReduce [Article]
Read more
  • 0
  • 0
  • 13243

article-image-working-import-process-intermediate
Packt
30 Aug 2013
5 min read
Save for later

Working with Import Process (Intermediate)

Packt
30 Aug 2013
5 min read
(For more resources related to this topic, see here.) Getting ready The first thing we need to do is to download the latest version of Sqoop from following location http://www.apache.org/dist/sqoop/ and extract it on your machine. Now I am calling the Sqoop installation dir as $SQOOP_HOME. Given here are the prerequisites for Sqoop import process. Installed and running Relational Database Management System (MySQL). Installed and running Hadoop Cluster. Set $HADOOP_HOME environment variable. Following are the common arguments of import process. Parameters Description --connect <jdbc-uri> This command specifies the server or database to connect. It also specifies the port. Example: --connect jdbc:mysql://host:port/databaseName --connection-manager <class-name>   Specify connection manager class name.   --driver <class-name> Specify the fully qualified name of JDBC driver class. --password <password> Set authentication password required to connect to input source. --username <username> Set authentication username. How to do it Let’s see how to work with import process First, we will start with import single RDBMS table into Hadoop. Query1: $ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1 --username root --password password --table tableName --target-dir /user/abc/tableName The content of output file in HDFS will look like: Next, we will put some light on approach of import only selected rows and selected columns of RDBMS table into Hadoop. Query 2: Import selected columns $ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1 --username root --password password --table student --target-dir /user/abc/student --columns “student_id,address,name” Query 3: Import selected rows. $ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1 --username root --password password --table student --target-dir /user/abc/student --where ‘student_id<100’ Query 4: Import selected columns of selected rows. $ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1 --username root --password password --table student --target-dir /user/abc/student --columns “student_id,address,name” -- where ‘student_id<100’ How it works… Now let’s see how the above steps work: Import single table: Apart from the common arguments of import process, as explained previously, this part covers some other arguments which are required to import a table into Hadoop Distributed File System. Parameters Description --table <table-name> Name of input table to fetch. --target-dir<dir> Location of output/target dir in HDFS. --direct If user want to use non-JDBC based access mechanism for faster database access --options-file <file-path> All the command line options that are common in most of commands can put in options file for convenience.  The Query1 will run a MapReduce job and import all the rows of given table to HDFS (where, /user/abc/tableName is the location of output files). The records imported in HDFS preserve their original columns order, which means, if input table contains four columns A, B, C and D, then content in HDFS file will look like: A1, B1, C1, D1 A2, B2, C2, D2 Import selected columns: By default, the import query will select all columns of input table for import, but we can select the subset of columns by specifying the comma separated list of columns in --columns argument. The Query2 will only fetch three columns (student_id, address and name) of student table. If import query contains the --columns argument, then the order of column in output files are same as order specified in --columns argument. The output in HDFS will look like: student_id, address, name 1, Delhi, XYZ 2, Mumbai, PQR .......... If the input query contains the column in following order -- “address, name, student_id”, then the output in HDFS will look like. address, name, student_id Delhi, XYZ, 1 Mumbai, PQR, 2 ............. Import selected rows: By default, all the rows of input table will be imported to HDFS, but we can control which rows need to be import by using a --where argument in the import statement. The Query3 will import only those rows into HDFS which has value of “student_id” column greater than 100. The Query4 use both --columns and --where arguments in one statement. For Query4, Sqoop will internally generates the query of the form “select student_id, address, name from student where student_id<100”. There’s more... This section covers some more examples of import process. Import all tables: So far we have imported a single table into HDFS, this section introduces an import-all-tables tool, by which we can import a set of tables from an RDBMS to HDFS. The import-all-tables tool creates a separate directory in HDFS for each RDBMS table. The following are the mandatory conditions for import-all-tables tool: All tables must have a single primary key column. User must intend to import all the columns of each table. No --where, --columns and --query arguments are permitted. Example: Query 5: $ bin/sqoop import-all-tables --connect jdbc:mysql://localhost:3306/db1 --username root --password password This query will import all tables (tableName and tableName1) of database db1 into HDFS. Output directories in HDFS look like: Summary We learned a lot in this article, about import single RDBMS table into HDFS, import selected columns and selected rows, and import set of RDBMS tables. Resources for Article : Further resources on this subject: Introduction to Logging in Tomcat 7 [Article] Configuring Apache and Nginx [Article] Geronimo Architecture: Part 2 [Article]
Read more
  • 0
  • 0
  • 794

article-image-what-are-ssas-2012-dimensions-and-cube
Packt
29 Aug 2013
7 min read
Save for later

What are SSAS 2012 dimensions and cube?

Packt
29 Aug 2013
7 min read
(For more resources related to this topic, see here.) What is SSAS? SQL Server Analysis Services is an online analytical processing tool that highly boosts the different types of SQL queries and calculations that are accepted in the business intelligence environment. It looks like a relation database, but it has differences. SSAS does not replace the requirement of relational databases, but if you combine the two, it would help to develop the business intelligence solutions. Why do we need SSAS? SSAS provide a very clear graphical interface for the end users to build queries. It is a kind of cache that we can use to speed up reporting. In most real scenarios where SSAS is used, there is a full copy of the data in the data warehouse. All reporting and analytic queries are run against SSAS rather than against the relational database. Today's modern relational databases include many features specifically aimed at BI reporting. SSAS are database services specifically designed for this type of workload, and in most cases it has achieved much better query performance. SSAS 2012 architecture In this article we will explain about the architecture of SSAS. The first and most important point to make about SSAS 2012 is that it is really two products in one package. It has had a few advancements relating to performance, scalability, and manageability. This new version of SSAS that closely resembles PowerPivot uses the tabular model. When installing SSAS, we must select either the tabular model or multidimensional model for installing an instance that runs inside the server; both data models are developed under the same code but sometimes both are treated separately. The concepts included in designing both data models are different, and we can't turn a tabular database into a multidimensional database, or vice versa without rebuilding everything from the start. The main point of view of the end users is that both data models do almost the same things and appear almost equally when used through a client tool such as Excel. The tabular model A concept of building a database using the tabular model is very similar to building it in a relational database. An instance of Analysis Services can hold many databases, and each database can be looked upon as a self-contained collection of objects and data relating to a single business solution. If we are writing reports or analyzing data and we find that we need to run queries on multiple databases, we probably have made a design mistake somewhere because everything we need should be contained within an individual database. Tabular models are designed by using SQL Server Data Tools (SSDT), and a data project in SSDT mapping onto a database in Analysis Services. The multidimensional model This data model is very similar to the tabular model. Data is managed in databases, and databases are designed in SSDT, which are in turn managed by using SQL Server Management Studio. The differences may become similar below the database level, where the multidimensional data model rather than relational concepts are accepted. In the multidimensional model, data is modeled as a series of cubes and dimensions and not tables. The future of Analysis Services We have two data models inside SSAS, along with two query and calculation languages; it is clearly not an ideal state of affairs. It means we have to select a data model to use at the start of our project, when we might not even know enough about our need to gauge which one is appropriate. It also means that anyone who decides to specialize in SSAS has to learn two technologies. Microsoft has very clearly said that the multidimensional model is not scrapped and that the tabular model is not its replacement. It is just like saying that the new advanced features for the multidimensional data model will be released in future versions of SSAS. The fact that the tabular and multidimensional data models share some of the same code suggests that some new features could easily be developed for both models simultaneously. What's new in SSAS 2012? As we know, there is no easy way of transferring a multidimensional data model into a tabular data model. We may have many tools in the market that claim to make this transition with a few mouse clicks, but such tools could only ever work for very simple multidimensional data models and would not save much development time. Therefore, if we already have a mature multidimensional implementation and the in-house skills to develop and maintain it, we may find the following improvements in SSAS 2012 useful. Ease of use If we are starting an SSAS 2012 project with no previous multidimensional or OLAP experience, it is very likely that we will find a tabular model much easier to learn than a multidimensional one. Not only are the concepts much easier to understand, especially if we are used to working with relational databases, but also the development process is much more straightforward and there are far fewer features to learn. Compatibility with PowerPivot The tabular data model and PowerPivot are the same in the way their models are designed. The user interfaces used are practically the same, as both the interfaces use DAX. PowerPivot models can be imported into SQL Server Data Tools to generate a tabular model, although the process does not work the other way around, and a tabular model cannot be converted to a PowerPivot model. Processing performance characteristics If we compare the processing performance of the multidimensional and tabular data models, that will become difficult. It may be slower to process a large table following the tabular data model than the equivalent measure group in a multidimensional one because a tabular data model can't process partitions in the same table at the same time, whereas a multidimensional model can process partitions in the same measure group at the same time. What is SSAS dimension? A database dimension is a collection of related objects; in other words, attributes; they provide the information about fact data in one or more cubes. Typical attributes in a product dimension are product name, product category, line, size, and price. Attributes can be organized into user-defined hierarchies that provide the paths to assist users when they browse through the data in a cube. By default these attributes are visible as attribute hierarchies, and they can be used to understand fact data in a cube. What is SSAS cube? A cube is a multidimensional structure that contains information for analytical purposes; the main constituents of a cube are dimensions and measures. Dimensions define the structure of a cube that you use to slice and dice over, and measures provide the aggregated numerical values of interest to the end user. As a logical structure, a cube allows a client application to retrieve values—of measures—as if they are contained in cells in the cube. The cells are defined for every possible summarized value. A cell, in the cube, is defined by the intersection of dimension members and contains the aggregated values of the measures at that specific intersection. Summary We talked about the special new features and services present, what you can do with them, and why they’re so great. Resources for Article: Further resources on this subject: Creating an Analysis Services Cube with Visual Studio 2008 - Part 1 [Article] Performing Common MDX-related Tasks [Article] How to Perform Iteration on Sets in MDX [Article]
Read more
  • 0
  • 0
  • 1899

article-image-creating-pivot-table
Packt
29 Aug 2013
8 min read
Save for later

Creating a pivot table

Packt
29 Aug 2013
8 min read
(For more resources related to this topic, see here.) A pivot table is the core business intelligence tool that helps to turn meaningless data from various sources to a meaningful result. By using different ways of presenting data, we are able to identify relations between seemingly separate data and reach conclusions to help us identify our strengths and areas of improvement. Getting ready Prepare the two files entitled DatabaseData_v2.xlsx and GDPData_v2.xlsx.We will be using these results along with other data sources to create a meaningful PowerPivot table that will be used for intelligent business analysis. How to do it... For each of the two files, we will build upon the file and add a pivot table to it, gaining exposure using the data we are already familiar with. The following are the steps to create a pivot table with the DatabaseData_v2.xlsx file, which results in the creation of a DatabaseData_v3.xlsx file: Open the PowerPivot window of the DatabaseData_v2.xlsx file with its 13 tables. Click on the PivotTable button near the middle of the top row and save as New Worksheet. Select the checkboxes as shown in the following screenshot: Select CountryRegion | Name and move it under Row Labels&#x89; Select Address | City and move it under Row Labels Select Address | AddressLine1 as Count of AddressLine1 and move it under Values Now, this shows the number of clients per city and per country. However, it is very difficult to navigate, as each country name has to be collapsed in order to see the next country. Let us move the CountryRegion | Name column to Slicers Vertical. Now, the PowerPivot Field List dashboard should appear as shown in the following screenshot: Now, the pivot table should display simple results: the number of clients in a region, filterable by the country using slicers. Let us apply some formatting to allow for a better understanding of the data. Right-click on Name under the Slicers Vertical area of the PowerPivot Field List dashboard. Select Field Settings, then change the name to Country Name. We now see that the title of the slicer has changed from Name to Country Name, allowing anyone who views this data to understand better what the data represents. Similarly, right-click on Count of AddressLine1 under Values, select Edit Measure, and then change its name to Number of Clients. Also change the data title City under the Row Labels area to City Name. The result should appear as shown in the following screenshot: Let's see our results change as we click on different country names. We can filter for multiple countries by holding the Ctrl key while clicking, and can remove all filters by clicking the small button on the top-right of slicers. This is definitely easier to navigate through and to understand compared to what we did at first without using slicers, which is how it would appear in Excel 2010 without PowerPivot. However, this table is still too big. Clicking on Canada gives too many cities whose names many of us have not heard about before. Let us break the data further down by including states/provinces. Select StateProvince | Name and move it under Slicers Horizontal and change its title to State Name. It is a good thing that we are renaming these as we go along. Otherwise, there would have been two datasets called Name, and anyone would be confused as we moved along. Now, we should see the state names filter on the top, the country name filter on the left, and a list of cities with the number of clients in the middle part. This, however, is kind of awkward. Let us rearrange the filters by having the largest filter (country) at the top and the sub-filter. (state) on the left-hand side This can be done simply by dragging the Country Name dataset to Slicers Horizontal and State Name to Slicers Vertical. After moving the slicers around a bit, the result should appear as shown in the following screenshot: Again, play with the results and try to understand the features: try filtering by a country—and by a state/province—now there are limited numbers of cities shown for each country and each state/province, making it easier to see the list of cities. However, for countries such as the United States, there are just too many states. Let us change the formatting of the vertical filter to display three states per line, so it is easier to find the state we are looking for. This can be done by right-clicking on the vertical filter, selecting Size and Properties| Position and Layout, and then by changing the Number of Columns value. Repeat the same step for Country Name to display six columns and then change the sizes of the filters to look more selectable. Change the name of the sheet as PivotTable and then save the file as DatabaseData_v3.xlsx. The following are the steps to create a pivot table with the GDPData_v2.xlsx file, which results in the creation of a GDPData_v3.xlsx file: Open the PowerPivot window of the GDPData_v2.xlsx file with its two tables. Click on the PivotTable button near the middle of the top row and save as New Worksheet. Move the dataset from the year 2000 to the year 2010 to the Value field, and move Country Name in the Row Labels field, and Country Name again into the Slicers Horizontal field. In the slicer, select five countries: Canada, China, Korea, Japan, and United States as shown in the following screenshot: Select all fields and reduce the number of its decimal places. We can now clearly see that GDP in China has tripled over the decade, and that only China and Korea saw an increase in GDP from 2008 to 2009 while the GDP of other nations dropped due to the 2008 financial crisis. Knowing the relevant background information of world finance events, we can make intelligent analysis such as which markets to invest in if we are worried about another financial crisis taking place. As the data get larger in size, looking at the GDP number becomes increasingly difficult. In such cases, we can switch the type of data displayed by using available buttons in the PivotTable Tools | Options menu, the Show Value As button. Play around with it and see how it works: % of Column Total shows each GDP as a percentage of the year, while % Different From allows the user to set one value as the standard and compare the rest to it, and the Rank Largest to Smallest option simply shows the ranking based on which country earns the most GDP. Change the name of the sheet as PivotTable and then save the file as GDPData_v3.xlsx. How it works... We looked at two different files and focused on two different fields. The first file was more qualitative and showed the relationship between regions and number of clients, using various features of pivot tables such as slicers. We also looked at how to format various aspects of a pivot table for easier processing and for a better understanding of the represented data. Slicers embedded in the pivot table are a unique and very powerful feature of PowerPivot that allow us to sort through data simply by clicking the different criteria. The increasing numbers of slicers help to customize the data further, enabling the user to create all sorts of data imaginable. There are no differences in horizontal and vertical slicers aside from the fact that they are at different locations. From the second file, we focused more on the quantitative data and different ways of representing the data. By using slicers to limit the number of countries, we were able to focus more on the data presented, and manage to represent the GDP in various formats such as percentages and ranks, and were able to compare the difference between the numbers by selecting one as a standard. A similar method of representing data in a different format could be applied to the first file to show the percentage of clients per nation, and so on. There's more... We covered the very basic setup of creating a pivot table. We can also analyze creating relationships between data and creating custom fields, so that better results are created. So don't worry about why the pivot table looks so small! For those who do not like working with a pivot table, there is also a feature that will convert all cells into Excel formula. Under the PowerPivot Tools | OLAP Tools option, the Convert to Formula button does exactly that. However, be warned that it cannot be undone as the changes are permanent. Summary In this article, we learned how to use the raw data to make some pivot tables that can help us make smart business decisions! Resources for Article: Further resources on this subject: SAP HANA integration with Microsoft Excel [Article] Managing Core Microsoft SQL Server 2008 R2 Technologies [Article] Eloquent relationships [Article]
Read more
  • 0
  • 0
  • 1087
article-image-schemas-and-models
Packt
27 Aug 2013
12 min read
Save for later

Schemas and Models

Packt
27 Aug 2013
12 min read
(For more resources related to this topic, see here.) So what is a schema? At its simplest, a schema is a way to describe the structure of data. Typically this involves giving each piece of data a label, and stating what type of data it is, for example, a number, date, string, and so on. In the following example, we are creating a new Mongoose schema called userSchema. We are stating that a database document using this schema will have three pieces of data, which are as follows: name: This data will contain a string email: This will also contain a string value createdOn: This data will contain a date The following is the schema definition: var userSchema = new mongoose.Schema({name: String,email: String,createdOn: Date}); Field sizes Note that, unlike some other systems there is no need to set the field size. This can be useful if you need to change the amount of data stored in a particular object. For example, your system might impose a 16-character limit on usernames, so you set the size of the field to 16 characters. Later, you realize that you want to encrypt the usernames, but this will double the length of the data stored. If your database schema uses fixed field sizes, you will need to refactor it, which can take a long time on a large database. With Mongoose, you can just start encrypting that data object without worrying about it. If you're storing large documents, you should bear in mind that MongoDB imposes a maximum document size of 16 MB. There are ways around even this limit, using the MongoDB GridFS API. Data types allowed in schemas There are eight types of data that can—by default—be set in a Mongoose schema. These are also referred to as SchemaTypes; they are: String Number Date Boolean Buffer ObjectId Mixed Array The first four SchemaTypes are self-explanatory, but let's take a quick look at them all. String This SchemaType stores a string value, UTF-8 encoded. Number This SchemaType stores a number value, with restrictions. Mongoose does not natively support long and double datatypes for example, although MongoDB does. However, Mongoose can be extended using plugins to support these other types. Date This SchemaType holds a date and time object, typically returned from MongoDB as an ISODate object, for example, ISODate("2013-04-03T12:56:26.009Z"). Boolean This SchemaType has only two values: true or false. Buffer This SchemaType is primarily used for storing binary information, for example, images stored in MongoDB. ObjectId This SchemaType is used to assign a unique identifier to a key other than _id. Rather than just specifying the type of ObjectId you need to specify the fully qualified version Schema.Types.ObjectId. For example: projectSchema.add({owner: mongoose.Schema.Types.ObjectId}); Mixed A mixed data object can contain any type of data. It can be declared either by setting an empty object, or by using the fully qualified Schema.Types.Mixed. These following two commands will do the same thing: vardjSchema= new mongoose.Schema({mixedUp: {}});vardjSchema= new mongoose.Schema({mixedUp: Schema.Types.Mixed}); While this sounds like it might be great, there is a big caveat. Changes to data of mixed type cannot be automatically detected by Mongoose, so it doesn't know that it needs to save them. Tracking changes to Mixed type As Mongoose can't automatically see changes made to mixed type of data, you have to manually declare when the data has changed. Fortunately, Mongoose exposes a method called markModified to do just this, passing it the path of the data object that has changed. dj.mixedUp = { valueone: "a new value" };dj.markModified('mixedUp');dj.save(); Array The array datatype can be used in two ways. First, a simple array of values of the same data type, as shown in the following code snippet: var userSchema = new mongoose.Schema({name: String,emailAddresses: [String]}); Second, the array datatype can be used to store a collection of subdocuments using nested schemas. Here's an example in the following of how this can work: var emailSchema = new mongoose.Schema({email: String,verified: Boolean});var userSchema = new mongoose.Schema({name: String,emailAddresses: [emailSchema]}); Warning – array defined as mixed type A word of caution. If you declare an empty array it will be treated as mixed type, meaning that Mongoose will not be able to automatically detect any changes made to the data. So avoid these two types of array declaration, unless you intentionally want a mixed type. var emailSchema = new mongoose.Schema({addresses: []});var emailSchema = new mongoose.Schema({addresses: Array}); Custom SchemaTypes If your data requires a different datatype which is not covered earlier in this article, Mongoose offers the option of extending it with custom SchemaTypes. The extension method is managed using Mongoose plugins. Some examples of SchemaType extensions that have already been created are: long, double, RegEx, and even email. Where to write the schemas As your schemas sit on top of Mongoose, the only absolute is that they need to be defined after Mongoose is required. You don't need an active or open connection to define your schemas. That being said it is advisable to make your connection early on, so that it is available as soon as possible, bearing in mind that remote database or replica sets may take longer to connect than your localhost development server. While no action can be taken on the database through the schemas and models until the connection is open, Mongoose can buffer requests made from when the connection is defined. Mongoose models also rely on the connection being defined, so there's another reason to get the connection set up early in the code and then define the schemas and models. Writing a schema Let's write the schema for a User in our MongoosePM application. The first thing we have to do is declare a variable to hold the schema. I recommend taking the object name (for example, user or project) and adding Schema to the end of it. This makes following the code later on super easy. The second thing we need to do is create a new Mongoose schema object to assign to this variable. The skeleton of this is as follows: var userSchema = new mongoose.Schema({ }); We can add in the basic values of name, email, and createdOn that we looked at earlier, giving us our first user schema definition. var userSchema = new mongoose.Schema({name: String,email: String,createdOn: Date}); Modifying an existing schema Suppose we run the application with this for a while, and then decide that we want to record the last time each user logged on, and the last time their record was modified. No problem! We don't have to refactor the database or take it offline while we upgrade the schema, we simply add a couple of entries to the Mongoose schema. If a key requested in the schema doesn't exist, neither Mongoose nor MongoDB will throw errors, Mongoose will just return null values. When saving the MongoDB documents, the new keys and values will be added and stored as required. If the value is null, then the key is not added. So let's add modifiedOn and lastLogin to our userSchema: var userSchema = new mongoose.Schema({name: String,email: String,createdOn: Date,modifiedOn: Date,lastLogin: Date}); Setting a default value Mongoose allows us to set a default value for a data key when the document is first created. Looking at our schema created earlier, a possible candidate for this is createdOn. When a user first signs up, we want the date and time to be set. We could do this by adding a timestamp to the controller function when we create a user, or to make a point we can modify the schema to set a default value. To do this, we need to change the information we are sending about the createdOn data object. What we have currently is: createdOn: Date This is short for: createdOn: { type: Date } We can add another entry to this object to set a default value here, using the JavaScript Date object: createdOn: { type: Date, default: Date.now } Now every time a new user is created its createdOn value will be set to the current date and time. Note that in JavaScript default is a reserved word. While the language allows reserved words to be used as keys, some IDEs and linters regard it as an error. If this causes issues for you or your environment, you can wrap it in quotes, like in the following code snippet: createdOn: { type: Date, 'default': Date.now } Only allowing unique entries If we want to ensure that there is only ever one user per e-mail address, we can specify that the email field should be unique. email: {type: String, unique:true} With this in place, when saving to the database, MongoDB will check to see if the e-mail value already exists in another document. If it finds it, MongoDB (not Mongoose) will return an E11000 error. Note that this approach also defines a MongoDB index on the email field. Our final User schema Your userSchema should now look like the following: var userSchema = new mongoose.Schema({name: String,email: {type: String, unique:true},createdOn: { type: Date, default: Date.now },modifiedOn: Date,lastLogin: Date}); A corresponding document from the database would look like the following (line breaks are added for readability): { "__v" : 0,"_id" : ObjectId("5126b7a1f8a44d1e32000001"),"createdOn" : ISODate("2013-02-22T00:11:13.436Z"),"email" : "[email protected]","lastLogin" : ISODate("2013-04-03T12:54:42.734Z"),"modifiedOn" : ISODate("2013-04-03T12:56:26.009Z"),"name" : "Simon Holmes" } What's that "__v" thing? You may have noticed a data entity in the document that we didn't set: __v. This is an internal versioning number automatically set by Mongoose when a document is created. It doesn't increment when a document is changed, but instead is automatically incremented whenever an array within the document is updated in such a way that might cause the indexed position of some of the entries to have changed. Why is this needed? When working with an array you will typically access the individual elements through their positional index, for example, myArray[3]. But what happens if somebody else deletes the element in myArray[2] while you are editing the data in myArray[3]? Your original data is now contained in myArray[2] but you don't know this, so you quite happily overwrite whatever data is now stored in myArray[3]. The __v gives you a method to be able to sanity check this, and prevent this scenario from happening. Defining the Project schema As part of our MongoosePM application we also need to think about Projects. After all, PM here does stand for Project Manager. Let's take what we've learned and create the Project schema. We are going to want a few types of data to start with: projectName: A string containing the name of the project. createdOn: The date when the document was first created and saved. This option is set to automatically save the current date and time. modifiedOn: The date and time when the document was last changed. createdBy: A string that will for now contain the unique ID of the user who created the project. tasks: A string to hold task information. Transforming these requirements into a Mongoose schema definition, we create this in the following: varprojectSchema = new mongoose.Schema({projectName: String,createdOn: Date,modifiedOn: { type: Date, default: Date.now },createdBy: String,tasks: String}); This is our starting point, and we will build upon it. For now we have these basic data objects as mentioned previously in this article. Here's an example of a corresponding document from the database (line breaks added for readability): { "projectName" : "Another test","createdBy" : "5126b7a1f8a44d1e32000001","createdOn" : ISODate("2013-04-03T17:47:51.031Z"),"tasks" : "Just a simple task","_id" : ObjectId("515c6b47596acf8e35000001"),"modifiedOn" : ISODate("2013-04-03T17:47:51.032Z"),"__v" : 0 } Improving the Project schema Throughout the rest of the article we will be improving this schema, but the beauty of using Mongoose is that we can do this relatively easily. Putting together a basic schema like this to build upon is a great approach for prototyping—you have the data you need there, and can add complexity where you need, when you need it. Building models A single instance of a model maps directly to a single document in the database. With this 1:1 relationship, it is the model that handles all document interaction—creating, reading, saving, and deleting. This makes the model a very powerful tool. Building the model is pretty straightforward. When using the default Mongoose connection we can call the mongoose.model command, passing it two arguments: The name of the model The name of the schema to compile So if we were to build a model from our user schema we would use this line: mongoose.model( 'User', userSchema ); If you're using a named Mongoose connection, the approach is very similar. adminConnection.model( 'User', userSchema ); Instances It is useful to have a good understanding of how a model works. After building the User model, using the previous line we could create two instances. var userOne = new User({ name: 'Simon' });var userTwo = new User({ name: 'Sally' }); Summary In this article, we have looked at how schemas and models relate to your data. You should now understand the roles of both schemas and models. We have looked at how to create simple schemas and the types of data they can contain. We have also seen that it is possible to extend this if the native types are not enough. In the MongoosePM project, you should now have added a User schema and a Project schema, and built models of both of these. Resources for Article: Further resources on this subject: Understanding Express Routes [Article] Validating and Using the Model Data [Article] Creating Your First Web Page Using ExpressionEngine: Part 1 [Article]
Read more
  • 0
  • 0
  • 3132

article-image-creating-multivariate-charts
Packt
26 Aug 2013
10 min read
Save for later

Creating Multivariate Charts

Packt
26 Aug 2013
10 min read
(For more resources related to this topic, see here.) With increasing number of variables, any analysis can become challenging and any observations harder; however, Tableau simplifies the process for the designer and uses effective layouts for the reader even in multivariate analysis. Using various combinations of colors and charts, we can create compelling graphics that generate critical insights from our data. Among the charts covered in this article, facets and area charts are easier to understand and easier to create compared to bullet graphs and dual axes charts. Creating facets Facets are one of the powerful features in Tableau. Edward Tufte, a pioneer in the field of information graphics, championed these types of charts, also called grid or panel charts; he called them small multiples. These charts show the same measure(s) across various values of one or two variables for easier comparison. Getting ready Let's use the sample file Sample – Coffee Chain (Access). Open a new worksheet and select Sample – Coffee Chain (Access) as the data source. How to do it... Once the data file is loaded on the new worksheet, perform the following steps to create a simple faceted chart: Drag-and-drop Market from Dimensions into the Columns shelf. Drag-and-drop Product Type from Dimensions into the Rows shelf. Drag-and-drop Profit from Measures into the Rows shelf next to Product Type. Optionally, you can drag-and-drop Market into the Color Marks box to give color to the four bars of different Market areas. The chart should look like the one in the following screenshot: How it works... When there is one dimension on one of the shelves, either Columns or Rows, and one measure on the other shelf, Tableau creates a univariate bar chart, but when we drop additional dimensions along with the measure, Tableau creates small charts or facets and displays univariate charts broken down by a dimension. There's more... A company named Juice Analytics has a great blog article on the topic of small multiples. This article lists the benefits of using small multiples as well as some examples of small multiples in practice. Find this blog at http://www.juiceanalytics.com/writing/better-know-visualization-small-multiples/. Creating area charts An area chart is an extension of a line chart. The area chart shows the line of the measure but fills the area below the line to emphasize on the value of the measure. A special case of area chart is a stacked area chart, which shows a line per measure and the area between the lines is filled. Tableau's implementation of area charts uses one date variable and one or more measures. Getting ready Let's use the sample file Sample – Superstore Sales (Excel). Open a new worksheet and select Sample – Superstore Sales (Excel) as the data source. How to do it... Once the data is loaded on the new worksheet, perform the following steps to create an area chart: Click on the Show Me button to bring the Show Me toolbar on the screen. Select Order Date from Dimensions and Order Quantity from Measures by clicking and holding the Ctrl key. Click on Area charts (continuous) from the Show Me toolbar. Drag-and-drop Order Date into the Columns shelf next to YEAR(Order Date. Expand YEAR(Order Date), seen on the right-hand side, by clicking on the plus sign. Drag-and-drop Region from Dimensions into the the Rows shelf to the left of SUM(Order Quantity). The chart should look like the one in the following screenshot: How it works… When we added Order Date for the first time, Tableau, by default, aggregated the date field by year; therefore, we added Order Date again to create aggregation by quarter of the Order Date. We also added Region to create facets on the regions that provide trends of order quantity over time. There's more... A blog post by visually, an information graphics company, discusses the key differences between line charts and area charts. You can find this post at http://blog.visual.ly/line-vs-area-charts/. Creating bullet graphs Stephen Few, an information visualization consultant and author, designed this chart to solve some of the problems that the gauges and meters type of charts poses. Gauges, although simple to understand, take a lot of space to show only one measure. Bullet graphs are a combination of the bar graph and thermometer types of charts, and they show a measure of interest in the form of a bar graph (which is the bullet) and target variables. Getting ready Let's use the sample file Sample – Coffee Chain (Access). Open a new worksheet and select Sample – Coffee Chain (Access) as the data source. How to do it... Once the data is loaded on the sheet, perform the following steps to create a bullet graph: Click on the Show Me button to bring the Show Me toolbar on the screen. While holding the Ctrl key, click on Type and Market from Dimensions and Budget Sales and Sales from Measures. Click on the bullet graphs icon on the Show Me toolbar. Right-click on the x axis (the Budget Sales axis) and click on Swap Reference Line Fields. The final chart should look like the one in the following screenshot: How it works... Although bullet graphs maximize the available space to show relevant information, readers require detailed explanation as to what all the components of the graphic are encoding. In this recipe, since we want to compare the budgeted sales with the actual sales, we had to swap the reference line from Sales to Budget Sales. The black bar on the graphic shows the budgeted sales and the blue bar shows the actual sales. The dark gray background color shows 60 percent of the actual sales and the lighter gray shows 80 percent of the actual sales. As we can see in this chart, blue bars crossed all the black lines, and that tells us that both the coffee types and all market regions exceeded the budgeted sales. There's more... A blog post by Data Pig Technologies discusses some of the problems with the bullet graph. The main problem is intuitive understanding of this chart. You can read about this problem and the reply by Stephen Few at http://datapigtechnologies.com/blog/index.php/the-good-and-bad-of-bullet-graphs/. Creating dual axes charts Dual axes charts are useful to compare two similar types of measures that may have different types of measurement units, such as pounds and dollars. In this recipe, we will look at the dual axes chart. Getting ready Let's use the same sample file, Sample – Coffee Chain (Access). Open a new worksheet and select Sample – Coffee Chain (Access) as the data source. How to do it... Once the data is loaded on the sheet, perform the following steps to create a dual axes chart: Click on the Show Me button to bring the Show Me toolbar on the screen. While holding the Ctrl key, click on Date, Type, and Market from Dimensions and Sales and Budget Sales from Measures. Click on the dual line graph icon on the Show Me toolbar. Click-and-drag Market from the Rows shelf into the Columns shelf. Right-click on the Sales vertical axis and click on Synchronize Axis. The chart should look like the one shown in the following screenshot: How it works... Tableau will create two vertical axes and automatically place Sales on one dual axes charts vertical axis and Budget Sales on the other. The scales on both the vertical axes are different, however. By synchronizing the axes, we get the same scales on both axes for better comparison and accurate representation of the patterns. Creating Gantt charts Gantt charts are most commonly used in project management as these charts show various activities and tasks with the time required to complete those tasks. Gantt charts are even more useful when they show dependencies among various tasks. This type of chart is very helpful when the number of activities is low (around 20-30), otherwise the chart becomes too big to be understood easily. Getting ready Let's use the sample file Sample – Superstore Sales (Excel). Open a new worksheet and select Sample – Superstore Sales (Excel) as the data source. How to do it... Once the data is loaded, perform the following steps to create a Gantt chart: Click on Analysis from the top menu toolbar, and if Aggregate Measures is checked, click on it again to uncheck that option. Click on the Show Me button to bring the Show Me toolbar on the screen. While holding the Ctrl key, click on Order Date and Category (under Products) from Dimensions and Time to Ship from Measures. Click on the Gantt chart icon on the Show Me toolbar. Drag-and-drop Order Date into the Filters pane. Select Years from the Filter Field [Order Date] options dialog box and hit Next. Check 2012 from the list and hit OK. Right-click on YEAR(Order Date) on the Columns shelf and select the Day May 8, 2011 option. Drag-and-drop Order Date into the Filters pane. Select Months from the Filter Field [Order Date] options dialog box and hit Next. Check December from the list and hit OK. Drag-and-drop Region from Dimensions into the Color Marks input box. Drag-and-drop Region from Dimensions into the Rows shelf before Category. The generated Gantt chart should look like the one in the following screenshot: How it works... Representing time this way helps the reader to discern which activity took the longest amount of time. We added the Order Date field two times in the Filters pane to first filter for the year 2012 and then for the month of December. In this recipe, out of all the products shipped in December of 2012, we can easily see the red bars for the West region in the Office Supplies category is longer, suggesting that these products took the longest amount of time to ship. There's more... Andy Kriebel, a Tableau data visualization expert, has a great example of Gantt charts using US presidential data. The following link shows the lengths of terms in office of Presidents from various parties: http://vizwiz.blogspot.com/2010/09/tableau-tip-creating-waterfall-chart.html Creating heat maps A heat map is a visual representation of numbers in a table or a grid such that the bigger numbers are encoded by darker colors or bigger sizes and the smaller numbers by lighter colors or smaller sizes. This type of representation makes the reader's pattern detection from the data easier. Getting ready Let's use the same sample file, Sample – Superstore Sales (Excel). Open a new worksheet and select Sample – Superstore Sales (Excel) as the data source. How to do it... Once the data is loaded, perform the following steps to create a heat map chart: Click on the Show Me button to bring the Show Me toolbar on the screen. While holding the Ctrl key, click on Sub-Category (under Products), Region, and Ship Mode from Dimensions and Profit from Measures. Click on the heat maps chart icon on the Show Me toolbar. Drag-and-drop Profit from Measures into the Color Marks box. The generated chart should look like the one in the following screenshot: Summary When we created the chart for the first time, Tableau assigned various sizes to the square boxes, but when we placed Profit as a color mark, red was used for low amounts of profit and green was used for higher amounts of profit. This made spotting of patterns very easy. Binders and Binder Accessories, shipped by Regular Air in the Central region, generated very high amounts of profit and Tables, shipped by Delivery Trucks in the East region, generated very low amounts of profit (it actually created losses for the company). Resources for Article: Further resources on this subject: Constructing and Evaluating Your Design Solution [Article] Basic use of Local Storage [Article] Creating Interactive Graphics and Animation [Article]
Read more
  • 0
  • 0
  • 3201

article-image-making-big-data-work-hadoop-and-solr
Packt
23 Aug 2013
8 min read
Save for later

Making Big Data Work for Hadoop and Solr

Packt
23 Aug 2013
8 min read
(For more resources related to this topic, see here.) Understanding data processing workflows Based on the data, configuration, and the requirements, data can be processed at multiple levels while it is getting ready for search. Cascading and LucidWorks Big Data are few such application platforms with which a complex data processing workflow can be rapidly developed on the Hadoop framework. In Cascading, the data is processed in different phases, with each phase containing a pipe responsible for carrying data units and applying a filter. The following diagram shows how incoming data can be processed in the pipeline-based workflow: Once a data is passed through the workflow, it can be persisted at the end with repository, and later synced with various nodes running in a distributed environment. The pipelining technique offers the following advantages: Apache Solr engine has minimum work to handle while index creation Incremental indexing can be supported By introducing intermediate store, you can have regular data backups at required stages The data can be transferred to a different type of storage such as HDFS directly through multiple processing units The data can be merged, joined, and processed as per the needs for different data sources LucidWorks Big Data is a more powerful product which helps the user to generate bulk indexes on Hadoop, allowing them to classify and analyze the data, and provide distributed searching capabilities. Sharding is a process of breaking one index into multiple logical units called shards across multiple records. In case of Solr, the results will be aggregated and returned. Big Data based technologies can be used with Apache Solr for various operations. Index creation itself can be made to run on distributed system in order to speed up the overall index generation activity. Once that is done, it can be distributed on different nodes participating in Big Data, and Solr can be made to run in a distributed manner for searching the data. You can set up your Solr instance in the following different configurations: Standalone machine This configuration uses single high end server containing indexes and Solr search; it is suitable for development, and in some cases, production. Distributed setup A distributed setup is suitable for large scale indexes where the index is difficult to store on one system. In this case index has to be distributed across multiple machines. Although distributed configuration of Solr offers ample flexibility in terms of processing, it has its own limitations. A lot of features of Apache Solr such as MoreLikeThis and Joins are not supported. The following diagram depicts the distributed setup: Replicated mode In this mode, more than one Solr instance exists; among them the master instance provides shared access to its slaves for replicating the indexes across multiple systems. Master continues to participate in index creation, search, and so on. Slaves sync up the storage through various replication techniques such as rsync utility. By default, Solr includes Java-based replication that uses HTTP protocol for communication. This replication is recommended due to its benefits over other external replication techniques. This mode is not used anymore with the release of Solr 4.x versions. Sharded mode This mode combines the best of both the worlds and brings in the real value of distributed system with high availability. In this configuration, the system has multiple masters, and each master holds multiple slaves where the replication has gone through. Load balancer is used to handle the load on multiple nodes equally. The following diagram depicts the distributed and replicated setup: If Apache Solr is deployed on a Hadoop-like framework, it falls into this category. Solr also provides SolrCloud for distributed Solr. We are going to look at different approaches in the next section. Using Solr 1045 patch – map-side indexing The work for Solr-1045 patch started with a goal to achieve index generation/building using the Apache MapReduce task. Solr-1045 patch converts all the input records to a set of <key, value> pairs in each map task that runs on Hadoop. Further it goes on creating SolrInputDocument from the <key, value>, and later creating the Solr indexes.The following diagram depicts this process: Reduce tasks can be used to perform deduplication of indexes, and merge them together if required. Although merge index seems to be an interesting feature, it is actually a costly affair in terms of processing, and you will not find many implementations with merge index functionality. Once the indexes are created, you can load them on your Solr instance and use them for searching. You can download this particular patch from https://issues.apache.org/jira/browse/SOLR-1045, and patch your Solr instance. To apply a patch to your Solr instance, you need to first build your Solr instance using source. You can download the patch from Apache JIRA. Before running the patch, first do a dry run which does not actually apply patch. You can do it with following command: cd <solr-trunk-dir>svn patch <name-of-patch> --dry-run If it is successful, you can run the patch without the –dry-run option to apply the patch. Let's look at some of the important classes in the patch. Important class Description SolrIndexUpdateMapper This class is a Hadoop mapper responsible for creating indexes out of <key, value> pairs of input. SolrXMLDocRecordReader This class is responsible for reading Solr input XML files. SolrIndexUpdater This class creates a MapReduce job configuration, runs the job to read the document, and updates the Solr instance. Right now it is built using the Lucene index updater. Benefits and drawbacks The following are the benefits and drawbacks of using the Solr-1045 patch: Benefits It achieves complete parallelism by index creation right at the map task. Merging of indexes is possible in the reduce phase of MapReduce. Drawbacks When the indexing is done at map-side, all the <key, value> pairs received by reducer gain equal weight/importance. So, it is difficult to use this patch with data that carries ranking/weight information. Using Solr 1301 patch – reduce-side indexing This patch focuses on using the Apache MapReduce framework for index creation. Keyword search can happen over Apache Solr or Apache SolrCloud. Unlike Solr-1045, in this patch, the indexes are created in the reduce phase of MapReduce. In this patch, a map task is responsible for converting input records to a <key, value> pair; later, they are passed to the reducer, which in turn converts them into SolrInputDocument, and then creates indexes out of it. This index is then passed as outputs of Hadoop MapReduce process. The following diagram depicts this process: To use Solr-1301 patch, you need to set up a Hadoop cluster. Once the index is created through Hadoop patch, it should then be provisioned to Solr server. The patch contains default converter for CSV files. Let's look at some of the important classes which are part of this patch. Important class Description CSVDocumentConverter This class is responsible for converting output of the map task, that is, key-value pair to SolrInputDocument; you can have multiple document converters. CSVReducer This is a reducer code implemented for Hadoop reducers. CSVIndexer This is the main class to be called from your command line for creating indexes using MapReduce. You need to provide input path for your data and output path for storing shards. SolrDocumentConverter This class is used in your map task for converting your objects in Solr document. SolrRecordWriter This class is an extension of mapreduce.RecordWriter; it breaks the data into multiple (key, value) pairs which are then converted into collection of SolrInputDocument(s), and then this data is submitted to SolrEmbeddedServer in batches. Once completed, it will commit the changes and run the optimizer on the embedded server. CSVMapper This class parses CSV file and gets key-value pair out of it. This is a mapper class. SolrOutputFormat This class is responsible for converting key-value pairs to write the data on file/HDFS as zip/raw format. Perform the following steps to run this patch: Create a local folder with configuration and library folder, conf containing Solr configuration (solr-config.xml, schema.xml), and lib containing library. Create your own converter class implementing SolrDocumentConverter; this will be used by SolrOutputFormat to convert output records to Solr document. You may also override the OutputFormat class provided by Solr. Write the Hadoop MapReduce job in the configuration writer: SolrOutputFormat.setupSolrHomeCache(newFile(solrConfigDir), conf);conf.setOutputFormat(SolrOutputFormat.class);SolrDocumentConverter.setSolrDocumentConverter(<yourclassname>.class, conf); Zip your configuration, and load it in HDFS. The ZIP file name should be solr.zip (unless you change the patch code). Now run the patch, each of the jobs will instantiate EmbeddedSolrInstance which will in turn do the conversion, and finally the SolrOutputDocument(s) get stored in the output format. Benefits and drawbacks The following are the benefits and drawbacks of using Solr-1301 patch: Benefits With reduced size index generation, it is possible to preserve the weights of documents, which can contribute while performing a prioritization during a search query. Drawbacks Merging of indexes is not possible like in Solr-1045, as the indexes are created in the reduce phase. Reducer becomes the crucial component of the system due to major tasks being performed. Summary In this article, we have understood different possible approaches of how Big Data can be made to work with Apache Hadoop and Solr. We also looked at the benefits of and drawbacks these approaches. Resources for Article : Further resources on this subject: Advanced Hadoop MapReduce Administration [Article] Apache Solr Configuration [Article] Analytics – Drawing a Frequency Distribution with MapReduce (Intermediate) [Article]
Read more
  • 0
  • 0
  • 2381
article-image-what-hazelcast
Packt
22 Aug 2013
10 min read
Save for later

What is Hazelcast?

Packt
22 Aug 2013
10 min read
(For more resources related to this topic, see here.) Starting out as usual In most modern software systems, data is the key. For more traditional architectures, the role of persisting and providing access to your system's data tends to fall to a relational database. Typically this is a monolithic beast, perhaps with a degree of replication, although this tends to be more for resilience rather than performance. For example, here is what a traditional architecture might look like (which hopefully looks rather familiar) This presents us with an issue in terms of application scalability, in that it is relatively easy to scale our application layer by throwing more hardware at it to increase the processing capacity. But the monolithic constraints of our data layer would only allow us to do this so far before diminishing returns or resource saturation stunted further performance increases; so what can we do to address this? In the past and in legacy architectures, the only solution would be to increase the performance capability of our database infrastructure, potentially by buying a bigger, faster server or by further tweaking and fettling the utilization of currently available resources. Both options are dramatic, either in terms of financial cost and/or manpower; so what else could we do? Data deciding to hang around In order for us to gain a bit more performance out of our existing setup, we can hold copies of our data away from the primary database and use these in preference wherever possible. There are a number of different strategies we could adopt, from transparent second-level caching layers to external key-value object storage. The detail and exact use of each varies significantly depending on the technology or its place in the architecture, but the main desire of these systems is to sit alongside the primary database infrastructure and attempt to protect it from an excessive load. This would then tend to lead to an increased performance of the primary database by reducing the overall dependency on it. However, this strategy tends to be only particularly valuable as a short-term solution, effectively buying us a little more time before the database once again starts to reach saturation. The other downside is that it only protects our database from read-based load; if our application is predominately write-heavy, this strategy has very little to offer. So our expanded architecture could look a bit like the following figure: Therein lies the problem However, in insulating the database from the read load, we have introduced a problem in the form of a cache consistency issue, in that, how does our local data cache deal with changing data underneath it within the primary database? The answer is rather depressing: it can't! The exact manifestation of any issues will largely depend on the data needs of the application and how frequently the data changes; but typically, caching systems will operate in one of the two following modes to combat the problem: Time bound cache: Holds entries for a defined period (time-to-live or TTL) Write through cache: Holds entries until they are invalidated by subsequent updates Time bound caches almost always have consistency issues, but at least the amount of time that the issue would be present is limited to the expiry time of each entry. However, we must consider the application's access to this data, because if the frequency of accessing a particular entry is less than the cache expiry time of it, the cache is providing no real benefit. Write through caches are consistent in isolation and can be configured to offer strict consistency, but if multiple write through caches exist within the overall architecture, then there will be consistency issues between them. We can avoid this by having a more intelligent cache, which features a communication mechanism between nodes, that can propagate entry invalidations to each other. In practice, an ideal cache would feature a combination of both features, so that entries would be held for a known maximum time, but also passes around invalidations as changes are made. So our evolved architecture would look a bit like the following figure: So far we've had a look through the general issues in scaling our data layer, and introduced strategies to help combat the trade-offs we will encounter along the way; however, the real world isn't quite as simple. There are various cache servers and in-memory database products in this area: however, most of these are stand-alone single instances, perhaps with some degree of distribution bolted on or provided by other supporting technologies. This tends to bring about the same issues we experienced with just our primary database, in that we could encounter resource saturation or capacity issues if the product is a single instance, or if the distribution doesn't provide consistency control, perhaps inconsistent data, which might harm our application. Breaking the mould Hazelcast is a radical new approach to data, designed from the ground up around distribution. It embraces a new scalable way of thinking; in that data should be shared around for both resilience and performance, while allowing us to configure the trade-offs surrounding consistency as the data requirements dictate. The first major feature to understand about Hazelcast is its master less nature; each node is configured to be functionally the same. The oldest node in the cluster is the de facto leader and manages the membership, automatically delegating as to which node is responsible for what data. In this way as new nodes join or dropout, the process is repeated and the cluster rebalances accordingly. This makes Hazelcast incredibly simple to get up and running, as the system is self-discovering, self-clustering, and works straight out of the box. However, the second feature to remember is that we are persisting data entirely in-memory; this makes it incredibly fast but this speed comes at a price. When a node is shutdown, all the dta that was held on it is lost. We combat this risk to resilience through replication, by holding enough copies of a piece of data across multiple nodes. In the event of failure, the overall cluster will not suffer any data loss. By default, the standard backup count is 1, so we can immediately enjoy basic resilience. But don't pull the plug on more than one node at a time, until the cluster has reacted to the change in membership and reestablished the appropriate number of backup copies of data. So when we introduce our new master less distributed cluster, we get something like the following figure: We previously identified that multi-node caches tend to suffer from either saturation or consistency issues. In the case of Hazelcast, each node is the owner of a number of partitions of the overall data, so the load will be fairly spread across the cluster. Hence, any saturation would be at the cluster level rather than any individual node. We can address this issue simply by adding more nodes. In terms of consistency, by default the backup copies of the data are internal to Hazelcast and not directly used, as such we enjoy strict consistency. This does mean that we have to interact with a specific node to retrieve or update a particular piece of data; however, exactly which node that is an internal operational detail and can vary over time — we as developers never actually need to know. If we imagine that our data is split into a number of partitions, that each partition slice is owned by one node and backed up on another, we could then visualize the interactions like the following figure: This means that for data belonging to Partition 1, our application will have to communicate to Node 1, Node 2 for data belonging to Partition 2, and so on. The slicing of the data into each partition is dynamic; so in practice, where there are more partitions than nodes, each node will own a number of different partitions and hold backups for others. As we have mentioned before, all of this is an internal operational detail, and our application does not need to know it, but it is important that we understand what is going on behind the scenes. Moving to new ground So far we have been talking mostly about simple persisted data and caches, but in reality, we should not think of Hazelcast as purely a cache, as it is much more powerful than just that. It is an in-memory data grid that supports a number of distributed collections and features. We can load in data from various sources into differing structures, send messages across the cluster; take out locks to guard against concurrent activity, and listen to the goings on inside the workings of the cluster. Most of these implementations correspond to a standard Java collection, or function in a manner comparable to other similar technologies, but all with the distribution and resilience capabilities already built in. Standard utility collections Map: Key-value pairs List: Collection of objects? Set: Non-duplicated collection Queue: Offer/poll FIFO collection Specialized collection Multi-Map: Key-list of values collection Lock: Cluster wide mutex Topic: Publish/subscribe messaging Concurrency utilities AtomicNumber: Cluster-wide atomic counter IdGenerator: Cluster-wide unique identifier generation Semaphore: Concurrency limitation CountdownLatch: Concurrent activity gate-keeping Listeners: Application notifications as things happen In addition to data storage collections, Hazelcast also features a distributed executor service allowing runnable tasks to be created that can be run anywhere on the cluster to obtain, manipulate, and store results. We could have a number of collections containing source data, then spin up a number of tasks to process the disparate data (for example, averaging or aggregating) and outputting the results into another collection for consumption. Again, just as we could scale up our data capacities by adding more nodes, we can also increase the execution capacity in exactly the same way. This essentially means that by building our data layer around Hazelcast, if our application needs rapidly increase, we can continuously increase the number of nodes to satisfy seemingly extensive demands, all without having to redesign or re-architect the actual application. With Hazelcast, we are dealing more with a technology than a server product, a library to build a system around rather than retrospectively bolting it on, or blindly connecting to an off-the-shelf commercial system. While it is possible (and in some simple cases quite practical) to run Hazelcast as a separate server-like cluster and connect to it remotely from our application; some of the greatest benefits come when we develop our own classes and tasks run within it and alongside it. With such a large range of generic capabilities, there is an entire world of problems that Hazelcast can help solve. We can use the technology in many ways; in isolation to hold data such as user sessions, run it alongside a more long-term persistent data store to increase capacity, or shift towards performing high performance and scalable operations on our data. By moving more and more responsibility away from monolithic systems to such a generic scalable one, there is no limit to the performance we can unlock. This will allow us to keep our application and data layers separate, but enabling the ability to scale them up independently as our application grows. This will avoid our application becoming a victim of its own success, while hopefully taking the world by storm. Summary In this article, we learned about Hazelcast. With such a large range of generic capabilities, Hazelcast can solve a world of problems. Resources for Article: Further resources on this subject: JBoss AS Perspective [Article] Drools Integration Modules: Spring Framework and Apache Camel [Article] JBoss RichFaces 3.3 Supplemental Installation [Article]
Read more
  • 0
  • 0
  • 3196

article-image-learning-how-classify-real-world-examples
Packt
22 Aug 2013
24 min read
Save for later

Learning How to Classify with Real-world Examples

Packt
22 Aug 2013
24 min read
(For more resources related to this topic, see here.) The Iris dataset The Iris dataset is a classic dataset from the 1930s; it is one of the first modern examples of statistical classification. The setting is that of Iris flowers, of which there are multiple species that can be identified by their morphology. Today, the species would be defined by their genomic signatures, but in the 1930s, DNA had not even been identified as the carrier of genetic information. The following four attributes of each plant were measured: Sepal length Sepal width Petal length Petal width In general, we will call any measurement from our data as features. Additionally, for each plant, the species was recorded. The question now is: if we saw a new flower out in the field, could we make a good prediction about its species from its measurements? This is the supervised learning or classification problem; given labeled examples, we can design a rule that will eventually be applied to other examples. This is the same setting that is used for spam classification; given the examples of spam and ham (non-spam e-mail) that the user gave the system, can we determine whether a new, incoming message is spam or not? For the moment, the Iris dataset serves our purposes well. It is small (150 examples, 4 features each) and can easily be visualized and manipulated. The first step is visualization Because this dataset is so small, we can easily plot all of the points and all two-dimensional projections on a page. We will thus build intuitions that can then be extended to datasets with many more dimensions and datapoints. Each subplot in the following screenshot shows all the points projected into two of the dimensions. The outlying group (triangles) are the Iris Setosa plants, while Iris Versicolor plants are in the center (circle) and Iris Virginica are indicated with "x" marks. We can see that there are two large groups: one is of Iris Setosa and another is a mixture of Iris Versicolor and Iris Virginica. We are using Matplotlib; it is the most well-known plotting package for Python. We present the code to generate the top-left plot. The code for the other plots is similar to the following code: from matplotlib import pyplot as plt from sklearn.datasets import load_iris import numpy as np # We load the data with load_iris from sklearn data = load_iris() features = data['data'] feature_names = data['feature_names'] target = data['target'] for t,marker,c in zip(xrange(3),">ox","rgb"): # We plot each class on its own to get different colored markers plt.scatter(features[target == t,0], features[target == t,1], marker=marker, c=c) Building our first classification model If the goal is to separate the three types of flower, we can immediately make a few suggestions. For example, the petal length seems to be able to separate Iris Setosa from the other two flower species on its own. We can write a little bit of code to discover where the cutoff is as follows: plength = features[:, 2] # use numpy operations to get setosa features is_setosa = (labels == 'setosa') # This is the important step: max_setosa =plength[is_setosa].max() min_non_setosa = plength[~is_setosa].min() print('Maximum of setosa: {0}.'.format(max_setosa)) print('Minimum of others: {0}.'.format(min_non_setosa)) This prints 1.9 and 3.0. Therefore, we can build a simple model: if the petal length is smaller than two, this is an Iris Setosa flower; otherwise, it is either Iris Virginica or Iris Versicolor. if features[:,2] < 2: print 'Iris Setosa' else: print 'Iris Virginica or Iris Versicolour' This is our first model, and it works very well in that it separates the Iris Setosa flowers from the other two species without making any mistakes. What we had here was a simple structure; a simple threshold on one of the dimensions. Then we searched for the best dimension threshold. We performed this visually and with some calculation; machine learning happens when we write code to perform this for us. The example where we distinguished Iris Setosa from the other two species was very easy. However, we cannot immediately see what the best threshold is for distinguishing Iris Virginica from Iris Versicolor. We can even see that we will never achieve perfect separation. We can, however, try to do it the best possible way. For this, we will perform a little computation. We first select only the non-Setosa features and labels: features = features[~is_setosa] labels = labels[~is_setosa] virginica = (labels == 'virginica') Here we are heavily using NumPy operations on the arrays. is_setosa is a Boolean array, and we use it to select a subset of the other two arrays, features and labels. Finally, we build a new Boolean array, virginica, using an equality comparison on labels. Now, we run a loop over all possible features and thresholds to see which one results in better accuracy. Accuracy is simply the fraction of examples that the model classifies correctly: best_acc = -1.0 for fi in xrange(features.shape[1]): # We are going to generate all possible threshold for this feature thresh = features[:,fi].copy() thresh.sort() # Now test all thresholds: for t in thresh: pred = (features[:,fi] > t) acc = (pred == virginica).mean() if acc > best_acc: best_acc = acc best_fi = fi best_t = t The last few lines select the best model. First we compare the predictions, pred, with the actual labels, virginica. The little trick of computing the mean of the comparisons gives us the fraction of correct results, the accuracy. At the end of the for loop, all possible thresholds for all possible features have been tested, and the best_fi and best_t variables hold our model. To apply it to a new example, we perform the following: if example[best_fi] > t: print 'virginica' else: print 'versicolor' What does this model look like? If we run it on the whole data, the best model that we get is split on the petal length. We can visualize the decision boundary. In the following screenshot, we see two regions: one is white and the other is shaded in grey. Anything that falls in the white region will be called Iris Virginica and anything that falls on the shaded side will be classified as Iris Versicolor: In a threshold model, the decision boundary will always be a line that is parallel to one of the axes. The plot in the preceding screenshot shows the decision boundary and the two regions where the points are classified as either white or grey. It also shows (as a dashed line) an alternative threshold that will achieve exactly the same accuracy. Our method chose the first threshold, but that was an arbitrary choice. Evaluation – holding out data and cross-validation The model discussed in the preceding section is a simple model; it achieves 94 percent accuracy on its training data. However, this evaluation may be overly optimistic. We used the data to define what the threshold would be, and then we used the same data to evaluate the model. Of course, the model will perform better than anything else we have tried on this dataset. The logic is circular. What we really want to do is estimate the ability of the model to generalize to new instances. We should measure its performance in instances that the algorithm has not seen at training. Therefore, we are going to do a more rigorous evaluation and use held-out data. For this, we are going to break up the data into two blocks: on one block, we'll train the model, and on the other—the one we held out of training—we'll test it. The output is as follows: Training error was 96.0%. Testing error was 90.0% (N = 50). The result of the testing data is lower than that of the training error. This may surprise an inexperienced machine learner, but it is expected and typical. To see why, look back at the plot that showed the decision boundary. See if some of the examples close to the boundary were not there or if one of the ones in between the two lines was missing. It is easy to imagine that the boundary would then move a little bit to the right or to the left so as to put them on the "wrong" side of the border. The error on the training data is called a training error and is always an overly optimistic estimate of how well your algorithm is doing. We should always measure and report the testing error; the error on a collection of examples that were not used for training. These concepts will become more and more important as the models become more complex. In this example, the difference between the two errors is not very large. When using a complex model, it is possible to get 100 percent accuracy in training and do no better than random guessing on testing! One possible problem with what we did previously, which was to hold off data from training, is that we only used part of the data (in this case, we used half of it) for training. On the other hand, if we use too little data for testing, the error estimation is performed on a very small number of examples. Ideally, we would like to use all of the data for training and all of the data for testing as well. We can achieve something quite similar by cross-validation. One extreme (but sometimes useful) form of cross-validation is leave-one-out. We will take an example out of the training data, learn a model without this example, and then see if the model classifies this example correctly: error = 0.0 for ei in range(len(features)): # select all but the one at position 'ei': training = np.ones(len(features), bool) training[ei] = False testing = ~training model = learn_model(features[training], virginica[training]) predictions = apply_model(features[testing], virginica[testing], model) error += np.sum(predictions != virginica[testing]) At the end of this loop, we will have tested a series of models on all the examples. However, there is no circularity problem because each example was tested on a model that was built without taking the model into account. Therefore, the overall estimate is a reliable estimate of how well the models would generalize. The major problem with leave-one-out cross-validation is that we are now being forced to perform 100 times more work. In fact, we must learn a whole new model for each and every example, and this will grow as our dataset grows. We can get most of the benefits of leave-one-out at a fraction of the cost by using x-fold cross-validation; here, "x" stands for a small number, say, five. In order to perform five-fold cross-validation, we break up the data in five groups, that is, five folds. Then we learn five models, leaving one fold out of each. The resulting code will be similar to the code given earlier in this section, but here we leave 20 percent of the data out instead of just one element. We test each of these models on the left out fold and average the results: The preceding figure illustrates this process for five blocks; the dataset is split into five pieces. Then for each fold, you hold out one of the blocks for testing and train on the other four. You can use any number of folds you wish. Five or ten fold is typical; it corresponds to training with 80 or 90 percent of your data and should already be close to what you would get from using all the data. In an extreme case, if you have as many folds as datapoints, you can simply perform leave-one-out cross-validation. When generating the folds, you need to be careful to keep them balanced. For example, if all of the examples in one fold come from the same class, the results will not be representative. We will not go into the details of how to do this because the machine learning packages will handle it for you. We have now generated several models instead of just one. So, what final model do we return and use for the new data? The simplest solution is now to use a single overall model on all your training data. The cross-validation loop gives you an estimate of how well this model should generalize. A cross-validation schedule allows you to use all your data to estimate if your methods are doing well. At the end of the cross-validation loop, you can use all your data to train a final model. Although it was not properly recognized when machine learning was starting out, nowadays it is seen as a very bad sign to even discuss the training error of a classification system. This is because the results can be very misleading. We always want to measure and compare either the error on a held-out dataset or the error estimated using a cross-validation schedule. Building more complex classifiers In the previous section, we used a very simple model: a threshold on one of the dimensions. Throughout this article, you will see many other types of models, and we're not even going to cover everything that is out there. What makes up a classification model? We can break it up into three parts: The structure of the model: In this, we use a threshold on a single feature. The search procedure: In this, we try every possible combination of feature and threshold. The loss function: Using the loss function, we decide which of the possibilities is less bad (because we can rarely talk about the perfect solution). We can use the training error or just define this point the other way around and say that we want the best accuracy. Traditionally, people want the loss function to be minimum. We can play around with these parts to get different results. For example, we can attempt to build a threshold that achieves minimal training error, but we will only test three values for each feature: the mean value of the features, the mean plus one standard deviation, and the mean minus one standard deviation. This could make sense if testing each value was very costly in terms of computer time (or if we had millions and millions of datapoints). Then the exhaustive search we used would be infeasible, and we would have to perform an approximation like this. Alternatively, we might have different loss functions. It might be that one type of error is much more costly than another. In a medical setting, false negatives and false positives are not equivalent. A false negative (when the result of a test comes back negative, but that is false) might lead to the patient not receiving treatment for a serious disease. A false positive (when the test comes back positive even though the patient does not actually have that disease) might lead to additional tests for confirmation purposes or unnecessary treatment (which can still have costs, including side effects from the treatment). Therefore, depending on the exact setting, different trade-offs can make sense. At one extreme, if the disease is fatal and treatment is cheap with very few negative side effects, you want to minimize the false negatives as much as you can. With spam filtering, we may face the same problem; incorrectly deleting a non-spam e-mail can be very dangerous for the user, while letting a spam e-mail through is just a minor annoyance. What the cost function should be is always dependent on the exact problem you are working on. When we present a general-purpose algorithm, we often focus on minimizing the number of mistakes (achieving the highest accuracy). However, if some mistakes are more costly than others, it might be better to accept a lower overall accuracy to minimize overall costs. Finally, we can also have other classification structures. A simple threshold rule is very limiting and will only work in the very simplest cases, such as with the Iris dataset. A more complex dataset and a more complex classifier We will now look at a slightly more complex dataset. This will motivate the introduction of a new classification algorithm and a few other ideas. Learning about the Seeds dataset We will now look at another agricultural dataset; it is still small, but now too big to comfortably plot exhaustively as we did with Iris. This is a dataset of the measurements of wheat seeds. Seven features are present, as follows: Area (A) Perimeter (P) Compactness () Length of kernel Width of kernel Asymmetry coefficient Length of kernel groove There are three classes that correspond to three wheat varieties: Canadian, Koma, and Rosa. As before, the goal is to be able to classify the species based on these morphological measurements. Unlike the Iris dataset, which was collected in the 1930s, this is a very recent dataset, and its features were automatically computed from digital images. This is how image pattern recognition can be implemented: you can take images in digital form, compute a few relevant features from them, and use a generic classification system. Later, we will work through the computer vision side of this problem and compute features in images. For the moment, we will work with the features that are given to us. UCI Machine Learning Dataset Repository The University of California at Irvine (UCI) maintains an online repository of machine learning datasets (at the time of writing, they are listing 233 datasets). Both the Iris and Seeds dataset used in this article were taken from there. The repository is available online: http://archive.ics.uci.edu/ml/ Features and feature engineering One interesting aspect of these features is that the compactness feature is not actually a new measurement, but a function of the previous two features, area and perimeter. It is often very useful to derive new combined features. This is a general area normally termed feature engineering; it is sometimes seen as less glamorous than algorithms, but it may matter more for performance (a simple algorithm on well-chosen features will perform better than a fancy algorithm on not-so-good features). In this case, the original researchers computed the "compactness", which is a typical feature for shapes (also called "roundness"). This feature will have the same value for two kernels, one of which is twice as big as the other one, but with the same shape. However, it will have different values for kernels that are very round (when the feature is close to one) as compared to kernels that are elongated (when the feature is close to zero). The goals of a good feature are to simultaneously vary with what matters and be invariant with what does not. For example, compactness does not vary with size but varies with the shape. In practice, it might be hard to achieve both objectives perfectly, but we want to approximate this ideal. You will need to use background knowledge to intuit which will be good features. Fortunately, for many problem domains, there is already a vast literature of possible features and feature types that you can build upon. For images, all of the previously mentioned features are typical, and computer vision libraries will compute them for you. In text-based problems too, there are standard solutions that you can mix and match. Often though, you can use your knowledge of the specific problem to design a specific feature. Even before you have data, you must decide which data is worthwhile to collect. Then, you need to hand all your features to the machine to evaluate and compute the best classifier. A natural question is whether or not we can select good features automatically. This problem is known as feature selection. There are many methods that have been proposed for this problem, but in practice, very simple ideas work best. It does not make sense to use feature selection in these small problems, but if you had thousands of features, throwing out most of them might make the rest of the process much faster. Nearest neighbor classification With this dataset, even if we just try to separate two classes using the previous method, we do not get very good results. Let me introduce, therefore, a new classifier: the nearest neighbor classifier. If we consider that each example is represented by its features (in mathematical terms, as a point in N-dimensional space), we can compute the distance between examples. We can choose different ways of computing the distance, for example: def distance(p0, p1): 'Computes squared euclidean distance' return np.sum( (p0-p1)**2) Now when classifying, we adopt a simple rule: given a new example, we look at the dataset for the point that is closest to it (its nearest neighbor) and look at its label: def nn_classify(training_set, training_labels, new_example): dists = np.array([distance(t, new_example) for t in training_set]) nearest = dists.argmin() return training_labels[nearest] In this case, our model involves saving all of the training data and labels and computing everything at classification time. A better implementation would be to actually index these at learning time to speed up classification, but this implementation is a complex algorithm. Now, note that this model performs perfectly on its training data! For each point, its closest neighbor is itself, and so its label matches perfectly (unless two examples have exactly the same features but different labels, which can happen). Therefore, it is essential to test using a cross-validation protocol. Using ten folds for cross-validation for this dataset with this algorithm, we obtain 88 percent accuracy. As we discussed in the earlier section, the cross-validation accuracy is lower than the training accuracy, but this is a more credible estimate of the performance of the model. We will now examine the decision boundary. For this, we will be forced to simplify and look at only two dimensions (just so that we can plot it on paper). In the preceding screenshot, the Canadian examples are shown as diamonds, Kama seeds as circles, and Rosa seeds as triangles. Their respective areas are shown as white, black, and grey. You might be wondering why the regions are so horizontal, almost weirdly so. The problem is that the x axis (area) ranges from 10 to 22 while the y axis (compactness) ranges from 0.75 to 1.0. This means that a small change in x is actually much larger than a small change in y. So, when we compute the distance according to the preceding function, we are, for the most part, only taking the x axis into account. If you have a physics background, you might have already noticed that we had been summing up lengths, areas, and dimensionless quantities, mixing up our units (which is something you never want to do in a physical system). We need to normalize all of the features to a common scale. There are many solutions to this problem; a simple one is to normalize to Z-scores. The Z-score of a value is how far away from the mean it is in terms of units of standard deviation. It comes down to this simple pair of operations: # subtract the mean for each feature: features -= features.mean(axis=0) # divide each feature by its standard deviation features /= features.std(axis=0) Independent of what the original values were, after Z-scoring, a value of zero is the mean and positive values are above the mean and negative values are below it. Now every feature is in the same unit (technically, every feature is now dimensionless; it has no units) and we can mix dimensions more confidently. In fact, if we now run our nearest neighbor classifier, we obtain 94 percent accuracy! Look at the decision space again in two dimensions; it looks as shown in the following screenshot: The boundaries are now much more complex and there is interaction between the two dimensions. In the full dataset, everything is happening in a seven-dimensional space that is very hard to visualize, but the same principle applies: where before a few dimensions were dominant, now they are all given the same importance. The nearest neighbor classifier is simple, but sometimes good enough. We can generalize it to a k-nearest neighbor classifier by considering not just the closest point but the k closest points. All k neighbors vote to select the label. k is typically a small number, such as 5, but can be larger, particularly if the dataset is very large. Binary and multiclass classification The first classifier we saw, the threshold classifier, was a simple binary classifier (the result is either one class or the other as a point is either above the threshold or it is not). The second classifier we used, the nearest neighbor classifier, was a naturally multiclass classifier (the output can be one of several classes). It is often simpler to define a simple binary method than one that works on multiclass problems. However, we can reduce the multiclass problem to a series of binary decisions. This is what we did earlier in the Iris dataset in a haphazard way; we observed that it was easy to separate one of the initial classes and focused on the other two, reducing the problem to two binary decisions: Is it an Iris Setosa (yes or no)? If no, check whether it is an Iris Virginica (yes or no). Of course, we want to leave this sort of reasoning to the computer. As usual, there are several solutions to this multiclass reduction. The simplest is to use a series of "one classifier versus the rest of the classifiers". For each possible label l, we build a classifier of the type "is this l or something else?". When applying the rule, exactly one of the classifiers would say "yes" and we would have our solution. Unfortunately, this does not always happen, so we have to decide how to deal with either multiple positive answers or no positive answers. Alternatively, we can build a classification tree. Split the possible labels in two and build a classifier that asks "should this example go to the left or the right bin?" We can perform this splitting recursively until we obtain a single label. The preceding diagram depicts the tree of reasoning for the Iris dataset. Each diamond is a single binary classifier. It is easy to imagine we could make this tree larger and encompass more decisions. This means that any classifier that can be used for binary classification can also be adapted to handle any number of classes in a simple way. There are many other possible ways of turning a binary method into a multiclass one. There is no single method that is clearly better in all cases. However, which one you use normally does not make much of a difference to the final result. Most classifiers are binary systems while many real-life problems are naturally multiclass. Several simple protocols reduce a multiclass problem to a series of binary decisions and allow us to apply the binary models to our multiclass problem. Summary In a sense, this was a very theoretical article, as we introduced generic concepts with simple examples. We went over a few operations with a classic dataset. This, by now, is considered a very small problem. However, it has the advantage that we were able to plot it out and see what we were doing in detail. This is something that will be lost when we move on to problems with many dimensions and many thousands of examples. The intuitions we gained here will all still be valid. Classification means generalizing from examples to build a model (that is, a rule that can automatically be applied to new, unclassified objects). It is one of the fundamental tools in machine learning. We also learned that the training error is a misleading, over-optimistic estimate of how well the model does. We must, instead, evaluate it on testing data that was not used for training. In order to not waste too many examples in testing, a cross-validation schedule can get us the best of both worlds (at the cost of more computation). We also had a look at the problem of feature engineering. Features are not something that is predefined for you, but choosing and designing features is an integral part of designing a machine-learning pipeline. In fact, it is often the area where you can get the most improvements in accuracy as better data beats fancier methods. In this article, we wrote all of our own code (except when we used NumPy, of course). We needed to build up intuitions on simple cases to illustrate the basic concepts. Resources for Article: Further resources on this subject: Python Testing: Installing the Robot Framework [Article] Getting Started with Spring Python [Article] Creating Skeleton Apps with Coily in Spring Python [Article]
Read more
  • 0
  • 0
  • 2563