Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-working-incanter-datasets
Packt
04 Feb 2015
28 min read
Save for later

Working with Incanter Datasets

Packt
04 Feb 2015
28 min read
In this article by Eric Rochester author of the book, Clojure Data Analysis Cookbook, Second Edition, we will cover the following recipes: Loading Incanter's sample datasets Loading Clojure data structures into datasets Viewing datasets interactively with view Converting datasets to matrices Using infix formulas in Incanter Selecting columns with $ Selecting rows with $ Filtering datasets with $where Grouping data with $group-by Saving datasets to CSV and JSON Projecting from multiple datasets with $join (For more resources related to this topic, see here.) Introduction Incanter combines the power to do statistics using a fully-featured statistical language such as R (http://www.r-project.org/) with the ease and joy of Clojure. Incanter's core data structure is the dataset, so we'll spend some time in this article to look at how to use them effectively. While learning basic tools in this manner is often not the most exciting way to spend your time, it can still be incredibly useful. At its most fundamental level, an Incanter dataset is a table of rows. Each row has the same set of columns, much like a spreadsheet. The data in each cell of an Incanter dataset can be a string or a numeric. However, some operations require the data to only be numeric. First you'll learn how to populate and view datasets, then you'll learn different ways to query and project the parts of the dataset that you're interested in onto a new dataset. Finally, we'll take a look at how to save datasets and merge multiple datasets together. Loading Incanter's sample datasets Incanter comes with a set of default datasets that are useful for exploring Incanter's functions. I haven't made use of them in this book, since there is so much data available in other places, but they're a great way to get a feel of what you can do with Incanter. Some of these datasets—for instance, the Iris dataset—are widely used to teach and test statistical algorithms. It contains the species and petal and sepal dimensions for 50 irises. This is the dataset that we'll access today. In this recipe, we'll load a dataset and see what it contains. Getting ready We'll need to include Incanter in our Leiningen project.clj file: (defproject inc-dsets "0.1.0":dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]]) We'll also need to include the right Incanter namespaces into our script or REPL: (use '(incanter core datasets)) How to do it… Once the namespaces are available, we can access the datasets easily: user=> (def iris (get-dataset :iris))#'user/iris user=> (col-names iris)[:Sepal.Length :Sepal.Width :Petal.Length :Petal.Width :Species]user=> (nrow iris)150 user=> (set ($ :Species iris))#{"versicolor" "virginica" "setosa"} How it works… We use the get-dataset function to access the built-in datasets. In this case, we're loading the Fisher's Iris dataset, sometimes called Anderson's dataset. This is a multivariate dataset for discriminant analysis. It gives petal and sepal measurements for 150 different Irises of three different species. Incanter's sample datasets cover a wide variety of topics—from U.S. arrests to plant growth and ultrasonic calibration. They can be used to test different algorithms and analyses and to work with different types of data. By the way, the names of functions should be familiar to you if you've previously used R. Incanter often uses the names of R's functions instead of using the Clojure names for the same functions. For example, the preceding code sample used nrow instead of count. There's more... Incanter's API documentation for get-dataset (http://liebke.github.com/incanter/datasets-api.html#incanter.datasets/get-dataset) lists more sample datasets, and you can refer to it for the latest information about the data that Incanter bundles. Loading Clojure data structures into datasets While they are good for learning, Incanter's built-in datasets probably won't be that useful for your work (unless you work with irises). Other recipes cover ways to get data from CSV files and other sources into Incanter. Incanter also accepts native Clojure data structures in a number of formats. We'll take look at a couple of these in this recipe. Getting ready We'll just need Incanter listed in our project.clj file: (defproject inc-dsets "0.1.0":dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]]) We'll also need to include this in our script or REPL: (use 'incanter.core) How to do it… The primary function used to convert data into a dataset is to-dataset. While it can convert single, scalar values into a dataset, we'll start with slightly more complicated inputs. Generally, you'll be working with at least a matrix. If you pass this to to-dataset, what do you get? user=> (def matrix-set (to-dataset [[1 2 3] [4 5 6]]))#'user/matrix-set user=> (nrow matrix-set)2user=> (col-names matrix-set)[:col-0 :col-1 :col-2] All the data's here, but it can be labeled in a better way. Does to-dataset handle maps? user=> (def map-set (to-dataset {:a 1, :b 2, :c 3}))#'user/map-set user=> (nrow map-set)1 user=> (col-names map-set)[:a :c :b] So, map keys become the column labels. That's much more intuitive. Let's throw a sequence of maps at it: user=> (def maps-set (to-dataset [{:a 1, :b 2, :c 3},                                 {:a 4, :b 5, :c 6}]))#'user/maps-setuser=> (nrow maps-set)2user=> (col-names maps-set)[:a :c :b] This is much more useful. We can also create a dataset by passing the column vector and the row matrix separately to dataset: user=> (def matrix-set-2         (dataset [:a :b :c]                         [[1 2 3] [4 5 6]]))#'user/matrix-set-2 user=> (nrow matrix-set-2)2 user=> (col-names matrix-set-2)[:c :b :a] How it works… The to-dataset function looks at the input and tries to process it intelligently. If given a sequence of maps, the column names are taken from the keys of the first map in the sequence. Ultimately, it uses the dataset constructor to create the dataset. When you want the most control, you should also use the dataset. It requires the dataset to be passed in as a column vector and a row matrix. When the data is in this format or when we need the most control—to rename the columns, for instance—we can use dataset. Viewing datasets interactively with view Being able to interact with our data programmatically is important, but sometimes it's also helpful to be able to look at it. This can be especially useful when you do data exploration. Getting ready We'll need to have Incanter in our project.clj file and script or REPL, so we'll use the same setup as we did for the Loading Incanter's sample datasets recipe, as follows. We'll also use the Iris dataset from that recipe. (use '(incanter core datasets)) How to do it… Incanter makes this very easy. Let's take a look at just how simple it is: First, we need to load the dataset, as follows: user=> (def iris (get-dataset :iris)) #'user/iris Then we just call view on the dataset: user=> (view iris) This function returns the Swing window frame, which contains our data, as shown in the following screenshot. This window should also be open on your desktop, although for me, it's usually hiding behind another window: How it works… Incanter's view function takes any object and tries to display it graphically. In this case, it simply displays the raw data as a table. Converting datasets to matrices Although datasets are often convenient, many times we'll want to treat our data as a matrix from linear algebra. In Incanter, matrices store a table of doubles. This provides good performance in a compact data structure. Moreover, we'll need matrices many times because some of Incanter's functions, such as trans, only operate on a matrix. Plus, it implements Clojure's ISeq interface, so interacting with matrices is also convenient. Getting ready For this recipe, we'll need the Incanter libraries, so we'll use this project.clj file: (defproject inc-dsets "0.1.0":dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]]) We'll use the core and io namespaces, so we'll load these into our script or REPL: (use '(incanter core io)) This line binds the file name to the identifier data-file: (def data-file "data/all_160_in_51.P35.csv") How to do it… For this recipe, we'll create a dataset, convert it to a matrix, and then perform some operations on it: First, we need to read the data into a dataset, as follows: (def va-data (read-dataset data-file :header true)) Then, in order to convert it to a matrix, we just pass it to the to-matrix function. Before we do this, we'll pull out a few of the columns since matrixes can only contain floating-point numbers: (def va-matrix    (to-matrix ($ [:POP100 :HU100 :P035001] va-data))) Now that it's a matrix, we can treat it like a sequence of rows. Here, we pass it to first in order to get the first row, take in order to get a subset of the matrix, and count in order to get the number of rows in the matrix: user=> (first va-matrix) A 1x3 matrix ------------- 8.19e+03 4.27e+03 2.06e+03   user=> (count va-matrix) 591 We can also use Incanter's matrix operators to get the sum of each column, for instance. The plus function takes each row and sums each column separately: user=> (reduce plus va-matrix) A 1x3 matrix ------------- 5.43e+06 2.26e+06 1.33e+06 How it works… The to-matrix function takes a dataset of floating-point values and returns a compact matrix. Matrices are used by many of Incanter's more sophisticated analysis functions, as they're easy to work with. There's more… In this recipe, we saw the plus matrix operator. Incanter defines a full suite of these. You can learn more about matrices and see what operators are available at https://github.com/liebke/incanter/wiki/matrices. Using infix formulas in Incanter There's a lot to like about lisp: macros, the simple syntax, and the rapid development cycle. Most of the time, it is fine if you treat math operators as functions and use prefix notations, which is a consistent, function-first syntax. This allows you to treat math operators in the same way as everything else so that you can pass them to reduce, or anything else you want to do. However, we're not taught to read math expressions using prefix notations (with the operator first). And especially when formulas get even a little complicated, tracing out exactly what's happening can get hairy. Getting ready For this recipe we'll just need Incanter in our project.clj file, so we'll use the dependencies statement—as well as the use statement—from the Loading Clojure data structures into datasets recipe. For data, we'll use the matrix that we created in the Converting datasets to matrices recipe. How to do it… Incanter has a macro that converts a standard math notation to a lisp notation. We'll explore that in this recipe: The $= macro changes its contents to use an infix notation, which is what we're used to from math class: user=> ($= 7 * 4)28user=> ($= 7 * 4 + 3)31 We can also work on whole matrixes or just parts of matrixes. In this example, we perform a scalar multiplication of the matrix: user=> ($= va-matrix * 4)A 591x3 matrix---------------3.28e+04 1.71e+04 8.22e+03 2.08e+03 9.16e+02 4.68e+02 1.19e+03 6.52e+02 3.08e+02...1.41e+03 7.32e+02 3.72e+02 1.31e+04 6.64e+03 3.49e+03 3.02e+04 9.60e+03 6.90e+03 user=> ($= (first va-matrix) * 4)A 1x3 matrix-------------3.28e+04 1.71e+04 8.22e+03 Using this, we can build complex expressions, such as this expression that takes the mean of the values in the first row of the matrix: user=> ($= (sum (first va-matrix)) /           (count (first va-matrix)))4839.333333333333 Or we can build expressions take the mean of each column, as follows: user=> ($= (reduce plus va-matrix) / (count va-matrix))A 1x3 matrix-------------9.19e+03 3.83e+03 2.25e+03 How it works… Any time you're working with macros and you wonder how they work, you can always get at their output expressions easily, so you can see what the computer is actually executing. The tool to do this is macroexpand-1. This expands the macro one step and returns the result. It's sibling function, macroexpand, expands the expression until there is no macro expression left. Usually, this is more than we want, so we just use macroexpand-1. Let's see what these macros expand into: user=> (macroexpand-1 '($= 7 * 4))(incanter.core/mult 7 4)user=> (macroexpand-1 '($= 7 * 4 + 3))(incanter.core/plus (incanter.core/mult 7 4) 3)user=> (macroexpand-1 '($= 3 + 7 * 4))(incanter.core/plus 3 (incanter.core/mult 7 4)) Here, we can see that the expression doesn't expand into Clojure's * or + functions, but it uses Incanter's matrix functions, mult and plus, instead. This allows it to handle a variety of input types, including matrices, intelligently. Otherwise, it switches around the expressions the way we'd expect. Also, we can see by comparing the last two lines of code that it even handles operator precedence correctly. Selecting columns with $ Often, you need to cut the data to make it more useful. One common transformation is to pull out all the values from one or more columns into a new dataset. This can be useful for generating summary statistics or aggregating the values of some columns. The Incanter macro $ slices out parts of a dataset. In this recipe, we'll see this in action. Getting ready For this recipe, we'll need to have Incanter listed in our project.clj file: (defproject inc-dsets "0.1.0":dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]                [org.clojure/data.csv "0.1.2"]]) We'll also need to include these libraries in our script or REPL: (require '[clojure.java.io :as io]         '[clojure.data.csv :as csv]         '[clojure.string :as str]         '[incanter.core :as i]) Moreover, we'll need some data. This time, we'll use some country data from the World Bank. Point your browser to http://data.worldbank.org/country and select a country. I picked China. Under World Development Indicators, there is a button labeled Download Data. Click on this button and select CSV. This will download a ZIP file. I extracted its contents into the data/chn directory in my project. I bound the filename for the primary data file to the data-file name. How to do it… We'll use the $ macro in several different ways to get different results. First, however, we'll need to load the data into a dataset, which we'll do in steps 1 and 2: Before we start, we'll need a couple of utilities that load the data file into a sequence of maps and makes a dataset out of those: (defn with-header [coll] (let [headers (map #(keyword (str/replace % space -))                      (first coll))]    (map (partial zipmap headers) (next coll))))   (defn read-country-data [filename] (with-open [r (io/reader filename)]    (i/to-dataset      (doall (with-header                (drop 2 (csv/read-csv r))))))) Now, using these functions, we can load the data: user=> (def chn-data (read-country-data data-file)) We can select columns to be pulled out from the dataset by passing the column names or numbers to the $ macro. It returns a sequence of the values in the column: user=> (i/$ :Indicator-Code chn-data) ("AG.AGR.TRAC.NO" "AG.CON.FERT.PT.ZS" "AG.CON.FERT.ZS" … We can select more than one column by listing all of them in a vector. This time, the results are in a dataset: user=> (i/$ [:Indicator-Code :1992] chn-data)   |           :Indicator-Code |               :1992 | |---------------------------+---------------------| |           AG.AGR.TRAC.NO |             770629 | |         AG.CON.FERT.PT.ZS |                     | |           AG.CON.FERT.ZS |                     | |           AG.LND.AGRI.K2 |             5159980 | … We can list as many columns as we want, although the formatting might suffer: user=> (i/$ [:Indicator-Code :1992 :2002] chn-data)   |           :Indicator-Code |               :1992 |               :2002 | |---------------------------+---------------------+---------------------| |           AG.AGR.TRAC.NO |            770629 |                     | |         AG.CON.FERT.PT.ZS |                     |     122.73027213719 | |           AG.CON.FERT.ZS |                     |   373.087159048868 | |           AG.LND.AGRI.K2 |             5159980 |             5231970 | … How it works… The $ function is just a wrapper over Incanter's sel function. It provides a good way to slice columns out of the dataset, so we can focus only on the data that actually pertains to our analysis. There's more… The indicator codes for this dataset are a little cryptic. However, the code descriptions are in the dataset too: user=> (i/$ [0 1 2] [:Indicator-Code :Indicator-Name] chn-data)   |   :Indicator-Code |                                               :Indicator-Name | |-------------------+---------------------------------------------------------------| |   AG.AGR.TRAC.NO |                             Agricultural machinery, tractors | | AG.CON.FERT.PT.ZS |           Fertilizer consumption (% of fertilizer production) | |   AG.CON.FERT.ZS | Fertilizer consumption (kilograms per hectare of arable land) | … See also… For information on how to pull out specific rows, see the next recipe, Selecting rows with $. Selecting rows with $ The Incanter macro $ also pulls rows out of a dataset. In this recipe, we'll see this in action. Getting ready For this recipe, we'll use the same dependencies, imports, and data as we did in the Selecting columns with $ recipe. How to do it… Similar to how we use $ in order to select columns, there are several ways in which we can use it to select rows, shown as follows: We can create a sequence of the values of one row using $, and pass it the index of the row we want as well as passing :all for the columns: user=> (i/$ 0 :all chn-data) ("AG.AGR.TRAC.NO" "684290" "738526" "52661" "" "880859" "" "" "" "59657" "847916" "862078" "891170" "235524" "126440" "469106" "282282" "817857" "125442" "703117" "CHN" "66290" "705723" "824113" "" "151281" "669675" "861364" "559638" "191220" "180772" "73021" "858031" "734325" "Agricultural machinery, tractors" "100432" "" "796867" "" "China" "" "" "155602" "" "" "770629" "747900" "346786" "" "398946" "876470" "" "795713" "" "55360" "685202" "989139" "798506" "") We can also pull out a dataset containing multiple rows by passing more than one index into $ with a vector (There's a lot of data, even for three rows, so I won't show it here): (i/$ (range 3) :all chn-data) We can also combine the two ways to slice data in order to pull specific columns and rows. We can either pull out a single row or multiple rows: user=> (i/$ 0 [:Indicator-Code :1992] chn-data) ("AG.AGR.TRAC.NO" "770629") user=> (i/$ (range 3) [:Indicator-Code :1992] chn-data)   |   :Indicator-Code | :1992 | |-------------------+--------| |   AG.AGR.TRAC.NO | 770629 | | AG.CON.FERT.PT.ZS |       | |   AG.CON.FERT.ZS |       | How it works… The $ macro is the workhorse used to slice rows and project (or select) columns from datasets. When it's called with two indexing parameters, the first is the row or rows and the second is the column or columns. Filtering datasets with $where While we can filter datasets before we import them into Incanter, Incanter makes it easy to filter and create new datasets from the existing ones. We'll take a look at its query language in this recipe. Getting ready We'll use the same dependencies, imports, and data as we did in the Selecting columns with $ recipe. How to do it… Once we have the data, we query it using the $where function: For example, this creates a dataset with a row for the percentage of China's total land area that is used for agriculture: user=> (def land-use          (i/$where {:Indicator-Code "AG.LND.AGRI.ZS"}                    chn-data)) user=> (i/nrow land-use) 1 user=> (i/$ [:Indicator-Code :2000] land-use) ("AG.LND.AGRI.ZS" "56.2891584865366") The queries can be more complicated too. This expression picks out the data that exists for 1962 by filtering any empty strings in that column: user=> (i/$ (range 5) [:Indicator-Code :1962]          (i/$where {:1962 {:ne ""}} chn-data))   |   :Indicator-Code |             :1962 | |-------------------+-------------------| |   AG.AGR.TRAC.NO |             55360 | |   AG.LND.AGRI.K2 |           3460010 | |   AG.LND.AGRI.ZS | 37.0949187612906 | |   AG.LND.ARBL.HA |         103100000 | | AG.LND.ARBL.HA.PC | 0.154858284392508 | Incanter's query language is even more powerful than this, but these examples should show you the basic structure and give you an idea of the possibilities. How it works… To better understand how to use $where, let's break apart the last example: ($i/where {:1962 {:ne ""}} chn-data) The query is expressed as a hashmap from fields to values (highlighted). As we saw in the first example, the value can be a raw value, either a literal or an expression. This tests for inequality. ($i/where {:1962 {:ne ""}} chn-data) Each test pair is associated with a field in another hashmap (highlighted). In this example, both the hashmaps shown only contain one key-value pair. However, they might contain multiple pairs, which will all be ANDed together. Incanter supports a number of test operators. The basic boolean tests are :$gt (greater than), :$lt (less than), :$gte (greater than or equal to), :$lte (less than or equal to), :$eq (equal to), and :$ne (not equal). There are also some operators that take sets as parameters: :$in and :$nin (not in). The last operator—:$fn—is interesting. It allows you to use any predicate function. For example, this will randomly select approximately half of the dataset: (def random-half (i/$where {:Indicator-Code {:$fn (fn [_] (< (rand) 0.5))}}            chnchn-data)) There's more… For full details of the query language, see the documentation for incanter.core/query-dataset (http://liebke.github.com/incanter/core-api.html#incanter.core/query-dataset). Grouping data with $group-by Datasets often come with an inherent structure. Two or more rows might have the same value in one column, and we might want to leverage that by grouping those rows together in our analysis. Getting ready First, we'll need to declare a dependency on Incanter in the project.clj file: (defproject inc-dsets "0.1.0" :dependencies [[org.clojure/clojure "1.6.0"]                  [incanter "1.5.5"]                  [org.clojure/data.csv "0.1.2"]]) Next, we'll include Incanter core and io in our script or REPL: (require '[incanter.core :as i]          '[incanter.io :as i-io]) For data, we'll use the census race data for all the states. You can download it from http://www.ericrochester.com/clj-data-analysis/data/all_160.P3.csv. These lines will load the data into the race-data name: (def data-file "data/all_160.P3.csv") (def race-data (i-io/read-dataset data-file :header true)) How to do it… Incanter lets you group rows for further analysis or to summarize them with the $group-by function. All you need to do is pass the data to $group-by with the column or function to group on: (def by-state (i/$group-by :STATE race-data)) How it works… This function returns a map where each key is a map of the fields and values represented by that grouping. For example, this is how the keys look: user=> (take 5 (keys by-state)) ({:STATE 29} {:STATE 28} {:STATE 31} {:STATE 30} {:STATE 25}) We can get the data for Virginia back out by querying the group map for state 51. user=> (i/$ (range 3) [:GEOID :STATE :NAME :POP100]            (by-state {:STATE 51}))   | :GEOID | :STATE |         :NAME | :POP100 | |---------+--------+---------------+---------| | 5100148 |     51 | Abingdon town |   8191 | | 5100180 |     51 | Accomac town |     519 | | 5100724 |     51 | Alberta town |     298 | Saving datasets to CSV and JSON Once you've done the work of slicing, dicing, cleaning, and aggregating your datasets, you might want to save them. Incanter by itself doesn't have a good way to do this. However, with the help of some Clojure libraries, it's not difficult at all. Getting ready We'll need to include a number of dependencies in our project.clj file: (defproject inc-dsets "0.1.0":dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]                 [org.clojure/data.csv "0.1.2"]                 [org.clojure/data.json "0.2.5"]]) We'll also need to include these libraries in our script or REPL: (require '[incanter.core :as i]          '[incanter.io :as i-io]          '[clojure.data.csv :as csv]          '[clojure.data.json :as json]          '[clojure.java.io :as io]) Also, we'll use the same data that we introduced in the Selecting columns with $ recipe. How to do it… This process is really as simple as getting the data and saving it. We'll pull out the data for the year 2000 from the larger dataset. We'll use this subset of the data in both the formats here: (def data2000 (i/$ [:Indicator-Code :Indicator-Name :2000] chn-data)) Saving data as CSV To save a dataset as a CSV, all in one statement, open a file and use clojure.data.csv/write-csv to write the column names and data to it: (with-open [f-out (io/writer "data/chn-2000.csv")] (csv/write-csv f-out [(map name (i/col-names data2000))]) (csv/write-csv f-out (i/to-list data2000))) Saving data as JSON To save a dataset as JSON, open a file and use clojure.data.json/write to serialize the file: (with-open [f-out (io/writer "data/chn-2000.json")] (json/write (:rows data2000) f-out)) How it works… For CSV and JSON, as well as many other data formats, the process is very similar. Get the data, open the file, and serialize data into it. There will be differences in how the output function wants the data (to-list or :rows), and there will be differences in how the output function is called (for instance, whether the file handle is the first or second argument). But generally, outputting datasets will be very similar and relatively simple. Projecting from multiple datasets with $join So far, we've been focusing on splitting up datasets, on dividing them into groups of rows or groups of columns with functions and macros such as $ or $where. However, sometimes we'd like to move in the other direction. We might have two related datasets and want to join them together to make a larger one. For example, we might want to join crime data to census data, or take any two related datasets that come from separate sources and analyze them together. Getting ready First, we'll need to include these dependencies in our project.clj file: (defproject inc-dsets "0.1.0" :dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]                  [org.clojure/data.csv "0.1.2"]]) We'll use these statements for inclusions: (require '[clojure.java.io :as io]          '[clojure.data.csv :as csv]          '[clojure.string :as str]          '[incanter.core :as i]) For our data file, we'll use the same data that we introduced in the Selecting columns with $ recipe: China's development dataset from the World Bank. How to do it… In this recipe, we'll take a look at how to join two datasets using Incanter: To begin with, we'll load the data from the data/chn/chn_Country_en_csv_v2.csv file. We'll use the with-header and read-country-data functions that were defined in the Selecting columns with $ recipe: (def data-file "data/chn/chn_Country_en_csv_v2.csv") (def chn-data (read-country-data data-file)) Currently, the data for each row contains the data for one indicator across many years. However, for some analyses, it will be more helpful to have each row contain the data for one indicator for one year. To do this, let's first pull out the data from 2 years into separate datasets. Note that for the second dataset, we'll only include a column to match the first dataset (:Indicator-Code) and the data column (:2000): (def chn-1990 (i/$ [:Indicator-Code :Indicator-Name :1990]        chn-data)) (def chn-2000 (i/$ [:Indicator-Code :2000] chn-data)) Now, we'll join these datasets back together. This is contrived, but it's easy to see how we will do this in a more meaningful example. For example, we might want to join the datasets from two different countries: (def chn-decade (i/$join [:Indicator-Code :Indicator-Code]            chn-1990 chn-2000)) From this point on, we can use chn-decade just as we use any other Incanter dataset. How it works… Let's take a look at this in more detail: (i/$join [:Indicator-Code :Indicator-Code] chn-1990 chn-2000) The pair of column keywords in a vector ([:Indicator-Code :Indicator-Code]) are the keys that the datasets will be joined on. In this case, the :Indicator-Code column from both the datasets is used, but the keys can be different for the two datasets. The first column that is listed will be from the first dataset (chn-1990), and the second column that is listed will be from the second dataset (chn-2000). This returns a new dataset. Each row of this new dataset is a superset of the corresponding rows from the two input datasets. Summary In this article we have covered covers the basics of working with Incanter datasets. Datasets are the core data structures used by Incanter, and understanding them is necessary in order to use Incanter effectively. Resources for Article: Further resources on this subject: The Hunt for Data [article] Limits of Game Data Analysis [article] Clojure for Domain-specific Languages - Design Concepts with Clojure [article]
Read more
  • 0
  • 0
  • 3024

article-image-cloud
Packt
22 Jan 2015
14 min read
Save for later

In the Cloud

Packt
22 Jan 2015
14 min read
In this article by Rafał Kuć, author of the book Solr Cookbook - Third Edition, covers the cloud side of Solr—SolrCloud, setting up collections, replicas configuration, distributed indexing and searching, as well as aliasing and shard manipulation. We will also learn how to create a cluster. (For more resources related to this topic, see here.) Creating a new SolrCloud cluster Imagine a situation where one day you have to set up a distributed cluster with the use of Solr. The amount of data is just too much for a single server to handle. Of course, you can just set up a second server or go for another master server with another set of data. But before Solr 4.0, you would have to take care of the data distribution yourself. In addition to this, you would also have to take care of setting up replication, data duplication, and so on. With SolrCloud you don't have to do this—you can just set up a new cluster and this article will show you how to do that. Getting ready It shows you how to set up a Zookeeper cluster in order to be ready for production use. How to do it... Let's assume that we want to create a cluster that will have four Solr servers. We also would like to have our data divided between the four Solr servers in such a way that we have the original data on two machines and in addition to this, we would also have a copy of each shard available in case something happens with one of the Solr instances. I also assume that we already have our Zookeeper cluster set up, ready, and available at the address 192.168.1.10 on the 9983 port. For this article, we will set up four SolrCloud nodes on the same physical machine: We will start by running an empty Solr server (without any configuration) on port 8983. We do this by running the following command (for Solr 4.x): java -DzkHost=192.168.1.10:9983 -jar start.jar For Solr 5, we will run the following command: bin/solr -c -z 192.168.1.10:9983 Now we start another three nodes, each on a different port (note that different Solr instances can run on the same port, but they should be installed on different machines). We do this by running one command for each installed Solr server (for Solr 4.x): java -Djetty.port=6983 -DzkHost=192.168.1.10:9983 -jar start.jarjava -Djetty.port=4983 -DzkHost=192.168.1.10:9983 -jar start.jarjava -Djetty.port=2983 -DzkHost=192.168.1.10:9983 -jar start.jar For Solr 5, the commands will be as follows: bin/solr -c -p 6983 -z 192.168.1.10:9983bin/solr -c -p 4983 -z 192.168.1.10:9983bin/solr -c -p 2983 -z 192.168.1.10:9983 Now we need to upload our collection configuration to ZooKeeper. Assuming that we have our configuration in /home/conf/solrconfiguration/conf, we will run the following command from the home directory of the Solr server that runs first (the zkcli.sh script can be found in the Solr deployment example in the scripts/cloud-scripts directory): ./zkcli.sh -cmd upconfig -zkhost 192.168.1.10:9983 -confdir /home/conf/solrconfiguration/conf/ -confname collection1 Now we can create our collection using the following command: curl 'localhost:8983/solr/admin/collections?action=CREATE&name=firstCollection&numShards=2&replicationFactor=2&collection.configName=collection1' If we now go to http://localhost:8983/solr/#/~cloud, we will see the following cluster view: As we can see, Solr has created a new collection with a proper deployment. Let's now see how it works. How it works... We assume that we already have ZooKeeper installed—it is empty and doesn't have information about any collection, because we didn't create them. For Solr 4.x, we started by running Solr and telling it that we want it to run in SolrCloud mode. We did that by specifying the -DzkHost property and setting its value to the IP address of our ZooKeeper instance. Of course, in the production environment, you would point Solr to a cluster of ZooKeeper nodes—this is done using the same property, but the IP addresses are separated using the comma character. For Solr 5, we used the solr script provided in the bin directory. By adding the -c switch, we told Solr that we want it to run in the SolrCloud mode. The -z switch works exactly the same as the -DzkHost property for Solr 4.x—it allows you to specify the ZooKeeper host that should be used. Of course, the other three Solr nodes run exactly in the same manner. For Solr 4.x, we add the -DzkHost property that points Solr to our ZooKeeper. Because we are running all the four nodes on the same physical machine, we needed to specify the -Djetty.port property, because we can run only a single Solr server on a single port. For Solr 5, we use the -z property of the bin/solr script and we use the -p property to specify the port on which Solr should start. The next step is to upload the collection configuration to ZooKeeper. We do this because Solr will fetch this configuration from ZooKeeper when you will request the collection creation. To upload the configuration, we use the zkcli.sh script provided with the Solr distribution. We use the upconfig command (the -cmd switch), which means that we want to upload the configuration. We specify the ZooKeeper host using the -zkHost property. After that, we can say which directory our configuration is stored (the -confdir switch). The directory should contain all the needed configuration files such as schema.xml, solrconfig.xml, and so on. Finally, we specify the name under which we want to store our configuration using the -confname switch. After we have our configuration in ZooKeeper, we can create the collection. We do this by running a command to the Collections API that is available at the /admin/collections endpoint. First, we tell Solr that we want to create the collection (action=CREATE) and that we want our collection to be named firstCollection (name=firstCollection). Remember that the collection names are case sensitive, so firstCollection and firstcollection are two different collections. We specify that we want our collection to be built of two primary shards (numShards=2) and we want each shard to be present in two copies (replicationFactor=2). This means that we will have a primary shard and a single replica. Finally, we specify which configuration should be used to create the collection by specifying the collection.configName property. As we can see in the cloud, a view of our cluster has been created and spread across all the nodes. There's more... There are a few things that I would like to mention—the possibility of running a Zookeeper server embedded into Apache Solr and specifying the Solr server name. Starting an embedded ZooKeeper server You can also start an embedded Zookeeper server shipped with Solr for your test environment. In order to do this, you should pass the -DzkRun parameter instead of -DzkHost=192.168.0.10:9983, but only in the command that sends our configuration to the Zookeeper cluster. So the final command for Solr 4.x should look similar to this: java -DzkRun -jar start.jar In Solr 5.0, the same command will be as follows: bin/solr start -c By default, ZooKeeper will start on the port higher by 1,000 to the one Solr is started at. So if you are running your Solr instance on 8983, ZooKeeper will be available at 9983. The thing to remember is that the embedded ZooKeeper should only be used for development purposes and only one node should start it. Specifying the Solr server name Solr needs each instance of SolrCloud to have a name. By default, that name is set using the IP address or the hostname, appended with the port the Solr instance is running on, and the _solr postfix. For example, if our node is running on 192.168.56.1 and port 8983, it will be called 192.168.56.1:8983_solr. Of course, Solr allows you to change that behavior by specifying the hostname. To do this, start using the -Dhost property or add the host property to solr.xml. For example, if we would like one of our nodes to have the name of server1, we can run the following command to start Solr: java -DzkHost=192.168.1.10:9983 -Dhost=server1 -jar start.jar In Solr 5.0, the same command would be: bin/solr start -c -h server1 Setting up multiple collections on a single cluster Having a single collection inside the cluster is nice, but there are multiple use cases when we want to have more than a single collection running on the same cluster. For example, we might want users and books in different collections or logs from each day to be only stored inside a single collection. This article will show you how to create multiple collections on the same cluster. Getting ready This article will show you how to create a new SolrCloud cluster. We also assume that ZooKeeper is running on 192.168.1.10 and is listening on the 2181 port and that we already have four SolrCloud nodes running as a cluster. How to do it... As we already have all the prerequisites, such as ZooKeeper and Solr up and running, we need to upload our configuration files to ZooKeeper to be able to create collections: Assuming that we have our configurations in /home/conf/firstcollection/conf and /home/conf/secondcollection/conf, we will run the following commands from the home directory of the first run Solr server to upload the configuration to ZooKeeper (the zkcli.sh script can be found in the Solr deployment example in the scripts/cloud-scripts directory): ./zkcli.sh -cmd upconfig -zkhost localhost:2181 -confdir /home/conf/firstcollection/conf/ -confname firstcollection./zkcli.sh -cmd upconfig -zkhost localhost:2181 -confdir /home/conf/secondcollection/conf/ -confname secondcollection We have pushed our configurations into Zookeeper, so now we can create the collections we want. In order to do this, we use the following commands: curl 'localhost:8983/solr/admin/collections?action=CREATE&name=firstCollection&numShards=2&replicationFactor=2&collection.configName=firstcollection'curl 'localhost:8983/solr/admin/collections?action=CREATE&name=secondcollection&numShards=4&replicationFactor=1&collection.configName=secondcollection' Now, just to test whether everything went well, we will go to http://localhost:8983/solr/#/~cloud. As the result, we will see the following cluster topology: As we can see, both the collections were created the way we wanted. Now let's see how that happened. How it works... We assume that we already have ZooKeeper installed—it is empty and doesn't have information about any collections, because we didn't create them. We also assumed that we have our SolrCloud cluster configured and started. We start by uploading two configurations to ZooKeeper, one called firstcollection and the other called secondcollection. After that we are ready to create our collections. We start by creating the collection named firstCollection that is built of two primary shards and one replica. The second collection, called secondcollection is built of four primary shards and it doesn't have any replicas. We can see that easily in the cloud view of the deployment. The firstCollection collection has two shards—shard1 and shard2. Each of the shard has two physical copies—one green (which means active) and one with a black dot, which is the primary shard. The secondcollection collection is built of four physical shards—each shard has a black dot near its name, which means that they are primary shards. Splitting shards Imagine a situation where you reach a limit of your current deployment—the number of shards is just not enough. For example, the indexing throughput is lower and lower, because the disks are not able to keep up. Of course, one of the possible solutions is to spread the index across more shards; however, you already have a collection and you want to keep the data and reindexing is not an option, because you don't have the original data. Solr can help you with such situations by allowing splitting shards of already created collections. This article will show you how to do it. Getting ready This article will show you how to create a new SolrCloud cluster. We also assume that ZooKeeper is running on 192.168.1.10 and is listening on port 2181 and that we already have four SolrCloud nodes running as a cluster. How to do it... Let's assume that we already have a SolrCloud cluster up and running and it has one collection called books. So our cloud view (which is available at http://localhost:8983/solr/#/~cloud) looks as follows: We have four nodes and we don't utilize them fully. We can say that these two nodes in which we have our shards are almost fully utilized. What we can do is create a new collection and reindex the data or we can split shards of the already created collection. Let's go with the second option: We start by splitting the first shard. It is as easy as running the following command: curl 'http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=books&shard=shard1' After this, we can split the second shard by running a similar command to the one we just used: curl 'http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=books&shard=shard2' Let's take a look at the cluster cloud view now (which is available at http://localhost:8983/solr/#/~cloud): As we can see, both shards were split—shard1 was divided into shard1_0 and shard1_1 and shard2 was divided into shard2_0 and shard2_1. Of course, the data was copied as well, so everything is ready. However, the last step should be to delete the original shards. Solr doesn't delete them, because sometimes applications use shard names to connect to a given shard. However, in our case, we can delete them by running the following commands: curl 'http://localhost:8983/solr/admin/collections?action=DELETESHARD&collection=books&shard=shard1' curl 'http://localhost:8983/solr/admin/collections?action=DELETESHARD&collection=books&shard=shard2' Now if we would again look at the cloud view of the cluster, we will see the following: How it works... We start with a simple collection called books that is built of two primary shards and no replicas. This is the collection which shards we will try to divide it without stopping Solr. Splitting shards is very easy. We just need to run a simple command in the Collections API (the /admin/collections endpoint) and specify that we want to split a shard (action=SPLITSHARD). We also need to provide additional information such as which collection we are interested in (the collection parameter) and which shard we want to split (the shard parameter). You can see the name of the shard by looking at the cloud view or by reading the cluster state from ZooKeeper. After sending the command, Solr might force us to wait for a substantial amount of time—shard splitting takes time, especially on large collections. Of course, we can run the same command for the second shard as well. Finally, we end up with six shards—four new and two old ones. The original shard will still contain data, but it will start to re-route requests to newly created shards. The data was split evenly between the new shards. The old shards were left although they are marked as inactive and they won't have any more data indexed to them. Because we don't need them, we can just delete them using the action=DELETESHARD command sent to the same Collections API. Similar to the split shard command, we need to specify the collection name, which shard we want to delete, and the name of the shard. After we delete the initial shards, we now see that our cluster view shows only four shards, which is what we were aiming at. We can now spread the shards across the cluster. Summary In this article, we learned how to set up multiple collections. This article thought us how to increase the number of collections in a cluster. We also worked on a way used to split shards. Resources for Article: Further resources on this subject: Tuning Solr JVM and Container [Article] Apache Solr PHP Integration [Article] Administrating Solr [Article]
Read more
  • 0
  • 0
  • 1494

article-image-taming-big-data-using-hdinsight
Packt
22 Jan 2015
10 min read
Save for later

Taming Big Data using HDInsight

Packt
22 Jan 2015
10 min read
(For more resources related to this topic, see here.) Era of Big Data In this article by Rajesh Nadipalli, the author of HDInsight Essentials Second Edition, we will take a look at the concept of Big Data and how to tame it using HDInsight. We live in a digital era and are always connected with friends and family using social media and smartphones. In 2014, every second, about 5,700 tweets were sent and 800 links were shared using Facebook, and the digital universe was about 1.7 MB per minute for every person on earth (source: IDC 2014 report). This amount of data sharing and storing is unprecedented and is contributing to what is known as Big Data. The following infographic shows you the details of our current use of the top social media sites (source: https://leveragenewagemedia.com/). Another contributor to Big Data are the smart, connected devices such as smartphones, appliances, cars, sensors, and pretty much everything that we use today and is connected to the Internet. These devices, which will soon be in trillions, continuously collect data and communicate with each other about their environment to make intelligent decisions and help us live better. This digitization of the world has added to the exponential growth of Big Data. According to the 2014 IDC digital universe report, the growth trend will continue and double in size every two years. In 2013, about 4.4 zettabytes were created and in 2020, the forecast is 44 zettabytes, which is 44 trillion gigabytes, (source: http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm). Business value of Big Data While we generated 4.4 zettabytes of data in 2013, only 5 percent of it was actually analyzed, and this is the real opportunity of Big Data. The IDC report forecasts that by 2020, we will analyze over 35 percent of the generated data by making smarter sensors and devices. This data will drive new consumer and business behavior that will drive trillions of dollars in opportunity for IT vendors and organizations analyzing this data. Let's take a look at some real use cases that have benefited from Big Data: IT systems in all major banks are constantly monitoring fraudulent activities and alerting customers within milliseconds. These systems apply complex business rules and analyze the historical data, geography, type of vendor, and other parameters based on the customer to get accurate results. Commercial drones are transforming agriculture by analyzing real-time aerial images and identifying the problem areas. These drones are cheaper and efficient than satellite imagery, as they fly under the clouds and can be used anytime. They identify the irrigation issues related to water, pests, or fungal infections thereby increasing the crop productivity and quality. These drones are equipped with technology to capture high-quality images every second and transfer them to a cloud-hosted Big Data system for further processing (reference: http://www.technologyreview.com/featuredstory/526491/agricultural-drones/). Developers of the blockbuster Halo 4 game were tasked to analyze player preferences and support an online tournament in the cloud. The game attracted over 4 million players in its first five days after its launch. The development team had to also design a solution that kept track of a leader board for the global Halo 4 Infinity challenge, which was open to all the players. The development team chose the Azure HDInsight service to analyze the massive amounts of unstructured data in a distributed manner. The results from HDInsight were reported using Microsoft SQL Server PowerPivot and Sharepoint and the business was extremely happy with the response times for their queries, which was a few hours or less, (source: http://www.microsoft.com/casestudies/Windows-Azure/343-Industries/343-Industries-Gets-New-User-Insights-from-Big-Data-in-the-Cloud/710000002102) Hadoop Concepts Apache Hadoop is the leading open source Big Data platform that can store and analyze massive amounts of structured and unstructured data efficiently and can be hosted on low-cost commodity hardware. There are other technologies that complement Hadoop under the Big Data umbrella such as MongoDB (a NoSQL database), Cassandra (a document database), and VoltDB (an in-memory database). This section describes Apache Hadoop core concepts and its ecosystem. A brief history of Hadoop Doug Cutting created Hadoop and named it after his kid's stuffed yellow elephant and has no real meaning. In 2004, the initial version of Hadoop was launched as Nutch Distributed Filesystem. In February 2006, the Apache Hadoop project was officially started as a standalone development for MapReduce and HDFS. By 2008, Yahoo adopted Hadoop as the engine of its web search with a cluster size of around 10,000. In the same year, Hadoop graduated as the top-level Apache project confirming its success. In 2012, Hadoop 2.x was launched with YARN enabling Hadoop to take on various types of workloads. Today, Hadoop is known by just about every IT architect and business executive as a open source Big Data platform and is used across all industries and sizes of organizations. Core components In this section, we will explore what Hadoop is actually comprised of. At the basic level, Hadoop consists of 4 layers: Hadoop Common: A set of common libraries and utilities used by Hadoop modules. Hadoop Distributed File System (HDFS): A scalable and fault tolerant distributed filesystem for data in any form. HDFS can be installed on commodity hardware and replicates data three times (which is configurable) to make the filesystem robust and tolerate partial hardware failures. Yet Another Resource Negotiator (YARN): From Hadoop 2.0, YARN is the cluster management layer to handle various workloads on the cluster. MapReduce: MapReduce is a framework that allows parallel processing of data in Hadoop. MapReduce breaks a job into smaller tasks and distributes the load to servers that have the relevant data. The design model is "move code and not data" making this framework efficient as it reduces the network and disk I/O required to move the data. The following diagram shows you the high-level Hadoop 2.0 core components: The preceding diagram shows you the components that form the basic Hadoop framework. In the past few years, a vast array of new components have emerged in the Hadoop ecosystem that take advantage of YARN making Hadoop faster, better, and suitable for various types of workloads. The following diagram shows you the Hadoop framework with these new components: Hadoop cluster layout Each Hadoop cluster has two types of machines, which are as follows: Master nodes: This includes HDFS Name Node, HDFS Secondary Name Node, and YARN Resource Manager. Worker nodes: This includes HDFS Data Nodes and YARN Node Managers. The data nodes and node managers are colocated for optimal data locality and performance. A network switch interconnects the master and worker nodes. It is recommended that you have separate servers for each of the master nodes; however, it is possible to deploy all the master nodes onto a single server for development or testing workloads. The following diagram shows you the typical cluster layout: Let's review the key functions of the master and worker nodes: Name node: This is the master for the distributed filesystem and maintains a metadata. This metadata has the listing of all the files, and the location of each block of a file that are stored across the various slaves. Without a name node, HDFS is not accessible. From Hadoop 2.0 onwards, name node HA (High Availability) can be configured with active and standby servers. Secondary name node: This is an assistant to the name node. It communicates only with the name node to take snapshots of HDFS metadata at intervals that is configured at the cluster level. YARN resource manager: This server is a scheduler that allocates the available resources in the cluster among the competing applications. Worker nodes: The Hadoop cluster will have several worker nodes that handle two types of functions—HDFS Data Node and YARN Node Manager. It is typical that each worker node handles both the functions for optimal data locality. This means processing happens on the data that is local to the node and follows the principle "move code and not data". HDInsight Overview HDInsight is an enterprise-ready distribution of Hadoop that runs on Windows servers and on the Azure HDInsight cloud service (PaaS). It is 100 percent Apache Hadoop based service in the cloud. HDInsight was developed with the partnership of Hortonworks, and Microsoft. Enterprises can now harness the power of Hadoop on Windows servers and the Windows Azure cloud service. The following are the key differentiators for a HDInsight distribution: Enterprise-ready Hadoop: HDInsight is backed by Microsoft support, and runs on standard Windows servers. IT teams can leverage Hadoop with the Platform as a Service (PaaS) reducing the operations overhead. Analytics using Excel: With Excel integration, your business users can visualize and analyze Hadoop data in compelling new ways with an easy-to-use familiar tool. The Excel add-ons PowerBI, PowerPivot, Power Query, and Power Map integrate with HDInsight. Develop in your favorite language: HDInsight has powerful programming extensions for languages, including .Net, C#, Java, and more. Scale using the cloud offering: The Azure HDInsight service enables customers to scale quickly as per the project needs and have seamless interface between HDFS and Azure Blob storage. Connect an on-premises Hadoop cluster with the cloud: With HDInsight, you can move Hadoop data from an on-site data center to the Azure cloud for backup, dev/test, and cloud bursting scenarios. Includes NoSQL transactional capabilities: HDInsight also includes Apache HBase, a columnar NoSQL database that runs on top of Hadoop and allows large online transactional processing (OLTP). HDInsight Emulator: The HDInsight Emulator tool provides a local development environment for Azure HDInsight without the need for a cloud subscription. This can be installed using Microsoft Web Platform Installer. Enterprise-ready Hadoop: HDInsight is backed by Microsoft support, and runs on standard Windows servers. IT teams can leverage Hadoop with the Platform as a Service (PaaS) reducing the operations overhead. Analytics using Excel: With Excel integration, your business users can visualize and analyze Hadoop data in compelling new ways with an easy-to-use familiar tool. The Excel add-ons PowerBI, PowerPivot, Power Query, and Power Map integrate with HDInsight. Develop in your favorite language: HDInsight has powerful programming extensions for languages, including .Net, C#, Java, and more. Scale using the cloud offering: The Azure HDInsight service enables customers to scale quickly as per the project needs and have seamless interface between HDFS and Azure Blob storage. Connect an on-premises Hadoop cluster with the cloud: With HDInsight, you can move Hadoop data from an on-site data center to the Azure cloud for backup, dev/test, and cloud bursting scenarios. Includes NoSQL transactional capabilities: HDInsight also includes Apache HBase, a columnar NoSQL database that runs on top of Hadoop and allows large online transactional processing (OLTP). HDInsight Emulator: The HDInsight Emulator tool provides a local development environment for Azure HDInsight without the need for a cloud subscription. This can be installed using Microsoft Web Platform Installer. Summary We live in a connected digital era and are witnessing unprecedented growth of data. Organizations that are able to analyze Big Data are demonstrating significant return on investment by detecting fraud, improved operations, and reduced time to analyze with a scale-out architecture. Apache Hadoop is the leading open source Big Data platform with strong and diverse ecosystem projects that enable organizations to build a modern data architecture. At the core, Hadoop has two key components: Hadoop Distributed File System also known as HDFS, and a cluster resource manager known as YARN. YARN has enabled Hadoop to be a true multi-use data platform that can handle batch processing, real-time streaming, interactive SQL, and others. Microsoft HDInsight is an enterprise-ready distribution of Hadoop on the cloud that has been developed with the partnership of Hortonworks and Microsoft. The key benefits of HDInsight include scaling up/down as required, analysis using Excel, connecting an on-premise Hadoop cluster with the cloud, and flexible programming and support for NoSQL transactional databases. Resources for Article: Further resources on this subject: Hadoop and HDInsight in a Heartbeat [article] Sizing and Configuring your Hadoop Cluster [article] Introducing Kafka [article]
Read more
  • 0
  • 0
  • 1531
Visually different images

article-image-highcharts-configurations
Packt
21 Jan 2015
53 min read
Save for later

Highcharts Configurations

Packt
21 Jan 2015
53 min read
This article is written by Joe Kuan, the author of Learning Highcharts 4. All Highcharts graphs share the same configuration structure and it is crucial for us to become familiar with the core components. However, it is not possible to go through all the configurations within the book. In this article, we will explore the functional properties that are most used and demonstrate them with examples. We will learn how Highcharts manages layout, and then explore how to configure axes, specify single series and multiple series data, followed by looking at formatting and styling tool tips in both JavaScript and HTML. After that, we will get to know how to polish our charts with various types of animations and apply color gradients. Finally, we will explore the drilldown interactive feature. In this article, we will cover the following topics: Understanding Highcharts layout Framing the chart with axes (For more resources related to this topic, see here.) Configuration structure In the Highcharts configuration object, the components at the top level represent the skeleton structure of a chart. The following is a list of the major components that are covered in this article: chart: This has configurations for the top-level chart properties such as layouts, dimensions, events, animations, and user interactions series: This is an array of series objects (consisting of data and specific options) for single and multiple series, where the series data can be specified in a number of ways xAxis/yAxis/zAxis: This has configurations for all the axis properties such as labels, styles, range, intervals, plotlines, plot bands, and backgrounds tooltip: This has the layout and format style configurations for the series data tool tips drilldown: This has configurations for drilldown series and the ID field associated with the main series title/subtitle: This has the layout and style configurations for the chart title and subtitle legend: This has the layout and format style configurations for the chart legend plotOptions: This contains all the plotting options, such as display, animation, and user interactions, for common series and specific series types exporting: This has configurations that control the layout and the function of print and export features For reference information concerning all configurations, go to http://api.highcharts.com. Understanding Highcharts' layout Before we start to learn how Highcharts layout works, it is imperative that we understand some basic concepts first. First, set a border around the plot area. To do that we can set the options of plotBorderWidth and plotBorderColor in the chart section, as follows:         chart: {                renderTo: 'container',                type: 'spline',                plotBorderWidth: 1,                plotBorderColor: '#3F4044'        }, The second border is set around the Highcharts container. Next, we extend the preceding chart section with additional settings:         chart: {                renderTo: 'container',                ....                borderColor: '#a1a1a1',                borderWidth: 2,                borderRadius: 3        }, This sets the container border color with a width of 2 pixels and corner radius of 3 pixels. As we can see, there is a border around the container and this is the boundary that the Highcharts display cannot exceed: By default, Highcharts displays have three different areas: spacing, labeling, and plot area. The plot area is the area inside the inner rectangle that contains all the plot graphics. The labeling area is the area where labels such as title, subtitle, axis title, legend, and credits go, around the plot area, so that it is between the edge of the plot area and the inner edge of the spacing area. The spacing area is the area between the container border and the outer edge of the labeling area. The following screenshot shows three different kinds of areas. A gray dotted line is inserted to illustrate the boundary between the spacing and labeling areas. Each chart label position can be operated in one of the following two layouts: Automatic layout: Highcharts automatically adjusts the plot area size based on the labels' positions in the labeling area, so the plot area does not overlap with the label element at all. Automatic layout is the simplest way to configure, but has less control. This is the default way of positioning the chart elements. Fixed layout: There is no concept of labeling area. The chart label is specified in a fixed location so that it has a floating effect on the plot area. In other words, the plot area side does not automatically adjust itself to the adjacent label position. This gives the user full control of exactly how to display the chart. The spacing area controls the offset of the Highcharts display on each side. As long as the chart margins are not defined, increasing or decreasing the spacing area has a global effect on the plot area measurements in both automatic and fixed layouts. Chart margins and spacing settings In this section, we will see how chart margins and spacing settings have an effect on the overall layout. Chart margins can be configured with the properties margin, marginTop, marginLeft, marginRight, and marginBottom, and they are not enabled by default. Setting chart margins has a global effect on the plot area, so that none of the label positions or chart spacing configurations can affect the plot area size. Hence, all the chart elements are in a fixed layout mode with respect to the plot area. The margin option is an array of four margin values covered for each direction, the same as in CSS, starting from north and going clockwise. Also, the margin option has a lower precedence than any of the directional margin options, regardless of their order in the chart section. Spacing configurations are enabled by default with a fixed value on each side. These can be configured in the chart section with the property names spacing, spacingTop, spacingLeft, spacingBottom, and spacingRight. In this example, we are going to increase or decrease the margin or spacing property on each side of the chart and observe the effect. The following are the chart settings:             chart: {                renderTo: 'container',                type: ...                marginTop: 10,                marginRight: 0,                spacingLeft: 30,                spacingBottom: 0            }, The following screenshot shows what the chart looks like: The marginTop property fixes the plot area's top border 10 pixels away from the container border. It also changes the top border into fixed layout for any label elements, so the chart title and subtitle float on top of the plot area. The spacingLeft property increases the spacing area on the left-hand side, so it pushes the y axis title further in. As it is in automatic layout (without declaring marginLeft), it also pushes the plot area's west border in. Setting marginRight to 0 will override all the default spacing on the chart's right-hand side and change it to fixed layout mode. Finally, setting spacingBottom to 0 makes the legend touch the lower bar of the container, so it also stretches the plot area downwards. This is because the bottom edge is still in automatic layout even though spacingBottom is set to 0. Chart label properties Chart labels such as xAxis.title, yAxis.title, legend, title, subtitle, and credits share common property names, as follows: align: This is for the horizontal alignment of the label. Possible keywords are 'left', 'center', and 'right'. As for the axis title, it is 'low', 'middle', and 'high'. floating: This is to give the label position a floating effect on the plot area. Setting this to true will cause the label position to have no effect on the adjacent plot area's boundary. margin: This is the margin setting between the label and the side of the plot area adjacent to it. Only certain label types have this setting. verticalAlign: This is for the vertical alignment of the label. The keywords are 'top', 'middle', and 'bottom'. x: This is for horizontal positioning in relation to alignment. y: This is for vertical positioning in relation to alignment. As for the labels' x and y positioning, they are not used for absolute positioning within the chart. They are designed for fine adjustment with the label alignment. The following diagram shows the coordinate directions, where the center represents the label location: We can experiment with these properties with a simple example of the align and y position settings, by placing both title and subtitle next to each other. The title is shifted to the left with align set to 'left', whereas the subtitle alignment is set to 'right'. In order to make both titles appear on the same line, we change the subtitle's y position to 15, which is the same as the title's default y value:  title: {     text: 'Web browsers ...',     align: 'left' }, subtitle: {     text: 'From 2008 to present',     align: 'right',     y: 15 }, The following is a screenshot showing both titles aligned on the same line: In the following subsections, we will experiment with how changes in alignment for each label element affect the layout behavior of the plot area. Title and subtitle alignments Title and subtitle have the same layout properties, and the only differences are that the default values and title have the margin setting. Specifying verticalAlign for any value changes from the default automatic layout to fixed layout (it internally switches floating to true). However, manually setting the subtitle's floating property to false does not switch back to automatic layout. The following is an example of title in automatic layout and subtitle in fixed layout:     title: {       text: 'Web browsers statistics'    },    subtitle: {       text: 'From 2008 to present',       verticalAlign: 'top',       y: 60       }, The verticalAlign property for the subtitle is set to 'top', which switches the layout into fixed layout, and the y offset is increased to 60. The y offset pushes the subtitle's position further down. Due to the fact that the plot area is not in an automatic layout relationship to the subtitle anymore, the top border of the plot area goes above the subtitle. However, the plot area is still in automatic layout towards the title, so the title is still above the plot area: Legend alignment Legends show different behavior for the verticalAlign and align properties. Apart from setting the alignment to 'center', all other settings in verticalAlign and align remain in automatic positioning. The following is an example of a legend located on the right-hand side of the chart. The verticalAlign property is switched to the middle of the chart, where the horizontal align is set to 'right':           legend: {                align: 'right',                verticalAlign: 'middle',                layout: 'vertical'          }, The layout property is assigned to 'vertical' so that it causes the items inside the legend box to be displayed in a vertical manner. As we can see, the plot area is automatically resized for the legend box: Note that the border decoration around the legend box is disabled in the newer version. To display a round border around the legend box, we can add the borderWidth and borderRadius options using the following:           legend: {                align: 'right',                verticalAlign: 'middle',                layout: 'vertical',                borderWidth: 1,                borderRadius: 3          }, Here is the legend box with a round corner border: Axis title alignment Axis titles do not use verticalAlign. Instead, they use the align setting, which is either 'low', 'middle', or 'high'. The title's margin value is the distance between the axis title and the axis line. The following is an example of showing the y-axis title rotated horizontally instead of vertically (which it is by default) and displayed on the top of the axis line instead of next to it. We also use the y property to fine-tune the title location:             yAxis: {                title: {                    text: 'Percentage %',                    rotation: 0,                    y: -15,                    margin: -70,                    align: 'high'                },                min: 0            }, The following is a screenshot of the upper-left corner of the chart showing that the title is aligned horizontally at the top of the y axis. Alternatively, we can use the offset option instead of margin to achieve the same result. Credits alignment Credits is a bit different from other label elements. It only supports the align, verticalAlign, x, and y properties in the credits.position property (shorthand for credits: { position: … }), and is also not affected by any spacing setting. Suppose we have a graph without a legend and we have to move the credits to the lower-left area of the chart, the following code snippet shows how to do it:             legend: {                enabled: false            },            credits: {                position: {                   align: 'left'                },                text: 'Joe Kuan',                href: 'http://joekuan.wordpress.com'            }, However, the credits text is off the edge of the chart, as shown in the following screenshot: Even if we move the credits label to the right with x positioning, the label is still a bit too close to the x axis interval label. We can introduce extra spacingBottom to put a gap between both labels, as follows:             chart: {                   spacingBottom: 30,                    ....            },            credits: {                position: {                   align: 'left',                   x: 20,                   y: -7                },            },            .... The following is a screenshot of the credits with the final adjustments: Experimenting with an automatic layout In this section, we will examine the automatic layout feature in more detail. For the sake of simplifying the example, we will start with only the chart title and without any chart spacing settings:      chart: {         renderTo: 'container',         // border and plotBorder settings         borderWidth: 2,         .....     },     title: {            text: 'Web browsers statistics,     }, From the preceding example, the chart title should appear as expected between the container and the plot area's borders: The space between the title and the top border of the container has the default setting spacingTop for the spacing area (a default value of 10-pixels high). The gap between the title and the top border of the plot area is the default setting for title.margin, which is 15-pixels high. By setting spacingTop in the chart section to 0, the chart title moves up next to the container top border. Hence the size of the plot area is automatically expanded upwards, as follows: Then, we set title.margin to 0; the plot area border moves further up, hence the height of the plot area increases further, as follows: As you may notice, there is still a gap of a few pixels between the top border and the chart title. This is actually due to the default value of the title's y position setting, which is 15 pixels, large enough for the default title font size. The following is the chart configuration for setting all the spaces between the container and the plot area to 0: chart: {     renderTo: 'container',     // border and plotBorder settings     .....     spacingTop: 0},title: {     text: null,     margin: 0,     y: 0} If we set title.y to 0, all the gap between the top edge of the plot area and the top container edge closes up. The following is the final screenshot of the upper-left corner of the chart, to show the effect. The chart title is not visible anymore as it has been shifted above the container: Interestingly, if we work backwards to the first example, the default distance between the top of the plot area and the top of the container is calculated as: spacingTop + title.margin + title.y = 10 + 15 + 15 = 40 Therefore, changing any of these three variables will automatically adjust the plot area from the top container bar. Each of these offset variables actually has its own purpose in the automatic layout. Spacing is for the gap between the container and the chart content; thus, if we want to display a chart nicely spaced with other elements on a web page, spacing elements should be used. Equally, if we want to use a specific font size for the label elements, we should consider adjusting the y offset. Hence, the labels are still maintained at a distance and do not interfere with other components in the chart. Experimenting with a fixed layout In the preceding section, we have learned how the plot area dynamically adjusted itself. In this section, we will see how we can manually position the chart labels. First, we will start with the example code from the beginning of the Experimenting with automatic layout section and set the chart title's verticalAlign to 'bottom', as follows: chart: {    renderTo: 'container',    // border and plotBorder settings    .....},title: {    text: 'Web browsers statistics',    verticalAlign: 'bottom'}, The chart title is moved to the bottom of the chart, next to the lower border of the container. Notice that this setting has changed the title into floating mode; more importantly, the legend still remains in the default automatic layout of the plot area: Be aware that we haven't specified spacingBottom, which has a default value of 15 pixels in height when applied to the chart. This means that there should be a gap between the title and the container bottom border, but none is shown. This is because the title.y position has a default value of 15 pixels in relation to spacing. According to the diagram in the Chart label properties section, this positive y value pushes the title towards the bottom border; this compensates for the space created by spacingBottom. Let's make a bigger change to the y offset position this time to show that verticalAlign is floating on top of the plot area:  title: {     text: 'Web browsers statistics',     verticalAlign: 'bottom',     y: -90 }, The negative y value moves the title up, as shown here: Now the title is overlapping the plot area. To demonstrate that the legend is still in automatic layout with regard to the plot area, here we change the legend's y position and the margin settings, which is the distance from the axis label:                legend: {                   margin: 70,                   y: -10               }, This has pushed up the bottom side of the plot area. However, the chart title still remains in fixed layout and its position within the chart hasn't been changed at all after applying the new legend setting, as shown in the following screenshot: By now, we should have a better understanding of how to position label elements, and their layout policy relating to the plot area. Framing the chart with axes In this section, we are going to look into the configuration of axes in Highcharts in terms of their functional area. We will start off with a plain line graph and gradually apply more options to the chart to demonstrate the effects. Accessing the axis data type There are two ways to specify data for a chart: categories and series data. For displaying intervals with specific names, we should use the categories field that expects an array of strings. Each entry in the categories array is then associated with the series data array. Alternatively, the axis interval values are embedded inside the series data array. Then, Highcharts extracts the series data for both axes, interprets the data type, and formats and labels the values appropriately. The following is a straightforward example showing the use of categories:     chart: {        renderTo: 'container',        height: 250,        spacingRight: 20    },    title: {        text: 'Market Data: Nasdaq 100'    },    subtitle: {        text: 'May 11, 2012'    },    xAxis: {        categories: [ '9:30 am', '10:00 am', '10:30 am',                       '11:00 am', '11:30 am', '12:00 pm',                       '12:30 pm', '1:00 pm', '1:30 pm',                       '2:00 pm', '2:30 pm', '3:00 pm',                       '3:30 pm', '4:00 pm' ],         labels: {             step: 3         }     },     yAxis: {         title: {             text: null         }     },     legend: {         enabled: false     },     credits: {         enabled: false     },     series: [{         name: 'Nasdaq',         color: '#4572A7',         data: [ 2606.01, 2622.08, 2636.03, 2637.78, 2639.15,                 2637.09, 2633.38, 2632.23, 2632.33, 2632.59,                 2630.34, 2626.89, 2624.59, 2615.98 ]     }] The preceding code snippet produces a graph that looks like the following screenshot: The first name in the categories field corresponds to the first value, 9:30 am, 2606.01, in the series data array, and so on. Alternatively, we can specify the time values inside the series data and use the type property of the x axis to format the time. The type property supports 'linear' (default), 'logarithmic', or 'datetime'. The 'datetime' setting automatically interprets the time in the series data into human-readable form. Moreover, we can use the dateTimeLabelFormats property to predefine the custom format for the time unit. The option can also accept multiple time unit formats. This is for when we don't know in advance how long the time span is in the series data, so each unit in the resulting graph can be per hour, per day, and so on. The following example shows how the graph is specified with predefined hourly and minute formats. The syntax of the format string is based on the PHP strftime function:     xAxis: {         type: 'datetime',          // Format 24 hour time to AM/PM          dateTimeLabelFormats: {                hour: '%I:%M %P',              minute: '%I %M'          }               },     series: [{         name: 'Nasdaq',         color: '#4572A7',         data: [ [ Date.UTC(2012, 4, 11, 9, 30), 2606.01 ],                  [ Date.UTC(2012, 4, 11, 10), 2622.08 ],                   [ Date.UTC(2012, 4, 11, 10, 30), 2636.03 ],                  .....                ]     }] Note that the x axis is in the 12-hour time format, as shown in the following screenshot: Instead, we can define the format handler for the xAxis.labels.formatter property to achieve a similar effect. Highcharts provides a utility routine, Highcharts.dateFormat, that converts the timestamp in milliseconds to a readable format. In the following code snippet, we define the formatter function using dateFormat and this.value. The keyword this is the axis's interval object, whereas this.value is the UTC time value for the instance of the interval:     xAxis: {         type: 'datetime',         labels: {             formatter: function() {                 return Highcharts.dateFormat('%I:%M %P', this.value);             }         }     }, Since the time values of our data points are in fixed intervals, they can also be arranged in a cut-down version. All we need is to define the starting point of time, pointStart, and the regular interval between them, pointInterval, in milliseconds: series: [{     name: 'Nasdaq',     color: '#4572A7',     pointStart: Date.UTC(2012, 4, 11, 9, 30),     pointInterval: 30 * 60 * 1000,     data: [ 2606.01, 2622.08, 2636.03, 2637.78,             2639.15, 2637.09, 2633.38, 2632.23,             2632.33, 2632.59, 2630.34, 2626.89,             2624.59, 2615.98 ] }] Adjusting intervals and background We have learned how to use axis categories and series data arrays in the last section. In this section, we will see how to format interval lines and the background style to produce a graph with more clarity. We will continue from the previous example. First, let's create some interval lines along the y axis. In the chart, the interval is automatically set to 20. However, it would be clearer to double the number of interval lines. To do that, simply assign the tickInterval value to 10. Then, we use minorTickInterval to put another line in between the intervals to indicate a semi-interval. In order to distinguish between interval and semi-interval lines, we set the semi-interval lines, minorGridLineDashStyle, to a dashed and dotted style. There are nearly a dozen line style settings available in Highcharts, from 'Solid' to 'LongDashDotDot'. Readers can refer to the online manual for possible values. The following is the first step to create the new settings:             yAxis: {                 title: {                     text: null                 },                 tickInterval: 10,                 minorTickInterval: 5,                 minorGridLineColor: '#ADADAD',                 minorGridLineDashStyle: 'dashdot'            } The interval lines should look like the following screenshot: To make the graph even more presentable, we add a striping effect with shading using alternateGridColor. Then, we change the interval line color, gridLineColor, to a similar range with the stripes. The following code snippet is added into the yAxis configuration:                 gridLineColor: '#8AB8E6',                 alternateGridColor: {                     linearGradient: {                         x1: 0, y1: 1,                         x2: 1, y2: 1                     },                     stops: [ [0, '#FAFCFF' ],                              [0.5, '#F5FAFF'] ,                              [0.8, '#E0F0FF'] ,                              [1, '#D6EBFF'] ]                   } The following is the graph with the new shading background: The next step is to apply a more professional look to the y axis line. We are going to draw a line on the y axis with the lineWidth property, and add some measurement marks along the interval lines with the following code snippet:                  lineWidth: 2,                  lineColor: '#92A8CD',                  tickWidth: 3,                  tickLength: 6,                  tickColor: '#92A8CD',                  minorTickLength: 3,                  minorTickWidth: 1,                  minorTickColor: '#D8D8D8' The tickWidth and tickLength properties add the effect of little marks at the start of each interval line. We apply the same color on both the interval mark and the axis line. Then we add the ticks minorTickLength and minorTickWidth into the semi-interval lines in a smaller size. This gives a nice measurement mark effect along the axis, as shown in the following screenshot: Now, we apply a similar polish to the xAxis configuration, as follows:            xAxis: {                type: 'datetime',                labels: {                    formatter: function() {                        return Highcharts.dateFormat('%I:%M %P', this.value);                    },                },                gridLineDashStyle: 'dot',                gridLineWidth: 1,                tickInterval: 60 * 60 * 1000,                lineWidth: 2,                lineColor: '#92A8CD',                tickWidth: 3,                tickLength: 6,                tickColor: '#92A8CD',            }, We set the x axis interval lines to the hourly format and switch the line style to a dotted line. Then, we apply the same color, thickness, and interval ticks as on the y axis. The following is the resulting screenshot: However, there are some defects along the x axis line. To begin with, the meeting point between the x axis and y axis lines does not align properly. Secondly, the interval labels at the x axis are touching the interval ticks. Finally, part of the first data point is covered by the y-axis line. The following is an enlarged screenshot showing the issues: There are two ways to resolve the axis line alignment problem, as follows: Shift the plot area 1 pixel away from the x axis. This can be achieved by setting the offset property of xAxis to 1. Increase the x-axis line width to 3 pixels, which is the same width as the y-axis tick interval. As for the x-axis label, we can simply solve the problem by introducing the y offset value into the labels setting. Finally, to avoid the first data point touching the y-axis line, we can impose minPadding on the x axis. What this does is to add padding space at the minimum value of the axis, the first point. The minPadding value is based on the ratio of the graph width. In this case, setting the property to 0.02 is equivalent to shifting along the x axis 5 pixels to the right (250 px * 0.02). The following are the additional settings to improve the chart:     xAxis: {         ....         labels: {                formatter: ...,                y: 17         },         .....         minPadding: 0.02,         offset: 1     } The following screenshot shows that the issues have been addressed: As we can see, Highcharts has a comprehensive set of configurable variables with great flexibility. Using plot lines and plot bands In this section, we are going to see how we can use Highcharts to place lines or bands along the axis. We will continue with the example from the previous section. Let's draw a couple of lines to indicate the day's highest and lowest index points on the y axis. The plotLines field accepts an array of object configurations for each plot line. There are no width and color default values for plotLines, so we need to specify them explicitly in order to see the line. The following is the code snippet for the plot lines:       yAxis: {               ... ,               plotLines: [{                    value: 2606.01,                    width: 2,                    color: '#821740',                    label: {                        text: 'Lowest: 2606.01',                        style: {                            color: '#898989'                        }                    }               }, {                    value: 2639.15,                    width: 2,                    color: '#4A9338',                    label: {                        text: 'Highest: 2639.15',                        style: {                            color: '#898989'                        }                    }               }]         } The following screenshot shows what it should look like: We can improve the look of the chart slightly. First, the text label for the top plot line should not be next to the highest point. Second, the label for the bottom line should be remotely covered by the series and interval lines, as follows: To resolve these issues, we can assign the plot line's zIndex to 1, which brings the text label above the interval lines. We also set the x position of the label to shift the text next to the point. The following are the new changes:              plotLines: [{                    ... ,                    label: {                        ... ,                        x: 25                    },                    zIndex: 1                    }, {                    ... ,                    label: {                        ... ,                        x: 130                    },                    zIndex: 1               }] The following graph shows the label has been moved away from the plot line and over the interval line: Now, we are going to change the preceding example with a plot band area that shows the index change between the market's opening and closing values. The plot band configuration is very similar to plot lines, except that it uses the to and from properties, and the color property accepts gradient settings or color code. We create a plot band with a triangle text symbol and values to signify a positive close. Instead of using the x and y properties to fine-tune label position, we use the align option to adjust the text to the center of the plot area (replace the plotLines setting from the above example):               plotBands: [{                    from: 2606.01,                    to: 2615.98,                    label: {                        text: '▲ 9.97 (0.38%)',                        align: 'center',                        style: {                            color: '#007A3D'                        }                    },                    zIndex: 1,                    color: {                        linearGradient: {                            x1: 0, y1: 1,                            x2: 1, y2: 1                        },                        stops: [ [0, '#EBFAEB' ],                                 [0.5, '#C2F0C2'] ,                                 [0.8, '#ADEBAD'] ,                                 [1, '#99E699']                        ]                    }               }] The triangle is an alt-code character; hold down the left Alt key and enter 30 in the number keypad. See http://www.alt-codes.net for more details. This produces a chart with a green plot band highlighting a positive close in the market, as shown in the following screenshot: Extending to multiple axes Previously, we ran through most of the axis configurations. Here, we explore how we can use multiple axes, which are just an array of objects containing axis configurations. Continuing from the previous stock market example, suppose we now want to include another market index, Dow Jones, along with Nasdaq. However, both indices are different in nature, so their value ranges are vastly different. First, let's examine the outcome by displaying both indices with the common y axis. We change the title, remove the fixed interval setting on the y axis, and include data for another series:             chart: ... ,             title: {                 text: 'Market Data: Nasdaq & Dow Jones'             },             subtitle: ... ,             xAxis: ... ,             credits: ... ,             yAxis: {                 title: {                     text: null                 },                 minorGridLineColor: '#D8D8D8',                 minorGridLineDashStyle: 'dashdot',                 gridLineColor: '#8AB8E6',                 alternateGridColor: {                     linearGradient: {                         x1: 0, y1: 1,                         x2: 1, y2: 1                     },                     stops: [ [0, '#FAFCFF' ],                              [0.5, '#F5FAFF'] ,                              [0.8, '#E0F0FF'] ,                              [1, '#D6EBFF'] ]                 },                 lineWidth: 2,                 lineColor: '#92A8CD',                 tickWidth: 3,                 tickLength: 6,                 tickColor: '#92A8CD',                 minorTickLength: 3,                 minorTickWidth: 1,                 minorTickColor: '#D8D8D8'             },             series: [{               name: 'Nasdaq',               color: '#4572A7',               data: [ [ Date.UTC(2012, 4, 11, 9, 30), 2606.01 ],                          [ Date.UTC(2012, 4, 11, 10), 2622.08 ],                           [ Date.UTC(2012, 4, 11, 10, 30), 2636.03 ],                          ...                        ]             }, {               name: 'Dow Jones',               color: '#AA4643',               data: [ [ Date.UTC(2012, 4, 11, 9, 30), 12598.32 ],                          [ Date.UTC(2012, 4, 11, 10), 12538.61 ],                           [ Date.UTC(2012, 4, 11, 10, 30), 12549.89 ],                          ...                        ]             }] The following is the chart showing both market indices: As expected, the index changes that occur during the day have been normalized by the vast differences in value. Both lines look roughly straight, which falsely implies that the indices have hardly changed. Let us now explore putting both indices onto separate y axes. We should remove any background decoration on the y axis, because we now have a different range of data shared on the same background. The following is the new setup for yAxis:            yAxis: [{                  title: {                     text: 'Nasdaq'                 },               }, {                 title: {                     text: 'Dow Jones'                 },                 opposite: true             }], Now yAxis is an array of axis configurations. The first entry in the array is for Nasdaq and the second is for Dow Jones. This time, we display the axis title to distinguish between them. The opposite property is to put the Dow Jones y axis onto the other side of the graph for clarity. Otherwise, both y axes appear on the left-hand side. The next step is to align indices from the y-axis array to the series data array, as follows:             series: [{                 name: 'Nasdaq',                 color: '#4572A7',                 yAxis: 0,                 data: [ ... ]             }, {                 name: 'Dow Jones',                 color: '#AA4643',                 yAxis: 1,                 data: [ ... ]             }]          We can clearly see the movement of the indices in the new graph, as follows: Moreover, we can improve the final view by color-matching the series to the axis lines. The Highcharts.getOptions().colors property contains a list of default colors for the series, so we use the first two entries for our indices. Another improvement is to set maxPadding for the x axis, because the new y-axis line covers parts of the data points at the high end of the x axis:             xAxis: {                 ... ,                 minPadding: 0.02,                 maxPadding: 0.02                 },             yAxis: [{                 title: {                     text: 'Nasdaq'                 },                 lineWidth: 2,                 lineColor: '#4572A7',                 tickWidth: 3,                 tickLength: 6,                 tickColor: '#4572A7'             }, {                 title: {                     text: 'Dow Jones'                 },                 opposite: true,                 lineWidth: 2,                 lineColor: '#AA4643',                 tickWidth: 3,                 tickLength: 6,                 tickColor: '#AA4643'             }], The following screenshot shows the improved look of the chart: We can extend the preceding example and have more than a couple of axes, simply by adding entries into the yAxis and series arrays, and mapping both together. The following screenshot shows a 4-axis line graph: Summary In this article, major configuration components were discussed and experimented with, and examples shown. By now, we should be comfortable with what we have covered already and ready to plot some of the basic graphs with more elaborate styles. Resources for Article: Further resources on this subject: Theming with Highcharts [article] Integrating with other Frameworks [article] Highcharts [article]
Read more
  • 0
  • 0
  • 5870

article-image-evolution-hadoop
Packt
29 Dec 2014
12 min read
Save for later

Evolution of Hadoop

Packt
29 Dec 2014
12 min read
 In this article by Sandeep Karanth, author of the book Mastering Hadoop, we will see about the Hadoop's timeline, Hadoop 2.X and Hadoop YARN. Hadoop's timeline The following figure gives a timeline view of the major releases and milestones of Apache Hadoop. The project has been there for 8 years, but the last 4 years has seen Hadoop make giant strides in big data processing. In January 2010, Google was awarded a patent for the MapReduce technology. This technology was licensed to the Apache Software Foundation 4 months later, a shot in the arm for Hadoop. With legal complications out of the way, enterprises—small, medium, and large—were ready to embrace Hadoop. Since then, Hadoop has come up with a number of major enhancements and releases. It has given rise to businesses selling Hadoop distributions, support, training, and other services. Hadoop 1.0 releases, referred to as 1.X in this book, saw the inception and evolution of Hadoop as a pure MapReduce job-processing framework. It has exceeded its expectations with a wide adoption of massive data processing. The stable 1.X release at this point of time is 1.2.1, which includes features such as append and security. Hadoop 1.X tried to stay flexible by making changes, such as HDFS append, to support online systems such as HBase. Meanwhile, big data applications evolved in range beyond MapReduce computation models. The flexibility of Hadoop 1.X releases had been stretched; it was no longer possible to widen its net to cater to the variety of applications without architectural changes. Hadoop 2.0 releases, referred to as 2.X in this book, came into existence in 2013. This release family has major changes to widen the range of applications Hadoop can solve. These releases can even increase efficiencies and mileage derived from existing Hadoop clusters in enterprises. Clearly, Hadoop is moving fast beyond MapReduce to stay as the leader in massive scale data processing with the challenge of being backward compatible. It is becoming a generic cluster-computing and storage platform from being only a MapReduce-specific framework. Hadoop 2.X The extensive success of Hadoop 1.X in organizations also led to the understanding of its limitations, which are as follows: Hadoop gives unprecedented access to cluster computational resources to every individual in an organization. The MapReduce programming model is simple and supports a develop once deploy at any scale paradigm. This leads to users exploiting Hadoop for data processing jobs where MapReduce is not a good fit, for example, web servers being deployed in long-running map jobs. MapReduce is not known to be affable for iterative algorithms. Hacks were developed to make Hadoop run iterative algorithms. These hacks posed severe challenges to cluster resource utilization and capacity planning. Hadoop 1.X has a centralized job flow control. Centralized systems are hard to scale as they are the single point of load lifting. JobTracker failure means that all the jobs in the system have to be restarted, exerting extreme pressure on a centralized component. Integration of Hadoop with other kinds of clusters is difficult with this model. The early releases in Hadoop 1.X had a single NameNode that stored all the metadata about the HDFS directories and files. The data on the entire cluster hinged on this single point of failure. Subsequent releases had a cold standby in the form of a secondary NameNode. The secondary NameNode merged the edit logs and NameNode image files, periodically bringing in two benefits. One, the primary NameNode startup time was reduced as the NameNode did not have to do the entire merge on startup. Two, the secondary NameNode acted as a replica that could minimize data loss on NameNode disasters. However, the secondary NameNode (secondary NameNode is not a backup node for NameNode) was still not a hot standby, leading to high failover and recovery times and affecting cluster availability. Hadoop 1.X is mainly a Unix-based massive data processing framework. Native support on machines running Microsoft Windows Server is not possible. With Microsoft entering cloud computing and big data analytics in a big way, coupled with existing heavy Windows Server investments in the industry, it's very important for Hadoop to enter the Microsoft Windows landscape as well. Hadoop's success comes mainly from enterprise play. Adoption of Hadoop mainly comes from the availability of enterprise features. Though Hadoop 1.X tries to support some of them, such as security, there is a list of other features that are badly needed by the enterprise. Yet Another Resource Negotiator (YARN) In Hadoop 1.X, resource allocation and job execution were the responsibilities of JobTracker. Since the computing model was closely tied to the resources in the cluster, MapReduce was the only supported model. This tight coupling led to developers force-fitting other paradigms, leading to unintended use of MapReduce. The primary goal of YARN is to separate concerns relating to resource management and application execution. By separating these functions, other application paradigms can be added onboard a Hadoop computing cluster. Improvements in interoperability and support for diverse applications lead to efficient and effective utilization of resources. It integrates well with the existing infrastructure in an enterprise. Achieving loose coupling between resource management and job management should not be at the cost of loss in backward compatibility. For almost 6 years, Hadoop has been the leading software to crunch massive datasets in a parallel and distributed fashion. This means huge investments in development; testing and deployment were already in place. YARN maintains backward compatibility with Hadoop 1.X (hadoop-0.20.205+) APIs. An older MapReduce program can continue execution in YARN with no code changes. However, recompiling the older code is mandatory. Architecture overview The following figure lays out the architecture of YARN. YARN abstracts out resource management functions to a platform layer called ResourceManager (RM). There is a per-cluster RM that primarily keeps track of cluster resource usage and activity. It is also responsible for allocation of resources and resolving contentions among resource seekers in the cluster. RM uses a generalized resource model and is agnostic to application-specific resource needs. For example, RM need not know the resources corresponding to a single Map or Reduce slot. Planning and executing a single job is the responsibility of ApplicationMaster (AM). There is an AM instance per running application. For example, there is an AM for each MapReduce job. It has to request for resources from the RM, use them to execute the job, and work around failures, if any. The general cluster layout has RM running as a daemon on a dedicated machine with a global view of the cluster and its resources. Being a global entity, RM can ensure fairness depending on the resource utilization of the cluster resources. When requested for resources, RM allocates them dynamically as a node-specific bundle called a container. For example, 2 CPUs and 4 GB of RAM on a particular node can be specified as a container. Every node in the cluster runs a daemon called NodeManager (NM). RM uses NM as its node local assistant. NMs are used for container management functions, such as starting and releasing containers, tracking local resource usage, and fault reporting. NMs send heartbeats to RM. The RM view of the system is the aggregate of the views reported by each NM. Jobs are submitted directly to RMs. Based on resource availability, jobs are scheduled to run by RMs. The metadata of the jobs are stored in persistent storage to recover from RM crashes. When a job is scheduled, RM allocates a container for the AM of the job on a node in the cluster. AM then takes over orchestrating the specifics of the job. These specifics include requesting resources, managing task execution, optimizations, and handling tasks or job failures. AM can be written in any language, and different versions of AM can execute independently on a cluster. An AM resource request contains specifications about the locality and the kind of resource expected by it. RM puts in its best effort to satisfy AM's needs based on policies and availability of resources. When a container is available for use by AM, it can launch application-specific code in this container. The container is free to communicate with its AM. RM is agnostic to this communication. Storage layer enhancements A number of storage layer enhancements were undertaken in the Hadoop 2.X releases. The number one goal of the enhancements was to make Hadoop enterprise ready. High availability NameNode is a directory service for Hadoop and contains metadata pertaining to the files within cluster storage. Hadoop 1.X had a secondary Namenode, a cold standby that needed minutes to come up. Hadoop 2.X provides features to have a hot standby of NameNode. On the failure of an active NameNode, the standby can become the active Namenode in a matter of minutes. There is no data loss or loss of NameNode service availability. With hot standbys, automated failover becomes easier too. The key to keep the standby in a hot state is to keep its data as current as possible with respect to the active Namenode. This is achieved by reading the edit logs of the active NameNode and applying it onto itself with very low latency. The sharing of edit logs can be done using the following two methods: A shared NFS storage directory between the active and standby NameNodes: the active writes the logs to the shared location. The standby monitors the shared directory and pulls in the changes. A quorum of Journal Nodes: the active NameNode presents its edits to a subset of journal daemons that record this information. The standby node constantly monitors these journal daemons for updates and syncs the state with itself. The following figure shows the high availability architecture using a quorum of Journal Nodes. The data nodes themselves send block reports directly to both the active and standby NameNodes: Zookeeper or any other High Availability monitoring service can be used to track NameNode failures. With the assistance of Zookeeper, failover procedures to promote the hot standby as the active NameNode can be triggered. HDFS Federation Similar to what YARN did to Hadoop's computation layer, a more generalized storage model has been implemented in Hadoop 2.X. The block storage layer has been generalized and separated out from the filesystem layer. This separation has given an opening for other storage services to be integrated into a Hadoop cluster. Previously, HDFS and the block storage layer were tightly coupled. One use case that has come forth from this generalized storage model is HDFS Federation. Federation allows multiple HDFS namespaces to use the same underlying storage. Federated NameNodes provide isolation at the filesystem level. HDFS snapshots Snapshots are point-in-time, read-only images of the entire or a particular subset of a filesystem. Snapshots are taken for three general reasons: Protection against user errors Backup Disaster recovery Snapshotting is implemented only on NameNode. It does not involve copying data from the data nodes. It is a persistent copy of the block list and file size. The process of taking a snapshot is almost instantaneous and does not affect the performance of NameNode. Other enhancements There are a number of other enhancements in Hadoop 2.X, which are as follows: The wire protocol for RPCs within Hadoop is now based on Protocol Buffers. Previously, Java serialization via Writables was used. This improvement not only eases maintaining backward compatibility, but also aids in rolling the upgrades of different cluster components. RPCs allow for client-side retries as well. HDFS in Hadoop 1.X was agnostic about the type of storage being used. Mechanical or SSD drives were treated uniformly. The user did not have any control on data placement. Hadoop 2.X releases in 2014 are aware of the type of storage and expose this information to applications as well. Applications can use this to optimize their data fetch and placement strategies. HDFS append support has been brought into Hadoop 2.X. HDFS access in Hadoop 1.X releases has been through HDFS clients. In Hadoop 2.X, support for NFSv3 has been brought into the NFS gateway component. Clients can now mount HDFS onto their compatible local filesystem, allowing them to download and upload files directly to and from HDFS. Appends to files are allowed, but random writes are not. A number of I/O improvements have been brought into Hadoop. For example, in Hadoop 1.X, clients collocated with data nodes had to read data via TCP sockets. However, with short-circuit local reads, clients can directly read off the data nodes. This particular interface also supports zero-copy reads. The CRC checksum that is calculated for reads and writes of data has been optimized using the Intel SSE4.2 CRC32 instruction. Support enhancements Hadoop is also widening its application net by supporting other platforms and frameworks. One dimension we saw was onboarding of other computational models with YARN or other storage systems with the Block Storage layer. The other enhancements are as follows: Hadoop 2.X supports Microsoft Windows natively. This translates to a huge opportunity to penetrate the Microsoft Windows server land for massive data processing. This was partially possible because of the use of the highly portable Java programming language for Hadoop development. The other critical enhancement was the generalization of compute and storage management to include Microsoft Windows. As part of Platform-as-a-Service offerings, cloud vendors give out on-demand Hadoop as a service. OpenStack support in Hadoop 2.X makes it conducive for deployment in elastic and virtualized cloud environments. Summary In this article, we saw the evolution of Hadoop and some of its milestones and releases. We went into depth on Hadoop 2.X and the changes it brings into Hadoop. The key takeaways from this article are: In over 6 years of its existence, Hadoop has become the number one choice as a framework for massively parallel and distributed computing. The community has been shaping Hadoop to gear up for enterprise use. In 1.X releases, HDFS append and security, were the key features that made Hadoop enterprise-friendly. Hadoop's storage layer was enhanced in 2.X to separate the filesystem from the block storage service. This enables features such as supporting multiple namespaces and integration with other filesystems. 2.X shows improvements in Hadoop storage availability and snapshotting. Resources for Article: Further resources on this subject: Securing the Hadoop Ecosystem [article] Sizing and Configuring your Hadoop Cluster [article] HDFS and MapReduce [article]
Read more
  • 0
  • 0
  • 2076

article-image-creating-map
Packt
29 Dec 2014
11 min read
Save for later

Creating a Map

Packt
29 Dec 2014
11 min read
In this article by Thomas Newton and Oscar Villarreal, authors of the book Learning D3.js Mapping, we will cover the following topics through a series of experiments: Foundation – creating your basic map Experiment 1 – adjusting the bounding box Experiment 2 – creating choropleths Experiment 3 – adding click events to our visualization (For more resources related to this topic, see here.) Foundation – creating your basic map In this section, we will walk through the basics of creating a standard map. Let's walk through the code to get a step-by-step explanation of how to create this map. The width and height can be anything you want. Depending on where your map will be visualized (cellphones, tablets, or desktops), you might want to consider providing a different width and height: var height = 600; var width = 900; The next variable defines a projection algorithm that allows you to go from a cartographic space (latitude and longitude) to a Cartesian space (x,y)—basically a mapping of latitude and longitude to coordinates. You can think of a projection as a way to map the three-dimensional globe to a flat plane. There are many kinds of projections, but geo.mercator is normally the default value you will use: var projection = d3.geo.mercator(); var mexico = void 0; If you were making a map of the USA, you could use a better projection called albersUsa. This is to better position Alaska and Hawaii. By creating a geo.mercator projection, Alaska would render proportionate to its size, rivaling that of the entire US. The albersUsa projection grabs Alaska, makes it smaller, and puts it at the bottom of the visualization. The following screenshot is of geo.mercator:   This following screenshot is of geo.albersUsa:   The D3 library currently contains nine built-in projection algorithms. An overview of each one can be viewed at https://github.com/mbostock/d3/wiki/Geo-Projections. Next, we will assign the projection to our geo.path function. This is a special D3 function that will map the JSON-formatted geographic data into SVG paths. The data format that the geo.path function requires is named GeoJSON: var path = d3.geo.path().projection(projection); var svg = d3.select("#map")    .append("svg")    .attr("width", width)    .attr("height", height); Including the dataset The necessary data has been provided for you within the data folder with the filename geo-data.json: d3.json('geo-data.json', function(data) { console.log('mexico', data); We get the data from an AJAX call. After the data has been collected, we want to draw only those parts of the data that we are interested in. In addition, we want to automatically scale the map to fit the defined height and width of our visualization. If you look at the console, you'll see that "mexico" has an objects property. Nested inside the objects property is MEX_adm1. This stands for the administrative areas of Mexico. It is important to understand the geographic data you are using, because other data sources might have different names for the administrative areas property:   Notice that the MEX_adm1 property contains a geometries array with 32 elements. Each of these elements represents a state in Mexico. Use this data to draw the D3 visualization. var states = topojson.feature(data, data.objects.MEX_adm1); Here, we pass all of the administrative areas to the topojson.feature function in order to extract and create an array of GeoJSON objects. The preceding states variable now contains the features property. This features array is a list of 32 GeoJSON elements, each representing the geographic boundaries of a state in Mexico. We will set an initial scale and translation to 1 and 0,0 respectively: // Setup the scale and translate projection.scale(1).translate([0, 0]); This algorithm is quite useful. The bounding box is a spherical box that returns a two-dimensional array of min/max coordinates, inclusive of the geographic data passed: var b = path.bounds(states); To quote the D3 documentation: "The bounding box is represented by a two-dimensional array: [[left, bottom], [right, top]], where left is the minimum longitude, bottom is the minimum latitude, right is maximum longitude, and top is the maximum latitude." This is very helpful if you want to programmatically set the scale and translation of the map. In this case, we want the entire country to fit in our height and width, so we determine the bounding box of every state in the country of Mexico. The scale is calculated by taking the longest geographic edge of our bounding box and dividing it by the number of pixels of this edge in the visualization: var s = .95 / Math.max((b[1][0] - b[0][0]) / width, (b[1][1] - b[0][1]) / height); This can be calculated by first computing the scale of the width, then the scale of the height, and, finally, taking the larger of the two. All of the logic is compressed into the single line given earlier. The three steps are explained in the following image:   The value 95 adjusts the scale, because we are giving the map a bit of a breather on the edges in order to not have the paths intersect the edges of the SVG container item, basically reducing the scale by 5 percent. Now, we have an accurate scale of our map, given our set width and height. var t = [(width - s * (b[1][0] + b[0][0])) / 2, (height - s * (b[1][1] + b[0][1])) / 2]; When we scale in SVG, it scales all the attributes (even x and y). In order to return the map to the center of the screen, we will use the translate function. The translate function receives an array with two parameters: the amount to translate in x, and the amount to translate in y. We will calculate x by finding the center (topRight – topLeft)/2 and multiplying it by the scale. The result is then subtracted from the width of the SVG element. Our y translation is calculated similarly but using the bottomRight – bottomLeft values divided by 2, multiplied by the scale, then subtracted from the height. Finally, we will reset the projection to use our new scale and translation: projection.scale(s).translate(t); Here, we will create a map variable that will group all of the following SVG elements into a <g> SVG tag. This will allow us to apply styles and better contain all of the proceeding paths' elements: var map = svg.append('g').attr('class', 'boundary'); Finally, we are back to the classic D3 enter, update, and exit pattern. We have our data, the list of Mexico states, and we will join this data to the path SVG element:    mexico = map.selectAll('path').data(states.features);      //Enter    mexico.enter()        .append('path')        .attr('d', path); The enter section and the corresponding path functions are executed on every data element in the array. As a refresher, each element in the array represents a state in Mexico. The path function has been set up to correctly draw the outline of each state as well as scale and translate it to fit in our SVG container. Congratulations! You have created your first map! Experiment 1 – adjusting the bounding box Now that we have our foundation, let's start with our first experiment. For this experiment, we will manually zoom in to a state of Mexico using what we learned in the previous section. For this experiment, we will modify one line of code: var b = path.bounds(states.features[5]); Here, we are telling the calculation to create a boundary based on the sixth element of the features array instead of every state in the country of Mexico. The boundaries data will now run through the rest of the scaling and translation algorithms to adjust the map to the one shown in the following screenshot:   We have basically reduced the min/max of the boundary box to include the geographic coordinates for one state in Mexico (see the next screenshot), and D3 has scaled and translated this information for us automatically:   This can be very useful in situations where you might not have the data that you need in isolation from the surrounding areas. Hence, you can always zoom in to your geography of interest and isolate it from the rest. Experiment 2 – creating choropleths One of the most common uses of D3.js maps is to make choropleths. This visualization gives you the ability to discern between regions, giving them a different color. Normally, this color is associated with some other value, for instance, levels of influenza or a company's sales. Choropleths are very easy to make in D3.js. In this experiment, we will create a quick choropleth based on the index value of the state in the array of all the states. We will only need to modify two lines of code in the update section of our D3 code. Right after the enter section, add the following two lines: //Update var color = d3.scale.linear().domain([0,33]).range(['red',   'yellow']); mexico.attr('fill', function(d,i) {return color(i)}); The color variable uses another valuable D3 function named scale. Scales are extremely powerful when creating visualizations in D3; much more detail on scales can be found at https://github.com/mbostock/d3/wiki/Scales. For now, let's describe what this scale defines. Here, we created a new function called color. This color function looks for any number between 0 and 33 in an input domain. D3 linearly maps these input values to a color between red and yellow in the output range. D3 has included the capability to automatically map colors in a linear range to a gradient. This means that executing the new function, color, with 0 will return the color red, color(15) will return an orange color, and color(33) will return yellow. Now, in the update section, we will set the fill property of the path to the new color function. This will provide a linear scale of colors and use the index value i to determine what color should be returned. If the color was determined by a different value of the datum, for instance, d.sales, then you would have a choropleth where the colors actually represent sales. The preceding code should render something as follows: Experiment 3 – adding click events to our visualization We've seen how to make a map and set different colors to the different regions of this map. Next, we will add a little bit of interactivity. This will illustrate a simple reference to bind click events to maps. First, we need a quick reference to each state in the country. To accomplish this, we will create a new function called geoID right below the mexico variable: var height = 600; var width = 900; var projection = d3.geo.mercator(); var mexico = void 0;   var geoID = function(d) {    return "c" + d.properties.ID_1; }; This function takes in a state data element and generates a new selectable ID based on the ID_1 property found in the data. The ID_1 property contains a unique numeric value for every state in the array. If we insert this as an id attribute into the DOM, then we would create a quick and easy way to select each state in the country. The following is the geoID function, creating another function called click: var click = function(d) {    mexico.attr('fill-opacity', 0.2); // Another update!    d3.select('#' + geoID(d)).attr('fill-opacity', 1); }; This method makes it easy to separate what the click is doing. The click method receives the datum and changes the fill opacity value of all the states to 0.2. This is done so that when you click on one state and then on the other, the previous state does not maintain the clicked style. Notice that the function call is iterating through all the elements of the DOM, using the D3 update pattern. After making all the states transparent, we will set a fill-opacity of 1 for the given clicked item. This removes all the transparent styling from the selected state. Notice that we are reusing the geoID function that we created earlier to quickly find the state element in the DOM. Next, let's update the enter method to bind our new click method to every new DOM element that enter appends: //Enter mexico.enter()      .append('path')      .attr('d', path)      .attr('id', geoID)      .on("click", click); We also added an attribute called id; this inserts the results of the geoID function into the id attribute. Again, this makes it very easy to find the clicked state. The code should produce a map as follows. Check it out and make sure that you click on any of the states. You will see its color turn a little brighter than the surrounding states. Summary You learned how to build many different kinds of maps that cover different kinds of needs. Choropleths and data visualizations on maps are some of the most common geographic-based data representations that you will come across. Resources for Article: Further resources on this subject: Using Canvas and D3 [article] Interacting with your Visualization [article] Simple graphs with d3.js [article]
Read more
  • 0
  • 0
  • 1389
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-replication
Packt
24 Dec 2014
5 min read
Save for later

Cassandra High Availability: Replication

Packt
24 Dec 2014
5 min read
This article by Robbie Strickland, the author of Cassandra High Availability, describes the data replication architecture used in Cassandra. Replication is perhaps the most critical feature of a distributed data store, as it would otherwise be impossible to make any sort of availability guarantee in the face of a node failure. As you already know, Cassandra employs a sophisticated replication system that allows fine-grained control over replica placement and consistency guarantees. In this article, we'll explore Cassandra's replication mechanism in depth. Let's start with the basics: how Cassandra determines the number of replicas to be created and where to locate them in the cluster. We'll begin the discussion with a feature that you'll encounter the very first time you create a keyspace: the replication factor. (For more resources related to this topic, see here.) The replication factor On the surface, setting the replication factor seems to be a fundamentally straightforward idea. You configure Cassandra with the number of replicas you want to maintain (during keyspace creation), and the system dutifully performs the replication for you, thus protecting you when something goes wrong. So by defining a replication factor of three, you will end up with a total of three copies of the data. There are a number of variables in this equation. Let's start with the basic mechanics of setting the replication factor. Replication strategies One thing you'll quickly notice is that the semantics to set the replication factor depend on the replication strategy you choose. The replication strategy tells Cassandra exactly how you want replicas to be placed in the cluster. There are two strategies available: SimpleStrategy: This strategy is used for single data center deployments. It is fine to use this for testing, development, or simple clusters, but discouraged if you ever intend to expand to multiple data centers (including virtual data centers such as those used to separate analysis workloads). NetworkTopologyStrategy: This strategy is used when you have multiple data centers, or if you think you might have multiple data centers in the future. In other words, you should use this strategy for your production cluster. SimpleStrategy As a way of introducing this concept, we'll start with an example using SimpleStrategy. The following Cassandra Query Language (CQL) block will allow us to create a keyspace called AddressBook with three replicas: CREATE KEYSPACE AddressBookWITH REPLICATION = {   'class' : 'SimpleStrategy',   'replication_factor' : 3}; The data is assigned to a node via a hash algorithm, resulting in each node owning a range of data. Let's take another look at the placement of our example data on the cluster. Remember the keys are first names, and we determined the hash using the Murmur3 hash algorithm. The primary replica for each key is assigned to a node based on its hashed value. Each node is responsible for the region of the ring between itself (inclusive) and its predecessor (exclusive). While using SimpleStrategy, Cassandra will locate the first replica on the owner node (the one determined by the hash algorithm), then walk the ring in a clockwise direction to place each additional replica, as follows: Additional replicas are placed in adjacent nodes when using manually assigned tokens In the preceding diagram, the keys in bold represent the primary replicas (the ones placed on the owner nodes), with subsequent replicas placed in adjacent nodes, moving clockwise from the primary. Although each node owns a set of keys based on its token range(s), there is no concept of a master replica. In Cassandra, unlike make other database designs, every replica is equal. This means reads and writes can be made to any node that holds a replica of the requested key. If you have a small cluster where all nodes reside in a single rack inside one data center, SimpleStrategy will do the job. This makes it the right choice for local installations, development clusters, and other similar simple environments where expansion is unlikely because there is no need to configure a snitch (which will be covered later in this section). For production clusters, however, it is highly recommended that you use NetworkTopologyStrategy instead. This strategy provides a number of important features for more complex installations where availability and performance are paramount. NetworkTopologyStrategy When it's time to deploy your live cluster, NetworkTopologyStrategy offers two additional properties that make it more suitable for this purpose: Rack awareness: Unlike SimpleStrategy, which places replicas naively, this feature attempts to ensure that replicas are placed in different racks, thus preventing service interruption or data loss due to failures of switches, power, cooling, and other similar events that tend to affect single racks of machines. Configurable snitches: A snitch helps Cassandra to understand the topology of the cluster. There are a number of snitch options for any type of network configuration. Here's a basic example of a keyspace using NetworkTopologyStrategy: CREATE KEYSPACE AddressBookWITH REPLICATION = {   'class' : 'NetworkTopologyStrategy',   'dc1' : 3,   'dc2' : 2}; In this example, we're telling Cassandra to place three replicas in a data center called dc1 and two replicas in a second data center called dc2. Summary In this article, we introduced the foundational concepts of replication and consistency. In our discussion, we outlined the importance of the relationship between replication factor and consistency level, and their impact on performance, data consistency, and availability. By now, you should be able to make sound decisions specific to your use cases. This article might serve as a handy reference in the future as it can be challenging to keep all these details in mind. Resources for Article: Further resources on this subject: An overview of architecture and modeling in Cassandra [Article] Basic Concepts and Architecture of Cassandra [Article] About Cassandra [Article]
Read more
  • 0
  • 0
  • 1877

article-image-analyzing-data
Packt
24 Dec 2014
13 min read
Save for later

Analyzing Data

Packt
24 Dec 2014
13 min read
In this article by Amarpreet Singh Bassan and Debarchan Sarkar, authors of Mastering SQL Server 2014 Data Mining, we will begin our discussion with an introduction to the data mining life cycle, and this article will focus on its first three stages. You are expected to have basic understanding of the Microsoft business intelligence stack and familiarity of terms such as extract, transform, and load (ETL), data warehouse, and so on. (For more resources related to this topic, see here.) Data mining life cycle Before going into further details, it is important to understand the various stages of the data mining life cycle. The data mining life cycle can be broadly classified into the following steps: Understanding the business requirement. Understanding the data. Preparing the data for the analysis. Preparing the data mining models. Evaluating the results of the analysis prepared with the models. Deploying the models to the SQL Server Analysis Services Server. Repeating steps 1 to 6 in case the business requirement changes. Let's look at each of these stages in detail. The first and foremost task that needs to be well defined even before beginning the mining process is to identify the goals. This is a crucial part of the data mining exercise and you need to understand the following questions: What and whom are we targeting? What is the outcome we are targeting? What is the time frame for which we have the data and what is the target time period that our data is going to forecast? What would the success measures look like? Let's define a classic problem and understand more about the preceding questions. We can use them to discuss how to extract the information rather than spending our time on defining the schema. Consider an instance where you are a salesman for the AdventureWorks Cycle company, and you need to make predictions that could be used in marketing the products. The problem sounds simple and straightforward, but any serious data miner would immediately come up with many questions. Why? The answer lies in the exactness of the information being searched for. Let's discuss this in detail. The problem statement comprises the words predictions and marketing. When we talk about predictions, there are several insights that we seek, namely: What is it that we are predicting? (for example: customers, product sales, and so on) What is the time period of the data that we are selecting for prediction? What time period are we going to have the prediction for? What is the expected outcome of the prediction exercise? From the marketing point of view, several follow-up questions that must be answered are as follows: What is our target for marketing, a new product or an older product? Is our marketing strategy product centric or customer centric? Are we going to market our product irrespective of the customer classification, or are we marketing our product according to customer classification? On what timeline in the past is our marketing going to be based on? We might observe that there are many questions that overlap the two categories and therefore, there is an opportunity to consolidate the questions and classify them as follows: What is the population that we are targeting? What are the factors that we will actually be looking at? What is the time period of the past data that we will be looking at? What is the time period in the future that we will be considering the data mining results for? Let's throw some light on these aspects based on the AdventureWorks example. We will get answers to the preceding questions and arrive at a more refined problem statement. What is the population that we are targeting? The target population might be classified according to the following aspects: Age Salary Number of kids What are the factors that we are actually looking at? They might be classified as follows: Geographical location: The people living in hilly areas would prefer All Terrain Bikes (ATB) and the population on plains would prefer daily commute bikes. Household: The people living in posh areas would look for bikes with the latest gears and also look for accessories that are state of the art, whereas people in the suburban areas would mostly look for budgetary bikes. Affinity of components: The people who tend to buy bikes would also buy some accessories. What is the time period of the past data that we would be looking at? Usually, the data that we get is quite huge and often consists of the information that we might very adequately label as noise. In order to sieve effective information, we will have to determine exactly how much into the past we should look; for example, we can look at the data for the past year, past two years, or past five years. We also need to decide the future data that we will consider the data mining results for. We might be looking at predicting our market strategy for an upcoming festive season or throughout the year. We need to be aware that market trends change and so does people's needs and requirements. So we need to keep a time frame to refresh our findings to an optimal; for example, the predictions from the past 5 years data can be valid for the upcoming 2 or 3 years depending upon the results that we get. Now that we have taken a closer look into the problem, let's redefine the problem more accurately. AdventureWorks has several stores in various locations and based on the location, we would like to get an insight on the following: Which products should be stocked where? Which products should be stocked together? How much of the products should be stocked? What is the trend of sales for a new product in an area? It is not necessary that we will get answers to all the detailed questions but even if we keep looking for the answers to these questions, there would be several insights that we will get, which will help us make better business decisions. Staging data In this phase, we collect data from all the sources and dump them into a common repository, which can be any database system such as SQL Server, Oracle, and so on. Usually, an organization might have various applications to keep track of the data from various departments, and it is quite possible that all these applications might use a different database system to store the data. Thus, the staging phase is characterized by dumping the data from all the other data storage systems to a centralized repository. Extract, transform, and load This term is most common when we talk about data warehouse. As it is clear, ETL has the following three parts: Extract: The data is extracted from a different source database and other databases that might contain the information that we seek Transform: Some transformation is applied to the data to fit the operational needs, such as cleaning, calculation, removing duplicates, reformatting, and so on Load: The transformed data is loaded into the destination data store database We usually believe that the ETL is only required till we load the data onto the data warehouse but this is not true. ETL can be used anywhere that we feel the need to do some transformation of data as shown in the following figure: Data warehouse As evident from the preceding figure, the next stage is the data warehouse. The AdventureWorksDW database is the outcome of the ETL applied to the staging database, which is AdventureWorks. We will now discuss the concepts of data warehousing and some best practices and then relate to these concepts with the help of AdventureWorksDW database. Measures and dimensions There are a few common terminologies you will encounter as you enter the world of data warehousing. They are as follows: Measure: Any business entity that can be aggregated or whose values can be ascertained in a numerical value is termed as measure, for example, sales, number of products, and so on Dimension: This is any business entity that lends some meaning to the measures, for example, in an organization, the quantity of goods sold is a measure but the month is a dimension Schema A schema, basically, determines the relationship of the various entities with each other. There are essentially two types of schema, namely: Star schema: This is a relationship where the measures have a direct relationship with the dimensions. Let's look at an instance wherein a seller has several stores that sell several products. The relationship of the tables based on the star schema will be as shown in the following screenshot: Snowflake schema: This is a relationship wherein the measures may have a direct and indirect relationship with the dimensions. We will be designing a snowflake schema if we want a more detailed drill down of the data. Snowflake schema usually would involve hierarchies, as shown in the following screenshot: Data mart While a data warehouse is a more organization-wide repository of data, extracting data from such a huge repository might well be an uphill task. We segregate the data according to the department or the specialty that the data belongs to, so that we have much smaller sections of the data to work with and extract information from. We call these smaller data warehouses data marts. Let's consider the sales for AdventureWorks cycles. To make any predictions on the sales of AdventureWorks, we will have to group all the tables associated with the sales together in a data mart. Based on the AdventureWorks database, we have the following table in the AdventureWorks sales data mart. The Internet sales facts table has the following data: [ProductKey][OrderDateKey][DueDateKey][ShipDateKey][CustomerKey][PromotionKey][CurrencyKey][SalesTerritoryKey][SalesOrderNumber][SalesOrderLineNumber][RevisionNumber][OrderQuantity][UnitPrice][ExtendedAmount][UnitPriceDiscountPct][DiscountAmount][ProductStandardCost][TotalProductCost][SalesAmount][TaxAmt][Freight][CarrierTrackingNumber][CustomerPONumber][OrderDate][DueDate][ShipDate] From the preceding column, we can easily identify that if we need to separate the tables to perform the sales analysis alone, we can safely include the following: Product: This provides the following data: [ProductKey][ListPrice] Date: This provides the following data: [DateKey] Customer: This provides the following data: [CustomerKey] Currency: This provides the following data: [CurrencyKey] Sales territory: This provides the following data: [SalesTerritoryKey] The preceding data will provide the relevant dimensions and the facts that are already contained in the FactInternetSales table and hence, we can easily perform all the analysis pertaining to the sales of the organization. Refreshing data Based on the nature of the business and the requirements of the analysis, refreshing of data can be done either in parts wherein new or incremental data is added to the tables, or we can refresh the entire data wherein the tables are cleaned and filled with new data, which consists of the old and new data. Let's discuss the preceding points in the context of the AdventureWorks database. We will take the employee table to begin with. The following is the list of columns in the employee table: [BusinessEntityID],[NationalIDNumber],[LoginID],[OrganizationNode],[OrganizationLevel],[JobTitle],[BirthDate],[MaritalStatus],[Gender],[HireDate],[SalariedFlag],[VacationHours],[SickLeaveHours],[CurrentFlag],[rowguid],[ModifiedDate] Considering an organization in the real world, we do not have a large number of employees leaving and joining the organization. So, it will not really make sense to have a procedure in place to reload the dimensions, prior to SQL 2008. When it comes to managing the changes in the dimensions table, Slowly Changing Dimensions (SCD) is worth a mention. We will briefly look at the SCD here. There are three types of SCD, namely: Type 1: The older values are overwritten by new values Type 2: A new row specifying the present value for the dimension is inserted Type 3: The column specifying TimeStamp from which the new value is effective is updated Let's take the example of HireDate as a method of keeping track of the incremental loading. We will also have to maintain a small table that will keep a track of the data that is loaded from the employee table. So, we create a table as follows: Create table employee_load_status(HireDate DateTime,LoadStatus varchar); The following script will load the employee table from the AdventureWorks database to the DimEmployee table in the AdventureWorksDW database: With employee_loaded_date(HireDate) as(select ISNULL(Max(HireDate),to_date('01-01-1900','MM-DD-YYYY')) fromemployee_load_status where LoadStatus='success'Union AllSelect ISNULL(min(HireDate),to_date('01-01-1900','MM-DD-YYYY')) fromemployee_load_status where LoadStatus='failed')Insert into DimEmployee select * from employee where HireDate>=(select Min(HireDate) from employee_loaded_date); This will reload all the data from the date of the first failure till the present day. A similar procedure can be followed to load the fact table but there is a catch. If we look at the sales table in the AdventureWorks table, we see the following columns: [BusinessEntityID],[TerritoryID],[SalesQuota],[Bonus],[CommissionPct],[SalesYTD],[SalesLastYear],[rowguid],[ModifiedDate] The SalesYTD column might change with every passing day, so do we perform a full load every day or do we perform an incremental load based on date? This will depend upon the procedure used to load the data in the sales table and the ModifiedDate column. Assuming the ModifiedDate column reflects the date on which the load was performed, we also see that there is no table in the AdventureWorksDW that will use the SalesYTD field directly. We will have to apply some transformation to get the values of OrderQuantity, DateOfShipment, and so on. Let's look at this with a simpler example. Consider we have the following sales table: Name SalesAmount Date Rama 1000 11-02-2014 Shyama 2000 11-02-2014 Consider we have the following fact table: id SalesAmount Datekey We will have to think of whether to apply incremental load or a complete reload of the table based on our end needs. So the entries for the incremental load will look like this: id SalesAmount Datekey ra 1000 11-02-2014 Sh 2000 11-02-2014 Ra 4000 12-02-2014 Sh 5000 13-02-2014 Also, a complete reload will appear as shown here: id TotalSalesAmount Datekey Ra 5000 12-02-2014 Sh 7000 13-02-2014 Notice how the SalesAmount column changes to TotalSalesAmount depending on the load criteria. Summary In this article, we've covered the first three steps of any data mining process. We've considered the reasons why we would want to undertake a data mining activity and identified the goal we have in mind. We then looked to stage the data and cleanse it. Resources for Article: Further resources on this subject: Hadoop and SQL [Article] SQL Server Analysis Services – Administering and Monitoring Analysis Services [Article] SQL Server Integration Services (SSIS) [Article]
Read more
  • 0
  • 0
  • 2120

article-image-hadoop-and-sql
Packt
23 Dec 2014
61 min read
Save for later

Hadoop and SQL

Packt
23 Dec 2014
61 min read
In this article by Garry Turkington and Gabriele Modena, the author of the book Learning Hadoop 2. MapReduce is a powerful paradigm that enables complex data processing that can reveal valuable insights. However, it does require a different mindset and some training and experience on the model of breaking processing analytics into a series of map and reduce steps. There are several products that are built atop Hadoop to provide higher-level or more familiar views of the data held within HDFS, and Pig is a very popular one. This article will explore the other most common abstraction implemented atop Hadoop SQL. (For more resources related to this topic, see here.) In this article, we will cover the following topics: What the use cases for SQL on Hadoop are and why it is so popular HiveQL, the SQL dialect introduced by Apache Hive Using HiveQL to perform SQL-like analysis of the Twitter dataset How HiveQL can approximate common features of relational databases such as joins and views How HiveQL allows the incorporation of user-defined functions into its queries How SQL on Hadoop complements Pig Other SQL-on-Hadoop products such as Impala and how they differ from Hive Why SQL on Hadoop Until now, we saw how to write Hadoop programs using the MapReduce APIs and how Pig Latin provides a scripting abstraction and a wrapper for custom business logic by means of UDFs. Pig is a very powerful tool, but its dataflow-based programming model is not familiar to most developers or business analysts. The traditional tool of choice for such people to explore data is SQL. Back in 2008, Facebook released Hive, the first widely used implementation of SQL on Hadoop. Instead of providing a way of more quickly developing map and reduce tasks, Hive offers an implementation of HiveQL, a query language based on SQL. Hive takes HiveQL statements and immediately and automatically translates the queries into one or more MapReduce jobs. It then executes the overall MapReduce program and returns the results to the user. This interface to Hadoop not only reduces the time required to produce results from data analysis, it also significantly widens the net as to who can use Hadoop. Instead of requiring software development skills, anyone who's familiar with SQL can use Hive. The combination of these attributes is that HiveQL is often used as a tool for business and data analysts to perform ad hoc queries on the data stored on HDFS. With Hive, the data analyst can work on refining queries without the involvement of a software developer. Just as with Pig, Hive also allows HiveQL to be extended by means of user-defined functions, enabling the base SQL dialect to be customized with business-specific functionality. Other SQL-on-Hadoop solutions Though Hive was the first product to introduce and support HiveQL, it is no longer the only one. Later in this article, we will also discuss Impala, released in 2013 and already a very popular tool, particularly for low-latency queries. There are others, but we will mostly discuss Hive and Impala as they have been the most successful. While introducing the core features and capabilities of SQL on Hadoop, however, we will give examples using Hive; even though Hive and Impala share many SQL features, they also have numerous differences. We don't want to constantly have to caveat each new feature with exactly how it is supported in Hive compared to Impala. We'll generally be looking at aspects of the feature set that is common to both, but if you use both products, it's important to read the latest release notes to understand the differences. Prerequisites Before diving into specific technologies, let's generate some data that we'll use in the examples throughout this article. We'll create a modified version of a former Pig script as the main functionality for this. The script in this article assumes that the Elephant Bird JARs used previously are available in the /jar directory on HDFS. The full source code is at https://github.com/learninghadoop2/book-examples/ch7/extract_for_hive.pig, but the core of extract_for_hive.pig is as follows: -- load JSON data tweets = load '$inputDir' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad'); -- Tweets tweets_tsv = foreach tweets { generate    (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt,    (chararray)$0#'id_str', (chararray)$0#'text' as text,    (chararray)$0#'in_reply_to', (boolean)$0#'retweeted' as is_retweeted, (chararray)$0#'user'#'id_str' as user_id, (chararray)$0#'place'#'id' as place_id; } store tweets_tsv into '$outputDir/tweets' using PigStorage('u0001'); -- Places needed_fields = foreach tweets {    generate (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt,      (chararray)$0#'id_str' as id_str, $0#'place' as place; } place_fields = foreach needed_fields { generate    (chararray)place#'id' as place_id,    (chararray)place#'country_code' as co,    (chararray)place#'country' as country,    (chararray)place#'name' as place_name,    (chararray)place#'full_name' as place_full_name,    (chararray)place#'place_type' as place_type; } filtered_places = filter place_fields by co != ''; unique_places = distinct filtered_places; store unique_places into '$outputDir/places' using PigStorage('u0001');   -- Users users = foreach tweets {    generate (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt, (chararray)$0#'id_str' as id_str, $0#'user' as user; } user_fields = foreach users {    generate    (chararray)CustomFormatToISO(user#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt, (chararray)user#'id_str' as user_id, (chararray)user#'location' as user_location, (chararray)user#'name' as user_name, (chararray)user#'description' as user_description, (int)user#'followers_count' as followers_count, (int)user#'friends_count' as friends_count, (int)user#'favourites_count' as favourites_count, (chararray)user#'screen_name' as screen_name, (int)user#'listed_count' as listed_count;   } unique_users = distinct user_fields; store unique_users into '$outputDir/users' using PigStorage('u0001'); Have a look at the following code: $ pig –f extract_for_hive.pig –param inputDir=<json input> -param outputDir=<output path> The preceding code writes data into three separate TSV files for the tweet, user, and place information. Notice that in the store command, we pass an argument when calling PigStorage. This single argument changes the default field separator from a tab character to unicode value U0001, or you can also use Ctrl +C + A. This is often used as a separator in Hive tables and will be particularly useful to us as our tweet data could contain tabs in other fields. Overview of Hive We will now show how you can import data into Hive and run a query against the table abstraction Hive provides over the data. In this example, and in the remainder of the article, we will assume that queries are typed into the shell that can be invoked by executing the hive command. Even though the classic CLI tool for Hive was the tool with the same name, it is specifically called hive (all lowercase); recently a client called Beeline also became available and will likely be the preferred CLI client in the near future. When importing any new data into Hive, there is generally a three-stage process, as follows: Create the specification of the table into which the data is to be imported Import the data into the created table Execute HiveQL queries against the table Most of the HiveQL statements are direct analogues to similarly named statements in standard SQL. We assume only a passing knowledge of SQL throughout this article, but if you need a refresher, there are numerous good online learning resources. Hive gives a structured query view of our data, and to enable that, we must first define the specification of the table's columns and import the data into the table before we can execute any queries. A table specification is generated using a CREATE statement that specifies the table name, the name and types of its columns, and some metadata about how the table is stored: CREATE table tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; The statement creates a new table tweet defined by a list of names for columns in the dataset and their data type. We specify that fields are delimited by a tab character t and that the format used to store data is TEXTFILE. Data can be imported from a location in HDFS tweets/ into hive using the LOAD DATA statement: LOAD DATA INPATH 'tweets' OVERWRITE INTO TABLE tweets; By default, data for Hive tables is stored on HDFS under /user/hive/warehouse. If a LOAD statement is given a path to data on HDFS, it will not simply copy the data into /user/hive/warehouse, but will move it there instead. If you want to analyze data on HDFS that is used by other applications, then either create a copy or use the EXTERNAL mechanism that will be described later. Once data has been imported into Hive, we can run queries against it. For instance: SELECT COUNT(*) FROM tweets; The preceding code will return the total number of tweets present in the dataset. HiveQL, like SQL, is not case sensitive in terms of keywords, columns, or table names. By convention, SQL statements use uppercase for SQL language keywords, and we will generally follow this when using HiveQL within files, as will be shown later. However, when typing interactive commands, we will frequently take the line of least resistance and use lowercase. If you look closely at the time taken by the various commands in the preceding example, you'll notice that loading data into a table takes about as long as creating the table specification, but even the simple count of all rows takes significantly longer. The output also shows that table creation and the loading of data do not actually cause MapReduce jobs to be executed, which explains the very short execution times. The nature of Hive tables Although Hive copies the data file into its working directory, it does not actually process the input data into rows at that point. Neither the CREATE TABLE nor LOAD DATA statements don't truly create concrete table data as such; instead, they produce the metadata that will be used when Hive generates MapReduce jobs to access the data conceptually stored in the table but actually residing on HDFS. Even though the HiveQL statements refer to a specific table structure, it is Hive's responsibility to generate code that correctly maps this to the actual on-disk format in which the data files are stored. This might seem to suggest that Hive isn't a real database; this is true, it isn't. Whereas a relational database will require a table schema to be defined before data is ingested and then ingest only data that conforms to that specification, Hive is much more flexible. The less concrete nature of Hive tables means that schemas can be defined based on the data as it has already arrived and not on some assumption of how the data should be, which might prove to be wrong. Though changeable data formats are troublesome regardless of technology, the Hive model provides an additional degree of freedom in handling the problem when, not if, it arises. Hive architecture Until version 2, Hadoop was primarily a batch system. MapReduce jobs tend to have high latency and overhead derived from submission and scheduling. Internally, Hive compiles HiveQL statements into MapReduce jobs. Hive queries have traditionally been characterized by high latency. This has changed with the Stinger initiative and the improvements introduced in Hive 0.13 that we will discuss later. Hive runs as a client application that processes HiveQL queries, converts them into MapReduce jobs, and submits these to a Hadoop cluster either to native MapReduce in Hadoop 1 or to the MapReduce Application Master running on YARN in Hadoop 2. Regardless of the model, Hive uses a component called the metastore, in which it holds all its metadata about the tables defined in the system. Ironically, this is stored in a relational database dedicated to Hive's usage. In the earliest versions of Hive, all clients communicated directly with the metastore, but this meant that every user of the Hive CLI tool needed to know the metastore username and password. HiveServer was created to act as a point of entry for remote clients, which could also act as a single access-control point and which controlled all access to the underlying metastore. Because of limitations in HiveServer, the newest way to access Hive is through the multi-client HiveServer2. HiveServer2 introduces a number of improvements over its predecessor, including user authentication and support for multiple connections from the same client. More information can be found at https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2. Instances of HiveServer and HiveServer2 can be manually executed with the hive --service hiveserver and hive --service hiveserver2 commands, respectively. In the examples we saw before and in the remainder of this article, we implicitly use HiveServer to submit queries via the Hive command-line tool. HiveServer2 comes with Beeline. For compatibility and maturity reasons, Beeline being relatively new, both tools are available on Cloudera and most other major distributions. The Beeline client is part of the core Apache Hive distribution and so is also fully open source. Beeline can be executed in embedded version with the following command: $ beeline -u jdbc:hive2:// Data types HiveQL supports many of the common data types provided by standard database systems. These include primitive types, such as float, double, int, and string, through to structured collection types that provide the SQL analogues to types such as arrays, structs, and unions (structs with options for some fields). Since Hive is implemented in Java, primitive types will behave like their Java counterparts. We can distinguish Hive data types into the following five broad categories: Numeric: tinyint, smallint, int, bigint, float, double, and decimal Date and time: timestamp and date String: string, varchar, and char Collections: array, map, struct, and uniontype Misc: boolean, binary, and NULL DDL statements HiveQL provides a number of statements to create, delete, and alter databases, tables, and views. The CREATE DATABASE <name> statement creates a new database with the given name. A database represents a namespace where table and view metadata is contained. If multiple databases are present, the USE <database name> statement specifies which one to use to query tables or create new metadata. If no database is explicitly specified, Hive will run all statements against the default database. The following SHOW [DATABASES, TABLES, VIEWS] statement displays the databases currently available within a data warehouse and which table and view metadata is present within the database currently in use: CREATE DATABASE twitter; SHOW databases; USE twitter; SHOW TABLES; The CREATE TABLE [IF NOT EXISTS] <name> statement creates a table with the given name. As alluded to earlier, what is really created is the metadata representing the table and its mapping to files on HDFS as well as a directory in which to store the data files. If a table or view with the same name already exists, Hive will raise an exception. Both table and column names are case insensitive. In older versions of Hive (0.12 and earlier), only alphanumeric and underscore characters were allowed in table and column names. As of Hive 0.13, the system supports unicode characters in column names. Reserved words, such as load and create, need to be escaped by backticks (the ` character) to be treated literally. The EXTERNAL keyword specifies that the table exists in resources out of Hive's control, which can be a useful mechanism to extract data from another source at the beginning of a Hadoop-based Extract-Transform-Load (ETL) pipeline. The LOCATION clause specifies where the source file (or directory) is to be found. The EXTERNAL keyword and LOCATION clause have been used in the following code: CREATE EXTERNAL TABLE tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/tweets'; This table will be created in metastore, but the data will not be copied into the /user/hive/warehouse directory. Note that Hive has no concept of primary key or unique identifier. Uniqueness and data normalization are aspects to be addressed before loading data into the data warehouse. The CREATE VIEW <view name> … AS SELECT statement creates a view with the given name. For example, we might want to create a view to isolate retweets from other messages, as follows: CREATE VIEW retweets COMMENT 'Tweets that have been retweeted' AS SELECT * FROM tweets WHERE retweeted = true; Unless otherwise specified, column names are derived from the defining SELECT statement. Hive does not currently support materialized views. The DROP TABLE and DROP VIEW statements remove both metadata and data for a given table or view. When dropping an EXTERNAL table or a view, only metadata will be removed and the actual data files will not be affected. Hive allows table metadata to be altered via the ALTER TABLE statement, which can be used to change a column type, name, position, and comment or to add and replace columns. When adding columns, it is important to remember that only metadata will be changed and not the dataset itself. This means that if we were to add a column in the middle of the table, which didn't exist in older files, then, while selecting from older data, we might get wrong values in the wrong columns. This is because we would be looking at old files with a new format. Similarly, ALTER VIEW <view name> AS <select statement> changes the definition of an existing view. File formats and storage The data files underlying a Hive table are no different from any other file on HDFS. Users can directly read the HDFS files in the Hive tables using other tools. They can also use other tools to write to HDFS files that can be loaded into Hive through CREATE EXTERNAL TABLE or through LOAD DATA INPATH. Hive uses the Serializer and Deserializer classes, SerDe, as well as FileFormat to read and write table rows. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified in a CREATE TABLE statement. The DELIMITED clause instructs the system to read delimited files. Delimiter characters can be escaped using the ESCAPED BY clause. Hive currently uses the following FileFormat classes to read and write HDFS files: TextInputFormat and HiveIgnoreKeyTextOutputFormat: These will read/write data in plain text file format SequenceFileInputFormat and SequenceFileOutputFormat: These classes read/write data in the Hadoop SequenceFile format Additionally, the following SerDe classes can be used to serialize and deserialize data: MetadataTypedColumnsetSerDe: This will read/write delimited records such as CSV or tab-separated records ThriftSerDe, and DynamicSerDe: These will read/write Thrift objects JSON As of version 0.13, Hive ships with the native org.apache.hive.hcatalog.data.JsonSerDe JSON SerDe. For older versions of Hive, Hive-JSON-Serde (found at https://github.com/rcongiu/Hive-JSON-Serde) is arguably one of the most feature-rich JSON serialization/deserialization modules. We can use either module to load JSON tweets without any need for preprocessing and just define a Hive schema that matches the content of a JSON document. In the following example, we use Hive-JSON-Serde. As with any third-party module, we load the SerDe JARS into Hive with the following code: ADD JAR JAR json-serde-1.3-jar-with-dependencies.jar; Then, we issue the usual create statement, as follows: CREATE EXTERNAL TABLE tweets (    contributors string,    coordinates struct <      coordinates: array <float>,      type: string>,    created_at string,    entities struct <      hashtags: array <struct <            indices: array <tinyint>,            text: string>>, … ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE LOCATION 'tweets'; With this SerDe, we can map nested documents (such as entities or users) to the struct or map types. We tell Hive that the data stored at LOCATION 'tweets' is text (STORED AS TEXTFILE) and that each row is a JSON object (ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'). In Hive 0.13 and later, we can express this property as ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'. Manually specifying the schema for complex documents can be a tedious and error-prone process. The hive-json module (found at https://github.com/hortonworks/hive-json) is a handy utility to analyze large documents and generate an appropriate Hive schema. Depending on the document collection, further refinement might be necessary. In our example, we used a schema generated with hive-json that maps the tweets JSON to a number of struct data types. This allows us to query the data using a handy dot notation. For instance, we can extract the screen name and description fields of a user object with the following code: SELECT user.screen_name, user.description FROM tweets_json LIMIT 10; Avro AvroSerde (https://cwiki.apache.org/confluence/display/Hive/AvroSerDe) allows us to read and write data in Avro format. Starting from 0.14, Avro-backed tables can be created using the STORED AS AVRO statement, and Hive will take care of creating an appropriate Avro schema for the table. Prior versions of Hive are a bit more verbose. This dataset was created using Pig's AvroStorage class, which generated the following schema: { "type":"record", "name":"record", "fields": [    {"name":"topic","type":["null","int"]},    {"name":"source","type":["null","int"]},    {"name":"rank","type":["null","float"]} ] } The structure is quite self-explanatory. The table structure is captured in an Avro record, which contains header information (a name and optional namespace to qualify the name) and an array of the fields. Each field is specified with its name and type as well as an optional documentation string. For a few of the fields, the type is not a single value, but instead a pair of values, one of which is null. This is an Avro union, and this is the idiomatic way of handling columns that might have a null value. Avro specifies null as a concrete type, and any location where another type might have a null value needs to be specified in this way. This will be handled transparently for us when we use the following schema. With this definition, we can now create a Hive table that uses this schema for its table specification, as follows: CREATE EXTERNAL TABLE tweets_pagerank ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.literal'='{    "type":"record",    "name":"record",    "fields": [        {"name":"topic","type":["null","int"]},        {"name":"source","type":["null","int"]},        {"name":"rank","type":["null","float"]}    ] }') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '${data}/ch5-pagerank'; Then, look at the following table definition from within Hive (note also that HCatalog): DESCRIBE tweets_pagerank; OK topic                 int                   from deserializer   source               int                   from deserializer   rank                 float                 from deserializer In the ddl, we told Hive that data is stored in the Avro format using AvroContainerInputFormat and AvroContainerOutputFormat. Each row needs to be serialized and deserialized using org.apache.hadoop.hive.serde2.avro.AvroSerDe. The table schema is inferred by Hive from the Avro embedded in avro.schema.literal. Alternatively, we can store a schema on HDFS and have Hive read it to determine the table structure. Create the preceding schema in a file called pagerank.avsc—this is the standard file extension for Avro schemas. Then place it on HDFS; we want to have a common location for schema files such as /schema/avro. Finally, define the table using the avro.schema.url SerDe property WITH SERDEPROPERTIES ('avro.schema.url'='hdfs://<namenode>/schema/avro/pagerank.avsc'). If Avro dependencies are not present in the classpath, we need to add the Avro MapReduce JAR to our environment before accessing individual fields. Within Hive, for example, on the Cloudera CDH5 VM: ADD JAR /opt/cloudera/parcels/CDH/lib/avro/avro-mapred-hadoop2.jar; We can also use this table like any other. For instance, we can query the data to select the user and topic pairs with a high PageRank: SELECT source, topic from tweets_pagerank WHERE rank >= 0.9; We will see how Avro and avro.schema.url play an instrumental role in enabling schema migrations. Columnar stores Hive can also take advantage of columnar storage via the ORC (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC) and Parquet (https://cwiki.apache.org/confluence/display/Hive/Parquet) formats. If a table is defined with very many columns, it is not unusual for any given query to only process a small subset of these columns. But even in SequenceFile, each full row and all its columns will be read from the disk, decompressed, and processed. This consumes a lot of system resources for data that we know in advance is not of interest. Traditional relational databases also store data on a row basis, and a type of database called columnarchanged this to be column-focused. In the simplest model, instead of one file for each table, there would be one file for each column in the table. If a query only needed to access five columns in a table with 100 columns in total, then only the files for those five columns will be read. Both ORC and Parquet use this principle as well as other optimizations to enable much faster queries. Queries Tables can be queried using the familiar SELECT … FROM statement. The WHERE statement allows the specification of filtering conditions, GROUP BY aggregates records, ORDER BY specifies sorting criteria, and LIMIT specifies the number of records to retrieve. Aggregate functions, such as count and sum, can be applied to aggregated records. For instance, the following code returns the top 10 most prolific users in the dataset: SELECT user_id, COUNT(*) AS cnt FROM tweets GROUP BY user_id ORDER BY cnt DESC LIMIT 10 The following are the top 10 most prolific users in the dataset: NULL 7091 1332188053 4 959468857 3 1367752118 3 362562944 3 58646041 3 2375296688 3 1468188529 3 37114209 3 2385040940 3 This allows us to identify the number of tweets, 7,091, with no user object. We can improve the readability of the hive output by setting the following code: SET hive.cli.print.header=true; This will instruct hive, though not beeline, to print column names as part of the output. You can add the command to the .hiverc file usually found in the root of the executing user's home directory to have it apply to all hive CLI sessions. HiveQL implements a JOIN operator that enables us to combine tables together. In the Prerequisites section, we generated separate datasets for the user and place objects. Let's now load them into hive using external tables. We first create a user table to store user data, as follows: CREATE EXTERNAL TABLE user ( created_at string, user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/users'; We then create a place table to store location data, as follows: CREATE EXTERNAL TABLE place ( place_id string, country_code string, country string, `name` string, full_name string, place_type string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/places'; We can use the JOIN operator to display the names of the 10 most prolific users, as follows: SELECT tweets.user_id, user.name, COUNT(tweets.user_id) AS cnt FROM tweets JOIN user ON user.user_id = tweets.user_id GROUP BY tweets.user_id, user.user_id, user.name ORDER BY cnt DESC LIMIT 10; Only equality, outer, and left (semi) joins are supported in Hive. Notice that there might be multiple entries with a given user ID but different values for the followers_count, friends_count, and favourites_count columns. To avoid duplicate entries, we count only user_id from the tweets tables. Alternatively, we can rewrite the previous query as follows: SELECT tweets.user_id, u.name, COUNT(*) AS cnt FROM tweets join (SELECT user_id, name FROM user GROUP BY user_id, name) u ON u.user_id = tweets.user_id GROUP BY tweets.user_id, u.name ORDER BY cnt DESC LIMIT 10;   Instead of directly joining the user table, we execute a subquery, as follows: SELECT user_id, name FROM user GROUP BY user_id, name; The subquery extracts unique user IDs and names. Note that Hive has limited support for subqueries. Historically, only permitting a subquery in the FROM clause of a SELECT statement. Hive 0.13 has added limited support for subqueries within the WHERE clause also. HiveQL is an ever-evolving rich language, a full exposition of which is beyond the scope of this article. A description of its query and ddl capabilities can be found at https://cwiki.apache.org/confluence/display/Hive/LanguageManual. Structuring Hive tables for given workloads Often Hive isn't used in isolation, instead tables are created with particular workloads in mind or with needs invoked in ways that are suitable for inclusion in automated processes. We'll now explore some of these scenarios. Partitioning a table With columnar file formats, we explained the benefits of excluding unneeded data as early as possible when processing a query. A similar concept has been used in SQL for some time: table partitioning. When creating a partitioned table, a column is specified as the partition key. All values with that key are then stored together. In Hive's case, different subdirectories for each partition key are created under the table directory in the warehouse location on HDFS. It's important to understand the cardinality of the partition column. With too few distinct values, the benefits are reduced as the files are still very large. If there are too many values, then queries might need a large number of files to be scanned to access all the required data. Perhaps, the most common partition key is one based on date. We could, for example, partition our user table from earlier based on the created_at column, that is, the date the user was first registered. Note that since partitioning a table by definition affects its file structure, we create this table now as a non-external one, as follows: CREATE TABLE partitioned_user ( created_at string, user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) PARTITIONED BY (created_at_date string) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; To load data into a partition, we can explicitly give a value for the partition in which to insert the data, as follows: INSERT INTO TABLE partitioned_user PARTITION( created_at_date = '2014-01-01') SELECT created_at, user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count FROM user; This is at best verbose, as we need a statement for each partition key value; if a single LOAD or INSERT statement contains data for multiple partitions, it just won't work. Hive also has a feature called dynamic partitioning, which can help us here. We set the following three variables: SET hive.exec.dynamic.partition = true; SET hive.exec.dynamic.partition.mode = nonstrict; SET hive.exec.max.dynamic.partitions.pernode=5000; The first two statements enable all partitions (nonstrict option) to be dynamic. The third one allows 5,000 distinct partitions to be created on each mapper and reducer node. We can then simply use the name of the column to be used as the partition key, and Hive will insert data into partitions depending on the value of the key for a given row: INSERT INTO TABLE partitioned_user PARTITION( created_at_date ) SELECT created_at, user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count, to_date(created_at) as created_at_date FROM user; Even though we use only a single partition column, here we can partition a table by multiple column keys; just have them as a comma-separated list in the PARTITIONED BY clause. Note that the partition key columns need to be included as the last columns in any statement being used to insert into a partitioned table as in the preceding code. We use Hive's to_date function to convert the created_at timestamp to a YYYY-MM-DD formatted string. Partitioned data is stored in HDFS as /path/to/warehouse/<database>/<table>/key=<value>. In our example, the partitioned_user table structure will look like /user/hive/warehouse/default/partitioned_user/created_at=2014-04-01. If data is added directly to the filesystem, for instance, by some third-party processing tool or by hadoop fs -put, the metastore won't automatically detect the new partitions. The user will need to manually run an ALTER TABLE statement such as the following for each newly added partition: ALTER TABLE <table_name> ADD PARTITION <location>; Using the MSCK REPAIR TABLE <table_name>; statement, all metadata for all partitions not currently present in the metastore will be added. On EMR, this is equivalent to executing the following code: ALTER TABLE <table_name> RECOVER PARTITIONS; Notice that both statements will work also with EXTERNAL tables. In the following article, we will see how this pattern can be exploited to create flexible and interoperable pipelines. Overwriting and updating data Partitioning is also useful when we need to update a portion of a table. Normally a statement of the following form will replace all the data for the destination table: INSERT OVERWRITE INTO <table>… If OVERWRITE is omitted, then each INSERT statement will add additional data to the table. Sometimes, this is desirable, but often, the source data being ingested into a Hive table is intended to fully update a subset of the data and keep the rest untouched. If we perform an INSERT OVERWRITE statement (or a LOAD OVERWRITE statement) into a partition of a table, then only the specified partition will be affected. Thus, if we were inserting user data and only wanted to affect the partitions with data in the source file, we could achieve this by adding the OVERWRITE keyword to our previous INSERT statement. We can also add caveats to the SELECT statement. Say, for example, we only wanted to update data for a certain month: INSERT INTO TABLE partitioned_user PARTITION (created_at_date) SELECT created_at , user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count, to_date(created_at) as created_at_date FROM user WHERE to_date(created_at) BETWEEN '2014-03-01' and '2014-03-31'; Bucketing and sorting Partitioning a table is a construct that you take explicit advantage of by using the partition column (or columns) in the WHERE clause of queries against the tables. There is another mechanism called bucketing that can further segment how a table is stored and does so in a way that allows Hive itself to optimize its internal query plans to take advantage of the structure. Let's create bucketed versions of our tweets and user tables; note the following additional CLUSTER BY and SORT BY statements in the CREATE TABLE statements: CREATE table bucketed_tweets ( tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) PARTITIONED BY (created_at string) CLUSTERED BY(user_ID) into 64 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE;   CREATE TABLE bucketed_user ( user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) PARTITIONED BY (created_at string) CLUSTERED BY(user_ID) SORTED BY(name) into 64 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; Note that we changed the tweets table to also be partitioned; you can only bucket a table that is partitioned. Just as we need to specify a partition column when inserting into a partitioned table, we must also take care to ensure that data inserted into a bucketed table is correctly clustered. We do this by setting the following flag before inserting the data into the table: SET hive.enforce.bucketing=true; Just as with partitioned tables, you cannot apply the bucketing function when using the LOAD DATA statement; if you wish to load external data into a bucketed table, first insert it into a temporary table, and then use the INSERT…SELECT… syntax to populate the bucketed table. When data is inserted into a bucketed table, rows are allocated to a bucket based on the result of a hash function applied to the column specified in the CLUSTERED BY clause. One of the greatest advantages of bucketing a table comes when we need to join two tables that are similarly bucketed, as in the previous example. So, for example, any query of the following form would be vastly improved: SET hive.optimize.bucketmapjoin=true; SELECT … FROM bucketed_user u JOIN bucketed_tweet t ON u.user_id = t.user_id; With the join being performed on the column used to bucket the table, Hive can optimize the amount of processing as it knows that each bucket contains the same set of user_id columns in both tables. While determining which rows against which to match, only those in the bucket need to be compared against, and not the whole table. This does require that the tables are both clustered on the same column and that the bucket numbers are either identical or one is a multiple of the other. In the latter case, with say one table clustered into 32 buckets and another into 64, the nature of the default hash function used to allocate data to a bucket means that the IDs in bucket 3 in the first table will cover those in both buckets 3 and 35 in the second. Sampling data Bucketing a table can also help while using Hive's ability to sample data in a table. Sampling allows a query to gather only a specified subset of the overall rows in the table. This is useful when you have an extremely large table with moderately consistent data patterns. In such a case, applying a query to a small fraction of the data will be much faster and will still give a broadly representative result. Note, of course, that this only applies to queries where you are looking to determine table characteristics, such as pattern ranges in the data; if you are trying to count anything, then the result needs to be scaled to the full table size. For a non-bucketed table, you can sample in a mechanism similar to what we saw earlier by specifying that the query should only be applied to a certain subset of the table: SELECT max(friends_count) FROM user TABLESAMPLE(BUCKET 2 OUT OF 64 ON name); In this query, Hive will effectively hash the rows in the table into 64 buckets based on the name column. It will then only use the second bucket for the query. Multiple buckets can be specified, and if RAND() is given as the ON clause, then the entire row is used by the bucketing function. Though successful, this is highly inefficient as the full table needs to be scanned to generate the required subset of data. If we sample on a bucketed table and ensure the number of buckets sampled is equal to or a multiple of the buckets in the table, then Hive will only read the buckets in question. The following code is representative of this case: SELECT MAX(friends_count) FROM bucketed_user TABLESAMPLE(BUCKET 2 OUT OF 32 on user_id); In the preceding query against the bucketed_user table, which is created with 64 buckets on the user_id column, the sampling, since it is using the same column, will only read the required buckets. In this case, these will be buckets 2 and 34 from each partition. A final form of sampling is block sampling. In this case, we can specify the required amount of the table to be sampled, and Hive will use an approximation of this by only reading enough source data blocks on HDFS to meet the required size. Currently, the data size can be specified as either a percentage of the table, as an absolute data size, or as a number of rows (in each block). The syntax for TABLESAMPLE is as follows, which will sample 0.5 percent of the table, 1 GB of data or 100 rows per split, respectively: TABLESAMPLE(0.5 PERCENT) TABLESAMPLE(1G) TABLESAMPLE(100 ROWS) If these latter forms of sampling are of interest, then consult the documentation, as there are some specific limitations on the input format and file formats that are supported. Writing scripts We can place Hive commands in a file and run them with the -f option in the hive CLI utility: $ cat show_tables.hql show tables; $ hive -f show_tables.hql We can parameterize HiveQL statements by means of the hiveconf mechanism. This allows us to specify an environment variable name at the point it is used rather than at the point of invocation. For example: $ cat show_tables2.hql show tables like '${hiveconf:TABLENAME}'; $ hive -hiveconf TABLENAME=user -f show_tables2.hql The variable can also be set within the Hive script or an interactive session: SET TABLE_NAME='user'; The preceding hiveconf argument will add any new variables in the same namespace as the Hive configuration options. As of Hive 0.8, there is a similar option called hivevar that adds any user variables into a distinct namespace. Using hivevar, the preceding command would be as follows: $ cat show_tables3.hql show tables like '${hivevar:TABLENAME}'; $ hive -hivevar TABLENAME=user –f show_tables3.hql Or we can write the command interactively: SET hivevar_TABLE_NAME='user'; Hive and Amazon Web Services With ElasticMapReduce as the AWS Hadoop-on-demand service, it is of course possible to run Hive on an EMR cluster. But it is also possible to use Amazon storage services, particularly S3, from any Hadoop cluster, be it within EMR or your own local cluster. Hive and S3 It is possible to specify a default filesystem other than HDFS for Hadoop and S3 is one option. But, it doesn't have to be an all-or-nothing thing; it is possible to have specific tables stored in S3. The data for these tables will be retrieved into the cluster to be processed, and any resulting data can either be written to a different S3 location (the same table cannot be the source and destination of a single query) or onto HDFS. We can take a file of our tweet data and place it onto a location in S3 with a command such as the following: $ aws s3 put tweets.tsv s3://<bucket-name>/tweets/ We firstly need to specify the access key and secret access key that can access the bucket. This can be done in three ways: Set fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey to the appropriate values in the Hive CLI Set the same values in hive-site.xml though note this limits use of S3 to a single set of credentials Specify the table location explicitly in the table URL, that is, s3n://<access key>:<secret access key>@<bucket>/<path> Then we can create a table referencing this data, as follows: CREATE table remote_tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) CLUSTERED BY(user_ID) into 64 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LOCATION 's3n://<bucket-name>/tweets' This can be an incredibly effective way of pulling S3 data into a local Hadoop cluster for processing. In order to use AWS credentials in the URI of an S3 location regardless of how the parameters are passed, the secret and access keys must not contain /, +, =, or characters. If necessary, a new set of credentials can be generated from the IAM console at https://console.aws.amazon.com/iam/. In theory, you can just leave the data in the external table and refer to it when needed to avoid WAN data transfer latencies (and costs), even though it often makes sense to pull the data into a local table and do future processing from there. If the table is partitioned, then you might find yourself retrieving a new partition each day, for example. Hive on ElasticMapReduce On one level, using Hive within Amazon ElasticMapReduce is just the same as everything discussed in this article. You can create a persistent cluster, log in to the master node, and use the Hive CLI to create tables and submit queries. Doing all this will use the local storage on the EC2 instances for the table data. Not surprisingly, jobs on EMR clusters can also refer to tables whose data is stored on S3 (or DynamoDB). And, not surprisingly, Amazon has made extensions to its version of Hive to make all this very seamless. It is quite simple from within an EMR job to pull data from a table stored in S3, process it, write any intermediate data to the EMR local storage, and then write the output results into S3, DynamoDB, or one of a growing list of other AWS services. The pattern mentioned earlier where new data is added to a new partition directory for a table each day has proved very effective in S3; it is often the storage location of choice for large and incrementally growing datasets. There is a syntax difference when using EMR; instead of the MSCK command mentioned earlier, the command to update a Hive table with new data added to a partition directory is as follows: ALTER TABLE <table-name> RECOVER PARTITIONS; Consult the EMR documentation for the latest enhancements at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-additional-features.html. Also, consult the broader EMR documentation. In particular, the integration points with other AWS services is an area of rapid growth. Extending HiveQL The HiveQL language can be extended by means of plugins and third-party functions. In Hive, there are three types of functions characterized by the number of rows they take as input and produce as output: User Defined Functions (UDFs): These are simpler functions that act on one row at a time. User Defined Aggregate Functions (UDAFs): These functions take multiple rows as input and generate multiple rows as output. These are aggregate functions to be used in conjunction with a GROUP BY statement (similar to COUNT(), AVG(), MIN(), MAX(), and so on). User Defined Table Functions (UDTFs): These take multiple rows as input and generate a logical table comprised of multiple rows that can be used in join expressions. These APIs are provided only in Java. For other languages, it is possible to stream data through a user-defined script using the TRANSFORM, MAP, and REDUCE clauses that act as a frontend to Hadoop's streaming capabilities. Two APIs are available to write UDFs. A simple API org.apache.hadoop.hive.ql.exec.UDF can be used for functions that take read and return basic writable types. A richer API, which provides support for data types other than writable is available in the org.apache.hadoop.hive.ql.udf.generic.GenericUDF package. We'll now illustrate how org.apache.hadoop.hive.ql.exec.UDF can be used to implement a string to ID function similar to the one we used in Iterative Computation with Spark, to map hashtags to integers in Pig. Building a UDF with this API only requires extending the UDF class and writing an evaluate() method, as follows: public class StringToInt extends UDF {    public Integer evaluate(Text input) {        if (input == null)            return null;            String str = input.toString();          return str.hashCode();    } } The function takes a Text object as input and maps it to an integer value with the hashCode() method. The source code of this function can be found at https://github.com/learninghadoop2/book-examples/ch7/udf/ com/learninghadoop2/hive/udf/StringToInt.java. A more robust hash function should be used in production. We compile the class and archive it into a JAR file, as follows: $ javac -classpath $(hadoop classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/* com/learninghadoop2/hive/udf/StringToInt.java $ jar cvf myudfs-hive.jar com/learninghadoop2/hive/udf/StringToInt.class Before being able to use it, a UDF must be registered in Hive with the following commands: ADD JAR myudfs-hive.jar; CREATE TEMPORARY FUNCTION string_to_int AS 'com.learninghadoop2.hive.udf.StringToInt'; The ADD JAR statement adds a JAR file to the distributed cache. The CREATE TEMPORARY FUNCTION <function> AS <class> statement registers as a function in Hive that implements a given Java class. The function will be dropped once the Hive session is closed. As of Hive 0.13, it is possible to create permanent functions whose definition is kept in the metastore using CREATE FUNCTION … . Once registered, StringToInt can be used in a Hive query just as any other function. In the following example, we first extract a list of hashtags from the tweet's text by applying the regexp_extract function. Then, we use string_to_int to map each tag to a numerical ID: SELECT unique_hashtags.hashtag, string_to_int(unique_hashtags.hashtag) AS tag_id FROM    (        SELECT regexp_extract(text,            '(?:\s|\A|^)[##]+([A-Za-z0-9-_]+)') as hashtag        FROM tweets        GROUP BY regexp_extract(text,        '(?:\s|\A|^)[##]+([A-Za-z0-9-_]+)') ) unique_hashtags GROUP BY unique_hashtags.hashtag, string_to_int(unique_hashtags.hashtag); We can use the preceding query to create a lookup table, as follows: CREATE TABLE lookuptable (tag string, tag_id bigint); INSERT OVERWRITE TABLE lookuptable SELECT unique_hashtags.hashtag,    string_to_int(unique_hashtags.hashtag) as tag_id FROM (    SELECT regexp_extract(text,        '(?:\s|\A|^)[##]+([A-Za-z0-9-_]+)') AS hashtag          FROM tweets          GROUP BY regexp_extract(text,            '(?:\s|\A|^)[##]+([A-Za-z0-9-_]+)')    ) unique_hashtags GROUP BY unique_hashtags.hashtag, string_to_int(unique_hashtags.hashtag); Programmatic interfaces In addition to the hive and beeline command-line tools, it is possible to submit HiveQL queries to the system via the JDBC and Thrift programmatic interfaces. Support for odbc was bundled in older versions of Hive, but as of Hive 0.12, it needs to be built from scratch. More information on this process can be found at https://cwiki.apache.org/confluence/display/Hive/HiveODBC. JDBC A Hive client written using JDBC APIs looks exactly the same as a client program written for other database systems (for example MySQL). The following is a sample Hive client program using JDBC APIs. The source code for this example can be found at https://github.com/learninghadoop2/book-examples/ch7/clients/ com/learninghadoop2/hive/client/HiveJdbcClient.java. public class HiveJdbcClient {      private static String driverName = " org.apache.hive.jdbc.HiveDriver";           // connection string      public static String URL = "jdbc:hive2://localhost:10000";        // Show all tables in the default database      public static String QUERY = "show tables";        public static void main(String[] args) throws SQLException {          try {                Class.forName (driverName);          }          catch (ClassNotFoundException e) {                e.printStackTrace();                System.exit(1);          }          Connection con = DriverManager.getConnection (URL);          Statement stmt = con.createStatement();                   ResultSet resultSet = stmt.executeQuery(QUERY);          while (resultSet.next()) {                System.out.println(resultSet.getString(1));          }    } } The URL part is the JDBC URI that describes the connection end point. The format for establishing a remote connection is jdbc:hive2:<host>:<port>/<database>. Connections in embedded mode can be established by not specifying a host or port jdbc:hive2://. The hive and hive2 part are the drivers to be used when connecting to HiveServer and HiveServer2. The QUERY statement contains the HiveQL query to be executed. Hive's JDBC interface exposes only the default database. In order to access other databases, you need to reference them explicitly in the underlying queries using the <database>.<table> notation. First we load the HiveServer2 JDBC driver org.apache.hive.jdbc.HiveDriver. Use org.apache.hadoop.hive.jdbc.HiveDriver to connect to HiveServer. Then, like with any other JDBC program, we establish a connection to URL and use it to instantiate a Statement class. We execute QUERY, with no authentication, and store the output dataset into the ResultSet object. Finally, we scan resultSet and print its content to the command line. Compile and execute the example with the following commands: $ javac HiveJdbcClient.java $ java -cp $(hadoop classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/*:/opt/cloudera/parcels/CDH/lib/hive/lib/hive-jdbc.jar: com.learninghadoop2.hive.client.HiveJdbcClient Thrift Thrift provides lower-level access to Hive and has a number of advantages over the JDBC implementation of HiveServer. Primarily, it allows multiple connections from the same client, and it allows programming languages other than Java to be used with ease. With HiveServer2, it is a less commonly used option but still worth mentioning for compatibility. A sample Thrift client implemented using the Java API can be found at https://github.com/learninghadoop2/book-examples/ch7/clients/ com/learninghadoop2/hive/client/HiveThriftClient.java. This client can be used to connect to HiveServer, but due to protocol differences, the client won't work with HiveServer2. In the example, we define a getClient() method that takes as input the host and port of a HiveServer service and returns an instance of org.apache.hadoop.hive.service.ThriftHive.Client. A client is obtained by first instantiating a socket connection, org.apache.thrift.transport.TSocket, to the HiveServer service, and by specifying a protocol, org.apache.thrift.protocol.TBinaryProtocol, to serialize and transmit data, as follows:        TSocket transport = new TSocket(host, port);        transport.setTimeout(TIMEOUT);        transport.open();        TBinaryProtocol protocol = new TBinaryProtocol(transport);        client = new ThriftHive.Client(protocol); Finally, we call getClient() from the main method and use the client to execute a query against an instance of HiveServer running on localhost on port 11111, as follows:      public static void main(String[] args) throws Exception {          Client client = getClient("localhost", 11111);          client.execute("show tables");          List<String> results = client.fetchAll();           for (String result : results) { System.out.println(result);           }      } Make sure that HiveServer is running on port 11111, and if not, start an instance with the following command: $ sudo hive --service hiveserver -p 11111 Compile and execute the HiveThriftClient.java example with the following command: $ javac $(hadoop classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/* com/learninghadoop2/hive/client/HiveThriftClient.java $ java -cp $(hadoop classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/*: com.learninghadoop2.hive.client.HiveThriftClient Stinger initiative Hive has remained very successful and capable since its earliest releases, particularly in its ability to provide SQL-like processing on enormous datasets. But other technologies did not stand still, and Hive acquired a reputation of being relatively slow, particularly in regard to lengthy startup times on large jobs and its inability to give quick responses to conceptually simple queries. These perceived limitations were less due to Hive itself and more a consequence of how translation of SQL queries into the MapReduce model has much built-in inefficiency when compared to other ways of implementing a SQL query. Particularly, in regard to very large datasets, MapReduce saw lots of I/OI/O (and consequently time) spent writing out the results of one MapReduce job just to have them read by another. Processing - MapReduce and Beyond, this is a major driver in the design of Tez, which can schedule tasks on a Hadoop cluster as a graph of tasks that does not require inefficient writes and reads between tasks in the graph. The following is a query on the MapReduce framework versus Tez: SELECT a.country, COUNT(b.place_id) FROM place a JOIN tweets b ON (a. place_id = b.place_id) GROUP BY a.country; The following figure contrasts the execution plan for the preceding query on the MapReduce framework versus Tez: Hive on MapReduce versus Tez In plain MapReduce, two jobs are created for the GROUP BY and JOIN clauses. The first job is composed of a set of MapReduce tasks that read data from the disk to carry out grouping. The reducers write intermediate results to the disk so that output can be synchronized. The mappers in the second job read the intermediate results from the disk as well as data from table b. The combined dataset is then passed to the reducer where shared keys are joined. Were we to execute an ORDER BY statement, this would have resulted in a third job and further MapReduce passes. The same query is executed on Tez as a single job by a single set of Map tasks that read data from the disk. I/O grouping and joining are pipelined across reducers. Alongside these architectural limitations, there were quite a few areas around SQL language support that could also provide better efficiency, and in early 2013, the Stinger initiative was launched with an explicit goal of making Hive over 100 times as fast and with much richer SQL support. Hive 0.13 has all the features of the three phases of Stinger, resulting in a much more complete SQL dialect. Also, Tez is offered as an execution framework in addition to a more efficient MapReduce-based implementation atop YARN. With Tez as the execution engine, Hive is no longer limited to a series of linear MapReduce jobs and can instead build a processing graph where any given step can, for example, stream results to multiple sub-steps. To take advantage of the Tez framework, there is a new Hive variable setting, as follows: set hive.execution.engine=tez; This setting relies on Tez being installed on the cluster; it is available in source form from http://tez.incubator.apache.org or in several distributions, though at the time of writing, not Cloudera, due to its support of Impala. The alternative value is mr, which uses the classic MapReduce model (atop YARN), so it is possible in a single installation to compare the performance of Hive using Tez. Impala Hive is not the only product providing the SQL-on-Hadoop capability. The second most widely used is likely Impala, announced in late 2012 and released in spring 2013. Though originally developed internally within Cloudera, its source code is periodically pushed to an open source Git repository (https://github.com/cloudera/impala). Impala was created out of the same perception of Hive's weaknesses that led to the Stinger initiative. Impala also took some inspiration from Google Dremel (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf) which was first openly described by a paper published in 2009. Dremel was built at Google to address the gap between the need for very fast queries on very large datasets and the high latency inherent in the existing MapReduce model underpinning Hive at the time. Dremel was a sophisticated approach to this problem that, rather than building mitigations atop MapReduce such as implemented by Hive, instead created a new service that accessed the same data stored in HDFS. Dremel also benefited from significant work to optimize the storage format of the data in a way that made it more amenable to very fast analytic queries. The architecture of Impala The basic architecture has three main components; the Impala daemons, the state store, and the clients. Recent versions have added additional components that improve the service, but we'll focus on the high-level architecture. The Impala daemon (impalad) should be run on each host where a DataNode process is managing HDFS data. Note that impalad does not access the filesystem blocks through the full HDFS FileSystem API; instead, it uses a feature called short-circuit reads to make data access more efficient. When a client submits a query, it can do so to any of the running impalad processes, and this one will become the coordinator for the execution of that query. The key aspect of Impala's performance is that for each query, it generates custom native code, which is then pushed to and executed by all the impalad processes on the system. This highly optimized code performs the query on the local data, and each impalad then returns its subset of the result set to the coordinator node, which performs the final data consolidation to produce the final result. This type of architecture should be familiar to anyone who has worked with any of the (usually commercial and expensive) Massively Parallel Processing (MPP) (the term used for this type of shared scale-out architecture) data warehouse solutions available today. As the cluster runs the state, the store ensures that each impalad process is aware of all the others and provides a view of the overall cluster health. Co-existing with Hive Impala, as a newer product, tends to have a more restricted set of SQL data types and supports a more constrained dialect of SQL than Hive. It is, however, expanding this support with each new release. Refer to the Impala documentation (http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/Impala/impala.html) to get an overview of the current level of support. Impala supports the Hive metastore mechanism used by Hive as a store of the metadata surrounding its table structure and storage. This means that on a cluster with an existing Hive setup, it should be immediately possible to use Impala as it will access the same metastore and therefore provide access to the same tables available in Hive. But be warned that the differences in SQL dialect and data types might cause unexpected results when working in a combined Hive and Impala environment. Some queries might work on one but not the other, they might show very different performance characteristics (more on this later), or they might actually give different results. This last point might become apparent when using data types such as float and double that are simply treated differently in the underlying systems (Hive is implemented on Java while Impala is written in C++). As of version 1.2, it supports UDFs written both in C++ and Java, although C++ is strongly recommended as a much faster solution. Keep this in mind if you are looking to share custom functions between Hive and Impala. A different philosophy When Impala was first released, its greatest benefit was in how it truly enabled what is often called speed of thought analysis. Queries could be returned sufficiently fast that an analyst could explore a thread of analysis in a completely interactive fashion without having to wait for minutes at a time for each query to complete. It's fair to say that most adopters of Impala were at times stunned by its performance, especially when compared to the version of Hive shipping at the time. The Impala focus has remained mostly on these shorter queries, and this does impose some limitations on the system. Impala tends to be quite memory-heavy as it relies on in-memory processing to achieve much of its performance. If a query requires a dataset to be held in memory rather than being available on the executing node, then that query will simply fail in versions of Impala before 2.0. Comparing the work on Stinger to Impala, it could be argued that Impala has a much stronger focus on excelling in the shorter (and arguably more common) queries that support interactive data analysis. Many business intelligence tools and services are now certified to directly run on Impala. The Stinger initiative has put less effort into making Hive just as fast in the area where Impala excels but has instead improved Hive (to varying degrees) for all workloads. Impala is still developing at a fast pace and Stinger has put additional momentum into Hive, so it is most likely wise to consider both products and determine which best meets the performance and functionality requirements of your projects and workflows. It should also be kept in mind that there are competitive commercial pressures shaping the direction of Impala and Hive. Impala was created and is still driven by Cloudera, the most popular vendor of Hadoop distributions. The Stinger initiative, though contributed to by many companies as diverse as Microsoft (yes, really!) and Intel, was lead by Hortonworks, probably the second largest vendor of Hadoop distributions. The fact is that if you are using the Cloudera distribution of Hadoop, then some of the core features of Hive might be slower to arrive, whereas Impala will always be up-to-date. Conversely, if you use another distribution, you might get the latest Hive release, but that might either have an older Impala or, as is currently the case, you might have to download and install it yourself. A similar situation has arisen with the Parquet and ORC file formats mentioned earlier. Parquet is preferred by Impala and developed by a group of companies led by Cloudera, while ORC is preferred by Hive and is championed by Hortonworks. Unfortunately, the reality is that Parquet support is often very quick to arrive in the Cloudera distribution but less so in say the Hortonworks distribution, where the ORC file format is preferred. These themes are a little concerning since, although competition in this space is a good thing, and arguably the announcement of Impala helped energize the Hive community, there is a greater risk that your choice of distribution might have a larger impact on the tools and file formats that will be fully supported, unlike in the past. Hopefully, the current situation is just an artifact of where we are in the development cycles of all these new and improved technologies, but do consider your choice of distribution carefully in relation to your SQL-on-Hadoop needs. Drill, Tajo, and beyond You should also consider that SQL on Hadoop no longer only refers to Hive or Impala. Apache Drill (http://drill.incubator.apache.org) is a fuller implementation of the Dremel model first described by Google. Although Impala implements the Dremel architecture across HDFS data, Drill looks to provide similar functionality across multiple data sources. It is still in its early stages, but if your needs are broader than what Hive or Impala provides, it might be worth considering. Tajo (http://tajo.apache.org) is another Apache project that seeks to be a full data warehouse system on Hadoop data. With an architecture similar to that of Impala, it offers a much richer system with components such as multiple optimizers and ETL tools that are commonplace in traditional data warehouses but less frequently bundled in the Hadoop world. It has a much smaller user base but has been used by certain companies very successfully for a significant length of time, and might be worth considering if you need a fuller data warehousing solution. Other products are also emerging in this space, and it's a good idea to do some research. Hive and Impala are awesome tools, but if you find that they don't meet your needs, then look around—something else might. Summary In its early days, Hadoop was sometimes erroneously seen as the latest supposed relational database killer. Over time, it has become more apparent that the more sensible approach is to view it as a complement to RDBMS technologies and that, in fact, the RDBMS community has developed tools such as SQL that are also valuable in the Hadoop world. HiveQL is an implementation of SQL on Hadoop and was the primary focus of this article. In regard to HiveQL and its implementations, we covered the following topics: How HiveQL provides a logical model atop data stored in HDFS in contrast to relational databases where the table structure is enforced in advance How HiveQL supports many standard SQL data types and commands including joins and views The ETL-like features offered by HiveQL, including the ability to import data into tables and optimize the table structure through partitioning and similar mechanisms How HiveQL offers the ability to extend its core set of operators with user-defined code and how this contrasts to the Pig UDF mechanism The recent history of Hive developments, such as the Stinger initiative, that have seen Hive transition to an updated implementation that uses Tez The broader ecosystem around HiveQL that now includes products such as Impala, Tajo, and Drill and how each of these focuses on specific areas in which to excel With Pig and Hive, we've introduced alternative models to process MapReduce data, but so far we've not looked at another question: what approaches and tools are required to actually allow this massive dataset being collected in Hadoop to remain useful and manageable over time? In the next article, we'll take a slight step up the abstraction hierarchy and look at how to manage the life cycle of this enormous data asset. Resources for Article: Further resources on this subject: Big Data Analysis [Article] Understanding MapReduce [Article] Amazon DynamoDB - Modelling relationships, Error handling [Article]
Read more
  • 0
  • 0
  • 1391

article-image-evolving-data-model
Packt
19 Dec 2014
11 min read
Save for later

Evolving the data model

Packt
19 Dec 2014
11 min read
In this article by C. Y. Kan, author of the book Cassandra Data Modeling and Analysis, we will see the techniques of how to evolve an existing Cassandra data model in detail. Meanwhile, the techniques of modeling by query will be demonstrated as well. (For more resources related to this topic, see here.) The Stock Screener Application is good enough to retrieve and analyze a single stock at one time. However, scanning just a single stock looks very limited in practical use. A slight improvement can be made here; it can handle a bunch of stocks instead of one. This bunch of stocks will be stored as Watch List in the Cassandra database. Accordingly, the Stock Screener Application will be modified to analyze the stocks in the Watch List, and therefore it will produce alerts for each of the stocks being watched based on the same screening rule. For the produced alerts, saving them in Cassandra will be beneficial for backtesting trading strategies and continuous improvement of the Stock Screener Application. They can be reviewed from time to time without having to review them on the fly. Backtesting is a jargon used to refer to testing a trading strategy, investment strategy, or a predictive model using existing historical data. It is also a special type of cross-validation applied to time series data. In addition, when the number of the stocks in the Watch List grows to a few hundred, it will be difficult for a user of the Stock Screener Application to recall what the stocks are by simply referring to their stock codes. Hence, it would be nice to have the name of the stocks added to the produced alerts to make them more descriptive and user-friendly. Finally, we might have an interest in finding out how many alerts were generated on a particular stock over a specified period of time and how many alerts were generated on a particular date. We will use CQL to write queries to answer these two questions. By doing so, the modeling by query technique can be demonstrated. The enhancement approach The enhancement approach consists of four change requests in total. First, we will conduct changes in the data model and then the code will be enhanced to provide the new features. Afterwards, we will test run the enhanced Stock Screener Application again. The parts of the Stock Screener Application that require modifications are highlighted in the following figure. It is remarkable that two new components are added to the Stock Screener Application. The first component, Watch List, governs Data Mapper and Archiver to collect stock quote data of those stocks in the Watch List from Yahoo! Finance. The second component is Query. It provides two Queries on Alert List for backtesting purposes: Watch List Watch List is a very simple table that merely stores the stock code of its constituents. It is rather intuitive for a relational database developer to define the stock code as the primary key, isn't it? Nevertheless, remember that in Cassandra, the primary key is used to determine the node that stores the row. As Watch List is expected to not be a very long list, it would be more appropriate to put all of its rows on the same node for faster retrieval. But how can we do that? We can create an additional column, say watch_list_code, for this particular purpose. The new table is called watchlist and will be created in the packtcdma keyspace. The CQL statement is shown in chapter06_001.py: # -*- coding: utf-8 -*- # program: chapter06_001.py ## import Cassandra driver library from cassandra.cluster import Cluster ## function to create watchlist def create_watchlist(ss):    ## create watchlist table if not exists    ss.execute('CREATE TABLE IF NOT EXISTS watchlist (' +                'watch_list_code varchar,' +                'symbol varchar,' +                'PRIMARY KEY (watch_list_code, symbol))')       ## insert AAPL, AMZN, and GS into watchlist    ss.execute("INSERT INTO watchlist (watch_list_code, " +                "symbol) VALUES ('WS01', 'AAPL')")    ss.execute("INSERT INTO watchlist (watch_list_code, " +                "symbol) VALUES ('WS01', 'AMZN')")    ss.execute("INSERT INTO watchlist (watch_list_code, " +                "symbol) VALUES ('WS01', 'GS')") ## create Cassandra instance cluster = Cluster() ## establish Cassandra connection, using local default session = cluster.connect() ## use packtcdma keyspace session.set_keyspace('packtcdma') ## create watchlist table create_watchlist(session) ## close Cassandra connection cluster.shutdown() The create_watchlist function creates the table. Note that the watchlist table has a compound primary key made of watch_list_code and symbol. A Watch List called WS01 is also created, which contains three stocks, AAPL, AMZN, and GS. Alert List It is produced by a Python program and enumerates the date when the close price was above its 10-day SMA, that is, the signal and the close price at that time. Note that there were no stock code and stock name. We will create a table called alertlist to store the alerts with the code and name of the stock. The inclusion of the stock name is to meet the requirement of making the Stock Screener Application more user-friendly. Also, remember that joins are not allowed and denormalization is really the best practice in Cassandra. This means that we do not mind repeatedly storing (duplicating) the stock name in the tables that will be queried. A rule of thumb is one table for one query; as simple as that. The alertlist table is created by the CQL statement, as shown in chapter06_002.py: # -*- coding: utf-8 -*- # program: chapter06_002.py ## import Cassandra driver library from cassandra.cluster import Cluster ## function to create alertlist def create_alertlist(ss):    ## execute CQL statement to create alertlist table if not exists    ss.execute('CREATE TABLE IF NOT EXISTS alertlist (' +                'symbol varchar,' +                'price_time timestamp,' +                'stock_name varchar,' +                'signal_price float,' +                'PRIMARY KEY (symbol, price_time))') ## create Cassandra instance cluster = Cluster() ## establish Cassandra connection, using local default session = cluster.connect() ## use packtcdma keyspace session.set_keyspace('packtcdma') ## create alertlist table create_alertlist(session) ## close Cassandra connection cluster.shutdown() The primary key is also a compound primary key that consists of symbol and price_time. Adding the descriptive stock name Until now, the packtcdma keyspace has three tables, which are alertlist, quote, and watchlist. To add the descriptive stock name, one can think of only adding a column of stock name to alertlist only. As seen in the previous section, this has been done. So, do we need to add a column for quote and watchlist? It is, in fact, a design decision that depends on whether these two tables will be serving user queries. What a user query means is that the table will be used to retrieve rows for a query raised by a user. If a user wants to know the close price of Apple Inc. on June 30, 2014, it is a user query. On the other hand, if the Stock Screener Application uses a query to retrieve rows for its internal processing, it is not a user query. Therefore, if we want quote and watchlist to return rows for user queries, they need the stock name column; otherwise, they do not need it. The watchlist table is only for internal use by the current design, and so it need not have the stock name column. Of course, if in future, the Stock Screener Application allows a user to maintain Watch List, the stock name should also be added to the watchlist table. However, for quote, it is a bit tricky. As the stock name should be retrieved from the Data Feed Provider, which is Yahoo! Finance in our case, the most suitable time to get it is when the corresponding stock quote data is retrieved. Hence, a new column called stock_name is added to quote, as shown in chapter06_003.py: # -*- coding: utf-8 -*- # program: chapter06_003.py ## import Cassandra driver library from cassandra.cluster import Cluster ## function to add stock_name column def add_stockname_to_quote(ss):    ## add stock_name to quote    ss.execute('ALTER TABLE quote ' +                'ADD stock_name varchar') ## create Cassandra instance cluster = Cluster() ## establish Cassandra connection, using local default session = cluster.connect() ## use packtcdma keyspace session.set_keyspace('packtcdma') ## add stock_name column add_stockname_to_quote(session) ## close Cassandra connection cluster.shutdown() It is quite self-explanatory. Here, we use the ALTER TABLE statement to add the stock_name column of the varchar data type to quote. Queries on alerts As mentioned previously, we are interested in two questions: How many alerts were generated on a stock over a specified period of time? How many alerts were generated on a particular date? For the first question, alertlist is sufficient to provide an answer. However, alertlist cannot answer the second question because its primary key is composed of symbol and price_time. We need to create another table specifically for that question. This is an example of modeling by query. Basically, the structure of the new table for the second question should resemble the structure of alertlist. We give that table a name, alert_by_date, and create it as shown in chapter06_004.py: # -*- coding: utf-8 -*- # program: chapter06_004.py ## import Cassandra driver library from cassandra.cluster import Cluster ## function to create alert_by_date table def create_alertbydate(ss):    ## create alert_by_date table if not exists    ss.execute('CREATE TABLE IF NOT EXISTS alert_by_date (' +               'symbol varchar,' +                'price_time timestamp,' +                'stock_name varchar,' +                'signal_price float,' +                'PRIMARY KEY (price_time, symbol))') ## create Cassandra instance cluster = Cluster() ## establish Cassandra connection, using local default session = cluster.connect() ## use packtcdma keyspace session.set_keyspace('packtcdma') ## create alert_by_date table create_alertbydate(session) ## close Cassandra connection cluster.shutdown() When compared to alertlist in chapter06_002.py, alert_by_date only swaps the order of the columns in the compound primary key. One might think that a secondary index can be created on alertlist to achieve the same effect. Nonetheless, in Cassandra, a secondary index cannot be created on columns that are already engaged in the primary key. Always be aware of this constraint. We now finish the modifications on the data model. It is time for us to enhance the application logic in the next section. Summary This article extends the Stock Screener Application by a number of enhancements. We made changes to the data model to demonstrate the modeling by query techniques and how denormalization can help us achieve a high-performance application. Resources for Article: Further resources on this subject: An overview of architecture and modeling in Cassandra [Article] About Cassandra [Article] Basic Concepts and Architecture of Cassandra [Article]
Read more
  • 0
  • 0
  • 1109
article-image-supervised-learning
Packt
19 Dec 2014
50 min read
Save for later

Supervised learning

Packt
19 Dec 2014
50 min read
In this article by Dan Toomey, author of the book R for Data Science, we will learn about the supervised learning, which involves the use of a target variable and a number of predictor variables that are put into a model to enable the system to predict the target. This is also known as predictive modeling. (For more resources related to this topic, see here.) As mentioned, in supervised learning we have a target variable and a number of possible predictor variables. The objective is to associate the predictor variables in such a way so as to accurately predict the target variable. We are using some portion of observed data to learn how our model behaves and then testing that model on the remaining observations for accuracy. We will go over the following supervised learning techniques: Decision trees Regression Neural networks Instance based learning (k-NN) Ensemble learning Support vector machines Bayesian learning Bayesian inference Random forests Decision tree For decision tree machine learning, we develop a logic tree that can be used to predict our target value based on a number of predictor variables. The tree has logical points, such as if the month is December, follow the tree logic to the left; otherwise, follow the tree logic to the right. The last leaf of the tree has a predicted value. For this example, we will use the weather data in the rattle package. We will develop a decision tree to be used to determine whether it will rain tomorrow or not based on several variables. Let's load the rattle package as follows: > library(rattle) We can see a summary of the weather data. This shows that we have some real data over a year from Australia: > summary(weather)      Date                     Location     MinTemp     Min.   :2007-11-01   Canberra     :366   Min.   :-5.300 1st Qu.:2008-01-31   Adelaide     : 0   1st Qu.: 2.300 Median :2008-05-01   Albany       : 0   Median : 7.450 Mean   :2008-05-01   Albury       : 0   Mean   : 7.266 3rd Qu.:2008-07-31   AliceSprings : 0   3rd Qu.:12.500 Max.   :2008-10-31   BadgerysCreek: 0   Max.   :20.900                      (Other)     : 0                      MaxTemp         Rainfall       Evaporation       Sunshine     Min.   : 7.60   Min.   : 0.000   Min.  : 0.200   Min.   : 0.000 1st Qu.:15.03   1st Qu.: 0.000   1st Qu.: 2.200   1st Qu.: 5.950 Median :19.65   Median : 0.000   Median : 4.200   Median : 8.600 Mean   :20.55   Mean   : 1.428   Mean   : 4.522   Mean   : 7.909 3rd Qu.:25.50   3rd Qu.: 0.200   3rd Qu.: 6.400   3rd Qu.:10.500 Max.   :35.80   Max.   :39.800   Max.   :13.800   Max.   :13.600                                                    NA's   :3       WindGustDir   WindGustSpeed   WindDir9am   WindDir3pm NW     : 73   Min.   :13.00   SE     : 47   WNW   : 61 NNW   : 44   1st Qu.:31.00   SSE   : 40   NW     : 61 E     : 37   Median :39.00   NNW   : 36   NNW   : 47 WNW   : 35   Mean   :39.84   N     : 31   N     : 30 ENE   : 30   3rd Qu.:46.00   NW     : 30   ESE   : 27 (Other):144   Max.   :98.00   (Other):151   (Other):139 NA's   : 3   NA's   :2       NA's   : 31   NA's   : 1 WindSpeed9am     WindSpeed3pm   Humidity9am     Humidity3pm   Min.   : 0.000   Min.   : 0.00   Min.   :36.00   Min.   :13.00 1st Qu.: 6.000   1st Qu.:11.00   1st Qu.:64.00   1st Qu.:32.25 Median : 7.000   Median :17.00   Median :72.00   Median :43.00 Mean   : 9.652   Mean   :17.99   Mean   :72.04   Mean   :44.52 3rd Qu.:13.000   3rd Qu.:24.00   3rd Qu.:81.00   3rd Qu.:55.00 Max.   :41.000   Max.   :52.00   Max.   :99.00   Max.   :96.00 NA's   :7                                                       Pressure9am     Pressure3pm       Cloud9am       Cloud3pm   Min.   : 996.5   Min.   : 996.8   Min.   :0.000   Min.   :0.000 1st Qu.:1015.4   1st Qu.:1012.8   1st Qu.:1.000   1st Qu.:1.000 Median :1020.1   Median :1017.4   Median :3.500   Median :4.000 Mean   :1019.7   Mean   :1016.8   Mean   :3.891   Mean   :4.025 3rd Qu.:1024.5   3rd Qu.:1021.5   3rd Qu.:7.000   3rd Qu.:7.000 Max.   :1035.7   Max.   :1033.2   Max.   :8.000   Max.   :8.000 Temp9am         Temp3pm         RainToday RISK_MM Min.   : 0.100   Min.   : 5.10   No :300   Min.   : 0.000 1st Qu.: 7.625   1st Qu.:14.15   Yes: 66   1st Qu.: 0.000 Median :12.550   Median :18.55             Median : 0.000 Mean   :12.358   Mean   :19.23             Mean   : 1.428 3rd Qu.:17.000   3rd Qu.:24.00             3rd Qu.: 0.200 Max.   :24.700   Max.   :34.50           Max.   :39.800                                                            RainTomorrow No :300     Yes: 66       We will be using the rpart function to develop a decision tree. The rpart function looks like this: rpart(formula, data, weights, subset, na.action = na.rpart, method, model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ...) The various parameters of the rpart function are described in the following table: Parameter Description formula This is the formula used for the prediction. data This is the data matrix. weights These are the optional weights to be applied. subset This is the optional subset of rows of data to be used. na.action This specifies the action to be taken when y, the target value, is missing. method This is the method to be used to interpret the data. It should be one of these: anova, poisson, class, or exp. If not specified, the algorithm decides based on the layout of the data. … These are the additional parameters to be used to control the behavior of the algorithm.  Let's create a subset as follows: > weather2 <- subset(weather,select=-c(RISK_MM)) > install.packages("rpart") >library(rpart) > model <- rpart(formula=RainTomorrow ~ .,data=weather2, method="class") > summary(model) Call: rpart(formula = RainTomorrow ~ ., data = weather2, method = "class") n= 366   CPn split       rel error     xerror   xstd 1 0.19696970     0 1.0000000 1.0000000 0.1114418 2 0.09090909      1 0.8030303 0.9696970 0.1101055 3 0.01515152     2 0.7121212 1.0151515 0.1120956 4 0.01000000     7 0.6363636 0.9090909 0.1073129   Variable importance Humidity3pm WindGustSpeed     Sunshine WindSpeed3pm       Temp3pm            24           14          12             8             6 Pressure3pm       MaxTemp       MinTemp   Pressure9am       Temp9am            6             5             4             4             4 Evaporation         Date   Humidity9am     Cloud3pm     Cloud9am             3             3             2             2             1      Rainfall            1 Node number 1: 366 observations,   complexity param=0.1969697 predicted class=No   expected loss=0.1803279 P(node) =1    class counts:   300   66    probabilities: 0.820 0.180 left son=2 (339 obs) right son=3 (27 obs) Primary splits:    Humidity3pm < 71.5   to the left, improve=18.31013, (0 missing)    Pressure3pm < 1011.9 to the right, improve=17.35280, (0 missing)    Cloud3pm   < 6.5     to the left, improve=16.14203, (0 missing)    Sunshine   < 6.45   to the right, improve=15.36364, (3 missing)    Pressure9am < 1016.35 to the right, improve=12.69048, (0 missing) Surrogate splits:    Sunshine < 0.45   to the right, agree=0.945, adj=0.259, (0 split) (many more)… As you can tell, the model is complicated. The summary shows the progression of the model development using more and more of the data to fine-tune the tree. We will be using the rpart.plot package to display the decision tree in a readable manner as follows: > library(rpart.plot) > fancyRpartPlot(model,main="Rain Tomorrow",sub="Chapter 12") This is the output of the fancyRpartPlot function Now, we can follow the logic of the decision tree easily. For example, if the humidity is over 72, we are predicting it will rain. Regression We can use a regression to predict our target value by producing a regression model from our predictor variables. We will be using the forest fire data from http://archive.ics.uci.edu. We will load the data and get the following summary: > forestfires <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv") > summary(forestfires)        X               Y           month     day         FFMC     Min.   :1.000   Min.   :2.0   aug   :184   fri:85 Min.   :18.70 1st Qu.:3.000   1st Qu.:4.0   sep   :172   mon:74   1st Qu.:90.20 Median :4.000   Median :4.0   mar   : 54   sat:84   Median :91.60 Mean   :4.669   Mean   :4.3   jul   : 32   sun:95   Mean   :90.64 3rd Qu.:7.000   3rd Qu.:5.0  feb   : 20   thu:61   3rd Qu.:92.90 Max.   :9.000   Max.   :9.0   jun   : 17   tue:64   Max.   :96.20                                (Other): 38   wed:54                      DMC             DC             ISI             temp     Min.   : 1.1   Min.   : 7.9   Min.   : 0.000   Min.   : 2.20 1st Qu.: 68.6   1st Qu.:437.7   1st Qu.: 6.500   1st Qu.:15.50 Median :108.3   Median :664.2   Median : 8.400   Median :19.30 Mean   :110.9   Mean   :547.9   Mean   : 9.022   Mean   :18.89 3rd Qu.:142.4   3rd Qu.:713.9   3rd Qu.:10.800   3rd Qu.:22.80 Max.   :291.3   Max.   :860.6   Max.   :56.100   Max.   :33.30                                                                         RH             wind           rain             area       Min.   : 15.00   Min.   :0.400   Min.   :0.00000   Min.   :   0.00 1st Qu.: 33.00   1st Qu.:2.700   1st Qu.:0.00000   1st Qu.:   0.00 Median : 42.00   Median :4.000   Median :0.00000   Median :   0.52 Mean   : 44.29   Mean   :4.018   Mean   :0.02166   Mean   : 12.85 3rd Qu.: 53.00   3rd Qu.:4.900   3rd Qu.:0.00000   3rd Qu.:   6.57 Max.   :100.00   Max.   :9.400   Max.   :6.40000   Max.   :1090.84 I will just use the month, temperature, wind, and rain data to come up with a model of the area (size) of the fires using the lm function. The lm function looks like this: lm(formula, data, subset, weights, na.action,    method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,    singular.ok = TRUE, contrasts = NULL, offset, ...) The various parameters of the lm function are described in the following table: Parameter Description formula This is the formula to be used for the model data This is the dataset subset This is the subset of dataset to be used weights These are the weights to apply to factors … These are the additional parameters to be added to the function Let's load the data as follows: > model <- lm(formula = area ~ month + temp + wind + rain, data=forestfires) Looking at the generated model, we see the following output: > summary(model) Call: lm(formula = area ~ month + temp + wind + rain, data = forestfires) Residuals:    Min     1Q Median     3Q     Max -33.20 -14.93   -9.10   -1.66 1063.59 Coefficients:            Estimate Std. Error t value Pr(>|t|) (Intercept) -17.390     24.532 -0.709   0.4787 monthaug     -10.342     22.761 -0.454   0.6498 monthdec     11.534     30.896   0.373   0.7091 monthfeb       2.607     25.796   0.101   0.9196 monthjan       5.988     50.493   0.119   0.9056 monthjul     -8.822    25.068 -0.352   0.7251 monthjun     -15.469     26.974 -0.573   0.5666 monthmar     -6.630     23.057 -0.288   0.7738 monthmay       6.603     50.053   0.132   0.8951 monthnov     -8.244     67.451 -0.122   0.9028 monthoct     -8.268    27.237 -0.304   0.7616 monthsep     -1.070     22.488 -0.048   0.9621 temp           1.569     0.673   2.332   0.0201 * wind           1.581     1.711   0.924   0.3557 rain         -3.179     9.595 -0.331   0.7406 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1   Residual standard error: 63.99 on 502 degrees of freedom Multiple R-squared: 0.01692, Adjusted R-squared: -0.0105 F-statistic: 0.617 on 14 and 502 DF, p-value: 0.8518 Surprisingly, the month has a significant effect on the size of the fires. I would have guessed that whether or not the fires occurred in August or similar months would have effected any discernable difference. Also, the temperature has such a minimal effect. Further, the model is using the month data as categorical. If we redevelop the model (without temperature), we have a better fit (notice the multiple R-squared value drops to 0.006 from 0.01), as shown here: > model <- lm(formula = area ~ month + wind + rain, data=forestfires) > summary(model)   Call: lm(formula = area ~ month + wind + rain, data = forestfires)   Residuals:    Min     1Q Median     3Q     Max -22.17 -14.39 -10.46   -3.87 1072.43   Coefficients:           Estimate Std. Error t value Pr(>|t|) (Intercept)   4.0126   22.8496   0.176   0.861 monthaug     4.3132   21.9724   0.196   0.844 monthdec     1.3259   30.7188   0.043   0.966 monthfeb     -1.6631   25.8441 -0.064   0.949 monthjan     -6.1034   50.4475 -0.121   0.904 monthjul     6.4648   24.3021   0.266   0.790 monthjun     -2.4944   26.5099 -0.094   0.925 monthmar     -4.8431   23.1458 -0.209   0.834 monthmay     10.5754   50.2441   0.210   0.833 monthnov     -8.7169   67.7479 -0.129   0.898 monthoct     -0.9917   27.1767 -0.036   0.971 monthsep     10.2110   22.0579   0.463   0.644 wind         1.0454     1.7026   0.614   0.540 rain         -1.8504     9.6207 -0.192   0.848   Residual standard error: 64.27 on 503 degrees of freedom Multiple R-squared: 0.006269, Adjusted R-squared: -0.01941 F-statistic: 0.2441 on 13 and 503 DF, p-value: 0.9971 From the results, we can see R-squared of close to 0 and p-value almost 1; this is a very good fit. If you plot the model, you will get a series of graphs. The plot of the residuals versus fitted values is the most revealing, as shown in the following graph: > plot(model) You can see from the graph that the regression model is very accurate: Neural network In a neural network, it is assumed that there is a complex relationship between the predictor variables and the target variable. The network allows the expression of each of these relationships. For this model, we will use the liver disorder data from http://archive.ics.uci.edu. The data has a few hundred observations from patients with liver disorders. The variables are various measures of blood for each patient as shown here: > bupa <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.data") > colnames(bupa) <- c("mcv","alkphos","alamine","aspartate","glutamyl","drinks","selector") > summary(bupa)      mcv           alkphos         alamine     Min.   : 65.00   Min.   : 23.00   Min.   : 4.00 1st Qu.: 87.00   1st Qu.: 57.00   1st Qu.: 19.00 Median : 90.00   Median : 67.00   Median : 26.00 Mean   : 90.17   Mean   : 69.81   Mean   : 30.36 3rd Qu.: 93.00   3rd Qu.: 80.00   3rd Qu.: 34.00 Max.   :103.00   Max.   :138.00   Max.   :155.00    aspartate       glutamyl         drinks     Min.   : 5.00   Min.   : 5.00   Min.  : 0.000 1st Qu.:19.00   1st Qu.: 15.00   1st Qu.: 0.500 Median :23.00   Median : 24.50   Median : 3.000 Mean   :24.64   Mean   : 38.31   Mean   : 3.465 3rd Qu.:27.00   3rd Qu.: 46.25   3rd Qu.: 6.000 Max.   :82.00   Max.   :297.00   Max. :20.000    selector   Min.   :1.000 1st Qu.:1.000 Median :2.000 Mean   :1.581 3rd Qu.:2.000 Max.   :2.000 We generate a neural network using the neuralnet function. The neuralnet function looks like this: neuralnet(formula, data, hidden = 1, threshold = 0.01,                stepmax = 1e+05, rep = 1, startweights = NULL,          learningrate.limit = NULL,          learningrate.factor = list(minus = 0.5, plus = 1.2),          learningrate=NULL, lifesign = "none",          lifesign.step = 1000, algorithm = "rprop+",          err.fct = "sse", act.fct = "logistic",          linear.output = TRUE, exclude = NULL,          constant.weights = NULL, likelihood = FALSE) The various parameters of the neuralnet function are described in the following table: Parameter Description formula This is the formula to converge. data This is the data matrix of predictor values. hidden This is the number of hidden neurons in each layer. stepmax This is the maximum number of steps in each repetition. Default is 1+e5. rep This is the number of repetitions. Let's generate the neural network as follows: > nn <- neuralnet(selector~mcv+alkphos+alamine+aspartate+glutamyl+drinks, data=bupa, linear.output=FALSE, hidden=2) We can see how the model was developed via the result.matrix variable in the following output: > nn$result.matrix                                      1 error                 100.005904355153 reached.threshold       0.005904330743 steps                 43.000000000000 Intercept.to.1layhid1   0.880621509705 mcv.to.1layhid1       -0.496298308044 alkphos.to.1layhid1     2.294158313786 alamine.to.1layhid1     1.593035613921 aspartate.to.1layhid1 -0.407602506759 glutamyl.to.1layhid1   -0.257862634340 drinks.to.1layhid1     -0.421390527261 Intercept.to.1layhid2   0.806928998059 mcv.to.1layhid2       -0.531926150470 alkphos.to.1layhid2     0.554627946150 alamine.to.1layhid2     1.589755874579 aspartate.to.1layhid2 -0.182482440722 glutamyl.to.1layhid2   1.806513419058 drinks.to.1layhid2     0.215346602241 Intercept.to.selector   4.485455617018 1layhid.1.to.selector   3.328527160621 1layhid.2.to.selector   2.616395644587 The process took 43 steps to come up with the neural network once the threshold was under 0.01 (0.005 in this case). You can see the relationships between the predictor values. Looking at the network developed, we can see the hidden layers of relationship among the predictor variables. For example, sometimes mcv combines at one ratio and on other times at another ratio, depending on its value. Let's load the neural network as follows: > plot(nn) Instance-based learning R programming has a nearest neighbor algorithm (k-NN). The k-NN algorithm takes the predictor values and organizes them so that a new observation is applied to the organization developed and the algorithm selects the result (prediction) that is most applicable based on nearness of the predictor values in the new observation. The nearest neighbor function is knn. The knn function call looks like this: knn(train, test, cl, k = 1, l = 0, prob = FALSE, use.all = TRUE) The various parameters of the knn function are described in the following table: Parameter Description train This is the training data. test This is the test data. cl This is the factor of true classifications. k This is the Number of neighbors to consider. l This is the minimum vote for a decision. prob This is a Boolean flag to return proportion of winning votes. use.all This is a Boolean variable for tie handling. TRUE means use all votes of max distance I am using the auto MPG dataset in the example of using knn. First, we load the dataset : > data <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data", na.string="?") > colnames(data) <- c("mpg","cylinders","displacement","horsepower","weight","acceleration","model.year","origin","car.name") > summary(data)      mpg         cylinders     displacement     horsepower Min.   : 9.00  Min.   :3.000   Min.   : 68.0   150   : 22 1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.2   90     : 20 Median :23.00   Median :4.000   Median :148.5   88     : 19 Mean   :23.51   Mean   :5.455   Mean   :193.4   110   : 18 3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0   100   : 17 Max.   :46.60   Max.   :8.000   Max.   :455.0   75     : 14                                                  (Other):288      weight     acceleration     model.year       origin     Min.   :1613   Min. : 8.00   Min.   :70.00   Min.   :1.000 1st Qu.:2224   1st Qu.:13.82   1st Qu.:73.00   1st Qu.:1.000 Median :2804   Median :15.50   Median :76.00   Median :1.000 Mean   :2970   Mean   :15.57   Mean   :76.01   Mean   :1.573 3rd Qu.:3608   3rd Qu.:17.18   3rd Qu.:79.00   3rd Qu.:2.000 Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000                                                                           car.name ford pinto   : 6 amc matador   : 5 ford maverick : 5 toyota corolla: 5 amc gremlin   : 4 amc hornet   : 4 (Other)       :369   There are close to 400 observations in the dataset. We need to split the data into a training set and a test set. We will use 75 percent for training. We use the createDataPartition function in the caret package to select the training rows. Then, we create a test dataset and a training dataset using the partitions as follows: > library(caret) > training <- createDataPartition(data$mpg, p=0.75, list=FALSE) > trainingData <- data[training,] > testData <- data[-training,] > model <- knn(train=trainingData, test=testData, cl=trainingData$mpg) NAs introduced by coercion The error message means that some numbers in the dataset have a bad format. The bad numbers were automatically converted to NA values. Then the inclusion of the NA values caused the function to fail, as NA values are not expected in this function call. First, there are some missing items in the dataset loaded. We need to eliminate those NA values as follows: > completedata <- data[complete.cases(data),] After looking over the data several times, I guessed that the car name fields were being parsed as numerical data when there was a number in the name, such as Buick Skylark 320. I removed the car name column from the test and we end up with the following valid results; > drops <- c("car.name") > completeData2 <- completedata[,!(names(completedata) %in% drops)] > training <- createDataPartition(completeData2$mpg, p=0.75, list=FALSE) > trainingData <- completeData2[training,] > testData <- completeData2[-training,] > model <- knn(train=trainingData, test=testData, cl=trainingData$mpg) We can see the results of the model by plotting using the following command. However, the graph doesn't give us much information to work on. > plot(model) We can use a different kknn function to compare our model with the test data. I like this version a little better as you can plainly specify the formula for the model. Let's use the kknn function as follows: > library(kknn) > model <- kknn(formula = formula(mpg~.), train = trainingData, test = testData, k = 3, distance = 1) > fit <- fitted(model) > plot(testData$mpg, fit) > abline(a=0, b=1, col=3) I added a simple slope to highlight how well the model fits the training data. It looks like as we progress to higher MPG values, our model has a higher degree of variance. I think that means we are missing predictor variables, especially for the later model, high MPG series of cars. That would make sense as government mandate and consumer demand for high efficiency vehicles changed the mpg for vehicles. Here is the graph generated by the previous code: Ensemble learning Ensemble learning is the process of using multiple learning methods to obtain better predictions. For example, we could use a regression and k-NN, combine the results, and end up with a better prediction. We could average the results of both or provide heavier weight towards one or another of the algorithms, whichever appears to be a better predictor. Support vector machines We covered support vector machines (SVM), but I will run through an example here. As a reminder, SVM is concerned with binary data. We will use the spam dataset from Hewlett Packard (part of the kernlab package). First, let's load the data as follows: > library(kernlab) > data("spam") > summary(spam)      make           address           all             num3d         Min.   :0.0000   Min.   : 0.000   Min.   :0.0000   Min.   : 0.00000 1st Qu.:0.0000   1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.: 0.00000 Median :0.0000   Median : 0.000   Median :0.0000   Median : 0.00000 Mean   :0.1046   Mean   : 0.213   Mean   :0.2807   Mean   : 0.06542 3rd Qu.:0.0000   3rd Qu.: 0.000   3rd Qu.:0.4200   3rd Qu.: 0.00000 Max.   :4.5400   Max.   :14.280   Max.   :5.1000   Max.   :42.81000 … There are 58 variables with close to 5000 observations, as shown here: > table(spam$type) nonspam   spam    2788   1813 Now, we break up the data into a training set and a test set as follows: > index <- 1:nrow(spam) > testindex <- sample(index, trunc(length(index)/3)) > testset <- spam[testindex,] > trainingset <- spam[-testindex,] Now, we can produce our SVM model using the svm function. The svm function looks like this: svm(formula, data = NULL, ..., subset, na.action =na.omit, scale = TRUE) The various parameters of the svm function are described in the following table: Parameter Description formula This is the formula model data This is the dataset subset This is the subset of the dataset to be used na.action This contains what action to take with NA values scale This determines whether to scale the data Let's use the svm function to produce a SVM model as follows: > library(e1071) > model <- svm(type ~ ., data = trainingset, method = "C-classification", kernel = "radial", cost = 10, gamma = 0.1) > summary(model) Call: svm(formula = type ~ ., data = trainingset, method = "C-classification",    kernel = "radial", cost = 10, gamma = 0.1) Parameters:    SVM-Type: C-classification SVM-Kernel: radial        cost: 10      gamma: 0.1 Number of Support Vectors: 1555 ( 645 910 ) Number of Classes: 2 Levels: nonspam spam We can test the model against our test dataset and look at the results as follows: > pred <- predict(model, testset) > table(pred, testset$type) pred     nonspam spam nonspam     891 104 spam         38 500 Note, the e1071 package is not compatible with the current version of R. Given its usefulness I would expect the package to be updated to support the user base. So, using SVM, we have a 90 percent ((891+500) / (891+104+38+500)) accuracy rate of prediction. Bayesian learning With Bayesian learning, we have an initial premise in a model that is adjusted with new information. We can use the MCMCregress method in the MCMCpack package to use Bayesian regression on learning data and apply the model against test data. Let's load the MCMCpack package as follows: > install.packages("MCMCpack") > library(MCMCpack) We are going to be using the transplant data on transplants available at http://lib.stat.cmu.edu/datasets/stanford. (The dataset on the site is part of the web page, so I copied into a local CSV file.) The data shows expected transplant success factor, the actual transplant success factor, and the number of transplants over a time period. So, there is a good progression over time as to the success of the program. We can read the dataset as follows: > transplants <- read.csv("transplant.csv") > summary(transplants)    expected         actual       transplants   Min.   : 0.057   Min.   : 0.000   Min.   : 1.00 1st Qu.: 0.722   1st Qu.: 0.500   1st Qu.: 9.00 Median : 1.654   Median : 2.000   Median : 18.00 Mean   : 2.379   Mean   : 2.382   Mean   : 27.83 3rd Qu.: 3.402   3rd Qu.: 3.000   3rd Qu.: 40.00 Max.   :12.131   Max.   :18.000   Max.   :152.00 We use Bayesian regression against the data— note that we are modifying the model as we progress with new information using the MCMCregress function. The MCMCregress function looks like this: MCMCregress(formula, data = NULL, burnin = 1000, mcmc = 10000,    thin = 1, verbose = 0, seed = NA, beta.start = NA,    b0 = 0, B0 = 0, c0 = 0.001, d0 = 0.001, sigma.mu = NA, sigma.var = NA,    marginal.likelihood = c("none", "Laplace", "Chib95"), ...) The various parameters of the MCMCregress function are described in the following table: Parameter Description formula This is the formula of model data This is the dataset to be used for model … These are the additional parameters for the function Let's use the Bayesian regression against the data as follows: > model <- MCMCregress(expected ~ actual + transplants, data=transplants) > summary(model) Iterations = 1001:11000 Thinning interval = 1 Number of chains = 1 Sample size per chain = 10000 1. Empirical mean and standard deviation for each variable,    plus standard error of the mean:                Mean     SD Naive SE Time-series SE (Intercept) 0.00484 0.08394 0.0008394     0.0008388 actual     0.03413 0.03214 0.0003214     0.0003214 transplants 0.08238 0.00336 0.0000336     0.0000336 sigma2     0.44583 0.05698 0.0005698     0.0005857 2. Quantiles for each variable:                2.5%     25%     50%     75%   97.5% (Intercept) -0.15666 -0.05216 0.004786 0.06092 0.16939 actual     -0.02841 0.01257 0.034432 0.05541 0.09706 transplants 0.07574 0.08012 0.082393 0.08464 0.08890 sigma2       0.34777 0.40543 0.441132 0.48005 0.57228 The plot of the data shows the range of results, as shown in the following graph. Look at this in contrast to a simple regression with one result. > plot(model) Random forests Random forests is an algorithm that constructs a multitude of decision trees for the model of the data and selects the best of the lot as the final result. We can use the randomForest function in the kernlab package for this function. The randomForest function looks like this: randomForest(formula, data=NULL, ..., subset, na.action=na.fail) The various parameters of the randomForest function are described in the following table: Parameter Description formula This is the formula of model data This is the dataset to be used subset This is the subset of the dataset to be used na.action This is the action to take with NA values For an example of random forest, we will use the spam data, as in the section Support vector machines. First, let's load the package and library as follows: > install.packages("randomForest") > library(randomForest) Now, we will generate the model with the following command (this may take a while): > fit <- randomForest(type ~ ., data=spam) Let's look at the results to see how it went: > fit Call: randomForest(formula = type ~ ., data = spam)                Type of random forest: classification                      Number of trees: 500 No. of variables tried at each split: 7        OOB estimate of error rate: 4.48% Confusion matrix:         nonspam spam class.error nonspam   2713   75 0.02690100 spam       131 1682 0.07225593 We can look at the relative importance of the data variables in the final model, as shown here: > head(importance(fit))        MeanDecreaseGini make           7.967392 address       12.654775 all           25.116662 num3d           1.729008 our           67.365754 over           17.579765 Ordering the data shows a couple of the factors to be critical to the determination. For example, the presence of the exclamation character in the e-mail is shown as a dominant indicator of spam mail: charExclamation   256.584207 charDollar       200.3655348 remove           168.7962949 free              142.8084662 capitalAve       137.1152451 capitalLong       120.1520829 your             116.6134519 Unsupervised learning With unsupervised learning, we do not have a target variable. We have a number of predictor variables that we look into to determine if there is a pattern. We will go over the following unsupervised learning techniques: Cluster analysis Density estimation Expectation-maximization algorithm Hidden Markov models Blind signal separation Cluster analysis Cluster analysis is the process of organizing data into groups (clusters) that are similar to each other. For our example, we will use the wheat seed data available at http://www.uci.edu, as shown here: > wheat <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt", sep="t") Let's look at the raw data: > head(wheat) X15.26 X14.84 X0.871 X5.763 X3.312 X2.221 X5.22 X1 1 14.88 14.57 0.8811 5.554 3.333 1.018 4.956 1 2 14.29 14.09 0.9050 5.291 3.337 2.699 4.825 1 3 13.84 13.94 0.8955 5.324 3.379 2.259 4.805 1 4 16.14 14.99 0.9034 5.658 3.562 1.355 5.175 1 5 14.38 14.21 0.8951 5.386 3.312 2.462 4.956 1 6 14.69 14.49 0.8799 5.563 3.259 3.586 5.219 1 We need to apply column names so we can see the data better: > colnames(wheat) <- c("area", "perimeter", "compactness", "length", "width", "asymmetry", "groove", "undefined") > head(wheat)    area perimeter compactness length width asymmetry groove undefined 1 14.88     14.57     0.8811 5.554 3.333     1.018 4.956         1 2 14.29     14.09     0.9050 5.291 3.337     2.699 4.825         1 3 13.84     13.94     0.8955 5.324 3.379     2.259 4.805         1 4 16.14     14.99     0.9034 5.658 3.562     1.355 5.175         1 5 14.38     14.21     0.8951 5.386 3.312     2.462 4.956         1 6 14.69     14.49     0.8799 5.563 3.259     3.586 5.219         1 The last column is not defined in the data description, so I am removing it: > wheat <- subset(wheat, select = -c(undefined) ) > head(wheat)    area perimeter compactness length width asymmetry groove 1 14.88     14.57     0.8811 5.554 3.333     1.018 4.956 2 14.29     14.09     0.9050 5.291 3.337     2.699 4.825 3 13.84     13.94     0.8955 5.324 3.379     2.259 4.805 4 16.14     14.99     0.9034 5.658 3.562     1.355 5.175 5 14.38     14.21     0.8951 5.386 3.312     2.462 4.956 6 14.69    14.49     0.8799 5.563 3.259     3.586 5.219 Now, we can finally produce the cluster using the kmeans function. The kmeans function looks like this: kmeans(x, centers, iter.max = 10, nstart = 1,        algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",                      "MacQueen"), trace=FALSE) The various parameters of the kmeans function are described in the following table: Parameter Description x This is the dataset centers This is the number of centers to coerce data towards … These are the additional parameters of the function Let's produce the cluster using the kmeans function: > fit <- kmeans(wheat, 5) Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1) Unfortunately, there are some rows with missing data, so let's fix this using the following command: > wheat <- wheat[complete.cases(wheat),] Let's look at the data to get some idea of the factors using the following command: > plot(wheat) If we try looking at five clusters, we end up with a fairly good set of clusters with an 85 percent fit, as shown here: > fit <- kmeans(wheat, 5) > fit K-means clustering with 5 clusters of sizes 29, 33, 56, 69, 15 Cluster means:      area perimeter compactness   length   width asymmetry   groove 1 16.45345 15.35310   0.8768000 5.882655 3.462517 3.913207 5.707655 2 18.95455 16.38879   0.8868000 6.247485 3.744697 2.723545 6.119455 3 14.10536 14.20143   0.8777750 5.480214 3.210554 2.368075 5.070000 4 11.94870 13.27000   0.8516652 5.229304 2.870101 4.910145 5.093333 5 19.58333 16.64600   0.8877267 6.315867 3.835067 5.081533 6.144400 Clustering vector: ... Within cluster sum of squares by cluster: [1] 48.36785 30.16164 121.63840 160.96148 25.81297 (between_SS / total_SS = 85.4 %) If we push to 10 clusters, the performance increases to 92 percent. Density estimation Density estimation is used to provide an estimate of the probability density function of a random variable. For this example, we will use sunspot data from Vincent arlbuck site. Not clear if sunspots are truly random. Let's load our data as follows: > sunspots <- read.csv("http://vincentarelbundock.github.io/Rdatasets/csv/datasets/sunspot.month.csv") > summary(sunspots)        X             time     sunspot.month   Min.   :   1   Min.   :1749   Min.   : 0.00 1st Qu.: 795   1st Qu.:1815   1st Qu.: 15.70 Median :1589   Median :1881   Median : 42.00 Mean   :1589   Mean   :1881   Mean   : 51.96 3rd Qu.:2383   3rd Qu.:1948   3rd Qu.: 76.40 Max.   :3177   Max.   :2014   Max.   :253.80 > head(sunspots) X     time sunspot.month 1 1 1749.000         58.0 2 2 1749.083         62.6 3 3 1749.167         70.0 4 4 1749.250         55.7 5 5 1749.333         85.0 6 6 1749.417        83.5 We will now estimate the density using the following command: > d <- density(sunspots$sunspot.month) > d Call: density.default(x = sunspots$sunspot.month) Data: sunspots$sunspot.month (3177 obs.); Bandwidth 'bw' = 7.916        x               y           Min.   :-23.75   Min.   :1.810e-07 1st Qu.: 51.58   1st Qu.:1.586e-04 Median :126.90   Median :1.635e-03 Mean   :126.90   Mean   :3.316e-03 3rd Qu.:202.22   3rd Qu.:5.714e-03 Max.   :277.55   Max.   :1.248e-02 A plot is very useful for this function, so let's generate one using the following command: > plot(d) It is interesting to see such a wide variation; maybe the data is pretty random after all. We can use the density to estimate additional periods as follows: > N<-1000 > sunspots.new <- rnorm(N, sample(sunspots$sunspot.month, size=N, replace=TRUE)) > lines(density(sunspots.new), col="blue") It looks like our density estimate is very accurate. Expectation-maximization Expectation-maximization (EM) is an unsupervised clustering approach that adjusts the data for optimal values. When using EM, we have to have some preconception of the shape of the data/model that will be targeted. This example reiterates the example on the Wikipedia page, with comments. The example tries to model the iris species from the other data points. Let's load the data as shown here: > iris <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data") > colnames(iris) <- c("SepalLength","SepalWidth","PetalLength","PetalWidth","Species") > modelName = "EEE" Each observation has sepal length, width, petal length, width, and species, as shown here: > head(iris) SepalLength SepalWidth PetalLength PetalWidth     Species 1         5.1       3.5         1.4       0.2 Iris-setosa 2         4.9       3.0         1.4       0.2 Iris-setosa 3         4.7       3.2         1.3       0.2 Iris-setosa 4         4.6       3.1         1.5       0.2 Iris-setosa 5         5.0       3.6         1.4       0.2 Iris-setosa 6         5.4       3.9         1.7       0.4 Iris-setosa We are estimating the species from the other points, so let's separate the data as follows: > data = iris[,-5] > z = unmap(iris[,5]) Let's set up our mstep for EM, given the data, categorical data (z) relating to each data point, and our model type name: > msEst <- mstep(modelName, data, z) We use the parameters defined in the mstep to produce our model, as shown here: > em(modelName, data, msEst$parameters) $z                [,1]         [,2]         [,3] [1,] 1.000000e+00 4.304299e-22 1.699870e-42 … [150,] 8.611281e-34 9.361398e-03 9.906386e-01 $parameters$pro [1] 0.3333333 0.3294048 0.3372619 $parameters$mean              [,1]     [,2]     [,3] SepalLength 5.006 5.941844 6.574697 SepalWidth 3.418 2.761270 2.980150 PetalLength 1.464 4.257977 5.538926 PetalWidth 0.244 1.319109 2.024576 $parameters$variance$d [1] 4 $parameters$variance$G [1] 3 $parameters$variance$sigma , , 1            SepalLength SepalWidth PetalLength PetalWidth SepalLength 0.26381739 0.09030470 0.16940062 0.03937152 SepalWidth   0.09030470 0.11251902 0.05133876 0.03082280 PetalLength 0.16940062 0.05133876 0.18624355 0.04183377 PetalWidth   0.03937152 0.03082280 0.04183377 0.03990165 , , 2 , , 3 … (there was little difference in the 3 sigma values) Covariance $parameters$variance$Sigma            SepalLength SepalWidth PetalLength PetalWidth SepalLength 0.26381739 0.09030470 0.16940062 0.03937152 SepalWidth   0.09030470 0.11251902 0.05133876 0.03082280 PetalLength 0.16940062 0.05133876 0.18624355 0.04183377 PetalWidth   0.03937152 0.03082280 0.04183377 0.03990165 $parameters$variance$cholSigma             SepalLength SepalWidth PetalLength PetalWidth SepalLength -0.5136316 -0.1758161 -0.32980960 -0.07665323 SepalWidth   0.0000000 0.2856706 -0.02326832 0.06072001 PetalLength   0.0000000 0.0000000 -0.27735855 -0.06477412 PetalWidth   0.0000000 0.0000000 0.00000000 0.16168899 attr(,"info") iterations       error 4.000000e+00 1.525131e-06 There is quite a lot of output from the em function. The highlights for me were the three sigma ranges were the same and the error from the function was very small. So, I think we have a very good estimation of species using just the four data points. Hidden Markov models The hidden Markov models (HMM) is the idea of observing data assuming it has been produced by a Markov model. The problem is to discover what that model is. I am using the Python example on Wikipedia for HMM. For an HMM, we need states (assumed to be hidden from observer), symbols, transition matrix between states, emission (output) states, and probabilities for all. The Python information presented is as follows: states = ('Rainy', 'Sunny') observations = ('walk', 'shop', 'clean') start_probability = {'Rainy': 0.6, 'Sunny': 0.4} transition_probability = {    'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3},    'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6},    } emission_probability = {    'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5},    'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1},    } trans <- matrix(c('Rainy', : {'Rainy': 0.7, 'Sunny': 0.3},    'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6},    } We convert these to use in R for the initHmm function by using the following command: > hmm <- initHMM(c("Rainy","Sunny"), c('walk', 'shop', 'clean'), c(.6,.4), matrix(c(.7,.3,.4,.6),2), matrix(c(.1,.4,.5,.6,.3,.1),3)) > hmm $States [1] "Rainy" "Sunny" $Symbols [1] "walk" "shop" "clean" $startProbs Rainy Sunny 0.6   0.4 $transProbs        to from   Rainy Sunny Rainy   0.7   0.4 Sunny   0.3   0.6 $emissionProbs        symbols states walk shop clean Rainy 0.1 0.5   0.3 Sunny 0.4 0.6   0.1 The model is really a placeholder for all of the setup information needed for HMM. We can then use the model to predict based on observations, as follows: > future <- forward(hmm, c("walk","shop","clean")) > future        index states         1         2         3 Rainy -2.813411 -3.101093 -4.139551 Sunny -1.832581 -2.631089 -5.096193 The result is a matrix of probabilities. For example, it is more likely to be Sunny when we observe walk. Blind signal separation Blind signal separation is the process of identifying sources of signals from a mixed signal. Primary component analysis is one method of doing this. An example is a cocktail party where you are trying to listen to one speaker. For this example, I am using the decathlon dataset in the FactoMineR package, as shown here: > library(FactoMineR) > data(decathlon) Let's look at the data to get some idea of what is available: > summary(decathlon) 100m           Long.jump     Shot.put       High.jump Min.   :10.44   Min.   :6.61   Min.   :12.68   Min.   :1.850 1st Qu.:10.85   1st Qu.:7.03   1st Qu.:13.88   1st Qu.:1.920 Median :10.98   Median :7.30   Median :14.57   Median :1.950 Mean   :11.00   Mean   :7.26   Mean   :14.48   Mean   :1.977 3rd Qu.:11.14   3rd Qu.:7.48   3rd Qu.:14.97   3rd Qu.:2.040 Max.   :11.64   Max.   :7.96   Max.   :16.36   Max.   :2.150 400m           110m.hurdle       Discus       Pole.vault   Min.   :46.81   Min.   :13.97   Min.   :37.92   Min.   :4.200 1st Qu.:48.93   1st Qu.:14.21   1st Qu.:41.90   1st Qu.:4.500 Median :49.40   Median :14.48   Median :44.41   Median :4.800 Mean   :49.62   Mean   :14.61 Mean   :44.33   Mean   :4.762 3rd Qu.:50.30   3rd Qu.:14.98   3rd Qu.:46.07   3rd Qu.:4.920 Max.   :53.20   Max.   :15.67   Max.   :51.65   Max.   :5.400 Javeline       1500m           Rank           Points   Min.   :50.31   Min.   :262.1   Min.   : 1.00   Min.   :7313 1st Qu.:55.27   1st Qu.:271.0   1st Qu.: 6.00   1st Qu.:7802 Median :58.36   Median :278.1   Median :11.00   Median :8021 Mean   :58.32   Mean   :279.0   Mean   :12.12   Mean   :8005 3rd Qu.:60.89   3rd Qu.:285.1   3rd Qu.:18.00   3rd Qu.:8122 Max.   :70.52   Max.   :317.0   Max.   :28.00   Max.   :8893    Competition Decastar:13 OlympicG:28 The output looks like performance data from a series of events at a track meet: > head(decathlon)        100m   Long.jump Shot.put High.jump 400m 110m.hurdle Discus SEBRLE 11.04     7.58   14.83     2.07 49.81       14.69 43.75 CLAY   10.76     7.40   14.26     1.86 49.37       14.05 50.72 KARPOV 11.02     7.30   14.77     2.04 48.37       14.09 48.95 BERNARD 11.02     7.23   14.25     1.92 48.93       14.99 40.87 YURKOV 11.34     7.09   15.19     2.10 50.42       15.31 46.26 WARNERS 11.11     7.60   14.31     1.98 48.68       14.23 41.10        Pole.vault Javeline 1500m Rank Points Competition SEBRLE       5.02   63.19 291.7   1   8217   Decastar CLAY         4.92   60.15 301.5   2   8122   Decastar KARPOV       4.92   50.31 300.2   3   8099   Decastar BERNARD       5.32   62.77 280.1   4   8067   Decastar YURKOV       4.72   63.44 276.4   5   8036   Decastar WARNERS       4.92   51.77 278.1   6   8030   Decastar Further, this is performance of specific individuals in track meets. We run the PCA function by passing the dataset to use, whether to scale the data or not, and the type of graphs: > res.pca = PCA(decathlon[,1:10], scale.unit=TRUE, ncp=5, graph=T) This produces two graphs: Individual factors map Variables factor map The individual factors map lays out the performance of the individuals. For example, we see Karpov who is high in both dimensions versus Bourginon who is performing badly (on the left in the following chart): The variables factor map shows the correlation of performance between events. For example, doing well in the 400 meters run is negatively correlated with the performance in the long jump; if you did well in one, you likely did well in the other as well. Here is the variables factor map of our data: Questions Factual Which supervised learning technique(s) do you lean towards as your "go to" solution? Why are the density plots for Bayesian results off-center? When, how, and why? How would you decide on the number of clusters to use? Find a good rule of thumb to decide the number of hidden layers in a neural net. Challenges Investigate other blind signal separation techniques, such as ICA. Use other methods, such as poisson, in the rpart function (especially if you have a natural occurring dataset). Summary In this article, we looked into various methods of machine learning, including both supervised and unsupervised learning. With supervised learning, we have a target variable we are trying to estimate. With unsupervised, we only have a possible set of predictor variables and are looking for patterns. In supervised learning, we looked into using a number of methods, including decision trees, regression, neural networks, support vector machines, and Bayesian learning. In unsupervised learning, we used cluster analysis, density estimation, hidden Markov models, and blind signal separation. Resources for Article: Further resources on this subject: Machine Learning in Bioinformatics [article] Data visualization [article] Introduction to S4 Classes [article]
Read more
  • 0
  • 0
  • 2464

article-image-navigation-mesh-generation
Packt
19 Dec 2014
9 min read
Save for later

Navigation Mesh Generation

Packt
19 Dec 2014
9 min read
In this article by Curtis Bennett and Dan Violet Sagmiller, authors of the book Unity AI Programming Essentials, we will learn about navigation meshes in Unity. Navigation mesh generation controls how AI characters are able to travel around a game level and is one of the most important topics in game AI. In this article, we will provide an overview of navigation meshes and look at the algorithm for generating them. Then, we'll look at different options of customizing our navigation meshes better. To do this, we will be using RAIN 2.1.5, a popular AI plugin for Unity by Rival Theory, available for free at http://rivaltheory.com/rain/download/. In this article, you will learn about: How navigation mesh generation works and the algorithm behind it Advanced options for customizing navigation meshes Creating advanced navigation meshes with RAIN (For more resources related to this topic, see here.) An overview of a navigation mesh To use navigation meshes, also referred to as NavMeshes, effectively the first things we need to know are what exactly navigation meshes are and how they are created. A navigation mesh is a definition of the area an AI character could travel to in a level. It is a mesh, but it is not intended to be rendered or seen by the player, instead it is used by the AI system. A NavMesh usually does not cover all the area in a level (if it did we wouldn't need one) since it's just the area a character can walk. The mesh is also almost always a simplified version of the geometry. For instance, you could have a cave floor in a game with thousands of polygons along the bottom showing different details in the rock, but for the navigation mesh the areas would just be a handful of very large polys giving a simplified view of the level. The purpose of navigation mesh is to provide this simplified representation to the rest of the AI system a way to find a path between two points on a level for a character. This is its purpose; let's discuss how they are created. It used to be a common practice in the games industry to create navigation meshes manually. A designer or artist would take the completed level geometry and create one using standard polygon mesh modelling tools and save it out. As you might imagine, this allowed for nice, custom, efficient meshes, but was also a big time sink, since every time the level changed the navigation mesh would need to be manually edited and updated. In recent years, there has been more research in automatic navigation mesh generation. There are many approaches to automatic navigation mesh generation, but the most popular is Recast, originally developed and designed by Mikko Monomen. Recast takes in level geometry and a set of parameters defining the character, such as the size of the character and how big of steps it can take, and then does a multipass approach to filter and create the final NavMesh. The most important phase of this is voxelizing the level based on an inputted cell size. This means the level geometry is divided into voxels (cubes) creating a version of the level geometry where everything is partitioned into different boxes called cells. Then the geometry in each of these cells is analyzed and simplified based on its intersection with the sides of the boxes and is culled based on things such as the slope of the geometry or how big a step height is between geometry. This simplified geometry is then merged and triangulated to make a final navigation mesh that can be used by the AI system. The source code and more information on the original C++ implementation of Recast is available at https://github.com/memononen/recastnavigation. Advanced NavMesh parameters Now that we understand how navigation mesh generations works, let's look at the different parameters you can set to generate them in more detail. We'll look at how to do these with RAIN: Open Unity and create a new scene and a floor and some blocks for walls. Download RAIN from http://rivaltheory.com/rain/download/ and import it into your scene. Then go to RAIN | Create Navigation Mesh. Also right-click on the RAIN menu and choose Show Advanced Settings. The setup should look something like the following screenshot: Now let's look at some of the important parameters: Size: This is the overall size of the navigation mesh. You'll want the navigation mesh to cover your entire level and use this parameter instead of trying to scale up the navigation mesh through the Scale transform in the Inspector window. For our demo here, set the Size parameter to 20. Walkable Radius: This is an important parameter to define the character size of the mesh. Remember, each mesh will be matched to the size of a particular character, and this is the radius of the character. You can visualize the radius for a character by adding a Unity Sphere Collider script to your object (by going to Component | Physics | Sphere Collider) and adjusting the radius of the collider. Cell Size: This is also a very important parameter. During the voxel step of the Recast algorithm, this sets the size of the cubes to inspect the geometry. The smaller the size, the more detailed and finer mesh, but longer the processing time for Recast. A large cell size makes computation fast but loses detail. For example, here is a NavMesh from our demo with a cell size of 0.01: You can see the finer detail here. Here is the navigation mesh generated with a cell size of 0.1: Note the difference between the two screenshots. In the former, walking through the two walls lower down in our picture is possible, but in the latter with a larger cell size, there is no path even though the character radius is the same. Problems like this become greater with larger cell sizes. The following is a navigation mesh with a cell size of 1: As you can see, the detail becomes jumbled and the mesh itself becomes unusable. With such differing results, the big question is how large should a cell size be for a level? The answer is that it depends on the required result. However, one important consideration is that as the processing time to generate one is done during development and not at runtime even if it takes several minutes to generate a good mesh, it can be worth it to get a good result in the game. Setting a small cell size on a large level can cause mesh processing to take a significant amount of time and consume a lot of memory. It is a good practice to save the scene before attempting to generate a complex navigation mesh. The Size, Walkable Radius, and Cell Size parameters are the most important parameters when generating the navigation mesh, but there are more that are used to customize the mesh further: Max Slope: This is the largest slope that a character can walk on. This is how much a piece of geometry that is tilted can still be walked on. If you take the wall and rotate it, you can see it is walkable: The preceding is a screenshot of a walkable object with slope. Step Height: This is how high a character can step from one object to another. For example, if you have steps between two blocks, as shown in the following screenshot, this would define how far in height the blocks can be apart and whether the area is still considered walkable: This is a screenshot of the navigation mesh with step height set to connect adjacent blocks. Walkable Height: This is the vertical height that is needed for the character to walk. For example, in the previous illustration, the second block is not walkable underneath because of the walkable height. If you raise it to a least one unit off the ground and set the walkable height to 1, the area underneath would become walkable:   You can see a screenshot of the navigation mesh with walkable height set to allow going under the higher block. These are the most important parameters. There are some other parameters related to the visualization and to cull objects. We will look at culling more in the next section. Culling areas Being able to set up areas as walkable or not is an important part of creating a level. To demo this, let's divide the level into two parts and create a bridge between the two. Take our demo and duplicate the floor and pull it down. Then transform one of the walls to a bridge. Then, add two other pieces of geometry to mark areas that are dangerous to walk on, like lava. Here is an example setup: This is a basic scene with a bridge to cross. If you recreate the navigation mesh now, all of the geometry will be covered and the bridge won't be recognized. To fix this, you can create a new tag called Lava and tag the geometry under the bridge with it. Then, in the navigation meshes' RAIN component, add Lava to the unwalkable tags. If you then regenerate the mesh, only the bridge is walkable. This is a screenshot of a navigation mesh areas under bridge culled: Using layers and the walkable tag you can customize navigation meshes. Summary Navigation meshes are an important part of game AI. In this article, we looked at the different parameters to customize navigation meshes. We looked at things such as setting the character size and walkable slopes and discussed the importance of the cell size parameter. We then saw how to customize our mesh by tagging different areas as not walkable. This should be a good start for designing navigation meshes for your games. Resources for Article: Further resources on this subject: Components in Unity [article] Enemy and Friendly AIs [article] Introduction to AI [article]
Read more
  • 0
  • 0
  • 1988

article-image-lookups
Packt
17 Dec 2014
24 min read
Save for later

Mastering Splunk: Lookups

Packt
17 Dec 2014
24 min read
In this article, by James Miller, author of the book Mastering Splunk, we will discuss Splunk lookups and workflows. The topics that will be covered in this article are as follows: The value of a lookup Design lookups File lookups Script lookups (For more resources related to this topic, see here.) Lookups Machines constantly generate data, usually in a raw form that is most efficient for processing by machines, but not easily understood by "human" data consumers. Splunk has the ability to identify unique identifiers and/or result or status codes within the data. This gives you the ability to enhance the readability of the data by adding descriptions or names as new search result fields. These fields contain information from an external source such as a static table (a CSV file) or the dynamic result of a Python command or a Python-based script. Splunk's lookups can use information within returned events or time information to determine how to add other fields from your previously defined external data sources. To illustrate, here is an example of a Splunk static lookup that: Uses the Business Unit value in an event Matches this value with the organization's business unit name in a CSV file Adds the definition to the event (as the Business Unit Name field) So, if you have an event where the Business Unit value is equal to 999999, the lookup will add the Business Unit Name value as Corporate Office to that event. More sophisticated lookups can: Populate a static lookup table from the results of a report. Use a Python script (rather than a lookup table) to define a field. For example, a lookup can use a script to return a server name when given an IP address. Perform a time-based lookup if your lookup table includes a field value that represents time. Let's take a look at an example of a search pipeline that creates a table based on IBM Cognos TM1 file extractions: sourcetype=csv 2014 "Current Forecast" "Direct" "513500" |rename May as "Month" Actual as "Version" "FY 2012" as Year 650693NLR001 as "Business Unit" 100000 as "FCST" "09997_Eliminations Co 2" as "Account" "451200" as "Activity" | eval RFCST= round(FCST) |Table Month, "Business Unit", RFCST The following table shows the results generated:   Now, add the lookup command to our search pipeline to have Splunk convert Business Unit into Business Unit Name: sourcetype=csv 2014 "Current Forecast" "Direct" "513500" |rename May as "Month" Actual as "Version" "FY 2012" as Year 650693NLR001 as "Business Unit" 100000 as "FCST" "09997_Eliminations Co 2" as "Account" "451200"as "Activity" | eval RFCST= round(FCST) |lookup BUtoBUName BU as "Business Unit" OUTPUT BUName as "Business Unit Name" | Table Month, "Business Unit", "Business Unit Name", RFCST The lookup command in our Splunk search pipeline will now add Business Unit Name in the results table:   Configuring a simple field lookup In this section, we will configure a simple Splunk lookup. Defining lookups in Splunk Web You can set up a lookup using the Lookups page (in Splunk Web) or by configuring stanzas in the props.conf and transforms.conf files. Let's take the easier approach first and use the Splunk Web interface. Before we begin, we need to establish our lookup table that will be in the form of an industry standard comma separated file (CSV). Our example is one that converts business unit codes to a more user-friendly business unit name. For example, we have the following information: Business unit code Business unit name 999999 Corporate office VA0133SPS001 South-western VA0133NLR001 North-east 685470NLR001 Mid-west In the events data, only business unit codes are included. In an effort to make our Splunk search results more readable, we want to add the business unit name to our results table. To do this, we've converted our information (shown in the preceding table) to a CSV file (named BUtoBUName.csv):   For this example, we've kept our lookup table simple, but lookup tables (files) can be as complex as you need them to be. They can have numerous fields (columns) in them. A Splunk lookup table has a few requirements, as follows: A table must contain a minimum of two columns Each of the columns in the table can have duplicate values You should use (plain) ASCII text and not non-UTF-8 characters Now, from Splunk Web, we can click on Settings and then select Lookups:   From the Lookups page, we can select Lookup table files:   From the Lookup table files page, we can add our new lookup file (BUtoBUName.csv):   By clicking on the New button, we see the Add new page where we can set up our file by doing the following: Select a Destination app (this is a drop-down list and you should select Search). Enter (or browse to) our file under Upload a lookup file. Provide a Destination filename. Then, we click on Save:   Once you click on Save, you should receive the Successfully saved "BUtoBUName" in search" message:   In the previous screenshot, the lookup file is saved by default as private. You will need to adjust permissions to allow other Splunk users to use it. Going back to the Lookups page, we can select Lookup definitions to see the Lookup definitions page:   In the Lookup definitions page, we can click on New to visit the Add new page (shown in the following screenshot) and set up our definition as follows: Destination app: The lookup will be part of the Splunk search app Name: Our file is BUtoBUName Type: Here, we will select File-based Lookup file: The filename is ButoBUName.csv, which we uploaded without the .csv suffix Again, we should see the Successfully saved "BUtoBUName" in search message:   Now, our lookup is ready to be used: Automatic lookups Rather than having to code for a lookup in each of your Splunk searches, you have the ability to configure automatic lookups for a particular source type. To do this from Splunk Web, we can click on Settings and then select Lookups:   From the Lookups page, click on Automatic lookups:   In the Automatic lookups page, click on New:   In the Add New page, we will fill in the required information to set up our lookup: Destination app: For this field, some options are framework, launcher, learned, search, and splunk_datapreview (for our example, select search). Name: This provide a user-friendly name that describes this automatic lookup. Lookup table: This is the name of the lookup table you defined with a CSV file (discussed earlier in this article). Apply to: This is the type that you want this automatic lookup to apply to. The options are sourcetype, source, or host (I've picked sourcetype). Named: This is the name of the type you picked under Apply to. I want my automatic search to apply for all searches with the sourcetype of csv. Lookup input fields: This is simple in my example. In my lookup table, the field to be searched on will be BU and the = field value will be the field in the event results that I am converting; in my case, it was the field 650693NLR001. Lookup output fields: This will be the field in the lookup table that I am using to convert to, which in my example is BUName and I want to call it Business Unit Name, so this becomes the = field value. Overwrite field values: This is a checkbox where you can tell Splunk to overwrite existing values in your output fields—I checked it. The Add new page The Splunk Add new page (shown in the following screenshot) is where you enter the lookup information (detailed in the previous section):   Once you have entered your automatic lookup information, you can click on Save and you will receive the Successfully saved "Business Unit to Business Unit Name" in search message:   Now, we can use the lookup in a search. For example, you can run a search with sourcetype=csv, as follows: sourcetype=csv 2014 "Current Forecast" "Direct" "513500" |rename May as "Month" Actual as "Version" "FY 2012" as Year 650693NLR001 as "Business Unit" 100000 as "FCST" "09997_Eliminations Co 2"as "Account" "451200" as "Activity" | eval RFCST= round(FCST) |Table "Business Unit", "Business Unit Name", Month, RFCST Notice in the following screenshot that Business Unit Name is converted to the user-friendly values from our lookup table, and we didn't have to add the lookup command to our search pipeline:   Configuration files In addition to using the Splunk web interface, you can define and configure lookups using the following files: props.conf transforms.conf To set up a lookup with these files (rather than using Splunk web), we can perform the following steps: Edit transforms.conf to define the lookup table. The first step is to edit the transforms.conf configuration file to add the new lookup reference. Although the file exists in the Splunk default folder ($SPLUNK_HOME/etc/system/default), you should edit the file in $SPLUNK_HOME/etc/system/local/ or $SPLUNK_HOME/etc/apps/<app_name>/local/ (if the file doesn't exist here, create it). Whenever you edit a Splunk .conf file, always edit a local version, keeping the original (system directory version) intact. In the current version of Splunk, there are two types of lookup tables: static and external. Static lookups use CSV files, and external (which are dynamic) lookups use Python scripting. You have to decide if your lookup will be static (in a file) or dynamic (use script commands). If you are using a file, you'll use filename; if you are going to use a script, you use external_cmd (both will be set in the transforms.conf file). You can also limit the number of matching entries to apply to an event by setting the max_matches option (this tells Splunk to use the first <integer> (in file order) number of entries). I've decided to leave the default for max_matches, so my transforms.conf file looks like the following: [butobugroup]filename = butobugroup.csv This step is optional. Edit props.conf to apply your lookup table automatically. For both static and external lookups, you stipulate the fields you want to match in the configuration file and the output from the lookup table that you defined in your transforms.conf file. It is okay to have multiple field lookups defined in one source lookup definition, but each lookup should have its own unique lookup name; for example, if you have multiple tables, you can name them LOOKUP-table01, LOOKUP-table02, and so on, or something perhaps more easily understood. If you add a lookup to your props.conf file, this lookup is automatically applied to all events from searches that have matching source types (again, as mentioned earlier; if your automatic lookup is very slow, it will also impact the speed of your searches). Restart Splunk to see your changes. Implementing a lookup using configuration files – an example To illustrate the use of configuration files in order to implement an automatic lookup, let's use a simple example. Once again, we want to convert a field from a unique identification code for an organization's business unit to a more user friendly descriptive name called BU Group. What we will do is match the field bu in a lookup table butobugroup.csv with a field in our events. Then, add the bugroup (description) to the returned events. The following shows the contents of the butobugroup.csv file: bu, bugroup 999999, leadership-groupVA0133SPS001, executive-group650914FAC002, technology-group You can put this file into $SPLUNK_HOME/etc/apps/<app_name>/lookups/ and carry out the following steps: Put the butobugroup.csv file into $SPLUNK_HOME/etc/apps/search/lookups/, since we are using the search app. As we mentioned earlier, we edit the transforms.conf file located at either $SPLUNK_HOME/etc/system/local/ or $SPLUNK_HOME/etc/apps/<app_name>/local/. We add the following two lines: [butobugroup]filename = butobugroup.csv Next, as mentioned earlier in this article, we edit the props.conf file located at either $SPLUNK_HOME/etc/system/local/ or $SPLUNK_HOME/etc/apps/<app_name>/local/. Here, we add the following two lines: [csv]LOOKUP-check = butobugroup bu AS 650693NLR001 OUTPUT bugroup Restart the Splunk server. You can (assuming you are logged in as an admin or have admin privileges) restart the Splunk server through the web interface by going to Settings, then select System and finally Server controls. Now, you can run a search for sourcetype=csv (as shown here): sourcetype=csv 2014 "Current Forecast" "Direct" "513500" |rename May as "Month" ,650693NLR001 as "Business Unit" 100000 as "FCST"| eval RFCST= round(FCST) |Table "Business Unit", "Business Unit Name", bugroup, Month, RFCST You will see that the field bugroup can be returned as part of your event results:   Populating lookup tables Of course, you can create CSV files from external systems (or, perhaps even manually?), but from time to time, you might have the opportunity to create lookup CSV files (tables) from event data using Splunk. A handy command to accomplish this is outputcsv (which is covered in detail later in this article). The following is a simple example of creating a CSV file from Splunk event data that can be used for a lookup table: sourcetype=csv "Current Forecast" "Direct" | rename 650693NLR001 as "Business Unit" | Table "Business Unit", "Business Unit Name", bugroup | outputcsv splunk_master The results are shown in the following screeshot:   Of course, the output table isn't quite usable, since the results have duplicates. Therefore, we can rewrite the Splunk search pipeline introducing the dedup command (as shown here): sourcetype=csv   "Current Forecast" "Direct"   | rename 650693NLR001 as "Business Unit" | dedup "Business Unit" | Table "Business Unit", "Business Unit Name", bugroup | outputcsv splunk_master Then, we can examine the results (now with more desirable results):   Handling duplicates with dedup This command allows us to set the number of duplicate events to be kept based on the values of a field (in other words, we can use this command to drop duplicates from our event results for a selected field). The event returned for the dedup field will be the first event found (if you provide a number directly after the dedup command, it will be interpreted as the number of duplicate events to keep; if you don't specify a number, dedup keeps only the first occurring event and removes all consecutive duplicates). The dedup command also lets you sort by field or list of fields. This will remove all the duplicates and then sort the results based on the specified sort-by field. Adding a sort in conjunction with the dedup command can affect the performance as Splunk performs the dedup operation and then sorts the results as a final step. Here is a search command using dedup: sourcetype=csv   "Current Forecast" "Direct"   | rename 650693NLR001 as "Business Unit" | dedup "Business Unit" sortby bugroup | Table "Business Unit", "Business Unit Name", bugroup | outputcsv splunk_master The result of the preceding command is shown in the following screenshot:   Now, we have our CSV lookup file (outputcsv splunk_master) generated and ready to be used:   Look for your generated output file in $SPLUNK_HOME/var/run/splunk. Dynamic lookups With a Splunk static lookup, your search reads through a file (a table) that was created or updated prior to executing the search. With dynamic lookups, the file is created at the time the search executes. This is possible because Splunk has the ability to execute an external command or script as part of your Splunk search. At the time of writing this book, Splunk only directly supports Python scripts for external lookups. If you are not familiar with Python, its implementation began in 1989 and is a widely used general-purpose, high-level programming language, which is often used as a scripting language (but is also used in a wide range of non-scripting contexts). Keep in mind that any external resources (such as a file) or scripts that you want to use with your lookup will need to be copied to a location where Splunk can find it. These locations are: $SPLUNK_HOME/etc/apps/<app_name>/bin $SPLUNK_HOME/etc/searchscripts The following sections describe the process of using the dynamic lookup example script that ships with Splunk (external_lookup.py). Using Splunk Web Just like with static lookups, Splunk makes it easy to define a dynamic or external lookup using the Splunk web interface. First, click on Settings and then select Lookups:   On the Lookups page, we can select Lookup table files to define a CSV file that contains the input file for our Python script. In the Add new page, we enter the following information: Destination app: For this field, select Search Upload a lookup file: Here, you can browse to the filename (my filename is dnsLookup.csv) Destination filename: Here, enter dnslookup The Add new page is shown in the following screenshot:   Now, click on Save. The lookup file (shown in the following screenshot) is a text CSV file that needs to (at a minimum) contain the two field names that the Python (py) script accepts as arguments, in this case, host and ip. As mentioned earlier, this file needs to be copied to $SPLUNK_HOME/etc/apps/<app_name>/bin.   Next, from the Lookups page, select Lookup definitions and then click on New. This is where you define your external lookup. Enter the following information: Type: For this, select External (as this lookup will run an external script) Command: For this, enter external_lookup.py host ip (this is the name of the py script and its two arguments) Supported fields: For this, enter host, ip (this indicates the two script input field names) The following screenshot describes a new lookup definition:   Now, click on Save. Using configuration files instead of Splunk Web Again, just like with static lookups in Splunk, dynamic lookups can also be configured in the Splunk transforms.conf file: [myLookup]external_cmd = external_lookup.py host ipexternal_type = pythonfields_list = host, ipmax_matches = 200 Let's learn more about the terms here: [myLookup]: This is the report stanza. external_cmd: This is the actual runtime command definition. Here, it executes the Python (py) script external_lookup, which requires two arguments (or parameters), host and ip. external_type (optional): This indicates that this is a Python script. Although this is an optional entry in the transform.conf file, it's a good habit to include this for readability and support. fields_list: This lists all the fields supported by the external command or script, delimited by a comma and space. The next step is to modify the props.conf file, as follows: [mylookup]LOOKUP-rdns = dnslookup host ip OUTPUT ip After updating the Splunk configuration files, you will need to restart Splunk. External lookups The external lookup example given uses a Python (py) script named external_lookup.py, which is a DNS lookup script that can return an IP address for a given host name or a host name for a provided IP address. Explanation The lookup table field in this example is named ip, so Splunk will mine all of the IP addresses found in the indexed logs' events and add the values of ip from the lookup table into the ip field in the search events. We can notice the following: If you look at the py script, you will notice that the example uses an MS Windows supported socket.gethostbyname_ex(host) function The host field has the same name in the lookup table and the events, so you don't need to do anything else Consider the following search command: sourcetype=tm1* | lookup dnslookup host | table host, ip When you run this command, Splunk uses the lookup table to pass the values for the host field as a CSV file (the text CSV file we looked at earlier) into the external command script. The py script then outputs the results (with both the host and ip fields populated) and returns it to Splunk, which populates the ip field in a result table:   Output of the py script with both the host and ip fields populated Time-based lookups If your lookup table has a field value that represents time, you can use the time field to set up a Splunk fields lookup. As mentioned earlier, the Splunk transforms.conf file can be modified to add a lookup stanza. For example, the following screenshot shows a file named MasteringDCHP.csv:   You can add the following code to the transforms.conf file: [MasteringDCHP]filename = MasteringDCHP.csvtime_field = TimeStamptime_format = %d/%m/%y %H:%M:%S $pmax_offset_secs = <integer>min_offset_secs = <integer> The file parameters are defined as follows: [MasteringDCHP]: This is the report stanza filename: This is the name of the CSV file to be used as the lookup table time_field: This is the field in the file that contains the time information and is to be used as the timestamp time_format: This indicates what format the time field is in max_offset_secs and min_offset_secs: This indicates min/max amount of offset time for an event to occur after a lookup entry Be careful with the preceding values; the offset relates to the timestamp in your lookup (CSV) file. Setting a tight (small) offset range might reduce the effectiveness of your lookup results! The last step will be to restart Splunk. An easier way to create a time-based lookup Again, it's a lot easier to use the Splunk Web interface to set up our lookup. Here is the step-by-step process: From Settings, select Lookups, and then Lookup table files: In the Lookup table files page, click on New, configure our lookup file, and then click on Save: You should receive the Successfully saved "MasterDHCP" in search message: Next, select Lookup definitions and from this page, click on New: In the Add new page, we define our lookup table with the following information: Destination app: For this, select search from the drop-down list Name: For this, enter MasterDHCP (this is the name you'll use in your lookup) Type: For this, select File-based (as this lookup table definition is a CSV file) Lookup file: For this, select the name of the file to be used from the drop-down list (ours is MasteringDCHP) Configure time-based lookup: Check this checkbox Name of time field: For this, enter TimeStamp (this is the field name in our file that contains the time information) Time format: For this, enter the string to describe to Splunk the format of our time field (our field uses this format: %d%m%y %H%M%S) You can leave the rest blank and click on Save. You should receive the Successfully saved "MasterDHCP" in search message: Now, we are ready to try our search: sourcetype=dh* | Lookup MasterDHCP IP as "IP" | table DHCPTimeStamp, IP, UserId | sort UserId The following screenshot shows the output:   Seeing double? Lookup table definitions are indicated with the attribute LOOKUP-<class> in the Splunk configuration file, props.conf, or in the web interface under Settings | Lookups | Lookup definitions. If you use the Splunk Web interface (which we've demonstrated throughout this article) to set up or define your lookup table definitions, Splunk will prevent you from creating duplicate table names, as shown in the following screenshot:   However, if you define your lookups using the configuration settings, it is important to try and keep your table definition names unique. If you do give the same name to multiple lookups, the following rules apply: If you have defined lookups with the same stanza (that is, using the same host, source, or source type), the first defined lookup in the configuration file wins and overrides all others. If lookups have different stanzas but overlapping events, the following logic is used by Splunk: Events that match the host get the host lookup Events that match the sourcetype get the sourcetype lookup Events that match both only get the host lookup It is a proven practice recommendation to make sure that all of your lookup stanzas have unique names. Command roundup This section lists several important Splunk commands you will use when working with lookups. The lookup command The Splunk lookup command is used to manually invoke field lookups using a Splunk lookup table that is previously defined. You can use Splunk Web (or the transforms.conf file) to define your lookups. If you do not specify OUTPUT or OUTPUTNEW, all fields in the lookup table (excluding the lookup match field) will be used by Splunk as output fields. Conversely, if OUTPUT is specified, the output lookup fields will overwrite existing fields and if OUTPUTNEW is specified, the lookup will not be performed for events in which the output fields already exist. For example, if you have a lookup table specified as iptousername with (at least) two fields, IP and UserId, for each event, Splunk will look up the value of the field IP in the table and for any entries that match, the value of the UserId field in the lookup table will be written to the field user_name in the event. The query is as follows: ... Lookup iptousernameIP as "IP" output UserId as user_name Always strive to perform lookups after any reporting commands in your search pipeline, so that the lookup only needs to match the results of the reporting command and not every individual event. The inputlookup and outputlookup commands The inputlookup command allows you to load search results from a specified static lookup table. It reads in a specified CSV filename (or a table name as specified by the stanza name in transforms.conf). If the append=t (that is, true) command is added, the data from the lookup file is appended to the current set of results (instead of replacing it). The outputlookup command then lets us write the results' events to a specified static lookup table (as long as this output lookup table is defined). So, here is an example of reading in the MasterDHCP lookup table (as specified in transforms.conf) and writing these event results to the lookup table definition NewMasterDHCP: | inputlookup MasterDHCP | outputlookup NewMasterDHCP After running the preceding command, we can see the following output:   Note that we can add the append=t command to the search in the following fashion: | inputlookup MasterDHCP.csv | inputlookup NewMasterDHCP.csv append=t | The inputcsv and outputcsv commands The inputcsv command is similar to the inputlookup command; in this, it loads search results, but this command loads from a specified CSV file. The filename must refer to a relative path in $SPLUNK_HOME/var/run/splunk and if the specified file does not exist and the filename did not have an extension, then a filename with a .csv extension is assumed. The outputcsv command lets us write our result events to a CSV file. Here is an example where we read in a CSV file named splunk_master.csv, search for the text phrase FPM, and then write any matching events to a CSV file named FPMBU.csv: | inputcsv splunk_master.csv | search "Business Unit Name"="FPM" | outputcsv FPMBU.csv The following screenshot shows the results from the preceding search command:   The following screenshot shows the resulting file generated as a result of the preceding command:   Here is another example where we read in the same CSV file (splunk_master.csv) and write out only events from 51 to 500: | inputcsv splunk_master start=50 max=500 Events are numbered starting with zero as the first entry (rather than 1). Summary In this article, we defined Splunk lookups and discussed their value. We also went through the two types of lookups, static and dynamic, and saw detailed, working examples of each. Various Splunk commands typically used with the lookup functionality were also presented. Resources for Article: Further resources on this subject: Working with Apps in Splunk [article] Processing Tweets with Apache Hive [article] Indexes [article]
Read more
  • 0
  • 0
  • 8397
article-image-adding-graded-activities
Packt
16 Dec 2014
9 min read
Save for later

Adding Graded Activities

Packt
16 Dec 2014
9 min read
This article by Rebecca Barrington, author of Moodle Gradebook Second Edition, teaches you how to add assignments and set up how they will be graded, including how to use our custom scales and add outcomes for grading. (For more resources related to this topic, see here.) As with all content within Moodle, we need to select Turn editing on within the course in order to be able to add resources and activities. All graded activities are added through the Add an activity or resource text available within each section of within a Moodle course. This text can be found in the bottom right of each section after editing has been turned on. There are a number of items that can be graded and will appear within the Gradebook. Assignments are the most feature-rich of all the graded activities and have many options available in order to customize how assessments can be graded. They can be used to provide assessment information for students, store grades, and provide feedback. When setting up the assignment, we can choose for students to submit their work electronically—either through file submission or online text, or we can review the assessment offline and use only the grade and feedback features of the assignment. Adding assignments There are many options *within the assignments, and throughout this article we will set up a number of different assignments and you'll learn about some of their most useful features and options. Let's have a go at creating a range of assignments that are ready for grading. Creating an assignment with a scale The first assignment that we will add will *make use of the PMD scale Click on the Turn editing on button. Click on Add an activity or resource. Click on Assignment and then click on Add. In the Assignment name box, type in the name of the assignment (such as Task 1). In the Description box, provide some assignment details. In the Availability section, we need to disable the date options. We will not make use of these options, but they can be very useful. To disable the options, click on the tick next to the Enable text. However, details of these options have *been provided for future* reference. The Allow submissions from section* is mostly relevant when the assignment will be submitted electronically, as students won't be able to submit their work until the date and time indicated here. The Due date section* can be used to indicate when the assignment needs to be submitted by. If students electronically submit their assignment after the date and time indicated here, the submission date and time will be shown in red in order to notify the teacher that it was submitted past the due date. The Cut off date section* enables teachers to set an extension period after the due date where late submissions will continue to be accepted. In the* Submission types section, ensure *that the File submissions checkbox is enabled by adding a tick there. This will enable students to submit their assignment electronically. There are additional options that we can choose as well. With Maximum number of uploaded files, we can indicate how many files a student can upload. Keep this as 1. We can also determine the Maximum submission size option for each file using the drop-down list shown in the following screenshot: Within the Feedback types section, ensure that all options under the Feedback types *section are *selected. Feedback comments enables *us to provide written feedback along with the grade. Feedback files enables us *to upload a file in order to provide feedback to a student. Offline grading worksheet will *provide us with the option to download a .csv file that contains core information about the assignment, and this can be used to add grades and feedback while working offline. This completed .csv file can be uploaded and the grades will be added to the assignments within the Gradebook. In the Submission settings section, we have options related to how students will submit their assignment and how they will reattempt submission if required. If Require students click submit button is left as No, students will upload* their assignment* and it will be available *to the teacher for grading. If this option is changed to Yes, students can upload their assignment, but the teacher will see that it is in the draft form. Students will click on Submit to indicate that it is ready to be graded. Require that students accept the submission statement will provide students *with a statement that they need to agree to when they submit their assignment. The default statement is This assignment is my own work, except where I have acknowledged the use of works of other people. The submission statement can be changed by a site administrator by navigating to Site administration | plugins | Activity modules | Assignment settings. The Attempts reopened drop-down list* provides options for the status of the assignment after it has been graded. Students will only be able to resubmit their work when it is open. Therefore this setting will control when and if students are able to submit another version of their assignment. The options available to us are:Never: This option should be selected if students will not be able to submit another piece of work.Manually: This will enable anyone who has the role of a teacher to choose to reopen a submission that enables a student to submit their work again.Automatically until pass: This option works when a pass grade is set within the Gradebook. After grading, if the student is awarded the minimum pass *grade or higher, the submission *will remain closed in order to prevent any changes to the submission. However, if the assignment is graded lower than the assigned pass grade, the submission will automatically reopen in order to enable the student to submit the assignment again.Maximum attempts: The maximum *attempts allowed for this assignment will limit the number of times an assignment is reopened. For example, if this option is set to 3, then a student will only be able to submit their assignment three times. After they have submitted their assignment for a third time, they will not be allowed to submit it again. The default is unlimited, but it can be changed by clicking on the drop-down list. In the Submission settings section, ensure that the options for Require students click on submit button and Require that students accept the submission statement are set to Yes. Also, change the Attempts reopened to Automatically until passed. Within the Grade section, navigate to Grade | Type | Scale and choose the PMD scale. Select Use marking workflow by changing the drop-down list to Yes.Use marking workflow is a new feature of Moodle 2.6* that enables *the grading process to go through a range of stages in order to indicate that the marking is in progress or is complete, is being reviewed, or is ready for release to students. Click on* Save and return to course. Creating an online assignment with a number grade The next *assignment that we will create will have an online* text option that will have a maximum grade of 20. The following steps show you how to create an online assignment with a number grade: Enable editing by clicking on Turn editing on. Click on Add an activity or resource. Click on Assignment and then click on Add. In the Assignment name box, type in the name of the assignment (such as Task 2). In the Description box, provide the assignment details. In the Submission types section, ensure that Online text has a tick next to it. This will enable students to type directly into Moodle. When choosing this option, we can also set a maximum word limit by clicking on the tick box next to the Enable text. After enabling this option, we can add a number to the textbox. For this assignment, enable a word limit of 200 words. When using* online text* submission, we have an additional feedback option within the Feedback types section. Under the Comment inline text, click on No and switch to Yes to enable yourself to add written feedback for students within the written text submitted by students. In the Submission settings section, ensure that the options for Require students click submit button and Require that students accept the submission statement are set to Yes. Also, change Attempts reopened to Automatically until passed. Within the Grades section, navigate to Grade | Type | Point and ensure that Maximum points is set to 20. Click *on* Save and return to course. Creating an assignment including outcomes The next assignment that we will *create will add some of the Outcomes: Enable editing by clicking on Turn editing on. Click on Add an activity or resource. Click on Assignment and then click on Add. In the Assignment name box, type in the name of the assignment (such as Task 3). In the Description box, provide the assignment details. In the Submission types box, ensure that Online text and File submissions are selected. Set Maximum number of uploaded files to 2. In the Submission settings section, ensure that the options for Require students to click submit button and Require that students accept the submission statement are amended to Yes. Change Attempts reopened to Manually. Within the Grades section, navigate to Grade | Type | Point and Maximum points is set to 100. In the Outcomes *section, choose the outcomes as Evidence provided and Criteria 1 met. Scroll to the bottom of the screen and click on Save and return to course. Summary In this article, we added a range of assignments that made use of number and scale grades as well as added outcomes to an assignment. Resources for Article: Further resources on this subject: Moodle for Online Communities [article] What's New in Moodle 2.0 [article] Moodle 2.0: What's New in Add a Resource [article]
Read more
  • 0
  • 0
  • 1160

article-image-ridge-regression
Packt
16 Dec 2014
9 min read
Save for later

Ridge Regression

Packt
16 Dec 2014
9 min read
In this article by Patrick R. Nicolas, the author of the book Scala for Machine Learning, we will cover the basics of ridge regression. The purpose of regression is to minimize a loss function, the residual sum of squares (RSS) being the one commonly used. The problem of overfitting can be addressed by adding a penalty term to the loss function. The penalty term is an element of the larger concept of regularization. (For more resources related to this topic, see here.) Ln roughness penalty Regularization consists of adding a penalty function J(w) to the loss function (or RSS in the case of a regressive classifier) in order to prevent the model parameters (or weights) from reaching high values. A model that fits a training set very well tends to have many features variable with relatively large weights. This process is known as shrinkage. Practically, shrinkage consists of adding a function with model parameters as an argument to the loss function: The penalty function is completely independent from the training set {x,y}. The penalty term is usually expressed as a power to function of the norm of the model parameters (or weights) wd. For a model of D dimension the generic Lp-norm is defined as follows: Notation Regularization applies to parameters or weights associated to an observation. In order to be consistent with our notation w0 being the intercept value, the regularization applies to the parameters w1 …wd. The two most commonly used penalty functions for regularization are L1 and L2. Regularization in machine learning The regularization technique is not specific to the linear or logistic regression. Any algorithm that minimizes the residual sum of squares, such as support vector machine or feed-forward neural network, can be regularized by adding a roughness penalty function to the RSS. The L1 regularization applied to the linear regression is known as the Lasso regularization. The Ridge regression is a linear regression that uses the L2 regularization penalty. You may wonder which regularization makes sense for a given training set. In a nutshell, L2 and L1 regularizations differ in terms of computation efficiency, estimation, and features selection (refer to the 13.3 L1 regularization: basics section in the book Machine Learning: A Probabilistic Perspective, and the Feature selection, L1 vs. L2 regularization, and rotational invariance paper available at http://www.machinelearning.org/proceedings/icml2004/papers/354.pdf). The various differences between the two regularizations are as follows: Model estimation: L1 generates a sparser estimation of the regression parameters than L2. For large non-sparse dataset, L2 has a smaller estimation error than L1. Feature selection: L1 is more effective in reducing the regression weights for features with high value than L2. Therefore, L1 is a reliable features selection tool. Overfitting: Both L1 and L2 reduce the impact of overfitting. However, L1 has a significant advantage in overcoming overfitting (or excessive complexity of a model) for the same reason it is more appropriate for selecting features. Computation: L2 is conducive to a more efficient computation model. The summation of the loss function and L2 penalty w2 is a continuous and differentiable function for which the first and second derivative can be computed (convex minimization). The L1 term is the summation of |wi|, and therefore, not differentiable. Terminology The ridge regression is sometimes called the penalized least squares regression. The L2 regularization is also known as the weight decay. Let's implement the ridge regression, and then evaluate the impact of the L2-norm penalty factor. Ridge regression The ridge regression is a multivariate linear regression with a L2 norm penalty term, and can be calculated as follows: The computation of the ridge regression parameters requires the resolution of the system of linear equations similar to the linear regression. Matrix representation of ridge regression closed form is as follows: I is the identity matrix and it is using the QR decomposition, as shown here: Implementation The implementation of the ridge regression adds L2 regularization term to the multiple linear regression computation of the Apache Commons Math library. The methods of RidgeRegression have the same signature as its ordinary least squares counterpart. However, the class has to inherit the abstract base class AbstractMultipleLinearRegression in the Apache Commons Math and override the generation of the QR decomposition to include the penalty term, as shown in the following code: class RidgeRegression[T <% Double](val xt: XTSeries[Array[T]],                                    val y: DblVector,                                   val lambda: Double) {                   extends AbstractMultipleLinearRegression                    with PipeOperator[Array[T], Double] {    private var qr: QRDecomposition = null    private[this] val model: Option[RegressionModel] = …    … } Besides the input time series xt and the labels y, the ridge regression requires the lambda factor of the L2 penalty term. The instantiation of the class train the model. The steps to create the ridge regression models are as follows: Extract the Q and R matrices for the input values, newXSampleData (line 1) Compute the weights using the calculateBeta defined in the base class (line 2) Return the tuple regression weights calculateBeta and the residuals calculateResiduals private val model: Option[(DblVector, Double)] = { this.newXSampleData(xt.toDblMatrix) //1 newYSampleData(y) val _rss = calculateResiduals.toArray.map(x => x*x).sum val wRss = (calculateBeta.toArray, _rss) //2 Some(RegressionModel(wRss._1, wRss._2)) } The QR decomposition in the AbstractMultipleLinearRegression base class does not include the penalty term (line 3); the identity matrix with lambda factor in the diagonal has to be added to the matrix to be decomposed (line 4). override protected def newXSampleData(x: DblMatrix): Unit = { super.newXSampleData(x)   //3 val xtx: RealMatrix = getX val nFeatures = xt(0).size Range(0, nFeatures).foreach(i => xtx.setEntry(i,i,xtx.getEntry(i,i) + lambda)) //4 qr = new QRDecomposition(xtx) } The regression weights are computed by resolving the system of linear equations using substitution on the QR matrices. It overrides the calculateBeta function from the base class: override protected def calculateBeta: RealVector = qr.getSolver().solve(getY()) Test case The objective of the test case is to identify the impact of the L2 penalization on the RSS value, and then compare the predicted values with original values. Let's consider the first test case related to the regression on the daily price variation of the Copper ETF (symbol: CU) using the stock daily volatility and volume as feature. The implementation of the extraction of observations is identical as with the least squares regression: val src = DataSource(path, true, true, 1) val price = src |> YahooFinancials.adjClose val volatility = src |> YahooFinancials.volatility val volume = src |> YahooFinancials.volume //1   val _price = price.get.toArray val deltaPrice = XTSeries[Double](_price                                .drop(1)                                .zip(_price.take(_price.size -1))                                .map( z => z._1 - z._2)) //2 val data = volatility.get                      .zip(volume.get)                      .map(z => Array[Double](z._1, z._2)) //3 val features = XTSeries[DblVector](data.take(data.size-1)) val regression = new RidgeRegression[Double](features, deltaPrice, lambda) //4   regression.rss match { case Some(rss) => Display.show(rss, logger) …. The observed data, ETF daily price, and the features (volatility and volume) are extracted from the source src (line 1). The daily price change, deltaPrice, is computed using a combination of Scala take and drop methods (line 2). The features vector is created by zipping volatility and volume (line 3). The model is created by instantiating the RidgeRegression class (line 4). The RSS value, rss, is finally displayed (line 5). The RSS value, rss, is plotted for different values of lambda <= 1.0 in the following graph: Graph of RSS versus Lambda for Copper ETF The residual sum of squares decreased as λ increases. The curve seems to be reaching for a minimum around λ=1. The case of λ = 0 corresponds to the least squares regression. Next, let's plot the RSS value for λ varying between 1 and 100: Graph RSS versus large value Lambda for Copper ETF This time around RSS increases with λ before reaching a maximum for λ > 60. This behavior is consistent with other findings (refer to Lecture 5: Model selection and assessment, a lecture by H. Bravo and R. Irizarry from department of Computer Science, University of Maryland, in 2010, available at http://www.cbcb.umd.edu/~hcorrada/PracticalML/pdf/lectures/selection.pdf). As λ increases, the overfitting gets more expensive, and therefore, the RSS value increases. The regression weights can by simply outputted as follows: regression.weights.get Let's plot the predicted price variation of the Copper ETF using the ridge regression with different value of lambda (λ): Graph of ridge regression on Copper ETF price variation with variable Lambda The original price variation of the Copper ETF Δ = price(t+1)-price(t) is plotted as λ =0. The predicted values for λ = 0.8 is very similar to the original data. The predicted values for λ = 0.8 follows the pattern of the original data with reduction of large variations (peaks and troves). The predicted values for λ = 5 corresponds to a smoothed dataset. The pattern of the original data is preserved but the magnitude of the price variation is significantly reduced. The reader is invited to apply the more elaborate K-fold validation routine and compute precision, recall, and F1 measure to confirm the findings. Summary The ridge regression is a powerful alternative to the more common least squares regression because it reduces the risk of overfitting. Contrary to the Naïve Bayes classifiers, it does not require conditional independence of the model features. Resources for Article: Further resources on this subject: Differences in style between Java and Scala code [Article] Dependency Management in SBT [Article] Introduction to MapReduce [Article]
Read more
  • 0
  • 0
  • 2961