Data | 0 articles | Tech News, Tutorials & Expert Insights

21 Aug 2014

9 min read

Using Canvas and D3

21 Aug 2014

This article by Pablo Navarro Castillo, author of Mastering D3.js, includes that most of the time, we use D3 to create SVG-based charts and visualizations. If the number of elements to render is huge, or if we need to render raster images, it can be more convenient to render our visualizations using the HTML5 canvas element. In this article, we will learn how to use the force layout of D3 and the canvas element to create an animated network visualization with random data. (For more resources related to this topic, see here.) Creating figures with canvas The HTML canvas element allows you to create raster graphics using JavaScript. It was first introduced in HTML5. It enjoys more widespread support than SVG, and can be used as a fallback option. Before diving deeper into integrating canvas and D3, we will construct a small example with canvas. The canvas element should have the width and height attributes. This alone will create an invisible figure of the specified size: <!— Canvas Element --> <canvas id="canvas-demo" width="650px" height="60px"></canvas> If the browser supports the canvas element, it will ignore any element inside the canvas tags. On the other hand, if the browser doesn't support the canvas, it will ignore the canvas tags, but it will interpret the content of the element. This behavior provides a quick way to handle the lack of canvas support: <!— Canvas Element --> <canvas id="canvas-demo" width="650px" height="60px">  <img src="img/fallback-img.png" width="650" height="60"></img> </canvas> If the browser doesn't support canvas, the fallback image will be displayed. Note that unlike the <img> element, the canvas closing tag (</canvas>) is mandatory. To create figures with canvas, we don't need special libraries; we can create the shapes using the canvas API: <script> // Graphic Variables var barw = 65, barh = 60; // Append a canvas element, set its size and get the node. var canvas = document.getElementById('canvas-demo'); // Get the rendering context. var context = canvas.getContext('2d'); // Array with colors, to have one rectangle of each color. var color = ['#5c3566', '#6c475b', '#7c584f', '#8c6a44', '#9c7c39', '#ad8d2d', '#bd9f22', '#cdb117', '#ddc20b', '#edd400']; // Set the fill color and render ten rectangles. for (var k = 0; k < 10; k += 1) { // Set the fill color. context.fillStyle = color[k]; // Create a rectangle in incremental positions. context.fillRect(k * barw, 0, barw, barh); } </script> We use the DOM API to access the canvas element with the canvas-demo ID and to get the rendering context. Then we set the color using the fillStyle method, and use the fillRect canvas method to create a small rectangle. Note that we need to change fillStyle every time or all the following shapes will have the same color. The script will render a series of rectangles, each one filled with a different color, shown as follows: A graphic created with canvas Canvas uses the same coordinate system as SVG, with the origin in the top-left corner, the horizontal axis augmenting to the right, and the vertical axis augmenting to the bottom. Instead of using the DOM API to get the canvas node, we could have used D3 to create the node, set its attributes, and created scales for the color and position of the shapes. Note that the shapes drawn with canvas don't exists in the DOM tree; so, we can't use the usual D3 pattern of creating a selection, binding the data items, and appending the elements if we are using canvas. Creating shapes Canvas has fewer primitives than SVG. In fact, almost all the shapes must be drawn with paths, and more steps are needed to create a path. To create a shape, we need to open the path, move the cursor to the desired location, create the shape, and close the path. Then, we can draw the path by filling the shape or rendering the outline. For instance, to draw a red semicircle centered in (325, 30) and with a radius of 20, write the following code: // Create a red semicircle. context.beginPath(); context.fillStyle = '#ff0000'; context.moveTo(325, 30); context.arc(325, 30, 20, Math.PI / 2, 3 * Math.PI / 2); context.fill(); The moveTo method is a bit redundant here, because the arc method moves the cursor implicitly. The arguments of the arc method are the x and y coordinates of the arc center, the radius, and the starting and ending angle of the arc. There is also an optional Boolean argument to indicate whether the arc should be drawn counterclockwise. A basic shape created with the canvas API is shown in the following screenshot: Integrating canvas and D3 We will create a small network chart using the force layout of D3 and canvas instead of SVG. To make the graph looks more interesting, we will randomly generate the data. We will generate 250 nodes sparsely connected. The nodes and links will be stored as the attributes of the data object: // Number of Nodes var nNodes = 250, createLink = false; // Dataset Structure var data = {nodes: [],links: []}; We will append nodes and links to our dataset. We will create nodes with a radius attribute randomly assigning it a value of either 2 or 4 as follows: // Iterate in the nodes for (var k = 0; k < nNodes; k += 1) { // Create a node with a random radius. data.nodes.push({radius: (Math.random() > 0.3) ? 2 : 4}); // Create random links between the nodes. } We will create a link with probability of 0.1 only if the difference between the source and target indexes are less than 8. The idea behind this way to create links is to have only a few connections between the nodes: // Create random links between the nodes. for (var j = k + 1; j < nNodes; j += 1) { // Create a link with probability 0.1 createLink = (Math.random() < 0.1) && (Math.abs(k - j) < 8); if (createLink) { // Append a link with variable distance between the nodes data.links.push({ source: k, target: j, dist: 2 * Math.abs(k - j) + 10 }); } } We will use the radius attribute to set the size of the nodes. The links will contain the distance between the nodes and the indexes of the source and target nodes. We will create variables to set the width and height of the figure: // Figure width and height var width = 650, height = 300; We can now create and configure the force layout. As we did in the previous section, we will set the charge strength to be proportional to the area of each node. This time, we will also set the distance between the links, using the linkDistance method of the layout: // Create and configure the force layout var force = d3.layout.force() .size([width, height]) .nodes(data.nodes) .links(data.links) .charge(function(d) { return -1.2 * d.radius * d.radius; }) .linkDistance(function(d) { return d.dist; }) .start(); We can create a canvas element now. Note that we should use the node method to get the canvas element, because the append and attr methods will both return a selection, which don't have the canvas API methods: // Create a canvas element and set its size. var canvas = d3.select('div#canvas-force').append('canvas') .attr('width', width + 'px') .attr('height', height + 'px') .node(); We get the rendering context. Each canvas element has its own rendering context. We will use the '2d' context, to draw two-dimensional figures. At the time of writing this, there are some browsers that support the webgl context; more details are available at https://developer.mozilla.org/en-US/docs/Web/WebGL/Getting_started_with_WebGL. Refer to the following '2d' context: // Get the canvas context. var context = canvas.getContext('2d'); We register an event listener for the force layout's tick event. As canvas doesn't remember previously created shapes, we need to clear the figure and redraw all the elements on each tick: force.on('tick', function() { // Clear the complete figure. context.clearRect(0, 0, width, height); // Draw the links ... // Draw the nodes ... }); The clearRect method cleans the figure under the specified rectangle. In this case, we clean the entire canvas. We can draw the links using the lineTo method. We iterate through the links, beginning a new path for each link, moving the cursor to the position of the source node, and by creating a line towards the target node. We draw the line with the stroke method: // Draw the links data.links.forEach(function(d) { // Draw a line from source to target. context.beginPath(); context.moveTo(d.source.x, d.source.y); context.lineTo(d.target.x, d.target.y); context.stroke(); }); We iterate through the nodes and draw each one. We use the arc method to represent each node with a black circle: // Draw the nodes data.nodes.forEach(function(d, i) { // Draws a complete arc for each node. context.beginPath(); context.arc(d.x, d.y, d.radius, 0, 2 * Math.PI, true); context.fill(); }); We obtain a constellation of disconnected network graphs, slowly gravitating towards the center of the figure. Using the force layout and canvas to create a network chart is shown in the following screenshot: We can think that to erase all the shapes and redraw each shape again and again could have a negative impact on the performance. In fact, sometimes it's faster to draw the figures using canvas, because this way the browser doesn't have to manage the DOM tree of the SVG elements (but it still have to redraw them if the SVG elements are changed). Summary In this article, we will learn how to use the force layout of D3 and the canvas element to create an animated network visualization with random data. Resources for Article: Further resources on this subject: Interacting with Data for Dashboards [Article] Kendo UI DataViz – Advance Charting [Article] Visualizing my Social Graph with d3.js [Article]

0
0
4742

article-image-recommender-systems-dissected

Packt

21 Aug 2014

4 min read

Recommender systems dissected

Packt

21 Aug 2014

4 min read

In this article by Rik Van Bruggen, the author of Learning Neo4j, we will take a look at so-called recommender systems. Such systems, broadly speaking, consist of two elementary parts: (For more resources related to this topic, see here.) A pattern discovery system: This is the system that somehow figures out what would be a useful recommendation for a particular target group. This discovery can be done in many different ways, but in general we see three ways to do so: A business expert who thoroughly understands the domain of the graph database application will use this understanding to determine useful recommendations. For example, the supervisor of a "do-it-yourself" retail outlet would understand a particular pattern. Suppose that if someone came in to buy multiple pots of paint, they would probably also benefit from getting a gentle recommendation for a promotion of high-end brushes. The store has that promotion going on right then, so the recommendation would be very timely. This process would be a discovery process where the pattern that the business expert has discovered would be applied and used in graph databases in real time as part of a sophisticated recommendation. A visual discovery of a specific pattern in the graph representation of the business domain. We have found that in many different projects, business users have stumbled upon these kinds of patterns while using graph visualizations to look at their data. Specific patterns emerge, unexpected relations jump out, or, in more advanced visualization solutions, specific clusters of activity all of a sudden become visible and require further investigation. Graph databases such as Neo4j, and the visualization solutions that complement it, can play a wonderfully powerful role in this process. An algorithmic discovery of a pattern in a dataset of the business domain using machine learning algorithms to uncover previously unknown patterns in the data. Typically, these processes require an iterative, raw number-crunching approach that has also been used on non-graph data formats in the past. It remains part-art part-science at this point, but can of course yield interesting insights if applied correctly. An overview of recommender systems A pattern application system: All recommender systems will be more or less successful based on not just the patterns that they are able to discover, but also based on the way that they are able to apply these patterns in business applications. We can look at different types of applications for these patterns: Batch-oriented applications: Some applications of these patterns are not as time-critical as one would expect. It does not really matter in any material kind of way if a bulk e-mail with recommendations, or worse, a printed voucher with recommended discounted products, gets delivered to the customer or prospect at 10 am or 11 am. Batch solutions can usually cope with these kinds of requests, even if they do so in the most inefficient way. Real-time oriented applications: Some pattern applications simply have to be delivered in real time, in between a web request and a web response, and cannot be precalculated. For these types of systems, which typically use more complex database queries in order to match for the appropriate recommendation to make, graph databases such as Neo4j are a fantastic tool to have. We will illustrate this going forward. With this classification behind us, we will look at an example dataset and some example queries to make this topic come alive. Using a graph model for recommendations We will be using a very specific data model for our recommender system. All we have changed is that we added a couple of products and brands to the model, and inserted some data into the database correspondingly. In total, we added the following: Ten products Three product brands Fifty relationships between existing person nodes and the mentioned products, highlighting that these persons bought these products These are the products and brands that we added: Adding products and brands to the dataset The following diagram shows the resulting model: In Neo4j, that model will look something like the following: A dataset like this one, while of course a broad simplification, offers us some interesting possibilities for a recommender system. Let's take a look at some queries that could really match this use case, and that would allow us to either visually or in real time exploit the data in this dataset in a product recommendation application. Summary This article elaborately dissected the so-called recommender systems. Resources for Article: Further resources on this subject: Working with a Neo4j Embedded Database [article] Comparative Study of NoSQL Products [article] Getting Started with CouchDB and Futon [article]

0
0
1853

article-image-amazon-dynamodb-modelling-relationships-error-handling

Packt

20 Aug 2014

7 min read

Amazon DynamoDB - Modelling relationships, Error handling

Packt

20 Aug 2014

7 min read

In this article by Tanmay Deshpande author of Mastering DynamoDB we are going to revise our concepts about the DynamoDB and will try to discover more about its features and implementation. (For more resources related to this topic, see here.) Amazon DynamoDB is a fully managed, cloud hosted, NoSQL database. It provides fast and predictable performance with the ability to scale seamlessly. It allows you to store and retrieve any amount of data , serving any level of network traffic without having any operational burden. DynamoDB gives numerous other advantages like consistent and predictable performance, flexible data modeling and durability. With just few clicks on Amazon Web Service console, you would be able to create your own DynamoDB database table, scale up or scale down provision throughput without taking down your application even for a millisecond. DynamoDB uses solid state disks (SSD) to store the data which confirms the durability of the work you are doing. It also automatically replicates the data across other AWS Availability Zones which provides built-in high availability and reliability. Before we start discussion details about DynamoDB let's try to understand what NoSQL databases are and when to choose DynamoDB over RDBMS. With the rise in data volume, variety and velocity, RDBMS were neither designed to cope up with the scale and flexibility challenges developers are facing to build the modern day applications nor were they able to take advantage of cheap commodity hardware. Also we need to provide schema before we start adding data which was restricting developers from making their application flexible. On the other hand, NoSQL databases are fast, provide flexible schema operations and do the effective use of cheap storage. Considering all these things, NoSQL is becoming popular very quickly amongst the developer community. But one has to be very cautious about when to go for NoSQL and when to stick to RDBMS. Sticking to relational databases makes sense when you know the schema is more over static, strong consistency is must and the data is not going to be that big in volume. But when you want to build an application which is Internet scalable, schema is more likely to get evolved over the time, the storage is going to be really big and the operations involved are ok to be eventually consistent then NoSQL is the way to go. There are various types of NoSQL databases. Following is the list of NoSQL database types and popular examples Document Store – MongoDB, CouchDB, MarkLogic etc. Column Store – Hbase, Cassandra etc. Key Value Store – DynamoDB, Azure, Redis etc. Graph Databases – Neo4J, DEX etc. Most of these NoSQL solutions are open source except few like DynamoDB, Azure which are available as service over Internet. DynamoDB being a key-value store indexes data only upon primary keys and one need to go through primary key to access certain attribute. Let's start learning more about DynamoDB by having a look at its history. DynamoDB's History Amazon's ecommerce platform had a huge set of decoupled services developed and managed individually and each and every service had API to be used and consumed for others. Earlier each service had direct database access which was a major bottleneck. In terms of scalability, Amazon's requirements were more than any third party vendors could provide at that time. DynamoDB was built to address Amazon's high availability, extreme scalability and durability needs. Earlier Amazon used to store its production data in relational databases and services had been provided for all required operations. But later they realized that most of the services access data only through its primary key and need not use complex queries to fetch the required data, plus maintaining these RDBMS systems required high end hardware and skilled personnel. So to overcome all such issues, Amazon's engineering team built a NoSQL database which addresses all above mentioned issues. In 2007, Amazon released one research paper on Dynamo which was combining the best of ideas from database and key value store worlds which was inspiration for many open source projects at the time. Cassandra, Voldemort and Riak were one of them. You can find the above mentioned paper at http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf Even though Dynamo had great features which would take care of all engineering needs, it was not widely accepted at that time in Amazon itself as it was not a fully managed service. When Amazon released S3 and SimpleDB, engineering teams were quite excited to adopt those compared to Dynamo as DynamoDB was bit expensive at that time due to SSDs. So finally after rounds of improvement, Amazon released Dynamo as cloud based service and since then it is one the most widely used NoSQL database. Before releasing to public cloud in 2012, DynamoDB has been the core storage service for Amazon's e-commerce platform, which was started shopping cart and session management service. Any downtime or degradation in performance had major impact on Amazon's business and any financial impact was strictly not acceptable and DynamoDB proved itself to be the best choice at the end. Now let's try to understand in more detail about DynamoDB. What is DynamoDB? DynamoDB is a fully managed, Internet scalable and easily administered, cost effective NoSQL database. It is a part of database as a service offering pane of Amazon Web Services. The above diagram shows how Amazon offers its various cloud services and where DynamoDB is exactly placed. AWS RDS is relational database as a service over Internet from Amazon while Simple DB and DynamoDB are NoSQL database as services. Both SimpleDB and DynamoDB are fully managed, non-relational services. DynamoDB is build considering fast, seamless scalability, and high performance. It runs on SSDs to provide faster responses and has no limits on request capacity and storage. It automatically partitions your data throughout the cluster to meet the expectations while in SimpleDB we have storage limit of 10 GB and can only take limited requests per second. Also in SimpleDB we have to manage our own partitions. So depending upon your need you have to choose the correct solution. To use DynamoDB, the first and foremost requirement is having an AWS account. Through easy to use AWS management console, you can directly create new tables, providing necessary information and can start loading data into the tables in few minutes. Modelling Relationships Like any other database, modeling relationships is quite interesting even though DynamoDB is a NoSQL database. Most of the time, people get confused on how do I model the relationships between various tables, in this section, we are trying make an effort to simplify this problem. To understand the relationships better, let's try to understand that using our example of Book Store where we have entities like Book, Author, Publisher, and so on. One to One In this type of relationship, one entity record of a table is related only one entity record of other table. In our book store application, we have BookInfo and Book-Details tables, BookInfo table can have short information about the book which can be used to display book information on web page whereas BookDetails table would be used when someone explicitly needs to see all the details of book. This design helps us keeping our system healthy as even if there are high request on one table, the other table would always be to up and running to fulfil what it is supposed to do. Following diagram shows how the table structure would look like. One to many In this type of relationship, one record from an entity is related to more than one record in another entity. In book store application, we can have Publisher Book Table which would keep information about the book and publisher relationship. Here we can have Publisher Id as hash key and Book Id as range key. Following diagram shows how a table structure would like. Many to many Many to many relationship means many records from an entity is related to many records from another entity. In case of book store application, we can say that a book can be authored by multiple authors and an author can write multiple books. In this we should use two tables with both and range keys.

0
0
3623

Packt

20 Aug 2014

19 min read

Importing Dynamic Data

Packt

20 Aug 2014

19 min read

In this article by Chad Adams, author of the book Learning Python Data Visualization, we will go over the finer points of pulling data from the Web using the Python language and its built-in libraries and cover parsing XML, JSON, and JSONP data. (For more resources related to this topic, see here.) Since we now have an understanding of how to work with the pygal library and building charts and graphics in general, this is the time to start looking at building an application using Python. In this article, we will take a look at the fundamentals of pulling data from the Web, parsing the data, and adding it to our code base and formatting the data into a useable format, and we will look at how to carry those fundamentals over to our Python code. We will also cover parsing XML and JSON data. Pulling data from the Web For many non-developers, it may seem like witchcraft that developers are magically able to pull data from an online resource and integrate that with an iPhone app, or a Windows Store app, or pull data to a cloud resource that is able to generate various versions of the data upon request. To be fair, they do have a general understanding; data is pulled from the Web and formatted to their app of choice. They just may not get the full background of how that process workflow happens. It's the same case with some developers as well—many developers mainly work on a technology that only works on a locked down environment, or generally, don't use the Internet for their applications. Again, they understand the logic behind it; somehow an RSS feed gets pulled into an application. In many languages, the same task is done in various ways, usually depending on which language is used. Let's take a look at a few examples using Packt's own news RSS feed, using an iOS app pulling in data via Objective-C. Now, if you're reading this and not familiar with Objective-C, that's OK, the important thing is that we have the inner XML contents of an XML file showing up in an iPhone application: #import "ViewController.h" @interfaceViewController () @property (weak, nonatomic) IBOutletUITextView *output; @end @implementationViewController - (void)viewDidLoad { [super viewDidLoad]; // Do any additional setup after loading the view, typically from a nib. NSURL *packtURL = [NSURLURLWithString:@"http://www.packtpub.com/rss.xml"]; NSURLRequest *request = [NSURLRequestrequestWithURL:packtURL]; NSURLConnection *connection = [[NSURLConnectionalloc] initWithRequest:requestdelegate:selfstartImmediately:YES]; [connection start]; } - (void)connection:(NSURLConnection *)connection didReceiveData:(NSData *)data { NSString *downloadstring = [[NSStringalloc] initWithData:dataencoding:NSUTF8StringEncoding]; [self.outputsetText:downloadstring]; } - (void)didReceiveMemoryWarning { [superdidReceiveMemoryWarning]; // Dispose of any resources that can be recreated. } @end Here, we can see in iPhone Simulator that our XML output is pulled dynamically through HTTP from the Web to our iPhone simulator. This is what we'll want to get started with doing in Python: The XML refresher Extensible Markup Language (XML) is a data markup language that sets a series of rules and hierarchy to a data group, which is stored as a static file. Typically, servers update these XML files on the Web periodically to be reused as data sources. XML is really simple to pick up as it's similar to HTML. You can start with the document declaration in this case: <?xml version="1.0" encoding="utf-8"?> Next, a root node is set. A node is like an HTML tag (which is also called a node). You can tell it's a node by the brackets around the node's name. For example, here's a node named root: <root></root> Note that we close the node by creating a same-named node with a backslash. We can also add parameters to the node and assign a value, as shown in the following root node: <root parameter="value"></root> Data in XML is set through a hierarchy. To declare that hierarchy, we create another node and place that inside the parent node, as shown in the following code: <root parameter="value"> <subnode>Subnode's value</subnode> </root> In the preceding parent node, we created a subnode. Inside the subnode, we have an inner value called Subnode's value. Now, in programmatical terms, getting data from an XML data file is a process called parsing. With parsing, we specify where in the XML structure we would like to get a specific value; for instance, we can crawl the XML structure and get the inner contents like this: /root/subnode This is commonly referred to as XPath syntax, a cross-language way of going through an XML file. For more on XML and XPath, check out the full spec at: http://www.w3.org/TR/REC-xml/ and here http://www.w3.org/TR/xpath/ respectively. RSS and the ATOM Really simple syndication (RSS) is simply a variation of XML. RSS is a spec that defines specific nodes that are common for sending data. Typically, many blog feeds include an RSS option for users to pull down the latest information from those sites. Some of the nodes used in RSS include rss, channel, item, title, description, pubDate, link, and GUID. Looking at our iPhone example in this article from the Pulling data from the Web section, we can see what a typical RSS structure entails. RSS feeds are usually easy to spot since the spec requires the root node to be named rss for it to be a true RSS file. In some cases, some websites and services will use .rss rather than .xml; this is typically fine since most readers for RSS content will pull in the RSS data like an XML file, just like in the iPhone example. Another form of XML is called ATOM. ATOM was another spec similar to RSS, but developed much later than RSS. Because of this, ATOM has more features than RSS: XML namespacing, specified content formats (video, or audio-specific URLs), support for internationalization, and multilanguage support, just to name a few. ATOM does have a few different nodes compared to RSS; for instance, the root node to an RSS feed would be <rss>. ATOM's root starts with <feed>, so it's pretty easy to spot the difference. Another difference is that ATOM can also end in .atom or .xml. For more on the RSS and ATOM spec, check out the following sites: http://www.rssboard.org/rss-specification http://tools.ietf.org/html/rfc4287 Understanding HTTP All these samples from the RSS feed of the Packt Publishing website show one commonality that's used regardless of the technology coded in, and that is the method used to pull down these static files is through the Hypertext Transfer Protocol (HTTP). HTTP is the foundation of Internet communication. It's a protocol with two distinct types of requests: a request for data or GET and a push of data called a POST. Typically, when we download data using HTTP, we use the GET method of HTTP in order to pull down the data. The GET request will return a string or another data type if we mention a specific type. We can either use this value directly or save to a variable. With a POST request, we are sending values to a service that handles any incoming values; say we created a new blog post's title and needed to add to a list of current titles, a common way of doing that is with URL parameters. A URL parameter is an existing URL but with a suffixed key-value pair. The following is a mock example of a POST request with a URL parameter: http://www.yourwebsite.com/blogtitles/?addtitle=Your%20New%20Title If our service is set up correctly, it will scan the POST request for a key of addtitle and read the value, in this case: Your New Title. We may notice %20 in our title for our request. This is an escape character that allows us to send a value with spaces; in this case, %20 is a placehoder for a space in our value. Using HTTP in Python The RSS samples from the Packt Publishing website show a few commonalities we use in programming when working in HTTP; one is that we always account for the possibility of something potentially going wrong with a connection and we always close our request when finished. Here's an example on how the same RSS feed request is done in Python using a built-in library called urllib2: #!/usr/bin/env python # -*- coding: utf-8 -*- import urllib2 try: #Open the file via HTTP. response = urllib2.urlopen('http://www.packtpub.com/rss.xml') #Read the file to a variable we named 'xml' xml = response.read() #print to the console. print(xml) #Finally, close our open network. response.close() except: #If we have an issue show a message and alert the user. print('Unable to connect to RSS...') If we look in the following console output, we can see the XML contents just as we saw in our iOS code example: In the example, notice that we wrapped our HTTP request around a try except block. For those coming from another language, except can be considered the same as a catch statement. This allows us to detect if an error occurs, which might be an incorrect URL or lack of connectivity, for example, to programmatically set an alternate result to our Python script. Parsing XML in Python with HTTP With these examples including our Python version of the script, it's still returning a string of some sorts, which by itself isn't of much use to grab values from the full string. In order to grab specific strings and values from an XML pulled through HTTP, we need to parse it. Luckily, Python has a built-in object in the Python main library for this, called as ElementTree, which is a part of the XML library in Python. Let's incorporate ElementTree into our example and see how that works: # -*- coding: utf-8 -*- import urllib2 from xml.etree import ElementTree try: #Open the file via HTTP. response = urllib2.urlopen('http://www.packtpub.com/rss.xml') tree = ElementTree.parse(response) root = tree.getroot() #Create an 'Element' group from our XPATH using findall. news_post_title = root.findall("channel//title") #Iterate in all our searched elements and print the inner text for each. for title in news_post_title: print title.text #Finally, close our open network. response.close() except Exception as e: #If we have an issue show a message and alert the user. print(e) The following screenshot shows the results of our script: As we can see, our output shows each title for each blog post. Notice how we used channel//item for our findall() method. This is using XPath, which allows us to write in a shorthand manner on how to iterate an XML structure. It works like this; let's use the http://www.packtpub.com feed as an example. We have a root, followed by channel, then a series of title elements. The findall() method found each element and saved them as an Element type specific to the XML library ElementTree uses in Python, and saved each of those into an array. We can then use a for in loop to iterate each one and print the inner text using the text property specific to the Element type. You may notice in the recent example that I changed except with a bit of extra code and added Exception as e. This allows us to help debug issues and print them to a console or display a more in-depth feedback to the user. An Exception allows for generic alerts that the Python libraries have built-in warnings and errors to be printed back either through a console or handled with the code. They also have more specific exceptions we can use such as IOException, which is specific for working with file reading and writing. About JSON Now, another data type that's becoming more and more common when working with web data is JSON. JSON is an acronym for JavaScript Object Notation, and as the name implies, is indeed true JavaScript. It has become popular in recent years with the rise of mobile apps, and Rich Internet Applications (RIA). JSON is great for JavaScript developers; it's easier to work with when working in HTML content, compared to XML. Because of this, JSON is becoming a more common data type for web and mobile application development. Parsing JSON in Python with HTTP To parse JSON data in Python is a pretty similar process; however, in this case, our ElementTree library isn't needed, since that only works with XML. We need a library designed to parse JSON using Python. Luckily, the Python library creators already have a library for us, simply called json. Let's build an example similar to our XML code using the json import; of course, we need to use a different data source since we won't be working in XML. One thing we may note is that there aren't many public JSON feeds to use, many of which require using a code that gives a developer permission to generate a JSON feed through a developer API, such as Twitter's JSON API. To avoid this, we will use a sample URL from Google's Books API, which will show demo data of Pride and Prejudice, Jane Austen. We can view the JSON (or download the file), by typing in the following URL: https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ Notice the API uses HTTPS, many JSON APIs are moving to secure methods of transmitting data, so be sure to include the secure in HTTP, with HTTPS. Let's take a look at the JSON output: { "kind": "books#volume", "id": "s1gVAAAAYAAJ", "etag": "yMBMZ85ENrc", "selfLink": "https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ", "volumeInfo": { "title": "Pride and Prejudice", "authors": [ "Jane Austen" ], "publisher": "C. Scribner's Sons", "publishedDate": "1918", "description": "Austen's most celebrated novel tells the story of Elizabeth Bennet, a bright, lively young woman with four sisters, and a mother determined to marry them to wealthy men. At a party near the Bennets' home in the English countryside, Elizabeth meets the wealthy, proud Fitzwilliam Darcy. Elizabeth initially finds Darcy haughty and intolerable, but circumstances continue to unite the pair. Mr. Darcy finds himself captivated by Elizabeth's wit and candor, while her reservations about his character slowly vanish. The story is as much a social critique as it is a love story, and the prose crackles with Austen's wry wit.", "readingModes": { "text": true, "image": true }, "pageCount": 401, "printedPageCount": 448, "dimensions": { "height": "18.00 cm" }, "printType": "BOOK", "averageRating": 4.0, "ratingsCount": 433, "contentVersion": "1.1.5.0.full.3", "imageLinks": { "smallThumbnail": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec =frontcover&img=1&zoom=5&edge=curl&imgtk=AFLRE73F8btNqKpVjGX6q7V3XS77 QA2PftQUxcEbU3T3njKNxezDql_KgVkofGxCPD3zG1yq39u0XI8s4wjrqFahrWQ- 5Epbwfzfkoahl12bMQih5szbaOw&source=gbs_api", "thumbnail": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec= frontcover&img=1&zoom=1&edge=curl&imgtk=AFLRE70tVS8zpcFltWh_ 7K_5Nh8BYugm2RgBSLg4vr9tKRaZAYoAs64RK9aqfLRECSJq7ATs_j38JRI3D4P48-2g_ k4-EY8CRNVReZguZFMk1zaXlzhMNCw&source=gbs_api", "small": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec =frontcover&img=1&zoom=2&edge=curl&imgtk=AFLRE71qcidjIs37x0jN2dGPstn 6u2pgeXGWZpS1ajrGgkGCbed356114HPD5DNxcR5XfJtvU5DKy5odwGgkrwYl9gC9fo3y- GM74ZIR2Dc-BqxoDuUANHg&source=gbs_api", "medium": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec= frontcover&img=1&zoom=3&edge=curl&imgtk=AFLRE73hIRCiGRbfTb0uNIIXKW 4vjrqAnDBSks_ne7_wHx3STluyMa0fsPVptBRW4yNxNKOJWjA4Od5GIbEKytZAR3Nmw_ XTmaqjA9CazeaRofqFskVjZP0&source=gbs_api", "large": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec= frontcover&img=1&zoom=4&edge=curl&imgtk=AFLRE73mlnrDv-rFsL- n2AEKcOODZmtHDHH0QN56oG5wZsy9XdUgXNnJ_SmZ0sHGOxUv4sWK6GnMRjQm2eEwnxIV4dcF9eBhghMcsx -S2DdZoqgopJHk6Ts&source=gbs_api", "extraLarge": "http://bks8.books.google.com/books?id=s1gVAAAAYAAJ&printsec= frontcover&img=1&zoom=6&edge=curl&imgtk=AFLRE73KIXHChszn TbrXnXDGVs3SHtYpl8tGncDPX_7GH0gd7sq7SA03aoBR0mDC4-euzb4UCIDiDNLYZUBJwMJxVX_ cKG5OAraACPLa2QLDcfVkc1pcbC0&source=gbs_api" }, "language": "en", "previewLink": "http://books.google.com/books?id=s1gVAAAAYAAJ&hl=&source=gbs_api", "infoLink": "http://books.google.com/books?id=s1gVAAAAYAAJ&hl=&source=gbs_api", "canonicalVolumeLink": "http://books.google.com/books/about/ Pride_and_Prejudice.html?hl=&id=s1gVAAAAYAAJ" }, "layerInfo": { "layers": [ { "layerId": "geo", "volumeAnnotationsVersion": "6" } ] }, "saleInfo": { "country": "US", "saleability": "FREE", "isEbook": true, "buyLink": "http://books.google.com/books?id=s1gVAAAAYAAJ&hl=&buy=&source=gbs_api" }, "accessInfo": { "country": "US", "viewability": "ALL_PAGES", "embeddable": true, "publicDomain": true, "textToSpeechPermission": "ALLOWED", "epub": { "isAvailable": true, "downloadLink": "http://books.google.com/books/download /Pride_and_Prejudice.epub?id=s1gVAAAAYAAJ&hl=&output=epub &source=gbs_api" }, "pdf": { "isAvailable": true, "downloadLink": "http://books.google.com/books/download/Pride_and_Prejudice.pdf ?id=s1gVAAAAYAAJ&hl=&output=pdf&sig=ACfU3U3dQw5JDWdbVgk2VRHyDjVMT4oIaA &source=gbs_api" }, "webReaderLink": "http://books.google.com/books/reader ?id=s1gVAAAAYAAJ&hl=&printsec=frontcover& output=reader&source=gbs_api", "accessViewStatus": "FULL_PUBLIC_DOMAIN", "quoteSharingAllowed": false } } One downside to JSON is that it can be hard to read complex data. So, if we run across a complex JSON feed, we can visualize it using a JSON Visualizer. Visual Studio includes one with all its editions, and a web search will also show multiple online sites where you can paste JSON and an easy-to-understand data structure will be displayed. Here's an example using http://jsonviewer.stack.hu/ loading our example JSON URL: Next, let's reuse some of our existing Python code using our urllib2 library to request our JSON feed, and then we will parse it with the Python's JSON library. Let's parse the volumeInfo node of the book by starting with the JSON (root) node that is followed by volumeInfo as the subnode. Here's our example from the XML section, reworked using JSON to parse all child elements: # -*- coding: utf-8 -*- import urllib2 import json try: #Set a URL variable. url = 'https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ' #Open the file via HTTP. response = urllib2.urlopen(url) #Read the request as one string. bookdata = response.read() #Convert the string to a JSON object in Python. data = json.loads(bookdata) for r in data ['volumeInfo']: print r #Close our response. response.close() except: #If we have an issue show a message and alert the user. print('Unable to connect to JSON API...') Here's our output. It should match the child nodes of volumeInfo, which it does in the output screen, as shown in the following screenshot: Well done! Now, let's grab the value for title. Look at the following example and notice we have two brackets: one for volumeInfo and another for title. This is similar to navigating our XML hierarchy: # -*- coding: utf-8 -*- import urllib2 import json try: #Set a URL variable. url = 'https://www.googleapis.com/books/v1/volumes/s1gVAAAAYAAJ' #Open the file via HTTP. response = urllib2.urlopen(url) #Read the request as one string. bookdata = response.read() #Convert the string to a JSON object in Python. data = json.loads(bookdata) print data['volumeInfo']['title'] #Close our response. response.close() except Exception as e: #If we have an issue show a message and alert the user. #'Unable to connect to JSON API...' print(e) The following screenshot shows the results of our script: As you can see in the preceding screenshot, we return one line with Pride and Prejudice parsed from our JSON data. About JSONP JSONP, or JSON with Padding, is actually JSON but it is set up differently compared to traditional JSON files. JSONP is a workaround for web cross-browser scripting. Some web services can serve up JSONP rather than pure JSON JavaScript files. The issue with that is JSONP isn't compatible with many JSON Python-based parsers including one covered here, so you will want to avoid JSONP style JSON whenever possible. So how can we spot JSONP files; do they have a different extension? No, it's simply a wrapper for JSON data; here's an example without JSONP: /* *Regular JSON */ { authorname: 'Chad Adams' } The same example with JSONP: /* * JSONP */ callback({ authorname: 'Chad Adams' }); Notice we wrapped our JSON data with a function wrapper, or a callback. Typically, this is what breaks in our parsers and is a giveaway that this is a JSONP-formatted JSON file. In JavaScript, we can even call it in code like this: /* * Using JSONP in JavaScript */ callback = function (data) { alert(data.authorname); }; JSONP with Python We can get around a JSONP data source though, if we need to; it just requires a bit of work. We can use the str.replace() method in Python to strip out the callback before running the string through our JSON parser. If we were parsing our example JSONP file in our JSON parser example, the string would look something like this: #Convert the string to a JSON object in Python. data = json.loads(bookdata.replace('callback(', '').) .replace(')', '')) Summary In this article, we covered HTTP concepts and methodologies for pulling strings and data from the Web. We learned how to do that with Python using the urllib2 library, and parsed XML data and JSON data. We discussed the differences between JSON and JSONP, and how to work around JSONP if needed. Resources for Article: Further resources on this subject: Introspecting Maya, Python, and PyMEL [Article] Exact Inference Using Graphical Models [Article] Automating Your System Administration and Deployment Tasks Over SSH [Article]

0
0
2154

Packt

21 Jul 2014

9 min read

Sharding in Action

Packt

21 Jul 2014

9 min read

0
0
1119

Packt

15 Jul 2014

6 min read

Securing the WAL Stream

Packt

15 Jul 2014

6 min read

(For more resources related to this topic, see here.) The primary mechanism that PostgreSQL uses to provide a data durability guarantee is through its Write Ahead Log (WAL). All transactional data is written to this location before ever being committed to database files. Once WAL files are no longer necessary for crash recovery, PostgreSQL will either delete or archive them. For the purposes of a highly available server, we recommend that you keep these important files as long as possible. There are several reasons for this; they are as follows: Archived WAL files can be used for Point In Time Recovery (PITR) If you are using streaming replication, interrupted streams can be re-established by applying WAL files until the replica has caught up WAL files can be reused to service multiple server copies In order to gain these benefits, we need to enable PostgreSQL WAL archiving and save these files until we no longer need them. This section will address our recommendations for long term storage of WAL files. Getting ready In order to properly archive WAL files, we recommend that you provision a server dedicated to backups or file storage. Depending on the transaction volume, an active PostgreSQL database might produce thousands of these on a daily basis. At 16 MB apiece, this is not an idle concern. For instance, for a 1 TB database, we recommend at least 3 TB of storage space. In addition, we will be using rsync as a daemon on this archive server. To install this on a Debian-based server, execute the following command as a root-level user: sudo apt-get install rsync Red-Hat-based systems will need this command instead: sudo yum install rsync xinetd How to do it... Our archive server has a 3 TB mount at the /db directory and is named arc_server on our network. The PostgreSQL source server resides at 192.168.56.10. Follow these steps for long-term storage of important WAL files on an archive server Enable rsync to run as a daemon on the archive server. On Debian based systems, edit the /etc/default/rsync file and change the RSYNC_ENABLE variable to true. On Red-Hat-based systems, edit the /etc/xinet.d/rsync file and change the disable parameter to no. Create a directory to store archived WAL files as the postgres user with these commands: sudo mkdir /db/pg_archived sudo chown postgres:postgres /db/pg_archived Create a file named /etc/rsyncd.conf and fill it with the following contents: [wal_store] path = /db/pg_archived comment = DB WAL Archives uid = postgres gid = postgres read only = false hosts allow = 192.168.56.10 hosts deny = * Start the rsync daemon. Debian-based systems should execute the following command: sudo service rsync start Red-Hat-based systems can start rsync with this command instead: sudo service xinetd start Change the archive_mode and archive_command parameters in postgresql.conf to read the following: archive_mode = on archive_command = 'rsync -aq %p arc_server::wal_store/%f' Restart the PostgreSQL server with a command similar to this: pg_ctl -D $PGDATA restart How it works The rsync utility is normally used to transfer files between two servers. However, we can take advantage of using it as a daemon to avoid connection overhead imposed by using SSH as an rsync protocol. Our first step is to ensure that the service is not disabled in some manner, which would make the rest of this guide moot. Next, we need a place to store archived WAL files on the archive server. Assuming that we have 3 TB of space in the /db directory, we simply claim /db/pg_archived as the desired storage location. There should be enough space to use /db for backups as well, but we won't discuss that here. Next, we create a file named /etc/rsyncd.conf, which will configure how rsync operates as a daemon. Here, we name the /db/pg_archived directory wal_store so that we can address the path by its name when sending files. We give it a human-readable name and ensure that files are owned by the postgres user, as this user also controls most of the PostgreSQL-related services. The next, and possibly the most important step, is to block all hosts but the primary PostgreSQL server from writing to this location. We set hosts deny to *, which blocks every server. Then, we set hosts allow to the primary database server's IP address so that only it has access. If everything goes well, we can start the rsync (or xinetd on Red Hat systems) service and we can see that in the following screenshot: Next, we enable archive_mode by setting it to on. With archive mode enabled, we can specify a command that will execute when PostgreSQL no longer needs a WAL file for crash recovery. In this case, we invoke the rsync command with the -a parameter to preserve elements such as file ownership, timestamps, and so on. In addition, we specify the -q setting to suppress output, as PostgreSQL only checks the command exit status to determine its success. In the archive_command setting, the %p value represents the full path to the WAL file, and %f resolves to the filename. In this context, we're sending the WAL file to the archive server at the wal_store module we defined in rsyncd.conf. Once we restart PostgreSQL, it will start storing all the old WAL files by sending them to the archive server. In case any rsync command fails because the archive server is unreachable, PostgreSQL will keep trying to send it until it is successful. If the archive server is unreachable for too long, we suggest that you change the archive_command setting to store files elsewhere. This prevents accidentally overfilling the PostgreSQL server storage. There's more... As we will likely want to use the WAL files on other servers, we suggest that you make a list of all the servers that could need WAL files. Then, modify the rsyncd.conf file on the archive server and add this section: [wal_fetch] path = /db/pg_archived comment = DB WAL Archive Retrieval uid = postgres gid = postgres read only = true hosts allow = host1, host2, host3 hosts deny = * Now, we can fetch WAL files from any of the hosts in hosts allow. As these are dedicated PostgreSQL replicas, recovery servers, or other defined roles, this makes the archive server a central location for all our WAL needs. See also We suggest that you read more about the archive_mode and archive_command settings on the PostgreSQL site. We've included a link here: http://www.postgresql.org/docs/9.3/static/runtime-config-wal.html The rsyncd.conf file should also have its own manual page. Read it with this command to learn more about the available settings: man rsyncd.conf Summary In this article, we've successfully learned how to secure the WAL stream by following the given steps. Resources for Article: Further resources on this subject: PostgreSQL 9: Reliable Controller and Disk Setup [article] Backup in PostgreSQL 9 [article] Recovery in PostgreSQL 9 [article]

0
0
1579

article-image-making-most-your-hadoop-data-lake-part-2-optimized-file-formats

Kristen Hardwick

30 Jun 2014

5 min read

Making the Most of Your Hadoop Data Lake, Part 2: Optimized File Formats

Kristen Hardwick

30 Jun 2014

5 min read

One major factor of making the conversion to Hadoop is the concept of the Data Lake. That idea suggests that users keep as much data as possible in HDFS in order to prepare for future use cases and as-yet-unknown data integration points. As your data grows, it is important to make sure that the data is being stored in a way that prolongs that behavior. Data compression is not the only technique that can be used to speed up job performance and improve cluster organization. In addition to the Text and Sequence File options that are typically used by default, Hadoop offers a few more optimized file formats that are specifically designed to improve the process of interacting with the data. In the second part of this two-part series, “Making the Most of Your Hadoop Data Lake”, we will address another important factor in improving manageability—optimized file formats. Using a smarter file for your data: RCFile RCFile stands for Record Columnar File, and it serves as an ideal format for storing relational data that will be accessed through Hive. This format offers performance improvements by storing the data in an optimized way. First, the data is partitioned horizontally, into groups of rows. Then each row group is partitioned vertically, into collections of columns. Finally, the data in each column collection is compressed and stored in column-row format, as if it were a column-oriented database. The first benefit of this altered storage mechanism is apparent at the row level. All HDFS blocks used to form RCFiles will be made up of the horizontally partitioned collections of rows. This is significant because it ensures that no row of data will be split across multiple blocks, and will therefore always be on the same machine. This is not the case for traditional HDFS file formats, which typically use data size to split the file. This optimized data storage will reduce the amount of network bandwidth that is required to serve queries. The second benefit comes from optimizations at the column level, in the form of disk I/O reduction. Since the columns are stored vertically within each row group, the system will be able to seek directly to the required column position in the file, rather than being required to scan across all columns and filter out data that is not necessary. This is extremely useful in queries that only require access to a small subset of the existing columns. RCFiles can be used natively in both Hive and Pig with very little configuration. In Hive CREATE TABLE … STORED AS RCFILE; ALTER TABLE … SET FILEFORMAT RCFILE; SET hive.default.fileformat=RCFile; In Pig: register …/piggybank.jar; a = LOAD '/user/hive/warehouse/table' USING org.apache.pig.piggybank.storage.hiverc.HiveRCInputFormat(…); The Pig jar file referenced here is just one option for enabling the RCFile. At the time of writing, there was also an RCFilePigStorageclass available through Twitter’s Elephant Bird open source library. Hortonworks’ ORCFile and Cloudera’s Parquet formats RCFiles provide optimization for relational files primarily by implementing modifications at the storage level. New innovations have provided improvements on the RCFile format, namely the ORCFile format from Hortonworks and the Parquet format from Cloudera. When storing data using the Optimized Row Columnar file or Parquet formats, several pieces of metadata are automatically written at the column level within each row group; for example, minimum and maximum values for numeric data types and dictionary-style metadata for text data types. The specific metadata is also configurable. One such use case would be for a user to configure the dataset to be sorted on a given set of columns for efficient access. This excess metadata allows for queries to take advantage of an improvement on the original RCFiles–predicate pushdown. That technique allows Hive to evaluate the where clause during the record gathering process, instead of filtering data after all records have been collected. The predicate pushdown technique will evaluate the conditions of the query against the metadata associated with a particular row group, allowing it to skip over entire file blocks if possible, or to seek directly to the correct row. One major benefit of this process is that the more complex a particular where clause is, the more potential there is for row groups and columns to be filtered as irrelevant to the final result. Cloudera’s Parquet format is typically used in conjunction with Impala, but just like with RCFiles, ORCFiles can be incorporated into both Hive and Pig. HCatalog can be used as the primary method to read and write ORCFiles using Pig. The commands for Hive are provided below: In Hive: CREATE TABLE … STORED AS ORC; ALTER TABLE … SET FILEFORMAT ORC SET hive.default.fileformat=Orc Conclusion This post has detailed the alternatives to the default file formats that can be used in Hadoop in order to optimize data access and storage. This information combined with the compression techniques described in the previous post (part 1) will provide some guidelines that can be used to ensure that users can make the most of the Hadoop Data Lake. About the author Kristen Hardwick has been gaining professional experience with software development in parallel computing environments in the private, public, and government sectors since 2007. She has interfaced with several different parallel paradigms including Grid, Cluster, and Cloud. She started her software development career with Dynetics in Huntsville, AL, and then moved to Baltimore, MD, to work for Dynamics Research Corporation. She now works at Spry where her focus is on designing and developing big data analytics for the Hadoop ecosystem.

0
0
1610

article-image-mapreduce-openstack-swift-and-zerovm

Lars Butler

30 Jun 2014

6 min read

MapReduce with OpenStack Swift and ZeroVM

Lars Butler

30 Jun 2014

6 min read

Originally coined in a 2004 research paper produced by Google, the term “MapReduce” was defined as a “programming model and an associated implementation for processing and generating large datasets”. While Google’s proprietary implementation of this model is known simply as “MapReduce”, the term has since become overloaded to refer to any software that follows the same general computation model. The philosophy behind the MapReduce computation model is based on a divide and conquer approach: the input dataset is divided into small pieces and distributed among a pool of computation nodes for processing. The advantage here is that many nodes can run in parallel to solve a problem. This can be much quicker than a single machine, provided that the cost of communication (loading input data, relaying intermediate results between compute nodes, and writing final results to persistent storage) does not outweigh the computation time. Indeed, the cost of moving data between storage and computation nodes can diminish the advantage of distributed parallel computation if task payloads are too granular. On the other hand, they must be granular enough in order to be distributed evenly (although achieving perfect distribution is nearly impossible if exact computational complexity of each task cannot be known in advance). A distributed MapReduce solution can be effective for processing large data sets, provided that each task is sufficiently long-running to offset communication costs. If the need for data transfer (between compute and storage nodes) could somehow be reduced or eliminated, the task size limitations would all but disappear, enabling a wider range of use cases. One way to achieve this is to perform the computation closer to the data, that is, on the same physical machine. Stored procedures in relational database systems, for example, are one way to achieve data-local computation. Simpler yet, computation could be run directly on the system where the data is stored in files. In both cases, the same problems are present: changes to the code require direct access to the system, and thus access needs to be carefully controlled (through group management, read/write permissions, and so on). Scaling in these scenarios is also problematic: vertical scaling (beefing up a single server) is the only feasible way to gain performance while maintaining the efficiency of data-local computation. Achieving horizontal scalability by adding more machines is not feasible unless the storage system inherently supports this. Enter Swift and ZeroVM OpenStack Swift is one such example of a horizontally scalable data store. Swift can store petabytes of data in millions of objects across thousands of machines. Swift’s storage operations can also be extended by writing custom middleware applications, which makes it a good foundation for building a “smart” storage system capable of running computations directly on the data. ZeroCloud is a middleware application for Swift which does just that. It embeds ZeroVM inside Swift and exposes a new API method for executing jobs. ZeroVM provides sufficient program isolation and a guarantee of security such that any arbitrary code can be run safely without compromising the storage system. Access to data is governed by Swift’s inherent access control, so there is no need to further lock down the storage system, even in a multi-tenant environment. The combination of Swift and ZeroVM results in something like stored procedures, but much more powerful. First and foremost, user application code can be updated at any time without the need to worry about malicious or otherwise destructive side effects. (The worst thing that can happen is that a user can crash their own machines and delete their own data—but not another user’s.) Second, ZeroVM instances can start very quickly: in about five milliseconds. This means that to process one thousand objects, ZeroCloud can instantaneously spawn one thousand ZeroVM instances (one for each file), run the job, and destroy the instances very quickly. This also means that any job, big or small, can feasibly run on the system. Finally, the networking component of ZeroVM allows intercommunication between instances, enabling the creation of arbitrary job pipelines and multi-stage computations. This converged compute and storage solution makes for a good alternative to popular MapReduce frameworks like Hadoop because little effort is required to set up and tear down computation nodes for a job. Also, once a job is complete, there is no need to extract result data, pipe it across the network, and then save it in a separate persistent data store (which can be an expensive operation if there is a large quantity of data to save). With ZeroCloud, the results can simply be saved back into Swift. The same principle applies to the setup of a job. There is no need to move or copy data to the computation cluster; the data is already where it needs to be. Limitations Currently, ZeroCloud uses a fairly naïve scheduling algorithm. To perform a data-local computation on an object, ZeroCloud determines which machines in the cluster contain a replicated copy of an object and then randomly chooses one to execute a ZeroVM process. While this functions well as a proof-of-concept for data-local computing, it is quite possible for jobs to be spread unevenly across a cluster, resulting in an inefficient use of resources. A research group at the University of Texas at San Antonio (UTSA) sponsored by Rackspace is currently working on developing better algorithms for scheduling workloads running on ZeroVM-based infrastructure. A short video of the research proposal can be found here:http://link.brightcove.com/services/player/bcpid3530100726001?bckey=AQ~~,AAACa24Pu2k~,Q186VLPcl3-oLBDP8npyqxjCNB5jgYcT. Further reading Rackspace has launched a ZeroCloud playground service called Zebra for developers to try out the platform. At the time this was written, Zebra is still in a private beta and invitations are limited. But it also possible to install and run your own copy of ZeroCloud for testing and development; basic installation instructions are here: https://github.com/zerovm/zerocloud/blob/icehouse/doc/Hacking.md. There are also some tutorials (including sample applications) for creating, packaging, deploying, and executing applications on ZeroCloud: http://zerovm.readthedocs.org/en/latest/zebra/tutorial.html. The tutorials are intended for use with the Zebra service, but can be run on any deployment of ZeroCloud. Big Data in the cloud? It's not the future, it's already here! Find more Hadoop tutorials and extra content here, and dive deeper into OpenStack by visiting this page, dedicated to one of the most exciting cloud platforms around today. About the author Lars Butler is a software developer at Rackspace, the open cloud company. He has worked as a software developer in avionics, seismic hazard research, and most recently, on ZeroVM. He can be reached @larsbutler.

0
0
2730

article-image-making-most-your-hadoop-data-lake-part-1-data-compression

Kristen Hardwick

30 Jun 2014

6 min read

Making the Most of Your Hadoop Data Lake, Part 1: Data Compression

Kristen Hardwick

30 Jun 2014

6 min read

In the world of big data, the Data Lake concept reigns supreme. Hadoop users are encouraged to keep all data in order to prepare for future use cases and as-yet-unknown data integration points. This concept is part of what makes Hadoop and HDFS so appealing, so it is important to make sure that the data is being stored in a way that prolongs that behavior. In the first part of this two-part series, “Making the Most of Your Hadoop Data Lake”, we will address one important factor in improving manageability—data compression. Data compression is an area that is often overlooked in the context of Hadoop. In many cluster environments, compression is disabled by default, putting the burden on the user. In this post, we will discuss the tradeoffs involved in deciding how to take advantage of compression techniques and the advantages and disadvantages of specific compression codec options with respect to Hadoop. To compress or not to compress Whenever data is converted to something other than its raw data format, that naturally implies some overhead involved in completing the conversion process. When data compression is being discussed, it is important to take that overhead into account with respect to the benefits of reducing the data footprint. One obvious benefit is that compressed data will reduce the amount of disk space that is required for storage of a particular dataset. In the big data environment, this benefit is especially significant—either the Hadoop cluster will be able to keep data for a larger time range, or storing data for the same time range will require fewer nodes, or the disk usage ratios will remain lower for longer. In addition, the smaller file sizes will mean lower data transfer times—either internally for MapReduce jobs or when performing exports of data results. The cost of these benefits, however, is that the data must be decompressed at every point where the data needs to be read, and compressed before being inserted into HDFS. With respect to MapReduce jobs, this processing overhead at both the map phase and the reduce phase will increase the CPU processing time. Fortunately, by making informed choices about the specific compression codecs used at any given phase in the data transformation process, the cluster administrator or user can ensure that the advantages of compression outweigh the disadvantages. Choosing the right codec for each phase Hadoop provides the user with some flexibility on which compression codec is used at each step of the data transformation process. It is important to realize that certain codecs are optimal for some stages, and non-optimal for others. In the next sections, we will cover some important notes for each choice. zlib The major benefit of using this codec is that it is the easiest way to get the benefits of data compression from a cluster and job configuration standpoint—the zlib codec is the default compression option. From the data transformation perspective, this codec will decrease the data footprint on disk, but will not provide much of a benefit in terms of job performance. gzip The gzip codec available in Hadoop is the same one that is used outside of the Hadoop ecosystem. It is common practice to use this as the codec for compressing the final output from a job, simply for the benefit of being able to share the compressed result with others (possibly outside of Hadoop) using a standard file format. bzip2 There are two important benefits for the bzip2 codec. First, if reducing the data footprint is a high priority, this algorithm will compress the data more than the default zlib option. Second, this is the only supported codec that produces “splittable” compressed data. A major characteristic of Hadoop is the idea of splitting the data so that they can be handled on each node independently. With the other compression codecs, there is an initial requirement to gather all parts of the compressed file in order to have all information necessary to decompress the data. With this format, the data can be decompressed in parallel. This splittable quality makes this format ideal for compressing data that will be used as input to a map function, either in a single step or as part of a series of chained jobs. LZO, LZ4, Snappy These three codecs are ideal for compressing intermediate data—the data output from the mappers that will be immediately read in by the reducers. All three codecs heavily favor compression speed over file size ratio, but the detailed specifications for each algorithm should be examined based on the specific licensing, cluster, and job requirements. Enabling compression Once the appropriate compression codec for any given transformation phase has been selected, there are a few configuration properties that need to be adjusted in order to have the changes take effect in the cluster. Intermediate data to reducer mapreduce.map.output.compress = true (Optional) mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.SnappyCodec Final output from a job mapreduce.output.fileoutputformat.compress = true (Optional) mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.BZip2Codec These compression codecs are also available within some of the ecosystem tools like Hive and Pig. In most cases, the tools will default to the Hadoop-configured values for particular codecs, but the tools also provide the option to compress the data generated between steps. Pig pig.tmpfilecompression = true (Optional) pig.tmpfilecompression.codec = snappy Hive hive.exec.compress.intermediate = true hive.exec.compress.output = true Conclusion This post detailed the benefits and disadvantages of data compression, along with some helpful guidelines on how to choose a codec and enable it at various stages in the data transformation workflow. In the next post, we will go through some additional techniques that can be used to ensure that users can make the most of the Hadoop Data Lake. For more Big Data and Hadoop tutorials and insight, visit our dedicated Hadoop page. About the author Kristen Hardwick has been gaining professional experience with software development in parallel computing environments in the private, public, and government sectors since 2007. She has interfaced with several different parallel paradigms including Grid, Cluster, and Cloud. She started her software development career with Dynetics in Huntsville, AL, and then moved to Baltimore, MD, to work for Dynamics Research Corporation. She now works at Spry where her focus is on designing and developing big data analytics for the Hadoop ecosystem.

0
0
2639

Packt

25 Jun 2014

10 min read

The Hunt for Data

Packt

25 Jun 2014

10 min read

(For more resources related to this topic, see here.) Examining a JSON file with the aeson package JavaScript Object Notation (JSON) is a way to represent key-value pairs in plain text. The format is described extensively in RFC 4627 (http://www.ietf.org/rfc/rfc4627). In this recipe, we will parse a JSON description about a person. We often encounter JSON in APIs from web applications. Getting ready Install the aeson library from hackage using Cabal. Prepare an input.json file representing data about a mathematician, such as the one in the following code snippet: $ cat input.json {"name":"Gauss", "nationality":"German", "born":1777, "died":1855} We will be parsing this JSON and representing it as a usable data type in Haskell. How to do it... Use the OverloadedStrings language extension to represent strings as ByteString, as shown in the following line of code: {-# LANGUAGE OverloadedStrings #-} Import aeson as well as some helper functions as follows: import Data.Aeson import Control.Applicative import qualified Data.ByteString.Lazy as B Create the data type corresponding to the JSON structure, as shown in the following code: data Mathematician = Mathematician { name :: String , nationality :: String , born :: Int , died :: Maybe Int } Provide an instance for the parseJSON function, as shown in the following code snippet: instance FromJSON Mathematician where parseJSON (Object v) = Mathematician <$> (v .: "name") <*> (v .: "nationality") <*> (v .: "born") <*> (v .:? "died") Define and implement main as follows: main :: IO () main = do Read the input and decode the JSON, as shown in the following code snippet: input <- B.readFile "input.json" let mm = decode input :: Maybe Mathematician case mm of Nothing -> print "error parsing JSON" Just m -> (putStrLn.greet) m Now we will do something interesting with the data as follows: greet m = (show.name) m ++ " was born in the year " ++ (show.born) m We can run the code to see the following output: $ runhaskell Main.hs "Gauss" was born in the year 1777 How it works... Aeson takes care of the complications in representing JSON. It creates native usable data out of a structured text. In this recipe, we use the .: and .:? functions provided by the Data.Aeson module. As the Aeson package uses ByteStrings instead of Strings, it is very helpful to tell the compiler that characters between quotation marks should be treated as the proper data type. This is done in the first line of the code which invokes the OverloadedStrings language extension. We use the decode function provided by Aeson to transform a string into a data type. It has the type FromJSON a => B.ByteString -> Maybe a. Our Mathematician data type must implement an instance of the FromJSON typeclass to properly use this function. Fortunately, the only required function for implementing FromJSON is parseJSON. The syntax used in this recipe for implementing parseJSON is a little strange, but this is because we're leveraging applicative functions and lenses, which are more advanced Haskell topics. The .: function has two arguments, Object and Text, and returns a Parser a data type. As per the documentation, it retrieves the value associated with the given key of an object. This function is used if the key and the value exist in the JSON document. The :? function also retrieves the associated value from the given key of an object, but the existence of the key and value are not mandatory. So, we use .:? for optional key value pairs in a JSON document. There's more… If the implementation of the FromJSON typeclass is too involved, we can easily let GHC automatically fill it out using the DeriveGeneric language extension. The following is a simpler rewrite of the code: {-# LANGUAGE OverloadedStrings #-} {-# LANGUAGE DeriveGeneric #-} import Data.Aeson import qualified Data.ByteString.Lazy as B import GHC.Generics data Mathematician = Mathematician { name :: String , nationality :: String , born :: Int , died :: Maybe Int } deriving Generic instance FromJSON Mathematician main = do input <- B.readFile "input.json" let mm = decode input :: Maybe Mathematician case mm of Nothing -> print "error parsing JSON" Just m -> (putStrLn.greet) m greet m = (show.name) m ++" was born in the year "++ (show.born) m Although Aeson is powerful and generalizable, it may be an overkill for some simple JSON interactions. Alternatively, if we wish to use a very minimal JSON parser and printer, we can use Yocto, which can be downloaded from http://hackage.haskell.org/package/yocto. Reading an XML file using the HXT package Extensible Markup Language (XML) is an encoding of plain text to provide machine-readable annotations on a document. The standard is specified by W3C (http://www.w3.org/TR/2008/REC-xml-20081126/). In this recipe, we will parse an XML document representing an e-mail conversation and extract all the dates. Getting ready We will first set up an XML file called input.xml with the following values, representing an e-mail thread between Databender and Princess on December 18, 2014 as follows: $ cat input.xml <thread> <email> <to>Databender</to> <from>Princess</from> <date>Thu Dec 18 15:03:23 EST 2014</date> <subject>Joke</subject> <body>Why did you divide sin by tan?</body> </email> <email> <to>Princess</to> <from>Databender</from> <date>Fri Dec 19 3:12:00 EST 2014</date> <subject>RE: Joke</subject> <body>Just cos.</body> </email> </thread> Using Cabal, install the HXT library which we use for manipulating XML documents: $ cabal install hxt How to do it... We only need one import, which will be for parsing XML, using the following line of code: import Text.XML.HXT.Core Define and implement main and specify the XML location. For this recipe, the file is retrieved from input.xml. Refer to the following code: main :: IO () main = do input <- readFile "input.xml" Apply the readString function to the input and extract all the date documents. We filter items with a specific name using the hasName :: String -> a XmlTree XmlTree function. Also, we extract the text using the getText :: a XmlTree String function, as shown in the following code snippet: dates <- runX $ readString [withValidate no] input //> hasName "date" //> getText We can now use the list of extracted dates as follows: print dates By running the code, we print the following output: $ runhaskell Main.hs ["Thu Dec 18 15:03:23 EST 2014", "Fri Dec 19 3:12:00 EST 2014"] How it works... The library function, runX, takes in an Arrow. Think of an Arrow as a more powerful version of a Monad. Arrows allow for stateful global XML processing. Specifically, the runX function in this recipe takes in IOSArrow XmlTree String and returns an IO action of the String type. We generate this IOSArrow object using the readString function, which performs a series of operations to the XML data. For a deep insight into the XML document, //> should be used whereas /> only looks at the current level. We use the //> function to look up the date attributes and display all the associated text. As defined in the documentation, the hasName function tests whether a node has a specific name, and the getText function selects the text of a text node. Some other functions include the following: isText: This is used to test for text nodes isAttr: This is used to test for an attribute tree hasAttr: This is used to test whether an element node has an attribute node with a specific name getElemName: This is used to select the name of an element node All the Arrow functions can be found on the Text.XML.HXT.Arrow.XmlArrow documentation at http://hackage.haskell.org/package/hxt/docs/Text-XML-HXT-Arrow-XmlArrow.html. Capturing table rows from an HTML page Mining Hypertext Markup Language (HTML) is often a feat of identifying and parsing only its structured segments. Not all text in an HTML file may be useful, so we find ourselves only focusing on a specific subset. For instance, HTML tables and lists provide a strong and commonly used structure to extract data whereas a paragraph in an article may be too unstructured and complicated to process. In this recipe, we will find a table on a web page and gather all rows to be used in the program. Getting ready We will be extracting the values from an HTML table, so start by creating an input.html file containing a table as shown in the following figure: The HTML behind this table is as follows: $ cat input.html <!DOCTYPE html> <html> <body> <h1>Course Listing</h1> <table> <tr> <th>Course</th> <th>Time</th> <th>Capacity</th> </tr> <tr> <td>CS 1501</td> <td>17:00</td> <td>60</td> </tr> <tr> <td>MATH 7600</td> <td>14:00</td> <td>25</td> </tr> <tr> <td>PHIL 1000</td> <td>9:30</td> <td>120</td> </tr> </table> </body> </html> If not already installed, use Cabal to set up the HXT library and the split library, as shown in the following command lines: $ cabal install hxt $ cabal install split How to do it... We will need the htx package for XML manipulations and the chunksOf function from the split package, as presented in the following code snippet: import Text.XML.HXT.Core import Data.List.Split (chunksOf) Define and implement main to read the input.html file. main :: IO () main = do input <- readFile "input.html" Feed the HTML data into readString, thereby setting withParseHTML to yes and optionally turning off warnings. Extract all the td tags and obtain the remaining text, as shown in the following code: texts <- runX $ readString [withParseHTML yes, withWarnings no] input //> hasName "td" //> getText The data is now usable as a list of strings. It can be converted into a list of lists similar to how CSV was presented in the previous CSV recipe, as shown in the following code: let rows = chunksOf 3 texts print $ findBiggest rows By folding through the data, identify the course with the largest capacity using the following code snippet: findBiggest :: [[String]] -> [String] findBiggest [] = [] findBiggest items = foldl1 (a x -> if capacity x > capacity a then x else a) items capacity [a,b,c] = toInt c capacity _ = -1 toInt :: String -> Int toInt = read Running the code will display the class with the largest capacity as follows: $ runhaskell Main.hs {"PHIL 1000", "9:30", "120"} How it works... This is very similar to XML parsing, except we adjust the options of readString to [withParseHTML yes, withWarnings no].

0
0
1092

article-image-exact-inference-using-graphical-models

Packt

25 Jun 2014

7 min read

Exact Inference Using Graphical Models

Packt

25 Jun 2014

7 min read

(For more resources related to this topic, see here.) Complexity of inference A graphical model can be used to answer both probability queries and MAP queries. The most straightforward way to use this model is to generate the joint distribution and sum out all the variables, except the ones we are interested in. However, we need to determine and specify the joint distribution where an exponential blowup happens. In worst-case scenarios, we need to determine the exact inference in NP-hard. By the word exact, we mean specifying the probability values with a certain precision (say, five digits after the decimals). Suppose we tone down our precision requirements (for example, only up to two digits after the decimals). Now, is the (approximate) inference task any easier? Unfortunately not—even approximate inference is NP-hard, that is, getting values is far better than random guessing (50 percent or a probability of 0.5), which takes exponential time. It might seem like inference is a hopeless task, but that is only in the worst case. In general cases, we can use exact inference to solve certain classes of real-world problems (such as Bayesian networks that have a small number of discrete random variables). Of course, for larger problems, we have to resort to approximate inference. Real-world issues Since inference is a task that is NP-hard, inference engines are written in languages that are as close to bare metal as possible; usually in C or C++. Use Python implementations of inference algorithms. Complete and mature packages for these are uncommon. Use inference engines that have a Python interface, such as Stan (mc-stan.org). This choice serves a good balance between running the Python code and a fast inference implementation. Use inference engines that do not have a Python interface, which is true for majority of the inference engines out there. A fairly comprehensive list can be found at http://en.wikipedia.org/wiki/Bayesian_network#Software. The use of Python here is limited to creating a file that describes the model in a format that the inference engine can consume. In the article on inference, we will stick to the first two choices in the list. We will use native Python implementations (of inference algorithms) to peek into the interiors of the inference algorithms while running toy-sized problems, and then use an external inference engine with Python interfaces to try out a more real-world problem. The tree algorithm We will now look at another class of exact inference algorithms based on message passing. Message passing is a general mechanism, and there exist many variations of message passing algorithms. We shall look at a short snippet of the clique tree-message passing algorithm (which is sometimes called the junction tree algorithm too). Other versions of the message passing algorithm are used in approximate inference as well. We initiate the discussion by clarifying some of the terms used. A cluster graph is an arrangement of a network where groups of variables are placed in the cluster. It is similar to a factor where each cluster has a set of variables in its scope. The message passing algorithm is all about passing messages between clusters. As an analogy, consider the gossip going on at a party, where Shelly and Clair are in a conversation. If Shelly knows B, C, and D, and she is chatting with Clair who knows D, E, and F (note that the only person they know in common is D), they can share information (or pass messages) about their common friend D. In the message passing algorithm, two clusters are connected by a Separation Set (sepset), which contains variables common to both clusters. Using the preceding example, the two clusters and are connected by the sepset , which contains the only variable common to both clusters. In the next section, we shall learn about the implementation details of the junction tree algorithm. We will first understand the four stages of the algorithm and then use code snippets to learn about it from an implementation perspective. The four stages of the junction tree algorithm In this section, we will discuss the four stages of the junction tree algorithm. In the first stage, the Bayes network is converted into a secondary structure called a join tree (alternate names for this structure in the literature are junction tree, cluster tree, or a clique tree). The transformation from the Bayes network to junction tree proceeds as per the following steps: We will construct a moral graph by changing all the directed edges to undirected edges. All nodes that have V-structures that enter the said node have their parents connected with an edge. We have seen an example of this process (in the VE algorithm) called moralization, which is a possible reference to connect (apparently unmarried) parents that have a child (node). Then, we will selectively add edges to the moral graph to create a triangulated graph. A triangulated graph is an undirected graph where the maximum cycle length between the nodes is 3. From the triangulated graph, we will identify the subsets of nodes (called cliques). Starting with the cliques as clusters, we will arrange the clusters to form an undirected tree called the join tree, which satisfies the running intersection property. This property states that if a node appears in two cliques, it should also appear in all the nodes on the path that connect the two cliques. In the second stage, the potentials at each cluster are initialized. The potentials are similar to a CPD or a table. They have a list of values against each assignment to a variable in their scope. Both clusters and sepsets contain a set of potentials. The term potential is used as opposed to probabilities because in Markov networks, unlike probabilities, the values of the potentials are not obliged to sum to 1. This stage consists of message passing or belief propagation between neighboring clusters. Each message consists of a belief the cluster has about a particular variable. Each message can be passed asynchronously, but it has to wait for information from other clusters before it collates that information and passes it to the next cluster. It can be useful to think of a tree-structured cluster graph, where the message passing happens in two stages: an upward pass stage and a downward pass stage. Only after a node receives messages from the leaf nodes, will it send the message to its parent (in the "upward pass"), and only after the node receives a message from its parents will it send a message to its children (in the "downward pass"). The message passing stage completes when each cluster sepset has consistent beliefs. Recall that a cluster connected to a sepset has common variables. For example, cluster C and sepset S have and variables in its scope. Then, the potential against obtained from either the cluster or the sepset has the same value, which is why it is said that the cluster graph has consistent beliefs or that the cliques are calibrated. Once the whole cluster graph has consistent beliefs, the fourth stage is marginalization, where we can query the marginal distribution for any variable in the graph. Summary We first explored the inference problem where we studied the types of inference. We then learned that inference is NP-hard and understood that, for large networks, exact inference is infeasible. Resources for Article: Further resources on this subject: Getting Started with Spring Python [article] Python Testing: Installing the Robot Framework [article] Discovering Python's parallel programming tools [article]

0
0
2041

Packt

23 Jun 2014

14 min read

Working with Pentaho Mobile BI

Packt

23 Jun 2014

14 min read

0
0
2890

Packt

20 Jun 2014

11 min read

What is Quantitative Finance?

Packt

20 Jun 2014

11 min read

(For more resources related to this topic, see here.) Discipline 1 – finance (financial derivatives) In general, a financial derivative is a contract between two parties who agree to exchange one or more cash flows in the future. The value of these cash flows depends on some future event, for example, that the value of some stock index or interest rate being above or below some predefined level. The activation or triggering of this future event thus depends on the behavior of a variable quantity known as the underlying. Financial derivatives receive their name because they derive their value from the behavior of another financial instrument. As such, financial derivatives do not have an intrinsic value in themselves (in contrast to bonds or stocks); their price depends entirely on the underlying. A critical feature of derivative contracts is thus that their future cash flows are probabilistic and not deterministic. The future cash flows in a derivative contract are contingent on some future event. That is why derivatives are also known as contingent claims. This feature makes these types of contracts difficult to price. The following are the most common types of financial derivatives: Futures Forwards Options Swaps Futures and forwards are financial contracts between two parties. One party agrees to buy the underlying from the other party at some predetermined date (the maturity date) for some predetermined price (the delivery price). An example could be a one-month forward contract on one ounce of silver. The underlying is the price of one ounce of silver. No exchange of cash flows occur at inception (today, t=0), but it occurs only at maturity (t=T). Here t represents the variable time. Forwards are contracts negotiated privately between two parties (in other words, Over The Counter (OTC)), while futures are negotiated at an exchange. Options are financial contracts between two parties. One party (called the holder of the option) pays a premium to the other party (called the writer of the option) in order to have the right, but not the obligation, to buy some particular asset (the underlying) for some particular price (the strike price) at some particular date in the future (the maturity date). This type of contract is called a European Call contract. Example 1 Consider a one-month call contract on the S&P 500 index. The underlying in this case will be the value of the S&P 500 index. There are cash flows both at inception (today, t=0) and at maturity (t=T). At inception, (t=0) the premium is paid, while at maturity (t=T), the holder of the option will choose between the following two possible scenarios, depending on the value of the underlying at maturity S(T): Scenario A: To exercise his/her right and buy the underlying asset for K Scenario B: To do nothing if the value of the underlying at maturity is below the value of the strike, that is, S(T)<K The option holder will choose Scenario A if the value of the underlying at maturity is above the value of the strike, that is, S(T)>K. This will guarantee him/her a profit of S(T)-K. The option holder will choose Scenario B if the value of the underlying at maturity is below the value of the strike, that is, S(T)<K. This will guarantee him/her to limit his/her losses to zero. Example 2 An Interest Rate Swap (IRS) is a financial contract between two parties A and B who agree to exchange cash flows at regular intervals during a given period of time (the life of a contract). Typically, the cash flows from A to B are indexed to a fixed rate of interest, while the cash flows from B to A are indexed to a floating interest rate. The set of fixed cash flows is known as the fixed leg, while the set of floating cash flows is known as the floating leg. The cash flows occur at regular intervals during the life of the contract between inception (t=0) and maturity (t=T). An example could be a fixed-for-floating IRS, who pays a rate of 5 percent on the agreed notional N every three months and receives EURIBOR3M on the agreed notional N every three months. Example 3 A futures contract on a stock index also involves a single future cash flow (the delivery price) to be paid at the maturity of the contract. However, the payoff in this case is uncertain because how much profit I will get from this operation will depend on the value of the underlying at maturity. If the price of the underlying is above the delivery price, then the payoff I get (denoted by function H) is positive (indicating a profit) and corresponds to the difference between the value of the underlying at maturity S(T) and the delivery price K. If the price of the underlying is below the delivery price, then the payoff I get is negative (indicating a loss) and corresponds to the difference between the delivery price K and the value of the underlying at maturity S(T). This characteristic can be summarized in the following payoff formula: Equation 1 Here, H(S(T)) is the payoff at maturity, which is a function of S(T). Financial derivatives are very important to the modern financial markets. According to the Bank of International Settlements (BIS) as of December 2012, the amounts outstanding for OTC derivative contracts worldwide were Foreign exchange derivatives with 67,358 billion USD, Interest Rate Derivatives with 489,703 billion USD, Equity-linked derivatives with 6,251 billion USD, Commodity derivatives with 2,587 billion USD, and Credit default swaps with 25,069 billion USD. For more information, see http://www.bis.org/statistics/dt1920a.pdf. Discipline 2 – mathematics We need mathematical models to capture both the future evolution of the underlying and the probabilistic nature of the contingent cash flows we encounter in financial derivatives. Regarding the contingent cash flows, these can be represented in terms of the payoff function H(S(T)) for the specific derivative we are considering. Because S(T) is a stochastic variable, the value of H(S(T)) ought to be computed as an expectation E[H(S(T))]. And in order to compute this expectation, we need techniques that allow us to predict or simulate the behavior of the underlying S(T) into the future, so as to be able to compute the value of ST and finally be able to compute the mean value of the payoff E[H(S(T))]. Regarding the behavior of the underlying, typically, this is formalized using Stochastic Differential Equations (SDEs), such as Geometric Brownian Motion (GBM), as follows: Equation 2 The previous equation fundamentally says that the change in a stock price (dS), can be understood as the sum of two effects—a deterministic effect (first term on the right-hand side) and a stochastic term (second term on the right-hand side). The parameter is called the drift, and the parameter is called the volatility. S is the stock price, dt is a small time interval, and dW is an increment in the Wiener process. This model is the most common model to describe the behavior of stocks, commodities, and foreign exchange. Other models exist, such as jump, local volatility, and stochastic volatility models that enhance the description of the dynamics of the underlying. Regarding the numerical methods, these correspond to ways in which the formal expression described in the mathematical model (usually in continuous time) is transformed into an approximate representation that can be used for calculation (usually in discrete time). This means that the SDE that describes the evolution of the price of some stock index into the future, such as the FTSE 100, is changed to describe the evolution at discrete intervals. An approximate representation of an SDE can be calculated using the Euler approximation as follows: Equation 3 The preceding equation needs to be solved in an iterative way for each time interval between now and the maturity of the contract. If these time intervals are days and the contract has a maturity of 30 days from now, then we compute tomorrow's price in terms of todays. Then we compute the day after tomorrow as a function of tomorrow's price and so on. In order to price the derivative, we require to compute the expected payoff E[H(ST)] at maturity and then discount it to the present. In this way, we would be able to compute what should be the fair premium associated with a European option contract with the help of the following equation: Equation 4 Discipline 3 – informatics (C++ programming) What is the role of C++ in pricing derivatives? Its role is fundamental. It allows us to implement the actual calculations that are required in order to solve the pricing problem. Using the preceding techniques to describe the dynamics of the underlying, we require to simulate many potential future scenarios describing its evolution. Say we ought to price a futures contract on the EUR/USD exchange rate with one year maturity. We have to simulate the future evolution of EUR/USD for each day for the next year (using equation 3). We can then compute the payoff at maturity (using equation 1). However, in order to compute the expected payoff (using equation 4), we need to simulate thousands of such possible evolutions via a technique known as Monte Carlo simulation. The set of steps required to complete this process is known as an algorithm. To price a derivative, we ought to construct such algorithm and then implement it in an advanced programming language such as C++. Of course C++ is not the only possible choice, other languages include Java, VBA, C#, Mathworks Matlab, and Wolfram Mathematica. However, C++ is an industry standard because it's flexible, fast, and portable. Also, through the years, several numerical libraries have been created to conduct complex numerical calculations in C++. Finally, C++ is a powerful modern object-oriented language. It is always difficult to strike a balance between clarity and efficiency. We have aimed at making computer programs that are self-contained (not too object oriented) and self-explanatory. More advanced implementations are certainly possible, particularly in the context of larger financial pricing libraries in a corporate context. In this article, all the programs are implemented with the newest standard C++11 using Code::Blocks (http://www.codeblocks.org) and MinGW (http://www.mingw.org). The Bento Box template A Bento Box is a single portion take-away meal common in Japanese cuisine. Usually, it has a rectangular form that is internally divided in compartments to accommodate the various types of portions that constitute a meal. In this article, we use the metaphor of the Bento Box to describe a visual template to facilitate, organize, and structure the solution of derivative problems. The Bento Box template is simply a form that we will fill sequentially with the different elements that we require to price derivatives in a logical structured manner. The Bento Box template when used to price a particular derivative is divided into four areas or boxes, each containing information critical for the solution of the problem. The following figure illustrates a generic template applicable to all derivatives: The Bento Box template – general case The following figure shows an example of the Bento Box template as applied to a simple European Call option: The Bento Box template – European Call option In the preceding figure, we have filled the various compartments, starting in the top-left box and proceeding clockwise. Each compartment contains the details about our specific problem, taking us in sequence from the conceptual (box 1: derivative contract) to the practical (box 4: algorithm), passing through the quantitative aspects required for the solution (box 2: mathematical model and box 3: numerical method). Summary This article gave an overview of the main elements of Quantitative Finance as applied to pricing financial derivatives. The Bento Box template technique will be used to organize our approach to solve problems in pricing financial derivatives. We will assume that we are in possession with enough information to fill box 1 (derivative contract). Resources for Article: Further resources on this subject: Application Development in Visual C++ - The Tetris Application [article] Getting Started with Code::Blocks [article] Creating and Utilizing Custom Entities [article]

0
0
1256

article-image-machine-learning-bioinformatics

Packt

20 Jun 2014

8 min read

Machine Learning in Bioinformatics

Packt

20 Jun 2014

8 min read

(For more resources related to this topic, see here.) Supervised learning for classification Like clustering, classification is also about categorizing data instances, but in this case, the categories are known and are termed as class labels. Thus, it aims at identifying the category that a new data point belongs to. It uses a dataset where the class labels are known to find the pattern. Classification is an instance of supervised learning where the learning algorithm takes a known set of input data and corresponding responses called class labels and builds a predictor model that generates reasonable predictions for the class labels in the unknown data. To illustrate, let's imagine that we have gene expression data from cancer patients as well as healthy patients. The gene expression pattern in these samples can define whether the patient has cancer or not. In this case, if we have a set of samples for which we know the type of tumor, the data can be used to learn a model that can identify the type of tumor. In simple terms, it is a predictive function used to determine the tumor type. Later, this model can be applied to predict the type of tumor in unknown cases. There are some do's and don'ts to keep in mind while learning a classifier. You need to make sure that you have enough data to learn the model. Learning with smaller datasets will not allow the model to learn the pattern in an unbiased manner and again, you will end up with an inaccurate classification. Furthermore, the preprocessing steps (such as normalization) for the training and test data should be the same. Another important thing that one should take care of is to keep the training and test data distinct. Learning on the entire data and then using a part of this data for testing will lead to a phenomenon called over fitting. It is always recommended that you take a look at it manually and understand the question that you need to answer via your classifier. There are several methods of classification. In this recipe, we will talk about some of these methods. We will discuss linear discriminant analysis (LDA), decision tree (DT), and support vector machine (SVM). Getting ready To perform the classification task, we need two preparations. First, a dataset with known class labels (training set), and second, the test data that the classifier has to be tested on (test set). Besides this, we will use some R packages, which will be discussed when required. As a dataset, we will use approximately 2300 gene from tumor cells. The data has ~83 data points with four different types of tumors. These will be used as our class labels. We will use 60 of the data points for the training and the remaining 23 for the test. To find out more about the dataset, refer to the Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks article by Khan and others (http://research.nhgri.nih.gov/microarray/Supplement/). The set has been precompiled in a format that is readily usable in R and is available on the book's web page (code files) under the name cancer.rda. How to do it… To classify data points based on their features, perform the following steps: First, load the following MASS library as it has some of the classification functions: > library(MASS) Now, you need your data to learn and test the classifiers. Load the data from the code files available on the book's web page (cancer.rda) as follows: > load ("path/to/code/directory/cancer.rda") # located in the code file directory for the chapter, assign the path accordingly Randomly sample 60 data points for the training and the remaining 23 for the test set as follows—ensure that these two datasets do not overlap and are not biased towards any specific tumor type (random sampling): > train <- mldata[train_row,] # use sampled indexes to extract training data > test <- mldata[-train_row,] # test set is select by selecting all the other data points For the training data, retain the class labels, which are the tumor columns here, and remove this information from the test data. However, store this information for comparison purposes: > testClass <- test$tumor > test$tumor <- NULL Now, try the linear discriminate analysis classifier, as follows, to get the classifier model: > myLD <- lda(tumor ~ ., train) # might issue a warning Test this classifier to predict the labels on your test set, as follows: > testRes_lda <- predict(myLD, test) To check the number of correct and incorrect predictions, simply compare the predicted classes with the testClass object, which was created in step 4, as follows: > sum(testRes_lda$class == testClass) # correct prediction [1] 19 > sum(testRes_lda$class != testClass) # incorrect prediction [1] 4 Now, try another simple classifier called DT. For this, you need the rpart package: > library(rpart) Create the decision tree based on your training data, as follows: > myDT <- rpart(tumor~ ., data = train, control = rpart.control(minsplit = 10)) Plot your tree by typing the following commands, as shown in the next diagram: > plot(myDT) > text(myDT, use.n=T) The following screenshot shows the cut off for each feature (represented by the branches) to differentiate between the classes: The tree for DT-based learning Now, test the decision tree classifier on your test data using the following prediction function: > testRes_dt <- predict(myDT, newdata= test) Take a look at the species that each data instance is put in by the predicted classifier, as follows (1 if predicted in the class, else 0): > classes <- round(testRes_dt) > head(classes) BL EW NB RM 4 0 0 0 1 10 0 0 0 1 15 1 0 0 0 16 0 0 1 0 18 0 1 0 0 21 0 1 0 0 Finally, you'll work with SVMs. To be able to use them, you need another R package named e1071 as follows: > library(e1071) Create the svm classifier from the training data as follows: > mySVM <- svm(tumor ~ ., data = train) Then, use your classifier, the model (mySVM object) learned to predict for the test data. You will see the predicted labels for each instance as follows: > testRes_svm <- predict(mySVM, test) > testRes_svm How it works… We started our recipe by loading the input data on tumors. The supervised learning methods we saw in the recipe used two datasets: the training set and test set. The training set carries the class label information. The first part in most of the learning methods shown here, the training set is used to identify a pattern and model the pattern to find a distinction between the classes. This model is then applied on the test set that does not have the class label data to predict the class labels. To identify the training and test sets, we first randomly sample 60 indexes out of the entire data and use the remaining 23 for testing purposes. The supervised learning methods explained in this recipe follow a different principle. LDA attempts to model the difference between classes based on the linear combination of its features. This combination function forms the model based on the training set and is used to predict the classes in the test set. The LDA model trained on 60 samples is then used to predict for the remaining 23 cases. DT is, however, a different method. It forms regression trees that form a set of rules to distinguish one class from the other. The tree learned on a training set is applied to predict classes in test sets or other similar datasets. SVM is a relatively complex technique of classification. It aims to create a hyperplane(s) in the feature space, making the data points separable along these planes. This is done on a training set and is then used to assign classes to new data points. In general, LDA uses linear combination and SVM uses multiple dimensions as the hyperplane for data distinction. In this recipe, we used the svm functionality from the e1071 package, which has many other utilities for learning. We can compare the results obtained by the models we used in this recipe (they can be computed using the provided code on the book's web page). There's more... One of the most popular classifier tools in the machine learning community is WEKA. It is a Java-based tool and implements many libraries to perform classification tasks using DT, LDA, Random Forest, and so on. R supports an interface to the WEKA with a library named RWeka. It is available on the CRAN repository at http://cran.r-project.org/web/packages/RWeka/ . It uses RWekajars, a separate package, to use the Java libraries in it that implement different classifiers. See also The Elements of Statistical Learning book by Hastie, Tibshirani, and Friedman at http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf, which provides more information on LDA, DT, and SVM

0
0
1808

article-image-signal-processing-techniques

Packt

12 Jun 2014

6 min read

Signal Processing Techniques

Packt

12 Jun 2014

6 min read

(For more resources related to this topic, see here.) Introducing the Sunspot data Sunspots are dark spots visible on the Sun's surface. This phenomenon has been studied for many centuries by astronomers. Evidence has been found for periodic sunspot cycles. We can download up-to-date annual sunspot data from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual. This is provided by the Belgian Solar Influences Data Analysis Center. The data goes back to 1700 and contains more than 300 annual averages. In order to determine sunspot cycles, scientists successfully used the Hilbert-Huang transform (refer to http://en.wikipedia.org/wiki/Hilbert%E2%80%93Huang_transform). A major part of this transform is the so-called Empirical Mode Decomposition (EMD) method. The entire algorithm contains many iterative steps, and we will cover only some of them here. EMD reduces data to a group of Intrinsic Mode Functions (IMF). You can compare this to the way Fast Fourier Transform decomposes a signal in a superposition of sine and cosine terms. Extracting IMFs is done via a sifting process. The sifting of a signal is related to separating out components of a signal one at a time. The first step of this process is identifying local extrema. We will perform the first step and plot the data with the extrema we found. Let's download the data in CSV format. We also need to reverse the array to have it in the correct chronological order. The following code snippet finds the indices of the local minima and maxima respectively: mins = signal.argrelmin(data)[0] maxs = signal.argrelmax(data)[0] Now we need to concatenate these arrays and use the indices to select the corresponding values. The following code accomplishes that and also plots the data: import numpy as np import sys import matplotlib.pyplot as plt from scipy import signal data = np.loadtxt(sys.argv[1], delimiter=',', usecols=(1,), unpack=True,skiprows=1) #reverse order data = data[::-1] mins = signal.argrelmin(data)[0] maxs = signal.argrelmax(data)[0] extrema = np.concatenate((mins, maxs)) year_range = np.arange(1700, 1700 + len(data)) plt.plot(1700 + extrema, data[extrema], 'go') plt.plot(year_range, data) plt.show() We will see the following chart: In this plot, you can see the extrema is indicated with dots. Sifting continued The next steps in the sifting process require us to interpolate with a cubic spline of the minima and maxima. This creates an upper envelope and a lower envelope, which should surround the data. The mean of the envelopes is needed for the next iteration of the EMD process. We can interpolate minima with the following code snippet: spl_min = interpolate.interp1d(mins, data[mins], kind='cubic') min_rng = np.arange(mins.min(), mins.max()) l_env = spl_min(min_rng) Similar code can be used to interpolate the maxima. We need to be aware that the interpolation results are only valid within the range over which we are interpolating. This range is defined by the first occurrence of a minima/maxima and ends at the last occurrence of a minima/maxima. Unfortunately, the interpolation ranges we can define in this way for the maxima and minima do not match perfectly. So, for the purpose of plotting, we need to extract a shorter range that lies within both the maxima and minima interpolation ranges. Have a look at the following code: import numpy as np import sys import matplotlib.pyplot as plt from scipy import signal from scipy import interpolate data = np.loadtxt(sys.argv[1], delimiter=',', usecols=(1,), unpack=True,skiprows=1) #reverse order data = data[::-1] mins = signal.argrelmin(data)[0] maxs = signal.argrelmax(data)[0] extrema = np.concatenate((mins, maxs)) year_range = np.arange(1700, 1700 + len(data)) spl_min = interpolate.interp1d(mins, data[mins], kind='cubic') min_rng = np.arange(mins.min(), mins.max()) l_env = spl_min(min_rng) spl_max = interpolate.interp1d(maxs, data[maxs], kind='cubic') max_rng = np.arange(maxs.min(), maxs.max()) u_env = spl_max(max_rng) inclusive_rng = np.arange(max(min_rng[0], max_rng[0]), min(min_rng[-1],max_rng[-1])) mid = (spl_max(inclusive_rng) + spl_min(inclusive_rng))/2 plt.plot(year_range, data) plt.plot(1700 + min_rng, l_env, '-x') plt.plot(1700 + max_rng, u_env, '-x') plt.plot(1700 + inclusive_rng, mid, '--') plt.show() The code produces the following chart: What you see is the observed data, with computed envelopes and mid line. Obviously, negative values don't make any sense in this context. However, for the algorithm we only need to care about the mid line of the upper and lower envelopes. In these first two sections, we basically performed the first iteration of the EMD process. The algorithm is a bit more involved, so we will leave it up to you whether or not you want to continue with this analysis on your own. Moving averages Moving averages are tools commonly used to analyze time-series data. A moving average defines a window of previously seen data that is averaged each time the window slides forward one period. The different types of moving average differ essentially in the weights used for averaging. The exponential moving average, for instance, has exponentially decreasing weights with time. This means that older values have less influence than newer values, which is sometimes desirable. We can express an equal-weight strategy for the simple moving average as follows in the NumPy code: weights = np.exp(np.linspace(-1., 0., N)) weights /= weights.sum() A simple moving average uses equal weights which, in code, looks as follows: def sma(arr, n): weights = np.ones(n) / n return np.convolve(weights, arr)[n-1:-n+1] The following code plots the simple moving average for the 11- and 22-year sunspot cycle: import numpy as np import sys import matplotlib.pyplot as plt data = np.loadtxt(sys.argv[1], delimiter=',', usecols=(1,), unpack=True, skiprows=1) #reverse order data = data[::-1] year_range = np.arange(1700, 1700 + len(data)) def sma(arr, n): weights = np.ones(n) / n return np.convolve(weights, arr)[n-1:-n+1] sma11 = sma(data, 11) sma22 = sma(data, 22) plt.plot(year_range, data, label='Data') plt.plot(year_range[10:], sma11, '-x', label='SMA 11') plt.plot(year_range[21:], sma22, '--', label='SMA 22') plt.legend() plt.show() In the following plot, we see the original data and the simple moving averages for 11- and 22-year periods. As you can see, moving averages are not a good fit for this data; this is generally the case for sinusoidal data. Summary This article gave us examples of signal processing and time series analysis. We looked at shifting continued that performs the first iteration of the EMD process. We also learned about Moving averages, which are tools commonly used to analyze time-series data. Resources for Article: Further resources on this subject: Advanced Indexing and Array Concepts [Article] Fast Array Operations with NumPy [Article] Move Further with NumPy Modules [Article]

0
0
2653

How-To Tutorials - Data

Using Canvas and D3

Recommender systems dissected

Amazon DynamoDB - Modelling relationships, Error handling

Importing Dynamic Data

Sharding in Action

Securing the WAL Stream

Making the Most of Your Hadoop Data Lake, Part 2: Optimized File Formats

MapReduce with OpenStack Swift and ZeroVM

Making the Most of Your Hadoop Data Lake, Part 1: Data Compression

The Hunt for Data

Trending Topics

Exact Inference Using Graphical Models

Working with Pentaho Mobile BI

What is Quantitative Finance?

Machine Learning in Bioinformatics

Signal Processing Techniques