Data | 0 articles | Tech News, Tutorials & Expert Insights

03 Jun 2015

5 min read

Working with Touch Gestures

03 Jun 2015

In this article by Ajit Kumar, the author Sencha Charts Essentials, we will cover the following topics: Touch gestures support in Sencha Charts Using gestures on existing charts Out-of-the-box interactions Creating custom interactions using touch gestures (For more resources related to this topic, see here.) Interacting with interactions The interactions code is located under the ext/packages/sencha-charts/src/chart/interactions folder. The Ext.chart.interactions.Abstract class is the base class for all the chart interactions. Interactions must be associated with a chart by configuring interactions on it. Consider the following example: Ext.create('Ext.chart.PolarChart', {title: 'Chart',interactions: ['rotate'],... The gestures config is an important config. It is an integral part of an interaction where it tells the framework which touch gestures would be part of an interaction. It's a map where the event name is the key and the handler method name is its value. Consider the following example: gestures: {tap: 'onTapGesture',doubletap: 'onDoubleTapGesture'} Once an interaction has been associated with a chart, the framework registers the events and their handlers, as listed in the gestures config, on the chart as part of the chart initialization, as shown here: Here is what happens during each stage of the preceding flowchart: The chart's construction starts when its constructor is called either by a call to Ext.create or xtype usage. The interactions config is applied to the AbstractChart class by the class system, which calls the applyInteractions method. The applyInteractions method sets the chart object on each of the interaction objects. This setter operation will call the updateChart method of the interaction class—Ext.chart.interactions.Abstract. The updateChart calls addChartListener to add the gesture-related events and their handlers. The addChartListener iterates through the gestures object and registers the listed events and their handlers on the chart object. Interactions work on touch as well as non-touch devices (for example, desktop). On non-touch devices, the gestures are simulated based on their mouse or pointer events. For example, mousedown is treated as a tap event. Using built-in interactions Here is a list of the built-in interactions: Crosshair: This interaction helps the user to get precise x and y values for a specific point on a chart. Because of this, it is applicable to Cartesian charts only. The x and y values are obtained by single-touch dragging on the chart. The interaction also offers additional configs: axes: This can be used to provide label text and label rectangle configs on a per axis basis using left, right, top, and bottom configs or a single config applying to all the axes. If the axes config is not specified, the axis label value is shown as the text and the rectangle will be filled with white color. lines: The interaction draws horizontal and vertical lines through the point on the chart. Line sprite attributes can be passed using horizontal or vertical configs. For example, we configure the following crosshair interaction on a CandleStick chart: interactions: [{type: 'crosshair',axes: {left: {label: { fillStyle: 'white' },rect: {fillStyle: 'pink',radius: 2}},bottom: {label: {fontSize: '14px',fontWeight: 'bold'},rect: { fillStyle: 'cyan' }}}}] The preceding configuration will produce the following output, where the labels and rectangles on the two axes have been styled as per the configuration: CrossZoom:This interaction allows the user to zoom in on a selected area of a chart using drag events. It is very useful in getting the microscopic view of your macroscopic data view. For example, the chart presents month-wise data for two years; using zoom, you can look at the values for, say, a specific month. The interaction offers an additional config—axes—using which we indicate the axes, which will be zoomed. Consider the following configuration on a CandleStick chart: interactions: [{type: 'crosszoom',axes: ['left', 'bottom']}] This will produce the following output where a user will be able to zoom in to both the left and bottom axes: Additionally, we can control the zoom level by passing minZoom and maxZoom, as shown in the following snippet: interactions: [{type: 'crosszoom',axes: {left: {maxZoom: 8,startZoom: 2},bottom: true}}] The zoom is reset when the user double-clicks on the chart. ItemHighlight: This interaction allows the user to highlight series items in the chart. It works in conjunction with highlight config that is configured on a series. The interaction identifies and sets the highlightItem on a chart, on which the highlight and highlightCfg configs are applied. PanZoom: This interaction allows the user to navigate the data for one or more chart axes by panning and/or zooming. Navigation can be limited to particular axes. Pinch gestures are used for zooming whereas drag gestures are used for panning. For devices which do not support multiple-touch events, zooming cannot be done via pinch gestures; in this case, the interaction will allow the user to perform both zooming and panning using the same single-touch drag gesture. By default, zooming is not enabled. We can enable it by setting zoomOnPanGesture:true on the interaction, as shown here: interactions: {type: 'panzoom',zoomOnPanGesture: true} By default, all the axes are navigable. However, the panning and zooming can be controlled at axis level, as shown here: {type: 'panzoom',axes: {left: {maxZoom: 5,allowPan: false},bottom: true}} Rotate: This interaction allows the user to rotate a polar chart about its centre. It implements the rotation using the single-touch drag gestures. This interaction does not have any additional config. RotatePie3D: This is an extension of the Rotate interaction to rotate a Pie3D chart. This does not have any additional config. Summary In this article, you learned how Sencha Charts offers interaction classes to build interactivity into the charts. We looked at the out-of-the-box interactions, their specific configurations, and how to use them on different types of charts. Resources for Article: Further resources on this subject: The Various Components in Sencha Touch [Article] Creating a Simple Application in Sencha Touch [Article] Sencha Touch: Catering Form Related Needs [Article]

0
0
930

article-image-developing-location-based-services-neo4j

Packt

02 Jun 2015

22 min read

Developing Location-based Services with Neo4j

Packt

02 Jun 2015

22 min read

0
0
2546

Packt

02 Jun 2015

8 min read

Basic Image Processing

Packt

02 Jun 2015

8 min read

0
0
7245

Packt

25 May 2015

37 min read

Query complete/suggest

Packt

25 May 2015

37 min read

0
0
2625

Packt

25 May 2015

15 min read

Cleaning Data in PDF Files

Packt

25 May 2015

15 min read

In this article by Megan Squire, author of the book Clean Data, we will experiment with several data decanters to extract all the good stuff hidden inside inscrutable PDF files. We will explore the following topics: What PDF files are for and why it is difficult to extract data from them How to copy and paste from PDF files, and what to do when this does not work How to shrink a PDF file by saving only the pages that we need How to extract text and numbers from a PDF file using the tools inside a Python package called pdfMiner How to extract tabular data from within a PDF file using a browser-based Java application called Tabula How to use the full, paid version of Adobe Acrobat to extract a table of data (For more resources related to this topic, see here.) Why is cleaning PDF files difficult? Files saved in Portable Document Format (PDF) are a little more complicated than some of the text files. PDF is a binary format that was invented by Adobe Systems, which later evolved into an open standard so that multiple applications could create PDF versions of their documents. The purpose of a PDF file is to provide a way of viewing the text and graphics in a document independent of the software that did the original layout. In the early 1990s, the heyday of desktop publishing, each graphic design software package had a different proprietary format for its files, and the packages were quite expensive. In those days, in order to view a document created in Word, Pagemaker, or Quark, you would have to open the document using the same software that had created it. This was especially problematic in the early days of the Web, since there were not many available techniques in HTML to create sophisticated layouts, but people still wanted to share files with each other. PDF was meant to be a vendor-neutral layout format. Adobe made its Acrobat Reader software free for anyone to download, and subsequently the PDF format became widely used. Here is a fun fact about the early days of Acrobat Reader. The words click here when entered into Google search engine still bring up Adobe's Acrobat PDF Reader download website as the first result, and have done so for years. This is because so many websites distribute PDF files along with a message saying something like, "To view this file you must have Acrobat Reader installed. Click here to download it." Since Google's search algorithm uses the link text to learn what sites go with what keywords, the keyword click here is now associated with Adobe Acrobat's download site. PDF is still used to make vendor- and application-neutral versions of files that have layouts that are more complicated than what could be achieved with plain text. For example, viewing the same document in the various versions of Microsoft Word still sometimes causes documents with lots of embedded tables, styles, images, forms, and fonts to look different from one another. This can be due to a number of factors, such as differences in operating systems or versions of the installed Word software itself. Even with applications that are intended to be compatible between software packages or versions, subtle differences can result in incompatibilities. PDF was created to solve some of this. Right away we can tell that PDF is going to be more difficult to deal with than a text file, because it is a binary format, and because it has embedded fonts, images, and so on. So most of the tools in our trusty data cleaning toolbox, such as text editors and command-line tools (less) are largely useless with PDF files. Fortunately there are still a few tricks we can use to get the data out of a PDF file. Try simple solutions first – copying Suppose that on your way to decant your bottle of fine red wine, you spill the bottle on the floor. Your first thought might be that this is a complete disaster and you will have to replace the whole carpet. But before you start ripping out the entire floor, it is probably worth trying to clean the mess with an old bartender's trick: club soda and a damp cloth. In this section, we outline a few things to try first, before getting involved in an expensive file renovation project. They might not work, but they are worth a try. Our experimental file Let's practice cleaning PDF data by using a real PDF file. We also do not want this experiment to be too easy, so let's choose a very complicated file. Suppose we are interested in pulling the data out of a file we found on the Pew Research Center's website called "Is College Worth It?". Published in 2011, this PDF file is 159 pages long and contains numerous data tables showing various ways of measuring if attaining a college education in the United States is worth the investment. We would like to find a way to quickly extract the data within these numerous tables so that we can run some additional statistics on it. For example, here is what one of the tables in the report looks like: This table is fairly complicated. It only has six columns and eight rows, but several of the rows take up two lines, and the header row text is only shown on five of the columns. The complete report can be found at the PewResearch website at http://www.pewsocialtrends.org/2011/05/15/is-college-worth-it/, and the particular file we are using is labeled Complete Report: http://www.pewsocialtrends.org/files/2011/05/higher-ed-report.pdf. Step one – try copying out the data we want The data we will experiment on in this example is found on page 149 of the PDF file (labeled page 143 in their document). If we open the file in a PDF viewer, such as Preview on Mac OSX, and attempt to select just the data in the table, we already see that some strange things are happening. For example, even though we did not mean to select the page number (143); it got selected anyway. This does not bode well for our experiment, but let's continue. Copy the data out by using Command-C or select Edit | Copy. How text looks when selected in this PDF from within Preview Step two – try pasting the copied data into a text editor The following screenshot shows how the copied text looks when it is pasted into Text Wrangler, our text editor: Clearly, this data is not in any sensible order after copying and pasting it. The page number is included, the numbers are horizontal instead of vertical, and the column headers are out of order. Even some of the numbers have been combined; for example, the final row contains the numbers 4,4,3,2; but in the pasted version, this becomes a single number 4432. It would probably take longer to clean up this data manually at this point than it would have taken just to retype the original table. We can conclude that with this particular PDF file, we are going to have to take stronger measures to clean it. Step three – make a smaller version of the file Our copying and pasting procedures have not worked, so we have resigned ourselves to the fact that we are going to need to prepare for more invasive measures. Perhaps if we are not interested in extracting data from all 159 pages of this PDF file, we can identify just the area of the PDF that we want to operate on, and save that section to a separate file. To do this in Preview on MacOSX, launch the File | Print… dialog box. In the Pages area, we will enter the range of pages we actually want to copy. For the purpose of this experiment, we are only interested in page 149; so enter 149 in both the From: and to: boxes as shown in the following screenshot. Then from the PDF dropdown box at the bottom, select Open PDF in Preview. You will see your single-page PDF in a new window. From here, we can save this as a new file and give it a new name, such as report149.pdf or the like. Another technique to try – pdfMiner Now that we have a smaller file to experiment with, let's try some programmatic solutions to extract the text and see if we fare any better. pdfMiner is a Python package with two embedded tools to operate on PDF files. We are particularly interested in experimenting with one of these tools, a command-line program called pdf2txt that is designed to extract text from within a PDF document. Maybe this will be able to help us get those tables of numbers out of the file correctly. Step one – install pdfMiner Launch the Canopy Python environment. From the Canopy Terminal Window, run the following command: pip install pdfminer This will install the entire pdfMiner package and all its associated command-line tools. The documentation for pdfMiner and the two tools that come with it, pdf2txt and dumpPDF, is located at http://www.unixuser.org/~euske/python/pdfminer/. Step two – pull text from the PDF file We can extract all text from a PDF file using the command-line tool called pdf2txt.py. To do this, use the Canopy Terminal and navigate to the directory where the file is located. The basic format of the command is pdf2txt.py <filename>. If you have a larger file that has multiple pages (or you did not already break the PDF into smaller ones), you can also run pdf2txt.py –p149 <filename> to specify that you only want page 149. Just as with the preceding copy-and-paste experiment, we will try this technique not only on the tables located on page 149, but also on the Preface on page 3. To extract just the text from page 3, we run the following command: pdf2txt.py –p3 pewReport.pdf After running this command, the extracted preface of the Pew Research report appears in our command-line window: To save this text to a file called pewPreface.txt, we can simply add a redirect to our command line as follows: pdf2txt.py –p3 pewReport.pdf > pewPreface.txt But what about those troublesome data tables located on page 149? What happens when we use pdf2txt on those? We can run the following command: pdf2txt.py pewReport149.pdf The results are slightly better than copy and paste, but not by much. The actual data output section is shown in the following screenshot. The column headers and data are mixed together, and the data from different columns are shown out of order. We will have to declare the tabular data extraction portion of this experiment a failure, though pdfMiner worked reasonably well on line-by-line text-only extraction. Remember that your success with each of these tools may vary. Much of it depends on the particular characteristics of the original PDF file. It looks like we chose a very tricky PDF for this example, but let's not get disheartened. Instead, we will move on to another tool and see how we fare with it. Third choice – Tabula Tabula is a Java-based program to extract data within tables in PDF files. We will download the Tabula software and put it to work on the tricky tables in our page 149 file. Step one – download Tabula Tabula is available to be downloaded from its website at http://tabula.technology/. The site includes some simple download instructions. On Mac OSX version 10.10.1, I had to download the legacy Java 6 application before I was able to run Tabula. The process was straightforward and required only following the on-screen instructions. Step two – run Tabula Launch Tabula from inside the downloaded .zip archive. On the Mac, the Tabula application file is called simply Tabula.app. You can copy this to your Applications folder if you like. When Tabula starts, it launches a tab or window within your default web browser at the address http://127.0.0.1:8080/. The initial action portion of the screen looks like this: The warning that auto-detecting tables takes a long time is true. For the single-page perResearch149.pdf file, with three tables in it, table auto-detection took two full minutes and resulted in an error message about an incorrectly formatted PDF file. Step three – direct Tabula to extract the data Once Tabula reads in the file, it is time to direct it where the tables are. Using your mouse cursor, select the table you are interested in. I drew a box around the entire first table. Tabula took about 30 seconds to read in the table, and the results are shown as follows: Compared to the way the data was read with copy and paste and pdf2txt, this data looks great. But if you are not happy with the way Tabula reads in the table, you can repeat this process by clearing your selection and redrawing the rectangle. Step four – copy the data out We can use the Download Data button within Tabula to save the data to a friendlier file format, such as CSV or TSV. Step five – more cleaning Open the CSV file in Excel or a text editor and take a look at it. At this stage, we have had a lot of failures in getting this PDF data extracted, so it is very tempting to just quit now. Here are some simple data cleaning tasks: We can combine all the two-line text cells into a single cell. For example, in column B, many of the phrases take up more than one row. Prepare students to be productive and members of the workforce should be in one cell as a single phrase. The same is true for the headers in Rows 1 and 2 (4-year and Private should be in a single cell). To clean this in Excel, create a new column between columns B and C. Use the concatenate() function to join B3:B4, B5:B6, and so on. Use Paste-Special to add the new concatenated values into a new column. Then remove the two columns you no longer need. Do the same for rows 1 and 2. Remove blank lines between rows. When these procedures are finished, the data looks like this: Tabula might seem like a lot of work compared to cutting and pasting data or running a simple command-line tool. That is true, unless your PDF file turns out to be finicky like this one was. Remember that specialty tools are there for a reason—but do not use them unless you really need them. Start with a simple solution first and only proceed to a more difficult tool when you really need it. When all else fails – fourth technique Adobe Systems sells a paid, commercial version of their Acrobat software that has some additional features above and beyond just allowing you to read PDF files. With the full version of Acrobat, you can create complex PDF files and manipulate existing files in various ways. One of the features that is relevant here is the Export Selection As… option found within Acrobat. To get started using this feature, launch Acrobat and use the File Open dialog to open the PDF file. Within the file, navigate to the table holding the data you want to export. The following screenshot shows how to select the data from the page 149 PDF we have been operating on. Use your mouse to select the data, then right-click and choose Export Selection As… At this point, Acrobat will ask you how you want the data exported. CSV is one of the choices. Excel Workbook (.xlsx) would also be a fine choice if you are sure you will not want to also edit the file in a text editor. Since I know that Excel can also open CSV files, I decided to save my file in that format so I would have the most flexibility between editing in Excel and my text editor. After choosing the format for the file, we will be prompted for a filename and location for where to save the file. When we launch the resulting file, either in a text editor or in Excel, we can see that it looks a lot like the Tabula version we saw in the previous section. Here is how our CSV file will look when opened in Excel: At this point, we can use the exact same cleaning routine we used with the Tabula data, where we concatenated the B2:B3 cells into a single cell and then removed the empty rows. Summary The goal of this article was to learn how to export data out of a PDF file. Like sediment in a fine wine, the data in PDF files can appear at first to be very difficult to separate. Unlike decanting wine, however, which is a very passive process, separating PDF data took a lot of trial and error. We learned four ways of working with PDF files to clean data: copying and pasting, pdfMiner, Tabula, and Acrobat export. Each of these tools has certain strengths and weaknesses: Copying and pasting costs nothing and takes very little work, but is not as effective with complicated tables. pdfMiner/Pdf2txt is also free, and as a command-line tool, it could be automated. It also works on large amounts of data. But like copying and pasting, it is easily confused by certain types of tables. Tabula takes some work to set up, and since it is a product undergoing development, it does occasionally give strange warnings. It is also a little slower than the other options. However, its output is very clean, even with complicated tables. Acrobat gives similar output to Tabula, but with almost no setup and very little effort. It is a paid product. By the end, we had a clean dataset that was ready for analysis or long-term storage. Resources for Article: Further resources on this subject: Machine Learning Using Spark MLlib [article] Data visualization [article] First steps with R [article]

0
0
7306

Packt

08 May 2015

3 min read

Learning D3.js Mapping

Packt

08 May 2015

3 min read

What is D3.js? D3.js (Data-Driven Documents) is a JavaScript library used for data visualization. D3.js is used to display graphical representations of information in a browser using JavaScript. Because they are, essentially, a collection of JavaScript code, graphical elements produced by D3.js can react to changes made on the client or server side. D3.js has seen major adoption as websites include more dynamic charts, graphs, infographics and other forms of visualized data. (For more resources related to this topic, see here.) Why this book? This book by the authors, Thomas Newton and Oscar Villarreal, explores the JavaScript library, D3.js, and its ability to help us create maps and amazing visualizations. You will no longer be confined to third-party tools in order to get a nice looking map. With D3.js, you can build your own maps and customize them as you please. This book will go from the basics of SVG and JavaScript to data trimming and modification with TopoJSON. Using D3.js to glue together these three key ingredients, we will create very attractive maps that will cover many common use cases for maps, such as choropleths, data overlays on maps, and interactivity. Key features Dive into D3.js and apply its powerful data binding ability in order to create stunning visualizations Learn the key concepts of SVG, JavaScript, CSS, and the DOM in order to project images onto the browser Solve a wide range of problems faced while building interactive maps with this solution-based guide Authors Thomas Newton has 20 years of experience in the technical industry, working on everything from low-level system designs and data visualization to software design and architecture. Currently, he is creating data visualizations to solve analytical problems for clients. Oscar Villarreal has been developing interfaces for the past 10 years, and most recently, he has been focusing on data visualization and web applications. In his spare time, he loves to write on his blog, oscarvillarreal.com. In short You will find a few books on D3.js but they require intermediate-level developers who already know how to use D3.js, a few of them will cover a much wider range of D3 usage, whereas this book is focused exclusively on mapping; it fully explores this core task, and takes a solution-based approach. Recommendations and all the wealthy knowledge that authors have shared in the book is based on many years of experience and many projects delivered using D3. What this book covers Learn all the tools you need to create a map using D3 A high-level overview of Scalable Vector Graphics (SVG) presentation with explanation on how it operates and what elements it encompasses Exploring D3.js—producing graphics from data Step-by-step guide to build a map with D3 Get you started with interactivity in your D3 map visualizations Most important aspects of map visualization in detail via the use of TopoJSON Assistance for the long-term maintenance of your D3 code base and different techniques to keep it healthy over the lifespan of your project Summary So far, you may have got an idea of what all be covered in the book. This book is carefully designed to allow the reader to jump between chapters based on what they are planning to get out of the book. Every chapter is full of pragmatic examples that can easily provide the foundation to more complex work. Authors have explained, step by step, how each example works. Resources for Article: Further resources on this subject: Using Canvas and D3 [article] Interacting with your Visualization [article] Simple graphs with d3.js [article]

0
0
1469

article-image-why-big-data-financial-sector

Packt

06 May 2015

7 min read

Why Big Data in the Financial Sector?

Packt

06 May 2015

7 min read

0
0
2229

Packt

06 May 2015

23 min read

Introducing PostgreSQL 9

Packt

06 May 2015

23 min read

In this article by Simon Riggs, Gianni Ciolli, Hannu Krosing, Gabriele Bartolini, the authors of PostgreSQL 9 Administration Cookbook - Second Edition, we will introduce PostgreSQL 9. PostgreSQL is a feature-rich, general-purpose database management system. It's a complex piece of software, but every journey begins with the first step. (For more resources related to this topic, see here.) We'll start with your first connection. Many people fall at the first hurdle, so we'll try not to skip that too swiftly. We'll quickly move on to enabling remote users, and from there, we will move to access through GUI administration tools. We will also introduce the psql query tool. PostgreSQL is an advanced SQL database server, available on a wide range of platforms. One of the clearest benefits of PostgreSQL is that it is open source, meaning that you have a very permissive license to install, use, and distribute PostgreSQL without paying anyone fees or royalties. On top of that, PostgreSQL is well-known as a database that stays up for long periods and requires little or no maintenance in most cases. Overall, PostgreSQL provides a very low total cost of ownership. PostgreSQL is also noted for its huge range of advanced features, developed over the course of more than 20 years of continuous development and enhancement. Originally developed by the Database Research Group at the University of California, Berkeley, PostgreSQL is now developed and maintained by a huge army of developers and contributors. Many of those contributors have full-time jobs related to PostgreSQL, working as designers, developers, database administrators, and trainers. Some, but not many, of those contributors work for companies that specialize in support for PostgreSQL, like we (the authors) do. No single company owns PostgreSQL, nor are you required (or even encouraged) to register your usage. PostgreSQL has the following main features: Excellent SQL standards compliance up to SQL:2011 Client-server architecture Highly concurrent design where readers and writers don't block each other Highly configurable and extensible for many types of applications Excellent scalability and performance with extensive tuning features Support for many kinds of data models: relational, document (JSON and XML), and key/value What makes PostgreSQL different? The PostgreSQL project focuses on the following objectives: Robust, high-quality software with maintainable, well-commented code Low maintenance administration for both embedded and enterprise use Standards-compliant SQL, interoperability, and compatibility Performance, security, and high availability What surprises many people is that PostgreSQL's feature set is more comparable with Oracle or SQL Server than it is with MySQL. The only connection between MySQL and PostgreSQL is that these two projects are open source; apart from that, the features and philosophies are almost totally different. One of the key features of Oracle, since Oracle 7, has been snapshot isolation, where readers don't block writers and writers don't block readers. You may be surprised to learn that PostgreSQL was the first database to be designed with this feature, and it offers a complete implementation. In PostgreSQL, this feature is called Multiversion Concurrency Control (MVCC). PostgreSQL is a general-purpose database management system. You define the database that you would like to manage with it. PostgreSQL offers you many ways to work. You can use a normalized database model, augmented with features such as arrays and record subtypes, or use a fully dynamic schema with the help of JSONB and an extension named hstore. PostgreSQL also allows you to create your own server-side functions in any of a dozen different languages. PostgreSQL is highly extensible, so you can add your own data types, operators, index types, and functional languages. You can even override different parts of the system using plugins to alter the execution of commands or add a new optimizer. All of these features offer a huge range of implementation options to software architects. There are many ways out of trouble when building applications and maintaining them over long periods of time. In the early days, when PostgreSQL was still a research database, the focus was solely on the cool new features. Over the last 15 years, enormous amounts of code have been rewritten and improved, giving us one of the most stable and largest software servers available for operational use. You may have read that PostgreSQL was, or is, slower than My Favorite DBMS, whichever that is. It's been a personal mission of mine over the last ten years to improve server performance, and the team has been successful in making the server highly performant and very scalable. That gives PostgreSQL enormous headroom for growth. Who is using PostgreSQL? Prominent users include Apple, BASF, Genentech, Heroku, IMDB.com, Skype, McAfee, NTT, The UK Met Office, and The U. S. National Weather Service. 5 years ago, PostgreSQL received well in excess of 1 million downloads per year, according to data submitted to the European Commission, which concluded, "PostgreSQL is considered by many database users to be a credible alternative." We need to mention one last thing. When PostgreSQL was first developed, it was named Postgres, and therefore many aspects of the project still refer to the word "postgres"; for example, the default database is named postgres, and the software is frequently installed using the postgres user ID. As a result, people shorten the name PostgreSQL to simply Postgres, and in many cases use the two names interchangeably. PostgreSQL is pronounced as "post-grez-q-l". Postgres is pronounced as "post-grez." Some people get confused, and refer to "Postgre", which is hard to say, and likely to confuse people. Two names are enough, so please don't use a third name! The following sections explain the key areas in more detail. Robustness PostgreSQL is robust, high-quality software, supported by automated testing for both features and concurrency. By default, the database provides strong disk-write guarantees, and the developers take the risk of data loss very seriously in everything they do. Options to trade robustness for performance exist, though they are not enabled by default. All actions on the database are performed within transactions, protected by a transaction log that will perform automatic crash recovery in case of software failure. Databases may be optionally created with data block checksums to help diagnose hardware faults. Multiple backup mechanisms exist, with full and detailed Point-In-Time Recovery, in case of the need for detailed recovery. A variety of diagnostic tools are available. Database replication is supported natively. Synchronous Replication can provide greater than "5 Nines" (99.999 percent) availability and data protection, if properly configured and managed. Security Access to PostgreSQL is controllable via host-based access rules. Authentication is flexible and pluggable, allowing easy integration with any external security architecture. Full SSL-encrypted access is supported natively. A full-featured cryptographic function library is available for database users. PostgreSQL provides role-based access privileges to access data, by command type. Functions may execute with the permissions of the definer, while views may be defined with security barriers to ensure that security is enforced ahead of other processing. All aspects of PostgreSQL are assessed by an active security team, while known exploits are categorized and reported at http://www.postgresql.org/support/security/. Ease of use Clear, full, and accurate documentation exists as a result of a development process where doc changes are required. Hundreds of small changes occur with each release that smooth off any rough edges of usage, supplied directly by knowledgeable users. PostgreSQL works in the same way on small or large systems and across operating systems. Client access and drivers exist for every language and environment, so there is no restriction on what type of development environment is chosen now, or in the future. SQL Standard is followed very closely; there is no weird behavior, such as silent truncation of data. Text data is supported via a single data type that allows storage of anything from 1 byte to 1 gigabyte. This storage is optimized in multiple ways, so 1 byte is stored efficiently, and much larger values are automatically managed and compressed. PostgreSQL has a clear policy to minimize the number of configuration parameters, and with each release, we work out ways to auto-tune settings. Extensibility PostgreSQL is designed to be highly extensible. Database extensions can be loaded simply and easily using CREATE EXTENSION, which automates version checks, dependencies, and other aspects of configuration. PostgreSQL supports user-defined data types, operators, indexes, functions and languages. Many extensions are available for PostgreSQL, including the PostGIS extension that provides world-class Geographical Information System (GIS) features. Performance and concurrency PostgreSQL 9.4 can achieve more than 300,000 reads per second on a 32-CPU server, and it benchmarks at more than 20,000 write transactions per second with full durability. PostgreSQL has an advanced optimizer that considers a variety of join types, utilizing user data statistics to guide its choices. PostgreSQL provides MVCC, which enables readers and writers to avoid blocking each other. Taken together, the performance features of PostgreSQL allow a mixed workload of transactional systems and complex search and analytical tasks. This is important because it means we don't always need to unload our data from production systems and reload them into analytical data stores just to execute a few ad hoc queries. PostgreSQL's capabilities make it the database of choice for new systems, as well as the right long-term choice in almost every case. Scalability PostgreSQL 9.4 scales well on a single node up to 32 CPUs. PostgreSQL scales well up to hundreds of active sessions, and up to thousands of connected sessions when using a session pool. Further scalability is achieved in each annual release. PostgreSQL provides multinode read scalability using the Hot Standby feature. Multinode write scalability is under active development. The starting point for this is Bi-Directional Replication. SQL and NoSQL PostgreSQL follows SQL Standard very closely. SQL itself does not force any particular type of model to be used, so PostgreSQL can easily be used for many types of models at the same time, in the same database. PostgreSQL supports the more normal SQL language statement. With PostgreSQL acting as a relational database, we can utilize any level of denormalization, from the full Third Normal Form, to the more normalized Star Schema models. PostgreSQL extends the relational model to provide arrays, row types, and range types. A document-centric database is also possible using PostgreSQL's text, XML, and binary JSON (JSONB) data types, supported by indexes optimized for documents and by full text search capabilities. Key/value stores are supported using the hstore extension. Popularity When MySQL was taken over some years back, it was agreed in the EU monopoly investigation that followed that PostgreSQL was a viable competitor. That's been certainly true, with the PostgreSQL user base expanding consistently for more than a decade. Various polls have indicated that PostgreSQL is the favorite database for building new, enterprise-class applications. The PostgreSQL feature set attracts serious users who have serious applications. Financial services companies may be PostgreSQL's largest user group, though governments, telecommunication companies, and many other segments are strong users as well. This popularity extends across the world; Japan, Ecuador, Argentina, and Russia have very large user groups, and so do USA, Europe, and Australasia. Amazon Web Services' chief technology officer Dr. Werner Vogels described PostgreSQL as "an amazing database", going on to say that "PostgreSQL has become the preferred open source relational database for many enterprise developers and start-ups, powering leading geospatial and mobile applications". Commercial support Many people have commented that strong commercial support is what enterprises need before they can invest in open source technology. Strong support is available worldwide from a number of companies. 2ndQuadrant provides commercial support for open source PostgreSQL, offering 24 x 7 support in English and Spanish with bug-fix resolution times. EnterpriseDB provides commercial support for PostgreSQL as well as their main product, which is a variant of Postgres that offers enhanced Oracle compatibility. Many other companies provide strong and knowledgeable support to specific geographic regions, vertical markets, and specialized technology stacks. PostgreSQL is also available as hosted or cloud solutions from a variety of companies, since it runs very well in cloud environments. A full list of companies is kept up to date at http://www.postgresql.org/support/professional_support/. Research and development funding PostgreSQL was originally developed as a research project at the University of California, Berkeley in the late 1980s and early 1990s. Further work was carried out by volunteers until the late 1990s. Then, the first professional developer became involved. Over time, more and more companies and research groups became involved, supporting many professional contributors. Further funding for research and development was provided by the NSF. The project also received funding from the EU FP7 Programme in the form of the 4CaaST project for cloud computing and the AXLE project for scalable data analytics. AXLE deserves a special mention because it is a 3-year project aimed at enhancing PostgreSQL's business intelligence capabilities, specifically for very large databases. The project covers security, privacy, integration with data mining, and visualization tools and interfaces for new hardware. Further details of it are available at http://www.axleproject.eu. Other funding for PostgreSQL development comes from users who directly sponsor features and companies selling products and services based around PostgreSQL. Monitoring Databases are not isolated entities. They live on computer hardware using CPUs, RAM, and disk subsystems. Users access databases using networks. Depending on the setup, databases themselves may need network resources to function in any of the following ways: performing some authentication checks when users log in, using disks that are mounted over the network (not generally recommended), or making remote function calls to other databases. This means that monitoring only the database is not enough. As a minimum, one should also monitor everything directly involved in using the database. This means knowing the following: Is the database host available? Does it accept connections? How much of the network bandwidth is in use? Have there been network interruptions and dropped connections? Is there enough RAM available for the most common tasks? How much of it is left? Is there enough disk space available? When will it run out of disk space? Is the disk subsystem keeping up? How much more load can it take? Can the CPU keep up with the load? How many spare idle cycles do the CPUs have? Are other network services the database access depends on (if any) available? For example, if you use Kerberos for authentication, you need to monitor it as well. How many context switches are happening when the database is running? For most of these things, you are interested in history; that is, how have things evolved? Was everything mostly the same yesterday or last week? When did the disk usage start changing rapidly? For any larger installation, you probably have something already in place to monitor the health of your hosts and network. The two aspects of monitoring are collecting historical data to see how things have evolved and getting alerts when things go seriously wrong. Tools based on Round Robin Database Tool (RRDtool) such as Cacti and Munin are quite popular for collecting the historical information on all aspects of the servers and presenting this information in an easy-to-follow graphical form. Seeing several statistics on the same timescale can really help when trying to figure out why the system is behaving the way it is. Another popular open source solution is Ganglia, a distributed monitoring solution particularly suitable for environments with several servers and in multiple locations. Another aspect of monitoring is getting alerts when something goes really wrong and needs (immediate) attention. For alerting, one of the most widely used tools is Nagios, with its fork (Icinga) being an emerging solution. The aforementioned trending tools can integrate with Nagios. However, if you need a solution for both the alerting and trending aspects of a monitoring tool, you might want to look into Zabbix. Then, of course, there is Simple Network Management Protocol (SNMP), which is supported by a wide array of commercial monitoring solutions. Basic support for monitoring PostgreSQL through SNMP is found in pgsnmpd. This project does not seem very active though. However, you can find more information about pgsnmpd and download it from http://pgsnmpd.projects.postgresql.org/. Providing PostgreSQL information to monitoring tools Historical monitoring information is best to use when all of it is available from the same place and at the same timescale. Most monitoring systems are designed for generic purposes, while allowing application and system developers to integrate their specific checks with the monitoring infrastructure. This is possible through a plugin architecture. Adding new kinds of data inputs to them means installing a plugin. Sometimes, you may need to write or develop this plugin, but writing a plugin for something such as Cacti is easy. You just have to write a script that outputs monitored values in simple text format. In most common scenarios, the monitoring system is centralized and data is collected directly (and remotely) by the system itself or through some distributed components that are responsible for sending the observed metrics back to the main node. As far as PostgreSQL is concerned, some useful things to include in graphs are the number of connections, disk usage, number of queries, number of WAL files, most numbers from pg_stat_user_tables and pg_stat_user_indexes, and so on, as shown here: An example of a dashboard in Cacti The preceding Cacti screenshot includes data for CPU, disk, and network usage; pgbouncer connection pooler; and the number of PostgreSQL client connections. As you can see, they are nicely correlated. One Swiss Army knife script, which can be used from both Cacti and Nagios/Icinga, is check_postgres. It is available at http://bucardo.org/wiki/Check_postgres. It has ready-made reporting actions for a large array of things worth monitoring in PostgreSQL. For Munin, there are some PostgreSQL plugins available at the Munin plugin repository at https://github.com/munin-monitoring/contrib/tree/master/plugins/postgresql. The following screenshot shows a Munin graph about PostgreSQL buffer cache hits for a specific database, where cache hits (blue line) dominate reads from the disk (green line): Finding more information about generic monitoring tools Setting up the tools themselves is a larger topic. In fact, each of these tools has more than one book written about them. The basic setup information and the tools themselves can be found at the following URLs: RRDtool: http://www.mrtg.org/rrdtool/ Cacti: http://www.cacti.net/ Ganglia: http://ganglia.sourceforge.net/ Icinga: http://www.icinga.org Munin: http://munin-monitoring.org/ Nagios: http://www.nagios.org/ Zabbix: http://www.zabbix.org/ Real-time viewing using pgAdmin You can also use pgAdmin to get a quick view of what is going on in the database. For better control, you need to install the adminpack extension in the destination database, by issuing this command: CREATE EXTENSION adminpack; This extension is a part of the additionally supplied modules of PostgreSQL (aka contrib). It provides several administration functions that PgAdmin (and other tools) can use in order to manage, control, and monitor a Postgres server from a remote location. Once you have installed adminpack, connect to the database and then go to Tools | Server Status. This will open a window similar to what is shown in the following screenshot, reporting locks and running transactions: Loading data from flat files Loading data into your database is one of the most important tasks. You need to do this accurately and quickly. Here's how. Getting ready You'll need a copy of pgloader, which is available at http://github.com/dimitri/pgloader. At the time of writing this article, the current stable version is 3.1.0. The 3.x series is a major rewrite, with many additional features, and the 2.x series is now considered obsolete. How to do it… PostgreSQL includes a command named COPY that provides the basic data load/unload mechanism. The COPY command doesn't do enough when loading data, so let's skip the basic command and go straight to pgloader. To load data, we need to understand our requirements, so let's break this down into a step-by-step process, as follows: Identify the data files and where they are located. Make sure that pgloader is installed at the location of the files. Identify the table into which you are loading, ensure that you have the permissions to load, and check the available space. Work out the file type (fixed, text, or CSV) and check the encoding. Specify the mapping between columns in the file and columns on the table being loaded. Make sure you know which columns in the file are not needed—pgloader allows you to include only the columns you want. Identify any columns in the table for which you don't have data. Do you need them to have a default value on the table, or does pgloader need to generate values for those columns through functions or constants? Specify any transformations that need to take place. The most common issue is date formats, though possibly there may be other issues. Write the pgloader script. pgloader will create a log file to record whether the load has succeeded or failed, and another file to store rejected rows. You need a directory with sufficient disk space if you expect them to be large. Their size is roughly proportional to the number of failing rows. Finally, consider what settings you need for performance options. This is definitely last, as fiddling with things earlier can lead to confusion when you're still making the load work correctly. You must use a script to execute pgloader. This is not a restriction; actually it is more like best practice, because it makes it much easier to iterate towards something that works. Loads never work the first time, except in the movies! Let's look at a typical example from pgloader's documentation—the example.load file: LOAD CSV FROM 'GeoLiteCity-Blocks.csv' WITH ENCODING iso-646-us HAVING FIELDS ( startIpNum, endIpNum, locId ) INTO postgresql://user@localhost:54393/dbname?geolite.blocks TARGET COLUMNS ( iprange ip4r using (ip-range startIpNum endIpNum), locId ) WITH truncate, skip header = 2, fields optionally enclosed by '"', fields escaped by backslash-quote, fields terminated by 't' SET work_mem to '32 MB', maintenance_work_mem to '64 MB'; We can use the load script like this: pgloader --summary summary.log example.load How it works… pgloader copes gracefully with errors. The COPY command loads all rows in a single transaction, so only a single error is enough to abort the load. pgloader breaks down an input file into reasonably sized chunks, and loads them piece by piece. If some rows in a chunk cause errors, then pgloader will split it iteratively until it loads all the good rows and skips all the bad rows, which are then saved in a separate "rejects" file for later inspection. This behavior is very convenient if you have large data files with a small percentage of bad rows; for instance, you can edit the rejects, fix them, and finally, load them with another pgloader run. Versions 2.x of pgloader were written in Python and connected to PostgreSQL through the standard Python client interface. Version 3.x is written in Common Lisp. Yes, pgloader is less efficient than loading data files using a COPY command, but running a COPY command has many more restrictions: the file has to be in the right place on the server, has to be in the right format, and must be unlikely to throw errors on loading. pgloader has additional overhead, but it also has the ability to load data using multiple parallel threads, so it can be faster to use as well. pgloader's ability to call out to reformat functions is often essential in most cases; straight COPY is just too simple. pgloader also allows loading from fixed-width files, which COPY does not. There's more… If you need to reload the table completely from scratch, then specify the –WITH TRUNCATE clause in the pgloader script. There are also options to specify SQL to be executed before and after loading the data. For instance, you may have a script that creates the empty tables before, or you can add constraints after, or both. After loading, if we have load errors, then there will be some junk loaded into the PostgreSQL tables. It is not junk that you can see, or that gives any semantic errors, but think of it more like fragmentation. You should think about whether you need to add a VACUUM command after the data load, though this will make the load take possibly much longer. We need to be careful to avoid loading data twice. The only easy way of doing that is to make sure that there is at least one unique index defined on every table that you load. The load should then fail very quickly. String handling can often be difficult, because of the presence of formatting or nonprintable characters. The default setting for PostgreSQL is to have a parameter named standard_conforming_strings set to off, which means that backslashes will be assumed to be escape characters. Put another way, by default, the n string means line feed, which can cause data to appear truncated. You'll need to turn standard_conforming_strings to on, or you'll need to specify an escape character in the load-parameter file. If you are reloading data that has been unloaded from PostgreSQL, then you may want to use the pg_restore utility instead. The pg_restore utility has an option to reload data in parallel, -j number_of_threads, though this is only possible if the dump was produced using the custom pg_dump format. This can be useful for reloading dumps, though it lacks almost all of the other pgloader features discussed here. If you need to use rows from a read-only text file that does not have errors, and you are using version 9.1 or later of PostgreSQL, then you may consider using the file_fdw contrib module. The short story is that it lets you create a "virtual" table that will parse the text file every time it is scanned. This is different from filling a table once and for all, either with COPY or pgloader; therefore, it covers a different use case. For example, think about an external data source that is maintained by a third party and needs to be shared across different databases. You may wish to send an e-mail to Dimitri Fontaine, the current author and maintainer of most of pgloader. He always loves to receive e-mails from users. Summary PostgreSQL provides a lot of features, which make it the most advanced open source database. Resources for Article: Further resources on this subject: Getting Started with PostgreSQL [article] Installing PostgreSQL [article] PostgreSQL – New Features [article]

0
0
2074

Packt

06 May 2015

11 min read

Introduction to Hadoop

Packt

06 May 2015

11 min read

In this article by Shiva Achari, author of the book Hadoop Essentials, you'll get an introduction about Hadoop, its uses, and advantages (For more resources related to this topic, see here.) Hadoop In big data, the most widely used system is Hadoop. Hadoop is an open source implementation of big data, which is widely accepted in the industry, and benchmarks for Hadoop are impressive and, in some cases, incomparable to other systems. Hadoop is used in the industry for large-scale, massively parallel, and distributed data processing. Hadoop is highly fault tolerant and configurable to as many levels as we need for the system to be fault tolerant, which has a direct impact to the number of times the data is stored across. As we have already touched upon big data systems, the architecture revolves around two major components: distributed computing and parallel processing. In Hadoop, the distributed computing is handled by HDFS, and parallel processing is handled by MapReduce. In short, we can say that Hadoop is a combination of HDFS and MapReduce, as shown in the following image: Hadoop history Hadoop began from a project called Nutch, an open source crawler-based search, which processes on a distributed system. In 2003–2004, Google released Google MapReduce and GFS papers. MapReduce was adapted on Nutch. Doug Cutting and Mike Cafarella are the creators of Hadoop. When Doug Cutting joined Yahoo, a new project was created along the similar lines of Nutch, which we call Hadoop, and Nutch remained as a separate sub-project. Then, there were different releases, and other separate sub-projects started integrating with Hadoop, which we call a Hadoop ecosystem. The following figure and description depicts the history with timelines and milestones achieved in Hadoop: Description 2002.8: The Nutch Project was started 2003.2: The first MapReduce library was written at Google 2003.10: The Google File System paper was published 2004.12: The Google MapReduce paper was published 2005.7: Doug Cutting reported that Nutch now uses new MapReduce implementation 2006.2: Hadoop code moved out of Nutch into a new Lucene sub-project 2006.11: The Google Bigtable paper was published 2007.2: The first HBase code was dropped from Mike Cafarella 2007.4: Yahoo! Running Hadoop on 1000-node cluster 2008.1: Hadoop made an Apache Top Level Project 2008.7: Hadoop broke the Terabyte data sort Benchmark 2008.11: Hadoop 0.19 was released 2011.12: Hadoop 1.0 was released 2012.10: Hadoop 2.0 was alpha released 2013.10: Hadoop 2.2.0 was released 2014.10: Hadoop 2.6.0 was released Advantages of Hadoop Hadoop has a lot of advantages, and some of them are as follows: Low cost—Runs on commodity hardware: Hadoop can run on average performing commodity hardware and doesn't require a high performance system, which can help in controlling cost and achieve scalability and performance. Adding or removing nodes from the cluster is simple, as an when we require. The cost per terabyte is lower for storage and processing in Hadoop. Storage flexibility: Hadoop can store data in raw format in a distributed environment. Hadoop can process the unstructured data and semi-structured data better than most of the available technologies. Hadoop gives full flexibility to process the data and we will not have any loss of data. Open source community: Hadoop is open source and supported by many contributors with a growing network of developers worldwide. Many organizations such as Yahoo, Facebook, Hortonworks, and others have contributed immensely toward the progress of Hadoop and other related sub-projects. Fault tolerant: Hadoop is massively scalable and fault tolerant. Hadoop is reliable in terms of data availability, and even if some nodes go down, Hadoop can recover the data. Hadoop architecture assumes that nodes can go down and the system should be able to process the data. Complex data analytics: With the emergence of big data, data science has also grown leaps and bounds, and we have complex and heavy computation intensive algorithms for data analysis. Hadoop can process such scalable algorithms for a very large-scale data and can process the algorithms faster. Uses of Hadoop Some examples of use cases where Hadoop is used are as follows: Searching/text mining Log processing Recommendation systems Business intelligence/data warehousing Video and image analysis Archiving Graph creation and analysis Pattern recognition Risk assessment Sentiment analysis Hadoop ecosystem A Hadoop cluster can be of thousands of nodes, and it is complex and difficult to manage manually, hence there are some components that assist configuration, maintenance, and management of the whole Hadoop system. In this article, we will touch base upon the following components: Layer Utility/Tool name Distributed filesystem Apache HDFS Distributed programming Apache MapReduce Apache Hive Apache Pig Apache Spark NoSQL databases Apache HBase Data ingestion Apache Flume Apache Sqoop Apache Storm Service programming Apache Zookeeper Scheduling Apache Oozie Machine learning Apache Mahout System deployment Apache Ambari All the components above are helpful in managing Hadoop tasks and jobs. Apache Hadoop The open source Hadoop is maintained by the Apache Software Foundation. The official website for Apache Hadoop is http://hadoop.apache.org/, where the packages and other details are described elaborately. The current Apache Hadoop project (version 2.6) includes the following modules: Hadoop common: The common utilities that support other Hadoop modules Hadoop Distributed File System (HDFS): A distributed filesystem that provides high-throughput access to application data Hadoop YARN: A framework for job scheduling and cluster resource management Hadoop MapReduce: A YARN-based system for parallel processing of large datasets Apache Hadoop can be deployed in the following three modes: Standalone: It is used for simple analysis or debugging. Pseudo distributed: It helps you to simulate a multi-node installation on a single node. In pseudo-distributed mode, each of the component processes runs in a separate JVM. Instead of installing Hadoop on different servers, you can simulate it on a single server. Distributed: Cluster with multiple worker nodes in tens or hundreds or thousands of nodes. In a Hadoop ecosystem, along with Hadoop, there are many utility components that are separate Apache projects such as Hive, Pig, HBase, Sqoop, Flume, Zookeper, Mahout, and so on, which have to be configured separately. We have to be careful with the compatibility of subprojects with Hadoop versions as not all versions are inter-compatible. Apache Hadoop is an open source project that has a lot of benefits as source code can be updated, and also some contributions are done with some improvements. One downside for being an open source project is that companies usually offer support for their products, not for an open source project. Customers prefer support and adapt Hadoop distributions supported by the vendors. Let's look at some Hadoop distributions available. Hadoop distributions Hadoop distributions are supported by the companies managing the distribution, and some distributions have license costs also. Companies such as Cloudera, Hortonworks, Amazon, MapR, and Pivotal have their respective Hadoop distribution in the market that offers Hadoop with required sub-packages and projects, which are compatible and provide commercial support. This greatly reduces efforts, not just for operations, but also for deployment, monitoring, and tools and utility for easy and faster development of the product or project. For managing the Hadoop cluster, Hadoop distributions provide some graphical web UI tooling for the deployment, administration, and monitoring of Hadoop clusters, which can be used to set up, manage, and monitor complex clusters, which reduce a lot of effort and time. Some Hadoop distributions which are available are as follows: Cloudera: According to The Forrester Wave™: Big Data Hadoop Solutions, Q1 2014, this is the most widely used Hadoop distribution with the biggest customer base as it provides good support and has some good utility components such as Cloudera Manager, which can create, manage, and maintain a cluster, and manage job processing, and Impala is developed and contributed by Cloudera which has real-time processing capability. Hortonworks: Hortonworks' strategy is to drive all innovation through the open source community and create an ecosystem of partners that accelerates Hadoop adoption among enterprises. It uses an open source Hadoop project and is a major contributor to Hadoop enhancement in Apache Hadoop. Ambari was developed and contributed to Apache by Hortonworks. Hortonworks offers a very good, easy-to-use sandbox for getting started. Hortonworks contributed changes that made Apache Hadoop run natively on the Microsoft Windows platforms including Windows Server and Microsoft Azure. MapR: MapR distribution of Hadoop uses different concepts than plain open source Hadoop and its competitors, especially support for a network file system (NFS) instead of HDFS for better performance and ease of use. In NFS, Native Unix commands can be used instead of Hadoop commands. MapR have high availability features such as snapshots, mirroring, or stateful failover. Amazon Elastic MapReduce (EMR): AWS's Elastic MapReduce (EMR) leverages its comprehensive cloud services, such as Amazon EC2 for compute, Amazon S3 for storage, and other services, to offer a very strong Hadoop solution for customers who wish to implement Hadoop in the cloud. EMR is much advisable to be used for infrequent big data processing. It might save you a lot of money. Pillars of Hadoop Hadoop is designed to be highly scalable, distributed, massively parallel processing, fault tolerant and flexible and the key aspect of the design are HDFS, MapReduce and YARN. HDFS and MapReduce can perform very large scale batch processing at a much faster rate. Due to contributions from various organizations and institutions Hadoop architecture has undergone a lot of improvements, and one of them is YARN. YARN has overcome some limitations of Hadoop and allows Hadoop to integrate with different applications and environments easily, especially in streaming and real-time analysis. One such example that we are going to discuss are Storm and Spark, they are well known in streaming and real-time analysis, both can integrate with Hadoop via YARN. Data access components MapReduce is a very powerful framework, but has a huge learning curve to master and optimize a MapReduce job. For analyzing data in a MapReduce paradigm, a lot of our time will be spent in coding. In big data, the users come from different backgrounds such as programming, scripting, EDW, DBA, analytics, and so on, for such users there are abstraction layers on top of MapReduce. Hive and Pig are two such layers, Hive has a SQL query-like interface and Pig has Pig Latin procedural language interface. Analyzing data on such layers becomes much easier. Data storage component HBase is a column store-based NoSQL database solution. HBase's data model is very similar to Google's BigTable framework. HBase can efficiently process random and real-time access in a large volume of data, usually millions or billions of rows. HBase's important advantage is that it supports updates on larger tables and faster lookup. The HBase data store supports linear and modular scaling. HBase stores data as a multidimensional map and is distributed. HBase operations are all MapReduce tasks that run in a parallel manner. Data ingestion in Hadoop In Hadoop, storage is never an issue, but managing the data is the driven force around which different solutions can be designed differently with different systems, hence managing data becomes extremely critical. A better manageable system can help a lot in terms of scalability, reusability, and even performance. In a Hadoop ecosystem, we have two widely used tools: Sqoop and Flume, both can help manage the data and can import and export data efficiently, with a good performance. Sqoop is usually used for data integration with RDBMS systems, and Flume usually performs better with streaming log data. Streaming and real-time analysis Storm and Spark are the two new fascinating components that can run on YARN and have some amazing capabilities in terms of processing streaming and real-time analysis. Both of these are used in scenarios where we have heavy continuous streaming data and have to be processed in, or near, real-time cases. The example which we discussed earlier for traffic analyzer is a good example for use cases of Storm and Spark. Summary In this article, we explored a bit about Hadoop history, finally migrating to the advantages and uses of Hadoop. Hadoop systems are complex to monitor and manage, and we have separate sub-projects' frameworks, tools, and utilities that integrate with Hadoop and help in better management of tasks, which are called a Hadoop ecosystem. Resources for Article: Further resources on this subject: Hive in Hadoop [article] Hadoop and MapReduce [article] Evolution of Hadoop [article]

0
0
2456

article-image-hadoop-monitoring-and-its-aspects

Packt

04 May 2015

8 min read

Hadoop Monitoring and its aspects

Packt

04 May 2015

8 min read

In this article by Gurmukh Singh, the author of the book Monitoring Hadoop, tells us the importance of monitoring Hadoop and its importance. It also explains various other concepts of Hadoop, such as its architecture, Ganglia (a tool used to monitor Hadoop), and so on. (For more resources related to this topic, see here.) In any enterprise, how big or small it could be, it is very important to monitor the health of all its components like servers, network devices, databases, and many more and make sure things are working as intended. Monitoring is a critical part for any business dependent upon infrastructure, by giving signals to enable necessary actions incase of any failures. Monitoring can be very complex with many components and configurations in a real production environment. There might be different security zones; different ways in which servers are setup or a same database might be used in many different ways with servers listening on various service ports. Before diving into setting up Monitoring and logging for Hadoop, it is very important to understand the basics of monitoring, how it works and some commonly used tools in the market. In Hadoop, we can do monitoring of the resources, services and also do metrics collection of various Hadoop counters. There are many tools available in the market and one of them is Nagios, which is widely used. Nagios is a powerful monitoring system that provides you with instant awareness of your organization's mission-critical IT infrastructure. By using Nagios, you can: Plan release cycle and rollouts, before things get outdated Early detection, before it causes an outage Have automation and a better response across the organization Nagios Architecture It is based on a simple server client architecture, in which the server has the capability to execute checks remotely using NRPE agents on the Linux clients. The results of execution are captured by the server and accordingly alerted by the system. The checks could be for memory, disk, CPU utilization, network, database connection and many more. It provides the flexibility to use either active or passive checks. Ganglia Ganglia, it is a beautiful tool for aggregating the stats and plotting them nicely. Nagios, give the events and alerts, Ganglia aggregates and presents it in a meaningful way. What if you want to look for total CPU, memory per cluster of 2000 nodes or total free disk space on 1000 nodes. Some of the key feature of Ganglia. View historical and real time metrics of a single node or for the entire cluster Use the data to make decision on cluster sizing and performance Ganglia Components Ganglia Monitoring Daemon (gmond): This runs on the nodes that need to be monitored, captures state change and sends updates using XDR to a central daemon. Ganglia Meta Daemon (gmetad): This collects data from gmond and other gmetad daemons. The data is indexed and stored to disk in round robin fashion. There is also a Ganglia front-end for meaningful display of information collected. All these tools can be integrated with Hadoop, to monitor it and capture its metrics. Integration with Hadoop There are many important components in Hadoop that needs to be monitored, like NameNode uptime, disk space, memory utilization, and heap size. Similarly, on DataNode we need to monitor disk usage, memory utilization or job execution flow status across the MapReduce components. To know what to monitor, we must understand how Hadoop daemons communicate with each other. There are lots of ports used in Hadoop, some are for internal communication like scheduling jobs, and replication, while others are for user interactions. They may be exposed using TCP or HTTP. Hadoop daemons provide information over HTTP about logs, stacks, metrics that could be used for troubleshooting. NameNode can expose information about the file system, live or dead nodes or block reports by the DataNode or JobTracker for tracking the running jobs. Hadoop uses TCP, HTTP, IPC or socket for communication among the nodes or daemons. YARN Framework The YARN (Yet Another resource Negotiator) is the new MapReduce framework. It is designed to scale for large clusters and performs much better as compared to the old framework. There are new sets of daemons in the new framework and it is good to understand how to communicate with each other. The diagram that follows, explains the daemons and ports on which they talk. Logging in Hadoop In Hadoop, each daemon writes its own logs and the severity of logging is configurable. The logs in Hadoop can be related to the daemons or the jobs submitted. Useful to troubleshoot slowness, issue with map reduce tasks, connectivity issues and platforms bugs. The logs generated can be user level like task tracker logs on each node or can be related to master daemons like NameNode and JobTracker. In the newer YARN platform, there is a feature to move the logs to HDFS after the initial logging. In Hadoop 1.x the user log management is done using UserLogManager, which cleans and truncates logs according to retention and size parameters like mapred.userlog.retain.hours and mapreduce.cluster.map.userlog.retain-size respectively. The tasks standard out and error are piped to Unix tail program, so it retains the require size only. The following are some of the challenges of log management in Hadoop: Excessive logging: The truncation of logs is not done till the tasks finish, this for many jobs could cause disk space issues as the amount of data written is quite large. Truncation: We cannot always say what to log and how much is good enough. For some users 500KB of logs might be good but for some 10MB might not suffice. Retention: How long to retain logs, 1 or 6 months?. There is no rule, but there are best practices or governance issues. In many countries there is regulation in place to keep data for 1 year. Best practice for any organization is to keep it for at least 6 months. Analysis: What if we want to look at historical data, how to aggregate logs onto a central system and do analyses. In Hadoop logs are served over HTTP for a single node by default. Some of the above stated issues have been addressed in the YARN framework. Rather then truncating logs and that to on individual nodes, the logs can be moved to HDFS and processed using other tools. The logs are written at the per application level into directories per application. The user can access these logs through command line or web UI. For example, $HADOOP_YARN_HOME/bin/yarn logs. Hadoop metrics In Hadoop there are many daemons running like DataNode, NameNode, JobTracker, and so on, each of these daemons captures a lot of information about the components they work on. Similarly, in YARN framework we have ResourceManager, NodeManager, and Application Manager, each of which exposes metrics, explained in the following sections under Metrics2. For example, DataNode collects metrics like number of blocks it has for advertising to the NameNode, the number of replicated blocks, metrics about the various read or writes from clients. In addition to this there could be metrics related to events, and so on. Hence, it is very important to gather it for the working of the Hadoop cluster and also helps in debugging, if something goes wrong. For this, Hadoop has a metrics system, for collecting all this information. There are two versions of the metrics system, Metrics and Metrics2 for Hadoop 1.x and Hadoop 2.x respectively. The file hadoop-metrics.properties and hadoop-metrics2.properties for each Hadoop version can be configured respectively. Configuring Metrics2 For Hadoop version 2, which uses YARN framework, the metrics can be configured using hadoop-metrics2.properties, under the $HADOOP_HOME directory. *.sink.file.class=org.apache.hadoop.metrics2.sink.FileSink *.period=10 namenode.sink.file.filename=namenode-metrics.out datanode.sink.file.filename=datanode-metrics.out jobtracker.sink.file.filename=jobtracker-metrics.out tasktracker.sink.file.filename=tasktracker-metrics.out maptask.sink.file.filename=maptask-metrics.out reducetask.sink.file.filename=reducetask-metrics.out Hadoop metrics Configuration for Ganglia Firstly, we need to define a sink class, as per Ganglia. *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31 Secondly, we need to define the frequency of how often the source showed be polled for data. We are polling every 30 seconds: *.sink.ganglia.period=30 Define retention for the metrics: *.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40 Summary In this article, we learned about Hadoop monitoring and its importance, and also the various concepts of Hadoop. Resources for Article: Further resources on this subject: Hadoop and MapReduce [article] YARN and Hadoop [article] Hive in Hadoop [article]

0
0
2200

Packt

30 Apr 2015

15 min read

Machine Learning

Packt

30 Apr 2015

15 min read

0
0
2236

Packt

29 Apr 2015

16 min read

Algorithmic Trading

Packt

29 Apr 2015

16 min read

In this article by James Ma Weiming, author of the book Mastering Python for Finance , we will see how algorithmic trading automates the systematic trading process, where orders are executed at the best price possible based on a variety of factors, such as pricing, timing, and volume. Some brokerage firms may offer an application programming interface (API) as part of their service offering to customers who wish to deploy their own trading algorithms. For developing an algorithmic trading system, it must be highly robust and handle any point of failure during the order execution. Network configuration, hardware, memory management and speed, and user experience are some factors to be considered when designing a system in executing orders. Designing larger systems inevitably add complexity to the framework. As soon as a position in a market is opened, it is subjected to various types of risk, such as market risk. To preserve the trading capital as much as possible, it is important to incorporate risk management measures to the trading system. Perhaps the most common risk measure used in the financial industry is the value-at-risk (VaR) technique. We will discuss the beauty and flaws of VaR, and how it can be incorporated into our trading system that we will develop in this article. In this article, we will cover the following topics: An overview of algorithmic trading List of brokers and system vendors with public API Choosing a programming language for a trading system Setting up API access on Interactive Brokers (IB) trading platform Using the IbPy module to interact with IB Trader WorkStation (TWS) Introduction to algorithmic trading In the 1990s, exchanges had already begun to use electronic trading systems. By 1997, 44 exchanges worldwide used automated systems for trading futures and options with more exchanges in the process of developing automated technology. Exchanges such as the Chicago Board of Trade (CBOT) and the London International Financial Futures and Options Exchange (LIFFE) used their electronic trading systems as an after-hours complement to traditional open outcry trading in pits, giving traders 24-hour access to the exchange's risk management tools. With improvements in technology, technology-based trading became less expensive, fueling the growth of trading platforms that are faster and powerful. Higher reliability of order execution and lower rates of message transmission error deepened the reliance of technology by financial institutions. The majority of asset managers, proprietary traders, and market makers have since moved from the trading pits to electronic trading floors. As systematic or computerized trading became more commonplace, speed became the most important factor in determining the outcome of a trade. Quants utilizing sophisticated fundamental models are able to recompute fair values of trading products on the fly and execute trading decisions, enabling them to reap profits at the expense of fundamental traders using traditional tools. This gave way to the term high-frequency trading (HFT) that relies on fast computers to execute the trading decisions before anyone else can. HFT has evolved into a billion-dollar industry. Algorithmic trading refers to the automation of the systematic trading process, where the order execution is heavily optimized to give the best price possible. It is not part of the portfolio allocation process. Banks, hedge funds, brokerage firms, clearing firms, and trading firms typically have their servers placed right next to the electronic exchange to receive the latest market prices and to perform the fastest order execution where possible. They bring enormous trading volumes to the exchange. Anyone who wishes to participate in low-latency, high-volume trading activities, such as complex event processing or capturing fleeting price discrepancies, by acquiring exchange connectivity may do so in the form of co-location, where his or her server hardware can be placed on a rack right next to the exchange for a fee. The Financial Information Exchange (FIX) protocol is the industry standard for electronic communications with the exchange from the private server for direct market access (DMA) to real-time information. C++ is the common choice of programming language for trading over the FIX protocol, though other languages, such as .NET framework common language and Java can be used. Before creating an algorithmic trading platform, you would need to assess various factors, such as speed and ease of learning before deciding on a specific language for the purpose. Brokerage firms would provide a trading platform of some sort to their customers for them to execute orders on selected exchanges in return for the commission fees. Some brokerage firms may offer an API as part of their service offering to technically inclined customers who wish to run their own trading algorithms. In most circumstances, customers may also choose from a number of commercial trading platforms offered by third-party vendors. Some of these trading platforms may also offer API access to route orders electronically to the exchange. It is important to read the API documentation beforehand to understand the technical capabilities offered by your broker and to formulate an approach in developing an algorithmic trading system. List of trading platforms with public API The following table lists some brokers and trading platform vendors who have their API documentation publicly available: Broker/vendor URL Programming languages supported Interactive Brokers https://www.interactivebrokers.com/en/index.php?f=1325 C++, Posix C++, Java, and Visual Basic for ActiveX E*Trade https://developer.etrade.com Java, PHP, and C++ IG http://labs.ig.com/ REST, Java, FIX, and Microsoft .NET Framework 4.0 Tradier https://developer.tradier.com Java, Perl, Python, and Ruby TradeKing https://developers.tradeking.com Java, Node.js, PHP, R, and Ruby Cunningham trading systems http://www.ctsfutures.com/wiki/T4%20API%2040.MainPage.ashx Microsoft .NET Framework 4.0 CQG http://cqg.com/Products/CQG-API.aspx C#, C++, Excel, MATLAB, and VB.NET Trading technologies https://developer.tradingtechnologies.com Microsoft .NET Framework 4.0 OANDA http://developer.oanda.com REST, Java, FIX, and MT4 Which is the best programming language to use? With many choices of programming languages available to interface with brokers or vendors, the question that comes naturally to anyone starting out in algorithmic trading platform development is: which language should I use? Well, the short answer is that there is really no best programming language. How your product will be developed, the performance metrics to follow, the costs involved, latency threshold, risk measures, and the expected user interface are pieces of the puzzle to be taken into consideration. The risk manager, execution engine, and portfolio optimizer are some major components that will affect the design of your system. Your existing trading infrastructure, choice of operating system, programming language compiler capability, and available software tools poses further constraints on the system design, development, and deployment. System functionalities It is important to define the outcomes of your trading system. An outcome could be a research-based system that might be more concerned with obtaining high-quality data from data vendors, performing computations or running models, and evaluating a strategy through signal generation. Part of the research component might include a data-cleaning module or a backtesting interface to run a strategy with theoretical parameters over historical data. The CPU speed, memory size, and bandwidth are factors to be considered while designing our system. Another outcome could be an execution-based system that is more concerned with risk management and order handling features to ensure timely execution of multiple orders. The system must be highly robust and handle any point of failure during the order execution. As such, network configuration, hardware, memory management and speed, and user experience are some factors to be considered when designing a system in executing orders. A system may contain one or more of these functionalities. Designing larger systems inevitably add complexity to the framework. It is recommended that you choose one or more programming languages that can address and balance the development speed, ease of development, scalability, and reliability of your trading system. Algorithmic trading with Interactive Brokers and IbPy In this section, we will build a working algorithmic trading platform that will authenticate with Interactive Brokers (IB) and log in, retrieve the market data, and send orders. IB is one of the most popular brokers in the trading community and has a long history of API development. There are plenty of articles on the use of the API available on the Web. IB serves clients ranging from hedge funds to retail traders. Although the API does not support Python directly, Python wrappers such as IbPy are available to make the API calls to the IB interface. The IB API is unique to its own implementation, and every broker has its own API handling methods. Nevertheless, the documents and sample applications provided by your broker would demonstrate the core functionality of every API interface that can be easily integrated into an algorithmic trading system if designed properly. Getting Interactive Brokers' Trader WorkStation The official page for IB is https://www.interactivebrokers.com. Here, you can find a wealth of information regarding trading and investing for retail and institutional traders. In this section, we will take a look at how to get the Trader WorkStation X (TWS) installed and running on your local workstation before setting up an algorithmic trading system using Python. Note that we will perform simulated trading on a demonstration account. If your trading strategy turns out to be profitable, head to the OPEN AN ACCOUNT section of the IB website to open a live trading account. Rules, regulations, market data fees, exchange fees, commissions, and other conditions are subjected to the broker of your choice. In addition, market conditions are vastly different from the simulated environment. You are encouraged to perform extensive testing on your algorithmic trading system before running on live markets. The following key steps describe how to install TWS on your local workstation, log in to the demonstration account, and set it up for API use: From IB's official website, navigate to TRADING, and then select Standalone TWS. Choose the installation executable that is suitable for your local workstation. TWS runs on Java; therefore, ensure that Java runtime plugin is already installed on your local workstation. Refer to the following screenshot: When prompted during the installation process, choose Trader_WorkStation_X and IB Gateway options. The Trader WorkStation X (TWS) is the trading platform with full order management functionality. The IB Gateway program accepts and processes the API connections without any order management features of the TWS. We will not cover the use of the IB Gateway, but you may find it useful later. Select the destination directory on your local workstation where TWS will place all the required files, as shown in the following screenshot: When the installation is completed, a TWS shortcut icon will appear together with your list of installed applications. Double-click on the icon to start the TWS program. When TWS starts, you will be prompted to enter your login credentials. To log in to the demonstration account, type edemo in the username field and demouser in the password field, as shown in the following screenshot: Once we have managed to load our demo account on TWS, we can now set up its API functionality. On the toolbar, click on Configure: Under the Configuration tree, open the API node to reveal further options. Select Settings. Note that Socket port is 7496, and we added the IP address of our workstation housing our algorithmic trading system to the list of trusted IP addresses, which in this case is 127.0.0.1. Ensure that the Enable ActiveX and Socket Clients option is selected to allow the socket connections to TWS: Click on OK to save all the changes. TWS is now ready to accept orders and market data requests from our algorithmic trading system. Getting IbPy – the IB API wrapper IbPy is an add-on module for Python that wraps the IB API. It is open source and can be found at https://github.com/blampe/IbPy. Head to this URL and download the source files. Unzip the source folder, and use Terminal to navigate to this directory. Type python setup.py install to install IbPy as part of the Python runtime environment. The use of IbPy is similar to the API calls, as documented on the IB website. The documentation for IbPy is at https://code.google.com/p/ibpy/w/list. A simple order routing mechanism In this section, we will start interacting with TWS using Python by establishing a connection and sending out a market order to the exchange. Once IbPy is installed, import the following necessary modules into our Python script: from ib.ext.Contract import Contractfrom ib.ext.Order import Orderfrom ib.opt import Connection Next, implement the logging functions to handle calls from the server. The error_handler method is invoked whenever the API encounters an error, which is accompanied with a message. The server_handler method is dedicated to handle all the other forms of returned API messages. The msg variable is a type of an ib.opt.message object and references the method calls, as defined by the IB API EWrapper methods. The API documentation can be accessed at https://www.interactivebrokers.com/en/software/api/api.htm. The following is the Python code for the server_handler method: def error_handler(msg):print "Server Error:", msgdef server_handler(msg):print "Server Msg:", msg.typeName, "-", msg We will place a sample order of the stock AAPL. The contract specifications of the order are defined by the Contract class object found in the ib.ext.Contract module. We will create a method called create_contract that returns a new instance of this object: def create_contract(symbol, sec_type, exch, prim_exch, curr):contract = Contract()contract.m_symbol = symbolcontract.m_secType = sec_typecontract.m_exchange = exchcontract.m_primaryExch = prim_exchcontract.m_currency = currreturn contract The Order class object is used to place an order with TWS. Let's define a method called create_order that will return a new instance of the object: def create_order(order_type, quantity, action):order = Order()order.m_orderType = order_typeorder.m_totalQuantity = quantityorder.m_action = actionreturn order After the required methods are created, we can then begin to script the main functionality. Let's initialize the required variables: if __name__ == "__main__":client_id = 100order_id = 1port = 7496tws_conn = None Note that the client_id variable is our assigned integer that identifies the instance of the client communicating with TWS. The order_id variable is our assigned integer that identifies the order queue number sent to TWS. Each new order requires this value to be incremented sequentially. The port number has the same value as defined in our API settings of TWS earlier. The tws_conn variable holds the connection value to TWS. Let's initialize this variable with an empty value for now. Let's use a try block that encapsulates the Connection.create method to handle the socket connections to TWS in a graceful manner: try:# Establish connection to TWS.tws_conn = Connection.create(port=port,clientId=client_id)tws_conn.connect()# Assign error handling function.tws_conn.register(error_handler, 'Error')# Assign server messages handling function.tws_conn.registerAll(server_handler)finally:# Disconnect from TWSif tws_conn is not None:tws_conn.disconnect() The port and clientId parameter fields define this connection. After the connection instance is created, the connect method will try to connect to TWS. When the connection to TWS has successfully opened, it is time to register listeners to receive notifications from the server. The register method associates a function handler to a particular event. The registerAll method associates a handler to all the messages generated. This is where the error_handler and server_handler methods declared earlier will be used for this occasion. Before sending our very first order of 100 shares of AAPL to the exchange, we will call the create_contract method to create a new contract object for AAPL. Then, we will call the create_order method to create a new Order object, to go long 100 shares. Finally, we will call the placeOrder method of the Connection class to send out this order to TWS: # Create a contract for AAPL stock using SMART order routing.aapl_contract = create_contract('AAPL','STK','SMART','SMART','USD')# Go long 100 shares of AAPLaapl_order = create_order('MKT', 100, 'BUY')# Place order on IB TWS.tws_conn.placeOrder(order_id, aapl_contract, aapl_order) That's it! Let's run our Python script. We should get a similar output as follows: Server Error: <error id=-1, errorCode=2104, errorMsg=Market data farmconnection is OK:ibdemo>Server Response: error, <error id=-1, errorCode=2104, errorMsg=Marketdata farm connection is OK:ibdemo>Server Version: 75TWS Time at connection:20141210 23:14:17 CSTServer Msg: managedAccounts - <managedAccounts accountsList=DU15200>Server Msg: nextValidId - <nextValidId orderId=1>Server Error: <error id=-1, errorCode=2104, errorMsg=Market data farmconnection is OK:ibdemo>Server Msg: error - <error id=-1, errorCode=2104, errorMsg=Market datafarm connection is OK:ibdemo>Server Error: <error id=-1, errorCode=2107, errorMsg=HMDS data farmconnection is inactive but should be available upon demand.demohmds>Server Msg: error - <error id=-1, errorCode=2107, errorMsg=HMDS data farmconnection is inactive but should be available upon demand.demohmds> Basically, what the error messages say is that there are no errors and the connections are OK. Should the simulated order be executed successfully during market trading hours, the trade will be reflected in TWS: The full source code of our implementation is given as follows: """ A Simple Order Routing Mechanism """from ib.ext.Contract import Contractfrom ib.ext.Order import Orderfrom ib.opt import Connectiondef error_handler(msg):print "Server Error:", msgdef server_handler(msg):print "Server Msg:", msg.typeName, "-", msgdef create_contract(symbol, sec_type, exch, prim_exch, curr):contract = Contract()contract.m_symbol = symbolcontract.m_secType = sec_typecontract.m_exchange = exchcontract.m_primaryExch = prim_exchcontract.m_currency = currreturn contractdef create_order(order_type, quantity, action):order = Order()order.m_orderType = order_typeorder.m_totalQuantity = quantityorder.m_action = actionreturn orderif __name__ == "__main__":client_id = 1order_id = 119port = 7496tws_conn = Nonetry:# Establish connection to TWS.tws_conn = Connection.create(port=port,clientId=client_id)tws_conn.connect()# Assign error handling function.tws_conn.register(error_handler, 'Error')# Assign server messages handling function.tws_conn.registerAll(server_handler)# Create AAPL contract and send orderaapl_contract = create_contract('AAPL','STK','SMART','SMART','USD')# Go long 100 shares of AAPLaapl_order = create_order('MKT', 100, 'BUY')# Place order on IB TWS.tws_conn.placeOrder(order_id, aapl_contract, aapl_order)finally:# Disconnect from TWSif tws_conn is not None:tws_conn.disconnect() Summary In this article, we were introduced to the evolution of trading from the pits to the electronic trading platform, and learned how algorithmic trading came about. We looked at some brokers offering API access to their trading service offering. To help us get started on our journey in developing an algorithmic trading system, we used the TWS of IB and the IbPy Python module. In our first trading program, we successfully sent an order to our broker through the TWS API using a demonstration account. Resources for Article: Prototyping Arduino Projects using Python Python functions – Avoid repeating code Pentesting Using Python

0
0
3962

Packt

27 Apr 2015

9 min read

Apache Solr and Big Data – integration with MongoDB

Packt

27 Apr 2015

9 min read

In this article by Hrishikesh Vijay Karambelkar, author of the book Scaling Big Data with Hadoop and Solr - Second Edition, we will go through Apache Solr and MongoDB together. In an enterprise, data is generated from all the software that is participating in day-to-day operations. This data has different formats, and bringing in this data for big-data processing requires a storage system that is flexible enough to accommodate a data with varying data models. A NoSQL database, by its design, is best suited for this kind of storage requirements. One of the primary objectives of NoSQL is horizontal scaling, that is, the P in CAP theorem, but this works at the cost of sacrificing Consistency or Availability. Visit http://en.wikipedia.org/wiki/CAP_theorem to understand more about CAP theorem (For more resources related to this topic, see here.) What is NoSQL and how is it related to Big Data? As we have seen, data models for NoSQL differ completely from that of a relational database. With the flexible data model, it becomes very easy for developers to quickly integrate with the NoSQL database, and bring in large sized data from different data sources. This makes the NoSQL database ideal for Big Data storage, since it demands that different data types be brought together under one umbrella. NoSQL also has different data models, like KV store, document store and Big Table storage. In addition to flexible schema, NoSQL offers scalability and high performance, which is again one of the most important factors to be considered while running big data. NoSQL was developed to be a distributed type of database. When traditional relational stores rely on the high computing power of CPUs and the high memory focused on a centralized system, NoSQL can run on your low-cost, commodity hardware. These servers can be added or removed dynamically from the cluster running NoSQL, making the NoSQL database easier to scale. NoSQL enables most advanced features of a database, like data partitioning, index sharding, distributed query, caching, and so on. Although NoSQL offers optimized storage for big data, it may not be able to replace the relational database. While relational databases offer transactional (ACID), high CRUD, data integrity, and a structured database design approach, which are required in many applications, NoSQL may not support them. Hence, it is most suited for Big Data where there is less possibility of need for data to be transactional. MongoDB at glance MongoDB is one of the popular NoSQL databases, (just like Cassandra). MongoDB supports the storing of any random schemas in the document oriented storage of its own. MongoDB supports the JSON-based information pipe for any communication with the server. This database is designed to work with heavy data. Today, many organizations are focusing on utilizing MongoDB for various enterprise applications. MongoDB provides high availability and load balancing. Each data unit is replicated and the combination of a data with its copes is called a replica set. Replicas in MongoDB can either be primary or secondary. Primary is the active replica, which is used for direct read-write operations, while the secondary replica works like a backup for the primary. MongoDB supports searches by field, range queries, and regular expression searches. Queries can return specific fields of documents and also include user-defined JavaScript functions. Any field in a MongoDB document can be indexed. More information about MongoDB can be read at https://www.mongodb.org/. The data on MongoDB is eventually consistent. Apache Solr can be used to work with MongoDB, to enable database searching capabilities on a MongoDB-based data store. Unlike Cassandra, where the Solr indexes are stored directly in Cassandra through solandra, MongoDB integration with Solr brings in the indexes in the Solr-based optimized storage. There are various ways in which the data residing in MongoDB can be analyzed and searched. MongoDB's replication works by recording all operations made on a database in a log file, called the oplog (operation log). Mongo's oplog keeps a rolling record of all operations that modify the data stored in your databases. Many of the implementers suggest reading this log file using a standard file IO program to push the data directly to Apache Solr, using CURL, SolrJ. Since oplog is a collection of data with an upper limit on maximum storage, it is feasible to synch such querying with Apache Solr. Oplog also provides tailable cursors on the database. These cursors can provide a natural order to the documents loaded in MongoDB, thereby, preserving their order. However, we are going to look at a different approach. Let's look at the schematic following diagram: In this case, MongoDB is exposed as a database to Apache Solr through the custom database driver. Apache Solr reads MongoDB data through the DataImportHandler, which in turns calls the JDBC-based MongoDB driver for connecting to MongoDB and running data import utilities. Since MongoDB supports replica sets, it manages the distribution of data across nodes. It also supports Sharding just like Apache Solr. Installing MongoDB To install MongoDB in your development environment, please follow the following steps: Download the latest version of MongoDB from https://www.mongodb.org/downloads for your supported operating system. Unzip the zipped folder. MongoDB comes up with a default set of different command-line components and utilities: bin/mongod: The database process. bin/mongos: Sharding controller. bin/mongo: The database shell (uses interactive JavaScript). Now, create a directory for MongoDB, which it will use for user data creation and management, and run the following command to start the single node server: $ bin/mongod –dbpath <path to your data directory> --rest In this case, --rest parameter enables support for simple rest APIs that can be used for getting the status. Once the server is started, access http://localhost:28017 from your favorite browser, you should be able to see following administration status page: Now that you have successfully installed MongoDB, try loading a sample data set from the book on MongoDB by opening a new command-line interface. Change the directory to $MONGODB_HOME and run the following command: $ bin/mongoimport --db solr-test --collection zips --file "<file-dir>/samples/zips.json" Please note that the database name is solr-test. You can see the stored data using the MongoDB-based CLI by running the following set of commands from your shell: $ bin/mongo MongoDB shell version: 2.4.9 connecting to: test Welcome to the MongoDB shell. For interactive help, type "help". For more comprehensive documentation, see http://docs.mongodb.org/ Questions? Try the support group http://groups.google.com/group/mongodb-user > use test Switched to db test > show dbs exampledb 0.203125GB local 0.078125GB test 0.203125GB > db.zips.find({city:"ACMAR"}) { "city" : "ACMAR", "loc" : [ -86.51557, 33.584132 ], "pop" : 6055, "state" :"AL", "_id" : "35004" } Congratulations! MongoDB is installed successfully Creating Solr indexes from MongoDB To run MongoDB as a database, you will need a JDBC driver built for MongoDB. However, the Mongo-JDBC driver has certain limitations, and it does not work with the Apache Solr DataImportHandler. So, I have extended Mongo-JDBC to work under the Solr-based DataImportHandler. The project repository is available at https://github.com/hrishik/solr-mongodb-dih. Let's look at the setting-up procedure for enabling MongoDB based Solr integration: You may not require a complete package from the solr-mongodb-dih repository, but just the jar file. This can be downloaded from https://github.com/hrishik/solr-mongodb-dih/tree/master/sample-jar. You will also need the following additional jar files: jsqlparser.jar mongo.jar These jars are available with the book Scaling Big Data with Hadoop and Solr, Second Edition for download. In your Solr setup, copy these jar files into the library path, that is, the $SOLR_WAR_LOCATION/WEB-INF/lib folder. Alternatively, point your container classpath variable to link them up. Using simple Java source code DataLoad.java (link https://github.com/hrishik/solr-mongodb-dih/blob/master/examples/DataLoad.java), populate the database with some sample schema and tables that you will use to load in Apache Solr. Now create a data source file (data-source-config.xml) as follows: <dataConfig> <dataSource name="mongod" type="JdbcDataSource" driver="com.mongodb. jdbc.MongoDriver" url="mongodb://localhost/solr-test"/> <document> <entity name="nameage" dataSource="mongod" query="select name, price from grocery"> <field column="name" name="name"/> <field column="name" name="id"/>  </entity> </document> </dataConfig> Copy the solr-dataimporthandler-*.jar from your contrib directory to a container/application library path. Modify $SOLR_COLLECTION_ROOT/conf/solr-config.xml with DIH entry:  <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config"><path to config>/data-source-config.xml</str> </lst> </requestHandler>  Once this configuration is done, you are ready to test it out. Access http://localhost:8983/solr/dataimport?command=full-import from your browser to run the full import on Apache Solr, where you will see that your import handler has successfully ran, and has loaded the data in Solr store, as shown in the following screenshot: You can validate the content created by your new MongoDB DIH by accessing the Solr Admin page, and running a query: Using this connector, you can perform operations for full-import on various data elements. Since MongoDB is not a relational database, it does support join queries. However, it supports selects, order by, and so on. Summary In this article, we have understood the distributed aspects of any enterprise search where went through Apache Solr and MongoDB together. Resources for Article: Further resources on this subject: Evolution of Hadoop [article] In the Cloud [article] Learning Data Analytics with R and Hadoop [article]

0
0
5755

article-image-integrating-d3js-visualization-simple-angularjs-application

Packt

27 Apr 2015

19 min read

Integrating a D3.js visualization into a simple AngularJS application

Packt

27 Apr 2015

19 min read

In this article by Christoph Körner, author of the book Data Visualization with D3 and AngularJS, we will apply the acquired knowledge to integrate a D3.js visualization into a simple AngularJS application. First, we will set up an AngularJS template that serves as a boilerplate for the examples and the application. We will see a typical directory structure for an AngularJS project and initialize a controller. Similar to the previous example, the controller will generate random data that we want to display in an autoupdating chart. Next, we will wrap D3.js in a factory and create a directive for the visualization. You will learn how to isolate the components from each other. We will create a simple AngularJS directive and write a custom compile function to create and update the chart. (For more resources related to this topic, see here.) Setting up an AngularJS application To get started with this article, I assume that you feel comfortable with the main concepts of AngularJS: the application structure, controllers, directives, services, dependency injection, and scopes. I will use these concepts without introducing them in great detail, so if you do not know about one of these topics, first try an intermediate AngularJS tutorial. Organizing the directory To begin with, we will create a simple AngularJS boilerplate for the examples and the visualization application. We will use this boilerplate during the development of the sample application. Let's create a project root directory that contains the following files and folders: bower_components/: This directory contains all third-party components src/: This directory contains all source files src/app.js: This file contains source of the application src/app.css: CSS layout of the application test/: This directory contains all test files (test/config/ contains all test configurations, test/spec/ contains all unit tests, and test/e2e/ contains all integration tests) index.html: This is the starting point of the application Installing AngularJS In this article, we use the AngularJS version 1.3.14, but different patch versions (~1.3.0) should also work fine with the examples. Let's first install AngularJS with the Bower package manager. Therefore, we execute the following command in the root directory of the project: bower install angular#1.3.14 Now, AngularJS is downloaded and installed to the bower_components/ directory. If you don't want to use Bower, you can also simply download the source files from the AngularJS website and put them in a libs/ directory. Note that—if you develop large AngularJS applications—you most likely want to create a separate bower.json file and keep track of all your third-party dependencies. Bootstrapping the index file We can move on to the next step and code the index.html file that serves as a starting point for the application and all examples of this section. We need to include the JavaScript application files and the corresponding CSS layouts, the same for the chart component. Then, we need to initialize AngularJS by placing an ng-app attribute to the html tag; this will create the root scope of the application. Here, we will call the AngularJS application myApp, as shown in the following code: <html ng-app="myApp"> <head>  <script src="bower_components/d3/d3.js" charset="UTF- 8"></script> <script src="bower_components/angular/angular.js" charset="UTF-8"></script>  <script src="src/app.js"></script> <link href="src/app.css" rel="stylesheet">  <script src="src/chart.js"></script> <link href="src/chart.css" rel="stylesheet"> </head> <body>  </body> </html> For all the examples in this section, I will use the exact same setup as the preceding code. I will only change the body of the HTML page or the JavaScript or CSS sources of the application. I will indicate to which file the code belongs to with a comment for each code snippet. If you are not using Bower and previously downloaded D3.js and AngularJS in a libs/ directory, refer to this directory when including the JavaScript files. Adding a module and a controller Next, we initialize the AngularJS module in the app.js file and create a main controller for the application. The controller should create random data (that represent some simple logs) in a fixed interval. Let's generate some random number of visitors every second and store all data points on the scope as follows: /* src/app.js */ // Application Module angular.module('myApp', []) // Main application controller .controller('MainCtrl', ['$scope', '$interval', function ($scope, $interval) { var time = new Date('2014-01-01 00:00:00 +0100'); // Random data point generator var randPoint = function() { var rand = Math.random; return { time: time.toString(), visitors: rand()*100 }; } // We store a list of logs $scope.logs = [ randPoint() ]; $interval(function() { time.setSeconds(time.getSeconds() + 1); $scope.logs.push(randPoint()); }, 1000); }]); In the preceding example, we define an array of logs on the scope that we initialize with a random point. Every second, we will push a new random point to the logs. The points contain a number of visitors and a timestamp—starting with the date 2014-01-01 00:00:00 (timezone GMT+01) and counting up a second on each iteration. I want to keep it simple for now; therefore, we will use just a very basic example of random access log entries. Consider to use the cleaner controller as syntax for larger AngularJS applications because it makes the scopes in HTML templates explicit! However, for compatibility reasons, I will use the standard controller and $scope notation. Integrating D3.js into AngularJS We bootstrapped a simple AngularJS application in the previous section. Now, the goal is to integrate a D3.js component seamlessly into an AngularJS application—in an Angular way. This means that we have to design the AngularJS application and the visualization component such that the modules are fully encapsulated and reusable. In order to do so, we will use a separation on different levels: Code of different components goes into different files Code of the visualization library goes into a separate module Inside a module, we divide logics into controllers, services, and directives Using this clear separation allows you to keep files and modules organized and clean. If at anytime we want to replace the D3.js backend with a canvas pixel graphic, we can just implement it without interfering with the main application. This means that we want to use a new module of the visualization component and dependency injection. These modules enable us to have full control of the separate visualization component without touching the main application and they will make the component maintainable, reusable, and testable. Organizing the directory First, we add the new files for the visualization component to the project: src/: This is the default directory to store all the file components for the project src/chart.js: This is the JS source of the chart component src/chart.css: This is the CSS layout for the chart component test/test/config/: This directory contains all test configurations test/spec/test/spec/chart.spec.js: This file contains the unit tests of the chart component test/e2e/chart.e2e.js: This file contains the integration tests of the chart component If you develop large AngularJS applications, this is probably not the folder structure that you are aiming for. Especially in bigger applications, you will most likely want to have components in separate folders and directives and services in separate files. Then, we will encapsulate the visualization from the main application and create the new myChart module for it. This will make it possible to inject the visualization component or parts of it—for example just the chart directive—to the main application. Wrapping D3.js In this module, we will wrap D3.js—which is available via the global d3 variable—in a service; actually, we will use a factory to just return the reference to the d3 variable. This enables us to pass D3.js as a dependency inside the newly created module wherever we need it. The advantage of doing so is that the injectable d3 component—or some parts of it—can be mocked for testing easily. Let's assume we are loading data from a remote resource and do not want to wait for the time to load the resource every time we test the component. Then, the fact that we can mock and override functions without having to modify anything within the component will become very handy. Another great advantage will be defining custom localization configurations directly in the factory. This will guarantee that we have the proper localization wherever we use D3.js in the component. Moreover, in every component, we use the injected d3 variable in a private scope of a function and not in the global scope. This is absolutely necessary for clean and encapsulated components; we should never use any variables from global scope within an AngularJS component. Now, let's create a second module that stores all the visualization-specific code dependent on D3.js. Thus, we want to create an injectable factory for D3.js, as shown in the following code: /* src/chart.js */ // Chart Module angular.module('myChart', []) // D3 Factory .factory('d3', function() { /* We could declare locals or other D3.js specific configurations here. */ return d3; }); In the preceding example, we returned d3 without modifying it from the global scope. We can also define custom D3.js specific configurations here (such as locals and formatters). We can go one step further and load the complete D3.js code inside this factory so that d3 will not be available in the global scope at all. However, we don't use this approach here to keep things as simple and understandable as possible. We need to make this module or parts of it available to the main application. In AngularJS, we can do this by injecting the myChart module into the myApp application as follows: /* src/app.js */ angular.module('myApp', ['myChart']); Usually, we will just inject the directives and services of the visualization module that we want to use in the application, not the whole module. However, for the start and to access all parts of the visualization, we will leave it like this. We can use the components of the chart module now on the AngularJS application by injecting them into the controllers, services, and directives. The boilerplate—with a simple chart.js and chart.css file—is now ready. We can start to design the chart directive. A chart directive Next, we want to create a reusable and testable chart directive. The first question that comes into one's mind is where to put which functionality? Should we create a svg element as parent for the directive or a div element? Should we draw a data point as a circle in svg and use ng-repeat to replicate these points in the chart? Or should we better create and modify all data points with D3.js? I will answer all these question in the following sections. A directive for SVG As a general rule, we can say that different concepts should be encapsulated so that they can be replaced anytime by a new technology. Hence, we will use AngularJS with an element directive as a parent element for the visualization. We will bind the data and the options of the chart to the private scope of the directive. In the directive itself, we will create the complete chart including the parent svg container, the axis, and all data points using D3.js. Let's first add a simple directive for the chart component: /* src/chart.js */ … // Scatter Chart Directive .directive('myScatterChart', ["d3", function(d3){ return { restrict: 'E', scope: { }, compile: function( element, attrs, transclude ) { // Create a SVG root element var svg = d3.select(element[0]).append('svg'); // Return the link function return function(scope, element, attrs) { }; } }; }]); In the preceding example, we first inject d3 to the directive by passing it as an argument to the caller function. Then, we return a directive as an element with a private scope. Next, we define a custom compile function that returns the link function of the directive. This is important because we need to create the svg container for the visualization during the compilation of the directive. Then, during the link phase of the directive, we need to draw the visualization. Let's try to define some of these directives and look at the generated output. We define three directives in the index.html file, as shown in the following code:  <div ng-controller="MainCtrl">   <my-scatter-chart class="chart"></my-scatter-chart>  <my-scatter-chart class="chart"></my-scatter-chart>  <my-scatter-chart class="chart"></my-scatter-chart> </div> If we look at the output of the html page in the developer tools, we can see that for each base element of the directive, we created a svg parent element for the visualization: Output of the HTML page In the resulting DOM tree, we can see that three svg elements are appended to the directives. We can now start to draw the chart in these directives. Let's fill these elements with some awesome charts. Implementing a custom compile function First, let's add a data attribute to the isolated scope of the directive. This gives us access to the dataset, which we will later pass to the directive in the HTML template. Next, we extend the compile function of the directive to create a g group container for the data points and the axis. We will also add a watcher that checks for changes of the scope data array. Every time the data changes, we call a draw() function that redraws the chart of the directive. Let's get started: /* src/capp..js */ ... // Scatter Chart Directive .directive('myScatterChart', ["d3", function(d3){ // we will soon implement this function var draw = function(svg, width, height, data){ … }; return { restrict: 'E', scope: { data: '=' }, compile: function( element, attrs, transclude ) { // Create a SVG root element var svg = d3.select(element[0]).append('svg'); svg.append('g').attr('class', 'data'); svg.append('g').attr('class', 'x-axis axis'); svg.append('g').attr('class', 'y-axis axis'); // Define the dimensions for the chart var width = 600, height = 300; // Return the link function return function(scope, element, attrs) { // Watch the data attribute of the scope scope.$watch('data', function(newVal, oldVal, scope) { // Update the chart draw(svg, width, height, scope.data); }, true); }; } }; }]); Now, we implement the draw() function in the beginning of the directive. Drawing charts So far, the chart directive should look like the following code. We will now implement the draw() function, draw axis, and time series data. We start with setting the height and width for the svg element as follows: /* src/chart.js */ ... // Scatter Chart Directive .directive('myScatterChart', ["d3", function(d3){ function draw(svg, width, height, data) { svg .attr('width', width) .attr('height', height); // code continues here } return { restrict: 'E', scope: { data: '=' }, compile: function( element, attrs, transclude ) { ... } }]); Axis, scale, range, and domain We first need to create the scales for the data and then the axis for the chart. The implementation looks very similar to the scatter chart. We want to update the axis with the minimum and maximum values of the dataset; therefore, we also add this code to the draw() function: /* src/chart.js --> myScatterChart --> draw() */ function draw(svg, width, height, data) { ... // Define a margin var margin = 30; // Define x-scale var xScale = d3.time.scale() .domain([ d3.min(data, function(d) { return d.time; }), d3.max(data, function(d) { return d.time; }) ]) .range([margin, width-margin]); // Define x-axis var xAxis = d3.svg.axis() .scale(xScale) .orient('top') .tickFormat(d3.time.format('%S')); // Define y-scale var yScale = d3.time.scale() .domain([0, d3.max(data, function(d) { return d.visitors; })]) .range([margin, height-margin]); // Define y-axis var yAxis = d3.svg.axis() .scale(yScale) .orient('left') .tickFormat(d3.format('f')); // Draw x-axis svg.select('.x-axis') .attr("transform", "translate(0, " + margin + ")") .call(xAxis); // Draw y-axis svg.select('.y-axis') .attr("transform", "translate(" + margin + ")") .call(yAxis); } In the preceding code, we create a timescale for the x-axis and a linear scale for the y-axis and adapt the domain of both axes to match the maximum value of the dataset (we can also use the d3.extent() function to return min and max at the same time). Then, we define the pixel range for our chart area. Next, we create two axes objects with the previously defined scales and specify the tick format of the axis. We want to display the number of seconds that have passed on the x-axis and an integer value of the number of visitors on the y-axis. In the end, we draw the axes by calling the axis generator on the axis selection. Joining the data points Now, we will draw the data points and the axis. We finish the draw() function with this code: /* src/chart.js --> myScatterChart --> draw() */ function draw(svg, width, height, data) { ... // Add new the data points svg.select('.data') .selectAll('circle').data(data) .enter() .append('circle'); // Updated all data points svg.select('.data') .selectAll('circle').data(data) .attr('r', 2.5) .attr('cx', function(d) { return xScale(d.time); }) .attr('cy', function(d) { return yScale(d.visitors); }); } In the preceding code, we first create circle elements for the enter join for the data points where no corresponding circle is found in the Selection. Then, we update the attributes of the center point of all circle elements of the chart. Let's look at the generated output of the application: Output of the chart directive We notice that the axes and the whole chart scales as soon as new data points are added to the chart. In fact, this result looks very similar to the previous example with the main difference that we used a directive to draw this chart. This means that the data of the visualization that belongs to the application is stored and updated in the application itself, whereas the directive is completely decoupled from the data. To achieve a nice output like in the previous figure, we need to add some styles to the cart.css file, as shown in the following code: /* src/chart.css */ .axis path, .axis line { fill: none; stroke: #999; shape-rendering: crispEdges; } .tick { font: 10px sans-serif; } circle { fill: steelblue; } We need to disable the filling of the axis and enable crisp edges rendering; this will give the whole visualization a much better look. Summary In this article, you learned how to properly integrate a D3.js component into an AngularJS application—the Angular way. All files, modules, and components should be maintainable, testable, and reusable. You learned how to set up an AngularJS application and how to structure the folder structure for the visualization component. We put different responsibilities in different files and modules. Every piece that we can separate from the main application can be reused in another application; the goal is to use as much modularization as possible. As a next step, we created the visualization directive by implementing a custom compile function. This gives us access to the first compilation of the element—where we can append the svg element as a parent for the visualization—and other container elements. Resources for Article: Further resources on this subject: AngularJS Performance [article] An introduction to testing AngularJS directives [article] Our App and Tool Stack [article]

0
0
6247

Packt

23 Apr 2015

9 min read

Solr Indexing Internals

Packt

23 Apr 2015

9 min read

In this article by Jayant Kumar, author of the book Apache Solr Search Patterns, we will discuss use cases for Solr in e-commerce and job sites. We will look at the problems faced while providing search in an e-commerce or job site: The e-commerce problem statement The job site problem statement Challenges of large-scale indexing (For more resources related to this topic, see here.) The e-commerce problem statement E-commerce provides an easy way to sell products to a large customer base. However, there is a lot of competition among multiple e-commerce sites. When users land on an e-commerce site, they expect to find what they are looking for quickly and easily. Also, users are not sure about the brands or the actual products they want to purchase. They have a very broad idea about what they want to buy. Many customers nowadays search for their products on Google rather than visiting specific e-commerce sites. They believe that Google will take them to the e-commerce sites that have their product. The purpose of any e-commerce website is to help customers narrow down their broad ideas and enable them to finalize the products they want to purchase. For example, suppose a customer is interested in purchasing a mobile. His or her search for a mobile should list mobile brands, operating systems on mobiles, screen size of mobiles, and all other features as facets. As the customer selects more and more features or options from the facets provided, the search narrows down to a small list of mobiles that suit his or her choice. If the list is small enough and the customer likes one of the mobiles listed, he or she will make the purchase. The challenge is also that each category will have a different set of facets to be displayed. For example, searching for books should display their format, as in paperpack or hardcover, author name, book series, language, and other facets related to books. These facets were different for mobiles that we discussed earlier. Similarly, each category will have different facets and it needs to be designed properly so that customers can narrow down to their preferred products, irrespective of the category they are looking into. The takeaway from this is that categorization and feature listing of products should be taken care of. Misrepresentation of features can lead to incorrect search results. Another takeaway is that we need to provide multiple facets in the search results. For example, while displaying the list of all mobiles, we need to provide facets for a brand. Once a brand is selected, another set of facets for operating systems, network, and mobile phone features has to be provided. As more and more facets are selected, we still need to show facets within the remaining products. Example of facet selection on Amazon.com Another problem is that we do not know what product the customer is searching for. A site that displays a huge list of products from different categories, such as electronics, mobiles, clothes, or books, needs to be able to identify what the customer is searching for. A customer can be searching for samsung, which can be in mobiles, tablets, electronics, or computers. The site should be able to identify whether the customer has input the author name or the book name. Identifying the input would help in increasing the relevance of the result set by increasing the precision of the search results. Most e-commerce sites provide search suggestions that include the category to help customers target the right category during their search. Amazon, for example, provides search suggestions that include both latest searched terms and products along with category-wise suggestions: Search suggestions on Amazon.com It is also important that products are added to the index as soon as they are available. It is even more important that they are removed from the index or marked as sold out as soon as their stock is exhausted. For this, modifications to the index should be immediately visible in the search. This is facilitated by a concept in Solr known as Near Real Time Indexing and Search (NRT). The job site problem statement A job site serves a dual purpose. On the one hand, it provides jobs to candidates, and on the other, it serves as a database of registered candidates' profiles for companies to shortlist. A job search has to be very intuitive for the candidates so that they can find jobs suiting their skills, position, industry, role, and location, or even by the company name. As it is important to keep the candidates engaged during their job search, it is important to provide facets on the abovementioned criteria so that they can narrow down to the job of their choice. The searches by candidates are not very elaborate. If the search is generic, the results need to have high precision. On the other hand, if the search does not return many results, then recall has to be high to keep the candidate engaged on the site. Providing a personalized job search to candidates on the basis of their profiles and past search history makes sense for the candidates. On the recruiter side, the search provided over the candidate database is required to have a huge set of fields to search upon every data point that the candidate has entered. The recruiters are very selective when it comes to searching for candidates for specific jobs. Educational qualification, industry, function, key skills, designation, location, and experience are some of the fields provided to the recruiter during a search. In such cases, the precision has to be high. The recruiter would like a certain candidate and may be interested in more candidates similar to the selected candidate. The more like this search in Solr can be used to provide a search for candidates similar to a selected candidate. NRT is important as the site should be able to provide a job or a candidate for a search as soon as any one of them is added to the database by either the recruiter or the candidate. The promptness of the site is an important factor in keeping users engaged on the site. Challenges of large-scale indexing Let us understand how indexing happens and what can be done to speed it up. We will also look at the challenges faced during the indexing of a large number of documents or bulky documents. An e-commerce site is a perfect example of a site containing a large number of products, while a job site is an example of a search where documents are bulky because of the content in candidate resumes. During indexing, Solr first analyzes the documents and converts them into tokens that are stored in the RAM buffer. When the RAM buffer is full, data is flushed into a segment on the disk. When the numbers of segments are more than that defined in the MergeFactor class of the Solr configuration, the segments are merged. Data is also written to disk when a commit is made in Solr. Let us discuss a few points to make Solr indexing fast and to handle a large index containing a huge number of documents. Using multiple threads for indexing on Solr We can divide our data into smaller chunks and each chunk can be indexed in a separate thread. Ideally, the number of threads should be twice the number of processor cores to avoid a lot of context switching. However, we can increase the number of threads beyond that and check for performance improvement. Using the Java binary format of data for indexing Instead of using XML files, we can use the Java bin format for indexing. This reduces a lot of overhead of parsing an XML file and converting it into a binary format that is usable. The way to use the Java bin format is to write our own program for creating fields, adding fields to documents, and finally adding documents to the index. Here is a sample code: //Create an instance of the Solr server String SOLR_URL = "http://localhost:8983/solr" SolrServer server = new HttpSolrServer(SOLR_URL); //Create collection of documents to add to Solr server SolrInputDocument doc1 = new SolrInputDocument(); document.addField("id",1); document.addField("desc", "description text for doc 1"); SolrInputDocument doc2 = new SolrInputDocument(); document.addField("id",2); document.addField("desc", "description text for doc 2"); Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>(); docs.add(doc1); docs.add(doc2); //Add the collection of documents to the Solr server and commit. server.add(docs); server.commit(); Here is the reference to the API for the HttpSolrServer program http://lucene.apache.org/solr/4_6_0/solr-solrj/org/apache/solr/client/solrj/impl/HttpSolrServer.html. Add all files from the <solr_directory>/dist folder to the classpath for compiling and running the HttpSolrServer program. Using the ConcurrentUpdateSolrServer class for indexing Using the ConcurrentUpdateSolrServer class instead of the HttpSolrServer class can provide performance benefits as the former uses buffers to store processed documents before sending them to the Solr server. We can also specify the number of background threads to use to empty the buffers. The API docs for ConcurrentUpdateSolrServer are found in the following link: http://lucene.apache.org/solr/4_6_0/solr-solrj/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.html The constructor for the ConcurrentUpdateSolrServer class is defined as: ConcurrentUpdateSolrServer(String solrServerUrl, int queueSize, int threadCount) Here, queueSize is the buffer and threadCount is the number of background threads used to flush the buffers to the index on disk. Note that using too many threads can increase the context switching between threads and reduce performance. In order to optimize the number of threads, we should monitor performance (docs indexed per minute) after each increase and ensure that there is no decrease in performance. Summary In this article, we saw in brief the problems faced by e-commerce and job websites during indexing and search. We discussed the challenges faced while indexing a large number of documents. We also saw some tips on improving the speed of indexing. Resources for Article: Further resources on this subject: Tuning Solr JVM and Container [article] Apache Solr PHP Integration [article] Boost Your search [article]

0
0
1527

How-To Tutorials - Data

Working with Touch Gestures

Developing Location-based Services with Neo4j

Basic Image Processing

Query complete/suggest

Cleaning Data in PDF Files

Learning D3.js Mapping

Why Big Data in the Financial Sector?

Introducing PostgreSQL 9

Introduction to Hadoop

Hadoop Monitoring and its aspects

Trending Topics

Machine Learning

Algorithmic Trading

Apache Solr and Big Data – integration with MongoDB

Integrating a D3.js visualization into a simple AngularJS application

Solr Indexing Internals