Data | 0 articles | Tech News, Tutorials & Expert Insights

20 Feb 2014

5 min read

MongoDB data modeling

20 Feb 2014

(For more resources related to this topic, see here.) MongoDB data modeling is an advanced topic. We will instead focus on a few key points regarding data modeling so that you will understand the implications for your report queries. These points touch on the strengths and weaknesses of embedded data models. The most common data modeling decision in MongoDB is whether to normalize or denormalize related collections of data. Denormalized MongoDB models embed related data together in the same database documents while normalized models represent the same records with references between documents. This modeling decision will impact the method used to extract and query data using Pentaho, because the normalized method requires joining documents outside of MongoDB. With normalized models, Pentaho Data Integration is used to resolve the reference joins. The following table provides a list of objects that form the fundamental building blocks of a MongoDB database and the associated object names in the pentaho database. MongoDB objects Sample MongoDB object names Database Pentaho Collection sessions, events, and sessions_events Document The sessions document fields (key-value pairs) are as follows: id: 512d200e223e7f294d13607 ip_address: 235.149.145.57 id_session: DF636FB593684905B335982BEA3D967B browser: Chrome date_session: 2013-02-20T15:47:49Z duration_session: 18.24 referring_url: www.xyz.com The events document fields (key-value pairs) are as follows: _id: 512d1ff2223e7f294d13c0c6 id_session: DF636FB593684905B335982BEA3D967B event: Signup Free Offer The sessions_events document fields (key-value pairs) are as follows: _id: 512d200e223e7f294d13607 ip_address: 235.149.145.57 id_session: DF636FB593684905B335982BEA3D967B browser: Chrome date_session: 2013-02-20T15:47:49Z duration_session: 18.24 referring_url: www.xyz.com event_data: [array] event: Visited Site event: Signup Free Offer This table shows three collections in the pentaho database: sessions, events, and sessions_events. The first two collections, sessions and events, illustrate the concept of normalizing the clickstream data by separating it into two related collections with a reference key field in common. In addition to the two normalized collections, the sessions_events collection is included to illustrate the concept of denormalizing the clickstream data by combining the data into a single collection. Normalized models Because multiple clickstream events can occur within a single web session, we know that sessions have a one-to-many relationship with events. For example, during a single 20-minute web session, a user could invoke four events by visiting the website, watching a video, signing up for a free offer, and completing a lead form. These four events would always appear within the context of a single session and would share the same id_session reference. The data resulting from the normalized model would include one new session document in the sessions collection and four new event documents in the events collection, as shown in the following figure: Each event document is linked to the parent session document by the shared id_session reference field, whose values are highlighted in red. This normalized model would be an efficient data model if we expect the number of events per session to be very large for a couple of reasons. The first reason is that MongoDB limits the maximum document size to 16 megabytes, so you will want to avoid data models that create extremely large documents. The second reason is that query performance can be negatively impacted by large data arrays that contains thousands of event values. This is not a concern for the clickstream dataset, because the number of events per session is small. Denormalized models The one-to-many relationship between sessions and events also gives us the option of embedding multiple events inside of a single session document. Embedding is accomplished by declaring a field to hold either an array of values or embedded documents known as subdocuments. The sessions_events collection is an example of embedding, because it embeds the event data into an array within a session document. The data resulting from our denormalized model includes four event values in the event_data array within the sessions_events collection as shown in the following figure: As you can see, we have the choice to keep the session and event data in separate collections, or alternatively, store both datasets inside a single collection. One important rule to keep in mind when you consider the two approaches is that the MongoDB query language does not support joins between collections. This rule makes embedded documents or data arrays better for querying, because the embedded relationship allows us to avoid expensive client-side joins between collections. In addition, the MongoDB query language supports a large number of powerful query operators for accessing documents by the contents of an array. A list of query operators can be found on the MongoDB documentation site at http://docs.mongodb.org/manual/reference/operator/. To summarize, the following are a few key points to consider when deciding on a normalized or denormalized data model in MongoDB: The MongoDB query language does not support joins between collections The maximum document size is 16 megabytes Very large data arrays can negatively impact query performance In our sample database, the number of clickstream events per session is expected to be small—within a modest range of only one to 20per session. The denormalized model works well in this scenario, because it eliminates joins by keeping events and sessions in a single collection. However, both data modeling scenarios are provided in the pentaho MongoDB database to highlight the importance of having an analytics platform, such as Pentaho, to handle both normalized and denormalized data models. Summary This article expands on the topic of data modeling and explains MongoDB database concepts essential to querying MongoDB data with Pentaho. Resources for Article: Further resources on this subject: Installing Pentaho Data Integration with MySQL [article] Integrating Kettle and the Pentaho Suite [article] Getting Started with Pentaho Data Integration [article]

0
0
1676

article-image-understanding-process-variation-control-charts

Packt

19 Feb 2014

10 min read

Understanding Process Variation with Control Charts

Packt

19 Feb 2014

10 min read

(For more resources related to this topic, see here.) Xbar-R charts and applying stages to a control chart As with all control charts, the Xbar-R charts are used to monitor process stability. Apart from generating the basic control chart, we will look at how we can control the output with a few options within the dialog boxes. Xbar-R stands for means and ranges; we use the means chart to estimate the population mean of a process and the range chart to observe how the population variation changes. For more information on control charts, see Understanding Statistical Process Control by Donald J. Wheeler and David S. Chambers. As an example, we will study the fill volumes of syringes. Five syringes are sampled from the process at hourly intervals; these are used to represent the mean and variation of that process over time. We will plot the means and ranges of the fill volumes across 50 subgroups. The data also includes a process change. This will be displayed on the chart by dividing the data into two stages. The charts for subgrouped data can use a worksheet set up in two formats. Here the data is recorded such that each row represents a subgroup. The columns are the sample points. The Xbar-S chart will use data in the other format where all the results are recorded in one column. The following screenshot shows the data with subgroups across the rows on the left, and the same data with subgroups stacked on the right: How to do it… The following steps will create an Xbar-R chart staged by the Adjustment column with all eight of the tests for special causes: Use the Open Worksheet command from the File menu to open the Volume.mtw worksheet. Navigate to Stat | Control Charts | Variables charts for subgroups. Then click on Xbar-R…. Change the drop down at the top of the dialog to Observations for a subgroup are in one row of columns:. Enter the columns Measure1 to Measure5 into the dialog box by highlighting all the measure columns in the left selection box and clicking on Select. Click on Xbar-R Options and navigate to the tab for Tests. Select all the tests for special causes. Select the Stages tab. Enter Adjustment in the Define Stages section. Click on OK in each dialog box. How it works… The R or range chart displays the variation over time in the data by plotting the range of measurements in a subgroup. The Xbar chart plots the means of the subgroups. The choice of layout of the worksheet is picked from the drop-down box in the main dialog box. The All observations for a chart are in one column: field is used for data stacked into columns. Means of subgroups and ranges are found from subgroups indicated in the worksheet. The Observations for a subgroup are in one row of columns: field will find means and ranges from the worksheet rows. The Xbar-S chart example shows us how to use the dialog box when the data is in a single column. The dialog boxes for both Xbar-R and Xbar-S work the same way. Tests for special causes are used to check the data for nonrandom events. The Xbar-R chart options give us control over the tests that will be used. The values of the tests can be changed from these options as well. The options from the Tools menu of Minitab can be used to set the default values and tests to use in any control chart. By using the option under Stages, we are able to recalculate the means and standard deviations for the pre and post change groups in the worksheet. Stages can be used to recalculate the control chart parameters on each change in a column or on specific values. A date column can be used to define stages by entering the date at which a stage should be started. There's more… Xbar-R charts are also available under the Assistant menu. The default display option for a staged control chart is to show only the mean and control limits for the final stage. Should we want to see the values for all stages, we would use the Xbar-R Options and Display tab. To place these values on the chart for all stages, check the Display control limit / center line labels for all stages box. See Xbar-S charts for a description of all the tabs within the Control Charts options. Using an Xbar-S chart Xbar-S charts are similar in use to Xbar-R. The main difference is that the variation chart uses standard deviation from the subgroups instead of the range. The choice between using Xbar-R or Xbar-S is usually made based on the number of samples in each subgroup. With smaller subgroups, the standard deviation estimated from these can be inflated. Typically, with less than nine results per subgroup, we see them inflating the standard deviation, which increases the width of the control limits on the charts. Automotive Industry Action Group (AIAG) suggests using the Xbar-R, which is greater than or equal to nine times the Xbar-S. Now, we will apply an Xbar-S chart to a slightly different scenario. Japan sits above several active fault lines. Because of this, minor earthquakes are felt in the region quite regularly. There may be several minor seismic events on any given day. For this example, we are going to use seismic data from the Advanced National Seismic System. All seismic events from January 1, 2013 to July 12, 2013 from the region that covers latitudes 31.128 to 45.275 and longitudes 129.799 to 145.269 are included in this dataset. This corresponds to an area that roughly encompasses Japan. The dataset is provided for us already but we could gather more up-to-date results from the following link: http://earthquake.usgs.gov/monitoring/anss/ To search the catalog yourself, use the following link: http://www.ncedc.org/anss/catalog-search.html We will look at seismic events by week that create Xbar-S charts of magnitude and depth. In the initial steps, we will use the date to generate a column that identifies the week of the year. This column is then used as the subgroup identifier. How to do it… The following steps will create an Xbar-S chart for the depth and magnitude of earthquakes. This will display the mean, and standard deviation of the events by week: Use the Open Worksheet command from the File menu to open the earthquake.mtw file. Go to the Data menu, click on Extract from Date/Time, and then click on To Text. Enter Date in the Extract from Date/time column: section. Type Week in the Store text column in: section. Check the selection for Week and click on OK to create the new column. Navigate to Stat | Control Charts | Variable charts for Subgroups and click on Xbar-S. Enter Depth and Mag into the dialog box as shown in the following screenshot and Week into the Subgroup sizes: field. Click on the Scale button, and select the option for Stamp. Enter Date in the Stamp columns section. Click on OK. Click on Xbar-S Options and then navigate to the Tests tab. Select all tests for special causes. Click on OK in each dialog box. How it works… Steps 1 to 4 build the Week column that we use as the subgroup. The extracts from the date/time options are fantastic for quickly generating columns based on dates. Days of the week, week of the year, month, or even minutes or seconds can all be separated from the date. Multiple columns can be entered into the control chart dialog box just as we have done here. Each column is then used to create a new Xbar-S chart. This lets us quickly create charts for several dimensions that are recorded at the same time. The use of the week column as the subgroup size will generate the control chart with mean depth and magnitude for each week. The scale options within control charts are used to change the display on the chart scales. By default, the x axis displays the subgroup number; changing this to display the date can be more informative when identifying the results that are out of control. Options to add axes, tick marks, gridlines, and additional reference lines are also available. We can also edit the axis of the chart after we have generated it by double-clicking on the x axis. The Xbar-S options are similar for all control charts; the tabs within Options give us control over a number of items for the chart. The following list shows us the tabs and the options found within each tab: Parameters: This sets the historical means and standard deviations; if using multiple columns, enter the first column mean, leave a space, and enter the second column mean Estimate: This allows us to specify subgroups to include or exclude in the calculations and change the method of estimating sigma Limits: This can be used to change where sigma limits are displayed or place on the control limits Tests: This allows us to choose the tests for special causes of the data and change the default values. The Using I-MR chart recipe details the options for the Tests tab. Stages: This allows the chart to be subdivided and will recalculate center lines and control limits for each stage Box Cox: This can be used to transform the data, if necessary Display: This has settings to choose how much of the chart to display. We can limit the chart to show only the last section of the data or split a larger dataset into separate segments. There is also an option to display the control limits and center lines for all stages of a chart in this option. Storage: This can be used to store parameters of the chart, such as means, standard deviations, plotted values, and test results There's more… The control limits for the graphs that are produced vary as the subgroup sizes are not constant; this is because the number of earthquakes varies each week. In most practical applications, we may expect to collect the same number of samples or items in a subgroup and hence have flat control limits. If we wanted to see the number of earthquakes in each week, we could use Tally from inside the Tables menu. This will display a result of counts per week. We could also store this tally back into the worksheet. The result of this tally could be used with a c-chart to display a count of earthquake events per week. If we wanted to import the data directly from the Advanced National Seismic System, then the following steps will obtain the data and prepare the worksheet for us: Follow the link to the ANSS catalog search at http://www.ncedc.org/anss/catalog-search.html. Enter 2013/01/01 in the Start date, time: field. Enter 2013/06/12 in the End date, time: field. Enter 3 in the Min magnitude: field. Enter 31.128 in the Min latitude field and 45.275 in the Max latitude field. Enter 129.799 in the Min longitude field and 145.269 in the Max longitude filed. Copy the data from the search results, excluding the headers, and paste it into a Minitab worksheet. Change the names of the columns to C1 Date, C2 Time, C3 Lat, C4 Long, C5 Depth, C6 Mag. The other columns, C7 to C13, can then be deleted. The Date column will have copied itself into Minitab as text; to convert this back to a date, navigate to Data | Change Data Type | Text to Date/Time. Enter Date in both the Change text columns: and Store date/time columns in: sections. In the Format of text columns: section, enter yyyy/mm/dd. Click on OK. To extract the week from the Date column, navigate to Data | Date/Time | Extract to Text. Enter 'Date' in the Extract from date/time column: section. Enter 'Week' in the Store text column in: field. Check the box for Week and click on OK.

0
0
2839

article-image-processing-tweets-apache-hive

Packt

18 Feb 2014

6 min read

Processing Tweets with Apache Hive

Packt

18 Feb 2014

6 min read

(For more resources related to this topic, see here.) Extracting hashtags In this part and the following one, we'll see how to extract data efficiently from tweets such as hashtags and emoticons. We need to do it because we want to be able to know what the most discussed topics are, and also get the mood across the tweets. And then, we'll want to join that information to get people's sentiments. We'll start with hashtags; to do so, we need to do the following: Create a new hashtag table. Use a function that will extract the hashtags from the tweet string. Feed the hashtag table with the result of the extracted function. So, I have some bad news and good news: Bad news: Hive provides a lot of built-in user-defined functions, but unfortunately, it does not provide any function based on a regex pattern; we need to use a custom user-defined function to do that. This is such a bad news as you will learn how to do it. Good news: Hive provides an extremely efficient way to create a Hive table from an array. We'll then use the lateral view and the Explode Hive UDF to do that. The following is the Hive-processing workflow that we are going to apply to our tweets: Hive-processing workflow The preceding diagram describes the workflow to be followed to extract the hashtags. The steps are basically as follows: Receive the tweets. Detect all the hashtags using the custom Hive user-defined function. Obtain an array of hashtags. Explode it and obtain a lateral view to feed our hashtags table. This kind of processing is really useful if we want to have a feeling of what the top tweeted topics are, and is most of the time represented by a word cloud chart like the one shown in the following diagram: Topic word cloud sample Let's do this by creating a new CH04_01_HIVE_PROCESSING_HASH_TAGS job under a new Chapter4 folder. This job will contain six components: One to connect to Hive; you can easily copy and paste the connection component from the CH03_03_HIVE_FORMAT_DATA job One tHiveRow to add the custom Hive UDF to the Hive runtime classpath. The following would be the steps to create a new job: First, we will add the following context variable to our PacktContext group: Name Value custom_udf_jar PATH_TO_THE_JAR For Example: /Users/bahaaldine/here/is/the/jar/extractHashTags.jar This new context variable is just the path to the Hive UDF JAR file provided in the source file Now, we can add the "add jar "+context.custom_udf_jar Hive query in our tHiveRow component to load the JAR file in the classpath when the job is being run. We use the add jar query so that Hive will load all the classes in the JAR file when the job starts, as shown in the following screenshot: Adding a Custom UDF JAR to Hive classpath. After the JAR file is loaded by the previous component, we need tHiveRow to register the custom UDF into the available UDF catalog. The custom UDF is a Java class with a bunch of methods that can be invoked from Hive-QL code. The custom UDF that we need is located in the org.talend.demo package of the JAR file and is named ExtractPattern. So we will simply add the "create temporary function extract_patterns as 'org.talend.demo.ExtractPattern'" configuration to the component. We use the create temporary function query to create a new extract_patterns function in Hive UDF catalog and give the implementation class contained in our package We need one tHiveRow to drop the hashtags table if it exists. As we have done in the CH03_02_HIVE_CREATE_TWEET_TABLE job, just add the "DROP TABLE IF EXISTS hash_tags" drop statement to be sure that the table is removed when we relaunch the job. We need one tHiveRow to create the hashtags table. We are going to create a new table to store the hashtags. For the purpose of simplicity, we'll only store the minimum time and description information as shown in the following table: Name Value hash_tags_id String day_of_week String day_of_month String time String month String hash_tags_label String The essential information here is the hash_tags_label column, which contains the hashtag name. With this knowledge, the following is our create table query: CREATE EXTERNAL TABLE hash_tags ( hash_tags_id string, day_of_week string, day_of_month string, time string, month string, hash_tags_label string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ';' LOCATION '/user/"+context.hive_user+"/packt/chp04/hashtags' Finally we need a tHiveRow component to feed the hashtags table. Here, we are going to use all the assets provided by the previous components as shown in the following query: insert into table hash_tags select concat(formatted_tweets.day_of_week, formatted_tweets. day_of_month, formatted_tweets.time, formatted_tweets.month) as hash_id, formatted_tweets.day_of_week, formatted_tweets.day_of_month, formatted_tweets.time, formatted_tweets.month, hash_tags_label from formatted_tweets LATERAL VIEW explode( extract_patterns(formatted_tweets.content,'#(\\w+)') ) hashTable as hash_tags_label Let's analyze the query from the end to the beginning. The last part of the query uses the extract_patterns function to parse in the formatted_tweets.content all hashtags based on the regex #(+). In Talend, all strings are Java string objects. That's why we need here to escape all backslash. Hive also needs special character escape, that brings us to finally having four backslashes. The extract_patterns command returns an array that we inject in the exploded Hive UDF in order to obtain a list of objects. We then pass them to the lateral view statement, which creates a new on-the-fly view called hashTable with one column hash_tags_label. Take a breath. We are almost done. If we go one level up, we will see that we selected all the required columns for our new hash_tags table, do a concatenation of data to build hash_id, and dynamically select a runtime-built column called hash_tags_label provided by the lateral view. Finally, all the selected data is inserted in the hash_tags table. We just need to run the job, and then, using the following query, we will check in Hive if the new table contains our hashtags: $ select * from hash_tags The following diagram shows the complete hashtags-extracting job structure: Hive processing job Summary By now, you should have a good overview of how to use Apache Hive features with Talend, from the ELT mode to the lateral view, passing by the custom Hive user-defined function. From the point of view of a use case, we have now reached the step where we need to reveal some added-value data from our Hive-based processing data. Resources for Article: Further resources on this subject: Visualization of Big Data [article] Managing Files [article] Making Big Data Work for Hadoop and Solr [article]

0
0
1813

Packt

18 Feb 2014

5 min read

Query Performance Tuning

Packt

18 Feb 2014

5 min read

(For more resources related to this topic, see here.) Understanding how Analysis Services processes queries We need to understand what happens inside Analysis Services when a query is run. The two major parts of the Analysis Services engine are: The Formula Engine: This part processes MDX queries, works out what data is needed to answer them, requests that data from the Storage Engine, and then performs all calculations needed for the query. The Storage Engine: This part handles all the reading and writing of data, for example, during cube processing and fetching all the data that the Formula Engine requests when a query is run. When you run an MDX query, then, that query goes first to the Formula Engine, then to the Storage Engine, and then back to the Formula Engine before the results are returned back to you. Performance tuning methodology When tuning performance there are certain steps you should follow to allow you to measure the effect of any changes you make to your cube, its calculations or the query you're running: Always test your queries in an environment that is identical to your production environment, wherever possible. Otherwise, ensure that the size of the cube and the server hardware you're running on is at least comparable, and running the same build of Analysis Services. Make sure that no-one else has access to the server you're running your tests on. You won't get reliable results if someone else starts running queries at the same time as you. Make sure that the queries you're testing with are equivalent to the ones that your users want to have tuned. As we'll see, you can use Profiler to capture the exact queries your users are running against the cube. Whenever you test a query, run it twice; first on a cold cache, and then on a warm cache. Make sure you keep a note of the time each query takes to run and what you changed on the cube or in the query for that run. Clearing the cache is a very important step—queries that run for a long time on a cold cache may be instant on a warm cache. When you run a query against Analysis Services, some or all of the results of that query (and possibly other data in the cube, not required for the query) will be held in cache so that the next time a query is run that requests the same data it can be answered from cache much more quickly. To clear the cache of an Analysis Services database, you need to execute a ClearCache XMLA command. To do this in SQL Management Studio, open up a new XMLA query window and enter the following: <Batch > <ClearCache> <Object> <DatabaseID>Adventure Works DW 2012</DatabaseID> </Object> </ClearCache> </Batch> Remember that the ID of a database may not be the same as its name—you can check this by right-clicking on a database in the SQL Management Studio Object Explorer and selecting Properties. Alternatives to this method also exist: the MDX Studio tool allows you to clear the cache with a menu option, and the Analysis Services Stored Procedure Project (http://tinyurl.com/asstoredproc) contains code that allows you to clear the Analysis Services cache and the Windows File System cache directly from MDX. Clearing the Windows File System cache is interesting because it allows you to compare the performance of the cube on a warm and cold file system cache as well as a warm and cold Analysis Services cache. When the Analysis Services cache is cold or can't be used for some reason, a warm filesystem cache can still have a positive impact on query performance. After the cache has been cleared, before Analysis Services can answer a query it needs to recreate the calculated members, named sets, and other objects defined in a cube's MDX Script. If you have any reasonably complex named set expressions that need to be evaluated, you'll see some activity in Profiler relating to these sets being built and it's important to be able to distinguish between this and activity that's related to the queries you're actually running. All MDX Script related activity occurs between Execute MDX Script Begin and Execute MDX Script End events; these are fired after the Query Begin event but before the Query Cube Begin event for the query run after the cache has been cleared and there is one pair of Begin/End events for each command on the MDX Script. When looking at a Profiler trace you should either ignore everything between the first Execute MDX Script Begin event and the last Execute MDX Script End event or run a query that returns no data at all to trigger the evaluation of the MDX Script, for example: SELECT {} ON 0 FROM [Adventure Works] Designing for performance Many of the recommendations for designing cubes will improve query performance, and in fact the performance of a query is intimately linked to the design of the cube it's running against. For example, dimension design, especially optimizing attribute relationships, can have significant effect on the performance of all queries—at least as much as any of the optimizations. As a result, we recommend that if you've got a poorly performing query the first thing you should do is review the design of your cube to see if there is anything you could do differently. There may well be some kind of trade-off needed between usability, manageability, time-to-develop, overall 'elegance' of the design and query performance, but since query performance is usually the most important consideration for your users then it will take precedence. To put it bluntly, if the queries your users want to run don't run fast, your users will not want to use the cube at all!

0
0
1087

article-image-sizing-configuring-hadoop-cluster

Oli Huggins

16 Feb 2014

10 min read

Sizing and Configuring your Hadoop Cluster

Oli Huggins

16 Feb 2014

10 min read

This article, written by Khaled Tannir, the author of Optimizing Hadoop for MapReduce, discusses two of the most important aspects to consider while optimizing Hadoop for MapReduce: sizing and configuring the Hadoop cluster correctly. Sizing your Hadoop cluster Hadoop's performance depends on multiple factors based on well-configured software layers and well-dimensioned hardware resources that utilize its CPU, Memory, hard drive (storage I/O) and network bandwidth efficiently. Planning the Hadoop cluster remains a complex task that requires a minimum knowledge of the Hadoop architecture and may be out the scope of this book. This is what we are trying to make clearer in this section by providing explanations and formulas in order to help you to best estimate your needs. We will introduce a basic guideline that will help you to make your decision while sizing your cluster and answer some How to plan questions about cluster's needs such as the following: How to plan my storage? How to plan my CPU? How to plan my memory? How to plan the network bandwidth? While sizing your Hadoop cluster, you should also consider the data volume that the final users will process on the cluster. The answer to this question will lead you to determine how many machines (nodes) you need in your cluster to process the input data efficiently and determine the disk/memory capacity of each one. Hadoop is a Master/Slave architecture and needs a lot of memory and CPU bound. It has two main components: JobTracker: This is the critical component in this architecture and monitors jobs that are running on the cluster TaskTracker: This runs tasks on each node of the cluster To work efficiently, HDFS must have high throughput hard drives with an underlying filesystem that supports the HDFS read and write pattern (large block). This pattern defines one big read (or write) at a time with a block size of 64 MB, 128 MB, up to 256 MB. Also, the network layer should be fast enough to cope with intermediate data transfer and block. HDFS is itself based on a Master/Slave architecture with two main components: the NameNode / Secondary NameNode and DataNode components. These are critical components and need a lot of memory to store the file's meta information such as attributes and file localization, directory structure, names, and to process data. The NameNode component ensures that data blocks are properly replicated in the cluster. The second component, the DataNode component, manages the state of an HDFS node and interacts with its data blocks. It requires a lot of I/O for processing and data transfer. Typically, the MapReduce layer has two main prerequisites: input datasets must be large enough to fill a data block and split in smaller and independent data chunks (for example, a 10 GB text file can be split into 40,960 blocks of 256 MB each, and each line of text in any data block can be processed independently). The second prerequisite is that it should consider the data locality, which means that the MapReduce code is moved where the data lies, not the opposite (it is more efficient to move a few megabytes of code to be close to the data to be processed, than moving many data blocks over the network or the disk). This involves having a distributed storage system that exposes data locality and allows the execution of code on any storage node. Concerning the network bandwidth, it is used at two instances: during the replication process and following a file write, and during the balancing of the replication factor when a node fails. The most common practice to size a Hadoop cluster is sizing the cluster based on the amount of storage required. The more data into the system, the more will be the machines required. Each time you add a new node to the cluster, you get more computing resources in addition to the new storage capacity. Let's consider an example cluster growth plan based on storage and learn how to determine the storage needed, the amount of memory, and the number of DataNodes in the cluster. Daily data input 100 GB Storage space used by daily data input = daily data input * replication factor = 300 GB HDFS replication factor 3 Monthly growth 5% Monthly volume = (300 * 30) + 5% = 9450 GB After one year = 9450 * (1 + 0.05)^12 = 16971 GB Intermediate MapReduce data 25% Dedicated space = HDD size * (1 - Non HDFS reserved space per disk / 100 + Intermediate MapReduce data / 100) = 4 * (1 - (0.25 + 0.30)) = 1.8 TB (which is the node capacity) Non HDFS reserved space per disk 30% Size of a hard drive disk 4 TB Number of DataNodes needed to process: Whole first month data = 9.450 / 1800 ~= 6 nodes The 12th month data = 16.971/ 1800 ~= 10 nodes Whole year data = 157.938 / 1800 ~= 88 nodes Do not use RAID array disks on a DataNode. HDFS provides its own replication mechanism. It is also important to note that for every disk, 30 percent of its capacity should be reserved to non-HDFS use. It is easy to determine the memory needed for both NameNode and Secondary NameNode. The memory needed by NameNode to manage the HDFS cluster metadata in memory and the memory needed for the OS must be added together. Typically, the memory needed by Secondary NameNode should be identical to NameNode. Then you can apply the following formulas to determine the memory amount: NameNode memory 2 GB - 4 GB Memory amount = HDFS cluster management memory + NameNode memory + OS memory Secondary NameNode memory 2 GB - 4 GB OS memory 4 GB - 8 GB HDFS memory 2 GB - 8 GB At least NameNode (Secondary NameNode) memory = 2 + 2 + 4 = 8 GB It is also easy to determine the DataNode memory amount. But this time, the memory amount depends on the physical CPU's core number installed on each DataNode. DataNode process memory 4 GB - 8 GB Memory amount = Memory per CPU core * number of CPU's core + DataNode process memory + DataNode TaskTracker memory + OS memory DataNode TaskTracker memory 4 GB - 8 GB OS memory 4 GB - 8 GB CPU's core number 4+ Memory per CPU core 4 GB - 8 GB At least DataNode memory = 4*4 + 4 + 4 + 4 = 28 GB Regarding how to determine the CPU and the network bandwidth, we suggest using the now-a-days multicore CPUs with at least four physical cores per CPU. The more physical CPU's cores you have, the more you will be able to enhance your job's performance (according to all rules discussed to avoid underutilization or overutilization). For the network switches, we recommend to use equipment having a high throughput (such as 10 GB) Ethernet intra rack with N x 10 GB Ethernet inter rack. Configuring your cluster correctly To run Hadoop and get a maximum performance, it needs to be configured correctly. But the question is how to do that. Well, based on our experiences, we can say that there is not one single answer to this question. The experiences gave us a clear indication that the Hadoop framework should be adapted for the cluster it is running on and sometimes also to the job. In order to configure your cluster correctly, we recommend running a Hadoop job(s) the first time with its default configuration to get a baseline. Then, you will check the resource's weakness (if it exists) by analyzing the job history logfiles and report the results (measured time it took to run the jobs). After that, iteratively, you will tune your Hadoop configuration and re-run the job until you get the configuration that fits your business needs. The number of mappers and reducer tasks that a job should use is important. Picking the right amount of tasks for a job can have a huge impact on Hadoop's performance. The number of reducer tasks should be less than the number of mapper tasks. Google reports one reducer for 20 mappers; the others give different guidelines. This is because mapper tasks often process a lot of data, and the result of those tasks are passed to the reducer tasks. Often, a reducer task is just an aggregate function that processes a minor portion of the data compared to the mapper tasks. Also, the correct number of reducers must also be considered. The number of mappers and reducers is related to the number of physical cores on the DataNode, which determines the maximum number of jobs that can run in parallel on DataNode. In a Hadoop cluster, master nodes typically consist of machines where one machine is designed as a NameNode, and another as a JobTracker, while all other machines in the cluster are slave nodes that act as DataNodes and TaskTrackers. When starting the cluster, you begin starting the HDFS daemons on the master node and DataNode daemons on all data nodes machines. Then, you start the MapReduce daemons: JobTracker on the master node and the TaskTracker daemons on all slave nodes. The following diagram shows the Hadoop daemon's pseudo formula: When configuring your cluster, you need to consider the CPU cores and memory resources that need to be allocated to these daemons. In a huge data context, it is recommended to reserve 2 CPU cores on each DataNode for the HDFS and MapReduce daemons. While in a small and medium data context, you can reserve only one CPU core on each DataNode. Once you have determined the maximum mapper's slot numbers, you need to determine the reducer's maximum slot numbers. Based on our experience, there is a distribution between the Map and Reduce tasks on DataNodes that give good performance result to define the reducer's slot numbers the same as the mapper's slot numbers or at least equal to two-third mapper slots. Let's learn to correctly configure the number of mappers and reducers and assume the following cluster examples: Cluster machine Nb Medium data size Large data size DataNode CPU cores 8 Reserve 1 CPU core Reserve 2 CPU cores DataNode TaskTracker daemon 1 1 1 DataNode HDFS daemon 1 1 1 Data block size 128 MB 256 MB DataNode CPU % utilization 95% to 120% 95% to 150% Cluster nodes 20 40 Replication factor 2 3 We want to use the CPU resources at least 95 percent, and due to Hyper-Threading, one CPU core might process more than one job at a time, so we can set the Hyper-Threading factor range between 120 percent and 170 percent. Maximum mapper's slot numbers on one node in a large data context = number of physical cores - reserved core * (0.95 -> 1.5) Reserved core = 1 for TaskTracker + 1 for HDFS Let's say the CPU on the node will use up to 120% (with Hyper-Threading) Maximum number of mapper slots = (8 - 2) * 1.2 = 7.2 rounded down to 7 Let's apply the 2/3 mappers/reducers technique: Maximum number of reducers slots = 7 * 2/3 = 5 Let's define the number of slots for the cluster: Mapper's slots: = 7 * 40 = 280 Reducer's slots: = 5 * 40 = 200 The block size is also used to enhance performance. The default Hadoop configuration uses 64 MB blocks, while we suggest using 128 MB in your configuration for a medium data context as well and 256 MB for a very large data context. This means that a mapper task can process one data block (for example, 128 MB) by only opening one block. In the default Hadoop configuration (set to 2 by default), two mapper tasks are needed to process the same amount of data. This may be considered as a drawback because initializing one more mapper task and opening one more file takes more time. Summary In this article, we learned about sizing and configuring the Hadoop cluster for optimizing it for MapReduce. Resources for Article: Further resources on this subject: Hadoop Tech Page Hadoop and HDInsight in a Heartbeat Securing the Hadoop Ecosystem Advanced Hadoop MapReduce Administration

0
3
22529

Packt

13 Feb 2014

14 min read

Changing the Appearance

Packt

13 Feb 2014

14 min read

0
0
2459

article-image-securing-qlikview-documents

Packt

16 Jan 2014

6 min read

Securing QlikView Documents

Packt

16 Jan 2014

6 min read

(For more resources related to this topic, see here.) Making documents available to the correct users can be handled in several different ways, depending on your implementation and license structure. These methods are not mutually exclusive and you may choose to implement a combination of them. By license If you only have named Document users, you can restrict access to documents by simply not granting users a license. If users do not have a Document license for a particular document, they may be able to see that document in AccessPoint, but will not be able to open it. You will need to turn off any automatic allocation of licenses for both Document licenses and Named User licenses, or the system will simply override your security by allocating an available license and giving the user access to that document. This only works for Document license users. The Named User license holders can't be locked out of a document this way as they have a license that allows them to open any number of documents, so they cannot be restricted. The fact that this is user based—a Document license can only be granted to a user, not a group—also means that there is no option to secure by a named group. This is the most basic, least flexible, and least user-friendly way to implement security. While it will certainly stop users getting access to documents—and it will work in either NTFS or DMS security modes—it can be frustrating for users to see a document that they think can open, but for which they will get a NO CAL error when they try to open it. The QlikView file will also need to have appropriate NTFS or DMS security so that users would be able to access it. The easiest way to do this is to grant access to a group that all the users will be in, or even allow access to an Authenticated Users group. Section Access Section Access security is a very effective way of securing a document to the correct set of users. This is because a user must be actually listed in the Section Access user list for the document to be even listed in AccessPoint for them. Additionally, if Section Access is in place, a user cannot even connect by using a direct access URL because they have no security access to the data. This method of securing documents works well in both NTFS and DMS security modes. When using the NTLM (Windows authentication via Internet Explorer) authentication method, you can have Group Names listed in Section Access. However, when using alternative authentication, Section Access does not give us an option to secure by group. As with the license method discussed earlier, appropriate file security needs to be in place in order to allow all the users access the QlikView file. NTFS Access Control List (ACL) NTFS (Microsoft's NT File System) security is the default method of securing access to files in a QlikView implementation. It works very well for installations where all the users are Windows users within the same domain or a set of trusted domains. In NTFS security mode, the Access Control List (ACL) of the QlikView file is used to list the documents for a particular user. This is a very straightforward way of securing access and will be very familiar to Windows system administrators. As with normal Windows file security, the security can be applied at the folder level. Windows security groups can also be used. Between groups and folder security, very flexible levels of security can be applied. By default, Internet Explorer and Google Chrome will pass through the Kerberos/NTLM credentials to sites in the local Intranet zone. For other browsers, such as Safari on the iPad, the user will be prompted for a username and password. When a user connects to AccessPoint and their credentials are established, they are compared against the ACLs for all the files hosted by QVS. Only those files that the user has access to—either directly by name or by group membership—will be listed in AccessPoint. Document Metadata Service (DMS) For non-Windows users, QlikView provides a way of managing user access to files called the Document Metadata Service (DMS). DMS uses a .META file in the same folder as the .QVW file to store the Access Control List. The Windows ACL, which has permissions on the file, now becomes mostly irrelevant as it is not used to authenticate users. It is only the QlikView service account that will need access to the file. It is a binary choice between using NTFS or DMS security on any one QlikView Server. Enabling DMS To enable DMS, we need to make a change to the server configuration. In the QlikView Management Console, on the Security tab of the QVS settings screen, we change Authorization to DMS authorization and then click on the Apply button. The QlikView Server service will need to be restarted for this change to take effect. Once the service has restarted, a new tab, Authorization, becomes available in the document properties: Clicking on the + button to the right of this tab allows you to enter new details of Access, User Type, and specific Users and Groups. Access is either set to Always or Restricted. When Access is set to Always, the associated user or group will have access at any time. If it is set to Restricted, you can specify a time range and specific days when the user or group has access. You can keep clicking on the + button to add as many sets of restricted times as needed for a user or group. The restrictions are additive; that is, if the user only has access on Monday and Tuesday in one group of restrictions, and then Thursday and Friday in another set of restrictions, they will therefore, have access on all four days. The User Type is one of the following categories: User Type Details All Users Essentially no security. Any user, including anonymous, who can access the server will be able to access the file. All Authenticated Users For most implementations, this will also be All Users. However, it will not give access to anonymous users. The Section Access would typically be used to manage the security. Named Users This allows you to specify a list of named users and/or groups that will have specific access to the document. If Named Users is selected, a Manage Users button will appear that allows you to specify users and/or groups. Summary In this article, we have looked at several ways of securing QlikView Documents—by license, using Section Access, utilizing NTFS ACLs, and implementing QlikView's DMS authorization. Resources for Article: Further resources on this subject: Common QlikView script errors [Article] Introducing QlikView elements [Article] Meet QlikView [Article]

0
0
4877

article-image-aspects-data-manipulation-r

Packt

10 Jan 2014

6 min read

Aspects of Data Manipulation in R

Packt

10 Jan 2014

6 min read

(For more resources related to this topic, see here.) Factor variables in R In any data analysis task, the majority of the time is dedicated to data cleaning and pre-processing. Sometimes, it is considered that about 80 percent of the effort is devoted for data cleaning before conducting actual analysis. Also, in real world data, we often work with categorical variables. A variable that takes only a limited number of distinct values is usually known as a categorical variable, and in R, this is known as a factor. Working with categorical variables in R is a bit technical, and in this article we have tried to demystify this process of dealing with categorical variables. During data analysis, sometimes the factor variable plays an important role, particularly in studying the relationship between two categorical variables. In this section, we will see some important aspects of factor manipulation. When a factor variable is first created, it stores all its levels along with the factor. But if we take any subset of that factor variable, it inherits all its levels from the original factor levels. This feature sometimes creates confusion in understanding the results. Numeric variables are convenient during statistical analysis, but sometimes we need to create categorical (factor) variables from numeric variables. We can create a limited number of categories from a numeric variable using a series of conditional statements, but this is not an efficient way to perform this operation. In R, cut is a generic command to create factor variables from numeric variables. Split-apply-combine strategy Data manipulation is an integral part of data cleaning and analysis. For large data it is always preferable to perform the operation within subgroup of a dataset to speed up the process. In R this type of data manipulation could be done with base functionality, but for large-scale data it requires considerable amount of coding and eventually it takes a longer time to process. In case of Big Data we could split the dataset, perform the manipulation or analysis, and then again combine the results into a single output. This type of split using base R is not efficient and to overcome this limitation, Wickham developed an R package plyr where he efficiently implemented the split-apply-combine strategy. Often, we require similar types of operations in different subgroups of a dataset, such as group-wise summarization, standardization, and statistical modeling. This type of task requires us to break up a big problem into manageable pieces, perform operations on each piece separately, and finally combine the output of each piece into a single piece of output. To understand the split-apply-combine strategy intuitively, we could compare this with the map-reduce strategy for processing large amounts of data, recently popularized by Google. In the map-reduce strategy, the map step corresponds to split and apply and the reduce step consists of combining. The map-reduce approach is primarily designed to deal with a highly parallel environment where the work has been done by several hundreds or thousands of computers independently. The split-apply-combine strategy creates an opportunity to see the similarities of problems across subgroups that were previously unconnected. This strategy can be used in many existing tools, such as the GROUP BY operation in SAS, PivotTable in MS Excel, and the SQL GROUP BY operator. Reshaping a dataset Reshaping data is a common and tedious task in real-life data manipulation and analysis. A dataset might come with different levels of grouping and we need to implement some reorientation to perform certain types of analyses. Datasets layout could be long or wide. In long-layout, multiple rows represent a single subject's record, whereas in wide-layout, a single row represents a single subject's record. Statistical analysis sometimes requires wide data and sometimes long data, and in such cases, we need to be able to fluently and fluidly reshape the data to meet the requirements of statistical analysis. Data reshaping is just a rearrangement of the form of the data—it does not change the content of the dataset. In this article, we will show you different layouts of the same dataset and see how they can be transferred from one layout to another. This article mainly highlights the melt and cast paradigm of reshaping datasets, which is implemented in the reshape contributed package. Later on, this same package is reimplemented with a new name, reshape2, which is much more time and memory efficient. A single dataset can be rearranged in many different ways, but before going into rearrangement, let's look back at how we usually perceive a dataset. Whenever we think about any dataset, we think of a two-dimensional arrangement where a row represents a subject's (a subject could be a person and is typically the respondent in a survey) information for all the variables in a dataset and a column represents the information for each characteristic for all subjects. This means rows indicate records and columns indicate variables, characteristics, or attributes. This is the typical layout of a dataset. In this arrangement, one or more variables might play a role as an identifier and others are measured characteristics. For the purpose of reshaping, we could group the variables into two groups: identifier variables and measured variables. Identifier variables: These help to identify the subject from whom we took information on different characteristics. Typically, identifier variables are qualitative in nature and take a limited number of unique values. In database terms, an identifier is termed as the primary key, and this can be a single variable or a composite of multiple variables. Measured variables: These are those characteristics whose information we took from a subject of interest. These can be qualitative, quantitative, or a mix of both. Now beyond this typical structure of dataset, we could think differently, where we will have only identification variables and a value. The identification variable identifies a subject along with which measured variable the value represents. In this new paradigm, each row represents one observation of one variable. In the new paradigm this is termed as melting and it produces molten data. The difference between this new layout of the data and the typical layout is that it now contains only the ID variable and a new column, value, which represents the value of that observation. Summary This article explains briefly about factor variables, the split-apply-combine strategy, and reshaping a dataset in R. Resources for Article: Further resources on this subject: SQL Server 2008 R2: Multiserver Management Using Utility Explorer [Article] Working with Data Application Components in SQL Server 2008 R2 [Article] Customizing Graphics and Creating a Bar Chart and Scatterplot in R [Article]

0
0
1337

Packt

02 Jan 2014

8 min read

Conozca QlikView

Packt

02 Jan 2014

8 min read

(Para más recursos relacionados con este tema, vea aquí.) ¿Qué es QlikView? QlikView es una herramienta computacional desarrollada por QlikTech, una compañía que fue fundada en Suecia en 1993, pero actualmente con sede a Estados Unidos. QlikView es una herramienta usada para Inteligencia de Negocios, comúnmente abreviada como BI por las siglas de su denominación en inglés: Business Intelligence . La inteligencia de negocios es definida por Gartner, una firma líder de analistas de la industria, como: Un término general que incluye la aplicación, infraestructura y herramientas, y mejores prácticas que permiten el acceso a información y análisis de la misma para mejorar y optimizar el proceso de toma de decisiones y desempeño de una compañía. Siguiendo esta definición, QlikView es una herramienta que permite el acceso a la información y posibilita el análisis de los datos, lo cual a su vez mejora y optimiza el proceso de toma de decisiones de negocio y por ende también el desempeño del mismo. Históricamente, la Inteligencia de Negocios ha sido comandada principalmente por los departamentos de Tecnologías de Información en las empresas. Los departamentos de TI eran responsables de todo el ciclo de vida de una solución de Inteligencia de Negocios, desde extraer los datos hasta entregar los reportes finales, análisis y cuadros de mando. Aunque este modelo funciona bien para la distribución de reportes estáticos predefinidos, la mayoría de las empresas se han ido dando cuenta que no cumple con las necesidades de sus usuarios de negocio. Como TI controla de cerca los datos y herramientas, los usuarios comúnmente experimentan largos tiempos de espera cuando surgen nuevas preguntas de negocio que no pueden ser respondidas con los reportes estándar. ¿Cómo se diferencia QlikView de herramientas tradicionales de BI? QlikTech se enorgullece de abordar la Inteligencia de Negocios de una manera distinta a lo que compañías como Oracle, SAP, e IBM – descritas por QlikTech como proveedores tradicionales de BI – ofrecen. QlikTech busca poner las herramientas en manos del usuario de negocio, permitiéndole ser autosuficiente, ya que así puede realizar sus propios análisis. Las firmas independientes de analistas de la industria han notado también este acercamiento distinto. En 2011, Gartner creó una subcategoría para herramientas de Descubrimiento de Datos en su evaluación anual de mercado, el Cuadrante Mágico de plataformas de Inteligencia de Negocios . QlikView fue el abanderado en esta nueva categoría de herramientas de BI. QlikTech prefiere describir su producto como una herramienta de Descubrimiento del Negocio en lugar de Descubrimiento de Datos. Sostiene que descubrir cosas sobre el negocio es mucho más importante que descubrir datos. El siguiente diagrama ilustra este paradigma. Fuente: QlikTech Además de la diferencia en quién usa la herramienta – usuarios de TI contra usuarios de negocio – hay algunas otras funcionalidades que diferencian a QlikView de otras soluciones. Experiencia de usuario asociativa La principal diferencia entre QlikView y otras soluciones de BI es la experiencia de usuario asociativa. Mientras que las soluciones de BI tradicionales usan caminos predefinidos para navegar y explorar datos, QlikView permite a los usuarios tomar cualquier ruta que deseen para realizar análisis. Esto resulta en una manera mucho más intuitiva de navegar los datos. QlikTech describe esto como "trabajar de la forma en que trabaja la mente humana". En la siguiente imagen se muestra un ejemplo. Mientras que en una solución típica de BI tendríamos que comenzar seleccionando una Región para después entrar paso a paso en el camino jerárquico definido, en QlikView podemos elegir cualquier punto de entrada que deseemos – Región , Estado , Producto , o Vendedor . Al ir navegando los datos, se nos presenta solo la información relacionada a nuestra selección y, para nuestra siguiente selección, podemos elegir cualquier camino que deseemos. La navegación es infinitamente flexible. Adicionalmente, la interfaz de usuario QlikView nos permite ver los datos que están asociados a nuestra selección. Por ejemplo, la siguiente imagen de pantalla (del documento demostrativo de QlikTech llamado What's New in QlikView 11 ) muestra un Cuadro de Mando en QlikView en el que hay dos valores seleccionados. En el campo Quarter , está seleccionado el valor Q3 , y en el campo Sales Reps , está seleccionado Cart Lynch . Podemos ver esto porque los valores correspondientes están en color verde, lo cual significa que dichos valores han sido seleccionados. Cuando se hace una selección, la interfaz se actualiza automáticamente no solo para mostrar los datos que están asociados a esta nueva consulta, sino también los datos que no están asociados con dicha selección. Los datos asociados aparecen con un fondo blanco, mientras que los datos no asociados tienen un fondo gris. Algunas veces las asociaciones pueden ser bastante obvias; no es sorpresa que el tercer trimestre del año tenga asociado los meses de Julio, Agosto y Septiembre. Sin embargo, en otras ocasiones nos encontramos con otras asociaciones no tan obvias, como por ejemplo que Carl Lynch no ha vendido ningún producto en Alemania o España. Esta información extra, que no se ve en herramientas tradicionales de BI, puede ser de gran valor ya que ofrece un nuevo punto de comienzo para exploración de datos. Tecnología El principal diferenciador tecnológico de QlikView es que utiliza un modelo de datos en memoria, es decir, que toda la información con que interactúa el usuario está guardada en RAM en lugar de utilizar disco. Como el uso de RAM es mucho más rápido que disco, los tiempos de respuesta son muy rápidos, generando así una experiencia de usuario muy fluida. En una sección posterior de este capítulo, ahondaremos un poco más en el tema de la tecnología detrás de QlikView. Adopción Hay otra diferencia entre QlikView y otras soluciones tradicionales de BI que radica en la forma en que se implementa dentro de una compañía. Mientras que las soluciones tradicionales de BI son típicamente implementadas de arriba hacia abajo – en donde TI selecciona una herramienta de BI para toda la compañía – QlikView comúnmente toma una ruta de adopción de abajo hacia arriba. Los usuarios de negocio de un solo departamento la implementan localmente, y su uso se expande desde ahí. QlikView se puede descargar de manera gratuita para uso personal. A esta versión se le llama QlikView Personal Edition o PE. Los documentos creados en la edición personal de QlikView pueden ser abiertos por usuarios con licencia completa del software o publicarse a través de QlikView Server. La limitación es que, a excepción de algunos documentos habilitados por QlikTech para PE, un usuario de la edición personal de QlikView no puede abrir documentos creados por otro usuario o en otro equipo; algunas veces tampoco se pueden abrir sus propios documentos si fueron abiertos y guardados por otro usuario o instancia de servidor. Frecuentemente, un usuario de negocio decidirá descargar QlikView para ver si puede resolver un problema de negocio. Cuando otros usuarios dentro del departamento ven el software, se vuelven cada vez más entusiastas sobre la herramienta, y cada quien baja el programa. Para poder compartir documentos, deciden comprar algunas licencias para el departamento. Luego, otros departamentos comienzan a notarlo también, y QlikView gana tracción dentro de la organización. Poco tiempo después, TI y los directivos de la empresa comienzan también a notarlo, lo cual lleva eventualmente a la adopción de QlikView en toda la empresa. QlikView facilita cada paso del proceso, escalando de una implementación en una computadora personal hasta implementaciones a nivel organización con miles de usuarios. La siguiente imagen ilustra este crecimiento dentro de una organización: Conforme la popularidad e historial de QlikView en la organización crece, gana cada vez más visibilidad a nivel empresa. Aunque la ruta de adopción descrita anteriormente es probablemente el escenario más común, no es extraño ahora que una compañía opte por una implementación de QlikView en modo top-down a nivel empresa desde un inicio.

0
0
2582

article-image-using-redis-hostile-environment-advanced

Packt

27 Dec 2013

12 min read

Using Redis in a hostile environment (Advanced)

Packt

27 Dec 2013

12 min read

(For more resources related to this topic, see here.) How to do it... Anyone who can read the files that Redis uses to persist your dataset has a full copy of all your data. Worse, anyone who can write to those files can, with a minimal amount of effort and some patience, change the data that your Redis server contains. Both of these things are probably not what you want, and thankfully it isn't particularly difficult to prevent. All you have to do is prevent anyone but the user running your Redis server from accessing the data directory your Redis instance is using. The simplest way to achieve this is by changing the owner of the directory to the user who runs your Redis server, and then disallow all privileges to everyone else, like this: Determine the user under whom you are running your Redis instance. You can typically find this out by running ps caux |grep redis-server. The name in the first column is the user under which Redis is running. Determine the directory in which Redis is storing its files. If you don't already know this, you can ask Redis by running CONFIG GET dir from within redis-cli. Ensure that the user running your Redis instance owns its data directory: chown <redisuser> /path/to/redis/datadir Restrict permissions on the data directory so that only the owner can access it at all: chmod 0700 /path/to/redis/datadir It is important that you protect the Redis data directory, and not individual data files, because Redis is regularly rewriting those data files, and the permissions you choose won't necessarily be preserved on the next rewrite. It is also a good practice to restrict access to your redis.conf file, because in some cases it can contain sensitive data. This is simply achieved: chmod 0600 /path/to/redis.conf If you run your Redis using applications on a server which is shared with other people, your Redis instance is at pretty serious risk of abuse. The most common way of connecting to Redis is via TCP, which can only limit access based on the address connecting to it. On a shared server, that address is shared amongst everyone using it, so anyone else on the same server as you can connect to your Redis. Not cool! If, however, the programs that need to access your Redis server are on the same machine as the Redis server, there is another, more secure, method of connection called Unix sockets. A Unix socket looks like a file on disk, and you can control its permissions just like a file, but Redis can listen on it (and clients can connect to it), in a very similar way to a TCP socket. Enabling Redis to listen on a Unix socket is fairly straightforward: Set the port parameter to 0 in your Redis configuration file. This will tell Redis to not listen on a TCP socket. This is very important to prevent miscreants from still being able to connect to your Redis server while you're happily using a Unix socket. Set the unixsocket parameter in your Redis configuration file to a fully-qualified filename where you want the socket to exist. If your Redis server runs as the same user as your client programs (which is common in shared-hosting situations), I recommend making the name of the file redis.sock, in the same directory as your Redis dataset. So, if you keep your Redis data in /home/joe/redis, set unixsocket to /home/joe/redis/redis.sock. Set the unixsocketperm parameter in your Redis configuration file to 600, or a more relaxed permission set if you know what you're doing. Again, this assumes that your Redis server and Redis-using programs are running as the same user. If they're not, you'll probably need a dedicated group and things get a lot more complicated—and beyond the scope of what can be covered in this guide. Once you've changed those configuration parameters and restarted Redis, you should find that the file you specified for unixsocket has magically appeared, and you can no longer connect to Redis using TCP. All that remains to do now is to configure your Redis-using programs to connect using the Unix socket, which is something you should find how to do in the manual for your particular Redis client library or application. Configuring Redis to use Unix sockets is all well and good when it's practical, but what about if you need to connect to Redis over a network? In that case, you'll need to let Redis listen on a TCP socket, but you should at least limit the computers that can connect to it with a suitable firewall configuration. While the properly paranoid systems administrator runs their systems with a default deny firewalling policy, not everyone shares this philosophy. However, given that by default, anyone who can connect to your Redis server can do anything they want with it, you should definitely configure a firewall on your Redis servers to limit incoming TCP connections to those which are coming from machines that have a legitimate need to talk to your Redis server. While it won't protect you from all attacks, it will cut down significantly on the attack surface, which is an important part of a defense-in-depth security strategy. Unfortunately, it is hard to give precise commands to configure a firewall ruleset, because there are so many firewall management tools in common use on systems today. In the interest of addressing the greatest common factor, though, I'll provide a set of Linux iptables rules, which should be translatable to whatever means of managing your firewall (whether it be an iptables wrapper of some sort on Linux, or a pf-based system on a BSD). In all of the following commands, replace the word <port> with the TCP port that your Redis server listens on. Also, note that these commands will temporarily stop all traffic to your Redis instance, so you'll want to avoid doing this on a live server. Setting up your firewall in an init script is the best course of action. Insert a rule that will drop all traffic to your Redis server port by default: iptables -I INPUT -p tcp --dport <port> -j DROP For each IP address you want to allow to connect, run these two commands to let the traffic in: iptables -I INPUT -p tcp --dport <port> -s <clientIP> -j ACCEPT iptables -I OUTPUT -p tcp --sport <port> -d <clientIP> -j ACCEPT A firewall is great, but sometimes you can't trust everyone with access to a machine that needs to talk to your Redis instance. In that case, you can use authentication to provide a limited amount of protection against miscreants: Select a very strong password. Redis is not hardened against repeated password guessing, so you want to make this very long and very random. If you make the password too short, an attacker can just write a program that tries every possible password very quickly and guess your password that way, not cool! Thankfully, since humans should rarely be typing this password, it can be a complete jumble, and very long. I like the command pwgen -sy 32 1 for all my "generating very strong password" needs. Configure all your clients to authenticate against the server, by sending the following command when they first connect to the server: AUTH <password> Edit your Redis configuration file to include a line like this: requirepass "\:d!&!:Y<p'TXBI0"ys96rfH]lxaA7|E" If your selected password contains any double-quotes, you'll need to escape them with a backslash (so " would become "), as I've done in the preceding example. You'll also need to double any actual backslashes (so becomes \), again as I've done in the password of the preceding example. Let the configuration changes take effect by restarting Redis. The authentication password cannot be changed at runtime. If you don't need certain commands, or want to limit the use of certain commands to a subset of clients, you can use the rename-command configuration parameter. Like firewalling, restricting, or disabling commands reduces your attack surface, but is not a panacea. The simplest solution to the risk of a dangerous command is to disable it. For example, if you want to stop anyone from accidentally (or deliberately) nuking all the data in your Redis server with a single command, you might decide to disable the FLUSHDB and FLUSHALL commands, by putting the following in your Redis config file: rename-command FLUSHDB ""rename-command FLUSHALL "" This doesn't stop someone from enumerating all the keys in your dataset with KEYS * and then deleting them all one-by-one, but it does raise the bar somewhat. If you never wanted to delete keys (but, say, only let them expire) you could disable the DEL command; although all that would probably do is encourage the wily cracker to enumerate all your keys and run PEXPIRE 1 over them. Arms races are a terrible thing... While disabling commands entirely is great when it can be done, you sometimes need a particular command, but you'd prefer not to give access to it to absolutely everyone—commands that can cause serious problems if misused, such as CONFIG. For those cases, you can rename the command to something hard-to-guess, as shown in the following command: rename-command CONFIG somegiantstringnobodywouldguess It's important to not make the new name of the command something easy-to-guess. Like the AUTH command, which we discussed previously, someone who wanted to do bad things could easily write a program to repeatedly guess what you've renamed your commands to. For any environment in which you can't trust the network (which these days is pretty much everywhere, thanks to the NSA and the Cloud), it is important to consider the possibility of someone watching all your data as it goes over the wire. There's little point configuring authentication, or renaming commands, if an attacker can watch all your data and commands flow back and forth. The least-worst option we have for generically securing network traffic from eavesdropping is still the Secure Sockets Layer (SSL). Redis doesn't support SSL natively; however, through the magic of the stunnel program, we can create a secure tunnel between Redis clients and servers. The setup we will build will look like the following diagram: In order to set this up, you'll need to do the following: In your redis.conf, ensure that Redis is only listening on 127.0.0.1, by setting the bind parameter: bind 127.0.0.1 Create a private key and certificate, which stunnel will use to secure the network communications. First, create a private key and a certificate request, by running: openssl req -out /etc/ssl/redis.csr -keyout /etc/ssl/redis.key -nodes -newkey rsa:2048 This will ask you all sorts of questions which you can answer with whatever you like. Create the self-signed certificate itself, by running: openssl x509 -req -days 3650 -signkey /etc/ssl/redis.key -in /etc/ssl/redis.csr -out /etc/ssl/redis.crt Finally, stunnel expects to find the private key and the certificate in the same file, so we'll concatenate the two together into one file: cat /etc/ssl/redis.key /etc/ssl/redis.crt >/etc/ssl/redis.pem Now, we've got our SSL keys, we can start stunnel on the server side, configuring it to listen out for SSL connections, and forward them to our local Redis server: stunnel -d 46379 -r 6379 -p /etc/ssl/redis.pem If your local Redis instance isn't listening on port 6379, or if you'd like to change the public port that stunnel listens on, you can, of course, adjust the preceding command line to suit. Also, don't forget to open up your firewall for the port you're listening on! Once you run the preceding command, you should be returned to a command line pretty quickly, because stunnel runs in the background. Although you examine your listening ports with netstat -ltn, you will still find that port 46379 is listening. If that's the case, you're done configuring the server. On the client(s), the process is somewhat simpler, because you don't have to create a whole new key pair. However, you do need the certificate from the server, because you want to be able to verify that you're connecting to the right SSL-enabled service. There's little point in using SSL if an attacker can just set up a fake SSL service and trick you into connecting to it. To set up the client, do the following: Copy /etc/ssl/redis.crt from the server to the same location on the client. Start stunnel on the client, as shown in the following code snippet: stunnel -c -v 3 -A /etc/ssl/redis.crt -d 127.0.0.1:56379 -r 192.0.2.42:46379 Replace 192.0.2.42 with the IP address of your Redis server. Verify that stunnel is listening correctly by running netstat -ltn, and look for something listening on port 56379. Reconfigure your client to connect to 127.0.0.1:56379, rather than directly to the remote Redis server. Summary This article contains an assortment of quick enhancements that you can deploy to your systems to protect them from various threats, which are frequently encountered on the Internet today. Resources for Article: Further resources on this subject: Implementing persistence in Redis (Intermediate) [Article] Python Text Processing with NLTK: Storing Frequency Distributions in Redis [Article] Coding for the Real-time Web [Article]

0
0
4399

Packt

26 Dec 2013

21 min read

Implementing the Naïve Bayes classifier in Mahout

Packt

26 Dec 2013

21 min read

0
0
2265

article-image-using-faceted-search-searching-finding

Packt

24 Dec 2013

11 min read

Using Faceted Search, from Searching to Finding

Packt

24 Dec 2013

11 min read

0
0
1426

Packt

20 Dec 2013

19 min read

Learning Option Pricing

Packt

20 Dec 2013

19 min read

(for more resources related to this topic, see here.) Introduction to options Options come in two variants, puts and calls. The call option gives the owner of the option the right, but not the obligation, to buy the underlying asset at the strike price. The put gives the holder of the contract, the right but not the obligation to sell the underlying asset. The Black-Scholes formula describes the European option, which can only be exercised on the maturity date, in contrast to for example American options. The buyer of the option pays a premium for this, to cover the risk taken from the counterpart side. Options have become very popular and they are traded on the major exchanges throughout the world, covering most asset-classes. The theory behind options can become complex pretty quick. In this article we'll look at the basics of options and how to explore them using code written in F#. Looking into contract specifications Options comes in a wide number of variations, some of them will be covered briefly below. The contract specifications for options will also depend on its type. Generally there are some properties that are more or less general to all of them. The general specifications are as follows: Side Quantity Strike price Expiration date Settlement terms The contract specifications, or know variables, are used then we valuate options. European options European options are the basic form of options that the other variants derive, American options and exotic options are some examples. We'll stick to European options in this article. American options American options are options that may be exercised on any trading day on or before expiry. Exotic options Exotic options are any of the broad category of options that may include complex financial structures and may be combinations of other instruments as well. Learning about Wiener processes Wiener processes are closely related to stochastic differential equations and volatility. Wiener processes or geometric Brownian motion, is defined as this: The formula describes the change in the stock price, or underlying, with a drift, μ, and a volatility, σ, and the Wiener process, Wt. This process is used to model the prices in Black-Scholes. We'll simulate market data using a Brownian motion, or Wiener process implemented in F# as a sequence. Sequences can be infinite and only the values used are evaluated, which suites or needs. We'll implement a generator function, to generate the Wiener process as a sequence as follows: // A normally distributed random generator let normd = new Normal(0.0, 1.0) let T = 1.0 let N = 500.0 let dt:float = T / N /// Sequences represent infinite number of elements // p -> probability mean // s -> scaling factor let W s = let rec loop x = seq { yield x; yield! loop (x + sqrt(dt)*normd.Sample()*s)} loop s;; Here we use the random function in normd.Sample(). Let's explain the parameters and the theory behind Brownian motion before looking at the implementation. The parameter T is the time used to create a discrete time increment dt. Notice that dt will assume there is 500 N:s, 500 items in the sequence, this is of course not always the case but will do fine in here. Next, we use recursion to create the sequence, where we add an increment to the previous value (x+...), where x c] xt-1. We can easily generate an arbitrary length of the sequence: > Seq.take 50 (W 55.00);; val it : seq<float> = seq [55.0; 56.72907873; 56.96071054;58.72850048; ...] Here we create a sequence of length 50. Let's plot the sequence to get a better understanding about the process. A Wiener process generated from the sequence generator above. Next we'll look at the code to generate the graph in the figure above. open System open System.Net open System.Windows.Forms open System.Windows.Forms.DataVisualization.Charting open Microsoft.FSharp.Control.WebExtensions open MathNet.Numerics.Distributions; // A normally distributed random generator let normd = new Normal(0.0, 1.0) // Create chart and form let chart = new Chart(Dock = DockStyle.Fill) let area = new ChartArea("Main") chart.ChartAreas.Add(area) let mainForm = new Form(Visible = true, TopMost = true, Width = 700, Height = 500) do mainForm.Text <- "Wiener process in F#" mainForm.Controls.Add(chart) // Create series for stock price let wienerProcess = new Series("process") do wienerProcess.ChartType <- SeriesChartType.Line do wienerProcess.BorderWidth <- 2 do wienerProcess.Color <- Drawing.Color.Red chart.Series.Add(wienerProcess) let random = new System.Random() let rnd() = random.NextDouble() let T = 1.0 let N = 500.0 let dt:float = T / N /// Sequences represent infinite number of elements let W s = let rec loop x = seq { yield x; yield! loop (x +/ sqrt(dt)*normd.Sample()*s)} loop s;; do (Seq.take 100 (W 55.00)) |> Seq.iter (wienerProcess.Points.Add>> ignore) Most of the code will be familiar to you at this stage, but the interesting part is the last line, where we can simply feed a chosen number of elements from the sequence into the Seq.iter which will plot the values, elegant and efficient. Learning the Black-Scholes formula The Black-Scholes formula was developed by Fischer Black and Myron Scholes in the 1970s. The Black-Scholes formula is a stochastic partial differential equation, which estimates the price an the option. The main idea behind the formula is the delta neutral portfolio. They created the theoretical delta neutral portfolio, to reduce the uncertainty involved. This was a necessary step to be able to come to the analytical formula which we’ll cover in this section. Below is the assumptions made under Black-Scholes: No arbitrage Possible to borrow money at a constant risk-free interest rate (throughout the holding of the option) Possible to buy, sell and short fractional amounts of underlying asset No transaction costs Price of underlying follows a Brownian Motion, constant drift and volatility No dividends paid from underlying security The simplest of the two variants is the one for call options. First the stock price is scaled using the cumulative distribution function with d1 as a parameter. Then the stock price is reduced by the discounted strike price scaled by the cumulative distribution function of d2. In other words, it’s the difference between the stock price and the strike using probability scaling of each and discounting the strike price. The formula for the put is a little more involved, but follows the same principles. The Black-Scholes formula are often separated into parts, where d1, d2 are the probability factors, describing the probability of the stock price being related to the strike price. The parameters used in the formula above can be summarized as follows: N – The cumulative distribution function T - Time to maturity, expressed in years S – The stock price, or other underlying K – The strike price r – The risk free interest rate σ – The volatility of the underlying Implementing Black-Scholes in F# Now that we've looked at the basics behind the Black-Scholes formula, and the parameters involved, we can implement it ourselves. The cumulative distribution function is implemented here to avoid dependencies and to illustrate that it's quite simple to implement it yourself too. Below is the Black-Scholes implemented in F#. It takes six arguments; the first is a call-put-flag that determines if it's a call or put option. The constants a1 to a5 are the Taylor series coefficients used in the approximation for the numerical implementation. let pow x n = exp(n * log(x)) type PutCallFlag = Put | Call /// Cumulative distribution function let cnd x = let a1 = 0.31938153 let a2 = -0.356563782 let a3 = 1.781477937 let a4 = -1.821255978 let a5 = 1.330274429 let pi = 3.141592654 let l = abs(x) let k = 1.0 / (1.0 + 0.2316419 * l) let w = (1.0-1.0/sqrt(2.0*pi)*exp(-l*l/2.0)*(a1*k+a2*k*k+a3*(pow k 3.0)+a4*(pow k 4.0)+a5*(pow k 5.0))) if x < 0.0 then 1.0 - w else w /// Black-Scholes // call_put_flag: Put | Call // s: stock price // x: strike price of option // t: time to expiration in years // r: risk free interest rate // v: volatility let black_scholes call_put_flag s x t r v = let d1=(log(s / x) + (r+v*v*0.5)*t)/(v*sqrt(t)) let d2=d1-v*sqrt(t) //let res = ref 0.0 match call_put_flag with | Put -> x*exp(-r*t)*cnd(-d2)-s*cnd(-d1) | Call -> s*cnd(d1)-x*exp(-r*t)*cnd(d2) Let's use the black_scholes function using some various numbers for call and put options. Suppose we want to know the price of an option, where the underlying is a stock traded at $58.60 with an annual volatility of 30%. The risk free interest rate is, let's say, 1%. Then we can use our formula, we defined previously to get the theoretical price according the Black-Scholes formula of a call option with 6 month to maturity (0.5 years): > black_scholes Call 58.60 60.0 0.5 0.01 0.3;; val it : float = 4.465202269 And the value for the put option, just by changing the flag to the function: > black_scholes Put 58.60 60.0 0.5 0.01 0.3;; val it : float = 5.565951021 Sometimes it's more convenient to express the time to maturity in number of days, instead of years. Let's introduce a helper function for that purpose. /// Convert the nr of days to years let days_to_years d = (float d) / 365.25 Note the number 365.25 which includes the factor for leap years. This is not necessary in our examples, but used for correctness. We can now use this function instead, when we know the time in days. > days_to_years 30;; val it : float = 0.08213552361 Let's use the same example above, but now with 20 days to maturity. > black_scholes Call 58.60 60.0 (days_to_years 20) 0.01 0.3;; val it : float = 1.065115482 > black_scholes Put 58.60 60.0 (days_to_years 20) 0.01 0.3;; val it : float = 2.432270266 Using Black-Scholes together with Charts Sometimes it's useful to be able to plot the price of an option until expiration. We can use our previously defined functions and vary the time left and plot the values coming out. In this example we'll make a program that outputs the graph seen below. Chart showing prices for call and put option as function of time /// Plot price of option as function of time left to maturity #r "System.Windows.Forms.DataVisualization.dll" open System open System.Net open System.Windows.Forms open System.Windows.Forms.DataVisualization.Charting open Microsoft.FSharp.Control.WebExtensions /// Create chart and form let chart = new Chart(Dock = DockStyle.Fill) let area = new ChartArea("Main") chart.ChartAreas.Add(area) chart.Legends.Add(new Legend()) let mainForm = new Form(Visible = true, TopMost = true, Width = 700, Height = 500) do mainForm.Text <- "Option price as a function of time" mainForm.Controls.Add(chart) /// Create series for call option price let optionPriceCall = new Series("Call option price") do optionPriceCall.ChartType <- SeriesChartType.Line do optionPriceCall.BorderWidth <- 2 do optionPriceCall.Color <- Drawing.Color.Red chart.Series.Add(optionPriceCall) /// Create series for put option price let optionPricePut = new Series("Put option price") do optionPricePut.ChartType <- SeriesChartType.Line do optionPricePut.BorderWidth <- 2 do optionPricePut.Color <- Drawing.Color.Blue chart.Series.Add(optionPricePut) /// Calculate and plot call option prices let opc = [for x in [(days_to_years 20)..(-(days_to_years 1))..0.0]do yield black_scholes Call 58.60 60.0 x 0.01 0.3] do opc |> Seq.iter (optionPriceCall.Points.Add >> ignore) /// Calculate and plot put option prices let opp = [for x in [(days_to_years 20)..(-(days_to_years 1))..0.0]do yield black_scholes Put 58.60 60.0 x 0.01 0.3] do opp |> Seq.iter (optionPricePut.Points.Add >> ignore) The code is just a modified version of the code seen in the previous article, with the options parts added. We have two series in this chart, one for call options and one for put options. We also add a legend for each of the series. The last part is the calculation of the prices and the actual plotting. List comprehensions are used for compact code, and the Black-Scholes formula is called for everyday until expiration, where the days are counted down by one day at each step. It's up to you as a reader to modify the code to plot various aspects of the option, such as the option price as a function of an increase in the underlying stock price etc. Introducing the greeks The greeks are partial derivatives of the Black-Scholes formula, with respect to a particular parameter such as time, rate, volatility or stock price. The greeks can be divided into two or more categories, with respect to the order of the derivatives. Below we'll look at the first and second order greeks. First order greeks In this section we'll present the first order greeks using the table below. Name Symbol Description Delta Δ Rate of change of option value with respect to change in the price of the underlying asset. Vega ν Rate of change of option value with respect to change in the volatility of the underlying asset. Referred to as the volatility sensitivity. Theta Θ Rate of change of option value with respect to time. The sensitivity with respect to time will decay as time elapses, phenomenon referred to as the "time decay." Rho ρ Rate of change of option value with respect to the interest rate. Second order greeks In this section we'll present the second order greeks using the table below. Name Symbol Description Gamma Γ Rate of change of delta with respect to change in the price of the underlying asset. Veta - Rate of change in Vega with respect to time. Vera - Rate of change in Rho with respect to volatility. Some of the second order greeks are omitted for clarity, we'll not cover these in this book. Implementing the greeks in F# Let's implement the greeks; Delta, Gamma, Vega, Theta and Rho. First we look at the formulas for each greek. In some of the cases they vary for calls and puts respectively. We need the derivative of the cumulative distribution function, which in fact is the normal distribution with zero mean and standard deviation of one: /// Normal distribution open MathNet.Numerics.Distributions; let normd = new Normal(0.0, 1.0) Delta Delta is the rate of change of option price with respect to change in the price of the underlying asset. /// Black-Scholes Delta // call_put_flag: Put | Call // s: stock price // x: strike price of option // t: time to expiration in years // r: risk free interest rate // v: volatility let black_scholes_delta call_put_flag s x t r v = let d1=(log(s / x) + (r+v*v*0.5)*t)/(v*sqrt(t)) match call_put_flag with | Put -> cnd(d1) - 1.0 | Call -> cnd(d1) Gamma Gamma is the rate of change of delta with respect to change in the price of the underlying asset. This is the 2nd derivative, with respect to price of the underlying asset. It measures the acceleration of the price of the option with respect to the underlying price. /// Black-Scholes Gamma // s: stock price // x: strike price of option // t: time to expiration in years // r: risk free interest rate // v: volatility let black_scholes_gamma s x t r v = let d1=(log(s / x) + (r+v*v*0.5)*t)/(v*sqrt(t)) normd.Density(d1) / (s*v*sqrt(t) Vega Vega is the rate of change of option value with respect to change in the volatility of the underlying asset. It is referred to as the volatility sensitivity. /// Black-Scholes Vega // s: stock price // x: strike price of option // t: time to expiration in years // r: risk free interest rate // v: volatility let black_scholes_vega s x t r v = let d1=(log(s / x) + (r+v*v*0.5)*t)/(v*sqrt(t)) s*normd.Density(d1)*sqrt(t) Theta Theta is the rate of change of option value with respect to time. The sensitivity with respect to time will decay as time elapses, phenomenon referred to as the “time decay.” /// Black-Scholes Theta // call_put_flag: Put | Call // s: stock price // x: strike price of option // t: time to expiration in years // r: risk free interest rate // v: volatility let black_scholes_theta call_put_flag s x t r v = let d1=(log(s / x) + (r+v*v*0.5)*t)/(v*sqrt(t)) let d2=d1-v*sqrt(t) let res = ref 0.0 match call_put_flag with | Put -> -(s*normd.Density(d1)*v)/(2.0*sqrt(t))+r*x*exp(-r*t)*cnd(-d2) | Call -> -(s*normd.Density(d1)*v)/(2.0*sqrt(t))-r*x*exp(-r*t)*cnd(d2) Rho Rho is rate of change of option value with respect to the interest rate. /// Black-Scholes Rho // call_put_flag: Put | Call // s: stock price // x: strike price of option // t: time to expiration in years // r: risk free interest rate // v: volatility let black_scholes_rho call_put_flag s x t r v = let d1=(log(s / x) + (r+v*v*0.5)*t)/(v*sqrt(t)) let d2=d1-v*sqrt(t) let res = ref 0.0 match call_put_flag with | Put -> -x*t*exp(-r*t)*cnd(-d2) | Call -> x*t*exp(-r*t)*cnd(d2) Investigating the sensitivity of the of the greeks Now that we have all the greeks implemented we'll investigate the sensitivity of some of them and see how they vary when the underlying stock price changes. The figure below is a surface plot with four of the greeks where time and underlying price is changing. The figure below is generated in MATLAB, and will not be generated in F#. We’ll use a 2D version of the graph to study the greeks below. Surface plot of Delta, Gamma, Theta and Rho of a call option. In this section we'll start by plotting the value of Delta for a call option where we vary the price of the underlying. This will result in the following 2D plot: A plot of call option delta versus price of underlying The result in the plot seen in figure above will be generated by the code presented next. We'll reuse most of the code from the example where we looked at the option prices for calls and puts. A slightly modified version is presented here, where the price of the underlying varies from $10.0 to $70.0. /// Plot delta of call option as function of underlying price #r "System.Windows.Forms.DataVisualization.dll" open System open System.Net open System.Windows.Forms open System.Windows.Forms.DataVisualization.Charting open Microsoft.FSharp.Control.WebExtensions /// Create chart and form let chart = new Chart(Dock = DockStyle.Fill) let area = new ChartArea("Main") chart.ChartAreas.Add(area) chart.Legends.Add(new Legend()) let mainForm = new Form(Visible = true, TopMost = true, Width = 700, Height = 500) do mainForm.Text <- "Option delta as a function of underlying price" mainForm.Controls.Add(chart) /// Create series for call option delta let optionDeltaCall = new Series("Call option delta") do optionDeltaCall.ChartType <- SeriesChartType.Line do optionDeltaCall.BorderWidth <- 2 do optionDeltaCall.Color <- Drawing.Color.Red chart.Series.Add(optionDeltaCall) /// Calculate and plot call delta let opc = [for x in [10.0..1.0..70.0] do yield black_scholes_delta Call x 60.0 0.5 0.01 0.3] do opc |> Seq.iter (optionDeltaCall.Points.Add >> ignore) We can extend the code to plot all four greeks, as in the figure with the surface plots, but here in 2D. The result will be a graph like seen in the figure below. Graph showing the for Greeks for a call option with respect to price change (x-axis). Code listing for visualizing the four greeks Below is the code listing for the entire program used to create the graph above. #r "System.Windows.Forms.DataVisualization.dll" open System open System.Net open System.Windows.Forms open System.Windows.Forms.DataVisualization.Charting open Microsoft.FSharp.Control.WebExtensions /// Create chart and form let chart = new Chart(Dock = DockStyle.Fill) let area = new ChartArea("Main") chart.ChartAreas.Add(area) chart.Legends.Add(new Legend()) let mainForm = new Form(Visible = true, TopMost = true, Width = 700, Height = 500) do mainForm.Text <- "Option delta as a function of underlying price" mainForm.Controls.Add(chart) We’ll create one series for each greek: /// Create series for call option delta let optionDeltaCall = new Series("Call option delta") do optionDeltaCall.ChartType <- SeriesChartType.Line do optionDeltaCall.BorderWidth <- 2 do optionDeltaCall.Color <- Drawing.Color.Red chart.Series.Add(optionDeltaCall) /// Create series for call option gamma let optionGammaCall = new Series("Call option gamma") do optionGammaCall.ChartType <- SeriesChartType.Line do optionGammaCall.BorderWidth <- 2 do optionGammaCall.Color <- Drawing.Color.Blue chart.Series.Add(optionGammaCall) /// Create series for call option theta let optionThetaCall = new Series("Call option theta") do optionThetaCall.ChartType <- SeriesChartType.Line do optionThetaCall.BorderWidth <- 2 do optionThetaCall.Color <- Drawing.Color.Green chart.Series.Add(optionThetaCall) /// Create series for call option vega let optionVegaCall = new Series("Call option vega") do optionVegaCall.ChartType <- SeriesChartType.Line do optionVegaCall.BorderWidth <- 2 do optionVegaCall.Color <- Drawing.Color.Purple chart.Series.Add(optionVegaCall) Next, we’ll calculate the values to plot for each greek: /// Calculate and plot call delta let opd = [for x in [10.0..1.0..70.0] do yield black_scholes_delta Call x 60.0 0.5 0.01 0.3] do opd |> Seq.iter (optionDeltaCall.Points.Add >> ignore) /// Calculate and plot call gamma let opg = [for x in [10.0..1.0..70.0] do yield black_scholes_gamma x 60.0 0.5 0.01 0.3] do opg |> Seq.iter (optionGammaCall.Points.Add >> ignore) /// Calculate and plot call theta let opt = [for x in [10.0..1.0..70.0] do yield black_scholes_theta Call x 60.0 0.5 0.01 0.3] do opt |> Seq.iter (optionThetaCall.Points.Add >> ignore) /// Calculate and plot call vega let opv = [for x in [10.0..1.0..70.0] do yield black_scholes_vega x 60.0 0.1 0.01 0.3] do opv |> Seq.iter (optionVegaCall.Points.Add >> ignore) Summary In this article, we looked into using F# for investigating different aspects of volatility. Volatility is an interesting dimension of finance where you quickly dive into complex theories and models. Here it's very much helpful to have a powerful tool such as F# and F# Interactive. We've just scratched the surface of options and volatility in this article. There is a lot more to cover, but that's outside the scope of this book. Most of the content here will be used in the trading system. resources for article: further resources on this subject: Working with Windows Phone Controls [article] Simplifying Parallelism Complexity in C# [article] Watching Multiple Threads in C# [article]

0
0
2552

article-image-sql-server-analysis-services-administering-and-monitoring-analysis-services

Packt

20 Dec 2013

19 min read

SQL Server Analysis Services – Administering and Monitoring Analysis Services

Packt

20 Dec 2013

19 min read

(For more resources related to this topic, see here.) If your environment has only one or a handful of SSAS instances, they can be managed by the same database administrators managing SQL Server and other database platforms. In large enterprises, there could be hundreds of SSAS instances managed by dedicated SSAS administrators. Regardless of the environment, you should become familiar with the configuration options as well as troubleshooting methodologies. In large enterprises, you might also be required to automate these tasks using the Analysis Management Objects (AMO) code. Analysis Services is a great tool for building business intelligence solutions. However, much like any other software, it does have its fair share of challenges and limitations. Most frequently encountered enterprise business intelligence system goals include quick provision of relevant data to the business users and assuring excellent query performance. If your cubes serve a large, global community of users, you will quickly learn that SSAS is optimized to run a single query as fast as possible. Once users send a multitude of heavy queries in parallel, you can expect to see memory, CPU, and disk-related performance counters to quickly rise, with a corresponding increase in query execution duration which, in turn, worsens user experience. Although you could build aggregations to improve query performance, doing so will lengthen cube processing time, and thereby, delay the delivery of essential data to decision makers. It might also be tempting to consider using ROLAP storage mode in lieu of MOLAP so that processing times are shorter, but MOLAP queries usually outperform ROLAP due to heavy compression rates. Hence, figuring out the right storage mode and appropriate level of aggregations is a great balancing act. If you cannot afford using ROLAP, and query performance is paramount to successful cube implementation, you should consider scaling your solution. You have two options for scaling, given as follows: Scaling up: This option means purchasing servers with more memory, more CPU cores, and faster disk drives Scaling out: This option means purchasing several servers of approximately the same capacity and distributing the querying workload across multiple servers using a load balancing tool SSAS lends itself best to the second option—scaling out. Later in this article you will learn how to separate processing and querying activities and how to ensure that all servers in the querying pool have the same data. SSAS instance configuration options All Analysis Services configuration options are available in the msmdsrv.ini file found in the config folder under the SSAS installation directory. Instance administrators can also modify some, but not all configuration properties, using SQL Server Management Studio (SSMS). SSAS has a multitude of properties that are undocumented—this normally means that such properties haven't undergone thorough testing, even by the software's developers. Hence, if you don't know exactly what the configuration setting does, it's best to leave the setting at default value. Even if you want to test various properties on a sandbox server, make a copy of the configuration file prior to applying any changes. How to do it... To modify the SSAS instance settings using the configuration file, perform the following steps: Navigate to the config folder within your Analysis Services installation directory. By default, this will be C:\Program Files\Microsoft SQL Server\MSAS11.instance_name\OLAP\Config. Open the msmdsrv.ini file using Notepad or another text editor of your choice. The file is in the XML format, so every property is enclosed in opening and closing tags. Search for the property of interest, modify its value as desired, and save the changes. For example, in order to change the upper limit of the processing worker threads, you would look for the <ThreadPool><Process><MaxThreads> tag sequence and set the values as shown in the following excerpt from the configuration file: <Process> <MinThreads>0</MinThreads> <MaxThreads>250</MaxThreads> <PriorityRatio>2</PriorityRatio> <Concurrency>2</Concurrency> <StackSizeKB>0</StackSizeKB> <GroupAffinity/> </Process> To change the configuration using SSMS, perform the following steps: Connect to the SSAS instance using the instance administrator account and choose Properties. If your account does not have sufficient permissions, you will get an error that only administrators can edit server properties. Change the desired properties by altering the Value column on the General page of the resulting dialog, as shown in the following screenshot: Advanced properties are hidden by default. You must check the Show Advanced (All) Properties box to see advanced properties. You will not see all the properties in SSMS even after checking this box. The only way to edit some properties is by editing msmdsrv.ini as previously discussed. Make a note of the Reset Default button in the bottom-right corner. This button comes in handy if you've forgotten what the configuration values were before you changed them and want to revert to the default settings. The default values are shown in the dialog box, which can provide guidance as to which properties have been altered. Some configuration settings require restarting the SSAS instance prior to being executed. If this is the case, the Restart column will have a value of Yes. Once you're happy with your changes, click on OK and restart the instance if necessary. You can restart SSAS using the Services.msc applet from the command line using the NET STOP / NET START commands, or directly in SSMS by choosing the Restart option after right-clicking on the instance. How it works... Discussing every SSAS property would make this article extremely lengthy; doing so is well beyond the scope of the book. Instead, in this section, I will summarize the most frequently used properties. Often, synchronization has to copy large partition datafiles and aggregation files. If the timeout value is exceeded, synchronization fails. Increase the value of the <Network><Listener><ServerSendTimeout> and <Network><Listener><ServerReceiveTimeout> properties to allow a longer time span for copying each file. By default, SSAS can use a lazy thread to rebuild missing indexes and aggregations after you process partition data. If the <OLAP><LazyProcessing><Enabled> property is set to 0, the lazy thread is not used for building missing indexes—you must use an explicit processing command instead. The <OLAP><LazyProcessing><MaxCPUUsage> property throttles the maximum CPU that could be used by the lazy thread. If efficient data delivery is your topmost priority, you can exploit the ProcessData option instead of ProcessFull. To build aggregations after the data is loaded, you must set the partition's ProcessingMode property to LazyAggregations. The SSAS formula engine is single threaded, so queries that perform heavy calculations will only use one CPU core, even on a multiCPU computer. The storage engine is multithreaded; hence, queries that read many partitions will require many CPU cycles. If you expect storage engine heavy queries, you should lower the CPU usage threshold for LazyAggregations. By default, Analysis Services records subcubes requested for every 10th query in the query log table. If you'd like to design aggregations based on query logs, you should change the <Log><QueryLog><QueryLogSampling> property value to 1 so that the SSAS logs subcube requests for every query. SSAS can use its own memory manager or the Windows memory manager. If your SSAS instance consistently becomes unresponsive, you could try using the Windows memory manager. Set <Memory><MemoryHeapType> to 2 and <Memory><HeapTypeForObjects> to 0. The Analysis Services memory manager values are 1 for both the properties. You must restart the SSAS service for the changes to these properties to take effect. The <Memory><PreAllocate> property specifies the percentage of total memory to be reserved at SSAS startup. SSAS normally allocates memory dynamically as it is required by queries and processing jobs. In some cases, you can achieve performance improvement by allocating a portion of the memory when the SSAS service starts. Setting this value will increase the time required to start the service. The memory will not be released back to the operating system until you stop the SSAS service. You must restart the SSAS service for changes to this property to take effect. The <Log><FlightRecorder><FileSizeMB>and <Log><FlightRecorder><LogDurationSec> properties control the size and age of the FlightRecorder trace file before it is recycled. You can supply your own trace definition file to include the trace events and columns you wish to monitor using the <Log><FlightRecorder><TraceDefinitionFile> property. If FlightRecorder collects useful trace events, it can be an invaluable troubleshooting tool. By default, the file is only allowed to grow to 10 MB or 60 minutes. Long processing jobs can take up much more space, and their duration could be much longer than 60 minutes. Hence, you should adjust the settings as necessary for your monitoring needs. You should also adjust the trace events and columns to be captured by FlightRecorder. You should consider adjusting the duration to cover three days (in case the issue you are researching happens over a weekend). The <Memory><LowMemoryLimit> property controls the point—amount of memory used by SSAS—at which the cleaner thread becomes actively engaged in reclaiming memory from existing jobs. Each SSAS command (query, processing, backup, synchronization, and so on) is associated with jobs that run on threads and use system resources. We can lower the value of this setting to run more jobs in parallel (though the performance of each job could suffer). Two properties control the maximum amount of memory that a SSAS instance could use. Once memory usage reaches the value specified by <Memory><TotalMemoryLimit>, the cleaner thread becomes particularly aggressive at reclaiming memory. The <Memory><HardMemoryLimit> property specifies the absolute memory limit—SSAS will not use memory above this limit. These properties are useful if you have SSAS and other applications installed on the same server computer. You should reserve some memory for other applications and the operating system as well. When HardMemoryLimit is reached, SSAS will disconnect the active sessions, advising that the operation was cancelled due to memory pressure. All memory settings are expressed in percentages if the values are less than or equal to 100. Values above 100 are interpreted as kilobytes. All memory configuration changes require restart of the SSAS service to take effect. In the prior releases of Analysis Services, you could only specify the minimum and maximum number of threads used for queries and processing jobs. With SSAS 2012, you can also specify the limits for the input/output job threads using the <ThreadPool><IOProcess> property. The <Process><IndexBuildThreshold> property governs the minimum number of rows within a partition for which SSAS will build indexes. The default value is 4096. SSAS decides which partitions it needs to scan for each query based on the partition index files. If the partition does not have indexes, it will be scanned for all the queries. Normally, SSAS can read small partitions without greatly affecting query performance. But if you have many small partitions, you should lower the threshold to ensure each partition has indexes. The <Process><BufferRecordLimit> and <Process><BufferMemoryLimit> properties specify the number of records for each memory buffer and the maximum percentage of memory that can be used by a memory buffer. Lower the value of these properties to process more partitions in parallel. You should monitor processing using the SQL Profiler to see if some partitions included in the processing batch are being processed while the others are in waiting. The <ExternalConnectionTimeout> and <ExternalCommandTimeout> properties control how long an SSAS command should wait for connecting to a relational database or how long SSAS should wait to execute the relational query before reporting timeout. Depending on the relational source, it might take longer than 60 seconds (that is, the default value) to connect. If you encounter processing errors without being able to connect to the relational source, you should increase the ExternalConnectionTimeout value. It could also take a long time to execute a query; by default, the processing query will timeout after one hour. Adjust the value as needed to prevent processing failures. The contents of the <AllowedBrowsingFolders> property define the drives and directories that are visible when creating databases, collecting backups, and so on. You can specify multiple items separated using the pipe (|) character. The <ForceCommitTimeout> property defines how long a processing job's commit operation should wait prior to cancelling any queries/jobs which may interfere with processing or synchronization. A long running query can block synchronization or processing from committing its transaction. You can adjust the value of this property from its default value of 30 seconds to ensure that processing and queries don't step on each other. The <Port> property specifies the port number for the SSAS instance. You can use the hostname followed by a colon (:) and a port number for connecting to the SSAS instance in lieu of the instance name. Be careful not to supply the port number used by another application; if you do so, the SSAS service won't start. The <ServerTimeout> property specifies the number of milliseconds after which a query will timeout. The default value is 1 hour, which could be too long for analytical queries. If the query runs for an hour, using up system resources, it could render the instance unusable by any other connection. You can also define a query timeout value in the client application's connection strings. Client setting overrides the server-level property. There's more... There are many other properties you can set to alter SSAS instance behavior. For additional information on configuration properties, please refer to product documentation at http://technet.microsoft.com/en-us/library/ms174556.aspx. Creating and dropping databases Only SSAS instance administrators are permitted to create, drop, restore, detach, attach, and synchronize databases. This recipe teaches administrators how to create and drop databases. Getting ready Launch SSMS and connect to your Analysis Services instance as an administrator. If you're not certain that you have administrative properties to the instance, right-click on the SSAS instance and choose Properties. If you can view the instance's properties, you are an administrator; otherwise, you will get an error indicating that only instance administrators can view and alter properties. How to do it... To create a database, perform the following steps: Right-click on the Databases folder and choose New Database. Doing so launches the New Database dialog shown in the following screenshot. Specify a descriptive name for the database, for example, Analysis_Services_Administration. Note that the database name can contain spaces. Each object has a name as well as an identifier. The identifier value is set to the object's original name and cannot be changed without dropping and recreating the database; hence, it is important to come up with a descriptive name from the very beginning. You cannot create more than one database with the same name on any SSAS instance. Specify the storage location for the database. By default, the database will be stored under the \OLAP\DATA folder of your SSAS installation directory. The only compelling reason to change the default is if your data drive is running out of disk space and cannot support the new database's storage requirements. Specify the impersonation setting for the database. You could also specify the impersonation property for each data source. Alternatively, each data source can inherit the DataSourceImpersonationInfo property from the database-level setting. You have four choices as follows: Specific user name (must be a domain user) and password: This is the most secure option but requires updating the password if the user changes the password Analysis Services service account Credentials of the current user: This option is specifically for data mining Default: This option is the same as using the service account option Specify an optional description for the database. As with majority of other SSMS dialogs, you can script the XMLA command you are about to execute by clicking on the Script button. To drop an existing database, perform the following steps: Expand the Databases folder on the SSAS instance, right-click on the database, and choose Delete. The Delete objects dialog allows you to ignore errors; however, it is not applicable to databases. You can script the XMLA command if you wish to review it first. An alternative way of scripting the DELETE command is to right-click on the database and navigate to Script database as | Delete To | New query window. Monitoring SSAS instance using Activity Viewer Unlike other database systems, Analysis Services has no system databases. However, administrators still need to check the activity on the server, ensure that cubes are available and can be queried, and there is no blocking. You can exploit a tool named Analysis Services Activity Viewer 2008 to monitor SSAS Versions 2008 and later, including SSAS 2012. This tool is owned and maintained by the SSAS community and can be downloaded from www.codeplex.com. Activity Viewer allows viewing active and dormant sessions, current XMLA and MDX queries, locks, as well as CPU and I/O usage by each connection. Additionally, you can define rules to raise alerts when a particular condition is met. How to do it... To monitor an SSAS instance using Activity Viewer, perform the following steps: Launch the application by double-clicking on ActivityViewer.exe. Click on the Add New Connection button on the Overview tab. Specify the hostname and instance name or the hostname and port number for the SSAS instance and then click on OK. For each SSAS instance you connect to, Activity Viewer adds a new tab. Click on the tab for your SSAS instance. Here, you will see several pages as shown in the following screenshot: Alerts: This page shows any sessions that met the condition found in the Rules page. Users: This page displays one row for each user as well as the number of sessions, total memory, CPU, and I/O usage. Active Sessions: This page displays each session that is actively running an MDX, Data Mining Extensions (DMX), or XMLA query. This page allows you to cancel a specific session by clicking on the Cancel Session button. Current Queries: This page displays the actual command's text, number of kilobytes read and written by the command, and the amount of CPU time used by the command. This page allows you to cancel a specific query by clicking on the Cancel Query button. Dormant Sessions: This page displays sessions that have a connection to the SSAS instance but are not currently running any queries. You can also disconnect a dormant session by clicking on the Cancel Session button. CPU: This page allows you to review the CPU time used by the session as well as the last command executed on the session. I/O: This page displays the number of reads and writes as well as the kilobytes read and written by each session. Objects: This page shows the CPU time and number of reads affecting each dimension and partition. This page also shows the full path to the object's parent; this is useful if you have the same naming convention for partitions in multiple measure groups. Not only do you see the partition name, but also the full path to the partition's measure group. This page also shows the number of aggregation hits for each partition. If you find that a partition is frequently queried and requires many reads, you should consider building aggregations for it. Locks: This page displays the locks currently in place, whether already granted or waiting. Be sure to check the Lock Status column—the value of 0 indicates that the lock request is currently blocked. Rules: This page allows defining conditions that will result in an alert. For example, if the session is idle for over 30 minutes or if an MDX query takes over 30 minutes, you should get alerted. How it works... Activity Viewer monitors Analysis Services using Dynamic Management Views (DMV). In fact, capturing queries executed by Activity Viewer using SQL Server Profiler is a good way of familiarizing yourself with SSAS DMV's. For example, the Current Queries page checks the $system.DISCOVER_COMMANDS DMV for any actively executing commands by running the following query: SELECT SESSION_SPID,COMMAND_CPU_TIME_MS,COMMAND_ELAPSED_TIME_MS, COMMAND_READ_KB,COMMAND_WRITE_KB, COMMAND_TEXT FROM $system.DISCOVER_COMMANDS WHERE COMMAND_ELAPSED_TIME_MS > 0 ORDER BY COMMAND_CPU_TIME_MS DESC The Active Sessions page checks the $system.DISCOVER_SESSIONS DMV with the session status set to 1 using the following query: SELECT SESSION_SPID,SESSION_USER_NAME, SESSION_START_TIME, SESSION_ELAPSED_TIME_MS,SESSION_CPU_TIME_MS, SESSION_ID FROM $SYSTEM.DISCOVER_SESSIONS WHERE SESSION_STATUS = 1 ORDER BY SESSION_USER_NAME DESC The Dormant sessions page runs a very similar query to that of the Active Sessions page, except it checks for sessions with SESSION_STATUS=0—sessions that are currently not running any queries. The result set is also limited to top 10 sessions based on idle time measured in milliseconds. The Locks page examines all the columns of the $system.DISCOVER_LOCKS DMV to find all requested locks as well as lock creation time, lock type, and lock status. As you have already learned, the lock status of 0 indicates that the request is blocked, whereas the lock status of 1 means that the request has been granted. Analysis Services blocking can be caused by conflicting operations that attempt to query and modify objects. For example, a long running query can block a processing or synchronization job from completion because processing will change the data values. Similarly, a command altering the database structure will block queries. The database administrator or instance administrator can explicitly issue the LOCK XMLA command as well as the BEGIN TRANSACTION command. Other operations request locks implicitly. The following table documents most frequently encountered Analysis Services lock types: Lock type identifier Description Acquired for 2 Read lock Processing to read metadata. 4 Write lock Processing to write data after it is read from relational sources. 8 Commit shared During the processing, restore or synchronization commands. 16 Commit exclusive Committing the processing, restore, or synchronization transaction when existing files are replaced by new files.

0
0
12400

article-image-key-components-and-inner-working-impala

Packt

20 Dec 2013

7 min read

Key components and inner working of Impala

Packt

20 Dec 2013

7 min read

(For more resources related to this topic, see here.) Impala Core Components: Here we will discuss following three important components: Impala Daemon Impala Statestore Impala Metadata and Metastore Putting together above components with Hadoop and application or command line interface, we can conceptualize them as below: Impala Execution Architecture: Essentially Impala daemons receives queries from variety of sources and distribute query load to other Impala daemons running on other nodes and while doing so interact with Statestore for node specific update and access Metastore, either stored in centralized database or in local cache. Now to complete the Impala execution we will discuss how Impala interacts with other components i.e. Hive, HDFS and HBase. Impala working with Apache Hive: We have already discussed earlier about Impala Metastore using the centralized database as Metastore and Hive also uses the same MySQL or PostgreSQL database for same kind of data. Impala provides same SQL like queries interface use in Apache Hive. Because both Impala and Hive share same database as Metastore, Impala can access Hive specific tables definitions if Hive table definition use the same file format, compression codecs and Impala-supported data types in their column values. Apache Hive provides various kinds of file type processing support to Impala. When using other then text file format i.e. RCFile, Avro, SequenceFile the data must be loaded through Hive first and then Impala can query the data from these file formats. Impala can perform read operation on more types of data using SELECT statement than it can perform write operation using INSERT statement. The ANALYZE TABLE statement in Hive generates useful table and column statistics and Impala use these valuable statistics to optimize the queries. Impala working with HDFS: Impala table data is actually regular data files stored in HDFS (Hadoop Distributed File System) and Impala uses HDFS as its primary data storage medium. As soon as a data file or a collection of files is available in specific folder of new table, Impala reads all of the files regardless of their name and new data is included in files with the name controlled by Impala. HDFS provides data redundancy through replication factor and Impala relies on such redundancy to access data on other datanodes in case it is not available on a specific datanode. We have already learnt earlier that Impala also maintains the information about physical location of the blocks about data files in HDFS,which helps data access in case of node failure. Impala working with HBase: HBase is a distributed, scalable, big data storage system, provides random, real-time read and write access to data stored on HDFS. HBase is a database storage system, sits on top of HDFS however like other traditional database storage system, HBase does not provide built-in SQL support however 3party applications can provide such functionality. To use HBase, first user defines tables in Impala and then maps them to the equivalent HBase tables. Once table relationship is established, users can submit queries into HBase table through Impala. Not only that join operations can be formed including HBase and Impala tables. Impala Security: Impala is designed & developed on run on top of Hadoop. So you must understand the Hadoop security model as well as the security provided in OS where Hadoop is running. If Hadoop is running on Linux then as Linux administrator and Hadoop administrator user can harden and tighten the security, which definitely can be taken in account with the security provided by Impala. Impala 1.1 or later uses Sentry Open Source Project to provide detailed authorization framework for Hadoop. Impala 1.1.1 supports auditing capabilities in cluster by creating auditing data, which can be collected from all nodes and then processing for further analysis and insight. Data Visualization using Impala: Visualizing data is as important as processing the data. Human brain perceives pictures fast then reading data in tables and because of it data visualization provides super fast understanding to large amount of data in split seconds. Reports, charts, interactive dashboards and any form of info-graphics are all part of data visualization and provide deeper understanding of results. To connect with 3rd party applications, Cloudera provides ODBC and JDBC connectors. These connectors are installed on machines where 3rd party applications are running and by configuring correct Impala server and port details on those connectors, 3rd party applications connect with Impala and submit those queries and then take results back to application. The result then displayed on 3rd party application, where it is rendered on graphics device for visualization or displayed in table format or processed further depending on application requirement. In this section we will cover few notable 3rd party applications, which can take advantage of Impala super fast query processing and than display amazing graphical results. Tableau and Impala: Tableau Software supporting Impala by providing access to tables on Impala using Impala ODBC connector provided by Tableau. Tableau is one of the most prominent data visualization software technologies in recent days and used by thousands of enterprises daily to get intelligence out of their data. Tableau software is available on Windows OS and an ODBC connector is provided by Cloudera to make this connection a reality. You can visit the link below to download Impala connector for Tableau: http://go.cloudera.com/tableau_connector_download Once Impala connector is installed on a machine where Tableau software is running, and configured correctly, Tableau software is ready to work with Impala. In this image below Tableau is connected to Impala server at port 21000, and then selected a table located at Impala: Once table is selected, particular fields are select and data is displayed in graphical format in various mind-blowing visualizations. The screenshot below displays one example of showing such visualization: Microsoft Excel and Impala Microsoft Excel is one of the widely adopted data processing application used by business professional worldwide. You can connect Microsoft Excel with Impala using another ODBC connector provided by Simba Technology. Microstrategy and Impala Microstrategy is another big player in data analysis and visualization software and uses ODBC drive to connect with Impala to render amazing looking visualizations. The connectivity model between Microstrategy software and Cloudera Impala is shown as below: Zoomdata and Impala: Zoomdata is considered to new generation of data user interface by addressing streams of data instead of sets of data. Zoomdata processing engine performs continuous mathematical operations across data streams in real-time to create visualization on multitude of devices. The visualization updates itself as the new data arrives and re-computed by Zoomdata. As shown in in the image below, you can see Zoomdata application uses Impala as a source of data, which is configured underneath to use of one the available connectors to connect with Impala: Once connection are made user can see amazing data visualization as shown below: Real-time Query with Impala on Hadoop: Impala is marketed as a product, which can do “Real-time queries on Hadoop” by its developer Cloudera. Impala is open source implementation based on above-mentioned Google Dremel technology, available free for anyone of use. Impala is available as package product, free to use or can be compiled from its source, which can run queries in memory to make them real-time and in some cases depending on type of data, if Parquet file format is used as input data source, it can expedite the query processing to multifold speed. Real-time query subscription with Impala: Cloudera provides Real-time Query (RTQ) Subscription as an add-on to Cloudera Enterprise subscription. You can still use Impala as free open source product however taking RTQ subscription makes you take advantage of Cloudera paid service to extend its usability and resilience. By accepting RTQ subscription you cannot only have access to Cloudera Technical support, but also you can work with Impala development team to provide ample feedback to shape up the product design and implementation. Summary Thus concludes the discussion on the key components of Impala and their inner working. Resources for Article: Further resources on this subject: Securing the Hadoop Ecosystem [Article] Cloudera Hadoop and HP Vertica [Article] Hadoop and HDInsight in a Heartbeat [Article]

0
0
2105

How-To Tutorials - Data

MongoDB data modeling

Understanding Process Variation with Control Charts

Processing Tweets with Apache Hive

Query Performance Tuning

Sizing and Configuring your Hadoop Cluster

Changing the Appearance

Securing QlikView Documents

Aspects of Data Manipulation in R

Conozca QlikView

Using Redis in a hostile environment (Advanced)

Trending Topics

Implementing the Naïve Bayes classifier in Mahout

Using Faceted Search, from Searching to Finding

Learning Option Pricing

SQL Server Analysis Services – Administering and Monitoring Analysis Services

Key components and inner working of Impala