Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-mongodb-data-modeling
Packt
20 Feb 2014
5 min read
Save for later

MongoDB data modeling

Packt
20 Feb 2014
5 min read
(For more resources related to this topic, see here.) MongoDB data modeling is an advanced topic. We will instead focus on a few key points regarding data modeling so that you will understand the implications for your report queries. These points touch on the strengths and weaknesses of embedded data models. The most common data modeling decision in MongoDB is whether to normalize or denormalize related collections of data. Denormalized MongoDB models embed related data together in the same database documents while normalized models represent the same records with references between documents. This modeling decision will impact the method used to extract and query data using Pentaho, because the normalized method requires joining documents outside of MongoDB. With normalized models, Pentaho Data Integration is used to resolve the reference joins. The following table provides a list of objects that form the fundamental building blocks of a MongoDB database and the associated object names in the pentaho database. MongoDB objects Sample MongoDB object names Database Pentaho Collection sessions, events, and sessions_events Document The sessions document fields (key-value pairs) are as follows: id: 512d200e223e7f294d13607 ip_address: 235.149.145.57 id_session: DF636FB593684905B335982BEA3D967B browser: Chrome date_session: 2013-02-20T15:47:49Z duration_session: 18.24 referring_url: www.xyz.com The events document fields (key-value pairs) are as follows: _id: 512d1ff2223e7f294d13c0c6 id_session: DF636FB593684905B335982BEA3D967B event: Signup Free Offer The sessions_events document fields (key-value pairs) are as follows: _id: 512d200e223e7f294d13607 ip_address: 235.149.145.57 id_session: DF636FB593684905B335982BEA3D967B browser: Chrome date_session: 2013-02-20T15:47:49Z duration_session: 18.24 referring_url: www.xyz.com event_data: [array] event: Visited Site event: Signup Free Offer This table shows three collections in the pentaho database: sessions, events, and sessions_events. The first two collections, sessions and events, illustrate the concept of normalizing the clickstream data by separating it into two related collections with a reference key field in common. In addition to the two normalized collections, the sessions_events collection is included to illustrate the concept of denormalizing the clickstream data by combining the data into a single collection. Normalized models Because multiple clickstream events can occur within a single web session, we know that sessions have a one-to-many relationship with events. For example, during a single 20-minute web session, a user could invoke four events by visiting the website, watching a video, signing up for a free offer, and completing a lead form. These four events would always appear within the context of a single session and would share the same id_session reference. The data resulting from the normalized model would include one new session document in the sessions collection and four new event documents in the events collection, as shown in the following figure: Each event document is linked to the parent session document by the shared id_session reference field, whose values are highlighted in red. This normalized model would be an efficient data model if we expect the number of events per session to be very large for a couple of reasons. The first reason is that MongoDB limits the maximum document size to 16 megabytes, so you will want to avoid data models that create extremely large documents. The second reason is that query performance can be negatively impacted by large data arrays that contains thousands of event values. This is not a concern for the clickstream dataset, because the number of events per session is small. Denormalized models The one-to-many relationship between sessions and events also gives us the option of embedding multiple events inside of a single session document. Embedding is accomplished by declaring a field to hold either an array of values or embedded documents known as subdocuments. The sessions_events collection is an example of embedding, because it embeds the event data into an array within a session document. The data resulting from our denormalized model includes four event values in the event_data array within the sessions_events collection as shown in the following figure: As you can see, we have the choice to keep the session and event data in separate collections, or alternatively, store both datasets inside a single collection. One important rule to keep in mind when you consider the two approaches is that the MongoDB query language does not support joins between collections. This rule makes embedded documents or data arrays better for querying, because the embedded relationship allows us to avoid expensive client-side joins between collections. In addition, the MongoDB query language supports a large number of powerful query operators for accessing documents by the contents of an array. A list of query operators can be found on the MongoDB documentation site at http://docs.mongodb.org/manual/reference/operator/. To summarize, the following are a few key points to consider when deciding on a normalized or denormalized data model in MongoDB: The MongoDB query language does not support joins between collections The maximum document size is 16 megabytes Very large data arrays can negatively impact query performance In our sample database, the number of clickstream events per session is expected to be small—within a modest range of only one to 20per session. The denormalized model works well in this scenario, because it eliminates joins by keeping events and sessions in a single collection. However, both data modeling scenarios are provided in the pentaho MongoDB database to highlight the importance of having an analytics platform, such as Pentaho, to handle both normalized and denormalized data models. Summary This article expands on the topic of data modeling and explains MongoDB database concepts essential to querying MongoDB data with Pentaho. Resources for Article: Further resources on this subject: Installing Pentaho Data Integration with MySQL [article] Integrating Kettle and the Pentaho Suite [article] Getting Started with Pentaho Data Integration [article]
Read more
  • 0
  • 0
  • 1676

article-image-understanding-process-variation-control-charts
Packt
19 Feb 2014
10 min read
Save for later

Understanding Process Variation with Control Charts

Packt
19 Feb 2014
10 min read
(For more resources related to this topic, see here.) Xbar-R charts and applying stages to a control chart As with all control charts, the Xbar-R charts are used to monitor process stability. Apart from generating the basic control chart, we will look at how we can control the output with a few options within the dialog boxes. Xbar-R stands for means and ranges; we use the means chart to estimate the population mean of a process and the range chart to observe how the population variation changes. For more information on control charts, see Understanding Statistical Process Control by Donald J. Wheeler and David S. Chambers. As an example, we will study the fill volumes of syringes. Five syringes are sampled from the process at hourly intervals; these are used to represent the mean and variation of that process over time. We will plot the means and ranges of the fill volumes across 50 subgroups. The data also includes a process change. This will be displayed on the chart by dividing the data into two stages. The charts for subgrouped data can use a worksheet set up in two formats. Here the data is recorded such that each row represents a subgroup. The columns are the sample points. The Xbar-S chart will use data in the other format where all the results are recorded in one column. The following screenshot shows the data with subgroups across the rows on the left, and the same data with subgroups stacked on the right: How to do it… The following steps will create an Xbar-R chart staged by the Adjustment column with all eight of the tests for special causes: Use the Open Worksheet command from the File menu to open the Volume.mtw worksheet. Navigate to Stat | Control Charts | Variables charts for subgroups. Then click on Xbar-R…. Change the drop down at the top of the dialog to Observations for a subgroup are in one row of columns:. Enter the columns Measure1 to Measure5 into the dialog box by highlighting all the measure columns in the left selection box and clicking on Select. Click on Xbar-R Options and navigate to the tab for Tests. Select all the tests for special causes. Select the Stages tab. Enter Adjustment in the Define Stages section. Click on OK in each dialog box. How it works… The R or range chart displays the variation over time in the data by plotting the range of measurements in a subgroup. The Xbar chart plots the means of the subgroups. The choice of layout of the worksheet is picked from the drop-down box in the main dialog box. The All observations for a chart are in one column: field is used for data stacked into columns. Means of subgroups and ranges are found from subgroups indicated in the worksheet. The Observations for a subgroup are in one row of columns: field will find means and ranges from the worksheet rows. The Xbar-S chart example shows us how to use the dialog box when the data is in a single column. The dialog boxes for both Xbar-R and Xbar-S work the same way. Tests for special causes are used to check the data for nonrandom events. The Xbar-R chart options give us control over the tests that will be used. The values of the tests can be changed from these options as well. The options from the Tools menu of Minitab can be used to set the default values and tests to use in any control chart. By using the option under Stages, we are able to recalculate the means and standard deviations for the pre and post change groups in the worksheet. Stages can be used to recalculate the control chart parameters on each change in a column or on specific values. A date column can be used to define stages by entering the date at which a stage should be started. There's more… Xbar-R charts are also available under the Assistant menu. The default display option for a staged control chart is to show only the mean and control limits for the final stage. Should we want to see the values for all stages, we would use the Xbar-R Options and Display tab. To place these values on the chart for all stages, check the Display control limit / center line labels for all stages box. See Xbar-S charts for a description of all the tabs within the Control Charts options. Using an Xbar-S chart Xbar-S charts are similar in use to Xbar-R. The main difference is that the variation chart uses standard deviation from the subgroups instead of the range. The choice between using Xbar-R or Xbar-S is usually made based on the number of samples in each subgroup. With smaller subgroups, the standard deviation estimated from these can be inflated. Typically, with less than nine results per subgroup, we see them inflating the standard deviation, which increases the width of the control limits on the charts. Automotive Industry Action Group (AIAG) suggests using the Xbar-R, which is greater than or equal to nine times the Xbar-S. Now, we will apply an Xbar-S chart to a slightly different scenario. Japan sits above several active fault lines. Because of this, minor earthquakes are felt in the region quite regularly. There may be several minor seismic events on any given day. For this example, we are going to use seismic data from the Advanced National Seismic System. All seismic events from January 1, 2013 to July 12, 2013 from the region that covers latitudes 31.128 to 45.275 and longitudes 129.799 to 145.269 are included in this dataset. This corresponds to an area that roughly encompasses Japan. The dataset is provided for us already but we could gather more up-to-date results from the following link: http://earthquake.usgs.gov/monitoring/anss/ To search the catalog yourself, use the following link: http://www.ncedc.org/anss/catalog-search.html We will look at seismic events by week that create Xbar-S charts of magnitude and depth. In the initial steps, we will use the date to generate a column that identifies the week of the year. This column is then used as the subgroup identifier. How to do it… The following steps will create an Xbar-S chart for the depth and magnitude of earthquakes. This will display the mean, and standard deviation of the events by week: Use the Open Worksheet command from the File menu to open the earthquake.mtw file. Go to the Data menu, click on Extract from Date/Time, and then click on To Text. Enter Date in the Extract from Date/time column: section. Type Week in the Store text column in: section. Check the selection for Week and click on OK to create the new column. Navigate to Stat | Control Charts | Variable charts for Subgroups and click on Xbar-S. Enter Depth and Mag into the dialog box as shown in the following screenshot and Week into the Subgroup sizes: field. Click on the Scale button, and select the option for Stamp. Enter Date in the Stamp columns section. Click on OK. Click on Xbar-S Options and then navigate to the Tests tab. Select all tests for special causes. Click on OK in each dialog box. How it works… Steps 1 to 4 build the Week column that we use as the subgroup. The extracts from the date/time options are fantastic for quickly generating columns based on dates. Days of the week, week of the year, month, or even minutes or seconds can all be separated from the date. Multiple columns can be entered into the control chart dialog box just as we have done here. Each column is then used to create a new Xbar-S chart. This lets us quickly create charts for several dimensions that are recorded at the same time. The use of the week column as the subgroup size will generate the control chart with mean depth and magnitude for each week. The scale options within control charts are used to change the display on the chart scales. By default, the x axis displays the subgroup number; changing this to display the date can be more informative when identifying the results that are out of control. Options to add axes, tick marks, gridlines, and additional reference lines are also available. We can also edit the axis of the chart after we have generated it by double-clicking on the x axis. The Xbar-S options are similar for all control charts; the tabs within Options give us control over a number of items for the chart. The following list shows us the tabs and the options found within each tab: Parameters: This sets the historical means and standard deviations; if using multiple columns, enter the first column mean, leave a space, and enter the second column mean Estimate: This allows us to specify subgroups to include or exclude in the calculations and change the method of estimating sigma Limits: This can be used to change where sigma limits are displayed or place on the control limits Tests: This allows us to choose the tests for special causes of the data and change the default values. The Using I-MR chart recipe details the options for the Tests tab. Stages: This allows the chart to be subdivided and will recalculate center lines and control limits for each stage Box Cox: This can be used to transform the data, if necessary Display: This has settings to choose how much of the chart to display. We can limit the chart to show only the last section of the data or split a larger dataset into separate segments. There is also an option to display the control limits and center lines for all stages of a chart in this option. Storage: This can be used to store parameters of the chart, such as means, standard deviations, plotted values, and test results There's more… The control limits for the graphs that are produced vary as the subgroup sizes are not constant; this is because the number of earthquakes varies each week. In most practical applications, we may expect to collect the same number of samples or items in a subgroup and hence have flat control limits. If we wanted to see the number of earthquakes in each week, we could use Tally from inside the Tables menu. This will display a result of counts per week. We could also store this tally back into the worksheet. The result of this tally could be used with a c-chart to display a count of earthquake events per week. If we wanted to import the data directly from the Advanced National Seismic System, then the following steps will obtain the data and prepare the worksheet for us: Follow the link to the ANSS catalog search at http://www.ncedc.org/anss/catalog-search.html. Enter 2013/01/01 in the Start date, time: field. Enter 2013/06/12 in the End date, time: field. Enter 3 in the Min magnitude: field. Enter 31.128 in the Min latitude field and 45.275 in the Max latitude field. Enter 129.799 in the Min longitude field and 145.269 in the Max longitude filed. Copy the data from the search results, excluding the headers, and paste it into a Minitab worksheet. Change the names of the columns to C1 Date, C2 Time, C3 Lat, C4 Long, C5 Depth, C6 Mag. The other columns, C7 to C13, can then be deleted. The Date column will have copied itself into Minitab as text; to convert this back to a date, navigate to Data | Change Data Type | Text to Date/Time. Enter Date in both the Change text columns: and Store date/time columns in: sections. In the Format of text columns: section, enter yyyy/mm/dd. Click on OK. To extract the week from the Date column, navigate to Data | Date/Time | Extract to Text. Enter 'Date' in the Extract from date/time column: section. Enter 'Week' in the Store text column in: field. Check the box for Week and click on OK.
Read more
  • 0
  • 0
  • 2839

article-image-processing-tweets-apache-hive
Packt
18 Feb 2014
6 min read
Save for later

Processing Tweets with Apache Hive

Packt
18 Feb 2014
6 min read
(For more resources related to this topic, see here.) Extracting hashtags In this part and the following one, we'll see how to extract data efficiently from tweets such as hashtags and emoticons. We need to do it because we want to be able to know what the most discussed topics are, and also get the mood across the tweets. And then, we'll want to join that information to get people's sentiments. We'll start with hashtags; to do so, we need to do the following: Create a new hashtag table. Use a function that will extract the hashtags from the tweet string. Feed the hashtag table with the result of the extracted function. So, I have some bad news and good news: Bad news: Hive provides a lot of built-in user-defined functions, but unfortunately, it does not provide any function based on a regex pattern; we need to use a custom user-defined function to do that. This is such a bad news as you will learn how to do it. Good news: Hive provides an extremely efficient way to create a Hive table from an array. We'll then use the lateral view and the Explode Hive UDF to do that. The following is the Hive-processing workflow that we are going to apply to our tweets: Hive-processing workflow The preceding diagram describes the workflow to be followed to extract the hashtags. The steps are basically as follows: Receive the tweets. Detect all the hashtags using the custom Hive user-defined function. Obtain an array of hashtags. Explode it and obtain a lateral view to feed our hashtags table. This kind of processing is really useful if we want to have a feeling of what the top tweeted topics are, and is most of the time represented by a word cloud chart like the one shown in the following diagram: Topic word cloud sample Let's do this by creating a new CH04_01_HIVE_PROCESSING_HASH_TAGS job under a new Chapter4 folder. This job will contain six components: One to connect to Hive; you can easily copy and paste the connection component from the CH03_03_HIVE_FORMAT_DATA job One tHiveRow to add the custom Hive UDF to the Hive runtime classpath. The following would be the steps to create a new job: First, we will add the following context variable to our PacktContext group: Name Value custom_udf_jar PATH_TO_THE_JAR For Example: /Users/bahaaldine/here/is/the/jar/extractHashTags.jar This new context variable is just the path to the Hive UDF JAR file provided in the source file Now, we can add the "add jar "+context.custom_udf_jar Hive query in our tHiveRow component to load the JAR file in the classpath when the job is being run. We use the add jar query so that Hive will load all the classes in the JAR file when the job starts, as shown in the following screenshot: Adding a Custom UDF JAR to Hive classpath. After the JAR file is loaded by the previous component, we need tHiveRow to register the custom UDF into the available UDF catalog. The custom UDF is a Java class with a bunch of methods that can be invoked from Hive-QL code. The custom UDF that we need is located in the org.talend.demo package of the JAR file and is named ExtractPattern. So we will simply add the "create temporary function extract_patterns as 'org.talend.demo.ExtractPattern'" configuration to the component. We use the create temporary function query to create a new extract_patterns function in Hive UDF catalog and give the implementation class contained in our package We need one tHiveRow to drop the hashtags table if it exists. As we have done in the CH03_02_HIVE_CREATE_TWEET_TABLE job, just add the "DROP TABLE IF EXISTS hash_tags" drop statement to be sure that the table is removed when we relaunch the job. We need one tHiveRow to create the hashtags table. We are going to create a new table to store the hashtags. For the purpose of simplicity, we'll only store the minimum time and description information as shown in the following table: Name Value hash_tags_id String day_of_week String day_of_month String time String month String hash_tags_label String The essential information here is the hash_tags_label column, which contains the hashtag name. With this knowledge, the following is our create table query: CREATE EXTERNAL TABLE hash_tags ( hash_tags_id string, day_of_week string, day_of_month string, time string, month string, hash_tags_label string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ';' LOCATION '/user/"+context.hive_user+"/packt/chp04/hashtags' Finally we need a tHiveRow component to feed the hashtags table. Here, we are going to use all the assets provided by the previous components as shown in the following query: insert into table hash_tags select concat(formatted_tweets.day_of_week, formatted_tweets. day_of_month, formatted_tweets.time, formatted_tweets.month) as hash_id, formatted_tweets.day_of_week, formatted_tweets.day_of_month, formatted_tweets.time, formatted_tweets.month, hash_tags_label from formatted_tweets LATERAL VIEW explode( extract_patterns(formatted_tweets.content,'#(\\w+)') ) hashTable as hash_tags_label Let's analyze the query from the end to the beginning. The last part of the query uses the extract_patterns function to parse in the formatted_tweets.content all hashtags based on the regex #(+). In Talend, all strings are Java string objects. That's why we need here to escape all backslash. Hive also needs special character escape, that brings us to finally having four backslashes. The extract_patterns command returns an array that we inject in the exploded Hive UDF in order to obtain a list of objects. We then pass them to the lateral view statement, which creates a new on-the-fly view called hashTable with one column hash_tags_label. Take a breath. We are almost done. If we go one level up, we will see that we selected all the required columns for our new hash_tags table, do a concatenation of data to build hash_id, and dynamically select a runtime-built column called hash_tags_label provided by the lateral view. Finally, all the selected data is inserted in the hash_tags table. We just need to run the job, and then, using the following query, we will check in Hive if the new table contains our hashtags: $ select * from hash_tags The following diagram shows the complete hashtags-extracting job structure: Hive processing job Summary By now, you should have a good overview of how to use Apache Hive features with Talend, from the ELT mode to the lateral view, passing by the custom Hive user-defined function. From the point of view of a use case, we have now reached the step where we need to reveal some added-value data from our Hive-based processing data. Resources for Article: Further resources on this subject: Visualization of Big Data [article] Managing Files [article] Making Big Data Work for Hadoop and Solr [article]
Read more
  • 0
  • 0
  • 1813
Visually different images

article-image-query-performance-tuning
Packt
18 Feb 2014
5 min read
Save for later

Query Performance Tuning

Packt
18 Feb 2014
5 min read
(For more resources related to this topic, see here.) Understanding how Analysis Services processes queries We need to understand what happens inside Analysis Services when a query is run. The two major parts of the Analysis Services engine are: The Formula Engine: This part processes MDX queries, works out what data is needed to answer them, requests that data from the Storage Engine, and then performs all calculations needed for the query. The Storage Engine: This part handles all the reading and writing of data, for example, during cube processing and fetching all the data that the Formula Engine requests when a query is run. When you run an MDX query, then, that query goes first to the Formula Engine, then to the Storage Engine, and then back to the Formula Engine before the results are returned back to you. Performance tuning methodology When tuning performance there are certain steps you should follow to allow you to measure the effect of any changes you make to your cube, its calculations or the query you're running: Always test your queries in an environment that is identical to your production environment, wherever possible. Otherwise, ensure that the size of the cube and the server hardware you're running on is at least comparable, and running the same build of Analysis Services. Make sure that no-one else has access to the server you're running your tests on. You won't get reliable results if someone else starts running queries at the same time as you. Make sure that the queries you're testing with are equivalent to the ones that your users want to have tuned. As we'll see, you can use Profiler to capture the exact queries your users are running against the cube. Whenever you test a query, run it twice; first on a cold cache, and then on a warm cache. Make sure you keep a note of the time each query takes to run and what you changed on the cube or in the query for that run. Clearing the cache is a very important step—queries that run for a long time on a cold cache may be instant on a warm cache. When you run a query against Analysis Services, some or all of the results of that query (and possibly other data in the cube, not required for the query) will be held in cache so that the next time a query is run that requests the same data it can be answered from cache much more quickly. To clear the cache of an Analysis Services database, you need to execute a ClearCache XMLA command. To do this in SQL Management Studio, open up a new XMLA query window and enter the following: <Batch > <ClearCache> <Object> <DatabaseID>Adventure Works DW 2012</DatabaseID> </Object> </ClearCache> </Batch> Remember that the ID of a database may not be the same as its name—you can check this by right-clicking on a database in the SQL Management Studio Object Explorer and selecting Properties. Alternatives to this method also exist: the MDX Studio tool allows you to clear the cache with a menu option, and the Analysis Services Stored Procedure Project (http://tinyurl.com/asstoredproc) contains code that allows you to clear the Analysis Services cache and the Windows File System cache directly from MDX. Clearing the Windows File System cache is interesting because it allows you to compare the performance of the cube on a warm and cold file system cache as well as a warm and cold Analysis Services cache. When the Analysis Services cache is cold or can't be used for some reason, a warm filesystem cache can still have a positive impact on query performance. After the cache has been cleared, before Analysis Services can answer a query it needs to recreate the calculated members, named sets, and other objects defined in a cube's MDX Script. If you have any reasonably complex named set expressions that need to be evaluated, you'll see some activity in Profiler relating to these sets being built and it's important to be able to distinguish between this and activity that's related to the queries you're actually running. All MDX Script related activity occurs between Execute MDX Script Begin and Execute MDX Script End events; these are fired after the Query Begin event but before the Query Cube Begin event for the query run after the cache has been cleared and there is one pair of Begin/End events for each command on the MDX Script. When looking at a Profiler trace you should either ignore everything between the first Execute MDX Script Begin event and the last Execute MDX Script End event or run a query that returns no data at all to trigger the evaluation of the MDX Script, for example: SELECT {} ON 0 FROM [Adventure Works] Designing for performance Many of the recommendations for designing cubes will improve query performance, and in fact the performance of a query is intimately linked to the design of the cube it's running against. For example, dimension design, especially optimizing attribute relationships, can have significant effect on the performance of all queries—at least as much as any of the optimizations. As a result, we recommend that if you've got a poorly performing query the first thing you should do is review the design of your cube to see if there is anything you could do differently. There may well be some kind of trade-off needed between usability, manageability, time-to-develop, overall 'elegance' of the design and query performance, but since query performance is usually the most important consideration for your users then it will take precedence. To put it bluntly, if the queries your users want to run don't run fast, your users will not want to use the cube at all!
Read more
  • 0
  • 0
  • 1087

article-image-sizing-configuring-hadoop-cluster
Oli Huggins
16 Feb 2014
10 min read
Save for later

Sizing and Configuring your Hadoop Cluster

Oli Huggins
16 Feb 2014
10 min read
This article, written by Khaled Tannir, the author of Optimizing Hadoop for MapReduce, discusses two of the most important aspects to consider while optimizing Hadoop for MapReduce: sizing and configuring the Hadoop cluster correctly. Sizing your Hadoop cluster Hadoop's performance depends on multiple factors based on well-configured software layers and well-dimensioned hardware resources that utilize its CPU, Memory, hard drive (storage I/O) and network bandwidth efficiently. Planning the Hadoop cluster remains a complex task that requires a minimum knowledge of the Hadoop architecture and may be out the scope of this book. This is what we are trying to make clearer in this section by providing explanations and formulas in order to help you to best estimate your needs. We will introduce a basic guideline that will help you to make your decision while sizing your cluster and answer some How to plan questions about cluster's needs such as the following: How to plan my storage? How to plan my CPU? How to plan my memory? How to plan the network bandwidth? While sizing your Hadoop cluster, you should also consider the data volume that the final users will process on the cluster. The answer to this question will lead you to determine how many machines (nodes) you need in your cluster to process the input data efficiently and determine the disk/memory capacity of each one. Hadoop is a Master/Slave architecture and needs a lot of memory and CPU bound. It has two main components: JobTracker: This is the critical component in this architecture and monitors jobs that are running on the cluster TaskTracker: This runs tasks on each node of the cluster To work efficiently, HDFS must have high throughput hard drives with an underlying filesystem that supports the HDFS read and write pattern (large block). This pattern defines one big read (or write) at a time with a block size of 64 MB, 128 MB, up to 256 MB. Also, the network layer should be fast enough to cope with intermediate data transfer and block. HDFS is itself based on a Master/Slave architecture with two main components: the NameNode / Secondary NameNode and DataNode components. These are critical components and need a lot of memory to store the file's meta information such as attributes and file localization, directory structure, names, and to process data. The NameNode component ensures that data blocks are properly replicated in the cluster. The second component, the DataNode component, manages the state of an HDFS node and interacts with its data blocks. It requires a lot of I/O for processing and data transfer. Typically, the MapReduce layer has two main prerequisites: input datasets must be large enough to fill a data block and split in smaller and independent data chunks (for example, a 10 GB text file can be split into 40,960 blocks of 256 MB each, and each line of text in any data block can be processed independently). The second prerequisite is that it should consider the data locality, which means that the MapReduce code is moved where the data lies, not the opposite (it is more efficient to move a few megabytes of code to be close to the data to be processed, than moving many data blocks over the network or the disk). This involves having a distributed storage system that exposes data locality and allows the execution of code on any storage node. Concerning the network bandwidth, it is used at two instances: during the replication process and following a file write, and during the balancing of the replication factor when a node fails. The most common practice to size a Hadoop cluster is sizing the cluster based on the amount of storage required. The more data into the system, the more will be the machines required. Each time you add a new node to the cluster, you get more computing resources in addition to the new storage capacity. Let's consider an example cluster growth plan based on storage and learn how to determine the storage needed, the amount of memory, and the number of DataNodes in the cluster. Daily data input 100 GB Storage space used by daily data input = daily data input * replication factor = 300 GB HDFS replication factor 3 Monthly growth 5% Monthly volume = (300 * 30) + 5% =  9450 GB After one year = 9450 * (1 + 0.05)^12 = 16971 GB Intermediate MapReduce data 25% Dedicated space = HDD size * (1 - Non HDFS reserved space per disk / 100 + Intermediate MapReduce data / 100) = 4 * (1 - (0.25 + 0.30)) = 1.8 TB (which is the node capacity) Non HDFS reserved space per disk 30% Size of a hard drive disk 4 TB Number of DataNodes needed to process: Whole first month data = 9.450 / 1800 ~= 6 nodes The 12th month data = 16.971/ 1800 ~= 10 nodes Whole year data = 157.938 / 1800 ~= 88 nodes Do not use RAID array disks on a DataNode. HDFS provides its own replication mechanism. It is also important to note that for every disk, 30 percent of its capacity should be reserved to non-HDFS use. It is easy to determine the memory needed for both NameNode and Secondary NameNode. The memory needed by NameNode to manage the HDFS cluster metadata in memory and the memory needed for the OS must be added together. Typically, the memory needed by Secondary NameNode should be identical to NameNode. Then you can apply the following formulas to determine the memory amount: NameNode memory 2 GB - 4 GB Memory amount = HDFS cluster management memory + NameNode memory + OS memory Secondary NameNode memory 2 GB - 4 GB OS memory 4 GB - 8 GB HDFS memory 2 GB - 8 GB At least NameNode (Secondary NameNode) memory = 2 + 2 + 4 = 8 GB It is also easy to determine the DataNode memory amount. But this time, the memory amount depends on the physical CPU's core number installed on each DataNode. DataNode process memory 4 GB - 8 GB Memory amount = Memory per CPU core * number of CPU's core + DataNode process memory + DataNode TaskTracker memory + OS memory DataNode TaskTracker memory 4 GB - 8 GB OS memory 4 GB - 8 GB CPU's core number 4+ Memory per CPU core 4 GB - 8 GB At least DataNode memory = 4*4 + 4 + 4 + 4 = 28 GB Regarding how to determine the CPU and the network bandwidth, we suggest using the now-a-days multicore CPUs with at least four physical cores per CPU. The more physical CPU's cores you have, the more you will be able to enhance your job's performance (according to all rules discussed to avoid underutilization or overutilization). For the network switches, we recommend to use equipment having a high throughput (such as 10 GB) Ethernet intra rack with N x 10 GB Ethernet inter rack. Configuring your cluster correctly To run Hadoop and get a maximum performance, it needs to be configured correctly. But the question is how to do that. Well, based on our experiences, we can say that there is not one single answer to this question. The experiences gave us a clear indication that the Hadoop framework should be adapted for the cluster it is running on and sometimes also to the job. In order to configure your cluster correctly, we recommend running a Hadoop job(s) the first time with its default configuration to get a baseline. Then, you will check the resource's weakness (if it exists) by analyzing the job history logfiles and report the results (measured time it took to run the jobs). After that, iteratively, you will tune your Hadoop configuration and re-run the job until you get the configuration that fits your business needs. The number of mappers and reducer tasks that a job should use is important. Picking the right amount of tasks for a job can have a huge impact on Hadoop's performance. The number of reducer tasks should be less than the number of mapper tasks. Google reports one reducer for 20 mappers; the others give different guidelines. This is because mapper tasks often process a lot of data, and the result of those tasks are passed to the reducer tasks. Often, a reducer task is just an aggregate function that processes a minor portion of the data compared to the mapper tasks. Also, the correct number of reducers must also be considered. The number of mappers and reducers is related to the number of physical cores on the DataNode, which determines the maximum number of jobs that can run in parallel on DataNode. In a Hadoop cluster, master nodes typically consist of machines where one machine is designed as a NameNode, and another as a JobTracker, while all other machines in the cluster are slave nodes that act as DataNodes and TaskTrackers. When starting the cluster, you begin starting the HDFS daemons on the master node and DataNode daemons on all data nodes machines. Then, you start the MapReduce daemons: JobTracker on the master node and the TaskTracker daemons on all slave nodes. The following diagram shows the Hadoop daemon's pseudo formula: When configuring your cluster, you need to consider the CPU cores and memory resources that need to be allocated to these daemons. In a huge data context, it is recommended to reserve 2 CPU cores on each DataNode for the HDFS and MapReduce daemons. While in a small and medium data context, you can reserve only one CPU core on each DataNode. Once you have determined the maximum mapper's slot numbers, you need to determine the reducer's maximum slot numbers. Based on our experience, there is a distribution between the Map and Reduce tasks on DataNodes that give good performance result to define the reducer's slot numbers the same as the mapper's slot numbers or at least equal to two-third mapper slots. Let's learn to correctly configure the number of mappers and reducers and assume the following cluster examples: Cluster machine Nb Medium data size Large data size DataNode CPU cores 8 Reserve 1 CPU core Reserve 2 CPU cores DataNode TaskTracker daemon 1 1 1 DataNode HDFS daemon 1 1 1 Data block size 128 MB 256 MB DataNode CPU % utilization 95% to 120% 95% to 150% Cluster nodes 20 40 Replication factor 2 3 We want to use the CPU resources at least 95 percent, and due to Hyper-Threading, one CPU core might process more than one job at a time, so we can set the Hyper-Threading factor range between 120 percent and 170 percent. Maximum mapper's slot numbers on one node in a large data context = number of physical cores - reserved core * (0.95 -> 1.5) Reserved core = 1 for TaskTracker + 1 for HDFS Let's say the CPU on the node will use up to 120% (with Hyper-Threading) Maximum number of mapper slots = (8 - 2) * 1.2 = 7.2 rounded down to 7 Let's apply the 2/3 mappers/reducers technique: Maximum number of reducers slots = 7 * 2/3 = 5 Let's define the number of slots for the cluster: Mapper's slots: = 7 * 40 = 280 Reducer's slots: = 5 * 40 = 200 The block size is also used to enhance performance. The default Hadoop configuration uses 64 MB blocks, while we suggest using 128 MB in your configuration for a medium data context as well and 256 MB for a very large data context. This means that a mapper task can process one data block (for example, 128 MB) by only opening one block. In the default Hadoop configuration (set to 2 by default), two mapper tasks are needed to process the same amount of data. This may be considered as a drawback because initializing one more mapper task and opening one more file takes more time. Summary In this article, we learned about sizing and configuring the Hadoop cluster for optimizing it for MapReduce. Resources for Article: Further resources on this subject: Hadoop Tech Page Hadoop and HDInsight in a Heartbeat Securing the Hadoop Ecosystem Advanced Hadoop MapReduce Administration
Read more
  • 0
  • 3
  • 22529

article-image-changing-appearance
Packt
13 Feb 2014
14 min read
Save for later

Changing the Appearance

Packt
13 Feb 2014
14 min read
(For more resources related to this topic, see here.) Controlling appearance An ADF Faces application is a modern web application, so the technology used for controlling the look of the application is Cascading Style Sheets (CSS). The idea behind CSS is that the web page (in HTML) should contain only the structure and not information about appearance. All of the visual definitions must be kept in the style sheet, and the HTML file must refer to the style sheet. This means that the same web page can be made to look completely different by applying a different style sheet to it. The Cascading Style Sheets basics In order to change the appearance of your application, you need to understand some CSS basics. If you have never worked with CSS before, you should start by reading one of the many CSS tutorials available on the Internet. To start with, let's repeat some of the basics of CSS. The CSS layout instructions are written in the form of rules. Each rule is in the following form: selector { property: value; } The selector function identifies which part of the web page the rule applies to, and the property/value pairs define the styling to be applied to the selected parts. For example, the following rule defines that all <h1> elements should be shown in red font: h1 { color: red; } One rule can include multiple selectors separated by commas, and multiple property values separated by semicolons. Therefore, it is also a valid CSS to write the following line of code to get all the <h1>, <h2>, and <h3> tags shown in large, red font: h1, h2, h3 { color: red; font-size: x-large; } If you want to apply a style with more precision than just every level 1 header, you define a style class, which is just a selector starting with a period, as shown in the following line of code: .important { color: red; font-weight: bold } To use this selector in your HTML code, you use the keyword class inside an HTML tag. There are three ways of using a style class. They are as follows: Inside an existing tag: <h1> Inside the special <span> tag to style the text within a paragraph Inside a <div> tag to style a whole paragraph of text Here are examples of all the three ways: <h1 class="important">Important topic</h1>You <span class="important">must</span> remember this.<div class="important">Important tip</div> In theory, you can place your styling information directly in your HTML document using the <style> tag. In practice, however, you usually place your CSS instructions in a separate .css file and refer to it from your HTML file with a <link> tag, as shown in the following line of code: <link href="mystyle.css" rel="stylesheet" type="text/css"> Styling individual components The preceding examples can be applied to HTML elements, but styling can also be applied to JSF components. A plain JSF component could look like the following code with inline styling: <h:outputFormat value="hello" style="color:red;"/> It can also look like the line of code shown using a style class: <h:outputFormat value="hello" styleClass="important"/> ADF components use the inlineStyle attribute instead of just style as shown in the following line of code: <af:outputFormat value="hello" inlineStyle="color:red;"/> The styleClass attribute is the same, as shown in the following line of code: <af:outputFormat value="hello" styleClass="important"/> Of course, you normally won't be setting these attributes in the source code, but will be using the StyleClass and InlineStyle properties in the Property Inspector instead. In both HTML and JSF, you should only use StyleClass so that multiple components can refer to the same style class and will reflect any change made to the style. InlineStyle is rarely used in real-life ADF applications; it adds to the page size (the same styling is sent for every styled element), and it is almost impossible to ensure that every occurrence is changed when the styling requirements change—as they will. Building a style While you are working out the styles you need in your application, you can use the Style section in the JDeveloper Properties window to define the look of your page, as shown in the following screenshot. This section shows six small subtabs with icons for font, background, border/outline, layout, table/list, and media/animation. If you enter or select a value on any of these tabs, this value will be placed into the InlineStyle field as a correctly formatted CSS. When your items look the way you want, copy the value from the InlineStyle field to a style class in your CSS file and set the StyleClass property to point to that class. If the style discussed earlier is the styling you want for a highlighted label, create a section in your CSS file, as shown in the following code: .highlight {background-color:blue;} Then, clear the InlineStyle property and set the StyleClass property to highlight. Once you have placed a style class into your CSS file, you can use it to style the other components in exactly the same way by simply setting the StyleClass property. We'll be building the actual CSS file where you define these style classes. InlineStyle and ContentStyle Some JSF components (for example, outputText) are easy to style—if you set the font color, you'll see it take effect in the JDeveloper design view and in your application, as shown in the following screenshot: Other elements (for example, inputText) are harder to style. For example, if you want to change the background color of the input field, you might try setting the background color, as shown in the following screenshot: You will notice that this did not work the way you reasonably expected—the background behind both the label and the actual input field changes. The reason for this is that an inputText component actually consists of several HTML elements, and an inline style applies to the outermost element. In this case, the outermost element is an HTML <tr> (table row) tag, so the green background color applies to the entire row. To help mitigate this problem, ADF offers another styling option for some components: ContentStyle. If you set this property, ADF tries to apply the style to the content of a component—in the case of an inputText, ContentStyle applies to the actual input field, as shown in the following screenshot: In a similar manner, you can apply styling to the label for an element by setting the LabelStyle property. Unravelling the mysteries of CSS styling As you saw in the Input Text example, ADF components can be quite complex, and it's not always easy to figure out which element to style to achieve the desired result. To be able to see into the complex HTML that ADF builds for you, you need a support tool such as Firebug. Firebug is a Firefox extension that you can download by navigating to Tools | Add-ons from within Firefox, or you can go to http://getfirebug.com. When you have installed Firebug, you see a little Firebug icon to the far right of your Firefox window, as shown in the following screenshot: When you click on the icon to start Firebug, you'll see it take up the lower half of your Firefox browser window. Only run Firebug when you need it Firebug's detailed analysis of every page costs processing power and slows your browser down. Run Firebug only when you need it. Remember to deactivate Firebug, not just hide it. If you click on the Inspect button (with a little blue arrow, second from the left in the Firebug toolbar), you place Firebug in inspect mode. You can now point to any element on a page and see both the HTML element and the style applied to this element, as shown in the following screenshot: In the preceding example, the pointer is placed on the label for an input text, and the Firebug panels show that this element is styled with color: #0000FF. If you scroll down in the right-hand side Style panel, you can see other attributes such as font-family: Tahoma, font-size: 11px, and so on. In order to keep the size of the HTML page smaller so that it loads faster, ADF has abbreviated all the style class names to cryptic short names such as .x10. While you are styling your application, you don't want this abbreviation to happen. To turn it off, you need to open the web.xml file (in your View project under Web Content | WEB-INF). Change to the Overview tab if it is not already shown, and select the Application subtab, as shown in the following screenshot: Under Context Initialization Parameters, add a new parameter, as shown: Name: org.apache.myfaces.trinidad.DISABLE_CONTENT_COMPRESSION Value: true When you do this, you'll see the full human-readable style names in Firebug, as shown in the following screenshot: You notice that you now get more readable names such as .af_outputLabel. You might need this information when developing your custom skin. Conditional formatting Similar to many other properties, the style properties do not have to be set to a fixed value—you can also set them to any valid expression written in Expression Language (EL). This can be used to create conditional formatting. In the simplest form, you can use an Expression Language ternary operator, which has the form <boolean expression> ? <value if true > : <value if false>. For example, you could set StyleClass to the following line of code: #{bindings.Job.inputValue eq 'MANAGER' ? 'managerStyle' :  'nonManagerStyle'} The preceding expression means that if the value of the Job attribute is equal to MANAGER, use the managerStyle style class; if not, use the nonManagerStyle style class. Of course, this only works if these two styles exist in your CSS. Skinning overview An ADF skin is a collection of files that together define the look and feel of the application. To a hunter, skinning is the process of removing the skin from an animal, but to an ADF developer it's the process of putting a skin onto an application. All applications have a skin—if you don't change it, an application built with JDeveloper 12c uses some variation of the skyros skin. When you define a custom skin, you must also choose a parent skin among the skins JDeveloper offers. This parent skin will define the look for all the components not explicitly defined in your skin. Skinning capabilities There are a few options to change the style of the individual components through their properties. However, with your own custom ADF skin, you can globally change almost every visual aspect of every instance of a specific component. To see skinning in action, you can go to http://jdevadf.oracle.com/adf-richclient-demo. This site is a demonstration of lots of ADF features and components, and if you choose the Skinning header in the accordion to the right, you are presented with a tree of skinnable components, as shown in the following screenshot: You can click on each component to see a page where you can experiment with various ways of skinning the component. For example, you can select the very common InputText component to see a page with various representations of the input text components. On the left-hand side, you see a number of Style Selectors that are relevant for that component. For each selector, you can check the checkbox to see an example of what the component looks like if you change that selector. In the following example, the af|inputText:disabled::content selector is checked, thus setting its style to color: #00C0C0, as shown in the following screenshot: As you might be able to deduce from the af|inputText:disabled::content style selector, this controls what the content field of the input text component looks like when it is set to disabled—in the demo application, it is set to a bluish color with the color code #00C0C0. The example application shows various values for the selectors but doesn't really explain them. The full documentation of all the selectors can be found online, at http://jdevadf.oracle.com/adf-richclient-demo/docs/skin-selectors.html. If it's not there, search for ADF skinning selectors. On the menu in the demo application, you also find a Skin menu that you can use to select and test all the built-in skins. This application can also be downloaded and run on your own server. It could be found on the ADF download page at http://www.oracle.com/technetwork/developer-tools/adf/downloads/index.html, as shown in the following screenshot: Skinning recommendations If your graphics designer has produced sample screens showing what the application must look like, you need to find out which components you will use to implement the required look and define the look of these components in your skin. If you don't have a detailed guideline from a graphics designer, look for some guidelines in your organization; you probably have web design guidelines for your public-facing website and/or intranet. If you don't have any graphics guidelines, create a skin, as described later in this section, and choose to inherit from the latest skyros skin provided by Oracle. However, don't change anything—leave the CSS file empty. If you are a programmer, you are unlikely to be able to improve on the look that the professional graphics designers at Oracle in Redwood Shores have created. The skinning process The skinning process in ADF consists of the following steps: Create a skin CSS file. Optionally, provide images for your skin. Optionally, create a resource bundle for your skin. Package the skin in an ADF library. Import and use the skin in the application. In JDeveloper 11g Release 1 ( 11.1.1.x) and earlier versions, this was very much a manual process. Fortunately, from 11g Release 2, JDeveloper has a built-in skinning editor. Stand-alone skinning If you are running JDeveloper 11g Release 1, don't despair. Oracle is making a stand-alone skinning editor available, containing the same functionality that is built into the later versions of JDeveloper. You can give this tool to your graphic designers and let them build the skin ADF library without having to give them the complete JDeveloper product. ADF skinning is a huge topic, and Oracle delivers a whole manual that describes skinning in complete detail. This document is called Creating Skins with Oracle ADF Skin Editor and can be found at http://docs.oracle.com/middleware/1212/skineditor/ADFSG. Creating a skin project You should place your skin in a common workspace. If you are using the modular architecture, you create your skin in the application common workspace. If you are using the enterprise architecture, you create your enterprise common skin in the application common workspace and then possibly create application-specific adaptations of that skin in the application common workspaces. The skin should be placed in its own project in the common workspace. Logically, it could be placed in the common UI workspace, but because the skin will often receive many small changes during certain phases of the project, it makes sense to keep it separate. Remember that changing the skin only affects the visual aspects of the application, but changing the page templates could conceivably change the functionality. By keeping the skin in a separate ADF library, you can be sure that you do not need to perform regression testing on the application functionality after deploying a new skin. To create your skin, open the common workspace and navigate to File | New | Project. Choose the ADF ViewController project, give the project the name CommonSkin, and set the package to your application's package prefix followed by .skin (for example, com.dmcsol.xdm.skin).
Read more
  • 0
  • 0
  • 2459
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-securing-qlikview-documents
Packt
16 Jan 2014
6 min read
Save for later

Securing QlikView Documents

Packt
16 Jan 2014
6 min read
(For more resources related to this topic, see here.) Making documents available to the correct users can be handled in several different ways, depending on your implementation and license structure. These methods are not mutually exclusive and you may choose to implement a combination of them. By license If you only have named Document users, you can restrict access to documents by simply not granting users a license. If users do not have a Document license for a particular document, they may be able to see that document in AccessPoint, but will not be able to open it. You will need to turn off any automatic allocation of licenses for both Document licenses and Named User licenses, or the system will simply override your security by allocating an available license and giving the user access to that document. This only works for Document license users. The Named User license holders can't be locked out of a document this way as they have a license that allows them to open any number of documents, so they cannot be restricted. The fact that this is user based—a Document license can only be granted to a user, not a group—also means that there is no option to secure by a named group. This is the most basic, least flexible, and least user-friendly way to implement security. While it will certainly stop users getting access to documents—and it will work in either NTFS or DMS security modes—it can be frustrating for users to see a document that they think can open, but for which they will get a NO CAL error when they try to open it. The QlikView file will also need to have appropriate NTFS or DMS security so that users would be able to access it. The easiest way to do this is to grant access to a group that all the users will be in, or even allow access to an Authenticated Users group. Section Access Section Access security is a very effective way of securing a document to the correct set of users. This is because a user must be actually listed in the Section Access user list for the document to be even listed in AccessPoint for them. Additionally, if Section Access is in place, a user cannot even connect by using a direct access URL because they have no security access to the data. This method of securing documents works well in both NTFS and DMS security modes. When using the NTLM (Windows authentication via Internet Explorer) authentication method, you can have Group Names listed in Section Access. However, when using alternative authentication, Section Access does not give us an option to secure by group. As with the license method discussed earlier, appropriate file security needs to be in place in order to allow all the users access the QlikView file. NTFS Access Control List (ACL) NTFS (Microsoft's NT File System) security is the default method of securing access to files in a QlikView implementation. It works very well for installations where all the users are Windows users within the same domain or a set of trusted domains. In NTFS security mode, the Access Control List (ACL) of the QlikView file is used to list the documents for a particular user. This is a very straightforward way of securing access and will be very familiar to Windows system administrators. As with normal Windows file security, the security can be applied at the folder level. Windows security groups can also be used. Between groups and folder security, very flexible levels of security can be applied. By default, Internet Explorer and Google Chrome will pass through the Kerberos/NTLM credentials to sites in the local Intranet zone. For other browsers, such as Safari on the iPad, the user will be prompted for a username and password. When a user connects to AccessPoint and their credentials are established, they are compared against the ACLs for all the files hosted by QVS. Only those files that the user has access to—either directly by name or by group membership—will be listed in AccessPoint. Document Metadata Service (DMS) For non-Windows users, QlikView provides a way of managing user access to files called the Document Metadata Service (DMS). DMS uses a .META file in the same folder as the .QVW file to store the Access Control List. The Windows ACL, which has permissions on the file, now becomes mostly irrelevant as it is not used to authenticate users. It is only the QlikView service account that will need access to the file. It is a binary choice between using NTFS or DMS security on any one QlikView Server. Enabling DMS To enable DMS, we need to make a change to the server configuration. In the QlikView Management Console, on the Security tab of the QVS settings screen, we change Authorization to DMS authorization and then click on the Apply button. The QlikView Server service will need to be restarted for this change to take effect. Once the service has restarted, a new tab, Authorization, becomes available in the document properties: Clicking on the + button to the right of this tab allows you to enter new details of Access, User Type, and specific Users and Groups. Access is either set to Always or Restricted. When Access is set to Always, the associated user or group will have access at any time. If it is set to Restricted, you can specify a time range and specific days when the user or group has access. You can keep clicking on the + button to add as many sets of restricted times as needed for a user or group. The restrictions are additive; that is, if the user only has access on Monday and Tuesday in one group of restrictions, and then Thursday and Friday in another set of restrictions, they will therefore, have access on all four days. The User Type is one of the following categories: User Type Details All Users Essentially no security. Any user, including anonymous, who can access the server will be able to access the file. All Authenticated Users For most implementations, this will also be All Users. However, it will not give access to anonymous users. The Section Access would typically be used to manage the security. Named Users This allows you to specify a list of named users and/or groups that will have specific access to the document. If Named Users is selected, a Manage Users button will appear that allows you to specify users and/or groups. Summary In this article, we have looked at several ways of securing QlikView Documents—by license, using Section Access, utilizing NTFS ACLs, and implementing QlikView's DMS authorization. Resources for Article: Further resources on this subject: Common QlikView script errors [Article] Introducing QlikView elements [Article] Meet QlikView [Article]
Read more
  • 0
  • 0
  • 4877

article-image-aspects-data-manipulation-r
Packt
10 Jan 2014
6 min read
Save for later

Aspects of Data Manipulation in R

Packt
10 Jan 2014
6 min read
(For more resources related to this topic, see here.) Factor variables in R In any data analysis task, the majority of the time is dedicated to data cleaning and pre-processing. Sometimes, it is considered that about 80 percent of the effort is devoted for data cleaning before conducting actual analysis. Also, in real world data, we often work with categorical variables. A variable that takes only a limited number of distinct values is usually known as a categorical variable, and in R, this is known as a factor. Working with categorical variables in R is a bit technical, and in this article we have tried to demystify this process of dealing with categorical variables. During data analysis, sometimes the factor variable plays an important role, particularly in studying the relationship between two categorical variables. In this section, we will see some important aspects of factor manipulation. When a factor variable is first created, it stores all its levels along with the factor. But if we take any subset of that factor variable, it inherits all its levels from the original factor levels. This feature sometimes creates confusion in understanding the results. Numeric variables are convenient during statistical analysis, but sometimes we need to create categorical (factor) variables from numeric variables. We can create a limited number of categories from a numeric variable using a series of conditional statements, but this is not an efficient way to perform this operation. In R, cut is a generic command to create factor variables from numeric variables. Split-apply-combine strategy Data manipulation is an integral part of data cleaning and analysis. For large data it is always preferable to perform the operation within subgroup of a dataset to speed up the process. In R this type of data manipulation could be done with base functionality, but for large-scale data it requires considerable amount of coding and eventually it takes a longer time to process. In case of Big Data we could split the dataset, perform the manipulation or analysis, and then again combine the results into a single output. This type of split using base R is not efficient and to overcome this limitation, Wickham developed an R package plyr where he efficiently implemented the split-apply-combine strategy. Often, we require similar types of operations in different subgroups of a dataset, such as group-wise summarization, standardization, and statistical modeling. This type of task requires us to break up a big problem into manageable pieces, perform operations on each piece separately, and finally combine the output of each piece into a single piece of output. To understand the split-apply-combine strategy intuitively, we could compare this with the map-reduce strategy for processing large amounts of data, recently popularized by Google. In the map-reduce strategy, the map step corresponds to split and apply and the reduce step consists of combining. The map-reduce approach is primarily designed to deal with a highly parallel environment where the work has been done by several hundreds or thousands of computers independently. The split-apply-combine strategy creates an opportunity to see the similarities of problems across subgroups that were previously unconnected. This strategy can be used in many existing tools, such as the GROUP BY operation in SAS, PivotTable in MS Excel, and the SQL GROUP BY operator. Reshaping a dataset Reshaping data is a common and tedious task in real-life data manipulation and analysis. A dataset might come with different levels of grouping and we need to implement some reorientation to perform certain types of analyses. Datasets layout could be long or wide. In long-layout, multiple rows represent a single subject's record, whereas in wide-layout, a single row represents a single subject's record. Statistical analysis sometimes requires wide data and sometimes long data, and in such cases, we need to be able to fluently and fluidly reshape the data to meet the requirements of statistical analysis. Data reshaping is just a rearrangement of the form of the data—it does not change the content of the dataset. In this article, we will show you different layouts of the same dataset and see how they can be transferred from one layout to another. This article mainly highlights the melt and cast paradigm of reshaping datasets, which is implemented in the reshape contributed package. Later on, this same package is reimplemented with a new name, reshape2, which is much more time and memory efficient. A single dataset can be rearranged in many different ways, but before going into rearrangement, let's look back at how we usually perceive a dataset. Whenever we think about any dataset, we think of a two-dimensional arrangement where a row represents a subject's (a subject could be a person and is typically the respondent in a survey) information for all the variables in a dataset and a column represents the information for each characteristic for all subjects. This means rows indicate records and columns indicate variables, characteristics, or attributes. This is the typical layout of a dataset. In this arrangement, one or more variables might play a role as an identifier and others are measured characteristics. For the purpose of reshaping, we could group the variables into two groups: identifier variables and measured variables. Identifier variables: These help to identify the subject from whom we took information on different characteristics. Typically, identifier variables are qualitative in nature and take a limited number of unique values. In database terms, an identifier is termed as the primary key, and this can be a single variable or a composite of multiple variables. Measured variables: These are those characteristics whose information we took from a subject of interest. These can be qualitative, quantitative, or a mix of both. Now beyond this typical structure of dataset, we could think differently, where we will have only identification variables and a value. The identification variable identifies a subject along with which measured variable the value represents. In this new paradigm, each row represents one observation of one variable. In the new paradigm this is termed as melting and it produces molten data. The difference between this new layout of the data and the typical layout is that it now contains only the ID variable and a new column, value, which represents the value of that observation. Summary This article explains briefly about factor variables, the split-apply-combine strategy, and reshaping a dataset in R. Resources for Article: Further resources on this subject: SQL Server 2008 R2: Multiserver Management Using Utility Explorer [Article] Working with Data Application Components in SQL Server 2008 R2 [Article] Customizing Graphics and Creating a Bar Chart and Scatterplot in R [Article]
Read more
  • 0
  • 0
  • 1337

article-image-conozca-qlikview
Packt
02 Jan 2014
8 min read
Save for later

Conozca QlikView

Packt
02 Jan 2014
8 min read
(Para más recursos relacionados con este tema, vea aquí.) ¿Qué es QlikView? QlikView es una herramienta computacional desarrollada por QlikTech, una compañía que fue fundada en Suecia en 1993, pero actualmente con sede a Estados Unidos. QlikView es una herramienta usada para Inteligencia de Negocios, comúnmente abreviada como BI por las siglas de su denominación en inglés: Business Intelligence . La inteligencia de negocios es definida por Gartner, una firma líder de analistas de la industria, como: Un término general que incluye la aplicación, infraestructura y herramientas, y mejores prácticas que permiten el acceso a información y análisis de la misma para mejorar y optimizar el proceso de toma de decisiones y desempeño de una compañía. Siguiendo esta definición, QlikView es una herramienta que permite el acceso a la información y posibilita el análisis de los datos, lo cual a su vez mejora y optimiza el proceso de toma de decisiones de negocio y por ende también el desempeño del mismo. Históricamente, la Inteligencia de Negocios ha sido comandada principalmente por los departamentos de Tecnologías de Información en las empresas. Los departamentos de TI eran responsables de todo el ciclo de vida de una solución de Inteligencia de Negocios, desde extraer los datos hasta entregar los reportes finales, análisis y cuadros de mando. Aunque este modelo funciona bien para la distribución de reportes estáticos predefinidos, la mayoría de las empresas se han ido dando cuenta que no cumple con las necesidades de sus usuarios de negocio. Como TI controla de cerca los datos y herramientas, los usuarios comúnmente experimentan largos tiempos de espera cuando surgen nuevas preguntas de negocio que no pueden ser respondidas con los reportes estándar. ¿Cómo se diferencia QlikView de herramientas tradicionales de BI? QlikTech se enorgullece de abordar la Inteligencia de Negocios de una manera distinta a lo que compañías como Oracle, SAP, e IBM – descritas por QlikTech como proveedores tradicionales de BI – ofrecen. QlikTech busca poner las herramientas en manos del usuario de negocio, permitiéndole ser autosuficiente, ya que así puede realizar sus propios análisis. Las firmas independientes de analistas de la industria han notado también este acercamiento distinto. En 2011, Gartner creó una subcategoría para herramientas de Descubrimiento de Datos en su evaluación anual de mercado, el Cuadrante Mágico de plataformas de Inteligencia de Negocios . QlikView fue el abanderado en esta nueva categoría de herramientas de BI. QlikTech prefiere describir su producto como una herramienta de Descubrimiento del Negocio en lugar de Descubrimiento de Datos. Sostiene que descubrir cosas sobre el negocio es mucho más importante que descubrir datos. El siguiente diagrama ilustra este paradigma. Fuente: QlikTech Además de la diferencia en quién usa la herramienta – usuarios de TI contra usuarios de negocio – hay algunas otras funcionalidades que diferencian a QlikView de otras soluciones. Experiencia de usuario asociativa La principal diferencia entre QlikView y otras soluciones de BI es la experiencia de usuario asociativa. Mientras que las soluciones de BI tradicionales usan caminos predefinidos para navegar y explorar datos, QlikView permite a los usuarios tomar cualquier ruta que deseen para realizar análisis. Esto resulta en una manera mucho más intuitiva de navegar los datos. QlikTech describe esto como "trabajar de la forma en que trabaja la mente humana". En la siguiente imagen se muestra un ejemplo. Mientras que en una solución típica de BI tendríamos que comenzar seleccionando una Región para después entrar paso a paso en el camino jerárquico definido, en QlikView podemos elegir cualquier punto de entrada que deseemos – Región , Estado , Producto , o Vendedor . Al ir navegando los datos, se nos presenta solo la información relacionada a nuestra selección y, para nuestra siguiente selección, podemos elegir cualquier camino que deseemos. La navegación es infinitamente flexible. Adicionalmente, la interfaz de usuario QlikView nos permite ver los datos que están asociados a nuestra selección. Por ejemplo, la siguiente imagen de pantalla (del documento demostrativo de QlikTech llamado What's New in QlikView 11 ) muestra un Cuadro de Mando en QlikView en el que hay dos valores seleccionados. En el campo Quarter , está seleccionado el valor Q3 , y en el campo Sales Reps , está seleccionado Cart Lynch . Podemos ver esto porque los valores correspondientes están en color verde, lo cual significa que dichos valores han sido seleccionados. Cuando se hace una selección, la interfaz se actualiza automáticamente no solo para mostrar los datos que están asociados a esta nueva consulta, sino también los datos que no están asociados con dicha selección. Los datos asociados aparecen con un fondo blanco, mientras que los datos no asociados tienen un fondo gris. Algunas veces las asociaciones pueden ser bastante obvias; no es sorpresa que el tercer trimestre del año tenga asociado los meses de Julio, Agosto y Septiembre. Sin embargo, en otras ocasiones nos encontramos con otras asociaciones no tan obvias, como por ejemplo que Carl Lynch no ha vendido ningún producto en Alemania o España. Esta información extra, que no se ve en herramientas tradicionales de BI, puede ser de gran valor ya que ofrece un nuevo punto de comienzo para exploración de datos. Tecnología El principal diferenciador tecnológico de QlikView es que utiliza un modelo de datos en memoria, es decir, que toda la información con que interactúa el usuario está guardada en RAM en lugar de utilizar disco. Como el uso de RAM es mucho más rápido que disco, los tiempos de respuesta son muy rápidos, generando así una experiencia de usuario muy fluida. En una sección posterior de este capítulo, ahondaremos un poco más en el tema de la tecnología detrás de QlikView. Adopción Hay otra diferencia entre QlikView y otras soluciones tradicionales de BI que radica en la forma en que se implementa dentro de una compañía. Mientras que las soluciones tradicionales de BI son típicamente implementadas de arriba hacia abajo – en donde TI selecciona una herramienta de BI para toda la compañía – QlikView comúnmente toma una ruta de adopción de abajo hacia arriba. Los usuarios de negocio de un solo departamento la implementan localmente, y su uso se expande desde ahí. QlikView se puede descargar de manera gratuita para uso personal. A esta versión se le llama QlikView Personal Edition o PE. Los documentos creados en la edición personal de QlikView pueden ser abiertos por usuarios con licencia completa del software o publicarse a través de QlikView Server. La limitación es que, a excepción de algunos documentos habilitados por QlikTech para PE, un usuario de la edición personal de QlikView no puede abrir documentos creados por otro usuario o en otro equipo; algunas veces tampoco se pueden abrir sus propios documentos si fueron abiertos y guardados por otro usuario o instancia de servidor. Frecuentemente, un usuario de negocio decidirá descargar QlikView para ver si puede resolver un problema de negocio. Cuando otros usuarios dentro del departamento ven el software, se vuelven cada vez más entusiastas sobre la herramienta, y cada quien baja el programa. Para poder compartir documentos, deciden comprar algunas licencias para el departamento. Luego, otros departamentos comienzan a notarlo también, y QlikView gana tracción dentro de la organización. Poco tiempo después, TI y los directivos de la empresa comienzan también a notarlo, lo cual lleva eventualmente a la adopción de QlikView en toda la empresa. QlikView facilita cada paso del proceso, escalando de una implementación en una computadora personal hasta implementaciones a nivel organización con miles de usuarios. La siguiente imagen ilustra este crecimiento dentro de una organización: Conforme la popularidad e historial de QlikView en la organización crece, gana cada vez más visibilidad a nivel empresa. Aunque la ruta de adopción descrita anteriormente es probablemente el escenario más común, no es extraño ahora que una compañía opte por una implementación de QlikView en modo top-down a nivel empresa desde un inicio.
Read more
  • 0
  • 0
  • 2582

article-image-using-redis-hostile-environment-advanced
Packt
27 Dec 2013
12 min read
Save for later

Using Redis in a hostile environment (Advanced)

Packt
27 Dec 2013
12 min read
(For more resources related to this topic, see here.) How to do it... Anyone who can read the files that Redis uses to persist your dataset has a full copy of all your data. Worse, anyone who can write to those files can, with a minimal amount of effort and some patience, change the data that your Redis server contains. Both of these things are probably not what you want, and thankfully it isn't particularly difficult to prevent. All you have to do is prevent anyone but the user running your Redis server from accessing the data directory your Redis instance is using. The simplest way to achieve this is by changing the owner of the directory to the user who runs your Redis server, and then disallow all privileges to everyone else, like this: Determine the user under whom you are running your Redis instance. You can typically find this out by running ps caux |grep redis-server. The name in the first column is the user under which Redis is running. Determine the directory in which Redis is storing its files. If you don't already know this, you can ask Redis by running CONFIG GET dir from within redis-cli. Ensure that the user running your Redis instance owns its data directory: chown <redisuser> /path/to/redis/datadir Restrict permissions on the data directory so that only the owner can access it at all: chmod 0700 /path/to/redis/datadir It is important that you protect the Redis data directory, and not individual data files, because Redis is regularly rewriting those data files, and the permissions you choose won't necessarily be preserved on the next rewrite. It is also a good practice to restrict access to your redis.conf file, because in some cases it can contain sensitive data. This is simply achieved: chmod 0600 /path/to/redis.conf If you run your Redis using applications on a server which is shared with other people, your Redis instance is at pretty serious risk of abuse. The most common way of connecting to Redis is via TCP, which can only limit access based on the address connecting to it. On a shared server, that address is shared amongst everyone using it, so anyone else on the same server as you can connect to your Redis. Not cool! If, however, the programs that need to access your Redis server are on the same machine as the Redis server, there is another, more secure, method of connection called Unix sockets. A Unix socket looks like a file on disk, and you can control its permissions just like a file, but Redis can listen on it (and clients can connect to it), in a very similar way to a TCP socket. Enabling Redis to listen on a Unix socket is fairly straightforward: Set the port parameter to 0 in your Redis configuration file. This will tell Redis to not listen on a TCP socket. This is very important to prevent miscreants from still being able to connect to your Redis server while you're happily using a Unix socket. Set the unixsocket parameter in your Redis configuration file to a fully-qualified filename where you want the socket to exist. If your Redis server runs as the same user as your client programs (which is common in shared-hosting situations), I recommend making the name of the file redis.sock, in the same directory as your Redis dataset. So, if you keep your Redis data in /home/joe/redis, set unixsocket to /home/joe/redis/redis.sock. Set the unixsocketperm parameter in your Redis configuration file to 600, or a more relaxed permission set if you know what you're doing. Again, this assumes that your Redis server and Redis-using programs are running as the same user. If they're not, you'll probably need a dedicated group and things get a lot more complicated—and beyond the scope of what can be covered in this guide. Once you've changed those configuration parameters and restarted Redis, you should find that the file you specified for unixsocket has magically appeared, and you can no longer connect to Redis using TCP. All that remains to do now is to configure your Redis-using programs to connect using the Unix socket, which is something you should find how to do in the manual for your particular Redis client library or application. Configuring Redis to use Unix sockets is all well and good when it's practical, but what about if you need to connect to Redis over a network? In that case, you'll need to let Redis listen on a TCP socket, but you should at least limit the computers that can connect to it with a suitable firewall configuration. While the properly paranoid systems administrator runs their systems with a default deny firewalling policy, not everyone shares this philosophy. However, given that by default, anyone who can connect to your Redis server can do anything they want with it, you should definitely configure a firewall on your Redis servers to limit incoming TCP connections to those which are coming from machines that have a legitimate need to talk to your Redis server. While it won't protect you from all attacks, it will cut down significantly on the attack surface, which is an important part of a defense-in-depth security strategy. Unfortunately, it is hard to give precise commands to configure a firewall ruleset, because there are so many firewall management tools in common use on systems today. In the interest of addressing the greatest common factor, though, I'll provide a set of Linux iptables rules, which should be translatable to whatever means of managing your firewall (whether it be an iptables wrapper of some sort on Linux, or a pf-based system on a BSD). In all of the following commands, replace the word <port> with the TCP port that your Redis server listens on. Also, note that these commands will temporarily stop all traffic to your Redis instance, so you'll want to avoid doing this on a live server. Setting up your firewall in an init script is the best course of action. Insert a rule that will drop all traffic to your Redis server port by default: iptables -I INPUT -p tcp --dport <port> -j DROP For each IP address you want to allow to connect, run these two commands to let the traffic in: iptables -I INPUT -p tcp --dport <port> -s <clientIP> -j ACCEPT iptables -I OUTPUT -p tcp --sport <port> -d <clientIP> -j ACCEPT A firewall is great, but sometimes you can't trust everyone with access to a machine that needs to talk to your Redis instance. In that case, you can use authentication to provide a limited amount of protection against miscreants: Select a very strong password. Redis is not hardened against repeated password guessing, so you want to make this very long and very random. If you make the password too short, an attacker can just write a program that tries every possible password very quickly and guess your password that way, not cool! Thankfully, since humans should rarely be typing this password, it can be a complete jumble, and very long. I like the command pwgen -sy 32 1 for all my "generating very strong password" needs. Configure all your clients to authenticate against the server, by sending the following command when they first connect to the server: AUTH <password> Edit your Redis configuration file to include a line like this: requirepass "\:d!&!:Y<p'TXBI0"ys96rfH]lxaA7|E" If your selected password contains any double-quotes, you'll need to escape them with a backslash (so " would become "), as I've done in the preceding example. You'll also need to double any actual backslashes (so becomes \), again as I've done in the password of the preceding example. Let the configuration changes take effect by restarting Redis. The authentication password cannot be changed at runtime. If you don't need certain commands, or want to limit the use of certain commands to a subset of clients, you can use the rename-command configuration parameter. Like firewalling, restricting, or disabling commands reduces your attack surface, but is not a panacea. The simplest solution to the risk of a dangerous command is to disable it. For example, if you want to stop anyone from accidentally (or deliberately) nuking all the data in your Redis server with a single command, you might decide to disable the FLUSHDB and FLUSHALL commands, by putting the following in your Redis config file: rename-command FLUSHDB ""rename-command FLUSHALL "" This doesn't stop someone from enumerating all the keys in your dataset with KEYS * and then deleting them all one-by-one, but it does raise the bar somewhat. If you never wanted to delete keys (but, say, only let them expire) you could disable the DEL command; although all that would probably do is encourage the wily cracker to enumerate all your keys and run PEXPIRE 1 over them. Arms races are a terrible thing... While disabling commands entirely is great when it can be done, you sometimes need a particular command, but you'd prefer not to give access to it to absolutely everyone—commands that can cause serious problems if misused, such as CONFIG. For those cases, you can rename the command to something hard-to-guess, as shown in the following command: rename-command CONFIG somegiantstringnobodywouldguess It's important to not make the new name of the command something easy-to-guess. Like the AUTH command, which we discussed previously, someone who wanted to do bad things could easily write a program to repeatedly guess what you've renamed your commands to. For any environment in which you can't trust the network (which these days is pretty much everywhere, thanks to the NSA and the Cloud), it is important to consider the possibility of someone watching all your data as it goes over the wire. There's little point configuring authentication, or renaming commands, if an attacker can watch all your data and commands flow back and forth. The least-worst option we have for generically securing network traffic from eavesdropping is still the Secure Sockets Layer (SSL). Redis doesn't support SSL natively; however, through the magic of the stunnel program, we can create a secure tunnel between Redis clients and servers. The setup we will build will look like the following diagram: In order to set this up, you'll need to do the following: In your redis.conf, ensure that Redis is only listening on 127.0.0.1, by setting the bind parameter: bind 127.0.0.1 Create a private key and certificate, which stunnel will use to secure the network communications. First, create a private key and a certificate request, by running: openssl req -out /etc/ssl/redis.csr -keyout /etc/ssl/redis.key -nodes -newkey rsa:2048 This will ask you all sorts of questions which you can answer with whatever you like. Create the self-signed certificate itself, by running: openssl x509 -req -days 3650 -signkey /etc/ssl/redis.key -in /etc/ssl/redis.csr -out /etc/ssl/redis.crt Finally, stunnel expects to find the private key and the certificate in the same file, so we'll concatenate the two together into one file: cat /etc/ssl/redis.key /etc/ssl/redis.crt >/etc/ssl/redis.pem Now, we've got our SSL keys, we can start stunnel on the server side, configuring it to listen out for SSL connections, and forward them to our local Redis server: stunnel -d 46379 -r 6379 -p /etc/ssl/redis.pem If your local Redis instance isn't listening on port 6379, or if you'd like to change the public port that stunnel listens on, you can, of course, adjust the preceding command line to suit. Also, don't forget to open up your firewall for the port you're listening on! Once you run the preceding command, you should be returned to a command line pretty quickly, because stunnel runs in the background. Although you examine your listening ports with netstat -ltn, you will still find that port 46379 is listening. If that's the case, you're done configuring the server. On the client(s), the process is somewhat simpler, because you don't have to create a whole new key pair. However, you do need the certificate from the server, because you want to be able to verify that you're connecting to the right SSL-enabled service. There's little point in using SSL if an attacker can just set up a fake SSL service and trick you into connecting to it. To set up the client, do the following: Copy /etc/ssl/redis.crt from the server to the same location on the client. Start stunnel on the client, as shown in the following code snippet: stunnel -c -v 3 -A /etc/ssl/redis.crt -d 127.0.0.1:56379 -r 192.0.2.42:46379 Replace 192.0.2.42 with the IP address of your Redis server. Verify that stunnel is listening correctly by running netstat -ltn, and look for something listening on port 56379. Reconfigure your client to connect to 127.0.0.1:56379, rather than directly to the remote Redis server. Summary This article contains an assortment of quick enhancements that you can deploy to your systems to protect them from various threats, which are frequently encountered on the Internet today. Resources for Article: Further resources on this subject: Implementing persistence in Redis (Intermediate) [Article] Python Text Processing with NLTK: Storing Frequency Distributions in Redis [Article] Coding for the Real-time Web [Article]
Read more
  • 0
  • 0
  • 4399
Packt
26 Dec 2013
21 min read
Save for later

Implementing the Naïve Bayes classifier in Mahout

Packt
26 Dec 2013
21 min read
(for more resources related to this topic, see here.) Bayes was a Presbyterian priest who died giving his "Tractatus Logicus" to the prints in 1795. The interesting fact is that we had to wait a whole century for the Boolean calculus before Bayes' work came to light in the scientific community. The corpus of Bayes' study was conditional probability. Without entering too much into mathematical theory, we define conditional probability as the probability of an event that depends on the outcome of another event. In this article, we are dealing with a particular type of algorithm, a classifier algorithm. Given a dataset, that is, a set of observations of many variables, a classifier is able to assign a new observation to a particular category. So, for example, consider the following table: Outlook Temperature Temperature Humidity Humidity Windy Play Numeric Nominal Numeric Nominal Overcast 83 Hot 86 High FALSE Yes Overcast 64 Cool 65 Normal TRUE Yes Overcast 72 Mild 90 High TRUE Yes Overcast 81 Hot 75 Normal FALSE Yes Rainy 70 Mild 96 High FALSE Yes Rainy 68 Cool 80 Normal FALSE Yes Rainy 65 Cool 70 Normal TRUE No Rainy 75 Mild 80 Normal FALSE Yes Rainy 71 Mild 91 High TRUE No Sunny 85 Hot 85 High FALSE No Sunny 80 Hot 90 High TRUE No Sunny 72 Mild 95 High FALSE No Sunny 69 Cool 70 Normal FALSE Yes Sunny 75 Mild 70 Normal TRUE Yes The table itself is composed of a set of 14 observations consisting of 7 different categories: temperature (numeric), temperature (nominal), humidity (numeric), and so on. The classifier takes some of the observations to train the algorithm and some as testing it, to create a decision for a new observation that is not contained in the original dataset. There are many types of classifiers that can do this kind of job. The classifier algorithms are part of the supervised learning data-mining tasks that use training data to infer an outcome. The Naïve Bayes classifier uses the assumption that the fact, on observation, belongs to a particular category and is independent from belonging to any other category. Other types of classifiers present in Mahout are the logistic regression, random forests, and boosting. Refer to the page https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms for more information. This page is updated with the algorithm type, actual integration in Mahout, and other useful information. Moving out of this context, we could describe the Naïve Bayes algorithm as a classification algorithm that uses the conditional probability to transform an initial set of weights into a weight matrix, whose entries (row by column) detail the probability that one weight is associated to the other weight. In this article's recipes, we will use the same algorithm provided by the Mahout example source code that uses the Naïve Bayes classifier to find the relation between works of a set of documents. Our recipe can be easily extended to any kind of document or set of documents. We will only use the command line so that once the environment is set up, it will be easy for you to reproduce our recipe. Our dataset is divided into two parts: the training set and the testing set. The training set is used to instruct the algorithm on the relation it needs to find. The testing set is used to test the algorithm using some unrelated input. Let us now get a first-hand taste of how to use the Naïve Bayes classifier. Using the Mahout text classifier to demonstrate the basic use case The Mahout binaries contain ready-to-use scripts for using and understanding the classical Mahout dataset. We will use this dataset for testing or coding. Basically, the code is nothing more than following the Mahout ready-to-use script with the corrected parameter and the path settings done. This recipe will describe how to transform the raw text files into weight vectors that are needed by the Naïve Bayes algorithm to create the model. The steps involved are the following: Converting the raw text file into a sequence file Creating vector files from the sequence files Creating our working vectors Getting ready The first step is to download the datasets. The dataset is freely available at the following link: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz. For classification purposes, other datasets can be found at the following URL: http://sci2s.ugr.es/keel/category.php?cat=clas#sub2. The dataset contains a post of 20 newsgroups dumped in a text file for the purpose of machine learning. Anyway, we could have also used other documents for testing purposes, but we will suggest how to do this later in the recipe. Before proceeding, in the command line, we need to set up the working folder where we decompress the original archive to have shorter commands when we need to insert the full path of the folder. In our case, the working folder is /mnt/new; so, our working folder's command-line variables will be set using the following command: export WORK_DIR=/mnt/new/ You can create a new folder and change the WORK_DIR bash variable accordingly. Do not forget that to have these examples running, you need to run the various commands with a user that has the HADOOP_HOME and MAHOUT_HOME variables in its path. To download the dataset, we only need to open up a terminal console and give the following command: wget http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz Once your working dataset is downloaded, decompress it using the following command: tar –xvzf 20news-bydate.tar.gz You should see the folder structure as shown in the following screenshot: The second step is to sequence the whole input file to transform them into Hadoop sequence files. To do this, you need to transform the two folders into a single one. However, this is only a pedagogical passage, but if you have multiple files containing the input texts, you could parse them separately by invoking the command multiple times. Using the console command, we can group them together as a whole by giving the following command in sequence: rm -rf ${WORK_DIR}/20news-all mkdir ${WORK_DIR}/20news-all cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all Now, we should have our input folder, which is the 20news-all folder, ready to be used: The following screenshot shows a bunch of files, all in the same folder: By looking at one single file, we should see the underlying structure that we will transform. The structure is as follows: From: xxx Subject: yyyyy Organization: zzzz X-Newsreader: rusnews v1.02 Lines: 50 jaeger@xxx (xxx) writes: >In article xxx writes: >>zzzz "How BCCI adapted the Koran rules of banking". The >>Times. August 13, 1991. > > So, let's see. If some guy writes a piece with a title that implies > something is the case then it must be so, is that it? We obviously removed the e-mail address, but you can open this file to see its content. For any newsgroup of 20 news items that are present on the dataset, we have a number of files, each of them containing a single post to a newsgroup without categorization. Following our initial tasks, we need to now transform all these files into Hadoop sequence files. To do this, you need to just type the following command: ./mahout seqdirectory -i ${WORK_DIR}/20news-all -o ${WORK_DIR}/20news-seq This command brings every file contained in the 20news-all folder and transforms them into a sequence file. As you can see, the number of corresponding sequence files is not one to one with the number of input files. In our case, the generated sequence files from the original 15417 text files are just one chunck-0 file. It is also possible to declare the number of output files and the mappers involved in this data transformation. We invite the reader to test the different parameters and their uses by invoking the following command: ./mahout seqdirectory --help The following table describes the various options that can be used with the seqdirectory command: Parameter Description --input (-i) input his gives the path to the job input directory. --output (-o) output The directory pathname for the output. --overwrite (-ow) If present, overwrite the output directory before running the job. --method (-xm) method The execution method to use: sequential or mapreduce. The default is mapreduce. --chunkSize (-chunk) chunkSize The chunkSize values in megabyte. The default is 64 Mb. --fileFilterClass (-filter) fileFilterClass The name of the class to use for file parsing.The default is org.apache.mahout.text.PrefixAdditionFilter. --keyPrefix (-prefix) keyPrefix The prefix to be prepended to the key of the sequence file. --charset (-c) charset The name of the character encoding of the input files.The default is UTF-8. --overwrite (-ow) If present, overwrite the output directory before running the job. --help (-h) Prints the help menu to the command console. --tempDir tempDir If specified, tells Mahout to use this as a temporary folder. --startPhase startPhase Defines the first phase that needs to be run. --endPhase endPhase Defines the last phase that needs to be run To examine the outcome, you can use the Hadoop command-line option fs. So, for example, if you would like to see what is in the chunck-0 file, you could type in the following command: hadoop fs -text $WORK_DIR/20news-seq/chunck-0 | more In our case, the result is as follows: /67399 From:xxx Subject: Re: Imake-TeX: looking for beta testers Organization: CS Department, Dortmund University, Germany Lines: 59 Distribution: world NNTP-Posting-Host: tommy.informatik.uni-dortmund.de In article <xxxxx>, yyy writes: |> As I announced at the X Technical Conference in January, I would like |> to |> make Imake-TeX, the Imake support for using the TeX typesetting system, |> publically available. Currently Imake-TeX is in beta test here at the |> computer science department of Dortmund University, and I am looking |> for |> some more beta testers, preferably with different TeX and Imake |> installations. The Hadoop command is pretty simple, and the syntax is as follows: hadoop fs –text <input file> In the preceding syntax, <input file> is the sequence file whose content you will see. Our sequence files have been created, and until now, there has been no analysis of the words and the text itself. The Naïve Bayes algorithm does not work directly with the words and the raw text, but with the weighted vector associated to the original document. So now, we need to transform the raw text into vectors of weights and frequency. To do this, we type in the following command: ./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news- vectors -lnorm -nv -wt tfidf The following command parameters are described briefly: The -lnorm parameter instructs the vector to use the L_2 norm as a distance The -nv parameter is an optional parameter that outputs the vector as namedVector The -wt parameter instructs which weight function needs to be used We end the data-preparation process with this step. Now, we have the weight vector files that are created and ready to be used by the Naïve Bayes algorithm. We will clear a little while this last step algorithm. This part is about tuning the algorithm for better performance of the Naïve Bayes classifier. How to do it… Now that we have generated the weight vectors, we need to give them to the training algorithm. But if we train the classifier against the whole set of data, we will not be able to test the accuracy of the classifier. To avoid this, you need to divide the vector files into two sets called the 80-20 split. This is a good data-mining approach because if you have any algorithm that should be instructed on a dataset, you should divide the whole bunch of data into two sets: one for training and one for testing your algorithm. A good dividing percentage is shown to be 80 percent and 20 percent, meaning that the training data should be 80 percent of the total while the testing ones should be the remaining 20 percent. To split data, we use the following command: ./mahout split -i ${WORK_DIR}/20news-vectors/tfidf-vectors --trainingOutput ${WORK_DIR}/20news-train-vectors --testOutput ${WORK_DIR}/20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential As result of this command, we will have two new folders containing the training and testing vectors. Now, it is time to train our Naïves Bayes algorithm on the training set of vectors, and the command that is used is pretty easy: ./mahout trainnb -i ${WORK_DIR}/20news-train-vectors -el -o ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow Once finished, we have our training model ready to be tested against the remaining 20 percent of the initial input vectors. The final console command is as follows: ./mahout testnb -i ${WORK_DIR}/20news-test-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex\ -ow -o ${WORK_DIR}/20news-testing The following screenshot shows the output of the preceding command: How it works... We have given certain commands and we have seen the outcome, but you've done this without an understanding of why we did it and above all, why we chose certain parameters. The whole sequence could be meaningless, even for an experienced coder. Let us now go a little deeper in each step of our algorithm. Apart from downloading the data, we can divide our Naïve Bayes algorithm into three main steps: Data preparation Data training Data testing In general, these are the three procedures for mining data that should be followed. The data preparation steps involve all the operations that are needed to create the dataset in the format that is required for the data mining procedure. In this case, we know that the original format was a bunch of files containing text, and we transformed them into a sequence file format. The main purpose of this is to have a format that can be handled by the map reducing algorithm. This phase is a general one as the input format is not ready to be used as it is in most cases. Sometimes, we also need to merge some data if they are divided into different sources. Sometimes, we also need to use Sqoop for extracting data from different datasources. Data training is the crucial part; from the original dataset, we extract the information that is relevant to our data mining tasks, and we bring some of them to train our model. In our case, we are trying to classify if a document can be inserted in a certain category based on the frequency of some terms in it. This will lead to a classifier that using another document can state if this document is under a previously found category. The output is a function that is able to determinate this association. Next, we need to evaluate this function because it is possible that one good classification in the learning phase is not so good when using a different document. This three-phased approach is essential in all classification tasks. The main difference relies on the type of classifier to be used in the training and testing phase. In this case, we use Naïve Bayes, but other classifiers can be used as well. In the Mahout framework, the available classifiers are Naïve Bayes, Decision Forest, and Logistic Regression. As we have seen, the data preparation consists basically of creating two series of files that will be used for training and testing purposes. The step to transform the raw text file into a Hadoop sequence format is pretty easy; so, we won't spend too long on it. But the next step is the most important one during data preparation. Let us recall it: mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news- vectors -lnorm -nv -wt tfidf This computational step basically grabs the whole text from the chunck-0 sequence file and starts parsing it to extract information from the words contained in it. The input parameters tell the utility to work in the following ways: The -i parameter is used to declare the input folder where all the sequence files are stored The -o parameter is used to create the output folder where the vector containing the weights is stored The -nv parameter tells Mahout that the output format should be in the namedVector format The -wt parameter tells which frequency function to use for evaluating the weight of every term to a category The -lnorm parameter is a function used to normalize the weights using the L_2 distance The -ow: parameter overwrites the previously generated output results The -m: parameter gives the minimum log-likelihood ratio The whole purpose of this computation step is to transform the sequence files that contain the documents' raw text in the sequence files containing vectors that count the frequency of the term. Obviously, there are some different functions that count the frequency of a term within the whole set of documents. So, in Mahout, the possible values for the wt parameter are tf and tfidf. The Tf value is the simpler one and counts the frequency of the term. This means that the frequency of the Wi term inside the set of documents is the ratio between the total occurrence of the word over the total number of words. The second one considers the sum of every term frequency using a logarithmic function like this one: In the preceding formula, Wi is the TF-IDF weight of the word indexed by i. N is the total number of documents. DFi is the frequency of the i word in all the documents. In this preprocessing phase, we notice that we index the whole corpus of documents so that we are sure that even if we divide or split in the next phase, the documents are not affected. We compute a word frequency; this means that the word was contained in the training or testing set. So, the reader should grasp the fact that changing this parameter can affect the final weight vectors; so, based on the same text, we could have very different outcomes. The lnorm value basically means that while the weight can be a number ranging from 0 to an upper positive integer, they are normalized to 1 as the maximum possible weight for a word inside the frequency range. The following screenshot shows the output of the output folder: Various folders are created for storing the word count, frequency, and so on. Basically, this is because the Naïve Bayes classifier works by removing all periods and punctuation marks from the text. Then, from every text, it extracts the categories and the words. The final vector file can be seen in the tfidf-vectors folder, and for dumping vector files to normal text ones, you can use the vectordump command as follows: mahout vectordump -i ${WORK_DIR}/20news-vectors/tfidf-vectors/ part-r-00000 –o ${WORK_DIR}/20news-vectors/tfidf-vectors/part-r-00000dump The dictionary files and word files are sequence files containing the association within the unique key/word created by the MapReduce algorithm using the command: hadoop fs -text $WORK_DIR/20news-vectors/dictionary.file-0 | more one can see for example adrenal_gland 12912 adrenaline 12913 adrenaline.com 12914| The splitting of the dataset into training and testing is done by using the split command-line option of Mahout. The interesting parameter in this case is that randomSelectionPct equals 40. It uses a random selection to evaluate which point belongs to the training or the testing dataset. Now comes the interesting part. We are ready to train using the Naïve Bayes algorithm. The output of this algorithm is the model folder that contains the model in the form of a binary file. This file represents the Naïve Bayes model that holds the weight Matrix, the feature and label sums, and the weight normalizer vectors generated so far. Now that we have the model, we test it on the training set. The outcome is directly shown on the command line in terms of a confusion matrix. The following screenshot shows the format in which we can see our result. Finally, we test our classifier on the test vector generated by the split instruction. The output in this case is a confusion matrix. Its format is as shown in the following screenshot: We are now going to provide details on how this matrix should be interpreted. As you can see, we have the total classified instances that tell us how many sentences have been analyzed. Above this, we have the correctly/incorrectly classified instances. In our case, this means that on a test set of weighted vectors, we have nearly 90 percent of the corrected classified sentences against an error of 9 percent. But if we go through the matrix row by row, we can see at the end that we have different newsgroups. So, a is equal to alt.atheism and b is equal to comp.graphics. So, a first look at the detailed confusion matrix tells us that we did the best in classification against the rec.sport.hockey newsgroup, with a value of 418 that is the highest we have. If we take a look at the corresponding row, we understand that of these 418 classified sentences, we have 403/412; so, 97 percent of all of the sentences were found in the rec.sport.hockey newsgroup. But if we take a look at the comp.os.ms-windows.miscwe newsgroup, we can see overall performance is low. The sentences are not so centered around the same new newsgroup; so, it means that we find and classify the sentences in ms-windows in another newsgroup, and so we do not have a good classification. This is reasonable as sports terms like "hockey" are really limited to the hockey world, while sentences about Microsoft could be found both on Microsoft specific newsgroups and in other newsgroups. We encourage you to give another run to the testing phase on the training phase to see the output of the confusion matrix by giving the following command: ./bin/mahout testnb -i ${WORK_DIR}/20news-train-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/20news-testing As you can see, the input folder is the same for the training phase, and in this case, we have the following confusion matrix: In this case, we can see it using the same set both as the training and testing phase. The first consequence is that we have a rise in the correctly classified sentences by an order of 10 percent, which is even bigger if you remember that in terms of weighted vectors with respect to the testing phase, we have a size that is four times greater. But probably the most important thing is that the best classification has now moved from the hockey newsgroup to the sci.electronics newsgroup. There's more We use exactly the same procedure used by the Mahout examples contained in the binaries folder that we downloaded. But you should now be aware that starting all process need only to change the input files from the initial folder. So, for the willing reader, we suggest you download another raw text file and perform all the steps in another type of file to see the changes that we have compared to the initial input text. We would suggest that non-native English readers also look at the differences that we have by changing the initial input set with one not written in English. Since the whole text is transformed using only weight vectors, the outcome does not depend on the difference between languages but only on the probability of finding certain word couples. As a final step, using the same input texts, you could try to change the way the algorithm normalizes and counts the words to create the vector sparse weights. This could be easily done by changing, for example, the -wt tfidf parameter into the command line Mahout seq2sparce. So, for example, an alternative run of the seq2sparce Mahout could be the following one: mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news- vectors -lnorm -nv -wt tfidf Finally, we not only choose to run the Naïve Bayes classifier for classifying words in a text document but also the algorithm that uses vectors of weights so that, for example, it would be easy to create your own vector weights.
Read more
  • 0
  • 0
  • 2265

article-image-using-faceted-search-searching-finding
Packt
24 Dec 2013
11 min read
Save for later

Using Faceted Search, from Searching to Finding

Packt
24 Dec 2013
11 min read
(For more resources related to this topic, see here.) Looking at Solr's standard query parameters The basic engine of Solr is Lucene, so Solr accepts a query syntax based on the Lucene one, even if there are some minor differences, they should not affect our experiments, as they involve more advanced behavior. You can find an explanation on the Solr Query syntax on wiki at: http://wiki.apache.org/solr/SolrQuerySyntax. Let's see some example of a query using the basic parameters. Before starting our tests, we need to configure a new core again, in the usual way. Sending Solr's query parameters over HTTP It is important to take care of the fact that our queries to Solr are sent over the HTTP protocol (unless we are using Solr in embedded mode, as we will see later). With cURL we can handle the HTTP encoding of parameters, for example: >> curl -X POST 'http://localhost:8983/solr/paintings/select?start=3&rows=2&fq=painting&wt=json&indent=true' --data-urlencode 'q=leonardo da vinci&fl=artist title' This command can be instead of the following command: >> curl -X GET "http://localhost:8983/solr/paintings/select?q=leonardo%20da%20vinci&fq=painting&start=3&row=2&fl=artist%20title&wt=json&indent=true" Please note how using the --data-urlencode parameter in the example we can write the parameters values including characters which needs to be encoded over HTTP. Testing HTTP parameters on browsers On modern browsers such as Firefox or Chrome you can look at the parameters directly into the provided console. For example using Chrome you can open the console (with F12): In the previous image you can see under Query String Parameters section on the right that the parameters are showed on a list, and we can easily switch between the encoded and the more readable un-encoded value's version. If don't like using Chrome or Firefox and want a similar tool, you can try the Firebug lite (http://getfirebug.com/firebuglite). This is a JavaScript library conceived to port firebug plugin functionality ideally to every browser, by adding this library to your HTML page during the test process. Choosing a format for the output When sending a query to Solr directly (by the browser or cURL) we can ask for results in multiple formats, including for example JSON: >> curl -X GET 'http://localhost:8983/solr/paintings/select?q=*:*&wt=json&indent=true' Time for action – searching all documents with pagination When performing a query we need to remember we are potentially asking for a huge number of documents. Let's observe how to manage partial results using pagination: For example think about the q=*:* query as seen in previous examples which was used for asking all the documents, without a specific criteria. In a case like this, in order to avoid problems with resources, Solr will send us actually only the first ones, as defined by a parameter in the configuration. The default number of returned results will be 10, so we need to be able to ask for a second group of results, and a third, and so on and on until there are. This is what is generally called a pagination of results, similarly as for scenarios involving SQL. Executing the command: >> curl -X GET "http://localhost:8983/solr/paintings/select?q=*:*&start=0&rows=0&wt=json&indent=true" We should obtain a result similar to this (the number of documents numFound and the time spent for processing query QTime could vary, depending on your data and your system): In the previous image we see the same results in two different ways: on the right side you'll recognize the output from cURL and on the left side of the browser you see how the results directly in the browser window. In the second example we had the Json View plugin installed in the browser, which gives a very helpful visualization of JSON, with indentation and colors. You can install it if you want for Chrome at: https://chrome.google.com/webstore/detail/jsonview/chklaanhfefbnpoihckbnefhakgolnmc For Firefox the plugin can be installed from: https://addons.mozilla.org/it/firefox/addon/jsonview/ Note how even if we have found 12484 documents, we are currently seeing none of them in the results! What just happened? In this very simple example, we already use two very useful parameters: start and rows, which we should always think as a couple, even if we may be using only one of them explicitly. We could change the default values for these parameters from the solrconfig.xml file, but this is generally not needed: The start value defines the original index of the first document returned in the response, from the ones matching our search criteria, and starting from value 0. The default value will again start at 0. The rows parameter is used to define how many documents we want in the results. The default value will be 10 for rows. So if for example we only want the second and third document from the results, we can obtain them by the query: >> curl -X GET "http://localhost:8983/solr/paintings/select?q=*:*&start=1 &rows=2&wt=json&indent=true' In order to obtain the second document in the results we need to remember that the enumeration starts from 0 (so the second will be at 1), while to see the next group of documents (if present), we will send a new query with values such as, start=10, rows=10, and so on. We are still using the wt and indent parameters only to have results formatted in a clear way. The start/rows parameters play roles in this context which are quite similar to the OFFSET/LIMIT clause in SQL. This process of segmenting the output to be able to read it in group or pages of results is usually called pagination, and it is generally handled by some programming code. You should know this mechanism, so you could play with your test even on a small segment of data without a loss of generalization. I strongly suggest you to always add these two parameters explicitly in your examples. Time for action – projecting fields with fl Another important parameter to consider is fl, that can be used for fields projection, obtaining only certain fields in the results: Suppose now that we are interested on obtaining the titles and artist reference for all the documents: >>curl -X GET 'http://localhost:8983/solr/paintings/select?q=artist:*&wt=json&indent=true&omitHeader=true&fl=title,artist' We will obtain an output similar to the one shown in the following image: Note that the results will be indented as requested, and will not contain any header to be more readable. Moreover the parameters list does not need to be written in a specific order. The previous query could be rewritten also: >>curl -X GET 'http://localhost:8983/solr/paintings/select?q=artist:*&wt=json&indent=true&omitHeader=true&fl=title&fl=artist' Here we ask for field projection one by one, if needed (for example when using HTML and JavaScript widget to compose the query following user's choices). What just happened? The fl parameter stands for fields list. By using this parameter we can define a comma-separated list of fields names that explicitly define what fields are projected in the results. We can also use a space to separate fields, but in this case we should use the URL encoding for the space, writing fl=title+artist or fl=title%20artist. If you are familiar with relational databases and SQL, you should consider the fl parameter. It is similar to the SELECT clause in SQL statements, used to project the selected fields in the results. In a similar way writing fl=author:artist,title corresponds to the usage of aliases for example, SELECT artist AS author, title. Let's see the full list of parameters in details: The parameter q=artist:* is used in this case in place of a more generic q=*:*, to select only the fields which have a value for the field artist. The special character * is used again for indicating all the values. The wt=json, indent=true parameters are used for asking for an indented JSON format. The omitHeader=true parameter is used for omit the header from the response. The fl=title,artist parameter represents the list of the fields to be projected for the results. Note how the fields are projected in the results without using the order asked in fl, as this has no particular sense for JSON output. This order will be used for the CSV response writer that we will see later, however, where changing the columns order could be mandatory. In addition to the existing field, which can be added by using the * special character, we could also ask for the projection of the implicit score field. A composition of these two options could be seen in the following query: >>curl -X GET 'http://localhost:8983/solr/paintings/select?q=artist:*&wt=json&indent=true&omitHeader=true&fl=*,score' This will return every field for every document, including the score field explicitly, which is sometimes called a pseudo-field, to distinguish it from the field defined by a schema. Time for action – selecting documents with filter query Sometimes it's useful to be able to narrow the collection of documents on which we are currently performing our search. It is useful to add some kind of explicit linked condition on the logical side for navigation on data, and will also have good impact on performances too. It is shown in the following example: It shows how the default search is restricted by the introduction of a fq=annunciation condition. What just happened? The first result in this simple example shows that we obtain results similar to what we could have obtained by a simple q=annunciation search. Filtered query can be cached (as well as facets, that we will see later), improving performance by reducing the overhead of performing the same query many times, and accessing documents of large datasets to the same group many times. In this case the analogy with SQL seems less convincing, but q=dali and fq=abstract:painting can be seen corresponding to WHERE conditions in SQL. The fq parameters will then be a fixed condition. In our scenario, we could define for example specific endpoints with pre-defined filter query by author, to create specific channels. In this case instead of passing the parameters every time we could set them on solrconfig.xml. Time for action – searching for similar terms with Fuzzy search Even if the wildcard queries are very flexible, sometimes they simply cannot give us a good results. There could be some weird typo in the term, and we still want to obtain some good results wherever it is possible under certain confidence conditions: If I want to write painting and I actually search for plainthing, for example: >> curl – X GET 'http://localhost:8983/solr/paintings/select?q=abstract:plainthing~0.5&wt=json' Suppose we have a person using a different language, who searched for leonardo by misspelling the name: >> curl -X GET 'http://localhost:8983/solr/paintings/select?q=abstract:lionardo~0.5&wt=json' In both cases the examples use misspelled words to be more recognizable, but the same syntax can be used for intercept existing similar words. What just happened? Both the preceding examples work as expected. The first gives us documents containing the term painting, the second gives us documents containing leonardo instead. Note that the syntax plainthing^0.5 represents a query that matches with a certain confidence, so for example we will also obtain occurrences of documents with the term paintings, which is good, but on a more general case we could receive weird results. In order to properly set up the confidence value there are not many options, apart from doing tests. Using fuzzy search is a simple way to obtain a suggested result for alternate forms of search query, just like when we trust some search engine's similar suggestions in the did you mean approaches.
Read more
  • 0
  • 0
  • 1426

article-image-learning-option-pricing
Packt
20 Dec 2013
19 min read
Save for later

Learning Option Pricing

Packt
20 Dec 2013
19 min read
(for more resources related to this topic, see here.) Introduction to options Options come in two variants, puts and calls. The call option gives the owner of the option the right, but not the obligation, to buy the underlying asset at the strike price. The put gives the holder of the contract, the right but not the obligation to sell the underlying asset. The Black-Scholes formula describes the European option, which can only be exercised on the maturity date, in contrast to for example American options. The buyer of the option pays a premium for this, to cover the risk taken from the counterpart side. Options have become very popular and they are traded on the major exchanges throughout the world, covering most asset-classes. The theory behind options can become complex pretty quick. In this article we'll look at the basics of options and how to explore them using code written in F#. Looking into contract specifications Options comes in a wide number of variations, some of them will be covered briefly below. The contract specifications for options will also depend on its type. Generally there are some properties that are more or less general to all of them. The general specifications are as follows: Side Quantity Strike price Expiration date Settlement terms The contract specifications, or know variables, are used then we valuate options. European options European options are the basic form of options that the other variants derive, American options and exotic options are some examples. We'll stick to European options in this article. American options American options are options that may be exercised on any trading day on or before expiry. Exotic options Exotic options are any of the broad category of options that may include complex financial structures and may be combinations of other instruments as well. Learning about Wiener processes Wiener processes are closely related to stochastic differential equations and volatility. Wiener processes or geometric Brownian motion, is defined as this: The formula describes the change in the stock price, or underlying, with a drift, μ, and a volatility, σ, and the Wiener process, Wt. This process is used to model the prices in Black-Scholes. We'll simulate market data using a Brownian motion, or Wiener process implemented in F# as a sequence. Sequences can be infinite and only the values used are evaluated, which suites or needs. We'll implement a generator function, to generate the Wiener process as a sequence as follows: // A normally distributed random generator let normd = new Normal(0.0, 1.0) let T = 1.0 let N = 500.0 let dt:float = T / N /// Sequences represent infinite number of elements // p -> probability mean // s -> scaling factor let W s = let rec loop x = seq { yield x; yield! loop (x + sqrt(dt)*normd.Sample()*s)} loop s;; Here we use the random function in normd.Sample(). Let's explain the parameters and the theory behind Brownian motion before looking at the implementation. The parameter T is the time used to create a discrete time increment dt. Notice that dt will assume there is 500 N:s, 500 items in the sequence, this is of course not always the case but will do fine in here. Next, we use recursion to create the sequence, where we add an increment to the previous value (x+...), where x c] xt-1. We can easily generate an arbitrary length of the sequence: > Seq.take 50 (W 55.00);; val it : seq<float> = seq [55.0; 56.72907873; 56.96071054;58.72850048; ...] Here we create a sequence of length 50. Let's plot the sequence to get a better understanding about the process. A Wiener process generated from the sequence generator above. Next we'll look at the code to generate the graph in the figure above. open System open System.Net open System.Windows.Forms open System.Windows.Forms.DataVisualization.Charting open Microsoft.FSharp.Control.WebExtensions open MathNet.Numerics.Distributions; // A normally distributed random generator let normd = new Normal(0.0, 1.0) // Create chart and form let chart = new Chart(Dock = DockStyle.Fill) let area = new ChartArea("Main") chart.ChartAreas.Add(area) let mainForm = new Form(Visible = true, TopMost = true, Width = 700, Height = 500) do mainForm.Text <- "Wiener process in F#" mainForm.Controls.Add(chart) // Create series for stock price let wienerProcess = new Series("process") do wienerProcess.ChartType <- SeriesChartType.Line do wienerProcess.BorderWidth <- 2 do wienerProcess.Color <- Drawing.Color.Red chart.Series.Add(wienerProcess) let random = new System.Random() let rnd() = random.NextDouble() let T = 1.0 let N = 500.0 let dt:float = T / N /// Sequences represent infinite number of elements let W s = let rec loop x = seq { yield x; yield! loop (x +/ sqrt(dt)*normd.Sample()*s)} loop s;; do (Seq.take 100 (W 55.00)) |> Seq.iter (wienerProcess.Points.Add>> ignore) Most of the code will be familiar to you at this stage, but the interesting part is the last line, where we can simply feed a chosen number of elements from the sequence into the Seq.iter which will plot the values, elegant and efficient. Learning the Black-Scholes formula The Black-Scholes formula was developed by Fischer Black and Myron Scholes in the 1970s. The Black-Scholes formula is a stochastic partial differential equation, which estimates the price an the option. The main idea behind the formula is the delta neutral portfolio. They created the theoretical delta neutral portfolio, to reduce the uncertainty involved. This was a necessary step to be able to come to the analytical formula which we’ll cover in this section. Below is the assumptions made under Black-Scholes: No arbitrage Possible to borrow money at a constant risk-free interest rate (throughout the holding of the option) Possible to buy, sell and short fractional amounts of underlying asset No transaction costs Price of underlying follows a Brownian Motion, constant drift and volatility No dividends paid from underlying security The simplest of the two variants is the one for call options. First the stock price is scaled using the cumulative distribution function with d1 as a parameter. Then the stock price is reduced by the discounted strike price scaled by the cumulative distribution function of d2. In other words, it’s the difference between the stock price and the strike using probability scaling of each and discounting the strike price. The formula for the put is a little more involved, but follows the same principles. The Black-Scholes formula are often separated into parts, where d1, d2 are the probability factors, describing the probability of the stock price being related to the strike price. The parameters used in the formula above can be summarized as follows: N – The cumulative distribution function T - Time to maturity, expressed in years S – The stock price, or other underlying K – The strike price r – The risk free interest rate σ – The volatility of the underlying Implementing Black-Scholes in F# Now that we've looked at the basics behind the Black-Scholes formula, and the parameters involved, we can implement it ourselves. The cumulative distribution function is implemented here to avoid dependencies and to illustrate that it's quite simple to implement it yourself too. Below is the Black-Scholes implemented in F#. It takes six arguments; the first is a call-put-flag that determines if it's a call or put option. The constants a1 to a5 are the Taylor series coefficients used in the approximation for the numerical implementation. let pow x n = exp(n * log(x)) type PutCallFlag = Put | Call /// Cumulative distribution function let cnd x = let a1 = 0.31938153 let a2 = -0.356563782 let a3 = 1.781477937 let a4 = -1.821255978 let a5 = 1.330274429 let pi = 3.141592654 let l = abs(x) let k = 1.0 / (1.0 + 0.2316419 * l) let w = (1.0-1.0/sqrt(2.0*pi)*exp(-l*l/2.0)*(a1*k+a2*k*k+a3*(pow k 3.0)+a4*(pow k 4.0)+a5*(pow k 5.0))) if x < 0.0 then 1.0 - w else w /// Black-Scholes // call_put_flag: Put | Call // s: stock price // x: strike price of option // t: time to expiration in years // r: risk free interest rate // v: volatility let black_scholes call_put_flag s x t r v = let d1=(log(s / x) + (r+v*v*0.5)*t)/(v*sqrt(t)) let d2=d1-v*sqrt(t) //let res = ref 0.0 match call_put_flag with | Put -> x*exp(-r*t)*cnd(-d2)-s*cnd(-d1) | Call -> s*cnd(d1)-x*exp(-r*t)*cnd(d2) Let's use the black_scholes function using some various numbers for call and put options. Suppose we want to know the price of an option, where the underlying is a stock traded at $58.60 with an annual volatility of 30%. The risk free interest rate is, let's say, 1%. Then we can use our formula, we defined previously to get the theoretical price according the Black-Scholes formula of a call option with 6 month to maturity (0.5 years): > black_scholes Call 58.60 60.0 0.5 0.01 0.3;; val it : float = 4.465202269 And the value for the put option, just by changing the flag to the function: > black_scholes Put 58.60 60.0 0.5 0.01 0.3;; val it : float = 5.565951021 Sometimes it's more convenient to express the time to maturity in number of days, instead of years. Let's introduce a helper function for that purpose. /// Convert the nr of days to years let days_to_years d = (float d) / 365.25 Note the number 365.25 which includes the factor for leap years. This is not necessary in our examples, but used for correctness. We can now use this function instead, when we know the time in days. > days_to_years 30;; val it : float = 0.08213552361 Let's use the same example above, but now with 20 days to maturity. > black_scholes Call 58.60 60.0 (days_to_years 20) 0.01 0.3;; val it : float = 1.065115482 > black_scholes Put 58.60 60.0 (days_to_years 20) 0.01 0.3;; val it : float = 2.432270266 Using Black-Scholes together with Charts Sometimes it's useful to be able to plot the price of an option until expiration. We can use our previously defined functions and vary the time left and plot the values coming out. In this example we'll make a program that outputs the graph seen below. Chart showing prices for call and put option as function of time /// Plot price of option as function of time left to maturity #r "System.Windows.Forms.DataVisualization.dll" open System open System.Net open System.Windows.Forms open System.Windows.Forms.DataVisualization.Charting open Microsoft.FSharp.Control.WebExtensions /// Create chart and form let chart = new Chart(Dock = DockStyle.Fill) let area = new ChartArea("Main") chart.ChartAreas.Add(area) chart.Legends.Add(new Legend()) let mainForm = new Form(Visible = true, TopMost = true, Width = 700, Height = 500) do mainForm.Text <- "Option price as a function of time" mainForm.Controls.Add(chart) /// Create series for call option price let optionPriceCall = new Series("Call option price") do optionPriceCall.ChartType <- SeriesChartType.Line do optionPriceCall.BorderWidth <- 2 do optionPriceCall.Color <- Drawing.Color.Red chart.Series.Add(optionPriceCall) /// Create series for put option price let optionPricePut = new Series("Put option price") do optionPricePut.ChartType <- SeriesChartType.Line do optionPricePut.BorderWidth <- 2 do optionPricePut.Color <- Drawing.Color.Blue chart.Series.Add(optionPricePut) /// Calculate and plot call option prices let opc = [for x in [(days_to_years 20)..(-(days_to_years 1))..0.0]do yield black_scholes Call 58.60 60.0 x 0.01 0.3] do opc |> Seq.iter (optionPriceCall.Points.Add >> ignore) /// Calculate and plot put option prices let opp = [for x in [(days_to_years 20)..(-(days_to_years 1))..0.0]do yield black_scholes Put 58.60 60.0 x 0.01 0.3] do opp |> Seq.iter (optionPricePut.Points.Add >> ignore) The code is just a modified version of the code seen in the previous article, with the options parts added. We have two series in this chart, one for call options and one for put options. We also add a legend for each of the series. The last part is the calculation of the prices and the actual plotting. List comprehensions are used for compact code, and the Black-Scholes formula is called for everyday until expiration, where the days are counted down by one day at each step. It's up to you as a reader to modify the code to plot various aspects of the option, such as the option price as a function of an increase in the underlying stock price etc. Introducing the greeks The greeks are partial derivatives of the Black-Scholes formula, with respect to a particular parameter such as time, rate, volatility or stock price. The greeks can be divided into two or more categories, with respect to the order of the derivatives. Below we'll look at the first and second order greeks. First order greeks In this section we'll present the first order greeks using the table below. Name Symbol Description Delta Δ Rate of change of option value with respect to change in the price of the underlying asset. Vega ν Rate of change of option value with respect to change in the volatility of the underlying asset. Referred to as the volatility sensitivity. Theta Θ Rate of change of option value with respect to time. The sensitivity with respect to time will decay as time elapses, phenomenon referred to as the "time decay." Rho ρ Rate of change of option value with respect to the interest rate. Second order greeks In this section we'll present the second order greeks using the table below. Name Symbol Description Gamma Γ Rate of change of delta with respect to change in the price of the underlying asset. Veta - Rate of change in Vega with respect to time. Vera - Rate of change in Rho with respect to volatility. Some of the second order greeks are omitted for clarity, we'll not cover these in this book. Implementing the greeks in F# Let's implement the greeks; Delta, Gamma, Vega, Theta and Rho. First we look at the formulas for each greek. In some of the cases they vary for calls and puts respectively. We need the derivative of the cumulative distribution function, which in fact is the normal distribution with zero mean and standard deviation of one: /// Normal distribution open MathNet.Numerics.Distributions; let normd = new Normal(0.0, 1.0) Delta Delta is the rate of change of option price with respect to change in the price of the underlying asset. /// Black-Scholes Delta // call_put_flag: Put | Call // s: stock price // x: strike price of option // t: time to expiration in years // r: risk free interest rate // v: volatility let black_scholes_delta call_put_flag s x t r v = let d1=(log(s / x) + (r+v*v*0.5)*t)/(v*sqrt(t)) match call_put_flag with | Put -> cnd(d1) - 1.0 | Call -> cnd(d1) Gamma Gamma is the rate of change of delta with respect to change in the price of the underlying asset. This is the 2nd derivative, with respect to price of the underlying asset. It measures the acceleration of the price of the option with respect to the underlying price. /// Black-Scholes Gamma // s: stock price // x: strike price of option // t: time to expiration in years // r: risk free interest rate // v: volatility let black_scholes_gamma s x t r v = let d1=(log(s / x) + (r+v*v*0.5)*t)/(v*sqrt(t)) normd.Density(d1) / (s*v*sqrt(t) Vega Vega is the rate of change of option value with respect to change in the volatility of the underlying asset. It is referred to as the volatility sensitivity. /// Black-Scholes Vega // s: stock price // x: strike price of option // t: time to expiration in years // r: risk free interest rate // v: volatility let black_scholes_vega s x t r v = let d1=(log(s / x) + (r+v*v*0.5)*t)/(v*sqrt(t)) s*normd.Density(d1)*sqrt(t) Theta Theta is the rate of change of option value with respect to time. The sensitivity with respect to time will decay as time elapses, phenomenon referred to as the “time decay.” /// Black-Scholes Theta // call_put_flag: Put | Call // s: stock price // x: strike price of option // t: time to expiration in years // r: risk free interest rate // v: volatility let black_scholes_theta call_put_flag s x t r v = let d1=(log(s / x) + (r+v*v*0.5)*t)/(v*sqrt(t)) let d2=d1-v*sqrt(t) let res = ref 0.0 match call_put_flag with | Put -> -(s*normd.Density(d1)*v)/(2.0*sqrt(t))+r*x*exp(-r*t)*cnd(-d2) | Call -> -(s*normd.Density(d1)*v)/(2.0*sqrt(t))-r*x*exp(-r*t)*cnd(d2) Rho Rho is rate of change of option value with respect to the interest rate. /// Black-Scholes Rho // call_put_flag: Put | Call // s: stock price // x: strike price of option // t: time to expiration in years // r: risk free interest rate // v: volatility let black_scholes_rho call_put_flag s x t r v = let d1=(log(s / x) + (r+v*v*0.5)*t)/(v*sqrt(t)) let d2=d1-v*sqrt(t) let res = ref 0.0 match call_put_flag with | Put -> -x*t*exp(-r*t)*cnd(-d2) | Call -> x*t*exp(-r*t)*cnd(d2) Investigating the sensitivity of the of the greeks Now that we have all the greeks implemented we'll investigate the sensitivity of some of them and see how they vary when the underlying stock price changes. The figure below is a surface plot with four of the greeks where time and underlying price is changing. The figure below is generated in MATLAB, and will not be generated in F#. We’ll use a 2D version of the graph to study the greeks below. Surface plot of Delta, Gamma, Theta and Rho of a call option. In this section we'll start by plotting the value of Delta for a call option where we vary the price of the underlying. This will result in the following 2D plot: A plot of call option delta versus price of underlying The result in the plot seen in figure above will be generated by the code presented next. We'll reuse most of the code from the example where we looked at the option prices for calls and puts. A slightly modified version is presented here, where the price of the underlying varies from $10.0 to $70.0. /// Plot delta of call option as function of underlying price #r "System.Windows.Forms.DataVisualization.dll" open System open System.Net open System.Windows.Forms open System.Windows.Forms.DataVisualization.Charting open Microsoft.FSharp.Control.WebExtensions /// Create chart and form let chart = new Chart(Dock = DockStyle.Fill) let area = new ChartArea("Main") chart.ChartAreas.Add(area) chart.Legends.Add(new Legend()) let mainForm = new Form(Visible = true, TopMost = true, Width = 700, Height = 500) do mainForm.Text <- "Option delta as a function of underlying price" mainForm.Controls.Add(chart) /// Create series for call option delta let optionDeltaCall = new Series("Call option delta") do optionDeltaCall.ChartType <- SeriesChartType.Line do optionDeltaCall.BorderWidth <- 2 do optionDeltaCall.Color <- Drawing.Color.Red chart.Series.Add(optionDeltaCall) /// Calculate and plot call delta let opc = [for x in [10.0..1.0..70.0] do yield black_scholes_delta Call x 60.0 0.5 0.01 0.3] do opc |> Seq.iter (optionDeltaCall.Points.Add >> ignore) We can extend the code to plot all four greeks, as in the figure with the surface plots, but here in 2D. The result will be a graph like seen in the figure below. Graph showing the for Greeks for a call option with respect to price change (x-axis). Code listing for visualizing the four greeks Below is the code listing for the entire program used to create the graph above. #r "System.Windows.Forms.DataVisualization.dll" open System open System.Net open System.Windows.Forms open System.Windows.Forms.DataVisualization.Charting open Microsoft.FSharp.Control.WebExtensions /// Create chart and form let chart = new Chart(Dock = DockStyle.Fill) let area = new ChartArea("Main") chart.ChartAreas.Add(area) chart.Legends.Add(new Legend()) let mainForm = new Form(Visible = true, TopMost = true, Width = 700, Height = 500) do mainForm.Text <- "Option delta as a function of underlying price" mainForm.Controls.Add(chart) We’ll create one series for each greek: /// Create series for call option delta let optionDeltaCall = new Series("Call option delta") do optionDeltaCall.ChartType <- SeriesChartType.Line do optionDeltaCall.BorderWidth <- 2 do optionDeltaCall.Color <- Drawing.Color.Red chart.Series.Add(optionDeltaCall) /// Create series for call option gamma let optionGammaCall = new Series("Call option gamma") do optionGammaCall.ChartType <- SeriesChartType.Line do optionGammaCall.BorderWidth <- 2 do optionGammaCall.Color <- Drawing.Color.Blue chart.Series.Add(optionGammaCall) /// Create series for call option theta let optionThetaCall = new Series("Call option theta") do optionThetaCall.ChartType <- SeriesChartType.Line do optionThetaCall.BorderWidth <- 2 do optionThetaCall.Color <- Drawing.Color.Green chart.Series.Add(optionThetaCall) /// Create series for call option vega let optionVegaCall = new Series("Call option vega") do optionVegaCall.ChartType <- SeriesChartType.Line do optionVegaCall.BorderWidth <- 2 do optionVegaCall.Color <- Drawing.Color.Purple chart.Series.Add(optionVegaCall) Next, we’ll calculate the values to plot for each greek: /// Calculate and plot call delta let opd = [for x in [10.0..1.0..70.0] do yield black_scholes_delta Call x 60.0 0.5 0.01 0.3] do opd |> Seq.iter (optionDeltaCall.Points.Add >> ignore) /// Calculate and plot call gamma let opg = [for x in [10.0..1.0..70.0] do yield black_scholes_gamma x 60.0 0.5 0.01 0.3] do opg |> Seq.iter (optionGammaCall.Points.Add >> ignore) /// Calculate and plot call theta let opt = [for x in [10.0..1.0..70.0] do yield black_scholes_theta Call x 60.0 0.5 0.01 0.3] do opt |> Seq.iter (optionThetaCall.Points.Add >> ignore) /// Calculate and plot call vega let opv = [for x in [10.0..1.0..70.0] do yield black_scholes_vega x 60.0 0.1 0.01 0.3] do opv |> Seq.iter (optionVegaCall.Points.Add >> ignore) Summary In this article, we looked into using F# for investigating different aspects of volatility. Volatility is an interesting dimension of finance where you quickly dive into complex theories and models. Here it's very much helpful to have a powerful tool such as F# and F# Interactive. We've just scratched the surface of options and volatility in this article. There is a lot more to cover, but that's outside the scope of this book. Most of the content here will be used in the trading system. resources for article: further resources on this subject: Working with Windows Phone Controls [article] Simplifying Parallelism Complexity in C# [article] Watching Multiple Threads in C# [article]
Read more
  • 0
  • 0
  • 2552
article-image-sql-server-analysis-services-administering-and-monitoring-analysis-services
Packt
20 Dec 2013
19 min read
Save for later

SQL Server Analysis Services – Administering and Monitoring Analysis Services

Packt
20 Dec 2013
19 min read
(For more resources related to this topic, see here.) If your environment has only one or a handful of SSAS instances, they can be managed by the same database administrators managing SQL Server and other database platforms. In large enterprises, there could be hundreds of SSAS instances managed by dedicated SSAS administrators. Regardless of the environment, you should become familiar with the configuration options as well as troubleshooting methodologies. In large enterprises, you might also be required to automate these tasks using the Analysis Management Objects (AMO) code. Analysis Services is a great tool for building business intelligence solutions. However, much like any other software, it does have its fair share of challenges and limitations. Most frequently encountered enterprise business intelligence system goals include quick provision of relevant data to the business users and assuring excellent query performance. If your cubes serve a large, global community of users, you will quickly learn that SSAS is optimized to run a single query as fast as possible. Once users send a multitude of heavy queries in parallel, you can expect to see memory, CPU, and disk-related performance counters to quickly rise, with a corresponding increase in query execution duration which, in turn, worsens user experience. Although you could build aggregations to improve query performance, doing so will lengthen cube processing time, and thereby, delay the delivery of essential data to decision makers. It might also be tempting to consider using ROLAP storage mode in lieu of MOLAP so that processing times are shorter, but MOLAP queries usually outperform ROLAP due to heavy compression rates. Hence, figuring out the right storage mode and appropriate level of aggregations is a great balancing act. If you cannot afford using ROLAP, and query performance is paramount to successful cube implementation, you should consider scaling your solution. You have two options for scaling, given as follows: Scaling up: This option means purchasing servers with more memory, more CPU cores, and faster disk drives Scaling out: This option means purchasing several servers of approximately the same capacity and distributing the querying workload across multiple servers using a load balancing tool SSAS lends itself best to the second option—scaling out. Later in this article you will learn how to separate processing and querying activities and how to ensure that all servers in the querying pool have the same data. SSAS instance configuration options All Analysis Services configuration options are available in the msmdsrv.ini file found in the config folder under the SSAS installation directory. Instance administrators can also modify some, but not all configuration properties, using SQL Server Management Studio (SSMS). SSAS has a multitude of properties that are undocumented—this normally means that such properties haven't undergone thorough testing, even by the software's developers. Hence, if you don't know exactly what the configuration setting does, it's best to leave the setting at default value. Even if you want to test various properties on a sandbox server, make a copy of the configuration file prior to applying any changes. How to do it... To modify the SSAS instance settings using the configuration file, perform the following steps: Navigate to the config folder within your Analysis Services installation directory. By default, this will be C:\Program Files\Microsoft SQL Server\MSAS11.instance_name\OLAP\Config. Open the msmdsrv.ini file using Notepad or another text editor of your choice. The file is in the XML format, so every property is enclosed in opening and closing tags. Search for the property of interest, modify its value as desired, and save the changes. For example, in order to change the upper limit of the processing worker threads, you would look for the <ThreadPool><Process><MaxThreads> tag sequence and set the values as shown in the following excerpt from the configuration file: <Process>       <MinThreads>0</MinThreads>       <MaxThreads>250</MaxThreads>      <PriorityRatio>2</PriorityRatio>       <Concurrency>2</Concurrency>       <StackSizeKB>0</StackSizeKB>       <GroupAffinity/>     </Process> To change the configuration using SSMS, perform the following steps: Connect to the SSAS instance using the instance administrator account and choose Properties. If your account does not have sufficient permissions, you will get an error that only administrators can edit server properties. Change the desired properties by altering the Value column on the General page of the resulting dialog, as shown in the following screenshot: Advanced properties are hidden by default. You must check the Show Advanced (All) Properties box to see advanced properties. You will not see all the properties in SSMS even after checking this box. The only way to edit some properties is by editing msmdsrv.ini as previously discussed. Make a note of the Reset Default button in the bottom-right corner. This button comes in handy if you've forgotten what the configuration values were before you changed them and want to revert to the default settings. The default values are shown in the dialog box, which can provide guidance as to which properties have been altered. Some configuration settings require restarting the SSAS instance prior to being executed. If this is the case, the Restart column will have a value of Yes. Once you're happy with your changes, click on OK and restart the instance if necessary. You can restart SSAS using the Services.msc applet from the command line using the NET STOP / NET START commands, or directly in SSMS by choosing the Restart option after right-clicking on the instance. How it works... Discussing every SSAS property would make this article extremely lengthy; doing so is well beyond the scope of the book. Instead, in this section, I will summarize the most frequently used properties. Often, synchronization has to copy large partition datafiles and aggregation files. If the timeout value is exceeded, synchronization fails. Increase the value of the <Network><Listener><ServerSendTimeout> and <Network><Listener><ServerReceiveTimeout> properties to allow a longer time span for copying each file. By default, SSAS can use a lazy thread to rebuild missing indexes and aggregations after you process partition data. If the <OLAP><LazyProcessing><Enabled> property is set to 0, the lazy thread is not used for building missing indexes—you must use an explicit processing command instead. The <OLAP><LazyProcessing><MaxCPUUsage> property throttles the maximum CPU that could be used by the lazy thread. If efficient data delivery is your topmost priority, you can exploit the ProcessData option instead of ProcessFull. To build aggregations after the data is loaded, you must set the partition's ProcessingMode property to LazyAggregations. The SSAS formula engine is single threaded, so queries that perform heavy calculations will only use one CPU core, even on a multiCPU computer. The storage engine is multithreaded; hence, queries that read many partitions will require many CPU cycles. If you expect storage engine heavy queries, you should lower the CPU usage threshold for LazyAggregations. By default, Analysis Services records subcubes requested for every 10th query in the query log table. If you'd like to design aggregations based on query logs, you should change the <Log><QueryLog><QueryLogSampling> property value to 1 so that the SSAS logs subcube requests for every query. SSAS can use its own memory manager or the Windows memory manager. If your SSAS instance consistently becomes unresponsive, you could try using the Windows memory manager. Set <Memory><MemoryHeapType> to 2 and <Memory><HeapTypeForObjects> to 0. The Analysis Services memory manager values are 1 for both the properties. You must restart the SSAS service for the changes to these properties to take effect. The <Memory><PreAllocate> property specifies the percentage of total memory to be reserved at SSAS startup. SSAS normally allocates memory dynamically as it is required by queries and processing jobs. In some cases, you can achieve performance improvement by allocating a portion of the memory when the SSAS service starts. Setting this value will increase the time required to start the service. The memory will not be released back to the operating system until you stop the SSAS service. You must restart the SSAS service for changes to this property to take effect. The <Log><FlightRecorder><FileSizeMB>and <Log><FlightRecorder><LogDurationSec> properties control the size and age of the FlightRecorder trace file before it is recycled. You can supply your own trace definition file to include the trace events and columns you wish to monitor using the <Log><FlightRecorder><TraceDefinitionFile> property. If FlightRecorder collects useful trace events, it can be an invaluable troubleshooting tool. By default, the file is only allowed to grow to 10 MB or 60 minutes. Long processing jobs can take up much more space, and their duration could be much longer than 60 minutes. Hence, you should adjust the settings as necessary for your monitoring needs. You should also adjust the trace events and columns to be captured by FlightRecorder. You should consider adjusting the duration to cover three days (in case the issue you are researching happens over a weekend). The <Memory><LowMemoryLimit> property controls the point—amount of memory used by SSAS—at which the cleaner thread becomes actively engaged in reclaiming memory from existing jobs. Each SSAS command (query, processing, backup, synchronization, and so on) is associated with jobs that run on threads and use system resources. We can lower the value of this setting to run more jobs in parallel (though the performance of each job could suffer). Two properties control the maximum amount of memory that a SSAS instance could use. Once memory usage reaches the value specified by <Memory><TotalMemoryLimit>, the cleaner thread becomes particularly aggressive at reclaiming memory. The <Memory><HardMemoryLimit> property specifies the absolute memory limit—SSAS will not use memory above this limit. These properties are useful if you have SSAS and other applications installed on the same server computer. You should reserve some memory for other applications and the operating system as well. When HardMemoryLimit is reached, SSAS will disconnect the active sessions, advising that the operation was cancelled due to memory pressure. All memory settings are expressed in percentages if the values are less than or equal to 100. Values above 100 are interpreted as kilobytes. All memory configuration changes require restart of the SSAS service to take effect. In the prior releases of Analysis Services, you could only specify the minimum and maximum number of threads used for queries and processing jobs. With SSAS 2012, you can also specify the limits for the input/output job threads using the <ThreadPool><IOProcess> property. The <Process><IndexBuildThreshold> property governs the minimum number of rows within a partition for which SSAS will build indexes. The default value is 4096. SSAS decides which partitions it needs to scan for each query based on the partition index files. If the partition does not have indexes, it will be scanned for all the queries. Normally, SSAS can read small partitions without greatly affecting query performance. But if you have many small partitions, you should lower the threshold to ensure each partition has indexes. The <Process><BufferRecordLimit> and <Process><BufferMemoryLimit> properties specify the number of records for each memory buffer and the maximum percentage of memory that can be used by a memory buffer. Lower the value of these properties to process more partitions in parallel. You should monitor processing using the SQL Profiler to see if some partitions included in the processing batch are being processed while the others are in waiting. The <ExternalConnectionTimeout> and <ExternalCommandTimeout> properties control how long an SSAS command should wait for connecting to a relational database or how long SSAS should wait to execute the relational query before reporting timeout. Depending on the relational source, it might take longer than 60 seconds (that is, the default value) to connect. If you encounter processing errors without being able to connect to the relational source, you should increase the ExternalConnectionTimeout value. It could also take a long time to execute a query; by default, the processing query will timeout after one hour. Adjust the value as needed to prevent processing failures. The contents of the <AllowedBrowsingFolders> property define the drives and directories that are visible when creating databases, collecting backups, and so on. You can specify multiple items separated using the pipe (|) character. The <ForceCommitTimeout> property defines how long a processing job's commit operation should wait prior to cancelling any queries/jobs which may interfere with processing or synchronization. A long running query can block synchronization or processing from committing its transaction. You can adjust the value of this property from its default value of 30 seconds to ensure that processing and queries don't step on each other. The <Port> property specifies the port number for the SSAS instance. You can use the hostname followed by a colon (:) and a port number for connecting to the SSAS instance in lieu of the instance name. Be careful not to supply the port number used by another application; if you do so, the SSAS service won't start. The <ServerTimeout> property specifies the number of milliseconds after which a query will timeout. The default value is 1 hour, which could be too long for analytical queries. If the query runs for an hour, using up system resources, it could render the instance unusable by any other connection. You can also define a query timeout value in the client application's connection strings. Client setting overrides the server-level property. There's more... There are many other properties you can set to alter SSAS instance behavior. For additional information on configuration properties, please refer to product documentation at http://technet.microsoft.com/en-us/library/ms174556.aspx. Creating and dropping databases Only SSAS instance administrators are permitted to create, drop, restore, detach, attach, and synchronize databases. This recipe teaches administrators how to create and drop databases. Getting ready Launch SSMS and connect to your Analysis Services instance as an administrator. If you're not certain that you have administrative properties to the instance, right-click on the SSAS instance and choose Properties. If you can view the instance's properties, you are an administrator; otherwise, you will get an error indicating that only instance administrators can view and alter properties. How to do it... To create a database, perform the following steps: Right-click on the Databases folder and choose New Database. Doing so launches the New Database dialog shown in the following screenshot. Specify a descriptive name for the database, for example, Analysis_Services_Administration. Note that the database name can contain spaces. Each object has a name as well as an identifier. The identifier value is set to the object's original name and cannot be changed without dropping and recreating the database; hence, it is important to come up with a descriptive name from the very beginning. You cannot create more than one database with the same name on any SSAS instance. Specify the storage location for the database. By default, the database will be stored under the \OLAP\DATA folder of your SSAS installation directory. The only compelling reason to change the default is if your data drive is running out of disk space and cannot support the new database's storage requirements. Specify the impersonation setting for the database. You could also specify the impersonation property for each data source. Alternatively, each data source can inherit the DataSourceImpersonationInfo property from the database-level setting. You have four choices as follows: Specific user name (must be a domain user) and password: This is the most secure option but requires updating the password if the user changes the password Analysis Services service account Credentials of the current user: This option is specifically for data mining Default: This option is the same as using the service account option Specify an optional description for the database. As with majority of other SSMS dialogs, you can script the XMLA command you are about to execute by clicking on the Script button. To drop an existing database, perform the following steps: Expand the Databases folder on the SSAS instance, right-click on the database, and choose Delete. The Delete objects dialog allows you to ignore errors; however, it is not applicable to databases. You can script the XMLA command if you wish to review it first. An alternative way of scripting the DELETE command is to right-click on the database and navigate to Script database as | Delete To | New query window. Monitoring SSAS instance using Activity Viewer Unlike other database systems, Analysis Services has no system databases. However, administrators still need to check the activity on the server, ensure that cubes are available and can be queried, and there is no blocking. You can exploit a tool named Analysis Services Activity Viewer 2008 to monitor SSAS Versions 2008 and later, including SSAS 2012. This tool is owned and maintained by the SSAS community and can be downloaded from www.codeplex.com. Activity Viewer allows viewing active and dormant sessions, current XMLA and MDX queries, locks, as well as CPU and I/O usage by each connection. Additionally, you can define rules to raise alerts when a particular condition is met. How to do it... To monitor an SSAS instance using Activity Viewer, perform the following steps: Launch the application by double-clicking on ActivityViewer.exe. Click on the Add New Connection button on the Overview tab. Specify the hostname and instance name or the hostname and port number for the SSAS instance and then click on OK. For each SSAS instance you connect to, Activity Viewer adds a new tab. Click on the tab for your SSAS instance. Here, you will see several pages as shown in the following screenshot: Alerts: This page shows any sessions that met the condition found in the Rules page. Users: This page displays one row for each user as well as the number of sessions, total memory, CPU, and I/O usage. Active Sessions: This page displays each session that is actively running an MDX, Data Mining Extensions (DMX), or XMLA query. This page allows you to cancel a specific session by clicking on the Cancel Session button. Current Queries: This page displays the actual command's text, number of kilobytes read and written by the command, and the amount of  CPU time used by the command. This page allows you to cancel a specific query by clicking on the Cancel Query button. Dormant Sessions: This page displays sessions that have a connection to the SSAS instance but are not currently running any queries. You can also disconnect a dormant session by clicking on the Cancel Session button. CPU: This page allows you to review the CPU time used by the session as well as the last command executed on the session. I/O: This page displays the number of reads and writes as well as the kilobytes read and written by each session. Objects: This page shows the CPU time and number of reads affecting each dimension and partition. This page also shows the full path to the object's parent; this is useful if you have the same naming convention for partitions in multiple measure groups. Not only do you see the partition name, but also the full path to the partition's measure group. This page also shows the number of aggregation hits for each partition. If you find that a partition is frequently queried and requires many reads, you should consider building aggregations for it. Locks: This page displays the locks currently in place, whether already granted or waiting. Be sure to check the Lock Status column—the value of 0 indicates that the lock request is currently blocked. Rules: This page allows defining conditions that will result in an alert. For example, if the session is idle for over 30 minutes or if an MDX query takes over 30 minutes, you should get alerted. How it works... Activity Viewer monitors Analysis Services using Dynamic Management Views (DMV). In fact, capturing queries executed by Activity Viewer using SQL Server Profiler is a good way of familiarizing yourself with SSAS DMV's. For example, the Current Queries page checks the $system.DISCOVER_COMMANDS DMV for any actively executing commands by running the following query: SELECT SESSION_SPID,COMMAND_CPU_TIME_MS,COMMAND_ELAPSED_TIME_MS,   COMMAND_READ_KB,COMMAND_WRITE_KB, COMMAND_TEXT FROM $system.DISCOVER_COMMANDS WHERE COMMAND_ELAPSED_TIME_MS > 0 ORDER BY COMMAND_CPU_TIME_MS DESC The Active Sessions page checks the $system.DISCOVER_SESSIONS DMV with the session status set to 1 using the following query: SELECT SESSION_SPID,SESSION_USER_NAME, SESSION_START_TIME,   SESSION_ELAPSED_TIME_MS,SESSION_CPU_TIME_MS, SESSION_ID FROM $SYSTEM.DISCOVER_SESSIONS WHERE SESSION_STATUS = 1 ORDER BY SESSION_USER_NAME DESC The Dormant sessions page runs a very similar query to that of the Active Sessions page, except it checks for sessions with SESSION_STATUS=0—sessions that are currently not running any queries. The result set is also limited to top 10 sessions based on idle time measured in milliseconds. The Locks page examines all the columns of the $system.DISCOVER_LOCKS DMV to find all requested locks as well as lock creation time, lock type, and lock status. As you have already learned, the lock status of 0 indicates that the request is blocked, whereas the lock status of 1 means that the request has been granted. Analysis Services blocking can be caused by conflicting operations that attempt to query and modify objects. For example, a long running query can block a processing or synchronization job from completion because processing will change the data values. Similarly, a command altering the database structure will block queries. The database administrator or instance administrator can explicitly issue the LOCK XMLA command as well as the BEGIN TRANSACTION command. Other operations request locks implicitly. The following table documents most frequently encountered Analysis Services lock types: Lock type identifier Description Acquired for 2 Read lock Processing to read metadata. 4 Write lock Processing to write data after it is read from relational sources. 8 Commit shared During the processing, restore or synchronization commands. 16 Commit exclusive Committing the processing, restore, or synchronization transaction when existing files are replaced by new files.  
Read more
  • 0
  • 0
  • 12400

article-image-key-components-and-inner-working-impala
Packt
20 Dec 2013
7 min read
Save for later

Key components and inner working of Impala

Packt
20 Dec 2013
7 min read
(For more resources related to this topic, see here.) Impala Core Components: Here we will discuss following three important components: Impala Daemon Impala Statestore Impala Metadata and Metastore Putting together above components with Hadoop and application or command line interface, we can conceptualize them as below: Impala Execution Architecture: Essentially Impala daemons receives queries from variety of sources and distribute query load to other Impala daemons running on other nodes and while doing so interact with Statestore for node specific update and access Metastore, either stored in centralized database or in local cache. Now to complete the Impala execution we will discuss how Impala interacts with other components i.e. Hive, HDFS and HBase.  Impala working with Apache Hive: We have already discussed earlier about Impala Metastore using the centralized database as Metastore and Hive also uses the same MySQL or PostgreSQL database for same kind of data. Impala provides same SQL like queries interface use in Apache Hive. Because both Impala and Hive share same database as Metastore, Impala can access Hive specific tables definitions if Hive table definition use the same file format, compression codecs and Impala-supported data types in their column values. Apache Hive provides various kinds of file type processing support to Impala. When using other then text file format i.e. RCFile, Avro, SequenceFile the data must be loaded through Hive first and then Impala can query the data from these file formats. Impala can perform read operation on more types of data using SELECT statement than it can perform write operation using INSERT statement. The ANALYZE TABLE statement in Hive generates useful table and column statistics and Impala use these valuable statistics to optimize the queries. Impala working with HDFS: Impala table data is actually regular data files stored in HDFS (Hadoop Distributed File System) and Impala uses HDFS as its primary data storage medium.  As soon as a data file or a collection of files is available in specific folder of new table, Impala reads all of the files regardless of their name and new data is included in files with the name controlled by Impala. HDFS provides data redundancy through replication factor and Impala relies on such redundancy to access data on other datanodes in case it is not available on a specific datanode. We have already learnt earlier that Impala also maintains the information about physical location of the blocks about data files in HDFS,which helps data access in case of node failure. Impala working with HBase: HBase is a distributed, scalable, big data storage system, provides random, real-time read and write access to data stored on HDFS. HBase is a database storage system, sits on top of HDFS however like other traditional database storage system, HBase does not provide built-in SQL support however 3party applications can provide such functionality. To use HBase, first user defines tables in Impala and then maps them to the equivalent HBase tables. Once table relationship is established, users can submit queries into HBase table through Impala. Not only that join operations can be formed including HBase and Impala tables. Impala Security: Impala is designed & developed on run on top of Hadoop. So you must understand the Hadoop security model as well as the security provided in OS where Hadoop is running. If Hadoop is running on Linux then as Linux administrator and Hadoop administrator user can harden and tighten the security, which definitely can be taken in account with the security provided by Impala. Impala 1.1 or later uses Sentry Open Source Project to provide detailed authorization framework for Hadoop. Impala 1.1.1 supports auditing capabilities in cluster by creating auditing data, which can be collected from all nodes and then processing for further analysis and insight. Data Visualization using Impala: Visualizing data is as important as processing the data. Human brain perceives pictures fast then reading data in tables and because of it data visualization provides super fast understanding to large amount of data in split seconds. Reports, charts, interactive dashboards and any form of info-graphics are all part of data visualization and provide deeper understanding of results. To connect with 3rd party applications, Cloudera provides ODBC and JDBC connectors. These connectors are installed on machines where 3rd party applications are running and by configuring correct Impala server and port details on those connectors, 3rd party applications connect with Impala and submit those queries and then take results back to application. The result then displayed on 3rd party application, where it is rendered on graphics device for visualization or displayed in table format or processed further depending on application requirement. In this section we will cover few notable 3rd party applications, which can take advantage of Impala super fast query processing and than display amazing graphical results. Tableau and Impala: Tableau Software supporting Impala by providing access to tables on Impala using Impala ODBC connector provided by Tableau. Tableau is one of the most prominent data visualization software technologies in recent days and used by thousands of enterprises daily to get intelligence out of their data. Tableau software is available on Windows OS and an ODBC connector is provided by Cloudera to make this connection a reality. You can visit the link below to download Impala connector for Tableau: http://go.cloudera.com/tableau_connector_download Once Impala connector is installed on a machine where Tableau software is running, and configured correctly, Tableau software is ready to work with Impala. In this image below Tableau is connected to Impala server at port 21000, and then selected a table located at Impala: Once table is selected, particular fields are select and data is displayed in graphical format in various mind-blowing visualizations. The screenshot below displays one example of showing such visualization:   Microsoft Excel and Impala Microsoft Excel is one of the widely adopted data processing application used by business professional worldwide. You can connect Microsoft Excel with Impala using another ODBC connector provided by Simba Technology. Microstrategy and Impala Microstrategy is another big player in data analysis and visualization software and uses ODBC drive to connect with Impala to render amazing looking visualizations. The connectivity model between Microstrategy software and Cloudera Impala is shown as below:    Zoomdata and Impala: Zoomdata is considered to new generation of data user interface by addressing streams of data instead of sets of data. Zoomdata processing engine performs continuous mathematical operations across data streams in real-time to create visualization on multitude of devices. The visualization updates itself as the new data arrives and re-computed by Zoomdata. As shown in in the image below, you can see Zoomdata application uses Impala as a source of data, which is configured underneath to use of one the available connectors to connect with Impala: Once connection are made user can see amazing data visualization as shown below: Real-time Query with Impala on Hadoop: Impala is marketed as a product, which can do “Real-time queries on Hadoop” by its developer Cloudera. Impala is open source implementation based on above-mentioned Google Dremel technology, available free for anyone of use. Impala is available as package product, free to use or can be compiled from its source, which can run queries in memory to make them real-time and in some cases depending on type of data, if Parquet file format is used as input data source, it can expedite the query processing to multifold speed.  Real-time query subscription with Impala: Cloudera provides Real-time Query (RTQ) Subscription as an add-on to Cloudera Enterprise subscription. You can still use Impala as free open source product however taking RTQ subscription makes you take advantage of Cloudera paid service to extend its usability and resilience. By accepting RTQ subscription you cannot only have access to Cloudera Technical support, but also you can work with Impala development team to provide ample feedback to shape up the product design and implementation. Summary Thus concludes the discussion on the key components of Impala and their inner working. Resources for Article: Further resources on this subject: Securing the Hadoop Ecosystem [Article] Cloudera Hadoop and HP Vertica [Article] Hadoop and HDInsight in a Heartbeat [Article]
Read more
  • 0
  • 0
  • 2105