Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-interacting-data-dashboards
Packt
23 May 2014
11 min read
Save for later

Interacting with Data for Dashboards

Packt
23 May 2014
11 min read
(For more resources related to this topic, see here.) Hierarchies for revealing the dashboard message It can become difficult to manage data, particularly if you have many columns. It can become more difficult if they are similarly named too. As you'd expect, Tableau helps you to organize your data so that it is easier to navigate and keep track of everything. From the user perspective, hierarchies improve navigation and use by allowing the users to navigate from a headline down to a detailed level. From the Tableau perspective, hierarchies are groups of columns that are arranged in increasing levels of granularity. Each deeper level of the hierarchy refers to more specific details of the data. Some hierarchies are natural hierarchies, such as date. So, say Tableau works out that a column is a date and automatically adds in a hierarchy in this order: year, quarter, month, week, and date. You have seen this already, for example, when you dragged a date across to the Columns shelf, Tableau automatically turned the date into a year. Some hierarchies are not always immediately visible. These hierarchies would need to be set up, and we will look at setting up a product hierarchy that straddles across different tables. This is a nice feature because it means that the hierarchy can reflect the users' understanding of the data and isn't determined only by the underlying data. Getting ready In this article, we will use the existing workbook that you created for this article. We will use the same data. For this article, let's take a copy of the existing worksheet and call it Hierarchies. To do this, right-click on the Worksheet tab and select the Duplicate Sheet option. You can then rename the sheet to Hierarchies. How to do it... Navigate to the DimProductCategory dimension and right-click on the EnglishProductCategoryName attribute. From the pop-up menu, select the Create Hierarchy feature. You can see its location in the following illustration: When you select the option, you will get a textbox entitled Create Hierarchy, which will ask you to specify the name of the hierarchy. We will call our hierarchy Product Category. Once you have entered this into the textbox, click on OK. Your hierarchy will now be created, and it will appear at the bottom of the Dimensions list on the left-hand side of Tableau's interface. Next, go to the DimProductSubcategory dimension and look for the EnglishProductSubCategoryName attribute. Drag it to the Product Category hierarchy under EnglishProductCategoryName, which is already part of the Product Category hierarchy. Now we will add the EnglishProductName attribute, which we will find under the DimProduct dimension. Drag-and-drop it under the EnglishProductSubCategoryName attribute that is already under the Product Category hierarchy. The Product Category hierarchy should now look as follows: The Product Category hierarchy will be easier to understand if we rename the attributes. To do this, right-click on each attribute and choose Rename. Change EnglishProductCategoryName to Product Category. Rename EnglishProductSubcategoryName to Product Subcategory by right-clicking on the attribute and selecting Rename. Rename EnglishProductName to Product. Once you have done this, the hierarchy should look as follows: You can now use your hierarchy to change the details that you wish to see in the data visualization. Now, we will use Product Category of our data visualization rather than Dimension. Remove everything from the Rows shelf and drag the Product Category hierarchy to the Rows shelf. Then, click on the plus sign; it will open the hierarchy, and you will see data for the next level under Product Category, which are subcategories. An example of the Tableau workbook is given in the following illustration. You can see that the biggest differences occurred in the Bikes product category, and they occurred in the years 2006 and 2007 for the Mountain Bikes and Road Bikes categories. To summarize, we have used the Hierarchy feature in Tableau to vary the degree of analysis we see in the dashboard. How it works… Tableau saves the additional information as part of the Tableau workbook. When you share the workbook, the hierarchies will be preserved. The Tableau workbook would need revisions if the hierarchy is changed, or if you add in new dimensions and they need to be maintained. Therefore, they may need some additional maintenance. However, they are very useful features and worth the little extra touch they offer in order to help the dashboard user. There's more... Dashboarding data usually involves providing "at a glance" information for team members to clearly see the issues in the data and to make actionable decisions. Often, we don't need to provide further information unless we are asked for it, and it is a very useful feature that will help us answer more detailed questions. It saves us space on the page and is a very useful dashboard feature. Let's take the example of a business meeting where the CEO wants to know more about the biggest differences or "swings" in the sales amount by category, and then wants more details. The Tableau analyst can quickly place a hierarchy in order to answer more detailed questions if required, and this is done quite simply as described here. Hierarchies also allow us to encapsulate business rules into the dashboard. In this article, we used product hierarchies. We could also add in hierarchies for different calendars, for example, in order to reflect different reporting periods. This will allow the dashboard to be easily reused in order to reflect different reporting calendars, say, you want to show data according to a fiscal year or a calendar year. You could have two different hierarchies: one for fiscal and the other for the calendar year. The dashboard could contain the same measures but sliced by different calendars according to user requirements. The hierarchies feature fits nicely with the Golden Mantra of Information Visualization, since it allows us to summarize the data and then drill down into it as the next step. See also http://www.tableausoftware.com/about/blog/2013/4/lets-talk-about-sets-23043 Classifying your data for dashboards Bins are a simple way of categorizing and bucketing values, depending on the measure value. So, for example, you could "bin" customers depending on their age group or the number of cars that they own. Bins are useful for dashboards because they offer a summary view of the data, which is essential for the "at a glance" function of dashboards. Tableau can create bins automatically, or we can also set up bins manually using calculated fields. This article will show both versions in order to meet the business needs. Getting ready In this article, we will use the existing workbook that you created for this article. We will use the same data. For this article, let's take a copy of the Hierarchies worksheet and by right-clicking on the Worksheet tab, select the Duplicate Sheet option. You can then rename the sheet to Bins. How to do it... Once you have your Bins worksheet in place, right-click on the SalesAmount measure and select the Create Bin option. You can see an example of this in the following screenshot: We will change the value to 5. Once you've done this, press the Load button to reveal the Min, Max, and Diff values of the data, as shown in the following screenshot: When you click on the OK button, you will see a bin appear under the Dimensions area. The following is an example of this: Let's test out our bins! To do this, remove everything from the Rows shelf, leaving only the Product Category hierarchy. Remove any filters from the worksheet and all of the calculations in the Marks shelf. Next, drag SalesAmount (bin) to the Marks area under the Detail and Tooltip buttons. Once again, take SalesAmount (bin) and drag it to the Color button on the Marks shelf. Now, we will change the size of the data points to reflect the size of the elements. To do this, drag SalesAmount (bin) to the Size button. You can vary the overall size of the elements by right-clicking on the Size button and moving the slider horizontally so that you can get your preferred size. To neaten the image, right-click on the Date column heading and select Hide Field Names for Columns from the list. The Tableau worksheet should now look as follows: This allows us to see some patterns in the data. We can also see more details if we click on the data points; you can see an illustration of the details in the data in the following screenshot: However, we might find that the automated bins are not very clear to business users. We can see in the previous screenshot that the SalesAmount(bin) value is £2,440.00. This may not be meaningful to business users. How can we set the bins so that they are meaningful to business users, rather than being automated by Tableau? For example, what if the business team wants to know about the proportion of their sales that fell into well-defined buckets, sliced by years? Fortunately, we can emulate the same behavior as in bins by simply using a calculated field. We can create a very simple IF… THEN ... ELSEIF formula that will place the sales amounts into buckets, depending on the value of the sales amount. These buckets are manually defined using a calculated field, and we will see how to do this now. Before we begin, take a copy of the existing worksheet called Bins and rename it to Bins Set Manually. To do this, right-click on the Sales Amount metric and choose the Create Calculated Field option. In the calculated field, enter the following formula: If [SalesAmount] <= 1000 THEN "1000" ELSEIF [SalesAmount] <= 2000 THEN "2000" ELSEIF [SalesAmount] <= 3000 THEN "3000" ELSEIF [SalesAmount] <= 4000 THEN "4000" ELSEIF [SalesAmount] <= 5000 THEN "5000" ELSEIF [SalesAmount] <= 6000 THEN "6000" ELSE "7000" END When this formula is entered into the Calculated Field window, it looks like what the following screenshot shows. Rename the calculated field to SalesAmount Buckets. Now that we have our calculated field in place, we can use it in our Tableau worksheet to create a dashboard component. On the Columns shelf, place the SalesAmount Buckets calculated field and the Year(Date) dimension attribute. On the Rows shelf, place Sum(SalesAmount) from the Measures section. Place the Product Category hierarchy on the Color button. Drag SalesAmount Buckets from the Dimensions pane to the Size button on the Marks shelf. Go to the Show Me panel and select the Circle View option. This will provide a dot plot feel to data visualization. You can resize the chart by hovering the mouse over the foot of the y axis where the £0.00 value is located. Once you're done with this, drag-and-drop the activities. The Tableau worksheet will look as it appears in the following screenshot: To summarize, we have created bins using Tableau's automatic bin feature. We have also looked at ways of manually creating bins using the Calculated Field feature. How it works... Bins are constructed using a default Bins feature in Tableau, and we can use Calculated Fields in order to make them more useful and complex. They are stored in the Tableau workbook, so you will be able to preserve your work if you send it to someone else. In this article, we have also looked at dot plot visualization, which is a very simple way of representing data that does not use a lot of "ink". The data/ink ratio is useful to simplify a data visualization in order to get the message of the data across very clearly. Dot plots might be considered old fashioned, but they are very effective and are perhaps underused. We can see from the screenshot that the 3000 bucket contained the highest number of sales amount. We can also see that this figure peaks in the year 2007 and then falls in 2008. This is a dashboard element that could be used as a start for further analysis. For example, business users will want to know the reason for the fall in sales for the highest occurring "bin". See also Visual Display of Quantitative Information, Edward Tufte, Graphics Press USA
Read more
  • 0
  • 0
  • 1768

Packt
22 May 2014
13 min read
Save for later

A/B Testing – Statistical Experiments for the Web

Packt
22 May 2014
13 min read
(For more resources related to this topic, see here.) Defining A/B testing At its most fundamental level, A/B testing just involves creating two different versions of a web page. Sometimes, the changes are major redesigns of the site or the user experience, but usually, the changes are as simple as changing the text on a button. Then, for a short period of time, new visitors are randomly shown one of the two versions of the page. The site tracks their behavior, and the experiment determines whether one version or the other increases the users' interaction with the site. This may mean more click-through, more purchases, or any other measurable behavior. This is similar to other methods in other domains that use different names. The basic framework randomly tests two or more groups simultaneously and is sometimes called random-controlled experiments or online-controlled experiments. It's also sometimes referred to as split testing, as the participants are split into two groups. These are all examples of between-subjects experiment design. Experiments that use these designs all split the participants into two groups. One group, the control group, gets the original environment. The other group, the test group, gets the modified environment that those conducting the experiment are interested in testing. Experiments of this sort can be single-blind or double-blind. In single-blind experiments, the subjects don't know which group they belong to. In double-blind experiments, those conducting the experiments also don't know which group the subjects they're interacting with belong to. This safeguards the experiments against biases that can be introduced by participants being aware of which group they belong to. For example, participants could get more engaged if they believe they're in the test group because this is newer in some way. Or, an experimenter could treat a subject differently in a subtle way because of the group that they belong to. As the computer is the one that directly conducts the experiment, and because those visiting your website aren't aware of which group they belong to, website A/B testing is generally an example of double-blind experiments. Of course, this is an argument for only conducting the test on new visitors. Otherwise, the user might recognize that the design has changed and throw the experiment away. For example, the users may be more likely to click on a new button when they recognize that the button is, in fact, new. However, if they are new to the site as a whole, then the button itself may not stand out enough to warrant extra attention. In some cases, these subjects can test more variant sites. This divides the test subjects into more groups. There needs to be more subjects available in order to compensate for this. Otherwise, the experiment's statistical validity might be in jeopardy. If each group doesn't have enough subjects, and therefore observations, then there is a larger error rate for the test, and results will need to be more extreme to be significant. In general, though, you'll want to have as many subjects as you reasonably can. Of course, this is always a trade-off. Getting 500 or 1000 subjects may take a while, given the typical traffic of many websites, but you still need to take action within a reasonable amount of time and put the results of the experiment into effect. So we'll talk later about how to determine the number of subjects that you actually need to get a certain level of significance. Another wrinkle that is you'll want to know as soon as possible is whether one option is clearly better or not so that you can begin to profit from it early. In the multi-armed bandit problem, this is a problem of exploration versus exploitation. This refers to the tension in the experiment design (and other domain) between exploring the problem space and exploiting the resources you've found in the experiment so far. We won't get into this further, but it is a factor to stay aware of as you perform A/B tests in the future. Because of the power and simplicity of A/B testing, it's being widely used in a variety of domains. For example, marketing and advertising make extensive use of it. Also, it has become a powerful way to test and improve measurable interactions between your website and those who visit it online. The primary requirement is that the interaction be somewhat limited and very measurable. Interesting would not make a good metric; the click-through rate or pages visited, however, would. Because of this, A/B tests validate changes in the placement or in the text of buttons that call for action from the users. For example, a test might compare the performance of Click for more! against Learn more now!. Another test may check whether a button placed in the upper-right section increases sales versus one in the center of the page. These changes are all incremental, and you probably don't want to break a large site redesign into pieces and test all of them individually. In a larger redesign, several changes may work together and reinforce each other. Testing them incrementally and only applying the ones that increase some metric can result in a design that's not aesthetically pleasing, is difficult to maintain, and costs you users in the long run. In these cases, A/B testing is not recommended. Some other things that are regularly tested in A/B tests include the following parts of a web page: The wording, size, and placement of a call-to-action button The headline and product description The length, layout, and fields in a form The overall layout and style of the website as a larger test, which is not broken down The pricing and promotional offers of products The images on the landing page The amount of text on a page Now that we have an understanding of what A/B testing is and what it can do for us, let's see what it will take to set up and perform an A/B test. Conducting an A/B test In creating an A/B test, we need to decide several things, and then we need to put our plan into action. We'll walk through those decisions here and create a simple set of web pages that will test the aspects of design that we are interested in changing, based upon the behavior of the user. Before we start building stuff, though, we need to think through our experiment and what we'll need to build. Planning the experiment For this article, we're going to pretend that we have a website for selling widgets (or rather, looking at the website Widgets!). The web page in this screenshot is the control page. Currently, we're getting 24 percent click-through on it from the Learn more! button. We're interested in the text of the button. If it read Order now! instead of Learn more!, it might generate more click-through. (Of course, actually explaining what the product is and what problems it solves might be more effective, but one can't have everything.) This will be the test page, and we're hoping that we can increase the click-through rate to 29 percent (a five percent absolute increase). Now that we have two versions of the page to experiment with, we can frame the experiment statistically and figure out how many subjects we'll need for each version of the page in order to achieve a statistically meaningful increase in the click-through rate on that button. Framing the statistics First, we need to frame our experiment in terms of the null-hypothesis test. In this case, the null hypothesis would look something like this: Changing the button copy from Learn more! to Order now! Would not improve the click-through rate. Remember, this is the statement that we're hoping to disprove (or fail to disprove) in the course of this experiment. Now we need to think about the sample size. This needs to be fixed in advance. To find the sample size, we'll use the standard error formula, which will be solved to get the number of observations to make for about a 95 percent confidence interval in order to get us in the ballpark of how large our sample should be: In this, δ is the minimum effect to detect and σ² is the sample variance. If we are testing for something like a percent increase in the click-through, the variance is σ² = p(1 – p), where p is the initial click-through rate with the control page. So for this experiment, the variance will be 0.24(1-0.24) or 0.1824. This would make the sample size for each variable 16(0.1824 / 0.052) or almost 1170. The code to compute this in Clojure is fairly simple: (defn get-target-sample [rate min-effect] (let [v (* rate (- 1.0 rate))] (* 16.0 (/ v (* min-effect min-effect))))) Running the code from the prompt gives us the response that we expect: user=> (get-target-sample 0.24 0.05) 1167.36 Part of the reason to calculate the number of participants needed is that monitoring the progress of the experiment and stopping it prematurely can invalidate the results of the test because it increases the risk of false positives where the experiment says it has disproved the null hypothesis when it really hasn't. This seems counterintuitive, doesn't it? Once we have significant results, we should be able to stop the test. Let's work through it. Let's say that in actuality, there's no difference between the control page and the test page. That is, both sets of copy for the button get approximately the same click-through rate. If we're attempting to get p ≤ 0.05, then it means that the test will return a false positive five percent of the time. It will incorrectly say that there is a significant difference between the click-through rates of the two buttons five percent of the time. Let's say that we're running the test and planning to get 3,000 subjects. We end up checking the results of every 1,000 participants. Let's break down what might happen: Run A B C D E F G H 1000 No No No No Yes Yes Yes Yes 2000 No No Yes Yes No Yes No Yes 3000 No Yes No Yes No No Yes Yes Final No Yes No Yes No No Yes Yes Stopped No Yes Yes Yes Yes Yes Yes Yes Let's read this table. Each lettered column represents a scenario for how the significance of the results may change over the run of the test. The rows represent the number of observations that have been made. The row labeled Final represents the experiment's true finishing result, and the row labeled Stopped represents the result if the experiment is stopped as soon as a significant result is seen. The final results show us that out of eight different scenarios, the final result would be significant in four cases (B, D, G, and H). However, if the experiment is stopped prematurely, then it will be significant in seven cases (all but A). The test could drastically over-generate false positives. In fact, most statistical tests assume that the sample size is fixed before the test is run. It's exciting to get good results, so we'll design our system so that we can't easily stop it prematurely. We'll just take that temptation away. With this in mind, let's consider how we can implement this test. Building the experiment There are several options to actually implement the A/B test. We'll consider several of them and weigh their pros and cons. Ultimately, the option that works best for you really depends on your circumstances. However, we'll pick one for this article and use it to implement the test for it. Looking at options to build the site The first way to implement A/B testing is to use a server-side implementation. In this case, all of the processing and tracking is handled on the server, and visitors' actions would be tracked using GET or POST parameters on the URL for the resource that the experiment is attempting to drive traffic towards. The steps for this process would go something like the following ones: A new user visits the site and requests for the page that contains the button or copy that is being tested. The server recognizes that this is a new user and assigns the user a tracking number. It assigns the user to one of the test groups. It adds a row in a database that contains the tracking number and the test group that the user is part of. It returns the page to the user with the copy, image, or design that is reflective of the control or test group. The user views the returned page and decides whether to click on the button or link or not. If the server receives a request for the button's or link's target, it updates the user's row in the tracking table to show us that the interaction was a success, that is, that the user did a click-through or made a purchase. This way of handling it keeps everything on the server, so it allows more control and configuration over exactly how you want to conduct your experiment. A second way of implementing this would be to do everything using JavaScript (or ClojureScript, https://github.com/clojure/clojurescript). In this scenario, the code on the page itself would randomly decide whether the user belonged to the control or the test group, and it would notify the server that a new observation in the experiment was beginning. It would then update the page with the appropriate copy or image. Most of the rest of this interaction is the same as the one in previous scenario. However, the complete steps are as follows: A new user visits the site and requests for the page that contains the button or copy being tested. The server inserts some JavaScript to handle the A/B test into the page. As the page is being rendered, the JavaScript library generates a new tracking number for the user. It assigns the user to one of the test groups. It renders that page for the group that the user belongs to, which is either the control group or the test group. It notifies the server of the user's tracking number and the group. The server takes this notification and adds a row for the observation in the database. The JavaScript in the browser tracks the user's next move either by directly notifying the server using an AJAX call or indirectly using a GET parameter in the URL for the next page. The server receives the notification whichever way it's sent and updates the row in the database. The downside of this is that having JavaScript take care of rendering the experiment might take slightly longer and may throw off the experiment. It's also slightly more complicated, because there are more parts that have to communicate. However, the benefit is that you can create a JavaScript library, easily throw a small script tag into the page, and immediately have a new A/B experiment running. In reality, though, you'll probably just use a service that handles this and more for you. However, it still makes sense to understand what they're providing for you, and that's what this article tries to do by helping you understand how to perform an A/B test so that you can be make better use of these A/B testing vendors and services.
Read more
  • 0
  • 0
  • 1339

article-image-data-warehouse-design
Packt
20 May 2014
14 min read
Save for later

Data Warehouse Design

Packt
20 May 2014
14 min read
(For more resources related to this topic, see here.) Most companies are establishing or planning to establish a Business Intelligence system and a data warehouse (DW). Knowledge related to the BI and data warehouse are in great demand in the job market. This article gives you an understanding of what Business Intelligence and data warehouse is, what the main components of the BI system are, and what the steps to create the data warehouse are. This article focuses on the designing of the data warehouse, which is the core of a BI system. A data warehouse is a database designed for analysis, and this definition indicates that designing a data warehouse is different from modeling a transactional database. Designing the data warehouse is also called dimensional modeling. In this article, you will learn about the concepts of dimensional modeling. Understanding Business Intelligence Based on Gartner's definition (http://www.gartner.com/it-glossary/business-intelligence-bi/), Business Intelligence is defined as follows: Business Intelligence is an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to and analysis of information to improve and optimize decisions and performance. As the definition states, the main purpose of a BI system is to help decision makers to make proper decisions based on the results of data analysis provided by the BI system. Nowadays, there are many operational systems in each industry. Businesses use multiple operational systems to simplify, standardize, and automate their everyday jobs and requirements. Each of these systems may have their own database, some of which may work with SQL Server, some with Oracle. Some of the legacy systems may work with legacy databases or even file operations. There are also systems that work through the Web via web services and XML. Operational systems are very useful in helping with day-to-day business operations such as the process of hiring a person in the human resources department, and sale operations through a retail store and handling financial transactions. The rising number of operational systems also adds another requirement, which is the integration of systems together. Business owners and decision makers not only need integrated data but also require an analysis of the integrated data. As an example, it is a common requirement for the decision makers of an organization to compare their hiring rate with the level of service provided by a business and the customer satisfaction based on that level of service. As you can see, this requirement deals with multiple operational systems such as CRM and human resources. The requirement might also need some data from sales and inventory if the decision makers want to bring sales and inventory factors into their decisions. As a supermarket owner or decision maker, it would be very important to understand what products in which branches were in higher demand. This kind of information helps you to provide enough products to cover demand, and you may even think about creating another branch in some regions. The requirement of integrating multiple operational systems together in order to create consolidated reports and dashboards that help decision makers to make a proper decision is the main directive for Business Intelligence. Some organizations and businesses use ERP systems that are integrated, so a question may appear in your mind that there won't be a requirement for integrating data because consolidated reports can be produced easily from these systems. So does that mean that these systems still require a BI solution? The answer in most cases is yes. The companies or businesses might not require a separate BI system for internal and parts of the operations that implemented it through ERP. However, they might require getting some data from outside, for example, getting some data from another vendor's web service or many other protocols and channels to send and receive information. This indicates that there would be a requirement for consolidated analysis for such information, which brings the BI requirement back to the table. The architecture and components of a BI system After understanding what the BI system is, it's time to discover more about its components and understand how these components work with each other. There are also some BI tools that help to implement one or more components. The following diagram shows an illustration of the architecture and main components of the Business Intelligence system: The BI architecture and components differ based on the tools, environment, and so on. The architecture shown in the preceding diagram contains components that are common in most of the BI systems. In the following sections, you will learn more about each component. The data warehouse The data warehouse is the core of the BI system. A data warehouse is a database built for the purpose of data analysis and reporting. This purpose changes the design of this database as well. As you know, operational databases are built on normalization standards, which are efficient for transactional systems, for example, to reduce redundancy. As you probably know, a 3NF-designed database for a sales system contains many tables related to each other. So, for example, a report on sales information may consume more than 10 joined conditions, which slows down the response time of the query and report. A data warehouse comes with a new design that reduces the response time and increases the performance of queries for reports and analytics. You will learn more about the design of a data warehouse (which is called dimensional modeling) later in this article. Extract Transform Load It is very likely that more than one system acts as the source of data required for the BI system. So there is a requirement for data consolidation that extracts data from different sources and transforms it into the shape that fits into the data warehouse, and finally, loads it into the data warehouse; this process is called Extract Transform Load (ETL). There are many challenges in the ETL process, out of which some will be revealed (conceptually) later in this article. According to the definition of states, ETL is not just a data integration phase. Let's discover more about it with an example; in an operational sales database, you may have dozen of tables that provide sale transactional data. When you design that sales data into your data warehouse, you can denormalize it and build one or two tables for it. So, the ETL process should extract data from the sales database and transform it (combine, match, and so on) to fit it into the model of data warehouse tables. There are some ETL tools in the market that perform the extract, transform, and load operations. The Microsoft solution for ETL is SQL Server Integration Service (SSIS), which is one of the best ETL tools in the market. SSIS can connect to multiple data sources such as Oracle, DB2, Text Files, XML, Web services, SQL Server, and so on. SSIS also has many built-in transformations to transform the data as required. Data model – BISM A data warehouse is designed to be the source of analysis and reports, so it works much faster than operational systems for producing reports. However, a DW is not that fast to cover all requirements because it is still a relational database, and databases have many constraints that reduce the response time of a query. The requirement for faster processing and a lower response time on one hand, and aggregated information on another hand causes the creation of another layer in BI systems. This layer, which we call the data model, contains a file-based or memory-based model of the data for producing very quick responses to reports. Microsoft's solution for the data model is split into two technologies: the OLAP cube and the In-memory tabular model. The OLAP cube is a file-based data storage that loads data from a data warehouse into a cube model. The cube contains descriptive information as dimensions (for example, customer and product) and cells (for example, facts and measures, such as sales and discount). The following diagram shows a sample OLAP cube: In the preceding diagram, the illustrated cube has three dimensions: Product, Customer, and Time. Each cell in the cube shows a junction of these three dimensions. For example, if we store the sales amount in each cell, then the green cell shows that Devin paid 23$ for a Hat on June 5. Aggregated data can be fetched easily as well within the cube structure. For example, the orange set of cells shows how much Mark paid on June 1 for all products. As you can see, the cube structure makes it easier and faster to access the required information. Microsoft SQL Server Analysis Services 2012 comes with two different types of modeling: multidimensional and tabular. Multidimensional modeling is based on the OLAP cube and is fitted with measures and dimensions, as you can see in the preceding diagram. The tabular model is based on a new In-memory engine for tables. The In-memory engine loads all data rows from tables into the memory and responds to queries directly from the memory. This is very fast in terms of the response time. The BI semantic model (BISM) provided by Microsoft is a combination of SSAS Tabular and Multidimensional solutions. Data visualization The frontend of a BI system is data visualization. In other words, data visualization is a part of the BI system that users can see. There are different methods for visualizing information, such as strategic and tactical dashboards, Key Performance Indicators (KPIs), and detailed or consolidated reports. As you probably know, there are many reporting and visualizing tools on the market. Microsoft has provided a set of visualization tools to cover dashboards, KPIs, scorecards, and reports required in a BI application. PerformancePoint, as part of Microsoft SharePoint, is a dashboard tool that performs best when connected to SSAS Multidimensional OLAP cube. Microsoft's SQL Server Reporting Services (SSRS) is a great reporting tool for creating detailed and consolidated reports. Excel is also a great slicing and dicing tool especially for power users. There are also components in Excel such as Power View, which are designed to build performance dashboards. Master Data Management Every organization has a part of its business that is common between different systems. That part of the data in the business can be managed and maintained as master data. For example, an organization may receive customer information from an online web application form or from a retail store's spreadsheets, or based on a web service provided by other vendors. Master Data Management (MDM) is the process of maintaining the single version of truth for master data entities through multiple systems. Microsoft's solution for MDM is Master Data Services (MDS). Master data can be stored in the MDS entities and it can be maintained and changed through the MDS Web UI or Excel UI. Other systems such as CRM, AX, and even DW can be subscribers of the master data entities. Even if one or more systems are able to change the master data, they can write back their changes into MDS through the staging architecture. Data Quality Services The quality of data is different in each operational system, especially when we deal with legacy systems or systems that have a high dependence on user inputs. As the BI system is based on data, the better the quality of data, the better the output of the BI solution. Because of this fact, working on data quality is one of the components of the BI systems. As an example, Auckland might be written as "Auckland" in some Excel files or be typed as "Aukland" by the user in the input form. As a solution to improve the quality of data, Microsoft provided users with DQS. DQS works based on Knowledge Base domains, which means a Knowledge Base can be created for different domains, and the Knowledge Base will be maintained and improved by a data steward as time passes. There are also matching policies that can be used to apply standardization on the data. Building the data warehouse A data warehouse is a database built for analysis and reporting. In other words, a data warehouse is a database in which the only data entry point is through ETL, and its primary purpose is to cover reporting and data analysis requirements. This definition clarifies that a data warehouse is not like other transactional databases that operational systems write data into. When there is no operational system that works directly with a data warehouse, and when the main purpose of this database is for reporting, then the design of the data warehouse will be different from that of transactional databases. If you recall from the database normalization concepts, the main purpose of normalization is to reduce the redundancy and dependency. The following table shows customers' data with their geographical information: Customer First Name Last Name Suburb City State Country Devin Batler Remuera Auckland Auckland New Zealand Peter Blade Remuera Auckland Auckland New Zealand Lance Martin City Center Sydney NSW Australia Let's elaborate on this example. As you can see from the preceding list, the geographical information in the records is redundant. This redundancy makes it difficult to apply changes. For example, in the structure, if Remuera, for any reason, is no longer part of the Auckland city, then the change should be applied on every record that has Remuera as part of its suburb. The following screenshot shows the tables of geographical information: So, a normalized approach is to retrieve the geographical information from the customer table and put it into another table. Then, only a key to that table would be pointed from the customer table. In this way, every time the value Remuera changes, only one record in the geographical region changes and the key number remains unchanged. So, you can see that normalization is highly efficient in transactional systems. This normalization approach is not that effective on analytical databases. If you consider a sales database with many tables related to each other and normalized at least up to the third normalized form (3NF), then analytical queries on such databases may require more than 10 join conditions, which slows down the query response. In other words, from the point of view of reporting, it would be better to denormalize data and flatten it in order to make it easier to query data as much as possible. This means the first design in the preceding table might be better for reporting. However, the query and reporting requirements are not that simple, and the business domains in the database are not as small as two or three tables. So real-world problems can be solved with a special design method for the data warehouse called dimensional modeling. There are two well-known methods for designing the data warehouse: the Kimball and Inmon methodologies. The Inmon and Kimball methods are named after the owners of these methodologies. Both of these methods are in use nowadays. The main difference between these methods is that Inmon is top-down and Kimball is bottom-up. In this article, we will explain the Kimball method. You can read more about the Inmon methodology in Building the Data Warehouse, William H. Inmon, Wiley (http://www.amazon.com/Building-Data-Warehouse-W-Inmon/dp/0764599445), and about the Kimball methodology in The Data Warehouse Toolkit, Ralph Kimball, Wiley (http://www.amazon.com/The-Data-Warehouse-Toolkit-Dimensional/dp/0471200247). Both of these books are must-read books for BI and DW professionals and are reference books that are recommended to be on the bookshelf of all BI teams. This article is referenced from The Data Warehouse Toolkit, so for a detailed discussion, read the referenced book. Dimensional modeling To gain an understanding of data warehouse design and dimensional modeling, it's better to learn about the components and terminologies of a DW. A DW consists of Fact tables and dimensions. The relationship between a Fact table and dimensions are based on the foreign key and primary key (the primary key of the dimension table is addressed in the fact table as the foreign key). Summary This article explains the first steps in thinking and designing a BI system. As the first step, a developer needs to design the data warehouse (DW) and needs an understanding of the key concepts of the design and methodologies to create the data warehouse. Resources for Article: Further resources on this subject: Self-service Business Intelligence, Creating Value from Data [Article] Oracle Business Intelligence : Getting Business Information from Data [Article] Business Intelligence and Data Warehouse Solution - Architecture and Design [Article]
Read more
  • 0
  • 0
  • 2494
Visually different images

article-image-backup-and-restore-improvements
Packt
25 Apr 2014
11 min read
Save for later

Backup and Restore Improvements

Packt
25 Apr 2014
11 min read
(For more resources related to this topic, see here.) Database backups to a URL and Microsoft Azure Storage The ability to backup to a URL was introduced in SQL Server 2012 Service Pack 1 cumulative update package 2. Prior to this, if you wanted to backup to a URL in SQL Server 2012, you needed to use Transact-SQL or PowerShell. SQL Server 2014 has integrated this option into Management Studio too. The reason for allowing backups to a URL is to allow you to integrate your SQL Server backups with cloud-based storage and store your backups in Microsoft Azure. By being able to create a backup there, you can keep database backups of your on-premise database in Microsoft Azure. This makes your backups safer and protected in the event that your main site is lost to a disaster as your backups are stored offsite. This can avoid the need for an actual disaster recovery site. In order to create a backup to Microsoft Azure Storage, you need a storage account and a storage container. From a SQL Server perspective, you will require a URL, which will specify a Uniform Resource Identifier (URI) to a unique backup file in Microsoft Cloud. It is the URL that provides the location for the backup and the backup filename. The URL will need to point to a blob, not just a container. If it does not exist, then it is created. However, if a backup file exists, then the backup will fail. This is unless the WITH FORMAT command is specified, which like in older versions of SQL Server allows the backup to overwrite the existing backup with the new one that you wish to create. You will also need to create a SQL Server credential to allow the SQL Server to authenticate with Microsoft Azure Storage. This credential will store the name of the storage account and also the access key. The WITH CREDENTIAL statement must be used when issuing the backup or restore commands. There are some limitations you need to consider when backing up your database to a URL and making use of Microsoft Azure Storage to store your database backups: Maximum backup size of 1 TB (Terabyte). Cannot be combined with backup devices. Cannot append to existing backups—in SQL Server, you can have more than one backup stored in a file. When taking a backup to a URL, the ratio should be of one backup to one file. You cannot backup to multiple blobs. In a normal SQL Server backup, you can stripe it across multiple files. You cannot do this with a backup to a URL on Microsoft Azure. There are some limitations you need to consider when backing up to the Microsoft Azure Storage; you can find more information on this at http://msdn.microsoft.com/en-us/library/dn435916(v=sql.120).aspx#backuptaskssms. For the purposes of this exercise, I have created a new container on my Microsoft Azure Storage account called sqlbackup. With the storage account container, you will now take the backup to a URL. As part of this process, you will create a credential using your Microsoft Azure publishing profile. This is slightly different to the process we just discussed, but you can download this profile from Microsoft Azure. Once you have your publishing profile, you can follow the steps explained in the following section. Backing up a SQL Server database to a URL You can use Management Studio's backup task to initiate the backup. In order to do this, you need to start Management Studio and connect to your local SQL Server instance. You will notice that I have a database called T3, and it is this database that I will be backing up to the URL as follows: Right-click on the database you want to back up and navigate to Tasks | Backup. This will start the backup task wizard for you. On the General page, you should change the backup destination from Disk to URL. Making this change will enable all the other options needed for taking a backup to a URL. You will need to provide a filename for your backup, then create the SQL Server credential you want to use to authenticate on the Windows Azure Storage container. Click on the Create Credential button to open the Create credential dialog box. There is an option to use your publishing profile, so click on the Browse button and select the publishing profile that you downloaded from the Microsoft Azure web portal. Once you have selected your publishing profile, it will prepopulate the credential name, management certificate, and subscription ID fields for you. Choose the appropriate Storage Account for your backups. Following this, you should then click on Create to create the credential. You will need to specify the Windows Azure Storage container to use for the backup. In this case, I entered sqlbackup. When you have finished, your General page should look like what is shown in the following screenshot: Following this, click on OK and the backup should run. If you want to use Transact-SQL, instead of Management Studio, to take the backup, the code would look like this: BACKUP DATABASE [T3] TO URL = N'https://gresqlstorage.blob.core.windows.net/sqlbackup/t3.bak' WITH CREDENTIAL = N'AzureCredential' , NOFORMAT, NOINIT, NAME = N'T3-Full Database Backup', NOSKIP, NOREWIND, NOUNLOAD, STATS = 10 GO This is a normal backup database statement, as it has always been, but it specifies a URL and a credential to use to take the backup as well. Restoring a backup stored on Windows Azure Storage In this section, you will learn how to restore a database using the backup you have stored on Windows Azure Storage: To carry out the restore, connect to your local instance of SQL Server in Management Studio, right-click on the databases folder, and choose the Restore database option. This will open the database restore pages. In the Source section of the General page, select the Device option, click on the dropdown and change the backup media type to URL, and click on Add. In the next screen, you have to specify the Windows Azure Storage account connection information. You will need to choose the storage account to connect to and specify an access key to allow SQL Server to connect to Microsoft Azure. You can get this from the Storage section of the Microsoft Azure portal. After this, you will need to specify a credential to use. In this case, I will use the credential that was created when I took the backup earlier. Click on Connect to connect to Microsoft Azure. You will then need to chose the backup to restore from. In this case, I'll use the backup of the T3 database that was created in the preceding section. You can then complete the restore options as you would do with a local backup. In this case, the database has been called T3_cloud, mainly for reference so that it can be easily identified. If you want to restore the existing database, you need to use the WITH REPLACE command in the restore statement. The restore statement would look like this: RESTORE DATABASE t3 FROM URL =' https://gresqlstorage.blob.core.windows.net/sqlbackup/t3.bak ' WITH CREDENTIAL = ' N'AzureCredential' ' ,REPLACE ,STATS = 5 When the restore has been completed, you will have a new copy of the database on the local SQL Server instance. SQL Server Managed Backup to Microsoft Azure Building on the ability to take a backup of a SQL Server database to a URL and Microsoft Azure Storage, you can now set up Managed Backups of your SQL Server databases to Microsoft Azure. It allows you to automate your database backups to the Microsoft Azure Storage. All database administrators appreciate automation, as it frees their time to focus on other projects. So, this feature will be useful to you. It's fully customizable, and you can build your backup strategy around the transaction workload of your database and set a retention policy. Configuring SQL Server-managed backups to Microsoft Azure In order to set up and configure Managed Backups in SQL Server 2014, a new stored procedure has been introduced to configure Managed Backups on a specific database. The stored procedure is called smart_admin.sp_set_db_backup. The syntax for the stored procedure is as follows: EXEC smart_admin.sp_set_db_backup [@database_name = ] 'database name' ,[@enable_backup = ] { 0 | 1} ,[@storage_url = ] 'storage url' ,[@retention_days = ] 'retention_period_in_days' ,[@credential_name = ] 'sql_credential_name' ,[@encryption_algorithm] 'name of the encryption algorithm' ,[@encryptor_type] {'CERTIFICATE' | 'ASYMMETRIC_KEY'} ,[@encryptor_name] 'name of the certificate or asymmetric key' This stored procedure will be used to set up Managed Backups on the T3 database. The SQL Server Agent will need to be running for this to work. In my case, I executed the following code to enable Managed Backups on my T3 database: Use msdb; GO EXEC smart_admin.sp_set_db_backup @database_name='T3' ,@enable_backup=1 ,@storage_url = 'https://gresqlstorage.blob.core.windows.net/' ,@retention_days=5 ,@credential_name='AzureCredential' ,@encryption_algorithm =NO_ENCRYPTION To view the Managed Backup information, you can run the following query: Use msdb GO SELECT * FROM smart_admin.fn_backup_db_config('T3') The results should look like this: To disable the Managed Backup, you can use the smart_admin.sp_set_db_backup procedure to disable it: Use msdb; GO EXEC smart_admin.sp_set_db_backup @database_name='T3' ,@enable_backup=0 Encryption For the first time in SQL Server, you can encrypt your backups using the native SQL Server backup tool. In SQL Server 2014, the backup tool supports several encryption algorithms, including AES 128, AES 192, AES 256, and Triple DES. You will need a certificate or an asymmetric key when taking encrypted backups. Obviously, there are a number of benefits to encrypting your SQL Server database backups, including securing the data in the database. This can also be very useful if you are using transparent data encryption (TDE) to protect your database's data files. Encryption is also supported using SQL Server Managed Backup to Microsoft Azure. Creating an encrypted backup To create an encrypted SQL Server backup, there are a few prerequisites that you need to ensure are set up on the SQL Server. Creating a database master key for the master database Creating the database master key is important because it is used to protect the private key certificate and the asymmetric keys that are stored in the master database, which will be used to encrypt the SQL Server backup. The following Transact-SQL will create a database master key for the master database: USE master; GO CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'P@$$W0rd'; GO In this example, a simple password has been used. In a production environment, it would be advisable to create a master key with a more secure password. Creating a certificate or asymmetric key The backup encryption process will need to make use of a certificate or asymmetric key to be able to take the backup. The following code creates a certificate that can be used to back up your databases using encryption: Use Master GO CREATE CERTIFICATE T3DBBackupCertificate WITH SUBJECT = 'T3 Backup Encryption Certificate'; GO Now you can take an encrypted backup of the database. Creating an encrypted database backup You can now take an encrypted backup of your databases. The following Transact-SQL statements back up the T3 database using the certificate you created in the preceding section: BACKUP DATABASE t3 TO DISK = N'C:Backupt3_enc.bak' WITH COMPRESSION, ENCRYPTION ( ALGORITHM = AES_256, SERVER CERTIFICATE = T3DBBackupCertificate ), STATS = 10 GO This is a local backup; it's located in the C:backup folder, and the encryption algorithm used is AES_256. Summary This article has shown some of the new backup features of SQL Server 2014. The ability to backup to Microsoft Azure Storage means that you can implement a robust backup and restore strategy at a relatively lower cost. Resources for Article: Further resources on this subject: SQL Server 2008 R2: Multiserver Management Using Utility Explorer [Article] Microsoft SQL Server 2008 High Availability: Installing Database Mirroring [Article] Manage SQL Azure Databases with the Web Interface 'Houston' [Article]
Read more
  • 0
  • 0
  • 1492

article-image-using-cross-validation
Packt
22 Apr 2014
7 min read
Save for later

Using cross-validation

Packt
22 Apr 2014
7 min read
(For more resources related to this topic, see here.) To start from, cross-validation is a common validation technique that can be used to evaluate machine learning models. Cross-validation essentially measures how well the estimated model will generalize some given data. This data is different from the training data supplied to our model, and is called the cross-validation set, or simply validation set, of our model. Cross-validation of a given model is also called rotation estimation. If an estimated model performs well during cross-validation, we can assume that the model can understand the relationship between its various independent and dependent variables. The goal of cross-validation is to provide a test to determine if a formulated model is overfit on the training data. In the perspective of implementation, cross-validation is a kind of unit test for a machine learning system. A single round of cross-validation generally involves partitioning all the available sample data into two subsets and then performing training on one subset and validation and/or testing on the other subset. Several such rounds, or folds, of cross-validation must be performed using different sets of data to reduce the variance of the overall cross-validation error of the given model. Any particular measure of the cross-validation error should be calculated as the average of this error over the different folds in cross-validation. There are several types of cross-validation we can implement as a diagnostic for a given machine learning model or system. Let's briefly explore a few of them as follows: A common type is k-fold cross-validation, in which we partition the cross-validation data into k equal subsets. The training of the model is then performed on subsets of the data and the cross-validation is performed on a single subset. A simple variation of k-fold cross-validation is 2-fold cross-validation, which is also called the holdout method. In 2-fold cross-validation, the training and cross-validation subsets of data will be almost equal in proportion. Repeated random subsampling is another simple variant of cross-validation in which the sample data is first randomized or shuffled and then used as training and cross-validation data. This method is notably not dependent on the number of folds used for cross-validation. Another form of k-fold cross-validation is leave-one-out cross-validation, in which only a single record from the available sample data is used for cross-validation. Leave-one-out cross-validation is essentially k-fold cross-validation in which k is equal to the number of samples or observations in the sample data. Cross-validation basically treats the estimated model as a black box, that is, it makes no assumptions about the implementation of the model. We can also use cross-validation to select features in a given model by using cross-validation to determine the feature set that produces the best fit model over the given sample data. Of course, there are a couple of limitations of classification, which can be summarized as follows: If a given model is needed to perform feature selection internally, we must perform cross-validation for each selected feature set in the given model. This can be computationally expensive depending on the amount of available sample data. Cross-validation is not very useful if the sample data comprises exactly or nearly equal samples. In summary, it's a good practice to implement cross-validation for any machine learning system that we build. Also, we can choose an appropriate cross-validation technique depending on the problem we are trying to model as well as the nature of the collected sample data. For the example that will follow, the namespace declaration should look similar to the following declaration: (ns my-namespace (:use [clj-ml classifiers data])) We can use the clj-ml library to cross-validate the classifier we built for the fish packaging plant. Essentially, we built a classifier to determine whether a fish is a salmon or a sea bass using the clj-ml library. To recap, a fish is represented as a vector containing the category of the fish and values for the various features of the fish. The attributes of a fish are its length, width, and lightness of skin. We also described a template for a sample fish, which is defined as follows: (def fish-template [{:category [:salmon :sea-bass]} :length :width :lightness]) The fish-template vector defined in the preceding code can be used to train a classifier with some sample data. For now, we will not bother about which classification algorithm we have used to model the given training data. We can only assume that the classifier was created using the make-classifier function from the clj-ml library. This classifier is stored in the *classifier* variable as follows: (def *classifier* (make-classifier ...)) Suppose the classifier was trained with some sample data. We must now evaluate this trained classification model. To do this, we must first create some sample data to cross-validate. For the sake of simplicity, we will use randomly generated data in this example. We can generate this data using the make-sample-fish function. This function simply creates a new vector of some random values representing a fish. Of course, we must not forget the fact that the make-sample-fish function has an in-built partiality, so we create a meaningful pattern in a number of samples created using this function as follows: (def fish-cv-data (for [i (range 3000)] (make-sample-fish))) We will need to use a dataset from the clj-ml library, and we can create one using the make-dataset function, as shown in the following code: (def fish-cv-dataset (make-dataset "fish-cv" fish-template fish-cv-data)) To cross-validate the classifier, we must use the classifier-evaluate function from the clj-ml.classifiers namespace. This function essentially performs k-fold cross-validation on the given data. Other than the classifier and the cross-validation dataset, this function requires the number of folds that we must perform on the data to be specified as the last parameter. Also, we will first need to set the class field of the records in fish-cv-dataset using the dataset-set-class function. We can define a single function to perform these operations as follows: (defn cv-classifier [folds] (dataset-set-class fish-cv-dataset 0) (classifier-evaluate *classifier* :cross-validation fish-cv-dataset folds)) We will use 10 folds of cross-validation on the classifier. Since the classifier-evaluate function returns a map, we bind this return value to a variable for further use, as follows: user> (def cv (cv-classifier 10)) #'user/cv We can fetch and print the summary of the preceding cross-validation using the :summary keyword as follows: user> (print (:summary cv)) Correctly Classified Instances 2986 99.5333 % Incorrectly Classified Instances 14 0.4667 % Kappa statistic 0.9888 Mean absolute error 0.0093 Root mean squared error 0.0681 Relative absolute error 2.2248 % Root relative squared error 14.9238 % Total Number of Instances 3000 nil As shown in the preceding code, we can view several statistical measures of performance for our trained classifier. Apart from the correctly and incorrectly classified records, this summary also describes the Root Mean Squared Error (RMSE) and several other measures of error in our classifier. For a more detailed view of the correctly and incorrectly classified instances in the classifier, we can print the confusion matrix of the cross-validation using the :confusion-matrix keyword, as shown in the following code: user> (print (:confusion-matrix cv)) === Confusion Matrix === a b <-- classified as 2129 0 | a = salmon 9 862 | b = sea-bass nil As shown in the preceding example, we can use the clj-ml library's classifier-evaluate function to perform a k-fold cross-validation on any given classifier. Although we are restricted to using classifiers from the clj-ml library when using the classifier-evaluate function, we must strive to implement similar diagnostics in any machine learning system we build.
Read more
  • 0
  • 0
  • 1545

article-image-monte-carlo-simulation-and-options
Packt
21 Apr 2014
10 min read
Save for later

Monte Carlo Simulation and Options

Packt
21 Apr 2014
10 min read
(For more resources related to this topic, see here.) In this article, we will cover the following topics: Generating random numbers from standard normal distribution and normal distribution Generating random numbers from a uniform distribution A simple application: estimate pi by the Monte Carlo simulation Generating random numbers from a Poisson distribution Bootstrapping with/without replacements The lognormal distribution and simulation of stock price movements Simulating terminal stock prices Simulating an efficient portfolio and an efficient frontier Generating random numbers from a standard normal distribution Normal distributions play a central role in finance. A major reason is that many finance theories, such as option theory and applications, are based on the assumption that stock returns follow a normal distribution. It is quite often that we need to generate n random numbers from a standard normal distribution. For this purpose, we have the following two lines of code: >>>import scipy as sp >>>x=sp.random.standard_normal(size=10) The basic random numbers in SciPy/NumPy are created by Mersenne Twister PRNG in the numpy.random function. The random numbers for distributions in numpy.random are in cython/pyrex and are pretty fast. To print the first few observations, we use the print() function as follows: >>>print x[0:5] [-0.55062594 -0.51338547 -0.04208367 -0.66432268 0.49461661] >>> Alternatively, we could use the following code: >>>import scipy as sp >>>x=sp.random.normal(size=10) This program is equivalent to the following one: >>>import scipy as sp >>>x=sp.random.normal(0,1,10) The first input is for mean, the second input is for standard deviation, and the last one is for the number of random numbers, that is, the size of the dataset. The default settings for mean and standard deviations are 0 and 1. We could use the help() function to find out the input variables. To save space, we show only the first few lines: >>>help(sp.random.normal) Help on built-in function normal: normal(...) normal(loc=0.0, scale=1.0, size=None) Drawing random samples from a normal (Gaussian) distribution The probability density function of the normal distribution, first derived by De Moivre and 200 years later by both Gauss and Laplace independently, is often called the bell curve because of its characteristic shape; refer to the following graph: Again, the density function for a standard normal distribution is defined as follows:   (1) Generating random numbers with a seed Sometimes, we like to produce the same random numbers repeatedly. For example, when a professor is explaining how to estimate the mean, standard deviation, skewness, and kurtosis of five random numbers, it is a good idea that students could generate exactly the same values as their instructor. Another example would be that when we are debugging our Python program to simulate a stock's movements, we might prefer to have the same intermediate numbers. For such cases, we use the seed() function as follows: >>>import scipy as sp >>>sp.random.seed(12345) >>>x=sp.random.normal(0,1,20) >>>print x[0:5] [-0.20470766 0.47894334 -0.51943872 -0.5557303 1.96578057] >>> In this program, we use 12345 as our seed. The value of the seed is not important. The key is that the same seed leads to the same random values. Generating n random numbers from a normal distribution To generate n random numbers from a normal distribution, we have the following code: >>>import scipy as sp >>>sp.random.seed(12345) >>>x=sp.random.normal(0.05,0.1,50) >>>print x[0:5] [ 0.02952923 0.09789433 -0.00194387 -0.00557303 0.24657806] >>> The difference between this program and the previous one is that the mean is 0.05 instead of 0, while the standard deviation is 0.1 instead of 1. The density of a normal distribution is defined by the following equation, where μ is the mean and σ is the standard deviation. Obviously, the standard normal distribution is just a special case of the normal distribution shown as follows:   (2) Histogram for a normal distribution A histogram is used intensively in the process of analyzing the properties of datasets. To generate a histogram for a set of random values drawn from a normal distribution with specified mean and standard deviation, we have the following code: >>>import scipy as sp >>>import matplotlib.pyplot as plt >>>sp.random.seed(12345) >>>x=sp.random.normal(0.08,0.2,1000) >>>plt.hist(x, 15, normed=True) >>>plt.show() The resultant graph is presented as follows: Graphical presentation of a lognormal distribution When returns follow a normal distribution, the prices would follow a lognormal distribution. The definition of a lognormal distribution is as follows:   (3) The following code shows three different lognormal distributions with three pairs of parameters, such as (0, 0.25), (0, 0.5), and (0, 1.0). The first parameter is for mean (), while the second one is for standard deviation, : import scipy.stats as sp import numpy as np import matplotlib.pyplot as plt x=np.linspace(0,3,200) mu=0 sigma0=[0.25,0.5,1] color=['blue','red','green'] target=[(1.2,1.3),(1.7,0.4),(0.18,0.7)] start=[(1.8,1.4),(1.9,0.6),(0.18,1.6)] for i in range(len(sigma0)): sigma=sigma0[i] y=1/(x*sigma*sqrt(2*pi))*exp(-(log(x)-mu)**2/(2*sigma*sigma)) plt.annotate('mu='+str(mu)+', sigma='+str(sigma),xy=target[i], xytext=start[i], arrowprops=dict(facecolor=color[i],shrink=0.01),) plt.plot(x,y,color[i]) plt.title('Lognormal distribution') plt.xlabel('x') plt.ylabel('lognormal density distribution') plt.show() The corresponding three graphs are put together to illustrate their similarities and differences: Generating random numbers from a uniform distribution When we plan to randomly choose m stocks from n available stocks, we could draw a set of random numbers from a uniform distribution. To generate 10 random numbers between one and 100 from a uniform distribution, we have the following code. To guarantee that we generate the same set of random numbers, we use the seed() function as follows: >>>import scipy as sp >>>sp.random.seed(123345) >>>x=sp.random.uniform(low=1,high=100,size=10) Again, low, high, and size are the three keywords for the three input variables. The first one specifies the minimum, the second one specifies the high end, while the size gives the number of the random numbers we intend to generate. The first five numbers are shown as follows: >>>print x[0:5] [ 30.32749021 20.58006409 2.43703988 76.15661293 75.06929084] >>> Using simulation to estimate the pi value It is a good exercise to estimate pi by the Monte Carlo simulation. Let's draw a square with 2R as its side. If we put the largest circle inside the square, its radius will be R. In other words, the areas for those two shapes have the following equations:   (4)   (5) By dividing equation (4) by equation (5), we have the following result: In other words, the value of pi will be 4* Scircle/Ssquare. When running the simulation, we generate n pairs of x and y from a uniform distribution with a range of zero and 0.5. Then we estimate a distance that is the square root of the summation of the squared x and y, that is, . Obviously, when d is less than 0.5 (value of R), it will fall into the circle. We can imagine throwing a dart that falls into the circle. The value of the pi will take the following form:   (6) The following graph illustrates these random points within a circle and within a square: The Python program to estimate the value of pi is presented as follows: import scipy as sp n=100000 x=sp.random.uniform(low=0,high=1,size=n) y=sp.random.uniform(low=0,high=1,size=n) dist=sqrt(x**2+y**2) in_circle=dist[dist<=1] our_pi=len(in_circle)*4./n print ('pi=',our_pi) print('error (%)=', (our_pi-pi)/pi) The estimated pi value would change whenever we run the previous code as shown in the following code, and the accuracy of its estimation depends on the number of trials, that is, n: ('pi=', 3.15) ('error (%)=', 0.0026761414789406262) >>> Generating random numbers from a Poisson distribution To investigate the impact of private information, Easley, Kiefer, O'Hara, and Paperman (1996) designed a (PIN) Probability of informed trading measure that is derived based on the daily number of buyer-initiated trades and the number of seller-initiated trades. The fundamental aspect of their model is to assume that order arrivals follow a Poisson distribution. The following code shows how to generate n random numbers from a Poisson distribution: import scipy as sp import matplotlib.pyplot as plt x=sp.random.poisson(lam=1, size=100) #plt.plot(x,'o') a = 5. # shape n = 1000 s = np.random.power(a, n) count, bins, ignored = plt.hist(s, bins=30) x = np.linspace(0, 1, 100) y = a*x**(a-1.) normed_y = n*np.diff(bins)[0]*y plt.plot(x, normed_y) plt.show() Selecting m stocks randomly from n given stocks Based on the preceding program, we could easily choose 20 stocks from 500 available securities. This is an important step if we intend to investigate the impact of the number of randomly selected stocks on the portfolio volatility as shown in the following code: import scipy as sp n_stocks_available=500 n_stocks=20 x=sp.random.uniform(low=1,high=n_stocks_available,size=n_stocks) y=[] for i in range(n_stocks): y.append(int(x[i])) #print y final=unique(y) print final print len(final) In the preceding program, we select 20 numbers from 500 numbers. Since we have to choose integers, we might end up with less than 20 values, that is, some integers appear more than once after we convert real numbers into integers. One solution is to pick more than we need. Then choose the first 20 integers. An alternative is to use the randrange() and randint() functions. In the next program, we choose n stocks from all available stocks. First, we download a dataset from http://canisius.edu/~yany/yanMonthly.pickle: n_stocks=10 x=load('c:/temp/yanMonthly.pickle') x2=unique(np.array(x.index)) x3=x2[x2<'ZZZZ'] # remove all indices sp.random.seed(1234567) nonStocks=['GOLDPRICE','HML','SMB','Mkt_Rf','Rf','Russ3000E_D','US_DEBT', 'Russ3000E_X','US_GDP2009dollar','US_GDP2013dollar'] x4=list(x3) for i in range(len(nonStocks)): x4.remove(nonStocks[i]) k=sp.random.uniform(low=1,high=len(x4),size=n_stocks) y,s=[],[] for i in range(n_stocks): index=int(k[i]) y.append(index) s.append(x4[index]) final=unique(y) print final print s In the preceding program, we remove non-stock data items. These non-stock items are a part of data items. First, we load a dataset called yanMonthly.pickle that includes over 200 stocks, gold price, GDP, unemployment rate, SMB (Small Minus Big), HML (High Minus Low), risk-free rate, price rate, market excess rate, and Russell indices. The .pickle extension means that the dataset has a type from Pandas. Since x.index would present all indices for each observation, we need to use the unique() function to select all unique IDs. Since we only consider stocks to form our portfolio, we have to move all market indices and other non-stock securities, such as HML and US_DEBT. Because all stock market indices start with a carat (^), we use less than ZZZZ to remove them. For other IDs that are between A and Z, we have to remove them one after another. For this purpose, we use the remove() function available for a list variable. The final output is shown as follows:
Read more
  • 0
  • 0
  • 2906
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-indexing-data
Packt
18 Apr 2014
10 min read
Save for later

Indexing the Data

Packt
18 Apr 2014
10 min read
(For more resources related to this topic, see here.) Elasticsearch indexing We have our Elasticsearch cluster up and running, and we also know how to use the Elasticsearch REST API to index our data, delete it, and retrieve it. We also know how to use search to get our documents. If you are used to SQL databases, you might know that before you can start putting the data there, you need to create a structure, which will describe what your data looks like. Although Elasticsearch is a schema-less search engine and can figure out the data structure on the fly, we think that controlling the structure and thus defining it ourselves is a better way. In the following few pages, you'll see how to create new indices (and how to delete them). Before we look closer at the available API methods, let's see what the indexing process looks like. Shards and replicas The Elasticsearch index is built of one or more shards and each of them contains part of your document set. Each of these shards can also have replicas, which are exact copies of the shard. During index creation, we can specify how many shards and replicas should be created. We can also omit this information and use the default values either defined in the global configuration file (elasticsearch.yml) or implemented in Elasticsearch internals. If we rely on Elasticsearch defaults, our index will end up with five shards and one replica. What does that mean? To put it simply, we will end up with having 10 Lucene indices distributed among the cluster. Are you wondering how we did the calculation and got 10 Lucene indices from five shards and one replica? The term "replica" is somewhat misleading. It means that every shard has its copy, so it means there are five shards and five copies. Having a shard and its replica, in general, means that when we index a document, we will modify them both. That's because to have an exact copy of a shard, Elasticsearch needs to inform all the replicas about the change in shard contents. In the case of fetching a document, we can use either the shard or its copy. In a system with many physical nodes, we will be able to place the shards and their copies on different nodes and thus use more processing power (such as disk I/O or CPU). To sum up, the conclusions are as follows: More shards allow us to spread indices to more servers, which means we can handle more documents without losing performance. More shards means that fewer resources are required to fetch a particular document because fewer documents are stored in a single shard compared to the documents stored in a deployment with fewer shards. More shards means more problems when searching across the index because we have to merge results from more shards and thus the aggregation phase of the query can be more resource intensive. Having more replicas results in a fault tolerance cluster, because when the original shard is not available, its copy will take the role of the original shard. Having a single replica, the cluster may lose the shard without data loss. When we have two replicas, we can lose the primary shard and its single replica and still everything will work well. The more the replicas, the higher the query throughput will be. That's because the query can use either a shard or any of its copies to execute the query. Of course, these are not the only relationships between the number of shards and replicas in Elasticsearch. So, how many shards and replicas should we have for our indices? That depends. We believe that the defaults are quite good but nothing can replace a good test. Note that the number of replicas is less important because you can adjust it on a live cluster after index creation. You can remove and add them if you want and have the resources to run them. Unfortunately, this is not true when it comes to the number of shards. Once you have your index created, the only way to change the number of shards is to create another index and reindex your data. Creating indices When we created our first document in Elasticsearch, we didn't care about index creation at all. We just used the following command: curl -XPUT http://localhost:9200/blog/article/1 -d '{"title": "New   version of Elasticsearch released!", "content": "...", "tags":["announce", "elasticsearch", "release"] }' This is fine. If such an index does not exist, Elasticsearch automatically creates the index for us. We can also create the index ourselves by running the following command: curl -XPUT http://localhost:9200/blog/ We just told Elasticsearch that we want to create the index with the blog name. If everything goes right, you will see the following response from Elasticsearch: {"acknowledged":true} When is manual index creation necessary? There are many situations. One of them can be the inclusion of additional settings such as the index structure or the number of shards. Altering automatic index creation Sometimes, you can come to the  conclusion that automatic index creation is a bad thing. When you have a big system with many processes sending data into Elasticsearch, a simple typo in the index name can destroy hours of script work. You can turn off automatic index creation by adding the following line in the elasticsearch.yml configuration file: action.auto_create_index: false Note that action.auto_create_index is more complex than it looks. The value can be set to not only false or true. We can also use index name patterns to specify whether an index with a given name can be created automatically if it doesn't exist. For example, the following definition allows automatic creation of indices with the names beginning with a, but disallows the creation of indices starting with an. The other indices aren't allowed and must be created manually (because of -*). action.auto_create_index: -an*,+a*,-* Note that the order of pattern definitions matters. Elasticsearch checks the patterns up to the first pattern that matches, so if you move -an* to the end, it won't be used because of +a* , which will be checked first. Settings for a newly created index The manual creation of an index is also necessary when you want to set some configuration options, such as the number of shards and replicas. Let's look at the following example: curl -XPUT http://localhost:9200/blog/ -d '{     "settings" : {         "number_of_shards" : 1,         "number_of_replicas" : 2     } }' The preceding command will result in the creation of the blog index with one shard and two replicas, so it makes a total of three physical Lucene indices. Also, there are other values that can be set in this way. So, we already have our new, shiny index. But there is a problem; we forgot to provide the mappings, which are responsible for describing the index structure. What can we do? Since we have no data at all, we'll go for the simplest approach – we will just delete the index. To do that, we will run a command similar to the preceding one, but instead of using the PUT HTTP method, we use DELETE. So the actual command is as follows: curl –XDELETE http://localhost:9200/posts And the response will be the same as the one we saw earlier, as follows: {"acknowledged":true} Now that we know what an index is, how to create it, and how to delete it, we are ready to create indices with the mappings we have defined. It is a very important part because data indexation will affect the search process and the way in which documents are matched. Mappings configuration If you are used to SQL databases, you may know that before you can start inserting the data in the database, you need to create a schema, which will describe what your data looks like. Although Elasticsearch is a schema-less search engine and can figure out the data structure on the fl y, we think that controlling the structure and thus defining it ourselves is a better way. In the following few pages, you'll see how to create new indices (and how to delete them) and how to create mappings that suit your needs and match your data structure. Type determining mechanism Before we  start describing how to create mappings  manually, we wanted to write about one thing. Elasticsearch can guess the document structure by looking at JSON, which defines the document. In JSON, strings are surrounded by quotation marks, Booleans are defined using specific words, and numbers are just a few digits. This is a simple trick, but it usually works. For example, let's look at the following document: {   "field1": 10, "field2": "10" } The preceding document has two fields. The field1 field will be determined as a number (to be precise, as long type), but field2 will be determined as a string, because it is surrounded by quotation marks. Of course, this can be the desired behavior, but sometimes the data source may omit the information about the data type and everything may be present as strings. The solution to this is to enable more aggressive text checking in the mapping definition by setting the numeric_detection property to true. For example, we can execute the following command during the creation of the index: curl -XPUT http://localhost:9200/blog/?pretty -d '{   "mappings" : {     "article": {       "numeric_detection" : true     }   } }' Unfortunately, the problem still exists if we want the Boolean type to be guessed. There is no option to force the guessing of Boolean types from the text. In such cases, when a change of source format is impossible, we can only define the field directly in the mappings definition. Another type that causes trouble is a date-based one. Elasticsearch tries to guess dates given as timestamps or strings that match the date format. We can define the list of recognized date formats using the dynamic_date_formats property, which allows us to specify the formats array. Let's look at the following command for creating the index and type: curl -XPUT 'http://localhost:9200/blog/' -d '{   "mappings" : {     "article" : {       "dynamic_date_formats" : ["yyyy-MM-dd hh:mm"]     }   } }' The preceding command will result in the creation of an index called blog with the single type called article. We've also used the dynamic_date_formats property with a single date format that will result in Elasticsearch using the date core type for fields matching the defined format. Elasticsearch uses the joda-time library to define date formats, so please visit http://joda-time.sourceforge.net/api-release/org/joda/time/format/DateTimeFormat.html if you are interested in finding out more about them. Remember that the dynamic_date_format property accepts an array of values. That means that we can handle several date formats simultaneously.
Read more
  • 0
  • 0
  • 3843

article-image-understanding-data-reduction-patterns
Packt
14 Apr 2014
15 min read
Save for later

Understanding Data Reduction Patterns

Packt
14 Apr 2014
15 min read
(For more resources related to this topic, see here.) Data reduction – a quick introduction Data reduction aims to obtain a reduced representation of the data. It ensures data integrity, though the obtained dataset after the reduction is much smaller in volume than the original dataset. Data reduction techniques are classified into the following three groups: Dimensionality reduction: This group of data reduction techniques deals with reducing the number of attributes that are considered for an analytics problem. They do this by detecting and eliminating the irrelevant attributes, relevant yet weak attributes, or redundant attributes. The principal component analysis and wavelet transforms are examples of dimensionality reduction techniques. Numerosity reduction: This group of data reduction techniques reduces the data by replacing the original dataset with a sparse representation of the data. The sparse subset of the data is computed by parametric methods such as regression, where a model is used to estimate the data so that only a subset is enough instead of the entire dataset. There are other methods such as nonparametric methods, for example, clustering, sampling, and histograms, which work without the need for a model to be built. Compression: This group of data reduction techniques uses algorithms to reduce the size of the physical storage that the data consumes. Typically, compression is performed at a higher level of granularity than at the attribute or record level. If you need to retrieve the original data from the compressed data without any loss of information, which is required while storing string or numerical data, a lossless compression scheme is used. If instead, there is a need to uncompress video and sound files that can accommodate the imperceptible loss of clarity, then lossy compression techniques are used. The following diagram illustrates the different techniques that are used in each of the aforementioned groups: Data reduction techniques – overview Data reduction considerations for Big Data In Big Data problems, data reduction techniques have to be considered as part of the analytics process rather than a separate process. This will enable you to understand what type of data has to be retained or eliminated due to its irrelevance to the analytics-related questions that are asked. In a typical Big Data analytical environment, data is often acquired and integrated from multiple sources. Even though there is the promise of a hidden reward for using the entire dataset for the analytics, which in all probability may yield richer and better insights, the cost of doing so sometimes overweighs the results. It is at this juncture that you may have to consider reducing the amount of data without drastically compromising on the effectiveness of the analytical insights, in essence, safeguarding the integrity of the data. Performing any type of analysis on Big Data often leads to high storage and retrieval costs owing to the massive amount of data. The benefits of data reduction processes are sometimes not evident when the data is small; they begin to become obvious when the datasets start growing in size. These data reduction processes are one of the first steps that are taken to optimize data from the storage and retrieval perspective. It is important to consider the ramifications of data reduction so that the computational time spent on it does not outweigh or erase the time saved by data mining on a reduced dataset size. Now that we have understood data reduction concepts, we will explore a few concrete design patterns in the following sections. Dimensionality reduction – the Principal Component Analysis design pattern In this design pattern, we will consider one way of implementing the dimensionality reduction through the usage of Principal Component Analysis (PCA) and Singular value decomposition (SVD), which are versatile techniques that are widely used for exploratory data analysis, creating predictive models, and for dimensionality reduction. Background Dimensions in a given data can be intuitively understood as a set of all attributes that are used to account for the observed properties of data. Reducing the dimensionality implies the transformation of a high dimensional data into a reduced dimension's set that is proportional to the intrinsic or latent dimensions of the data. These latent dimensions are the minimum number of attributes that are needed to describe the dataset. Thus, dimensionality reduction is a method to understand the hidden structure of data that is used to mitigate the curse of high dimensionality and other unwanted properties of high dimensional spaces. Broadly, there are two ways to perform dimensionality reduction; one is linear dimensionality reduction for which PCA and SVD are examples. The other is nonlinear dimensionality reduction for which kernel PCA and Multidimensional Scaling are examples. In this design pattern, we explore linear dimensionality reduction by implementing PCA in R and SVD in Mahout and integrating them with Pig. Motivation Let's first have an overview of PCA. PCA is a linear dimensionality reduction technique that works unsupervised on a given dataset by implanting the dataset into a subspace of lower dimensions, which is done by constructing a variance-based representation of the original data. The underlying principle of PCA is to identify the hidden structure of the data by analyzing the direction where the variation of data is the most or where the data is most spread out. Intuitively, a principal component can be considered as a line, which passes through a set of data points that vary to a greater degree. If you pass the same line through data points with no variance, it implies that the data is the same and does not carry much information. In cases where there is no variance, data points are not considered as representatives of the properties of the entire dataset, and these attributes can be omitted. PCA involves finding pairs of eigenvalues and eigenvectors for a dataset. A given dataset is decomposed into pairs of eigenvectors and eigenvalues. An eigenvector defines the unit vector or the direction of the data perpendicular to the others. An eigenvalue is the value of how spread out the data is in that direction. In multidimensional data, the number of eigenvalues and eigenvectors that can exist are equal to the dimensions of the data. An eigenvector with the biggest eigenvalue is the principal component. After finding out the principal component, they are sorted in the decreasing order of eigenvalues so that the first vector shows the highest variance, the second shows the next highest, and so on. This information helps uncover the hidden patterns that were not previously suspected and thereby allows interpretations that would not result ordinarily. As the data is now sorted in the decreasing order of significance, the data size can be reduced by eliminating the attributes with a weak component, or low significance where the variance of data is less. Using the highly valued principal components, the original dataset can be constructed with a good approximation. As an example, consider a sample election survey conducted on a hundred million people who have been asked 150 questions about their opinions on issues related to elections. Analyzing a hundred million answers over 150 attributes is a tedious task. We have a high dimensional space of 150 dimensions, resulting in 150 eigenvalues/vectors from this space. We order the eigenvalues in descending order of significance (for example, 230, 160, 130, 97, 62, 8, 6, 4, 2,1… up to 150 dimensions). As we can decipher from these values, there can be 150 dimensions, but only the top five dimensions possess the data that is varying considerably. Using this, we were able to reduce a high dimensional space of 150 and could consider the top five eigenvalues for the next step in the analytics process. Next, let's look into SVD. SVD is closely related to PCA, and sometimes both terms are used as SVD, which is a more general method of implementing PCA. SVD is a form of matrix analysis that produces a low-dimensional representation of a high-dimensional matrix. It achieves data reduction by removing linearly dependent data. Just like PCA, SVD also uses eigenvalues to reduce the dimensionality by combining information from several correlated vectors to form basis vectors that are orthogonal and explains most of the variance in the data. For example, if you have two attributes, one is sale of ice creams and the other is temperature, then their correlation is so high that the second attribute, temperature, does not contribute any extra information useful for a classification task. The eigenvalues derived from SVD determines which attributes are most informative and which ones you can do without. Mahout's Stochastic SVD (SSVD) is based on computing mathematical SVD in a distributed fashion. SSVD runs in the PCA mode if the pca argument is set to true; the algorithm computes the column-wise mean over the input and then uses it to compute the PCA space. Use cases You can consider using this pattern to perform data reduction, data exploration, and as an input to clustering and multiple regression. The design pattern can be applied on ordered and unordered attributes with sparse and skewed data. It can also be used on images. This design pattern cannot be applied on complex nonlinear data. Pattern implementation The following steps describe the implementation of PCA using R: The script applies the PCA technique to reduce dimensions. PCA involves finding pairs of eigenvalues and eigenvectors for a dataset. An eigenvector with the biggest eigenvalue is the principal component. The components are sorted in the decreasing order of eigenvalues. The script loads the data and uses streaming to call the R script. The R script performs PCA on the data and returns the principal components. Only the first few principal components that can explain most of the variation can be selected so that the dimensionality of the data is reduced. Limitations of PCA implementation While streaming allows you to call the executable of your choice, it has performance implications, and the solution is not scalable in situations where your input dataset is huge. To overcome this, we have shown a better way of performing dimensionality reduction by using Mahout; it contains a set of highly scalable machine learning libraries. The following steps describe the implementation of SSVD on Mahout: Read the input dataset in the CSV format and prepare a set of data points in the form of key/value pairs; the key should be unique and the value should comprise of n vector tuples. Write the previous data into a sequence file. The key can be of a type adapted into WritableComparable, Long, or String, and the value should be of the VectorWritable type. Decide on the number of dimensions in the reduced space. Execute SSVD on Mahout with the rank arguments (this specifies the number of dimensions), setting pca, us, and V to true. When the pca argument is set to true, the algorithm runs in the PCA mode by computing the column-wise mean over the input and then uses it to compute the PCA space. The USigma folder contains the output with reduced dimensions. Generally, dimensionality reduction is applied on very high dimensional datasets; however, in our example, we have demonstrated this on a dataset with fewer dimensions for a better explainability. Code snippets To illustrate the working of this pattern, we have considered the retail transactions dataset that is stored on the Hadoop File System (HDFS). It contains 20 attributes, such as Transaction ID, Transaction date, Customer ID, Product subclass, Phone No, Product ID, age, quantity, asset, Transaction Amount, Service Rating, Product Rating, and Current Stock. For this pattern, we will be using PCA to reduce the dimensions. The following code snippet is the Pig script that illustrates the implementation of this pattern via Pig streaming: /* Assign an alias pcar to the streaming command Use ship to send streaming binary files(R script in this use case) from the client node to the compute node */ DEFINE pcar '/home/cloudera/pdp/data_reduction/compute_pca.R' ship('/home/cloudera/pdp/data_reduction/compute_pca.R'); /* Load the data set into the relation transactions */ transactions = LOAD '/user/cloudera/pdp/datasets/data_reduction/transactions_multi_dims.csv'USING PigStorage(',') AS (transaction_id:long,transaction_date:chararray, customer_id:chararray,prod_subclass:chararray, phone_no:chararray, country_code:chararray,area:chararray, product_id:chararray,age:int, amt:int, asset:int, transaction_amount:double, service_rating:int,product_rating:int, curr_stock:int, payment_mode:int, reward_points:int,distance_to_store:int, prod_bin_age:int, cust_height:int); /* Extract the columns on which PCA has to be performed. STREAM is used to send the data to the external script. The result is stored in the relation princ_components */ selected_cols = FOREACH transactions GENERATEage AS age, amt AS amount, asset AS asset, transaction_amount AStransaction_amount, service_rating AS service_rating,product_rating AS product_rating, curr_stock AS current_stock,payment_mode AS payment_mode, reward_points AS reward_points,distance_to_store AS distance_to_store, prod_bin_age AS prod_bin_age,cust_height AS cust_height; princ_components = STREAM selected_cols THROUGH pcar; /* The results are stored on the HDFS in the directory pca */ STORE princ_components INTO '/user/cloudera/pdp/output/data_reduction/pca'; Following is the R code illustrating the implementation of this pattern: #! /usr/bin/env Rscript options(warn=-1) #Establish connection to stdin for reading the data con <- file("stdin","r") #Read the data as a data frame data <- read.table(con, header=FALSE, col.names=c("age", "amt", "asset","transaction_amount", "service_rating", "product_rating","current_stock", "payment_mode", "reward_points","distance_to_store", "prod_bin_age","cust_height")) attach(data) #Calculate covariance and correlation to understandthe variation between the independent variables covariance=cov(data, method=c("pearson")) correlation=cor(data, method=c("pearson")) #Calculate the principal components pcdat=princomp(data) summary(pcdat) pcadata=prcomp(data, scale = TRUE) pcadata The ensuing code snippets illustrate the implementation of this pattern using Mahout's SSVD. The following is a snippet of a shell script with the commands for executing CSV to the sequence converter: #All the mahout jars have to be included inHADOOP_CLASSPATH before execution of this script. #Execute csvtosequenceconverter jar to convert the CSV file to sequence file. hadoop jar csvtosequenceconverter.jar com.datareduction.CsvToSequenceConverter/user/cloudera/pdp/datasets/data_reduction/transactions_multi_dims_ssvd.csv /user/cloudera/pdp/output/data_reduction/ssvd/transactions.seq The following is the code snippet of the Pig script with commands for executing SSVD on Mahout: /* Register piggybank jar file */ REGISTER '/home/cloudera/pig-0.11.0/contrib/piggybank/java/piggybank.jar'; /* *Ideally the following data pre-processing steps have to be generallyperformed on the actual data, we have deliberatelyomitted the implementation as these steps werecovered in the respective chapters *Data Ingestion to ingest data from the required sources *Data Profiling by applying statistical techniquesto profile data and find data quality issues *Data Validation to validate the correctness ofthe data and cleanse it accordingly *Data Transformation to apply transformations on the data. */ /* Use sh command to execute shell commands. Convert the files in a directory to sequence files -i specifies the input path of the sequence file on HDFS -o specifies the output directory on HDFS -k specifies the rank, i.e the number of dimensions in the reduced space -us set to true computes the product USigma -V set to true computes V matrix -pca set to true runs SSVD in pca mode */ sh /home/cloudera/mahout-distribution-0.8/bin/mahoutssvd -i /user/cloudera/pdp/output/data_reduction/ssvd/transactions.seq -o /user/cloudera/pdp/output/data_reduction/ssvd/reduced_dimensions -k 7 -us true -V true -U false -pca true -ow -t 1 /* Use seqdumper to dump the output in text format. -i specifies the HDFS path of the input file */ sh /home/cloudera/mahout-distribution-0.8/bin/mahout seqdumper -i /user/cloudera/pdp/output/data_reduction/ssvd/reduced_dimensions/V/v-m-00000 Results The following is a snippet of the result of executing the R script through Pig streaming. Only the important components in the results are shown to improve readability. Importance of components: Comp.1 Comp.2 Comp.3 Standard deviation 1415.7219657 548.8220571 463.15903326 Proportion of Variance 0.7895595 0.1186566 0.08450632 Cumulative Proportion 0.7895595 0.9082161 0.99272241 The following diagram shows a graphical representation of the results: PCA output From the cumulative results, we can explain most of the variation with the first three components. Hence, we can drop the other components and still explain most of the data, thereby achieving data reduction. The following is a code snippet of the result attained after applying SSVD on Mahout: Key: 0: Value: {0:6.78114976729216E-5,1:-2.1865954292525495E-4,2:-3.857078959222571E-5,3:9.172780131217343E-4,4:-0.0011674781643860148,5:-0.5403803571549012,6:0.38822546035077155} Key: 1: Value: {0:4.514870142377153E-6,1:-1.2753047299542729E-5,2:0.002010945408634006,3:2.6983823401328314E-5,4:-9.598021198119562E-5,5:-0.015661212194480658,6:-0.00577713052974214} Key: 2: Value: {0:0.0013835831436886054,1:3.643672803676861E-4,2:0.9999962672043754,3:-8.597640675661196E-4,4:-7.575051881399296E-4,5:2.058878196540628E-4,6:1.5620427291943194E-5} . . Key: 11: Value: {0:5.861358116239576E-4,1:-0.001589570485260711,2:-2.451436184622473E-4,3:0.007553283166922416,4:-0.011038688645296836,5:0.822710349440101,6:0.060441819443160294} The contents of the V folder show the contribution of the original variables to every principal component. The result is a 12 x 7 matrix as we have 12 dimensions in our original dataset, which were reduced to 7, as specified in the rank argument to SSVD. The USigma folder contains the output with reduced dimensions.
Read more
  • 0
  • 0
  • 4222

article-image-deploying-storm-hadoop-advertising-analysis
Packt
24 Mar 2014
5 min read
Save for later

Deploying Storm on Hadoop for Advertising Analysis

Packt
24 Mar 2014
5 min read
(For more resources related to this topic, see here.) Establishing the architecture The recent componentization within Hadoop allows any distributed system to use it for resource management. In Hadoop 1.0, resource management was embedded into the MapReduce framework as shown in the following diagram: Hadoop 2.0 separates out resource management into YARN, allowing other distributed processing frameworks to run on the resources managed under the Hadoop umbrella. In our case, this allows us to run Storm on YARN as shown in the following diagram: As shown in the preceding diagram, Storm fulfills the same function as MapReduce. It provides a framework for the distributed computation. In this specific use case, we use Pig scripts to articulate the ETL/analysis that we want to perform on the data. We will convert that script into a Storm topology that performs the same function, and then we will examine some of the intricacies involved in doing that transformation. To understand this better, it is worth examining the nodes in a Hadoop cluster and the purpose of the processes running on those nodes. Assume that we have a cluster as depicted in the following diagram: There are two different components/subsystems shown in the diagram. The first is YARN, which is the new resource management layer introduced in Hadoop 2.0. The second is HDFS. Let's first delve into HDFS since that has not changed much since Hadoop 1.0. Examining HDFS HDFS is a distributed filesystem. It distributes blocks of data across a set of slave nodes. The NameNode is the catalog. It maintains the directory structure and the metadata indicating which nodes have what information. The NameNode does not store any data itself, it only coordinates create, read, update, and delete (CRUD) operations across the distributed filesystem. Storage takes place on each of the slave nodes that run DataNode processes. The DataNode processes are the workhorses in the system. They communicate with each other to rebalance, replicate, move, and copy data. They react and respond to the CRUD operations of clients. Examining YARN YARN is the resource management system. It monitors the load on each of the nodes and coordinates the distribution of new jobs to the slaves in the cluster. The ResourceManager collects status information from the NodeManagers. The ResourceManager also services job submissions from clients. One additional abstraction within YARN is the concept of an ApplicationMaster. An ApplicationMaster manages resource and container allocation for a specific application. The ApplicationMaster negotiates with the ResourceManager for the assignment of resources. Once the resources are assigned, the ApplicationMaster coordinates with the NodeManagers to instantiate containers. The containers are logical holders for the processes that actually perform the work. The ApplicationMaster is a processing-framework-specific library. Storm-YARN provides the ApplicationMaster for running Storm processes on YARN. HDFS distributes the ApplicationMaster as well as the Storm framework itself. Presently, Storm-YARN expects an external ZooKeeper. Nimbus starts up and connects to the ZooKeeper when the application is deployed. The following diagram depicts the Hadoop infrastructure running Storm via Storm-YARN: As shown in the preceding diagram, YARN is used to deploy the Storm application framework. At launch, Storm Application Master is started within a YARN container. That, in turn, creates an instance of Storm Nimbus and the Storm UI. After that, Storm-YARN launches supervisors in separate YARN containers. Each of these supervisor processes can spawn workers within its container. Both Application Master and the Storm framework are distributed via HDFS. Storm-YARN provides command-line utilities to start the Storm cluster, launch supervisors, and configure Storm for topology deployment. We will see these facilities later in this article. To complete the architectural picture, we need to layer in the batch and real-time processing mechanisms: Pig and Storm topologies, respectively. We also need to depict the actual data. Often a queuing mechanism such as Kafka is used to queue work for a Storm cluster. To simplify things, we will use data stored in HDFS. The following depicts our use of Pig, Storm, YARN, and HDFS for our use case, omitting elements of the infrastructure for clarity. To fully realize the value of converting from Pig to Storm, we would convert the topology to consume from Kafka instead of HDFS as shown in the following diagram: As the preceding diagram depicts, our data will be stored in HDFS. The dashed lines depict the batch process for analysis, while the solid lines depict the real-time system. In each of the systems, the following steps take place: Step Purpose Pig Equivalent Storm-Yarn Equivalent 1 The processing frameworks are deployed The MapReduce Application Master is deployed and started Storm-YARN launches Application Master and distributes Storm framework 2 The specific analytics are launched The Pig script is compiled to MapReduce jobs and submitted as a job Topologies are deployed to the cluster 3 The resources are reserved Map and reduce tasks are created in YARN containers Supervisors are instantiated with workers 4 The analyses reads the data from storage and performs the analyses Pig reads the data out of HDFS Storm reads the work, typically from Kafka; but in this case, the topology reads it from a flat file Another analogy can be drawn between Pig and Trident. Pig scripts compile down into MapReduce jobs, while Trident topologies compile down into Storm topologies. For more information on the Storm-YARN project, visit the following URL: https://github.com/yahoo/storm-yarn
Read more
  • 0
  • 0
  • 3636

article-image-first-steps
Packt
24 Mar 2014
7 min read
Save for later

First Steps

Packt
24 Mar 2014
7 min read
(For more resources related to this topic, see here.) Installing matplotlib Before experimenting with matplotlib, you need to install it. Here we introduce some tips to get matplotlib up and running without too much trouble. How to do it... We have three likely scenarios: you might be using Linux, OS X, or Windows. Linux Most Linux distributions have Python installed by default, and provide matplotlib in their standard package list. So all you have to do is use the package manager of your distribution to install matplotlib automatically. In addition to matplotlib, we highly recommend that you install NumPy, SciPy, and SymPy, as they are supposed to work together. The following list consists of commands to enable the default packages available in different versions of Linux: Ubuntu: The default Python packages are compiled for Python 2.7. In a command terminal, enter the following command: sudo apt-get install python-matplotlib python-numpy python-scipy python-sympy ArchLinux: The default Python packages are compiled for Python 3. In a command terminal, enter the following command: sudo pacman -S python-matplotlib python-numpy python-scipy python-sympy If you prefer using Python 2.7, replace python by python2 in the package names Fedora: The default Python packages are compiled for Python 2.7. In a command terminal, enter the following command: sudo yum install python-matplotlib numpy scipy sympy There are other ways to install these packages; in this article, we propose the most simple and seamless ways to do it. Windows and OS X Windows and OS X do not have a standard package system for software installation. We have two options—using a ready-made self-installing package or compiling matplotlib from the code source. The second option involves much more work; it is worth the effort to have the latest, bleeding edge version of matplotlib installed. Therefore, in most cases, using a ready-made package is a more pragmatic choice. You have several choices for ready-made packages: Anaconda, Enthought Canopy, Algorete Loopy, and more! All these packages provide Python, SciPy, NumPy, matplotlib, and more (a text editor and fancy interactive shells) in one go. Indeed, all these systems install their own package manager and from there you install/uninstall additional packages as you would do on a typical Linux distribution. For the sake of brevity, we will provide instructions only for Enthought Canopy. All the other systems have extensive documentation online, so installing them should not be too much of a problem. So, let's install Enthought Canopy by performing the following steps: Download the Enthought Canopy installer from https://www.enthought.com/products/canopy. You can choose the free Express edition. The website can guess your operating system and propose the right installer for you. Run the Enthought Canopy installer. You do not need to be an administrator to install the package if you do not want to share the installed software with other users. When installing, just click on Next to keep the defaults. You can find additional information about the installation process at http://docs.enthought.com/canopy/quick-start.html. That's it! You will have Python 2.7, NumPy, SciPy, and matplotlib installed and ready to run. Plotting one curve The initial example of Hello World! for a plotting software is often about showing a simple curve. We will keep up with that tradition. It will also give you a rough idea about how matplotlib works. Getting ready You need to have Python (either v2.7 or v3) and matplotlib installed. You also need to have a text editor (any text editor will do) and a command terminal to type and run commands. How to do it... Let's get started with one of the most common and basic graph that any plotting software offers—curves. In a text file saved as plot.py, we have the following code: import matplotlib.pyplot as plt X = range(100) Y = [value ** 2 for value in X] plt.plot(X, Y) plt.show() Assuming that you installed Python and matplotlib, you can now use Python to interpret this script. If you are not familiar with Python, this is indeed a Python script we have there! In a command terminal, run the script in the directory where you saved plot.py with the following command: python plot.py Doing so will open a window as shown in the following screenshot: The window shows the curve Y = X ** 2 with X in the [0, 99] range. As you might have noticed, the window has several icons, some of which are as follows: : This icon opens a dialog, allowing you to save the graph as a picture file. You can save it as a bitmap picture or a vector picture. : This icon allows you to translate and scale the graphics. Click on it and then move the mouse over the graph. Clicking on the left button of the mouse will translate the graph according to the mouse movements. Clicking on the right button of the mouse will modify the scale of the graphics. : This icon will restore the graph to its initial state, canceling any translation or scaling you might have applied before. How it works... Assuming that you are not very familiar with Python yet, let's analyze the script demonstrated earlier. The first line tells Python that we are using the matplotlib.pyplot module. To save on a bit of typing, we make the name plt equivalent to matplotlib.pyplot. This is a very common practice that you will see in matplotlib code. The second line creates a list named X, with all the integer values from 0 to 99. The range function is used to generate consecutive numbers. You can run the interactive Python interpreter and type the command range(100) if you use Python 2, or the command list(range(100)) if you use Python 3. This will display the list of all the integer values from 0 to 99. In both versions, sum(range(100)) will compute the sum of the integers from 0 to 99. The third line creates a list named Y, with all the values from the list X squared. Building a new list by applying a function to each member of another list is a Python idiom, named list comprehension. The list Y will contain the squared values of the list X in the same order. So Y will contain 0, 1, 4, 9, 16, 25, and so on. The fourth line plots a curve, where the x coordinates of the curve's points are given in the list X, and the y coordinates of the curve's points are given in the list Y. Note that the names of the lists can be anything you like. The last line shows a result, which you will see on the window while running the script. There's more... So what we have learned so far? Unlike plotting packages like gnuplot, matplotlib is not a command interpreter specialized for the purpose of plotting. Unlike Matlab, matplotlib is not an integrated environment for plotting either. matplotlib is a Python module for plotting. Figures are described with Python scripts, relying on a (fairly large) set of functions provided by matplotlib. Thus, the philosophy behind matplotlib is to take advantage of an existing language, Python. The rationale is that Python is a complete, well-designed, general purpose programming language. Combining matplotlib with other packages does not involve tricks and hacks, just Python code. This is because there are numerous packages for Python for pretty much any task. For instance, to plot data stored in a database, you would use a database package to read the data and feed it to matplotlib. To generate a large batch of statistical graphics, you would use a scientific computing package such as SciPy and Python's I/O modules. Thus, unlike many plotting packages, matplotlib is very orthogonal—it does plotting and only plotting. If you want to read inputs from a file or do some simple intermediary calculations, you will have to use Python modules and some glue code to make it happen. Fortunately, Python is a very popular language, easy to master and with a large user base. Little by little, we will demonstrate the power of this approach.
Read more
  • 0
  • 0
  • 980
article-image-going-viral
Packt
19 Mar 2014
14 min read
Save for later

Going Viral

Packt
19 Mar 2014
14 min read
(For more resources related to this topic, see here.) Social media mining using sentiment analysis People are highly opinionated. We hold opinions about everything from international politics to pizza delivery. Sentiment analysis, synonymously referred to as opinion mining, is the field of study that analyzes people's opinions, sentiments, evaluations, attitudes, and emotions through written language. Practically speaking, this field allows us to measure, and thus harness, opinions. Up until the last 40 years or so, opinion mining hardly existed. This is because opinions were elicited in surveys rather than in text documents, computers were not powerful enough to store or sort a large amount of information, and algorithms did not exist to extract opinion information from written language. The explosion of sentiment-laden content on the Internet, the increase in computing power, and advances in data mining techniques have turned social data mining into a thriving academic field and crucial commercial domain. Professor Richard Hamming famously pushes researchers to ask themselves, "What are the important problems in my field?" Researchers in the broad area of natural language processing (NLP) cannot help but list sentiment analysis as one such pressing problem. Sentiment analysis is not only a prominent and challenging research area, but also a powerful tool currently being employed in almost every business and social domain. This prominence is due, at least in part, to the centrality of opinions as both measures and causes of human behavior. This article is an introduction to social data mining. For us, social data refers to data generated by people or by their interactions. More specifically, social data for the purposes of this article will usually refer to data in text form produced by people for other people's consumption. Data mining is a set of tools and techniques used to describe and make inferences about data. We approach social data mining with a potent mix of applied statistics and social science theory. As for tools, we utilize and provide an introduction to the statistical programming language R. The article covers important topics and latest developments in the field of social data mining with many references and resources for continued learning. We hope it will be of interest to an audience with a wide array of substantive interests from fields such as marketing, sociology, politics, and sales. We have striven to make it accessible enough to be useful for beginners while simultaneously directing researchers and practitioners already active in the field towards resources for further learning. Code and additional material will be available online at http://socialmediaminingr.com as well as on the authors' GitHub account, https://github.com/SocialMediaMininginR. The state of communication The state of communication section describes the fundamentally altered modes of social communication fostered by the Internet. The interconnected, social, rapid, and public exchange of information detailed here underlies the power of social data mining. Now more than ever before, information can go viral, a phrase first cited as early as 2004. By changing the manner in which we connect with each other, the Internet changed the way we interact—communication is now bi-directional and many-to-many. Networks are now self-organized, and information travels along every dimension, varying systematically depending on direction and purpose. This new economy with ideas as currency has impacted nearly every person. More than ever, people rely on context and information before making decisions or purchases, and by extension, more and more on peer effects and interactions rather than centralized sources. The traditional modes of communication are represented mainly by radio and television, which are isotropic and one-to-many. It took 38 years for radio broadcasters and 13 years for television to reach an audience of 50 million, but the Internet did it in just four years (Gallup). Not only has the nature of communication changed, but also its scale. There were 50 pages on the World Wide Web (WWW) in 1993. Today, the full impact and scope of the WWW is difficult to measure, but we can get a rough sense of its size: the Indexed Web contains at least 1.7 billion pages as of February 2014 (World Wide Web size). The WWW is the largest, most widely used source of information, with nearly 2.4 billion users (Wikipedia). 70 percent of these users use it daily to both contribute and receive information in order to learn about the world around them and to influence that same world—constantly organizing information around pieces that reflect their desires. In today's connected world, many of us are members of at least one, if not more, social networking service. The influence and reach of social media enterprises such as Facebook is staggering. Facebook has 1.11 billion monthly active users and 751 million monthly active users of their mobile products (Facebook key facts). Twitter has more than 200 million (Twitter blog) active users. As communication tools, they offer a global reach to huge multinational audiences, delivering messages almost instantaneously. Connectedness and social media have altered the way we organize our communications. Today we have dramatically more friends and more friends of friends, and we can communicate with these higher order connections faster and more frequently than ever before. It is difficult to ignore the abundance of mimicry (that is, copying or reposting) and repeated social interactions in our social networks. This mimicry is a result of virtual social interactions organized into reaffirming or oppositional feedback loops. We self-organize these interactions via (often preferential) attachments that form organic, shifting networks. There is little question of whether or not social media has already impacted your life and changed the manner in which you communicate. Our beliefs and perceptions of reality, as well as the choices we make, are largely conditioned by our neighbors in virtual and physical networks. When we need to make a decision, we seek out for opinions of others—more and more of those opinions are provided by virtual networks. Information bounce is the resonance of content within and between social networks often powered by social media such as customer reviews, forums, blogs, microblogs, and other user-generated content. This notion represents a significant change when compared to how information has traveled throughout history; individuals no longer need to exclusively rely on close ties within their physical social networks. Social media has both made our close ties closer and the number of weak ties exponentially greater. Beyond our denser and larger social networks is a general eagerness to incorporate information from other networks with similar interests and desires. The increased access to networks of various types has, in fact, conditioned us to seek even more information; after all, ignoring available information would constitute irrational behavior. These fundamental changes to the nature and scope of communication are crucial due to the importance of ideas in today's economic and social interactions. Today, and in the future, ideas will be of central importance, especially those ideas that bounce and go viral. Ideas that go viral are those that resonate and spur on social movements, which may have political and social purposes or reshape businesses and allow companies such as Nike and Apple to produce outsized returns on capital. This article introduces readers to the tools necessary to measure ideas and opinions derived from social data at scale. Along the way, we'll describe strategies for dealing with Big Data. What is Big Data? People create 2.5 quintillion bytes (2.5 * 1018) of data, or nearly 2.3 million Terabytes of data every day, so much that 90 percent of the data in the world today has been created in the last two years alone. Furthermore, rather than being a large collection of disparate data, much of this data flow consists of data on similar things, generating huge data-sets with billions upon billions of observations. Big Data refers not only to the deluge of data being generated, but also to the astronomical size of data-sets themselves. Both factors create challenges and opportunities for data scientists. This data comes from everywhere: physical sensors used to gather information, human sensors such as the social web, transaction records, and cell phone GPS signals to name a few. This data is not only big but is growing at an increasing rate. The data used in this article, namely, Twitter data, is no exception. Twitter was launched in March 21, 2006, and it took 3 years, 2 months, and 1 day to reach 1 billion tweets. Twitter users now send 1 billion tweets every 2.5 days. The size and scope of Big Data helps us overcome some of the hurdles caused by its low density. For instance, even though each unique piece of social data may have little applicability to our particular task, these small bits of information quickly become useful as we aggregate them across thousands or millions of people. Like the proverbial bundle of sticks—none of which could support inferences alone—when tied together, these small bits of information can be a powerful tool for understanding the opinions of the online populace. The sheer scope of Big Data has other benefits as well. The size and coverage of many social data-sets creates coverage overlaps in time, space, and topic. This allows analysts to cross-refer socially generated sets against one another or against small-scale data-sets designed to examine niche questions. This type of cross-coverage can generate consilience (Osborne)—the principle that states evidence from independent, unrelated sources can converge to form strong conclusions. That is, when multiple sources of evidence are in agreement, the conclusion can be very strong even when none of the individual sources of evidence are very strong on their own. A crucial characteristic of socially generated data is that it is opinionated. This point underpins the usefulness of big social data for sentiment analysis, and is novel. For the first time in history, interested parties can put their fingers to the pulse of the masses because the masses are frequently opining about what is important to them. They opine with and for each other and anyone else who cares to listen. In sum, opinionated data is the great enabler of opinion-based research. Human sensors and honest signals Opinion data generated by humans in real time presents tremendous opportunities. However, big social data will only prove useful to the extent that it is valid. This section tackles the extent to which socially generated data can be used to accurately measure individual and/or group-level opinions head-on. One potential indicator of the validity of socially generated data is the extent of its consumption for factual content. Online media has expanded significantly over the past 20 years. For example, online news is displacing print and broadcast. More and more Americans distrust mainstream media, with a majority (60 percent) now having little to no faith in traditional media to report news fully, accurately, and fairly. Instead, people are increasingly turning to the Internet to research, connect, and share opinions and views. This was especially evident during the 2012 election where social media played a large role in information transmission (Gallup). Politics is not the only realm affected by social Big Data. People are increasingly relying on the opinions of others to inform about their consumption preferences. Let's have a look at this: 91 percent of people report having gone into a store because of an online experience 89 percent of consumers conduct research using search engines 62 percent of consumers end up making a purchase in a store after researching it online 72 percent of consumers trust online reviews as much as personal recommendations 78 percent of consumers say that posts made by companies on social media influence their purchases If individuals are willing to use social data as a touchstone for decision making in their own lives, perhaps this is prima facie evidence of its validity. Other Big Data thinkers point out that much of what people do online constitutes their genuine actions and intentions. The breadcrumbs left from when people execute online transactions, send messages, or spend time on web pages constitute what Alex Petland of MIT calls honest signals. These signals are honest insofar as they are actions taken by people with no subtext or secondary intent. Specifically, he writes the following: "Those breadcrumbs tell the story of your life. It tells what you've chosen to do. That's very different than what you put on Facebook. What you put on Facebook is what you would like to tell people, edited according to the standards of the day. Who you actually are is determined by where you spend time, and which things you buy." To paraphrase, Petland finds some web-based data to be valid measures of people's attitudes when that data is without subtext or secondary intent; what he calls data exhaust. In other words, actions are harder to fake than words. He cautions against taking people's online statements at face value, because they may be nothing more than cheap talk. Anthony Stefanidis of George Mason University also advocates for the use of social data mining. He favorably speaks about its reliability, noting that its size inherently creates a preponderance of evidence. This article takes neither the strong position of Pentland and honest signals nor Stefanidis and preponderance of evidence. Instead, we advocate a blended approach of curiosity and creativity as well as some healthy skepticism. Generally, we follow the attitude of Charles Handy (The Empty Raincoat, 1994), who described the steps to measurement during the Vietnam War as follows: "The first step is to measure whatever can be easily measured. This is OK as far as it goes. The second step is to disregard that which can't be easily measured or to give it an arbitrary quantitative value. This is artificial and misleading. The third step is to presume that what can't be measured easily really isn't important. This is blindness. The fourth step is to say that what can't be easily measured really doesn't exist. This is suicide." The social web may not consist of perfect data, but its value is tremendous if used properly and analyzed with care. 40 years ago, a social science study containing millions of observations was unheard of due to the time and cost associated with collecting that much information. The most successful efforts in social data mining will be by those who "measure (all) what is measurable, and make measurable (all) what is not so" (Rasinkinski, 2008). Ultimately, we feel that the size and scope of big social data, the fact that some of it is comprised of honest signals, and the fact that some of it can be validated with other data, lends it validity. In another sense, the "proof is in the pudding". Businesses, governments, and organizations are already using social media mining to good effect; thus, the data being mined must be at least moderately useful. Another defining characteristic of big social data is the speed with which it is generated, especially when considered against traditional media channels. Social media platforms such as Twitter, but also the web generally, spread news in near-instant bursts. From the perspective of social media mining, this speed may be a blessing or a curse. On the one hand, analysts can keep up with the very fast-moving trends and patterns, if necessary. On the other hand, fast-moving information is subject to mistakes or even abuse. Following the tragic bombings in Boston, Massachusetts (April 15, 2013), Twitter was instrumental in citizen reporting and provided insight into the events as they unfolded. Law enforcement asked for and received help from general public, facilitated by social media. For example, Reddit saw an overall peak in traffic when reports came in that the second suspect was captured. Google Analytics reports that there were about 272,000 users on the site with 85,000 in the news update thread alone. This was the only time in Reddit's history other than Obama AMA that a thread beat the front page in the ratings (Reddit). The downside of this fast-paced, highly visible public search is that masses can be incorrect. This is exactly what happened; users began to look at the details and photos posted and pieced together their own investigation—as it turned out, the information was incorrect. This was a charged event and created an atmosphere that ultimately undermined the good intentions of many. Other efforts such as those by governments (Wikipedia) and companies (Forbes) to post messages favorable to their position is less than well intentioned. Overall, we should be skeptical of tactical (that is, very real time) uses of social media. Summary In this article, we introduced readers to the concepts of social media, sentiment analysis, and Big Data. We described how social media has changed the nature of interpersonal communication and the opportunities it presents for analysts of social data. This article also made a case for the use of quantitative approaches to measure all that is measurable, and make the one which is not so measurable. Resources for Article: Further resources on this subject: Bitcoins – Pools and Mining [Article] Managing your social channels [Article] Social Media for Wordpress: VIP Memberships [Article]
Read more
  • 0
  • 0
  • 1545

article-image-integrating-other-frameworks
Packt
14 Mar 2014
6 min read
Save for later

Integrating with other Frameworks

Packt
14 Mar 2014
6 min read
(For more resources related to this topic, see here.) Using NodeJS as a data provider JavaScript has become a formidable language in its own right. Google's work on the V8 JavaScript engine has created something very performant, and has enabled others to develop Node.js, and with it, allow the development of JavaScript on the serverside. This article will take a look at how we can serve data using NodeJS, specifically using a framework known as express. Getting ready We will need to set up a simple project before we can get started: Download and install NodeJS (http://nodejs.org/download/). Create a folder nodejs for our project. Create a file nodejs/package.json and fill it with the following contents: {"name": "highcharts-cookbook-nodejs","description": "An example application for using highchartswith nodejs","version": "0.0.1","private": true,"dependencies": {"express": "3.4.4"}} From within the nodejs folder, install our dependencies locally (that is, within the nodejs folder) using npm (NodeJS package manager): npm install If we wanted to install packages globally, we could have instead done the following: npm install -g Create a folder nodejs/static which will later contain our static assets (for example, a webpage and our JavaScript). Create a file nodejs/app.js which will later contain our express application and data provider. Create a file nodejs/bower.json to list our JavaScript dependencies for the page: {"name": "highcharts-cookbook-chapter-8","dependencies": {"jquery": "^1.9","highcharts": "~3.0"}} Create a file nodejs/.bowerrc to configure where our JavaScript dependencies will be installed: { "directory": "static/js" } How to do it... Let’s begin: Create an example file nodejs/static/index.html for viewing our charts <html> <head> </head> <body> <div id='example'></div> <script src = './js/jquery/jquery.js'></script> <script src = './js/highcharts/highcharts.js'></script> <script type = 'text/javascript'> $(document).ready(function() { var options = { chart: { type: 'bar', events: { load: function () { var self = this; setInterval(function() { $.getJSON('/ajax/series', function(data) { var series = self.series[0]; series.setData(data); }); }, 1000); } } }, title: { text: 'Using AJAX for polling charts' }, series: [{ name: 'AJAX data (series)', data: [] }] }; $('#example').highcharts(options); }); </script> </body> </html> In nodejs/app.js, import the express framework: var express = require('express'); Create a new express application: var app = express(); Tell our application where to serve static files from: var app = express(); app.use(express.static('static')); Create a method to return data: app.use(express.static('static')); app.get('/ajax/series', function(request, response) { var count = 10, results = []; for(var i = 0; i < count; i++) { results.push({ "y": Math.random()*100 }); } response.json(results); }); Listen on port 8888: response.json(results); }); app.listen(8888); Start our application: node app.js View the output on http://localhost:8888/index.html How it works... Most of what we've done in our application is fairly simple: create an express instance, create request methods, and listen on a certain port. With express, we could also process different HTTP verbs like POST or DELETE. We can handle these methods by creating a new request method. In our example, we handled GET requests (that is, app.get) but in general, we can use app.VERB (Where VERB is an HTTP verb). In fact, we can also be more flexible in what our URLs look like: we can use JavaScript regular expressions as well. More information on the express API can be found at http://expressjs.com/api.html. Using Django as a data provider Django is likely one of the more robust python frameworks, and certainly one of the oldest. As such, Django can be used to tackle a variety of different cases, and has a lot of support and extensions available. This recipe will look at how we can leverage Django to provide data for Highcharts. Getting ready Download and install Python 2.7 (http://www.python.org/getit/) Download and install Django (http://www.djangoproject.com/download/) Create a new folder for our project, django. From within the django folder, run the following to create a new project: django-admin.py startproject example Create a file django/bower.json to list our JavaScript dependencies { "name": "highcharts-cookbook-chapter-8", "dependencies": { "jquery": "^1.9", "highcharts": "~3.0" } } Create a file django/.bowerrc to configure where our JavaScript dependencies will be installed. { "directory": "example/static/js" } Create a folder example/templates for any templates we may have. How to do it... To get started, follow the instructions below: Create a folder example/templates, and include a file index.html as follows: {% load staticfiles %} <html> <head> </head> <body> <div class='example' id='example'></div> <script src = '{% static "js/jquery/jquery.js" %}'></script> <script src = '{% static "js/highcharts/highcharts.js" %}'></script> <script type='text/javascript'> $(document).ready(function() { var options = { chart: { type: 'bar', events: { load: function () { var self = this; setInterval(function() { $.getJSON('/ajax/series', function(data) { var series = self.series[0]; series.setData(data); }); }, 1000); } } }, title: { text: 'Using AJAX for polling charts' }, series: [{ name: 'AJAX data (series)', data: [] }] }; $('#example').highcharts(options); }); </script> </body> </html> Edit example/example/settings.py and include the following at the end of the file: STATIC_URL = '/static/' TEMPLATE_DIRS = ( os.path.join(BASE_DIR, 'templates/') ) STATICFILES_DIRS = ( os.path.join(BASE_DIR, 'static/'), ) Create a file example/example/views.py and create a handler to show our page: from django.shortcuts import render_to_response def index(request): return render_to_response('index.html') Edit example/example/views.py and create a handler to serve our data: import json from random import randint from django.http import HttpResponse from django.shortcuts import render_to_response def index(request): return render_to_response('index.html') def series(request): results = [] for i in xrange(1, 11): results.append({ 'y': randint(0, 100) }) json_results = json.dumps(results) return HttpResponse(json_results, mimetype='application/json') Edit example/example/urls.py to register our URL handlers: from django.conf.urls import patterns, include, url from django.contrib import admin admin.autodiscover() import views urlpatterns = patterns('', # Examples: # url(r'^$', 'example.views.home', name='home'), # url(r'^blog/', include('blog.urls')), url(r'^admin/', include(admin.site.urls)), url(r'^/?$', views.index, name='index'), url(r'^ajax/series/?$', views.series, name='series'), ) Run the following command from the django folder to start the server: python example/manage.py runserver Observe the page by visiting http://localhost:8000
Read more
  • 0
  • 0
  • 1525

article-image-using-show-explain-running-queries
Packt
13 Mar 2014
5 min read
Save for later

Using SHOW EXPLAIN with running queries

Packt
13 Mar 2014
5 min read
(For more resources related to this topic, see here.) Getting ready Import the ISFDB database which is available under Creative Commons licensing. How to do it... Open a terminal window and launch the mysql command-line client and connect to the isfdb database using the following statement. mysql isfdb Next, we open another terminal window and launch another instance of the mysql command-line client. Run the following command in the first window: ALTER TABLE title_relationships DROP KEY titles; Next, in the first window, start the following example query: SELECT titles.title_id AS ID, titles.title_title AS Title, authors.author_legalname AS Name, (SELECT COUNT(DISTINCT title_relationships.review_id) FROM title_relationships WHERE title_relationships.title_id = titles.title_id) AS reviews FROM titles,authors,canonical_author WHERE (SELECT COUNT(DISTINCT title_relationships.review_id) FROM title_relationships WHERE title_relationships.title_id = titles.title_id)>=10 AND canonical_author.author_id = authors.author_id AND canonical_author.title_id=titles.title_id AND titles.title_parent=0 ; Wait for at least a minute and then run the following query to look for the details of the query that we executed in step 4 and QUERY_ID for that query: SELECT INFO, TIME, ID, QUERY_ID FROM INFORMATION_SCHEMA.PROCESSLIST WHERE TIME > 60G Run SHOW EXPLAIN in the second window (replace id in the following command line with the numeric ID that we discovered in step 5): SHOW EXPLAIN FOR id Run the following command in the second window to kill the query running in the first window (replace query_id in the following command line with the numeric QUERY_ID number that we discovered in step 5): KILL QUERY ID query_id; In the first window, reverse the change we made in step 3 using the following command: ALTER TABLE title_relationships ADD KEY titles (title_id); How it works... The SHOW EXPLAIN statement allows us to obtain information about how MariaDB executes a long-running statement. This is very useful for identifying bottlenecks in our database. The query in this article will execute efficiently only if it touches the indexes in our data. So, for demonstration purposes, we will first sabotage the title_relationships table by removing the title's index. This causes our query to unnecessarily iterate through hundreds of thousands of rows and generally take far too long to complete. The output of steps 3 and 4 will look similar to the following screenshot: While our sabotaged query is running, and after waiting for at least a minute, we switch to another window and look for all queries that have been running for longer than 60 seconds. Our sabotaged query will likely be the only one in the output. From this output, we get ID and QUERY_ID. The output of the command will look like the following with the ID and QUERY_ID as the last two items: Next, we use the ID number to execute SHOW EXPLAIN for our query. Incidentally, our query looks up all titles in the database that have 10 or more reviews and displays the title, author, and the number of reviews that the title has. The EXPLAIN for our query will look similar to the following screenshot: An easy-to-read version of this EXPLAIN is available at https://mariadb.org/ea/8v65g. Looking at rows 4 and 5 of EXPLAIN, it's easy to see why our query runs for so long. These two rows are dependent subqueries of the primary query (the first row). In the first query, we see that 117044 rows will be searched, and then, for the two dependent subqueries, MariaDB searches through 83389 additional rows, twice. Ouch. If we were analyzing a slow query in the real world at this point, we would fix the query to not have such an inefficient subquery, or we would add a KEY to our table to make the subquery efficient. If we're part of a larger development team, we could send the output of SHOW EXPLAIN and the query to the appropriate people to easily and accurately show them what the problem is with the query. In our case, we know exactly what to do; we will add back the KEY that we removed earlier. For fun, after adding back the KEY, we could rerun the query and the SHOW EXPLAIN command to see the difference that having the KEY in place makes. We'll have to be quick though, as with the KEY there, the query will only take a few seconds to complete (depending on the speed of our computer). There's more... The output of SHOW EXPLAIN is always accompanied by a warning. The purpose of this warning is to show us the command that is being run. After running SHOW EXPLAIN on a process ID, we simply issue SHOW WARNINGSG and we will see what SQL statement the process ID is running: This is useful for very long-running commands that after their start, takes a long time to execute, and then returns back at a time where we might not remember the command we started. In the examples of this article, we're using "G" as the delimiter instead of the more common ";" so that the data fits the page better. We can use either one. See also The full documentation of the KILL QUERY ID command can be found at https://mariadb.com/kb/en/data-manipulation-kill-connectionquery/ The full documentation of the SHOW EXPLAIN command can be found at https://mariadb.com/kb/en/show-explain/ Summary In this article, we saw the functionality of the SHOW EXPLAIN feature after altering the database using various queries. Further information regarding the SHOW EXPLAIN command can be found in the official documents provided in the preceding section. Resources for Article: Further resources on this subject: Installing MariaDB on Windows and Mac OS X [Article] A Look Inside a MySQL Daemon Plugin [Article] Visual MySQL Database Design in MySQL Workbench [Article]
Read more
  • 0
  • 0
  • 1942
article-image-self-service-reporting
Packt
11 Mar 2014
7 min read
Save for later

Self-service reporting

Packt
11 Mar 2014
7 min read
(For more resources related to this topic, see here.) Server 2012 Power View – Self-service Reporting Self-service reporting is when business users have the ability to create personalized reports and analytical queries without requiring the IT department to get involved. There will be some basic work that the IT department must do, namely creating the various data marts that the reporting tools will use as well as deploying those reporting tools. However, once that is done, IT will be freed of creating reports so that they can work on other tasks. Instead, the people who know the data best—the business users—will be able to build the reports. Here is a typical scenario that occurs when a self-service reporting solution is not in place: a business user wants a report created, so they fill out a report request that gets routed to IT. The IT department is backlogged with report requests, so it takes them weeks to get back to the user. When they do, they interview the user to get more details about exactly what data the user wants on the report and the look of the report (the business requirements). The IT person may not know the data that well, so they will have to get educated by the user on what the data means. This leads to mistakes in understanding what the user is requesting. The IT person may take away an incorrect assumption of what data the report should contain or how it should look. Then, the IT person goes back and creates the report. A week or so goes by and he shows the user the report. Then, they hear things from the user such as "that is not correct" or "that is not what I meant". The IT person fixes the report and presents it to the user once again. More problems are noticed, fixes are made, and this cycle is repeated four to five times before the report is finally up to the user's satisfaction. In the end, a lot of time has been wasted by the business user and the IT person, and the finished version of the report took way longer that it should have. This is where a self-service reporting tool such as Power View comes in. It is so intuitive and easy to use that most business users can start developing reports with it with little or no training. The interface is so visually appealing that it makes report writing fun. This results in users creating their own reports, thereby empowering businesses to make timely, proactive decisions and explore issues much more effectively than ever before. In this article, we will cover the major features and functions of Power View, including the setup, various ways to start Power View, data visualizations, the user interface, data models, deploying and sharing reports, multiple views, chart highlighting, slicing, filters, sorting, exporting to PowerPoint, and finally, design tips. We will also talk about PowerPivot and the Business Intelligence Sematic Model (BISM). By the end of the article, you should be able to jump right in and start creating reports. Getting started Power View was first introduced as a new integrated reporting feature of SQL Server 2012 (Enterprise or BI Edition) with SharePoint 2010 Enterprise Edition. It has also been seamlessly integrated and built directly into Excel 2013 and made available as an add-in that you can simply enable (although it is not possible to share Power View reports between SharePoint and Excel). Power View allows users to quickly create highly visual and interactive reports via a What You See Is What You Get (WYSIWYG) interface. The following screenshot gives an example of a type of report you can build with Power View, which includes various types of visualizations: Sales Dashboard The following screenshot is another example of a Power View report that makes heavy use of slicers along with a bar chart and tables:   Promotion Dashboard We will start by discussing PowerPivot and BISM and will then go over the setup procedures for the two possible ways to use Power View: through SharePoint or via Excel 2013. PowerPivot It is important to understand what PowerPivot is and how it relates to Power View. PowerPivot is a data analysis add-on for Microsoft Excel. With it, you can mash large amounts of data together that you can then analyze and aggregate all in one workbook, bypassing the Excel maximum worksheet size of one million rows. It uses a powerful data engine to analyze and query large volumes of data very quickly. There are many data sources that you can use to import data into PowerPivot. Once the data is imported, it becomes part of a data model, which is simply a collection of tables that have relationships between them. Since the data is in Excel, it is immediately available to PivotTables, PivotCharts, and Power View. PowerPivot is implemented in an application window separate from Excel that gives you the ability to do such things as insert and delete columns, format text, hide columns from client tools, change column names, and add images. Once you complete your changes, you have the option of uploading (publishing) the PowerPivot workbook to a PowerPivot Gallery or document library (on a BI site) in SharePoint (a PowerPivot Gallery is a special type of SharePoint document library that provides document and preview management for published Excel workbooks that contain PowerPivot data). This will allow you to share the data model inside PowerPivot with others. To publish your PowerPivot workbook to SharePoint, perform the following steps: Open the Excel file that contains the PowerPivot workbook. Select the File tab on the ribbon. If using Excel 2013, click on Save As and then click on Browse and enter the SharePoint location of the PowerPivot Gallery (see the next screenshot). If using Excel 2010, click on Save & Send, click on Save to SharePoint, and then click on Browse. Click on Save and the file will then be uploaded to SharePoint and immediately be made available to others. Saving files to the PowerPivot Gallery A Power View report can be built from the PowerPivot workbook in the PowerPivot Gallery in SharePoint or from the PowerPivot workbook in an Excel 2013 file. Business Intelligence Semantic Model Business Intelligence Semantic Model (BISM) is a new data model that was introduced by Microsoft in SQL Server 2012. It is a single unified BI platform that publicizes one model for all end-user experiences. It is a hybrid model that exposes two storage implementations: the multidimensional data model (formerly called OLAP) and the tabular data model, which uses the xVelocity engine (formally called VertiPaq), all of which are hosted in SQL Server Analysis Services (SSAS). The tabular data model provides the architecture and optimization in a format that is the same as the data storage method used by PowerPivot, which uses an in-memory analytics engine to deliver fast access to tabular data. Tabular data models are built using SQL Server Data Tools (SSDT) and can be created from scratch or by importing a PowerPivot data model contained within an Excel workbook. Once the model is complete, it is deployed to an SSAS server instance configured for tabular storage mode to make it available for others to use. This provides a great way to create a self-service BI solution, and then make it a department solution and then an enterprise solution, as shown: Self-service solution: A business user loads data into PowerPivot and analyzes the data, making improvements along the way. Department solution: The Excel file that contains the PowerPivot workbook is deployed to a SharePoint site used by the department (in which the active data model actually resides in an SSAS instance and not in the Excel file). Department members use and enhance the data model over time. Enterprise solution: The PowerPivot data model from the SharePoint site is imported into a tabular data model by the IT department. Security is added and then the model is deployed to SSAS so the entire company can use it. Summary In this article, we learned about the features of Power View and how it is an excellent tool for self-service reporting. We talked about PowerPivot how it relates to Power View. Resources for Article: Further resources on this subject: Microsoft SQL Server 2008 High Availability: Installing Database Mirroring [Article] Microsoft SQL Server 2008 - Installation Made Easy [Article] Best Practices for Microsoft SQL Server 2008 R2 Administration [Article]
Read more
  • 0
  • 0
  • 1317

article-image-article-reporting-with-microsoft-sql
Packt
11 Mar 2014
7 min read
Save for later

Reporting with Microsoft SQL

Packt
11 Mar 2014
7 min read
(For more resources related to this topic, see here.) Server 2012 Power View – Self-service Reporting Self-service reporting is when business users have the ability to create personalized reports and analytical queries without requiring the IT department to get involved. There will be some basic work that the IT department must do, namely creating the various data marts that the reporting tools will use as well as deploying those reporting tools. However, once that is done, IT will be freed of creating reports so that they can work on other tasks. Instead, the people who know the data best—the business users—will be able to build the reports. Here is a typical scenario that occurs when a self-service reporting solution is not in place: a business user wants a report created, so they fill out a report request that gets routed to IT. The IT department is backlogged with report requests, so it takes them weeks to get back to the user. When they do, they interview the user to get more details about exactly what data the user wants on the report and the look of the report (the business requirements). The IT person may not know the data that well, so they will have to get educated by the user on what the data means. This leads to mistakes in understanding what the user is requesting. The IT person may take away an incorrect assumption of what data the report should contain or how it should look. Then, the IT person goes back and creates the report. A week or so goes by and he shows the user the report. Then, they hear things from the user such as "that is not correct" or "that is not what I meant". The IT person fixes the report and presents it to the user once again. More problems are noticed, fixes are made, and this cycle is repeated four to five times before the report is finally up to the user's satisfaction. In the end, a lot of time has been wasted by the business user and the IT person, and the finished version of the report took way longer that it should have. This is where a self-service reporting tool such as Power View comes in. It is so intuitive and easy to use that most business users can start developing reports with it with little or no training. The interface is so visually appealing that it makes report writing fun. This results in users creating their own reports, thereby empowering businesses to make timely, proactive decisions and explore issues much more effectively than ever before. In this article, we will cover the major features and functions of Power View, including the setup, various ways to start Power View, data visualizations, the user interface, data models, deploying and sharing reports, multiple views, chart highlighting, slicing, filters, sorting, exporting to PowerPoint, and finally, design tips. We will also talk about PowerPivot and the Business Intelligence Sematic Model (BISM). By the end of the article, you should be able to jump right in and start creating reports. Getting started Power View was first introduced as a new integrated reporting feature of SQL Server 2012 (Enterprise or BI Edition) with SharePoint 2010 Enterprise Edition. It has also been seamlessly integrated and built directly into Excel 2013 and made available as an add-in that you can simply enable (although it is not possible to share Power View reports between SharePoint and Excel). Power View allows users to quickly create highly visual and interactive reports via a What You See Is What You Get (WYSIWYG) interface. The following screenshot gives an example of a type of report you can build with Power View, which includes various types of visualizations: Sales Dashboard The following screenshot is another example of a Power View report that makes heavy use of slicers along with a bar chart and tables:   Promotion Dashboard We will start by discussing PowerPivot and BISM and will then go over the setup procedures for the two possible ways to use Power View: through SharePoint or via Excel 2013. PowerPivot It is important to understand what PowerPivot is and how it relates to Power View. PowerPivot is a data analysis add-on for Microsoft Excel. With it, you can mash large amounts of data together that you can then analyze and aggregate all in one workbook, bypassing the Excel maximum worksheet size of one million rows. It uses a powerful data engine to analyze and query large volumes of data very quickly. There are many data sources that you can use to import data into PowerPivot. Once the data is imported, it becomes part of a data model, which is simply a collection of tables that have relationships between them. Since the data is in Excel, it is immediately available to PivotTables, PivotCharts, and Power View. PowerPivot is implemented in an application window separate from Excel that gives you the ability to do such things as insert and delete columns, format text, hide columns from client tools, change column names, and add images. Once you complete your changes, you have the option of uploading (publishing) the PowerPivot workbook to a PowerPivot Gallery or document library (on a BI site) in SharePoint (a PowerPivot Gallery is a special type of SharePoint document library that provides document and preview management for published Excel workbooks that contain PowerPivot data). This will allow you to share the data model inside PowerPivot with others. To publish your PowerPivot workbook to SharePoint, perform the following steps: Open the Excel file that contains the PowerPivot workbook. Select the File tab on the ribbon. If using Excel 2013, click on Save As and then click on Browse and enter the SharePoint location of the PowerPivot Gallery (see the next screenshot). If using Excel 2010, click on Save & Send, click on Save to SharePoint, and then click on Browse. Click on Save and the file will then be uploaded to SharePoint and immediately be made available to others. Saving files to the PowerPivot Gallery A Power View report can be built from the PowerPivot workbook in the PowerPivot Gallery in SharePoint or from the PowerPivot workbook in an Excel 2013 file. Business Intelligence Semantic Model Business Intelligence Semantic Model (BISM) is a new data model that was introduced by Microsoft in SQL Server 2012. It is a single unified BI platform that publicizes one model for all end-user experiences. It is a hybrid model that exposes two storage implementations: the multidimensional data model (formerly called OLAP) and the tabular data model, which uses the xVelocity engine (formally called VertiPaq), all of which are hosted in SQL Server Analysis Services (SSAS). The tabular data model provides the architecture and optimization in a format that is the same as the data storage method used by PowerPivot, which uses an in-memory analytics engine to deliver fast access to tabular data. Tabular data models are built using SQL Server Data Tools (SSDT) and can be created from scratch or by importing a PowerPivot data model contained within an Excel workbook. Once the model is complete, it is deployed to an SSAS server instance configured for tabular storage mode to make it available for others to use. This provides a great way to create a self-service BI solution, and then make it a department solution and then an enterprise solution, as shown: Self-service solution: A business user loads data into PowerPivot and analyzes the data, making improvements along the way. Department solution: The Excel file that contains the PowerPivot workbook is deployed to a SharePoint site used by the department (in which the active data model actually resides in an SSAS instance and not in the Excel file). Department members use and enhance the data model over time. Enterprise solution: The PowerPivot data model from the SharePoint site is imported into a tabular data model by the IT department. Security is added and then the model is deployed to SSAS so the entire company can use it. Summary In this article, we learned about the features of Power View and how it is an excellent tool for self-service reporting. We talked about PowerPivot how it relates to Power View. Resources for Article: Further resources on this subject: Microsoft SQL Server 2008 High Availability: Installing Database Mirroring [Article] Microsoft SQL Server 2008 - Installation Made Easy [Article] Best Practices for Microsoft SQL Server 2008 R2 Administration [Article]
Read more
  • 0
  • 0
  • 662