Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-measures-and-measure-groups-microsoft-analysis-services-part-1

15 Oct 2009

12 min read

Measures and Measure Groups in Microsoft Analysis Services: Part 1

15 Oct 2009

In this two-part article by Chris Webb, we will look at measures and measure groups, ways to control how measures aggregate up, and how dimensions can be related to measure groups. In this part, will cover useful properties of measures, along with built-in measure aggregation types and dimension calculations. Measures and aggregation Measures are the numeric values that our users want to aggregate, slice, dice and otherwise analyze, and as a result, it's important to make sure they behave the way we want them to. One of the fundamental reasons for using Analysis Services is that, unlike a relational database it allows us to build into our cube design business rules about measures: how they should be formatted, how they should aggregate up, how they interact with specific dimensions and so on. It's therefore no surprise that we'll spend a lot of our cube development time thinking about measures. Useful properties of Measures Apart from the AggregateFunction property of a measure, which we'll come to next, there are two other important properties we'll want to set on a measure, once we've created it. Format string The Format String property of a measure specifies how the raw value of the measure gets formatted when it's displayed in query results. Almost all client tools will display the formatted value of a measure, and this allows us to ensure consistent formatting of a measure across all applications that display data from our cube. A notable exception is Excel 2003 and earlier versions, which can only display raw measure values and not formatted values. Excel 2007 will display properly formatted measure values in most cases, but not all. For instance, it ignores the fourth section of the Format String which controls formatting for nulls. Reporting Services can display formatted values in reports, but doesn't by default; this blog entry describes how you can make it do so: http://tinyurl.com/gregformatstring. There are a number of built-in formats that you can choose from, and you can also build your own by using syntax very similar to the one used by Visual BASIC for Applications (VBA) for number formatting. The Books Online topic FORMAT_STRING Contents gives a complete description of the syntax used. Here are some points to bear in mind when setting the Format String property: If you're working with percentage values, using the % symbol will display your values multiplied by one hundred and add a percentage sign to the end. Note that only the display value gets multiplied by hundred—the real value of the measure will not be, so although your user might see a value of 98% the actual value of the cell would be 0.98. If you have a measure that returns null values in some circumstances and you want your users to see something other than null, don't try to use a MDX calculation to replace the nulls—this will cause severe query performance problems. You can use the fourth section of the Format String property to do this instead—for example, the following: #,#.00;#,#.00;0;NA will display the string NA for null values, while keeping the actual cell value as null without affecting performance. Be careful while using the Currency built-in format: it will format values with the currency symbol for the locale specified in the Language property of the cube. This combination of the Currency format and the Language property is frequently recommended for formatting measures that contain monetary values, but setting this property will also affect the way number formats are displayed: for example, in the UK and the USA, the comma is used as a thousands separator, but in continental Europe it is used as a decimal separator. As a result, if you wanted to display a currency value to a user in a locale that didn't use that currency, then you could end up with confusing results. The value €100,101 would be interpreted as a value just over one hundred Euros to a user in France, but in the UK, it would be interpreted as a value of just over one hundred thousand Euros. You can use the desired currency symbol in a Format String instead, for example '$#,#.00', but this will not have an effect on the thousands and decimal separators used, which will always correspond to the Language setting. You can find an example of how to change the language property using a scoped assignment in the MDX Script here: http://tinyurl.com/gregformatstring. Similarly, while Analysis Services 2008 supports the translation of captions and member names for users in different locales, unlike in previous versions, it will not translate the number formats used. As a result, if your cube might be used by users in different locales you need to ensure they understand whichever number format the cube is using. Display folders Many cubes have a lot of measures on them, and as with dimension hierarchies, it's possible to group measures together into folders to make it easier for your users to find the one they want. Most, but not all, client tools support display folders, so it may be worth checking whether the one you intend to use does. By default each measure group in a cube will have its own folder containing all of the measures on the measure group; these top level measure group folders cannot be removed and it's not possible to make a measure from one measure group appear in a folder under another measure group. By entering a folder name in a measure's Display Folder property, you'll make the measure appear in a folder underneath its measure group with that name; if there isn't already a folder with that name, then one will be created, and folder names are case-sensitive. You can make a measure appear under multiple folders by entering a semi-colon delimited list of names as follows: Folder One; Folder Two. You can also create a folder hierarchy by entering either a forward-slash / or back-slash delimited list (the documentation contradicts itself on which is meant to be used—most client tools that support display folders support both) of folder names as follows: Folder One; Folder TwoFolder Three. Calculated measures defined in the MDX Script can also be associated with a measure group, through the Associated_Measure_Group property, and with a display folder through the Display_Folder property. These properties can be set either in code or in Form View in the Calculations tab in the Cube Editor: If you don't associate a calculated measure with a measure group, but do put it in a folder, the folder will appear at the same level as the folders created for each measure group. Built-in measure aggregation types The most important property of a measure is AggregateFunction; it controls how the measure aggregates up through each hierarchy in the cube. When you run an MDX query, you can think of it as being similar to a SQL SELECT statement with a GROUP BY clause—but whereas in SQL you have to specify an aggregate function to control how each column's values get aggregated, in MDX you specify this for each measure when the cube is designed. Basic aggregation types Anyone with a passing knowledge of SQL will understand the four basic aggregation types available when setting the AggregateFunction property: Sum is the commonest aggregation type to use, probably the one you'll use for 90% of all the measures. It means that the values for this measure will be summed up. Count is another commonly used property value and aggregates either by counting the overall number of rows from the fact table that the measure group is built from (when the Binding Type property, found on the Measure Source dialog that appears when you click on the ellipses button next to the Source property of a measure, is set to Row Binding), or by counting non-null values from a specific measure column (when Binding Type property is set to Column Binding). Min and Max return the minimum and maximum measure values. There isn't a built-in Average aggregation type—as we'll soon see, AverageOfChildren does not do a simple average—but it's very easy to create a calculated measure that returns an average by dividing a measure with AggregateFunction Sum by one with AggregateFunction Count, for example: CREATE MEMBER CURRENTCUBE.[Measures].[Average Measure Example] ASIIF([Measures].[Count Measure]=0, NULL,[Measures].[Sum Measure]/[Measures].[Count Measure]); Distinct Count The DistinctCount aggregation type counts the number of distinct values in a column in your fact table, similar to a Count(Distinct) in SQL. It's generally used in scenarios where you're counting some kind of key, for example, finding the number of unique Customers who bought a particular product in a given time period. This is, by its very nature, an expensive operation for Analysis Services and queries that use DistinctCount measures can perform worse than those which use additive measures. It is possible to get distinct count values using MDX calculations but this almost always performs worse; it is also possible to use many-to-many dimensions to get the same results and this may perform better in some circumstances; see the section on "Distinct Count" in the "Many to Many Revolution" white paper, available at http://tinyurl.com/m2mrev. When you create a new distinct count measure, BIDS will create a new measure group to hold it automatically. Each distinct count measure needs to be put into its own measure group for query performance reasons, and although it is possible to override BIDS and create a distinct count measure in an existing measure group with measures that have other aggregation types, we strongly recommend that you do not do this. None The None aggregation type simply means that no aggregation takes place on the measure at all. Although it might seem that a measure with this aggregation type displays no values at all, that's not true: it only contains values at the lowest possible granularity in the cube, at the intersection of the key attributes of all the dimensions. It's very rarely used, and only makes sense for values such as prices that should never be aggregated. If you ever find that your cube seems to contain no data even though it has processed successfully, check to see if you have accidentally deleted the Calculate statement from the beginning of your MDX Script. Without this statement, no aggregation will take place within the cube and you'll only see data at the intersection of the leaves of every dimension, as if every measure had AggregateFunction None. Semi-additive aggregation types The semi-additive aggregation types are: AverageOfChildren FirstChild LastChild FirstNonEmpty LastNonEmpty They behave the same as measures with aggregation type Sum on all dimensions except Time dimensions. In order to get Analysis Services to recognize a Time dimension, you'll need to have set the dimension's Type property to Time in the Dimension Editor. Sometimes you'll have multiple, role-playing Time dimensions in a cube, and if you have semi-additive measures, they'll be semi-additive for just one of these Time dimensions. In this situation, Analysis Services 2008 RTM uses the first Time dimension in the cube that has a relationship with the measure group containing the semi-additive measure. You can control the order of dimensions in the cube by dragging and dropping them in the Dimensions pane in the bottom left-hand corner of the Cube Structure tab of the Cube Editor; the following blog entry describes how to do this in more detail: http://tinyurl.com/gregsemiadd. However, this behavior has changed between versions in the past and may change again in the future. Semi-additive aggregation is extremely useful when you have a fact table that contains snapshot data. For example, if you had a fact table containing information on the number of items in stock in a warehouse, then it would never make sense to aggregate these measures over time: if you had ten widgets in stock on January 1, eleven in stock on January 2, eight on January 3 and so on, the value you would want to display for the whole of January would never be the sum of the number of items in stock on each day in January. The value you do display depends on your organization's business rules. Let's take a look at what each of the semi-additive measure values actually do: AverageOfChildren displays the average of all the values at the lowest level of granularity on the Time dimension. So, for example, if Date was the lowest level of granularity, when looking at a Year value, then Analysis Services would display the average value for all days in the year. FirstChild displays the value of the first time period at the lowest level of granularity, for example, the first day of the year. LastChild displays the value of the last time period at the lowest level of granularity, for example, the last day of the year. FirstNonEmpty displays the value of the first time period at the lowest level of granularity that is not empty, for example the first day of the year that has a value. LastNonEmpty displays the value of the last time period at the lowest level of granularity that is not empty, for example the last day of the year that has a value. This is the most commonly used semi-additive type; a good example of its use would be where the measure group contains data about stock levels in a warehouse, so when you aggregated along the Time dimension what you'd want to see is the amount of stock you had at the end of the current time period. The following screenshot of an Excel pivot table illustrates how each of these semi-additive aggregation types works: Note that the semi-additive measures only have an effect above the lowest level of granularity on a Time dimension. For dates like July 17th in the screenshot above, where there is no data for the Sum measure, the LastNonEmpty measure still returns null and not the value of the last non-empty date.

0
0
6863

article-image-measures-and-measure-groups-microsoft-analysis-services-part-2

Packt

15 Oct 2009

20 min read

Measures and Measure Groups in Microsoft Analysis Services: Part 2

Packt

15 Oct 2009

20 min read

Measure groups All but the simplest data warehouses will contain multiple fact tables, and Analysis Services allows you to build a single cube on top of multiple fact tables through the creation of multiple measure groups. These measure groups can contain different dimensions and be at different granularities, but so long as you model your cube correctly, your users will be able to use measures from each of these measure groups in their queries easily and without worrying about the underlying complexity. Creating multiple measure groups To create a new measure group in the Cube Editor, go to the Cube Structure tab and right-click on the cube name in the Measures pane and select 'New Measure Group'. You'll then need to select the fact table to create the measure group from and then the new measure group will be created; any columns that aren't used as foreign key columns in the DSV will automatically be created as measures, and you'll also get an extra measure of aggregation type Count. It's a good idea to delete any measures you are not going to use at this stage. Once you've created a new measure group, BIDS will try to set up relationships between it and any existing dimensions in your cube based on the relationships you've defined in your DSV. Since doing this manually can be time-consuming, this is another great reason for defining relationships in the DSV. You can check the relationships that have been created on the Dimension Usage tab of the Cube Editor: In Analysis Services 2005, it was true in some cases that query performance was better on cubes with fewer measure groups, and that breaking a large cube with many measure groups up into many smaller cubes with only one or two measure groups could result in faster queries. This is no longer the case in Analysis Services 2008. Although there are other reasons why you might want to consider creating separate cubes for each measure group, this is still something of a controversial subject amongst Analysis Services developers. The advantages of a single cube approach are: All of your data is in one place. If your users need to display measures from multiple measure groups, or you need to create calculations that span measure groups, everything is already in place. You only have one cube to manage security and calculations on; with multiple cubes the same security and calculations might have to be duplicated. The advantages of the multiple cube approach are: If you have a complex cube but have to use Standard Edition, you cannot use Perspectives to hide complexity from your users. In this case, creating multiple cubes might be a more user-friendly approach. Depending on your requirements, security might be easier to manage with multiple cubes. It's very easy to grant or deny a role access to a cube; it's much harder to use dimension security to control which measures and dimensions in a multi-measure group cube a role can access. If you have complex calculations, especially MDX Script assignments, it's too easy to write a calculation that has an effect on part of the cube you didn't want to alter. With multiple cubes, the chances of this happening are reduced. Creating measure groups from dimension tables Measure groups don't always have to be created from fact tables. In many cases, it can be useful to build measure groups from dimension tables too. One common scenario where you might want to do this is when you want to create a measure that counts the number of days in the currently selected time period, so if you had selected a year on your Time dimension's hierarchy, the measure would show the number of days in the year. You could implement this with a calculated measure in MDX, but it would be hard to write code that worked in all possible circumstances, such as when a user multi-selects time periods. In fact, it's a better idea to create a new measure group from your Time dimension table containing a new measure with AggregateFunction Count, so you're simply counting the number of days as the number of rows in the dimension table. This measure will perform faster and always return the values you expect. This post on Mosha Pasumansky's blog discusses the problem in more detail: http://tinyurl.com/moshadays MDX formulas vs pre-calculating valuesIf you can somehow model a calculation into the structure of your cube, or perform it in your ETL, you should do so in preference to doing it in MDX only so long as you do not compromise the functionality of your cube. A pure MDX approach will be the most flexible and maintainable since it only involves writing code, and if calculation logic needs to change, then you just need to redeploy your updated MDX Script; doing calculations upstream in the ETL can be much more time-consuming to implement and if you decide to change your calculation logic, then it could involve reloading one or more tables. However, an MDX calculation, even one that is properly tuned, will of course never perform as well as a pre-calculated value or a regular measure. The day count measure, discussed in the previous paragraph, is a perfect example of where a cube-modeling approach trumps MDX. If your aim was to create a measure that showed average daily sales, though, it would make no sense to try to pre-calculate all possible values since that would be far too time-consuming and would result in a non-aggregatable measure. The best solution here would be a hybrid: create real measures for sales and day count, and then create an MDX calculated measure that divided the former by the latter. However, it's always necessary to consider the type of calculation, the volume of data involved and the chances of the calculation algorithm changing in the future before you can make an informed decision on which approach to take. Handling different dimensionality When you have different measure groups in a cube, they are almost always going to have different dimensions associated with them; indeed, if you have measure groups that have identical dimensionality, you might consider combining them into a single measure group if it is convenient to do so. As we've already seen, the Dimension Usage tab shows us which dimensions have relationships with which measure groups. When a dimension has a relationship with a measure group it goes without saying that making a selection on that dimension will affect the values that are displayed for measures on that measure group. But what happens to measures when you make a selection on a dimension that has no relationship with a measure group? In fact, you have two options here, controlled by the IgnoreUnrelatedDimensions property of a measure group: IgnoreUnrelatedDimensions=False displays a null value for all members below the root (the intersection of all of the All Members or default members on every hierarchy) of the dimension, except the Unknown member, or IgnoreUnrelatedDimensions=True repeats the value displayed at the root of the dimension for every member on every hierarchy of the dimension. This is the default state. The screenshot below shows what happens for two otherwise identical measures from measure groups which have IgnoreUnrelatedDimensions set to True and to False when they're displayed next to a dimension they have no relationship with: It's usually best to keep IgnoreUnrelatedDimensions set to True since if the users are querying measures from multiple measure groups, then they don't want some of their selected measures suddenly returning null if they slice by a dimension that has a regular relationship with their other selected measures. Handling different granularities Even when measure groups share the same dimensions, they may not share the same granularity. For example, we may hold sales information in one fact table down to the day level, but also hold sales quotas in another fact table at the quarter level. If we created measure groups from both these fact tables, then they would both have regular relationships with our Time dimension but at different granularities. Normally, when you create a regular relationship between a dimension and a measure group, Analysis Services will join the columns specified in the KeyColumns property of the key attribute of the dimension with the appropriate foreign key columns of the fact table (note that during processing, Analysis Services won't usually do the join in SQL, it does it internally). However, when you have a fact table of a higher granularity, you need to change the granularity attribute property of the relationship to choose the attribute from the dimension you do want to join on instead: In the previous screenshot, we can see an amber warning triangle telling us that by selecting a non-key attribute, the server may have trouble aggregating measure values. What does this mean exactly? Let's take a look at the attribute relationships defined on our Time dimension again: If we're loading data at the Quarter level, what do we expect to see at the Month and Date level? We can only expect to see useful values at the level of the granularity attribute we've chosen, and for only those attributes whose values can be derived from that attribute; this is yet another good reason to make sure your attribute relationships have been optimized. Below the granularity attribute, we've got the same options regarding what gets displayed as we had with dimensions that have no relationship at all with a measure group: either repeated values or null values. The IgnoreUnrelatedDimensions property is again used to control this behavior. Unfortunately, the default True setting for IgnoreUnrelatedDimensions is usually not the option you want to use in this scenario (users usually prefer to see nulls below the granularity of a measure in our experience) and this may conflict with how we want to set IgnoreUnrelatedDimensions to control the behavior of dimensions which have no relationship with a measure group. There are ways of resolving this conflict such as using MDX Script assignments to set cell values to null or by using the ValidMeasure() MDX function, but none are particularly elegant. Non-aggregatable measures: a different approach We've already seen how we can use parent/child hierarchies to load non-aggregatable measure values into our cube. However, given the problems associated with using parent/child hierarchies and knowing what we now know about measure groups, let's consider a different approach to solving this problem. A non-aggregatable measure will have, by its very nature, data stored for many different granularities of a dimension. Rather than storing all of these different granularities of values in the same fact table, we could create multiple fact tables for each granularity of value. Having built measure groups from these fact tables, we would then be able to join our dimension to each of them with a regular relationship but at different granularities. We'd then be in the position of having multiple measures representing the different granularities of a single, logical measure. What we actually want is a single non-aggregatable measure, and we can get this by using MDX Script assignments to combine the different granularities. Let's say we have a regular (non-parent/child) dimension called Employee with three attributes Manager, Team Leader and Sales Person, and a logical non-aggregatable measure called Sales Quota appearing in three measure groups as three measures called Sales Amount Quota_Manager, Sales Amount Quota_TeamLead and Sales Amount Quota for each of these three granularities. Here's a screenshot showing what a query against this cube would show at this stage: We can combine the three measures into one like this: SCOPE([Measures].[Sales Amount Quota]); SCOPE([Employee].[Salesperson].[All]); THIS=[Measures].[Sales Amount Quota_TeamLead]; END SCOPE; SCOPE([Employee].[Team Lead].[All]); THIS=[Measures].[Sales Amount Quota_Manager]; END SCOPE;END SCOPE; This code takes the lowest granularity measure Sales Amount Quota, and then overwrites it twice: the first assignment replaces all of the values above the Sales Person granularity with the value of the measure containing Sales Amount Quota for Team Leaders; the second assignment then replaces all of the values above the Team Leader granularity with the value of the measure containing Sales Quotas for Managers. Once we've set Visible=False for the Sales Amount Quota_TeamLead and Sales Amount Quota_Manager measures, we're left with just the Sales Amount Quota measure visible, thus displaying the non-aggregatable values that we wanted. The user would then see this: Using linked dimensions and measure groups Creating linked dimensions and measure groups allows you to share the same dimensions and measure groups across separate Analysis Services databases, and the same measure group across multiple cubes. To do this, all you need to do is to run the 'New Linked Object' wizard from the Cube Editor, either by clicking on the button in the toolbar on the Cube Structure or Dimension Usage tabs, or by selecting it from the right-click menu in the Measures pane of the Cube Structure tab. Doing this has the advantage of reducing the amount of processing and maintenance needed: instead of having many identical dimensions and measure groups to maintain and keep synchronized, all of which need processing separately, you can have a single object which only needs to be changed and processed once. At least that's the theory—in practice, linked objects are not as widely used as they could be because there are a number of limitations in their use: Linked objects represent a static snapshot of the metadata of the source object, and any changes to the source object are not passed through to the linked object. So for example, if you create a linked dimension and then add an attribute to the source dimension, you then have to delete and recreate the linked dimension—there's no option to refresh a linked object. You can also import the calculations defined in the MDX Script of the source cube using the wizard. However, you can only import the entire script and this may include references to objects present in the source cube that aren't in the target cube, and which may need to be deleted to prevent errors. The calculations that remain will also need to be updated manually when those in the source cube are changed, and if there are a lot, this can add an unwelcome maintenance overhead. A linked measure group can only be used with dimensions from the same database as the source measure group. This isn't a problem when you're sharing measure groups between cubes in the same database, but could be if you wanted to share measure groups across databases. As you would expect, when you query a linked measure group, your query is redirected to the source measure group. If the source measure group is on a different server, this may introduce some latency and hurt query performance. Analysis Services does try to mitigate this by doing some caching on the linked measure group's database, though. By default, it will cache data on a per-query basis, but if you change the RefreshPolicy property from ByQuery to ByInterval you can specify a time limit for data to be held in cache. Linked objects can be useful when cube development is split between multiple development teams, or when you need to create multiple cubes containing some shared data, but, in general, we recommend against using them widely because of these limitations. Role-playing dimensions It's also possible to add the same dimension to a cube more than once, and give each instance a different relationship to the same measure group. For example, in our Sales fact table, we might have several different foreign key columns that join to our Time dimension table: one which holds the date an order was placed on, one which holds the date it was shipped from the warehouse, and one which holds the date the order should arrive with the customer. In Analysis Services, we can create a single physical Time dimension in our database, which is referred to as a database dimension, and then add it three times to the cube to create three 'cube dimensions', renaming each cube dimension to something like Order Date, Ship Date and Due Date. These three cube dimensions are referred to as role-playing dimensions: the same dimension is playing three different roles in the same cube. Role playing dimensions are a very useful feature. They reduce maintenance overheads because you only need to edit one dimension, and unlike linked dimensions, any changes made to the underlying database dimension are propagated to all of the cube dimensions that are based on it. They also reduce processing time because you only need to process the database dimension once. However, there is one frustrating limitation with role-playing dimensions and that is that while you can override certain properties of the database dimension on a per-cube dimension basis, you can't change the name of any of the attributes or hierarchies of a cube dimension. So if you have a user hierarchy called 'Calendar' on your database dimension, all of your cube dimensions will also have a user hierarchy called 'Calendar', and your users might find it difficult to tell which hierarchy is which in certain client tools (Excel 2003 is particularly bad in this respect) or in reports. Unfortunately, we have seen numerous cases where this problem alone meant role-playing dimensions couldn't be used. Dimension/measure group relationships So far we've seen dimensions either having no relationship with a measure group or having a regular relationship, but that's not the whole story: there are many different types of relationships that a dimension can have with a measure group. Here's the complete list: No relationship Regular Fact Referenced Many-to-Many Data Mining Fact relationships Fact or degenerate dimensions are dimensions that are built directly from columns in a fact table, not from a separate dimension table. From an Analysis Services dimension point of view, they are no different from any other kind of dimension, except that there is a special fact relationship type that a dimension can have with a measure group. There are in fact very few differences between a fact relationship and a regular relationship, and they are: A fact relationship will result in marginally more efficient SQL being generated when the fact dimension is used in ROLAP drillthrough. Fact relationships are visible to client tools in the cube's metadata, so client tools may choose to display fact dimensions differently. A fact relationship can only be defined on dimensions and measure groups that are based on the same table in the DSV. A measure group can only have a fact relationship with one database dimension. It can have more than one fact relationship, but all of them have to be with cube dimensions based on the same database dimension. It still makes sense though to define relationships as fact relationships when you can. Apart from the reasons given above, the functionality might change in future versions of Analysis Services and fact relationship types might be further optimized in some way. Referenced relationships A referenced relationship is where a dimension joins to a measure group through another dimension. For example, you might have a Customer dimension that includes geographic attributes up to and including a customer's country; also, your organization might divide the world up into international regions such as North America, Europe, Middle East and Africa (EMEA), Latin America (LATAM) and Asia-Pacific and so on for financial reporting, and you might build a dimension for this too. If your sales fact table only contained a foreign key for the Customer dimension, but you wanted to analyze sales by international region, you would be able to create a referenced relationship from the Region dimension through the Customer dimension to the Sales measure group. When setting up a referenced relationship in the Define Relationship dialog in the Dimension Usage tab, you're asked to first choose the dimension that you wish to join through and then which attribute on the reference dimension joins to which attribute on the intermediate dimension: When the join is made between the attributes you've chosen on the reference dimension, once again it's the values in the columns that are defined in the KeyColumns property of each attribute that you're in fact joining on. The Materialize checkbox is automatically checked, and this ensures maximum query performance by resolving the join between the dimensions at processing time, which can lead to a significant decrease in processing performance. Unchecking this box means that no penalty is paid at processing time but query performance may be worse. The question you may well be asking yourself at this stage is: why bother to use referenced relationships at all? It is in fact a good question to ask, because, in general, it's better to include all of the attributes you need in a single Analysis Services dimension built from multiple tables rather than use a referenced relationship. The single dimension approach will perform better and is more user-friendly: for example, you can't define user hierarchies that span a reference dimension and its intermediate dimension. That said, there are situations where referenced relationships are useful because it's simply not feasible to add all of the attributes you need to a dimension. You might have a Customer dimension, for instance, that has a number of attributes representing dates—the date of a customer's first purchase, the date of a customer's tenth purchase, the date of a customer's last purchase and so on. If you had created these attributes with keys that matched the surrogate keys of your Time dimension, you could create multiple, referenced (but not materialized) role-playing Time dimensions joined to each of these attributes that would give you the ability to analyze each of these dates. You certainly wouldn't want to duplicate all of the attributes from your Time dimension for each of these dates in your Customer dimension. Another good use for referenced relationships is when you want to create multiple parent/child hierarchies from the same dimension table Data mining relationships The data mining functionality of Analysis Services is outside the scope of this article, so we won't spend much time on the data mining relationship type. Suffice to say that when you create an Analysis Services mining structure from data sourced from a cube, you have the option of using that mining structure as the source for a special type of dimension, called a data mining dimension. The wizard will also create a new cube containing linked copies of all of the dimensions and measure groups in the source cube, plus the new data mining dimension, which then has a data mining relationships with the measure groups. Summary In this part, we focused on how to create new measure groups and handle the problems of different dimensionality and granularity, and looked at the different types of relationships that are possible between dimensions and measure groups.

0
0
9171

Packt

15 Oct 2009

4 min read

Deployment of Reports with BIRT

Packt

15 Oct 2009

4 min read

Everything in this article uses utilities from the BIRT Runtime installation package, available from the BIRT homepage at http://www.eclipse.org/birt. BIRT Viewer The BIRT Viewer is a J2EE application that is designed to demonstrate how to implement the Report Engine API to execute reports in an online web application. For most basic uses—such as for small to medium size Intranet applications—this is an appropriate approach. The point to keep in mind about the BIRT Web Viewer is that it is an example application. It can be used as a baseline for more sophisticated web applications that will implement the BIRT Report Engine API. Installation of the BIRT Viewer is documented at a number of places. The Eclipse BIRT website has some great tutorials at: http://www.eclipse.org/birt/phoenix/deploy/viewerSetup.php http://wiki.eclipse.org/BIRT/FAQ/Deployment This is also documented on my website in a series of articles introducing people to BIRT: http://digiassn.blogspot.com/2005/10/birt-report-server-pt-2.html I won't go into the details about installing Apache Tomcat as this is covered in depth in other locations, but I will cover how to install the Viewer in a Tomcat environment. For the most part these instructions can be used in other J2EE containers, such as WebSphere. In some cases a WAR package is used instead. I prefer Tomcat because it is a widely used open-source J2EE environment. Under the BIRT Runtime package is a folder containing an example Web Viewer application. The Web Viewer is a useful application as you require basic report viewing capabilities, such as parameter passing, pagination, and export capabilities to formats such as Word, Excel, RTF, and CSV. For this example, I have Apache Tomcat 5.5 installed into a folder at C:apache-tomcat-5.5.25. To install the Web Viewer, I simply need to copy the WebViewerExample folder from the BIRT Runtime to the web application folder at C:apache-tomcat-5.5.25webapps. Accessing the BIRT Web Viewer is as simple as calling the WebViewerExample Context. When copying the WebViewerExample folder, you can rename this folder to anything you want. Obviously WebViewerExample is not a good name for an online web application. So in the following screenshot, I renamed the WebViewerExample folder to birtViewer, and am accessing the BIRT Web Viewer test report. Installing Reports into the Web Viewer Once the BIRT Viewer is set up, Deploying reports is as simple as copying the report design files, Libraries, or report documents into the application's Context, and calling it with the appropriate URL parameters. For example, we will install the reports from the Classic Cars – With Library folder into the BIRT Web Viewer at birtViewer. In order for these reports to work, all dependent Libraries need to be installed with the reports. In the case of the example application, we currently have the report folder set to the Root of the web application folder. Accessing Reports in the Web Viewer Accessing reports is as simple as passing the correct parameters to the Web Viewer. In the BIRT Web Viewer, there are seven servlets that you can call to run reports, which are as follows: frameset run preview download parameter document output Out of these, you will only need frameset and run as the other servlets are for Engine-related purposes, such as the preview for the Eclipse designer, the parameter Dialog, and the download of report documents. Out of the these two servlets, frameset is the one that is typically used for user interaction with reports, as it provides the pagination options, parameter Dialogs, table of contents viewing, and export and print Dialogs. The run servlet only provides report output. There are a few URL parameters for the BIRT Web Viewer, such as: __format : which is the output format, either HTML or PDF. __isnull: which sets a Report Parameter to null, parameter name as a value. __locale: which is the reports locale. __report: which is the report design file to run. __document: which is the report document file to open. Any remaining URL parameter will be treated as a Report Parameter. In the following image, I am running the Employee_Sales_Percentage.rptdesign file with the startDate and endDate parameters set.

0
0
5537

article-image-essbase-aso-aggregate-storage-option

Packt

14 Oct 2009

5 min read

Essbase ASO (Aggregate Storage Option)

Packt

14 Oct 2009

5 min read

Welcome to the exciting world of Essbase Analytics known as the Aggregate Storage Option (ASO). Well, now you're ready to take everything one step further. You see, the BSO architecture used by Essbase is the original database architecture as the behind the scenes method of data storage in an Essbase database. The ASO method is entirely different. What is ASO ASO is Essbase's alternative to the sometimes cumbersome BSO method of storing data in an Essbase database. In fact, it is BSO that is exactly what makes Essbase a superior OLAP analytical tool but it is also the BSO that can occasionally be a detriment to the level of system performance demanded in today's business world. In a BSO database, all data is stored, except for dynamically calculated members. All data consolidations and parent-child relationships in the database outline are stored as well. While the block storage method is quite efficient from a data to size ratio perspective, a BSO database can require large amounts of overhead to deliver the retrieval performance demanded by the business customer. The ASO database efficiently stores not only zero level data, but can also store aggregated hierarchical data with the understandings that stored hierarchies can only have the no-consolidation (~) or the addition (+) operator assigned to them and the no-consolidation (~) operator can only be used underneath Label Only members. Outline member consolidations are performed on the fly using dynamic calculations and only at the time of the request for data. This is the main reason why ASO is a valuable option worth consideration when building an Essbase system for your customer. Because of the simplified levels of data stored in the ASO database, a more simplified method of storing the physical data on the disk can also be used. It is this simplified storage method which can help result in higher performance for the customer. Your choice of one database type over the other will always depend on balancing the customer's needs with the server's physical capabilities, along with the volume of data. These factors must be given equal consideration. Creating an aggregate storage Application|Database Believe it or not, creating an ASO Essbase application and database is as easy as creating a BSO application and database. All you need to do is follow these simple steps: Right-click on the server name in your EAS console for the server on which you want to create your ASO application. Select Create application | Using aggregate storage as shown in the following screenshot: Click on Using aggregate storage and that's it. The rest of the steps are easy to follow and basically the same as for a BSO application. To create an ASO application and database, you follow virtually the same steps as you do to create a BSO application and database. However, there are some important differences, and here we list a few: A BSO database outline can be converted into an Aggregate Storage database outline, but an Aggregate Storage database outline cannot be converted into a Block Storage database outline.Steps to convert a BSO application into an ASO application: Open the BSO outline that you wish to convert, select the Essbase database and click on the File | Wizards | Aggregate Storage Outline Conversion option. You will see the first screen Select Source Outline. The source of the outline can be in a file system or on the Essbase Server. In this case, we have selected the OTL from the Essbase Server and then click Next as shown in the following screenshot: In the Next screen, the conversion wizard will verify the conversion and display a message that the conversion has completed successfully. Click Next. Here, Essbase prompts you to select the destination of the ASO outline. If you have not yet created an ASO application, you can click on the Create Aggregate Storage Application on the bottom-right corner of the screen as shown in the next screenshot: Enter the Application and the Database name and click on OK. Your new ASO application is created, now click on Finish. Your BSO application is now converted into an ASO application. You may still need to tweak the ASO application settings and outline members to be the best fit for your needs. In an ASO database, all dimensions are Sparse so there is no need to try to determine the best Dense/Sparse settings as you would do with a BSO database. Although Essbase recommends that you only have one Essbase database in an Essbase application, you can create more than one database per application when you are using the BSO. When you create an ASO application, Essbase will only allow one database per application. There is quite a bit to know about ASO but have no fear, with all that you know about Essbase and how to design and build an Essbase system, it will seem easy for you. Keep reading for more valuable information on the ASO for things like, when it is a good time to use ASO, or how do you query ASO databases effectively, or even what are the differences between ASO and BSO. If you understand the differences, you can then understand the benefits.

0
0
6791

article-image-data-access-adonet-data-services

Packt

14 Oct 2009

3 min read

Data Access with ADO.NET Data Services

Packt

14 Oct 2009

3 min read

In essence, ADO.NET Data Services serves up data as a web service over HTTP using the objects in the Entity Model of the data and conforms to the principles of REST (Representational State Transfer) wherein everything is a kind of resource with a name and they can be manipulated using the well known HTTP Verbs; Get, Put, Post and Delete. The data can be returned in a number of formats including RSS and ATOM. According to W3C, "Atom" is an XML-based document format that describes lists of related information known as "feeds". Feeds are composed of a number of items, known as "entries", each with an extensible set of attached "metadata". Create an ASP.NET Web Application Create an ASP.NET Web application as shown below and make sure the target platform is 3.5. Create the Entity Model Right click the Application in Solution Explorer and click on Add New Item. In the Add New Item window click on ADO.NET Entity Data Model and change the default name Model1.edmx to one of your choice. Here it is MyFirst.edmx. Then click on the Add button. This brings up the Entity Data Model Wizard as shown. Read the notes on this page. Now click on the Generate from Database and click Next. This opens the 'Choose Your Data Connection' page of the wizard as shown with some default connection as shown. Change it over to the TestNorthwind database on SQL Server 2008 Enterprise Edition (RTM). TestNorthwind is a copy of the Northwind database and click Next. Of course you can create a new connection as well. The connection is saved to the Web.Config file. Click on the Next button. The wizard changes page so that you can 'Choose your Database Objects' This may take some time as the program accesses the database and the following page is displayed. The model uses the relational data in the database. Place a check mark for the Tables to include all the tables in the database. The TestNorthwindModel namespace gets created. Click on the Finish button. This brings up the ModelBrowser browsing the MyFirst.edmx file as shown. The MyFirst.edmx file contents are shown in the next figure. By right clicking inside the MyFirst.edmx you can choose the Zoom level for the display. The connected items shown are the tables with relationships in the database. Those standing alone are just tables that are not a part of the original database and can be removed by right clicking and deleting them. The tables retained finally for the model are as shown. Make use of the Zoom controls to adjust to your comfort level. As seen in the above each table has both Scalar Properties as well as Navigation Properties. The Navigation properties shows the relationship with other tables. In the OrderDetails table shown, the connections extend to the Orders table and the Products table. On the other hand the Orders table is connected to, Customers, Employees, Order_Details and Shippers. In the Solution Explorer the MyFirst.edmx file consists of the TestNorthwindModel and the TestNorthwindModel.store folders. The model folder consist of the Entities and the Assoications(relationships between entities) and the EntityContainer as shown.

0
0
2076

article-image-migrating-ms-sql-server-2008-enterprisedb

Packt

09 Oct 2009

2 min read

Migrating from MS SQL Server 2008 to EnterpriseDB

Packt

09 Oct 2009

2 min read

With many database vendor products in the market and data intensive applications using them, it is often required to port the application to use the data or, migrate the data so that the application can use it. Migration of data is therefore one of the realities of the IT Industry. Some of the author's previous articles on migration can be found at this link. You may find more if you do a search on his blog site. Table to be migrated in SQL Server 2008 The following figure shows the Categories table in the Microsoft SQL Server 2008's Management Studio that will be migrated to the Postgres database. Creating a database in Postgres Studio Right click Databases node in the Advanced Server 8.3 and click on New Database... menu as shown. The New Database... window gets displayed as shown. Create an empty database PGNorthwind in Postgres Studio by entering information shown in the next figure. This creates a new database and related objects as shown in the next figure. This also creates the script in the Properties pane as shown. Review the properties. The database may be dropped using the Drop Database statement. Starting the Migration Studio Click on Start | All Programs | Postgres Advanced Server 8.3 to display the drop-down menu as shown. Click on the Migration Studio drop-down item. This opens the EnterpriseDB Migration Studio 8.3(Migration Studio for the rest of the tutorial) with a modal form with the title Edit Server Advanced Server 8.3(localhost:5432). This is the server we installed in the previous tutorial. Enter the User Name and Password and click OK. You will get a message displaying the result as shown in the next figure. Click OK to both the open windows and the EnterpriseDB Migration Studio shows up as shown here. Click on the File in the main menu on the Migration Studio and pick Add Server. This brings up the Add Server window with a default as shown.

0
0
2311

article-image-remote-job-agent-oracle-11g-database-oracle-scheduler

Packt

09 Oct 2009

7 min read

Remote Job Agent in Oracle 11g Database with Oracle Scheduler

Packt

09 Oct 2009

7 min read

Oracle Scheduler in Oracle 10g is a very powerful tool. However, Oracle 11g has many added advantages that give you more power. In this article by Ronald Rood, we will get our hands on the most important addition to the Scheduler—the remote job agent . This is a whole new kind of process, which allows us to run jobs on machines that do not have a running database. However, they must have Oracle Scheduler Agent installed, as this agent is responsible for executing the remote job. This gives us a lot of extra power and also solves the process owner's problem that exists in classical local external jobs. In classical local external jobs, the process owner is by default nobody and is controlled by $ORACLE_HOME/rdbms/admin/externaljob.ora. This creates problems in installation, where the software is shared between multiple databases because it is not possible to separate the processes. In this article, we will start by installing the software, and then see how we can make good use of it. After this, you will want to get rid of the classical local external jobs as soon as possible because you will want to embrace all the improvements in the remote job agent over the old job type. Security Anything that runs on our database server can cause havoc to our databases. No matter what happens, we want to be sure that our databases cannot be harmed. As we have no control over the contents of scripts that can be called from the database, it seems logical not to have these scripts run by the same operating system user who also owns the Oracle database files and processes. This is why, by default, Oracle chose the user nobody as the default user to run the classical local external jobs. This can be adjusted by editing the contents of $ORACLE_HOME/rdbms/admin/externaljob.ora. On systems where more databases are using the same $ORACLE_HOME directory, this automatically means that all the databases run their external jobs using the same operating system account. This is not very flexible. Luckily for us, Oracle has changed this in the 11g release where remote external jobs are introduced. In this release, Oracle decoupled the job runner process and the database processes. The job runner process, that is the job agent, now runs as a remote process and is contacted using a host:port combination over TCP/IP. The complete name for the agent is remote job agent, but this does not mean the job agent can be installed only remotely. It can be installed on the same machine where the database runs, and where it can easily replace the old-fashioned remote jobs. As the communication is done by TCP/IP, this job agent process can be run using any account on the machine. Oracle has no recommendations for the account, but this could very well be nobody. The operating system user who runs the job agent does need some privileges in the $ORACLE_HOME directory of the remote job agent, namely, an execution privilege on $ORACLE_HOME/bin/* as well as read privileges on $ORACLE_HOME/lib/*. At the end of the day, the user has to be able to use the software. The remote job agent should also have the ability to write its administration (log) in a location that (by default) is in $ORACLE_HOME/data, but it can be configured to a different location by setting the EXECUTION_AGENT_DATA environment variable. In 11g, Oracle also introduced a new object type called CREDENTIAL. We can create credentials using dbms_scheduler.create_credential. This allows us to administrate which operating system user is going to run our jobs in the database. This also allows us to have control over who can use this credential. To see which credentials are defined, we can use the *_SCHEDULER_CREDENTIAL views. We can grant access to a credential by granting execute privilege on the credential. This adds lots more control than we ever had in Oracle 10gR2. Currently, the Scheduler Agent can only use a username-password combination to authenticate against the operating system. The jobs scheduled on the remote job agent will run using the account specified in the credential that we use in the job definition. Check the Creating job section to see how this works. This does introduce a small problem in maintenance. On many systems, customers are forced to use security policies such as password aging. When combining with credentials, this might cause a credential to become invalid. Any change in the password of a job runtime account needs to be reflected in the credential definition that uses the account. As we get much more control over who executes a job, it is strongly recommend to use the new remote job agent in favor of the classical local external jobs, even locally. The classical external job type will soon become history. A quick glimpse with a wireshark, a network sniffer, does not reveal the credentials in the clear text, so it looks like it's secure by default. However, the job results do pass in clear text. The agent and the database communicate using SSL and because of this, a certificate is installed in the ${EXECUTION_AGENT_DATA}/agent.key. You can check this certificate using Firefox. Just point your browser to the host:port where the Scheduler Agent is running and use Firefox to examine the certificate. There is a bug in 11.1.0.6 that generates a certificate with an expiration date of 90 days past the agent's registration date. In such a case, you will start receiving certificate validation errors when trying to launch a job. Stopping the agent can solve this. Just remove the agent.key and re-register the agent with the database. The registration will be explained in this article shortly. Installation on Windows We need to get the software before the installation can take place. The Scheduler Agent can be found on the Transparent Gateways disk, which can be downloaded from Oracle technet at http://www.oracle.com/technology/software/products/database/index.html. There's no direct link to this software, so find a platform of your choice and click on See All to get the complete list of database software products for that platform. Then download the Oracle Database Gateways CD. Unzip the installation CD, and then navigate to the setup program found in the top level folder and start it. The following screenshot shows the download directory where you run the setup file: After running the setup, the following Welcome screen will appear. The installation process is simple. Click on the Next button to continue to the product selection screen. Select Oracle Scheduler Agent 11.1.0.6.0 and click on the Next button to continue. Enter Name and Path for ORACLE_HOME (we can keep the default values). Now click on Next to reach the screen where we can choose a port on which the database can contact the agent. I chose 15021. On Unix systems, pick a port above 1023 because the lower ports require root privileges to open. The port should be unused and easily memorizable, and should not be used by the database's listener process. If possible, keep all the remote job agents registered to the same database and the same port. Also, don't forget to open the firewall for that port. Hitting the Next button brings us to the following Summary screen: We click on the Install button to complete the installation. If everything goes as expected, the End of Installation screen pops up as follows: Click on the Exit button and confirm the exit. We can find Oracle Execution Agent in the services control panel. Make sure it is running when you want to use the agent to run jobs.

0
0
2853

article-image-debugging-scheduler-oracle-11g-databases

Packt

08 Oct 2009

5 min read

Debugging the Scheduler in Oracle 11g Databases

Packt

08 Oct 2009

5 min read

Unix—all releases Something that has not been made very clear in the Oracle Scheduler documentation is that redirection cannot be used in jobs (<, >, >>, |, &&, ||). Therefore, many developers have tried to use it. So, let's keep in mind that we cannot use redirection, not in 11g as well as older releases of the database. The scripts must be executable, so don't forget to set the execution bits. This might seem like knocking down an open door, but it's easily forgotten. The user (who is the process owner of the external job and is nobody:nobody by default) should be able to execute the $ORACLE_HOME/bin/extjob file. In Unix, this means that the user should have execution permissions on all the parent directories of this file. This is not something specific to Oracle; it's just the way a Unix file system works. Really! Check it out. Since 10gR1, Oracle does not give execution privileges to others. A simple test for this is to try starting SQL*Plus as a user who is neither the Oracle installation user, nor a member of the DBA group—but a regular user. If you get all kinds of errors, then it implies that the permissions are not correct, assuming that the environment variables (ORACLE_HOME and PATH) are set up correctly. The $ORACLE_HOME/install/changePerm.sh script can fix the permissions within ORACLE_HOME (for 10g). In Oracle 11g, this again changed and is no longer needed. The Scheduler interprets the return code of external jobs and records it in the *_scheduler_job_run_details view. This interpretation can be very misleading, especially when using your own exit codes. For example, when you code your script to check the number of arguments, and code an exit 1 when you find the incorrect number of arguments, the error number is translated to ORA-27301: OS failure message:No such file or directory by Oracle using the code in errno.h. In 11g, the Scheduler also records the return code in the error# column. This lets us recognize the error code better and find where it is raised in the script that ran, when the error codes are unique within the script. When Oracle started with Scheduler, there were some quick changes. Here are the most important changes listed that could cause us problems when the definitions of the mentioned files are not exactly as listed: 10.2.0.1: $ORACLE_HOME/bin/extjob should be owned by the user who runs the jobs (process owner) and have 6550 permissions (setuid process owner). In a regular notation, that is what ls –l shows, and the privileges should be -r-sr-s---. 10.2.0.2: $ORACLE_HOME/rdbms/admin/externaljob.ora should be owned by root. This file is owned by the Oracle user (the user who installed Oracle) and the Oracle install group with 644 permissions or -rw-r—-r--, as shown by ls –l. This file controls which operating system user is going to be the process owner, or which user is going to run the job. The default contents of this file are as shown in the following screenshot: $ORACLE_HOME/bin/extjob must be the setuid root (permissions 4750 or -rwsr-x---) and executable for the Oracle install group, where the setuid root means that the root should be the owner of the file. This also means that while executing this binary, we temporarily get root privileges on the system. $ ORACLE_HOME/bin/extjobo should have normal 755 or -rwxr-xr-x permissions,and be owned by the normal Oracle software owner and group. If this file is missing, just copy it from $ORACLE_HOME/bin/extjob. On AIX, this is the first release that has external job support. 11g release: I n 11g, the same files as in 10.2.0.2 exist with the same permissions. But $ORACLE_HOME/bin/jssu is owned by root and the Oracle install group with the setuid root (permissions 4750 or -rwsr-x---). It is undoubtedly best to stop using the old 10g external jobs and migrate to the 11g external jobs with credentials as soon as possible. The security of the remote external jobs is better because of the use of credentials instead of falling back to the contents of a single file in $ORACLE_HOME/, and the flexibility is much better. In 11g, the process owner of the remote external jobs is controlled by the credential and not by a file. Windows usage On Windows, this is a little easier with regard to file system security. The OracleJobscheduler service must exist in a running state, and the user who runs this service should have the Logon as batch job privilege. A .batfile cannot be run directly, but should be called as an argument of cmd.exe, for example: --/BEGINDBMS_SCHEDULER.create_job(job_name => 'env_windows',job_type => 'EXECUTABLE',number_of_arguments => 2,job_action => 'C:windowssystem32cmd.exe',auto_drop => FALSE,enabled => FALSE);DBMS_SCHEDULER.set_job_argument_value('env_windows',1,'/c');DBMS_SCHEDULER.set_job_argument_value('env_windows',2,'d:temptest.bat');end;/ This job named env_windows calls cmd.exe, which eventually runs the script named test.bat that we created in d:temp. When the script we want to call needs arguments, they should be listed from argument number 3 onwards.

0
0
3088

article-image-indexing-data-solr-14-enterprise-search-server-part2

Packt

05 Oct 2009

9 min read

Indexing Data in Solr 1.4 Enterprise Search Server: Part2

Packt

05 Oct 2009

9 min read

(For more resources on Solr, see here.) Direct database and XML import The capability for Solr to get data directly from a database or HTTP GET accessible XML is distributed with Solr as a contrib module, and it is known as the DataImportHandler (DIH in short). The complete reference documentation for this capability is here at http://wiki.apache.org/solr/DataImportHandler, and it's rather thorough. In this article, we'll only walk through an example to see how it can be used with the MusicBrainz data set. In short, the DIH offers the following capabilities: Imports data from databases through JDBC (Java Database Connectivity) Imports XML data from a URL (HTTP GET) or a file Can combine data from different tables or sources in various ways Extraction/Transformation of the data Import of updated (delta) data from a database, assuming a last-updated date A diagnostic/development web page Extensible to support alternative data sources and transformation steps As the MusicBrainz data is in a database, the most direct method to get data into Solr is definitely through the DIH using JDBC. Getting started with DIH DIH is not a direct part of Solr. Hence it might not be included in your Solr setup. It amounts to a JAR file named something like apache-solr-dataimporthandler-1.4.jar, which is probably already embedded within the solr.war file. You can use jar -tf solr.war to see. Alternatively, it may be placed in <solr-home>/lib, which is alongside the conf directory we've been working with. For database connectivity, we need to ensure that the JDBC driver is on the Java classpath. Placing it in <solr-home>/lib is a convenient way to do this. The DIH needs to be registered with Solr in solrconfig.xml. Here is how it is done: <requestHandler name="/dataimport"class="org.apache.solr.handler.dataimport.DataImportHandler"><lst name="defaults"><str name="config">mb-dih-artists-jdbc.xml</str></lst></requestHandler> mb-dih-artists-jdbc.xml (mb being short for MusicBrainz) is a file in <solr-home>/conf, which is used to configure DIH. It is possible to specify some configuration aspects in this request handler configuration instead of the dedicated configuration file. However, I recommend that it all be in the DIHconfig file, as in our example here. Given below is an mb-dih-artists-jdbc.xml file with a rather long SQL query: <dataConfig> <dataSource name="jdbc" driver="org.postgresql.Driver" url="jdbc:postgresql://localhost/musicbrainz_db" user="musicbrainz" readOnly="true" autoCommit="false" /> <document> <entity name="artist" dataSource="jdbc" pk="id" query=" select a.id as id, a.name as a_name, a.sortname as a_name_sort, a.begindate as a_begin_date, a.enddate as a_end_date, a.type as a_type ,array_to_string( array(select aa.name from artistalias aa where aa.ref = a.id ) , '|') as a_alias ,array_to_string( array(select am.name from v_artist_members am where am.band = a.id order by am.id) , '|') as a_member_name ,array_to_string( array(select am.id from v_artist_members am where am.band = a.id order by am.id) , '|') as a_member_id, (select re.releasedate from release re inner join album r on re.album = r.id where r.artist = a.id order by releasedate desc limit 1) as a_release_date_latest from artist a " transformer="RegexTransformer,DateFormatTransformer, TemplateTransformer"> <field column = "id" template="Artist:${artist.id}" /> <field column = "type" template="Artist" /> <field column = "a_begin_date" dateTimeFormat="yyyy-MM-dd" /> <field column = "a_end_date" dateTimeFormat="yyyy-MM-dd" /> <field column = "a_alias" splitBy="|" /> <field column = "a_member_name" splitBy="|"/> <field column = "a_member_id" splitBy="|" /> </entity> </document></dataConfig> The DIH development console Before describing the configuration details, we're going to take a look at the DIH development console. It is accessed by going to this URL (modifications may be needed for your host, port, core, and so on):http://localhost:8983/solr/admin/dataimport.jsp The development console looks like the following screenshot: The screen is divided into two panes: on the left is the DIH control form, which includes an editable version of the DIH configuration file and on the right is the command output as raw XML. The screen works quite simply. The form essentially results in submitting a URL to the right pane. There's no real server-side logic to this interface beyond the standard DIH command invocations being executed on the right. The last section on DIH in this article goes into more detail on submitting a command to the DIH. DIH DataSources of type JdbcDataSource The DIH configuration file starts with the declaration of one or more data sources using the element <dataSource/>, which refers to either a database, a file, or an HTTP URL, depending on the type attribute. It defaults to a value of JdbcDataSource. Those familiar with JDBC should find the driver and url attributes with accompanying user and password straightforward—consult the documentation for your driver/database for further information. readOnly is a boolean that will set a variety of other JDBC options appropriately when set to true. And batchSize is an alias for the JDBC fetchSize and defaults to 500. There are numerous JDBC oriented attributes that can be set as well. I would not normally recommend learning about a feature by reading source code, but this is an exception. For further information, read org.apache.solr.handler.dataimport.JdbcDataSource.java Efficient JDBC configuration Many database drivers in the default configurations (including those for PostgreSQL and MySQL) fetch all of the query results into the memory instead of on-demand or using a batch/fetch size. This may work well for typical database usage like OLTP (Online Transaction Processing systems), but is completely unworkable for ETL (Extract Transform and Load) usage such as this. Configuring the driver to stream the data requires driver-specific configuration options. You may need to consult relevant documentation for the JDBC driver. For PostgreSQL, set autoCommit to false. For MySQL, set batchSize to -1(The DIH detects the -1 and replaces it with Java's Integer.MIN_VALUE, which triggers the special behavior in MySQL's JDBC driver). For Microsoft SQL Server, set responseBuffering to adaptive. Further information about specific databases is at :http://wiki.apache.org/solr/DataImportHandlerFaq.. DIH documents, entities After the declaration of <dataSource/> element(s) is the <document/> element. In turn, this element contains one or more <entity/> elements. In this sample configuration, we're only getting artists. However, if we wanted to have more than one type in the same index, then another could be added. The dataSource attribute references a correspondingly named element earlier. It is only necessary if there are multiple to choose from, but we've put it here explicitly anyway. The main piece of an entity used with a JDBC data source is the query attribute, which is the SQL query to be evaluated. You'll notice that this query involves some sub-queries, which are made into arrays and then transformed into strings joined by spaces. The particular functions used to do these sorts of things are generally database specific. This is done to shoe-horn multi-valued data into a single row in the results. It may create a more complicated query, but it does mean that the database does all of the heavy lifting so that all of the data Solr needs for an artist is in the row. An alternative with DIH is to declare other entities within the entity. If you aren't using a database or if you wish to mix in another data source (even if it's of a different type), then you will be forced to do that. See the Solr DIH Wiki page for examples: http://wiki.apache.org/solr/DataImportHandler. The DIH also supports a delta query, which is a query that selects time-stamped data with dates after the last queried date. This won't be covered here, but you can find more information at the previous URL. DIH fields and transformers Within the <entity/> are some <field/>elements that declare how the columns in the query map to Solr. The field element must have a column attribute that matches the corresponding named column in the SQL query. The name attribute is the Solr schema field name that the column is going into. If it is not specified (and it never is for our example), then it defaults to the column name. Use the SQL as a keyword as we've done to use the same names as the Solr schema instead of the database schema. This reduces the number of explicit mappings needed in <field/> elements and shortens existing ones. When a column in the result can be placed directly into Solr without further processing, there is no need to specify the field declaration, because it is implied. An attribute of the entity declaration that we didn't mention yet is transformer. This declares a comma-separated list of transformers that manipulate the transfer of data from the JDBC resultset into a Solr field. These transformers evaluate a field, if it has an attribute it uses to do its job. More than one might operate on a given field. Therefore, the order in which the transformers are declared in matters. Here are the attributes we've used: template: It is used by TemplateTransformer and declares text, which might include variable name substitutions using ${name} syntax. To access an existing field, use the entityname.columnname syntax. splitBy: It is used by RegexTransformer and splits a single string value into a multi-value by looking for the specified character. dateTimeFormat: It is used by DateFormatTransformer. This is a Java date/time format pattern http://java.sun.com/j2se/1.5.0/docs/api/java/text/SimpleDateFormat.html). If the type of the field in the schema is a date, then it is necessary to ensure Solr can interpret the format. Alternatively, ensure that the string matches the ISO-8601 format, which looks like this: 1976-10-23T23:59:59.000Z. As in all cases in Solr, when specifying dates you can use its so-called "DateMath" syntax such as appending /DAY to tell Solr to round the date to a day.

0
0
2202

article-image-indexing-data-solr-14-enterprise-search-server-part1

Packt

05 Oct 2009

6 min read

Indexing Data in Solr 1.4 Enterprise Search Server: Part1

Packt

05 Oct 2009

6 min read

(For more resources on Solr, see here.) Lets get started. Communicating with Solr There are a few dimensions to the options available for communicating with Solr: Direct HTTP or a convenient client API Applications interact with Solr over HTTP. This can either be done directly (by hand, but by using an HTTP client of your choice), or it might be facilitated by a Solr integration API such as SolrJ or Solr Flare, which in turn use HTTP. An exception to HTTP is offered by SolrJ, which can optionally be used in an embedded fashion with Solr (so-called Embedded Solr) to avoid network and inter process communication altogether. However, unless you are sure you really want to embed Solr within another application, this option is discouraged in favor of writing a custom Solr updating request handler. Data streamed remotely or from Solr's Filesystem Even though an application will be communicating with Solr over HTTP, it does not have to send Solr data over this channel. Solr supports what it calls remote streaming. Instead of giving Solr the data directly, it is given a URL that it will resolve. It might be an HTTP URL, but more likely it is a filesystem based URL, applicable when the data is already on Solr's machine. Finally, in the case of Solr's DataImportHandler, the data can be fetched from a database. Data formats The following are the different data formats: Solr-XML: Solr has a specific XML schema it uses to specify documents and their fields. It supports instructions to delete documents and to perform optimizes and commits too. Solr-binary: Analogous to Solr-XML, it is an efficient binary representation of the same structure. This is only supported by the SolrJ client API. CSV: CSV is a character separated value format (often a comma). Rich documents like PDF, XLS, DOC, PPT to Solr: The text data extracted from these formats is directed to a particular field in your Solr schema. Finally, Solr's DIH DataImportHandler contrib add-on is a powerful capability that can communicate with both databases and XML sources (for example: web services). It supports configurable relational and schema mapping options and supports custom transformation additions if needed. The DIH uniquely supports delta updates if the source data has modification dates. We'll use the XML, CSV, and DIH options in bringing the MusicBrainz data into Solr from its database to demonstrate Solr's capability. Most likely, an application would use just one format. Before these approaches are described, we'll discuss curl and remote streaming, which are foundational topics. Using curl to interact with Solr Solr receives commands (and possibly the associated data) through HTTP POST. Solr lets you use HTTP GET too (for example, through your web browser). However, this is an inappropriate HTTP verb if it causes something to change on the server, as happens with indexing. For more information on this concept, read about REST at: http://en.wikipedia.org/wiki/Representational_State_Transfer. One way to send an HTTP POST is through the Unix command line program curl (also available on Windows through Cygwin). Even if you don't use curl, it is very important to know how we're going to use it, because the concepts will be applied no matter how you make the HTTP messages. There are several ways to tell Solr to index data, and all of them are through HTTP POST: Send the data as the entire POST payload (only applicable to Solr's XML format). curl does this with data-binary (or some similar options) and an appropriate content-type header reflecting that it's XML. Send some name-value pairs akin to an HTML form submission. With curl, such pairs are proceeded by -F. If you're giving data to Solr to be indexed (as opposed to it looking for it in a database), then there are a few ways to do that: Put the data into the stream.body parameter. If it's small, perhaps less than a megabyte, then this approach is fine. The limit is configured with the multipartUploadLimitInKB setting in solrconfig.xml. Refer to the data through either a local file on the Solr server using the stream.file parameter or a URL that Solr will fetch it from through the stream.url parameter. These choices are a feature that Solr calls remote streaming. Here is an example of the first choice. Let's say we have an XML file named artists.xml in the current directory. We can post it to Solr using the following command line: curl http://localhost:8983/solr/update -H 'Content-type:text/xml; charset=utf-8' --data-binary @artists.xml If it succeeds, then you'll have output that looks like this: <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int><int name="QTime">128</int> </lst> </response> To use the solr.body feature for the example above, you would do this: curl http://localhost:8983/solr/update -F [email protected] In both cases, the @ character instructs curl to get the data from the file instead of being @artists.xml literally. If the XML is short, then you can just as easily specify it literally on the command line: curl http://localhost:8983/solr/update -F stream.body=' <commit />' Notice the leading space in the value. This was intentional. In this example, curl treats @ and < to mean things we don't want. In this case, it might be more appropriate to use form-string instead of -F. However, it's more typing, and I'm feeling lazy. Remote streaming In the examples above, we've given Solr the data to index in the HTTP message. Alternatively, the POST request can give Solr a pointer to the data in the form of either a file path accessible to Solr or an HTTP URL to it. The file path is accessed by the Solr server on its machine, not the client, and it must also have the necessary operating system file permissions too. However, just as before, the originating request does not return a response until Solr has finished processing it. If you're sending a large CSV file, then it is practical to use remote streaming. Otherwise, if the file is of a decent size or is already at some known URL, then you may find remote streaming faster and/or more convenient, depending on your situation. Here is an example of Solr accessing a local file: curl http://localhost:8983/solr/update -F stream.file=/tmp/artists.xml To use a URL, the parameter would change to stream.url, and we'd specify a URL. We're passing a name-value parameter (stream.file and the path), not the actual data. Remote streaming must be enabled In order to use remote streaming (stream.file or stream.url), you must enable it in solrconfig.xml. It is disabled by default and is configured on a line that looks like this: <requestParsers enableRemoteStreaming="true" multipartUploadLimitInKB="2048" />

0
0
2319

article-image-time-dimension-oracle-warehouse-builder-11g

Packt

05 Oct 2009

4 min read

Time Dimension in Oracle Warehouse Builder 11g

Packt

05 Oct 2009

4 min read

Let's discuss briefly what a Time dimension is, and then we'll dive right into the Warehouse Builder Design Center and create one. A Time dimension is a key part of most data warehouses. It provides the time series information to describe our data. A key feature of data warehouses is being able to analyze data from several time periods and compare results between them. The Time dimension is what provides us the means to retrieve data by time period. Do not be confused by the use of the word Time to refer to this dimension. In this case, it does not refer to the time of day but to time in general which can span days, weeks, months, and so on. We are using it because the Warehouse Builder uses the word Time for this type of dimension to signify a time period. So when referring to a Time dimension here, we will be talking about our time period dimension that we will be using to store the date. We will give the name Date to be clear about what information it contains. Every dimension, whether time or not, has four characteristics that have to be defined in OWB: Levels Dimension Attributes Level Attributes Hierarchies The Levels are for defining the levels where aggregations will occur, or to which data can be summed. We must have at least two levels in our Time dimension. While reporting on data from our data warehouse, users will want to see totals summed up by certain time periods such as per day, per month, or per year. These become the levels. A multidimensional implementation includes metadata to enable aggregations automatically at those levels, if we use the OLAP feature. The relational implementation can make use of those levels in queries to sum the data. The Warehouse Builder has the following Levels available for the Time dimension: Day Fiscal week Calendar week Fiscal month Calendar month Fiscal quarter Calendar quarter Fiscal year Calendar year The Dimension Attributes are individual pieces of information we're going to store in the dimension that can be found at more than one level. Each level will have an ID that identifies that level, a start and an end date for the time period represented at that level, a time span that indicates the number of days in the period, and a description of the level. Each level has Level Attributes associated with it that provide descriptive information about the value in that level. The dimension attributes found at that level and additional attributes specific to the level are included. For example, if we're talking about the Month level, we will find attributes that describe the value for the month such as the month of the year it represents, or the month in the calendar quarter. These would be numbers indicating which month of the year or which month of the quarter it is. The Oracle Warehouse Builder Users' Guide contains a more complete list of all the attributes that are available. OWB tracks which of these attributes are applicable to which level and allows the setting of a separate description that identifies the attribute for that level. Toward the end of the chapter, when we look at the Data Object Editor, we'll see the feature provided by the Warehouse Builder to view details about objects such as dimensions and cubes. We must also define at least one Hierarchy for our Time dimension. A hierarchy is a structure in our dimension that is composed of certain levels in order; there can be one or more hierarchies in a dimension. Calendar month, calendar quarter, and calendar year can be a hierarchy. We could view our data at each of these levels, and the next level up would simply be a summation of all the lower-level data within that period. A calendar quarter sum would be the sum of all the values in the calendar month level in that quarter, and the multidimensional implementation includes the metadata to facilitate these kinds of calculations. This is one of the strengths of a multidimensional implementation. The good news is that the Warehouse Builder contains a wizard that will do all the work for us—create our Time dimension and define the above four characteristics—just by asking us a few questions.

0
0
1627

article-image-adding-reporting-capabilities-our-java-applications-using-jasperreports-35

Packt

01 Oct 2009

12 min read

Adding Reporting Capabilities to our Java Applications Using JasperReports 3.5

Packt

01 Oct 2009

12 min read

0
0
2043

article-image-designing-target-structure-oracle-warehouse-builder-11g

Packt

01 Oct 2009

6 min read

Designing the Target Structure in Oracle Warehouse Builder 11g

Packt

01 Oct 2009

6 min read

We have our entire source structures defined in the Warehouse Builder. But before we can do anything with them, we need to design what our target data warehouse structure is going to look like. When we have that figured out, we can start mapping data from the source to the target. So, let's design our target structure. First, we're going to take a look at some design topics related to a data warehouse that are different from what we would use if we were designing a regular relational database. Data Warehouse Design When it comes to the design of a data warehouse, there is basically one option that makes the most sense for how we will structure our database and that is the dimensional model. This is a way of looking at the data from a business perspective that makes the data simple, understandable, and easy to query for the business end user. It doesn't require a database administrator to be able to retrieve data from it. We know the normalized method of modelling a database. A normalized model removes redundancies in data by storing information in discrete tables, and then referencing those tables when needed. This has an advantage for a transactional system because information needs to be entered at only one place in the database, without duplicating any information already entered. For example, in the ACME Toys and Gizmos transactional database, each time a transaction is recorded for the sale of an item at a register, a record needs to be added only to the transactions table. In the table, all details regarding the information to identify the register, the item information, and the employee who processed the transaction do not need to be entered because that information is already stored in separate tables. The main transaction record just needs to be entered with references to all that other information. This works extremely well for a transactional type of system concerned with daily operational processing where the focus is on getting data into the system. However, it does not work well for a data warehouse whose focus is on getting data out of the system. Users do not want to navigate through the spider web of tables that compose a normalized database model to extract the information they need. Therefore, dimensional models were introduced to provide the end user with a flattened structure of easily queried tables that he or she can understand from a business perspective. Dimensional Design A dimensional model takes the business rules of our organization and represents them in the database in a more understandable way. A business manager looking at sales data is naturally going to think more along the lines of "how many gizmos did I sell last month in all stores in the south and how does that compare to how many I sold in the same month last year?" Managers just want to know what the result is, and don't want to worry about how many tables need to be joined in a complex query to get that result. A dimensional model removes the complexity and represents the data in a way that end users can relate to it more easily from a business perspective. Users can intuitively think of the data for the above question as a cube, and the edges (or dimensions) of the cube labeled as stores, products, and time frame. So let's take a look at this concept of a cube with dimensions, and how we can use that to represent our data. Cube and Dimensions The dimensions become the business characteristics about the sales, for example: A time dimension—users can look back in time and check various time periods A store dimension—information can be retrieved by store and location A product dimension—various products for sale can be broken out Think of the dimensions as the edges of a cube, and the intersection of the dimensions as the measure we are interested in for that particular combination of time, store, and product. A picture is worth a thousand words, so let's look at what we're talking about in the following image: Notice what this cube looks like. How about a Rubik's Cube? We're doing a data warehouse for a toy store company, so we ought to know what a Rubik's cube is! If you have one, maybe you should go get it now because that will exactly model what we're talking about. Think of the width of the cube, or a row going across, as the product dimension. Every piece of information or measure in the same row refers to the same product, so there are as many rows in the cube as there are products. Think of the height of the cube, or a column going up and down, as the store dimension. Every piece of information in a column represents one single store, so there are as many columns as there are stores. Finally, think of the depth of the cube as the time dimension, so any piece of information in the rows and columns at the same depth represent the same point in time. The intersection of each of these three dimensions locates a single individual cube in the big cube, and that represents the measure amount we're interested in. In this case, it's dollar sales for a single product in a single store at a single point in time. But one might wonder if we are restricted to just three dimensions with this model. After all, a cube has only three dimensions—length, width, and depth. Well, the answer is no. We can have many more dimensions than just three. In our ACME example, we might want to know the sales each employee has accomplished for the day. This would mean we would need a fourth dimension for employees. But what about our visualization above using a cube? How is this fourth dimension going to be modelled? And no, the answer is not that we're entering the Twilight Zone here with that "dimension not only of sight and sound but of mind..." We can think of additional dimensions as being cubes within a cube. If we think of an individual intersection of the three dimensions of the cube as being another cube, we can see that we've just opened up another three dimensions to use—the three for that inner cube. The Rubik's Cube example used above is good because it is literally a cube of cubes and illustrates exactly what we're talking about. We do not need to model additional cubes. The concept of cubes within cubes was just to provide a way to visualize further dimensions. We just model our main cube, add as many dimensions as we need to describe the measures, and leave it for the implementation to handle. This is a very intuitive way for users to look at the design of the data warehouse. When it's implemented in a database, it becomes easy for users to query the information from it.

0
0
1677

article-image-faceting-solr-14-enterprise-search-server

Packt

01 Oct 2009

9 min read

Faceting in Solr 1.4 Enterprise Search Server

Packt

01 Oct 2009

9 min read

(For more resources on Solr, see here.) Faceting, after searching, is arguably the second-most valuable feature in Solr. It is perhaps even the most fun you'll have, because you will learn more about your data than with any other feature. Faceting enhances search results with aggregated information over all of the documents found in the search to answer questions such as the ones mentioned below, given a search on MusicBrainz releases: How many are official, bootleg, or promotional? What were the top five most common countries in which the releases occurred? Over the past ten years, how many were released in each year? How many have names in these ranges: A-C, D-F, G-I, and so on? Given a track search, how many are < 2 minutes long, 2-3, 3-4, or more? Moreover, in addition, it can power term-suggest aka auto-complete functionality, which enables your search application to suggest a completed word that the user is typing, which is based on the most commonly occurring words starting with what they have already typed. So if a user started typing siamese dr, then Solr might suggest that dreams is the most likely word, along with other alternatives. Faceting, sometimes referred to as faceted navigation, is usually used to power user interfaces that display this summary information with clickable links that apply Solr filter queries to a subsequent search. If we revisit the comparison of search technology to databases, then faceting is more or less analogous to SQL's group by feature on a column with count(*). However, in Solr, facet processing is performed subsequent to an existing search as part of a single request-response with both the primary search results and the faceting results coming back together. In SQL, you would need to potentially perform a series of separate queries to get the same information. A quick example: Faceting release types Observe the following search results. echoParams is set to explicit (defined in solrconfig.xml) so that the search parameters are seen here. This example is using the standard handler (though perhaps dismax is more typical). The query parameter q is *:*, which matches all documents. In this case, the index I'm using only has releases. If there were non-releases in the index, then I would add a filter fq=type%3ARelease to the URL or put this in the handler configuration, as that is the data set we'll be using for most of this article. I wanted to keep this example brief so I set rows to 2. Sometimes when using faceting, you only want the facet information and not the main search, so you would set rows to 0, if that is the case. It's important to understand that the faceting numbers are computed over the entire search result, which is all of the releases in this example, and not just the two rows being returned. <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">160</int> <lst name="params"> <str name="wt">standard</str> <str name="rows">2</str> <str name="facet">true</str> <str name="q">*:*</str> <str name="fl">*,score</str> <str name="qt">standard</str> <str name="facet.field">r_official</str> <str name="f.r_official.facet.missing">true</str> <str name="f.r_official.facet.method">enum</str> <str name="indent">on</str> </lst> </lst> <result name="response" numFound="603090" start="0" maxScore="1.0"> <doc> <float name="score">1.0</float> <str name="id">Release:136192</str> <str name="r_a_id">3143</str> <str name="r_a_name">Janis Joplin</str> <arr name="r_attributes"><int>0</int><int>9</int> <int>100</int></arr> <str name="r_name">Texas International Pop Festival 11-30-69</str> <int name="r_tracks">7</int> <str name="type">Release</str> </doc> <doc> <float name="score">1.0</float> <str name="id">Release:133202</str> <str name="r_a_id">6774</str> <str name="r_a_name">The Dubliners</str> <arr name="r_attributes"><int>0</int></arr> <str name="r_lang">English</str> <str name="r_name">40 Jahre</str> <int name="r_tracks">20</int> <str name="type">Release</str> </doc> </result> <lst name="facet_counts"> <lst name="facet_queries"/> <lst name="facet_fields"> <lst name="r_official"> <int name="Official">519168</int> <int name="Bootleg">19559</int> <int name="Promotion">16562</int> <int name="Pseudo-Release">2819</int> <int>44982</int> </lst> </lst> <lst name="facet_dates"/> </lst> </response> The facet related search parameters are highlighted at the top. The facet.missing parameter was set using the field-specific syntax, which will be explained shortly. Notice that the facet results (highlighted) follow the main search result and are given a name facet_counts. In this example, we only faceted on one field, r_official, but you'll learn in a bit that you can facet on as many fields as you desire. The name attribute holds a facet value, which is simply an indexed term, and the integer following it is the number of documents in the search results containing that term, aka a facet count. The next section gives us an explanation of where r_official and r_type came from. MusicBrainz schema changes In order to get better self-explanatory faceting results out of the r_attributes field and to split its dual-meaning, I modified the schema and added some text analysis. r_attributes is an array of numeric constants, which signify various types of releases and it's official-ness, for lack of a better word. As it represents two different things, I created two new fields: r_type and r_official with copyField directives to copy r_attributes into them: <field name="r_attributes" type="integer" multiValued="true" indexed="false" /> <field name="r_type" type="rType" multiValued="true" stored="false" /> <field name="r_official" type="rOfficial" multiValued="true" stored="false" /> And: <copyField source="r_attributes" dest="r_type" /> <copyField source="r_attributes" dest="r_official" /> In order to map the constants to human-readable definitions, I created two field types: rType and rOfficial that use a regular expression to pull out the desired numbers and a synonym list to map from the constant to the human readable definition. Conveniently, the constants for r_type are in the range 1-11, whereas r_official are 100-103. I removed the constant 0, as it seemed to be bogus. <fieldType name="rType" class="solr.TextField" sortMissingLast="true" omitNorms="true"> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.PatternReplaceFilterFactory" pattern="^(0|1dd)$" replacement="" replace="first" /> <filter class="solr.LengthFilterFactory" min="1" max="100" /> <filter class="solr.SynonymFilterFactory" synonyms="mb_attributes.txt" ignoreCase="false" expand="false"/> </analyzer> </fieldType> The definition of the type rOfficial is the same as rType, except it has this regular expression: ^(0|dd?)$. The presence of LengthFilterFactory is to ensure that no zero-length (empty-string) terms get indexed. Otherwise, this would happen because the previous regular expression reduces text fitting unwanted patterns to empty strings. The content of mb_attributes.txt is as follows: # from: http://bugs.musicbrainz.org/browser/mb_server/trunk/ # cgi-bin/MusicBrainz/Server/Release.pm#L48 #note: non-album track seems bogus; almost everything has it 0=>Non-Album Track 1=>Album 2=>Single 3=>EP 4=>Compilation 5=>Soundtrack 6=>Spokenword 7=>Interview 8=>Audiobook 9=>Live 10=>Remix 11=>Other 100=>Official 101=>Promotion 102=>Bootleg 103=>Pseudo-Release It does not matter if the user interface uses the name (for example: Official) or constant (for example: 100) when applying filter queries when implementing faceted navigation, as the text analysis will let the names through and will map the constants to the names. This is not necessarily true in a general case, but it is for the text analysis as I've configured it above. The approach I took was relatively simple, but it is not the only way to do it. Alternatively, I might have split the attributes and/or mapped them as part of the import process. This would allow me to remove the multiValued setting in r_official. Moreover, it wasn't truly necessary to map the numbers to their names, as a user interface, which is going to present the data, could very well map it on the fly. Field requirements The principal requirement of a field that will be faceted on is that it must be indexed. In addition to all but the prefix faceting use case, you will also want to use text analysis that does not tokenize the text. For example, the value Non-Album Track is indexed the way it is in r_type. We need to be careful to escape the space where this appeared in mb_attributes.txt. Otherwise, faceting on this field would show tallies for Non-Album and Track separately. Depending on the type of faceting you want to do and other needs you have like sorting, you will often find it necessary to have a copy of a field just for faceting. Remember that with faceting, the facet values returned in search results are the actual terms indexed, and not the stored value, which isn't even used.

0
0
3508

article-image-synchronizing-objects-oracle-warehouse-builder-2

Packt

29 Sep 2009

5 min read

Synchronizing Objects in Oracle Warehouse Builder

Packt

29 Sep 2009

5 min read

Synchronizing objects We created tables, dimensions, and a cube; and new tables were automatically created for each dimension and cube. We then created mappings to map data from tables to tables, dimensions, and a cube. What happens if, let's say for example, a table definition is updated after we've defined it and created a mapping or mappings that include it? What if a dimensional object is changed? In that case, what happens to the underlying table? This is what we are going to discuss in this section. One set of changes that we'll frequently find ourselves making is changes to the data we've defined for our data warehouse. We may get some new requirements that lead us to capture a new data element that we have not captured yet. We'll need to update our staging table to store it and our staging mapping to load it. Our dimension mapping(s) will need to be updated to store the new data element along with the underlying table. We could make manual edits to all the affected objects in our project, but the Warehouse Builder provides us some features to make that easier. Changes to tables Let's start the discussion by looking at table updates. If we have a new data element that needs to be captured, it will mean finding out where that data resides in our source system and updating the associated table definition in our module for that source system. Updating object definitions There are a couple of ways to update table definitions. Our choice will depend on how the table was defined in the Warehouse Builder in the first place. The two options are: It could be a table in a source database system, in which case the table was physically created in the source database and we just imported the table definition into the Warehouse Builder. It could be a table we defined in our project in the Warehouse Builder and then deployed to the target database to create it. Our staging table would be an example of this second option. In the first case, we can re-import the source table using the procedures generally used for importing source metadata. When re-importing tables, the Warehouse Builder will do a reconciliation process to update the already imported table with any changes it detects in the source table. For the second case, we can manually edit the table definition in our project to reflect the new data element. For the first case where the table is in a source database system, the action we choose also depends on whether that source table definition is in an Oracle database or a third-party database. If it is in a third-party database, we're going to encounter an error. Hence, we'll be forced to make manual edits to our metadata for that source until that bug is fixed. If the table is in an Oracle database, re-importing the table definition would not be a problem and it will do the reconciliation process, picking up any new data elements or changes to the existing ones. For a hands-on example here, let's turn to our new project that we created earlier while discussing snapshots. We copied our POS_TRANS_STAGE table over to this project, so let's use that table as an example of a changing table, as we defined the table structure manually in the Warehouse Builder Design Center and then deployed it to the target database to actually create it. For this example, we won't actually re-deploy it because we'll be using that second project we created. It doesn't have a valid location defined, but we can still edit the table definition and investigate how to reconcile that edit in the next section. So, let's edit the POS_TRANS_STAGE table in the ACME_PROJ_FOR_COPYING project in the Design Center by double-clicking on it to launch it in the Data Object Editor. We'll just add a column called STORE_AREA_SIZE to the table for storing the size of the store in square feet or square meters. We'll click on the Columns tab, scroll it all the way to the end, enter the name of the column, then select NUMBER for the data type, and leave the precision and scale to the default (that is 0) for this example. We can validate and generate the object without having a valid location defined, so we'll do that. The validation and generation should complete successfully; and if we look at the script, we'll see the new column included. We now need a mapping that uses that table, which we have back in our original project. Let's use the copy and paste technique we used earlier to copy the STAGE_MAP mapping over to this new project. We'll open the ACME_DW_PROJECT project, answering Save to the prompt to save or revert. Then on the STAGE_MAP mapping entry, we'll select Copy from the pop-up menu. We'll open the ACME_PROJ_FOR_COPYING project and then on the Mappings node, select Paste on the pop-up menu. We ordinarily won't copy an object and paste it into a whole new project just for making changes. We're only doing it here so that we can make changes without worrying about interfering with a working project.

0
0
2832

How-To Tutorials - Data

Measures and Measure Groups in Microsoft Analysis Services: Part 1

Measures and Measure Groups in Microsoft Analysis Services: Part 2

Deployment of Reports with BIRT

Essbase ASO (Aggregate Storage Option)

Data Access with ADO.NET Data Services

Migrating from MS SQL Server 2008 to EnterpriseDB

Remote Job Agent in Oracle 11g Database with Oracle Scheduler

Debugging the Scheduler in Oracle 11g Databases

Indexing Data in Solr 1.4 Enterprise Search Server: Part2

Indexing Data in Solr 1.4 Enterprise Search Server: Part1

Trending Topics

Time Dimension in Oracle Warehouse Builder 11g

Adding Reporting Capabilities to our Java Applications Using JasperReports 3.5

Designing the Target Structure in Oracle Warehouse Builder 11g

Faceting in Solr 1.4 Enterprise Search Server

Synchronizing Objects in Oracle Warehouse Builder