Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-pentaho-data-integration-4-working-complex-data-flows

27 Jun 2011

7 min read

Pentaho Data Integration 4: working with complex data flows

27 Jun 2011

Joining two or more streams based on given conditions There are occasions where you will need to join two datasets. If you are working with databases, you could use SQL statements to perform this task, but for other kinds of input (XML, text, Excel), you will need another solution. Kettle provides the Merge Join step to join data coming from any kind of source. Let's assume that you are building a house and want to track and manage the costs of building it. Before starting, you prepared an Excel file with the estimated costs for the different parts of your house. Now, you are given a weekly file with the progress and the real costs. So, you want to compare both to see the progress. Getting ready To run this recipe, you will need two Excel files, one for the budget and another with the real costs. The budget.xls has the estimated starting date, estimated end date, and cost for the planned tasks. The costs.xls has the real starting date, end date, and cost for tasks that have already started. You can download the sample files from here. How to do it... Carry out the following steps: Create a new transformation. Drop two Excel input steps into the canvas. Use one step for reading the budget information (budget.xls file) and the other for reading the costs information (costs.xls file). Under the Fields tab of these steps, click on the Get fields from header row... button in order to populate the grid automatically. Apply the format dd/MM/yyyy to the fields of type Date and $0.00 to the fields with costs. Add a Merge Join step from the Join category, and create a hop from each Excel input step toward this step. The following diagram depicts what you have so far: Configure the Merge Join step, as shown in the following screenshot: If you do a preview on this step, you will obtain the result of the two Excel files merged. In order to have the columns more organized, add a Select values step from the Transform category. In this new step, select the fields in this order: task, starting date (est.), starting date, end date (est.), end date, cost (est.), cost. Doing a preview on the last step, you will obtain the merged data with the columns of both Excel files interspersed, as shown in the following screenshot: How it works... In the example, you saw how to use the Merge Join step to join data coming from two Excel files. You can use this step to join any other kind of input. In the Merge Join step, you set the name of the incoming steps, and the fields to use as the keys for joining them. In the recipe, you joined the streams by just a single field: the task field. The rows are expected to be sorted in an ascending manner on the specified key fields. There's more... In the example, you set the Join Type to LEFT OUTER JOIN. Let's see explanations of the possible join options: Interspersing new rows between existent rows In most Kettle datasets, all rows share a common meaning; they represent the same kind of entity, for example: In a dataset with sold items, each row has data about one item In a dataset with the mean temperature for a range of days in five different regions, each row has the mean temperature for a different day in one of those regions In a dataset with a list of people ordered by age range (0-10, 11-20, 20-40, and so on), each row has data about one person Sometimes, there is a need of interspersing new rows between your current rows. Taking the previous examples, imagine the following situations: In the sold items dataset, every 10 items, you have to insert a row with the running quantity of items and running sold price from the first line until that line. In the temperature's dataset, you have to order the data by region and the last row for each region has to have the average temperature for that region. In the people's dataset, for each age range, you have to insert a header row just before the rows of people in that range. In general, the rows you need to intersperse can have fixed data, subtotals of the numbers in previous rows, header to the rows coming next, and so on. What they have in common is that they have a different structure or meaning compared to the rows in your dataset. Interspersing these rows is not a complicated task, but is a tricky one. In this recipe, you will learn how to do it. Suppose that you have to create a list of products by category. For each category, you have to insert a header row with the category description and the number of products inside that category. The final result should be as follows: Getting ready This recipe uses an outdoor database with the structure shown in Appendix, Data Structures (Download here). As source, you can use a database like this or any other source, for example a text file with the same structure. How to do it... Carry out the following steps: Create a transformation, drag into the canvas a Table Input step, select the connection to the outdoor database, or create it if it doesn't exist. Then enter the following statement: SELECT category , desc_product FROM products p ,categories c WHERE p.id_category = c.id_category ORDER by category Do a preview of this step. You already have the product list! Now, you have to create and intersperse the header rows. In order to create the headers, do the following: From the Statistics category, add a Group by step and fill in the grids, as shown in the following screenshot: From the Scripting category, add a User Defined Java Expression step, and use it to add two fields: The first will be a String named desc_product, with value ("Category: " + category).toUpperCase(). The second will be an Integer field named order with value 1. Use a Select values step to reorder the fields as category, desc_product, qty_product, and order. Do a preview on this step; you should see the following result: Those are the headers. The next step is mixing all the rows in the proper order. Drag an Add constants step into the canvas and a Sort rows step. Link them to the other steps as shown: Use the Add constants to add two Integer fields: qty_prod and order. As Value, leave the first field empty, and type 2 for the second field. Use the Sort rows step for sorting by category, order, and desc_product. Select the last step and do a preview. You should see the rows exactly as shown in the introduction. How it works... When you have to intersperse rows between existing rows, there are just four main tasks to do, as follows: Create a secondary stream that will be used for creating new rows. In this case, the rows with the headers of the categories. In each stream, add a field that will help you intersperse rows in the proper order. In this case, the key field was named order. Before joining the two streams, add, remove, and reorder the fields in each stream to make sure that the output fields in each stream have the same metadata. Join the streams and sort by the fields that you consider appropriate, including the field created earlier. In this case, you sorted by category, inside each category by the field named order and finally by the products description. Note that in this case, you created a single secondary stream. You could create more if needed, for example, if you need a header and footer for each category.

0
0
4570

article-image-pentaho-data-integration-4-understanding-data-flows

Packt

24 Jun 2011

12 min read

Pentaho Data Integration 4: Understanding Data Flows

Packt

24 Jun 2011

12 min read

Pentaho Data Integration 4 Cookbook Over 70 recipes to solve ETL problems using Pentaho Kettle Read more about this book This article by Adrián Sergio Pulvirenti and María Carina Roldán, authors of Pentaho Data Integration 4 Cookbook, focuses on the different ways for combining, splitting, or manipulating streams or flows of data using Kettle transformations. The main purpose of Kettle transformations is to manipulate data in the form of a dataset; this task is done by the steps of the transformation. In this article, we will cover: Splitting a stream into two or more streams based on a condition Merging rows from two streams with the same or different structure Comparing two streams and generating differences Generating all possible pairs formed from two datasets (For more resources on this subject, see here.) Introduction The main purpose of Kettle transformations is to manipulate data in the form of a dataset; this task is done by the steps of the transformation. When a transformation is launched, all its steps are started. During the execution, the steps work simultaneously reading rows from the incoming hops, processing them, and delivering them to the outgoing hops. When there are no more rows left, the execution of the transformation ends. The dataset that flows from step to step is not more than a set of rows all having the same structure or metadata. This means that all rows have the same number of columns, and the columns in all rows have the same type and name. Suppose that you have a single stream of data and that you apply the same transformations to all rows, that is, you have all steps connected in a row one after the other. In other words, you have the simplest of the transformations from the point of view of its structure. In this case, you don't have to worry much about the structure of your data stream, nor the origin or destination of the rows. The interesting part comes when you face other situations, for example: You want a step to start processing rows only after another given step has processed all rows You have more than one stream and you have to combine them into a single stream You have to inject rows in the middle of your stream and those rows don't have the same structure as the rows in your dataset With Kettle, you can actually do this, but you have to be careful because it's easy to end up doing wrong things and getting unexpected results or even worse: undesirable errors. With regard to the first example, it doesn't represent a default behavior due to the parallel nature of the transformations as explained earlier. There are two steps however, that might help, which are as follows: Blocking Step: This step blocks processing until all incoming rows have been processed. Block this step until steps finish: This step blocks processing until the selected steps finish. Both these steps are in the Flow category. This and the next article on Working with Complex Data Flows focuses on the other two examples and some similar use cases, by explaining the different ways for combining, splitting, or manipulating streams of data. Splitting a stream into two or more streams based on a condition In this recipe, you will learn to use the Filter rows step in order to split a single stream into different smaller streams. In the There's more section, you will also see alternative and more efficient ways for doing the same thing in different scenarios. Let's assume that you have a set of outdoor products in a text file, and you want to differentiate the tents from other kind of products, and also create a subclassification of the tents depending on their prices. Let's see a sample of this data: id_product,desc_product,price,category1,"Swedish Firesteel - Army Model",19,"kitchen"2,"Mountain House #10 Can Freeze-Dried Food",53,"kitchen"3,"Lodge Logic L9OG3 Pre-Seasoned 10-1/2-Inch RoundGriddle",14,"kitchen"... Getting ready To run this recipe, you will need a text file named outdoorProducts.txt with information about outdoor products. The file contains information about the category and price of each product. How to do it... Carry out the following steps: Create a transformation. Drag into the canvas a Text file input step and fill in the File tab to read the file named outdoorProducts.txt. If you are using the sample text file, type , as the Separator. Under the Fields tab, use the Get Fields button to populate the grid. Adjust the entries so that the grid looks like the one shown in the following screenshot: Now, let's add the steps to manage the flow of the rows. To do this, drag two Filter rows steps from the Flow category. Also, drag three Dummy steps that will represent the three resulting streams. Create the hops, as shown in the following screenshot. When you create the hops, make sure that you choose the options according to the image: Result is TRUE for creating a hop with a green icon, and Result is FALSE for creating a hop with a red icon in it. Double-click on the first Filter rows step and complete the condition, as shown in the following screenshot: Double-click on the second Filter rows step and complete the condition with price < 100. You have just split the original dataset into three groups. You can verify it by previewing each Dummy step. The first one has products whose category is not tents; the second one, the tents under 100 US$; and the last group, the expensive tents; those whose price is over 100 US$. The preview of the last Dummy step will show the following: How it works... The main objective in the recipe is to split a dataset with products depending on their category and price. To do this, you used the Filter rows step. In the Filter rows setting window, you tell Kettle where the data flows to depending on the result of evaluating a condition for each row. In order to do that, you have two list boxes: Send 'true' data to step and Send 'false' data to step. The destination steps can be set by using the hop properties as you did in the recipe. Alternatively, you can set them in the Filter rows setting dialog by selecting the name of the destination steps from the available drop-down lists. You also have to enter the condition. The condition has the following different parts: The upper textbox on the left is meant to negate the condition. The left textbox is meant to select the field that will be used for comparison. Then, you have a list of possible comparators to choose from. On the right, you have two textboxes: The upper textbox for comparing against a field and the bottom textbox for comparing against a constant value. Also, you can include more conditions by clicking on the Add Condition button on the right. If you right-click on a condition, a contextual menu appears to let you delete, edit, or move it. In the first Filter rows step of the recipe, you typed a simple condition: You compared a field (category) with a fixed value (tents) by using the equal (=) operator. You did this to separate the tents products from the others. The second filter had the purpose of differentiating the expensive and the cheap tents. There's more... You will find more filter features in the following subsections. Avoiding the use of Dummy steps In the recipe, we assumed that you wanted all three groups of products for further processing. Now, suppose that you only want the cheapest tents and you don't care about the rest. You could use just one Filter rows step with the condition category = tents AND price < 100, and send the 'false' data to a Dummy step, as in shown in the following diagram: The rows that don't meet the condition will end at the Dummy step. Although this is a very commonly used solution for keeping just the rows that meet the conditions, there is a simpler way to implement it. When you create the hop from the Filter rows toward the next step, you are asked for the kind of hop. If you choose Main output of step, the two options Send 'true' data to step and Send 'false' data to step will remain empty. This will cause two things: Only the rows that meet the condition will pass. The rest will be discarded. Comparing against the value of a Kettle variable The recipe above shows you how to configure the condition in the Filter rows step to compare a field against another field or a constant value, but what if you want to compare against the value of a Kettle variable? Let's assume, for example, you have a named parameter called categ with kitchen as Default Value. As you might know, named parameters are a particular kind of Kettle variable. You create the named parameters under the Parameter tab from the Settings option of the Edit menu. To use this variable in a condition, you must add it to your dataset in advance. You do this as follows: Add a Get Variables step from the Job category. Put it in the stream after the Text file input step and before the Filter Rows step; use it to create a new field named categ of String type with the value ${categ} in the Variable column. Now, the transformation looks like the one shown in the following screenshot: After this, you can set the condition of the first Filter rows step to category = categ, selecting categ from the listbox of fields to the right. This way, you will be filtering the kitchen products. If you run the transformation and set the parameter to tents, you will obtain similar results to those that were obtained in the main recipe. Avoiding the use of nested Filter Rows steps Suppose that you want to compare a single field against a discrete and short list of possible values and do different things for each value in that list. In this case, you can use the Switch / Case step instead of nested Filter rows steps. Let's assume that you have to send the rows to different steps depending on the category. The best way to do this is with the Switch / Case step. This way you avoid adding one Filter row step for each category. In this step, you have to select the field to be used for comparing. You do it in the Field name to switch listbox. In the Case values grid, you set the Value—Target step pairs. The following screenshot shows how to fill in the grid for our particular problem: The following are some considerations about this step: You can have multiple values directed to the same target step You can leave the value column blank to specify a target step for empty values You have a listbox named Default target step to specify the target step for rows that do not match any of the case values You can only compare with an equal operator If you want to compare against a substring of the field, you could enable the Use string contains option and as Case Value, type the substring you are interested in. For example, if for Case Value, you type tent_ then all categories containing tent_ such as tent_large, tent_small, or best_tents will be redirected to the same target step. Overcoming the difficulties of complex conditions There will be situations where the condition is too complex to be expressed in a single Filter rows step. You can nest them and create temporary fields in order to solve the problem, but it would be more efficient if you used the Java Filter or User Defined Java Expression step as explained next. You can find the Java Filter step in the Flow category. The difference compared to the Filter Rows step is that in this step, you write the condition using a Java expression. The names of the listboxes—Destination step for matching rows (optional) and Destination step for non-matching rows (optional)—differ from the names in the Filter rows step, but their purpose is the same. As an example, the following are the conditions you used in the recipe rewritten as Java expressions: category.equals("tents") and price < 100. These are extremely simple, but you can write any Java expression as long as it evaluates to a Boolean result. If you can't guarantee that the category will not be null, you'd better invert the first expression and put "tents".equals(category) instead. By doing this, whenever you have to check if a field is equal to a constant, you avoid an unexpected Java error. Finally, suppose that you have to split the streams simply to set some fields and then join the streams again. For example, assume that you want to change the category as follows: Doing this with nested Filter rows steps leads to a transformation like the following: You can do the same thing in a simpler way: Replace all the steps but the Text file input with a User Defined Java Expression step located in the Scripting category. In the setting window of this step, add a row in order to replace the value of the category field: As New field and Replace value type category. As Value type select String. As Java expression, type the following: (category.equals("tents"))?(price<100?"cheap_tents":"expensive_tents"):category The preceding expression uses the Java ternary operator ?:. If you're not familiar with the syntax, think of it as shorthand for the if-then-else statement. For example, the inner expression price<100?"cheap_tents":"expensive_tents" means if (price<100) then return "cheap_tents" else return "expensive_tents". Do a preview on this step. You will see something similar to the following:

0
0
8515

Packt

21 Jun 2011

7 min read

Segmenting images in OpenCV

Packt

21 Jun 2011

7 min read

OpenCV 2 Computer Vision Application Programming Cookbook Over 50 recipes to master this library of programming functions for real-time computer vision Read more about this book OpenCV (Open Source Computer Vision) is an open source library containing more than 500 optimized algorithms for image and video analysis. Since its introduction in 1999, it has been largely adopted as the primary development tool by the community of researchers and developers in computer vision. OpenCV was originally developed at Intel by a team led by Gary Bradski as an initiative to advance research in vision and promote the development of rich, vision-based CPU-intensive applications. In the previous article by Robert Laganière, author of OpenCV 2 Computer Vision Application Programming Cookbook, we took a look at image processing using morphological filters. In this article we will see how to segment images using watersheds and GrabCut algorithm. (For more resources related to this subject, see here.) Segmenting images using watersheds The watershed transformation is a popular image processing algorithm that is used to quickly segment an image into homogenous regions. It relies on the idea that when the image is seen as a topological relief, homogeneous regions correspond to relatively flat basins delimitated by steep edges. As a result of its simplicity, the original version of this algorithm tends to over-segment the image which produces multiple small regions. This is why OpenCV proposes a variant of this algorithm that uses a set of predefined markers which guide the definition of the image segments. How to do it... The watershed segmentation is obtained through the use of the cv::watershed function. The input to this function is a 32-bit signed integer marker image in which each non-zero pixel represents a label. The idea is to mark some pixels of the image that are known to certainly belong to a given region. From this initial labeling, the watershed algorithm will determine the regions to which the other pixels belong. In this recipe, we will first create the marker image as a gray-level image, and then convert it into an image of integers. We conveniently encapsulated this step into a WatershedSegmenter class: class WatershedSegmenter { private: cv::Mat markers;public: void setMarkers(const cv::Mat& markerImage) { // Convert to image of ints markerImage.convertTo(markers,CV_32S); } cv::Mat process(const cv::Mat &image) { // Apply watershed cv::watershed(image,markers); return markers; } The way these markers are obtained depends on the application. For example, some preprocessing steps might have resulted in the identification of some pixels belonging to an object of interest. The watershed would then be used to delimitate the complete object from that initial detection. In this recipe, we will simply use the binary image used in the previous article (OpenCV: Image Processing using Morphological Filters) in order to identify the animals of the corresponding original image. Therefore, from our binary image, we need to identify pixels that certainly belong to the foreground (the animals) and pixels that certainly belong to the background (mainly the grass). Here, we will mark foreground pixels with label 255 and background pixels with label 128 (this choice is totally arbitrary, any label number other than 255 would work). The other pixels, that is the ones for which the labeling is unknown, are assigned value 0. As it is now, the binary image includes too many white pixels belonging to various parts of the image. We will then severely erode this image in order to retain only pixels belonging to the important objects: // Eliminate noise and smaller objectscv::Mat fg;cv::erode(binary,fg,cv::Mat(),cv::Point(-1,-1),6); The result is the following image: Note that a few pixels belonging to the background forest are still present. Let's simply keep them. Therefore, they will be considered to correspond to an object of interest. Similarly, we also select a few pixels of the background by a large dilation of the original binary image: // Identify image pixels without objectscv::Mat bg;cv::dilate(binary,bg,cv::Mat(),cv::Point(-1,-1),6);cv::threshold(bg,bg,1,128,cv::THRESH_BINARY_INV); The resulting black pixels correspond to background pixels. This is why the thresholding operation immediately after the dilation assigns to these pixels the value 128. The following image is then obtained: These images are combined to form the marker image: // Create markers imagecv::Mat markers(binary.size(),CV_8U,cv::Scalar(0));markers= fg+bg; Note how we used the overloaded operator+ here in order to combine the images. This is the image that will be used as input to the watershed algorithm: The segmentation is then obtained as follows: // Create watershed segmentation objectWatershedSegmenter segmenter;// Set markers and processsegmenter.setMarkers(markers);segmenter.process(image); The marker image is then updated such that each zero pixel is assigned one of the input labels, while the pixels belonging to the found boundaries have value -1. The resulting image of labels is then: The boundary image is: How it works... As we did in the preceding recipe, we will use the topological map analogy in the description of the watershed algorithm. In order to create a watershed segmentation, the idea is to progressively flood the image starting at level 0. As the level of "water" progressively increases (to levels 1, 2, 3, and so on), catchment basins are formed. The size of these basins also gradually increase and, consequently, the water of two different basins will eventually merge. When this happens, a watershed is created in order to keep the two basins separated. Once the level of water has reached its maximal level, the sets of these created basins and watersheds form the watershed segmentation. As one can expect, the flooding process initially creates many small individual basins. When all of these are merged, many watershed lines are created which results in an over-segmented image. To overcome this problem, a modification to this algorithm has been proposed in which the flooding process starts from a predefined set of marked pixels. The basins created from these markers are labeled in accordance with the values assigned to the initial marks. When two basins having the same label merge, no watersheds are created, thus preventing the oversegmentation. This is what happens when the cv::watershed function is called. The input marker image is updated to produce the final watershed segmentation. Users can input a marker image with any number of labels with pixels of unknown labeling left to value 0. The marker image has been chosen to be an image of a 32-bit signed integer in order to be able to define more than 255 labels. It also allows the special value -1, to be assigned to pixels associated with a watershed. This is what is returned by the cv::watershed function. To facilitate the displaying of the result, we have introduced two special methods. The first one returns an image of the labels (with watersheds at value 0). This is easily done through thresholding: // Return result in the form of an imagecv::Mat getSegmentation() { cv::Mat tmp; // all segment with label higher than 255 // will be assigned value 255 markers.convertTo(tmp,CV_8U); return tmp;} Similarly, the second method returns an image in which the watershed lines are assigned value 0, and the rest of the image is at 255. This time, the cv::convertTo method is used to achieve this result: // Return watershed in the form of an imagecv::Mat getWatersheds() { cv::Mat tmp; // Each pixel p is transformed into // 255p+255 before conversion markers.convertTo(tmp,CV_8U,255,255); return tmp;} The linear transformation that is applied before the conversion allows -1 pixels to be converted into 0 (since -1*255+255=0). Pixels with a value greater than 255 are assigned the value 255. This is due to the saturation operation that is applied when signed integers are converted into unsigned chars. See also The article The viscous watershed transform by C. Vachier, F. Meyer, Journal of Mathematical Imaging and Vision, volume 22, issue 2-3, May 2005, for more information on the watershed transform. The next recipe which presents another image segmentation algorithm that can also segment an image into background and foreground objects.

0
0
5297

Packt

13 Jun 2011

8 min read

Getting Started with OpenSSO

Packt

13 Jun 2011

8 min read

OpenAM Written and tested with OpenAM Snapshot 9—the Single Sign-On (SSO) tool for securing your web applications in a fast and easy way History of OpenSSO Back in early 2000, Sun Microsystems Inc. started the Directory Server Access Management Edition project to develop a product that would solve the web Single Sign-On (SSO) problems. The initial scope was to just provide authentication and authorization as part of the SSO using a proprietary protocol. Over the years it evolved to become a lightweight pure Java application providing a comprehensive set of security features to protect the enterprise resources. After undergoing a lot of changes in terms of product name, features, among other things, it ended up with OpenSSO. As part of Sun Microsystems software business strategy, they have open sourced most of their commercial enterprise software products including Sun Java Enterprise System (JES), Java SDK, and Communication suite of products. OpenSSO is the term coined by the Sun's access management product team as part of their strategy to open source Sun Java System Access Manager, Sun Java System SAML v2 Plugin for Federation Services, and Sun Java System Federation Manager. OpenSSO is an integrated product that includes the features of both Access and Federation manager products. The goal was to amalgamate the access, federation, and web services features. This goal was achieved when Sun released the OpenSSO Enterprise 8.0 that received strong market traction and analysts coverage. Sun's OpenSSO product made it to the leadership quadrant in 2008 in the Access Management category which was published by the Gartner Analyst group. As part of the open source initiative, a lot of code re-factoring and feature alignments occurred that changed the product's outlook. It removed all the native dependencies that were required to build and deploy earlier. OpenSSO became the pure Java web application that enabled the customers and open source community to take advantage of the automatic deployment by just dropping the web archive on any Java servlet container to deploy it. The build and check-in processes were highly simplified which attracted the open source community to contribute to the code and quality of the product. OpenSSO had a very flexible release model wherein new features or bug fixes could easily be implemented in the customer environment by picking up the nightly, express, or enterprise build. OpenSSO Enterprise 8.0 was a major release by Sun that was built from the open source branch. After this release, there were two other express releases. Those were very feature-rich and introduced Secure Token Service (STS) and OAuth functionality. Express build 9 was not released in the binary form by Oracle but the source code has been made available to the open source community. You can download the OpenAM express build, built using the express build 9 branch from the Forgerock site. As part of the acquisition of Sun Microsystems Inc. by Oracle Corporation that happened back in early 2010, the release and support models have been changed for OpenSSO. If you are interested in obtaining a support contract for the enterprise version of the product, you should call up the Oracle support team or the product management team. Oracle continues its support for the OpenSSO enterprise's existing customers. For the OpenSSO open source version (also known as OpenAM) you can approach the Forgerock team to obtain support. OpenSSO vs. OpenAM OpenSSO was the only open source product in the access management segment that had production level quality. Over eight thousands test cases were executed on twelve different Java servlet containers. OpenSSO is supported by a vibrant community that includes engineers, architects, and solution developers. If you have any questions, just send a mail to [email protected], and you are likely get the answer to what you want to know. Recently Forgerock (http://www.forgerock.com) undertook an initiative to keep the community and product strong. They periodically fix the bugs in the open source branch. Their version of OpenSSO is called OpenAM, but the code base is the same as OpenSSO. There may be incompatibilities in future if OpenAM code base deviates a lot from the OpenSSO express build 9 code base. Note that the Oracle Open SSO Enterprise 8.0 update releases are based on the OpenSSO Enterprise release 8.0 code base, whereas the open source version OpenAM is continuing its development from the express build 9 code base. OpenSSO—an overview OpenSSO is a freely available feature-rich access management product; it can be downloaded from http://www.forgerock.com/openam.html. It integrates authentication and authorization services, SSO, and open standards-based federation protocols to provide SSO among disparate business domains, entitlement services, and web services security. Overall, customers will be able to build a comprehensive solution for protecting their network resources by preventing unauthorized access to web services, applications, web content, and securing identity data. OpenSSO offers a complete solution for securing both web applications and web services. You can enforce a comprehensive security policy for web applications and services across the enterprise, rather than relying on developers to come up with ad hoc ways to secure services as they develop them. OpenSSO is a pure Java application that makes it easy to deploy on any operating system platform or container as it supports a broad range of operating systems and servlet containers. OpenSSO services All the services provided by the OpenSSO are exposed over HTTP protocol. The clients access them using appropriate interfaces. OpenSSO exposes a rich library of Application Programming Interfaces (APIs) and Service Provider Interfaces (SPIs) using which, customers can achieve the desired functionality. These services developed for OpenSSO generally contain both a server component and a client component. The server component is a simple Java servlet developed to receive XML requests and return XML responses. The opensso.war web application encompasses all the services and associated configuration items that are required to deliver the OpenSSO functionality. The client component is provided as Java API, and in some cases, C API. This allows remote applications and other OpenSSO services to communicate with and consume the particular functionality. Each core service uses its own framework to retrieve customer and service data and to provide it to other OpenSSO services. The OpenSSO framework integrates all of these service frameworks to form a layer that is accessible to all product components and plugins as shown in the following diagram: There are certain core services that are not covered due to the scope of this article. Just to make you aware of the breadth of features provided by the OpenSSO, in the next few sections, some of the prominent features that are not covered will be briefly introduced. Federation services Typically, in the web access management the Single Sign-On happens in the same company, within the same Domain Name Service (DNS) domain. Most of the time this will work for small companies or in B2C type scenarios, whereas in a B2B scenario use of a DNS domain-based SSO will not work as the cookie will not be forwarded to the other DNS domains. Besides, there are privacy and security concerns to perform SSO across multiple businesses using this approach. So how do we solve these kinds of problems where customers want to seamlessly sign on to services even though the services are provided by a third party? Federation is the solution. So, what is federation? Federation is a process that establishes a standards-based method for sharing and managing identity data and establishing a Single Sign-On across security domains and organizations. It allows an organization to offer a variety of external services to trusted business partners, as well as corporate services to internal departments and divisions. Forming trust relationships across security domains allows an organization to integrate applications offered by different departments or divisions within the enterprise, as well as engage in relationships with co-operating business partners that offer complementary services. Towards the federation or solving SSO across multiple domains, multiple industry standards, such as those developed by the Organization for the Advancement of Structured Information Standards (OASIS) and the Liberty Alliance Project), are supported. OpenSSO provides an open and extensible framework for identity federation and associated web services that resolves the problems of identity-enabling web services, web service discovery and invocation, security, and privacy. Federation services are built on the following standards: Liberty Alliance Project Identity Federation Framework (Liberty ID-FF) 1.1 and 1.2 OASIS Security Assertion Markup Language (SAML) 1.0 and 1.1 OASIS Security Assertion Markup Language (SAML) 2.0 WS-Federation (Passive Requestor Profile) SAML 2.0 is becoming the de facto standard for the federation SAML 2.0 is becoming the de facto standard for the federation SSO as many of the vendors and service providers support SAML 2.0 protocol. For instance Google Apps and Salesforce support SAML 2.0 as their choice of protocol for SSO.

0
0
4322

article-image-learn-computer-vision-applications-open-cv

Packt

07 Jun 2011

16 min read

Learn computer vision applications in Open CV

Packt

07 Jun 2011

16 min read

OpenCV 2 Computer Vision Application Programming Cookbook Over 50 recipes to master this library of programming functions for real-time computer vision Read more about this book OpenCV (Open Source Computer Vision) is an open source library containing more than 500 optimized algorithms for image and video analysis. Since its introduction in 1999, it has been largely adopted as the primary development tool by the community of researchers and developers in computer vision. OpenCV was originally developed at Intel by a team led by Gary Bradski as an initiative to advance research in vision and promote the development of rich, vision-based CPU-intensive applications. In this article by Robert Laganière, author of OpenCV 2 Computer Vision Application Programming Cookbook, we will cover: Calibrating a camera Computing the fundamental matrix of an image pair Matching images using random sample consensus Computing a homography between two images (For more resources related to the article, see here.) Introduction Images are generally produced using a digital camera that captures a scene by projecting light onto an image sensor going through its lens. The fact that an image is formed through the projection of a 3D scene onto a 2D plane imposes the existence of important relations between a scene and its image, and between different images of the same scene. Projective geometry is the tool that is used to describe and characterize, in mathematical terms, the process of image formation. In this article, you will learn some of the fundamental projective relations that exist in multi-view imagery and how these can be used in computer vision programming. But before we start the recipes, let's explore the basic concepts related to scene projection and image formation. Image formation Fundamentally, the process used to produce images has not changed since the beginning of photography. The light coming from an observed scene is captured by a camera through a frontal aperture and the captured light rays hit an image plane (or image sensor) located on the back of the camera. Additionally, a lens is used to concentrate the rays coming from the different scene elements. This process is illustrated by the following figure: Here, do is the distance from the lens to the observed object, di is the distance from the lens to the image plane, and f is the focal length of the lens. These quantities are related by the so-called thin lens equation: In computer vision, this camera model can be simplified in a number of ways. First, we can neglect the effect of the lens by considering a camera with an infinitesimal aperture since, in theory, this does not change the image. Only the central ray is therefore considered. Second, since most of the time we have do>>di, we can assume that the image plane is located at the focal distance. Finally, we can notice from the geometry of the system, that the image on the plane is inverted. We can obtain an identical but upright image by simply positioning the image plane in front of the lens. Obviously, this is not physically feasible, but from a mathematical point of view, this is completely equivalent. This simplified model is often referred to as the pin-hole camera model and it is represented as follows: From this model, and using the law of similar triangles, we can easily derive the basic projective equation: The size (hi) of the image of an object (of height ho) is therefore inversely proportional to its distance (do) from the camera which is naturally true. This relation allows the position of the image of a 3D scene point to be predicted onto the image plane of a camera. Calibrating a camera From the introduction of this article, we learned that the essential parameters of a camera under the pin-hole model are its focal length and the size of the image plane (which defines the field of view of the camera). Also, since we are dealing with digital images, the number of pixels on the image plane is another important characteristic of a camera. Finally, in order to be able to compute the position of an image's scene point in pixel coordinates, we need one additional piece of information. Considering the line coming from the focal point that is orthogonal to the image plane, we need to know at which pixel position this line pierces the image plane. This point is called the principal point. It could be logical to assume that this principal point is at the center of the image plane, but in practice, this one might be off by few pixels depending at which precision the camera has been manufactured. Camera calibration is the process by which the different camera parameters are obtained. One can obviously use the specifications provided by the camera manufacturer, but for some tasks, such as 3D reconstruction, these specifications are not accurate enough. Camera calibration will proceed by showing known patterns to the camera and analyzing the obtained images. An optimization process will then determine the optimal parameter values that explain the observations. This is a complex process but made easy by the availability of OpenCV calibration functions. How to do it... To calibrate a camera, the idea is show to this camera a set of scene points for which their 3D position is known. You must then determine where on the image these points project. Obviously, for accurate results, we need to observe several of these points. One way to achieve this would be to take one picture of a scene with many known 3D points. A more convenient way would be to take several images from different viewpoints of a set of some 3D points. This approach is simpler but requires computing the position of each camera view, in addition to the computation of the internal camera parameters which fortunately is feasible. OpenCV proposes to use a chessboard pattern to generate the set of 3D scene points required for calibration. This pattern creates points at the corners of each square, and since this pattern is flat, we can freely assume that the board is located at Z=0 with the X and Y axes well aligned with the grid. In this case, the calibration process simply consists of showing the chessboard pattern to the camera from different viewpoints. Here is one example of a calibration pattern image: The nice thing is that OpenCV has a function that automatically detects the corners of this chessboard pattern. You simply provide an image and the size of the chessboard used (number of vertical and horizontal inner corner points). The function will return the position of these chessboard corners on the image. If the function fails to find the pattern, then it simply returns false: // output vectors of image pointsstd::vector<cv::Point2f> imageCorners;// number of corners on the chessboardcv::Size boardSize(6,4);// Get the chessboard cornersbool found = cv::findChessboardCorners(image, boardSize, imageCorners); Note that this function accepts additional parameters if one needs to tune the algorithm, which are not discussed here. There is also a function that draws the detected corners on the chessboard image with lines connecting them in sequence: //Draw the cornerscv::drawChessboardCorners(image, boardSize, imageCorners, found); // corners have been found The image obtained is seen here: The lines connecting the points shows the order in which the points are listed in the vector of detected points. Now to calibrate the camera, we need to input a set of such image points together with the coordinate of the corresponding 3D points. Let's encapsulate the calibration process in a CameraCalibrator class: class CameraCalibrator { // input points: // the points in world coordinates std::vector<std::vector<cv::Point3f>> objectPoints; // the point positions in pixels std::vector<std::vector<cv::Point2f>> imagePoints; // output Matrices cv::Mat cameraMatrix; cv::Mat distCoeffs; // flag to specify how calibration is done int flag; // used in image undistortion cv::Mat map1,map2; bool mustInitUndistort; public: CameraCalibrator() : flag(0), mustInitUndistort(true) {}; As mentioned previously, the 3D coordinates of the points on the chessboard pattern can be easily determined if we conveniently place the reference frame on the board. The method that accomplishes this takes a vector of the chessboard image filename as input: // Open chessboard images and extract corner pointsint CameraCalibrator::addChessboardPoints( const std::vector<std::string>& filelist, cv::Size & boardSize) { // the points on the chessboard std::vector<cv::Point2f> imageCorners; std::vector<cv::Point3f> objectCorners; // 3D Scene Points: // Initialize the chessboard corners // in the chessboard reference frame // The corners are at 3D location (X,Y,Z)= (i,j,0) for (int i=0; i<boardSize.height; i++) { for (int j=0; j<boardSize.width; j++) { objectCorners.push_back(cv::Point3f(i, j, 0.0f)); } } // 2D Image points: cv::Mat image; // to contain chessboard image int successes = 0; // for all viewpoints for (int i=0; i<filelist.size(); i++) { // Open the image image = cv::imread(filelist[i],0); // Get the chessboard corners bool found = cv::findChessboardCorners( image, boardSize, imageCorners); // Get subpixel accuracy on the corners cv::cornerSubPix(image, imageCorners, cv::Size(5,5), cv::Size(-1,-1), cv::TermCriteria(cv::TermCriteria::MAX_ITER + cv::TermCriteria::EPS, 30, // max number of iterations 0.1)); // min accuracy //If we have a good board, add it to our data if (imageCorners.size() == boardSize.area()) { // Add image and scene points from one view addPoints(imageCorners, objectCorners); successes++; } } return successes;} The first loop inputs the 3D coordinates of the chessboard, which are specified in an arbitrary square size unit here. The corresponding image points are the ones provided by the cv::findChessboardCorners function. This is done for all available viewpoints. Moreover, in order to obtain a more accurate image point location, the function cv::cornerSubPix can be used and as the name suggests, the image points will then be localized at sub-pixel accuracy. The termination criterion that is specified by the cv::TermCriteria object defines a maximum number of iterations and a minimum accuracy in sub-pixel coordinates. The first of these two conditions that is reached will stop the corner refinement process. When a set of chessboard corners has been successfully detected, these points are added to our vector of image and scene points: // Add scene points and corresponding image pointsvoid CameraCalibrator::addPoints(const std::vector<cv::Point2f>&imageCorners, const std::vector<cv::Point3f>& objectCorners) { // 2D image points from one view imagePoints.push_back(imageCorners); // corresponding 3D scene points objectPoints.push_back(objectCorners);} The vectors contains std::vector instances. Indeed, each vector element being a vector of points from one view. Once a sufficient number of chessboard images have been processed (and consequently a large number of 3D scene point/2D image point correspondences are available), we can initiate the computation of the calibration parameters: // Calibrate the camera// returns the re-projection errordouble CameraCalibrator::calibrate(cv::Size &imageSize){ // undistorter must be reinitialized mustInitUndistort= true; //Output rotations and translations std::vector<cv::Mat> rvecs, tvecs; // start calibration return calibrateCamera(objectPoints, // the 3D points imagePoints, // the image points imageSize, // image size cameraMatrix,// output camera matrix distCoeffs, // output distortion matrix rvecs, tvecs,// Rs, Ts flag); // set options} In practice, 10 to 20 chessboard images are sufficient, but these must be taken from different viewpoints at different depths. The two important outputs of this function are the camera matrix and the distortion parameters. The camera matrix will be described in the next section. For now, let's consider the distortion parameters. So far, we have mentioned that with the pin-hole camera model, we can neglect the effect of the lens. But this is only possible if the lens used to capture an image does not introduce too important optical distortions. Unfortunately, this is often the case with lenses of lower quality or with lenses having a very short focal length. You may have already noticed that in the image we used for our example, the chessboard pattern shown is clearly distorted. The edges of the rectangular board being curved in the image. It can also be noticed that this distortion becomes more important as we move far from the center of the image. This is a typical distortion observed with fish-eye lens and it is called radial distortion. The lenses that are used in common digital cameras do not exhibit such a high degree of distortion, but in the case of the lens used here, these distortions cannot certainly be ignored. It is possible to compensate for these deformations by introducing an appropriate model. The idea is to represent the distortions induced by a lens by a set of mathematical equations. Once established, these equations can then be reverted in order to undo the distortions visible on the image. Fortunately, the exact parameters of the transformation that will correct the distortions can be obtained together with the other camera parameter during the calibration phase. Once this is done, any image from the newly calibrated camera can be undistorted: // remove distortion in an image (after calibration)cv::Mat CameraCalibrator::remap(const cv::Mat &image) { cv::Mat undistorted; if (mustInitUndistort) { // called once per calibration cv::initUndistortRectifyMap( cameraMatrix, // computed camera matrix distCoeffs, // computed distortion matrix cv::Mat(), // optional rectification (none) cv::Mat(), // camera matrix to generate undistorted image.size(), // size of undistorted CV_32FC1, // type of output map map1, map2); // the x and y mapping functions mustInitUndistort= false; } // Apply mapping functions cv::remap(image, undistorted, map1, map2, cv::INTER_LINEAR); // interpolation type return undistorted;} Which results in the following image: As you can see, once the image is undistorted, we obtain a regular perspective image. How it works... In order to explain the result of the calibration, we need to go back to the figure in the introduction which describes the pin-hole camera model. More specifically, we want to demonstrate the relation between a point in 3D at position (X,Y,Z) and its image (x,y) on a camera specified in pixel coordinates. Let's redraw this figure by adding a reference frame that we position at the center of the projection as seen here: Note that the Y-axis is pointing downward to get a coordinate system compatible with the usual convention that places the image origin at the upper-left corner. We learned previously that the point (X,Y,Z) will be projected onto the image plane at (fX/Z,fY/Z). Now, if we want to translate this coordinate into pixels, we need to divide the 2D image position by, respectively, the pixel width (px) and height (py). We notice that by dividing the focal length f given in world units (most often meters or millimeters) by px, then we obtain the focal length expressed in (horizontal) pixels. Let's then define this term as fx. Similarly, fy =f/py is defined as the focal length expressed in vertical pixel unit. The complete projective equation is therefore: Recall that (u0,v0) is the principal point that is added to the result in order to move the origin to the upper-left corner of the image. These equations can be rewritten in matrix form through the introduction of homogeneous coordinates in which 2D points are represented by 3-vectors, and 3D points represented by 4-vectors (the extra coordinate is simply an arbitrary scale factor that need to be removed when a 2D coordinate needs to be extracted from a homogeneous 3-vector). Here is the projective equation rewritten: The second matrix is a simple projection matrix. The first matrix includes all of the camera parameters which are called the intrinsic parameters of the camera. This 3x3 matrix is one of the output matrices returned by the cv::calibrateCamera function. There is also a function called cv::calibrationMatrixValues that returns the value of the intrinsic parameters given a calibration matrix. More generally, when the reference frame is not at the projection center of the camera, we will need to add a rotation (a 3x3 matrix) and a translation vector (3x1 matrix). These two matrices describe the rigid transformation that must be applied to the 3D points in order to bring them back to the camera reference frame. Therefore, we can rewrite the projection equation in its most general form: Remember that in our calibration example, the reference frame was placed on the chessboard. Therefore, there is a rigid transformation (rotation and translation) that must be computed for each view. These are in the output parameter list of the cv::calibrateCamera function. The rotation and translation components are often called the extrinsic parameters of the calibration and they are different for each view. The intrinsic parameters remain constant for a given camera/lens system. The intrinsic parameters of our test camera obtained from a calibration based on 20 chessboard images are fx=167, fy=178, u0=156, v0=119. These results are obtained by cv::calibrateCamera through an optimization process aimed at finding the intrinsic and extrinsic parameters that will minimize the difference between the predicted image point position, as computed from the projection of the 3D scene points, and the actual image point position, as observed on the image. The sum of this difference for all points specified during the calibration is called the re-projection error. To correct the distortion, OpenCV uses a polynomial function that is applied to the image point in order to move them at their undistorted position. By default, 5 coefficients are used; a model made of 8 coefficients is also available. Once these coefficients are obtained, it is possible to compute 2 mapping functions (one for the x coordinate and one for the y) that will give the new undistorted position of an image point on a distorted image. This is computed by the function cv::initUndistortRectifyMap and the function cv::remap remaps all of the points of an input image to a new image. Note that because of the non-linear transformation, some pixels of the input image now fall outside the boundary of the output image. You can expand the size of the output image to compensate for this loss of pixels, but you will now obtain output pixels that have no values in the input image (they will then be displayed as black pixels). There's more... When a good estimate of the camera intrinsic parameters are known, it could be advantageous to input them to the cv::calibrateCamera function. They will then be used as initial values in the optimization process. To do so, you just need to add the flag CV_CALIB_USE_INTRINSIC_GUESS and input these values in the calibration matrix parameter. It is also possible to impose a fixed value for the principal point (CV_CALIB_ FIX_PRINCIPAL_POINT), which can often be assumed to be the central pixel. You can also impose a fixed ratio for the focal lengths fx and fy (CV_CALIB_FIX_RATIO) in which case you assume pixels of square shape.

0
1
6213

article-image-exporting-sap-businessobjects-dashboards-different-environments

Packt

02 Jun 2011

7 min read

Exporting SAP BusinessObjects Dashboards into Different Environments

Packt

02 Jun 2011

7 min read

SAP BusinessObjects Dashboards 4.0 Cookbook Introduction First, the visual model is compiled to a SWF file format. Compiling to a SWF file format ensures that the dashboard plays smoothly on different screen sizes and across different platforms. It also ensures that the users aren't given huge 10+ megabyte files. After compilation of the visual model to a SWF file, developers can then publish it to a format of their choice. The following are the available choices—Flash (SWF), AIR, SAP BusinessObjects Platform, HTML, PDF, PPT, Outlook, and Word. Once publishing is complete, the dashboard is ready to share! Exporting to a standard SWF, PPT, PDF, and so on After developing a Visual Model on Dashboard Design, we will need to somehow share it with users. We want to put it into a format that everyone can see on their machines. The simplest way is to export to a standard SWF file. One of the great features Dashboard Design has is to be able to embed dashboards into different office file formats. For example, a presenter could have a PowerPoint deck, and in the middle of the presentation, have a working dashboard that presents an important set of data values to the audience. Another example could be an executive level user who is viewing a Word document created by an analyst. The analyst could create a written document in Word and then embed a working dashboard with the most updated data to present important data values to the executive level user. You can choose to embed a dashboard in the following file types: PowerPoint Word PDF Outlook HTML Getting ready Make sure your visual model is complete and ready for sharing. How to do it... In the menu toolbar, go to File | Export | Flash (SWF). Select the directory in which you want the SWF to go to and the name of your SWF file How it works... Xcelsius compiles the visual model into an SWF file that everyone is able to see. Once the SWF file has been compiled, the dashboard will then be ready for sharing. It is mandatory that anyone viewing the dashboard have Adobe Flash installed. If not, they can download and install it from http://www.adobe.com/products/flashplayer/. If we export to PPT, we can then edit the PowerPoint file however we desire. If you have an existing PowerPoint presentation deck and want to append the dashboard to it, the easiest way is to first embed the dashboard SWF to a temporary PowerPoint file and then copy that slide to your existing PowerPoint file. There's more... Exporting to an SWF file makes it very easy for distribution, thus making the presentation of mockups great at a business level. Developers are able to work very closely with the business and iteratively come up with a visual model closest to the business goals. It is important though, when distributing SWF files, that everyone viewing the dashboards has the same version, otherwise confusion may occur. Thus, as a best practice, versioning every SWF that is distributed is very important. It is important to note that when the much anticipated Adobe Flash 10.1 was released, there were problems with embedding Dashboard Design dashboards in DOC, PPT, PDF, and so on. However, with the 10.1.82.76 Adobe Flash Player update, this has been fixed. Thus, it is important that if users have Adobe Flash Player 10.1+ installed, the version is higher than or equal to 10.1.82.76. When exporting to PDF, please take the following into account: In Dashboard Design 2008, the default format for exporting to PDF is Acrobat 9.0 (PDF 1.8). If Acrobat Reader 8.0 is installed, the default exported PDF cannot be opened. If using Acrobat Reader 8.0 or older, change the format to "Acrobat 6.0 (PDF 1.5)" before exporting to PDF. Exporting to SAP Business Objects Enterprise After Dashboard Design became a part of BusinessObjects, it was important to be able to export dashboards into the BusinessObjects Enterprise system. Once a dashboard is exported to BusinessObjects Enterprise, users can then easily access their dashboards through InfoView (now BI launch pad). On top of that, administrators are able control dashboard security. Getting ready Make sure your visual model is complete and ready for sharing. How to do it... From the menu toolbar, go to File | Export | Export to SAP BusinessObjects Platform. Enter your BusinessObjects login credentials and then select the location in the SAP BusinessObjects Enterprise system, where you want to store the SWF file, as shown in the following screenshot: Log into BI launch pad (formerly known as InfoView) and verify that you can access the dashboard. (Move the mouse over the image to enlarge.) How it works... When we export a dashboard to SAP BusinessObjects Enterprise, we basically place it in the SAP BusinessObjects Enterprise content management system. From there, we can control accessibility to the dashboard and make sure that we have one source of truth instead of sending out multiple dashboards through e-mail and possibly getting mixed up with what is the latest version. When we log into BI launch pad (formerly known as Infoview), it also passes the login token to the dashboard, so we don't have to enter our credentials again when connecting to SAP BusinessObjects Enterprise data. This is important because we don't have to manually create and pass any additional tokens once we have logged in. There's more... To give a true website type feel, developers can house their dashboards in a website type format using Dashboard Builder. This in turn provides a better experience for users, as they don't have to navigate through many folders in order to access the dashboard that they are looking for. Publishing to SAP BW This recipe shows you how to publish Dashboard Design dashboards to a SAP BW system. Once a dashboard is saved to the SAP BW system, it can be published within a SAP Enterprise Portal iView and made available for the users. Getting ready For this recipe, you will need an Dashboard Design dashboard model. This dashboard does not necessarily have to include a data connection to SAP BW. How to do it... Select Publish in the SAP menu. If you want to save the Xcelsius model with a different name, select the Publish As... option. If you are not yet connected to the SAP BW system, a pop up will appear. Select the appropriate system and fill in your username and password in the dialog box. If you want to disconnect from the SAP BW system and connect to a different system, select the Disconnect option from the SAP menu. Enter the Description and Technical Name of the dashboard. Select the location you want to save the dashboard to and click on Save. The dashboard is now published to the SAP BW system. To launch the dashboard and view it from the SAP BW environment, select the Launch option from the SAP menu. You will be asked to log in to the SAP BW system before you are able to view the dashboard. How it works... As we have seen in this recipe, the publishing of an Dashboard Design dashboard to SAP BW is quite straightforward. As the dashboard is part of the SAP BW environment after publishing, the model can be transported between SAP BW systems like all other SAP BW objects. There is more... After launching step 5, the Dashboard Design dashboard will load in your browser from the SAP BW server. You can add the displayed URL to an SAP Enterprise Portal iView to make the dashboard accessible for portal users.

0
0
3993

article-image-sap-businessobjects-customizing-dashboard

Packt

27 May 2011

4 min read

SAP BusinessObjects: Customizing the Dashboard

Packt

27 May 2011

4 min read

SAP BusinessObjects Dashboards 4.0 Cookbook Over 90 simple and incredibly effective recipes for transforming your business data into exciting dashboards with SAP BusinessObjects Dashboards 4.0 Xcelsius Introduction In this article, we will go through certain techniques on how you can utilize the different cosmetic features Dashboard Design provides, in order to improve the look of your dashboard. Dashboard Design provides a powerful way to capture the audience versus other dashboard tools. It allows developers to build dashboards with the important 'wow' factor that other tools lack. Let's take, for example, two dashboards that have the exact same functionality, placement of charts, and others. However, one dashboard looks much more attractive than the other. In general, people looking at the nicer looking dashboard will be more interested and thus get more value of the data that comes out of it. Thus, not only does Dashboard Design provide a powerful and flexible way of presenting data, but it also provides the 'wow' factor to capture a user's interest. Changing the look of a chart This recipe will run through changing the look of a chart. Particularly, it will go through each tab in the appearance icon of the chart properties. We will then make modifications and see the resulting changes. Getting ready Insert a chart object onto the canvas. Prepare some data and bind it to the chart. How to do it... Double-click/right-click on the chart object on the canvas/object properties window to go into Chart Properties. In the Layout tab, uncheck Show Chart Background. (Move the mouse over the image to enlarge.) In the Series tab, click on the colored square box circled in the next screenshot to change the color of the bar to your desired color. Then change the width of each bar; click on the Marker Size area and change it to 35. Click on the colored boxes circled in red in the Axes tab and choose dark blue to modify the horizontal and vertical axes separately. Uncheck Show Minor Gridlines at the bottom so that we remove all the horizontal lines in between each of the major gridlines. Next, go to the Text and Color tabs, where you can make changes to all the different text areas of the chart. How it works... As you can see, the default chart looks plain and the bars are skinny so it's harder to visualize things. It is a good idea to remove the chart background if there is an underlying background so that the chart blends in better. In addition, the changes to the chart colors and text provide additional aesthetics that help improve the look of the chart. Adding a background to your dashboard This recipe shows the usefulness of backgrounds in the dashboard. It will show how backgrounds can help provide additional depth to objects and help to group certain areas together for better visualization. Getting ready Make sure you have all your objects such as charts and selectors ready on the canvas. Here's an example of the two charts before the makeover. Bind some data to the charts if you want to change the coloring of the series How to do it... Choose Background4 from the Art and Backgrounds tab of the Components window. Stretch the background so that it fills the size of the canvas. Make sure that ordering of the backgrounds is before the charts. To change the ordering of the background, go to the object browser, select the background object and then press the "-" key until the background object is behind the chart. Select Background1 from the Art and Backgrounds tab and put two of them under the charts, as shown in the following screenshot: When the backgrounds are in the proper place, open the properties window for the backgrounds and set the background color to your desired color. In this example we picked turquoise blue for each background. How it works... As you can see with the before and after pictures, having backgrounds can make a huge difference in terms of aesthetics. The objects are much more pleasant to look at now and there is certainly a lot of depth with the charts. The best way to choose the right backgrounds that fit your dashboard is to play around with the different background objects and their colors. If you are not very artistic, you can come up with a bunch of examples and demonstrate it to the business user to see which one they prefer the most.

0
0
3749

article-image-opencv-image-processing-using-morphological-filters

Packt

25 May 2011

6 min read

OpenCV: Image Processing using Morphological Filters

Packt

25 May 2011

6 min read

0
0
7131

Packt

23 May 2011

9 min read

Sphinx Search FAQs

Packt

23 May 2011

9 min read

Got questions on Sphinx, the open source search engine? Not sure if it's the right tool for you? You're in the right place - we've put together an FAQ on Sphinx. It should help you make the right decision about the software that powers your search. If you've got questions on the other kind of Sphinx, we recommend you look here instead. What is Sphinx? Sphinx is a full-text search engine (generally standalone) which provides fast, relevant, efficient full-text search functionality to third-party applications. It was especially created to facilitate searches on SQL databases and integrates very well with scripting languages; such as PHP, Python, Perl, Ruby, and Java. What are the major features of Sphinx? Some of the major features of Sphinx include: High indexing speed (up to 10 MB/sec on modern CPUs) High search speed (average query is under 0.1 sec on 2 to 4 GB of text collection) High scalability (up to 100 GB of text, up to 100 Million documents on a single CPU) Supports distributed searching (since v.0.9.6) Supports MySQL (MyISAM and InnoDB tables are both supported) and PostgreSQL natively Supports phrase searching Supports phrase proximity ranking, providing good relevance Supports English and Russian stemming Supports any number of document fields (weights can be changed on the fly) Supports document groups Supports stopwords, that is, that it indexes only what's most relevant from a given list of words Supports different search modes ("match extended", "match all", "match phrase" and "match any" as of v.0.9.5) Generic XML interface which greatly simplifies custom integration Pure-PHP (that is, NO module compiling and so on) search client API Which operating systems does Sphinx run on? Sphinx was developed and tested mostly on UNIX based systems. All modern UNIX based operating systems with an ANSI compliant compiler should be able to compile and run Sphinx without any issues. However, Sphinx has also been found running on the following operating systems without any issues. Linux (Kernel 2.4.x and 2.6.x of various distributions) Microsoft Windows 2000 and XP FreeBSD 4.x, 5.x, 6.x NetBSD 1.6, 3.0 Solaris 9, 11 Mac OS X What does the configure command do? The configure command gets the details of our machine and also checks for all dependencies. If any of the dependency is missing, it will throw an error. Which are the various options for the configure command? There are many options that can be passed to the configure command but we will take a look at a few important ones: prefix=/path: This option specifies the path to install the sphinx binaries. with-mysql=/path: Sphinx needs to know where to find MySQL's include and library files. It auto-detects this most of the time but if for any reason it fails, you can supply the path here. with-pgsql=/path: Same as –-with-mysql but for PostgreSQL. What is full-text search? Full-text search is one of the techniques for searching a document or database stored on a computer. While searching, the search engine goes through and examines all of the words stored in the document and tries to match the search query against those words. A complete examination of all the words (text) stored in the document is undertaken and hence it is called a full-text search. Full-text search excels in searching large volumes of unstructured text quickly and effectively. It returns pages based on how well they match the user's query. What are the advantages of full-text search? The following points are some of the major advantages of full-text search: It is quicker than traditional searches as it benefits from an index of words that is used to look up records instead of doing a full table scan It gives results that can be sorted by relevance to the searched phrase or term, with sophisticated ranking capabilities to find the best documents or records It performs very well on huge databases with millions of records It skips the common words such as the, an, for, and so on When should you use full-text search? You should use full-text search when: When there is a high volume of free-form text data to be searched When there is a need for highly optimized search results When there is a demand for flexible search querying Why use Sphinx for full-text search? If you're looking for a good Database Management System (DBMS), there are plenty of options available with support for full-text indexing and searches, such as MySQL, PostgreSQL, and SQL Server. There are also external full-text search engines, such as Lucene and Solr. Let's see the advantages of using Sphinx over the DBMS's full-text searching capabilities and other external search engines: It has a higher indexing speed. It is 50 to 100 times faster than MySQL FULLTEXT and 4 to 10 times faster than other external search engines. It also has higher searching speed since it depends heavily on the mode, Boolean vs. phrase, and additional processing. It is up to 500 times faster than MySQL FULLTEXT in cases involving a large result set with GROUP BY. It is more than two times faster in searching than other external search engines available. Relevancy is among the key features one expects when using a search engine, and Sphinx performs very well in this area. It has phrase-based ranking in addition to classic statistical BM25 ranking. Last but not the least, Sphinx has better scalability. It can be scaled vertically (utilizing many CPUs, many HDDs) or horizontally (utilizing many servers), and this comes out of the box with Sphinx. One of the biggest known Sphinx cluster has over 3 billion records with more than 2 terabytes of size. What are indexes? Indexes in Sphinx are a bit different from indexes we have in databases. The data that Sphinx indexes is a set of structured documents and each document has the same set of fields. This is very similar to SQL, where each row in the table corresponds to a document and each column to a field. Sphinx builds a special data structure that is optimized for answering full-text search queries. This structure is called an index and the process of creating an index from the data is called indexing. The indexes in Sphinx can also contain attributes that are highly optimized for filtering. These attributes are not full-text indexed and do not contribute to matching. However, they are very useful at filtering out the results we want based on attribute values. There can be different types of indexes suited for different tasks. The index type, which has been implemented in Sphinx, is designed for maximum indexing and searching speed. What are multi-value attributes (MVA)? MVAs are a special type of attribute in Sphinx that make it possible to attach multiple values to every document. These attributes are especially useful in cases where each document can have multiple values for the same property (field). How does weighting help? Weighting decides which document gets priority over other documents and appear at the top. In Sphinx, weighting depends on the search mode. Weight can also be referred to as ranking. There are two major parts which are used in weighting functions: Phrase rank: This is based on the length of Longest Common Subsequence (LCS) of search words between document body and query phrase. This means that the documents in which the queried phrase matches perfectly will have a higher phrase rank and the weight would be equal to the query word counts. Statistical rank: This is based on BM25 function which takes only the frequency of the queried words into account. So, if a word appears only one time in the whole document then its weight will be low. On the other hand if a word appears a lot in the document then its weight will be higher. The BM25 weight is a floating point number between 0 and 1. What is index merging, exactly? Index merging is more efficient than indexing the data from scratch, that is, all over again. In this technique we define a delta index in the Sphinx configuration file. The delta index always gets the new data to be indexed. However, the main index acts as an archive and holds data that never changes. What is SphinxQL? Programmers normally issue search queries using one or more client libraries that relate to the database on which the search is to be performed. Some programmers may also find it easier to write an SQL query than to use the Sphinx Client API library. SphinxQL is used to issue search queries in the form of SQL queries. These queries can be fired from any client of the database in question, and returns the results in the way that a normal query would. Currently MySQL binary network protocol is supported and this enables Sphinx to be accessed with the regular MySQL API. What do you mean by Geo-distance search? In a Geo-distance search, you can find geo coordinates nearby to the base anchor point. Thus you can use this technique to find the nearby places to the given location. It can be useful in many applications like hotel search, property search, restaurant search, tourist destination search etc. Sphinx makes it very easy to perform a geo-distance search by providing an API method wherein you can set the anchor point (if you have latitude and longitude in your index) and all searches performed thereafter will return the results with a magic attribute "@geodist" holding the values of distance from the anchor point. You can then filter or sort your results based on this attribute. Further resources on this subject: Sphinx: Index Searching [Article] Getting Started with Sphinx Search [Article] Search Engine Optimization in Joomla! [Article] Blogger: Improving Your Blog with Google Analytics and Search Engine Optimization [Article] Drupal 6 Search Engine Optimization [Book]

0
0
2507

article-image-getting-started-sphinx-search

Packt

31 Mar 2011

6 min read

Getting Started with Sphinx Search

Packt

31 Mar 2011

6 min read

Sphinx is a full-text search engine. So, before going any further, we need to understand what full-text search is and how it excels over the traditional searching. What is full-text search? Full-text search is one of the techniques for searching a document or database stored on a computer. While searching, the search engine goes through and examines all of the words stored in the document and tries to match the search query against those words. A complete examination of all the words (text) stored in the document is undertaken and hence it is called a full-text search. Full-text search excels in searching large volumes of unstructured text quickly and effectively. It returns pages based on how well they match the user's query. Traditional search To understand the difference between a normal search and full-text search, let's take an example of a MySQL database table and perform searches on it. It is assumed that MySQL Server and phpMyAdmin are already installed on your system. Time for action – normal search in MySQL Open phpMyAdmin in your browser and create a new database called myblog. Select the myblog database: Create a table by executing the following query: CREATE TABLE `posts` ( `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY , `title` VARCHAR( 255 ) NOT NULL , `description` TEXT NOT NULL , `created` DATETIME NOT NULL , `modified` DATETIME NOT NULL ) ENGINE = MYISAM; Queries can be executed from the SQL page in phpMyAdmin. You can find the link to that page in the top menu. Populate the table with some records: INSERT INTO `posts`(`id`, `title`, `description`, `created`, `modified`) VALUES (1, 'PHP scripting language', 'PHP is a web scripting language originally created by Rasmus Lerdorf', NOW(), NOW()), (2, 'Programming Languages', 'There are many languages available to cater any kind of programming need', NOW(), NOW()), (3, 'My Life', 'This post is about my life which in a sense is beautiful', NOW(), NOW()), (4, 'Life on Mars', 'Is there any life on mars?', NOW(), NOW()); Next, run the following queries against the table: SELECT * FROM posts WHERE title LIKE 'programming%'; The above query returns row 2. SELECT * FROM posts WHERE description LIKE '%life%'; The above query return rows 3 and 4. SELECT * FROM posts WHERE description LIKE '%scripting language%'; The above query returns row 1. SELECT * FROM posts WHERE description LIKE '%beautiful%' OR description LIKE '%programming%'; The above query returns rows 2 and 3. phpMyAdmin To administer MySQL database, I highly recommend using a GUI interface tool like phpMyAdmin (http://www.phpmyadmin.net). All the above mentioned queries can easily be executed What just happened? We first created a table posts to hold some data. Each post has a title and a description. We then populated the table with some records. With the first SELECT query we tried to find all posts where the title starts with the word programming. This correctly gave us the row number 2. But what if you want to search for the word anywhere in the field and not just at that start? For this we fired the second query, wherein we searched for the word life anywhere in the description of the post. Again this worked pretty well for us and as expected we got the result in the form of row numbers 3 and 4. Now what if we wanted to search for multiple words? For this we fired the third query where we searched for the words scripting language. As row 1 has those words in its description, it was returned correctly. Until now everything looked fine and we were able to perform searches without any hassle. The query gets complex when we want to search for multiple words and those words are not necessarily placed consecutively in a field, that is, side by side. One such example is shown in the form of our fourth query where we tried to search for the words programming and beautiful in the description of the posts. Since the number of words we need to search for increases, this query gets complicated, and moreover, slow in execution, since it needs to match each word individually. The previous SELECT queries and their output also don't give us any information about the relevance of the search terms with the results found. Relevance can be defined as a measure of how closely the returned database records match the user's search query. In other words, how pertinent the result set is to the search query. Relevance is very important in the search world because users want to see the items with highest relevance at the top of their search results. One of the major reasons for the success of Google is that their search results are always sorted by relevance. MySQL full-text search This is where full-text search comes to the rescue. MySQL has inbuilt support for full-text search and you only need to add FULLTEXT INDEX to the field against which you want to perform your search. Continuing the earlier example of the posts table, let's add a full-text index to the description field of the table. Run the following query: ALTER TABLE `posts` ADD FULLTEXT ( `description` ); The query will add an INDEX of type FULLTEXT to the description field of the posts table. Only MyISAM Engine in MySQL supports the full-text indexes. Now to search for all the records which contain the words programming or beautiful anywhere in their description, the query would be: SELECT * FROM posts WHERE MATCH (description) AGAINST ('beautiful programming'); This query will return rows 2 and 3, and the returned results are sorted by relevance. One more thing to note is that this query takes less time than the earlier query, which used LIKE for matching. By default, the MATCH() function performs a natural language search, it attempts to use natural language processing to understand the nature of the query and then search accordingly. Full-text search in MySQL is a big topic in itself and we have only seen the tip of the iceberg. For a complete reference, please refer to the MySQL manual at http://dev.mysql.com/doc/. Advantages of full-text search The following points are some of the major advantages of full-text search: It is quicker than traditional searches as it benefits from an index of words that is used to look up records instead of doing a full table scan It gives results that can be sorted by relevance to the searched phrase or term, with sophisticated ranking capabilities to find the best documents or records It performs very well on huge databases with millions of records It skips the common words such as the, an, for, and so on When to use a full-text search? When there is a high volume of free-form text data to be searched When there is a need for highly optimized search results When there is a demand for flexible search querying

0
0
1622

Packt

16 Mar 2011

9 min read

Sphinx: Index Searching

Packt

16 Mar 2011

9 min read

Client API implementations for Sphinx Sphinx comes with a number of native searchd client API implementations. Some third-party open source implementations for Perl, Ruby, and C++ are also available. All APIs provide the same set of methods and they implement the same network protocol. As a result, they more or less all work in a similar fashion, they all work in a similar fashion. All examples in this article are for PHP implementation of the Sphinx API. However, you can just as easily use other programming languages. Sphinx is used with PHP more widely than any other language. Search using client API Let's see how we can use native PHP implementation of Sphinx API to search. We will add a configuration related to searchd and then create a PHP file to search the index using the Sphinx client API implementation for PHP. Time for action – creating a basic search script Add the searchd config section to /usr/local/sphinx/etc/sphinx-blog.conf: source blog { # source options } index posts { # index options } indexer { # indexer options } # searchd options (used by search daemon) searchd { listen = 9312 log = /usr/local/sphinx/var/log/searchd.log query_log = /usr/local/sphinx/var/log/query.log max_children = 30 pid_file = /usr/local/sphinx/var/log/searchd.pid } Start the searchd daemon (as root user): $ sudo /usr/local/sphinx/bin/searchd -c /usr/local/sphinx/etc/ sphinx-blog.conf Copy the sphinxapi.php file (the class with PHP implementation of Sphinx API) from the sphinx source directory to your working directory: $ mkdir /path/to/your/webroot/sphinx $ cd /path/to/your/webroot/sphinx $ cp /path/to/sphinx-0.9.9/api/sphinxapi.php ./ Create a simple_search.php script that uses the PHP client API class to search the Sphinx-blog index, and execute it in the browser: <?php require_once('sphinxapi.php'); // Instantiate the sphinx client $client = new SphinxClient(); // Set search options $client->SetServer('localhost', 9312); $client->SetConnectTimeout(1); $client->SetArrayResult(true); // Query the index $results = $client->Query('php'); // Output the matched results in raw format print_r($results['matches']); The output of the given code, as seen in a browser, will be similar to what's shown in the following screenshot: What just happened? Firstly, we added the searchd configuration section to our sphinx-blog.conf file. The following options were added to searchd section: listen: This options specifies the IP address and port that searchd will listen on. It can also specify the Unix-domain socket path. This options was introduced in v0.9.9 and should be used instead of the port (deprecated) option. If the port part is omitted, then the default port used is 9312.Examples: listen = localhost listen = 9312 listen = localhost:9898 listen = 192.168.1.25:4000 listen = /var/run/sphinx.s log: Name of the file where all searchd runtime events will be logged. This is an optional setting and the default value is "searchd.log". query_log: Name of the file where all search queries will be logged. This is an optional setting and the default value is empty, that is, do not log queries. max_children: The maximum number of concurrent searches to run in parallel. This is an optional setting and the default value is 0 (unlimited). pid_file: Filename of the searchd process ID. This is a mandatory setting. The file is created on startup and it contains the head daemon process ID while the daemon is running. The pid_file becomes unlinked when the daemon is stopped. Once we were done with adding searchd configuration options, we started the searchd daemon with root user. We passed the path of the configuration file as an argument to searchd. The default configuration file used is /usr/local/sphinx/etc/sphinx.conf. After a successful startup, searchd listens on all network interfaces, including all the configured network cards on the server, at port 9312. If we want searchd to listen on a specific interface then we can specify the hostname or IP address in the value of the listen option: listen = 192.168.1.25:9312 The listen setting defined in the configuration file can be overridden in the command line while starting searchd by using the -l command line argument. There are other (optional) arguments that can be passed to searchd as seen in the following screenshot: searchd needs to be running all the time when we are using the client API. The first thing you should always check is whether searchd is running or not, and start it if it is not running. We then created a PHP script to search the sphinx-blog index. To search the Sphinx index, we need to use the Sphinx client API. As we are working with a PHP script, we copied the PHP client implementation class, (sphinxapi.php) which comes along with Sphinx source, to our working directory so that we can include it in our script. However, you can keep this file anywhere on the file system as long as you can include it in your PHP script. Throughout this article we will be using /path/to/webroot/sphinx as the working directory and we will create all PHP scripts in that directory. We will refer to this directory simply as webroot. We initialized the SphinxClient class and then used the following class methods to set upthe Sphinx client API: SphinxClient::SetServer($host, $port)—This method sets the searchd hostname and port. All subsequent requests use these settings unless this method is called again with some different parameters. The default host is localhost and port is 9312. SphinxClient::SetConnectTimeout($timeout)—This is the maximum time allowed to spend trying to connect to the server before giving up. SphinxClient::SetArrayResult($arrayresult)—This is a PHP client APIspecific method. It specifies whether the matches should be returned as an array or a hash. The Default value is false, which means that matches will be returned in a PHP hash format, where document IDs will be the keys, and other information (attributes, weight) will be the values. If $arrayresult is true, then the matches will be returned in plain arrays with complete per-match information. After that, the actual querying of index was pretty straightforward using the SphinxClient::Query($query) method. It returned an array with matched results, as well as other information such as error, fields in index, attributes in index, total records found, time taken for search, and so on. The actual results are in the $results['matches'] variable. We can run a loop on the results, and it is a straightforward job to get the actual document's content from the document ID and display it. Matching modes When a full-text search is performed on the Sphinx index, different matching modes can be used by Sphinx to find the results. The following matching modes are supported by Sphinx: SPH_MATCH_ALL—This is the default mode and it matches all query words, that is, only records that match all of the queried words will be returned. SPH_MATCH_ANY—This matches any of the query words. SPH_MATCH_PHRASE—This matches query as a phrase and requires a perfect match. SPH_MATCH_BOOLEAN—This matches query as a Boolean expression. SPH_MATCH_EXTENDED—This matches query as an expression in Sphinx internal query language. SPH_MATCH_EXTENDED2—This matches query using the second version of Extended matching mode. This supersedes SPH_MATCH_EXTENDED as of v0.9.9. SPH_MATCH_FULLSCAN—In this mode the query terms are ignored and no text-matching is done, but filters and grouping are still applied. Time for action – searching with different matching modes Create a PHP script display_results.php in your webroot with the following code: <?php // Database connection credentials $dsn ='mysql:dbname=myblog;host=localhost'; $user = 'root'; $pass = ''; // Instantiate the PDO (PHP 5 specific) class try { $dbh = new PDO($dsn, $user, $pass); } catch (PDOException $e){ echo'Connection failed: '.$e->getMessage(); } // PDO statement to fetch the post data $query = "SELECT p.*, a.name FROM posts AS p " . "LEFT JOIN authors AS a ON p.author_id = a.id " . "WHERE p.id = :post_id"; $post_stmt = $dbh->prepare($query); // PDO statement to fetch the post's categories $query = "SELECT c.name FROM posts_categories AS pc ". "LEFT JOIN categories AS c ON pc.category_id = c.id " . "WHERE pc.post_id = :post_id"; $cat_stmt = $dbh->prepare($query); // Function to display the results in a nice format function display_results($results, $message = null) { global $post_stmt, $cat_stmt; if ($message) { print "<h3>$message</h3>"; } if (!isset($results['matches'])) { print "No results found<hr />"; return; } foreach ($results['matches'] as $result) { // Get the data for this document (post) from db $post_stmt->bindParam(':post_id', $result['id'], PDO::PARAM_INT); $post_stmt->execute(); $post = $post_stmt->fetch(PDO::FETCH_ASSOC); // Get the categories of this post $cat_stmt->bindParam(':post_id', $result['id'], PDO::PARAM_INT); $cat_stmt->execute(); $categories = $cat_stmt->fetchAll(PDO::FETCH_ASSOC); // Output title, author and categories print "Id: {$posmt['id']}<br />" . "Title: {$post['title']}<br />" . "Author: {$post['name']}"; $cats = array(); foreach ($categories as $category) { $cats[] = $category['name']; } if (count($cats)) { print "<br />Categories: " . implode(', ', $cats); } print "<hr />"; } } Create a PHP script search_matching_modes.php in your webroot with the following code: <?php // Include the api class Require('sphinxapi.php'); // Include the file which contains the function to display results require_once('display_results.php'); $client = new SphinxClient(); // Set search options $client->SetServer('localhost', 9312); $client->SetConnectTimeout(1); $client->SetArrayResult(true); // SPH_MATCH_ALL mode will be used by default // and we need not set it explicitly display_results( $client->Query('php'), '"php" with SPH_MATCH_ALL'); display_results( $client->Query('programming'), '"programming" with SPH_MATCH_ALL'); display_results( $client->Query('php programming'), '"php programming" with SPH_MATCH_ALL'); // Set the mode to SPH_MATCH_ANY $client->SetMatchMode(SPH_MATCH_ANY); display_results( $client->Query('php programming'), '"php programming" with SPH_MATCH_ANY'); // Set the mode to SPH_MATCH_PHRASE $client->SetMatchMode(SPH_MATCH_PHRASE); display_results( $client->Query('php programming'), '"php programming" with SPH_MATCH_PHRASE'); display_results( $client->Query('scripting language'), '"scripting language" with SPH_MATCH_PHRASE'); // Set the mode to SPH_MATCH_FULLSCAN $client->SetMatchMode(SPH_MATCH_FULLSCAN); display_results( $client->Query('php'), '"php programming" with SPH_MATCH_FULLSCAN'); Execute search_matching_modes.php in a browser (http://localhost/sphinx/search_matching_modes.php).

0
0
2370

article-image-oracle-goldengate-11g-performance-tuning

Packt

01 Mar 2011

12 min read

Oracle GoldenGate 11g: Performance Tuning

Packt

01 Mar 2011

12 min read

0
0
4272

article-image-oracle-goldengate-considerations-designing-solution

Packt

24 Feb 2011

8 min read

Oracle GoldenGate: Considerations for Designing a Solution

Packt

24 Feb 2011

8 min read

Oracle GoldenGate 11g Implementer's guide Design, install, and configure high-performance data replication solutions using Oracle GoldenGate The very first book on GoldenGate, focused on design and performance tuning in enterprise-wide environments Exhaustive coverage and analysis of all aspects of the GoldenGate software implementation, including design, installation, and advanced configuration Migrate your data replication solution from Oracle Streams to GoldenGate Design a GoldenGate solution that meets all the functional and non-functional requirements of your system Written in a simple illustrative manner, providing step-by-step guidance with discussion points Goes way beyond the manual, appealing to Solution Architects, System Administrators and Database Administrators At a high level, the design must include the following generic requirements: Hardware Software Network Storage Performance All the above must be factored into the overall system architecture. So let's take a look at some of the options and the key design issues. Replication methods So you have a fast reliable network between your source and target sites. You also have a schema design that is scalable and logically split. You now need to choose the replication architecture; One to One, One to Many, active-active, active-passive, and so on. This consideration may already be answered for you by the sheer fact of what the system has to achieve. Let's take a look at some configuration options. Active-active Let's assume a multi-national computer hardware company has an office in London and New York. Data entry clerks are employed at both sites inputting orders into an Order Management System. There is also a procurement department that updates the system inventory with volumes of stock and new products related to a US or European market. European countries are managed by London, and the US States are managed by New York. A requirement exists where the underlying database systems must be kept in synchronisation. Should one of the systems fail, London users can connect to New York and vice-versa allowing business to continue and orders to be taken. Oracle GoldenGate's active-active architecture provides the best solution to this requirement, ensuring that the database systems on both sides of the pond are kept synchronised in case of failure. Another feature the active-active configuration has to offer is the ability to load balance operations. Rather than have effectively a DR site in both locations, the European users could be allowed access to New York and London systems and viceversa. Should a site fail, then the DR solution could be quickly implemented. Active-passive The active-passive bi-directional configuration replicates data from an active primary database to a full replica database. Sticking with the earlier example, the business would need to decide which site is the primary where all users connect. For example, in the event of a failure in London, the application could be configured to failover to New York. Depending on the failure scenario, another option is to start up the passive configuration, effectively turning the active-passive configuration into active-active. Cascading The Cascading GoldenGate topology offers a number of "drop-off" points that are intermediate targets being populated from a single source. The question here is "what data do I drop at which site?" Once this question has been answered by the business, it is then a case of configuring filters in Replicat parameter files allowing just the selected data to be replicated. All of the data is passed on to the next target where it is filtered and applied again. This type of configuration lends itself to a head office system updating its satellite office systems in a round robin fashion. In this case, only the relevant data is replicated at each target site. Another design, is the Hub and Spoke solution, where all target sites are updated simultaneously. This is a typical head office topology, but additional configuration and resources would be required at the source site to ship the data in a timely manner. The CPU, network, and file storage requirements must be sufficient to accommodate and send the data to multiple targets. Physical Standby A Physical Standby database is a robust Oracle DR solution managed by the Oracle Data Guard product. The Physical Standby database is essentially a mirror copy of its Primary, which lends itself perfectly for failover scenarios. However , it is not easy to replicate data from the Physical Standby database, because it does not generate any of its own redo. That said, it is possible to configure GoldenGate to read the archived standby logs in Archive Log Only (ALO) mode. Despite being potentially slower, it may be prudent to feed a downstream system on the DR site using this mechanism, rather than having two data streams configured from the Primary database. This reduces network bandwidth utilization, as shown in the following diagram: Reducing network traffic is particularly important when there is considerable distance between the primary and the DR site. Networking The network should not be taken for granted. It is a fundamental component in data replication and must be considered in the design process. Not only must it be fast, it must be reliable. In the following paragraphs, we look at ways to make our network resilient to faults and subsequent outages, in an effort to maintain zero downtime. Surviving network outages Probably one of your biggest fears in a replication environment is network failure. Should the network fail, the source trail will fill as the transactions continue on the source database, ultimately filling the filesystem to 100% utilization, causing the Extract process to abend. Depending on the length of the outage, data in the database's redologs may be overwritten causing you the additional task of configuring GoldenGate to extract data from the database's archived logs. This is not ideal as you already have the backlog of data in the trail files to ship to the target site once the network is restored. Therefore, ensure there is sufficient disk space available to accommodate data for the longest network outage during the busiest period. Disks are relatively cheap nowadays. Providing ample space for your trail files will help to reduce the recovery time from the network outage. Redundant networks One of the key components in your GoldenGate implementation is the network. Without the ability to transfer data from the source to the target, it is rendered useless. So, you not only need a fast network but one that will always be available. This is where redundant networks come into play, offering speed and reliability. NIC teaming One method of achieving redundancy is Network Interface Card (NIC) teaming or bonding. Here two or more Ethernet ports can be "coupled" to form a bonded network supporting one IP address. The main goal of NIC teaming is to use two or more Ethernet ports connected to two or more different access network switches thus avoiding a single point of failure. The following diagram illustrates the redundant features of NIC teaming: Linux (OEL/RHEL 4 and above) supports NIC teaming with no additional software requirements. It is purely a matter of network configuration stored in text files in the /etc/sysconfig/network-scripts directory. The following steps show how to configure a server for NIC teaming: First, you need to log on as root user and create a bond0 config file using the vi text editor. # vi /etc/sysconfig/network-scripts/ifcfg-bond0 Append the following lines to it, replacing the IP address with your actual IP address, then save file and exit to shell prompt: DEVICE=bond0 IPADDR=192.168.1.20 NETWORK=192.168.1.0 NETMASK=255.255.255.0 USERCTL=no BOOTPROTO=none ONBOOT=yes Choose the Ethernet ports you wish to bond, and then open both configurations in turn using the vi text editor, replacing ethn with the respective port number. # vi /etc/sysconfig/network-scripts/ifcfg-eth2 # vi /etc/sysconfig/network-scripts/ifcfg-eth4 Modify the configuration as follows: DEVICE=ethn USERCTL=no ONBOOT=yes MASTER=bond0 SLAVE=yes BOOTPROTO=none Save the files and exit to shell prompt. To make sure the bonding module is loaded when the bonding interface (bond0) is brought up, you need to modify the kernel modules configuration file: # vi /etc/modprobe.conf Append the following two lines to the file: alias bond0 bonding options bond0 mode=balance-alb miimon=100 Finally, load the bonding module and restart the network services: # modprobe bonding # service network restart You now have a bonded network that will load balance when both physical networks are available, providing additional bandwidth and enhanced performance. Should one network fail, the available bandwidth will be halved, but the network will still be available. Non-functional requirements (NFRs) Irrespective of the functional requirements, the design must also include the nonfunctional requirements (NFR) in order to achieve the overall goal of delivering a robust, high performance, and stable system. Latency One of the main NFRs is performance. How long does it take to replicate a transaction from the source database to the target? This is known as end-to-end latency that typically has a threshold that must not be breeched in order to satisfy the specified NFR. GoldenGate refers to latency as lag, which can be measured at different intervals in the replication process. These are: Source to Extract: The time taken for a record to be processed by the Extract compared to the commit timestamp on the database Replicat to Target: The time taken for the last record to be processed by the Replicat compared to the record creation time in the trail file A well designed system may encounter spikes in latency but it should never be continuous or growing. Trying to tune GoldenGate when the design is poor is a difficult situation to be in. For the system to perform well you may need to revisit the design.

0
0
3266

article-image-oracle-goldengate-11g-configuration-high-availability

Packt

23 Feb 2011

10 min read

Oracle GoldenGate 11g: Configuration for High Availability

Packt

23 Feb 2011

10 min read

0
0
4538

article-image-openam-oracle-dsee-and-multiple-data-stores

Packt

03 Feb 2011

6 min read

OpenAM: Oracle DSEE and Multiple Data Stores

Packt

03 Feb 2011

6 min read

OpenAM Written and tested with OpenAM Snapshot 9—the Single Sign-On (SSO) tool for securing your web applications in a fast and easy way The first and the only book that focuses on implementing Single Sign-On using OpenAM Learn how to use OpenAM quickly and efficiently to protect your web applications with the help of this easy-to-grasp guide Written by Indira Thangasamy, core team member of the OpenSSO project from which OpenAM is derived Real-world examples for integrating OpenAM with various applications Oracle Directory Server Enterprise Edition The Oracle Directory Server Enterprise Edition type of identity store is predominantly used by customers and natively supported by the OpenSSO server. It is the only data store where all the user management features offered by OpenSSO is supported with no exceptions. In the console it is labeled as Sun DS with OpenSSO schema. After Oracle acquired Sun Microsystems Inc., the brand name for the Sun Directory Server Enterprise Edition has been changed to Oracle Directory Server Enterprise Edition (ODSEE). You need to have the ODSEE configured prior to creating the data store either from the console or CLI as shown in the following section. Creating a data store for Oracle DSEE Creating a data store from the OpenSSO console is the easiest way to achieve this task. However, if you want to automate this process in a repeatable fashion, then using the ssoadm tool is the right choice. However, the problem here is to obtain the list of the attributes and their corresponding values. It is not documented anywhere in the publicly available documentation for OpenSSO. Let me show you the easy way out of this by providing the required options and their values for the ssoadm: ./ssoadm create-datastore -m odsee_datastore -t LDAPv3ForAMDS -u amadmin-f /tmp/.passwd_of_amadmin -D data_store_odsee.txt -e / This command will create the data store that talks to an Oracle DSEE store. The key options in the preceding command are LDAPv3ForAMDS (instructing the server to create a datastore of type ODSEE) and the properties that have been included in the data_ store_odsee.txt. You can find the complete contents of this file as part of the code bundle provided by Packt Publishers. Some excerpts from the file are given as follows: sun-idrepo-ldapv3-config-ldap-server=odsee.packt-services. net:4355 sun-idrepo-ldapv3-config-authid=cn=directory manager sun-idrepo-ldapv3-config-authpw=dssecret12 com.iplanet.am.ldap.connection.delay.between.retries=1000 sun-idrepo-ldapv3-config-auth-naming-attr=uid <contents removed to save paper space > sun-idrepo-ldapv3-config-users-search-attribute=uid sun-idrepo-ldapv3-config-users-search-filter=(objectclass= inetorgperson) sunIdRepoAttributeMapping= sunIdRepoClass=com.sun.identity.idm.plugins.ldapv3.LDAPv3Repo sunIdRepoSupportedOperations=filteredrole=read,create,edit, delete sunIdRepoSupportedOperations=group=read,create,edit,delete sunIdRepoSupportedOperations=realm=read,create,edit,delete, service sunIdRepoSupportedOperations=role=read,create,edit,delete sunIdRepoSupportedOperations=user=read,create,edit,delete, service Among these properties, the first three provide critical information for the whole thing to work. As it is evident from the name of the property they denote the ODSEE server name, port, bind DN, and the password. The password is in plain text so you need to remove the password from the input file after creating the data store, for security reasons. Updating the data store Like we discussed in the previous section, one can invoke the update-datastore sub command with the appropriate properties and its values along with the -a switch to the ssoadm tool. If you have more properties to be updated, then put them in a text file and use it with the -D option like the create-datastore sub command. Deleting the data store A data store can be deleted by just selecting it's specific name from the console. There will be no warnings issued while deleting the data stores, and you could eventually delete all of the data stores in a realm. Be cautious about this behavior, you might end up deleting all of them unintentionally. Deleting data store does not remove any existing data in the underlying LDAP server, it only removes the configuration from the OpenSSO server: ./ssoadm delete-datastores -e / -m odsee_datastore -f /tmp/ .passwd_of_amadmin -u amadmin Data store for OpenDS OpenDS is one of the popular LDAP servers that is completely written in Java and available freely under open source license. As a matter of fact the embedded configuration store that is built in the OpenSSO server is the embedded version of OpenDS. It has been fully tested with the OpenDS standalone version for the identity store usage. The data store creation and management are pretty much similar to the steps described in the foregoing section, except the type of store and the corresponding properties' values. The properties and their values are given in the file data_store_opends.txt (available as part of code bundle). Invoke the ssoadm tool with this property file after making appropriate changes to fit to your deployment. Here is the sample command that creates the datastore for OpenDS: ./ssoadm create-datastore -u amadmin -f /tmp/.passwd_of_amadmin -e / -m "OpenDS-store" -t LDAPv3ForOpenDS -D data_store_opends.txt Data store for Tivoli DS The IBM Tivoli Directory Server 6.2 is one of the supported LDAP servers for the OpenSSO server to provide authentication and authorization services. A specific sub configuration LDAPv3ForTivoli is available out of the box to support this server. You can find the data_store_tivoli.txt to create a new data store by supplying the -t LDAPv3ForTivoli option to the ssoadm tool. Here is the sample command that creates the datastore for Tivoli DS. The sample (data_store_tivoli. txt can be found as part of the code bundle) file contains the entries including the default group, that are shipped with the Tivoli DS for easy understanding. You can customize it to any valid values: ./ssoadm create-datastore -u amadmin -f /tmp/.passwd_of_amadmin -e / -m "Tivoli-store" -t LDAPv3ForTivoli -D data_store_tivoli.txt Data store for Active Directory Microsoft Active Directory provides most of the LDAPv3 features including support for persistent search notifications. Creating a data store for this is also a straightforward process and is available out-of-the-box. You can find the data_ store_ad.txt to create a new data store by supplying the -t LDAPv3ForAD option to the ssoadm tool. Here is the sample command that creates the datastore for AD: ./ssoadm create-datastore -u amadmin -f /tmp/.passwd_of_amadmin -e / -m "AD-store" -t LDAPv3ForAD -D data_store_ad.txt . Data store for Active Directory Application Mode Microsoft Active Directory Application Mode (ADAM) is the lightweight version of the Active directory with simplified schema. In the ADAM instance it is possible to set user password over LDAP unlike Active Directory where password-related operations must happen over LDAPS. Creating a data store for this is also a straightforward process and is available out-of-the-box. You can find the data_store_adam. txt to create a new data store by supplying the -t LDAPv3ForADAM option to the ssoadm tool: ./ssoadm create-datastore -u amadmin -f /tmp/.passwd_of_amadmin -e / -m "ADAM-store" -t LDAPv3ForADAM -D data_store_adam.txt

0
0
1527

How-To Tutorials - Data

Pentaho Data Integration 4: working with complex data flows

Pentaho Data Integration 4: Understanding Data Flows

Segmenting images in OpenCV

Getting Started with OpenSSO

Learn computer vision applications in Open CV

Exporting SAP BusinessObjects Dashboards into Different Environments

SAP BusinessObjects: Customizing the Dashboard

OpenCV: Image Processing using Morphological Filters

Sphinx Search FAQs

Getting Started with Sphinx Search

Trending Topics

Sphinx: Index Searching

Oracle GoldenGate 11g: Performance Tuning

Oracle GoldenGate: Considerations for Designing a Solution

Oracle GoldenGate 11g: Configuration for High Availability

OpenAM: Oracle DSEE and Multiple Data Stores