Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
Packt
22 Aug 2013
7 min read
Save for later

Pentaho – Using Formulas in Our Reports

Packt
22 Aug 2013
7 min read
(For more resources related to this topic, see here.) At the end of the article, we propose that you make some modifications to the report created in this article. Starting practice In this article, we will create a copy of the report, then we will do the necessary changes in its layout; the final result is as follows: As we can observe in the previous screenshot, the rectangle that is to the left of each title changes color. We'll see how to do this, and much more, shortly. Time for action – making a copy of the previous report In this article, we will use an already created report. To do so, we will open it and save it with the name 09_Using_Formulas.prpt. Then we will modify its layout to fit this article. Finally, we will establish default values for our parameters. The steps for making a copy of the previous report are as follows: We open the report 07_Adding_Parameters.prpt that we created. Next, we create a copy by going to File | Save As... and saving it with the name 09_Using_Formulas.prpt. We will modify our report so that it looks like the following screenshot: As you can see, we have just added a rectangle in the Details section, a label (Total) in the Details Header section, and we have modified the name of the label found in the Report Header section. To easily differentiate this report from the one used previously, we have also modified its colors to grayscale. Later in this article, we will make the color of the rectangle vary according to the formula, so itis important that the rest of the report does not have too many colors so the result are easy for the end user to see. We will establish default values in our parameters so we can preview the report without delays caused by having to choose the values for ratings, year, and month. We go to the Data tab, select the SelectRating parameter, right-click on it, and choose the Edit Parameter... option: In Default Value, we type the value [G]: Next, we click on OK to continue. We should do something similar for SelectYear and SelectMonth: For SelectYear, the Default Value will be 2005. For SelectMonth, the Default Value will be 5. Remember that the selector shows the names of the months, but internally the months' numbers are used; so, 5 represents May. What just happened? We created a copy of the report 07_Adding_Parameters.prpt and saved it with the name 09_Using_Formulas.prpt. We changed the layout of the report, adding new objects and changing the colors. Then we established default values for the parameters SelectRating, SelectYear, and SelectMonth. Formulas To manage formulas, PRD implements the open standard OpenFormula. According to OpenFormula's specifications: "OpenFormula is an open format for exchanging recalculated formulas between office application implementations, particularly for spreadsheets. OpenFormula defines the types, syntax, and semantics for calculated formulas, including many predefined functions and operations, so that formulas can be exchanged between applications and produce substantively equal outputs when recalculated with equal inputs. Both closed and open source software can implement OpenFormula." For more information on OpenFormula, refer to the following links: Wikipedia: http://en.wikipedia.org/wiki/OpenFormula Specifications: https://www.oasis-open.org/committees/download.php/16826/openformula-spec-20060221.html Web: http ://www.openformula.org/ Pentaho wiki: http://wiki.pentaho.com/display/Reporting/Formula+Expressions Formulas are used for greatly varied purposes, and their use depends on the result one wants to obtain. Formulas let us carry out simple and complex calculations based on fixed and variable values and include predefined functions that let us work with text, databases, date and time, let us make calculations, and also include general information functions and user-defined functions. They also use logical operators (AND, OR, and so on) and comparative operators (>, <, and so on). Creating formulas There are two ways to create formulas: By creating a new function and by going to Common | Open Formula By pressing the button in a section's / an object's Style or Attributes tab, or to configure some feature In the report we are creating in this article, we will create formulas using both methods. Using the first method, general-use formulas can be created. That is, the result will be an object that can either be included directly in our report or used as a value in another function, style, or attribute. We can create objects that make calculations at a general level to be included in sections that include Report Header, Group Footer, and so on, or we can make calculations to be included in the Details section. In this last case, the formula will make its calculation row by row. With this last example, we can make an important differentiation with respect to aggregate functions as they usually can only calculate totals and subtotals. Using the second method, we create specific-use functions that affect the value of the style or attribute of an individual object. The way to use these functions is simple. Just choose the value you want to modify in the Style and Attributes tabs and click on the button that appears on their right. In this way, you can create formulas that dynamically assign values to an object's color, position, width, length, format, visibility, and so on. Using this technique, stoplights can be created by assigning different values to an object according to a calculation, progress bars can be created by changing an object's length, and dynamic images can be placed in the report using the result of a formula to calculate the image's path. As we have seen in the examples, using formulas in our reports gives us great flexibility in applying styles and attributes to objects and to the report itself, as well as the possibility of creating our own objects based on complex calculations. By using formulas correctly, you will be able to give life to your reports and adapt them to changing contexts. For example, depending on which user executes the report, a certain image can appear in the Report Header section, or graphics and subreports can be hidden if the user does not have sufficient permissions. The formula editor The formula editor has a very intuitive and easy-to-use UI that in addition to guiding us in creating formulas, tells us, whenever possible, the value that the formula will return. In the following screenshot, you can see the formula editor: We will explain its layout with an example. Let's suppose that we added a new label and we want to create a formula that returns the value of Attributes.Value. For this purpose, we do the following: Select the option to the right of Attributes.Value. This will open the formula editor. In the upper-left corner, there is a selector where we can specify the category of functions that we want to see. Below this, we find a list of the functions that we can use to create our own formulas. In the lower-left section, we can see more information about the selected function; that is, the type of value that it will return and a general description: We choose the CONCATENATE function by double-clicking on it, and in the lower-right section, we can see the formula (Formula:) that we will use. We type in =CONCATENATE(Any), and an assistant will open in the upper-right section that will guide us in entering the values we want to concatenate. We could complete the CONCATENATE function by adding some fixed values and some variables; take the following example: If there is an error in the text of the formula, text will appear to warn us. Otherwise, the formula editor will try to show us the result that our formula will return. When it is not possible to visualize the result that a formula will return, this is usually because the values used are calculated during the execution of the report. Formulas should always begin with the = sign. Initially, one tends to use the help that the formula editor provides, but later, with more practice, it will become evident that it is much faster to type the formula directly. Also, if you need to enter complex formulas or add various functions with logical operators, the formula editor will not be of use.
Read more
  • 0
  • 0
  • 1523

article-image-lucenenet-optimizing-and-merging-index-segments
Packt
20 Aug 2013
3 min read
Save for later

Lucene.NET: Optimizing and merging index segments

Packt
20 Aug 2013
3 min read
(For more resources related to this topic, see here.) How to do it… Index optimization is accomplished by calling the Optimize method on an instance of IndexWriter. The example for this recipe demonstrates the use of the Optimize method to clean up the storage of the index data on the physical disk. The general steps in the process to optimize and index segments are the following: Create/open an index. Add or delete documents from the index. Examine the MaxDoc and NumDocs properties of the IndexWriter class. If the index is deemed to be too dirty, call the Optimize method of the IndexWriter class. The following example for this recipe demonstrates taking these steps to create, modify, and then optimize an index. namespace Lucene.NET.HowTo._12_MergeAndOptimize {// ...// build facade and an initial index of 5 documentsvar facade = new LuceneDotNetHowToExamplesFacade().buildLexicographicalExampleIndex(maxDocs: 5).createIndexWriter();// report MaxDoc and NumDocsTrace.WriteLine(string.Format("MaxDoc=={0}", facade.IndexWriter.MaxDoc()));Trace.WriteLine(string.Format("NumDocs=={0}",facade.IndexWriter.NumDocs()));// delete one documentfacade.IndexWriter.DeleteDocuments(new Term("filename", "0.txt"));facade.IndexWriter.Commit();// report MaxDoc and NumDocsTrace.WriteLine("After delete / commit");Trace.WriteLine(string.Format("MaxDoc=={0}", facade.IndexWriter.MaxDoc()));Trace.WriteLine(string.Format("NumDocs=={0}", facade.IndexWriter.NumDocs()));// optimize the indexfacade.IndexWriter.Optimize();// report MaxDoc and NumDocsTrace.WriteLine("After Optimize");Trace.WriteLine(string.Format("MaxDoc=={0}", facade.IndexWriter.MaxDoc()));Trace.WriteLine(string.Format("NumDocs=={0}", facade.IndexWriter.NumDocs()));Trace.Flush();// ...} How it works… When this program is run, you will see output similar to that in the following screenshot: This program first creates an index with five files. It then reports the values of the MaxDoc and NumDocs properties of the instance of IndexWriter. MaxDoc represents the maximum number of documents that have been stored in the index. It is possible to add more documents, but that may incur a performance penalty by needing to grow the index. NumDocs is the current number of documents stored in the index. At this point these values are 5 and 5, respectively. The next step deletes a single document named 0.txt from the index, and the changes are committed to disk. MaxDoc and NumDocs are written to the console again and now report 5 and 4 respectively. This makes sense as one file has been deleted and there is now "slop" in the index where space is being taken up from a previously deleted document. The reference to the document index information has been removed, but the space is still used on the disk. The final two steps are to call Optimize and to write MaxDoc and NumDocs values to the console, for the final time. These now are 4 and 4, respectively, as Lucene.NET has merged any index segments and removed any empty disk space formerly used by deleted document index information. Summary A Lucene.NET index physically contains one or more segments, each of which is its own index and holds a subset of the overall indexed content. As documents are added to the index, new segments are created as index writer's flush-buffered content into the index's directory and file structure. Over time this fragmentation will cause searches to slow, requiring a merge/optimization to be performed to regain performance. Resources for Article : Further resources on this subject: Extending Your Structure and Search [Article] Advanced Performance Strategies [Article] Creating your first collection (Simple) [Article]
Read more
  • 0
  • 0
  • 2886

Packt
14 Aug 2013
6 min read
Save for later

Analytics – Drawing a Frequency Distribution with MapReduce (Intermediate)

Packt
14 Aug 2013
6 min read
(For more resources related to this topic, see here.) Often, we use Hadoop to calculate analytics, which are basic statistics about data. In such cases, we walk through the data using Hadoop and calculate interesting statistics about the data. Some of the common analytics are show as follows: Calculating statistical properties like minimum, maximum, mean, median, standard deviation, and so on of a dataset. For a dataset, generally there are multiple dimensions (for example, when processing HTTP access logs, names of the web page, the size of the web page, access time, and so on, are few of the dimensions). We can measure the previously mentioned properties by using one or more dimensions. For example, we can group the data into multiple groups and calculate the mean value in each case. Frequency distributions histogram counts the number of occurrences of each item in the dataset, sorts these frequencies, and plots different items as X axis and frequency as Y axis. Finding a correlation between two dimensions (for example, correlation between access count and the file size of web accesses). Hypothesis testing: To verify or disprove a hypothesis using a given dataset. However, Hadoop will only generate numbers. Although the numbers contain all the information, we humans are very bad at figuring out overall trends by just looking at numbers. On the other hand, the human eye is remarkably good at detecting patterns, and plotting the data often yields us a deeper understanding of the data. Therefore, we often plot the results of Hadoop jobs using some plotting program. Getting ready This article assumes that you have access to a computer that has Java installed and the JAVA_HOME variable configured. Download a Hadoop distribution 1.1.x from http://hadoop.apache.org/releases.html page. Unzip the distribution, we will call this directory HADOOP_HOME. Download the sample code for the article and copy the data files. How to do it... If you have not already done so, let us upload the amazon dataset to the HDFS filesystem using the following commands: >bin/hadoopdfs -mkdir /data/>bin/hadoopdfs -mkdir /data/amazon-dataset>bin/hadoopdfs -put <SAMPLE_DIR>/amazon-meta.txt /data/amazondataset/>bin/hadoopdfs -ls /data/amazon-dataset Copy the hadoop-microbook.jar file from SAMPLE_DIR to HADOOP_HOME. Run the first MapReduce job to calculate the buying frequency. To do that run the following command from HADOOP_HOME: $ bin/hadoop jar hadoop-microbook.jar microbook.frequency.BuyingFrequencyAnalyzer/data/amazon-dataset /data/frequencyoutput1 Use the following command to run the second MapReduce job to sort the results of the first MapReduce job: $ bin/hadoop jar hadoop-microbook.jar microbook.frequency.SimpleResultSorter /data/frequency-output1 frequency-output2 You can find the results from the output directory. Copy results to HADOOP_HOME using the following command: $ bin/Hadoop dfs -get /data/frequency-output2/part-r-00000 1.data Copy all the *.plot files from SAMPLE_DIR to HADOOP_HOME. Generate the plot by running the following command from HADOOP_HOME. $gnuplot buyfreq.plot It will generate a file called buyfreq.png, which will look like the following: As the figure depicts, few buyers have brought a very large number of items. The distribution is much steeper than normal distribution, and often follows what we call a Power Law distribution. This is an example that analytics and plotting results would give us insight into, underlying patterns in the dataset. How it works... You can find the mapper and reducer code at src/microbook/frequency/BuyingFrequencyAnalyzer.java. This figure shows the execution of two MapReduce jobs. Also the following code listing shows the map function and the reduce function of the first job: public void map(Object key, Text value, Context context) throwsIOException, InterruptedException {List<BuyerRecord> records = BuyerRecord.parseAItemLine(value.toString());for(BuyerRecord record: records){context.write(new Text(record.customerID), new IntWritable(record.itemsBrought.size()));}}public void reduce(Text key, Iterable<IntWritable> values, Context context) {int sum = 0;for (IntWritableval : values) {sum += val.get();}result.set(sum);context.write(key, result);} As shown by the figure, Hadoop will read the input file from the input folder and read records using the custom formatter we introduced in the Writing a formatter (Intermediate) article. It invokes the mapper once per each record, passing the record as input. The mapper extracts the customer ID and the number of items the customer has brought, and emits the customer ID as the key and number of items as the value. Then, Hadoop sorts the key-value pairs by the key and invokes a reducer once for each key passing all values for that key as inputs to the reducer. Each reducer sums up all item counts for each customer ID and emits the customer ID as the key and the count as the value in the results. Then the second job sorted the results. It reads output of the first job as the result and passes each line as argument to the map function. The map function extracts the customer ID and the number of items from the line and emits the number of items as the key and the customer ID as the value. Hadoop will sort the key-value pairs by the key, thus sorting them by the number of items, and invokes the reducer once per key in the same order. Therefore, the reducer prints them out in the same order essentially sorting the dataset. Since we have generated the results, let us look at the plotting. You can find the source for the gnuplot file from buyfreq.plot. The source for the plot will look like the following: set terminal pngset output "buyfreq.png"set title "Frequency Distribution of Items brought by Buyer";setylabel "Number of Items Brought";setxlabel "Buyers Sorted by Items count";set key left topset log yset log xplot "1.data" using 2 title "Frequency" with linespoints Here the first two lines define the output format. This example uses png, but gnuplot supports many other terminals such as screen, pdf, and eps. The next four lines define the axis labels and the title, and the next two lines define the scale of each axis, and this plot uses log scale for both. The last line defines the plot. Here, it is asking gnuplot to read the data from the 1.data file, and to use the data in the second column of the file via using 2, and to plot it using lines. Columns must be separated by whitespaces. Here if you want to plot one column against another, for example data from column 1 against column 2, you should write using 1:2 instead of using 2. There's more... We can use a similar method to calculate the most types of analytics and plot the results. Refer to the freely available article of Hadoop MapReduce Cookbook, Srinath Perera and Thilina Gunarathne, Packt Publishing at http://www.packtpub.com/article/advanced-hadoop-mapreduce-administration for more information. Summary In this article, we have learned how to process Amazon data with MapReduce, generate data for a histogram, and plot it using gnuplot. Resources for Article : Further resources on this subject: Advanced Hadoop MapReduce Administration [Article] Comparative Study of NoSQL Products [Article] HBase Administration, Performance Tuning [Article]
Read more
  • 0
  • 0
  • 1263
Visually different images

article-image-calculus
Packt
14 Aug 2013
8 min read
Save for later

Calculus

Packt
14 Aug 2013
8 min read
(For more resources related to this topic, see here.) Derivatives To compute the derivative of a function, create the corresponding expression and use diff(). Its first argument is the expression and the second is the variable with regard to which you want to differentiate. The result is the expression for the derivative: >>> diff(exp(x**2), x)2*x*exp(x**2)>>> diff(x**2 * y**2, y)2*x**2*y Higher-order derivatives can also be computed with a single call to diff(): >>> diff(x**3, x, x)6*x>>> diff(x**3, x, 2)6*x>>> diff(x**2 * y**2, x, 2, y, 2)4 Due to SymPy's focus on expressions rather than functions, the derivatives for symbolic functions can seem a little surprising, but LaTeX rendering in the notebook should make their meaning clear. >>> f = Function('f')>>> diff(f(x**2), x)2*x*Subs(Derivative(f(_xi_1), _xi_1), (_xi_1,), (x**2,)) Let's take a look at the following screenshot: Limits Limits are obtained through limit(). The syntax for the limit of expr when x goes to some value x0 is limit(expr, x, x0). To specify a limit towards infinity, you need to use SymPy's infinity object, named oo. This object will also be returned for infinite limits: >>> limit(exp(-x), x, oo)0>>> limit(1/x**2, x, 0)oo There is also a fourth optional parameter, to specify the direction of approach of the limit target. "+" (the default) gives the limit from above, and "-" is from below. Obviously, this parameter is ignored when the limit target is infinite: >>> limit(1/x, x, 0, "-")-oo>>> limit(1/x, x, 0, "+")oo Let's take a look at the following screenshot: Integrals SymPy has powerful algorithms for integration, and, in particular, can find most integrals of logarithmic and exponential functions expressible with special functions, and many more besides, thanks to Meijer G-functions. The main function for integration is integrate(). It can compute both antiderivatives (indefinite integrals) and definite integrals. Note that the value of an antiderivative is only defined up to an arbitrary constant but the result does not include it. >>> integrate(sin(x), x)-cos(x)>>> integrate(sin(x), (x, 0, pi))2 Unevaluated symbolic integrals and antiderivatives are represented by the Integral class. integrate() may return these objects if it cannot compute the integral. It is also possible to create Integral objects directly, using the same syntax as integrate(). To evaluate them, call their .doit() method: >>> integral = Integral(sin(x), (x, 0, pi))>>> integralIntegral(sin(x), (x, 0, pi))>>> integral.doit()2 Let's take a look at the following screenshot: Taylor series A Taylor series approximation is an approximation of a function obtained by truncating its Taylor series. To compute it, use series(expr, x, x0, n), where x is the relevant variable, x0 is the point where the expansion is done (defaults to 0), and n is the order of expansion (defaults to 6): >>> series(cos(x), x)1 - x**2/2 + x**4/24 + O(x**6)>>> series(cos(x), x, n=10)1 - x**2/2 + x**4/24 - x**6/720 + x**8/40320 + O(x**10) The O(x**6) part in the result is a "big-O" object. Intuitively, it represents all the terms of order equal to or higher than 6. This object automatically absorbs or combines with powers of the variable, which makes simple arithmetic operations on expansions convenient: >>> O(x**2) + 2*x**3O(x**2)>>> O(x**2) * 2*x**3O(x**5)>>> expand(series(sin(x), x, n=6) * series(cos(x), x, n=4))x - 2*x**3/3 + O(x**5)>>> series(sin(x)*cos(x), x, n=5)x - 2*x**3/3 + O(x**5) If you want to use the expansion as an approximation of the function, the O() term prevents it from behaving like an ordinary expression, so you need to remove it. You can do so by using the aptly named .removeO() method: >>> series(cos(x), x).removeO()x**4/24 - x**2/2 + 1 Taylor series look better in the notebook, as shown in the following screenshot: Solving equations This section will teach you how to solve the different types of equations that SymPy handles. The main function to use for solving equations is solve(). Its interface is somewhat complicated as it accepts many different kinds of inputs and can output results in various forms depending on the input. In the simplest case, univariate equations, use the syntax solve(expr, x) to solve the equation expr = 0 for the variable x. If you want to solve an equation of the form A = B, simply put it under the preceding form, using solve(A - B, x). This can solve algebraic and transcendental equations involving rational fractions, square roots, absolute values, exponentials, logarithms, trigonometric functions, and so on. The result is then a list of the values of the variables satisfying the equation. The following commands show a few examples of equations that can be solved: >>> solve(x**2 - 1, x)[-1, 1]>>> solve(x*exp(x) - 1, x)[LambertW(1)]>>> solve(abs(x**2-4) - 3, x)[-1, 1, -sqrt(7), sqrt(7)] Note that the form of the result means that it can only return a finite set of solutions. In cases where the true solution is infinite, it can therefore be misleading. When the solution is an interval, solve() typically returns an empty list. For periodic functions, usually only one solution is returned: >>> solve(0, x) # all x are solutions[]>>> solve(x - abs(x), x) # all positive x are solutions[]>>> solve(sin(x), x) # all k*pi with k integer are solutions[0] The domain over which the equation is solved depends on the assumptions on the variable. Hence, if the variable is a real Symbol object, only real solutions are returned, but if it is complex, then all solutions in the complex plane are returned (subject to the aforementioned restriction on returning infinite solution sets). This difference is readily apparent when solving polynomials, as the following example demonstrates: >>> solve(x**2 + 1, x)[]>>> solve(z**2 + 1, z)[-I, I] There is no restriction on the number of variables appearing in the expression. Solving a multivariate expression for any of its variables allows it to be expressed as a function of the other variables, and to eliminate it from other expressions. The following example shows different ways of solving the same multivariate expression: >>> solve(x**2 - exp(a), x)[-exp(a/2), exp(a/2)]>>> solve(x**2 - exp(a), a)[log(x**2)]>>> solve(x**2 - exp(a), x, a)[{x: -exp(a/2)}, {x: exp(a/2)}]>>> solve(x**2 - exp(a), x, b)[{x: -exp(a/2)}, {x: exp(a/2)}] To solve a system of equations, pass a list of expressions to solve(): each one will be interpreted, as in the univariate case, as an equation of the form expr = 0. The result can be returned in one of two forms, depending on the mathematical structure of the input: either as a list of tuples, where each tuple contains the values for the variables in the order given to solve, or a single dictionary, suitable for use in subs(), mapping variables to their values. As you can see in the following example, it can be hard to predict what form the result will take: >>> solve([exp(x**2) - y, y - 3], x, y)[(-sqrt(log(3)), 3), (sqrt(log(3)), 3)]>>> solve([x**2 - y, y - 3], x, y)[(-sqrt(3), 3), (sqrt(3), 3)]>>> solve([x - y, y - 3], x, y){y: 3, x: 3} This variability in return types is fine for interactive use, but for library code, more predictability is required. In this case, you should use the dict=True option. The output will then always be a list of mappings of variables to value. Compare the following example to the previous one: >>> solve([x**2 - y, y - 3], x, y, dict=True)[{y: 3, x: -sqrt(3)}, {y: 3, x: sqrt(3)}]>>> solve([x - y, y - 3], x, y, dict=True)[{y: 3, x: 3}] Summary We successfully computed the various mathematical operations using the SymPy application, Calculus. Resources for Article : Further resources on this subject: Move Further with NumPy Modules [Article] Advanced Indexing and Array Concepts [Article] Running a simple game using Pygame [Article]
Read more
  • 0
  • 0
  • 1262

article-image-using-unrestricted-languages
Packt
13 Aug 2013
15 min read
Save for later

Using Unrestricted Languages

Packt
13 Aug 2013
15 min read
(For more resources related to this topic, see here.) Are untrusted languages inferior to trusted ones? No, on the contrary, these languages are untrusted in the same way that a sharp knife is untrusted and should not be trusted to very small children, at least not without adult supervision. They have extra powers that ordinary SQL or even the trusted languages (such as PL/pgSQL) and trusted variants of the same language (PL/Perl versus PL/Perlu) don't have. You can use the untrusted languages to directly read and write on the server's disks, and you can use it to open sockets and make Internet queries to the outside world. You can even send arbitrary signals to any process running on the database host. Generally, you can do anything the native language of the PL can do. However, you probably should not trust arbitrary database users to have the right to define functions in these languages. Always think twice before giving all privileges on some untrusted language to a user or group by using the *u languages for important functions. Can you use the untrusted languages for important functions? Absolutely. Sometimes, it may be the only way to accomplish some tasks from inside the server. Performing simple queries and computations should do nothing harmful to your database, and neither should connecting to the external world for sending e-mails, fetching web pages, or doing SOAP requests. They may cause delays and even queries that get stuck, but these can usually be dealt with by setting an upper limit as to how long a query can run by using an appropriate statement time-out value. Setting a reasonable statement time-out value by default is a good practice anyway. So, if you don't deliberately do risky things, the probability of harming the database is no bigger than using a "trusted" (also known as "restricted") variant of the language. However, if you give the language to someone who starts changing bytes on the production database "to see what happens", you probably get what you asked for. Will untrusted languages corrupt the database? The power to corrupt the database is definitely there, since the functions run as the system user of the database server with full access to the filesystem. So, if you blindly start writing into the data files and deleting important logs, it is very likely that your database will be corrupted. Additional types of denial-of-service attacks are also possible such as using up all memory or opening all IP ports; but there are ways to overload the database using plain SQL as well, so that part is not much different from the trusted database access with the ability to just run arbitrary queries. So yes, you can corrupt the database, but please don't do it on a production server. If you do, you will be sorry. Why untrusted? PostgreSQL's ability to use an untrusted language is a powerful way to perform some nontraditional things from database functions. Creating these functions in a PL is an order of magnitude smaller task than writing an extension function in C. For example, a function to look up a hostname for an IP address is only a few lines in PL/Pythonu: CREATE FUNCTION gethostbyname(hostname text) RETURNS inet AS $$ import socket return socket.gethostbyname(hostname) $$ LANGUAGE plpythonu SECURITY DEFINER; You can test it immediately after creating the function by using psql: hannu=# select gethostbyname('www.postgresql.org'); gethostbyname ---------------- 98.129.198.126 (1 row) Creating the same function in the most untrusted language, C, involves writing tens of lines of boilerplate code, worrying about memory leaks, and all the other problems coming from writing code in a low-level language. I recommend prototyping in some PL language if possible, and in an untrusted language if the function needs something that the restricted languages do not offer. Why PL/Python? All of these tasks could be done equally well using PL/Perlu or PL/Tclu; I chose PL/Pythonu mainly because Python is the language I am most comfortable with. This also translates to having written some PL/Python code, which I plan to discuss and share with you in this article. Quick introduction to PL/Python PL/pgSQL is a language unique to PostgreSQL and was designed to add blocks of computation and SQL inside the database. While it has grown in its breath of functionality, it still lacks the completeness of syntax of a full programming language. PL/Python allows your database functions to be written in Python with all the depth and maturity of writing a Python code outside the database. A minimal PL/Python function Let's start from the very beginning (again): CREATE FUNCTION hello(name text) RETURNS text AS $$ return 'hello %s !' % name $$ LANGUAGE plpythonu; Here, we see that creating the function starts by defining it as any other PostgreSQL function with a RETURNS definition of a text field: CREATE FUNCTION hello(name text) RETURNS text The difference from what we have seen before is that the language part is specifying plpythonu (the language ID for PL/Pythonu language): $$ LANGUAGE plpythonu; Inside the function body it is very much a normal python function, returning a value obtained by the name passed as an argument formatted into a string 'hello %s !' using the standard Python formatting operator %: return 'hello %s !' % name Finally, let's test how this works: hannu=# select hello('world'); hello --------------- hello world ! (1 row) And yes, it returns exactly what is expected! Data type conversions The first and last things happening when a PL function is called by PostgreSQL are converting argument values between the PostgreSQL and PL types. The PostgreSQL types need to be converted to the PL types on entering the function, and then the return value needs to be converted back into the PostgreSQL type. Except for PL/pgSQL, which uses PostgreSQL's own native types in computations, the PLs are based on existing languages with their own understanding of what types (integer, string, date, …) are, how they should behave, and how they are represented internally. They are mostly similar to PostgreSQL's understanding but quite often are not exactly the same. PL/Python converts data from PostgreSQL type to Python types as shown in the following table: PostgreSQL Python 2 Python 3 Comments int2, int4 int int   int8 long int   real, double, numeric float float This may lose precision for numeric values. bytea str bytes No encoding conversion is done, nor should any encoding be assumed. text, char(), varchar(), and other text types str str On Python 2, the string will be in server encoding. On Python 3, it is an unicode string. All other types str str PostgreSQL's type output function is used to convert to this string. Inside the function, all computation is done using Python types and the return value is converted back to PostgreSQL using the following rules (this is a direct quote from official PL/Python documentation at http://www.postgresql.org/docs/current/static/plpython-data.html): When the PostgreSQL return type is Boolean, the return value will be evaluated for truth according to the Python rules. That is, 0 and empty string are false, but notably f is true. When the PostgreSQL return type is bytea, the return value will be converted to a string (Python 2) or bytes (Python 3) using the respective Python built-ins, with the result being converted bytea. For all other PostgreSQL return types, the returned Python value is converted to a string using Python's built-in str, and the result is passed to the input function of the PostgreSQL data type. Strings in Python 2 are required to be in the PostgreSQL server encoding when they are passed to PostgreSQL. Strings that are not valid in the current server encoding will raise an error; but not all encoding mismatches can be detected, so garbage data can still result when this is not done correctly. Unicode strings are converted to the correct encoding automatically, so it can be safer and more convenient to use those. In Python 3, all strings are Unicode strings. In other words, anything but 0, False, and an empty sequence, including empty string ' ' or dictionary becomes PostgreSQL false. One notable exception to this is that the check for None is done before any other conversions and even for Booleans, None is always converted to NULL and not to the Boolean value false. For the bytea type, the PostgreSQL byte array, the conversion from Python's string representation, is an exact copy with no encoding or other conversions applied. Writing simple functions in PL/Python Writing functions in PL/Python is not much different in principle from writing functions in PL/pgSQL. You still have the exact same syntax around the function body in $$, and the argument name, types, and returns all mean the same thing regardless of the exact PL/language used. A simple function So a simple add_one() function in PL/Python looks like this: CREATE FUNCTION add_one(i int) RETURNS int AS $$ return i + 1; $$ LANGUAGE plpythonu; It can't get much simpler than that, can it? What you see here is that the PL/Python arguments are passed to the Python code after converting them to appropriate types, and the result is passed back and converted to the appropriate PostgreSQL type for the return value. Functions returning a record To return a record from a Python function, you can use: A sequence or list of values in the same order as the fields in the return record A dictionary with keys matching the fields in the return record A class or type instance with attributes matching the fields in the return record Here are samples of the three ways to return a record. First, using an instance: CREATE OR REPLACE FUNCTION userinfo( INOUT username name, OUT user_id oid, OUT is_superuser boolean) AS $$ class PGUser: def __init__(self,username,user_id,is_superuser): self.username = username self.user_id = user_id self.is_superuser = is_superuser u = plpy.execute(""" select usename,usesysid,usesuper from pg_user where usename = '%s'""" % username)[0] user = PGUser(u['usename'], u['usesysid'], u['usesuper']) return user $$ LANGUAGE plpythonu; Then, a little simpler one using a dictionary: CREATE OR REPLACE FUNCTION userinfo( INOUT username name, OUT user_id oid, OUT is_superuser boolean) AS $$ u = plpy.execute(""" select usename,usesysid,usesuper from pg_user where usename = '%s'""" % username)[0] return {'username':u['usename'], 'user_id':u['usesysid'], 'is_ superuser':u['usesuper']} $$ LANGUAGE plpythonu; Finally, using a tuple: CREATE OR REPLACE FUNCTION userinfo( INOUT username name, OUT user_id oid, OUT is_superuser boolean) AS $$ u = plpy.execute(""" select usename,usesysid,usesuper from pg_user where usename = '%s'""" % username)[0] return (u['usename'], u['usesysid'], u['usesuper']) $$ LANGUAGE plpythonu; Notice [0] at the end of u = plpy.execute(...)[0] in all the examples. It is there to extract the first row of the result, as even for one-row results plpy.execute still returns a list of results. Danger of SQL injection! As we have neither executed a prepare() method and executed a execute() method with arguments after it, nor have we used the plpy.quote_literal() method (both techniques are discussed later) to safely quote the username before merging it into the query, we are open to a security flaw known as SQL injection. So, make sure that you only let trusted users call this function or supply the username argument. Calling the function defined via any of these three CREATE commands will look exactly the same: hannu=# select * from userinfo('postgres'); username | user_id | is_superuser ----------+---------+-------------- postgres | 10 | t (1 row) It usually does not make sense to declare a class inside a function just to return a record value. This possibility is included mostly for cases where you already have a suitable class with a set of attributes matching the ones the function returns. Table functions When returning a set from a PL/Python functions, you have three options: Return a list or any other sequence of return type Return an iterator or generator yield the return values from a loop Here, we have three ways to generate all even numbers up to the argument value using these different styles. First, returning a list of integers: CREATE FUNCTION even_numbers_from_list(up_to int) RETURNS SETOF int AS $$ return range(0,up_to,2) $$ LANGUAGE plpythonu; The list here is returned by a built-in Python function called range, which returns a result of all even numbers below the argument. This gets returned as a table of integers, one integer per row from the PostgreSQL function. If the RETURNS clause of the function definition would say int[] instead of SETOF int, the same function would return a single number of even integers as a PostgreSQL array. The next function returns a similar result using a generator and returning both the even number and the odd one following it. Also, notice the different PostgreSQL syntax RETURNS TABLE(...) used this time for defining the return set: CREATE FUNCTION even_numbers_from_generator(up_to int) RETURNS TABLE (even int, odd int) AS $$ return ((i,i+1) for i in xrange(0,up_to,2)) $$ LANGUAGE plpythonu; The generator is constructed using a generator expression (x for x in <seq>). Finally, the function is defined using a generator using and explicit yield syntax, and yet another PostgreSQL syntax is used for returning SETOF RECORD with the record structure defined this time by OUT parameters: CREATE FUNCTION even_numbers_with_yield(up_to int, OUT even int, OUT odd int) RETURNS SETOF RECORD AS $$ for i in xrange(0,up_to,2): yield i, i+1 $$ LANGUAGE plpythonu; The important part here is that you can use any of the preceding ways to define a PL/Python set returning function, and they all work the same. Also, you are free to return a mixture of different types for each row of the set: CREATE FUNCTION birthdates(OUT name text, OUT birthdate date) RETURNS SETOF RECORD AS $$ return ( {'name': 'bob', 'birthdate': '1980-10-10'}, {'name': 'mary', 'birthdate': '1983-02-17'}, ['jill', '2010-01-15'], ) $$ LANGUAGE plpythonu; This yields result as follows: hannu=# select * from birthdates(); name | birthdate ------+------------ bob | 1980-10-10 mary | 1983-02-17 jill | 2010-01-15 (3 rows) As you see, the data returning a part of PL/Pythonu is much more flexible than returning data from a function written in PL/pgSQL. Running queries in the database If you have ever accessed a database in Python, you know that most database adapters conform to a somewhat loose standard called Python Database API Specification v2.0 or DBAPI 2 for short. The first thing you need to know about database access in PL/Python is that in-database queries do not follow this API. Running simple queries Instead of using the standard API, there are just three functions for doing all database access. There are two variants: plpy.execute() for running a query, and plpy.prepare() for turning query text into a query plan or a prepared query. The simplest way to do a query is with: res = plpy.execute(<query text>, [<row count>]) This takes a textual query and an optional row count, and returns a result object, which emulates a list of dictionaries, one dictionary per row. As an example, if you want to access a field 'name' of the third row of the result, you use: res[2]['name'] The index is 2 and not 3 because Python lists are indexed starting from 0, so the first row is res[0], the second row res[1], and so on. Using prepared queries In an ideal world this would be all that is needed, but plpy.execute(query, cnt) has two shortcomings: It does not support parameters The plan for the query is not saved, requiring the query text to be parsed and run through the optimizer at each invocation We will show a way to properly construct a query string later, but for most uses simple case parameter passing is enough. So, the execute(query, [maxrows]) call becomes a set of two statements: plan = plpy.prepare(<query text>, <list of argument types>) res = plpy.execute(plan, <list of values>, [<row count>])For example, to query if a user 'postgres' is a superuser, use the following: plan = plpy.prepare("select usesuper from pg_user where usename = $1", ["text"]) res = plpy.execute(plan, ["postgres"]) print res[0]["usesuper"] The first statement prepares the query, which parses the query string into a query tree, optimizes this tree to produce the best query plan available, and returns the prepared_query object. The second row uses the prepared plan to query for a specific user's superuser status. The prepared plan can be used multiple times, so that you could continue to see if user bob is superuser. res = plpy.execute(plan, ["bob"]) print res[0]["usesuper"] Caching prepared queries Preparing the query can be quite an expensive step, especially for more complex queries where the optimizer has to choose from a rather large set of possible plans; so, it makes sense to re-use results of this step if possible. The current implementation of PL/Python does not automatically cache query plans (prepared queries), but you can do it easily yourself. try: plan = SD['is_super_qplan'] except: SD['is_super_qplan'] = plpy.prepare(".... plan = SD['is_super_qplan'] <the rest of the function> The values in SD[] and GD[] only live inside a single database session, so it only makes sense to do the caching in case you have long-lived connections.
Read more
  • 0
  • 0
  • 1579

Packt
12 Aug 2013
14 min read
Save for later

Quick start – Creating your first Java application

Packt
12 Aug 2013
14 min read
(For more resources related to this topic, see here.) Cassandra's storage architecture is designed to manage large data volumes and revolve around some important factors: Decentralized systems Data replication and transparency Data partitioning Decentralized systems are systems that provide maximum throughput from each node.Cassandra offers decentralization by keeping each node with an identical configuration. There are no such master-slave configurations between nodes. Data is spread across nodes and each node is capable of serving read/write requests with the same efficiency. A data center is a physical space where critical application data resides. Logically, a data center is made up of multiple racks, and each rack may contain multiple nodes. Cassandra replication strategies Cassandra replicates data across the nodes based on configured replication. If the replication factor is 1, it means that one copy of the dataset will be available on one node only. If the replication factor is 2, it means two copies of each dataset will be made available on different nodes in the cluster. Still, Cassandra ensures data transparency, as for an end user data is served from one logical cluster. Cassandra offers two types of replication strategies. Simple strategy Simple strategy is best suited for clusters involving a single data center, where data is replicated across different nodes based on the replication factor in a clockwise direction. With a replication factor of 3, two more copies of each row will be copied on nearby nodes in a clockwise direction: Network topology strategy Network topology strategy ( NTS ) is preferred when a cluster is made up of nodes spread across multiple data centers. With NTS, we can configure the number of replicas needed to be placed within each data center. Data colocation and no single point of failure are two important factors that we need to consider priorities while configuring the replication factor and consistency level. NTS identifies the first node based on the selected schema partitioning and then looks up for nodes in a different rack (in the same data center). In case there is no such node, data replicas will be passed on to different nodes within the same rack. In this way, data colocation can be guaranteed by keeping the replica of a dataset in the same data center (to serve read requests locally). This also minimizes the risk of network latency at the same time. NTS depends on snitch configuration for proper data replica placement across different data centers. A snitch relies upon the node IP address for grouping nodes within the network topology. Cassandra depends upon this information for routing data requests internally between nodes. The preferred snitch configurations for NTS are RackInferringSnitch and PropertyFileSnitch . We can configure snitch in cassandra.yaml (the configuration file). Data partitioning Data partitioning strategy is required for node selection of a given data read/request. Cassandra offers two types of partitioning strategies. Random partitioning Random partitioning is the recommended partitioning scheme for Cassandra. Each node is assigned a 128-bit token value ( initial_token for a node is defined in cassandra.yaml) generated by a one way hashing (MD5) algorithm. Each node is assigned an initial token value (to determine the position in a ring) and a data range is assigned to the node. If a read/write request with the token value (generated for a row key value) lies within the assigned range of nodes, then that particular node is responsible for serving that request. The following diagram is a common graphical representation of the numbers of nodes placed in a circular representation or a ring, and the data range is evenly distributed between these nodes: Ordered partitioning Ordered partitioning is useful when an application requires key distribution in a sorted manner. Here, the token value is the actual row key value. Ordered partitioning also allows you to perform range scans over row keys. However, with ordered partitioning, key distribution might be uneven and may require load balancing administration. It is certainly possible that the data for multiple column families may get unevenly distributed and the token range may vary from one node to another. Hence, it is strongly recommended not to opt for ordered partitioning unless it is really required. Cassandra write path Here, we will discuss how the Cassandra process writes a request and stores it on a disk: As we have mentioned earlier, all nodes in Cassandra are peers and there is no master-slave configuration. Hence, on receiving a write request, a client can select any node to serve as a coordinator. The coordinator node is responsible for delegating write requests to an eligible node based on the cluster's partitioning strategy and replication factor. First, it is written to a commit log and then it is delegated to corresponding memtables (see the preceding diagram). A memtable is an in-memory table, which serves subsequent read requests without any look up in the disk. For each column family, there is one memtable. Once a memtable is full, data is flushed down in the form of SS tables (on disk), asynchronously. Once all the segments are flushed onto the disk, they are recycled. Periodically, Cassandra performs compaction over SS tables (sorted by row keys) and claims unused segments. In case of data node restart (unwanted scenarios such as failover), the commit log replay will happen, to recover any previous incomplete write requests. Hands on with the Cassandra command-line interface Cassandra provides a default command-line interface that is located at: CASSANDRA_HOME/bin/cassandra-cli.sh using Linux CASSANDRA_HOME/bin/cassandra-cli.bat using Windows Before we proceed with the sample exercise, let's have a look at the Cassandra schema: Keyspace: A keyspace may contain multiple column families; similarly, a cluster (made up of multiple nodes) can contain multiple keyspaces. Column family: A column family is a collection of rows with defined column metadata. Cassandra offers different ways to define two types of column families, namely, static and dynamic column families. Static column family: A static column family contains a predefined set of columns with metadata. Please note that a predefined set of columns may exist, but the number of columns can vary across multiple rows within the column family. Dynamic column family: A dynamic column family generally defines a comparator type and validation class for all columns instead of individual column metadata. The client application is responsible for providing columns for a particular row key, which means the column names and values may differ across multiple row keys: Column: A column can be attributed as a cell, which contains a name, value, and timestamp. Super column: A super column is similar to a column and contains a name, value, and timestamp, except that a super column value may contain a collection of columns. Super columns cannot be sorted; however, subcolumns within super columns can be sorted by defining a sub comparator. Super columns do have some limitations, such as that secondary indexes over super columns are not possible. Also, it is not possible to read a particular super column without deserialization of the wrapped subcolumns. Because of such limitations, usage of super columns is highly discouraged within the Cassandra community. Using composite columns we can achieve such functionalities. In the next articles, we will cover composite columns in detail: Counter column family: Since 0.8 onwards, Cassandra has enabled support for counter columns. Counter columns are useful for applications that perform the following: Maintain the page count for the website Do aggregation based on a column value from another column family A counter column is a sort of 64 bit signed integer. To create a counter column family, we simply need to define default_validation_class as CounterColumnType. Counter columns do have some application and technical limitations: In case of events, such as disk failure, it is not possible to replay a column family containing counters without reinitializing and removing all the data Secondary indexes over counter columns are not supported in Cassandra Frequent insert/delete operations over the counter column in a short period of time may result in inconsistent counter values There are still some unresolved issues (https://issues.apache.org/jira/browse/CASSANDRA-4775) and to considering the preceding limitations before opting for counter columns is recommended. You can start a Cassandra server simply by running $CASSANDRA_HOME/bin/ cassandra. If started in the local mode, it means there is only one node. Once successfully started, you should see logs on your console, as follows: Cassandra-cli: Cassandra distribution, by default, provides a command-line utility (cassandra-cli ), which can be used for basic ddl /dml operations; you can connect to a local/remote Cassandra server instance by specifying the host and port options, as follows: $CASSANDRA_HOME/bin/cassandra-cli -host locahost -port 9160 Performing DDL/DML operations on the column family First, we need to create a keyspace using the create keyspace command, as follows: The create keyspace command: This operation will create a keyspace cassandraSample with node placement strategy as SimpleStrategy and replication factor one. By default, if you don't specify placement_strategy and strategy_options, it will opt for NTS, where replication will be on one data center: create keyspace cassandraSample with placement_strategy='org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1}; We can look for available keyspaces by running the following command: show keyspaces; This will result in the following output: We can always update the keyspace for configurations, such as replication factor. To update the keyspace, do the following: Modify the replication factor: You can update a keyspace for changing the replication factor as well as the placement strategy. For example, to change a replication factor to 2 for cassandraSample, you simply need to execute the following command: update keyspace cassandraSample with placement_strategy='org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:2}; Modify the placement strategy: You can change the placement strategy for NTS by executing the following command: update keyspace cassandraSample with placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options = {datacenter1:1}; Strategy options are in the format {datacentername:number of replicas}, and there can be multiple datacenters. After successfully creating a keyspace before proceeding with other ddl operations (for example, column family creation), we need to authorize a keyspace. We will authorize to a keyspace using the following command: use cassandraSample; Create a column family/super column family as follows: Use the following command to create column family users within the cassandraSample keyspace: create column family users with key_validation_class = 'UTF8Type' and comparator = 'UTF8Type' and default_validation_class = 'UTF8Type'; To create a super column family suysers, you need to run the following command: create column family suysers with key_validation_class = 'UTF8Type' and comparator = 'UTF8Type' and subcomparator='UTF8Type' and default_validation_class = 'UTF8Type' and column_type='Super' and column_metadata=[{column_name: name, validation_class: UTF8Type}]; key_validation_class: It defines the datatype for the row key comparator: It defines the datatype for the column name default_validation_class: It defines the datatype for the column value subcomparator: It defines the datatype for subcolumns. You can create/update a column by using the set method as follows: // create a column named "username", with a value of "user1" for row key 1set users[1][username] = user1; // create a column named "password", with a value of "password1" for row key 1 set users[1][password] = password1;// create a column named "username", with a value of "user2" for row key 2set users[2][username] = user2; // create a column named "password", with a value of "password2" for row key 2 set users[2][password] = password2; To fetch all the rows and columns from a column family, execute the following command: // to list down all persisted rows within a column family.list users ; // to fetch a row from users column family having row key value "1".get users[1]; You can delete a column as follows: // to delete a column "username" for row key 1;del users[1][username]; To update the column family, do the following: If you want to change key_validation_class from UTF8Type to BytesType and validation_class for the password column from UTF8Type to BytesType, then type the following command: update column family users with key_validation_class=BytesType and comparator=UTF8Type and column_metadata = [{column_name:password, validation_class:BytesType}] To drop/truncate the column family, follow the ensuing steps: Delete all the data rows from a column family users, as follows: truncate users; Drop a column family by issuing the following command: drop column family users; These are some basic operations that should give you a brief idea about how to create/manage the Cassandra schema. Cassandra Query Language Cassandra is schemaless, but CQL is useful when we need data modeling with the traditional RDBMS flavor. Two variants of CQL (2.0 and 3.0) are provided by Cassandra. We will use CQL3.0 for a quick exercise. We will refer to similar exercises, as we follow with the Cassandra-cli interface. The command to connect with cql is as follows: $CASSANDRA_HOME/bin/cqlsh host port cqlversion You can connect to the localhost and 9160 ports by executing the following command: $CASSANDRA_HOME/bin/cqlsh localhost 9160 -3 After successfully connecting to the command-line CQL client, you can create the keyspace as follows: create keyspace cassandrasample with strategy_class='SimpleStrategy' and strategy_options:replication_factor=1;Update keyspacealter keyspace cassandrasample with strategy_class='NetworkTopologyStrategy' and strategy_options:datacenter=1; Before creating any column family and storing data, we need to authorize such ddl/dml operations to a keyspace (for example, cassandraSample). We can authorize to a keyspace as follows: use cassandrasample; We can always run the describe keyspace command to look into containing column families and configuration settings. We can describe a keyspace as follows: describe keyspace cassandrasample; We will create a users column family with user_id as row key and username and password as columns. To create a column family, such as users, use the following command: create columnfamily users(user_id varchar PRIMARY KEY,username varchar, password varchar); To store a row in the users column family for row key value 1, we will run the following CQL query: insert into users(user_id,username,password) values(1,'user1','password1'); To select all the data from the users column family, we need to execute the following CQL query: select * from users; We can delete a row as well as specific columns using the delete operation. The following command-line scripts are to perform the deletion of a complete row and column age from the users column family, respectively: // delete complete row for user_id=1delete from users where user_id=1; // delete age column from users for row key 1.delete age from users where user_id=1; You can update a column family to add columns and to update or drop column metadata. Here are a few examples: // add a new columnalter columnfamily users add age int; // update column metadataalter columnfamily users alter password type blob; Truncating a column family will delete all the data belonging to the corresponding column family, whereas dropping a column family will also remove the column family definition along with the containing data. We can drop/truncate the column family as follows: truncate users;drop columnfamily users; Dropping a keyspace means instantly removing all the column families and data available within that keyspace.We can drop a keyspace using the following command: drop keyspace cassandrasample; By default, the CQL shell converts the column family and keyspace name to lowercase. You can ensure case sensitivity by wrapping these identifiers within " " . Summary This article showed how to create a Java application using Cassandra. Resources for Article: Further resources on this subject: Getting Started with Apache Cassandra [Article] Apache Cassandra: Working in Multiple Datacenter Environments [Article] Apache Cassandra: Libraries and Applications [Article]
Read more
  • 0
  • 0
  • 1748
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $15.99/month. Cancel anytime
article-image-overview-sql-server-reporting-services-2012-architecture-features-and-tools
Packt
08 Aug 2013
15 min read
Save for later

Overview of SQL Server Reporting Services 2012 Architecture, Features, and Tools

Packt
08 Aug 2013
15 min read
(For more resources related to this topic, see here.) Structural design of SQL servers and SharePoint environment Depending on the business and the resources available, the various servers may be located in distributed locations and the Web applications may also be run from Web servers in a farm and the same can be true for SharePoint servers. In this article, by the word architecture we mean the way by which the preceding elements are put together to work on a single computer. However, it is important to know that this is just one topology (an arrangement of constituent elements) and in general it can be lot more complicated spanning networks and reaching across boundaries. The Report Server is the centerpiece of the Reporting Services installation. This installation can be deployed in two modes, namely, Native mode or SharePoint Integrated mode. Each mode has a separate engine and an extensible architecture. It consists of a collection of special-purpose extensions that handle authentication, data processing, rendering, and delivery operations. Once deployed in one mode it cannot be changed to the other. It is possible to have two servers each installed in a different mode. We have installed all the necessary elements to explore the RS 2012 features, including Power View and Data Alerts. The next diagram briefly shows the structural design of the environment used in working with the article: Primarily, SQL Server 2012 Enterprise Edition is used, for both Native mode as well as SharePoint Integrated mode. As we see in the previous diagram, Report Server Native mode is on a named instance HI (in some places another named instance Kailua is also used). This server has the Reporting Services databases ReportServer$HI and ReportServer$HITempDB. The associated Report Server handles Jobs, Security, and Shared Schedules. The Native mode architecture described in the next section is taken from the Microsoft documentation. The tools (SSDT, Report Builder, Report Server Configuration, and so on) connect to the Report Server. The associated SQL Server Agent takes care of the jobs such as subscriptions related to Native mode. The SharePoint Server 2010 is a required element with which the Reporting Services add-in helps to create a Reporting Services Service. With the creation of the RS Service in SharePoint, three SQL Server 2012 databases (shown alongside in the diagram) are created in an instance with its Reporting Services installed in SharePoint Integrated mode. The SQL Server 2012 instance NJ is installed in this fashion. These databases are repositories for report content including those related to Power Views and Data Alerts. The data sources(extension .rsds) used in creating Power View reports (extension.rdlx) are stored in the ReportingService_b67933dba1f14282bdf434479cbc8f8f database and the alerting related information is stored in the ReportingService_b67933dba1f14282bdf434479cbc8f8f_Alerting database. Not shown is an Express database that is used by the SharePoint Server for its content, administration, and so on. RS_ADD-IN allows you to create the service. You will use the Power Shell tool to create and manage the service. In order to create Power View reports, the new feature in SSRS 2012, you start off creating a data source in SharePoint library. Because of the RS Service, you can enable Reporting Services features such as Report Builder; and associate BISM file extensions to support connecting to tabular models created in SSDT deployed to Analysis Services Server. When Reporting Services is installed in SharePoint Integrated mode, SharePoint Web parts will be available to users that allow them to connect to RS Native mode servers to work with reports on the servers from within SharePoint Site. Native mode The following schematic taken from Microsoft documentation (http://msdn.microsoft.com/en-us/library/ms157231.aspx) shows the major components of a Native mode installation: The image shows clearly the several processors that are called into play before a report is displayed. The following are the elements of this processing: Processing extensions(data, rendering, report processing, and authentication) Designing tools(Report Builder, Report Designer) Display devices(browsers) Windows components that do the scheduling and delivery through extensions(Report Server databases, a SQL Server 2012 database, which store everything connected with reports) For the Reporting Services 2012 enabled in Native mode for this article, the following image shows the ReportServer databases and the Reporting Services Server. A similar server HI was also installed after a malware attack. The Report Server is implemented as a Microsoft Windows service called Report Server Service. SharePoint Integrated mode In SharePoint mode, a Report Server must run within a SharePoint Server (even in a standalone implementation). The Report Server processing, rendering, and management are all from SharePoint application server running the Reporting Services SharePoint shared service. For this to happen, at SQL Server installation time, the SharePoint Integrated mode has to be chosen. The access to reports and related operations in this case are from a SharePoint frontend. The following elements are required for SharePoint mode: SharePoint Foundation 2010 or SharePoint Server 2010 An appropriate version of the Reporting Services add-in for SharePoint products A SharePoint application server with a Reporting Services shared service instance and at least one Reporting Services service application The following diagram taken from Microsoft documentation illustrates the various parts of a SharePoint Integrated environment of Reporting Services. Note that the alerting Web service and Power View need SharePoint Integration. The numbered items and their description shown next are also from the same Microsoft document. Follow the link at the beginning of this section. The architectural details presented previously were taken from Microsoft documentation. Item number in the diagram   Description   1   Web servers or Web Frontends (WFE). The Reporting Services add-in must be installed on each Web server from which you want to utilize the Web application feature such as viewing reports or a Reporting Services management page for tasks such as managing data sources and subscriptions.   2   The add-in installs URL and SOAP endpoints for clients to communicate with application servers through the Reporting Services Proxy.   3   Application servers running a shared service. Scale-out of report processing is managed as part of the SharePoint farm and by adding the service to additional application servers.   4 You can create more than one Reporting Services service application with different configurations, including permissions, e-mail, proxy, and subscriptions.   5   Reports, data sources, and other items are stored in SharePoint content databases.   6   Reporting Services service applications create three databases for the Report Server, temp, and data alerting features. Configuration settings that apply to all SSRS service applications are stored in RSReportserver.config file.   When you install Reporting Services in SharePoint Integrated mode, several features that you are used to in Native mode will not be available. Some of them are summarized here from the MSDN site: URL access will work but you will have to access SharePoint URL and not Native mode URL. The Native mode folder hierarchy will not work. Custom Security extensions can be used but you need to use the special purpose security extension meant to be used for SharePoint Integration. You cannot use the Reporting Services Configuration Manager (of the Native mode installation).You should use the SharePoint Central Administration shown in this section (for Reporting Services 2008 and 2008 R2). Report Manager is not the frontend; in this case, you should use SharePoint Application pages. You cannot use Linked Reports, My Reports, and My Subscriptions in SharePoint mode. In SharePoint Integrated mode, you can work with Data Alerts and this is not possible in a Native mode installation. Power View is another thing you can do with SharePoint that is not available for Native mode. To access Power View the browser needs Silverlight installed. While reports with RDL extension are supported in both modes, reports with RDLX are only supported in SharePoint mode. SharePoint user token credentials, AAM Zones for internet facing deployments, SharePoint back and recovery, and ULS log support are only available for SharePoint mode. For the purposes of discussion and exercises in this article, a standalone server deployment is used as shown in the next diagram. It must be remembered that there are various other topologies of deployment possible using more than one computer. For a detailed description please follow the link http://msdn.microsoft.com/en-us/library/bb510781(v=sql.105).aspx. The standalone deployment is the simplest, in that all the components are installed on a single computer representative of the installation used for this article. The following diagram taken from the preceding link illustrates the elements of the standalone deployment: Reporting Services configuration For both modes of installation, information for Reporting Services components is stored in configuration files and the registry. During setup the configuration files are copied to the following locations: Native modeC:Program FilesMicrosoft SQL ServerMSRS11.MSSQLSERVER SharePoint Integrated modeC:Program FilesCommon FilesMicrosoft SharedWeb Server Extensions15WebServicesReporting Follow the link http://msdn.microsoft.com/en-us/library/ms155866.aspx for details. Native mode The Report Server Windows Service is an orchestrated set of applications that run in a single process using a single account with access to a single Report Server database with a set of configuration files listed here: Stored in   Description   Location   RSReportServer.config   Stores configuration settings for feature areas of the Report Server Service: Report Manager, the Report Server Web Service, and background processing.   <Installation directory> Reporting Services ReportServer   RSSrvPolicy.config   Stores the code access security policies for the server extensions.   <Installation directory> Reporting Services ReportServer   RSMgrPolicy.config   Stores the code access security policies for Report Manager.   <Installation directory> Reporting Services ReportManager   Web.config for the Report Server Web Service   Includes only those settings that are required for ASP.NET.   <Installation directory> Reporting Services ReportServer   Web.config for Report Manager   Includes only those settings that are required for ASP.NET.   <Installation directory> Reporting Services ReportManager   ReportingServicesService. exe.config   Stores configuration settings that specify the trace levels and logging options for the Report Server Service.   <Installation directory> Reporting Services ReportServer Bin Registry settings   Stores configuration state and other settings used to uninstall Reporting Services. If you are troubleshooting an installation or configuration problem, you can view these settings to get information about how the Report Server is configured.   Do not modify these settings directly as this can invalidate your installation.   HKEY_LOCAL_MACHINE SOFTWARE Microsoft Microsoft SQL Server <InstanceID> Setup and HKEY_ LOCAL_MACHINE SOFTWARE Microsoft Microsoft SQL ServerServices ReportServer   RSReportDesigner.config   Stores configuration settings for Report Designer. For more information follow the link http://msdn.microsoft.com/en-us/library/ms160346.aspx   <drive>:Program Files Microsoft Visual Studio 10 Common7 IDE PrivateAssemblies   RSPreviewPolicy.config   Stores the code access security policies for the server extensions used during report preview.   C:Program Files Microsoft Visual Studio 10.0 Common7IDE PrivateAssembliesr   First is the RSReportServer configuration file which can be found in the installation directory under Reporting Services. The entries in this file control the feature areas of the three components in the previous image, namely, Report Server Web Service, Report Server Service, Report Manager, and background processing. The ReportServer Configuration file has several sections with which you can modify the following features: General configuration settings URL reservations Authentication Service UI Extensions MapTileServerConfiguration (Microsoft Bing Maps SOAP Services that provides a tile background for map report items in the report) Default configuration file for a Native mode Report Server Default configuration file for a SharePoint mode Report Server The three areas previously mentioned (Report Server Web Service, Report Server Service, and Report Manager) all run in separate application domains and you can turn on/off elements that you may or may not need so as to improve security by reducing the surface area for attacks. Some functionality works for all the three components such as memory management and process health. For example, in the reporting server Kailua in this article, the service name is ReportServer$KAILUA. This service has no other dependencies. In fact, you can access the help file for this service when you look at Windows Services in the Control Panels shown. In three of the tabbed pages of this window you can access contextual help. SharePoint Integrated mode The following table taken from Microsoft documentation describes the configuration files used in the SharePoint mode Report Server. Configuration settings are stored in SharePoint Service application databases. Stored in   Description   Location   RSReportServer. config   Stores configuration settings for feature areas of the Report Server Service: Report Manager, the Report Server Web Service, and background processing.   <Installation directory> Reporting Services ReportServer   RSSrvPolicy.config   Stores the code access security policies for the server extensions.   <Installation directory> Reporting Services ReportServer   Web.config for the Report Server Web Service Registry settings   Stores configuration state and other settings used to uninstall Reporting Services. Also stores information about each Reporting Services service application.   Do not modify these settings directly as this can invalidate your installation.   HKEY_LOCAL_MACHINE SOFTWARE Microsoft Microsoft SQL Server <InstanceID> Setup   For example instance ID: MSSQL11.MSSQLSERVER and HKEY_LOCAL_MACHINE SOFTWAREMicrosoft Microsoft SQL Server Reporting Services Service Applications   RSReportDesigner. config   Stores configuration settings for Report Designer.   <drive>:Program Files Microsoft Visual Studio 10 Common7 IDE PrivateAssemblies   Hands-on exercise 3.1 – modifying the configuration file in Native mode We can make changes to the rsreportserver.config file if changes are required or some tuning has to be done. For example, you may need to change, to accommodate a different e-mail, change authentication, and so on. This is an XML file that can be edited in Notepad.exe (you can also use an XML Editor or Visual Studio). You need to start Notepad with administrator privileges. Turn on/off the Report Server Web Service In this exercise, we will modify the configuration file to turn on/off the Report Server Web Service. Perform the following steps: Start Notepad using Run as Administrator. Open the file at this location (you may use Start Search| for rsreportserver.config) which is located at C:Program FilesMicrosoft SQL ServerMSRS11.KAILUAReporting ServicesReportServerrsreportserver.config. In Edit Find| type in IsWebServiceEnabled. There are two values True/False. If you want to turn off, change TRUE to FALSE. The default is TRUE.Here is a section of the file reproduced: <Service> <IsSchedulingService>True</IsSchedulingService> <IsNotificationService>True</IsNotificationService> <IsEventService>True</IsEventService> <PollingInterval>10</PollingInterval> <WindowsServiceUseFileShareStorage>False </WindowsServiceUseFileShareStorage> <MemorySafetyMargin>80</MemorySafetyMargin> <MemoryThreshold>90</MemoryThreshold> <RecycleTime>720</RecycleTime> <MaxAppDomainUnloadTime>30</MaxAppDomainUnloadTime> <MaxQueueThreads>0</MaxQueueThreads> <UrlRoot> </UrlRoot> <UnattendedExecutionAccount> <UserName></UserName> <Password></Password> <Domain></Domain> </UnattendedExecutionAccount> <PolicyLevel>rssrvpolicy.config</PolicyLevel> <IsWebServiceEnabled>True</IsWebServiceEnabled> <IsReportManagerEnabled>True</IsReportManagerEnabled> <FileShareStorageLocation> <Path> </Path> </FileShareStorageLocation> </Service> Save the file to apply changes. Turn on/off the scheduled events and delivery This changes the report processing and delivery. Make changes in the rsreportserver.config file in the following section of <Service/>: <IsSchedulingService>True</IsSchedulingService> <IsNotificationService>True</IsNotificationService> <IsEventService>True</IsEventService> The default value for all of the three is TRUE. You can make it FALSE and save the file to apply changes. This can be carried out modifying FACET in SQL Server Management Studio (SSMS), but presently this is not available. Turn on/off the Report Manager Report Manager can be turned off or on by making changes to the configuration file. Make a change to the following section in the <Service/>: <IsReportManagerEnabled>True</IsReportManagerEnabled> Again, this change can be made using the Reporting Services Server in its FACET. To change this make sure you launch SQL Server Management Studio as Administrator. In the following sections use of SSMS via Facets is described. Hands-on exercise 3.2 – turn the Reporting Service on/off in SSMS The following are the steps to turn the Reporting Service on/off in SSMS: Connect to Reporting Services_KAILUA in SQL Server Management Studio as the Administrator. Choose HODENTEKWIN7KAILUA under Reporting Services. Click on OK. Right-click on HODENTEKWIN7KAILUA (Report Server 11.0.22180 –HodentekWin7mysorian). Click on Facets to open the following properties page Click on the handle and set it to True or False and click on OK. The default is True. It should be possible to turn Windows Integrated security on or off by using SQL Server Management Studio. However, the Reporting Services Server properties are disabled.
Read more
  • 0
  • 0
  • 4046

article-image-understanding-mapreduce
Packt
07 Aug 2013
18 min read
Save for later

Understanding MapReduce

Packt
07 Aug 2013
18 min read
(For more resources related to this topic, see here.) Key/value pairs Here we will explain why some operations process and provide the output in terms of key/value pair. What it mean Firstly, we will clarify just what we mean by key/value pairs by highlighting similar concepts in the Java standard library. The java.util.Map interface is the parent of commonly used classes such as HashMap and (through some library backward reengineering) even the original Hashtable. For any Java Map object, its contents are a set of mappings from a given key of a specified type to a related value of a potentially different type. A HashMap object could, for example, contain mappings from a person's name (String) to his or her birthday (Date). In the context of Hadoop, we are referring to data that also comprises keys that relate to associated values. This data is stored in such a way that the various values in the data set can be sorted and rearranged across a set of keys. If we are using key/value data, it will make sense to ask questions such as the following: Does a given key have a mapping in the data set? What are the values associated with a given key? What is the complete set of keys? We will go into Wordcount in detail shortly, but the output of the program is clearly a set of key/value relationships; for each word (the key), there is a count (the value) of its number of occurrences. Think about this simple example and some important features of key/value data will become apparent, as follows: Keys must be unique but values need not be Each value must be associated with a key, but a key could have no values (though not in this particular example) Careful definition of the key is important; deciding on whether or not the counts are applied with case sensitivity will give different results Note that we need to define carefully what we mean by keys being unique here. This does not mean the key occurs only once; in our data set we may see a key occur numerous times and, as we shall see, the MapReduce model has a stage where all values associated with each key are collected together. The uniqueness of keys guarantees that if we collect together every value seen for any given key, the result will be an association from a single instance of the key to every value mapped in such a way, and none will be omitted. Why key/value data? Using key/value data as the foundation of MapReduce operations allows for a powerful programming model that is surprisingly widely applicable, as can be seen by the adoption of Hadoop and MapReduce across a wide variety of industries and problem scenarios. Much data is either intrinsically key/value in nature or can be represented in such a way. It is a simple model with broad applicability and semantics straightforward enough that programs defined in terms of it can be applied by a framework like Hadoop. Of course, the data model itself is not the only thing that makes Hadoop useful; its real power lies in how it uses the techniques of parallel execution, and divide and conquer. We can have a large number of hosts on which we can store and execute data and even use a framework that manages the division of the larger task into smaller chunks, and the combination of partial results into the overall answer. But we need this framework to provide us with a way of expressing our problems that doesn't require us to be an expert in the execution mechanics; we want to express the transformations required on our data and then let the framework do the rest. MapReduce, with its key/value interface, provides such a level of abstraction, whereby the programmer only has to specify these transformations and Hadoop handles the complex process of applying this to arbitrarily large data sets. Some real-world examples To become less abstract, let's think of some real-world data that is key/value pair: An address book relates a name (key) to contact information (value) A bank account uses an account number (key) to associate with the account details (value) The index of a book relates a word (key) to the pages on which it occurs (value) On a computer filesystem, filenames (keys) allow access to any sort of data, such as text, images, and sound (values) These examples are intentionally broad in scope, to help and encourage you to think that key/value data is not some very constrained model used only in high-end data mining but a very common model that is all around us. We would not be having this discussion if this was not important to Hadoop. The bottom line is that if the data can be expressed as key/value pairs, it can be processed by MapReduce. MapReduce as a series of key/value transformations You may have come across MapReduce described in terms of key/value transformations, in particular the intimidating one looking like this: {K1,V1} -> {K2, List<V2>} -> {K3,V3} We are now in a position to understand what this means: The input to the map method of a MapReduce job is a series of key/value pairs that we'll call K1 and V1. The output of the map method (and hence input to the reduce method) is a series of keys and an associated list of values that are called K2 and V2. Note that each mapper simply outputs a series of individual key/value outputs; these are combined into a key and list of values in the shuffle method. The final output of the MapReduce job is another series of key/value pairs, called K3 and V3 These sets of key/value pairs don't have to be different; it would be quite possible to input, say, names and contact details and output the same, with perhaps some intermediary format used in collating the information. Keep this three-stage model in mind as we explore the Java API for MapReduce next. We will first walk through the main parts of the API you will need and then do a systematic examination of the execution of a MapReduce job. The Hadoop Java API for MapReduce Hadoop underwent a major API change in its 0.20 release, which is the primary interface in the 1.0 version. Though the prior API was certainly functional, the community felt it was unwieldy and unnecessarily complex in some regards. The new API, sometimes generally referred to as context objects, for reasons we'll see later, is the future of Java's MapReduce development. Note that caveat: there are parts of the pre-0.20 MapReduce libraries that have not been ported to the new API, so we will use the old interfaces when we need to examine any of these. The 0.20 MapReduce Java API The 0.20 and above versions of MapReduce API have most of the key classes and interfaces either in the org.apache.hadoop.mapreduce package or its subpackages. In most cases, the implementation of a MapReduce job will provide job-specific subclasses of the Mapper and Reducer base classes found in this package. We'll stick to the commonly used K1/K2/K3/ and so on terminology, though more recently the Hadoop API has, in places, used terms such as KEYIN/VALUEIN and KEYOUT/VALUEOUT instead. For now, we will stick with K1/K2/K3 as it helps understand the end-to-end data flow. The Mapper class This is a cut-down view of the base Mapper class provided by Hadoop. For our own mapper implementations, we will subclass this base class and override the specified method as follows: class Mapper<K1, V1, K2, V2> { void map(K1 key, V1 value Mapper.Context context) throws IOException, InterruptedException {..} } Although the use of Java generics can make this look a little opaque at first, there is actually not that much going on. The class is defined in terms of the key/value input and output types, and then the map method takes an input key/value pair in its parameters. The other parameter is an instance of the Context class that provides various mechanisms to communicate with the Hadoop framework, one of which is to output the results of a map or reduce method. Notice that the map method only refers to a single instance of K1 and V1 key/ value pairs. This is a critical aspect of the MapReduce paradigm in which you write classes that process single records and the framework is responsible for all the work required to turn an enormous data set into a stream of key/ value pairs. You will never have to write map or reduce classes that try to deal with the full data set. Hadoop also provides mechanisms through its InputFormat and OutputFormat classes that provide implementations of common file formats and likewise remove the need of having to write file parsers for any but custom file types. There are three additional methods that sometimes may be required to be overridden. protected void setup( Mapper.Context context) throws IOException, Interrupted Exception This method is called once before any key/value pairs are presented to the map method. The default implementation does nothing. protected void cleanup( Mapper.Context context) throws IOException, Interrupted Exception This method is called once after all key/value pairs have been presented to the map method. The default implementation does nothing. protected void run( Mapper.Context context) throws IOException, Interrupted Exception This method controls the overall flow of task processing within a JVM. The default implementation calls the setup method once before repeatedly calling the map method for each key/value pair in the split, and then finally calls the cleanup method . The Reducer class The Reducer base class works very similarly to the Mapper class, and usually requires only subclasses to override a single reduce method . Here is the cut-down class definition: public class Reducer<K2, V2, K3, V3> { void reduce(K1 key, Iterable<V2> values, Reducer.Context context) throws IOException, InterruptedException {..} } Again, notice the class definition in terms of the broader data flow (the reduce method accepts K2/V2 as input and provides K3/V3 as output) while the actual reduce method takes only a single key and its associated list of values. The Context object is again the mechanism to output the result of the method. This class also has the setup, run, and cleanup methods with similar default implementations as with the Mapper class that can optionally be overridden: protected void setup( Reduce.Context context) throws IOException, InterruptedException This method is called once before any key/lists of values are presented to the reduce method. The default implementation does nothing. protected void cleanup( Reducer.Context context) throws IOException, InterruptedException This method is called once after all key/lists of values have been presented to the reduce method. The default implementation does nothing. protected void run( Reducer.Context context) throws IOException, InterruptedException This method controls the overall flow of processing the task within JVM. The default implementation calls the setup method before repeatedly calling the reduce method for as many key/values provided to the Reducer class, and then finally calls the cleanup method. The Driver class Although our mapper and reducer implementations are all we need to perform the MapReduce job, there is one more piece of code required: the driver that communicates with the Hadoop framework and specifies the configuration elements needed to run a MapReduce job. This involves aspects such as telling Hadoop which Mapper and Reducer classes to use, where to find the input data and in what format, and where to place the output data and how to format it. There is no default parent Driver class as a subclass; the driver logic usually exists in the main method of the class written to encapsulate a MapReduce job. Take a look at the following code snippet as an example driver. Don't worry about how each line works, though you should be able to work out generally what each is doing: public class ExampleDriver { ... public static void main(String[] args) throws Exception { // Create a Configuration object that is used to set other options Configuration conf = new Configuration() ; // Create the object representing the job Job job = new Job(conf, "ExampleJob") ; // Set the name of the main class in the job jarfile job.setJarByClass(ExampleDriver.class) ; // Set the mapper class job.setMapperClass(ExampleMapper.class) ; // Set the reducer class job.setReducerClass(ExampleReducer.class) ; // Set the types for the final output key and value job.setOutputKeyClass(Text.class) ; job.setOutputValueClass(IntWritable.class) ; // Set input and output file paths FileInputFormat.addInputPath(job, new Path(args[0])) ; FileOutputFormat.setOutputPath(job, new Path(args[1])) // Execute the job and wait for it to complete System.exit(job.waitForCompletion(true) ? 0 : 1); } }} Given our previous talk of jobs, it is not surprising that much of the setup involves operations on a Job object. This includes setting the job name and specifying which classes are to be used for the mapper and reducer implementations. Certain input/output configurations are set and, finally, the arguments passed to the main method are used to specify the input and output locations for the job. This is a very common model that you will see often. There are a number of default values for configuration options, and we are implicitly using some of them in the preceding class. Most notably, we don't say anything about the file format of the input files or how the output files are to be written. These are defined through the InputFormat and OutputFormat classes mentioned earlier; we will explore them in detail later. The default input and output formats are text files that suit our WordCount example. There are multiple ways of expressing the format within text files in addition to particularly optimized binary formats. A common model for less complex MapReduce jobs is to have the Mapper and Reducer classes as inner classes within the driver. This allows everything to be kept in a single file, which simplifies the code distribution. Writing MapReduce programs We have been using and talking about WordCount for quite some time now; let's actually write an implementation, compile, and run it, and then explore some modifications. Time for action – setting up the classpath To compile any Hadoop-related code, we will need to refer to the standard Hadoop-bundled classes. Add the Hadoop-1.0.4.core.jar file from the distribution to the Java classpath as follows: $ export CLASSPATH=.:${HADOOP_HOME}/Hadoop-1.0.4.core.jar:${CLASSPATH} What just happened? This adds the Hadoop-1.0.4.core.jar file explicitly to the classpath alongside the current directory and the previous contents of the CLASSPATH environment variable. Once again, it would be good to put this in your shell startup file or a standalone file to be sourced. We will later need to also have many of the supplied third-party libraries that come with Hadoop on our classpath, and there is a shortcut to do this. For now, the explicit addition of the core JAR file will suffice. Time for action – implementing WordCount We will explore our own Java implementation by performing the following steps: Enter the following code into the WordCount1.java file: Import java.io.* ; import org.apache.hadoop.conf.Configuration ; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount1 { public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { String[] words = value.toString().split(" ") ; for (String str: words) { word.set(str); context.write(word, one); } } } public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int total = 0; for (IntWritable val : values) { total++ ; } context.write(key, new IntWritable(total)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "word count"); job.setJarByClass(WordCount1.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Now compile it by executing the following command: $ javac WordCount1.java What just happened? This is our first complete MapReduce job. Look at the structure and you should recognize the elements we have previously discussed: the overall Job class with the driver configuration in its main method and the Mapper and Reducer implementations defined as inner classes. We'll do a more detailed walkthrough of the mechanics of MapReduce in the next section, but for now let's look at the preceding code and think of how it realizes the key/value transformations we talked about earlier. The input to the Mapper class is arguably the hardest to understand, as the key is not actually used. The job specifies TextInputFormat as the format of the input data and, by default, this delivers to the mapper data where the key is the line number in the file and the value is the text of that line. In reality, you may never actually see a mapper that uses that line number key, but it is provided. The mapper is executed once for each line of text in the input source and every time it takes the line and breaks it into words. It then uses the Context object to output (more commonly known as emitting) each new key/value of the form <word, 1 >. These are our K2/V2 values. We said before that the input to the reducer is a key and a corresponding list of values, and there is some magic that happens between the map and reduce methods to collect together the values for each key that facilitates this, which we'll not describe right now. Hadoop executes the reducer once for each key and the preceding reducer implementation simply counts the numbers in the Iterable object and gives output for each word in the form of <word, count>. This is our K3/V3 values. Take a look at the signatures of our mapper and reducer classes: the WordCountMapper class gives IntWritable and Text as input and gives Text and IntWritable as output. The WordCountReducer class gives Text and IntWritableboth as input and output. This is again quite a common pattern, where the map method performs an inversion on the key and values, and instead emits a series of data pairs on which the reducer performs aggregation. The driver is more meaningful here, as we have real values for the parameters. We use arguments passed to the class to specify the input and output locations. Time for action – building a JAR file Before we run our job in Hadoop, we must collect the required class files into a single JAR file that we will submit to the system. Create a JAR file from the generated class files. $ jar cvf wc1.jar WordCount1*class What just happened? We must always package our class files into a JAR file before submitting to Hadoop, be it local or on Elastic MapReduce. Be careful with the JAR command and file paths. If you include in a JAR file class the files from a subdirectory, the class may not be stored with the path you expect. This is especially common when using a catch-all classes directory where all source data gets compiled. It may be useful to write a script to change into the directory, convert the required files into JAR files, and move the JAR files to the required location. Time for action – running WordCount on a local Hadoop cluster Now we have generated the class files and collected them into a JAR file, we can run the application by performing the following steps: Submit the new JAR file to Hadoop for execution. $ hadoop jar wc1.jar WordCount1 test.txt output Check the output file; it should be as follows: $ Hadoop fs –cat output/part-r-00000 This 1 yes 1 a 1 is 2 test 1 this 1 What just happened? This is the first time we have used the Hadoop JAR command with our own code. There are four arguments: The name of the JAR file. The name of the driver class within the JAR file. The location, on HDFS, of the input file (a relative reference to the /user/Hadoop home folder, in this case). The desired location of the output folder (again, a relative path). The name of the driver class is only required if a main class has not (as in this case) been specified within the JAR file manifest.
Read more
  • 0
  • 0
  • 2574

article-image-so-what-mongodb
Packt
02 Aug 2013
6 min read
Save for later

So, what is MongoDB?

Packt
02 Aug 2013
6 min read
(For more resources related to this topic, see here.) What is a document? While it may vary for various implementations of different Document Oriented Databases available, as far as MongoDB is concerned it is a BSON document, which stands for Binary JSON. JSON (JavaScript Object Notation) is an open standard developed for human readable data exchange. Though a thorough knowledge of JSON is not really important to understand MongoDB, for keen readers the URL to its RFC is http://tools.ietf.org/html/rfc4627. Also, the BSON specification can be found at http://bsonspec.org/. Since MongoDB stores the data as BSON documents, it is a Document Oriented Database. What does a document look like? Consider the following example where we represent a person using JSON: {"firstName":"Jack","secondName":"Jones","age":30,"phoneNumbers":[{fixedLine:"1234"},{mobile:"5678"}],"residentialAddress":{lineOne:"…",lineTwo:"…",city:"…",state:"…",zip:"…",country:"…"}} As we can see, a JSON document always starts and ends with curly braces and has all the content within these braces. Multiple fields and values are separated by commas, with a field name always being a string value and the value being of any type ranging from string, numbers, date, array, another JSON document, and so on. For example in "firstName":"Jack", the firstName is the name of the field whereas Jack is the value of the field. Need for MongoDB Many of you would probably be wondering why we need another database when we already have good old relational databases. We will try to see a few drivers from its introduction back in 2009. Relational databases are extremely rich in features. But these features don't come for free; there is a price to pay and it is done by compromising on the scalability and flexibility. Let us see these one by one. Scalability It is a factor used to measure the ease with which a system can accommodate the growing amount of work or data. There are two ways in which you can scale your system: scale up, also known as scale vertically or scale out, also known as scale horizontally. Vertical scalability can simply be put up as an approach where we say "Need more processing capabilities? Upgrade to a bigger machine with more cores and memory". Unfortunately, with this approach we hit a wall as it is expensive and technically we cannot upgrade the hardware beyond a certain level. You are then left with an option to optimize your application, which might not be a very feasible approach for some systems which are running in production for years. On the other hand, Horizontal scalability can be described as an approach where we say "Need more processing capabilities? Simple, just add more servers and multiply the processing capabilities". Theoretically this approach gives us unlimited processing power but we have more challenges in practice. For many machines to work together, there would be a communication overhead between them and the probability of any one of these machines being down at a given point of time is much higher. MongoDB enables us to scale horizontally easily, and at the same time addresses the problems related to scaling horizontally to a great extent. The end result is that it is very easy to scale MongoDB with increasing data as compared to relational databases. Ease of development MongoDB doesn't have the concept of creation of schema as we have in relational databases. The document that we just saw can have an arbitrary structure when we store them in the database. This feature makes it very easy for us to model and store relatively unstructured/ complex data, which becomes difficult to model in a relational database. For example, product catalogues of an e-commerce application containing various items and each having different attributes. Also, it is more natural to use JSON in application development than tables from relational world. Ok, it looks good, but what is the catch? Where not to use MongoDB? To achieve the goal of letting MongoDB scale out easily, it had to do away with features like joins and multi document/distributed transactions. Now, you must be wondering it is pretty useless as we have taken away two of the most important features of the relational database. However, to mitigate the problems of joins is one of the reasons why MongoDB is document oriented. If you look at the preceding JSON document for the person, we have the address and the phone number as a part of the document. In relational database, these would have been in separate tables and retrieved by joining these tables together. Distributed/Multi document transactions inhibit MongoDB to scale out and hence are not supported and nor there is a way to mitigate it. MongoDB still is atomic but the atomicity for inserts and updates is guaranteed at document level and not across multiple documents. Hence, MongoDB is not a good fit for scenarios where complex transactions are needed, such as in an OLTP banking applications. This is an area where good old relational database still rules. To conclude, let us take a look at the following image. This graph is pretty interesting and was presented by Dwight Merriman, Founder and CEO of 10gen, the MongoDB company in one of his online courses. As we can see, we have on one side some products like Memcached which is very low on functionality but high on scalability and performance. On the other end we have RDBMS (Relational Database Management System) which is very rich in features but not that scalable. According to the research done while developing MongoDB, this graph is not linear and there is a point in it after which the scalability and performance fall steeply on adding more features to the product. MongoDB sits on this point where it gives maximum possible features without compromising too much on the scalability and performance. Summary In this article, we saw the features displayed by MongoDB, how a document looks like, and how it is better than relational databases. Resources for Article : Further resources on this subject: Building a Chat Application [Article] Ruby with MongoDB for Web Development [Article] Comparative Study of NoSQL Products [Article]
Read more
  • 0
  • 0
  • 1623

article-image-using-oracle-goldengate
Packt
02 Aug 2013
15 min read
Save for later

Using Oracle GoldenGate

Packt
02 Aug 2013
15 min read
(For more resources related to this topic, see here.) Creating one-way replication (Simple) Here we'll be utilizing the demo scripts included in the OGG software distribution to implement a basic homogenous (Oracle-to-Oracle) replication. Getting ready You need to ensure your Oracle database is in archivelog mode. If your database is not in archivelog mode, you won't be able to recover your database due to media corruption or user errors. How to do it... The steps for creating one-way replication are as follows: Check whether supplemental logging is enabled on your source database using the following command: SQL> select supplemental_log_data_min from v$database; The output of the preceding command will be as follows: SUPPLEME-----------------NO Enable supplemental logging using the following command: SQL> alter database add supplemental log data;SQL> select supplemental_log_data_min from v$database; The output of the preceding command will be as follows: SUPPLEME-----------------YES Let's run the demo script to create a couple of tables in the scott schema. You need to know the scott schema password, which is tiger by default. We do it using following command: $ cd /u01/app/oracle/gg$ ./ggsci$ sqlpus scottEnter password:SQL> @demo_ora_create.sql The output of the preceding command will be as follows: DROP TABLE tcustmer*ERROR at line 1:ORA-00942: table or view does not existTable created.DROP TABLE tcustord*ERROR at line 1:ORA-00942: table or view does not existTable created. You must add the checkpoint table, do it as follows: $ cd /u01/app/oracle/gg$ vi GLOBALS Add the following entry to the file: CheckPointTable ogg.chkpt Save the file and exit. Next create the checkpoint table using the following command: $ ./ggsciGGSCI> add checkpointtableGGSCI> info checkpointtable The output of the preceding command will be as follows: No checkpoint table specified, using GLOBALS specification (ogg.chkpt)...Checkpoint table ogg.chkpt created 2012-10-31 12:39:38. Set up the MANAGER parameter file using the following command: $ cd /u01/app/oracle/gg/dirprm$ vi mgr.prm Add the following lines to the file: PORT 7809DYNAMICPORTLIST 7810-7849AUTORESTART er *, RETRIES 6, WAITMINUTES 1, RESETMINUTES 10PURGEOLDEXTRACTS /u01/app/oracle/gg/dirdat/*, USECHECKPOINTS,MINKEEPDAYS 2 Save the file and exit. Start the manager using the following command: $ cd /u01/app/oracle/gg$ ggsciGGSCI> start mgrGGSCI> info mgr The output of the preceding command will be as follows: GGSCI> info allProgram Status Group Lag at Chkpt Time Since ChkptMANAGER RUNNING Create a TNS entry in the database home so that the extract can connect to the Automatic Storage Management (ASM) instance, using the following command: $ cd $ORACLE_HOME/network/admin$ vi tnsnames.ora Add the following TNS entry: ASMGG =(DESCRIPTION =(ADDRESS =(PROTOCOL = IPC)(key=EXTPROC1521))(CONNECT_DATA=(SID=+ASM))) Save the file and exit. Create a user asmgg with the sysdba role in the ASM instance. Connect to the ASM instance as sys user using the following command: $ sqlplus sys/<password>@asmgg as sysasm The output of the preceding command will be as follows: SQL*Plus: Release 11.2.0.3.0 Production on Thu Nov 15 14:24:202012Copyright (c) 1982, 2011, Oracle. All rights reserved.Connected to:Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bitProductionWith the Automatic Storage Management option The user is created using the following command: SQL> create user asmgg identified by asmgg ; We will get the following output message: User created. Provide the sysdba role to the user ASMGG using the following command: SQL> grant sysdba to asmgg ; We will get the following output message: Grant succeeded. Let's add supplemental logging to the source tables using the following commands: $ cd /u01/app/oracle/gg$ ./ggsciGGSCI> add trandata scott.tcustmer The output will be as follows: Logging of supplemental redo data enabled for table SCOTT.TCUSTMER. Then type the following command: GGSCI> add trandata scott.tcustord The output message will be as follows: Logging of supplemental redo data enabled for table SCOTT.TCUSTORD. The next command to be executed is: GGSCI> info trandata scott.tcustmer The output message will be as follows: Logging of supplemental redo log data is disabled for table OGG.TCUSTMER. The next command to be used is: GGSCI> info trandata scott.tcustord The output will be as follows: Logging of supplemental redo log data is disabled for table OGG.TCUSTORD. Create the extract parameter file for data capture using the following command: $ cd /u01/app/oracle/gg/dirprm$ vi ex01sand.prm Add the following lines to the file: EXTRACT ex01sandSETENV (ORACLE_SID="SRC100")SETENV (ORACLE_HOME="/u01/app/oracle/product/11.2.0/db_1")SETENV (NLS_LANG="AMERICAN_AMERICA.AL32UTF8")USERID ogg, PASSWORD oggTRANLOGOPTIONS EXCLUDEUSER oggTRANLOGOPTIONS ASMUSER asmgg@ASMGG ASMPASSWORD asmgg-- Trail File location locallyEXTTRAIL /u01/app/oracle/gg/dirdat/prDISCARDFILE /u01/app/oracle/gg/dirrpt/ex01sand.dsc, PURGEDISCARDROLLOVER AT 01:00 ON SUNDAYTABLE SCOTT.TCUSTMER ;TABLE SCOTT.TCUSTORD ; Save the file and exit. Let's add the Extract process and start it. We do it by using the following command: $ cd /u01/app/oracle/gg$ ./ggsciGGSCI> add extract ex01sand tranlog begin now The output of the preceding command will be as follows: EXTRACT added. The following command adds the location of the trail files and size for each trail created: GGSCI> add exttrail /u01/app/oracle/gg/dirdat/pr extract ex01sandmegabytes 2 The output of the preceding command will be as follows: EXTTRAIL added.GGSCI> start ex01sandSending START request to MANAGER ...EXTRACT EX01SAND startingGGSCI> info allProgram Status Group Lag at Chkpt Time Since ChkptMANAGER RUNNINGEXTRACT RUNNING EX01SAND 00:00:00 00:00:06 Next we'll create the data pump parameter file using the following command: $ cd /u01/app/oracle/gg/dirprm$ vi pp01sand.prm Add the following lines to the file: EXTRACT pp01sandPASSTHRURMTHOST hostb MGRPORT 7820RMTTRAIL /u01/app/oracle/goldengate/dirdat/rpDISCARDFILE /u01/app/oracle/gg/dirrpt/pp01sand.dsc, PURGE-- Tables for transportTABLE SCOTT.TCUSTMER ;TABLE SCOTT.TCUSTORD ; Save the file and exit. Add the data pump process and final configuration on the source side as follows: GGSCI> add extract pp01sand exttrailsource /u01/app/oracle/gg/dirdat/pr The output of the preceding command will be as follows: EXTRACT added. The following command points the pump to drop the trail files to the remote location: GGSCI> add rmttrail /u01/app/oracle/goldengate/dirdat/rp extractpp01sand megabytes 2 The output of the preceding command will be as follows: RMTTRAIL added Then we execute the following command: GGSCI> info all The output of the preceding command will be as follows: Program Status Group Lag at Chkpt Time Since ChkptMANAGER RUNNINGEXTRACT RUNNING EXPR610 00:00:00 00:00:05EXTRACT STOPPED PP01SAND 00:00:00 00:00:55 We're not going to start the data pump (pump) at this point since the manager does not yet exist at the target site. Perform the following actions on the target server. We've now completed most of our steps on the source system. We'll have to come back to the source server to start the pump a little later. Now, we'll move on to our target server where we'll have to set up the Replicat process in order to receive and apply the changes received from the source database. Perform the following actions on the target database: Create tables on the target host using the following command: $ cd /u01/app/oracle/goldengate$ sqlplus scott/tigerSQL> @demo_ora_create.sql The output of the preceding command will be as follows: DROP TABLE tcustmer*ERROR at line 1:ORA-00942: table or view does not existTable created.DROP TABLE tcustord*ERROR at line 1:ORA-00942: table or view does not existTable created. Let's add the checkpoint table as a global parameter using the following command: $ cd /u01/app/oracle/goldengate$ vi GLOBALS Add the following line to the file: CheckPointTable ogg.chkpt Save the file and exit. Create the checkpoint table using the following command: $ cd ..$ ./ggsciGGSCI> dblogin userid ogg password oggGGSCI> add checkpointtable Then execute the following command: $ cd /u01/app/oracle/goldengate/dirprm$ vi mgr.prm Add the following lines to the file: PORT 7820DYNAMICPORTLIST 7821-7849AUTORESTART er *, RETRIES 6, WAITMINUTES 1, RESETMINUTES 10PURGEOLDEXTRACTS /u01/app/oracle/goldengate/dirdat/*,USECHECKPOINTS, MINKEEPFILES 2 Save the file and exit Start the manager using the following command: $ cd /u01/app/oracle/goldengate$ ./ggsciGGSCI> start mgrGGSCI> info mgrGGSCI> info all We will get the following output: Program Status Group Lag at Chkpt Time Since ChkptMANAGER RUNNING Edit the parameter file using the following command, now we're ready to create the replicat parameter file: $ cd /u01/app/oracle/goldengate/dirprm$ vi re01sand.prm Add the following lines to the file: REPLICAT re01sandSETENV (ORACLE_SID="TRG101")SETENV (ORACLE_HOME="/u01/app/oracle/product/11.1.0/db_1")SETENV (NLS_LANG = "AMERICAN_AMERICA.AL32UTF8")USERID ogg PASSWORD oggDISCARDFILE /u01/app/oracle/goldengate/dirrpt/re01sand.dsc, APPENDDISCARDROLLOVER at 01:00ReportCount Every 30 Minutes, RateREPORTROLLOVER at 01:30DBOPTIONS DEFERREFCONSTASSUMETARGETDEFSMAP SCOTT.TCUSTMER , TARGET SCOTT.TCUSTMER ;MAP SCOTT Save the file and exit. We now add and start the Replicat process using the following commands: $ cd .. The following extrail location must match exactly as in the pump's rmttrail location on the source server: $ ./ggsciGGSCI> add replicat re01sand exttrail /u01/app/oracle/goldengate/dirdat/rp checkpointtable ogg.chkptGGSCI> start re01sand The output of the preceding command will be as follows: Sending START request to MANAGER ...REPLICAT RE01SAND starting Then we execute the following command: GGSCI> info all The output of the preceding command will be as follows:` Program Status Group Lag at Chkpt Time Since ChkptMANAGER RUNNINGREPLICAT RUNNING RE01SAND 00:00:00 00:00:01 Let's go back to the source host and start the pump using the following command: $ cd /u01/app/oracle/gg$ ./ggsciGGSCI> start pp01sand The output of the preceding command will be as follows: Sending START request to MANAGER ...EXTRACT PP01SAND starting Next we use the demo insert script to add rows to source tables that should replicate to the target tables. We can do it using the following commands: $ cd /u01/app/oracle/gg$ sqlplus scott/tigerSQL> @demo_ora_insert The output of the preceding command will be as follows: 1 row created.1 row created.1 row created.1 row created.Commit complete. To verify that the 4 rows just created have been captured at the source use the following commands: $ ./ggsciGGSC>stats ex01sand totalsonly scott.* The output of the preceding command will be as follows: Sending STATS request to EXTRACT EX01SAND ...Start of Statistics at 2012-11-30 20:22:37.Output to /u01/app/oracle/gg/dirdat/pr:… truncated for brevity*** Latest statistics since 2012-11-30 20:17:38 ***Total inserts 4.00Total updates 0.00Total deletes 0.00Total discards 0.00Total operations 4.00 To verify if the pump has shipped to the target server use the following command: GGSCI> stats pp01sand totalsonly scott.* The output of the preceding command will be as follows: Sending STATS request to EXTRACT PP01SAND ...Start of Statistics at 2012-11-30 20:24:56.Output to /u01/app/oracle/goldengate/dirdat/rp:Cumulative totals for specified table(s):… cut for brevity*** Latest statistics since 2012-11-30 20:18:14 ***Total inserts 4.00Total updates 0.00Total deletes 0.00Total discards 0.00Total operations 4.00End of Statistics. And finally if they have been applied at the target, the next command is performed at the target server as follows: $ ./ggsciGGSCI> stats re01sand totalsonly scott.* The output of the preceding command will be as follows: Sending STATS request to REPLICAT RE01SAND ...Start of Statistics at 2012-11-30 20:28:01.Cumulative totals for specified table(s):...*** Latest statistics since 2012-11-30 20:18:20 ***Total inserts 4.00Total updates 0.00Total deletes 0.00Total discards 0.00Total operations 4.00End of Statistics. How it works... Supplemental logging must be turned on at the database level and subsequently at the table level as well, for those tables you would like to replicate. For a one-way replication, this is done at the source table. There isn't a need to turn on supplemental logging at the target site, if the target site in turn is not a source to other targets or to itself. A database user ogg is created in order to administer the OGG schema. This user is solely used for the purpose of administering OGG in the database. Checkpoints are needed by both the source and target servers; these are structures that persist to disk as a known position in the trail file. You would start from these after an expected or unexpected shutdown of the OGG process. The PORT parameter in the mgr.prm file specifies the port to which the MGR should bind and start listening for connection requests. If the manager is down, then connections can't be established and you'll receive TCP connection errors. The only necessary parameter required is the port number itself. Also, the PURGEOLDEXTRACT parameter is a nice way to keep your trail files to a minimum size so that they don't store indefinitely and finally run out of space in your filesystem. In this example, we're asking the manager to purge trail files and keep the files from the last two days on disk. If your Oracle database is using an ASM instance, then OGG needs to establish a connection to the ASM instance in order to read the online-redo logs. You must ensure that you either use the sys schema or create a user (such as asmgg) with SYSDBA privileges for authentication. Since we need a supplemental log at the table level, add trandata does precisely this Now we'll focus on some of the EXTRACT (ex01sand) data capture parameters. For one thing, you'll notice that we need to supply the extract with credentials to the database and the ASM instance in order to scan the online-redo logs for committed transactions. The following lines tell OGG to exclude the user ogg from capture. The second tranlogoptions is how the extract authenticates to the ASM instance. USERID ogg, PASSWORD oggTRANLOGOPTIONS EXCLUDEUSER oggTRANLOGOPTIONS ASMUSER asmgg@ASMGG ASMPASSWORD asmgg If you're using Oracle version 10gR2 and later versions of 10gR2, or Oracle 11.2.0.2 and later, you could use the newer ASM API tranlogoptions DBLOGREADER rather than the ASMUSER. The API uses the database connection rather than connecting to the ASM instance to read the online-redo logs. The following two lines in the extract tell the extract where to place the trail files, with a prefix of pr followed by 6 digits that increment once each file rolls over to the next file generation. The DISCARDFILE by convention has the same name as the extract but with an extension .dsc for discard. If, for any reason, OGG can't capture a transaction, it will throw the text and SQL to this file for later investigation. EXTTRAIL /u01/app/oracle/gg/dirdat/prDISCARDFILE /u01/app/oracle/gg/dirrpt/ex01sand.dsc, PURGE Tables or schemas are captured with the following syntax in the extract file: TABLE SCOTT.TCUSTMER ;TABLE SCOTT.TCUSTORD ; The specification can vary and use wildcards as well. Say you want to capture the entire schema, you could specify this as TABLE SCOTT.* ;. In the following code the first command adds the extract with the option tranlog begin now telling OGG to start capturing changes using the online-redo logs as of now. The second command tells the extract where to store the trail files with a size not exceeding 2 MB. GGSCI> add extract ex01sand tranlog begin nowGGSCI> add exttrail /u01/app/oracle/gg/dirdat/pr extract ex01sandmegabytes 2 Now, the PUMP (data pump; pp01sand) is an optional, but highly recommended extract whose sole purpose is to perform all of the TCP/IP activity; for example, transporting the trail files to the target site. This is beneficial because we alleviate the capture process from performing any of the TCP/IP activity. The parameters in the following snippet tell the pump to send the data as is with the PASSTHRU parameter. This is the optimal and preferred method if there isn't any data transformation along the way. The RMTHOST parameter specifies the destination host and the port to which the remote manager is listening, for example, port 7820. If the manager port is not running at the target, the destination host will refuse the connection; that is why we did not start the pump early on during our work on the source host. PASSTHRURMTHOST hostb MGRPORT 7820RMTTRAIL /u01/app/oracle/goldengate/dirdat/rp The RMTTRAIL specifies where the trail file will be stored at the remote host with a prefix of rp followed by a 6 digit number sequentially increasing as the files roll over after a specified size has reached. Finally, at the destination host, hostb, the Replicat process (re01sand) is the applier where the SQL is replayed in the target database. The following two lines in the parameter file specify how the Replicat knows to map source and target data as it comes in by way of the trail files: MAP SCOTT.TCUSTMER , TARGET SCOTT.TCUSTMER ;MAP SCOTT.TCUSTORD , TARGET SCOTT.TCUSTORD ; The target tables don't necessarily have to be of the same schema names as in the preceding example, but they could have been applied to a different schema altogether if that was the requirement Summary In this article we learned about the creation of one-way replication using Oracle GoldenGate. Resources for Article : Further resources on this subject: Oracle GoldenGate 11g: Configuration for High Availability [Article] Getting Started with Oracle GoldenGate [Article] Oracle GoldenGate: Considerations for Designing a Solution [Article]
Read more
  • 0
  • 0
  • 2833
article-image-making-simple-curl-request-simple
Packt
01 Aug 2013
5 min read
Save for later

Making a simple cURL request (Simple)

Packt
01 Aug 2013
5 min read
(For more resources related to this topic, see here.) Getting ready In this article we will use cURL to request and download a web page from a server. How to do it... Enter the following code into a new PHP project: <?php // Function to make GET request using cURL function curlGet($url) { $ch = curl_init(); // Initialising cURL session // Setting cURL options curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); curl_setopt($ch, CURLOPT_URL, $url); $results = curl_exec($ch); // Executing cURL session curl_close($ch); // Closing cURL session return $results; // Return the results } $packtPage = curlGet('http://www.packtpub.com/oop-php-5/book'); echo $packtPage; ?> Save the project as 2-curl-request.php (ensure you use the .php extension!). Execute the script. Once our script has completed, we will see the source code of http://www.packtpub.com/oop-php-5/book displayed on the screen. How it works... Let's look at how we performed the previously defined steps: The first line, <?php, and the last line,?>, indicate where our PHP code block will begin and end. All the PHP code should appear between these two tags. Next, we create a function called curlGet(), which accepts a single parameter $url, the URL of the resource to be requested. Running through the code inside the curlGet() function, we start off by initializing a new cURL session as follows: $ch = curl_init(); We then set our options for cURL as follows: curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Tells cURL to return the results of the request (the source code of the target page) as a string. curl_setopt($ch, CURLOPT_URL, $url); // Here we tell cURL the URL we wish to request, notice that it is the $url variable that we passed into the function as a parameter. We execute our cURL request, storing the returned string in the $results variable as follows: $results = curl_exec($ch); Now that the cURL request has been made and we have the results, we close the cURL session by using the following code: curl_close($ch); At the end of the function, we return the $results variable containing our requested page, out of the function for using in our script. return $results; After the function is closed we are able to use it throughout the rest of our script. Later, deciding on the URL we wish to request, http://www.packtpub.com/oop-php-5/book , we execute the function, passing the URL as a parameter and storing the returned data from the function in the $packtPage variable as follows: $packtPage = curlGet('http://www.packtpub.com/oop-php-5/book'); Finally, we echo the contents of the $packtPage variable (the page we requested) to the screen by using the following code: echo $packtPage; There's more... There are a number of different HTTP request methods which indicate the server the desired response, or the action to be performed. The request method being used in this article is cURLs default GET request. This tells the server that we would like to retrieve a resource. Depending on the resource we are requesting, a number of parameters may be passed in the URL. For example, when we perform a search on the Packt Publishing website for a query, say, php, we notice that the URL is http://www.packtpub.com/books?keys=php. This is requesting the resource books (the page that displays search results) and passing a value of php to the keys parameter, indicating that the dynamically generated page should show results for the search query php. More cURL Options Of the many cURL options available, only two have been used in our preceding code. They are CURLOPT_RETURNTRANSFER and CURLOPT_URL. Though we will cover many more throughout the course of this article, some other options to be aware of, that you may wish to try out, are listed in the following table: Option Name Value Purpose CURLOPT_FAILONERROR TRUE or FALSE If a response code greater than 400 is returned, cURL will fail silently. CURLOPT_FOLLOWLOCATION TRUE or FALSE If Location: headers are sent by the server, follow the location. CURLOPT_USERAGENT A user agent string, for example: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.5; rv:15.0) Gecko/20100101 Firefox/15.0.1' Sending the user agent string in your request informs the target server, which client is requesting the resource. Since many servers will only respond to 'legitimate' requests it is advisable to include one. CURLOPT_HTTPHEADER An array containing header information, for example: array('Cache-Control: max-age=0', 'Connection: keep-alive', 'Keep-Alive: 300', 'Accept-Language: en-us,en;q=0.5') This option is used to send header information with  the request and we will come across use cases for this in later recipes. A full listing of cURL options can be found on the PHP website at http://php.net/manual/en/function.curl-setopt.php. The HTTP response code An HTTP response code is the number that is returned, which corresponds with the result of an HTTP request. Some common response code values are as follows: 200: OK 301: Moved Permanently 400: Bad Request 401: Unauthorized 403: Forbidden 404: Not Found 500: Internal Server Error Summary This article covers techniques on making a simple cURL request. It is often useful to have our scrapers responding to different response code values in a different manner, for example, letting us know if a web page has moved, or is no longer accessible, or we are unauthorized to access a particular page. In this case, we can access the response of a request using cURL by adding the following line to our function, which will store the response code in the $httpResponse variable: $httpResponse = curl_getinfo($ch, CURLINFO_HTTP_CODE); Resources for Article: Further resources on this subject: A look into the high-level programming operations for the PHP language [Article] Installing PHP-Nuke [Article] Creating Your Own Theme—A Wordpress Tutorial [Article]
Read more
  • 0
  • 0
  • 2686

article-image-participating-business-process-intermediate
Packt
31 Jul 2013
5 min read
Save for later

Participating in a business process (Intermediate)

Packt
31 Jul 2013
5 min read
(For more resources related to this topic, see here.) The hurdles and bottlenecks for financial services from an IT point of view are: Silos of data Outdated IT system and many applications running on legacy and non-standard based systems Business process and reporting systems not in sync with each other Lack of real-time data visibility Automated decision making Ability to change and manage business processes in accordance with changes in business dynamics Partner management Customer satisfaction This is where BPM plays a key role in bridging the gap between key business requirements and technology or businesses hurdles. In a real-life scenario, a typical home loan use case would be tied up with Know Your Customer (KYC) regulatory requirement. In India for example, the Reserve Bank of India ( RBI) had passed on guidelines that make it mandatory for banks to properly know their customers. RBI mandates that banks collect their customers' proof of identity, recent photographs, and Income Tax PAN. Proof of residence can be a voter card, a driving license, or a passport copy. Getting ready We start with the source code from the previous recipe. We will add a re-usable e-mail or SMS notification process. It is always a best practice to add a new process if it is called multiple times in the same process. This can be a subprocess within the main process itself, or it can be a part of the same composite outside the main process. We will add a new regulatory requirement that allows the customer to add KYC requirements such as photo, proof of address, and Income Tax PAN copy as attachments that will be checked into the WebCenter Content repository. These checks become a part of the customer verification stage before finance approval. We will make KYC as a subprocess with a scope of expansion under a different scenario. We will also save the process data into a filesystem or in a JMS messaging queue at the end of the loan process completion. In a banking scenario, it can also be the integration stage for other applications such as a CRM application or any other application. How to do it… Let's perform the following steps: Launch JDeveloper and open the composite.xml of LoanApplicationProcess in the Design view. Drag-and-drop a new BPMN Process component from the Component Palette. Create the Send Notifications process next to the existing LoanApplicationProcess, and edit the new process. The Send Notifications process will take input parameters as To e-mail ID, From e-mail ID, Subject, CC, and send e-mail to the given e-mail ID. Similarly, we will drag-and-drop a File Adapter component from the Component Palette that saves the customer data into a file. We place this component the end of the LoanApplication process, just before the End activity. We will use this notification service to notify Verification Officers about the arrival of a new eligible application that needs to be verified. In the Application Verification Officer stage, we will add a subprocess, KYC , that will be assigned to the loan initiator—James Cooper in our case. This will be preceded by sending an e-mail notification to the applicant asking for KYC details such as PAN number, scanned photograph, and voter ID as requested by the Verification Officers. Now, let us implement Save Loan Application by invoking the File Adapter service. The Email notification services are also available out of the box. How it works… The outputs of this recipe are re-usable services that can be used across multiple service calls such as notification services. This recipe also demonstrates how to use subprocesses and change the process to meet regulatory requirements. Let's understand the output by taking our use case scenario: When the process is initiated, the e-mail notification gets triggered at appropriate stages of the process. Conan Doyle and John Steinbeck will get the e-mail, requesting them to process the application, with the required information of the applicant, along with the link to BPM Workspace. The KYC task also sends an e-mail to James Cooper, requesting him for the documents required for the KYC check. James Cooper logs in to the James Bank WebCenter Portal and sees there is a task assigned to him to upload his KYC details. James Cooper clicks on the task link and submits the required soft copy documents, and gets them checked into the content repository once the form is submitted.            The start-to-end process flow now looks as follows: Summary BPM Process Spaces, which is an extension template of BPM, allows process and task views to be exposed to WebCenter Portal. The advantage of having Process Spaces made available within the Portal is that the users can collaborate with others using out of the box Portal features such as wikis, discussion forums, blogs, and content management. This improves productivity as the user need not log in to different applications for different purposes, as all the required data and information will be made available within the Portal environment. It is also possible to expose some of the WSRP supported application portlets (for example, HR Portlets from PeopleSoft) into a corporate portal environment. All of this sums up to provide higher visibility of the entire business process, and a nature of working and collaborating together in an enterprise business environment. Resources for Article : Further resources on this subject: Managing Oracle Business Intelligence [Article] Oracle E-Business Suite: Creating Bank Accounts and Cash Forecasts [Article] Getting Started with Oracle Information Integration [Article]
Read more
  • 0
  • 0
  • 982

article-image-data-sources-charts
Packt
31 Jul 2013
12 min read
Save for later

Data sources for the Charts

Packt
31 Jul 2013
12 min read
(For more resources related to this topic, see here.) Spreadsheets In Spreadsheets, two preparation steps must be addressed in order to use a Spreadsheet as a data source with the Visualization API. The first is to identify the URL location of the Spreadsheet file for the API code. The second step is to set appropriate access to the data held in the Spreadsheet file. Preparation The primary method of access for a Spreadsheet behaving as a data source is through a JavaScript-based URL query. The query itself is constructed with the Google Query Language. If the URL request does not include a query, all data source columns and rows are returned in their default order. To query a Spreadsheet also requires that the Spreadsheet fi le and the API application security settings are con figured appropriately. Proper preparation of a Spreadsheet as a data source involves both setting the appropriate access as well as locating the fi le's query URL. Permissions In order for a Spreadsheet to return data to the Visualization API properly, access settings on the Spreadsheets fi le itself must allow view access to users. For a Spreadsheet that allows for edits, including form-based additions, permissions must be set to Edit . To set permissions on the Spreadsheet, select the Share button to open up the Sharing settings dialog. To be sure the data is accessible to the Visualization API, access levels for both the Visualization application and Spreadsheet must be the same. For instance, if a user has access to the Visualization application and does not have view access to the Spreadsheet, the user will not be able to run the visualization as the data is more restrictive to that user than the application. The opposite scenario is true as well, but less likely to cause confusion as a user unable to access the API application is a fairly self-described problem. All Google applications handle access and permissions similarly. More information on this topic can be found on the Google Apps Support pages. Google Permissions overview is available at  http://support.google. com/drive/bin/answer.py?hl=en&answer=2494886&rd=1. Get the URL path At present, acquiring a query-capable URL for a Spreadsheet is not as straightforward a task as one might think. There are several methods for which a URL is generated for sharing purposes, but the URL format needed for a data source query can only be found by creating a gadget in the Spreadsheet. A Google Gadget is simply dynamic, HTML or JavaScript-based web content that can be embedded in a web page. Google Gadgets also have their own API, and have capabilities beyond Spreadsheets applications. Information on Google Gadget API is available at https://developers.google.com/gadgets/. Initiate gadget creation by selecting the Gadget... option from the Insert item on the menu bar. When the Gadget Settings window appears, select Apply & close from the Gadget Settings dialog. Choose any gadget from the selection window. The purpose of this procedure is simply to retrieve the correct URL for querying. In fact, deleting the gadget as soon as the URL is copied is completely acceptable. In other words, the specific gadget chosen is of no consequence. Once the gadget has been created, select Get query data source url… from the newly created gadget's drop-down menu. Next, determine and select the range of the Spreadsheet to query. Either the previously selected range when the gadget was created, or the entire sheet is acceptable, depending on the needs of the Visualization application being built. The URL listed under Paste this as a gadget data source url in the Table query data source window is the correct URL to use with the API code requiring query capabilities. Be sure to select the desired cell range, as the URL will change with various options. Important note Google Gadgets are to be retired in 2013, but the query URL is still part of the gadget object at the time of publication. Look for the method of finding the query URL to change as Gadgets are retired. Query Use the URL retrieved from the Spreadsheet Gadget to build the query. The following query statement is set to query the entire Spreadsheet of the key indicated: var query =new google.visualization.Query ('https://docs.google.com/spreadsheet/tq?key =0AhnmGz1SteeGdEVsNlNWWkoxU3ZRQjlmbDdTTjF2dHc&headers=-1'); Once the query is built, it can then be sent. Since an external data source is by definition not always under explicit control of the developer, a valid response to a query is not necessarily guaranteed. In order to prevent hard-to-detect data-related issues, it is best to include a method of handling erroneous returns from the data source. The following query.send function also informs the application how to handle information returned from the data source, regardless of quality. query.send(handleQueryResponse); The handleQueryResponse function sent along with the query acts as a filter, catching and handling errors from the data source. If an error was detected, the handleQueryResponse function displays an alert message. If the response from the data source is valid, the function proceeds and draws the visualization. function handleQueryResponse(response) { if (response.isError()) {alert('Error in query: ' + response.getMessage() + ' ' + response.getDetailedMessage()); return; } var data = response.getDataTable(); visualization = new google.visualization.Table (documnt.getElementById('visualization')); visualization.draw(data, null);} Best practice Be prepared for potential errors by planning for how to handle them. For reference, the previous example is given in its complete HTML form: <html > <head><meta http-equiv="content-type" content ="text/html; charset=utf-8"/> <title> Google Visualization API Sample </title> <script type="text/javascript" src ="http://www.google.com/jsapi"> </script><script type="text/javascript"> google.load('visualization', '1', {packages: ['table']}); </script> <script type="text/javascript">var visualization;function drawVisualization() {// To see the data that this visualization uses, browse to //https://docs.google.com/spreadsheet/ccc?key=0AhnmGz1SteeGdEVsNlN WWkoxU3ZRQjlmbDdTTjF2dHc&usp=sharing var query = new google.visualization.Query('https://docs.google.com/spreadsheet/tq?key= 0AhnmGz1SteeGdEVsNlNWWkoxU3ZRQjlmbDdTTjF2dHc&headers=-1'); // Send the query with a callback function. query.send(handleQueryResponse); } function handleQueryResponse(response) { if (response.isError()) { alert('Error in query: ' + response.getMessage() + ' ' + response.getDetailedMessage()); return; } var data = response.getDataTable(); visualization = new google.visualization.Table(document.getEleme ntById('visualization')); visualization.draw(data, null);} google.setOnLoadCallback(drawVisualization); </script></head><body style="font-family: Arial;border: 0 none;"> <div id="visualization" style ="height: 400px; width: 400px;"> </div> </body></html>  View live examples for Spreadsheets at http://gvisapi-packt. appspot.com/ch6-examples/ch6-datasource.html Apps Script method Just as the Visualization API can be used from within an Apps Script, external data sources can also be requested from the script. In the Apps Script Spreadsheet example presented earlier in this article, the DataTable() creation was performed within the script. In the following example, the create data table element has been removed and a .setDataSourceUrloption has been added to Charts. newAreaChart(). The script otherwise remains the same. functiondoGet() {var chart = Charts.newAreaChart().setDataSourceUrl("https: //docs.google.com/spreadsheet/tq ?key= 0AhnmGz1SteeGdEVsNlNWWkoxU3ZRQjlmbDdTTjF2dHc&headers=-1").setDimensions(600, 400) .setXAxisTitle("Age Groups") .setYAxisTitle("Population") .setTitle("Chicago Population by Age and Gender - 2010 Census") .build();varui = UiApp.createApplication(); ui.add(chart); returnui;} View live examples in Apps Script at https://script. google.com/d/1Q2R72rGBnqPsgtOxUUME5zZy5Kul5 3r_lHIM2qaE45vZcTlFNXhTDqrr/edit. Fusion Tables Fusion Tables are another viable data source ready for use by Visualization API. Fusion Tables offer benefit over Spreadsheets beyond just the Google Map functionality. Tables API also allows for easier data source modification than is available in Spreadsheets. Preparation Preparing a Fusion Table to be used as a source is similar in procedure to preparing a Spreadsheet as a data source. The Fusion Table must be shared to the intended audience, and a unique identifier must be gathered from the Fusion Tables application. Permissions Just as with Spreadsheets, Fusion Tables must allow a user a minimum of view permissions in order for an application using the Visualization API to work properly. From the Sharing settings window in Fusion Tables, give the appropriate users viewaccess as a minimum. Get the URL path Referencing a Fusion Table is very similar in method to Spreadsheets. Luckily, the appropriate URL ID information is slightly easier to find in Fusion Tables than in Spreadsheets. With the Sharing settings window open, there is a field at the top of the page containing the Link to share . At the end portion of the link, following the characters dcid= is the Table's ID. The ID will look something like the following: 1Olo92KwNin8wB4PK_dBDS9eghe80_4kjMzOTSu0 This ID is the unique identifier for the table. Query Google Fusion Tables API includes SQL-like queries for the modification of Fusion Tables data from outside the GUI interface. Queries take the form of HTTP POST and GET requests and are constructed using the Fusion Tables API query capabilities. Data manipulation using Fusion Tables API is beyond the scope of this article, but a simple example is offered here as a basic illustration of functionality. Fusion Table query requests the use of the API SELECT option, formatted as: SELECT Column_name FROM Table_ID Here Column_name is the name of the Fusion Table column and Table_ID is the table's ID extracted from the Sharing settings window. If the SELECT call is successful, the requested information is returned to the application in the JSON format. The Visualization API drawChart() is able to take the SELECT statement and the corresponding data source URL as options for the chart rendering. The male and female data from the Fusion Tables 2010 Chicago Census file have been visualized using the drawChart() technique. function drawVisualization() { google.visualization.drawChart({ containerId: 'visualization', dataSourceUrl: 'http://www.google.com/fusiontables/gvizdata?tq=', query: 'SELECT Age, Male, Female FROM 1Olo92KwNin8wB4PK_ dBDS9eghe80_4kjMzOTSu0', chartType: 'AreaChart', options: { title: 'Chicago Population by Age and Sex - 2010 Census', vAxis: { title: 'Population' }, hAxis: { title: 'Age Groups' } } });} The preceding code results in the following visualization: Live examples are available at http://gvisapi-packt. appspot.com/ch6-examples/ch6-queryfusion.html. Important note Fusion Table query responses are limited to 500 rows. See Fusion Tables API documentation for other resource parameters. API Explorer With so many APIs available to developers using the Google platform, testing individual API functionality can be time consuming. The same issue arises for GUI applications used as a data source. Fortunately, Google provides API methods for its graphical applications as well. The ability to test API requests against Google's infrastructure is a desirable practice for all API programing efforts. To support this need, Google maintains the APIs Explorer service. This service is a console-based, web application that allows queries to be submitted to APIs directly, without an application to frame them. This is helpful functionality when attempting to verify whether a data source is properly configured. To check if the Fusion Tables 2010 U.S. Census data instance is configured properly, a query can be sent to list all columns, which informs which columns are actually exposed to the Visualization API application. Best practice Use the Google API Explorer service to test if API queries work as intended. To use the API Explorer for Fusion Tables, select Fusion Tables API from the list of API services. API functions available for testing are listed on the Fusion Tables API page. Troubleshooting a Chart with a Fusion Tables data source usually involves fi rst verifying all columns are available to the visualization code. If a column is not available, or is not formatted as expected, a visualization issue related to data problems may be difficult to troubleshoot from inside the Visualization API environment. The API call that best performs a simple check on column information is the fusiontables.column.list item. Selecting fusiontables.column.list opens up a form-based interface. The only required information is the Table ID (collected from the Share settings window in the Fusion Tables file). Click on the Execute button to run the query. The API Explorer tool will then show the GET query sent to the Fusion Table in addition to the results it returned. For the fusiontables.column.list query, columns are returned in bracketed sections. Each section contains attributes of that column. The following queried attributes should look familiar, as it is the fusiontables.column.list result of a query to the 2010 Chicago Census data Fusion Table. Best Practice The Column List Tool is helpful when troubleshooting Fusion Table to API code connectivity. If the Table is able to return coherent values through the tool, it can generally be assumed that access settings are appropriate and the code itself may be the source of connection issues. Fusion Tables—row and query reference is available at https:// developers.google.com/fusiontables/docs/v1/sqlreference. Information on API Explorer—column list is available at https:// developers.google.com/fusiontables/docs/v1/ reference/column/list#try-it.
Read more
  • 0
  • 0
  • 1564
article-image-first-steps-r
Packt
30 Jul 2013
6 min read
Save for later

First steps with R

Packt
30 Jul 2013
6 min read
(For more resources related to this topic, see here.) Obtaining and installing R The way to obtain R is downloading it from the CRAN website (http://www.r-project.org/). The Comprehensive R Archive Network (CRAN) is a network of FTP and web servers around the world that stores identical, up-to-date versions of code and documentation for R. The CRAN is directly accessible from the R website and on such website it is also possible to find information about R, some technical manuals, the R journal, and details about the packages developed for R and stored on the CRAN repositories. The functionalities of the R environment can then also be expanded thanks to software libraries which can be installed and recalled if needed. These libraries or packages are a collection of source code and other additional files that, when installed in R, allow the user to load them in the workspace via a call to the library() function. An example of code to load the package lattice may be found as follows: > library(lattice) An R installation contains one or more libraries of packages. Some of these packages are part of the basic installation and are loaded automatically as soon as the session is started. Other can be installed from the CRAN, the official R repository, or downloaded and installed manually. Interacting with the console As soon as you will start R, you will see that a workspace is open; you can see a screenshot of the R Console window in the image below. The workspace is the environment in which you are working, where you will load your data, and create your variables. The screen prompt > is the R prompt that waits for commands. On the starting screen, you can either type any function, command, or you can use R to perform basic calculation. R uses the usual symbols for addition (+), subtraction (-), multiplication (*), division (/), and exponentiation (^). Parentheses ( ) can be used to specify the order of operations. R also provides %% for taking the modulus and %/% for integer division. Comments in R are defined by the character #, so everything after such character up to the end of the line will be ignored by R. R has a number of built-in functions, for example, sin(x), cos(x), tan(x), (all in radians), exp(x), log(x), and sqrt(x). Some special constants such as pi are also pre-defined. You can see an example of the use of such function in the following code: > exp(2.5)[1] 12.18249 Understanding R objects In every computer language, variables provide a means of accessing the data stored in memory. R does not provide direct access to the computer’s memory but rather provides a number of specialized data structures called objects. These objects are referred to through symbols or variables. Vectors The basic object in R is the vector; even scalars are vectors of length one. Vectors can be thought of as a series of data of the same class. There are six basic vector type (called atomic vectors): logical, integer, real, complex, string (or character), and raw. Integer and real represent numeric objects; logicals are Boolean data type with possible value TRUE or FALSE. Among such atomic vectors, the more common ones are logical, string, and numeric (integer and real). There are several ways to create vectors. For instance the operator : (colon) is a sequence-generating operator, it creates sequences by incrementing or decrementing by one. > 1:10 [1] 1 2 3 4 5 6 7 8 9 10> 5:-6 [1] 5 4 3 2 1 0 -1 -2 -3 -4 -5 -6 If the interval between the numbers is not one, you can use the seq() function. Here an example > seq(from=2, to=2.5, by=0.1)[1] 2.0 2.1 2.2 2.3 2.4 2.5 One of the more important features of R is the possibility to use entire vector as arguments of functions, thus avoiding the use of cyclic loops. Most of the functions in R allow the use of vector as argument, as example the use of some of these functions is reported as follows > x <- c(12,10,4,6,9)> max(x)[1] 12> min(x)[1] 4> mean(x)[1] 8.2 Matrices and arrays In R, the matrix notation is extended to elements of any kind, so in example it is possible to have a matrix of character strings. Matrices and arrays are basically vectors with a dimension attribute. The function matrix() may be used to create matrices. By default, such function creates the matrix by column; as alternative it is possible to specify to the function to build the matrix by row: > matrix(1:9,nrow=3,byrow=TRUE) [,1] [,2] [,3][1,] 1 2 3[2,] 4 5 6[3,] 7 8 9 Lists A list in R is a collection of different objects. One of the main advantages of lists is that the object contained within a list may be of different type, for example, numeric and character values. In order to define a list, you simply will need to provide the object that you want to include as argument of the function list(). Data frame A data frame corresponds to a data set; it is basically a special list in which the elements have the same length. Elements may be different type in different columns, but within the same column all the elements are of the same type. You can easily create data frames using the function data.frame(), and a specific column can be recall using the operator $. Top features you’ll want to know about In addition to the basic object creation and manipulation, many more complex tasks can be performed with R, spanning from data manipulation, programming, statistical analysis and the realization of very high quality graphs. Some of the most useful features are Data input and output Flow control (for, if…else, while) Create your own functions Debugging functions and handling exceptions Plotting data Summary In this article we saw what is R, how to obtain and install R, and how to interacting with the console. We also saw at few R objects and also looked at the top features you would want to know about Resources for Article: Further resources on this subject: Organizing, Clarifying and Communicating the R Data Analyses [Article] Customizing Graphics and Creating a Bar Chart and Scatterplot in R [Article] Graphical Capabilities of R [Article]
Read more
  • 0
  • 0
  • 1433

article-image-model-design-accelerator
Packt
30 Jul 2013
6 min read
Save for later

Model Design Accelerator

Packt
30 Jul 2013
6 min read
(For more resources related to this topic, see here.) By the end of this article you will be able to use Model Design Accelerator to design a new Framework model. To introduce Model Design Accelerator, we will use a fairly simple schema based on a rental star schema, derived from the MySQL Sakila sample database.This database can be downloaded from http://dev.mysql.com/doc/sakila/en/. It is just one example of a number of possible dimensional models based on this sample database. The Model Design Accelerator user interface The user interface of Model Design Accelerator is very simple, consisting of only two panels: Explorer Tree: This contains details of the database tables and views from the data source. Model Accelerator: This contains a single fact table surrounded by four dimension tables, and is the main work area for the model being designed. By clicking on the labels (Explorer Tree and Model Accelerator) at the top of the window, it is possible to hide either of these panels, but having both these panels always visible is beneficial. Starting Model Design Accelerator Model Design Accelerator is started from the Framework Manager initial screen: Select Create a new project using Model Design Accelerator…. This will start the new project creation wizard, which is exactly the same as if you were starting any new project. Select the data source to import the database tables into the new model. After importing the database tables, the project creation wizard will display the Model Design Accelerator Introduction screen: After reading the instructions, click on the Close button to continue. This will then show the Model Design Accelerator workspace. Adding tables to your workspace The first step in creating your model with Model Design Accelerator is to add the dimension and fact tables to your model: From the Explorer panel,drag-and-drop dim_date ,dim_film ,dim_ customer, and dim_store to the four New Query Subject boxes in the Model Accelerator panel. After adding your queries, right-click on the boxes to rename the queries to Rental Date Dim,Film Dim ,Customer Dim, and Store Dim respectively. If not all query columns are required, it is also possible to expand the dimension tables and drag-and-drop individual columns to the query boxes. In the Explorer Tree panel,expand the fact_rental table by clicking on the (+) sign besides the name, and from the expanded tree drag-and-drop count_returns,count_rentals, and rental_duration columns to the Fact Query Subject box. Rename the Fact Query Subject to Rental Fact. Additional dimension queries can be added to the model by clicking on the top-left icon in the Model Accelerator panel, and then by dragging and dropping the required query onto the workplace window. Since we have a start_date and an end_date for the rental period, add a second copy of the date_dim table, by clicking on the icon and dragging the table from the Explorer view into the workspace. Also rename this query as Return Date Dim: Adding joins to your workspace After we have added our database table columns to the workspace, we now need to add the relationship joins between the dimension and fact tables. To do this: Double-click on the Rental Date Dim table, and this will expand the date_ dim and the fact_rental tables in the workspace window: Click on the Enter relationship creation mode link. Select the date_key column in the dim_date table, and the rental_date_ key column in the fact_rental table as follows: Click on the Create relationship icon: Click on OK to create this join. Close the Query Subject Diagram by clicking on the (X) symbol in the top-right corner. Repeat this procedure for each of the other four tables. The final model will look like the following screenshot: Generating Framework Manager model Once we have completed our model in Model Design Accelerator, we need to create a Framework Manager model: Click on the Generate Model button. Click on Yes to generate your model.The Framework Manager model will be generated and will open as follows:   When you generate your model, all of the Model Advisor tests are automatically applied to the resulting model. You should review any issues that have been identified in the Verify Results tab, and decide whether you need to fix them. When you generate the model only those query items required will be used to create the Framework Manager model. The Physical View tab will contain only those tables required by your star schema model. The Business View tab will contain model query subjects containing only the columns used in your star schema model. The Presentation View tab will only contain shortcuts to the query subjects that exist in the Business View tab. After generating your model, you can use Framework Manager to improve the model by adding calculations, filters, dimensions, measures, and so on. Each time you generate a Framework Manager model from your Model Design Accelerator model, a new namespace is created in the current Framework Manager model and any improvements you want to use will also need to be applied to these new namespaces. From Framework Manager you can return to Model Design Accelerator at any time to continue making changes to your star schema. To return to the Model Design Accelerator from within Framework Manager: From the Tools menu, select Run Model Design Accelerator. You may choose to continue with the same model or create a new model. To make your star schema model available to the report authors, you must first create a package and then publish the package to your Cognos Reporting Server. Summary In this article, we have looked at Model Design Accelerator. This is a tool that allows a novice modeler, or even an experienced modeler, to create a new Framework Manger model quickly and easily. Resources for Article: Further resources on this subject: Integrating IBM Cognos TM1 with IBM Cognos 8 BI [Article] How to Set Up IBM Lotus Domino Server [Article] IBM Cognos 10 BI dashboarding components [Article]
Read more
  • 0
  • 0
  • 1614