Data | 0 articles | Tech News, Tutorials & Expert Insights

22 Aug 2013

7 min read

Pentaho – Using Formulas in Our Reports

22 Aug 2013

(For more resources related to this topic, see here.) At the end of the article, we propose that you make some modifications to the report created in this article. Starting practice In this article, we will create a copy of the report, then we will do the necessary changes in its layout; the final result is as follows: As we can observe in the previous screenshot, the rectangle that is to the left of each title changes color. We'll see how to do this, and much more, shortly. Time for action – making a copy of the previous report In this article, we will use an already created report. To do so, we will open it and save it with the name 09_Using_Formulas.prpt. Then we will modify its layout to fit this article. Finally, we will establish default values for our parameters. The steps for making a copy of the previous report are as follows: We open the report 07_Adding_Parameters.prpt that we created. Next, we create a copy by going to File | Save As... and saving it with the name 09_Using_Formulas.prpt. We will modify our report so that it looks like the following screenshot: As you can see, we have just added a rectangle in the Details section, a label (Total) in the Details Header section, and we have modified the name of the label found in the Report Header section. To easily differentiate this report from the one used previously, we have also modified its colors to grayscale. Later in this article, we will make the color of the rectangle vary according to the formula, so itis important that the rest of the report does not have too many colors so the result are easy for the end user to see. We will establish default values in our parameters so we can preview the report without delays caused by having to choose the values for ratings, year, and month. We go to the Data tab, select the SelectRating parameter, right-click on it, and choose the Edit Parameter... option: In Default Value, we type the value [G]: Next, we click on OK to continue. We should do something similar for SelectYear and SelectMonth: For SelectYear, the Default Value will be 2005. For SelectMonth, the Default Value will be 5. Remember that the selector shows the names of the months, but internally the months' numbers are used; so, 5 represents May. What just happened? We created a copy of the report 07_Adding_Parameters.prpt and saved it with the name 09_Using_Formulas.prpt. We changed the layout of the report, adding new objects and changing the colors. Then we established default values for the parameters SelectRating, SelectYear, and SelectMonth. Formulas To manage formulas, PRD implements the open standard OpenFormula. According to OpenFormula's specifications: "OpenFormula is an open format for exchanging recalculated formulas between office application implementations, particularly for spreadsheets. OpenFormula defines the types, syntax, and semantics for calculated formulas, including many predefined functions and operations, so that formulas can be exchanged between applications and produce substantively equal outputs when recalculated with equal inputs. Both closed and open source software can implement OpenFormula." For more information on OpenFormula, refer to the following links: Wikipedia: http://en.wikipedia.org/wiki/OpenFormula Specifications: https://www.oasis-open.org/committees/download.php/16826/openformula-spec-20060221.html Web: http ://www.openformula.org/ Pentaho wiki: http://wiki.pentaho.com/display/Reporting/Formula+Expressions Formulas are used for greatly varied purposes, and their use depends on the result one wants to obtain. Formulas let us carry out simple and complex calculations based on fixed and variable values and include predefined functions that let us work with text, databases, date and time, let us make calculations, and also include general information functions and user-defined functions. They also use logical operators (AND, OR, and so on) and comparative operators (>, <, and so on). Creating formulas There are two ways to create formulas: By creating a new function and by going to Common | Open Formula By pressing the button in a section's / an object's Style or Attributes tab, or to configure some feature In the report we are creating in this article, we will create formulas using both methods. Using the first method, general-use formulas can be created. That is, the result will be an object that can either be included directly in our report or used as a value in another function, style, or attribute. We can create objects that make calculations at a general level to be included in sections that include Report Header, Group Footer, and so on, or we can make calculations to be included in the Details section. In this last case, the formula will make its calculation row by row. With this last example, we can make an important differentiation with respect to aggregate functions as they usually can only calculate totals and subtotals. Using the second method, we create specific-use functions that affect the value of the style or attribute of an individual object. The way to use these functions is simple. Just choose the value you want to modify in the Style and Attributes tabs and click on the button that appears on their right. In this way, you can create formulas that dynamically assign values to an object's color, position, width, length, format, visibility, and so on. Using this technique, stoplights can be created by assigning different values to an object according to a calculation, progress bars can be created by changing an object's length, and dynamic images can be placed in the report using the result of a formula to calculate the image's path. As we have seen in the examples, using formulas in our reports gives us great flexibility in applying styles and attributes to objects and to the report itself, as well as the possibility of creating our own objects based on complex calculations. By using formulas correctly, you will be able to give life to your reports and adapt them to changing contexts. For example, depending on which user executes the report, a certain image can appear in the Report Header section, or graphics and subreports can be hidden if the user does not have sufficient permissions. The formula editor The formula editor has a very intuitive and easy-to-use UI that in addition to guiding us in creating formulas, tells us, whenever possible, the value that the formula will return. In the following screenshot, you can see the formula editor: We will explain its layout with an example. Let's suppose that we added a new label and we want to create a formula that returns the value of Attributes.Value. For this purpose, we do the following: Select the option to the right of Attributes.Value. This will open the formula editor. In the upper-left corner, there is a selector where we can specify the category of functions that we want to see. Below this, we find a list of the functions that we can use to create our own formulas. In the lower-left section, we can see more information about the selected function; that is, the type of value that it will return and a general description: We choose the CONCATENATE function by double-clicking on it, and in the lower-right section, we can see the formula (Formula:) that we will use. We type in =CONCATENATE(Any), and an assistant will open in the upper-right section that will guide us in entering the values we want to concatenate. We could complete the CONCATENATE function by adding some fixed values and some variables; take the following example: If there is an error in the text of the formula, text will appear to warn us. Otherwise, the formula editor will try to show us the result that our formula will return. When it is not possible to visualize the result that a formula will return, this is usually because the values used are calculated during the execution of the report. Formulas should always begin with the = sign. Initially, one tends to use the help that the formula editor provides, but later, with more practice, it will become evident that it is much faster to type the formula directly. Also, if you need to enter complex formulas or add various functions with logical operators, the formula editor will not be of use.

0
0
1523

article-image-lucenenet-optimizing-and-merging-index-segments

Packt

20 Aug 2013

3 min read

Lucene.NET: Optimizing and merging index segments

Packt

20 Aug 2013

3 min read

(For more resources related to this topic, see here.) How to do it… Index optimization is accomplished by calling the Optimize method on an instance of IndexWriter. The example for this recipe demonstrates the use of the Optimize method to clean up the storage of the index data on the physical disk. The general steps in the process to optimize and index segments are the following: Create/open an index. Add or delete documents from the index. Examine the MaxDoc and NumDocs properties of the IndexWriter class. If the index is deemed to be too dirty, call the Optimize method of the IndexWriter class. The following example for this recipe demonstrates taking these steps to create, modify, and then optimize an index. namespace Lucene.NET.HowTo._12_MergeAndOptimize {// ...// build facade and an initial index of 5 documentsvar facade = new LuceneDotNetHowToExamplesFacade().buildLexicographicalExampleIndex(maxDocs: 5).createIndexWriter();// report MaxDoc and NumDocsTrace.WriteLine(string.Format("MaxDoc=={0}", facade.IndexWriter.MaxDoc()));Trace.WriteLine(string.Format("NumDocs=={0}",facade.IndexWriter.NumDocs()));// delete one documentfacade.IndexWriter.DeleteDocuments(new Term("filename", "0.txt"));facade.IndexWriter.Commit();// report MaxDoc and NumDocsTrace.WriteLine("After delete / commit");Trace.WriteLine(string.Format("MaxDoc=={0}", facade.IndexWriter.MaxDoc()));Trace.WriteLine(string.Format("NumDocs=={0}", facade.IndexWriter.NumDocs()));// optimize the indexfacade.IndexWriter.Optimize();// report MaxDoc and NumDocsTrace.WriteLine("After Optimize");Trace.WriteLine(string.Format("MaxDoc=={0}", facade.IndexWriter.MaxDoc()));Trace.WriteLine(string.Format("NumDocs=={0}", facade.IndexWriter.NumDocs()));Trace.Flush();// ...} How it works… When this program is run, you will see output similar to that in the following screenshot: This program first creates an index with five files. It then reports the values of the MaxDoc and NumDocs properties of the instance of IndexWriter. MaxDoc represents the maximum number of documents that have been stored in the index. It is possible to add more documents, but that may incur a performance penalty by needing to grow the index. NumDocs is the current number of documents stored in the index. At this point these values are 5 and 5, respectively. The next step deletes a single document named 0.txt from the index, and the changes are committed to disk. MaxDoc and NumDocs are written to the console again and now report 5 and 4 respectively. This makes sense as one file has been deleted and there is now "slop" in the index where space is being taken up from a previously deleted document. The reference to the document index information has been removed, but the space is still used on the disk. The final two steps are to call Optimize and to write MaxDoc and NumDocs values to the console, for the final time. These now are 4 and 4, respectively, as Lucene.NET has merged any index segments and removed any empty disk space formerly used by deleted document index information. Summary A Lucene.NET index physically contains one or more segments, each of which is its own index and holds a subset of the overall indexed content. As documents are added to the index, new segments are created as index writer's flush-buffered content into the index's directory and file structure. Over time this fragmentation will cause searches to slow, requiring a merge/optimization to be performed to regain performance. Resources for Article : Further resources on this subject: Extending Your Structure and Search [Article] Advanced Performance Strategies [Article] Creating your first collection (Simple) [Article]

0
0
2886

Packt

14 Aug 2013

6 min read

Analytics – Drawing a Frequency Distribution with MapReduce (Intermediate)

Packt

14 Aug 2013

6 min read

(For more resources related to this topic, see here.) Often, we use Hadoop to calculate analytics, which are basic statistics about data. In such cases, we walk through the data using Hadoop and calculate interesting statistics about the data. Some of the common analytics are show as follows: Calculating statistical properties like minimum, maximum, mean, median, standard deviation, and so on of a dataset. For a dataset, generally there are multiple dimensions (for example, when processing HTTP access logs, names of the web page, the size of the web page, access time, and so on, are few of the dimensions). We can measure the previously mentioned properties by using one or more dimensions. For example, we can group the data into multiple groups and calculate the mean value in each case. Frequency distributions histogram counts the number of occurrences of each item in the dataset, sorts these frequencies, and plots different items as X axis and frequency as Y axis. Finding a correlation between two dimensions (for example, correlation between access count and the file size of web accesses). Hypothesis testing: To verify or disprove a hypothesis using a given dataset. However, Hadoop will only generate numbers. Although the numbers contain all the information, we humans are very bad at figuring out overall trends by just looking at numbers. On the other hand, the human eye is remarkably good at detecting patterns, and plotting the data often yields us a deeper understanding of the data. Therefore, we often plot the results of Hadoop jobs using some plotting program. Getting ready This article assumes that you have access to a computer that has Java installed and the JAVA_HOME variable configured. Download a Hadoop distribution 1.1.x from http://hadoop.apache.org/releases.html page. Unzip the distribution, we will call this directory HADOOP_HOME. Download the sample code for the article and copy the data files. How to do it... If you have not already done so, let us upload the amazon dataset to the HDFS filesystem using the following commands: >bin/hadoopdfs -mkdir /data/>bin/hadoopdfs -mkdir /data/amazon-dataset>bin/hadoopdfs -put <SAMPLE_DIR>/amazon-meta.txt /data/amazondataset/>bin/hadoopdfs -ls /data/amazon-dataset Copy the hadoop-microbook.jar file from SAMPLE_DIR to HADOOP_HOME. Run the first MapReduce job to calculate the buying frequency. To do that run the following command from HADOOP_HOME: $ bin/hadoop jar hadoop-microbook.jar microbook.frequency.BuyingFrequencyAnalyzer/data/amazon-dataset /data/frequencyoutput1 Use the following command to run the second MapReduce job to sort the results of the first MapReduce job: $ bin/hadoop jar hadoop-microbook.jar microbook.frequency.SimpleResultSorter /data/frequency-output1 frequency-output2 You can find the results from the output directory. Copy results to HADOOP_HOME using the following command: $ bin/Hadoop dfs -get /data/frequency-output2/part-r-00000 1.data Copy all the *.plot files from SAMPLE_DIR to HADOOP_HOME. Generate the plot by running the following command from HADOOP_HOME. $gnuplot buyfreq.plot It will generate a file called buyfreq.png, which will look like the following: As the figure depicts, few buyers have brought a very large number of items. The distribution is much steeper than normal distribution, and often follows what we call a Power Law distribution. This is an example that analytics and plotting results would give us insight into, underlying patterns in the dataset. How it works... You can find the mapper and reducer code at src/microbook/frequency/BuyingFrequencyAnalyzer.java. This figure shows the execution of two MapReduce jobs. Also the following code listing shows the map function and the reduce function of the first job: public void map(Object key, Text value, Context context) throwsIOException, InterruptedException {List<BuyerRecord> records = BuyerRecord.parseAItemLine(value.toString());for(BuyerRecord record: records){context.write(new Text(record.customerID), new IntWritable(record.itemsBrought.size()));}}public void reduce(Text key, Iterable<IntWritable> values, Context context) {int sum = 0;for (IntWritableval : values) {sum += val.get();}result.set(sum);context.write(key, result);} As shown by the figure, Hadoop will read the input file from the input folder and read records using the custom formatter we introduced in the Writing a formatter (Intermediate) article. It invokes the mapper once per each record, passing the record as input. The mapper extracts the customer ID and the number of items the customer has brought, and emits the customer ID as the key and number of items as the value. Then, Hadoop sorts the key-value pairs by the key and invokes a reducer once for each key passing all values for that key as inputs to the reducer. Each reducer sums up all item counts for each customer ID and emits the customer ID as the key and the count as the value in the results. Then the second job sorted the results. It reads output of the first job as the result and passes each line as argument to the map function. The map function extracts the customer ID and the number of items from the line and emits the number of items as the key and the customer ID as the value. Hadoop will sort the key-value pairs by the key, thus sorting them by the number of items, and invokes the reducer once per key in the same order. Therefore, the reducer prints them out in the same order essentially sorting the dataset. Since we have generated the results, let us look at the plotting. You can find the source for the gnuplot file from buyfreq.plot. The source for the plot will look like the following: set terminal pngset output "buyfreq.png"set title "Frequency Distribution of Items brought by Buyer";setylabel "Number of Items Brought";setxlabel "Buyers Sorted by Items count";set key left topset log yset log xplot "1.data" using 2 title "Frequency" with linespoints Here the first two lines define the output format. This example uses png, but gnuplot supports many other terminals such as screen, pdf, and eps. The next four lines define the axis labels and the title, and the next two lines define the scale of each axis, and this plot uses log scale for both. The last line defines the plot. Here, it is asking gnuplot to read the data from the 1.data file, and to use the data in the second column of the file via using 2, and to plot it using lines. Columns must be separated by whitespaces. Here if you want to plot one column against another, for example data from column 1 against column 2, you should write using 1:2 instead of using 2. There's more... We can use a similar method to calculate the most types of analytics and plot the results. Refer to the freely available article of Hadoop MapReduce Cookbook, Srinath Perera and Thilina Gunarathne, Packt Publishing at http://www.packtpub.com/article/advanced-hadoop-mapreduce-administration for more information. Summary In this article, we have learned how to process Amazon data with MapReduce, generate data for a histogram, and plot it using gnuplot. Resources for Article : Further resources on this subject: Advanced Hadoop MapReduce Administration [Article] Comparative Study of NoSQL Products [Article] HBase Administration, Performance Tuning [Article]

0
0
1263

Packt

14 Aug 2013

8 min read

Calculus

Packt

14 Aug 2013

8 min read

(For more resources related to this topic, see here.) Derivatives To compute the derivative of a function, create the corresponding expression and use diff(). Its first argument is the expression and the second is the variable with regard to which you want to differentiate. The result is the expression for the derivative: >>> diff(exp(x**2), x)2*x*exp(x**2)>>> diff(x**2 * y**2, y)2*x**2*y Higher-order derivatives can also be computed with a single call to diff(): >>> diff(x**3, x, x)6*x>>> diff(x**3, x, 2)6*x>>> diff(x**2 * y**2, x, 2, y, 2)4 Due to SymPy's focus on expressions rather than functions, the derivatives for symbolic functions can seem a little surprising, but LaTeX rendering in the notebook should make their meaning clear. >>> f = Function('f')>>> diff(f(x**2), x)2*x*Subs(Derivative(f(_xi_1), _xi_1), (_xi_1,), (x**2,)) Let's take a look at the following screenshot: Limits Limits are obtained through limit(). The syntax for the limit of expr when x goes to some value x0 is limit(expr, x, x0). To specify a limit towards infinity, you need to use SymPy's infinity object, named oo. This object will also be returned for infinite limits: >>> limit(exp(-x), x, oo)0>>> limit(1/x**2, x, 0)oo There is also a fourth optional parameter, to specify the direction of approach of the limit target. "+" (the default) gives the limit from above, and "-" is from below. Obviously, this parameter is ignored when the limit target is infinite: >>> limit(1/x, x, 0, "-")-oo>>> limit(1/x, x, 0, "+")oo Let's take a look at the following screenshot: Integrals SymPy has powerful algorithms for integration, and, in particular, can find most integrals of logarithmic and exponential functions expressible with special functions, and many more besides, thanks to Meijer G-functions. The main function for integration is integrate(). It can compute both antiderivatives (indefinite integrals) and definite integrals. Note that the value of an antiderivative is only defined up to an arbitrary constant but the result does not include it. >>> integrate(sin(x), x)-cos(x)>>> integrate(sin(x), (x, 0, pi))2 Unevaluated symbolic integrals and antiderivatives are represented by the Integral class. integrate() may return these objects if it cannot compute the integral. It is also possible to create Integral objects directly, using the same syntax as integrate(). To evaluate them, call their .doit() method: >>> integral = Integral(sin(x), (x, 0, pi))>>> integralIntegral(sin(x), (x, 0, pi))>>> integral.doit()2 Let's take a look at the following screenshot: Taylor series A Taylor series approximation is an approximation of a function obtained by truncating its Taylor series. To compute it, use series(expr, x, x0, n), where x is the relevant variable, x0 is the point where the expansion is done (defaults to 0), and n is the order of expansion (defaults to 6): >>> series(cos(x), x)1 - x**2/2 + x**4/24 + O(x**6)>>> series(cos(x), x, n=10)1 - x**2/2 + x**4/24 - x**6/720 + x**8/40320 + O(x**10) The O(x**6) part in the result is a "big-O" object. Intuitively, it represents all the terms of order equal to or higher than 6. This object automatically absorbs or combines with powers of the variable, which makes simple arithmetic operations on expansions convenient: >>> O(x**2) + 2*x**3O(x**2)>>> O(x**2) * 2*x**3O(x**5)>>> expand(series(sin(x), x, n=6) * series(cos(x), x, n=4))x - 2*x**3/3 + O(x**5)>>> series(sin(x)*cos(x), x, n=5)x - 2*x**3/3 + O(x**5) If you want to use the expansion as an approximation of the function, the O() term prevents it from behaving like an ordinary expression, so you need to remove it. You can do so by using the aptly named .removeO() method: >>> series(cos(x), x).removeO()x**4/24 - x**2/2 + 1 Taylor series look better in the notebook, as shown in the following screenshot: Solving equations This section will teach you how to solve the different types of equations that SymPy handles. The main function to use for solving equations is solve(). Its interface is somewhat complicated as it accepts many different kinds of inputs and can output results in various forms depending on the input. In the simplest case, univariate equations, use the syntax solve(expr, x) to solve the equation expr = 0 for the variable x. If you want to solve an equation of the form A = B, simply put it under the preceding form, using solve(A - B, x). This can solve algebraic and transcendental equations involving rational fractions, square roots, absolute values, exponentials, logarithms, trigonometric functions, and so on. The result is then a list of the values of the variables satisfying the equation. The following commands show a few examples of equations that can be solved: >>> solve(x**2 - 1, x)[-1, 1]>>> solve(x*exp(x) - 1, x)[LambertW(1)]>>> solve(abs(x**2-4) - 3, x)[-1, 1, -sqrt(7), sqrt(7)] Note that the form of the result means that it can only return a finite set of solutions. In cases where the true solution is infinite, it can therefore be misleading. When the solution is an interval, solve() typically returns an empty list. For periodic functions, usually only one solution is returned: >>> solve(0, x) # all x are solutions[]>>> solve(x - abs(x), x) # all positive x are solutions[]>>> solve(sin(x), x) # all k*pi with k integer are solutions[0] The domain over which the equation is solved depends on the assumptions on the variable. Hence, if the variable is a real Symbol object, only real solutions are returned, but if it is complex, then all solutions in the complex plane are returned (subject to the aforementioned restriction on returning infinite solution sets). This difference is readily apparent when solving polynomials, as the following example demonstrates: >>> solve(x**2 + 1, x)[]>>> solve(z**2 + 1, z)[-I, I] There is no restriction on the number of variables appearing in the expression. Solving a multivariate expression for any of its variables allows it to be expressed as a function of the other variables, and to eliminate it from other expressions. The following example shows different ways of solving the same multivariate expression: >>> solve(x**2 - exp(a), x)[-exp(a/2), exp(a/2)]>>> solve(x**2 - exp(a), a)[log(x**2)]>>> solve(x**2 - exp(a), x, a)[{x: -exp(a/2)}, {x: exp(a/2)}]>>> solve(x**2 - exp(a), x, b)[{x: -exp(a/2)}, {x: exp(a/2)}] To solve a system of equations, pass a list of expressions to solve(): each one will be interpreted, as in the univariate case, as an equation of the form expr = 0. The result can be returned in one of two forms, depending on the mathematical structure of the input: either as a list of tuples, where each tuple contains the values for the variables in the order given to solve, or a single dictionary, suitable for use in subs(), mapping variables to their values. As you can see in the following example, it can be hard to predict what form the result will take: >>> solve([exp(x**2) - y, y - 3], x, y)[(-sqrt(log(3)), 3), (sqrt(log(3)), 3)]>>> solve([x**2 - y, y - 3], x, y)[(-sqrt(3), 3), (sqrt(3), 3)]>>> solve([x - y, y - 3], x, y){y: 3, x: 3} This variability in return types is fine for interactive use, but for library code, more predictability is required. In this case, you should use the dict=True option. The output will then always be a list of mappings of variables to value. Compare the following example to the previous one: >>> solve([x**2 - y, y - 3], x, y, dict=True)[{y: 3, x: -sqrt(3)}, {y: 3, x: sqrt(3)}]>>> solve([x - y, y - 3], x, y, dict=True)[{y: 3, x: 3}] Summary We successfully computed the various mathematical operations using the SymPy application, Calculus. Resources for Article : Further resources on this subject: Move Further with NumPy Modules [Article] Advanced Indexing and Array Concepts [Article] Running a simple game using Pygame [Article]

0
0
1262

article-image-using-unrestricted-languages

Packt

13 Aug 2013

15 min read

Using Unrestricted Languages

Packt

13 Aug 2013

15 min read

0
0
1579

Packt

12 Aug 2013

14 min read

Quick start – Creating your first Java application

Packt

12 Aug 2013

14 min read

0
0
1748

article-image-overview-sql-server-reporting-services-2012-architecture-features-and-tools

Packt

08 Aug 2013

15 min read

Overview of SQL Server Reporting Services 2012 Architecture, Features, and Tools

Packt

08 Aug 2013

15 min read

0
0
4046

Packt

07 Aug 2013

18 min read

Understanding MapReduce

Packt

07 Aug 2013

18 min read

0
0
2574

Packt

02 Aug 2013

6 min read

So, what is MongoDB?

Packt

02 Aug 2013

6 min read

(For more resources related to this topic, see here.) What is a document? While it may vary for various implementations of different Document Oriented Databases available, as far as MongoDB is concerned it is a BSON document, which stands for Binary JSON. JSON (JavaScript Object Notation) is an open standard developed for human readable data exchange. Though a thorough knowledge of JSON is not really important to understand MongoDB, for keen readers the URL to its RFC is http://tools.ietf.org/html/rfc4627. Also, the BSON specification can be found at http://bsonspec.org/. Since MongoDB stores the data as BSON documents, it is a Document Oriented Database. What does a document look like? Consider the following example where we represent a person using JSON: {"firstName":"Jack","secondName":"Jones","age":30,"phoneNumbers":[{fixedLine:"1234"},{mobile:"5678"}],"residentialAddress":{lineOne:"…",lineTwo:"…",city:"…",state:"…",zip:"…",country:"…"}} As we can see, a JSON document always starts and ends with curly braces and has all the content within these braces. Multiple fields and values are separated by commas, with a field name always being a string value and the value being of any type ranging from string, numbers, date, array, another JSON document, and so on. For example in "firstName":"Jack", the firstName is the name of the field whereas Jack is the value of the field. Need for MongoDB Many of you would probably be wondering why we need another database when we already have good old relational databases. We will try to see a few drivers from its introduction back in 2009. Relational databases are extremely rich in features. But these features don't come for free; there is a price to pay and it is done by compromising on the scalability and flexibility. Let us see these one by one. Scalability It is a factor used to measure the ease with which a system can accommodate the growing amount of work or data. There are two ways in which you can scale your system: scale up, also known as scale vertically or scale out, also known as scale horizontally. Vertical scalability can simply be put up as an approach where we say "Need more processing capabilities? Upgrade to a bigger machine with more cores and memory". Unfortunately, with this approach we hit a wall as it is expensive and technically we cannot upgrade the hardware beyond a certain level. You are then left with an option to optimize your application, which might not be a very feasible approach for some systems which are running in production for years. On the other hand, Horizontal scalability can be described as an approach where we say "Need more processing capabilities? Simple, just add more servers and multiply the processing capabilities". Theoretically this approach gives us unlimited processing power but we have more challenges in practice. For many machines to work together, there would be a communication overhead between them and the probability of any one of these machines being down at a given point of time is much higher. MongoDB enables us to scale horizontally easily, and at the same time addresses the problems related to scaling horizontally to a great extent. The end result is that it is very easy to scale MongoDB with increasing data as compared to relational databases. Ease of development MongoDB doesn't have the concept of creation of schema as we have in relational databases. The document that we just saw can have an arbitrary structure when we store them in the database. This feature makes it very easy for us to model and store relatively unstructured/ complex data, which becomes difficult to model in a relational database. For example, product catalogues of an e-commerce application containing various items and each having different attributes. Also, it is more natural to use JSON in application development than tables from relational world. Ok, it looks good, but what is the catch? Where not to use MongoDB? To achieve the goal of letting MongoDB scale out easily, it had to do away with features like joins and multi document/distributed transactions. Now, you must be wondering it is pretty useless as we have taken away two of the most important features of the relational database. However, to mitigate the problems of joins is one of the reasons why MongoDB is document oriented. If you look at the preceding JSON document for the person, we have the address and the phone number as a part of the document. In relational database, these would have been in separate tables and retrieved by joining these tables together. Distributed/Multi document transactions inhibit MongoDB to scale out and hence are not supported and nor there is a way to mitigate it. MongoDB still is atomic but the atomicity for inserts and updates is guaranteed at document level and not across multiple documents. Hence, MongoDB is not a good fit for scenarios where complex transactions are needed, such as in an OLTP banking applications. This is an area where good old relational database still rules. To conclude, let us take a look at the following image. This graph is pretty interesting and was presented by Dwight Merriman, Founder and CEO of 10gen, the MongoDB company in one of his online courses. As we can see, we have on one side some products like Memcached which is very low on functionality but high on scalability and performance. On the other end we have RDBMS (Relational Database Management System) which is very rich in features but not that scalable. According to the research done while developing MongoDB, this graph is not linear and there is a point in it after which the scalability and performance fall steeply on adding more features to the product. MongoDB sits on this point where it gives maximum possible features without compromising too much on the scalability and performance. Summary In this article, we saw the features displayed by MongoDB, how a document looks like, and how it is better than relational databases. Resources for Article : Further resources on this subject: Building a Chat Application [Article] Ruby with MongoDB for Web Development [Article] Comparative Study of NoSQL Products [Article]

0
0
1623

Packt

02 Aug 2013

15 min read

Using Oracle GoldenGate

Packt

02 Aug 2013

15 min read

0
0
2833

article-image-making-simple-curl-request-simple

Packt

01 Aug 2013

5 min read

Making a simple cURL request (Simple)

Packt

01 Aug 2013

5 min read

(For more resources related to this topic, see here.) Getting ready In this article we will use cURL to request and download a web page from a server. How to do it... Enter the following code into a new PHP project: <?php // Function to make GET request using cURL function curlGet($url) { $ch = curl_init(); // Initialising cURL session // Setting cURL options curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); curl_setopt($ch, CURLOPT_URL, $url); $results = curl_exec($ch); // Executing cURL session curl_close($ch); // Closing cURL session return $results; // Return the results } $packtPage = curlGet('http://www.packtpub.com/oop-php-5/book'); echo $packtPage; ?> Save the project as 2-curl-request.php (ensure you use the .php extension!). Execute the script. Once our script has completed, we will see the source code of http://www.packtpub.com/oop-php-5/book displayed on the screen. How it works... Let's look at how we performed the previously defined steps: The first line, <?php, and the last line,?>, indicate where our PHP code block will begin and end. All the PHP code should appear between these two tags. Next, we create a function called curlGet(), which accepts a single parameter $url, the URL of the resource to be requested. Running through the code inside the curlGet() function, we start off by initializing a new cURL session as follows: $ch = curl_init(); We then set our options for cURL as follows: curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Tells cURL to return the results of the request (the source code of the target page) as a string. curl_setopt($ch, CURLOPT_URL, $url); // Here we tell cURL the URL we wish to request, notice that it is the $url variable that we passed into the function as a parameter. We execute our cURL request, storing the returned string in the $results variable as follows: $results = curl_exec($ch); Now that the cURL request has been made and we have the results, we close the cURL session by using the following code: curl_close($ch); At the end of the function, we return the $results variable containing our requested page, out of the function for using in our script. return $results; After the function is closed we are able to use it throughout the rest of our script. Later, deciding on the URL we wish to request, http://www.packtpub.com/oop-php-5/book , we execute the function, passing the URL as a parameter and storing the returned data from the function in the $packtPage variable as follows: $packtPage = curlGet('http://www.packtpub.com/oop-php-5/book'); Finally, we echo the contents of the $packtPage variable (the page we requested) to the screen by using the following code: echo $packtPage; There's more... There are a number of different HTTP request methods which indicate the server the desired response, or the action to be performed. The request method being used in this article is cURLs default GET request. This tells the server that we would like to retrieve a resource. Depending on the resource we are requesting, a number of parameters may be passed in the URL. For example, when we perform a search on the Packt Publishing website for a query, say, php, we notice that the URL is http://www.packtpub.com/books?keys=php. This is requesting the resource books (the page that displays search results) and passing a value of php to the keys parameter, indicating that the dynamically generated page should show results for the search query php. More cURL Options Of the many cURL options available, only two have been used in our preceding code. They are CURLOPT_RETURNTRANSFER and CURLOPT_URL. Though we will cover many more throughout the course of this article, some other options to be aware of, that you may wish to try out, are listed in the following table: Option Name Value Purpose CURLOPT_FAILONERROR TRUE or FALSE If a response code greater than 400 is returned, cURL will fail silently. CURLOPT_FOLLOWLOCATION TRUE or FALSE If Location: headers are sent by the server, follow the location. CURLOPT_USERAGENT A user agent string, for example: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.5; rv:15.0) Gecko/20100101 Firefox/15.0.1' Sending the user agent string in your request informs the target server, which client is requesting the resource. Since many servers will only respond to 'legitimate' requests it is advisable to include one. CURLOPT_HTTPHEADER An array containing header information, for example: array('Cache-Control: max-age=0', 'Connection: keep-alive', 'Keep-Alive: 300', 'Accept-Language: en-us,en;q=0.5') This option is used to send header information with the request and we will come across use cases for this in later recipes. A full listing of cURL options can be found on the PHP website at http://php.net/manual/en/function.curl-setopt.php. The HTTP response code An HTTP response code is the number that is returned, which corresponds with the result of an HTTP request. Some common response code values are as follows: 200: OK 301: Moved Permanently 400: Bad Request 401: Unauthorized 403: Forbidden 404: Not Found 500: Internal Server Error Summary This article covers techniques on making a simple cURL request. It is often useful to have our scrapers responding to different response code values in a different manner, for example, letting us know if a web page has moved, or is no longer accessible, or we are unauthorized to access a particular page. In this case, we can access the response of a request using cURL by adding the following line to our function, which will store the response code in the $httpResponse variable: $httpResponse = curl_getinfo($ch, CURLINFO_HTTP_CODE); Resources for Article: Further resources on this subject: A look into the high-level programming operations for the PHP language [Article] Installing PHP-Nuke [Article] Creating Your Own Theme—A Wordpress Tutorial [Article]

0
0
2686

article-image-participating-business-process-intermediate

Packt

31 Jul 2013

5 min read

Participating in a business process (Intermediate)

Packt

31 Jul 2013

5 min read

(For more resources related to this topic, see here.) The hurdles and bottlenecks for financial services from an IT point of view are: Silos of data Outdated IT system and many applications running on legacy and non-standard based systems Business process and reporting systems not in sync with each other Lack of real-time data visibility Automated decision making Ability to change and manage business processes in accordance with changes in business dynamics Partner management Customer satisfaction This is where BPM plays a key role in bridging the gap between key business requirements and technology or businesses hurdles. In a real-life scenario, a typical home loan use case would be tied up with Know Your Customer (KYC) regulatory requirement. In India for example, the Reserve Bank of India ( RBI) had passed on guidelines that make it mandatory for banks to properly know their customers. RBI mandates that banks collect their customers' proof of identity, recent photographs, and Income Tax PAN. Proof of residence can be a voter card, a driving license, or a passport copy. Getting ready We start with the source code from the previous recipe. We will add a re-usable e-mail or SMS notification process. It is always a best practice to add a new process if it is called multiple times in the same process. This can be a subprocess within the main process itself, or it can be a part of the same composite outside the main process. We will add a new regulatory requirement that allows the customer to add KYC requirements such as photo, proof of address, and Income Tax PAN copy as attachments that will be checked into the WebCenter Content repository. These checks become a part of the customer verification stage before finance approval. We will make KYC as a subprocess with a scope of expansion under a different scenario. We will also save the process data into a filesystem or in a JMS messaging queue at the end of the loan process completion. In a banking scenario, it can also be the integration stage for other applications such as a CRM application or any other application. How to do it… Let's perform the following steps: Launch JDeveloper and open the composite.xml of LoanApplicationProcess in the Design view. Drag-and-drop a new BPMN Process component from the Component Palette. Create the Send Notifications process next to the existing LoanApplicationProcess, and edit the new process. The Send Notifications process will take input parameters as To e-mail ID, From e-mail ID, Subject, CC, and send e-mail to the given e-mail ID. Similarly, we will drag-and-drop a File Adapter component from the Component Palette that saves the customer data into a file. We place this component the end of the LoanApplication process, just before the End activity. We will use this notification service to notify Verification Officers about the arrival of a new eligible application that needs to be verified. In the Application Verification Officer stage, we will add a subprocess, KYC , that will be assigned to the loan initiator—James Cooper in our case. This will be preceded by sending an e-mail notification to the applicant asking for KYC details such as PAN number, scanned photograph, and voter ID as requested by the Verification Officers. Now, let us implement Save Loan Application by invoking the File Adapter service. The Email notification services are also available out of the box. How it works… The outputs of this recipe are re-usable services that can be used across multiple service calls such as notification services. This recipe also demonstrates how to use subprocesses and change the process to meet regulatory requirements. Let's understand the output by taking our use case scenario: When the process is initiated, the e-mail notification gets triggered at appropriate stages of the process. Conan Doyle and John Steinbeck will get the e-mail, requesting them to process the application, with the required information of the applicant, along with the link to BPM Workspace. The KYC task also sends an e-mail to James Cooper, requesting him for the documents required for the KYC check. James Cooper logs in to the James Bank WebCenter Portal and sees there is a task assigned to him to upload his KYC details. James Cooper clicks on the task link and submits the required soft copy documents, and gets them checked into the content repository once the form is submitted. The start-to-end process flow now looks as follows: Summary BPM Process Spaces, which is an extension template of BPM, allows process and task views to be exposed to WebCenter Portal. The advantage of having Process Spaces made available within the Portal is that the users can collaborate with others using out of the box Portal features such as wikis, discussion forums, blogs, and content management. This improves productivity as the user need not log in to different applications for different purposes, as all the required data and information will be made available within the Portal environment. It is also possible to expose some of the WSRP supported application portlets (for example, HR Portlets from PeopleSoft) into a corporate portal environment. All of this sums up to provide higher visibility of the entire business process, and a nature of working and collaborating together in an enterprise business environment. Resources for Article : Further resources on this subject: Managing Oracle Business Intelligence [Article] Oracle E-Business Suite: Creating Bank Accounts and Cash Forecasts [Article] Getting Started with Oracle Information Integration [Article]

0
0
982

Packt

31 Jul 2013

12 min read

Data sources for the Charts

Packt

31 Jul 2013

12 min read

(For more resources related to this topic, see here.) Spreadsheets In Spreadsheets, two preparation steps must be addressed in order to use a Spreadsheet as a data source with the Visualization API. The first is to identify the URL location of the Spreadsheet file for the API code. The second step is to set appropriate access to the data held in the Spreadsheet file. Preparation The primary method of access for a Spreadsheet behaving as a data source is through a JavaScript-based URL query. The query itself is constructed with the Google Query Language. If the URL request does not include a query, all data source columns and rows are returned in their default order. To query a Spreadsheet also requires that the Spreadsheet fi le and the API application security settings are con figured appropriately. Proper preparation of a Spreadsheet as a data source involves both setting the appropriate access as well as locating the fi le's query URL. Permissions In order for a Spreadsheet to return data to the Visualization API properly, access settings on the Spreadsheets fi le itself must allow view access to users. For a Spreadsheet that allows for edits, including form-based additions, permissions must be set to Edit . To set permissions on the Spreadsheet, select the Share button to open up the Sharing settings dialog. To be sure the data is accessible to the Visualization API, access levels for both the Visualization application and Spreadsheet must be the same. For instance, if a user has access to the Visualization application and does not have view access to the Spreadsheet, the user will not be able to run the visualization as the data is more restrictive to that user than the application. The opposite scenario is true as well, but less likely to cause confusion as a user unable to access the API application is a fairly self-described problem. All Google applications handle access and permissions similarly. More information on this topic can be found on the Google Apps Support pages. Google Permissions overview is available at http://support.google. com/drive/bin/answer.py?hl=en&answer=2494886&rd=1. Get the URL path At present, acquiring a query-capable URL for a Spreadsheet is not as straightforward a task as one might think. There are several methods for which a URL is generated for sharing purposes, but the URL format needed for a data source query can only be found by creating a gadget in the Spreadsheet. A Google Gadget is simply dynamic, HTML or JavaScript-based web content that can be embedded in a web page. Google Gadgets also have their own API, and have capabilities beyond Spreadsheets applications. Information on Google Gadget API is available at https://developers.google.com/gadgets/. Initiate gadget creation by selecting the Gadget... option from the Insert item on the menu bar. When the Gadget Settings window appears, select Apply & close from the Gadget Settings dialog. Choose any gadget from the selection window. The purpose of this procedure is simply to retrieve the correct URL for querying. In fact, deleting the gadget as soon as the URL is copied is completely acceptable. In other words, the specific gadget chosen is of no consequence. Once the gadget has been created, select Get query data source url… from the newly created gadget's drop-down menu. Next, determine and select the range of the Spreadsheet to query. Either the previously selected range when the gadget was created, or the entire sheet is acceptable, depending on the needs of the Visualization application being built. The URL listed under Paste this as a gadget data source url in the Table query data source window is the correct URL to use with the API code requiring query capabilities. Be sure to select the desired cell range, as the URL will change with various options. Important note Google Gadgets are to be retired in 2013, but the query URL is still part of the gadget object at the time of publication. Look for the method of finding the query URL to change as Gadgets are retired. Query Use the URL retrieved from the Spreadsheet Gadget to build the query. The following query statement is set to query the entire Spreadsheet of the key indicated: var query =new google.visualization.Query ('https://docs.google.com/spreadsheet/tq?key =0AhnmGz1SteeGdEVsNlNWWkoxU3ZRQjlmbDdTTjF2dHc&headers=-1'); Once the query is built, it can then be sent. Since an external data source is by definition not always under explicit control of the developer, a valid response to a query is not necessarily guaranteed. In order to prevent hard-to-detect data-related issues, it is best to include a method of handling erroneous returns from the data source. The following query.send function also informs the application how to handle information returned from the data source, regardless of quality. query.send(handleQueryResponse); The handleQueryResponse function sent along with the query acts as a filter, catching and handling errors from the data source. If an error was detected, the handleQueryResponse function displays an alert message. If the response from the data source is valid, the function proceeds and draws the visualization. function handleQueryResponse(response) { if (response.isError()) {alert('Error in query: ' + response.getMessage() + ' ' + response.getDetailedMessage()); return; } var data = response.getDataTable(); visualization = new google.visualization.Table (documnt.getElementById('visualization')); visualization.draw(data, null);} Best practice Be prepared for potential errors by planning for how to handle them. For reference, the previous example is given in its complete HTML form: <html > <head><meta http-equiv="content-type" content ="text/html; charset=utf-8"/> <title> Google Visualization API Sample </title> <script type="text/javascript" src ="http://www.google.com/jsapi"> </script><script type="text/javascript"> google.load('visualization', '1', {packages: ['table']}); </script> <script type="text/javascript">var visualization;function drawVisualization() {// To see the data that this visualization uses, browse to //https://docs.google.com/spreadsheet/ccc?key=0AhnmGz1SteeGdEVsNlN WWkoxU3ZRQjlmbDdTTjF2dHc&usp=sharing var query = new google.visualization.Query('https://docs.google.com/spreadsheet/tq?key= 0AhnmGz1SteeGdEVsNlNWWkoxU3ZRQjlmbDdTTjF2dHc&headers=-1'); // Send the query with a callback function. query.send(handleQueryResponse); } function handleQueryResponse(response) { if (response.isError()) { alert('Error in query: ' + response.getMessage() + ' ' + response.getDetailedMessage()); return; } var data = response.getDataTable(); visualization = new google.visualization.Table(document.getEleme ntById('visualization')); visualization.draw(data, null);} google.setOnLoadCallback(drawVisualization); </script></head><body style="font-family: Arial;border: 0 none;"> <div id="visualization" style ="height: 400px; width: 400px;"> </div> </body></html> View live examples for Spreadsheets at http://gvisapi-packt. appspot.com/ch6-examples/ch6-datasource.html Apps Script method Just as the Visualization API can be used from within an Apps Script, external data sources can also be requested from the script. In the Apps Script Spreadsheet example presented earlier in this article, the DataTable() creation was performed within the script. In the following example, the create data table element has been removed and a .setDataSourceUrloption has been added to Charts. newAreaChart(). The script otherwise remains the same. functiondoGet() {var chart = Charts.newAreaChart().setDataSourceUrl("https: //docs.google.com/spreadsheet/tq ?key= 0AhnmGz1SteeGdEVsNlNWWkoxU3ZRQjlmbDdTTjF2dHc&headers=-1").setDimensions(600, 400) .setXAxisTitle("Age Groups") .setYAxisTitle("Population") .setTitle("Chicago Population by Age and Gender - 2010 Census") .build();varui = UiApp.createApplication(); ui.add(chart); returnui;} View live examples in Apps Script at https://script. google.com/d/1Q2R72rGBnqPsgtOxUUME5zZy5Kul5 3r_lHIM2qaE45vZcTlFNXhTDqrr/edit. Fusion Tables Fusion Tables are another viable data source ready for use by Visualization API. Fusion Tables offer benefit over Spreadsheets beyond just the Google Map functionality. Tables API also allows for easier data source modification than is available in Spreadsheets. Preparation Preparing a Fusion Table to be used as a source is similar in procedure to preparing a Spreadsheet as a data source. The Fusion Table must be shared to the intended audience, and a unique identifier must be gathered from the Fusion Tables application. Permissions Just as with Spreadsheets, Fusion Tables must allow a user a minimum of view permissions in order for an application using the Visualization API to work properly. From the Sharing settings window in Fusion Tables, give the appropriate users viewaccess as a minimum. Get the URL path Referencing a Fusion Table is very similar in method to Spreadsheets. Luckily, the appropriate URL ID information is slightly easier to find in Fusion Tables than in Spreadsheets. With the Sharing settings window open, there is a field at the top of the page containing the Link to share . At the end portion of the link, following the characters dcid= is the Table's ID. The ID will look something like the following: 1Olo92KwNin8wB4PK_dBDS9eghe80_4kjMzOTSu0 This ID is the unique identifier for the table. Query Google Fusion Tables API includes SQL-like queries for the modification of Fusion Tables data from outside the GUI interface. Queries take the form of HTTP POST and GET requests and are constructed using the Fusion Tables API query capabilities. Data manipulation using Fusion Tables API is beyond the scope of this article, but a simple example is offered here as a basic illustration of functionality. Fusion Table query requests the use of the API SELECT option, formatted as: SELECT Column_name FROM Table_ID Here Column_name is the name of the Fusion Table column and Table_ID is the table's ID extracted from the Sharing settings window. If the SELECT call is successful, the requested information is returned to the application in the JSON format. The Visualization API drawChart() is able to take the SELECT statement and the corresponding data source URL as options for the chart rendering. The male and female data from the Fusion Tables 2010 Chicago Census file have been visualized using the drawChart() technique. function drawVisualization() { google.visualization.drawChart({ containerId: 'visualization', dataSourceUrl: 'http://www.google.com/fusiontables/gvizdata?tq=', query: 'SELECT Age, Male, Female FROM 1Olo92KwNin8wB4PK_ dBDS9eghe80_4kjMzOTSu0', chartType: 'AreaChart', options: { title: 'Chicago Population by Age and Sex - 2010 Census', vAxis: { title: 'Population' }, hAxis: { title: 'Age Groups' } } });} The preceding code results in the following visualization: Live examples are available at http://gvisapi-packt. appspot.com/ch6-examples/ch6-queryfusion.html. Important note Fusion Table query responses are limited to 500 rows. See Fusion Tables API documentation for other resource parameters. API Explorer With so many APIs available to developers using the Google platform, testing individual API functionality can be time consuming. The same issue arises for GUI applications used as a data source. Fortunately, Google provides API methods for its graphical applications as well. The ability to test API requests against Google's infrastructure is a desirable practice for all API programing efforts. To support this need, Google maintains the APIs Explorer service. This service is a console-based, web application that allows queries to be submitted to APIs directly, without an application to frame them. This is helpful functionality when attempting to verify whether a data source is properly configured. To check if the Fusion Tables 2010 U.S. Census data instance is configured properly, a query can be sent to list all columns, which informs which columns are actually exposed to the Visualization API application. Best practice Use the Google API Explorer service to test if API queries work as intended. To use the API Explorer for Fusion Tables, select Fusion Tables API from the list of API services. API functions available for testing are listed on the Fusion Tables API page. Troubleshooting a Chart with a Fusion Tables data source usually involves fi rst verifying all columns are available to the visualization code. If a column is not available, or is not formatted as expected, a visualization issue related to data problems may be difficult to troubleshoot from inside the Visualization API environment. The API call that best performs a simple check on column information is the fusiontables.column.list item. Selecting fusiontables.column.list opens up a form-based interface. The only required information is the Table ID (collected from the Share settings window in the Fusion Tables file). Click on the Execute button to run the query. The API Explorer tool will then show the GET query sent to the Fusion Table in addition to the results it returned. For the fusiontables.column.list query, columns are returned in bracketed sections. Each section contains attributes of that column. The following queried attributes should look familiar, as it is the fusiontables.column.list result of a query to the 2010 Chicago Census data Fusion Table. Best Practice The Column List Tool is helpful when troubleshooting Fusion Table to API code connectivity. If the Table is able to return coherent values through the tool, it can generally be assumed that access settings are appropriate and the code itself may be the source of connection issues. Fusion Tables—row and query reference is available at https:// developers.google.com/fusiontables/docs/v1/sqlreference. Information on API Explorer—column list is available at https:// developers.google.com/fusiontables/docs/v1/ reference/column/list#try-it.

0
0
1564

Packt

30 Jul 2013

6 min read

First steps with R

Packt

30 Jul 2013

6 min read

(For more resources related to this topic, see here.) Obtaining and installing R The way to obtain R is downloading it from the CRAN website (http://www.r-project.org/). The Comprehensive R Archive Network (CRAN) is a network of FTP and web servers around the world that stores identical, up-to-date versions of code and documentation for R. The CRAN is directly accessible from the R website and on such website it is also possible to find information about R, some technical manuals, the R journal, and details about the packages developed for R and stored on the CRAN repositories. The functionalities of the R environment can then also be expanded thanks to software libraries which can be installed and recalled if needed. These libraries or packages are a collection of source code and other additional files that, when installed in R, allow the user to load them in the workspace via a call to the library() function. An example of code to load the package lattice may be found as follows: > library(lattice) An R installation contains one or more libraries of packages. Some of these packages are part of the basic installation and are loaded automatically as soon as the session is started. Other can be installed from the CRAN, the official R repository, or downloaded and installed manually. Interacting with the console As soon as you will start R, you will see that a workspace is open; you can see a screenshot of the R Console window in the image below. The workspace is the environment in which you are working, where you will load your data, and create your variables. The screen prompt > is the R prompt that waits for commands. On the starting screen, you can either type any function, command, or you can use R to perform basic calculation. R uses the usual symbols for addition (+), subtraction (-), multiplication (*), division (/), and exponentiation (^). Parentheses ( ) can be used to specify the order of operations. R also provides %% for taking the modulus and %/% for integer division. Comments in R are defined by the character #, so everything after such character up to the end of the line will be ignored by R. R has a number of built-in functions, for example, sin(x), cos(x), tan(x), (all in radians), exp(x), log(x), and sqrt(x). Some special constants such as pi are also pre-defined. You can see an example of the use of such function in the following code: > exp(2.5)[1] 12.18249 Understanding R objects In every computer language, variables provide a means of accessing the data stored in memory. R does not provide direct access to the computer’s memory but rather provides a number of specialized data structures called objects. These objects are referred to through symbols or variables. Vectors The basic object in R is the vector; even scalars are vectors of length one. Vectors can be thought of as a series of data of the same class. There are six basic vector type (called atomic vectors): logical, integer, real, complex, string (or character), and raw. Integer and real represent numeric objects; logicals are Boolean data type with possible value TRUE or FALSE. Among such atomic vectors, the more common ones are logical, string, and numeric (integer and real). There are several ways to create vectors. For instance the operator : (colon) is a sequence-generating operator, it creates sequences by incrementing or decrementing by one. > 1:10 [1] 1 2 3 4 5 6 7 8 9 10> 5:-6 [1] 5 4 3 2 1 0 -1 -2 -3 -4 -5 -6 If the interval between the numbers is not one, you can use the seq() function. Here an example > seq(from=2, to=2.5, by=0.1)[1] 2.0 2.1 2.2 2.3 2.4 2.5 One of the more important features of R is the possibility to use entire vector as arguments of functions, thus avoiding the use of cyclic loops. Most of the functions in R allow the use of vector as argument, as example the use of some of these functions is reported as follows > x <- c(12,10,4,6,9)> max(x)[1] 12> min(x)[1] 4> mean(x)[1] 8.2 Matrices and arrays In R, the matrix notation is extended to elements of any kind, so in example it is possible to have a matrix of character strings. Matrices and arrays are basically vectors with a dimension attribute. The function matrix() may be used to create matrices. By default, such function creates the matrix by column; as alternative it is possible to specify to the function to build the matrix by row: > matrix(1:9,nrow=3,byrow=TRUE) [,1] [,2] [,3][1,] 1 2 3[2,] 4 5 6[3,] 7 8 9 Lists A list in R is a collection of different objects. One of the main advantages of lists is that the object contained within a list may be of different type, for example, numeric and character values. In order to define a list, you simply will need to provide the object that you want to include as argument of the function list(). Data frame A data frame corresponds to a data set; it is basically a special list in which the elements have the same length. Elements may be different type in different columns, but within the same column all the elements are of the same type. You can easily create data frames using the function data.frame(), and a specific column can be recall using the operator $. Top features you’ll want to know about In addition to the basic object creation and manipulation, many more complex tasks can be performed with R, spanning from data manipulation, programming, statistical analysis and the realization of very high quality graphs. Some of the most useful features are Data input and output Flow control (for, if…else, while) Create your own functions Debugging functions and handling exceptions Plotting data Summary In this article we saw what is R, how to obtain and install R, and how to interacting with the console. We also saw at few R objects and also looked at the top features you would want to know about Resources for Article: Further resources on this subject: Organizing, Clarifying and Communicating the R Data Analyses [Article] Customizing Graphics and Creating a Bar Chart and Scatterplot in R [Article] Graphical Capabilities of R [Article]

0
0
1433

Packt

30 Jul 2013

6 min read

Model Design Accelerator

Packt

30 Jul 2013

6 min read

0
0
1614

How-To Tutorials - Data

Pentaho – Using Formulas in Our Reports

Lucene.NET: Optimizing and merging index segments

Analytics – Drawing a Frequency Distribution with MapReduce (Intermediate)

Calculus

Using Unrestricted Languages

Quick start – Creating your first Java application

Overview of SQL Server Reporting Services 2012 Architecture, Features, and Tools

Understanding MapReduce

So, what is MongoDB?

Using Oracle GoldenGate

Trending Topics

Making a simple cURL request (Simple)

Participating in a business process (Intermediate)

Data sources for the Charts

First steps with R

Model Design Accelerator