Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-big-data-analysis
Packt
19 Apr 2013
15 min read
Save for later

Big Data Analysis

Packt
19 Apr 2013
15 min read
(For more resources related to this topic, see here.) Counting distinct IPs in weblog data using MapReduce and Combiners This recipe will walk you through creating a MapReduce program to count distinct IPs in weblog data. We will demonstrate the application of a combiner to optimize data transfer overhead between the map and reduce stages. The code is implemented in a generic fashion and can be used to count distinct values in any tab-delimited dataset. Getting ready This recipe assumes that you have a basic familiarity with the Hadoop 0.20 MapReduce API. You will need access to the weblog_entries dataset supplied with this book and stored in an HDFS folder at the path /input/weblog. You will need access to a pseudo-distributed or fully-distributed cluster capable of running MapReduce jobs using the newer MapReduce API introduced in Hadoop 0.20. You will also need to package this code inside a JAR file to be executed by the Hadoop JAR launcher from the shell. Only the core Hadoop libraries are required to compile and run this example. How to do it... Perform the following steps to count distinct IPs using MapReduce: Open a text editor/IDE of your choice, preferably one with Java syntax highlighting. Create a class named DistinctCounterJob.java in your JAR file at whatever source package is appropriate. The following code will serve as the Tool implementation for job submission: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import java.io.IOException; import java.util.regex.Pattern; public class DistinctCounterJob implements Tool { private Configuration conf; public static final String NAME = "distinct_counter"; public static final String COL_POS = "col_pos"; public static void main(String[] args) throws Exception { ToolRunner.run(new Configuration(), new DistinctCounterJob(), args); } The run() method is where we set the input/output formats, mapper class configuration, combiner class, and key/value class configuration: public int run(String[] args) throws Exception { if(args.length != 3) { System.err.println("Usage: distinct_counter <input> <output> <element_position>"); System.exit(1); } conf.setInt(COL_POS, Integer.parseInt(args[2])); Job job = new Job(conf, "Count distinct elements at position"); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); job.setMapperClass(DistinctMapper.class); job.setReducerClass(DistinctReducer.class); job.setCombinerClass(DistinctReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setJarByClass(DistinctCounterJob.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitForCompletion(true) ? 1 : 0; } public void setConf(Configuration conf) { this.conf = conf; } public Configuration getConf() { return conf; } } The map() function is implemented in the following code by extending mapreduce.Mapper: public static class DistinctMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private static int col_pos; private static final Pattern pattern = Pattern. compile("t"); private Text outKey = new Text(); private static final IntWritable outValue = new IntWritable(1); @Override protected void setup(Context context ) throws IOException, InterruptedException { col_pos = context.getConfiguration(). getInt(DistinctCounterJob.COL_POS, 0); } @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String field = pattern.split(value.toString())[col_ pos]; outKey.set(field); context.write(outKey, outValue); } } The reduce() function is implemented in the following code by extending mapreduce.Reducer: public static class DistinctReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable count = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int total = 0; for(IntWritable value: values) { total += value.get(); } count.set(total); context.write(key, count); } } The following command shows the sample usage against weblog data with column position number 4, which is the IP column: hadoop jar myJobs.jar distinct_counter /input/weblog/ /output/ weblog_distinct_counter 4 How it works... First we set up DistinctCounterJob to implement a Tool interface for remote submission. The static constant NAME is of potential use in the Hadoop Driver class, which supports the launching of different jobs from the same JAR file. The static constant COL_POS is initialized to the third required argument from the command line <element_position>. This value is set within the job configuration, and should match the position of the column you wish to count for each distinct entry. Supplying 4 will match the IP column for the weblog data. Since we are reading and writing text, we can use the supplied TextInputFormat and TextOutputFormat classes. We will set the Mapper and Reduce classes to match our DistinctMapper and DistinctReducer implemented classes respectively. We also supply DistinctReducer as a combiner class. This decision is explained in more detail as follows: It's also very important to call setJarByClass() so that the TaskTrackers can properly unpack and find the Mapper and Reducer classes. The job uses the static helper methods on FileInputFormat and FileOutputFormat to set the input and output directories respectively. Now we're set up and ready to submit the job. The Mapper class sets up a few member variables as follows: col_pos: This is initialized to a value supplied in the configuration. It allows users to change which column to parse and apply the count distinct operation on. pattern: This defines the column's split point for each row based on tabs. outKey: This is a class member that holds output values. This avoids having to create a new instance for each output that is written. outValue: This is an integer representing one occurrence of the given key. It is similar to the WordCount example. The map() function splits each incoming line's value and extracts the string located at col_ pos. We reset the internal value for outKey to the string found on that line's position. For our example, this will be the IP value for the row. We emit the value of the newly reset outKey variable along with the value of outValue to mark one occurrence of that given IP address. Without the assistance of the combiner, this would present the reducer with an iterable collection of 1s to be counted. The following is an example of a reducer {key, value:[]} without a combiner: {10.10.1.1, [1,1,1,1,1,1]} = six occurrences of the IP "10.10.1.1". The implementation of the reduce() method will sum the integers and arrive at the correct total, but there's nothing that requires the integer values to be limited to the number 1. We can use a combiner to process the intermediate key-value pairs as they are output from each mapper and help improve the data throughput in the shuffle phase. Since the combiner is applied against the local map output, we may see a performance improvement as the amount of data we need to transfer for an intermediate key/value can be reduced considerably. Instead of seeing {10.10.1.1, [1,1,1,1,1,1]}, the combiner can add the 1s and replace the value of the intermediate value for that key to {10.10.1.1, [6]}. The reducer can then sum the various combined values for the intermediate key and arrive at the same correct total. This is possible because addition is both a commutative and associative operation. In other words: Commutative: The order in which we process the addition operation against the values has no effect on the final result. For example, 1 + 2 + 3 = 3 + 1 + 2. Associative: The order in which we apply the addition operation has no effect on the final result. For example, (1 + 2) + 3 = 1 + (2 + 3). For counting the occurrences of distinct IPs, we can use the same code in our reducer as a combiner for output in the map phase. When applied to our problem, the normal output with no combiner from two separate independently running map tasks might look like the following where {key: value[]} is equal to the intermediate key-value collection: Map Task A = {10.10.1.1, [1,1,1]} = three occurrences Map Task B = {10.10.1.1, [1,1,1,1,1,1]} = six occurrences Without the aid of a combiner, this will be merged in the shuffle phase and presented to a single reducer as the following key-value collection: {10.10.1.1, [1,1,1,1,1,1,1,1,1]} = nine total occurrences Now let's revisit what would happen when using a Combiner against the exact same sample output: Map Task A = {10.10.1.1, [1,1,1]} = three occurrences Combiner = {10.10,1,1, [3] = still three occurrences, but reduced for this mapper. Map Task B = {10.10.1.1, [1,1,1,1,1,1] = six occurrences Combiner = {10.10.1.1, [6] = still six occurrences Now the reducer will see the following for that key-value collection: {10.10.1.1, [3,6]} = nine total occurrences We arrived at the same total count for that IP address, but we used a combiner to limit the amount of network I/O during the MapReduce shuffle phase by pre-reducing the intermediate key-value output from each mapper. There's more... The combiner can be confusing to newcomers. Here are some useful tips: The Combiner does not always have to be the same class as your Reducer The previous recipe and the default WordCount example show the Combiner class being initialized to the same implementation as the Reducer class. This is not enforced by the API, but ends up being common for many types of distributed aggregate operations such as sum(), min(), and max(). One basic example might be the min() operation of the Reducer class that specifically formats output in a certain way for readability. This will take a slightly different form from that of the min() operator of the Combiner class, which does not care about the specific output formatting. Combiners are not guaranteed to run Whether or not the framework invokes your combiner during execution depends on the intermediate spill file size from each map output, and is not guaranteed to run for every intermediate key. Your job should not depend on the combiner for correct results, it should be used only for optimization. You can control the spill file threshold when MapReduce tries to combine intermediate values with the configuration property min.num.spills.for.combine. Using Hive date UDFs to transform and sort event dates from geographic event data This recipe will illustrate the efficient use of the Hive date UDFs to list the 20 most recent events and the number of days between the event date and the current system date. Getting ready Make sure you have access to a pseudo-distributed or fully-distributed Hadoop cluster with Apache Hive 0.7.1 installed on your client machine and on the environment path for the active user account. This recipe depends on having the Nigera_ACLED_cleaned.tsv dataset loaded into a Hive table named acled_nigeria_cleaned with the fields mapped to the respective datatypes. Issue the following command to the Hive client to see the mentioned fields: describe acled_nigeria_cleaned You should see the following response: OK Loc string event_date string event_type string actor string latitude double longitude double source string fatalities int How to do it... Perform the following steps to utilize Hive UDFs for sorting and transformation: Open a text editor of your choice, ideally one with SQL syntax highlighting. Add the inline creation and transform syntax: SELECT event_type,event_date,days_since FROM ( SELECT event_type,event_date, datediff(to_date(from_unixtime(unix_timestamp())), to_date(from_unixtime( unix_timestamp(event_date, 'yyyy-MM-dd')))) AS days_since FROM acled_nigeria_cleaned) date_differences ORDER BY event_date DESC LIMIT 20; Save the file as top_20_recent_events.sql in the active folder. Run the script from the operating system shell by supplying the –f option to the Hive client. You should see the following five rows appear first in the output console: OK Battle-No change of territory 2011-12-31 190 Violence against civilians 2011-12-27 194 Violence against civilians 2011-12-25 196 Violence against civilians 2011-12-25 196 Violence against civilians 2011-12-25 196 How it works... Let's start with the nested SELECT subqueries. We select three fields from our Hive table acled_nigeria_cleaned: event_type, event_date, and the result of calling the UDF datediff(), which takes as arguments an end date and a start date. Both are expected in the form yyyy-MM-dd. The first argument to datediff() is the end date, with which we want to represent the current system date. Calling unix_timestamp() with no arguments will return the current system time in milliseconds. We send that return value to from_ unixtimestamp() to get a formatted timestamp representing the current system date in the default Java 1.6 format (yyyy-MM-dd HH:mm:ss). We only care about the date portion, so calling to_date() with the output of this function strips the HH:mm:ss. The result is the current date in the yyyy-MM-dd form. The second argument to datediff() is the start date, which for our query is the event_ date. The series of function calls operate in almost the exact same manner as our previous argument, except that when we call unix_timestamp(), we must tell the function that our argument is in the SimpleDateFormat format that is yyyy-MM-dd. Now we have both start_date and end_date arguments in the yyyy-MM-dd format and can perform the datediff() operation for the given row. We alias the output column of datediff() as days_since for each row. The outer SELECT statement takes these three columns per row and sorts the entire output by event_date in descending order to get reverse chronological ordering. We arbitrarily limit the output to only the first 20. The net result is the 20 most recent events with the number of days that have passed since that event occurred. There's more... The date UDFs can help tremendously in performing string date comparisons. Here are some additional pointers: Date format strings follow Java SimpleDateFormat guidelines Check out the Javadocs for SimpleDateFormat to learn how your custom date strings can be used with the date transform UDFs. Default date and time formats Many of the UDFs operate under a default format assumption. For UDFs requiring only date, your column values must be in the form yyyy-MM-dd. For UDFs that require date and time, your column values must be in the form yyyy- MM-dd HH:mm:ss. Using Hive to build a per-month report of fatalities over geographic event data This recipe will show a very simple analytic that uses Hive to count fatalities for every month appearing in the dataset and print the results to the console. Getting ready Make sure you have access to a pseudo-distributed or fully-distributed Hadoop cluster with Apache Hive 0.7.1 installed on your client machine and on the environment path for the active user account. This recipe depends on having the Nigera_ACLED_cleaned.tsv dataset loaded into a Hive table named acled_nigeria_cleaned with the following fields mapped to the respective datatypes. Issue the following command to the Hive client: describe acled_nigeria_cleaned You should see the following response: OK loc string event_date string event_type string actor string latitude double longitude double source string fatalities int How to do it... Follow the steps to use Hive for report generation: Open a text editor of your choice, ideally one with SQL syntax highlighting. Add the inline creation and transformation syntax: SELECT from_unixtime(unix_timestamp(event_date, 'yyyy-MM-dd'), 'yyyy-MMM'), COALESCE(CAST(sum(fatalities) AS STRING), 'Unknown') FROM acled_nigeria_cleaned GROUP BY from_unixtime(unix_timestamp(event_date, 'yyyy-MMdd'),' yyyy-MMM'); Save the file as monthly_violence_totals.sql in the active folder. Run the script from the operating system shell by supplying the –f option to the Hive client. You should see the following three rows appear first in the output console. Note that the output is sorted lexicographically, and not on the order of dates. OK 1997-Apr 115 1997-Aug 4 1997-Dec 26 How it works... The SELECT statement uses unix_timestamp() and from_unixtime() to reformat the event_date for each row as just a year-month concatenated field. This is also in the GROUP BY expression for totaling fatalities using sum(). The coalesce() method returns the first non-null argument passed to it. We pass as the first argument, the value of fatalities summed for that given year-month, cast as a string. If that value is NULL for any reason, return the constant Unknown. Otherwise return the string representing the total fatalities counted for that year-month combination. Print everything to the console over stdout. There's more... The following are some additional helpful tips related to the code in this recipe: The coalesce() method can take variable length arguments. As mentioned in the Hive documentation, coalesce() supports one or more arguments. The first non-null argument will be returned. This can be useful for evaluating several different expressions for a given column before deciding the right one to choose. The coalesce() will return NULL if no argument is non-null. It's not uncommon to provide a type literal to return if all other arguments are NULL. Date reformatting code template Having to reformat dates stored in your raw data is very common. Proper use of from_ unixtime() and unix_timestamp() can make your life much easier. Remember this general code template for concise date format transformation in Hive: from_unixtime(unix_timestamp(<col>,<in-format>),<out-format>);
Read more
  • 0
  • 0
  • 1686

article-image-comparative-study-nosql-products
Packt
09 Apr 2013
7 min read
Save for later

Comparative Study of NoSQL Products

Packt
09 Apr 2013
7 min read
(For more resources related to this topic, see here.) Comparison Choosing a technology does not merely involve a technical comparison. Several other factors related to documentation, maintainability, stability and maturity, vendor support, developer community, license, price, and the future of the product or the organization behind it also play important roles. Having said that, I must also add that technical comparison should continue to play a pivotal role. We will start a deep technical comparison of the previously mentioned products and then look at the semi-technical and non-technical aspects for the same. Technical comparison From a technical perspective, we compare on the following parameters: Implementation language Engine types Speed Implementation language One of the more important factors that come into play is how can, if required, the product be extended; the programming language in which the product itself is written determines a large part of it. Some of the database may provide a different language for writing plugins but it may not always be true: Amazon SimpleDB: It is available in cloud and has a client SDK for Java, .NET, PHP, and Ruby. There are libraries for Android and iOS as well. BaseX: Written in Java. To extend, one must code in Java. Cassandra: Everything in Java. CouchDB: Written in Erlang. To extend use Erlang. Google Datastore: It is available in cloud and has SDK for Java, Python, and Go. HBase: It is Java all the way. MemcacheDB: Written in C. Uses the same language to extend. MongoDB: Written in C++. Client drivers are available in several languages including but not limited to JavaScript, Java, PHP, Python, and Ruby. Neo4j: Like several others, it is Java all the way Redis: Written in C. So you can extend using C. Great, so the first parameter itself may have helped you shortlist the products that you may be interested to use based on the developers available in your team or for hire. You may still be tempted to get smart people onboard and then build competency based on the choice that you make, based on subsequent dimensions. Note that for the databases written in high-level languages like Java, it may still be possible to write extensions in languages like C or C++ by using interfaces like JNI or otherwise. Amazon SimpleDB provides access via the HTTP protocol and has SDK in multiple languages. If you do not find an SDK for yourself, say for example, in JavaScript for use with NodeJS, just write one. However, life is not open with Google Datastore that allows access only via its cloud platform App Engine and has SDKs only in Java, Python, and the Go languages. Since the access is provided natively from the cloud servers, you cannot do much about it. In fact, the top requested feature of the Google App Engine is support for PHP ( See http://code.google.com/p/googleappengine/issues/list). Engine types Engine types define how you will structure the data and what data design expertise your team will need. NoSQL provides multiple options to choose from. Database Column oriented Document store Key value store Graph Amazon SimpleDB No No Yes No BaseX No Yes No No Cassandra Yes Yes No No CouchDB No Yes No No Google Datastore Yes No No No HBase Yes No No No MemcacheDB No No Yes No MongoDB No Yes No No Neo4j No No No Yes Redis No Yes Yes No You may notice two aspects of this table – a lot of No and multiple Yes against some databases. I expect the table to be populated with a lot more Yes over the next couple of years. Specifically, I expect the open source databases written in Java to be developed and enhanced actively providing multiple options to the developers. Speed One of the primary reasons for choosing a NoSQL solution is speed. Comparing and benchmarking the databases is a non-trivial task considering that each database has its own set of hardware and other configuration requirements. Having said that, you can definitely find a whole gambit of benchmark results comparing one NoSQL database against the other with details of how the tests were executed. Of all that is available, my personal choice is the Yahoo! Cloud Serving Benchmark (YCSB) tool. It is open source and available on Github at https://github.com/brianfrankcooper/YCSB. It is written in Java and clients are available for Cassandra, DynamoDB, HBase, HyperTable, MongoDB, Redis apart from several others that we have not discuss in this book. Before showing some results from the YCSB, I did a quick run on a couple of easy-to-set-up databases myself. I executed them without any optimizations to just get a feel of how easy it is for software to incorporate it without needing any expert help. I ran it on MongoDB on my personal box (server as well as the client on the same machine), DynamoDB connecting from a High-CPU Medium (c1.medium) box, and MySQL on the same High-CPU Medium box with both server and client on the same machine. Detailed configurations with the results are shown as follows: Server configuration: Parameter MongoDB DynamoDB MySQL Processor 5 EC2 Compute Units N/A 5 EC2 Compute Units RAM 1.7 GB with Apache HTTP server running (effective free: 200 MB, after database is up and running) N/A 1.7GB with Apache HTTP server running (effective free: 500MB, after database is up and running) Hard disk Non-SSD N/A Non-SSD Network configuration N/A US-East-1 N/A Operating system Ubuntu 10.04, 64 bit N/A Ubuntu 10.04, 64 bit Database version 1.2.2 N/A 5.1.41 Configuration Default Max write: 500, Max read: 500 Default Client configuration: Parameter MongoDB DynamoDB MySQL Processor 5 EC2 Compute Units 5 EC2 Compute Units 5 EC2 Compute Units RAM 1.7GB with Apache HTTP server running (effective free: 200MB, after database is up and running) 1.7GB with Apache HTTP server running (effective free: 500MB, after database is up and running) 1.7GB with Apache HTTP server running (effective free: 500MB after database is up and running) Hard disk Non-SSD Non-SSD Non-SSD Network configuration Same Machine as server US-East-1 Same Machine as server Operating system Ubuntu 10.04, 64 bit Ubuntu 10.04, 64 bit Ubuntu 10.04, 64 bit Record count 1,000,000 1,000 1,000,000 Max connections 1 5 1 Operation count (workload a) 1,000,000 1,000 1,000,000 Operation count (workload f) 1,000,000 100,000 1,000,000 Results: Workload Parameter MongoDB DynamoDB MySQL Workload-a (load) Total time 290 seconds 16 seconds 300 seconds   Speed (operations/second) 2363 to 4180 (approximately 3700) Bump at 1278 50 to 82 (operations/second) 3135 to 3517 (approximately 3300)   Insert latency 245 to 416 microseconds (approximately 260) Bump at 875 microseconds 12 to 19 milliseconds 275 to 300 microseconds (approximately 290) Workload-a (run) Total time 428 seconds 17 seconds 240 seconds   Speed 324 to 4653 42 to 78 3970 to 4212   Update latency 272 to 2946 microseconds 13 to 23.7 microseconds 219 to 225.5 microseconds   Read latency 112 to 5358 microseconds 12.4 to 22.48 microseconds 240.6 to 248.9 microseconds Workload-f (load) Total time 286 seconds Did not execute 295 seconds   Speed 3708 to 4200   3254 to 3529   Insert latency 228 to 265 microseconds   275 to 299 microseconds Workload-f (run) Total time 412 seconds Did not execute 1022 seconds   Speed 192 to 4146   224 to 2096   Update latency 219 to 336 microseconds   216 to 233 microseconds, with two bursts at 600 and 2303 microseconds   Read latency 119 to 5701 microseconds   1360 to 8246 microseconds   Read Modify Write (RMW) latency 346 to 9170 microseconds   1417 to 14648 microseconds Do not read too much into these numbers as they are a result of the default configuration, out-of-the-box setup without any optimizations. Some of the results from YCSB published by Brian F. Cooper (http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf) are shown next. For update-heavy, 50-50 read-update: For read-heavy, under varying hardware: There are some more from Sergey Sverchkov at Altoros (http://altoros.com/nosql-research) who published their white paper recently. Summary In this article, we did a detailed comparative study of ten NoSQL databases on few parameters, both technical and non-technical. Resources for Article : Further resources on this subject: Getting Started with CouchDB and Futon [Article] Ruby with MongoDB for Web Development [Article] An Introduction to Rhomobile [Article]  
Read more
  • 0
  • 0
  • 1989

article-image-advanced-hadoop-mapreduce-administration
Packt
08 Apr 2013
6 min read
Save for later

Advanced Hadoop MapReduce Administration

Packt
08 Apr 2013
6 min read
(For more resources related to this topic, see here.) Tuning Hadoop configurations for cluster deployments Getting ready Shut down the Hadoop cluster if it is already running, by executing the bin/stop-dfs.sh and bin/stop-mapred.sh commands from HADOOP_HOME. How to do it... We can control Hadoop configurations through the following three configuration files: conf/core-site.xml: This contains the configurations common to whole Hadoop distribution conf/hdfs-site.xml: This contains configurations for HDFS conf/mapred-site.xml: This contains configurations for MapReduce Each configuration file has name-value pairs expressed in an XML format, and they define the workings of different aspects of Hadoop. The following code snippet shows an example of a property in the configuration file. Here, the <configuration> tag is the top-level XML container, and the <property> tags that define individual properties go as child elements of the <configuration> tag. <configuration><property><name>mapred.reduce.parallel.copies</name><value>20</value></property>...</configuration> The following instructions show how to change the directory to which we write Hadoop logs and configure the maximum number of map and reduce tasks: Create a directory to store the logfiles. For example, /root/hadoop_logs. Uncomment the line that includes HADOOP_LOG_DIR in HADOOP_HOME/conf/ hadoop-env.sh and point it to the new directory. Add the following lines to the HADOOP_HOME/conf/mapred-site.xml file: <property><name>mapred.tasktracker.map.tasks.maximum</name><value>2 </value></property><property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>2 </value></property> Restart the Hadoop cluster by running the bin/stop-mapred.sh and bin/start-mapred.sh commands from the HADOOP_HOME directory. You can verify the number of processes created using OS process monitoring tools. If you are in Linux, run the watch ps –ef|grep hadoop command. If you are in Windows or MacOS use the Task Manager. How it works... HADOOP_LOG_DIR redefines the location to which Hadoop writes its logs. The mapred. tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks. maximum properties define the maximum number of map and reduce tasks that can run within a single TaskTracker at a given moment. These and other server-side parameters are defined in the HADOOP_HOME/conf/*-site. xml files. Hadoop reloads these configurations after a restart. There's more... There are many similar configuration properties defined in Hadoop. You can see some of them in the following tables. The configuration properties for conf/core-site.xml are listed in the following table: Name Default value Description fs.inmemory.size.mb 100 This is the amount of memory allocated to the in-memory filesystem that is used to merge map outputs at reducers in MBs. io.sort.factor 100 This is the maximum number of streams merged while sorting files. io.file.buffer.size 131072 This is the size of the read/write buffer used by sequence files. The configuration properties for conf/mapred-site.xml are listed in the following table: Name Default value Description mapred.reduce. parallel.copies 5 This is the maximum number of parallel copies the reduce step will execute to fetch output from many parallel jobs. mapred.map.child.java. opts -Xmx200M This is for passing Java options into the map JVM. mapred.reduce.child. java.opts -Xmx200M This is for passing Java options into the reduce JVM. io.sort.mb 200 The memory limit while sorting data in MBs. The configuration properties for conf/hdfs-site.xml are listed in the following table: Name Default value Description dfs.block.size 67108864 This is the HDFS block size. dfs.namenode.handler. count 40 This is the number of server threads to handle RPC calls in the NameNode. Running benchmarks to verify the Hadoop installation The Hadoop distribution comes with several benchmarks. We can use them to verify our Hadoop installation and measure Hadoop's performance. This recipe introduces these benchmarks and explains how to run them. Getting ready Start the Hadoop cluster. You can run these benchmarks either on a cluster setup or on a pseudo-distributed setup. How to do it... Let us run the sort benchmark. The sort benchmark consists of two jobs. First, we generate some random data using the randomwriter Hadoop job and then sort them using the sort sample. Change the directory to HADOOP_HOME. Run the randomwriter Hadoop job using the following command: >bin/hadoop jar hadoop-examples-1.0.0.jarrandomwriter-Dtest.randomwrite.bytes_per_map=100-Dtest.randomwriter.maps_per_host=10 /data/unsorted-data Here the two parameters, test.randomwrite.bytes_per_map and test. randomwriter.maps_per_host specify the size of data generated by a map and the number of maps respectively. Run the sort program: >bin/hadoop jar hadoop-examples-1.0.0.jar sort /data/unsorted-data/data/sorted-data Verify the final results by running the following command: >bin/hadoop jar hadoop-test-1.0.0.jar testmapredsort -sortInput /data/unsorted-data -sortOutput /data/sorted-data Finally, when everything is successful, the following message will be displayed: The job took 66 seconds.SUCCESS! Validated the MapReduce framework's 'sort' successfully. How it works... First, the randomwriter application runs a Hadoop job to generate random data that can be used by the second sort program. Then, we verify the results through testmapredsort job. If your computer has more capacity, you may run the initial randomwriter step with increased output sizes. There's more... Hadoop includes several other benchmarks. TestDFSIO: This tests the input output (I/O) performance of HDFS nnbench: This checks the NameNode hardware mrbench: This runs many small jobs TeraSort: This sorts a one terabyte of data More information about these benchmarks can be found at http://www.michaelnoll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoopcluster- with-terasort-testdfsio-nnbench-mrbench/. Reusing Java VMs to improve the performance In its default configuration, Hadoop starts a new JVM for each map or reduce task. However, running multiple tasks from the same JVM can sometimes significantly speed up the execution. This recipe explains how to control this behavior. How to do it... Run the WordCount sample by passing the following option as an argument: >bin/hadoop jar hadoop-examples-1.0.0.jar wordcount –Dmapred.job.reuse.jvm.num.tasks=-1 /data/input1 /data/output1 Monitor the number of processes created by Hadoop (through ps –ef|grephadoop command in Unix or task manager in Windows). Hadoop starts only a single JVM per task slot and then reuses it for an unlimited number of tasks in the job. However, passing arguments through the –D option only works if the job implements the org.apache.hadoop.util.Tools interface. Otherwise, you should set the option through the JobConf.setNumTasksToExecutePerJvm(-1) method. How it works... By setting the job configuration property through mapred.job.reuse.jvm.num.tasks, we can control the number of tasks for the JVM run by Hadoop. When the value is set to -1, Hadoop runs the tasks in the same JVM.
Read more
  • 0
  • 0
  • 3128
Visually different images

article-image-line-area-and-scatter-charts
Packt
05 Apr 2013
10 min read
Save for later

Line, Area, and Scatter Charts

Packt
05 Apr 2013
10 min read
(For more resources related to this topic, see here.) Introducing line charts First let's start with a single series line chart. We will use one of the many data provided by The World Bank organization at www.worldbank.org. The following is the code snippet to create a simple line chart which shows the percentage of population ages, 65 and above, in Japan for the past three decades: var chart = new Highcharts.Chart({chart: {renderTo: 'container'},title: {text: 'Population ages 65 and over (% of total)',},credits: {position: {align: 'left',x: 20},text: 'Data from The World Bank'},yAxis: {title: {text: 'Percentage %'}},xAxis: {categories: ['1980', '1981','1982', ... ],labels: {step: 5}},series: [{name: 'Japan - 65 and over',data: [ 9, 9, 9, 10, 10, 10, 10 ... ]}]}); The following is the display of the simple chart: Instead of specifying the year number manually as strings in categories, we can use the pointStart option in the series config to initiate the x-axis value for the first point. So we have an empty xAxis config and series config, as follows: xAxis: {},series: [{pointStart: 1980,name: 'Japan - 65 and over',data: [ 9, 9, 9, 10, 10, 10, 10 ... ]}] Although this simplifies the example, the x-axis labels are automatically formatted by Highcharts utility method, numberFormat, which adds a comma after every three digits. The following is the outcome on the x axis: To resolve the x-axis label, we overwrite the label's formatter option by simply returning the value to bypass the numberFormat method being called. Also we need to set the allowDecimals option to false. The reason for that is when the chart is resized to elongate the x axis, decimal values are shown. The following is the final change to use pointStart for the year values: xAxis: {labels:{formatter: function() {// 'this' keyword is the label objectreturn this.value;}},allowDecimals: false},series: [{pointStart: 1980,name: 'Japan - 65 and over',data: [ 9, 9, 9, 10, 10, 10, 10 ... ]}] Extending to multiple series line charts We can include several more line series and set the Japan series by increasing the line width to be 6 pixels wide, as follows: series: [{lineWidth: 6,name: 'Japan',data: [ 9, 9, 9, 10, 10, 10, 10 ... ]}, {Name: 'Singapore',data: [ 5, 5, 5, 5, ... ]}, {...}] The line series for Japanese population becomes the focus in the chart, as shown in the following screenshot: Let's move on to a more complicated line graph. For the sake of demonstrating inverted line graphs, we use the chart.inverted option to flip the y and x axes to opposite orientations. Then we change the line colors of the axes to match the same series colors. We also disable data point markers for all the series and finally align the second series to the second entry in the y-axis array, as follows: chart: {renderTo: 'container',inverted: true,},yAxis: [{title: {text: 'Percentage %'},lineWidth: 2,lineColor: '#4572A7'}, {title: {text: 'Age'},opposite: true,lineWidth: 2,lineColor: '#AA4643'}],plotOptions: {series: {marker: {enabled: false}}},series: [{name: 'Japan - 65 and over',type: 'spline',data: [ 9, 9, 9, ... ]}, {name: 'Japan - Life Expectancy',yAxis: 1,data: [ 76, 76, 77, ... ]}] The following is the inverted graph with double y axes: The data representation of the chart may look slightly odd as the usual time labels are swapped to the y axis and the data trend is awkward to comprehend. The inverted option is normally used for showing data in a noncontinuous form and in bar format. If we interpret the data from the graph, 12 percent of the population is 65 and over, and the life expectancy is 79 in 1990. By setting plotOptions.series.marker.enabled to false it switches off all the data point markers. If we want to display a point marker for a particular series, we can either switch off the marker globally and then set the marker on an individual series, or the other way round. plotOptions: {series: {marker: {enabled: false}}},series: [{marker: {enabled: true},name: 'Japan - 65 and over',type: 'spline',data: [ 9, 9, 9, ... ]}, { The following graph demonstrates that only the 65 and over series has point markers: Sketching an area chart In this section, we are going to use our very first example and turn it into a more stylish graph (based on the design of wind energy poster by Kristin Clute), which is an area spline chart. An area spline chart is generated using the combined properties of area and spline charts. The main data line is plotted as a spline curve and the region underneath the line is filled in a similar color with a gradient and an opaque style. Firstly, we want to make the graph easier for viewers to look up the values for the current trend, so we move the y axis next to the latest year, that is, to the opposite side of the chart: yAxis: { ....opposite:true} The next thing is to remove the interval lines and have a thin axis line along the y axis: yAxis: { ....gridLineWidth: 0,lineWidth: 1,} Then we simplify the y-axis title with a percentage sign and align it to the top of the axis: yAxis: { ....title: {text: '(%)',rotation: 0,x: 10,y: 5,align: 'high'},} As for the x axis, we thicken the axis line with a red color and remove the interval ticks: xAxis: { ....lineColor: '#CC2929',lineWidth: 4,tickWidth: 0,offset: 2} For the chart title, we move the title to the right of the chart, increase the margin between the chart and the title, and then adopt a different font for the title: title: {text: 'Population ages 65 and over (% of total) -Japan ',margin: 40,align: 'right',style: {fontFamily: 'palatino'}} After that we are going to modify the whole series presentation, we first set the chart.type property from 'line' to 'areaspline'. Notice that setting the properties inside this series object will overwrite the same properties defined in plotOptions.areaspline and so on in plotOptions.series. Since so far there is only one series in the graph, there is no need to display the legend box. We can disable it with the showInLegend property. We then smarten the area part with gradient color and the spline with a darker color: series: [{showInLegend: false,lineColor: '#145252',fillColor: {linearGradient: {x1: 0, y1: 0,x2: 0, y2: 1},stops:[ [ 0.0, '#248F8F' ] ,[ 0.7, '#70DBDB' ],[ 1.0, '#EBFAFA' ] ]},data: [ ... ]}] After that, we introduce a couple of data labels along the line to indicate that the ranking of old age population has increased over time. We use the values in the series data array corresponding to the year 1995 and 2010, and then convert the numerical value entries into data point objects. Since we only want to show point markers for these two years, we turn off markers globally in plotOptions.series. marker.enabled and set the marker on, individually inside the point objects accompanied with style settings: plotOptions: {series: {marker: {enabled: false}}},series: [{ ...,data:[ 9, 9, 9, ...,{ marker: {radius: 2,lineColor: '#CC2929',lineWidth: 2,fillColor: '#CC2929',enabled: true},y: 14}, 15, 15, 16, ... ]}] We then set a bounding box around the data labels with round corners (borderRadius) in the same border color (borderColor) as the x axis. The data label positions are then finely adjusted with the x and y options. Finally, we change the default implementation of the data label formatter. Instead of returning the point value, we print the country ranking. series: [{ ...,data:[ 9, 9, 9, ...,{ marker: {...},dataLabels: {enabled: true,borderRadius: 3,borderColor: '#CC2929',borderWidth: 1,y: -23,formatter: function() {return "Rank: 15th";}},y: 14}, 15, 15, 16, ... ]}] The final touch is to apply a gray background to the chart and add extra space into spacingBottom. The extra space for spacingBottom is to avoid the credit label and x-axis label getting too close together, because we have disabled the legend box. chart: {renderTo: 'container',spacingBottom: 30,backgroundColor: '#EAEAEA'}, When all these configurations are put together, it produces the exact chart, as shown in the screenshot at the start of this section. Mixing line and area series In this section we are going to explore different plots including line and area series together, as follows: Projection chart, where a single trend line is joined with two series in different line styles Plotting an area spline chart with another step line series Exploring a stacked area spline chart, where two area spline series are stacked on top of each other Simulating a projection chart The projection chart has spline area with the section of real data and continues in a dashed line with projection data. To do that we separate the data into two series, one for real data and the other for projection data. The following is the series configuration code for the future data up to 2024. This data is based on the National Institute of Population and Social Security Research report (http://www.ipss.go.jp/pp-newest/e/ppfj02/ppfj02.pdf). series: [{name: 'project data',type: 'spline',showInLegend: false,lineColor: '#145252',dashStyle: 'Dash',data: [ [ 2010, 23 ], [ 2011, 22.8 ],... [ 2024, 28.5 ] ]}] The future series is configured as a spline in a dashed line style and the legend box is disabled, because we want to show both series as being from the same series. Then we set the future (second) series color the same as the first series. The final part is to construct the series data. As we specify the x-axis time data with the pointStart property, we need to align the projection data after 2010. There are two approaches that we can use to specify the time data in a continuous form, as follows: Insert null values into the second series data array for padding to align with the real data series Specify the second series data in tuples, which is an array with both time and projection data Next we are going to use the second approach because the series presentation is simpler. The following is the screenshot only for the future data series: The real data series is exactly the same as the graph in the screenshot at the start of the Sketching an area chart section, except without the point markers and data label decorations. The next step is to join both series together, as follows: series: [{name: 'real data',type: 'areaspline',....}, {name: 'project data',type: 'spline',....}] Since there is no overlap between both series data, they produce a smooth projection graph: Contrasting spline with step line In this section we are going to plot an area spline series with another line series but in a step presentation. The step line transverses vertically and horizontally only according to the changes in series data. It is generally used for presenting discrete data, that is, data without continuous/gradual movement. For the purpose of showing a step line, we will continue from the first area spline example. First of all, we need to enable the legend by removing the disabled showInLegend setting and also remove dataLabels in the series data. Next is to include a new series, Ages 0 to 14, in the chart with a default line type. Then we will change the line style slightly differently into steps. The following is the configuration for both series: series: [{name: 'Ages 65 and over',type: 'areaspline',lineColor: '#145252',pointStart: 1980,fillColor: {....},data: [ 9, 9, 9, 10, ...., 23 ]}, {name: 'Ages 0 to 14',// default type is line seriesstep: true,pointStart: 1980,data: [ 24, 23, 23, 23, 22, 22, 21,20, 20, 19, 18, 18, 17, 17, 16, 16, 16,15, 15, 15, 15, 14, 14, 14, 14, 14, 14,14, 14, 13, 13 ]}] The following screenshot shows the second series in the stepped line style:
Read more
  • 0
  • 0
  • 2402

article-image-obtaining-binary-backup
Packt
04 Apr 2013
6 min read
Save for later

Obtaining a binary backup

Packt
04 Apr 2013
6 min read
Getting ready Next we need to modify the postgresql.conf file for our database to run in the proper mode for this type of backup. Change the following configuration variables: wal_level = archive max_wal_senders = 5 Then we must allow a super user to connect to the replication database, which is used by pg_basebackup. We do that by adding the following line to pg_hba.conf: local replication postgres peer Finally, restart the database instance to commit the changes. How to do it... Though it is only one command, pg_basebackup requires at least one switch to obtain a binary backup, as shown in the following step: Execute the following command to create the backup in a new directory named db_backup: $> pg_basebackup -D db_backup -x How it works... For PostgreSQL, WAL stands for Write Ahead Log. By changing wal_level to archive, those logs are written in a format compatible with pg_basebackup and other replicationbased tools. By increasing max_wal_senders from the default of zero, the database will allow tools to connect and request data files. In this case, up to five streams can request data files simultaneously. This maximum should be sufficient for all but the most advanced systems. The pg_hba.conf file is essentially a connection access control list (ACL). Since pg_basebackup uses the replication protocol to obtain data files, we need to allow local connections to request replication. Next, we send the backup itself to a directory (-D) named db_backup. This directory will effectively contain a complete copy of the binary files that make up the database. Finally, we added the -x flag to include transaction logs (xlogs), which the database will require to start, if we want to use this backup. When we get into more complex scenarios, we will exclude this option, but for now, it greatly simplifies the process. There's more... The pg_basebackup tool is actually fairly complicated. There is a lot more involved under the hood. Viewing backup progress For manually invoked backups, we may want to know how long the process might take, and its current status. Luckily, pg_basebackup has a progress indicator, which does that by using the following command: $> pg_basebackup -P -D db_backup Like many of the other switches, -P can be combined with tape archive format, standalone backups, database clones, and so on. This is clearly not necessary for automated backup routines, but could be useful for one-off backups monitored by an administrator. Compressed tape archive backups Many binary backup files come in the TAR (Tape Archive) format, which we can activate using the -f flag and setting it to t for TAR. Several Unix backup tools can directly process this type of backup, and most administrators are familiar with it. If we want a compressed output, we can set the -z flag, especially in the case of large databases. For our sample database, we should see almost a 20x compression ratio. Try the following command: $> pg_basebackup -Ft -z -D db_backup The backup file itself will be named base.tar.gz within the db_backup directory, reflecting its status as a compressed tape archive. In case the database contains extra tablespaces, each becomes a separate compressed archive. Each file can be extracted to a separate location, such as a different set of disks, for very complicated database instances. For the sake of this example, we ignored the possible presence of extra tablespaces than the pg_default default included in every installation. User-created tablespaces will greatly complicate your backup process. Making the backup standalone By specifying -x, we tell the database that we want a "complete" backup. This means we could extract or copy the backup anywhere and start it as a fully qualified database. As we mentioned before, the flag means that you want to include transaction logs, which is how the database recovers from crashes, checks integrity, and performs other important tasks. The following is the command again, for reference: $> pg_basebackup -x -D db_backup When combined with the TAR output format and compression, standalone binary backups are perfect for archiving to tape for later retrieval, as each backup is compressed and self-contained. By default, pg_basebackup does not include transaction logs, because many (possibly most) administrators back these up separately. These files have multiple uses, and putting them in the basic backup would duplicate efforts and make backups larger than necessary. We include them at this point because it is still too early for such complicated scenarios. We will get there eventually, of course. Database clones Because pg_basebackup operates through PostgreSQL's replication protocol, it can execute remotely. For instance, if the database was on a server named Production, and we wanted a copy on a server named Recovery, we could execute the following command from Recovery: $> pg_basebackup -h Production -x -D /full/db/path For this to work, we would also need this line in pg_hba.conf for Recovery: host replication postgres Recovery trust Though we set the authentication method to trust, this is not recommended for a production server installation. However, it is sufficient to allow Recovery to copy all data from Production. With the -x flag, it also means that the database can be started and kept online in case of emergency. It is a backup and a running server. Parallel compression Compression is very CPU intensive, but there are some utilities capable of threading the process. Tools such as pbzip2 or pigz can do the compression instead. Unfortunately, this only works in the case of a single tablespace (the default one; if you create more, this will not work). The following is the command for compression using pigz: $> pg_basebackup -Ft -D - | pigz -j 4 > db_backup.tar.gz It uses four threads of compression, and sets the backup directory to standard output (-) so that pigz can process the output itself. Summary In this article we saw the process of obtaining a binary backup. Though, we saw that this process is more complex and tedious, but at the same time it is much faster. Further resources on this subject: Introduction to PostgreSQL 9 Backup in PostgreSQL 9 Recovery in PostgreSQL 9
Read more
  • 0
  • 0
  • 273

article-image-ease-chaos-automated-patching
Packt
02 Apr 2013
19 min read
Save for later

Ease the Chaos with Automated Patching

Packt
02 Apr 2013
19 min read
(For more resources related to this topic, see here.) We have seen how the provisioning capabilities of the Oracle Enterprise Manager's Database Lifecycle Management (DBLM) Pack enable you to deploy fully patched Oracle Database homes and databases, as replicas of the gold copy in the Software Library of Enterprise Manager. However, nothing placed in production should be treated as static. Software changes in development cycles, enhancements take place, or security/functional issues are found. For almost anything in the IT world, new patches are bound to be released. These will also need to be applied to production, testing, reporting, staging, and development environments in the data center on an ongoing basis. For the database side of things, Oracle releases quarterly a combination of security fixes known as the Critical Patch Update (CPU). Other patches are bundled together and released every quarter in the form of a Patch Set Update (PSU), and this also includes the CPU for that quarter. Oracle strongly recommends applying either the PSU or the CPU every calendar quarter. If you prefer to apply the CPU, continue doing so. If you wish to move to the PSU, you can do so, but in that case continue only with the PSU. The quarterly patching requirement, as a direct recommendation from Oracle, is followed by many companies that prefer to have their databases secured with the latest security fixes. This underscores the importance of patching. However, if there are hundreds of development, testing, staging, and production databases in the data center to be patched, the situation quickly turns into a major manual exercise every three months. DBAs and their managers start planning for the patch exercise in advance, and a lot of resources are allocated to make it happen—with the administrators working on each database serially, at times overnight and at times over the weekend. There are a number of steps involved in patching each database, such as locating the appropriate patch in My Oracle Support (MOS), downloading the patch, transferring it to each of the target servers, upgrading the OPATCH facility in each Oracle home, shutting down the databases and listeners running from that home, applying the patch, starting each of the databases in restricted mode, applying any supplied SQL scripts, restarting the databases in normal mode, and checking the patch inventory. These steps have to be manually repeated on every database home on every server, and on every database in that home. Dull repetition of these steps in patching the hundreds of servers in a data center is a very monotonous task, and it can lead to an increase in human errors. To avoid these issues inherent in manual patching, some companies decide not to apply the quarterly patches on their databases. They wait for a year, or a couple of years before they consider patching, and some even prefer to apply year-old patches instead of the latest patches. This is counter-productive and leads to their databases being insecure and vulnerable to attacks, since the latest recommended CPUs from Oracle have not been applied. What then is the solution, to convince these companies to apply patches regularly? If the patching process can be mostly automated (but still under the control of the DBAs), it would reduce the quarterly patching effort to a great extent. Companies would then have the confidence that their existing team of DBAs would be able to manage the patching of hundreds of databases in a controlled and automated manner, keeping human error to a minimum. The Database Lifecycle Management Pack of Enterprise Manager Cloud Control 12c is able to achieve this by using its Patch Automation capability. We will now look into Patch Automation and the close integration of Enterprise Manager with My Oracle Support. Recommended patches By navigating to Enterprise | Summary, a Patch Recommendations section will be visible in the lower left-hand corner, as shown in the following screenshot: The graph displays either the Classification output of the recommended patches, or the Target Type output. Currently for this system, more than five security patches are recommended as can be seen in this graph. This recommendation has been derived via a connection to My Oracle Support (the OMS can be connected either directly to the Internet, or by using a proxy server). Target configuration information is collected by the Enterprise Manager Agent and is stored in the Configuration Management Database (CMDB) within the repository. This configuration information is collated regularly by the Enterprise Manager's Harvester process and pushed to My Oracle Support. Thus, configuration information about your targets is known to My Oracle Support, and it is able to recommend appropriate patches as and when they are released. However, the recommended patch engine also runs within Enterprise Manager 12c at your site, working off the configuration data in the CMDB in Enterprise Manager, so recommendations can in fact be achieved without the configuration having been uploaded on MOS by the Harvester (this upload is more useful now for other purposes, such as attaching configuration details during SR creation). It is also possible to get metadata about the latest available patches from My Oracle Support in offline mode, but more manual steps are involved in this case, so Internet connectivity is recommended to get the full benefits of Enterprise Manager's integration with My Oracle Support. To view the details about the patches, click on the All Recommendations link or on the graph itself. This connects to My Oracle Support (you may be asked to log in to your company-specific MOS account) and brings up the list of the patches in the Patch Recommendations section. The database (and other types of) targets managed by the Enterprise Manager system are displayed on the screen, along with the recommended CPU (or other) patches. We select the CPU July patch for our saiprod database. This displays the details about the patch in the section in the lower part of the screen. We can see the list Bugs Resolved by This Patch, the Last Updated date and Size of the patch and also Read Me—which has important information about the patch. The number of total downloads for this patch is visible, as is the Community Discussion on this patch in the Oracle forums. You can add your own comment for this patch, if required, by selecting Reply to the Discussion. Thus, at a glance, you can find out how popular the patch is (number of downloads) and any experience of other Oracle DBAs regarding this patch—whether positive or negative. Patch plan You can view the information about the patch by clicking on the Full Screen button. You can download the patch either to the Software Library in Enterprise Manager or to your desktop. Finally, you can directly add this patch to a new or existing patch plan, which we will do next. Go to Add to Plan | Add to New, and enter Plan Name as Sainath_patchplan. Then click on Create Plan. If you would like to add multiple patches to the plan, select both the patches first and then add to the plan. (You can also add patches later to the plan). After the plan is created, click on View Plan. This brings up the following screen: A patch plan is nothing but a collection of patches that can be applied as a group to one or more targets. On the Create Plan page that appears, there are five steps that can be seen in the left-hand pane. By default, the second step appears first. In this step, you can see all the patches that have been added to the plan. It is possible to include more patches by clicking on the Add Patch... button. Besides the ability to manually add a patch to this list, the analysis process may also result in additional patches being added to the plan. If you click on the first step, Plan Information, you can put in a description for this plan. You can also change the plan permissions, either Full or View, for various Enterprise Manager roles. Note that the Full permission allows the role to validate the plan, however, the View permission does not allow validation. Move to step 3, Deployment Options. The following screen appears. Out-of-place patching A new mechanism for patching has been provided in the Enterprise Manager Cloud Control 12c version, known as out-of-place patching. This is now the recommended method and creates a new Oracle home which is then patched while the previous home is still operational. All this is done using an out of the box deployment procedure in Enterprise Manager. Using this mechanism means that the only downtime will take place when the databases from the previous home are switched to run from the new home. If there is any issue with the database patch, you can switch back to the previous unpatched home since it is still available. So, patch rollback is a lot faster. Also, if there are multiple databases running in the previous home, you can decide which ones to switch to the new patched home. This is obviously an advantage, otherwise you would be forced to simultaneously patch all the databases in a home. A disadvantage of this method would be the space requirements for a duplicate home. Also, if proper housekeeping is not carried out later on, it can lead to a proliferation of Oracle homes on a server where patches are being applied regularly using this mechanism. This kind of selective patching and minimal downtime is not possible if you use the previously available method of in-place patching, which uses a separate deployment procedure to shut down all databases running from an Oracle home before applying the patches on the same home. The databases can only be restarted normally after the patching process is over, and this obviously takes more downtime and affects all databases in a home. Depending on the method you choose, the appropriate deployment procedure will be automatically selected and used. We will now use the out-of-place method in this patch plan. On the Step 3: Deployment Options page, make sure the Out of Place (Recommended) option is selected. Then click on Create New Location. Type in the name and location of the new Oracle home, and click on the Validate button. This checks the Oracle home path on the Target server. After this is done, click on the Create button. The deployment options of the patch plan are successfully updated, and the new home appears on the Step 3 page. Click on the Credentials tab. Here you need to select or enter the normal and privileged credentials for the Oracle home. Click on the Next button. This moves us to step 4, the Validation step. Pre-patching analysis Click on the Analyze button. A job to perform prepatching analysis is started in the background. This will compare the installed software and patches on the targets with the new patches you have selected in your plan, and attempt to validate them. This validation may take a few minutes to complete, since it also checks the Oracle home for readiness, computes the space requirements for the home, and conducts other checks such as cluster node connectivity (if you are patching a RAC database). If you drill down to the analysis job itself by clicking on Show Detailed Progress here, you can see that it does a number of checks to validate if the targets are supported for patching, verifies the normal and super user credentials of the Oracle home, verifies the target tools, commands, and permissions, upgrades OPATCH to the latest version, stages the selected patches to Oracle homes, and then runs the prerequisite checks including those for cloning an Oracle home. If the prerequisite checks succeed, the analysis job skips the remaining steps and stops at this point with a successful status. The patch is seen as Ready for Deployment. If there are any issues, they will show up at this point. For example, if there is a conflict with any of the patches, a replacement patch or a merge patch may be suggested. If there is no replacement or merge patch and you want to request such a patch, it will allow you to make the request directly from the screen. If you are applying a PSU and the CPU for that same release is already applied to the Oracle home, for example, July 2011 CPU, then because the PSU is a superset of the CPU, the MOS analysis will stop and mention that the existing patch fixes the issues. Such a message can be seen in the Informational Messages section of the Validation page. Deployment In our case, the patch is Ready for Deployment. At this point, you can move directly to step 5, Review & Deploy, by clicking on it in the left-hand side pane. On the Review & Deploy page, the patch plan is described in detail along with Impacted Targets. Along with the database that is in the patch plan, a new impacted target has been found by the analysis process and added to the list of impacted targets. This is the listener that is running from the home that is to be cloned and patched. The patches that are to be applied are also listed on this review page, in our case the CPUJUL2011 patch is shown with the status Conflict Free. The deployment procedure that will be used is Clone and Patch Oracle Database, since out-of-place patching is being used, and all instances and listeners running in the previous Oracle home are being switched to the new home. Click on the Prepare button. The status on the screen changes to Preparation in Progress. A job for preparation of the out-of-place patching starts, including cloning of the original Oracle home and applying the patches to the cloned home. No downtime is required while this job is running; it can happen in the background. This preparation phase is like a pre-deploy and is only possible in the case of out-of-place patching, whereas in the case of in-place patching, there is no Prepare button and you deploy straightaway. Clicking on Show Detailed Progress here opens a new window showing the job details. When the preparation job has successfully completed (after about two hours in our virtual machine), we can see that it performs the cloning of the Oracle home, applies the patches on the new home, validates the patches, runs the post patch scripts, and then skips all the remaining steps. It also collects target properties for the Oracle home in order to refresh the configurations in Enterprise Manager. The Review & Deploy page now shows Preparation Successful!. The plan is now ready to be deployed. Click on the Deploy button. The status on the screen changes to Deployment in Progress. A job for deployment of the out-of-place patching starts. At this time, downtime will be required since the database instances using the previous Oracle home will be shut down and switched across. The deploy job successfully completes (after about 21 minutes in our virtual machine); we can see that it works iteratively over the list of hosts and Oracle homes in the patch plan. It starts a blackout for the database instances in the Oracle home (so that no alerts are raised), stops the instances, migrates them to the cloned Oracle home, starts them in upgrade mode, applies SQL scripts to patch the instance, applies post-SQL scripts, and then restarts the database in normal mode. The deploy job applies other SQL scripts and recompiles invalid objects (except in the case of patch sets). It then migrates the listener from the previous Oracle home using the Network Configuration Assistant (NetCA), updates the Target properties, stops the blackout, and detaches the previous Oracle home. Finally, the configuration information of the cloned Oracle home is refreshed. The Review & Deploy page of the patch plan now shows the status of Deployment Successful!, as can be seen in the following screenshot: Plan template On the Deployment Successful page, it is possible to click on Save as Template at the bottom of the screen in order to save a patch plan as a plan template. The patch plan should be successfully analyzed and deployable, or successfully deployed, before it can be saved as a template. The plan template, when thus created, will not have any targets included, and such a template can then be used to apply the successful patch plan to multiple other targets. Inside the plan template, the Create Plan button is used to create a new plan based on this template, and this can be done repeatedly for multiple targets. Go to Enterprise | Provisioning and Patching | Patches & Updates; this screen displays a list of all the patch plans and plan templates that have been created. The successfully deployed Sainath_patchplan and the new patch plan template also shows up here. To see a list of the saved patches in the Software Library, go to Enterprise | Provisioning and Patching | Saved Patches. This brings up the following screen: This page also allows you to manually upload patches to the Software Library. This scenario is mostly used when there is no connection to the Internet (either direct or via a proxy server) from the Enterprise Manager OMS servers, and consequently you need to download the patches manually. For more details on setting up the offline mode and downloading the patch recommendations and latest patch information in the form of XML files from My Oracle Support, please refer to Oracle Enterprise Manager Lifecycle Management Administrator's Guide 12c Release 2 (12.1.0.2) at the following URL: http://docs.oracle.com/cd/E24628_01/em.121/e27046/pat_mosem_new. htm#BABBIEAI Patching roles The new version of Enterprise Manager Cloud Control 12c supplies out of the box administrator roles specifically for patching. These roles are EM_PATCH_ ADMINISTRATOR, EM_PATCH_DESIGNER, and EM_PATCH_OPERATOR. You need to grant these roles to appropriate administrators. Move to Setup | Security | Roles. On this page, search for the roles specifically meant for patching. The three roles appear as follows: The EM_PATCH_ADMINISTRATOR role can create, edit, deploy, or delete any patch plan and can also grant privileges to other administrators after creating them. This role has full privileges on any patch plan or patch template in the Enterprise Manager system and maintains the patching infrastructure. The EM_PATCH_DESIGNER role normally identifies patches to be used in the patching cycle across development, testing, and production. This role would be the one of the senior DBA in real life. The patch designer creates patch plans and plan templates, and grants privileges for these plan templates to the EM_PATCH_ OPERATOR role. As an example, the patch designer will select a set of recommended and other manually selected patches for an Oracle 11g database and create a patch plan. This role will then test the patching process in a development environment, and save the successfully analyzed or deployed patch plan as a plan template. The patch designer will then publish the Oracle 11g database patching plan template to the patch operator—probably the junior DBA or application DBA in real life. Next, the patch operator creates new patch plans using the template (but cannot create a template), and adds a different list of targets, such as other Oracle 11g databases in the test, staging, or production environment. This role then schedules the deployment of the patches to all these environments—using the same template again and again. Summary: Enterprise Manager Cloud Control 12 c allows automation of the tedious patching procedure used in many organizations today, to patch their Oracle databases and servers. This is achieved via the Database Lifecycle Management Pack, which is one of the main licensable packs of Enterprise Manager. Sophisticated Deployment Procedures are provided out of the box to fulfill many different types of patching tasks, and this helps you to achieve mass patching of multiple targets with multiple patches in a fully automated manner, thus making tremendous savings in administrative time and effort. Some companies have estimated savings of up to 98 percent in patching tasks in their data centers. Different types of patches can be applied in this manner, including CPUs, PSUs, Patch sets and other one-off patches. Different versions of databases are supported, such as 9i, 10 g and 11 g. For the first time, the upgrade of single-instance databases is also possible via Enterprise Manager Cloud Control 12c. There is full integration of the patching capabilities of Enterprise Manager with My Oracle Support (MOS). The support site retains the configuration of all the components managed by Enterprise Manager inside the company. Since the current version and patch information of the components is known, My Oracle Support is able to provide appropriate patch recommendations for many targets, including the latest security fixes. This ensures that the company is up to date with regards to security protection. A full division of roles is available, such as Patch Administrator, Designer, and Operator. It is possible to take the My Oracle Support recommendations, select patches for targets, put them into a patch plan, deploy the patch plan and then create a plan template from it. The template can then be published to any operator who can then create their own patch plans for other targets. In this way patching can be tested, verified, and then pushed to production. In all, Enterprise Manager Cloud Control 12 c offers valuable automation methods for Mass Patching, allowing Administrators to ensure that their systems have the latest security patches, and enabling them to control the application of patches on development, test, and production servers from the centralized location of the Software Library. Resources for Article : Further resources on this subject: Author Podcast - Bob Griesemer on Oracle Warehouse Builder 11g [Article] Managing Oracle Business Intelligence [Article] Author Podcast - Ronald Rood discusses the birth of Oracle Scheduler [Article]
Read more
  • 0
  • 0
  • 1453
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-follow-money
Packt
28 Mar 2013
13 min read
Save for later

Follow the Money

Packt
28 Mar 2013
13 min read
(For more resources related to this topic, see here.) It starts with the Cost Worksheet In PCM, the Cost Worksheet is the common element for all the monetary modules. The Cost Worksheet is a spreadsheet-like module with rows and columns. The following screenshot shows a small part of a typical Cost Worksheet register: The columns are set by PCM but the rows are determined by the organization. The rows are the Cost Codes. This is your cost breakdown structure. This is the lowest level of detail with which money will be tracked in PCM. It can also be called Cost Accounts. All money entered into the documents in any of the monetary modules in PCM will be allocated to a Cost Code on the Cost Worksheet. Even if you do not specifically allocate to a Cost Code, the system will allocate to a system generated Cost Code called NOT COSTED. The NOT COSTED Cost Code is important so no money slips through the cracks. If you forget to assign money to your Cost Codes on the project it will assign the money to this code. When reviewing the Cost Worksheet, a user can review the NOT COSTED Cost Code and see if any money is associated with this code. If there is money associated with NOT COSTED, he can find that document where he has forgotten to allocate all the money to proper Cost Codes. Users cannot edit any numbers directly on the Cost Worksheet; it is a reflection of information entered on various documents on your project. This provides a high level of accountability in that no money can be entered or changed without a document entered someplace within PCM (like can be done with a spreadsheet) The Cost Code itself can be up to 30 characters in length and can be divided into segments to align with the cost breakdown structure, as shown in the following screenshot: The number of Cost Codes and the level of breakdown is typically determined by the accounting or ERP system used by your organization or it can be used as an extension of the ERP system's coding structure. When the Cost Code structure matches, integration between the two systems becomes easier. There are many other factors to consider when thinking about integrating systems but the Cost Code structure is at the core of relating the two systems. Defining the segments within the Cost Codes is done as part of the initial setup and implementation of PCM. This is done in the Cost Code Definitions screen, as shown in the following screenshot: To set up, you must tell PCM what character of the Cost Code the segment starts with and how long the segment is (the number of characters). Once this is done you can also populate a dictionary of titles for each segment. A trick used for having different segment titles for different projects is to create an identical segment dictionary but for different projects. For example, if you have a different list of Disciplines for every project, you can create and define a list of Disciplines for each project with the same starting character and length. Then you can use the proper Cost Code definitions in your layouts and reporting for that project. The following screenshot shows how this can be done: Once the Cost Codes have been defined, the Cost Worksheet will need to be populated for your project. There are various ways to accomplish this. Create a dummy project with the complete list of company Cost Codes you would ever use on a project. When you want to populate the Cost Code list on a new project, use the Copy Cost Codes function from the project tree. Import a list of Cost Codes that have been developed in a spreadsheet (Yes, I used the word "spreadsheet". There are times when a spreadsheet comes in handy – managing a multi-million dollar project is not one of them). PCM has an import function from the Cost Worksheet where you can import a comma-separated values (CSV) file of the Cost Codes and titles. Enter the Cost Codes one at a time from the Cost Worksheet. If there are a small number of Cost Codes, this might be the fastest and easiest method. Understanding the columns of the Cost Worksheet will help you understand how powerful and important the Cost Worksheet really is. The columns of the Cost Worksheet in PCM are set by the system. They are broken down into a few categories, as follows: Budget Commitment Custom Actuals Procurement Variances Miscellaneous Each of the categories has a corresponding color to help differentiate them when looking at the Cost Worksheet. Within each of these categories are a number of columns. The Budget, Commitment, and Custom categories have the same columns while the other categories have their own set of columns. These three categories work basically the same. They can be defined in basic terms as follows: Budget: This is the money that your company has available to spend and is going to be received by the project. Examples depend on the perspective of the setup of PCM. In the example of our cavemen Joe and David, David is the person working for Joe. If David was using PCM, the Budget category would be the amount of the agreed price between Joe and David to make the chair or the amount of money that David was going to be paid by Joe to make the chair. Committed: This is the money that has been agreed to be spent on the project, not the money that has been spent. So in our example it would be the amount of money that David has agreed to pay his subcontractors to supply him with goods and services to build the chair for him. Custom: This is a category that is available to be used by the user for another contracting type. It has its own set of columns identical to the Budget and Commitment categories. This can be used for a Funding module where you can track the amount of money funded for the project, which can be much different from the available budget for the project. Money distributed to the Trends module can be posted to many of the columns as determined by the user upon adding the Trend. The Trend document is not referenced in the following explanations. When money is entered in a document, it must be allocated or distributed to one or multiple Cost Codes. As stated before, if you forget or do not allocate the money to a Cost Code, PCM will allocate the money to the NOT COSTED Cost Code. The system knows what column to place the money in but the user must tell PCM the proper row (Cost Code). If the Status Type of a document is set to Closed or Rejected, the money is removed from the Cost Worksheet but the document is still available to be reviewed. This way only documents that are in progress or approved will be placed on the Cost Worksheet. Let's look at each of the columns individually and explain how money is posted. The only documents that affect the Cost Worksheet are as follows: Contracts (three types) Change Orders Proposals Payment Requisitions Invoices Trends Procurement Contracts Let's look at the first three categories first since they are the most complex. Following is a table of the columns associated with these categories. Understand that the terminology used here is the standard out of the box terminology of PCM and may not match what has been set up in your organization. The third contract type (Custom) can be turned on or off using the Project Settings. It can be used for a variety of types as it has its own set of columns in the Cost Worksheet. The Custom contract type can be used in the Change Management module; however, it utilizes the Commitment tab, which requires the user to understand exactly what category the change is related. The following tables show various columns on the Cost Worksheet starting with the Cost Code itself. The first table shows all the columns used by each of the three contract categories: Cost Worksheet Columns The columns listed above are affected by the Contracts, Purchase Orders, or any Change Document modules. Let's look at specific definitions of what document type can be posted to which column. The Original Column The Original column is used for money distributed from any of the Contract modules. If a Commitment contract is added under the Contracts – Committed module and the money is distributed to various Cost Codes (rows), the column used is the Original Commitment column in the worksheet. It's the same with the Contracts – Budgeted and Contracts – Custom modules. The Purchase Order module is posted to the Commitments category. Money can also be posted to this column for Budget and Commitment contracts from the Change Management module where a phase has been assigned this column. This is not a typical practice as the Original column should be unchangeable from the values on the original contract. The Approved Column The Approved Revisions column is used for money distributed from the Change Order module. If a Change Order is added under the Change Order module against a commitment contract and the money is distributed to various Cost Codes (rows), and the Change Order has been approved, the money on this document is posted to the Approved Commitment Revisions column in the worksheet. We will discuss what happens prior to approval later. The Revised Column The Revised column is a computed column adding the original money and the approved money. Money cannot be distributed to this column from any document in PCM. The Pending Changes Column The Pending Revisions column can be populated by several document types as follows: Change Orders: Prior to approving the Change Order document, all money associated with the Change Order document created from the Change Orders module from the point of creation will be posted to the Pending Changes column. Change Management: These are documents associated with a change management process where the change phase is associated with the Pending column. This can be from the Proposal module or the Change Order module. Proposals: These are documents created in the Proposals module either through the Change Management module or directly from the module itself. The Estimated Changes Column The Estimated Revisions column is populated from phases in Change Management that have been assigned to distribute money to this column The Adjustment Column The Adjustment column is populated from phases in Change Management that have been assigned to distribute money to this column The Projected Column The Projected column is a computed column of all columns associated with a category. This column is very powerful in understanding the potential cost at completion of this Cost Code. Actuals There are two columns that are associated with actual cost documents in PCM. The modules that affect these columns are as follows: Payment Requisitions Invoices These columns are the Actuals Received and Actuals Issued columns. These column names can be confusing and should be considered for change during implementation. This is the way you could look at what money these columns include. Actuals Received: This column holds money where you have received a Payment Requisition or Invoice to be paid by you. This also includes the Custom category. Actuals Issued: This column holds money where you have issued a Payment Requisition or Invoice to be paid to you. As Payment Requisitions or Invoices are being entered and the money distributed to Cost Codes, this money will be placed in one of these two columns depending on the contract relationship associated with these documents. Be aware that money is placed into these columns as soon as it is entered into Payment Requisitions or Invoices regardless of approval or certification. Procurement There are many columns relating to the Procurement module. This book does not go into details of the Procurement module. The column names related to Procurement are as follows: Procurement Estimate Original Estimate Estimate Accuracy Estimated Gross Profit Buyout Purchasing Buyout Variances There are many Variance columns that are computed columns. These columns show the variance (or difference) between other columns on the worksheet, as follows. Original Variance: The Original Budget minus the Original Commitment Approved Variance: The Revised Budget minus the Revised Commitment Pending Variance: The (Revised Budget plus Pending Budget Revisions) minus (Revised Commitment plus Pending Commitment) Projected Variance: The Projected Budget minus the Projected Commitment These columns are very powerful to help analyze relationships between the Budget category and the Commitment category. Miscellaneous There are a few miscellaneous columns as follows that are worth noting so you understand what the numbers mean: Budget Percent: This represents the percentage of the Actuals Issued column of the Revised Budget column for that Cost Code. Commitment Percentage: This represents the percentage of the Actuals Received column of the Revised Commitment column for that Cost Code. Planned to Commit: This is the planned expenditure for the Cost Code. This value can only be populated from the Details tab of the Cost Code. It is also used for an estimators value of the Cost Code. Drilling down to the detail The beauty of the Cost Worksheet is the ability to quickly review what documents have had an effect on which column on the worksheet. Look at the Cost Worksheet as a ten-thousand foot view of the money on your project. There is a lot of information that can be gleaned from this high-level review especially if you are using layouts properly. If you see some numbers that need further review, then drilling down to the detail directly from the Cost Worksheet is quite simple. To drill down to the detail, click on the Cost Code. This will bring up a new page with tabs for the different categories. Click on the tab you wish to review and the grid shows all the documents where some or all the money has been posted to this Cost Code. This page shows all columns affected by the selected category, with the rows representing each document and the corresponding value from that document that affects the selected Cost Code on the Cost Worksheet. From this page you can click on the link under the Item column (as shown in the previous screenshot) to open the actual document that the row represents. Summary Understanding the concepts in this article is key to understanding how the money flows within PCM. Take the time to review this information so that other articles of the book on changes, payments, and forecasting make more sense. The ability to have all aspects of the money on your project accessible from one module is extremely powerful and should be one of the modules that you refer to on a regular basis. Resources for Article : Further resources on this subject: Author Podcast - Ronald Rood discusses the birth of Oracle Scheduler [Article] Author Podcast - Bob Griesemer on Oracle Warehouse Builder 11g [Article] Oracle Integration and Consolidation Products [Article]
Read more
  • 0
  • 0
  • 962

article-image-generating-reports-notebooks-rstudio
Packt
26 Mar 2013
7 min read
Save for later

Generating Reports in Notebooks in RStudio

Packt
26 Mar 2013
7 min read
(For more resources related to this topic, see here.) A very important feature of reproducible science is generating reports. The main idea of automatic report generation is that the results of analyses are not manually copied to the report. Instead, both the R code and the report's text are combined in one or more plain text files. The report is generated by a tool that executes the chunks of code, captures the results (including figures), and generates the report by weaving the report's text and results together. To achieve this, you need to learn a few special commands, called markup specifiers, that tell the report generator which part of your text is R code, and which parts you want in special typesetting such as boldface or italic. There are several markup languages to do this, but the following is a minimal example using the Markdown language: A simple example with Markdown The left panel shows the plain text file in RStudio's editor and the right panel shows the web page that is generated by clicking on the Knit HTML button. The markup specifiers used here are the double asterisks for boldface, single underscores for slanted font, and the backticks for code. By adding an r to the first backtick, the report generator executes the code following it. To reproduce this example, go to File | New | R Markdown, copy the text as shown in the preceding screenshot, and save as one.Rmd. Next, click on Knit HTML. The Markdown language is one of many markup languages in existence and RStudio supports several of them. RStudio has excellent support for interweaving code with Markdown, HTML, LaTeX, or even in plain comments. Notebooks are useful to quickly share annotated lines of code or results. There are a few ways to control the layout of a notebook. The Markdown language is easy to learn and has a fair amount of layout options. It also allows you to include equations in the LaTeX format. The HTML option is really only useful if you aim to create a web page. You should know, or be willing to learn HTML to use it. The result of these three methods is always a web page (that is, an HTML file) although this can be exported to PDF. If you need ultimate control over your document's layout, and if you need features like automated bibliographies and equation numbering, LaTeX is the way to go. With this last option, it is possible to create papers for scientific journals straight from your analysis. Depending on the chosen system, a text file with a different extension is used as the source file. The following table gives an overview: Markup system Input file type Report file type Notebook Markdown HTML LaTeX .R .Rmd .Rhtml .Rnw .html (via .md) .html (via .md) .html .pdf (via .tex) Finally, we note that the interweaving of code and text (often referred to as literate programming) may serve two purposes. The first, described in this article, is to generate a data analysis report by executing code to produce the result. The second is to document the code itself, for example, by describing the purpose of a function and all its arguments. Prerequisites for report generation For notebooks, R Markdown, and Rhtml, RStudio relies on Yihui Xie's knitr package for executing code chunks and merging the results. The knitr package can be installed via RStudio's Packages tab or with the command install. packages("knitr"). For LaTeX/Sweave files, the default is to use R's native Sweave driver. The knitr package is easier to use and has more options for fine-tuning, so in the rest of this article we assume that knitr is always used. To make sure that knitr is also used for Sweave files, go to Tools | Options | Sweave and choose knitr as Weave Rnw files. If you're working in an RStudio project, you can set this as a project option as well by navigating to Project | Project Options | Sweave. When you work with LaTeX/Sweave, you need to have a working LaTeX distribution installed. Popular distributions are TeXLive for Linux, MikTeX for Windows, and MacTeX for Mac OS X. Notebook The easiest way to generate a quick, sharable report straight from your Rscript is by creating a notebook via File | Notebook, or by clicking on the Notebook button all the way on the top right of the Rscript tab (right next to the Source button). Notebook options RStudio offers three ways to generate a notebook from an Rscript—the simplest are Default and knitr::stitch. These only differ a little in layout. The knitr::spin mode allows you to use the Markdown markup language to specify text layout. The markup options are presented after navigating to File | Notebook or after clicking on the Notebook button. Under the hood, the Default and knitr::stitch options use knitr to generate a Markdown file which is then directly converted to a web page (HTML file). The knitr::spin mode allows for using Markdown commands in your comments and will convert your .R file to a .Rmd (R Markdown) file before further processing. In Default mode, R code and printed results are rendered to code blocks in a fixedwidth font with a different background color. Figures are included in the output and the document is prepended with a title, an optional author name, and the date. The only option to include text in your output is to add it as an R comment (behind the # sign) and it will be rendered as such. In knitr::stitch mode, instead of prepending the report with an author name and date, the report is appended with a call to Sys.time() and R's sessionInfo(). The latter is useful since it shows the context in which the code was executed including R's version, locale settings, and loaded packages. The result of the knitr::stitch mode depends on a template file called knitr-template.Rnw, included with the knitr package. It is stored in a directory that you can find by typing system. file('misc',package='knitr'). The knitr::spin mode allows you to escape from the simple notebook and add text outside of code blocks, using special markup specifiers. In particular, all comment lines that are preceded with #' (hash and single quote) are interpreted as the Markdown text. For example, the following code block: # This is printed as comment in a code block 1 + 1 #' This will be rendered as main text #' Markdown **specifiers** are also _recognized_ Will be rendered in the knitr::spin mode as shown in the following screenshot: Reading a notebook in the knitr::spin mode allows for escaping to Markdown The knitr package has several general layout options for included code (that will be discussed in the next section). When generating a notebook in the knitr::spin mode, these options can be set by preceding them with a #+ (hash and plus signs). For example, the following code: #' The code below is _not_ evaluated #+ eval=FALSE 1 + 1 Results in the following report: Setting knitr options for a notebook in knitr::spin mode Although it is convenient to be able to use Markdown commands in the knitr::spin mode, once you need such options it is often better to switch to R Markdown completely, as discussed in the next section. Note that a notebook is a valid R script and can be executed as such. This is in contrast with the other report generation options—those are text files that need knitr or Sweave to be processed. Publishing a notebook Notebooks are ideal to share examples or quick results from fairly simple data analyses. Since early 2012, the creators of RStudio offer a website, called RPubs. com, where you can upload your notebooks by clicking on the Publish button in the notebook preview window that automatically opens after a notebook has been generated. Do note that this means that results will be available for the world to see, so be careful when using personal or otherwise private data. Summary In this article we discussed prerequisites for producing a report. We also learnt how to produce reports via Notebook that automatically include the results of an analysis. Resources for Article : Further resources on this subject: Organizing, Clarifying and Communicating the R Data Analyses[Article] Customizing Graphics and Creating a Bar Chart and Scatterplot in R [Article] Graphical Capabilities of R[Article]
Read more
  • 0
  • 0
  • 1943

article-image-creating-first-circos-diagram
Packt
25 Mar 2013
6 min read
Save for later

Creating the first Circos diagram

Packt
25 Mar 2013
6 min read
(For more resources related to this topic, see here.) Getting ready Let's start with the simple task of graphing a relationship between a student's eye and hair color. We can expect some results: brown eyes are more common for students with brown or black hair, and blue eyes are more common amongst blondes. Circos is able to show these relationships with more clarity than a traditional table. We will be using the hair and eye color data available in the book's supplemental materials (HairEyeColor.csv). The data contains the information about hair and eye color of University of Delaware students. Create a folder C:Usersuser_nameCircos BookHairEyeColor, and place the data file into the location. Here, user_name denotes the user name that is used to log in to your computer. The original data is in a size that can be typically stored in a data set. Each line represents a student and their respective hair (black, brown, blonde, or red) and eye (blue, brown, green, or hazel) color. The following table shows the first 10 lines of data: Hair Eye Brown Red Blonde Brown Blonde Brown Black Brown Brown Brown Brown Blue Hazel Blue Blue Brown Brown Hazel   Before we start creating the specific diagram, let's prepare the data into a table. If you wish, you can use Microsoft Excel's PivotTable or Data Pilots of OpenOffice to transform it into a table as follows:   Blue Brown Green Hazel Black Blonde Brown Red 20 94 84 17 68 7 119 26 5 15 29 14 15 11 54 14 In order to use the data for Circos, we need a simpler format. Open a text file and create a table only separated by spaces. We will also change the row and column titles to make it clearer, as follows: X Blue_Eyes Brown_Eyes Green_Eyes Hazel_Eyes Black_Hair 20 68 5 15 Blonde_Hair 94 7 15 11 Brown_Hair 84 119 29 54 Red_Hair 17 26 14 14 The X is simply a place holder. Save this file as HairEyeColorTable.txt as we are ready to use Circos. You can skip the process of making the raw tables. We will be using the HairEyeColorTable.txt file to create the Circos diagram. How to do it… Open the Command Prompt and change the directory to the location of the tableviewer tools in the CircosCircos Toolstoolstableviewerbin, as follows: cd C:Program Files (x86)CircosCircos Toolstoolstableviewerbin Parse the text table (HairEyeColorTable.txt). This will create a new file, HairEyeColorTable-parsed.txt, which will be refined into a Circos diagram as follows: perl parse-table -file "C:Usersuser_nameCircos Book HairEyeColorHairEyeColorTable.txt" > "C:Usersuser_nameCircos BookHairEyeColorHairEyeColorTable-parsed.txt" The parse command consists of a few parts. First, Perl's parse-table instructs Perl to execute the parse program on the HairEyeColorTable.txt file. Second, the > symbol instructs Windows to write the output into another text file called HairEyeColorTable-parsed.txt. Linux Users Linux users can use a simpler, shorter syntax. Steps 2 and 3 can be completed with this command: cat "~/Documents/Circos Book/HairEyeColor/ HairEyeColorTable.txt" | bin/parse-table | bin/ make-conf -dir "~/Documents/user_name/Circos Book/ HairEyeColor/HairEyeColorTable-parsed.txt Create the configuration files from the parsed table using the following command: type "C:Usersuser_nameCircos BookHairEyeColor HairEyeColorTable-parsed.txt" | perl make-conf -dir "C:Users user_nameCircos BookHairEyeColor" This will create 11 new configuration files. These files contain the data and style information which is needed to create the final diagram. This command consists of two parts. We are instructing Windows to pass the text in the HairEyeColorTable-parsed.txt file to the make-conf command. The | (pipe) character separates what we want passed along and the actual command. After the pipe, we are instructing Perl to execute the make-conf command and store the output into a new directory. We need to create a final file, which compiles all the information. This file will also tell Circos how the diagram should appear, such as size, labels, image style, and where the diagram will be saved. We will save the diagram as HairEyeColor.conf. The make-conf command gave us the color.conf file, which associates colors with the final diagram. In addition, the Circos installation provides us with some other basic colors and fonts. The first several lines of code are: <colors> <<include colors.conf>> <<include C:Program Files (x86)Circosetccolors.conf>> </colors> <fonts> <<include C:Program Files (x86)Circosetcfonts.conf>> </fonts> The next segment is the ideogram. These are the parameters that set the details of the image. This first set of lines specifies the spacing, color, and size of the chromosomes: <ideogram> <spacing> default=0.01r break=200u </spacing> thickness = 100p stroke_thickness = 2 stroke_color = black fill = yes fill_color = black radius = 0.7r show_label = yes label_font = condensedbold label_radius = dim(ideogram,radius) + 0.05r label_size = 48 band_stroke_thickness = 2 show_bands = yes fill_bands = yes </ideogram> Next, we will define the image, including where it is stored (this location is mentioned in the following code snippet as dir), the file name, whether we want an SVG or PNG file, size, background color, and any rotation: dir = C:Usersuser_nameCircos BookHairEyeColor file = HairEyeColor svg = yes png = yes 24bit = yes radius = 800p background = white angle_offset = +90 Lastly, we will input the data and define how the links (ribbons) should look: chromosomes_units = 1 karyotype = karyotype.txt <links> z = 0 radius = 1r – 150p bezier_radius = 0.2r <link cell_> ribbon = yes flat = yes show = yes color = black thickness = 2 file = cells.txt </link> show_bands = yes <<include C:Program Files (x86)Circosetchousekeeping.conf>> Save this file as HairEyeColor.conf with the other configuration files. Have a look at the next diagram which explains all this procedure: The make-conf command outputs a few very important files. First, karyotype.txt defines each ideogram band's name, width, and color. Meanwhile, cells.txt is the segdup file containing the actual data. It is very different from our original table, but it dictates the width of each ribbon. Circos links the karyotype and segdup files to create the image. The other configuration files are mostly to set the aesthetics, placement, and size of the diagram. Return to the Command Prompt and execute the following command: cd C:Usersuser_nameCircos BookHairEyeColor perl "C:Program Files (x86)Circosbincircos" –conf HairEyeColor.conf Several lines of text will scroll across the screen. At the conclusion, HairEyeColor.png and HairEyeColor.svg will appear in the folder as shown in the next diagram:
Read more
  • 0
  • 0
  • 2713

article-image-extending-your-structure-and-search
Packt
15 Mar 2013
6 min read
Save for later

Extending Your Structure and Search

Packt
15 Mar 2013
6 min read
(For more resources related to this topic, see here.) Indexing data that is not flat Not all data is flat. Of course if we are building our system, which ElasticSearch will be a part of, we can create a structure that is convenient for ElasticSearch. However, it doesn't need to be flat, it can be more object-oriented. Let's see how to create mappings that use fully structured JSON objects. Data Let's assume we have the following data (we store it in the file called structured_data.json): { "book" : { "author" : { "name" : { "firstName" : "Fyodor", "lastName" : "Dostoevsky" } }, "isbn" : "123456789", "englishTitle" : "Crime and Punishment", "originalTitle" : "Преступлéние и наказáние", "year" : 1886, "characters" : [ { "name" : "Raskolnikov" }, { "name" : "Sofia" } ], "copies" : 0 } } As you can see, the data is not flat. It contains arrays and nested objects, so we can't use our mappings that we used previously. But we can create mappings that will be able to handle such data. Objects The previous example data shows a structured JSON file. As you can see, the root object in our file is book. The root object is a special one, which allows us to define additional properties. The book root object has some simple properties such as englishTitle, originalTitle, and so on. Those will be indexed as normal fields in the index. In addition to that it has the characters array type, which we will discuss in the next paragraph. For now, let's focus on author. As you can see, author is an object that has another object nested in it, that is, the name object, which has two properties firstName and lastName. Arrays We have already used array type data, but we didn't talk about it. By default all fields in Lucene and thus in ElasticSearch are multivalued, which means that they can store multiple values. In order to send such fields for indexing to ElasticSearch we use the JSON array type, which is nested within the opening and closing square brackets []. As you can see in the previous example, we used the array type for characters property. Mappings So, what can we do to index such data as that shown previously? To index arrays we don't need to do anything, we just specify the properties for such fields inside the array name. So in our case in order to index the characters data present in the data we would need to add such mappings as these: "characters" : { "properties" : { "name" : {"type" : "string", "store" : "yes"} } } Nothing strange, we just nest the properties section inside the array's name (which is characters in our case) and we define fields there. As a result of this mapping, we would get the characters.name multivalued field in the index. We perform similar steps for our author object. We call the section by the same name as is present in the data, but in addition to the properties section we also tell ElasticSearch that it should expect the object type by adding the type property with the value object. We have the author object, but it also has the name object nested in it, so we do the same; we just nest another object inside it. So, our mappings for that would look like the following code: "author" : { "type" : "object", "properties" : { "name" : { "type" : "object", "properties" : { "firstName" : {"type" : "string", "store" : "yes"}, "lastName" : {"type" : "string", "store" : "yes"} } } } } The firstName and lastName fields would appear in the index as author.name.firstName and author.name.lastName. We will check if that is true in just a second. The rest of the fields are simple core types, so I'll skip discussing them. Final mappings So our final mappings file that we've called structured_mapping.json looks like the following: { "book" : { "properties" : { "author" : { "type" : "object", "properties" : { "name" : { "type" : "object", "properties" : { "firstName" : {"type" : "string", "store" : "yes"}, "lastName" : {"type" : "string", "store" : "yes"} } } } }, "isbn" : {"type" : "string", "store" : "yes"}, "englishTitle" : {"type" : "string", "store" : "yes"}, "originalTitle" : {"type" : "string", "store" : "yes"}, "year" : {"type" : "integer", "store" : "yes"}, "characters" : { "properties" : { "name" : {"type" : "string", "store" : "yes"} } }, "copies" : {"type" : "integer", "store" : "yes"} } } } To be or not to be dynamic As we already know, ElasticSearch is schemaless, which means that it can index data without the need of first creating the mappings (although we should do so if we want to control the index structure). The dynamic behavior of ElasticSearch is turned on by default, but there may be situations where you may want to turn it off for some parts of your index. In order to do that, one should add the dynamic property set to false on the same level of nesting as the type property for the object that shouldn't be dynamic. For example, if we would like our author and name objects not to be dynamic, we should modify the relevant parts of the mappings file so that it looks like the following code: "author" : { "type" : "object", "dynamic" : false, "properties" : { "name" : { "type" : "object", "dynamic" : false, "properties" : { "firstName" : {"type" : "string", "store" : "yes"}, "lastName" : {"type" : "string", "store" : "yes"} } } } } However, please remember that in order to add new fields for such objects, we would have to update the mappings. You can also turn off the dynamic mapping functionality by adding the index.mapper.dynamic : false property to your elasticsearch.yml configuration file. Sending the mappings to ElasticSearch The last thing I would like to do is test if all the work we did actually works. This time we will use a slightly different technique of creating an index and adding the mappings. First, let's create the library index with the following command: curl -XPUT 'localhost:9200/library' Now, let's send our mappings for the book type: curl -XPUT 'localhost:9200/library/book/_mapping' -d @structured_mapping.json Now we can index our example data: curl -XPOST 'localhost:9200/library/book/1' -d @structured_data.json If we would like to see how our data was indexed, we can run a query like the following: curl -XGET 'localhost:9200/library/book/_search?q=*:*&fields=*&pretty=true' It will return the following data: { "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "1", "_score" : 1.0, "fields" : { "copies" : 0, "characters.name" : [ "Raskolnikov", "Sofia" ], "englishTitle" : "Crime and Punishment", "author.name.lastName" : "Dostoevsky", "isbn" : "123456789", "originalTitle" : "Преступлéние и наказáние", "year" : 1886, "author.name.firstName" : "Fyodor" } } ] } } As you can see, all the fields from arrays and object types are indexed properly. Please notice that there is, for example, the author.name.firstName field present, because ElasticSearch did flatten the data.
Read more
  • 0
  • 0
  • 1416
article-image-working-apps-splunk
Packt
08 Mar 2013
6 min read
Save for later

Working with Apps in Splunk

Packt
08 Mar 2013
6 min read
(For more resources related to this topic, see here.) Defining an app In the strictest sense, an app is a directory of configurations and, sometimes, code. The directories and files inside have a particular naming convention and structure. All configurations are in plain text, and can be edited using your choice of text editor. Apps generally serve one or more of the following purposes: A container for searches, dashboards, and related configurations: This is what most users will do with apps. This is not only useful for logical grouping, but also for limiting what configurations are applied and at what time. This kind of app usually does not affect other apps. Providing extra functionality: Many objects can be provided in an app for use by other apps. These include field extractions, lookups, external commands, saved searches, workflow actions, and even dashboards. These apps often have no user interface at all; instead they add functionality to other apps. Configuring a Splunk installation for a specific purpose: In a distributed deployment, there are several different purposes that are served by the multiple installations of Splunk. The behavior of each installation is controlled by its configuration, and it is convenient to wrap those configurations into one or more apps. These apps completely change the behavior of a particular installation. Included apps Without apps, Splunk has no user interface, rendering it essentially useless. Luckily, Splunk comes with a few apps to get us started. Let's look at a few of these apps: gettingstarted: This app provides the help screens that you can access from the launcher. There are no searches, only a single dashboard that simply includes an HTML page. search: This is the app where users spend most of their time. It contains the main search dashboard that can be used from any app, external search commands that can be used from any app, admin dashboards, custom navigation, custom css, a custom app icon, a custom app logo, and many other useful elements. splunk_datapreview: This app provides the data preview functionality in the admin interface. It is built entirely using JavaScript and custom REST endpoints. SplunkDeploymentMonitor: This app provides searches and dashboards to help you keep track of your data usage and the health of your Splunk deployment. It also defines indexes, saved searches, and summary indexes. It is a good source for more advanced search examples. SplunkForwarder and SplunkLightForwarder: These apps, which are disabled by default, simply disable portions of a Splunk installation so that the installation is lighter in weight. If you never create or install another app, and instead simply create saved searches and dashboards in the app search, you can still be quite successful with Splunk. Installing and creating more apps, however, allows you to take advantage of others' work, organize your own work, and ultimately share your work with others. Installing apps Apps can either be installed from Splunkbase or uploaded through the admin interface. To get started, let's navigate to Manager | Apps, or choose Manage apps... from the App menu as shown in the following screenshot: Installing apps from Splunkbase If your Splunk server has direct access to the Internet, you can install apps from Splunkbase with just a few clicks. Navigate to Manager | Apps and click on Find more apps online. The most popular apps will be listed as follows: Let's install a pair of apps and have a little fun. First, install Geo Location Lookup Script (powered by MAXMIND) by clicking on the Install free button. You will be prompted for your splunk.com login. This is the same login that you created when you downloaded Splunk. If you don't have an account, you will need to create one. Next, install the Google Maps app. This app was built by a Splunk customer and contributed back to the Splunk community. This app will prompt you to restart Splunk. Once you have restarted and logged back in, check the App menu. Google Maps is now visible, but where is Geo Location Lookup Script? Remember that not all apps have dashboards; nor do they necessarily have any visible components at all. Using Geo Location Lookup Script Geo Location Lookup Script provides a lookup script to provide geolocation information for IP addresses. Looking at the documentation, we see this example: eventtype=firewall_event | lookup geoip clientip as src_ip You can find the documentation for any Splunkbase app by searching for it at splunkbase.com, or by clicking on Read more next to any installed app by navigating to Manager | Apps | Browse more apps. Let's read through the arguments of the lookup command: geoip: This is the name of the lookup provided by Geo Location Lookup Script. You can see the available lookups by going to Manager | Lookups | Lookup definitions. clientip: This is the name of the field in the lookup that we are matching against. as src_ip: This says to use the value of src_ip to populate the field before it; in this case, clientip. I personally find this wording confusing. In my mind, I read this as "using" instead of "as". Included in the ImplementingSplunkDataGenerator app (available at http://packtpub.com/) is a sourcetype instance named impl_splunk_ips, which looks like this: 2012-05-26T18:23:44 ip=64.134.155.137 The IP addresses in this fictitious log are from one of my websites. Let's see some information about these addresses: sourcetype="impl_splunk_ips" | lookup geoip clientip AS ip | top client_country This gives us a table similar to the one shown in the following screenshot: That's interesting. I wonder who is visiting my site from Slovenia! Using Google Maps Now let's do a similar search in the Google Maps app. Choose Google Maps from the App menu. The interface looks like the standard search interface, but with a map instead of an event listing. Let's try this remarkably similar (but not identical) query using a lookup provided in the Google Maps app: sourcetype="impl_splunk_ips" | lookup geo ip The map generated looks like this: Unsurprisingly, most of the traffic to this little site came from my house in Austin, Texas. Installing apps from a file It is not uncommon for Splunk servers to not have access to the Internet, particularly in a datacenter. In this case, follow these steps: Download the app from splunkbase.com. The file will have a .spl or .tgz extension. Navigate to Manager | Apps. Click on Install app from file. Upload the downloaded file using the form provided. Restart if the app requires it. Configure the app if required. That's it. Some apps have a configuration form. If this is the case, you will see a Set up link next to the app when you go to Manager | Apps. If something goes wrong, contact the author of the app. If you have a distributed environment, in most cases the app only needs to be installed on your search head. The components that your indexers need will be distributed automatically by the search head. Check the documentation for the app.
Read more
  • 0
  • 0
  • 3063

article-image-article-constructing-and-evaluating-your-design-solution
Packt
25 Feb 2013
9 min read
Save for later

Constructing and Evaluating Your Design Solution

Packt
25 Feb 2013
9 min read
(For more resources related to this topic, see here.) For constructing visualizations, technology matters The importance of being able to rationalize options has been a central theme of this book. As we reach the final stage of this journey and we are faced with the challenge of building our visualization solution, the keyword is, once again, choice. The intention of this book has been to focus on offering a handy strategy to help you work through the many design issues and decisions you're faced with. Up to now discussions about issues relating to technology and technical capability have been kept to a minimum in order to elevate the importance of the preparatory and conceptual stages. You have to work through these challenges regardless of what tools or skills you have. However, it is fair to say that to truly master data visualization design, it is inevitable that you will need to achieve technical literacy across a number of different applications and environments. All advanced designers need to be able to rely on a symphony of different tools and capabilities for gathering data, handling, and analyzing it before presenting, and launching the visual design. While we may have great concepts and impressively creative ideas, without the means to convert these into built solutions they will ultimately remain unrealized. The following example, tracking 61 years of tornado activity in the US, demonstrates a project that would have involved a great amount of different analytical and design-based technical skills and would not have been possible without these: Image In contrast to most of the steps that we have covered this far, the choices we make when it comes to producing the final data visualization design are more heavily influenced by capability and access to resources than necessarily the suitability of a given tool. This is something we covered earlier when identifying the key factors that shape what may or may not be possible to achieve. To many, the technology side of data visualization can be quite an overwhelming prospect—trying to harness and master the many different options available, knowing each one's relative strengths and weaknesses, identifying specific function and purpose, keeping on top of the latest developments and trends, and so on. Acquiring a broad technical skillset is clearly not easily accomplished. We touched on the different capability requirements of data visualization in article 2, Setting the Purpose and Identifying Key Factors, in the The "eight hats" of data visualization design section. This highlighted the importance of recognizing your strengths and weaknesses and where your skillset marries up with the varied and numerous demands of visualization design. In order to accommodate the absence of technical skills, in particular, you may need to find a way to collaborate with others or possibly scale down the level of your ambition. Visualization software, applications, and programs The scope of this book does not lend itself to provide a detailed dissection and evaluation of the many different possible tools and resources available for data visualization design. There are so many to choose from and it is a constantly evolving landscape—it feels like each new month sees an additional resource entering the fray. To help, you can find an up-to-date, curated list of the many technology options in this field by visiting http://www.visualisingdata.com/ index.php/resources/. Unlike other design disciplines, there is no single killer tool that does everything. To accommodate the agility of different technical solutions required in this field we have to be prepared to develop a portfolio of capabilities. What follows is a selection of just some of the most common, most useful, and most accessible options for you to consider utilizing and developing experience with. The tools presented have been classified to help you understand their primary purpose or function. Charting and statistical analysis tools This category covers some of the main charting productivity tools and the more effective visual analytics or Business Intelligence (BI) applications that offer powerful visualization capabilities. Microsoft Excel (http://office.microsoft.com/en-gb/excel/) is ubiquitous and has been a staple diet for many of us number crunchers for most of our working lives. Within the data visualization world, Excel's charting capabilities are somewhat derided largely down to the terrible default settings and the range of bad-practice charting functions it enables. (3D cone charts, anyone? No, thank you.) However, Excel does allow you to do much more than you would expect and, when fully exploited, it can prove to be quite a valuable ally. With experience and know-how, you can control and refine many chart properties and you will find that most of your basic charting requirements are met, certainly those that you might associate more with a pragmatic or analytical tone. Image Excel can also be used to serve up chart images for exporting to other applications (such as Illustrator, see later). Search online for the work of Jorge Camoes ( http://www.excelcharts.com/blog/), Jon Peltier (http://peltiertech.com/), and Chandoo (http://chandoo.org/) and you'll find some excellent visualization examples produced in Excel. Tableau (http://www.tableausoftware.com/) is a very powerful and rapid visual analytics application that allows you to potentially connect up millions of records from a range of origins and formats. From there you can quickly construct good practice charts and dashboards to visually explore and present your data. It is available as a licensed desktop or server version as well as a free-to-use public version. Tableau is particularly valuable when it comes to the important stage of data familiarization. When you want to quickly discover the properties, the shapes and quality of your data, Tableau is a great solution. It also enables you to create embeddable interactive visualizations and, like Excel, lets you export charts as images for use in other applications. Image There are many excellent Tableau practitioners out there whose work you should check out, such as Craig Bloodworth (http://www.theinformationlab.co.uk/ blog/), Jérôme Cukier (http://www.jeromecukier.net/), and Ben Jones (http://dataremixed.com/), among many others. While the overall landscape of BI is patchy in terms of its visualization quality, you will find some good additional solutions such as QlikView (http://www.qlikview.com/uk), TIBCO Spotfire (http://spotfire.tibco.com/), Grapheur (http://grapheur. com/), and Panopticon (http://www.panopticon.com/). You will also find that there are many chart production tools available online. Google has created a number of different ways to create visualizations through its Chart Tools (https://developers.google.com/chart/) and Visualization API (https://developers.google.com/chart/interactive/docs/reference) environments. While you can exploit these tools without the need for programming skills, the API platforms do enable developers to enhance the functional and design options themselves. Additionally, Google Fusion Tables (http://www.google.com/drive/start/ apps.html) offers a convenient method for publishing simple choropleth maps, timelines, and a variety of reasonably interactive charts. Image Other notable browser-based tools for analyzing data and creating embeddable or exportable data visualizations include DataWrapper (http://datawrapper.de/) and Polychart (http://polychart.com/). One of the first online offerings was Many Eyes, created by the IBM Visual Communications Lab in 2007, though ongoing support and development has lapsed. Many Eyes introduced many to Wordle (http://www.wordle.net/) a popular tool for visualizing the frequency of words used in text via "word clouds". Note, however, the novelty of this type of display has long since worn off for many people (especially please stop using it as your final PowerPoint slide in presentations!). Programming environments The ultimate capability in visualization design is to have complete control over the characteristics and behavior of every mark, property, and user-driven event on a chart or graph. The only way to fundamentally achieve this level of creative control is through the command of one or a range of programming languages. Until recent times one of the most important and popular options was Adobe Flash (http://www.adobe.com/uk/products/flash.html), a powerful and creative environment for animated and multimedia designs. Flash was behind many prominent interactive visualization designs in the field. However, Apple's decision to not support Flash on its mobile platforms effectively signaled the beginning of the end. Subsequently, most contemporary visualization programmers are focusing their developments on a range of powerful JavaScript environments and libraries. D3.js (http://d3js.org/) is the newest and coolest kid in town. Launched in 2011 from the Stanford Visualization Group that previously brought us Protovis (no longer in active development) this is a JavaScript library that has rapidly evolved into to the major player in interactive visualization terms. D3 enables you to take full creative control over your entire visualization design (all data representation and presentation features) to create incredibly smooth, expressive, and immersive interactive visualizations. Mike Bostock, the key creative force behind D3 and who now works at the New York Times, has an incredible portfolio of examples (http://bost.ocks.org/mike/) and you should also take a look at the work and tutorials provided by another D3 "hero", Scott Murray (http://alignedleft.com/). Image D3 and Flash are particularly popular (or have been popular, in the latter's case) because they are suitable for creating interactive projects to work in the browser. Over the past decade, Processing (http://processing.org/) has reigned as one of the most important solutions for creating powerful, generative, and animated visualizations that sit outside the browser, either as video, a separate application, or an installation. As an open source language it has built a huge following of creative programmers, designers, and artists look to optimize the potential of data representation and expression. There is a large and dynamic community of experts, authors, and tutorial writers that provide wonderful resources for anyone interested in picking up capabilities in this environment. There are countless additional JavaScript libraries and plugins that offer specialist capability, such as Paper.js (http://paperjs.org/) and Raphaël (http:// raphaeljs.com/), to really maximize your programming opportunities.
Read more
  • 0
  • 0
  • 1815

article-image-getting-started-innodb
Packt
19 Feb 2013
9 min read
Save for later

Getting Started with InnoDB

Packt
19 Feb 2013
9 min read
(For more resources related to this topic, see here.) Basic features of InnoDB InnoDB is more than a fast disk-based relational database engine. It offers, at its core, the following features that separate it from other disk-based engines: MVCC ACID compliance Transaction support Row-level locking These features are responsible for providing what is known as Referential integrity; a core requirement for enterprise database applications. Referential integrity Referential integrity can be best thought of as the ability for the database application to store relational data in multiple tables with consistency. If a database lacks consistency between relational data, the data cannot be relied upon for applications. If, for example, an application stores financial transactions where monetary data is processed, referential integrity and consistency of transactional data is a key component. Financial data is not the only case where this is an important feature, as many applications store and process sensitive data that must be consistent Multiversion concurrency control A vital component is Multiversion concurrency control (MVCC), which is a control process used by databases to ensure that multiple concurrent connections can see and access consistent states of data over time. A common scenario relying on MVCC can be thought of as follows: data exists in a table and an application connection accesses that data, then a second connection accesses the same original data set while the first connection is making changes to it; since the first connection has not finalized its changes and committed its information we don't want the second connection to see the nonfinalized data. Thus two versions of the data exist at the same time—multiple versions—to allow the database to control the concurrent state of the data. MVCC also provides for the existence of point-in-time consistent views, where multiple versions of data are kept and are available for access based on their point-in-time existence. Transaction isolation Transaction support at the database level refers to the ability for units of work to be processed in separate units of execution from others. This isolation of data execution allows each database connection to manipulate, read, and write information at the same time without conflicting with each other. Transactions allow connections to operate on data on an all-or-nothing operation, so that if the transaction completes successfully it will be written to disk and recorded for upcoming transactions to then operate on. However, if the sequence of changes to the data in the transaction process do not complete then they can be rolled back, and no changes will be recorded to disk. This allows sequences of execution that contain multiple steps to fully succeed only if all of the changes complete, and to roll back any changed data to its original state if one or more of the sequence of changes in the transaction fail. This feature guarantees that the data remains consistent and referentially safe. ACID compliance An integral part of InnoDB is its ability to ensure that data is atomic, consistent, isolated, and durable; these features make up components of ACID compliance. Simply put, atomicity requires that if a transaction fails then the changes are rolled back and not committed. Consistency requires that each successfully executed transaction will move the database ahead in time from one state to the next in a consistent manner without errors or data integrity issues. Isolation defines that each transaction will see separate sets of data in time and not conflict with other transactional data access. Finally, the durability clause ensures that any data that has been committed in a successful transaction will be written to disk in its final state, without the risk of data loss from errors or system failure, and will then be available to transactions that come in the future. Locking characteristics Finally, InnoDB differs from other on-disk storage engines in that it offers row-level locking. This primarily differs, in the MySQL world, with the MyISAM storage engine which features table-level locking. Locking refers to an internal operation of the database that prohibits reading or writing of table data by connections if another is currently using that data. This prevents concurrent connections from causing data corruption or forcing data invalidation when data is in use. The primary difference between table- and row-level locking is that when a connection requests data from a table it can either lock the row of data being accessed or the whole table of data being accessed. For performance and concurrency benefits, row-level locking excels. System requirements and supported platforms InnoDB can be used on all platforms on which MySQL can be installed. These include: Linux: RPM, Deb, Tar BSDs: FreeBSD, OpenBSD, NetBSD Solaris and OpenSolaris / Illumos: SPARC + Intel IBM AIX HP-UX Mac OSX Windows 32 bit and 64 bit There are also custom ports of MySQL from the open source community for running MySQL on various embedded platforms and non-standard operating systems. Hardware-wise, MySQL and correspondingly InnoDB, will run on a wide variety of hardware, which at the time of this writing includes: Intel x86 32 bit AMD/Intel x 86_64 Intel Itanium IA-64 IBM Power architecture Apple's PPC PA-RISC 1.0 + 2.0 SPARC 32 + 64 bit Keep in mind when installing and configuring InnoDB, depending on the architecture in which it is installed, it will have certain options available and enabled that are not available on all platforms. In addition to the underlying hardware, the operating system will also determine whether certain configuration options are available and the range to which some variables can be set. One of the more decisively important differences to be considered while choosing an operating system for your database server is the manner in which the operating system and underlying filesystem handles write caching and write flushes to the disk storage subsystem. These operating system abilities can cause a dramatic difference in the performance of InnoDB, often to the order of 10 times the concurrency ability. When reading the MySQL documentation you may find that InnoDB has over fifty-eight configuration settings, more or less depending on the version, for tuning the performance and operational defaults. The majority of these default settings can be left alone for development and production server environments. However, there are several core settings that can affect great change, in either positive or negative directions depending on the application workload and hardware resource limits, with which every MySQL database administrator should be familiar and proficient. Keep in mind when setting values that some variables are considered dynamic while others are static; dynamic variables can be changed during runtime and do not require a process restart while static variables can only be changed prior to process start, so any changes made to static variables during runtime will only take effect upon the next restart of the database server process. Dynamic variables can be changed on the MySQL command line via the following command: mysql> SET GLOBAL [variable]=[value]; If a value is changed on the command line, it should also be updated in the global my.cnf configuration file so that the change is applied during each restart. MySQL memory allocation equations Before tuning any InnoDB configuration settings—memory buffers in particular—we need to understand how MySQL allocates memory to various areas of the application that handles RAM. There are two simple equations for referencing total memory usage that allocate memory based on incoming client connections: Per-thread buffers: Per-thread buffers, also called per-connection buffers since MySQL uses a separate thread for each connection, operate in contrast to global buffers in that per-thread buffers only allocate memory when a connection is made and in some cases will only allocate as much memory as the connection's workload requires, thus not necessarily utilizing the entire size of the allowable buffer. This memory utilization method is described in the MySQL manual as follows: Each client thread is associated with a connection buffer and a result buffer. Both begin with a size given by net_buffer_length but are dynamically enlarged up to max_allowed_packet bytes as needed. The result buffer shrinks to net_buffer_length after each SQL statement. Global buffers: Global buffers are allocated memory resources regardless of the number of connections being handled. These buffers request their memory requirements during the startup process and retain this reservation of resources until the server process has ended. When allocating memory to MySQL buffers we need to ensure that there is also enough RAM available for the operating system to perform its tasks and processes; in general it is a best practice to limit MySQL between 85 to 90 percent allocation of total system RAM. The memory utilization equations for each of the buffers is given as follows: Per-thread Buffer memory utilization equation: (read_buffer_size + read_rnd_buffer_size + sort_buffer_size + thread_stack + join_buffer_size + binlog_cache_size) * max_connections = total memory allocation for all connections, or MySQL Thread Buffers (MTB) Global Buffer memory utilization equation: innodb_buffer_pool_size + innodb_additional_mem_pool_size + innodb_ log_buffer_size + key_buffer_size + query_cache_size = total memory used by MySQL Global Buffers (MGB) Total memory allocation equation: MTB + MGB = Total Memory Used by MySQL If the total memory used by the combination of MTB and MGB is greater than 85 to 90 percent of the total system RAM then you may experience resource contention, a resource bottleneck, or in the worst case you will see memory pages swapping to on-disk resources (virtual memory) which results in performance degradation and, in some cases, process failure or connection timeouts. Therefore it is wise to check memory allocation via the equations mentioned previously before making changes to the memory buffers or increasing the value of max_connections to the database. More information about how MySQL manages memory and threads can be read about in the following pages of the MySQL documentation: http://dev.mysql.com/doc/refman/5.5/en/connection-threads.html http://dev.mysql.com/doc/refman/5.5/en/memory-use.html Summary This article provided a quick overview of the core terminology and basic features, system requirements, and a few memory allocation equations. Resources for Article : Further resources on this subject: Configuring MySQL [Article] Optimizing your MySQL Servers' performance using Indexes [Article] Indexing in MySQL Admin [Article]
Read more
  • 0
  • 0
  • 835
article-image-marker-based-augmented-reality-iphone-or-ipad
Packt
01 Feb 2013
23 min read
Save for later

Marker-based Augmented Reality on iPhone or iPad

Packt
01 Feb 2013
23 min read
(For more resources related to this topic, see here.) Creating an iOS project that uses OpenCV In this section we will create a demo application for iPhone/iPad devices that will use the OpenCV ( Open Source Computer Vision ) library to detect markers in the camera frame and render 3D objects on it. This example will show you how to get access to the raw video data stream from the device camera, perform image processing using the OpenCV library, find a marker in an image, and render an AR overlay. We will start by first creating a new XCode project by choosing the iOS Single View Application template, as shown in the following screenshot: Now we have to add OpenCV to our project. This step is necessary because in this application we will use a lot of functions from this library to detect markers and estimate position position. OpenCV is a library of programming functions for real-time computer vision. It was originally developed by Intel and is now supported by Willow Garage and Itseez. This library is written in C and C++ languages. It also has an official Python binding and unofficial bindings to Java and .NET languages. Adding OpenCV framework Fortunately the library is cross-platform, so it can be used on iOS devices. Starting from version 2.4.2, OpenCV library is officially supported on the iOS platform and you can download the distribution package from the library website at http://opencv.org/. The OpenCV for iOS link points to the compressed OpenCV framework. Don't worry if you are new to iOS development; a framework is like a bundle of files. Usually each framework package contains a list of header files and list of statically linked libraries. Application frameworks provide an easy way to distribute precompiled libraries to developers. Of course, you can build your own libraries from scratch. OpenCV documentation explains this process in detail. For simplicity, we follow the recommended way and use the framework for this article. After downloading the file we extract its content to the project folder, as shown in the following screenshot: To inform the XCode IDE to use any framework during the build stage, click on Project options and locate the Build phases tab. From there we can add or remove the list of frameworks involved in the build process. Click on the plus sign to add a new framework, as shown in the following screenshot: From here we can choose from a list of standard frameworks. But to add a custom framework we should click on the Add other button. The open file dialog box will appear. Point it to opencv2.framework in the project folder as shown in the following screenshot: Including OpenCV headers Now that we have added the OpenCV framework to the project, everything is almost done. One last thing—let's add OpenCV headers to the project's precompiled headers. The precompiled headers are a great feature to speed up compilation time. By adding OpenCV headers to them, all your sources automatically include OpenCV headers as well. Find a .pch file in the project source tree and modify it in the following way. The following code shows how to modify the .pch file in the project source tree: // // Prefix header for all source files of the 'Example_MarkerBasedAR' // #import <Availability.h> #ifndef __IPHONE_5_0 #warning "This project uses features only available in iOS SDK 5.0 and later." #endif #ifdef __cplusplus #include <opencv2/opencv.hpp> #endif #ifdef __OBJC__ #import <UIKit/UIKit.h> #import <Foundation/Foundation.h> #endif Now you can call any OpenCV function from any place in your project. That's all. Our project template is configured and we are ready to move further. Free advice: make a copy of this project; this will save you time when you are creating your next one! Application architecture Each iOS application contains at least one instance of the UIViewController interface that handles all view events and manages the application's business logic. This class provides the fundamental view-management model for all iOS apps. A view controller manages a set of views that make up a portion of your app's user interface. As part of the controller layer of your app, a view controller coordinates its efforts with model objects and other controller objects—including other view controllers—so your app presents a single coherent user interface. The application that we are going to write will have only one view; that's why we choose a Single-View Application template to create one. This view will be used to present the rendered picture. Our ViewController class will contain three major components that each AR application should have (see the next diagram): Video source Processing pipeline Visualization engine The video source is responsible for providing new frames taken from the built-in camera to the user code. This means that the video source should be capable of choosing a camera device (front- or back-facing camera), adjusting its parameters (such as resolution of the captured video, white balance, and shutter speed), and grabbing frames without freezing the main UI. The image processing routine will be encapsulated in the MarkerDetector class. This class provides a very thin interface to user code. Usually it's a set of functions like processFrame and getResult. Actually that's all that ViewController should know about. We must not expose low-level data structures and algorithms to the view layer without strong necessity. VisualizationController contains all logic concerned with visualization of the Augmented Reality on our view. VisualizationController is also a facade that hides a particular implementation of the rendering engine. Low code coherence gives us freedom to change these components without the need to rewrite the rest of your code. Such an approach gives you the freedom to use independent modules on other platforms and compilers as well. For example, you can use the MarkerDetector class easily to develop desktop applications on Mac, Windows, and Linux systems without any changes to the code. Likewise, you can decide to port VisualizationController on the Windows platform and use Direct3D for rendering. In this case you should write only new VisualizationController implementation; other code parts will remain the same. The main processing routine starts from receiving a new frame from the video source. This triggers video source to inform the user code about this event with a callback. ViewController handles this callback and performs the following operations: Sends a new frame to the visualization controller. Performs processing of the new frame using our pipeline. Sends the detected markers to the visualization stage. Renders a scene. Let's examine this routine in detail. The rendering of an AR scene includes the drawing of a background image that has a content of the last received frame; artificial 3D objects are drawn later on. When we send a new frame for visualization, we are copying image data to internal buffers of the rendering engine. This is not actual rendering yet; we are just updating the text with a new bitmap. The second step is the processing of new frame and marker detection. We pass our image as input and as a result receive a list of the markers detected. on it. These markers are passed to the visualization controller, which knows how to deal with them. Let's take a look at the following sequence diagram where this routine is shown: We start development by writing a video capture component. This class will be responsible for all frame grabbing and for sending notifications of captured frames via user callback. Later on we will write a marker detection algorithm. This detection routine is the core of your application. In this part of our program we will use a lot of OpenCV functions to process images, detect contours on them, find marker rectangles, and estimate their position. After that we will concentrate on visualization of our results using Augmented Reality. After bringing all these things together we will complete our first AR application. So let's move on! Accessing the camera The Augmented Reality application is impossible to create without two major things: video capturing and AR visualization. The video capture stage consists of receiving frames from the device camera, performing necessary color conversion, and sending it to the processing pipeline. As the single frame processing time is so critical to AR applications, the capture process should be as efficient as possible. The best way to achieve maximum performance is to have direct access to the frames received from the camera. This became possible starting from iOS Version 4. Existing APIs from the AVFoundation framework provide the necessary functionality to read directly from image buffers in memory. You can find a lot of examples that use the AVCaptureVideoPreviewLayer class and the UIGetScreenImage function to capture videos from the camera. This technique was used for iOS Version 3 and earlier. It has now become outdated and has two major disadvantages: Lack of direct access to frame data. To get a bitmap, you have to create an intermediate instance of UIImage, copy an image to it, and get it back. For AR applications this price is too high, because each millisecond matters. Losing a few frames per second (FPS) significantly decreases overall user experience. To draw an AR, you have to add a transparent overlay view that will present the AR. Referring to Apple guidelines, you should avoid non-opaque layers because their blending is hard for mobile processors. Classes AVCaptureDevice and AVCaptureVideoDataOutput allow you to configure, capture, and specify unprocessed video frames in 32 bpp BGRA format. Also you can set up the desired resolution of output frames. However, it does affect overall performance since the larger the frame the more processing time and memory is required. There is a good alternative for high-performance video capture. The AVFoundation API offers a much faster and more elegant way to grab frames directly from the camera. But first, let's take a look at the following figure where the capturing process for iOS is shown: AVCaptureSession is a root capture object that we should create. Capture session requires two components—an input and an output. The input device can either be a physical device (camera) or a video file (not shown in diagram). In our case it's a built-in camera (front or back). The output device can be presented by one of the following interfaces: AVCaptureMovieFileOutput AVCaptureStillImageOutput AVCaptureVideoPreviewLayer AVCaptureVideoDataOutput The AVCaptureMovieFileOutput interface is used to record video to the file, the AVCaptureStillImageOutput interface is used to to make still images, and the AVCaptureVideoPreviewLayer interface is used to play a video preview on the screen. We are interested in the AVCaptureVideoDataOutput interface because it gives you direct access to video data. The iOS platform is built on top of the Objective-C programming language. So to work with AVFoundation framework, our class also has to be written in Objective-C. In this section all code listings are in the Objective-C++ language. To encapsulate the video capturing process, we create the VideoSource interface as shown by the following code: @protocol VideoSourceDelegate<NSObject> -(void)frameReady:(BGRAVideoFrame) frame; @end @interface VideoSource : NSObject<AVCaptureVideoDataOutputSampleBuffe rDelegate> { } @property (nonatomic, retain) AVCaptureSession *captureSession; @property (nonatomic, retain) AVCaptureDeviceInput *deviceInput; @property (nonatomic, retain) id<VideoSourceDelegate> delegate; - (bool) startWithDevicePosition:(AVCaptureDevicePosition) devicePosition; - (CameraCalibration) getCalibration; - (CGSize) getFrameSize; @end In this callback we lock the image buffer to prevent modifications by any new frames, obtain a pointer to the image data and frame dimensions. Then we construct temporary BGRAVideoFrame object that is passed to outside via special delegate. This delegate has following prototype: @protocol VideoSourceDelegate<NSObject> -(void)frameReady:(BGRAVideoFrame) frame; @end Within VideoSourceDelegate, the VideoSource interface informs the user code that a new frame is available. The step-by-step guide for the initialization of video capture is listed as follows: Create an instance of AVCaptureSession and set the capture session quality preset. Choose and create AVCaptureDevice. You can choose the front- or backfacing camera or use the default one. Initialize AVCaptureDeviceInput using the created capture device and add it to the capture session. Create an instance of AVCaptureVideoDataOutput and initialize it with format of video frame, callback delegate, and dispatch the queue. Add the capture output to the capture session object. Start the capture session. Let's explain some of these steps in more detail. After creating the capture session, we can specify the desired quality preset to ensure that we will obtain optimal performance. We don't need to process HD-quality video, so 640 x 480 or an even lesser frame resolution is a good choice: - (id)init { if ((self = [super init])) { AVCaptureSession * capSession = [[AVCaptureSession alloc] init]; if ([capSession canSetSessionPreset:AVCaptureSessionPreset64 0x480]) { [capSession setSessionPreset:AVCaptureSessionPreset640x480]; NSLog(@"Set capture session preset AVCaptureSessionPreset640x480"); } else if ([capSession canSetSessionPreset:AVCaptureSessionPresetL ow]) { [capSession setSessionPreset:AVCaptureSessionPresetLow]; NSLog(@"Set capture session preset AVCaptureSessionPresetLow"); } self.captureSession = capSession; } return self; } Always check hardware capabilities using the appropriate API; there is no guarantee that every camera will be capable of setting a particular session preset. After creating the capture session, we should add the capture input—the instance of AVCaptureDeviceInput will represent a physical camera device. The cameraWithPosition function is a helper function that returns the camera device for the requested position (front, back, or default): - (bool) startWithDevicePosition:(AVCaptureDevicePosition) devicePosition { AVCaptureDevice *videoDevice = [self cameraWithPosition:devicePosit ion]; if (!videoDevice) return FALSE; { NSError *error; AVCaptureDeviceInput *videoIn = [AVCaptureDeviceInput deviceInputWithDevice:videoDevice error:&error]; self.deviceInput = videoIn; if (!error) { if ([[self captureSession] canAddInput:videoIn]) { [[self captureSession] addInput:videoIn]; } else { NSLog(@"Couldn't add video input"); return FALSE; } } else { NSLog(@"Couldn't create video input"); return FALSE; } } [self addRawViewOutput]; [captureSession startRunning]; return TRUE; } Please notice the error handling code. Take care of return values for such an important thing as working with hardware setup is a good practice. Without this, your code can crash in unexpected cases without informing the user what has happened. We created a capture session and added a source of the video frames. Now it's time to add a receiver—an object that will receive actual frame data. The AVCaptureVideoDataOutput class is used to process uncompressed frames from the video stream. The camera can provide frames in BGRA, CMYK, or simple grayscale color models. For our purposes the BGRA color model fits best of all, as we will use this frame for visualization and image processing. The following code shows the addRawViewOutput function: - (void) addRawViewOutput { /*We setupt the output*/ AVCaptureVideoDataOutput *captureOutput = [[AVCaptureVideoDataOutput alloc] init]; /*While a frame is processes in -captureOutput:didOutputSampleBuff er:fromConnection: delegate methods no other frames are added in the queue. If you don't want this behaviour set the property to NO */ captureOutput.alwaysDiscardsLateVideoFrames = YES; /*We create a serial queue to handle the processing of our frames*/ dispatch_queue_t queue; queue = dispatch_queue_create("com.Example_MarkerBasedAR. cameraQueue", NULL); [captureOutput setSampleBufferDelegate:self queue:queue]; dispatch_release(queue); // Set the video output to store frame in BGRA (It is supposed to be faster) NSString* key = (NSString*)kCVPixelBufferPixelFormatTypeKey; NSNumber* value = [NSNumber numberWithUnsignedInt:kCVPixelFormatType_32BGRA]; NSDictionary* videoSettings = [NSDictionary dictionaryWithObject:value forKey:key]; [captureOutput setVideoSettings:videoSettings]; // Register an output [self.captureSession addOutput:captureOutput]; } Now the capture session is finally configured. When started, it will capture frames from the camera and send it to user code. When the new frame is available, an AVCaptureSession object performs a captureOutput: didOutputSampleBuffer:fromConnection callback. In this function, we will perform a minor data conversion operation to get the image data in a more usable format and pass it to user code: - (void)captureOutput:(AVCaptureOutput *)captureOutput didOutputSampleBuffer:(CMSampleBufferRef)sampleBuffer fromConnection:(AVCaptureConnection *)connection { // Get a image buffer holding video frame CVImageBufferRef imageBuffer = CMSampleBufferGetImageBuffer (sampleB uffer); // Lock the image buffer CVPixelBufferLockBaseAddress(imageBuffer,0); // Get information about the image uint8_t *baseAddress = (uint8_t *)CVPixelBufferGetBaseAddress(image Buffer); size_t width = CVPixelBufferGetWidth(imageBuffer); size_t height = CVPixelBufferGetHeight(imageBuffer); size_t stride = CVPixelBufferGetBytesPerRow(imageBuffer); BGRAVideoFrame frame = {width, height, stride, baseAddress}; [delegate frameReady:frame]; /*We unlock the image buffer*/ CVPixelBufferUnlockBaseAddress(imageBuffer,0); } We obtain a reference to the image buffer that stores our frame data. Then we lock it to prevent modifications by new frames. Now we have exclusive access to the frame data. With help of the CoreVideo API, we get the image dimensions, stride (number of pixels per row), and the pointer to the beginning of the image data. I draw your attention to the CVPixelBufferLockBaseAddress/ CVPixelBufferUnlockBaseAddress function call in the callback code. Until we hold a lock on the pixel buffer, it guarantees consistency and correctness of its data. Reading of pixels is available only after you have obtained a lock. When you're done, don't forget to unlock it to allow the OS to fill it with new data. Marker detection A marker is usually designed as a rectangle image holding black and white areas inside it. Due to known limitations, the marker detection procedure is a simple one. First of all we need to find closed contours on the input image and unwarp the image inside it to a rectangle and then check this against our marker model. In this sample the 5 x 5 marker will be used. Here is what it looks like: In the sample project that you will find in this book, the marker detection routine is encapsulated in the MarkerDetector class: /** * A top-level class that encapsulate marker detector algorithm */ class MarkerDetector { public: /** * Initialize a new instance of marker detector object * @calibration[in] - Camera calibration necessary for pose estimation. */ MarkerDetector(CameraCalibration calibration); void processFrame(const BGRAVideoFrame& frame); const std::vector<Transformation>& getTransformations() const; protected: bool findMarkers(const BGRAVideoFrame& frame, std::vector<Marker>& detectedMarkers); void prepareImage(const cv::Mat& bgraMat, cv::Mat& grayscale); void performThreshold(const cv::Mat& grayscale, cv::Mat& thresholdImg); void findContours(const cv::Mat& thresholdImg, std::vector<std::vector<cv::Point> >& contours, int minContourPointsAllowed); void findMarkerCandidates(const std::vector<std::vector<cv::Point> >& contours, std::vector<Marker>& detectedMarkers); void detectMarkers(const cv::Mat& grayscale, std::vector<Marker>& detectedMarkers); void estimatePosition(std::vector<Marker>& detectedMarkers); private: }; To help you better understand the marker detection routine, a step-by-step processing on one frame from a video will be shown. A source image taken from an iPad camera will be used as an example: Marker identification Here is the workflow of the marker detection routine: Convert the input image to grayscale. Perform binary threshold operation. Detect contours. Search for possible markers. Detect and decode markers. Estimate marker 3D pose. Grayscale conversion The conversion to grayscale is necessary because markers usually contain only black and white blocks and it's much easier to operate with them on grayscale images. Fortunately, OpenCV color conversion is simple enough. Please take a look at the following code listing in C++: void MarkerDetector::prepareImage(const cv::Mat& bgraMat, cv::Mat& grayscale) { // Convert to grayscale cv::cvtColor(bgraMat, grayscale, CV_BGRA2GRAY); } This function will convert the input BGRA image to grayscale (it will allocate image buffers if necessary) and place the result into the second argument. All further steps will be performed with the grayscale image. Image binarization The binarization operation will transform each pixel of our image to black (zero intensity) or white (full intensity). This step is required to find contours. There are several threshold methods; each has strong and weak sides. The easiest and fastest method is absolute threshold. In this method the resulting value depends on current pixel intensity and some threshold value. If pixel intensity is greater than the threshold value, the result will be white (255); otherwise it will be black (0). This method has a huge disadvantage—it depends on lighting conditions and soft intensity changes. The more preferable method is the adaptive threshold. The major difference of this method is the use of all pixels in given radius around the examined pixel. Using average intensity gives good results and secures more robust corner detection. The following code snippet shows the MarkerDetector function: void MarkerDetector::performThreshold(const cv::Mat& grayscale, cv::Mat& thresholdImg) { cv::adaptiveThreshold(grayscale, // Input image thresholdImg,// Result binary image 255, // cv::ADAPTIVE_THRESH_GAUSSIAN_C, // cv::THRESH_BINARY_INV, // 7, // 7 // ); } After applying adaptive threshold to the input image, the resulting image looks similar to the following one: Each marker usually looks like a square figure with black and white areas inside it. So the best way to locate a marker is to find closed contours and approximate them with polygons of 4 vertices. Contours detection The cv::findCountours function will detect contours on the input binary image: void MarkerDetector::findContours(const cv::Mat& thresholdImg, std::vector<std::vector<cv::Point> >& contours, int minContourPointsAllowed) { std::vector< std::vector<cv::Point> > allContours; cv::findContours(thresholdImg, allContours, CV_RETR_LIST, CV_ CHAIN_APPROX_NONE); contours.clear(); for (size_t i=0; i<allContours.size(); i++) { int contourSize = allContours[i].size(); if (contourSize > minContourPointsAllowed) { contours.push_back(allContours[i]); } } } The return value of this function is a list of polygons where each polygon represents a single contour. The function skips contours that have their perimeter in pixels value set to be less than the value of the minContourPointsAllowed variable. This is because we are not interested in small contours. (They will probably contain no marker, or the contour won't be able to be detected due to a small marker size.) The following figure shows the visualization of detected contours: Candidates search After finding contours, the polygon approximation stage is performed. This is done to decrease the number of points that describe the contour shape. It's a good quality check to filter out areas without markers because they can always be represented with a polygon that contains four vertices. If the approximated polygon has more than or fewer than 4 vertices, it's definitely not what we are looking for. The following code implements this idea: void MarkerDetector::findCandidates ( const ContoursVector& contours, std::vector<Marker>& detectedMarkers ) { std::vector<cv::Point> approxCurve; std::vector<Marker> possibleMarkers; // For each contour, analyze if it is a parallelepiped likely to be the marker for (size_t i=0; i<contours.size(); i++) { // Approximate to a polygon double eps = contours[i].size() * 0.05; cv::approxPolyDP(contours[i], approxCurve, eps, true); // We interested only in polygons that contains only four points if (approxCurve.size() != 4) continue; // And they have to be convex if (!cv::isContourConvex(approxCurve)) continue; // Ensure that the distance between consecutive points is large enough float minDist = std::numeric_limits<float>::max(); for (int i = 0; i < 4; i++) { cv::Point side = approxCurve[i] - approxCurve[(i+1)%4]; float squaredSideLength = side.dot(side); minDist = std::min(minDist, squaredSideLength); } // Check that distance is not very small if (minDist < m_minContourLengthAllowed) continue; // All tests are passed. Save marker candidate: Marker m; for (int i = 0; i<4; i++) m.points.push_back( cv::Point2f(approxCurve[i].x,approxCu rve[i].y) ); // Sort the points in anti-clockwise order // Trace a line between the first and second point. // If the third point is at the right side, then the points are anticlockwise cv::Point v1 = m.points[1] - m.points[0]; cv::Point v2 = m.points[2] - m.points[0]; double o = (v1.x * v2.y) - (v1.y * v2.x); if (o < 0.0) //if the third point is in the left side, then sort in anti-clockwise order std::swap(m.points[1], m.points[3]); possibleMarkers.push_back(m); } // Remove these elements which corners are too close to each other. // First detect candidates for removal: std::vector< std::pair<int,int> > tooNearCandidates; for (size_t i=0;i<possibleMarkers.size();i++) { const Marker& m1 = possibleMarkers[i]; //calculate the average distance of each corner to the nearest corner of the other marker candidate for (size_t j=i+1;j<possibleMarkers.size();j++) { const Marker& m2 = possibleMarkers[j]; float distSquared = 0; for (int c = 0; c < 4; c++) { cv::Point v = m1.points[c] - m2.points[c]; distSquared += v.dot(v); } distSquared /= 4; if (distSquared < 100) { tooNearCandidates.push_back(std::pair<int,int>(i,j)); } } } // Mark for removal the element of the pair with smaller perimeter std::vector<bool> removalMask (possibleMarkers.size(), false); for (size_t i=0; i<tooNearCandidates.size(); i++) { float p1 = perimeter(possibleMarkers[tooNearCandidates[i]. first ].points); float p2 = perimeter(possibleMarkers[tooNearCandidates[i].second]. points); size_t removalIndex; if (p1 > p2) removalIndex = tooNearCandidates[i].second; else removalIndex = tooNearCandidates[i].first; removalMask[removalIndex] = true; } // Return candidates detectedMarkers.clear(); for (size_t i=0;i<possibleMarkers.size();i++) { if (!removalMask[i]) detectedMarkers.push_back(possibleMarkers[i]); } } Now we have obtained a list of parallelepipeds that are likely to be the markers. To verify whether they are markers or not, we need to perform three steps: First, we should remove the perspective projection so as to obtain a frontal view of the rectangle area. Then we perform thresholding of the image using the Otsu algorithm. This algorithm assumes a bimodal distribution and finds the threshold value that maximizes the extra-class variance while keeping a low intra-class variance. Finally we perform identification of the marker code. If it is a marker, it has an internal code. The marker is divided into a 7 x 7 grid, of which the internal 5 x 5 cells contain ID information. The rest correspond to the external black border. Here, we first check whether the external black border is present. Then we read the internal 5 x 5 cells and check if they provide a valid code. (It might be required to rotate the code to get the valid one.) To get the rectangle marker image, we have to unwarp the input image using perspective transformation. This matrix can be calculated with the help of the cv::getPerspectiveTransform function. It finds the perspective transformation from four pairs of corresponding points. The first argument is the marker coordinates in image space and the second point corresponds to the coordinates of the square marker image. Estimated transformation will transform the marker to square form and let us analyze it: cv::Mat canonicalMarker; Marker& marker = detectedMarkers[i]; // Find the perspective transfomation that brings current marker to rectangular form cv::Mat M = cv::getPerspectiveTransform(marker.points, m_ markerCorners2d); // Transform image to get a canonical marker image cv::warpPerspective(grayscale, canonicalMarker, M, markerSize); Image warping transforms our image to a rectangle form using perspective transformation: Now we can test the image to verify if it is a valid marker image. Then we try to extract the bit mask with the marker code. As we expect our marker to contain only black and white colors, we can perform Otsu thresholding to remove gray pixels and leave only black and white pixels: //threshold image cv::threshold(markerImage, markerImage, 125, 255, cv::THRESH_BINARY | cv::THRESH_OTSU);
Read more
  • 0
  • 0
  • 5130

article-image-oracle-using-metadata-service-share-xml-artifacts
Packt
23 Jan 2013
11 min read
Save for later

Oracle: Using the Metadata Service to Share XML Artifacts

Packt
23 Jan 2013
11 min read
(For more resources related to this topic, see here.) The WSDL of a web service is made up of the following XML artifacts: WSDL Definition: It defines the various operations that constitute a service, their input and output parameters, and the protocols (bindings) they support. XML Schema Definition (XSD): It is either embedded within the WSDL definition or referenced as a standalone component; this defines the XML elements and types that constitute the input and output parameters. To better facilitate the exchange of data between services, as well as achieve better interoperability and re-usability, it is good practice to de?ne a common set of XML Schemas, often referred to as the canonical data model, which can be referenced by multiple services (or WSDL De?nitions). This means, we will need to share the same XML schema across multiple composites. While typically a service (or WSDL) will only be implemented by a single composite, it will often be invoked by multiple composites; so the corresponding WSDL will be shared across multiple composites. Within JDeveloper, the default behavior, when referencing a predefined schema or WSDL, is for it to add a copy of the file to our SOA project. However, if we have several composites, each referencing their own local copy of the same WSDL or XML schema, then every time that we need to change either the schema or WSDL, we will be required to update every copy. This can be a time-consuming and error-prone approach; a better approach is to have a single copy of each WSDL and schema that is referenced by all composites. The SOA infrastructure incorporates a Metadata Service (MDS), which allows us to create a library of XML artifacts that we can share across SOA composites. MDS supports two types of repositories: File-based repository: This is quicker and easier to set up, and so is typically used as the design-time MDS by JDeveloper. Database repository: It is installed as part of the SOA infrastructure. This is used at runtime by the SOA infrastructure. As you move projects from one environment to another (for example, from test to production), you must typically modify several environment-specific values embedded within your composites, such as the location of a schema or the endpoint of a referenced web service. By placing all this information within the XML artifacts deployed to MDS, you can make your composites completely agnostic of the environment they are to be deployed to. The other advantage of placing all your referenced artifacts in MDS is that it removes any direct dependencies between composites, which means that they can be deployed and started in any order (once you have deployed the artifacts to MDS). In addition, an SOA composite leverages many other XML artifacts, such as fault policies, XSLT Transformations, EDLs for event EDN event definitions, and Schematrons for validation, each of which may need to be shared across multiple composites. These can also be shared between composites by placing them in MDS. Defining a project structure Before placing all our XML artifacts into MDS, we need to define a standard file structure for our XML library. This allows us to ensure that if any XML artifact within our XML library needs to reference another XML artifact (for example a WSDL importing a schema), it can do so via a relative reference; in other words, the XML artifact doesn't include any reference to MDS and is portable. This has a number of benefits, including: OSB compatibility; the same schemas and WSDLs can be deployed to the Oracle Service Bus without modification Third-party tool compatibility; often we will use a variety of tools that have no knowledge of MDS to create/edit XML schemas, WSDLs, and so on (for example XML Spy, Oxygen) In this article, we will assume that we have defined the following directory structure under our <src> directory. Under the xmllib folder, we have defined multiple <solution> directories, where a solution (or project) is made up of one or more related composite applications. This allows each solution to maintain its XML artifacts independently. However, it is also likely that there will be a number of XML artifacts that need to be shared between different solutions (for example, the canonical data model for the organization), which in this example would go under <core>. Where we have XML artifacts shared between multiple solutions, appropriate governance is required to manage the changes to these artifacts. For the purpose of this article, the directory structure is over simpli?ed. In reality, a more comprehensive structure should be de?ned as part of the naming and deployment standards for your SOA Reference Architecture. The other consideration here is versioning; over time it is likely that multiple versions of the same schema, WSDL and so on, will require to be deployed side by side. To support this, we typically recommend appending the version number to the filename. We would also recommend that you place this under some form of version control, as it makes it far simpler to ensure that everyone is using an up-to-date version of the XML library. For the purpose of this article, we will assume that you are using Subversion. Creating a file-based MDS repository for JDeveloper Before we can reference this with JDeveloper, we need to define a connection to the file-based MDS. Getting ready By default, a file-based repository is installed with JDeveloper and sits under the directory structure: <JDeveloper Home>/jdeveloper/integration/seed This already contains the subdirectory soa, which is reserved for, and contains, artifacts used by the SOA infrastructure For artifacts that we wish to share across our applications in JDeveloper, we should create the subdirectory apps (under the seed directory); this is critical, as when we deploy the artifacts to the SOA infrastructure, they will be placed in the apps namespace We need to ensure that the content of the apps directory always contains the latest version of our XML library; as these are stored under Subversion, we simply need to check out the right portion of the Subversion project structure. How to do it... First, we need to create and populate our file-based repository. Navigate to the seed directory, and right-click and select SVN Checkout..., this will launch the Subversion Checkout window. For URL of repository, ensure that you specify the path to the apps subdirectory. For Checkout directory, specify the full pathname of the seed directory and append /apps at the end. Leave the other default values, as shown in the following screenshot, and then click on OK: Subversion will check out a working copy of the apps subfolder within Subversion into the seed directory. Before we can reference our XML library with JDeveloper, we need to define a connection to the file-based MDS. Within JDeveloper, from the File menu select New to launch the Gallery, and under Categories select General | Connections | SOA-MDS Connection from the Items list. This will launch the MDS Connection Wizard. Enter File Based MDS for Connection Name and select a Connection Type of File Based MDS. We then need to specify the MDS root folder on our local filesystem; this will be the directory that contains the apps directory, namely: <JDeveloper Home>jdeveloperintegrationseed Click on Test Connection; the Status box should be updated to Success!. Click on OK. This will create a file-based MDS connection in JDeveloper. Browse the File Based MDS connection in JDeveloper. Within JDeveloper, open the Resource Palette and expand SOA-MDS. This should contain the File Based MDS connection that we just created. Expand all the nodes down to the xsd directory, as shown in the following screenshot: If you double-click on one of the schema files, it will open in JDeveloper (in read-only mode). There's more... Once the apps directory has been checked out, it will contain a snapshot of the MDS artifacts at the point in time that you created the checkpoint. Over time, the artifacts in MDS will be modified or new ones will be created. It is important that you ensure that your local version of MDS is updated with the current version. To do this, navigate to the seed directory, right-click on apps, and select SVN Update. Creating Mediator using a WSDL in MDS In this recipe, we will show how we can create Mediator using an interface definition from a WSDL held in MDS. This approach enables us to separate the implementation of a service (a composite) from the definition of its contract (WSDL). Getting ready Make sure you have created a file-based MDS repository for JDeveloper, as described in the first recipe. Create an SOA application with a project containing an empty composite. How to do it... Drag Mediator from SOA Component Palette onto your composite. This will launch the Create Mediator wizard; specify an appropriate name (EmployeeOnBoarding in the following example), and for the Template select Interface Definition from WSDL Click on the Find Existing WSDLs icon (circled in the previous screenshot); this will launch the SOA Resource Browser. Select Resource Palette from the drop-down list (circled in the following screenshot). Select the WSDL that you wish to import and click on OK. This will return you to the Create Mediator wizard window; ensure that the Port Type is populated and click on OK. This will create Mediator based on the specified WSDL within our composite. How it works... When we import the WSDL in this fashion, JDeveloper doesn't actually make a copy of the schema; rather within the componentType file, it sets the wsdlLocation attribute to reference the location of the WSDL in MDS (as highlighted in the following screenshot). For WSDLs in MDS, the wsdlLocation attribute uses the following format: oramds:/apps/<wsdl name> Where oramds indicates that it is located in MDS, apps indicates that it is in the application namespace and <wsdl name> is the full pathname of the WSDL in MDS. The wsdlLocation doesn't specify the physical location of the WSDL; rather it is relative to MDS, which is specific to the environment in which the composite is deployed. This means that when the composite is open in JDeveloper, it will reference the WSDL in the file-based MDS, and when deployed to the SOA infrastructure, it will reference the WSDL deployed to the MDS database repository, which is installed as part of the SOA infrastructure. There's more... This method can be used equally well to create a BPEL process based on the WSDL from within the Create BPEL Process wizard; for Template select Base on a WSDL and follow the same steps. This approach works well with Contract First Design as it enables the contract for a composite to be designed first, and when ready for implementation, be checked into Subversion. The SOA developer can then perform a Subversion update on their file-based MDS repository, and then use the WSDL to implement the composite Creating Mediator that subscribes to EDL in MDS In this recipe, we will show how we can create Mediator that subscribes to an EDN event whose EDL is defined in MDS. This approach enables us to separate the definition of an event from the implementation of a composite that either subscribes to, or publishes, the event. Getting ready Make sure you have created a file-based MDS repository for JDeveloper, as described in the initial recipe. Create an SOA application with a project containing an empty composite. How to do it... Drag Mediator from SOA Component Palette onto your composite. This will launch the Create Mediator wizard; specify an appropriate name for it (UserRegistration in the following example), and for the Template select Subscribe to Events. Click on the Subscribe to new event icon (circled in the previous screenshot); this will launch the Event Chooser window. Click on the Browse for Event Definition (edl) files icon (circled in the previous screenshot); this will launch SOA Resource Browser. Select Resource Palette from the drop-down list. Select the EDL that you wish to import and click on OK. This will return you to the Event Chooser window; ensure that the required event is selected and click on OK. This will return you to the Create Mediator window; ensure that the required event is configured as needed, and click on OK. This will create an event subscription based on the EDL specified within our composite. How it works... When we reference an EDL in MDS, JDeveloper doesn't actually make a copy of the EDL; rather within the composite.xml file, it creates an import statement to reference the location of the EDL in MDS. There's more... This approach can be used equally well to subscribe to an event within a BPEL process or publish an event using either Mediator or BPEL.
Read more
  • 0
  • 0
  • 1803