Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-what-new-12c
Packt
30 Sep 2013
23 min read
Save for later

What is New in 12c

Packt
30 Sep 2013
23 min read
(For more resources related to this topic, see here.) Oracle Database 12c has introduced many new features and enhancements for backup and recovery. This article will introduce you to some of them and you will have the opportunity to learn in more detail how they could be used in real life situations. But I cannot start talking about Oracle 12 c without talking first about a revolutionary whole new concept that was introduced with this new version of the database product, called Multitenant Container Database( CDB ) that will contain two or more pluggable databases ( PDB ). When a container database only contains one PDB it is called Single Tenant Container Database. You can also have your database on Oracle 12c using the same format as before 12c, it will be called non-CDB database and will not allow the use of PDBs. Pluggable database We are now able to have multiple databases sharing a single instance and Oracle binaries. Each of the databases will be configurable to a degree and will allow some parameters to be set specifically for themselves (due that they will share the same initialization parameter file) and what is better, each database will be completely isolated from each other without either knowing that the other exists. A CDB is a single physical database that contains a root container with the main Oracle data dictionary and at least one PDB with specific application data. A PDB is a portable container with its own data dictionary, including metadata and internal links to the system-supplied objects in the root container, and this PDB will appear to an Oracle Net client as a traditional Oracle database. The CDB also contains a PDB called SEED, which is used as a template when an empty PDB needs to be created. The following figure shows an example of a CDB with five PDBs: When creating a database on Oracle 12 c , you can now create a CDB with one or more PDBs, and what is even better is that you can easily clone a PDB, or unplug it and plug it into a different server with a preinstalled CDB, if your target server is running out of resources such as CPU or memory. Many years ago, the introduction of external storage gave us the possibility to store data on external devices and the flexibility to plug and unplug them to any system independent of their OS. For example, you can connect an external device to a system using Windows XP and read your data without any problems. Later you can unplug it and connect it to a laptop running Windows 7 and you will still be able to read your data. Now with the introduction of Oracle pluggable databases, we will be able to do something similar with Oracle when upgrading a PDB, making this process simple and easy. All you will need to do to upgrade a PDB, as per example, is: Unplug your PDB (step 1 in the following figure) that is using a CDB running 12.1.0.1. Copy the PDB to the destination location with a CDB that is using a later version such as 12.2.0.1 (step 2 in the following figure). Plug the PDB to the CDB (step 3 in the following figure), and your PDB is now upgraded to 12.2.0.1. This new concept is a great solution for database consolidation and is very useful for multitenant SaaS (Software as a Service) providers, improving resource utilization, manageability, integration, and service management. Some key points about pluggable databases are: You can have many PDBs if you want inside a single container (a CDB can contain a maximum of 253 PDBs) A PDB is fully backwards compatible with an ordinary pre-12.1 database in an applications perspective, meaning that an application built for example to run on Oracle 11.1 will have no need to be changed to run on Oracle 12c A system administrator can connect to a CDB as a whole and see a single system image If you are not ready to make use of this new concept, you can still be able to create a database on Oracle 12c as before, called non-CDB (non-Container Database) Each instance in RAC opens the CDB as a whole. A foreground session will see only the single PDB it is connected to and sees it just as a non-CDB The Resource Manager is extended with some new between-PDB capabilities Fully integrated with Oracle Enterprise Manager 12c and SQL Developer Fast provisioning of new databases (empty or as a copy/clone of an existing PDB) On Clone triggers can be used to scrub or mask data during a clone process Fast unplug and plug between CDBs Fast path or upgrade by unplugging a PDB and plugging it into a different CDB already patched or with a later database version Separation of duties between DBA and application administrators Communication between PDBs is allowed via intra-CDB dblinks Every PDB has a default service with its name in one Listener An unplugged PDB carries its lineage, Opatch, encryption key info, and much more All PDBs in a CDB should use the same character set All PDBs share the same control files, SPFILE, redo log files, flashback log files, and undo Flashback PDB is not available on 12.1, it expected to be available with 12.2 Allows multitenancy of Oracle Databases, very useful for centralization, especially if using Exadata Multitenant Container Database is only available for Oracle Enterprise Edition as a payable option, all other editions of the Oracle database can only deploy non-CDB or Single Tenant Pluggable databases. RMAN new features and enhancements Now we can continue and take a fast and closer look at some of the new features and enhancements introduced in this database version for RMAN. Container and pluggable database backup and restore As we saw earlier, the introduction of Oracle 12c and the new pluggable database concept made it possible to easily centralize multiple databases maintaining the individuality of each one when using a single instance. The introduction of this new concept also forced Oracle to introduce some new enhancements to the already existent BACKUP, RESTORE, and RECOVERY commands to enable us to be able to make an efficient backup or restore of the complete CDB. This includes all PDBs or just one of more PDBs, or if you want to be more specific, you can also just backup or restore one or more tablespaces from a PDB. Some examples of how to use the RMAN commands when performing a backup on Oracle 12c are: RMAN> BACKUP DATABASE; (To backup the CBD + all PDBs) RMAN> BACKUP DATABASE root; (To backup only the CBD) RMAN> BACKUP PLUGGABLE DATABASE pdb1,pdb2; (To backup all specified PDBs) RMAN> BACKUP TABLESPACE pdb1:example; (To backup a specific tablespace in a PDB) Some examples when performing RESTORE operations are: RMAN> RESTORE DATABASE; (To restore an entire CDB, including all PDBs) RMAN> RESTORE DATABASE root; (To restore only the root container) RMAN> RESTORE PLUGGABLE DATABASE pdb1; (To restore a specific PDB) RMAN> RESTORE TABLESPACE pdb1:example; (To restore a tablespace in a PDB) Finally, some example of RECOVERY operations are: RMAN> RECOVER DATABASE; (Root plus all PDBs) RMAN> RUN { SET UNTIL SCN 1428; RESTORE DATABASE; RECOVER DATABASE; ALTER DATABASE OPEN RESETLOGS; } RMAN> RUN } RESTORE PLUGGABLE DATABASE pdb1 TO RESTORE POINT one; RECOVER PLUGGABLE DATABASE pdb1 TO RESTORE POINT one; ALTER PLUGGABLE DATABASE pdb1 OPEN RESETLOGS;} Enterprise Manager Database Express The Oracle Enterprise Manager Database Console or Database Control that many of us used to manage an entire database is now deprecated and replaced by the new Oracle Enterprise Manager Database Express. This new tool uses Flash technology and allows the DBA to easily manage the configurations, storage, security, and performance of a database. Note that RMAN, Data Pump, and the Oracle Enterprise Manager Cloud Control are now the only tools able to perform backup and recovery operations in a pluggable database environment, in other words, you cannot use the Enterprise Manager Database Express for database backup/recovery operations. Backup privileges Oracle Database 12c provides separation support for the separation of DBA duties for the Oracle Database by introducing task-specific and least privileged administrative privileges for backups that do not require the SYSDBA privilege. The new system privilege introduced with this new release is SYSBACKUP. Avoid the use of the SYSDBA privilege for backups unless it is strictly necessary. When connecting to the database using the AS SYSDBA system privilege, you are able to see any object structure and all the data within the object, whereas if you are connecting using the new system privilege AS SYSBACKUP, you will still be able to see the structure of an object but not the object data. If you try to see any data using the SYSBACKUP privilege, the ORA-01031: insufficient privileges message will be raised. Tighter security policies require a separation of duties. The new SYSBACKUP privilege facilitates the implementation of the separation of duties, allowing backup and recovery operations to be performed without implicit access to the data, so if access to the data is required for one specific user, it will need to be granted explicitly to this user. RMAN has introduced some changes when connecting to a database such as: TARGET: It will require the user to have the SYSBACKUP administrative privilege to be able to connect to the TARGET database CATALOG: As in the earlier versions a user was required to have the RECOVERY_CATALOG_OWNER role assigned to be able to connect to the RMAN catalog, now it will need to have assigned the SYSBACKUP privilege to be able to connect to the catalog AUXILIARY: It will require the SYSBACKUP administrative privilege to connect to the AUXILIARY database Some important points about the SYSBACKUP administrative privilege are: It includes permissions for backup and recovery operations It does not include data access privileges such as SELECT ANY TABLE that the SYSDBA privilege has It can be granted to the SYSBACKUP user that is created during the database installation process It's the default privilege when a RMAN connection string is issued and does not contain the AS SYSBACKUP clause: $ RMAN TARGET / Before connecting as the SYSBACKUP user created during the database creation process, you will need to unlock the account and grant the SYSBACKUP privilege to the user. When you use the GRANT command to give the SYSBACKUP privilege to a user, the username and privilege information will be automatically added to the database password file. The v$pwfile_users view contains all information regarding users within the database password file and indicates whether a user has been granted any privileged system privilege. Let's take a closer look to this view: SQL> DESC v$pwfile_users Name Null? Type ----------------------------- -------- ----------------- USERNAME VARCHAR2(30) SYSDBA VARCHAR2(5) SYSOPER VARCHAR2(5) SYSASM VARCHAR2(5) SYSBACKUP VARCHAR2(5) SYSDG VARCHAR2(5) SYSKM VARCHAR2(5) CON_ID NUMBER As you can see, this view now contains some new columns, such as: SYSBACKUP: It indicates if the user is able to connect using the SYSBACKUP privileges SYSDG: It indicates if the user is able to connect using the SYSDG (new for Data Guard) privileges SYSKM: It indicates if the user is able to connect using the SYSKM (new for Advanced Security) privileges. CON_ID: It is the ID of the current container. If 0, it will indicate that it is related to the entire CDB or to an entire traditional database (non-CDB): if the value is 1, then this user has the access only to root; if other value, then the view will identify a specific container ID. To help you clearly understand the use of the SYSBACKUP privilege, let's run a few examples to make it completely clear. Let's connect to our newly created database as SYSDBA and take a closer look at the SYSBACKUP privilege: $ sqlplus / as sysdbaSQL> SET PAGES 999SQL> SET LINES 99SQL> COL USERNAME FORMAT A21SQL> COL ACCOUNT_STATUS FORMAT A20SQL> COL LAST_LOGIN FORMAT A41 SQL> SELECT username, account_status, last_login 2 FROM dba_users 3 WHERE username = 'SYSBACKUP';USERNAME ACCOUNT_STATUS LAST_LOGIN------------ -------------------- -----------------------SYSBACKUP EXPIRED & LOCKED As you can see, the SYSBACKUP account created during the database creation is currently EXPIRED & LOCKED, you will need to unlock this account and grant the SYSBACKUP privilege to it if you want to use this user for any backup and recovery purposes: For this demo I will use the original SYSBACKUP account, but in a production environment never use the SYSBACKUP account, instead grant the SYSBACKUP privilege to the user(s) that will be responsible for the backup and recovery operations. SQL> ALTER USER sysbackup IDENTIFIED BY "demo" ACCOUNT UNLOCK; User altered. SQL> GRANT sysbackup TO sysbackup; Grant succeeded. SQL> SQL> SELECT username, account_status 2 FROM dba_users 3 WHERE account_status NOT LIKE '%LOCKED'; USERNAME ACCOUNT_STATUS --------------------- -------------------- SYS OPEN SYSTEM OPEN SYSBACKUP OPEN We can also easily identify what system privileges and roles are assigned to SYSBACKUP by executing the following SQLs: SQL> COL grantee FORMAT A20 SQL> SELECT * 2 FROM dba_sys_privs 3 WHERE grantee = 'SYSBACKUP'; GRANTEE PRIVILEGE ADM COM ------------- ----------------------------------- --- --- SYSBACKUP ALTER SYSTEM NO YES SYSBACKUP AUDIT ANY NO YES SYSBACKUP SELECT ANY TRANSACTION NO YES SYSBACKUP SELECT ANY DICTIONARY NO YES SYSBACKUP RESUMABLE NO YES SYSBACKUP CREATE ANY DIRECTORY NO YES SYSBACKUP UNLIMITED TABLESPACE NO YES SYSBACKUP ALTER TABLESPACE NO YES SYSBACKUP ALTER SESSION NO YES SYSBACKUP ALTER DATABASE NO YES SYSBACKUP CREATE ANY TABLE NO YES SYSBACKUP DROP TABLESPACE NO YES SYSBACKUP CREATE ANY CLUSTER NO YES 13 rows selected. SQL> COL granted_role FORMAT A30 SQL> SELECT * 2 FROM dba_role_privs 3 WHERE grantee = 'SYSBACKUP'; GRANTEE GRANTED_ROLE ADM DEF COM -------------- ------------------------------ --- --- --- SYSBACKUP SELECT_CATALOG_ROLE NO YES YES Where the column ADMIN_OPTION refers to if the user has or not, the ADMIN_OPTION privilege, the column DEFAULT_ROLE indicates whether or not ROLE is designated as a default role for the user, and the column COMMON refers to if it's common to all the containers and pluggable databases available. SQL and DESCRIBE As you know well, you are able to execute the SQL commands, and the PL/SQL procedures from the RMAN command line starting with Oracle 12.1, do not require the use of the SQL prefix or quotes for most SQL commands in RMAN. You can now run some simple SQL commands in RMAN such as: RMAN> SELECT TO_CHAR(sysdate,'dd/mm/yy - hh24:mi:ss') 2> FROM dual; TO_CHAR(SYSDATE,'DD) ------------------- 17/09/12 - 02:58:40 RMAN> DESC v$datafile Name Null? Type --------------------------- -------- ------------------- FILE# NUMBER CREATION_CHANGE# NUMBER CREATION_TIME DATE TS# NUMBER RFILE# NUMBER STATUS VARCHAR2(7) ENABLED VARCHAR2(10) CHECKPOINT_CHANGE# NUMBER CHECKPOINT_TIME DATE UNRECOVERABLE_CHANGE# NUMBER UNRECOVERABLE_TIME DATE LAST_CHANGE# NUMBER LAST_TIME DATE OFFLINE_CHANGE# NUMBER ONLINE_CHANGE# NUMBER ONLINE_TIME DATE BYTES NUMBER BLOCKS NUMBER CREATE_BYTES NUMBER BLOCK_SIZE NUMBER NAME VARCHAR2(513) PLUGGED_IN NUMBER BLOCK1_OFFSET NUMBER AUX_NAME VARCHAR2(513) FIRST_NONLOGGED_SCN NUMBER FIRST_NONLOGGED_TIME DATE FOREIGN_DBID NUMBER FOREIGN_CREATION_CHANGE# NUMBER FOREIGN_CREATION_TIME DATE PLUGGED_READONLY VARCHAR2(3) PLUGIN_CHANGE# NUMBER PLUGIN_RESETLOGS_CHANGE# NUMBER PLUGIN_RESETLOGS_TIME DATE CON_ID NUMBER RMAN> ALTER TABLESPACE users 2> ADD DATAFILE '/u01/app/oracle/oradata/cdb1/pdb1/user02.dbf' size 50M; Statement processed Remember that the SYSBACKUP privilege does not grant access to the user tables or views, but the SYSDBA privilege does. Multi-section backups for incremental backups Oracle Database 11g introduced multi-section backups to allow us to backup and restore very large files using backup sets (remember that Oracle datafiles can be up to 128 TB in size). Now with Oracle Database 12c , we are able to make use of image copies when creating multi-section backups as a complement of the previous backup set functionality. This helps us to reduce image copy creation time for backups, transporting tablespaces, cloning, and doing a TSPITR (tablespace point-in-time recovery), it also improves backups when using Exadata. The main restrictions to make use of this enhancement are: The COMPATIBLE initialization parameter needs to be set to 12.0 or higher to make use of the new image copy multi-section backup feature This is only available for datafiles and cannot be used to backup control or password files Not to be used with a large number of parallelisms when a file resides on a small number of disks, to avoid each process to compete with each other when accessing the same device Another new feature introduced with multi-section backups is the ability to create multi-section backups for incremental backups. This will allow RMAN to only backup the data that has changed since the last backup, consequently enhancing the performance of multi-section backups due that they are processed independently, either serially or in parallel. Network-based recovery Restoring and recovering files over the network is supported starting with Oracle Database 12c . We can now recover a standby database and synchronize it with its primary database via the network without the need to ship the archive log files. When the RECOVER command is executed, an incremental backup is created on the primary database. It is then transferred over the network to the physical standby database and applied to the standby database to synchronize it within the primary database. RMAN uses the SCN from the standby datafile header and creates the incremental backup starting from this SCN on the primary database, in other words, only bringing the information necessary to the synchronization process. If block change tracking is enabled for the primary database, it will be used while creating the incremental backup making it faster. A network-based recovery can also be used to replace any missing datafiles, control files, SPFILE, or tablespaces on the primary database using the corresponding entity from the physical standby to the recovery operation. You can also use multi-section backup sets, encryption, or even compression within a network-based recovery. Active Duplicate The Active Duplicate feature generates an online backup on the TARGET database and directly transmits it via an inter-instance network connection to the AUXILIARY database for duplication (not written to disk in the source server). Consequently, this reduces the impact on the TARGET database by offloading the data transfer operation to the AUXILIARY database, also reducing the duplication time. This very useful feature has now received some important enhancements. In Oracle 11 g when this feature was initially introduced, it only allowed us to use a push process based on the image copies. Now it allows us to make use of the already known push process or to make use of the newly introduced pull process from the AUXILIARY database that is based on backup sets (the pull process is now the new default and automatically copies across all datafiles, control files, SPFILE and archive log files). Then it performs the restore of all files and uses a memory script to complete the recovery operation and open the AUXILIARY database. RMAN will dynamically determine, based on your DUPLICATE clauses, which process will be used (push or pull). It is very possible that soon Oracle will end deprecating the push process on the future releases of the database. You can now choose your choice of compression, section size, and encryption to be used during the Active Duplication process. For example, if you specify the SET ENCRYPTION option before the DUPLICATE command, all the backups sent from the target to the auxiliary database will be encrypted. For an effective use of parallelism, allocate more AUXILIARY channels instead of TARGET channels as in the earlier releases. Finally, another important new enhancement is the possibility to finish the duplication process with the AUXILIARY database in not open state (the default is to open the AUXILIARY database after the duplication is completed). This option is very useful when you are required to: Modify the block change tracking Configure fast incremental backups or flashback database settings Move the location of the database, for example, to ASM Upgrade the AUXILIARY database (due that the database must not be open with reset logs prior to applying the upgrade scripts) Or when you know that the attempt to open the database would produce errors To make it clearer, let's take a closer look at what operations RMAN will perform when a DUPLICATE command is used: Create an SPFILE string for the AUXILIARY instance. Mount the backup control file. Restore the TARGET datafiles on the AUXILIARY database. Perform incomplete recovery using all the available incremental backups and archived redo log files. Shut down and restart the AUXILIARY instance in the NOMOUNT mode. Create a new control file, create and store the new database ID in the datafiles (it will not happen if the FOR STANDBY clause is in use). Mount and opens the duplicate database using the RESETLOGS option, and create the online redo log files by default. If the NOOPEN option is used, the duplicated database will not be opened with RESETLOGS and will remain in the MOUNT state. Here are some examples of how to use the DUPLICATE command with PDBs: RMAN> DUPLICATE TARGET DATABASE TO <CDB1>; RMAN> DUPLICATE TARGET DATABASE TO <CDB1> PLUGGABLE DATABASE <PDB1>, <PDB2>, <PDB3>; Support for the third-party snapshot In the past when using a third-party snapshot technology to make a backup or clone of a database, you were forced to change the database to the backup mode (BEGIN BACKUP) before executing the storage snapshot. This requirement is no longer necessary if the following conditions are met: The database crash is consistent at the point of the snapshot Write ordering is preserved for each file within the snapshot The snapshot stores the time at which the snapshot is completed If a storage vendor cannot guarantee compliance with the conditions discussed, then you must place your database in backup mode before starting with the snapshot. The RECOVER command now has a newly introduced option called SNAPSHOT TIME that allows RMAN to recover a snapshot that was taken without being in backup mode to a consistent point-in-time. Some examples of how to use this new option are: RMAN> RECOVER DATABASE UNTIL TIME '10/12/2012 10:30:00' SNAPSHOT TIME '10/12/2012 10:00:00'; RMAN> RECOVER DATABASE UNTIL CANCEL SNAPSHOT TIME '10/12/2012 10:00:00'; Only trust your backups after you ensure that they are usable for recovery. In other words, always test your backup methodology first, ensuring that it can be used in the future in case of a disaster. Cross-platform data transport Starting with Oracle 12c, transporting data across platforms can be done making use of backup sets and also create cross-platform inconsistent tablespace backups (when the tablespace is not in the read-only mode) using image copies and backup sets. When using backup sets, you are able to make use of the compression and multi-section options, reducing downtime for the tablespace and the database platform migrations. RMAN does not catalog backup sets created for cross-platform transport in the control file, and always takes into consideration the endian format of the platforms and the database open mode. Before creating a backup set that will be used for a cross-platform data transport, the following prerequisites should be met: The compatible parameter in the SPFILE string should be 12.0 or greater The source database must be open in read-only mode when transporting an entire database due that the SYS and SYSAUX tablespaces will participate in the transport process If using Data Pump, the database must be open in read-write mode You can easily check the current compatible value and open_mode of your database by running the following SQL commands: SQL> SHOW PARAMETER compatible NAME TYPE VALUE ---------------------- ----------- ---------------------- compatible string 12.0.0.0.0 SQL> SELECT open_mode FROM v$database; OPEN_MODE -------------------- READ WRITE When making use of the FOR TRANSPORT or the TO PLATFORM clauses in the BACKUP command, you cannot make use of the following clauses: CUMULATIVE forRecoveryOfSpec INCREMENTAL LEVEL n keepOption notBackedUpSpec PROXY SECTION SIZE TAG VALIDATE Table recovery In previous versions of Oracle Database, the process to recover a table to a specific point-in-time was never easy. Oracle has now solved this major issue by introducing the possibility to do a point-in-time recovery of a table, group of tables or even table partitions without affecting the remaining database objects using RMAN. This makes the process easier and faster than ever before. Remember that Oracle has previously introduced features such as database point-in-time recovery ( DBPITR ), tablespace point-in-time recovery ( TSPITR ) and Flashback database; this is an evolution of the same technology and principles. The recovery of tables and table partitions is useful in the following situations: To recover a very small set of tables to a particular point-in-time To recover a tablespace that is not self-contained to a particular point-in-time, remember that TSPITR can only be used if the tablespace is self-contained To recover tables that are corrupted or dropped with the PURGE option, so the FLASHBACK DROP functionality is not possible to be used When logging for a Flashback table is enabled but the flashback target time or SCN is beyond the available undo To recover data that was lost after a data definition language ( DDL ) operation that changed the structure of a table To recover tables and table partitions from a RMAN backup, the TARGET database should be (prerequisites): At the READ/WRITE mode In the ARCHIVELOG mode The COMPATIBLE parameter should be set to 12.0 or higher You cannot recover tables or table partitions from the SYS, SYSTEM and SYSAUX schemas, or even from a standby database. Now let's take a closer look at the steps to do a table or table partitions recovery using RMAN: First check if all the prerequisites to do a table recovery are met. Start a RMAN session with the CONNECT TARGET command. Use the RECOVER TABLE command with all the required clauses. RMAN will determine which backup contains the data that needs to be recovered based on the point-in-time specified. RMAN creates an AUXILIARY instance, you can also specify the location of the AUXILIARY instance files using the AUXILIARY DESTINATION or SET NEWNAME clause. RMAN recovers the specified objects into the AUXILIARY instance. RMAN creates a Data Pump export dump file that contains the objects. RMAN imports the recovered objects from the dump file previously created into the TARGET database. If you want to manually import the objects to the TARGET database, you can make use of the NOTABLEIMPORT clause in the RECOVER command to achieve this goal. RMAN optionally offers the possibility to rename the recovered objects in the TARGET database using the REMAP TABLE clause, or to import the recovered objects to a different tablespace using the REMAP TABLESPACE clause. An example of how to use the new RECOVER TABLE command is: RMAN> RECOVER TABLE SCOTT.test UNTIL SEQUENCE 5481 THREAD 2 AUXILARY DESTINATION '/tmp/recover' REMAP TABLE SCOTT.test:my_test;
Read more
  • 0
  • 0
  • 2872

article-image-hadoop-and-hdinsight-heartbeat
Packt
30 Sep 2013
6 min read
Save for later

Hadoop and HDInsight in a Heartbeat

Packt
30 Sep 2013
6 min read
(For more resources related to this topic, see here.) Apache Hadoop is the leading Big Data platform that allows to process large datasets efficiently and at low cost. Other Big Data 0platforms are MongoDB, Cassandra, and CouchDB. This section describes Apache Hadoop core concepts and its ecosystem. Core components The following image shows core Hadoop components: At the core, Hadoop has two key components: Hadoop Distributed File System (HDFS) Hadoop MapReduce (distributed computing for batch jobs) For example, say we need to store a large file of 1 TB in size and we only have some commodity servers each with limited storage. Hadoop Distributed File System can help here. We first install Hadoop, then we import the file, which gets split into several blocks that get distributed across all the nodes. Each block is replicated to ensure that there is redundancy. Now we are able to store and retrieve the 1 TB file. Now that we are able to save the large file, the next obvious need would be to process this large file and get something useful out of it, like a summary report. To process such a large file would be difficult and/or slow if handled sequentially. Hadoop MapReduce was designed to address this exact problem statement and process data in parallel fashion across several machines in a fault-tolerant mode. MapReduce programing models use simple key-value pairs for computation. One distinct feature of Hadoop in comparison to other cluster or grid solutions is that Hadoop relies on the "share nothing" architecture. This means when the MapReduce program runs, it will use the data local to the node, thereby reducing network I/O and improving performance. Another way to look at this is when running MapReduce, we bring the code to the location where the data resides. So the code moves and not the data. HDFS and MapReduce together make a powerful combination, and is the reason why there is so much interest and momentum with the Hadoop project. Hadoop cluster layout Each Hadoop cluster has three special master nodes (also known as servers): NameNode: This is the master for the distributed filesystem and maintains a metadata. This metadata has the listing of all the files and the location of each block of a file, which are stored across the various slaves (worker bees). Without a NameNode HDFS is not accessible. Secondary NameNode: This is an assistant to the NameNode. It communicates only with the NameNode to take snapshots of the HDFS metadata at intervals configured at cluster level. JobTracker: This is the master node for Hadoop MapReduce. It determines the execution plan of the MapReduce program, assigns it to various nodes, monitors all tasks, and ensures that the job is completed by automatically relaunching any task that fails. All other nodes of the Hadoop cluster are slaves and perform the following two functions: DataNode: Each node will host several chunks of files known as blocks. It communicates with the NameNode. TaskTracker: Each node will also serve as a slave to the JobTracker by performing a portion of the map or reduce task, as decided by the JobTracker. The following image shows a typical Apache Hadoop cluster: The Hadoop ecosystem As Hadoop's popularity has increased, several related projects have been created that simplify accessibility and manageability to Hadoop. I have organized them as per the stack, from top to bottom. The following image shows the Hadoop ecosystem: Data access The following software are typically used access mechanisms for Hadoop: Hive: It is a data warehouse infrastructure that provides SQL-like access on HDFS. This is suitable for the ad hoc queries that abstract MapReduce. Pig: It is a scripting language such as Python that abstracts MapReduce and is useful for data scientists. Mahout: It is used to build machine learning and recommendation engines. MS Excel 2013: With HDInsight, you can connect Excel to HDFS via Hive queries to analyze your data. Data processing The following are the key programming tools available for processing data in Hadoop: MapReduce: This is the Hadoop core component that allows distributed computation across all the TaskTrackers Oozie: It enables creation of workflow jobs to orchestrate Hive, Pig, and MapReduce tasks The Hadoop data store The following are the common data stores in Hadoop: HBase: It is the distributed and scalable NOSQL (Not only SQL) database that provides a low-latency option that can handle unstructured data HDFS: It is a Hadoop core component, which is the foundational distributed filesystem Management and integration The following are the management and integration software: Zookeeper: It is a high-performance coordination service for distributed applications to ensure high availability Hcatalog: It provides abstraction and interoperability across various data processing software such as Pig, MapReduce, and Hive Flume: Flume is distributed and reliable software for collecting data from various sources for Hadoop Sqoop: It is designed for transferring data between HDFS and any RDBMS Hadoop distributions Apache Hadoop is an open-source software and is repackaged and distributed by vendors offering enterprise support. The following is the listing of popular distributions: Amazon Elastic MapReduce (cloud, http://aws.amazon.com/elasticmapreduce/) Cloudera (http://www.cloudera.com/content/cloudera/en/home.html) EMC PivitolHD (http://gopivotal.com/) Hortonworks HDP (http://hortonworks.com/) MapR (http://mapr.com/) Microsoft HDInsight (cloud, http://www.windowsazure.com/) HDInsight distribution differentiator HDInsight is an enterprise-ready distribution of Hadoop that runs on Windows servers and on Azure HDInsight cloud service. It is 100 percent compatible with Apache Hadoop. HDInsight was developed in partnership with Hortonworks and Microsoft. Enterprises can now harness the power of Hadoop on Windows servers and Windows Azure cloud service. The following are the key differentiators for HDInsight distribution: Enterprise-ready Hadoop: HDInsight is backed by Microsoft support, and runs on standard Windows servers. IT teams can leverage Hadoop with Platform as a Service ( PaaS ) reducing the operations overhead. Analytics using Excel: With Excel integration, your business users can leverage data in Hadoop and analyze using PowerPivot. Integration with Active Directory: HDInsight makes Hadoop reliable and secure with its integration with Windows Active directory services. Integration with .NET and JavaScript: .NET developers can leverage the integration, and write map and reduce code using their familiar tools. Connectors to RDBMS: HDInsight has ODBC drivers to integrate with SQL Server and other relational databases. Scale using cloud offering: Azure HDInsight service enables customers to scale quickly as per the project needs and have seamless interface between HDFS and Azure storage vault. JavaScript console: It consists of easy-to-use JavaScript console for configuring, running, and post processing of Hadoop MapReduce jobs. Summary In this article, we reviewed the Apache Hadoop components and the ecosystem of projects that provide a cost-effective way to deal with Big Data problems. We then looked at how Microsoft HDInsight makes the Apache Hadoop solution better by simplified management, integration, development, and reporting. Resources for Article : Further resources on this subject: Making Big Data Work for Hadoop and Solr [Article] Understanding MapReduce [Article] Advanced Hadoop MapReduce Administration [Article]
Read more
  • 0
  • 0
  • 2712

article-image-introducing-qlikview-elements
Packt
24 Sep 2013
6 min read
Save for later

Introducing QlikView elements

Packt
24 Sep 2013
6 min read
(For more resources related to this topic, see here.) People People are the only active element of data visualization, and as such, they are the most important. We briefly describe the roles of several people that participate in our project, but we mainly focus on the person who is going to analyze and visualize the data. After the meeting, we get together with our colleague, Samantha, who is the analyst that supports the sales and executive teams. She currently manages a series of highly personalized Excels that she creates from standard reports generated within the customer invoice and project management system. Her audience ranges from the CEO down to sales managers. She is not a pushover, but she is open to try new techniques, especially given that the sponsor of this project is the CEO of QDataViz, Inc. As a data discovery user, Samantha possesses the following traits: Ownership She has a stake in the project's success or failure. She, along with the company, stands to grow as a result of this project, and most importantly, she is aware of this opportunity. Driven She is focused on grasping what we teach her and is self-motivated to continue learning after the project is fi nished. The cause of her drive is unimportant as long as she remains honest. Honest She understands that data is a passive element that is open to diverse interpretations by different people. She resists basing her arguments on deceptive visualization techniques or data omission. Flexible She does not endanger her job and company results following every technological fad or whimsical idea. However, she realizes that technology does change and that a new approach can foment breakthroughs. Analytical She loves finding anomalies in the data and being the reason that action is taken to improve QDataViz, Inc. As a means to achieve what she loves, she understands how to apply functions and methods to manipulate data. Knowledgeable She is familiar with the company's data, and she understands the indicators needed to analyze its performance. Additionally, she serves as a data source and gives context to analysis. Team player She respects the roles of her colleagues and holds them accountable. In turn, she demands respect and is also obliged to meet her responsibilities. Data Our next meeting involves Samantha and Ivan, our Information Technology (IT) Director. While Ivan explains the data available in the customer invoice and project management system's well-defined databases, Samantha adds that she has vital data in Microsoft Excel that is missing from those databases. One Excel file contains the sales budget and another contains an additional customer grouping; both files are necessary to present information to the CEO. We take advantage of this discussion to highlight the following characteristics that make data easy to analyze. Reliable Ivan is going to document the origin of the tables and fields, which increases Samantha's confidence in the data. He is also going to perform a basic data cleansing and eliminate duplicate records whose only difference is a period, two transposed letters, or an abbreviation. Once the system is operational, Ivan will consider the impact any change in the customer invoice and project management system may have on the data. He will also verify that the data is continually updated while Samantha helps con firm the data's validity. Detailed Ivan will preserve as much detail as possible. If he is unable to handle large volumes of data as a whole, he will segment the detailed data by month and reduce the detail of a year's data in a consistent fashion. Conversely, he is will consider adding detail by prorating payments between the products of paid invoices in order to maintain a consistent level of detail between invoices and payments. Formal An Excel file as a data source is a short-term solution. While Ivan respects its temporary use to allow for a quick, first release of the data visualization project, he takes responsibility to find a more stable medium to long-term solution. In the span of a few months, he will consider modifying the invoice system, investing in additional software, or creating a simple portal to upload Excel files to a database. Flexible Ivan will not prevent progress solely for bureaucratic reasons. Samantha respects that Ivan's goal is to make data more standardized, secure, and recoverable. However, Ivan knows that if he does not move as quickly as business does, he will become irrelevant as Samantha and others create their own black market of company data. Referential Ivan is going to make available manifold perspectives of QDataViz, Inc. He will maintain history, budgets, and forecasts by customers, salespersons, divisions, states, and projects. Additionally, he will support segmenting these dimensions into multiple groups, subgroups, classes, and types. Tools We continue our meeting with Ivan and Samantha, but we now change our focus to what tool we will use to foster great data visualization and analysis. We create the following list of basic features we hope from this tool: Fast and easy implementation We should be able to learn the tool quickly and be able to deliver a first version of our data visualization project within a matter of weeks. In this fashion, we start receiving a return on our investment within a short period of time. Business empowerment Samantha should be able to continue her analysis with little help from us. Also, her audience should be able to easily perform their own lightweight analysis and follow up on the decisions made. Enterprise-ready Ivan should be able to maintain hundreds or thousands of users and data volumes that exceed 100 million rows. He should also be able to restrict access to certain data to certain users. Finally, he needs to have the confidence that the tools will remain available even if a server fails. Based on these expectations, we talk about data discovery tools, which are increasingly becoming part of the architecture of many organizations. Samantha can use these tools for self-service data analysis. In other words, she can create her own data visualizations without having to depend on pre-built graphs or reports. At the same time, Ivan can be reassured that the tool does not interfere with his goal of providing an enterprise solution that offers scalability, security, and high availability. The data discovery tool we are going to use is QlikView, and the following diagram shows the overall architecture we will build and where this article focuses its attention: Summary In this article, we learned about People, data, and tools which are an essential part of creating great data visualization and analysis. Resources for Article: Further resources on this subject: Meet QlikView [Article] Linking Section Access to multiple dimensions [Article] Creating sheet objects and starting new list using Qlikview 11 [Article]
Read more
  • 0
  • 0
  • 1801
Visually different images

article-image-executing-pdi-jobs-filesystem-simple
Packt
19 Sep 2013
7 min read
Save for later

Executing PDI jobs from a filesystem (Simple)

Packt
19 Sep 2013
7 min read
(For more resources related to this topic, see here.) Getting ready To get ready for this article, we first need to check that our Java environment is configured properly; to do this, check that the JAVA_HOME environment variable is set. Even if all the PDI scripts, when started, call other scripts that try to find out about our Java execution environment to get the values of the JAVA_HOME variable, it is always a good rule of thumb to have that variable set properly anytime we work with a Java application. The Kitchen script is in the PDI home directory, so the best thing to do to launch the script in the easiest way is to add the path to the PDI home directory to the PATH variable. This gives you the ability to start the Kitchen script from any place without specifying the absolute path to the Kitchen file location. If you do not do this, you will always have to specify the complete path to the Kitchen script file. To play with this article, we will use the samples in the directory <book_samples>/sample1; here, <book_samples> is the directory where you unpacked all the samples of the article. How to do it… For starting a PDI job in Linux or Mac, use the following steps: Open the command-line terminal and go to the <book_samples>/sample1 directory. Let's start the sample job. To identify which job file needs to be started by Kitchen, we need to use the –file argument with the following syntax: –file: <complete_filename_to_job_file> Remember to specify either an absolute path or a relative path by properly setting the correct path to the file. The simplest way to start the job is with the following syntax: $ kitchen.sh –file:./export-job.kjb If you're not positioned locally in the directory where the job files are located, you must specify the complete path to the job file as follows: $ kitchen.sh –file:/home/sramazzina/tmp/samples/export-job.kjb Another option to start our job is to separately specify the name of the directory where the job file is located and then give the name of the job file. To do this, we need to use the –dir argument together with the –file argument. The –dir argument lets you specify the location of the job file directory using the following syntax: –dir: <complete_path_to_ job_file_directory> So, if we're located in the same directory where the job resides, to start the job, we can use the following new syntax: $ kitchen.sh – dir:. –file:export-job.kjb If we're starting the job from a different directory than the directory where the job resides, we can use the absolute path and the –dir argument to set the job's directory as follows: $ kitchen.sh –dir:/home/sramazzina/tmp/samples –file:export-job.kjb For starting a PDI job with parameters in Linux or Mac, perform the following steps: Normally, PDI manages input parameters for the executing job. To set parameters using the command-line script, we need to use a proper argument. We use the –param argument to specify the parameters for the job we are going to launch. The syntax is as follows: -param: <parameter_name>= <parameter_value> Our sample job and transformation does accept a sample parameter called p_country that specifies the name of the country we want to export the customers to a file. Let's suppose we are positioned in the same directory where the job file resides and we want to call our job to extract all the customers for the country U.S.A. In this case, we can call the Kitchen script using the following syntax: $ kitchen.sh –param:p_country=USA -file=./export-job.kjb Of course, you can apply the –param switch to all the other three cases we detailed previously. For starting a PDI job in Windows, use the following steps: In Windows, a PDI job from the filesystem can be started by following the same rules that we saw previously, using the same arguments in the same way. The only difference is in the way we specify the command-line arguments. Any time we start the PDI jobs from Windows, we need to specify the arguments using the / character instead of the – character we used for Linux or Mac. Therefore, this means that: -file: <complete_filename_to_job_file> Will become: /file: <complete_filename_to_job_file> And: –dir: <complete_path_to_ job_file_directory> Will become: /dir: <complete_path_to_ job_file_directory> From the directory <book_samples>/sample1, if you want to start the job, you can run the Kitchen script using the following syntax: C:tempsamples>Kitchen.bat /file:./export-job.kjb Regarding the use of PDI parameters in command-line arguments, the second important difference on Windows is that we need to substitute the = character in the parameter assignment syntax with the : character. Therefore, this means that: –param: <parameter_name>= <parameter_value> Will become: /param: <parameter_name>: <parameter_value> From the directory <book_samples>/sample1, if you want to extract all the customers for the country U. S. A, you can start the job using the following syntax: C:tempsamples>Kitchen.bat /param:p_country:USA /file:./exportjob. kjb For starting the PDI transformations, perform the following steps: The Pan script starts PDI transformations. On Linux or Mac, you can find the pan.sh script in the PDI home directory. Assuming that you are in the same directory, <book_samples>/sample1, where the transformation is located, you can start a simple transformation with a command in the following way: $ pan.sh –file:./read-customers.ktr If you want to start a transformation by specifying some parameters, you can use the following command: $ pan.sh –param:p_country=USA –file:./read-customers.ktr In Windows, you can use the Pan.bat script, and the sample commands will be as follows: C:tempsamples>Pan.bat /file:./read-customers.ktr Again, if you want to start a transformation by specifying some parameters, you can use the following command: C:tempsamples>Pan.bat /param:p_country=USA /file:./readcustomers. ktr Summary IIn this article, you were guided through simply starting a PDI job using the script Kitchen. In this case, the PDI job we started were stored locally in the computer filesystem, but it could be anywhere in the network in any place that is directly accessible. You learned how to start simple jobs both with and without a set of input parameters previously defined in the job. Using command-line scripts was a fast way to start batches, but it was also the easiest way to schedule our jobs using our operating system's scheduler. The script accepted a set of inline arguments to pass the proper options required by the program to run our job in any specific situation. Resources for Article : Further resources on this subject: Integrating Kettle and the Pentaho Suite [Article] Installing Pentaho Data Integration with MySQL [Article] Pentaho – Using Formulas in Our Reports [Article]
Read more
  • 0
  • 0
  • 3179

article-image-oracle-b2b-overview
Packt
17 Sep 2013
12 min read
Save for later

Oracle B2B Overview

Packt
17 Sep 2013
12 min read
B2B environment setup Here is the list of some OFM concepts that will be used in this article: Domain: It is the basic administration unit that includes a special WebLogic Server instance called the Administration Server, and optionally one or many Java components. Java component: It is a Java EE application deployed to an Oracle WebLogic Server domain as part of a domain template. For example, SOA Suite is a Java component. Managed server: It is an additional WebLogic Server included in a domain, to host Java components such as SOA Suite. We will use the UNIX operating system for our tutorials. The following table depicts the directory environment variables used throughout the article for configuring the Oracle SOA Suite deployment: Name Variable What It Is Example Middleware home MW_HOME The top-level directory for all OFM products WebLogic Server home WL_HOME Contains installed files necessary to host a WebLogic Server $MW_HOME/wlserver_10.3 Oracle home SOA_ORACLE_HOME Oracle SOA Suite product directory $MW_HOME/Oracle_SOA1 Oracle Common Home ORACLE_COMMON_HOME Contains the binary and library files required for the Oracle Enterprise Manager Fusion Middleware Control and Java Required Files (JRF) $MW_HOME/oracle_common Domain home SOA_DOMAIN_HOME The absolute path of the source domain containing the SOA Suite Java component $MW_HOME/user_projects/domains/SOADomain Java home JAVA_HOME Specifies the location of JDK (must be 1.6.04 or higher) or JRockit $MW_HOME/jdk160_29 Ant Home ANT_HOME Specifies the location of Ant archive location $MW_HOME/org.apache.ant_1.7.1 The following figure depicts a snapshot of the SOA Suite directory's hierarchical structure: For the recommended SOA Suite directory location, please refer to the OFM Enterprise Development guide for SOA Suite that can be found at http://docs.oracle.com/cd/E16764_01/core.1111/e12036/toc.htm. JDeveloper installation tips JDeveloper is a development tool that will be used in the article. It is a full service Integrated Development Environment (IDE), which allows for the development of SOA projects along with a host of other Oracle products, including Java. If it has not been installed yet, one may consider downloading and installing the VM VirtualBox (VBox) Image of the entire package of SOA Suite, B2B, and JDeveloper, provided by Oracle on the Oracle download site found at http://www.oracle.com/technetwork/middleware/soasuite/learnmore/vmsoa-172279.html. All you need to do is to install Oracle VM VirtualBox, and import the SOA/BPM appliance. This is for evaluation and trial purposes, and is not recommended for production use; however, for the purpose of following, along with the tutorial in the article, it is perfect. The following table shows minimum and recommended requirements for the VBox Image: Minimum Recommended Memory (RAM) 4-6 GB 8 GB Disk Space 25 GB 50 GB While VM's are convenient, they do use quite a bit of disk space and memory. If you don't have a machine that meets the minimum requirements, it will be a challenge to try the exercises. The other alternative is to download the bits for the platform you are using from the Oracle download page, and install each software package, and configure them accordingly, including a JDK, a DB, WebLogic Server, SOA Suite, and JDeveloper, among other things you may need. If you decide that you have enough system resources to run the VBox Image, here are some of the major steps that you need to perform to download and install it. Please follow the detailed instructions found in the Introduction and Readme file that can be downloaded from http://www.oracle.com/technetwork/middleware/soasuite/learnmore/soabpmvirtualboxreadme-1612068.pdf, in order to have the complete set of instructions. Download the Introduction and Readme file, and review. Enable hardware virtualization in your PC BIOS if necessary. To download and install the VirtualBox software (engine that runs the virtual machine on your host machine), click on the link Download and install Oracle VM VirtualBox on the download page. To download the 7 pieces of the ZIP file, click on each file download ending with 7z.00[1-7] on the download page. To download the MD5 Checksum tool if you don't have one, click on the link Download MD5sums if you're on Windows to check your download worked okay on the download page. Run the MD5 Checksum tool to verify the 7 downloaded files: md5sums oel5u5-64bit-soabpm-11gr1-ps5-2-0-M.7z.001. Repeat for all 7 files. (This takes quite a while, but it is best to do it, so that you can verify that your download is complete and accurate.) Compare the results of the program with the results in the download that ends with .mdsum. They should match exactly. Extract the VBox Image from the .001 file using a Zip/Unzip tool. Use a ZIP tool such as 7-Zip (available as freeware for Windows), WinZip, or other to extract the .ova file from the 7 files into a single file on your platform. Using 7-Zip, if you extract from the first file; it will find the other 6 files and combine them all as it extracts. Start VirtualBox and set preferences such as the location of the VBox Image on your disk (follow instructions in the readme file). Import the new .ova file that was extracted from the ZIP file. Check settings and adjust memory/CPU. Start the appliance (VBox Image). Login as oracle with password oracle (check Readme). Choose the domain type dev_soa_osb. Set up a shared folder, you can use to share files between your machine and the virtual machine, and restart the VM. Once you are logged back in, start the admin server using the text based menu. Once the server is started, you can start the graphical desktop using the text based menu. Click on the jDeveloper Icon on the desktop of the VM to start jDeveloper. Choose Default Role when prompted for a role. At the time of writing, the latest available version is 11g PS5 (11.1.1.1.6). The VBox Image comes with SOA Suite, Oracle 11g XE Database, and JDeveloper, pre-installed on a Linux Virtual Machine. Using the VirtualBox technology, you can run this Linux machine virtually on your laptop, desktop, or on a variety of other platforms. For the purpose of this article, you should choose the dev_soa_osb type of domain. System requirements Oracle B2B is installed as a part of the SOA Suite installation. The SOA Suite installation steps are well documented, and are beyond the scope of this article. If you have never installed Oracle SOA Suite 11g, check with the Installation Guide for Oracle SOA Suite and Oracle Business Process Management Suite 11g. It can be downloaded from the Oracle Technology Network (OTN) Documentation downloads page at http://docs.oracle.com/cd/E23943_01/doc.1111/e13925/toc.htm. There are several important topics that did not have enough coverage in the SOA Suite Installation Guide. One of them is how to prepare the environment for the SOA/B2B installation. To begin, it is important to validate whether your environment meets the minimum requirements specified in the Oracle Fusion Middleware System Requirements and Specifications document. It can be downloaded from the Oracle Technology Network (OTN) Documentation downloads page at http://docs.oracle.com/html/E18558_01/fusion_requirements.htm. The spreadsheet provides very important SOA Suite installation recommendations, such as the minimum disk space information and memory requirements that could help the IT hardware team with its procurement planning process. For instance, the Oracle recommended hardware and system requirements for SOA Suite are: Minimum Physical Memory required: 2 gigabytes Minimum available Memory Required: 4 gigabytes CPU: dual-core Pentium, 1.5 GHz or greater Disk Space: 15 gigabytes or more This document also has information about supported databases and database versions. Another important document that has plenty of relevant information is Oracle Fusion Middleware 11g Release 1 (11.1.1.x) Certification Matrix. It can be downloaded from the Oracle Technology Network (OTN) Documentation downloads page at http://www.oracle.com/technetwork/middleware/downloads/fmw-11gr1certmatrix.xls. It is indeed a treasure chest of information. This spreadsheet may save you from a lot of headache. The last thing someone wants to run into is the need to re-install, just because they did not properly read and /or missed some recommendations. Here are some important points from the spreadsheet you don't want to miss: Hardware platform version's compatibility with a particular SOA Suite release Supported JDK versions Interoperability support for SOA Suite with WebLogic Server Supported database versions In conclusion, the following list includes a complete SOA Suite software stack (as used in the article): Oracle WebLogic Server (10.1.3.6) (Required) Repository Creation Utility (RCU) (11.1.1.6.0) (Required) SOA Suite 11g (11.1.1.6.0) (Required) JDeveloper (11.1.1.6.0) (Required) JDeveloper Extension for SOA (Required) Oracle B2B Document Editor release 7.05 (Optional) Oracle Database 11g (Required) Oracle B2B installation and post-installation configuration notes There are several important installation and post-installation steps that may directly or indirectly impact the B2B component's behavior. Creating a WebLogic domain is one of the most important SOA Suite 11g installation steps. The BEA engineers, who used to work with WebLogic prior to 11g, never before had to select SOA Suite components while creating a domain. This process is completely new for the Oracle engineers who are familiar only with prior releases of SOA Suite. There are several steps in this process that, if missed, might require a complete re-installation. A common mistake that people make when creating a new domain is that they don't check the Enterprise Manager checkbox. As a result, Enterprise Manager is not available, meaning that neither instance monitoring and tracking, nor access to the B2B configuration properties is available. Make sure you do not make such a mistake by selecting the Oracle Enterprise Manager checkbox. Oracle Enterprise Manager has been assigned a new name: Enterprise Manager Fusion Middleware Control. While planning SOA Suite deployment architecture, it is recommended to choose ahead of time between the following two WebLogic domain configurations: Developers domain Production domain In the Developers domain configuration, SOA Suite is installed as part of the administration server, implying that a separate managed server will not be created. This configuration could be a good choice for a development server, or a personal laptop, or any environment where available memory is limited. One should always keep in mind that SOA Suite requires up to 4 gigabytes of available memory. To set up the Developers domain, select the Oracle SOA Suite for developers checkbox on the Select Domain Source page, as shown in the following screenshot: Oracle strongly recommends against using this configuration in a production environment by warning that it will not be supported; that is, Oracle Support won't be able to provide assistance for any issues that happen to occur in this environment. Conversely, if the Oracle SOA Suite checkbox is selected, as shown in the following screenshot, a managed server will be created with a default name soa_server1. Creating a separate managed server (which is a WebLogic Java Virtual Machine) and deploying SOA Suite to this managed server, provides a more scalable configuration. If SOA Suite for developers was installed, you need to perform the following steps to activate the B2B user interface: Login to the Oracle WebLogic Server Administration Console using the following URL: http :: //<localserver>:7001/console (note that 7001 is the default port unless a different port was chosen during the installation process). Provide the default administrator account (the WebLogic user, unless it was changed during the installation process). Select Deployments from the Domain Structure list. On the bottom-right side of the page, select b2bui from the Deployments list (as shown in the following screenshot). On the next page, click on the Targets tab. Select the Component checkbox to enable the Change Target button. Click on the Change Target button. Select the AdminServer checkbox and click on the Yes button. Click on the Activate Changes button. Click on the Deployments link in the WebLogic domain structure. The B2B user interface is activated. If the SOA Suite production configuration was chosen, these steps are no longer necessary. However, you must first configure Node Manager. To do that, execute the setNMProps script and start Node Manager. $ORACLE_COMMON_HOME/common/bin/setNMProps.sh $MW_HOME/wlserver_n/server/bin/startNodeManager.sh Oracle B2B web components Oracle B2B Gateway is deployed as part of the Oracle SOA Service Infrastructure, or SOA-Infra. SOA Infrastructure is a Java EE compliant application running on Oracle WebLogic Server. Java EE compliant application: It is a wrapper around web applications and Enterprise Java Bean (EJB) applications Web Application: It usually represents the User Interface Layer, and includes Java Server pages (JSP), Servlets, HTML, and so on Servlet: It is a a module of Java code that runs as a server-side application Java Server Pages (JSP): It is a programming technology used to make dynamic web pages WAR archive: It is an artifact for the web application deployment Enterprise Java Beans: These are server-side domain objects that fit into a standard component-based architecture for building enterprise applications with Java EJB application: It is a collection of Enterprise Java Beans The following table shows a list of B2B web components installed as part of the SOA Infrastructure application. They include an enterprise application, several EJBs, a web application, and a web service. The B2B web application provides a link to the B2B Interface. The B2B MetadataWS Web Service provides Oracle SOA Service Infrastructure with access to the metadata repository. Stateless EJBs are used by the B2B Engine. This table might be helpful to understand how Oracle B2B integrates with Oracle SOA Suite. It could also be useful while developing B2B high availability architecture. Name Application Type b2b Web Application b2bui JEE Application B2BInstanceMessageBean EJB B2BStarterBeanWLS EJB B2BUtilityBean EJB B2BMetadataWS Web Service
Read more
  • 0
  • 0
  • 1884

article-image-linking-opencv-ios-project
Packt
16 Sep 2013
7 min read
Save for later

Linking OpenCV to an iOS project

Packt
16 Sep 2013
7 min read
(For more resources related to this topic, see here.) Getting ready First you should download the OpenCV framework for iOS from the official website at http://opencv.org. In this article, we will use Version 2.4.6. How to do it... The following are the main steps to accomplish the task: Add the OpenCV framework to your project. Convert image to the OpenCV format. Process image with a simple OpenCV call. Convert image back. Display image as before. Let's implement the described steps: We continue modifying the previous project, so that you can use it; otherwise create a new project with UIImageView. We'll start by adding the OpenCV framework to the Xcode project. There are two ways to do it. You can add the framework as a resource. This is a straightforward approach. Alternatively, the framework can be added through project properties by navigating to Project | Build Phases | Link Binary With Libraries. To open project properties you should click to the project name in the Project Navigator area. Next, we'll include OpenCV header files to our project. To avoid conflicts, we will add the following code to the very beginning of the file, above all other imports: #ifdef __cplusplus#import <opencv2/opencv.hpp>#endif This is needed, because OpenCV redefines some names, for example, min/max functions. Set the value of Compile Sources As property as Objective-C++. The property is available in the project settings and can be accessed by navigating to Project | Build Settings | Apple LLVM compiler 4.1 - Language. To convert the images from UIImageto cv::Mat, you can use the following functions: UIImage* MatToUIImage(const cv::Mat& image) { NSData *data = [NSData dataWithBytes:image.data length:image.elemSize()*image.total()]; CGColorSpaceRef colorSpace; if (image.elemSize() == 1) { colorSpace = CGColorSpaceCreateDeviceGray(); } else { colorSpace = CGColorSpaceCreateDeviceRGB(); } CGDataProviderRef provider = CGDataProviderCreateWithCFData((__bridge CFDataRef)data); // Creating CGImage from cv::Mat CGImageRef imageRef = CGImageCreate(image.cols, //width image.rows, //height 8, //bits per component 8*image.elemSize(),//bits per pixel image.step.p[0], //bytesPerRow colorSpace, //colorspace kCGImageAlphaNone|kCGBitmapByteOrderDefault,// bitmap info provider, //CGDataProviderRef NULL, //decode false, //should interpolate kCGRenderingIntentDefault //intent ); // Getting UIImage from CGImage UIImage *finalImage = [UIImage imageWithCGImage:imageRef]; CGImageRelease(imageRef); CGDataProviderRelease(provider); CGColorSpaceRelease(colorSpace); return finalImage;}void UIImageToMat(const UIImage* image, cv::Mat& m, bool alphaExist = false){ CGColorSpaceRef colorSpace = CGImageGetColorSpace(image.CGImage); CGFloat cols = image.size.width, rows = image.size.height; CGContextRef contextRef; CGBitmapInfo bitmapInfo = kCGImageAlphaPremultipliedLast; if (CGColorSpaceGetModel(colorSpace) == 0) { m.create(rows, cols, CV_8UC1); //8 bits per component, 1 channel bitmapInfo = kCGImageAlphaNone; if (!alphaExist) bitmapInfo = kCGImageAlphaNone; contextRef = CGBitmapContextCreate(m.data, m.cols, m.rows, 8, m.step[0], colorSpace, bitmapInfo); } else { m.create(rows, cols, CV_8UC4); // 8 bits per component, 4 channels if (!alphaExist) bitmapInfo = kCGImageAlphaNoneSkipLast | kCGBitmapByteOrderDefault; contextRef = CGBitmapContextCreate(m.data, m.cols, m.rows, 8, m.step[0], colorSpace, bitmapInfo); } CGContextDrawImage(contextRef, CGRectMake(0, 0, cols, rows), image.CGImage); CGContextRelease(contextRef);} These functions are included into the library starting from Version 2.4.6 of OpenCV. In order to use them, you should include the ios.h header file. #import "opencv2/highgui/ios.h" Let's consider a simple example that extracts edges from the image. In order to do so, you have to add the following code to the viewDidLoad()method: viewDidLoad() method:- (void)viewDidLoad{ [super viewDidLoad]; UIImage* image = [UIImage imageNamed:@"lena.png"]; // Convert UIImage* to cv::Mat UIImageToMat(image, cvImage); if (!cvImage.empty()) { cv::Mat gray; // Convert the image to grayscale cv::cvtColor(cvImage, gray, CV_RGBA2GRAY); // Apply Gaussian filter to remove small edges cv::GaussianBlur(gray, gray, cv::Size(5, 5), 1.2, 1.2); // Calculate edges with Canny cv::Mat edges; cv::Canny(gray, edges, 0, 50); // Fill image with white color cvImage.setTo(cv::Scalar::all(255)); // Change color on edges cvImage.setTo(cv::Scalar(0, 128, 255, 255), edges); // Convert cv::Mat to UIImage* and show the resulting image imageView.image = MatToUIImage(cvImage); }} Now run your application and check whether the application finds edges on the image correctly. How it works... Frameworks are intended to simplify the process of handling dependencies. They encapsulate header and binary files, so the Xcode sees them, and you don't need to add all the paths manually. Simply speaking, the iOS framework is just a specially structured folder containing include files and static libraries for different architectures (for example, armv7, armv7s, and x86). But Xcode knows where to search for proper binaries for each build configuration, so this approach is the simplest way to link external library on the iOS. All dependencies are handled automatically and added to the final application package. Usually, iOS applications are written in Objective-C language. Header files have a *.h extension and source files have *.m. Objective-C is a superset of C, so you can easily mix these languages in one file. But OpenCV is primarily written in C++, so we need to use C++ in the iOS project, and we need to enable support of Objective-C++. That's why we have set the language property to Objective-C++. Source files in Objective-C++ language usually have the *.mm extension. To include OpenCV header files, we use the #importdirective. It is very similar to #include in C++, while there is one distinction. It automatically adds guards for the included file, while in C++ we usually add them manually: #ifndef __SAMPLE_H__#define __SAMPLE_H__…#endif In the code of the example, we just convert the loaded image from a UIImage object to cv::Matby calling the UIImageToMat function. Please be careful with this function, because it entails a memory copy, so frequent calls to this function will negatively affect your application's performance. Please note that this is probably the most important performance tip—to be very careful while working with memory in mobile applications. Avoid memory reallocations and copying as much as possible. Images require quite large chunks of memory, and you should reuse them between iterations. For example, if your application has some pipeline, you should preallocate all buffers and use the same memory while processing new frames. After converting images, we do some simple image processing with OpenCV. First, we convert our image to the single-channel one. After that, we use the GaussianBlur filter to remove small details. Then we use the Canny method to detect edges in the image. To visualize results, we create a white image and change the color of the pixels that lie on detected edges. The resulting cv::Mat object is converted back to UIImage and displayed on the screen. There's more... The following is additional advice. Objective-C++ There is one more way to add support of Objective-C++ to your project. You should just change the extension of the source files to .mm where you plan to use C++ code. This extension is specific to Objective-C++ code. Converting to cv::Mat If you don't want to use UIImage, but want to load an image to cv::Mat directly, you can do it using the following code: // Create file handleNSFileHandle* handle = [NSFileHandle fileHandleForReadingAtPath:filePath];// Read content of the fileNSData* data = [handle readDataToEndOfFile];// Decode image from the data buffercvImage = cv::imdecode(cv::Mat(1, [data length], CV_8UC1, (void*)data.bytes), CV_LOAD_IMAGE_UNCHANGED); In this example we read the file content to the buffer and call the cv::imdecode function to decode the image. But there is one important note; if you later want to convert cv::Mat to the UIImage, you should change the channel order from BGR to RGB, as OpenCV's native image format is BGR. Summary This article explained how to link the OpenCV library and call any function from it. Resources for Article: Further resources on this subject: A quick start – OpenCV fundamentals [Article] Using Image Processing Techniques [Article] OpenCV: Segmenting Images [Article]
Read more
  • 0
  • 0
  • 3797
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $15.99/month. Cancel anytime
article-image-report-authoring
Packt
16 Sep 2013
12 min read
Save for later

Report Authoring

Packt
16 Sep 2013
12 min read
(For more resources related to this topic, see here.) In this article, we will cover some fundamental techniques that will be used in your day-to-day life as a Report Studio author. In each recipe, we will take a real-life example and see how it can be accomplished. At the end of the article, you will learn several concepts and ideas which you can mix-and-match to build complex reports. Though this article is called Report Authoring Basic Concepts, it is not a beginner's guide or a manual. It expects the following: You are familiar with the Report Studio environment, components, and terminologies You know how to add items on the report page and open various explorers and panes You can locate the properties window and know how to test run the report Based on my personal experience, I will recommend this article to new developers with two days to two months of experience. In the most raw terminology, a report is a bunch of rows and columns. The aim is to extract the right rows and columns from the database and present them to the users. The selection of columns drive what information is shown in the report, and the selection of rows narrow the report to a specific purpose and makes it meaningful. The selection of rows is controlled by filters. Report Studio provides three types of filtering: detail , summary , and slicer. Slicers are used with dimensional models.). In the first recipe of this article, we will cover when and why to use the detail and summary filters. Once we get the correct set of rows by applying the filters, the next step is to present the rows in the most business-friendly manner. Grouping and ordering plays an important role in this. The second recipe will introduce you to the sorting technique for grouped reports. With grouped reports, we often need to produce subtotals and totals. There are various types of aggregation possible. For example, average, total, count, and so on. Sometimes, the nature of business demands complex aggregation as well. In the third recipe, you will learn how to introduce aggregation without increasing the length of the query. You will also learn how to achieve different aggregation for subtotals and totals. The fourth recipe will build upon the filtering concept you have learnt earlier. It will talk about implementing the if-then-elselogic in filters. Then we will see some techniques on data formatting, creating sections in a report, and hiding a column in a crosstab. Finally, the eighth and last recipe of this article will show you how to use prompt's Use Value and Display Value properties to achieve better performing queries. The examples used in all the recipes are based on the GO Data Warehouse (query) package that is supplied with IBM Cognos 10.1.1 installation. These recipe samples can be downloaded from the Packt Publishing website. They use the relational schema from the Sales and Marketing (query) / Sales (query) namespace. The screenshots used throughout this article are taken from Cognos Version 10.1.1 and 10.2. Summary filters and detail filters Business owners need to see the sales quantity of their product lines to plan their strategy. They want to concentrate only on the highest selling product for each product line. They would also like the facility to select only those orders that are shipped in a particular month for this analysis. In this recipe, we will create a list report with product line, product name, and quantity as columns. We will also create an optional filter on the Shipment Month Key. Also, we will apply correct filtering to bring up only the top selling product per product line. Getting ready Create a new list report based on the GO Data Warehouse (query) package. From the Sales (query) namespace, bring up Products / Product line , Products / Product , and Sales fact / Quantity as columns, the way it is shown in the following screenshot: How to do it... Here we want to create a list report that shows product line, product name, and quantity, and we want to create an optional filter on Shipment Month. The report should also bring up only the top selling product per product line. In order to achieve this, perform the following steps: We will start by adding the optional filter on Shipment Month. To do that, click anywhere on the list report on the Report page. Then, click on Filters from the toolbar. In the Filters dialog box, add a new detail filter. In the Create Filter screen, select Advanced and then click on OK as shown in the following screenshot: By selecting Advanced , we will be able to filter the data based on the fields that are not part of our list table like the Month Key in our example as you will see in the next step. Define the filter as follows: [Sales (query)].[Time (ship date)].[Month key (ship date)] = ?ShipMonth? Validate the filter and then click on OK. Set the usage to Optional as shown in the following screenshot: Now we will add a filter to bring only the highest sold product per product line. To achieve this, select Product line and Product (press Ctrl and select the columns) and click on the group button from the toolbar. This will create a grouping as shown in the following screenshot: Now select the list and click on the filter button again and select Edit Filters . This time go to the Summary Filters tab and add a new filter. In the Create Filter screen, select Advanced and then click on OK. Define the filter as follows: [Quantity] = maximum([Quantity] for [Product line]). Set usage to Required and set the scope to Product as shown in the following screenshot: Now run the report to test the functionality. You can enter 200401as the Month Key as that has data in the Cognos supplied sample. How it works... Report Studio allows you to define two types of filters. Both work at different levels of granularity and hence have different applications. The detail filter The detail filter works at the lowest level of granularity in a selected cluster of objects. In our example, this grain is the Sales entries stored in Sales fact . By putting a detail filter on Shipment Month, we are making sure that only those sales entries which fall within the selected month are pulled out. The summary filter In order to achieve the highest sold product per product line, we need to consider the aggregated sales quantity for the products. If we put a detail filter on quantity, it will work at sales entry level. You can try putting a detail filter of [Quantity] = maximum([Quantity]for[Productline])and you will see that it gives incorrect results. So, we need to put a summary filter here. In order to let the query engine know that we are interested in filtering sales aggregated at product level, we need to set the SCOPE to Product . This makes the query engine calculate [Quantity]at product level and then allows only those products where the value matches maximum([Quantity]for [Product line]). There's more... When you define multiple levels of grouping, you can easily change the scope of summary filters to decide the grain of filtering. For example, if you need to show only those products whose sales are more than 1000 and only those product lines whose sales are more than 25000, you can quickly put two summary filters for code with the correct Scope setting. Before/after aggregation The detail filter can also be set to apply after aggregation (by changing the application property). However, I think this kills the logic of the detail filter. Also, there is no control on the grain at which the filter will apply. Hence, Cognos sets it to before aggregation by default, which is the most natural usage of the detail filter. See also The Implementing if-then-else in filtering recipe Sorting grouped values The output of the previous recipe brings the right information back on the screen. It filters the rows correctly and shows the highest selling product per product line for the selected shipment month. For better representation and to highlight the best-selling product lines, we need to sort the product lines in descending order of quantity. Getting ready Open the report created in the previous recipe in Cognos Report Studio for further amendments. How to do it... In the report created in the previous recipe, we managed to show data filtered by the shipment month. To improve the reports look and feel, we will sort the output to highlight the best-selling products. To start this, perform the following steps: Open the report in Cognos Report Studio. Select the Quantity column. Click on the Sort button from the toolbar and choose Sort Descending . Run the report to check if sorting is working. You will notice that sorting is not working. Now go back to Report Studio, select Quantity , and click on the Sort button again. This time choose Edit Layout Sorting under the Other Sort Options header. Expand the tree for Product line . Drag Quantity from Detail Sort List to Sort List under Product line as shown in the following screenshot: Click on the OK button and test the report. This time the rows are sorted in descending order of Quantity as required. How it works... The sort option by default works at the detailed level. This means the non-grouped items are sorted by the specified criteria within their own groups. Here we want to sort the product lines that are grouped (not the detailed items). In order to sort the groups, we need to define a more advanced sorting using the Edit Layout Sorting options shown in this recipe. There's more... You can also define sorting for the whole list report from the Edit Layout Sorting dialog box. You can use different items and ordering for different groups and details. You can also choose to sort certain groups by the data items that are not shown in the report. You need to bring only those items from source (model) to the query, and you will be able to pick it in the sorting dialog. Aggregation and rollup aggregation Business owners want to see the unit cost of every product. They also want the entries to be grouped by product line and see the highest unit cost for each product line. At the end of the report, they want to see the average unit cost for the whole range. Getting ready Create a simple list report with Products / Product line , Products / Product , and Sales fact / Unit cost as columns. How to do it... In this recipe, we want to examine how to aggregate the data and what is meant by rollup aggregation. Using the new report that you have created, this is how we are going to start this recipe: We will start by examining the Unit cost column. Click on this column and check the Aggregate Function property. Set this property to Average . Add grouping for Product line and Product by selecting those columns and then clicking on the GROUP button from the toolbar. Click on the Unit cost column and then click on the Summarize button from the toolbar. Select the Total option from the list. Now, again click on the Summarize button and choose the Average option as shown in the following screenshot: The previous step will create footers as shown in the following screenshot: Now delete the line with the <Average (Unit cost)> measure from Product line . Similarly, delete the line with the <Unit cost> measure from Summary . The report should look like the following screenshot: Click on the Unit cost column and change its Rollup Aggregate Function property to Maximum . Run the report to test it. How it works... In this recipe, we have seen two properties of the data items related to aggregation of the values. The aggregation property We first examined the aggregation property of unit cost and ensured that it was set to average. Remember that the unit cost here comes from the sales table. The grain of this table is sales entries or orders. This means there will be many entries for each product and their unit cost will repeat. We want to show only one entry for each product and the unit cost needs to be rolled up correctly. The aggregation property determines what value is shown for unit cost when calculated at product level. If it is set to Total , it will wrongly add up the unit costs for each sales entry. Hence, we are setting it to Average . It can be set to Minimum or Maximum depending on business requirements. The rollup aggregation property In order to show the maximum unit cost for product type, we create an aggregate type of footer in step 4 and set the Rollup Aggregation to Maximum in step 8. Here we could have directly selected Maximum from the Summarize drop-down toolbox. But that creates a new data item called Maximum (Unit Cost) . Instead, we ask Cognos to aggregate the number in the footer and drive the type by rollup aggregation property. This will reduce one data item in query subject and native SQL. Multiple aggregation We also need to show the overall average at the bottom. For this we have to create a new data item. Hence, we select unit cost and create an Average type of aggregation in step 5. This calculates the Average (Unit Cost) and places it on the product line and in the overall footer. We then deleted the aggregations that are not required in step 7. There's more... The rollup aggregation of any item is important only when you create the aggregation of Aggregate type. When it is set to automatic, Cognos will decide the function based on the data type, which is not preferred. It is good practice to always set the aggregation and rollup aggregation to a meaningful function rather than leaving them as automatic.
Read more
  • 0
  • 0
  • 1458

article-image-creating-sample-web-scraper
Packt
11 Sep 2013
10 min read
Save for later

Creating a sample web scraper

Packt
11 Sep 2013
10 min read
(For more resources related to this topic, see here.) As web scrapers like to say: "Every website has an API. Sometimes it's just poorly documented and difficult to use." To say that web scraping is a useful skill is an understatement. Whether you're satisfying a curiosity by writing a quick script in an afternoon or building the next Google, the ability to grab any online data, in any amount, in any format, while choosing how you want to store it and retrieve it, is a vital part of any good programmer's toolkit. By virtue of reading this article, I assume that you already have some idea of what web scraping entails, and perhaps have specific motivations for using Java to do it. Regardless, it is important to cover some basic definitions and principles. Very rarely does one hear about web scraping in Java—a language often thought to be solely the domain of enterprise software and interactive web apps of the 90's and early 2000's. However, there are many reasons why Java is an often underrated language for web scraping: Java's excellent exception-handling lets you compile code that elegantly handles the often-messy Web Reusable data structures allow you to write once and use everywhere with ease and safety Java's concurrency libraries allow you to write code that can process other data while waiting for servers to return information (the slowest part of any scraper) The Web is big and slow, but the Java RMI allows you to write code across a distributed network of machines, in order to collect and process data quickly There are a variety of standard libraries for getting data from servers, as well as third-party libraries for parsing this data, and even executing JavaScript (which is needed for scraping some websites) In this article, we will explore these, and other benefits of Java in web scraping, and build several scrapers ourselves. Although it is possible, and recommended, to skip to the sections you already have a good grasp of, keep in mind that some sections build up the code and concepts of other sections. When this happens, it will be noted in the beginning of the section. How is this legal? Web scraping has always had a "gray hat" reputation. While websites are generally meant to be viewed by actual humans sitting at a computer, web scrapers find ways to subvert that. While APIs are designed to accept obviously computer-generated requests, web scrapers must find ways to imitate humans, often by modifying headers, forging POST requests and other methods. Web scraping often requires a great deal of problem solving and ingenuity to figure out how to get the data you want. There are often few roadmaps or tried-and-true procedures to follow, and you must carefully tailor the code to each website—often riding between the lines of what is intended and what is possible. Although this sort of hacking can be fun and challenging, you have to be careful to follow the rules. Like many technology fields, the legal precedent for web scraping is scant. A good rule of thumb to keep yourself out of trouble is to always follow the terms of use and copyright documents on websites that you scrape (if any). There are some cases in which the act of web crawling is itself in murky legal territory, regardless of how the data is used. Crawling is often prohibited in the terms of service of websites where the aggregated data is valuable (for example, a site that contains a directory of personal addresses in the United States), or where a commercial or rate-limited API is available. Twitter, for example, explicitly prohibits web scraping (at least of any actual tweet data) in its terms of service: "crawling the Services is permissible if done in accordance with the provisions of the robots.txt file, however, scraping the Services without the prior consent of Twitter is expressly prohibited" Unless explicitly prohibited by the terms of service, there is no fundamental difference between accessing a website (and its information) via a browser, and accessing it via an automated script. The robots.txt file alone has not been shown to be legally binding, although in many cases the terms of service can be. Writing a simple scraper (Simple) Wikipedia is not just a helpful resource for researching or looking up information but also a very interesting website to scrape. They make no efforts to prevent scrapers from accessing the site, and, with a very well-marked-up HTML, they make it very easy to find the information you're looking for. In this project, we will scrape an article from Wikipedia and retrieve the first few lines of text from the body of the article. Getting ready It is recommended that you have some working knowledge of Java, and the ability to create and execute Java programs at this point. As an example, we will use the article from the following Wikipedia link: http://en.wikipedia.org/wiki/Java Note that this article is about the Indonesian island nation Java, not the programming language. Regardless, it seems appropriate to use it as a test subject. We will be using the jsoup library to parse HTML data from websites and convert it into a manipulatable object, with traversable, indexed values (much like an array). In this xercise, we will show you how to download, install, and use Java libraries. In addition, we'll also be covering some of the basics of the jsoup library in particular. How to do it... Now that we're starting to get into writing scrapers, let's create a new project to keep them all bundled together. Carry out the following steps for this task: Open Eclipse and create a new Java project called Scraper. Packages are still considered to be handy for bundling collections of classes together within a single project (projects contain multiple packages, and packages contain multiple classes). You can create a new package by highlighting the Scraper project in Eclipse and going to File | New | Package . By convention, in order to prevent programmers from creating packages with the same name (and causing namespace problems), packages are named starting with the reverse of your domain name (for example, com.mydomain.mypackagename). For the rest of the article, we will begin all our packages with com.packtpub.JavaScraping appended with the package name. Let's create a new package called com.packtpub.JavaScraping.SimpleScraper. Create a new class, WikiScraper, inside the src folder of the package. Download the jsoup core library, the first link, from the following URL: http://jsoup.org/download Place the .jar file you downloaded into the lib folder of the package you just created. In Eclipse, right-click in the Package Explorer window and select Refresh. This will allow Eclipse to update the Package Explorer to the current state of the workspace folder. Eclipse should show your jsoup-1.7.2.jar file (this file may have a different name depending on the version you're using) in the Package Explorer window. Right-click on the jsoup JAR file and select Build Path | Add to Build Path. In your WikiScraper class file, write the following code: package com.packtpub.JavaScraping.SimpleScraper; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import java.net.*; import java.io.*; public class WikiScraper { public static void main(String[] args) { scrapeTopic("/wiki/Python"); } public static void scrapeTopic(String url){ String html = getUrl("http://www.wikipedia.org/"+url); Document doc = Jsoup.parse(html); String contentText = doc.select("#mw-content-text > p").first().text(); System.out.println(contentText); } public static String getUrl(String url){ URL urlObj = null; try{ urlObj = new URL(url); } catch(MalformedURLException e){ System.out.println("The url was malformed!"); return ""; } URLConnection urlCon = null; BufferedReader in = null; String outputText = ""; try{ urlCon = urlObj.openConnection(); in = new BufferedReader(new InputStreamReader(urlCon.getInputStream())); String line = ""; while((line = in.readLine()) != null){ outputText += line; } in.close(); }catch(IOException e){ System.out.println("There was an error connecting to the URL"); return ""; } return outputText; } } Assuming you're connected to the internet, this should compile and run with no errors, and print the first paragraph of text from the article. How it works... Unlike our HelloWorld example, there are a number of libraries needed to make this code work. We can incorporate all of these using the import statements before the class declaration. There are a number of jsoup objects needed, along with two Java libraries, java.io and java.net , which are needed for creating the connection and retrieving information from the Web. As always, our program starts out in the main method of the class. This method calls the scrapeTopic method, which will eventually print the data that we are looking for (the first paragraph of text in the Wikipedia article) to the screen. scrapeTopic requires another method, getURL, in order to do this. getUrl is a function that we will be using throughout the article. It takes in an arbitrary URL and returns the raw source code as a string. Essentially, it creates a Java URL object from the URL string, and calls the openConnection method on that URL object. The openConnection method returns a URLConnection object, which can be used to create a BufferedReader object. BufferedReader objects are designed to read from, potentially very long, streams of text, stopping at a certain size limit, or, very commonly, reading streams one line at a time. Depending on the potential size of the pages you might be reading (or if you're reading from an unknown source), it might be important to set a buffer size here. To simplify this exercise, however, we will continue to read as long as Java is able to. The while loop here retrieves the text from the BufferedReader object one line at a time and adds it to outputText, which is then returned. After the getURL method has returned the HTML string to scrapeTopic, jsoup is used. jsoup is a Java library that turns HTML strings (such as the string returned by our scraper) into more accessible objects. There are many ways to access and manipulate these objects, but the function you'll likely find yourself using most often is the select function. The jsoup select function returns a jsoup object (or an array of objects, if more than one element matches the argument to the select function), which can be further manipulated, or turned into text, and printed. The crux of our script can be found in this line: String contentText = doc.select("#mw-content-text > p").first().text(); This finds all the elements that match #mw-content-text > p (that is, all p elements that are the children of the elements with the CSS ID mw-content-text), selects the first element of this set, and turns the resulting object into plain text (stripping out all tags, such as <a> tags or other formatting that might be in the text). The program ends by printing this line out to the console. There's more... Jsoup is a powerful library that we will be working with in many applications throughout this article. For uses that are not covered in the article, I encourage you to read the complete documentation at http://jsoup.org/apidocs/. What if you find yourself working on a project where jsoup's capabilities aren't quite meeting your needs? There are literally dozens of Java-based HTML parsers on the market. Summary Thus in this article we took the first step towards web scraping with Java, and also learned how to scrape an article from Wikipedia and retrieve the first few lines of text from the body of the article. Resources for Article : Further resources on this subject: Making a simple cURL request (Simple) [Article] Web Scraping with Python [Article] Generating Content in WordPress Top Plugins—A Sequel [Article]  
Read more
  • 0
  • 0
  • 7485

article-image-ravendbnet-client-api
Packt
11 Sep 2013
12 min read
Save for later

RavenDB.NET Client API

Packt
11 Sep 2013
12 min read
(For more resources related to this topic, see here.) The RavenDB .NET Client API RavenDB provides a set of .NET client libraries for interacting with it, which can be accessed from any .NET-based programming languages. By using these libraries, you can manage RavenDB, construct requests, send them to the server, and receive responses. The .NET Client API exposes all aspects of the RavenDB and allows developers to easily integrate RavenDB services into their own applications. The .NET Client API is involved with loading and saving the data, and it is responsible for integrating with the .NET System.Transactions namespace. With System.Transactions, there is already a general way to work transactionally across different resources. Also the .NET Client API is responsible for batching requests to the server, caching, and more. The .NET Client API is easy to use and uses several conventions to control how it works. These conventions can be modified by the user to meet its application needs. It is not recommended to use System.Transactions. In fact, it is recommended to avoid it. System.Transactions is supported in RavenDB mostly to allow integration while using collaboration tools between services for example, NServiceBus. Setting up your development environment Before you can start developing an application for RavenDB, you must first set up your development environment. This setup involves installing the following software packages on your development computer: Visual Studio 2012 (required) RavenDB Client (required) NuGet Package Manager (optional) You may download and install the latest Visual Studio version from the Microsoft website: http://msdn.microsoft.com/En-us/library/vstudio/e2h7fzkw.aspx In order to use RavenDB in your own .NET application, you have to add a reference to the Raven.Client.Lightweight.dll and the Raven.Abstractions.dll files, (which you can find in the ~Client folder of the distribution package) into your Visual Studio project. The easiest way to add a reference to these DLLs is by using the NuGet Package Manager, which you may use to add the RavenDB.Client package. NuGet Package Manager is a Visual Studio extension that makes it easy to add, remove, and update libraries and tools in Visual Studio projects that use the .NET framework. You can find more information on the NuGet Package Manager extension by visiting the official website at http://nuget.org/. Time for action – installing NuGet Package Manager The NuGet Package Manager extension is the easiest way to add the RavenDB Client library to a Visual Studio project. If you do not have NuGet Package Manager already installed, you can install it as follows: Start Visual Studio. From the TOOLS menu, click on Extensions and Updates.... In the Extensions and Updates... dialog, click on Online. If you don’t see NuGet Package Manager, type nuget package manager in the search box. Select the NuGet Package Manager extension and click on Download. After the download completes, you will be prompted to install the package. After the installation completes, you might be prompted to restart Visual Studio. What just happened? We installed the NuGet Package Manager extension to Visual Studio, which you will use to add RavenDB Client to your Visual Studio project. Creating a simple application Let’s write a simple application to learn how to interact with RavenDB. Time for action – adding RavenDB Client to a Visual Studio project Let’s go ahead and create a new Visual Studio project. You will add the RavenDB Client using the Nuget Package Manager extension to this project to be able to connect to RavenDB and begin using it: Start Visual Studio and click on New Project from the Start page or from the File menu, navigate to New and then Project. In the Templates pane, click on Installed and then click on Templates and expand the Visual C# node. Under Visual C#, click on Windows. In the list of project templates, select Console Application. Name the project as RavenDB_Ch03 and click on OK. From the TOOLS menu, click on Library Package Manager. If you do not see this menu item, make sure that the NuGet Package Manager extension has been installed correctly. Click on Manage NuGet Packages for Solution.... In the Manage NugGet Packages dialog, select Online. In the search box, type ravendb. Select the package named RavenDB Client. Click on Install and accept the license. After the package installs, click on Close to close the dialog box. What just happened? We created the RavenDB_Ch03 Visual Studio project and added the RavenDB Client to get connected to the RavenDB server. Once the RavenDB Client is installed, by expanding the References node of your project in Visual Studio, you can see the RavenDB DLLs (Raven.Abstractions, Raven.Client.Lightweight) added automatically to your project by the Nuget Package Manager extension. You should ensure that the RavenDB Client version matches the server version you are running. This can lead to some really frustrating runtime errors when the versions don’t match. You can also install RavenDB Client using the Package Manager console (Visual Studio | Tools | Library Package Manager | Package Manager Console). To install the latest RavenDB Client, run the following command in the Package Manager console: Install-Package RavenDB.Client or Install-Package RavenDB.Client–Version {version number} to add a specific version. Connecting to RavenDB To begin using RavenDB we need to get a new instance of the DocumentStore object, which points to the server and acts as the main communication channel manager. Once a new instance of DocumentStore has been created, the next step is to create a new session against that Document Store. The session object is responsible for implementing the Unit of Work pattern. A Unit of Work keeps track of everything you do during a business transaction that can affect the database. When you’re done, it figures out everything that needs to be done to alter the database as a result of your work. The session object is used to interact with RavenDB and provides a fully transactional way of performing database operations, and allows the consumer to store data into the database, and load it back when necessary using queries or by document ID. In order to perform an action against the RavenDB store, we need to ensure that we have an active and valid RavenDB session before starting the action, and dispose it off once it ends. Basically, before disposing of the session, we will call the SaveChanges() method on the session object to ensure that all changes will be persisted. To create a new RavenDB session, we will call the OpenSession() method on the DocumentStore object, which will return a new instance of the IDocumentSession object. Time for action – connecting to RavenDB Your Visual Studio project now has a reference to all needed RavenDB Client DLLs and is ready to get connected to the RavenDB server. You will add a new class and necessary code to create a communication channel with the RavenDB server: Open the RavenDB_Ch03 project. Modify the Main() method (in Program.cs class) to look like the following code snippet: Press Ctrl + Shift + S to save all files in the solution. What just happened? We just added the DocumentStore object initialization code to the Main() method of the RavenDB_Ch03 project. Within the Main() method, you create a new instance of the DocumentStore object, which is stored in the DocumentStore variable. The DocumentStore object basically acts as the connection manager and must be created only once per application. It is important to point out that this is a heavyweight and thread safe object. Note that, when you create many of these your application will run slow and will have a larger memory footprint. To create an instance of the DocumentStore object, you need to specify the URL that points to the RavenDB server. The URL has to include the TCP port number (8080 is the default port number) on which RavenDB server is listening (line 17 in the previous code snippet). In order to point to the Orders database, you set the value of DefaultDatabase to Orders, which is the name of our database (line 18). To get the new instance of the IDocumentStore object, you have to call the Initialize() method on the DocumentStore object. With this instance, you can establish the connection to the RavenDB server (line 19). The whole DocumentStore initialization code is surrounded by a using statement. This is used to specify when resources should be released. Basically, the using statement calls the Dispose() method of each object for which the resources should be released. Interacting with RavenDB using the .NET Client API RavenDB stores documents in JSON format. It converts the .NET objects into their JSON equivalent when storing documents, and back again by mapping the .NET object property names to the JSON property names and copies the values when loading documents. This process that makes writing and reading data structures to and from a document file extremely easy is called Serialization and Deserialization. Interacting with RavenDB is very easy. You might create a new DocumentStore object and then open a session, do some operations, and finally apply the changes to the RavenDB server. The session object will manage all changes internally, but changes will be committed to the underlying document database only when the SaveChanges() method is called. This is important to note because all changes to the document will be discarded if this method is not invoked. RavenDB is safe by default. This unique feature means that the database is configured to stop users querying for large amount of data. It is never a good idea to have a query that returns thousands of records, which is inefficient and may take a long time to process. By default, RavenDB limits the number of documents on the client side to 128 (this is a configurable option) if you don't make use of the Take() method on the Query object. Basically, if you need to get data beyond the 128 documents limit, you need to page your query results (by using Take() and Skip() methods). RavenDB also has a very useful option to stop you from creating the dreaded SELECT N+1 scenario—this feature stops after 30 requests are sent to the server per session, (which also is a configurable option). The recommended practice is to keep the ratio of SaveChanges() calls to session instance at 1:1. Reaching the limit of 30 remote calls while maintaining the 1:1 ratio is typically a symptom of a significant N+1 problem. To retrieve or store documents, you might create a class type to hold your data document and use the session instance to save or load that document, which will be automatically serialized to JSON. Loading a document To load a single document from a RavenDB database, you need to invoke the Load() method on the Session object you got from the DocumentStore instance. Then you need to specify the type of the document you will load from the RavenDB database. Time for action – loading a document You will create the Order class to hold the order data information and will add the LoadOrder()method, which you will call to load an Order document from the RavenDB server: Open the RavenDB_Ch03 project, add a new class and name it Order. Add the following code snippet to the Order class: Add the DisplayOrder() method to the Program class using the following code snippet: Add the Load<Order>() method to the Program class using the following code snippet: Save all the files and press F6 to build the solution. Switch to Windows Explorer and go to the RavenDB installation folder and launch RavenDB server using the Start.cmd command file. Return to Visual Studio, once the RavenDB server is running, press F5 to run RavenDB_Ch03 to see the document information in the output console window. What just happened? You just wrote the necessary code to load your first document from the RavenDB server. Let’s take a look at the code you added to the RavenDB_Ch03 project. You added the Order class to the project. This class will hold the data information for the Order document you will load from the server. It contains six fields (lines 11 to 16 in the previous code snippet) that will be populated with values from the JSON Order document stored on the server. By adding a field named Id, RavenDB will automatically populate this field with the document ID when the document is retrieved from the server. You added the DisplayOrder() method to the Program class. This method is responsible for displaying the Order field’s values to the output window. You also added the Load<Order>() method (lines 26 to 30) to the Program class. This method is surrounded by a using statement to ensure that the resources will be disposed at the end of execution of this method. To open a RavenDB session you call the OpenSession() method on the IDocumentStore object. This session is handled by the session variable (line 26). The Load() method is a generic method. It will load a specific entity with a specific ID, which you need to provide when calling the method. So in the calling code to the Load() method (line 28), you provide the Order entity and the document ID that we want to retrieve from the server which is Orders/A179854. Once the Order document with the Id field as Orders/A179854 is retrieved from the server, you send it to the DisplayOrder() method to be displayed (line 29). Finally, you build and run the RavenDB_Ch03 project. Have a go hero – loading multiple documents You know now how to load any single document from the RavenDB server using the Load() method. What if you have more than one document to load? You can use the Load() method to load more than one document in a single call. It seems easy to do. Well give it a go! Arrays are very useful!
Read more
  • 0
  • 0
  • 2107

article-image-out-process-distributed-caching
Packt
06 Sep 2013
7 min read
Save for later

Out-of-process distributed caching

Packt
06 Sep 2013
7 min read
(For more resources related to this topic, see here.) Getting ready Out-of-process caching is a way of distributing your caching needs in a different JVM and/or infrastructure. Ehcache provides a convenient deployable WAR file that works on most web containers/standalone servers whose mission is to provide an easy API interface to distributed cache. At the moment of writing, you can download it from http://sourceforge.net/projects/ehcache/files/ehcache-server/, or you can include it in your Maven POM and will be delivered as a WAR file. The cache server requires no special configuration on the Tomcat container. However, if you are running GlassFish, Jetty, WebLogic, or any other application server (or servlet container), there are minimal configuration changes to do. Please refer to the Ehcache cache server documentation for details on these. While using the RESTful interface, it is important to note that you have three ways to set the MIME type for exchanging data back and forth to the cache server, namely Text/XML, application/JSON, and application/x-java-serialized-object. You can use any programming language to invoke the web service interface and cache your objects (except for application/x-java-serialized-object for obvious reasons). Refer to the recipe8 project directory within the source code bundle for a fully working sample of this recipe content and further information related to this topic. How to do it... Add Ehcache and Ehcache cache server dependencies to your POM.xml file. <dependency> <groupId>net.sf.ehcache</groupId> <artifactId>ehcache-server</artifactId> <version>1.0.0</version> <type>war</type> </dependency> <dependency> <groupId>net.sf.ehcache</groupId> <artifactId>ehcache</artifactId> <version>2.6.0</version> <type>pom</type> </dependency> Edit ehcache.xml in the cache server to hold your cache setup (the cache name is very important).You can find this file here: ${CACHE_SERVER}/WEB-INF/classes/ehcache.xml. <?xml version="1.0" encoding="UTF-8"?> <ehcache xsi_noNamespaceSchemaLocation="ehcache.xsd" updateCheck="true" monitoring="autodetect" dynamicConfig="true"> <!-- Set cache eternal (of course not to do in production) --> <cache name="remoteCache" maxElementsInMemory="10000" eternal="true" diskPersistent="true" overflowToDisk="true"/> ... Disable the SOAP interface in the cache server web.xml (since we are going to use RESTful) file:You can find this file here: ${CACHE_SERVER}/WEB-INF/web.xml. <?xml version="1.0" encoding="UTF-8"?> <web-app xsi_schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd" version="2.5"> ... <!-- SOAP Servlet Comment out (or remove) to disable SOAP Web Services <servlet> <servlet-name>EhcacheWebServiceEndpoint</servlet-name> <servlet-class>com.sun.xml.ws.transport.http.servlet.WSServlet </servlet-class> <load-on-startup>1</load-on-startup> </servlet> <servlet-mapping> <servlet-name>EhcacheWebServiceEndpoint</servlet-name> <url-pattern>/soap/EhcacheWebServiceEndpoint</url-pattern> </servlet-mapping> <session-config> <session-timeout>60</session-timeout> </session-config> <listener> <listener-class> com.sun.xml.ws.transport.http.servlet.WSServletContextListener </listener-class> </listener> --> ... Make your objects-to-be-cached serializable: import java.io.Serializable; public final class Item implements Serializable { Invoke the RESTful (or SOAP) interface to save/retrieve/delete cached objects: ... public void saveItemInCache(String key, Serializable item) { //sample URL: http://localhost:8080/ehcache/rest/cacheName/{id} //here cacheName is the cache name you set up in the cache-server ehcache.xml String url = CACHE_SERVER_URL + "cacheName" + "/" + key; //initialize Apache HTTP Client client = new DefaultHttpClient(); //create Cache Element to be sent Element element = new Element(key, item); //serialize object to be sent to EhCache Server byte[] itemToByteArray = SerializationUtils.serialize(element); //create PUT request HttpPut putRequest = new HttpPut(url); //set header to read java-serialized-object mime type putRequest.setHeader ("Content-Type", "application/x-java-serialized-object"); ... How it works... The Ehcache cache server utility is a versatile tool that lets us distribute cache engines in a very flexible way. It provides a very simple API exposure via RESTful or SOAP-based web services. We start by editing the ehcache.xml configuration file within the cache server application by adding a cache that we would like to use for our cached objects: ... <!-- Set cache eternal (of course not to do in production) --> <cache name="remoteCache" maxElementsInMemory="10000" eternal="true" diskPersistent="true" overflowToDisk="true"/> ... The cache name defined here is very important because this will be the endpoint of our RESTful URL pattern that the cache server will identify and use. Then, we need to edit the web.xml file within the cache server application (located in {CACHE-SERVER}/WEB-INF/) in order to comment out (or completely remove) service definitions that we are not going to use (that is, SOAP if you are using RESTful or vice versa). <!-- SOAP Servlet Comment out to disable SOAP Web Services <servlet> <servlet-name>EhcacheWebServiceEndpoint</servlet-name> <servlet-class>com.sun.xml.ws.transport.http.servlet.WSServlet </servlet-class> <load-on-startup>1</load-on-startup> </servlet> ... In order to cache an object (specially a Java object), we need to make it serializable simply by implementing the Serializable interface (this is not a requirement for MIME types different from the application/x-java-serialized-object). import java.io.Serializable; public final class Item implements Serializable { Finally, we invoke the RESTful endpoint from our code to store/retrieve/delete the object from/to the cache layer. //sample URL: http://localhost:8080/ehcache/rest/cacheName/{id} //here cacheName is the cache name you set up in the cache-server ehcache.xml String url = CACHE_SERVER_URL + "cacheName" + "/" + key; //set header to read json mime type putRequest.setHeader("Content-Type", "application/json"); It is important to note here that the cacheName URL parameter represents the cache name you defined in the ehcache.xml configuration file in the cache server application. You have defined your cache name as follows: <!-- Set cache eternal (of course not to do in production) --> <cache name="remoteCache" maxElementsInMemory="10000" ... Now, your URL would be something like this: //sample URL: http://localhost:8080/ehcache/rest/remoteCache/{id} Here, id is just the key value you assign to the cached object. Finally, you just use any http/SOAP client library (or Java default Net API classes) to invoke the web service. In the case of RESTful services, you need to be aware that the HTTP method sent determines whether you are storing, updating, retrieving, or deleting a cached item. They are as follows: GET /{cache}/{element}: This retrieves an object by its key from the O-O-P cache layer. PUT /{cache}/{element}: This stores an item in the O-O-P cache layer. DELETE /{cache}/{element}: This deletes an item from the O-O-P cache layer. HEAD /{cache}/{element}: This retrieves metadata (cache configuration values) from the O-O-P cache layer. OPTIONS /{cache}/{element}: This returns the WADL describing operations. For changing the context you can edit the file ${CACHE_SERVER}/META-INF/context.xml and place your desired context name. As for security, look for the file ${CACHE_SERVER}/WEB-INF/server_security_config.xml_rename_to_activate and open it to read the instructions. Summary This article provided details on implementing distributed caching using the Ehcache server, and also explained in brief what out-of-process caching is. Resources for Article : Further resources on this subject: PostgreSQL 9: Reliable Controller and Disk Setup [Article] Play Framework: Data Validation Using Controllers [Article] Building Applications with Spring Data Redis [Article]
Read more
  • 0
  • 0
  • 1487
article-image-recursive-queries
Packt
06 Sep 2013
5 min read
Save for later

Recursive queries

Packt
06 Sep 2013
5 min read
(For more resources related to this topic, see here.) XML is one solution to dealing with hierarchical data, but it isn't the most natural for the relational database. Instead, you often wind up with nested categories, or filesystem-like folder hierarchies, or links back to older records. A popular way to structure a relational database for data with this shape is using a self reference, a link back to a parent object in the same table.For instance, you might want to model categories that can have subcategories with arbitrary nesting. A simple table might look like this: CREATE TABLE categories ( id SERIAL PRIMARY KEY, name VARCHAR, parent_id INTEGER REFERENCES categories); What makes this structure recursive is that self-referencing parent_id column, which refers to the table we're defining from within its own definition. We would treat categories with a NULL value for the parent_id column as top-level categories, categories that do not belong within any other category. To get a feel for the kind of data, let's put a few categories in there for an online retailer. Say we sell shirts and books, and books we further divide into fiction and non-fiction, and then we'll put programming books inside non-fiction. It might look like the following: INSERT INTO categories (id, name, parent_id) VALUES(1, 'Shirts', NULL),(2, 'Books', NULL),(3, 'Fiction', 2),(4, 'Non-fiction', 2),(5, 'Programming', 4); Usually you won't put these in manually like this, but you can see that Fiction and Non-fiction are children of the Books category because their parent_id values are 2, the value for Books is NULL and Programming has the parent_id value as 4, which is the id value for Non-fiction. Suppose you want to make navigation breadcrumbs given a category. You need to look up that category and then you need to look up the parent category. You need to do this until the parent category is NULL. In a procedural pseudocode, the process might look like the following: def write_breadcrumbs(category): row = getOne("SELECT * FROM categories WHERE id = ?", category) while row['parent_id'] != NULL: write(row['name']) write(' > ') row = getOne("SELECT * FROM categories WHERE id = ?", row['parent_id']) This kind of solution leads to the N+1 query problem. There's an action you want to take for a particular value, but to take that action you have to run an arbitrary number of separate queries. Recursive queries in PostgreSQL provide you with a way to have the database do the heavy lifting instead. Because we'll be using recursion in the SQL, let's first see what a recursion formulation would look like in our pseudocode: def write_breadcrumbs(category): row = getOne("SELECT * FROM categories WHERE id = ?", category) write(row['name']) if row['parent_id'] != NULL: write(' > ') write_breadcrumbs(row['parent_id']) It's debatable whether this is better code; it's shorter, and it has fewer bugs, but it also might expose the developer to the possibility of stack overflows. Recursive functions always have some similar structure though, some number of base cases that do not call the function recursively and some number of inductive cases that work by calling the same function again on slightly different data. The final destination will be one of the base cases. The inductive cases will peel off a small piece of the problem and solve it, then delegate the rest of the work to an invocation of the same function. PostgreSQL's recursive query support works with something called the common table expressions (CTEs). The idea is to make a named alias for a query. We won't delve into the details too much here, but all recursive queries will have the same basic structure: WITH RECURSIVE recursive-query-name AS ( SELECT <base-case> FROM table UNION ALL SELECT <inductive-case> FROM table JOIN <recursive-query-name> ON )SELECT * FROM <recursive-query-name>; For an example, let's get all the categories above Programming. The base case will be the Programming category itself. The inductive case will be to find the parent of a category we've already seen: WITH RECURSIVE programming_parents AS ( SELECT * FROM categories WHERE id = 5 UNION ALL SELECT categories.* FROM categories JOIN programming_parents ON programming_parents.parent_id = categories.id)SELECT * FROM programming_parents; This works as we'd hope. Output of a simple recursive query Without using this trick we'd have to do three separate queries to get this information, but with the trick it will always take one query no matter how deeply nested the categories are. We can also go in the other direction and build up something like a tree underneath a category by searching for categories that have a category we've already seen as a parent category. We can make the hierarchy more explicit by building up a path as we go: WITH RECURSIVE all_categories AS ( SELECT *, name as path FROM categories WHERE parent_id IS NULL UNION ALL SELECT c.*, p.path || '/' || c.name FROM categories AS c JOIN all_categories p ON p.id = c.parent_id)SELECT * FROM all_categories; Finding all the categories with their path We can be more discriminating with the search by looking for a particular category as the starting value in the base case: WITH RECURSIVE all_categories AS ( SELECT *, name as path FROM categories WHERE id = 2 UNION ALL SELECT c.*, p.path || '/' || c.name FROM categories AS c JOIN all_categories p ON p.id = c.parent_id)SELECT * FROM all_categories; The category tree rooted at Books This is hugely useful with hierarchical data. Summary In this article we learned about recursive queries as one of the most important features of PostgreSQL. Resources for Article: Further resources on this subject: Introduction to PostgreSQL 9 [Article] PostgreSQL: Tips and Tricks [Article] Obtaining a binary backup [Article]
Read more
  • 0
  • 0
  • 1570

article-image-regression-models-weka
Packt
06 Sep 2013
4 min read
Save for later

Regression models in Weka

Packt
06 Sep 2013
4 min read
(For more resources related to this topic, see here.) Getting ready Let's look at an example of a house price-based regression model, and create some real data to examine. These are actual numbers from houses for sale, and we will be trying to find the value of a house we are supposed to sell: Size (m2) Land (m2) Rooms Granite Extra bathroom Price 1076 2801 6 0 0 €324.500,00 990 3067 5 1 1 €466.000,00 1229 3094 5 0 1 €425.900,00 731 4315 4 1 0 €387.120,00 671 2926 4 0 1 €312.100,00 1078 6094 6 1 1 €603.000,00 909 2854 5 0 1 €383.400,00 975 2947 5 1 1 ?? To load files in Weka, we have to put the table in the ARFF file format and save it as house.arff. Make sure the attributes are numeric, as shown here: @RELATION house@ATTRIBUTE size NUMERIC@ATTRIBUTE land NUMERIC@ATTRIBUTE rooms NUMERIC@ATTRIBUTE granite NUMERIC@ATTRIBUTE extra_bathroom NUMERIC@ATTRIBUTE price NUMERIC@DATA1076,2801,6,0,0,324500990,3067,5,1,1,4660001229,3094,5,0,1,425900731,4315,4,1,0,387120671,2926,4,0,1,3121001078,6094,6,1,1,603000909,2854,5,0,1,383400975,2947,5,1,1,? How to do it… Use the following snippet: import java.io.BufferedReader;import java.io.FileReader;import weka.core.Instance;import weka.core.Instances;import weka.classifiers.functions.LinearRegression;public class Regression{public static void main(String args[]) throws Exception{//load dataInstances data = new Instances(new BufferedReader(newFileReader("dataset/house.arff")));data.setClassIndex(data.numAttributes() - 1);//build modelLinearRegression model = new LinearRegression();model.buildClassifier(data); //the last instance with missingclass is not usedSystem.out.println(model);//classify the last instanceInstance myHouse = data.lastInstance();double price = model.classifyInstance(myHouse);System.out.println("My house ("+myHouse+"): "+price);}} Here is the output: Linear Regression Modelprice =195.2035 * size +38.9694 * land +76218.4642 * granite +73947.2118 * extra_bathroom +2681.136My house (975,2947,5,1,1,?): 458013.16703945777 How it works… Import a basic regression model named weka.classifiers.functions.LinearRegression: import java.io.BufferedReader;import java.io.FileReader;import weka.core.Instance;import weka.core.Instances;import weka.classifiers.functions.LinearRegression; Load the house dataset: Instances data = new Instances(new BufferedReader(newFileReader("dataset/house.arff")));data.setClassIndex(data.numAttributes() - 1); Initialize and build a regression model. Note, that the last instance is not used for building the model since the class value is missing: LinearRegression model = new LinearRegression();model.buildClassifier(data); Output the model: System.out.println(model); Use the model to predict the price of the last instance in the dataset: Instance myHouse = data.lastInstance();double price = model.classifyInstance(myHouse);System.out.println("My house ("+myHouse+"): "+price); There’s more This section lists some additional algorithms. Other regression algorithms There is a wide variety of implemented regression algorithms one can use in Weka: weka.classifiers.rules.ZeroR: The class for building and using an 0-R classifier. Predicts the mean (for a numeric class) or the mode (for a nominal class) and it is considered as a baseline; that is, if your classifier's performance is worse than average value predictor, it is not worth considering it. weka.classifiers.trees.REPTree: The fast decision tree learner. Builds a decision/regression tree using information gain/variance and prunes it using reduced-error pruning (with backfitting). It only sorts values for numeric attributes once. Missing values are dealt with by splitting the corresponding instances into pieces (that is, as in C4.5). weka.classifiers.functions.SMOreg: SMOreg implements the support vector machine for regression. The parameters can be learned using various algorithms. The algorithm is selected by setting the RegOptimizer. The most popular algorithm (RegSMOImproved) is due to Shevade, Keerthi, and others, and this is the default RegOptimizer. weka.classifiers.functions.MultilayerPerceptron: A classifier that uses backpropagation to classify instances. This network can be built by hand, or created by an algorithm, or both. The network can also be monitored and modified during training time. The nodes in this network are all sigmoid (except for when the class is numeric in which case the output nodes become unthresholded linear units). weka.classifiers.functions.GaussianProcesses: Implements Gaussian Processes for regression without hyperparameter-tuning. Summary We learned how to use models that predict a value of numerical class, in contrast to classification, which predicts the value of a nominal class. Given a set of attributes, the regression builds a model, usually an equation that is used to compute the predicted class value. Resources for Article: Further resources on this subject: Java in Oracle Database [Article] Installing and Setting up JavaFX for NetBeans and Eclipse IDE [Article] Getting Started with JavaFX [Article]
Read more
  • 0
  • 0
  • 8326

article-image-using-indexes-manipulate-pandas-objects
Packt
05 Sep 2013
4 min read
Save for later

Using indexes to manipulate pandas objects

Packt
05 Sep 2013
4 min read
(For more resources related to this topic, see here.) Getting ready A good understanding of indexes in pandas is crucial to quickly move the data around. From a business intelligence perspective, they create a distinction similar to that of metrics and dimensions in an OLAP cube. To illustrate this point, this recipe walks through getting stock data out of pandas, combining it, then reindexing it for easy chomping. How to do it... Use the DataReader object to transfer stock price information into a DataFrame and to explore the basic axis of Panel. > from pandas.i git push -u origin master o.data import DataReader > tickers = ['gs', 'ibm', 'f', 'ba', 'axp'] > dfs = {} > for ticker in tickers: dfs[ticker] = DataReader(ticker, "yahoo", '2006-01-01') # a yet undiscussed data structure, in the same way the a # DataFrame is a collection of Series, a Panel is a collection of # DataFrames > pan = pd.Panel(dfs) > pan <class 'pandas.core.panel.Panel'> Dimensions: 5 (items) x 1764 (major_axis) x 6 (minor_axis)Items axis: axp to ibm Major_axis axis: 2006-01-03 00:00:00 to 2013-01-04 00:00:00 Minor_axis axis: Open to Adj Close > pan.items Index([axp, ba, f, gs, ibm], dtype=object) > pan.minor_axis Index([Open, High, Low, Close, Volume, Adj Close], dtype=object) > pan.major_axis <class 'pandas.tseries.index.DatetimeIndex'>[2006-01-03 00:00:00, ..., 2013-01-04 00:00:00] Length: 1764, Freq: None, Timezone: None Use the axis selectors to easily compute different sets of summary statistics. > pan.minor_xs('Open').mean() axp 46.227466 ba 70.746451 f 9.135794 gs 151.655091 ibm 129.570969 # major axis is sliceable as well > day_slice = pan.major_axis[1] > pan.major_xs(day_slice)[['gs', 'ba']] ba gs Open 70.08 127.35 High 71.27 128.91 Low 69.86 126.38 Close 71.17 127.09 Volume 3165000.00 4861600.00 Adj Close 60.43 118.12 Convert the Panel to a DataFrame. > dfs = [] > for df in pan: idx = pan.major_axis idx = pd.MultiIndex.from_tuples(zip([df]*len(idx), idx)) idx.names = ['ticker', 'timestamp'] dfs.append(pd.DataFrame(pan[df].values, index=idx, columns=pan.minor_axis)) > df = pd.concat(dfs) > df Data columns: Open 8820 non-null values High 8820 non-null values Low 8820 non-null values Close 8820 non-null values Volume 8820 non-null values Adj Close 8820 non-null values dtypes: float64(6) Perform the analogous operations as in the preceding examples on the newly created DataFrame. # selecting from a MultiIndex isn't much different than the Panel # (output muted) > df.ix['gs':'ibm'] > df['Open'] How it works... The previous example was certainly contrived, but when indexing and statistical techniques are incorporated, the power of pandas begins to come through. Statistics will be covered in an upcoming recipe. pandas' indexes by themselves can be thought of as descriptors of a certain point in the DataFrame. When ticker and timestamp are the only indexes in a DataFrame, then the point is individualized by the ticker, timestamp, and column name. After the point is individualized, it's more convenient for aggregation and analysis. There's more... Indexes show up all over the place in pandas so it's worthwhile to see some other use cases as well. Advanced header indexes Hierarchical indexing isn't limited to rows. Headers can also be represented by MultiIndex, as shown in the following command line: > header_top = ['Price', 'Price', 'Price', 'Price', 'Volume', 'Price'] > df.columns = pd.MultiIndex.from_tuples(zip(header_top, df.columns) Performing aggregate operations with indexes As a prelude to the following sections, we'll do a single groupby function here since they work with indexes so well. > df.groupby(level=['tickers', 'day'])['Volume'].mean() This answers the question for each ticker and for each day (not date), that is, what was the mean volume over the life of the data. Summary This article talks about the use and importance of indexes in pandas. It also talks about different operations that can be done with indexes. Resources for Article : Further resources on this subject: Installing Panda3D [Article] Setting Up Panda3D and Configuring Development Tools [Article] Collision Detection and Physics in Panda3D Game Development [Article]
Read more
  • 0
  • 0
  • 1606
Packt
04 Sep 2013
7 min read
Save for later

Quick start – writing your first MDX query

Packt
04 Sep 2013
7 min read
(For more resources related to this topic, see here.) Step 1 – open the SQL Server Management Studio and connect to the cube The Microsoft SQL Server Management Studio (SSMS) is a client application used by the administrators to manage instances and by developers to create object and write queries. We will use SSMS to connect on the cube and write our first MDX query. Here's a screenshot of SSMS with a connection on a SSAS server: Click on the Windows button, click on All Programs, click on Microsoft SQL Server 2012, and then click on SQL Server Management Studio. In the Connect to Server window, in the Server type box, select Analysis Services. In the Server name box, type the name of your Analysis Services server. Click on Connect. In the SQL Server Management Studio window, click on the File menu, click on New, and then click on Analysis Services MDX Query. In the Connect to Analysis Services window, in the Server name box, type the name of you Analysis Services server, and then click on Connect. SELECT FROM WHERE If you have already written SQL queries, you might have already made connections with the T-SQL language. Here's my tip for you: don't, you will only hurt yourself. Some words are the same, but it is better to think MDX when writing MDX rather than to think SQL when writing MDX. Step 2 – SELECT The SELECT clause is the main part of the MDX query. You will define what are the measure and dimension members that you want to display. You also have to define on which axis of your result set you want to display the measure and dimension members. Axes Axes are the columns and rows of the result set. With SQL Server Analysis Services, upto 128 axes can be specified. The axes have a number which is zero-based. The first axe is 0, the second on is 1, and so on. So, if you want to use two axes, the first one will be 0 and the second will be 1. You cannot use axe 0 and axe 2, if you don't define axe 1. For the first five axes, you can use the axis alias instead. After the axe 4, you will have to revert to the number because no other aliases are available. Axe Number Alias 0 Columns 1 Rows 2 Pages 3 Sections 4 Chapters Even if SSAS supports 128 axes, if you try to use more than two axes in SSMS in your query, you will get this error when you execute your MDX query: Results cannot be displayed for cellsets with more than two axes. So, always write your MDX queries using only two axes in SSMS and separate them with a comma. Tuples A tuple is a specific point in the cube where dimensions meet. A tuple can contain one or more members from the cube's dimensions, but you cannot have two members from the same dimension. If you want to display only the calendar year 2008, you will have to write [Date].[CY 2008]. If you want to have more than one dimension, you have to enclose them using parenthesis () and separate them with a comma. Calendar year for United States will look like ([Date].[CY 2008], [Geography].[United States]). Even if you are writing a tuple with only a single member from a single dimension, it is good practice to enclose it in parenthesis. Sets If you want to display the year 2005 to 2008, you will write four single-dimension tuples which composes a set. When writing the set, you separate the tuples with commas and wrap it all with curly braces {} and separate the tuples with commas such as {[Date].[CY 2005], [Date].[CY 2006] , [Date].[CY 2007] , [Date].[CY 2008]} to have the calendar years from 2005 to 2008. Since all the tuples are from the same dimension, you can also write it using a colon (:), such as {[Date].[CY 2005]: [Date].[CY 2008]} which will give you the years 2005 to 2008. With SSAS 2012, you can write {[Date].[CY 2008]: [Date].[CY 2005]} and the result will still be from 2005 to 2008. What about the calendar year 2008 for both Canada and the United States? You will write two tuples. A set can be composed of one or more tuples. The tuples must have the same dimensionality; otherwise, an error will occur. Meaning that the first member is from the Date dimension and the second from the Geography dimension. You cannot have the first tuple with Date-Geography and the second being Geography-Date; you will encounter an error. So the calendar year 2008 with Canada and United States will look such as {([Date].[CY 2008], [Geography].[Canada]), ([Date].[CY 2008], [Geography].[United States])}. When writing tuples, always use the form [Dimension].[Level].[MemberName]. So, [Geography].[Canada] should be written as [Geography].[Country].[Canada]. You could also use the member key instead of the member name. In SSAS, use the ampersand (&) when using the key; [Geography].[State-Province].[Quebec] with the name becomes [Geography].[State-Province].&[QC]&[CA] using the keys. What happens when you want to write bigger sets such as for the bikes and components product category in Canada and the United States from 2005 to 2008? Enter the Crossjoin function. Crossjoin takes two or more sets for arguments and returns you a set with the cross products or the specified sets. Crossjoin ({[Product].[Category].[Bikes], [Product].[Category].[Components]}, {[Geography].[Country].[Canada], [Geography].[Country].[United States]}, {[Date].[CY 2005] : [Date].[CY 2008]}) The MDX queries can be written using line-break to add visibility to the code. So each time we write a new set and even tuples, we write it on a new line and add some indentation: Crossjoin ({[Product].[Category].[Bikes], [Product].[Category].[Components]},{[Geography].[Country].[Canada], [Geography].[Country].[United States]}, {[Date].[CY 2005] : [Date].[CY 2008]}) Step 3 – FROM The FROM clause defines where the query will get the data. It can be one of the following four things: A cube. A perspective (a subset of dimensions and measures). A subcube (a MDX query inside a MDX query). A dimension (a dimension inside your SSAS database, you must use the dollar sign ($) before the name of the dimension). Step 4 – WHERE The WHERE clause is used to filter the dimensions and members out of the MDX query. The set used in the WHERE clause won't be displayed in your result set. Step 5 – comments Comment your code. You never know when somebody else will take a look on your queries and trying to understand what has been written could be harsh. There are three ways to use delimit comments inside the query: /* and */ // -- (pair of dashes) The /* and */ symbols can be used to comment multiple lines of text in your query. Everything between the /* and the */ symbols will be ignored when the MDX query is parsed. Use // or -- to begin a comment on a single line. Step 6 – your first MDX query So if you want to display the Resellers Sales Amount and Reseller Order Quantity measures on the columns, the years from 2006 to 2008 with the bikes and components product categories for Canada. First, identify what will go where. Start with the two axes, continue with the FROM clause, and finish with the WHERE clause. SELECT{[Measures].[Reseller Sales Amount], [Measures].[Reseller Order Quantity]} on columns,Crossjoin({[Date].[CY 2006] : [Date].[CY 2008]}, {[Product].[Category].[Bikes], [Product].[Category].[Components]}) on rowsFROM [Adventure Works]WHERE {[Geography].[Country].[Canada]} This query will return the following result set:     Reseller Sales Amount Reseller Order Quantity CY 2006 Bikes $3,938,283.99 4,563 CY 2006 Components $746,576.15 2,954 CY 2007 Bikes $4,417,665.71 5,395 CY 2007 Components $997,617.89 4,412 CY 2008 Bikes $1,909,709.62 2,209 CY 2008 Components $370,698.68 1,672 Summary In this article, we saw how to write the MDX queries in various steps. We used the FROM, WHERE, and SELECT clauses in writing the queries. This article was a quick start guide for starting to query and it will help you write more complex queries. Happy querying! Resources for Article : Further resources on this subject: Connecting to Microsoft SQL Server Compact 3.5 with Visual Studio [Article] MySQL Linked Server on SQL Server 2008 [Article] Microsoft SQL Azure Tools [Article]
Read more
  • 0
  • 0
  • 5198

article-image-understanding-point-time-recovery
Packt
04 Sep 2013
28 min read
Save for later

Understanding Point-In-Time-Recovery

Packt
04 Sep 2013
28 min read
(For more resources related to this topic, see here.) Understanding the purpose of PITR PostgreSQL offers a tool called pg_dump to backup a database. Basically, pg_dump will connect to the database, read all the data in transaction isolation level "serializable" and return the data as text. As we are using "serializable", the dump is always consistent. So, if your pg_dump starts at midnight and finishes at 6 A.M, you will have created a backup, which contains all the data as of midnight but no further data. This kind of snapshot creation is highly convenient and perfectly feasible for small to medium amounts of data. A dump is always consistent. This means that all foreign keys are intact; new data added after starting the dump will be missing. It is most likely the most common way to perform standard backups. But, what if your data is so valuable and maybe so large in size that you want to backup it incrementally? Taking a snapshot from time to time might be enough for some applications; for highly critical data, it is clearly not. In addition to that, replaying 20 TB of data in textual form is not efficient either. Point-In-Time-Recovery has been designed to address this problem. How does it work? Based on a snapshot of the database, the XLOG will be replayed later on. This can happen indefinitely or up to a point chosen by you. This way, you can reach any point in time. This method opens the door to many different approaches and features: Restoring a database instance up to a given point in time Creating a standby database, which holds a copy of the original data Creating a history of all changes In this article, we will specifically feature on the incremental backup functionality and describe how you can make your data more secure by incrementally archiving changes to a medium of choice. Moving to the bigger picture The following picture provides an overview of the general architecture in use for Point-In-Time-Recovery: PostgreSQL produces 16 MB segments of transaction log. Every time one of those segments is filled up and ready, PostgreSQL will call the so called archive_command. The goal of archive_command is to transport the XLOG file from the database instance to an archive. In our image, the archive is represented as the pot on the bottom-right side of the image. The beauty of the design is that you can basically use an arbitrary shell script to archive the transaction log. Here are some ideas: Use some simple copy to transport data to an NFS share Run rsync to move a file Use a custom made script to checksum the XLOG file and move it to an FTP server Copy the XLOG file to a tape The possible options to manage XLOG are only limited by imagination. The restore_command is the exact counterpart of the archive_command. Its purpose is to fetch data from the archive and provide it to the instance, which is supposed to replay it (in our image, this is labeled as Restored Backup). As you have seen, replay might be used for replication or simply to restore a database to a given point in time as outlined in this article. Again, the restore_command is simply a shell script doing whatever you wish, file by file. It is important to mention that you, the almighty administrator, are in charge of the archive. You have to decide how much XLOG to keep and when to delete it. The importance of this task cannot be underestimated. Keep in mind, when then archive_command fails for some reason, PostgreSQL will keep the XLOG file and retry after a couple of seconds. If archiving fails constantly from a certain point on, it might happen that the master fills up. The sequence of XLOG files must not be interrupted; if a single file is missing, you cannot continue to replay XLOG. All XLOG files must be present because PostgreSQL needs an uninterrupted sequence of XLOG files; if a single file is missing, the recovery process will stop there at the very latest. Archiving the transaction log After taking a look at the big picture, we can take a look and see how things can be put to work. The first thing you have to do when it comes to Point-In-Time-Recovery is to archive the XLOG. PostgreSQL offers all the configuration options related to archiving through postgresql.conf. Let us see step by step what has to be done in postgresql.conf to start archiving: First of all, you should turn archive_mode on. In the second step, you should configure your archive command. The archive command is a simple shell command taking two parameters: %p: This is a placeholder representing the XLOG file that should be archived, including its full path (source). %f: This variable holds the name of the XLOG without the path pointing to it. Let us set up archiving now. To do so, we should create a place to put the XLOG. Ideally, the XLOG is not stored on the same hardware as the database instance you want to archive. For the sake of this example, we assume that we want to apply an archive to /archive. The following changes have to be made to postgresql.conf: wal_level = archive # minimal, archive, or hot_standby # (change requires restart) archive_mode = on # allows archiving to be done # (change requires restart) archive_command = 'cp %p /archive/%f' # command to use to archive a logfile segment # placeholders: %p = path of file to archive # %f = file name only Once those changes have been made, archiving is ready for action and you can simply restart the database to activate things. Before we restart the database instance, we want to focus your attention on wal_level. Currently three different wal_level settings are available: minimal archive hot_standby The amount of transaction log produced in the case of just a single node is by far not enough to synchronize an entire second instance. There are some optimizations in PostgreSQL, which allow XLOG-writing to be skipped in the case of single-node mode. The following instructions can benefit from wal_level being set to minimal: CREATE TABLE AS, CREATE INDEX, CLUSTER, and COPY (if the table was created or truncated within the same transaction). To replay the transaction log, at least archive is needed. The difference between archive and hot_standby is that archive does not have to know about currently running transactions. For streaming replication, however, this information is vital. Restarting can either be done through pg_ctl –D /data_directory –m fast restart directly or through a standard init script. The easiest way to check if our archiving works is to create some useless data inside the database. The following code snippets shows a million rows can be made easily: test=# CREATE TABLE t_test AS SELECT * FROM generate_series(1,1000000);SELECT 1000000test=# SELECT * FROM t_test LIMIT 3;generate_series----------------- 1 2 3(3 rows) We have simply created a list of numbers. The important thing is that 1 million rows will trigger a fair amount of XLOG traffic. You will see that a handful of files have made it to the archive: iMac:archivehs$ ls -ltotal 131072-rw------- 1 hs wheel 16777216 Mar 5 22:31000000010000000000000001-rw------- 1 hs wheel 16777216 Mar 5 22:31000000010000000000000002-rw------- 1 hs wheel 16777216 Mar 5 22:31000000010000000000000003-rw------- 1 hs wheel 16777216 Mar 5 22:31000000010000000000000004 Those files can be easily used for future replay operations. If you want to save storage, you can also compress those XLOG files. Just add gzip to your archive_command. Taking base backups In the previous section, you have seen that enabling archiving takes just a handful of lines and offers a great deal of flexibility. In this section, we will see how to create a so called base backup, which can be used to apply XLOG later on. A base backup is an initial copy of the data. Keep in mind that the XLOG itself is more or less worthless. It is only useful in combination with the initial base backup. In PostgreSQL, there are two main options to create an initial base backup: Using pg_basebackup Traditional copy/rsync based methods The following two sections will explain in detail how a base backup can be created: Using pg_basebackup The first and most common method to create a backup of an existing server is to run a command called pg_basebackup, which was introduced in PostgreSQL 9.1.0. Basically pg_basebackup is able to fetch a database base backup directly over a database connection. When executed on the slave, pg_basebackup will connect to the database server of your choice and copy all the data files in the data directory over to your machine. There is no need to log into the box anymore, and all it takes is one line of code to run it; pg_basebackup will do all the rest for you. In this example, we will assume that we want to take a base backup of a host called sample.postgresql-support.de. The following steps must be performed: Modify pg_hba.conf to allow replication Signal the master to take pg_hba.conf changes into account Call pg_basebackup Modifying pg_hba.conf To allow remote boxes to log into a PostgreSQL server and to stream XLOG, you have to explicitly allow replication. In PostgreSQL, there is a file called pg_hba.conf, which tells the server which boxes are allowed to connect using which type of credentials. Entire IP ranges can be allowed or simply discarded through pg_hba.conf. To enable replication, we have to add one line for each IP range we want to allow. The following listing contains an example of a valid configuration: # TYPE DATABASE USER ADDRESS METHODhost replication all 192.168.0.34/32 md5 In this case we allow replication connections from 192.168.0.34. The IP range is identified by 32 (which simply represents a single server in our case). We have decided to use MD5 as our authentication method. It means that the pg_basebackup has to supply a password to the server. If you are doing this in a non-security critical environment, using trust as authentication method might also be an option. What happens if you actually have a database called replication in your system? Basically, setting the database to replication will just configure your streaming behavior, if you want to put in rules dealing with the database called replication, you have to quote the database name as follows: "replication". However, we strongly advise not to do this kind of trickery to avoid confusion. Signaling the master server Once pg_hba.conf has been changed, we can tell PostgreSQL to reload the configuration. There is no need to restart the database completely. We have three options to make PostgreSQL reload pg_hba.conf: By running an SQL command: SELECT pg_reload_conf(); By sending a signal to the master: kill –HUP 4711 (with 4711 being the process ID of the master) By calling pg_ctl: pg_ctl –D $PGDATA reload (with $PGDATA being the home directory of your database instance) Once we have told the server acting as data source to accept streaming connections, we can move forward and run pg_basebackup. pg_basebackup – basic features pg_basebackup is a very simple-to-use command-line tool for PostgreSQL. It has to be called on the target system and will provide you with a ready-to-use base backup, which is ready to consume the transaction log for Point-In-Time-Recovery. The syntax of pg_basebackup is as follows: iMac:dbhs$ pg_basebackup --help pg_basebackup takes a base backup of a running PostgreSQL server. Usage: pg_basebackup [OPTION]... Options controlling the output: -D, --pgdata=DIRECTORY receive base backup into directory -F, --format=p|t output format (plain (default), tar) -x, --xlog include required WAL files in backup (fetch mode) -X, --xlog-method=fetch|stream include required WAL files with specified method -z, --gzip compress tar output -Z, --compress=0-9 compress tar output with given compression level General options: -c, --checkpoint=fast|spread set fast or spread checkpointing -l, --label=LABEL set backup label -P, --progress show progress information -v, --verbose output verbose messages -V, --version output version information, then exit -?, --help show this help, then exit Connection options: -h, --host=HOSTNAME database server host or socket directory -p, --port=PORT database server port number -s, --status-interval=INTERVAL time between status packets sent to server (in seconds) -U, --username=NAME connect as specified database user -w, --no-password never prompt for password -W, --password force password prompt (should happen automatically) A basic call to pg_basebackup would look like that: iMac:dbhs$ pg_basebackup -D /target_directory -h sample.postgresql-support.de In this example, we will fetch the base backup from sample.postgresql-support.de and put it into our local directory called /target_directory. It just takes this simple line to copy an entire database instance to the target system. When we create a base backup as shown in this section, pg_basebackup will connect to the server and wait for a checkpoint to happen before the actual copy process is started. In this mode, this is necessary because the replay process will start exactly at this point in the XLOG. The problem is that it might take a while until a checkpoint kicks in; pg_basebackup does not enforce a checkpoint on the source server straight away to make sure that normal operations are not disturbed. If you don't want to wait on a checkpoint, consider using --checkpoint=fast. It will enforce an instant checkpoint and pg_basebackup will start copying instantly. By default, a plain base backup will be created. It will consist of all the files in directories found on the source server. If the base backup should be stored on tape, we suggest to give –-format=t a try. It will automatically create a TAR archive (maybe on a tape). If you want to move data to a tape, you can save an intermediate step easily this way. When using TAR, it is usually quite beneficial to use it in combination with --gzip to reduce the size of the base backup on disk. There is also a way to see a progress bar while doing the base backup but we don't recommend to use this option (--progress) because it requires pg_basebackup to determine the size of the source instance first, which can be costly. pg_basebackup – self-sufficient backups Usually a base backup without XLOG is pretty worthless. This is because the base backup is taken while the master is fully operational. While the backup is taken, those storage files in the database instance might have been modified heavily. The purpose of the XLOG is to fix those potential issues in the data files reliably. But, what if we want to create a base backup, which can live without (explicitly archived) XLOG? In this case, we can use the --xlog-method=stream option. If this option has been chosen, pg_basebackup will not just copy the data as it is but it will also stream the XLOG being created during the base backup to our destination server. This will provide us with just enough XLOG to allow us to start a base backup made that way directly. It is self-sufficient and does not need additional XLOG files. This is not Point-In-Time-Recovery but it can come in handy in case of trouble. Having a base backup, which can be started right away, is usually a good thing and it comes at fairly low cost. Please note that --xlog-method=stream will require two database connections to the source server, not just one. You have to keep that in mind when adjusting max_wal_senders on the source server. If you are planning to use Point-In-Time-Recovery and if there is absolutely no need to start the backup as it is, you can safely skip the XLOG and save some space this way (default mode). Making use of traditional methods to create base backups These days pg_basebackup is the most common way to get an initial copy of a database server. This has not always been the case. Traditionally, a different method has been used which works as follows: Call SELECT pg_start_backup('some label'); Copy all data files to the remote box through rsync or any other means. Run SELECT pg_stop_backup(); The main advantage of this old method is that there is no need to open a database connection and no need to configure XLOG-streaming infrastructure on the source server. A main advantage is also that you can make use of features such as ZFS-snapshots or similar means, which can help to dramatically reduce the amount of I/O needed to create an initial backup. Once you have started pg_start_backup, there is no need to hurry. It is not necessary and not even especially desirable to leave the backup mode soon. Nothing will happen if you are in backup mode for days. PostgreSQL will archive the transaction log as usual and the user won't face any kind of downside. Of course, it is bad habit not to close backups soon and properly. However, the way PostgreSQL works internally does not change when a base backup is running. There is nothing filling up, no disk I/O delayed, or anything of this sort. Tablespace issues If you happen to use more than one tablespace, pg_basebackup will handle this just fine if the filesystem layout on the target box is identical to the filesystem layout on the master. However, if your target system does not use the same filesystem layout there is a bit more work to do. Using the traditional way of doing the base backup might be beneficial in this case. In case you are using --format=t (for TAR), you will be provided with one TAR file per tablespace. Keeping an eye on network bandwidth Let us imagine a simple scenario involving two servers. Each server might have just one disk (no SSDs). Our two boxes might be interconnected through a 1 Gbit link. What will happen to your applications if the second server starts to run a pg_basebackup? The second box will connect, start to stream data at full speed and easily kill your hard drive by using the full bandwidth of your network. An application running on the master might instantly face disk wait and offer bad response times. Therefore it is highly recommended to control the bandwidth used up by rsync to make sure that your business applications have enough spare capacity (mainly disk, CPU is usually not an issue). If you want to limit rsync to, say, 20 MB/sec, you can simply use rsync --bwlimit=20000. This will definitely make the creation of the base backup take longer but it will make sure that your client apps will not face problems. In general we recommend a dedicated network interconnect between master and slave to make sure that a base backup does not affect normal operations. Limiting bandwidth cannot be done with pg_basebackup onboard functionality.Of course, you can use any other tool to copy data and achieve similar results. If you are using gzip compression with –-gzip, it can work as an implicit speed brake. However, this is mainly a workaround. Replaying the transaction log Once we have created ourselves a shiny initial base backup, we can collect the XLOG files created by the database. When the time has come, we can take all those XLOG files and perform our desired recovery process. This works as described in this section. Performing a basic recovery In PostgreSQL, the entire recovery process is governed by a file named recovery.conf, which has to reside in the main directory of the base backup. It is read during startup and tells the database server where to find the XLOG archive, when to end replay, and so forth. To get you started, we have decided to include a simple recovery.conf sample file for performing a basic recovery process: restore_command = 'cp /archive/%f %p'recovery_target_time = '2013-10-10 13:43:12' The restore_command is essentially the exact counterpart of the archive_command you have seen before. While the archive_command is supposed to put data into the archive, the restore_command is supposed to provide the recovering instance with the data file by file. Again, it is a simple shell command or a simple shell script providing one chunk of XLOG after the other. The options you have here are only limited by imagination; all PostgreSQL does is to check for the return code of the code you have written, and replay the data provided by your script. Just like in postgresql.conf, we have used %p and %f as placeholders; the meaning of those two placeholders is exactly the same. To tell the system when to stop recovery, we can set the recovery_target_time. The variable is actually optional. If it has not been specified, PostgreSQL will recover until it runs out of XLOG. In many cases, simply consuming the entire XLOG is a highly desirable process; if something crashes, you want to restore as much data as possible. But, it is not always so. If you want to make PostgreSQL stop recovery at a specific point in time, you simply have to put the proper date in. The crucial part here is actually to know how far you want to replay XLOG; in a real work scenario this has proven to be the trickiest question to answer. If you happen to a recovery_target_time, which is in the future, don't worry, PostgreSQL will start at the very last transaction available in your XLOG and simply stop recovery. The database instance will still be consistent and ready for action. You cannot break PostgreSQL, but, you might break your applications in case data is lost because of missing XLOG. Before starting PostgreSQL, you have to run chmod 700 on the directory containing the base backup, otherwise, PostgreSQL will error out: iMac:target_directoryhs$ pg_ctl -D /target_directorystartserver startingFATAL: data directory "/target_directory" has group or world accessDETAIL: Permissions should be u=rwx (0700). This additional security check is supposed to make sure that your data directory cannot be read by some user accidentally. Therefore an explicit permission change is definitely an advantage from a security point of view (better safe than sorry). Now that we have all the pieces in place, we can start the replay process by starting PostgreSQL: iMac:target_directoryhs$ pg_ctl –D /target_directory startserver startingLOG: database system was interrupted; last known up at 2013-03-1018:04:29 CETLOG: creating missing WAL directory "pg_xlog/archive_status"LOG: starting point-in-time recovery to 2013-10-10 13:43:12+02LOG: restored log file "000000010000000000000006" from archiveLOG: redo starts at 0/6000020LOG: consistent recovery state reached at 0/60000B8LOG: restored log file "000000010000000000000007" from archiveLOG: restored log file "000000010000000000000008" from archiveLOG: restored log file "000000010000000000000009" from archiveLOG: restored log file "00000001000000000000000A" from archivecp: /tmp/archive/00000001000000000000000B: No such file ordirectoryLOG: could not open file "pg_xlog/00000001000000000000000B" (logfile 0, segment 11): No such file or directoryLOG: redo done at 0/AD5CE40LOG: last completed transaction was at log time 2013-03-1018:05:33.852992+01LOG: restored log file "00000001000000000000000A" from archivecp: /tmp/archive/00000002.history: No such file or directoryLOG: selected new timeline ID: 2cp: /tmp/archive/00000001.history: No such file or directoryLOG: archive recovery completeLOG: database system is ready to accept connectionsLOG: autovacuum launcher started The amount of log produced by the database tells us everything we need to know about the restore process and it is definitely worth investigating this information in detail. The first line indicates that PostgreSQL has found out that it has been interrupted and that it has to start recovery. From the database instance point of view, a base backup looks more or less like a crash needing some instant care by replaying XLOG; this is precisely what we want. The next couple of lines (restored log file ...) indicate that we are replaying one XLOG file after the other that have been created since the base backup. It is worth mentioning that the replay process starts at the sixth file. The base backup knows where to start, so PostgreSQL will automatically look for the right XLOG file. The message displayed after PostgreSQL reaches the sixth file (consistent recovery state reached at 0/60000B8) is of importance. PostgreSQL states that it has reached a consistent state. This is important. The reason is that the data files inside a base backup are actually broken by definition, but, the data files are not broken beyond repair. As long as we have enough XLOG to recover, we are very well off. If you cannot reach a consistent state, your database instance will not be usable and your recovery cannot work without providing additional XLOG. Practically speaking, not being able to reach a consistent state usually indicates a problem somewhere in your archiving process and your system setup. If everything up to now has been working properly, there is no reason not to reach a consistent state. Once we have reached a consistent state, one file after the other will be replayed successfully until the system finally looks for the 00000001000000000000000B file. The problem is that this file has not been created by the source database instance. Logically, an error pops up. Not finding the last file is absolutely normal; this type of error is expected if the recovery_target_time does not ask PostgreSQL to stop recovery before it reaches the end of the XLOG stream. Don't worry, your system is actually fine. You have successfully replayed everything to the file showing up exactly before the error message. As soon as all the XLOG has been consumed and the error message discussed earlier has been issued, PostgreSQL reports the last transaction it was able or supposed to replay, and starts up. You have a fully recovered database instance now and you can connect to the database instantly. As soon as the recovery has ended, recovery.conf will be renamed by PostgreSQL to recovery.done to make sure that it does not do any harm when the new instance is restarted later on at some point. More sophisticated positioning in the XLOG Up to now, we have recovered a database up to the very latest moment available in our 16 MB chunks of transaction log. We have also seen that you can define the desired recovery timestamp. But the question now is: How do you know which point in time to perform the recovery to? Just imagine somebody has deleted a table during the day. What if you cannot easily determine the recovery timestamp right away? What if you want to recover to a certain transaction? recovery.conf has all you need. If you want to replay until a certain transaction, you can refer to recovery_target_xid. Just specify the transaction you need and configure recovery_target_inclusive to include this very specific transaction or not. Using this setting is technically easy but as mentioned before, it is not easy by far to find the right position to replay to. In a typical setup, the best way to find a reasonable point to stop recovery is to use pause_at_recovery_target. If this is set to true, PostgreSQL will not automatically turn into a productive instance if the recovery point has been reached. Instead, it will wait for further instructions from the database administrator. This is especially useful if you don't know exactly how far to replay. You can replay, log in, see how far the database is, change to the next target time, and continue replaying in small steps. You have to set hot_standby = on in postgresql.conf to allow reading during recovery. Resuming recovery after PostgreSQL has paused can be done by calling a simple SQL statement: SELECT pg_xlog_replay_resume(). It will make the instance move to the next position you have set in recovery.conf. Once you have found the right place, you can set the pause_at_recovery_target back to false and call pg_xlog_replay_resume. Alternatively, you can simply utilize pg_ctl –D ... promote to stop recovery and make the instance operational. Was this explanation too complicated? Let us boil it down to a simple list: Add a restore_command to the recovery.conf file. Add recovery_target_time to the recovery.conf file. Set pause_at_recovery_target to true in the recovery.conf file. Set hot_standby to on in postgresql.conf file. Start the instance to be recovered. Connect to the instance once it has reached a consistent state and as soon as it stops recovering. Check if you are already where you want to be. If you are not: Change recovery_target_time. Run SELECT pg_xlog_replay_resume(). Check again and repeat this section if it is necessary. Keep in mind that once recovery has finished and once PostgreSQL has started up as a normal database instance, there is (as of 9.2) no way to replay XLOG later on. Instead of going through this process, you can of course always use filesystem snapshots. A filesystem snapshot will always work with PostgreSQL because when you restart a snapshotted database instance, it will simply believe that it had crashed before and recover normally. Cleaning up the XLOG on the way Once you have configured archiving, you have to store the XLOG being created by the source server. Logically, this cannot happen forever. At some point, you really have to get rid of this XLOG; it is essential to have a sane and sustainable cleanup policy for your files. Keep in mind, however, that you must keep enough XLOG so that you can always perform recovery from the latest base backup. But if you are certain that a specific base backup is not needed anymore, you can safely clean out all the XLOG that is older than the base backup you want to keep. How can an administrator figure out what to delete? The best method is to simply take a look at your archive directory: 000000010000000000000005000000010000000000000006000000010000000000000006.00000020.backup000000010000000000000007000000010000000000000008 Check out the filename in the middle of the listing. The .backup file has been created by the base backup. It contains some information about the way the base backup has been made and tells the system where to continue replaying the XLOG. If the backup file belongs to the oldest base backup you need to keep around, you can safely erase all the XLOG lower than file number 6; in this case, file number 5 could be safely deleted. In our case, 000000010000000000000006.00000020.backup contains the following information: START WAL LOCATION: 0/6000020 (file 000000010000000000000006)STOP WAL LOCATION: 0/60000E0 (file 000000010000000000000006)CHECKPOINT LOCATION: 0/6000058BACKUP METHOD: streamedBACKUP FROM: masterSTART TIME: 2013-03-10 18:04:29 CETLABEL: pg_basebackup base backupSTOP TIME: 2013-03-10 18:04:30 CET The .backup file will also provide you with relevant information such as the time the base backup has been made. It is plain there and so it should be easy for ordinary users to read this information. As an alternative to deleting all the XLOG files at one point, it is also possible to clean them up during replay. One way is to hide an rm command inside your restore_command. While this is technically possible, it is not necessarily wise to do so (what if you want to recover again?). Also, you can add the recovery_end_command command to your recovery.conf file. The goal of recovery_end_command is to allow you to automatically trigger some action as soon as the recovery ends. Again, PostgreSQL will call a script doing precisely what you want. You can easily abuse this setting to clean up the old XLOG when the database declares itself active. Switching the XLOG files If you are going for an XLOG file-based recovery, you have seen that one XLOG will be archived every 16 MB. What would happen if you never manage to create 16 MB of changes? What if you are a small supermarket, which just makes 50 sales a day? Your system will never manage to fill up 16 MB in time. However, if your system crashes, the potential data loss can be seen as the amount of data in your last unfinished XLOG file. Maybe this is not good enough for you. A postgresql.conf setting on the source database might help. The archive_timeout tells PostgreSQL to create a new XLOG file at least every x seconds. So, if you are this little supermarket, you can ask the database to create a new XLOG file every day shortly before you are heading for home. In this case, you can be sure that the data of the day will safely be on your backup device already. It is also possible to make PostgreSQL switch to the next XLOG file by hand. A procedure named pg_switch_xlog() is provided by the server to do the job: test=# SELECT pg_switch_xlog();pg_switch_xlog----------------0/17C0EF8(1 row) You might want to call this procedure when some important patch job has finished or if you want to make sure that a certain chunk of data is safely in your XLOG archive. Summary In this article, you have learned about Point-In-Time-Recovery, which is a safe and easy way to restore your PostgreSQL database to any desired point in time. PITR will help you to implement better backup policies and make your setups more robust. Resources for Article: Further resources on this subject: Introduction to PostgreSQL 9 [Article] PostgreSQL: Tips and Tricks [Article] PostgreSQL 9: Reliable Controller and Disk Setup [Article]
Read more
  • 0
  • 0
  • 2323