Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-avro-source-sink
Packt
19 Jul 2013
3 min read
Save for later

Avro Source Sink

Packt
19 Jul 2013
3 min read
(For more resources related to this topic, see here.) A typical configuration might look something as follows: To use the Avro Source, you specify the type property with a value of avro. You need to provide a bind address and port number to listen on: collector.sources=av1collector.sources.av1.type=avrocollector.sources.av1.bind=0.0.0.0collector.sources.av1.port=42424collector.sources.av1.channels=ch1collector.channels=ch1collector.channels.ch1.type=memorycollector.sinks=k1collector.sinks.k1.type=hdfscollector.sinks.k1.channel=ch1collector.sinks.k1.hdfs.path=/path/in/hdfs Here we have configured the agent on the right that listens on port 42424, uses a memory channel, and writes to HDFS. Here I've used the memory channel for brevity of this example configuration. Also, note that I've given this agent a different name, collector, just to avoid confusion. The agents on the left—feeding the collector tier—might have a configuration similar to this. I have left the sources off this configuration for brevity: client.channels=ch1client.channels.ch1.type=memoryclient.sinks=k1client.sinks.k1.type=avroclient.sinks.k1.channel=ch1client.sinks.k1.hostname=collector.example.comclient.sinks.k1.port=42424 The hostname value, collector.example.com, has nothing to do with the agent name on that machine, it is the host name (or you can use an IP) of the target machine with the receiving Avro Source. This configuration, named client, would be applied to both agents on the left assuming both had similar source configurations. Since I don't like single points of failure, I would configure two collector agents with the preceding configuration and instead set each client agent to round robin between the two using a sink group. Again, I've left off the sources for brevity: client.channels=ch1client.channels.ch1.type=memoryclient.sinks=k1 k2client.sinks.k1.type=avroclient.sinks.k1.channel=ch1client.sinks.k1.hostname=collectorA.example.comclient.sinks.k1.port=42424client.sinks.k2.type=avroclient.sinks.k2.channel=ch1client.sinks.k2.hostname=collectorB.example.comclient.sinks.k2.port=42424client.sinkgroups=g1client.sinkgroups.g1=k1 k2client.sinkgroups.g1.processor.type=load_balanceclient.sinkgroups.g1.processor.selector=round_robinclient.sinkgroups.g1.processor.backoff=true Summary In this article, we covered tiering data flows using the Avro Source and Sink. More information on this topic can be found in the book Apache Flume: Distributed Log Collection for Hadoop. Resources for Article : Further resources on this subject: Supporting hypervisors by OpenNebula [Article] Integration with System Center Operations Manager 2012 SP1 [Article] VMware View 5 Desktop Virtualization [Article]
Read more
  • 0
  • 0
  • 2479

article-image-dpm-non-aware-windows-workload-protection
Packt
16 Jul 2013
18 min read
Save for later

DPM Non-aware Windows Workload Protection

Packt
16 Jul 2013
18 min read
(For more resources related to this topic, see here.) Protecting DFS with DPM DFS stands for Distributed File System . It was introduced in Windows Server 2003, and is a set of services available as a role on Windows Server operating systems that allow you to group file shares held in different locations (different servers) under one folder known as DFS root . The actual locations of the file shares are transparent to the end user. DFS is also often used for redundancy of file shares. For more information on DFS Windows Server 2008: http://technet.microsoft.com/en-us/library/cc753479%28v=ws.10%29.aspx Windows Server 2008 R2 and Windows Server 2012: http://technet.microsoft.com/en-us/library/cc732006.aspx Before DFS can be protected it is important to know how it is structured. DFS consists of both data and configuration information: The configuration for DFS is stored in the registry of each server, and in either the DFS tree during standalone DFS deployments, or in Active Directory when domain-based DFS is deployed. DFS data is stored on each server in the DFS tree. The data consists of the multiple shares that make up the DFS root. Protecting DFS with DPM is fairly straightforward. It is recommended to protect the actual file shares directly on each of the servers in the DFS root. When you have a standalone DFS deployment you should protect the system state on the servers in the DFS root, and when you have a domain-based DFS deployment we recommend you protect your Active Directory of the domain controller that hosts the DFS root. If you are using DFS replication it is also recommended to protect the shadow copy components on servers that host the replication data, in addition to the previously mentioned items. These methods would allow you to restore DFS by restoring the data and either system state or Active Directory depending on your deployment type. Another option is to use the DfsUtil tool to export/import your DFS configuration. This is a command-line utility that comes with Windows Server that can export the namespace configuration to a file. The configuration can then be imported back into a DFS server to restore a DFS namespace. DPM can be set up to protect the DFS export. You would still need to protect the actual data directly. An example of using the DfsUtil tool would be: Run DfsUtil root export domainnamerootname dfsrootname.xml to export the DFS configuration to an XML file, then run DfsUtil root import to import the DFS configuration back in. For more information on the DfsUtil tool, visit the following URL: http://blogs.technet.com/b/josebda/archive/2009/05/01/using-the-windows-server-2008-dfsutil-exe-command-line-to-manage-dfs-namespaces.aspx That covers the backing up of DFS with DPM. Protecting Dynamics CRM with DPM Microsoft Dynamics CRM is Microsoft's customer relationship management (CRM) software in the CRM market. Microsoft Dynamics CRM Version 1.0 was released in 2003. It then progressed to Version 4.0 and the latest one is 2011. CRM is a part of the Microsoft Dynamics product family. In this section we will cover protecting Versions 4.0 and 2011. Note that when protecting Microsoft Dynamics CRM on either Version 4.0 or 2011, you should keep a note of your update-rollup level some place safe, so that you can install CRM back to that level in the event of a restore. You will need to restore the CRM database and this could lead to an error if CRM is not at the correct update level. To protect Microsoft Dynamics CRM 4.0, back up the following components: Microsoft CRM Server database This is straightforward; you simply need to protect the SQL CRM databases. The two databases you want to protect are the following: The configuration database: MSCRM_CONFIG The organization database: OrganizationName_MSCRM Microsoft CRM Server program files By default, these files will be located at C:Program FilesMicrosoft CRM. Microsoft CRM website By default the CRM website files are located in the C:Inetpubwwwroot directory. The web.config file can be protected. It only needs protecting if it has been changed from the default settings. Microsoft CRM registry subkey Back up the HKEY_LOCAL_MACHINESOFTWAREMicrosoftMSCRM key. Microsoft CRM customizations To protect customizations or any third-party add-ons you will need to understand the specific components to back up and protect. Other components to back up for protecting Microsoft CRM include the following: System state of your domain controller. Exchange server if the CRM's e-mail router is used. To protect Microsoft Dynamics CRM 2011, back up the following components: Microsoft CRM 2011 databases This is straightforward, you simply need to protect the SQL CRM databases. The two databases you want to protect are: The configuration database: MSCRM_CONFIG The organization database: OrganizationName_MSCRM Microsoft CRM 2011 program files By default, these files will be located at C:Program FilesMicrosoft CRM. Microsoft CRM 2011 website By default the CRM website files are located in the C:Program FilesMicrosoft CRMCRMWeb directory. The web.config file can be protected. It only needs protecting if it has been changed from the default settings. Microsoft CRM 2011 registry subkey Back up the HKEY_LOCAL_MACHINESOFTWAREMicrosoftMSCRM subkey. Microsoft CRM 2011 customizations To protect customizations or any third-party add-ons you will need to understand the specific components to back up and protect. Other components to back up for protecting Microsoft CRM 2011 include: System state of your domain controller. Exchange server if the CRM's e-mail router is used. SharePoint if CRM and SharePoint integration is in use. Note that for both CRM 4.0 and CRM 2011, you could have more than one OrganizationName_MSCRM database if you have more than one organization in CRM. Be sure to protect all of the OrganizationName_MSCRM databases that may exist. That wraps up the Microsoft Dynamics CRM protection for both 4.0 and 2011. You simply need to configure protection of the mentioned components with DPM. Now let's look at what it will take to protect another product from the Dynamics family. Protecting Dynamics GP with DPM Dynamics GP is Microsoft's ERP and accounting software package for mid-market businesses. GP has standard accounting functions but it can do more such as Sales Order Processing, Order Management, Inventory Management, and Demand Planner for forecasting, thus making it usable as a full-blown ERP. GP was once known as Great Plains Software before acquisition by Microsoft. The most recent versions of GP are Microsoft Dynamics GP 10.0 and Dynamics GP 2010 R2. GP holds your organization's financial data. If you use it as an ERP solution, it holds even more critical data, and losing this data could be devastating to an organization. Yes, there is a built-in backup utility in GP but this does not cover all bases in protecting your GP. In fact, the built-in backup process only backs up the SQL database, and does not cover items like: Customized forms Reports Financial statement formats The sysdata folder These are the GP components you should protect with DPM: SQL administrative databases: Master, TempDB, and Model Microsoft Dynamics GP system database (DYNAMICS) Each of your company databases If you use SQL Server Agent to schedule automatic tasks, back up the msdb database forms.dic (for customized forms) can be found in %systemdrive%Program Files (x86)Microsoft DynamicsGP2010 reports.dic (for reports) can be found in %systemdrive%Program Files (x86)Microsoft DynamicsGP2010 Backing up these components with DPM should be sufficient protection in the event a restore is needed. Protecting TMG 2010 with DPM Threat Management Gateway (TMG ) is a part of the Forefront product family. The predecessor to TMG is Internet Security and Acceleration Server (ISA Server ). TMG is fundamentally a firewall, but a very powerful one with features such as VPN, web caching, reverse proxy, advanced stateful packet, WAN failover, malware protection, routing, load balancer, and much more. There have been several forum threads on the Microsoft DPM TechNet forums asking about DPM protecting TMG, which sparked the inclusion of this section in the book. TMG is a critical part of networks and should have high priority in regards to backup, right up there with your other critical business applications. In many environments, if TMG is down, there are a good amount of users that cannot access certain business applications which causes downtime. Let's take a look at how and what to protect in regards to TMG. The first step is to allow DPM traffic on TMG so that the agent can communicate with DPM. You will need to install the DPM agent on TMG and then start protecting it from there. Follow the ensuing steps to protect your TMG server: On the TMG server, go to Start | All Programs | Microsoft TMG Server . Open the TMG Server Management MMC. Expand Arrays and then TMG Server computer , then click on Firewall Policy . On the View menu, click on Show System Policy Rules . Right-click on the Allow remote management from selected computers using MMC system policy rule. Select Edit System Policy . In the System Policy Editor dialog box, click to clear the Enable this configuration group checkbox, and then click on OK . Click on Apply to update the firewall configuration, and then click on OK . Right-click on the Allow RPC from TMG server to trusted servers system policy rule. Select Edit System Policy . In the System Policy Editor dialog box, click to clear the Enforce strict RPC compliance checkbox, and then click on OK . Click on Apply to update the firewall configuration, and then click on OK . On the View menu, click on Hide System Policy Rules . Right-click on Firewall Policy . Select New and then Access Rule . In the New Access Rule Wizard window, type a name in the Access rule name box. Click on Next . Check the Allow checkbox and then click on Next . In the This rule applies to list, select All outbound traffic from the drop-down menu and click on Next . On the Access Rule Sources page, click on Add . In the Add Network Entities dialog window, click on New and select Computer from the drop-down list. Now type the name of your DPM server and type the DPM server's IP address in the Computer IP Address field. Click on OK when you are done. You will then see your DPM server listed under the Computers folder in the Add Network Entities window. Select it and click on Add . This will bring the DPM computer into your access rule wizard. Click on Next . In the Add Rule Destinations window click on Add . The Add Network Entities window will come up again. In this window expand the Networks folder, and then select Local Host and click on Add . Now click on Next . Your rule should have both the DPM server and Local Host listed for both incoming and outgoing. Click on Next , leave the default All Users entry in the This rule applies to requests from the following user sets box, click on Next again. Click on Finish . Right-click on the new rule (DPM2010 in this example), and then click on Move Up . Right-click on the new rule, and select Properties . In the rule name properties dialog box (DPM2010 Properties ), click on the Protocols tab, then click on Filtering . Now select Configure RPC Protocol . In the Configure RPC protocol policy dialog box, check the Enforce strict RPC compliance checkbox, and then click on OK twice. Click on Apply to update the firewall policy, and then click on OK . Now you will need to attach the DPM agent for the TMG server. Follow the ensuing steps to complete this task: Open the DPM Administrator Console. Click on the Management tab on the navigation bar. Now click on the Agents tab. On the Actions pane, click on Install . Now the Protection Agent Install Wizard window should pop up. Choose the Attach agents checkbox. Choose Computer on trusted domain , and click on Next . Select the TMG server from the list and click on Add and then click on Next . Enter credentials for the domain account. The account that is used here needs to have administrative rights on the computer you are going to protect. Click on Next to continue. You will receive a warning that DPM cannot tell if the TMG server is clustered or not. Click on OK for this. On the next screen click on Attach to continue. Next you have to install the agent on the TMG firewall and point it to the correct DPM server. Follow the ensuing steps to complete this task: From the TMG server that you will be protecting, access the DPM server over the network and copy the folder with the agent installed in it down to the local machine. Use this path DPMSERVERNAME%systemdrive%program filesMicrosoft DPMDPMProtectionAgentsRA3.03.0.7696.0i386. Then from the local folder on the protected computer, run dpmra.msi to install the agent. Open a command prompt (make sure you have elevated privileges), change directory to C:Program FilesMicrosoft Data Protection ManagerDPMbin then run the following: SetDpmServer.exe -dpmServerName <serverName> userName <userName> Following is the example of the previous command: SetDpmServer.exe -dpmServerName buchdpm Now restart the TMG server. Once your TMG server comes back, check the Windows services to make sure that the DPMRA service is set to automatic, and then start it. That is it for configuring DPM to start protecting TMG, but there are a few more things that we still need to cover on this topic. With TMG backup you can choose to back up certain components of TMG, depending on your recovery needs. With DPM you can back up the TMG hard drive, TMG logs that are stored in SQL, TMG's system state, or BMR of TMG. Following is the list of components you should back up depending on your circumstances: What can be included in TMG server backup: TMG configuration settings (exported through TMG) TMG firewall settings (exported through TMG) TMG logfiles (stored in SQL databases) TMG install directory (only needed if you have custom forms for things such as an Outlook Web Access login screen TMG server system state TMG BMR None of the previous components are required for protection of TMG. In fact, protecting the SQL logfiles tends to cause more issues than it helps, as they change so often. These SQL log databases change so often that DPM will send an error when the old SQL databases no longer shown under protection. The logfiles are not required to restore your TMG. For a standard TMG restore, you will need to reinstall TMG, reconfigure NIC settings, import any certificates, and restore TMG configuration and firewall settings. For more information on backing up TMG 2010, visit the following page: http://technet.microsoft.com/en-us/library/cc984454.aspx. DPM cannot back up the TMG configuration and firewall settings natively. This needs to be scripted and scheduled through Windows Task Scheduler, and then placed on the local hard drive. DPM can back up the .XML settings for TMG export from there. You can find the TMG server's export script at http://msdn.microsoft.com/en-us/library/ms812627.aspx. Place this script into a .VBS file, and then set up a scheduled task to call this file to run. This automates the export of your TMG server settings. There is another way to back up the entire TMG server. This is a new type of protection, specific to TMG 2010. This protection is BMR and is available because TMG is now installed on top of Windows Server 2008 and Windows Server 2008 R2. Protecting the BMR of your TMG gives you the ability to restore your entire TMG in the event that it fails-configuration and firewall settings included. BMR will also bring back certificates and NIC card settings. Note that BMR of TMG restored on a virtual machine can't use its NIC card settings. It only on the same hardware. Well that covers how to protect TMG with DPM. As you can see that there are some improvements through BMR, and if you do not employ BMR protection you can still automate the process of protecting TMG. How to protect IIS Internet Information Services (IIS ) is Microsoft's web server platform. It is included for free with Windows Server operating systems. Its modular nature makes it scalable for different organization web server need. The latest version is IIS 8. It can be used for more than standard web hosting, for example as an FTP server or for media delivery . Knowing what to protect when it comes to IIS will come in handy in almost any environment you may work in. Backing up IIS is one thing but you need to ensure that you understand the websites or web applications you are running, so that you know how to back them up too. In this section, we are going to look at the protection of IIS. To protect IIS, you should backup the following components: IIS configuration files Website or web applications data SSL certificates Registry (only needed if website or web application required modifications of the registry) Metabase The IIS configuration files are located in the %systemdrive%windowssystem32inetsrvconfig directory (and subdirectories). The website or web application files are typically found in C:inetpubwwwroot. Now this is the default location but the website or web application files can be located anywhere on an IIS server. To export SSL certificates directly from IIS, follow the ensuing steps: Open the Microsoft IIS 7 console. In the left-hand pane, select the server name. In the center pane click on the server certificates icon. Right-click on the certificate you wish to export and select export . Enter a file path, name the certificate file, and give it a password. Click on OK and your certificate will be exported as a .pfx file in the path you specified. Metabase is an internal database that holds IIS configuration data. It is made up of two files: MBSchema.xml and MetaBase.xml. These can be found in %SystemRoot%system32inetsrv. A good thing to know is that if you protect the system state of a server, then IIS configuration will be included in this backup. This does not include the website or web application files, so you will still need to protect these in addition to a system state backup. That covers the items you will need to protect IIS with DPM backup. Protecting Lync 2010 with DPM Lync 2010 is Microsoft's Unified Communication platform complete with IM, presence, conferencing, enterprise video and voice, and more. Lync was formerly known as Office Communicator. Lync is quickly becoming an integral part of business communications. With Lync being a critical application to organizations, it important to ensure this platform is backed up. Lync is a massive product with many moving parts. We are not going to cover all of Lync's architecture as this would need its own book. We are going to focus on what should be backed up to ensure protection of your Lync deployment. Overall, we want to protect Lync's settings and configuration data. The majority of this data is stored in the Lync Central Management store. The following are the components that needs to be protected in order to back up Lync: Settings and configuration data Topology configuration (Xds.mdf) Location information (Lis.mdf) Response group configuration (RgsConfig.mdf) Data stored in databases User data (Rtc.mdf) Archiving data (LcsLog.mdf) Monitoring data (csCDR.mdf and QoeMetrics.mdf) File stores Lync server file store Archiving file store These stores will be file shares on the Lync server, named in the format lyncservernamesharename. To track down these file shares if you don't know where they are, go to the Lync Topology Builder and look in the File stores node. Note the files named Meeting.Active should not be backed up. These files are in use and locked while a meeting takes place. Other components as follows: Active Directory (User SIP data, a pointer to the Central Management store, and objects for Response Group and Conferencing Attendant) Certification authority (CA) and certificates (if you use an internal CA) Microsoft Exchange and Exchange Unified Messaging (UM) if you are using UM with your Exchange Domain Name System (DNS) records and IP addresses IIS on Lync Server DHCP Configuration Group Chat (if used) XMPP gateways if you are using XMPP gateway Public switched telephone network (PSTN) gateway configuration, if your Lync is connected to one Firewall and Load Balancer (if used) configurations Summary Now that we had a chance to look at several Microsoft workloads that are used in organizations today and how to protect them with DPM, you should have a good understanding what it takes to back them up. These workloads included Lync 2010, IIS, CRM, GP, DFS, and TMG. Note there are many more Microsoft workloads that DPM cannot protect natively, which we were unable to cover in this article. Resources for Article : Further resources on this subject: Overview of Microsoft Dynamics CRM 2011 [Article] Deploying .NET-based Applications on to Microsoft Windows CE Enabled Smart Devices [Article] Working with Dashboards in Dynamics CRM [Article]
Read more
  • 0
  • 0
  • 1570

article-image-oracle-business-intelligence-11g-r1-cookbook
Packt
10 Jul 2013
4 min read
Save for later

Measuring Performance with Key Performance Indicators

Packt
10 Jul 2013
4 min read
(For more resources related to this topic, see here.) Creating the KPIs and the KPI watchlists We're going to create Key Performance Indicators and watchlists in the first recipe. There should be comparable measure columns in the repository in order to create KPI objects. The following columns will be used in the sample scenario: Shipped Quantity Requested Quantity How to do it Click on the KPI link in the Performance Management section and you're going to select a subject area. The KPI creation wizard has five different steps. The first step is the General Propertiessection and we're going to write a description for the KPI object. The Actual Value and the Target Value attributes display the columns that we'll use in this scenario. The columns should be selected manually. The Enable Trendingcheckbox is not selected by default. When you select the checkbox, trending options will appear on the screen. We're going to select the Day level from the Time hierarchy for trending in the Compare to Prior textbox and define a value for the Tolerance attribute. We're going to use 1 and % Change in this scenario. Clicking on the Next button will display the second step named Dimensionality. Click on the Add button to select Dimension attributes. Select the Region column in the Add New Dimension window. After adding the Region column, repeat the step for the YEAR column. You shouldn't select any value to pin. Both column values will be . Clicking on Next button will display the third step named States. You can easily configure the state values in this step. Select the High Values are Desirable value from the Goal drop-down list. By default, there are three steps: OK Warning Critical Then click on the Next button and you'll see the Related Documents step. This is a list of supporting documents and links regarding to the Key Performance Indicator. Click on the Add button to select one of the options. If you want to use another analysis as a supporting document, select the Catalog option and choose the analysis that consists of some valuable information about the report. We're going to add a link. You can also easily define the address of the link. We'll use the http://www.abc.com/portal link. Click on the Next button to display the Custom Attributes column values. To add a custom attribute that will be displayed in the KPI object, click on the Add button and define the values specified as follows: Number: 1 Label: Dollars Formula: "Fact_Sales"."Dollars" Save the KPI object by clicking on the Save button. Right after saving the KPI object, you'll see the KPI content. KPI objects cannot be published in the dashboards directly. We need KPI watchlists to publish them in the dashboards. Click on the KPI Watchlist link in the Performance Managementsection to create one. The New KPI Watchlist page will be displayed without any KPI objects. Drag-and-drop the KPI object that was previously created from the Catalog pane onto the KPI watchlist list. When you drop the KPI object, the Add KPI window will pop up automatically. You can select one of the available values for the dimensions. We're going to select the Use Point-of-View option. Enter a Label value, A Sample KPI, for this example. You'll see the dimension attributes in the Point-of-View bar. You can easily select the values from the drop-down lists to have different perspectives. Save the KPI watchlist object. How it works KPI watchlists can contain multiple KPI objects based on business requirements. These container objects can be published in the dashboards so that end users will access the content of the KPI objects through the watchlists. When you want to publish these watchlists, you'll need to select a value for the dimension attributes. There's more The Drill Down feature is also enabled in the KPI objects. If you want to access finer levels, you can just click on the hyperlink of the value you are interested in and a detailed level is going to be displayed automatically. Summary In this article, we learnt how to create KPIs and KPI watchlists. Key Performance Indicators are building blocks of strategy management. In order to implement balanced scorecard management technique in an organization, you'll first need to create the KPI objects. Resources for Article : Further resources on this subject: Oracle Integration and Consolidation Products [Article] Managing Oracle Business Intelligence[Article] Oracle Tools and Products [Article]
Read more
  • 0
  • 0
  • 882
Visually different images

article-image-getting-started-oracle-data-guard
Packt
02 Jul 2013
13 min read
Save for later

Getting Started with Oracle Data Guard

Packt
02 Jul 2013
13 min read
(For more resources related to this topic, see here.) What is Data Guard? Data Guard, which was introduced as the standby database in Oracle database Version 7.3 under the name of Data Guard with Version 9 i , is a data protection and availability solution for Oracle databases. The basic function of Oracle Data Guard is to keep a synchronized copy of a database as standby, in order to make provision, incase the primary database is inaccessible to end users. These cases are hardware errors, natural disasters, and so on. Each new Oracle release added new functionalities to Data Guard and the product became more and more popular with offerings such as data protection, high availability, and disaster recovery for Oracle databases. Using Oracle Data Guard, it's possible to direct user connections to a Data Guard standby database automatically with no data loss, in case of an outage in the primary database. Data Guard also offers taking advantage of the standby database for reporting, test, and backup offloading. Corruptions on the primary database may be fixed automatically by using the non-corrupted data blocks on the standby database. There will be minimal outages (seconds to minutes) on the primary database in planned maintenances such as patching and hardware changes by using the switchover feature of Data Guard, which changes the roles of the primary and standby databases. All of these features are available with Data Guard, which doesn't require an installation but a cloning and configuration of the Oracle database. A Data Guard configuration consists of two main components: primary database and standby database. The primary database is the database for which we want to take precaution for its inaccessibility. Fundamentally, changes on the data of the primary database are passed through the standby database and these changes are applied to the standby database in order to keep it synchronized. The following figure shows the general structure of Data Guard: Let's look at the standby database and its properties more closely. Standby database It is possible to configure a standby database simply by copying, cloning, or restoring a primary database to a different server. Then the Data Guard configurations are made on the databases in order to start the transfer of redo information from primary to standby and also to start the apply process on the standby database. Primary and standby databases may exist on the same server; however, this kind of configuration should only be used for testing. In a production environment, the primary and standby database servers are generally preferred to be on separate data centers. Data Guard keeps the primary and standby databases synchronized by using redo information. As you may know, transactions on an Oracle database produce redo records. This redo information keeps all of the changes made to the database. The Oracle database first creates redo information in memory (redo log buffers). Then they're written into online redo logfiles, and when an online redo logfile is full, its content is written into an archived redo log. An Oracle database can run in the ARCHIVELOG mode or the NOARCHIVELOG mode. In the ARCHIVELOG mode, online redo logfiles are written into archived redo logs and in the NOARCHIVELOG mode, redo logfiles are overwritten without being archived as they become full. In a Data Guard environment, the primary database must be in the ARCHIVELOG mode. In Data Guard, transfer of the changed data from the primary to standby database is achieved by redo with no alternative. However, the apply process of the redo content to the standby database may vary. The different methods on the apply process reveal different type of standby databases. There were two kinds of standby databases before Oracle database Version 11 g , which were: physical standby database and logical standby database. Within Version 11 g we should mention a third type of standby database which is snapshot standby. Let's look at the properties of these standby database types. Physical standby database The Physical standby database is a block-based copy of the primary database. In a physical standby environment, in addition to containing the same database objects and same data, the primary and standby databases are identical on a block-for-block basis. Physical standby databases use Redo Apply method to apply changes. Redo Apply uses Managed recovery process ( MRP ) in order to manage application of the change in information on redo. In Version 11 g , a physical standby database can be accessible in read-only mode while Redo Apply is working, which is called Active Data Guard. Using the Active Data Guard feature, we can offload report jobs from the primary to physical standby database. Physical standby database is the only option that has no limitation on storage vendor or data types to keep a synchronized copy of the primary database. Logical standby database Logical standby database is a feature introduced in Version 9 i R2. In this configuration, redo data is first converted into SQL statements and then applied to the standby database. This process is called SQL Apply. This method makes it possible to access the standby database permanently and allows read/write while the replication of data is active. Thus, you're also able to create database objects on the standby database that don't exist on the primary database. So a logical standby database can be used for many other purposes along with high availability and disaster recovery. Due to the basics of SQL Apply, a logical standby database will contain the same data as the primary database but in a different structure on the disks. One discouraging aspect of the logical standby database is the unsupported data types, objects, and DDLs. The following data types are not supported to be replicated in a logical standby environment: BFILE Collections (including VARRAYS and nested tables) Multimedia data types (including Spatial, Image, and Oracle Text) ROWID and UROWID User-defined types The logical standby database doesn't guarantee to contain all primary data because of the unsupported data types, objects, and DDLs. Also, SQL Apply consumes more hardware resources. Therefore, it certainly brings more performance issues and administrative complexities than Redo Apply. Snapshot standby database Principally, a snapshot standby database is a special condition of a physical standby database. Snapshot standby is a feature that is available with Oracle Database Version 11 g . When you convert a Physical standby database into a snapshot standby database, it becomes accessible for read/write. You can run tests on this database and change the data. When you're finished with the snapshot standby database, it's possible to reverse all the changes made to the database and turn it back to a physical standby again. An important point here is that a snapshot standby database can't run Redo Apply. Redo transfer continues but standby is not able to apply redo. Oracle Data Guard evolution It has been a long time that the Oracle Data Guard technology has been in the database administrator's life and it apparently evolved from the beginning until 11 g R2. Let's look at this evolution closely through the different database versions. Version 7.3 – stone age The functionality of keeping a duplicate database in a separate server, which can be synchronized with the primary database, came with Oracle database Version 7.3 under the name of standby database. This standby database was constantly in recovery mode waiting for the archived redo logs to be synchronized. However, this feature was not able to automate the transfer of archived redo logs. Database administrators had to find a way to transfer archived redo logs and apply them to the standby server continuously. This was generally accomplished by a script running in the background. The only aim of Version 7.3 of the standby database was disaster recovery. It was not possible to query the standby database or to open it for any purpose other than activating it in the event of failure of the primary database. Once the standby database was activated, it couldn't be returned to the standby recovery mode again. Version 8 i – first age Oracle database Version 8 i brought the much-awaited features to the standby database and made the archived log shipping and apply process automatic, which is now called managed standby environment and managed recovery, respectively. However, some users were choosing to apply the archived logs manually because it was not possible to set a delay in the managed recovery mode. This mode was bringing the risk of the accidental operations to reflect standby database quickly. Along with the "managed" modes, 8 i made it possible to open a standby database with the read-only option and allowed it to be used as a reporting database. Even though there were new features that made the tool more manageable and practical, there were still serious deficiencies. For example, when we added a datafile or created a tablespace on the primary database, these changes were not being replicated to the standby database. Database administrators had to take care of this maintenance on the standby database. Also when we opened the primary database with resetlogs or restored a backup control file, we had to re-create the standby database. Version 9 i – middle age First of all, with this version Oracle8 i standby database was renamed to Oracle9 i Data Guard. 9 i Data Guard includes very important new features, which makes the product much more reliable and functional. The following features were included: Oracle Data Guard Broker management framework, which is used to centralize and automate the configuration, monitoring, and management of Oracle Data Guard installations, was introduced with this version. Zero data loss on failover was guaranteed as a configuration option. Switchover was introduced, which made it possible to change the roles of primary and standby. This made it possible to accomplish a planned maintenance on the primary database with very less service outage. Standby database administration became simpler because new datafiles on the primary database are created automatically on standby and if there are missing archived logs on standby, which is called gap; Data Guard detects and transmits the missing logs to standby automatically. Delay option was added, which made it possible to configure a standby database that is always behind the primary in a specified time delay. Parallel recovery increased recovery performance on the standby database. In Version 9 i Release 2, which was introduced in May 2002, one year after Release 1, there were again very important features announced. They are as follows: Logical standby database was introduced, which we've mentioned earlier in this article Three data protection modes were ready to use: Maximum Protection, Maximum Availability, and Maximum Performance, which offered more flexibility on configuration The Cascade standby database feature made it possible to configure a second standby database, which receives its redo data from the first standby database Version 10 g – new age The 10 g version again introduced important features of Data Guard but we can say that it perhaps fell behind expectations because of the revolutionary changes in release 9 i . The following new features were introduces in Version 10 g : One of the most important features of 10 g was the Real-Time Apply. When running in Real-Time Apply mode, the standby database applies changes on the redo immediately after receiving it. Standby does not wait for the standby redo logfile to be archived. This provides faster switchover and failover. Flashback database support was introduced, which made it unnecessary to configure a delay in the Data Guard configuration. Using flashback technology, it was possible to flash back a standby database to a point in time. With 10 g Data Guard, if we open a primary database with resetlogs it was not required to re-create the standby database. Standby was able to recover through resetlogs. Version 10 g made it possible to use logical standby databases in the database software rolling upgrades of the primary database. This method made it possible to lessen the service outage time by performing switchover to the logical standby database. 10 g Release 2 also introduced new features to Data Guard, but these features again were not satisfactory enough to make a jump to the Data Guard technology. The two most important features were Fast-Start Failover and the use of Guaranteed restore point: Fast-start failover automated and accelerated the failover operation when the primary database was lost. This option strengthened the disaster recovery role of Oracle Data Guard. Guaranteed restore point was not actually a Data Guard feature. It was a database feature, which made it possible to revert a database to the moment that Guaranteed restore point was created, as long as there is sufficient disk space for the flashback logs. Using this feature following scenario became possible: Activate a physical standby database after stopping Redo Apply, use it for testing with read/write operations, then revert the changes, make it standby again and synchronize it with the primary. Using a standby database read/write was offering a great flexibility to users but the archived log shipping was not able to continue while the standby is read/write and this was causing data loss on the possible primary database failure. Version 11 g – modern age Oracle database version 11 g offered the expected jump in the Data Guard technology, especially with two new features, which are called Active Data Guard and snapshot standby. The following features were introduced: Active Data Guard has been a milestone in Data Guard history, which enables a query from a physical standby database while the media recovery is active. Snapshot standby is a feature to use a physical standby database read/write for test purposes. As we mentioned, this was possible with 10 g R2 Guaranteed restore point feature but 11 g provided the continuous archived log shipping in the time period that standby is read/write with snapshot standby. It has been possible to compress redo traffic in a Data Guard configuration, which is useful in excessive redo generation rates and resolving gaps. Compression of redo when resolving gaps was introduced in 11 g R1 and compression of all redo data was introduced in 11 g R2. Use of the physical standby databases for the rolling upgrades of database software was enabled, aka Transient Logical Standby. It became possible to include different operating systems in a Data Guard configuration such as Windows and Linux. Lost-write, which is a serious data corruption type arising from the misinformation of storage subsystem on completing the write of a block, can be detected in an 11 g Data Guard configuration. Recovery is automatically stopped in such a case. RMAN fast incremental backup feature "Block Change Tracking" can be run on an Active Data Guard enabled standby database. Another very important enhancement in 11 g was Automatic Block Corruption Repair feature that was introduced with 11 g R2. With this feature, a corrupted data block in the primary database can be automatically replaced with an uncorrupted copy from a physical standby database in Active Data Guard mode and vice versa. We've gone through the evolution of Oracle Data Guard from its beginning until today. As you may notice, Data Guard started its life as a very simple database property revealed to keep a synchronized database copy with a lot of manual work and now it's a complicated tool with advanced automation, precaution, and monitoring features. Now let's move on with the architecture and components of Oracle Data Guard 11 g R2.
Read more
  • 0
  • 0
  • 2552

article-image-creating-your-first-collection-simple
Packt
26 Jun 2013
7 min read
Save for later

Creating your first collection (Simple)

Packt
26 Jun 2013
7 min read
(For more resources related to this topic, see here.) Getting ready Assuming that you have walked through the tutorial, you should be nearly ready with the setup. Still, it does not hurt to go through the checklist: Be familiar that you know how to start your operating system's shell (cmd.exe on Windows, Terminal/iTerm on Mac, and sh/bash/tch/zsh on Unix). Ensure that running the java –version command on the shell's prompt returns at least Version 1.6. You may need to upgrade if you have an older version. Ensure that you know where you unpacked the Solr distribution and the full path to the example directory within that. You needed that directory for the tutorial, but that's also where we are going to start our own Solr instance. That allows us to easily run an embedded Jetty web server and to also find all the additional JAR files that Solr needs to operate properly. Now, create a directory where we will store our indexes and experiments. It can be anywhere on your drive. As Solr can run on any operating system where Java can run, we will use SOLRINDEXING as a name whenever we refer to that directory. Make sure to use absolute path names when substituting with your real path for the directory. How to do it... As our first example, we will create an index that stores and allows for the searching of simplified e-mail information. For now, we will just look at the addr_from and addr_to e-mail addresses and the subject line. You will see that it takes only two simple configuration files to get the basic Solr index working. Under the SOLR-INDEXING directory, create a collection1 directory and inside that create a conf directory. In the conf directory, create two files: schema.xml and solrconfig.xml. The schema.xml file should have the following content: <?xml version="1.0" encoding="UTF-8" ?><schema version="1.5"><fields><field name="id" type="string" indexed="true" stored="true"required="true"/><field name="addr_from" type="string" indexed="true"stored="true" required="true"/><field name="addr_to" type="string" indexed="true"stored="true" required="true"/><field name="subject" type="string" indexed="true"stored="true" required="true"/></fields><uniqueKey>id</uniqueKey><types><fieldType name="string" class="solr.StrField" /></types></schema> The solrconfig.xml file should have the following content: <?xml version="1.0" encoding="UTF-8" ?><config><luceneMatchVersion>LUCENE_43</luceneMatchVersion><requestDispatcher handleSelect="false"><httpCaching never304="true" /></requestDispatcher><requestHandler name="/select" class="solr.SearchHandler" /><requestHandler name="/update" class="solr.UpdateRequestHandler" /><requestHandler name="/admin" class="solr.admin.AdminHandlers" /><requestHandler name="/analysis/field" class="solr.FieldAnalysisRequestHandler" startup="lazy" /></config> That is it. Now, let's start our just-created Solr instance. Open a new shell (we'll need the current one later). On that shell's command prompt, change the directory to the example directory of the Solr distribution and run the following command: java -Dsolr.solr.home=SOLR-INDEXING -jar start.jar Notice that solr.solr.home is not a typo; you do need the solr part twice. And, as always, if you have spaces in your paths (now or later), you may need to escape them in platform-specific ways, such as with backslashes on Unix/Linux or by quoting the whole value. In the window of your shell, you should see a long list of messages that you can safely ignore (at least for now). You can verify that everything is working fine by checking for the following three elements: The long list of messages should finish with a message like Started [email protected]:8983. This means that Solr is now running on port 8983 successfully. You should now have a directory called data, right next to the directory called conf that we created earlier. If you open the web browser and go to the http:// localhost:8983/ solr/, you should see a web-based admin interface that makes testing and troubleshooting your Solr instance much easier. We will be using this interface later, so do spend a couple of minutes clicking around now. Now, let's load some actual content into our collection: Copy post.jar from the Solr distribution's example/exampledocs directory to our root SOLR-INDEXING directory. Create a file called input1.csv in the collection1 directory, next to the conf and data directories with the following three-line content: id,addr_from,addr_to,subjectemail1,[email protected],[email protected],"Kari,we need more Junior Java engineers"email2,[email protected],[email protected],"Updating vacancy description" Run the import command from the command line in the SOLR-INDEXING directory (one long command; do not split it across lines): java -Dauto -Durl=http://localhost:8983/solr/collection1/update -jar post.jar collection1/input1.csv You should see the following in one of the message lines: "1 files indexed". If you now open a web browser and go to http:// localhost:8983/solr/ collection1/select?q=*%3A*&wt=ruby&indent=true, you should see Solr output with all the three documents displayed on the screen in a somewhat readable format. How it works... We have created two files to get our example working. Let's review what they mean and how they fit together: The schema.xml file in the collection's conf directory defines the actual shape of data that you want to store and index. The fields define a structure of a record. Each field has a type, which is also defined in the same file. The field defines whether it is stored, indexed, required, multivalued, or a small number of other, more advanced properties. On the other hand, the field type defines what is actually done to the field when it is indexed and when it is searched. We will explore all of these later. The solrconfig.xml file also in the collection's conf directory defines and tunes the components that make up Solr's runtime environment. At the very least, it needs to define which URLs can be called to add records to a collection (here, /update), which to query a collection (here, /select), and which to do various administrative tasks (here, /admin and /analysis/field). Once Solr started, it created a single collection with the default name of collection1, assigned an update handler to it at the /solr/collection1/update URL and search handler at the /solr/collection1/select URL (as per solrconfig.xml). At that point, Solr was ready for the data to be imported into the four required fields (as per schema.xml). We then proceeded to populate the index from a CSV file (one of many update formats available) and then verified that the records are all present in an indented Ruby format (again, one of many result formats available). Summary This article helped you create a basic Solr collection and populate it with a simple dataset in CSV format. Resources for Article : Further resources on this subject: Integrating Solr: Ruby on Rails Integration [Article] Indexing Data in Solr 1.4 Enterprise Search Server: Part2 [Article] Text Search, your Database or Solr [Article]
Read more
  • 0
  • 0
  • 4032

article-image-article-creating-your-first-heat-map-r
Packt
26 Jun 2013
10 min read
Save for later

Creating your first heat map in R

Packt
26 Jun 2013
10 min read
(For more resources related to this topic, see here.) The following image shows one of the heat maps that we are going to create in this recipe from the total count of air passengers: Image Getting ready Download the script 5644_01_01.r from your account at http://www.packtpub.com and save it to your hard disk. The first section of the script, below the comment line starting with ### loading packages, will automatically check for the availability of the R packages gplots and lattice, which are required for this recipe. If those packages are not already installed, you will be prompted to select an official server from the Comprehensive R Archive Network (CRAN) to allow the automatic download and installation of the required packages. If you have already installed those two packages prior to executing the script, I recommend you to update them to the most recent version by calling the following function in the R command line: code Use the source() function in the R command-line to execute an external script from any location on your hard drive. If you start a new R session from the same directory as the location of the script, simply provide the name of the script as an argument in the function call as follows: code   You have to provide the absolute or relative path to the script on your hard drive if you started your R session from a different directory to the location of the script. Refer to the following example: code   You can view the current working directory of your current R session by executing the following command in the R command-line: code   How to do it... Run the 5644OS_01_01.r script in R to execute the following code, and take a look at the output printed on the screen as well as the PDF file, first_heatmaps.pdf that will be created by this script: code How it works... There are different functions for drawing heat maps in R, and each has its own advantages and disadvantages. In this recipe, we will take a look at the levelplot() function from the lattice package to draw our first heat map. Furthermore, we will use the advanced heatmap.2() function from gplots to apply a clustering algorithm to our data and add the resulting dendrograms to our heat maps. The following image shows an overview of the different plotting functions that we are using throughout this book: Image Now let us take a look at how we read in and process data from different data files and formats step-by-step: Loading packages: The first eight lines preceding the ### loading data section will make sure that R loads the lattice and gplots package, which we need for the two heat map functions in this recipe: levelplot() and heatmap.2(). Each time we start a new session in R, we have to load the required packages in order to use the levelplot() and heatmap.2() functions. To do so, enter the following function calls directly into the R command-line or include them at the beginning of your script: library(lattice) library(gplots)   Loading the data set: R includes a package called data, which contains a variety of different data sets for testing and exploration purposes. More information on the different data sets that are contained in the data package can be found at http:// stat.ethz.ch/ROmanual/ROpatched/library/datasets/. For this recipe, we are loading the AirPassenger data set, which is a collection of the total count of air passengers (in thousands) for international airlines from 1949- 1960 in a time-series format. code Converting the data set into a numeric matrix: Before we can use the heat map functions, we need to convert the AirPassenger time-series data into a numeric matrix first. Numeric matrices in R can have characters as row and column labels, but the content itself must consist of one single mode: numerical. We use the matrix() function to create a numeric matrix consisting of 12 columns to which we pass the AirPassenger time-series data row-by-row. Using the argument dimnames = rowcolNames, we provide row and column names that we assigned previously to the variable rowColNames, which is a list of two vectors: a series of 12 strings representing the years 1949 to 1960, and a series of strings for the 12 three-letter abbreviations of the months from January to December, respectively. code A simple heat map using levelplot(): Now that we have converted the AirPassenger data into a numeric matrix format and assigned it to the variable air_data, we can go ahead and construct our first heat map using the levelplot() function from the lattice package: code The levelplot() function creates a simple heat map with a color key to the righthand side of the map. We can use the argument col.regions = heat.colors to change the default color transition to yellow and red. X and y axis labels are specified by the xlab and ylab parameters, respectively, and the main parameter gives our heat map its caption. In contrast to most of the other plotting functions in R, the lattice package returns objects, so we have to use the print() function in our script if we want to save the plot to a data file. In an interactive R session, the print() call can be omitted. Typing the name of the variable will automatically display the referring object on the screen. Creating enhanced heat maps with heatmap.2(): Next, we will use the heatmap.2() function to apply a clustering algorithm to the AirPassenger data and to add row and column dendrograms to our heat map: code Hierarchical clustering is especially popular in gene expression analyses. It is a very powerful method for grouping data to reveal interesting trends and patterns in the data matrix. Another neat feature of heatmap.2() is that you can display a histogram of the count of the individual values inside the color key by including the argument density.info = NULL in the function call. Alternatively, you can set density. info = "density" for displaying a density plot inside the color key. By adding the argument keysize = 1.8, we are slightly increasing the size of the color key—the default value of keysize is 1.5: code Did you notice the missing row dendrogram in the resulting heat map? This is due to the argument dendrogram = "column" that we passed to the heat map function. Similarly, you can type row instead of column to suppress the column dendrogram, or use neither to draw no dendrogram at all. There's more... By default, levelplot() places the color key on the right-hand side of the heat map, but it can be easily moved to the top, bottom, or left-hand side of the map by modifying the space parameter of colorkey: Replacing top by left or bottom will place the color key on the left-hand side or on the bottom of the heat map, respectively. Moving around the color key for heatmap.2() can be a little bit more of a hassle. In this case we have to modify the parameters of the layout() function. By default, heatmap.2() passes a matrix, lmat, to layout(), which has the following content: code The numbers in the preceding matrix specify the locations of the different visual elements on the plot (1 implies heat map, 2 implies row dendrogram, 3 implies column dendrogram, and 4 implies key). If we want to change the position of the key, we have to modify and rearrange those values of lmat that heatmap.2() passes to layout(). For example, if we want to place the color key at the bottom left-hand corner of the heat map, we need to create a new matrix for lmat as follows: code We can construct such a matrix by using the rbind() function and assigning it to lmat: code Furthermore, we have to pass an argument for the column height parameter lhei to heatmap.2(), which will allow us to use our modified lmat matrix for rearranging the color key: code If you don't need a color key for your heat map, you could turn it off by using the argument key = FALSE for heatmap.2() and colorkey = FALSE for levelplot(), respectively. R also has a base function for creating heat maps that does not require you to install external packages and is most advantageous if you can go without a color key. The syntax is very similar to the heatmap.2() function, and all options for heatmap.2() that we have seen in this recipe also apply to heatmap(): code More information on dendrograms and clustering By default, the dendrograms of heatmap.2() are created by a hierarchical agglomerate clustering method, also known as bottom-up clustering. In this approach, all individual objects start as individual clusters and are successively merged until only one single cluster remains. The distance between a pair of clusters is calculated by the farthest neighbor method, also called the complete linkage method, which is based by default on the Euclidean distance of the two points from both clusters that are farthest apart from each other. The computed dendrograms are then reordered based on the row and column means. By modifying the default parameters of the dist() function, we can use another distance measure rather than the Euclidean distance. For example, if we want to use the Manhattan distance measure (based on a grid-like path rather than a direct connection between two objects), we would modify the method parameter of the dist() function and assign it to a variable distance first: code Other options for the method parameter are: euclidean (default), maximum, canberra, binary, or minkowski. To use other agglomeration methods than the complete linkage method, we modify the method parameter in the hclust() function and assign it to another variable cluster. Note the first argument distance that we pass to the hclust() function, which comes from our previous assignment: code By setting the method parameter to ward, R will use Joe H. Ward's minimum variance method for hierarchical clustering. Other options for the method parameter that we can pass as arguments to hclust() are: complete (default), single, average, mcquitty, median, or centroid. To use our modified clustering parameters, we simply call the as.dendrogram() function within heatmap.2() using the variable cluster that we assigned previously: code We can also draw the cluster dendrogram without the heat map by using the plot() function: code To turn off row and column reordering, we need to turn off the dendrograms and set the parameters Colv and Rowv to NA: code Summary This article has helped us create our first heat maps from a small data set provided in R. We have used different heat map functions in R to get a first impression of their functionalities. Resources for Article :   Further resources on this subject: Getting started with Leaflet [Article] Moodle 1.9: Working with Mind Maps [Article] Joomla! with Flash: Showing maps using YOS amMap [Article]
Read more
  • 0
  • 0
  • 2801
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-linking-section-access-multiple-dimensions
Packt
25 Jun 2013
3 min read
Save for later

Linking Section Access to multiple dimensions

Packt
25 Jun 2013
3 min read
(For more resources related to this topic, see here.) Getting ready Load the following script: Product:LOAD * INLINE [ ProductID, ProductGroup, ProductName 1, GroupA, Great As 2, GroupC, Super Cs 3, GroupC, Mega Cs 4, GroupB, Good Bs 5, GroupB, Busy Bs];Customer:LOAD * INLINE [ CustomerID, CustomerName, Country 1, Gatsby Gang, USA 2, Charly Choc, USA 3, Donnie Drake, USA 4, London Lamps, UK 5, Shylock Homes, UK];Sales:LOAD * INLINE [ CustomerID, ProductID, Sales 1, 2, 3536 1, 3, 4333 1, 5, 2123 2, 2, 45562, 4, 1223 2, 5, 6789 3, 2, 1323 3, 3, 3245 3, 4, 6789 4, 2, 2311 4, 3, 1333 5, 1, 7654 5, 2, 3455 5, 3, 6547 5, 4, 2854 5, 5, 9877];CountryLink:Load Distinct Country, Upper(Country) As COUNTRY_LINKResident Customer;Load Distinct Country, 'ALL' As COUNTRY_LINKResident Customer;ProductLink:Load Distinct ProductGroup, Upper(ProductGroup) As PRODUCT_LINKResident Product;Load Distinct ProductGroup, 'ALL' As PRODUCT_LINKResident Product;//Section Access;Access:LOAD * INLINE [ ACCESS, USERID, PRODUCT_LINK, COUNTRY_LINKADMIN, ADMIN, *, * USER, GM, ALL, ALL USER, CM1, ALL, USA USER, CM2, ALL, UK USER, PM1, GROUPA, ALL USER, PM2, GROUPB, ALL USER, PM3, GROUPC, ALL USER, SM1, GROUPB, UK USER, SM2, GROUPA, USA];Section Application; Note that there is a loop error generated on reload because there is a loop in the data structure. How to do it… Follow these steps to link Section Access to multiple dimensions: Add list boxes to the layout for ProductGroup and Country. Add a statistics box for Sales. Remove // to uncomment the Section Access statement. From the Settings menu, open Document Properties and select the Opening tab. Turn on the Initial Data Reduction Based on Section Access option. Reload and save the document. Close QlikView. Re-open QlikView and open the document. Log in as the Country Manager, CM1, user. Note that USA is the only country. Also, the product group, GroupA, is missing—there are no sales of this product group in USA. Close QlikView and then re-open again. This time, log in as the Sales Manager, SM2. You will not be allowed access to the document. Log into the document as the ADMIN user. Edit the script. Add a second entry for the SM2 user in the Access table as follows: USER, SM2, GROUPA, USA USER, SM2, GROUPB, UK Reload, save, and close the document and QlikView. Re-open and log in as SM2. Note the selections. How it works… Section Access is really quite simple. The user is connected to the data and the data is reduced accordingly. QlikView allows Section Access tables to be connected to multiple dimensions in the main data structure without causing issues with loops. Each associated field acts in the same way as a selection in the layout. The initial setting for the SM2 user contained values that were mutually exclusive. Because of the default Strict Exclusion setting, the SM2 user cannot log in. We changed the script and included multiple rows for the SM2 user. Intuitively, we might expect that, as the first row did not connect to the data, only the second row would connect to the data. However, each field value is treated as an individual selection and all of the values are included. There's more… If we wanted to include solely the composite association of Country and ProductGroup, we would need to derive a composite key in the data set and connect the user to that. In this example, we used the USERID field to test using QlikView logins. However, we would normally use NTNAME to link the user to either a Windows login or a custom login. Resources for Article : Further resources on this subject: Pentaho Reporting: Building Interactive Reports in Swing [Article] Visual ETL Development With IBM DataStage [Article] A Python Multimedia Application: Thumbnail Maker [Article]
Read more
  • 0
  • 0
  • 5289

article-image-ibm-cognos-workspace-advanced
Packt
14 Jun 2013
5 min read
Save for later

IBM Cognos Workspace Advanced

Packt
14 Jun 2013
5 min read
(For more resources related to this topic, see here.) Who should use Cognos Workspace Advanced? With Cognos Workspace Advanced, business users have one tool for creating advanced analyses and reports. The tool, like Query Studio and Analysis Studio, is designed for ease of use and is built on the same platform as the other report development tools in Cognos. Business Insight Advanced/Cognos Workspace Advanced is actually so powerful that it is being positioned more as a light Cognos Report Studio than as a powerful Cognos Query Studio and Cognos Analysis Studio. Comparing to Cognos Query Studio and Cognos Analysis Studio With so many options for business users, how do we know which tool to use? The best approach for making this decision is to consider the similarities and differences between the options available. In order to help us do so, we can use the following table: Feature Query Studio Analysis Studio Cognos Workspace Advanced Ad hoc reporting X   X Ad hoc analysis   X X Basic charting X X X Advanced charting     X Basic filtering X X X Advanced filtering     X Basic calculations X X X Advanced calculations     X Properties pane     X External data     X Freeform design     X As you can see from the table, all three products have basic charting, basic filtering, and basic calculation features. Also, we can see that Cognos Query Studio and Cognos Workspace Advanced both have ad hoc reporting capabilities, while Cognos Analysis Studio and Cognos Workspace Advanced both have ad hoc analysis capabilities. In addition to those shared capabilities, Cognos Workspace Advanced also has advanced charting, filtering, and calculation features. Cognos Workspace Advanced also has a limited properties pane (similar to what you would see in Cognos Report Studio). Furthermore, Cognos Workspace Advanced allows end users to bring in external data from a flat file and merge it with the data from Cognos Connection. Finally, Cognos Workspace Advanced has free-form design capabilities. In other words, you are not limited in where you can add charts or crosstabs in the way that Cognos Query Studio and Cognos Analysis Studio limit you to the standard templates. The simple conclusion after performing this comparison is that you should always use Cognos Workspace Advanced. While that will be true for some users, it is not true for all. With the additional capabilities come additional complexities. For your most basic business users, you may want to keep them using Cognos Query Studio or Cognos Analysis Studio for their ad hoc reporting and ad hoc analysis simply because they are easier tools to understand and use. However, for those business users with basic technical acumen, Cognos Workspace Advanced is clearly the superior option. Accessing Cognos Workspace Advanced I would assume now that, after reviewing the capabilities Cognos Workspace Advanced brings to the table, you are anxious to start using it. We will start off by looking at how to access the product. The first way to access Cognos Workspace Advanced is through the welcome page. On the welcome page, you can get to Cognos Workspace Advanced by clicking on the option Author business reports: This will bring you to a screen where you can select your package. In Cognos Query Studio or Cognos Analysis Studio, you will only be able to select non-dimensional and dimensional packages based on the tool you are using. With Cognos Workspace Advanced, because the tool can use both dimensional and non-dimensional packages, you will be prompted with packages for both. The next way to access Cognos Workspace Advanced is through the Launch menu in Cognos Connection. Within the menu, you can simply choose Cognos Workspace Advanced to be taken to the same options for choosing a package. Note, however, that if you have already navigated into a package, it will automatically launch Cognos Workspace Advanced using the very same package. The third way to access Cognos Workspace Advanced is by far the most functional way. You can actually access Cognos Workspace Advanced from within Cognos Workspace by clicking on the Do More... option on a component of the dashboard: When you select this option, the object will expand out and open for editing inside Cognos Workspace Advanced. Then, once you are done editing, you can simply choose the Done button in the upper right-hand corner to return to Cognos Workspace with your newly updated object. For the sake of showing as many features as possible in this chapter, we will launch Cognos Workspace Advanced from the welcome page or from the Launch menu and select a package that has an OLAP data source. For the purpose of following along, we will be using the Cognos BI sample package great_outdoors_8 (or Great Outdoors). When we first access it, we are prompted to choose a package. For these examples, we will choose great_outdoors_8: We are then brought to a splash screen where we can choose Create new or Open existing. We will choose Create new. We are then prompted to pick the type of chart we want to create. As we will see from the following screenshot, our options are: Blank: It starts us off with a completely blank slate List: It starts us off with a list report Crosstab: It starts us off with a crosstab Chart: It starts us off with a chart and loads the chart wizard Financial: It starts us off with a crosstab formatted like a financial report Existing...: It allows us to open an existing report We will choose Blank because we can still add as many of the other objects as we want to later on.
Read more
  • 0
  • 0
  • 2438

Packt
12 Jun 2013
8 min read
Save for later

A quick start – OpenCV fundamentals

Packt
12 Jun 2013
8 min read
(For more resources related to this topic, see here.) The OpenCV library has a modular structure, and the following diagram depicts the different modules available in it: A brief description of all the modules is as follows: Module Feature Core A compact module defining basic data structures, including the dense multidimensional array Mat and basic functions used by all other modules. Imgproc An image processing module that includes linear and non-linear image filtering, geometrical image transformations (resize, affine and perspective warping, generic table-based remapping), color space conversion, histograms, and so on. Video A video analysis module that includes motion estimation, background subtraction, and object tracking algorithms. Calib3d Basic multiple-view geometry algorithms, single and stereo camera calibration, object pose estimation, stereo correspondence algorithms, and elements of 3D reconstruction. Features2d Salient feature detectors, descriptors, and descriptor matchers. Objdetect Detection of objects and instances of the predefined classes; for example, faces, eyes, mugs, people, cars, and so on. Highgui An easy-to-use interface to video capturing, image and video codecs, as well as simple UI capabilities. Gpu GPU-accelerated algorithms from different OpenCV modules. Task 1 – image basics When trying to recreate the physical world around us in digital format via a camera, for example, the computer just sees the image in the form of a code that just contains the numbers 1 and 0. A digital image is nothing but a collection of pixels (picture elements) which are then stored in matrices in OpenCV for further manipulation. In the matrices, each element contains information about a particular pixel in the image. The pixel value decides how bright or what color that pixel should be. Based on this, we can classify images as: Greyscale Color/RGB Greyscale Here the pixel value can range from 0 to 255 and hence we can see the various shades of gray as shown in the following diagram. Here, 0 represents black and 255 represents white: A special case of grayscale is the binary image or black and white image. Here every pixel is either black or white, as shown in the following diagram: Color/RGB Red, Blue, and Green are the primary colors and upon mixing them in various different proportions, we can get new colors. A pixel in a color image has three separate channels— one each for Red, Blue, and Green. The value ranges from 0 to 255 for each channel, as shown in the following diagram: Task 2 – reading and displaying an image We are now going to write a very simple and basic program using the OpenCV library to read and display an image. This will help you understand the basics. Code A simple program to read and display an image is as follows: // opencv header files #include "opencv2/highgui/highgui.hpp" #include "opencv2/core/core.hpp" // namespaces declaration using namespace cv; using namespace std; // create a variable to store the image Mat image; int main( int argc, char** argv ) { // open the image and store it in the 'image' variable // Replace the path with where you have downloaded the image image=imread("<path to image">/lena.jpg"); // create a window to display the image namedWindow( "Display window", CV_WINDOW_AUTOSIZE ); // display the image in the window created imshow( "Display window", image ); // wait for a keystroke waitKey(0); return 0; } Code explanation Now let us understand how the code works. Short comments have also been included in the code itself to increase the readability. #include "opencv2/highgui/highgui.hpp" #include "opencv2/core/core.hpp" The preceding two header files will be a part of almost every program we write using the OpenCV library. As explained earlier, the highgui header is used for window creation, management, and so on, while the core header is used to access the Mat data structure in OpenCV. using namespace cv; using namespace std; The preceding two lines declare the required namespaces for this code so that we don't have to use the :: (scope resolution) operator every time for accessing the functions. Mat image; With the above command, we have just created a variable image of the datatype Mat that is frequently used in OpenCV to store images. image=imread("<path to image">/lena.jpg"); In the previous command, we opened the image lena.jpg and stored it in the image variable. Replace <path to image> in the preceding command with the location of that picture on your PC. namedWindow( "Display window", CV_WINDOW_AUTOSIZE ); We now need a window to display our image. So, we use the above function to do the same. This function takes two parameters, out of which the first one is the name of the window. In our case, we would like to name our window Display Window. The second parameter is optional, but it resizes the window based on the size of the image so that the image is not cropped. imshow( "Display window", image ); Finally, we are ready to display our image in the window we just created by using the preceding function. This function takes two parameters out of which the first one is the window name in which the image has to be displayed. In our case, obviously, that will be Display Window . The second parameter is the image variable containing the image that we want to display. In our case, it's the image variable. waitKey(0); Last but not least, it is advised that you use the preceding function in most of the codes that you write using the OpenCV library. If we don't write this code, the image will be displayed for a fraction of a second and the program will be immediately terminated. It happens so fast that you will not be able to see the image. What this function does essentially is that it waits for a keystroke from the user and hence it delays the termination of the program. The delay here is in milliseconds. Output The image can be displayed as follows: Task 3 – resizing and saving an image We are now going to write a very simple and basic program using the OpenCV library to resize and save an image. Code The following code helps you to resize a given image: // opencv header files #include "opencv2/highgui/highgui.hpp" #include "opencv2/imgproc/imgproc.hpp" #include "opencv2/core/core.hpp" // namespaces declaration using namespace std; using namespace cv; int main(int argc, char** argv) { // create variables to store the images Mat org, resized,saved; // open the image and store it in the 'org' variable // Replace the path with where you have downloaded the image org=imread("<path to image>/lena.png"); //Create a window to display the image namedWindow("Original Image",CV_WINDOW_AUTOSIZE); //display the image imshow("Original Image",org); //resize the image resize(org,resized,Size(),0.5,0.5,INTER_LINEAR); namedWindow("Resized Image",CV_WINDOW_AUTOSIZE); imshow("Resized Image",resized); //save the image //Replace <path> with your desired location imwrite("<path>/saved.png",resized; namedWindow("Image saved",CV_WINDOW_AUTOSIZE); saved=imread("<path to image>/saved.png"); imshow("Image saved",saved); //wait for a keystroke waitKey(0); return 0; } Code explanation Only the new functions/concepts will be explained in this case. #include "opencv2/imgproc/imgproc.hpp" Imgproc is another useful header that gives us access to the various transformations, color conversions, filters, histograms, and so on. Mat org, resized; We have now created two variables, org and resized, to store the original and resized images respectively. resize(org,resized,Size(),0.5,0.5,INTER_LINEAR); We have used the preceding function to resize the image. The preceding function takes six parameters, out of which the first one is the variable containing the source image to be modified. The second one is the variable to store the resized image. The third parameter is the output image size. In this case we have not specified this, but we have instead used the Size() function, which will automatically calculate it based on the values of the fourth and fifth parameters. The fourth and fifth parameters are the scale factors along the horizontal and vertical axes respectively. The sixth parameter is for choosing the type of interpolation method. We have used the bilinear interpolation, which is the default method. imwrite("<path>/saved.png",final); Finally, using the preceding function, you can save an image to a particular location on our PC. The function takes two parameters, out of which the first one is the location where you want to store the image and the second is the variable in which the image is stored. This function is very useful when you want to perform multiple operations on an image and save the image on your PC for future reference. Replace <path> in the preceding function with your desired location. Output Resizing can be demonstrated through the following output: Summary This section showed you how to perform a few of the basic tasks in OpenCV as well as how to write your first OpenCV program. Resources for Article : Further resources on this subject: OpenCV: Segmenting Images [Article] Tracking Faces with Haar Cascades [Article] OpenCV: Image Processing using Morphological Filters [Article]
Read more
  • 0
  • 0
  • 2005

article-image-implementing-persistence-redis-intermediate
Packt
06 Jun 2013
10 min read
Save for later

Implementing persistence in Redis (Intermediate)

Packt
06 Jun 2013
10 min read
(For more resources related to this topic, see here.) Getting ready Redis provides configuration settings for persistence and for enabling durability of data depending on the project statement. If durability of data is critical If durability of data is not important You can achieve persistence of data using the snapshotting mode, which is the simplest mode in Redis. Depending on the configuration, Redis saves a dump of all the data sets in its memory into a single RDB file. The interval in which Redis dumps the memory can be configured to happen every X seconds or after Y operations. Consider an example of a moderately busy server that receives 15,000 changes every minute over its 1 GB data set in memory. Based on the snapshotting rule, the data will be stored every 60 seconds or whenever there are at least 15,000 writes. So the snapshotting runs every minute and writes the entire data of 1 GB to the disk, which soon turns ugly and very inefficient. To solve this particular problem, Redis provides another way of persistence, Append-only file (AOF), which is the main persistence option in Redis. This is similar to journal files, where all the operations performed are recorded and replayed in the same order to rebuild the exact state. Redis's AOF persistence supports three different modes: No fsync: In this mode, we take a chance and let the operating system decide when to flush the data. This is the fastest of the three modes. fsync every second: This mode is a compromised middle point between performance and durability. Data will be flushed using fsync every second. If the disk is not able to match the write speed, the fsync can take more than a second, in which case Redis delays the write up to another second. So this mode guarantees a write to be committed to OS buffers and transferred to the disk within 2 seconds in the worstcase scenario. fsync always: This is the last and safest mode. This provides complete durability of data at a heavy cost to performance. In this mode, the data needs to be written to the file and synced with the disk using fsync before the client receives an acknowledgment. This is the slowest of all three modes. How to do it... First let us see how to configure snapshotting, followed by the Append-only file method: In Redis, we can configure when a new snapshot of the data set will be performed. For example, Redis can be configured to dump the memory if the last dump was created more than 30 seconds ago and there are at least 100 keys that are modified or created. Snapshotting should be configured in the /etc/redis/6379.conf file. The configuration can be as follows: save 900 1save 60 10000 The first line translates to take a snapshot of data after 900 seconds if at least one key has changed, while the second line translates to snapshotting every 60 seconds if 10,000 keys have been modified in the meantime. The configuration parameter rdbcompression defines whether the RDB file is to be compressed or not. There is a trade-off between the CPU and RDB dump file size. We are interested in changing the dump's filename using the dbfilename parameter. Redis uses the current folder to create the dump files. For convenience, it is advised to store the RDB file in a separate folder. dbfilename redis-snapshot.rdbdir /var/lib/redis/ Let us run a small test to make sure the RDB dump is working. Start the server again. Connect to the server using redis-cli, as we did already. To test whether our snapshotting is working, issue the following commands: SET Key ValueSAVE After the SAVE command, a file should be created in the folder /var/lib/redis with the name redis-snapshot.rdb. This confirms that our installation is able to take a snapshot of our data into a file. Now let us see how to configure persistence in Redis using the AOF method: The configuration for persistence through AOF also goes into the same file located in /etc/redis/6379.conf. By default, the Append-only mode is not enabled. Enable it using the appendonly parameter. appendonly yes Also, if you would like to specify a filename for the AOF log, uncomment the line and change the filename. appendfilename redis-aof.aof The appendfsync everysec command provides a good balance between performance and durability. appendfsync everysec Redis needs to know when it has to rewrite the AOF file. This will be decided based on two configuration parameters, as follows: auto-aof-rewrite-percentage 100auto-aof-rewrite-min-size 64mb Unless the minimum size is reached and the percentage of the increase in size when compared to the last rewrite is less than 100 percent, the AOF rewrite will not be performed. How it works... First let us see how snapshotting works. When one of the criteria is met, Redis forks the process. The child process starts writing the RDB file to the disk at the folder specified in our configuration file. Meanwhile, the parent process continues to serve the requests. The problem with this approach is that the parent process stores the keys, which change during this snapshotting by the child, in the extra memory. In the worst-case scenario, if all the keys are modified, the memory usage spikes to roughly double. Caution Be aware that the bigger the RDB file, the longer it takes Redis to restore the data on startup. Corruption of the RDB file is not possible as it is created by the append-only method from the data in Redis's memory, by the child process. The new RDB file is created as a temporary file and is then renamed to the destination file using the atomic rename system call once the dump is completed. AOF's working is simple. Every time a write operation is performed, the command operation gets logged into a logfile. The format used in the logfile is the same as the format used by clients to communicate to the server. This helps in easy parsing of AOF files, which brings in the possibility of replaying the operation in another Redis instance. Only the operations that change the data set are written to the log. This log will be used on startup to reconstruct the exact data. As we are continuously writing the operations into the log, the AOF file explodes in size as compared to the amount of operations performed. So, usually, the size of the AOF file is larger than the RDB dump. Redis manages the increasing size of the data log by compacting the file in a non-blocking manner periodically. For example, say a specific key, key1, has changed 100 times using the SET command. In order to recreate the final state in the last minute, only the last SET command is required. We do not need information about the previous 99 SET commands. This might look simple in theory, but it gets complex when dealing with complex data structures and operations such as union and intersection. Due to this complexity, it becomes very difficult to compress the existing file. To reduce the complexity of compacting the AOF, Redis starts with the data in the memory and rewrites the AOF file from scratch. This is more similar to the snapshotting method. Redis forks a child process that recreates the AOF file and performs an atomic rename to swap the old file with a new one. The same problem, of the requirement of extra memory for operations performed during the rewrite, is present here. So the memory required can spike up to two times based on the operations while writing an AOF file. There's more... Both snapshotting and AOF have their own advantages and limitations, which makes it ideal to use both at the same time. Let us now discuss the major advantages and limitations in the snapshotting method. Advantages of snapshotting The advantages of configuring snapshotting in Redis are as follows: RDB is a single compact file that cannot get corrupted due to the way it is created. It is very easy to implement. This dump file is perfect to take backups and for disaster recovery of remote servers. The RDB file can just be copied and saved for future recoveries. In comparison, this approach has little or no influence over performance as the only work the parent process needs to perform is forking a child process. The parent process will never perform any disk operations; they are all performed by the child process. As an RDB file can be compressed, it provides a faster restart when compared to the append-only file method. Limitations of snapshotting Snapshotting, in spite of the advantages mentioned, has a few limitations that you should be aware of: The periodic background save can result in significant loss of data in case of server or hardware failure. The fork() process used to save the data might take a moment, during which the server will stop serving clients. The larger the data set to be saved, the longer it takes the fork() process to complete. The memory needed for the data set might double in the worst-case scenario, when all the keys in the memory are modified while snapshotting is in progress. What should we use? Now that we have discussed both the modes of persistence Redis provides us with, the big question is what should we use? The answer to this question is entirely based on our application and requirements. In cases where we expect good durability, both snapshotting and AOF can be turned on and be made to work in unison, providing us with redundant persistence. Redis always restores the data from AOF wherever applicable, as it is supposed to have better durability with little loss of data. Both RDB and AOF files can be copied and stored for future use or for recovering another instance of Redis. In a few cases, where performance is very critical, memory usage is limited, and persistence is also paramount, persistence can be turned off completely. In these cases, replication can be used to get durability. Replication is a process in which two Redis instances, one master and one slave, are in sync with the same data. Clients are served by the master, and the master server syncs the data with a slave. Replication setup for persistence Consider a setup as shown in the preceding image; that is: Master instance with no persistence Slave instance with AOF enabled In this case, the master does not need to perform any background disk operations and is fully dedicated to serve client requests, except for a trivial slave connection. The slave server configured with AOF performs the disk operations. As mentioned before, this file can be used to restore the master in case of a disaster. Persistence in Redis is a matter of configuration, balancing the trade-off between performance, disk I/O, and data durability. If you are looking for more information on persistence in Redis, you will find the article by Salvatore Sanfilippo at http://oldblog.antirez.com/post/redis-persistence-demystified.html interesting. Summary This article helps you to understand the persistence option available in Redis, which could ease your efforts of adding Redis to your application stack. Resources for Article : Further resources on this subject: Using Execnet for Parallel and Distributed Processing with NLTK [Article] Parsing Specific Data in Python Text Processing [Article] Python Text Processing with NLTK: Storing Frequency Distributions in Redis [Article]
Read more
  • 0
  • 3
  • 5785
article-image-article-optimizing-programs
Packt
29 May 2013
6 min read
Save for later

Optimizing Programs

Packt
29 May 2013
6 min read
(For more resources related to this topic, see here.) Using transaction SAT to find problem areas In this recipe, we will see the steps required to analyze the execution of any report, transaction, or function module using the transaction SAT. Getting ready For this recipe, we will analyze the runtime of a standard program RIBELF00 (Display Document Flow Program). The program selection screen contains a number of fields. We will execute the program on the order number (aufnr) and see the behavior. How to do it... For carrying out runtime analysis using transaction SAT, proceed as follows: Call transaction SAT. The screen appears as shown: Enter a suitable name for the variant (in our case, YPERF_VARIANT) and click the Create button below it. This will take you to the Variant creation screen. On the Duration and Type tab, switch on Aggregation by choosing the Per Call Position radio-button. Then, click on the Statements tab. On the Statements tab, make sure Internal Tables, the Read Operations checkbox and the Change Operations checkbox, and the Open SQL checkbox under Database Access are checked. Save your variant. Come back to the main screen of SAT. Make sure that within Data Formatting on the initial screen of SAT, the checkbox for Determine Names of Internal Tables is selected. Next, enter the name of the program that is to be traced in the field provided (in our case, it is RIBELF00). Then click the   button. The screen of the program appears as shown. We will enter an order number range and execute the program. Once the program output is generated, click on the Back key to come back to program selection screen. Click on the Back key once again to generate the evaluation results. How it works... We carried out the execution of the program through the transaction SAT and the evaluation results were generated. On the left are the Trace Results (in tree form) listing the statements/ events with the most runtime. These are like a summary report of the entire measurement of the program. They are listed in descending order of the Net time in microseconds and the percentage of the total time. For example, in our case, the OPEN CURSOR event takes 68 percent of the total runtime of the program. Selecting the Hit List tab will show the top time consumer components of the program. In this example, the access of database tables AFRU and VBAK takes most of the time. Double-clicking any item in the Trace Results window on the left-hand side will display (in the Hit List area on the right-hand pane) details of contained items along with execution time of each item. From the Hit List window, double-clicking a particular item will take us to the relevant line in the program code. For example, when we double-click the Open Cursor VBAK line, it will take us to the corresponding program code. We have carried out analysis with Aggregation switched on. The switching on of Aggregation shows one single entry for a multiple calls of a particular line of code. Because of this, the results are less detailed and easier to read, since the hit list and the call hierarchy in the results are much more simplified. Also within the results, by default, the names of the internal table used are not shown. In order for the internal table names to appear in the evaluation result, the Determine Names checkbox of Internal tables indicator is checked. As a general recommendation, the runtime analysis should be carried out several times for best results. The reason being that the DB-measurement time could be dependent on a variety of factors, such as system load, network performance, and so on. Creation of secondary indexes in database tables Very often, the cause of a long running report is full-scan of a database table specified within the code, mainly because no suitable index exists. In this recipe, we will see the steps required in creating a new secondary index in database table for performance improvement. Creating indexes lets you optimize standard reports as well as your own reports. In this recipe, we will create a secondary index on a test table ZST9_VBAK (that is simply a copy of VBAK). How to do it... For creating a secondary index, proceed as follows: Call transaction SE11. Enter the name of the table in the field provided, in our case, ZST9_VBAK. Then click the Display button. This will take you to the Display Table screen. Next, choose the menu path Goto | Indexes. This will display all indexes that currently exist for the table. Click the Create button and then choose the option Create Extension Index The dialog box appears. Enter a three-digit name for the index. Then, press Enter. This will take you to the extension index maintenance screen. On the top part, enter the short description in the Short Description field provided. We will create a non-unique index so the Non-unique index radio button is selected (on the middle part of the screen). On the lower part of the screen, specify the field names to be used in the index. In our case, we use MANDT and AUFNR . Then, activate your index using keys Ctrl + F3. The index will be created in the database with appropriate message of creation shown below Status. How it works... This will create the index on the database. Since we created an extension index, the index will not be overwritten by SAP during an upgrade. Now any report that accesses ZST9_VBAK table specifying MANDT and AUFNR in the WHERE clause, will take advantage of index scan using our new secondary index. There's more... It is recommended by SAP that the index be first created in development system and then transport to quality, and to the production system. Secondary indexes are not automatically generated on target systems after being transported. We should check the status on the Activation Log in the target systems, and use the Database Utility to manually activate the index in question. A secondary index, preferably, must have fields that are not common (or as much as uncommon as possible) with other indexes. Too many redundant secondary indexes (that is, too many common fields across several indexes) on a table has a negative impact on performance. For instance, a table with 10 secondary indexes is sharing more than three fields. In addition, tables that are rarely modified (and very often read) are the ideal candidates for secondary indexes. See also http://help.sap.com/saphelp_erp2005/helpdata/EN/85/685a41cdbf80 47e10000000a1550b0/content.htm http://help.sap.com/saphelp_nw04/helpdata/en/cf/21eb2d446011d1 89700000e8322d00/frameset.htmhttp://docs.oracle.com/cd/ SELECT clause E17076_02/html/programmer_reference/am_second.html http://forums.sdn.sap.com/thread.jspa?threadID=1469347
Read more
  • 0
  • 0
  • 944

article-image-techniques-for-creating-a-multimedia-database
Packt
17 May 2013
37 min read
Save for later

Techniques for Creating a Multimedia Database

Packt
17 May 2013
37 min read
(For more resources related to this topic, see here.) Tier architecture The rules surrounding technology are constantly changing. Decisions and architectures based on current technology might easily become out of date with hardware changes. To best understand how multimedia and unstructured data fit and can adapt to the changing technology, it's important to understand how and why we arrived at our different current architectural positions. In some cases we have come full circle and reinvented concepts that were in use 20 years ago. Only by learning from the lessons of the past can we see how to move forward to deal with this complex environment. In the past 20 years a variety of architectures have come about in an attempt to satisfy some core requirements: Allow as many users as possible to access the system Ensure those users had good performance for accessing the data Enable those users to perform DML (insert/update/delete) safely and securely (safely implies ability to restore data in the event of failure) The goal of a database management system was to provide an environment where these points could be met. The first databases were not relational. They were heavily I/O focused as the computers did not have much memory and the idea of caching data was deemed to be too expensive. The servers had kilobytes and then eventually, megabytes of memory. This memory was required foremost by the programs to run in them. The most efficient architecture was to use pointers to link the data together. The architecture that emerged naturally was hierarchical and a program would navigate the hierarchy to find rows related to each other. Users connected in via a dumb terminal. This was a monitor with a keyboard that could process input and output from a basic protocol and display it on the screen. All the processing of information, including how the screen should display it (using simple escape sequence commands), was controlled in the server. Traditional no tier The mainframes used a block mode structure, where the user would enter a screen full of data and press the Enter key. After doing this the whole screen of information was sent to the server for processing. Other servers used asynchronous protocols, where each letter, as it was typed, was sent to the server for processing. This method was not as efficient as block mode because it required more server processing power to handle the data coming in. It did provide a friendlier interface for data entry as mistakes made could be relayed immediately back to the user. Block mode could only display errors once the screen of data was sent, processed, and returned. As more users started using these systems, the amount of data in them began to grow and the users wanted to get more intelligence out of the data entered. Requirements for reporting appeared as well as the ability to do ad hoc querying. The databases were also very hard to maintain and enhance as the pointer structure linked everything together tightly. It was very difficult to perform maintenance and changes to code. In the 1970s the relational database concept was formulated and it was based on sound mathematical principles. In the early 1980s the first conceptual relational databases appeared in the marketplace with Oracle leading the way. The relational databases were not received well. They performed poorly and used a huge amount of server resources. Though they achieved a stated goal of being flexible and adaptable, enabling more complex applications to be built quicker, the performance overheads of performing joins proved to be a major issue. Benefits could be seen in them, but they could never be seen as being able to be used in any environment that required tens to hundreds or thousands of concurrent users. The technology wasn't there to handle them. To initially achieve better performance the relational database vendors focused on using a changing hardware feature and that was memory. By the late 1980s the computer servers were starting to move from 16 bit to 32 bit. The memory was increasing and there was drop in the price. By adapting to this the vendors managed to take advantage of memory and improved join performance. The relational databases in effect achieved a balancing act between memory and disk I/O. Accessing a disk was about a thousand times slower than accessing memory. Memory was transient, meaning if there was a power failure and if there was data stored in memory, it would be lost. Memory was also measured in megabytes, but disk was measured in gigabytes. Disk was not transient and generally reliable, but still required safeguards to be put in place to protect from disk failure. So the balancing act the databases performed involved caching data in memory that was frequently accessed, while ensuring any modifications made to that data were always stored to disk. Additionally, the database had to ensure no data was lost if a disk failed. To improve join performance the database vendors came up with their own solutions involving indexing, optimization techniques, locking, and specialized data storage structures. Databases were judged on the speed at which they could perform joins. The flexibility and ease in which applications could be updated and modified compared to the older systems soon made the relational database become popular and must have. As all relational databases conformed to an international SQL standard, there was a perception that a customer was never locked into a propriety system and could move their data between different vendors. Though there were elements of truth to this, the reality has shown otherwise. The Oracle Database key strength was that you were not locked into the hardware and they offered the ability to move a database between a mainframe to Windows to Unix. This portability across hardware effectively broke the stranglehold a number of hardware vendors had, and opened up the competition enabling hardware vendors to focus on the physical architecture rather than the operating system within it. In the early 1990s with the rise in popularity of the Apple Macintosh, the rules changed dramatically and the concept of a user friendly graphical environment appeared. The Graphical User Interface (GUI) screen offered a powerful interface for the user to perform data entry. Though it can be argued that data entry was not (and is still not) as fast as data entry via a dumb terminal interface, the use of colors, varying fonts, widgets, comboboxes, and a whole repository of specialized frontend data entry features made the interface easier to use and more data could be entered with less typing. Arguably, the GUI opened up the computer to users who could not type well. The interface was easier to learn and less training was needed to use the interface. Two tier The GUI interface had one major drawback; it was expensive to run on the CPU. Some vendors experimented with running the GUI directly on the server (the Solaris operating system offered this capability), but it become obvious that this solution would not scale. To address this, the two-tier architecture was born. This involved using the GUI, which was running on an Apple Macintosh or Microsoft Windows or other Windows environment (Microsoft Windows wasn't the only GUI to run on Intel platforms) to handle the display processing. This was achieved by moving the application displayed to the computer that the user was using. Thus splitting the GUI presentation layer and application from the database. This seemed like an ideal solution as the database could now just focus on handling and processing SQL queries and DML. It did not have to be burdened with application processing as well. As there were no agreed network protocols, a number had to be used, including named pipes, LU6.2, DECNET, and TCP/IP. The database had to handle language conversion as the data was moved between the client and the server. The client might be running on a 16-bit platform using US7ASCII as the character set, but the server might be running on 32-bit using EBCDIC as the character set. The network suddenly became very complex to manage. What proved to be the ultimate show stopper with the architecture had nothing to do with the scalability of client or database performance, but rather something which is always neglected in any architecture, and that is the scalability of maintenance. Having an environment of a hundred users, each with their own computer accessing the server, requires a team of experts to manage those computers and ensure the software on it is correct. Application upgrades meant upgrading hundreds of computers at the same time. This was a time-consuming and manual task. Compounded by this is that if the client computer is running multiple applications, upgrading one might impact the other applications. Even applying an operating system patch could impact other applications. Users also might install their own software on their computer and impact the application running on it. A lot of time was spent supporting users and ensuring their computers were stable and could correctly communicate with the server. Three tier Specialized software vendors tried to come to the rescue by offering the ability to lock down a client computer from being modified and allowing remote access to the computer to perform remote updates. Even then, the maintenance side proved very difficult to deal with and when the idea of a three tier architecture was pushed by vendors, it was very quickly adopted as the ideal solution to move towards because it critically addressed the maintenance issue. In the mid 1990s the rules changed again. The Internet started to gain in popularity and the web browser was invented. The browser opened up the concept of a smart presentation layer that is very flexible and configured using a simple mark up language. The browser ran on top of the protocol called HTTP, which uses TCP/IP as the underlying network protocol. The idea of splitting the presentation layer from the application became a reality as more applications appeared in the browser. The web browser was not an ideal platform for data entry as the HTTP protocol was stateless making it very hard to perform transactions in it. The HTTP protocol could scale. The actual usage involved the exact same concepts as block mode data entry performed on mainframe computers. In a web browser all the data is entered on the screen, and then sent in one go to the application handling the data. The web browser also pushed the idea that the operating system the client is running on is immaterial. The web browsers were ported to Apple computers, Windows, Solaris, and Unix platforms. The web browser also introduced the idea of standard for the presentation layer. All vendors producing a web browser had to conform to the agreed HTML standard. This ensured that anyone building an application that confirmed to HTML would be able to run on any web browser. The web browser pushed the concept that the presentation layer had to run on any client computer (later on, any mobile device as well) irrespective of the operating system and what else was installed on it. The web browser was essentially immune from anything else running on the client computer. If all the client had to use was a browser, maintenance on the client machine would be simplified. HTML had severe limitations and it was not designed for data entry. To address this, the Java language came about and provided the concept of an applet which could run inside the browser, be safe, and provide an interface to the user for data entry. Different vendors came up with different architectures for splitting their two tier application into a three tier one. Oracle achieved this by taking their Oracle Forms product and moving it to the middle application tier, and providing a framework where the presentation layer would run as a Java applet inside the browser. The Java applet would communicate with a process on the application server and it would give it its own instructions for how to draw the display. When the Forms product was replaced with JDeveloper, the same concept was maintained and enhanced. The middle tier became more flexible and multiple middle application tiers could be configured enabling more concurrent users. The three tier architecture has proven to be an ideal environment for legacy systems, giving them a new life and enabling them be put in an environment where they can scale. The three tier environment has a major flaw preventing it from truly scaling. The flaw is the bottleneck between the application layer and the database. The three tier environment also is designed for relational databases. It is not designed for multimedia databases.In the architecture if the digital objects are stored in the database, then to be delivered to the customer they need to pass through the application-database network (exaggerating the bottleneck capacity issues), and from there passed to the presentation layer. Those building in this environment naturally lend themselves to the concept that the best location for the digital objects is the middle tier. This then leads to issues of security, backing up, management, and all the issues previously cited for why storing the digital objects in the database is ideal. The logical conclusion to this is to move the database to the middle tier to address this. In reality, the logical conclusion is to move the application tier back into the database tier. Virtualized architecture In the mid 2000s the idea of a virtualization began to appear in the marketplace. A virtualization was not really a new idea and the concept has existed on the IBM MVS environment since the late 1980s. What made this virtualization concept powerful was that it could run Windows, Linux, Solaris, and Mac environments within them. A virtualized environment was basically the ability to run a complete operating system within another operating system. If the computer server had sufficient power and memory, it could run multiple virtualizations (VMs). We can take the snapshot of a VM, which involves taking a view of the disk and memory and storing it. It then became possible to rollback to the snapshot. A VM could be easily cloned (copied) and backed up. VMs could also be easily transferred to different computer servers. The VM was not tied to a physical server and the same environment could be moved to new servers as their capacity increased. A VM environment became attractive to administrators simply because they were easy to manage. Rather than running five separate servers, an administrator could have the one server with five virtualizations in it. The VM environment entered at a critical moment in the evolution of computer servers. Prior to 2005 most computer servers had one or two CPUs in them. The advanced could have as many as 64 (for example, the Sun E10000), but generally, one or two was the simplest solution. The reason was that computer power was doubling every two years following Moore's law. By around 2005 the market began to realize that there was a limit to the speed of an individual CPU due to physical limitations in the size of the transistors in the chips. The solution was to grow the CPUs sideways and the concept of cores came about. A CPU could be broken down into multiple cores, where each one acted like a separate CPU but was contained in one chip. With the introduction of smart threading, the number of virtual cores increased. A single CPU could now simulate eight or more CPUs. This concept has changed the rules. A server can now run with a large number of cores whereas 10 years ago it was physically limited to one or two CPUs. If a process went wild and consumed all the resources of one CPU, it impacted all users. In the multicore CPU environment, a rogue process will not impact the others. In a VM the controlling operating system (which is also called a hypervisor, and can be hardware, firmware, or software centric) can enable VMs to be constrained to certain cores as well as CPU thresholds within that core. This allows a VM to be fenced in. This concept was taken by Amazon and the concept of the cloud environment formed. This architecture is now moving into a new path where users can now use remote desktop into their own VM on a server. The user now needs a simple laptop (resulting in the demise of the tower computer) to use remote desktop (or equivalent) into the virtualization. They then become responsible for managing their own laptop, and in the event of an issue, it can be replaced or wiped and reinstalled with a base operating system on it. This simplifies the management. As all the business data and application logic is in the VM, the administrator can now control it, easily back it up, and access it. Though this VM cloud environment seems like a good solution to resolving the maintenance scalability issue, a spanner has been thrown in the works at the same time as VMs are becoming popular, so was the evolution of the mobile into a portable hand held device with applications running on it. Mobile applications architecture The iPhone, iPad, Android, Samsung, and other devices have caused a disruption in the marketplace as to how the relationship between the user and the application is perceived and managed. These devices are simpler and on the face of it employ a variety of architectures including two tier and three tier. Quality control of the application is managed by having an independent and separate environment, where the user can obtain their application for the mobile device. The strict controls Apple employs for using iTunes are primarily to ensure that the Trojan code or viruses are not embedded in the application, resulting in a mobile device not requiring a complex and constantly updating anti-virus software. Though the interface is not ideal for heavy data entry, the applications are naturally designed to be very friendly and use touch screen controls. The low cost combined with their simple interface has made them an ideal product for most people and are replacing the need for a laptop in a number of cases. Application vendors that have applications that naturally lend themselves to this environment are taking full advantage of it to provide a powerful interface for clients to use. The result is that there are two architectures today that exist and are moving in different directions. Each one is popular and resolves certain issues. Each has different interfaces and when building and configuring a storage repository for digital objects, both these environments need to be taken into consideration. For a multimedia environment the ideal solution to implement the application is based on the Web. This is because the web environment over the last 15 years has evolved into one which is very flexible and adaptable for dealing with the display of those objects. From the display of digital images to streaming video, the web browser (with sometimes plugins to improve the display) is ideal. This includes the display of documents. The browser environment though is not strong for the editing of these digital objects. Adobe Photoshop, Gimp, Garage Band, Office, and a whole suite of other products are available that are designed to edit each type of digital object perfectly. This means that currently the editing of those digital objects requires a different solution to the loading, viewing and delivery of those digital objects. There is no right solution for the tier architecture to manage digital objects. The N-Tier model moves the application and database back into the database tier. An HTTP server can also be located in this tier or for higher availability it can be located externally. Optimal performance is achieved by locating the application as close to the database as possible. This reduces the network bottleneck. By locating the application within the database (in Oracle this is done by using PL/SQL or Java) an ideal environment is configured where there is no overhead between the application and database. The N-Tier model also supports the concept of having the digital objects stored outside the environment and delivered using other methods. This could include a streaming server. The N-Tier model also supports the concept of transformation servers. Scalability is achieved by adding more tiers and spreading the database between them. The model also deals with the issue of the connection to the Internet becoming a bottleneck. A database server in the tier is moved to another network to help balance the load. For Oracle this can be done using RAC to achieve a form of transparent scalability. In most situations, Tuning, scalability at the server is achieved using manual methods using a form of application partitioning. Basic database configuration concepts When a database administrator first creates a database that they know will contain digital objects, they will be confronted with some basic database configuration questions covering key sizing features of the database. When looking at the Oracle Database there are a number of physical and logical structures built inside the database. To avoid confusion with other database management systems, it's important to note that an Oracle Database is a collection of schemas, whereas in other database management the terminology for a database equates to exactly one schema. This confusion has caused a lot of issues in the past. An Oracle Database administrator will say it can take 30 minutes to an hour to create a database, whereas a SQL Server administrator will say it takes seconds to create a database. In Oracle to create a schema (the same as a SQL Server database) also takes seconds to perform. For the physical storage of tables, the Oracle Database is composed of logical structures called tablespaces. The tablespace is designed to provide a transparent layer between the developer creating a table and the physical disk system and to ensure the two are independent. Data in a table that resides in a tablespace can span multiple disks and disk subsystem or a network storage system. A subsystem equating to a Raid structure has been covered in greater detail at the end of this article. A tablespace is composed of many physical datafiles. Each datafile equates to one physical file on the disk. The goal when creating a datafile is to ensure its allocation of storage is contiguous in that the operating system and doesn't split its location into different areas on the disk (Raid and NAS structures store the data in different locations based on their core structure so this rule does not apply to them). A contiguous file will result in less disk activity being performed when full tablespace scans are performed. In some cases, especially, when reading in very large images, this can improve performance. A datafile is fragmented (when using locally managed tablespaces, the default in Oracle) into fixed size extents. Access to the extents is controlled via a bitmap which is managed in the header of the tablespace (which will reside on a datafile). An extent is based on the core Oracle block size. So if the extent is 128 KB and the database block size is 8 KB, 16 Oracle blocks will exist within the extent. An Oracle block is the smallest unit of storage within the database. Blocks are read into memory for caching, updated, and changes stored in the redo logs. Even though the Oracle block is the smallest unit of storage, as a datafile is an operating system file, based on the type of server filesystem (UNIX can be UFS and Windows can be NTFS), the unit of storage at this level can change. The default in Windows was once 512 bytes, but with NTFS can be as high as 64 KB. This means every time a request is made to the disk to retrieve data from the filesystem it does a read to return this amount of data. So if the Oracle block's size was 8 KB in size and the filesystem block size was 64 KB, when Oracle requests a block to be read in, the filesystem will read in 64 KB, return the 8 KB requested, and reject the rest. Most filesystems cache this data to improve performance, but this example highlights how in some cases not balancing the database block size with the filesystem block size can result in wasted I/O. The actual answer to this is operating system and filesystem dependent, and it also depends on whether Oracle is doing read aheads (using the init.ora parameter db_file_multiblock_read_count). When Oracle introduced the Exadata they put forward the idea of putting smarts into the disk layer. Rather than the database working out how best to retrieve the physical blocks of data, the database passes a request for information to the disk system. As the Exadata knows about its own disk performance, channel speed, and I/O throughput, it is in a much better position for working out the optimal method for extracting the data. It then works out the best way of retrieving it based on the request (which can be a query). In some cases it might do a full table scan because it can process the blocks faster than if it used an index. It now becomes a smart disk system rather than a dumb/blind one. This capability has changed the rules for how a database works with the underlying storage system. ASM—Automated Storage Management In Oracle 10G, Oracle introduced ASM primarily to improve the performance of Oracle RAC (clustered systems, where multiple separate servers share the same database on the same disk). It replaces the server filesystem and can handle mirroring and load balancing of datafiles. ASM takes the filesystem and operating system out of the equation and enables the database administrator to have a different degree of control over the management of the disk system. Block size The database block size is the fundamental unit of storage within an Oracle Database. Though the database can support different block sizes, a tablespace is restricted to one fixed block size. The block sizes available are 4 KB, 8 KB, 16 KB, and 32 KB (a 32 KB block size is valid only on 64-bit platforms). The current tuning mentality says it's best to have one block size for the whole database. This is based on the idea that the one block size makes it easier to manage the SGA and ensure that memory isn't wasted. If multiple block sizes are used, the database administrator has to partition the SGA into multiple areas and assign each a block size. So if the administrator decided to have the database at 8 KB and 16 KB, they would have to set up a database startup parameter indicating the size of each: DB_8K_CACHE_SIZE = 2GDB_16K_CACHE_SIZE = 1G The problem that an administrator faces is that it can be hard to judge memory usage with table usage. In the above scenario the tables residing in the 8 KB block might be accessed a lot more than 16 KB ones, meaning the memory needs to be adjusted to deal with that. This balancing act of tuning invariably results in the decision that unless exceptional situations warrant its use, it's best to keep to the same database blocks size across the whole database. This makes the job of tuning simpler. As is always the case when dealing with unstructured data, the rules change. The current thinking is that it's more efficient to store the data in a large block size. This ensures there is less wasted overhead and fewer block reads to read in a row of data. The challenge is that the size of the unstructured data can vary dramatically. It's realistic for an image thumbnail to be under 4 KB in size. This makes it an ideal candidate to be stored in the row with the other relational data. Even if an 8 KB block size is used, the thumbnail and other relational data might happily exist in the one block. A photo might be 10 MB in size requiring a large number of blocks to be used to store it. If a 16 KB block size is used, it requires about 64 blocks to store 1 MB (assuming there is some overhead that requires overall extra storage for the block header). An 8 KB block size requires about 130 blocks. If you have to store 10 MB, the number of blocks increases 10 times. For an 8 KB block that is over 1300 reads is sufficient for one small-sized 10 MB image. With images now coming close to 100 MB in size, this figure again increases by a factor of 10. It soon becomes obvious that a very large block size is needed. When storing video at over 4 GB in size, even a 32 KB block size seems too small. As is covered later in the article, unstructured data stored in an Oracle blob does not have to be cached in the SGA. In fact, it's discouraged because in most situations the data is not likely to be accessed on a frequent basis. This generally holds true but there are cases, especially with video, where this does not hold true and this situation is covered later. Under the assumption that the thumbnails are accessed frequently and should be cached and the originals are accessed infrequently and should not be cached, the conclusion is that it now becomes practical to split the SGA in two. The unstructured, uncached data is stored in a tablespace using a large block size (32 KB) and the remaining data is stored in a more acceptable and reasonable 8 KB block. The SGA for the 32 KB is kept to a bare minimum as it will not be used, thus bypassing the issue of perceived wasted memory by splitting the SGA in two. In the following table a simple test was done using three tablespace block sizes. The aim was to see if the block size would impact load and read times. The load involved reading in 67 TIF images totaling 3 GB in size. The result was that the tablespace block size made no statistical significant difference. The test was done using a 50-MB extent size and as shown shown in the next segment, this size will impact performance. So to correctly understand how important block size can be, one has to look at not only the block size but also the extent size. Details of the environment used to perform these tests CREATE TABLESPACE tbls_name BLOCKSIZE 4096/8192/16384 EXTENTMANAGEMENT LOCAL UNIFORM SIZE 50M segment space management autodatafile 'directory/datafile' size 5G reuse; The following table compares the various block sizes: Tablespace block size Blocks Extents Load time Read time 4 KB 819200 64 3.49 minutes 1.02 minutes 8 KB 403200 63 3.46 minutes 0.59 minutes 16 KB 201600 63 3.55 minutes 0.59 minutes UNIFORM extent size and AUTOALLOCATE When creating a tablespace to store the unstructured data, the next step after the block size is determined is to work out what the most efficient extent size will be. As a table might contain data ranging from hundreds of gigabytes to terabytes determining the extent size is important. The larger the extent, the potential to possible waste space if the table doesn't use it all is greater. The smaller the extent size the risk is that the table will grow into tens or hundreds of thousands of extents. As a locally managed tablespace uses a bitmap to manage the access to the extents and is generally quite fast, having it manage tens of thousands of extents might be pushing its performance capabilities. There are two methods available to the administrator when creating a tablespace. They can manually specify the fragment size using the UNIFORM extent size clause or they can let the Oracle Database calculate it using the AUTOALLOCATE clause. Tests were done to determine what the optimal fragment size was when AUTOALLOCATE was not used. The AUTOALLOCATE is a more set-and-forget method and one goal was to see if this clause was as efficient as manually setting it. Locally managed tablespace UNIFORM extent size Covers testing performed to try to find an optimal extent and block size. The results showed that a block size of 16384 (16 KB) is ideal, though 8192 (8 KB) is acceptable. The block size of 32 KB was not tested. The administrator, who might be tempted to think the larger the extent size, the better the performance, would be surprised that the results show that this is not always the case and an extent size between 50 MB-200 MB is optimal. For reads with SECUREFILES the number of extents was not a major performance factor but it was for writes. When compared to the AUTOALLOCATE clause, it was shown there was no real performance improvement or loss when used. The testing showed that an administrator can use this clause knowing they will get a good all round result when it comes to performance. The syntax for configuration is as follows: EXTENT MANAGEMENT LOCAL AUTOALLOCATE segment space management auto Repeated tests showed that this configuration produced optimal read/write times without the database administrator having to worry about what the extent size should be. For a 300 GB tablespace it produced a similar number of extents as when a 50M extent size was used. As has been covered, once an image is loaded it is rare that it is updated. A relational database fragmentation within a tablespace is caused by repeated creation/dropping of schema objects and extents of different sizes, resulting in physical storage gaps, which are not easily reused. Storage is lost. This is analogous to the Microsoft Windows environment with its disk storage. After a period of time, the disk becomes fragmented making it hard to find contiguous storage and locate similar items together. Locating all the pieces in a file as close together as possible can dramatically reduce the number of disk reads required to read it in. With NTFS (a Microsoft disk filesystem format) the system administrator can on creation determine whether extents are autoallocated or fragmented. This is similar in concept to the Oracle tablespace creation. Testing was not done to check if the fragmentation scenario is avoided with the AUTOALLOCATE clause. The database administrator should therefore be aware of the tablespace usage and whether it is likely going to be stable once rows are added (in which case AUTOALLOCATE can be used simplifying storage management). If it is volatile, the UNIFORM clause might be considered as a better option. Temporary tablespace For working with unstructured data, the primary uses of the TEMPORARY tablespace is to hold the contents of temporary tables and temporary lobs. A temporary lob is used for processing a temporary multimedia object. In the following example, a temporary blob is created. It is not cached in memory. A multimedia image type is created and loaded into it. Information is extracted and the blob is freed. This is useful if images are stored temporarily outside the database. This is not the same case as using a bfile which Oracle Multimedia supports. The bfile is a permanent pointer to an image stored outside the database. SQL>declareimage ORDSYS.ORDImage;ctx raw(4000);beginimage := ordsys.ordimage.init();dbms_lob.createtemporary(image.source.localdata,FALSE);image.importfrom(ctx, 'file', 'LOADING_DIR', 'myimg.tif');image.setProperties;dbms_output.put_line( 'width x height = ' || image.width ||'x' || image.height);dbms_lob.freetemporary(image.source.localdata);end;/width x height = 2809x4176 It's important when using this tablespace to ensure that all code, especially on failure, performs a dbms_lob.freetemporary function, to ensure that storage leakage doesn't occur. This will result in the tablespace continuing to grow until it runs out of room. In this case the only way to clean it up is to either stop all database processes referencing, then resize the datafile (or drop and recreate the temporary tablespace after creating another interim one), or to restart the database and mount it. The tablespace can then be resized or dropped and recreated. UNDO tablespace The UNDO tablespace is used by the database to store sufficient information to rollback a transaction. In a database containing a lot of digital objects, the size of the database just for storage of the objects can exceed terabytes. In this situation the UNDO tablespace can be sized larger giving added opportunity for the database administrator to perform flashback recovery from user error. It's reasonable to size the UNDO tablespace at 50 GB even growing it to 100 GB in size. The larger the UNDO tablespace the further back in time the administrator can go and the greater the breathing space between user failure, user failure detected and reported, and the database administrator doing the flash back recovery. The following is an example flashback SQL statement. The as of timestamp clause tells Oracle to find rows that match the timestamp from the current time going back so that we can have a look at a table an hour ago: select t.vimg.source.srcname || '=' ||dbms_lob.getlength(t.vimg.source.localdata)from test_load as of timestamp systimestamp - (1/24) t; SYSTEM tablespace The SYSTEM tablespace contains the data dictionary. In Oracle 11g R2 it also contains any compiled PL/SQL code (where PLSQL_CODE_TYPE=NATIVE). The recommended initial starting size of the tablespace should be 1500 MB. Redo logs The following test results highlight how important it is to get the size and placement of the redo logs correct. The goal was to determine what combination of database parameters and redo/undo size were optimal. In addition, an SSD was used as a comparison. Based on the result of each test, the parameters and/or storage was modified to see whether it would improve the results. When it appeared an optimal parameter/storage setting was found, it was locked in while the other parameters were tested further. This enabled multiple concurrent configurations to be tested and an optimal result to be calculated. The test involved loading 67 images into the database. Each image varied in size between 40 to 80 MB resulting in 2.87 GB of data being loaded. As the test involved only image loading, no processing such as setting properties or extraction of metadata was performed. Archiving on the database was not enabled. All database files resided on hard disk unless specified. In between each test a full database reboot was done. The test was run at least three times with the range of results shown as follows: Database parameter descriptions used:Redo Buffer Size = LOG_BUFFERMultiblock Read Count = db_file_multiblock_read_count Source disk Redo logs Database parameters Fastest time Slowest time Hard disk Hard disk 3 x 50 MB Redo buffer size = 4 MB Multiblock read count = 64 UNDO tablespace on HD (10 GB) Table datafile on HD 3 minutes and 22 sec 3 minutes and 53 sec Hard disk Hard disk 3 x 1 GB Redo buffer size = 4 MB Multiblock read count = 64 UNDO tablespace on HD (10 GB) Table datafile on HD 2 minutes and 49 sec 2 minutes and 57 sec Hard disk SSD 3 x 1 GB Redo buffer size = 4 MB Multiblock read count = 64 UNDO tablespace on HD (10 GB) Table datafile on HD 1 minute and 30 sec 1 minute and 41 sec Hard disk SSD 3 x 1 GB Redo buffer size = 64 MB Multiblock read count = 64 UNDO tablespace on HD (10 GB) Table datafile on HD 1 minute and 23 sec 1 minute and 48 sec Hard disk SSD 3 x 1 GB Redo buffer size = 8 MB Multiblock read count = 64 UNDO tablespace on HD (10 GB) Table datafile on HD 1 minute and 18 sec 1 minute and 29 sec Hard disk SSD 3 x 1 GB Redo buffer size = 16 MB Multiblock read count = 64 UNDO tablespace on HD (10 GB) Table datafile on HD 1 minute and 19 sec 1 minute and 27 sec Hard disk SSD 3 x 1 GB Redo buffer size = 16 MB Multiblock read count = 256 UNDO tablespace on HD (10 GB) Table datafile on HD 1 minute and 27 sec 1 minute and 41 sec Hard disk SSD 3 x 1 GB Redo buffer size = 8 MB Multiblock read count = 64 UNDO tablespace = 1 GB on SSD Table datafile on HD 1 minute and 21 sec 1 minute and 49 sec SSD SSD 3 x 1 GB Redo buffer size = 8 MB Multiblock read count = 64 UNDO tablespace = 1 GB on SSD Table datafile on HD 53 sec 54 sec SSD SSD 3 x 1 GB Redo buffer size = 8 MB Multiblock read count = 64 UNDO tablespace = 1 GB on SSD Table datafile on SSD 1 minute and 20 sec 1 minute and 20 sec Analysis The tests show a huge improvement when the redo logs were moved to a Solid State Drive (SSD). Though the conclusion that can be drawn is this: the optimal step to perform it might be self defeating. A number of manufacturers of SSD acknowledge there are limitations with the SSD when it comes to repeated writes. The Mean Time to Failure (MTF) might be 2 million hours for reads; for writes the failure rate can be very high. Modern SSD and flash cards offer much improved wear leveling algorithms to reduce failures and make performance more consistent. No doubt improvements will continue in the future. A redo log by its nature is constant and has heavy writes. So, moving the redo logs to the SSD might quickly result in it becoming damaged and failing. For an organization that on configuration performs one very large load of multimedia, the solution might be to initially keep the redo logs on SSD, and once the load is finished, to move the redo logs to a hard drive. Increasing the size of the redo logs from 50 MB to 1 GB improves performance and all database containing unstructured data should have a redo log size of at least 1 GB. The number of logs should be at least 10; preferred is from 50 to 100. As is covered later, disk is cheaper today than it once was, and 100 GB of redo logs is not that large a volume of data as it once was. The redo logs should always be mirrored. The placement or size of the UNDO tablespace makes no difference with performance. The redo buffer size (LOG_BUFFER) showed a minor improvement when it was increased in size, but the results were inconclusive as the figures varied. A figure of LOG_BUFFER=8691712, showed the best results and database administrators might use this figure as a starting point for tuning. The changing of multiblock read count (DB_FILE_MULTIBLOCK_READ_COUNT) from the default value of 64 to 256 showed no improvement. As the default value (in this case 64) is set by the database as optimal for the platform, the conclusion that can be drawn is that the database has set this figure to be a good size. By moving the original images to an SSD showed another huge improvement in performance. This highlighted how the I/O bottleneck of reading from disk and the writing to disk (redo logs) is so critical for digital object loading. The final test involved moving the datafile containing the table to the SSD. It highlighted a realistic issue that DBAs face in dealing with I/O. The disk speed and seek time might not be critical in tuning if the bottleneck is the actual time it takes to transfer the data to and from the disk to the server. In the test case the datafile was moved to the same SSD as the redo logs resulting in I/O competition. In the previous tests the datafile was on the hard disk and the database could write to the disk (separate I/O channel) and to the redo logs (separate I/O channel) without one impacting the other. Even though the SSD is a magnitude faster in performance than the disk, it quickly became swamped with calls for reads and writes. The lesson is that it's better to have multiple smaller SSDs on different I/O channels into the server than one larger channel. Sites using a SAN will soon realize that even though SAN might offer speed, unless it offers multiple I/O channels into the server, its channel to the server will quickly become the bottleneck, especially if the datafiles and the images for loading are all located on the server. The original tuning notion of separating data fi les onto separate disks that was performed more than 15 years ago still makes sense when it comes to image loading into a multimedia database. It's important to stress that this is a tuning issue while dealing with image loading not when running the database in general. Tuning the database in general is a completely different story and might result in a completely different architecture.
Read more
  • 0
  • 0
  • 3905

article-image-tracking-faces-haar-cascades
Packt
13 May 2013
4 min read
Save for later

OpenCV: Tracking Faces with Haar Cascades

Packt
13 May 2013
4 min read
Conceptualizing Haar cascades When we talk about classifying objects and tracking their location, what exactly are we hoping to pinpoint? What constitutes a recognizable part of an object? Photographic images, even from a webcam, may contain a lot of detail for our (human) viewing pleasure. However, image detail tends to be unstable with respect to variations in lighting, viewing angle, viewing distance, camera shake, and digital noise. Moreover, even real differences in physical detail might not interest us for the purpose of classification. I was taught in school, that no two snowflakes look alike under a microscope. Fortunately, as a Canadian child, I had already learned how to recognize snowflakes without a microscope, as the similarities are more obvious in bulk. Thus, some means of abstracting image detail is useful in producing stable classification and tracking results. The abstractions are called features, which are said to be extracted from the image data. There should be far fewer features than pixels, though any pixel might influence multiple features. The level of similarity between two images can be evaluated based on distances between the images' corresponding features. For example, distance might be defined in terms of spatial coordinates or color coordinates. Haar-like features are one type of feature that is often applied to real-time face tracking. They were first used f or this purpose by Paul Viola and Michael Jones in 2001. Each Haar-like feature describes the pattern of contrast among adjacent image regions. For example, edges, vertices, and thin lines each generate distinctive features. For any given image, the features may vary depending on the regions' size, which may be called the window size. Two images that differ only in scale should be capable of yielding similar features, albeit for different window sizes. Thus, it is useful to generate features for multiple window sizes. Such a collection of features is called a cascade. We may say a Haar cascade is scale-invariant or, in other words, robust to changes in scale. OpenCV provides a classifier and tracker for scale-invariant Haar cascades, whic h it expects to be in a certain file format. Haar cascades, as implemented in OpenCV, are not robust to changes in rotation. For example, an upside-down face is not considered similar to an upright face and a face viewed in profile is not considered similar to a face viewed from the front. A more complex and more resource-intensive implementation could improve Haar cascades' robustness to rotation by considering multiple transformations of images as well as multiple window sizes. However, we will confine ourselves to the implementation in OpenCV. Getting Haar cascade data As part of your OpenCV setup, you probably have a directory called haarcascades. It contains cascades that are trained for certain subjects using tools that come with OpenCV. The directory's full path depends on your system and method of setting up OpenCV, as follows: Build from source archive:: <unzip_destination>/data/haarcascades Windows with self-extracting ZIP:<unzip_destination>/data/haarcascades Mac with MacPorts:MacPorts: /opt/local/share/OpenCV/haarcascades Mac with Homebrew:The haarcascades file is not included; to get it, download the source archive Ubuntu with apt or Software Center: The haarcascades file is not included; to get it, download the source archive If you cannot find haarcascades, then download the source archive from http://sourceforge.net/projects/opencvlibrary/files/opencv-unix/2.4.3/OpenCV-2.4.3.tar.bz2/download (or the Windows self-extracting ZIP from http://sourceforge.net/projects/opencvlibrary/files/opencvwin/ 2.4.3/OpenCV-2.4.3.exe/download), unzip it, and look for <unzip_destination>/data/haarcascades. Once you find haarcascades, create a directory called cascades in the same folder as cameo.py and copy the following files from haarcascades into cascades: haarcascade_frontalface_alt.xmlhaarcascade_eye.xmlhaarcascade_mcs_nose.xmlhaarcascade_mcs_mouth.xml As their names suggest, these cascades are for tracking faces, eyes, noses, and mouths. They require a frontal, upright view of the subject. With a lot of patience and a powerful computer, you can make your own cascades, trained for various types of objects. Creating modules We should continue to maintain good separation between application-specific code and reusable code. Let's make new modules for tracking classes and their helpers. A file called trackers.py should be created in the same directory as cameo.py (and, equivalently, in the parent directory of cascades ). Let's put the following import statements at the start of trackers.py: import cv2import rectsimport utils Alongside trackers.py and cameo.py, let's make another file called rects.py containing the following import statement: import cv2 Our face tracker and a definition of a face will go in trackers.py, while various helpers will go in rects.py and our preexisting utils.py file.
Read more
  • 0
  • 0
  • 4953
article-image-move-further-numpy-modules
Packt
13 May 2013
7 min read
Save for later

Move Further with NumPy Modules

Packt
13 May 2013
7 min read
(For more resources related to this topic, see here.) Linear algebra Linear algebra is an important branch of mathematics. The numpy.linalg package contains linear algebra functions. With this module, you can invert matrices, calculate eigenvalues, solve linear equations, and determine determinants, among other things. Time for action – inverting matrices The inverse of a matrix A in linear algebra is the matrix A-1, which when multiplied with the original matrix, is equal to the identity matrix I. This can be written, as A* A-1 = I. The inv function in the numpy.linalg package can do this for us. Let's invert an example matrix. To invert matrices, perform the following steps: We will create the example matrix with the mat. A = np.mat("0 1 2;1 0 3;4 -3 8") print "An", A The A matrix is printed as follows: A [[ 0 1 2] [ 1 0 3] [ 4 -3 8]] Now, we can see the inv function in action, using which we will invert the matrix. inverse = np.linalg.inv(A) print "inverse of An", inverse The inverse matrix is shown as follows: inverse of A [[-4.5 7. -1.5] [-2. 4. -1. ] [ 1.5 -2. 0.5]] If the matrix is singular or not square, a LinAlgError exception is raised. If you want, you can check the result manually. This is left as an exercise for the reader. Let's check what we get when we multiply the original matrix with the result of the inv function: print "Checkn", A * inverse The result is the identity matrix, as expected. Check[[ 1. 0. 0.][ 0. 1. 0.][ 0. 0. 1.]] What just happened? We calculated the inverse of a matrix with the inv function of the numpy.linalg package. We checked, with matrix multiplication, whether this is indeed the inverse matrix. import numpy as npA = np.mat("0 1 2;1 0 3;4 -3 8")print "An", Ainverse = np.linalg.inv(A)print "inverse of An", inverseprint "Checkn", A * inverse Solving linear systems A matrix transforms a vector into another vector in a linear way. This transformation mathematically corresponds to a system of linear equations. The numpy.linalg function, solve, solves systems of linear equations of the form Ax = b; here A is a matrix, b can be 1D or 2D array, and x is an unknown variable. We will see the dot function in action. This function returns the dot product of two floating-point arrays. Time for action – solving a linear system Let's solve an example of a linear system. To solve a linear system, perform the following steps: Let's create the matrices A and b. iA = np.mat("1 -2 1;0 2 -8;-4 5 9")print "An", Ab = np.array([0, 8, -9])print "bn", b The matrices A and b are shown as follows: Solve this linear system by calling the solve function. x = np.linalg.solve(A, b)print "Solution", x The following is the solution of the linear system: Solution [ 29. 16. 3.] Check whether the solution is correct with the dot function. print "Checkn", np.dot(A , x) The result is as expected: Check[[ 0. 8. -9.]] What just happened? We solved a linear system using the solve function from the NumPy linalg module and checked the solution with the dot function. import numpy as npA = np.mat("1 -2 1;0 2 -8;-4 5 9")print "An", Ab = np.array([0, 8, -9])print "bn", bx = np.linalg.solve(A, b)print "Solution", xprint "Checkn", np.dot(A , x) Finding eigenvalues and eigenvectors Eigenvalues are scalar solutions to the equation Ax = ax, where A is a two-dimensional matrix and x is a one-dimensional vector. Eigenvectors are vectors corresponding to eigenvalues. The eigvals function in the numpy.linalg package calculates eigenvalues. The eig function returns a tuple containing eigenvalues and eigenvectors. Time for action – determining eigenvalues and eigenvectors Let's calculate the eigenvalues of a matrix. Perform the following steps to do so: Create a matrix as follows: A = np.mat("3 -2;1 0")print "An", A The matrix we created looks like the following: A[[ 3 -2][ 1 0]] Calculate eigenvalues by calling the eig function. print "Eigenvalues", np.linalg.eigvals(A) The eigenvalues of the matrix are as follows: Eigenvalues [ 2. 1.] Determine eigenvalues and eigenvectors with the eig function. This function returns a tuple, where the first element contains eigenvalues and the second element contains corresponding Eigenvectors, arranged column-wise. eigenvalues, eigenvectors = np.linalg.eig(A)print "First tuple of eig", eigenvaluesprint "Second tuple of eign", eigenvectors The eigenvalues and eigenvectors will be shown as follows: First tuple of eig [ 2. 1.]Second tuple of eig[[ 0.89442719 0.70710678][ 0.4472136 0.70710678]] Check the result with the dot function by calculating the right- and left-hand sides of the eigenvalues equation Ax = ax. for i in range(len(eigenvalues)):print "Left", np.dot(A, eigenvectors[:,i])print "Right", eigenvalues[i] * eigenvectors[:,i]print The output is as follows: Left [[ 1.78885438][ 0.89442719]]Right [[ 1.78885438][ 0.89442719]]Left [[ 0.70710678][ 0.70710678]]Right [[ 0.70710678][ 0.70710678]] What just happened? We found the eigenvalues and eigenvectors of a matrix with the eigvals and eig functions of the numpy.linalg module. We checked the result using the dot function . import numpy as npA = np.mat("3 -2;1 0")print "An", Aprint "Eigenvalues", np.linalg.eigvals(A)eigenvalues, eigenvectors = np.linalg.eig(A)print "First tuple of eig", eigenvaluesprint "Second tuple of eign", eigenvectorsfor i in range(len(eigenvalues)):print "Left", np.dot(A, eigenvectors[:,i])print "Right", eigenvalues[i] * eigenvectors[:,i]print Singular value decomposition Singular value decomposition is a type of factorization that decomposes a matrix into a product of three matrices. The singular value decomposition is a generalization of the previously discussed eigenvalue decomposition. The svd function in the numpy.linalg package can perform this decomposition. This function returns three matrices – U, Sigma, and V – such that U and V are orthogonal and Sigma contains the singular values of the input matrix. The asterisk denotes the Hermitian conjugate or the conjugate transpose. Time for action – decomposing a matrix It's time to decompose a matrix with the singular value decomposition. In order to decompose a matrix, perform the following steps: First, create a matrix as follows: A = np.mat("4 11 14;8 7 -2")print "An", A The matrix we created looks like the following: A[[ 4 11 14][ 8 7 -2]] Decompose the matrix with the svd function. U, Sigma, V = np.linalg.svd(A, full_matrices=False)print "U"print Uprint "Sigma"print Sigmaprint "V"print V The result is a tuple containing the two orthogonal matrices U and V on the left- and right-hand sides and the singular values of the middle matrix. [-0.31622777 0.9486833 ]]Sigma[ 18.97366596 9.48683298]V[[-0.33333333 -0.66666667 -0.66666667][ 0.66666667 0.33333333 -0.66666667]]U[[-0.9486833 -0.31622777] We do not actually have the middle matrix—we only have the diagonal values. The other values are all 0. We can form the middle matrix with the diag function. Multiply the three matrices. This is shown, as follows: print "Productn", U * np.diag(Sigma) * V The product of the three matrices looks like the following: Product[[ 4. 11. 14.][ 8. 7. -2.]] What just happened? We decomposed a matrix and checked the result by matrix multiplication. We used the svd function from the NumPy linalg module. import numpy as npA = np.mat("4 11 14;8 7 -2")print "An", AU, Sigma, V = np.linalg.svd(A, full_matrices=False)print "U"print Uprint "Sigma"print Sigmaprint "V"print Vprint "Productn", U * np.diag(Sigma) * V Pseudoinverse The Moore-Penrose pseudoinverse of a matrix can be computed with the pinv function of the numpy.linalg module (visit http://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_pseudoinverse). The pseudoinverse is calculated using the singular value decomposition. The inv function only accepts square matrices; the pinv function does not have this restriction.
Read more
  • 0
  • 0
  • 1516

article-image-ten-ipython-essentials
Packt
02 May 2013
10 min read
Save for later

Ten IPython essentials

Packt
02 May 2013
10 min read
(For more resources related to this topic, see here.) Running the IPython console If IPython has been installed correctly, you should be able to run it from a system shell with the ipython command. You can use this prompt like a regular Python interpreter as shown in the following screenshot: Command-line shell on Windows If you are on Windows and using the old cmd.exe shell, you should be aware that this tool is extremely limited. You could instead use a more powerful interpreter, such as Microsoft PowerShell, which is integrated by default in Windows 7 and 8. The simple fact that most common filesystem-related commands (namely, pwd, cd, ls, cp, ps, and so on) have the same name as in Unix should be a sufficient reason to switch. Of course, IPython offers much more than that. For example, IPython ships with tens of little commands that considerably improve productivity. Some of these commands help you get information about any Python function or object. For instance, have you ever had a doubt about how to use the super function to access parent methods in a derived class? Just type super? (a shortcut for the command %pinfo super) and you will find all the information regarding the super function. Appending ? or ?? to any command or variable gives you all the information you need about it, as shown here: In [1]: super? Typical use to call a cooperative superclass method: class C(B): def meth(self, arg): super(C, self).meth(arg) Using IPython as a system shell You can use the IPython command-line interface as an extended system shell. You can navigate throughout your filesystem and execute any system command. For instance, the standard Unix commands pwd, ls, and cd are available in IPython and work on Windows too, as shown in the following example: In [1]: pwd Out[1]: u'C:' In [2]: cd windows C:windows These commands are particular magic commands that are central in the IPython shell. There are dozens of magic commands and we will use a lot of them throughout this book. You can get a list of all magic commands with the %lsmagic command. Using the IPython magic commands Magic commands actually come with a % prefix, but the automagic system, enabled by default, allows you to conveniently omit this prefix. Using the prefix is always possible, particularly when the unprefixed command is shadowed by a Python variable with the same name. The %automagic command toggles the automagic system. In this book, we will generally use the % prefix to refer to magic commands, but keep in mind that you can omit it most of the time, if you prefer. Using the history Like the standard Python console, IPython offers a command history. However, unlike in Python's console, the IPython history spans your previous interactive sessions. In addition to this, several key strokes and commands allow you to reduce repetitive typing. In an IPython console prompt, use the up and down arrow keys to go through your whole input history. If you start typing before pressing the arrow keys, only the commands that match what you have typed so far will be shown. In any interactive session, your input and output history is kept in the In and Out variables and is indexed by a prompt number. The _, __, ___ and _i, _ii, _iii variables contain the last three output and input objects, respectively. The _n and _in variables return the nth output and input history. For instance, let's type the following command: In [4]: a = 12 In [5]: a ** 2 Out[5]: 144 In [6]: print("The result is {0:d}.".format(_)) The result is 144. In this example, we display the output, that is, 144 of prompt 5 on line 6. Tab completion Tab completion is incredibly useful and you will find yourself using it all the time. Whenever you start typing any command, variable name, or function, press the Tab key to let IPython either automatically complete what you are typing if there is no ambiguity, or show you the list of possible commands or names that match what you have typed so far. It also works for directories and file paths, just like in the system shell. It is also particularly useful for dynamic object introspection. Type any Python object name followed by a point and then press the Tab key; IPython will show you the list of existing attributes and methods, as shown in the following example: In [1]: import os In [2]: os.path.split<tab> os.path.split os.path.splitdrive os.path.splitext os.path.splitunc In the second line, as shown in the previous code, we press the Tab key after having typed os.path.split. IPython then displays all the possible commands. Tab Completion and Private Variables Tab completion shows you all the attributes and methods of an object, except those that begin with an underscore (_). The reason is that it is a standard convention in Python programming to prefix private variables with an underscore. To force IPython to show all private attributes and methods, type myobject._ before pressing the Tab key. Nothing is really private or hidden in Python. It is part of a general Python philosophy, as expressed by the famous saying, "We are all consenting adults here." Executing a script with the %run command Although essential, the interactive console becomes limited when running sequences of multiple commands. Writing multiple commands in a Python script with the .py file extension (by convention) is quite common. A Python script can be executed from within the IPython console with the %run magic command followed by the script filename. The script is executed in a fresh, new Python namespace unless the -i option has been used, in which case the current interactive Python namespace is used for the execution. In all cases, all variables defined in the script become available in the console at the end of script execution. Let's write the following Python script in a file called script.py: print("Running script.") x = 12 print("'x' is now equal to {0:d}.".format(x)) Now, assuming we are in the directory where this file is located, we can execute it in IPython by entering the following command: In [1]: %run script.py Running script. 'x' is now equal to 12. In [2]: x Out[2]: 12 When running the script, the standard output of the console displays any print statement. At the end of execution, the x variable defined in the script is then included in the interactive namespace, which is quite convenient. Quick benchmarking with the %timeit command You can do quick benchmarks in an interactive session with the %timeit magic command. It lets you estimate how much time the execution of a single command takes. The same command is executed multiple times within a loop, and this loop itself is repeated several times by default. The individual execution time of the command is then automatically estimated with an average. The -n option controls the number of executions in a loop, whereas the -r option controls the number of executed loops. For example, let's type the following command: In[1]: %timeit [x*x for x in range(100000)] 10 loops, best of 3: 26.1 ms per loop Here, it took about 26 milliseconds to compute the squares of all integers up to 100000. Quick debugging with the %debug command IPython ships with a powerful command-line debugger. Whenever an exception is raised in the console, use the %debug magic command to launch the debugger at the exception point. You then have access to all the local variables and to the full stack traceback in postmortem mode. Navigate up and down through the stack with the u and d commands and exit the debugger with the q command. See the list of all the available commands in the debugger by entering the ? command. You can use the %pdb magic command to activate the automatic execution of the IPython debugger as soon as an exception is raised. Interactive computing with Pylab The %pylab magic command enables the scientific computing capabilities of the NumPy and matplotlib packages, namely efficient operations on vectors and matrices and plotting and interactive visualization features. It becomes possible to perform interactive computations in the console and plot graphs dynamically. For example, let's enter the following command: In [1]: %pylab Welcome to pylab, a matplotlib-based Python environment [backend: TkAgg]. For more information, type 'help(pylab)'. In [2]: x = linspace(-10., 10., 1000) In [3]: plot(x, sin(x)) In this example, we first define a vector of 1000 values linearly spaced between -10 and 10. Then we plot the graph (x, sin(x)). A window with a plot appears as shown in the following screenshot, and the console is not blocked while this window is opened. This allows us to interactively modify the plot while it is open. Using the IPython Notebook The Notebook brings the functionality of IPython into the browser for multiline textediting features, interactive session reproducibility, and so on. It is a modern and powerful way of using Python in an interactive and reproducible way To use the Notebook, call the ipython notebook command in a shell (make sure you have installed the required dependencies). This will launch a local web server on the default port 8888. Go to http://127.0.0.1:8888/ in a browser and create a new Notebook. You can write one or several lines of code in the input cells. Here are some of the most useful keyboard shortcuts: Press the Enter key to create a new line in the cell and not execute the cell Press Shift + Enter to execute the cell and go to the next cell Press Alt + Enter to execute the cell and append a new empty cell right after it Press Ctrl + Enter for quick instant experiments when you do not want to save the output Press Ctrl + M and then the H key to display the list of all the keyboard shortcuts Customizing IPython You can save your user preferences in a Python file; this file is called an IPython profile. To create a default profile, type ipython profile create in a shell. This will create a folder named profile_default in the ~/.ipython or ~/.config/ ipython directory. The file ipython_config.py in this folder contains preferences about IPython. You can create different profiles with different names using ipython profile create profilename, and then launch IPython with ipython --profile=profilename to use that profile. The ~ directory is your home directory, for example, something like /home/ yourname on Unix, or C:Usersyourname or C:Documents and Settings yourname on Windows. Summary We have gone through 10 of the most interesting features offered by IPython in this article. They essentially concern the Python and shell interactive features, including the integrated debugger and profiler, and the interactive computing and visualization features brought by the NumPy and Matplotlib packages. Resources for Article : Further resources on this subject: Advanced Matplotlib: Part 1 [Article] Python Testing: Installing the Robot Framework [Article] Running a simple game using Pygame [Article]
Read more
  • 0
  • 0
  • 2189