Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-postgresql-9-reliable-controller-and-disk-setup
Packt
27 Oct 2010
10 min read
Save for later

PostgreSQL 9: Reliable Controller and Disk Setup

Packt
27 Oct 2010
10 min read
Accelerate your PostgreSQL system Learn the right techniques to obtain optimal PostgreSQL database performance, from initial design to routine maintenance Discover the techniques used to scale successful database installations Avoid the common pitfalls that can slow your system down Filled with advice about what you should be doing; how to build experimental databases to explore performance topics, and then move what you've learned into a production database environment Covers versions 8.1 through 9.0 PostgreSQL uses a Write-Ahead Log (WAL) to write data in a way that survives a database or hardware crash. This is similar to the log buffer or REDO log found in other databases. The database documentation covers the motivation and implementation of the WAL at http://www.postgresql.org/docs/current/ static/wal.html To quote from that introduction: WAL's central concept is that changes to data files (where tables and indexes reside) must be written only after those changes have been logged, that is, after log records describing the changes have been flushed to permanent storage. This procedure ensures that if your application has received a COMMIT for a transaction, that transaction is on permanent storage, and will not be lost even if there is a crash. This satisfies the durability portion of the ACID (atomicity, consistency, isolation, durability) expectations databases aim to satisfy. Write-back caches The CPUs and memory in your server are quite fast compared to its disk drives. Accordingly, making the rest of the system wait for the disks, particularly when things need to be written out, can drag overall performance down heavily. Systems that wait for the disks to complete their writes before moving into their next task are referred to as having a write-through cache. While the data may be stored temporarily in a memory cache, until it's made it all the way through to the physical disk, any write an application requested isn't considered complete. The normal solution to making that faster is to introduce a different type of write cache between the program doing the writing and disks. A write-back cache is one where data is copied into memory, and then control returns to the application that requested the write. Those writes are then handled asynchronously, at some future time dictated by the design of the write-back cache. It can take minutes before the data actually makes it to disk. When PostgreSQL writes information to the WAL, and sometimes when it writes to the regular database files too, that information must be "flushed to permanent storage" in order for the database's crash corruption defense mechanism to work. So what happens if you have a write-back cache that says the write is complete, but it really isn't? People call these lying drives, and the result can be very bad: If you have a system with a write-back cache, and a system crash causes the contents of that write-back cache to be lost, this can corrupt a PostgreSQL database stored on that drive and make it unusable. You can discover it takes expert intervention to even get the database to start again, and determining what data is damaged will be difficult. Consider the case where you have committed a transaction. Details of that new transaction might be spread across two data blocks on the drive. Now, imagine that one of those made it to disk before the system crashed, but the other didn't. You've now left the database in a corrupted state: one block refers to a transaction that doesn't exist where it's supposed to in the other block. Had at least all of the data blocks related to the WAL been written properly, the database WAL could correct this error after the crash. But the WAL protection only works if it can get honest information about whether information has been written to the disks properly or not, and the "lying" write-back caches do not report that. Sources of write-back caching Servers are filled with write caches you need to be aware of: Operating system write cache. This cache can easily be gigabytes in size. Typically you can flush data out of this cache by forcing a "sync" operation on the block that needs to be stored on disk. On POSIX systems (which includes all UNIX-like ones), this is done with the fsync or fdatasync calls. In some cases, it's possible to write directly in a sync mode, which is effectively a write followed by fsync. The postgresql.conf setting wal_sync_method controls which method is used, and it's possible to disable this altogether to optimize for speed instead of safety. Disk controller write cache. You'll find a write cache on most RAID controller cards, as well as inside external storage such as a SAN. Common sizes right now are 128 MB to 512 MB for cards, but gigabytes are common on a SAN. Typically controllers can be changed to operate in the completely writethrough mode, albeit slowly. But by default, you'll normally find them in write-back mode. Writes that can fit in the controller's cache are stored there, the operating system is told the write is completed, and the card writes the data out at some future time. To keep this write from being lost if power is interrupted, the card must be configured with a battery. That combination is referred to as a battery-backed write cache (BBC or BBWC). Disk drive write cache. All SATA and SAS disks have a write cache on them that on current hardware is 8 MB to 32 MB in size. This cache is always volatile: if power is lost, any data stored in there will be lost and they're always write-back caches if enabled. How can you make sure you're safe given all these write-back caches that might lose your data? There are a few basic precautions to take: Make sure whatever file system you're using properly implements fsync calls, or whatever similar mechanism is used, fully. Monitor your driver controller battery. Some controller cards will monitor their battery health, and automatically switch from write-back to writethough mode when there is no battery or it's not working properly. That's a helpful safety measure, but performance is going to drop hard when this happens. Disable any drive write caches. Most hardware RAID controllers will do this for you, preferring their own battery-backed caches instead. Disk controller monitoring When you have a RAID controller card with a battery-backed cache, you probably expect you'll need to monitor the card to determine when disks fail. But monitoring controller battery health is an equally critical aspect of maintaining a reliable database system when you're using this technology. If the battery fails and you're using it in write-back mode, your writes are not safe. Similarly, if your power fails, you should prefer shutting the database server down after a few minutes of power loss to trying to keep it going. Integrating in power monitoring via a UPS or similar mechanism should be part of your database server configuration, so that a short outage results in an orderly shutdown. Consider the purpose of the controller battery to protect yourself from really unexpected outages, like someone tripping over the power cord. Even if the manufacturer claims the controller battery will last through days of downtime, that's not a good reason to believe you'll actually be safe for an extended outage. You should consider the battery as something you'd prefer to only use for some number of minutes of protection. That may be the reality, particularly in a case where the battery is older and has lost much of its capacity, and some controller batteries don't start out with very much capacity. Be sure to run your own tests rather than blindly believing the manufacturer specifications: your data depends on it. Better RAID controllers will automatically disable write-back mode if their battery stops working normally. If performance suddenly drops on an older server, this is one potential cause. Also don't forget that every UPS has a battery that degrades over time as well. That's all the more reason to arrange an orderly shutdown of your server during a power outage, rather than optimistically presuming you can keep it running until power returns. Disabling drive write caches If your card doesn't disable all the drive write caches, or if you're using a software RAID approach, you'll need to turn the caches off yourself. The best way to do this is to see if it's possible to change the default write cache state using the utilities provided by the drive manufacturer. You should be able to do this through software as well. Here is a sample session from a Linux system checking the write cache, toggling it off, confirming that change took, and then toggling it on again: # hdparm -I /dev/sda | grep "Write cache" * Write cache # sudo hdparm -W 0 /dev/sda /dev/sda: setting drive write-caching to 0 (off) write-caching = 0 (off) # hdparm -I /dev/sda | grep "Write cache" Write cache # hdparm -W 1 /dev/sda /dev/sda: setting drive write-caching to 1 (on) write-caching = 1 (on) Only the -W 0 configuration is completely safe for database use. The PostgreSQL WAL documentation suggests similar commands to use for other operating systems. Performance impact of write-through caching If you don't have a battery-backed write cache, and therefore can't utilize some memory-based cache to accelerate fsync writes, commit performance on your database can be quite bad. The worst-case here is where you have a single client that is issuing a commit after every statement it executes. The reality of how a hard drive works means that individual writes happen once each time the drive spins around. Here are the measurements for the common drive speeds available right now, with computed maximum commit rate:     Rotation speed Rotation time (ms) Max commits/second 5400 11.1 90 7200 8.3 120 10000 6.0 166 15000 4.0 250 It's important to realize how limiting this can be: If you have a common 7200 rpm hard drive, no single client can commit more than 120 transactions/second in any situation where all that's available is a write-back cache. It doesn't matter how many disks you have in a RAID array, or how you configure your software. You must have hardware with a battery, enabling a non-volatile write-back cache, in order to safely exceed this limit. Some PostgreSQL installs use a RAID controller card just for this purpose, to provide a BBWC, in Just a Bunch of Disks (JBOD) mode—where no RAID is being done on the controller at all. Sometimes disks are used directly, and others layer software RAID on top, which can have some advantages compared to hardware RAID. If you have more than one client, you'll get more done per commit. It's normal to see >500 committed transactions per second if you have a larger number of clients all committing regularly, because each flushed disk write will include any queued up commit requests from other clients, too. The other common technique here is to batch commits into larger pieces, perhaps going 1000 records at a time rather than a single one, in order to reduce the average impact of commit latency. Another approach for accelerating systems that don't have a usable write cache is asynchronous commit. Summary In this article we saw how accidentally using volatile write-back caching in disk controllers and drives can easily introduce database corruption. Further resources on this subject: UNIX Monitoring Tool for PostgreSQL [Article] Server Configuration Tuning in PostgreSQL [Article] PostgreSQL 9: Balancing Hardware Spending [Article]
Read more
  • 0
  • 0
  • 2217

article-image-postgresql-9-balancing-hardware-spending
Packt
27 Oct 2010
10 min read
Save for later

PostgreSQL 9: Balancing Hardware Spending

Packt
27 Oct 2010
10 min read
One of the reasons that working with open-source databases such as PostgreSQL can be so effective is that every dollar you save on software licensing can be put toward better hardware instead. The three main components you'll need to balance in your budget are CPUs, memory, and disks, with the disk controller as a related and critical part too. CPUs Currently, available processors are bundling at least two and possibly as many as eight cores into each CPU, making the core count the figure of merit for most database applications. There are two basic decisions you need to make while deciding which CPU solution would best match your database application: Which processor family? Nowadays, this normally boils down to choosing among the various 64 bit product lines from Intel or AMD, although there are some other less popular choices still floating around (Itanium, SPARC, and so on). Do you get more cores or faster cores? These choices are sometimes more tied together than you might think. Currently, Intel has a lead in delivering individual processor cores that are the fastest around, often due to faster transfers between the processor and system RAM. But the processors and related parts are more expensive too. AMD still is competitive at providing more cores per dollar, and their server class designs normally do a good job making the best of the memory available to each core. But if what you want is many more affordable cores instead, that's where AMD is stronger. AMD also has a better history of making its fastest processors available in configurations with many sockets, when you want to put more than two physical CPUs into a single server. The best way to figure out which class of database app you have—more cores or faster cores—is to monitor an existing server using tools such as top. If there's a small number of processes running using a single CPU each, that's the sort of workload where faster cores are better. That tends to happen if you have giant queries running in a batch fashion, for example when large quantities of data need to be sorted to deliver any single report. But if all the CPUs are active with many more concurrent processes instead, then you'd likely benefit better from more cores. That's normally what you'll see in applications with a larger user count, such as databases backing web applications. If you don't have the benefit of a working system to inspect, you can try to guess which type of situation you're in by noting the limitations of the database. PostgreSQL does not allow splitting a single query across more than one core, what's called parallel query by some other databases that support it. That means that if you have any one query or small number of queries that must run as fast as possible, the only way to do that is to prioritize getting faster cores. Another situation where getting a faster core is the better choice is if you need to prioritize data loading or export situations. PostgreSQL's best performing data import method, COPY, can easily become (but isn't always) limited by CPU performance, where that turns into the bottleneck for operations. While it's possible to split input files into pieces and load them in parallel, that's something you'll need to build or acquire yourself, rather than something the server knows how to do for you. Exporting a copy of the database using the pg_dump utility is another example of something that can become CPU limited on some systems. Memory How much to prioritize memory for your application really depends on the size of the working set of data needed to handle the most common operations. Generally, adding more RAM will provide a significant performance boost. There are a few situations where you'd be better served doing something else instead: If your data set is small enough to fit into a smaller amount of RAM, adding more won't help you much. You probably want faster processors instead. When running applications that scan tables much larger than what you can feasibly purchase as RAM, such as in many data warehouse situations, you might be better served by getting faster disks rather than more memory. The normal situation where more memory helps most is when the data you access frequently will fit with the larger amount, but not with the smaller. This happens more often than you might think because of the way database B-tree indexes are stored. Even if you can't fit the entirety of a table or even its index in memory, being able to store a good sized fraction of the index can mean that index-based data lookups will be significantly sped up. Having the most "popular" blocks from the top of the tree structure cached is helpful even if you can't fit all the leaves into memory too. Once you have an application running, you can usually get a much better idea how memory is being used by looking inside the PostgreSQL buffer cache (and potentially inside the operating system one as well) and seeing what data it prefers to keep around. Disks While it's always possible to run into situations where the CPU in your database server is its bottleneck, it's downright likely that you'll run into a disk bottleneck— particularly if you only have a drive or two in the system. A few years ago, the basic two choices in hard drives were the inexpensive ATA (also known as IDE) drives used in desktops versus the more serious SCSI drives aimed at servers. Both technologies have marched forward, and the current choice you're most likely to run into when configuring a database server is whether to use Serial ATA (SATA) or Serial Attached SCSI (SAS). It's possible to find nearly identical drives available in both interfaces, and there are even drive controllers that allow attaching either kind of drive. Combined with a narrowing performance difference between the two, picking between them is harder than ever. The broad parameters of each technology are straightforward to compare. Here's the state of things as this is being written: SAS disks: The maximum available RPM is higher: 10,000 or 15,000 Not as much drive capacity: 73 GB-1 TB are popular sizes Cost per MB is higher SATA disks: Drives typically have a slower RPM: 7200 is standard, some 10,000 designs exist such as the Western Digital VelociRaptor Higher drive capacity: 2 TB available Cost per MB is lower Generally, you'll find individual SAS disks to be faster even than SATA ones with similar specifications. In particular, you're likely to see better seek performance on random I/O due to faster drive mechanics in SAS, and sometimes a faster transfer rate from the disk too. Also, because the SAS drives have supported advanced features such as command queuing for longer, it's more likely your operating system will have matching support to take advantage of them. RAID The Redundant Array of Inexpensive Disks (RAID) approach is the standard way to handle both the performance and reliability limitations of individual disk drives. A RAID array puts many disks, typically of exactly the same configuration, into a set that acts like a single disk—but with either enhanced performance, reliability, or both. In some cases the extra reliability comes from computing what's called parity information for writes to the array. Parity is a form of checksum on the data, which allows reconstructing it even if some of the information is lost. RAID levels that use parity are efficient from a space perspective at writing data in a way that will survive drive failures, but the parity computation overhead can be significant for database applications. The most common basic forms of RAID arrays used are: RAID 0: It is also called as Striping. Multiple disks are used at the same time, spreading reads and writes over each of them in parallel. This can be almost a linear improvement (two disks reading twice as fast as a single one), but a failure on any volume in the set will lose all the data. RAID 1: It is also called as Mirroring. Here more copies of the same data are put onto multiple disks. This can sometimes improve performance—a good RAID 1 mirroring across two disks might handle two reads by sending one to each drive. Reads executed in parallel against both drives can effectively double average seeks per second. But generally, the reason for RAID 1 is redundancy: if a single drive fails, the system will continue operating using the other one. RAID 10 or 1+0: This first takes pairs of disks and mirrors then using RAID 1. Then, the resulting set is striped using RAID 0. The result provides both high performance and the ability to tolerate any single disk failure, without as many ways for speed to suffer in the average and worst case as RAID 5/6. RAID 10 is particularly appropriate for write-heavy environments, where the parity computation overhead of RAID 5/6 can cause disk performance to suffer. Accordingly, it's the preferred RAID level for high-performance database systems. RAID 5: It is also called as Striped with Parity. This approach sits midway between 0 and 1. You stripe data across multiple drives similarly to RAID 0, which improves read performance. But some redundant data is added to a parity drive. If one of the disks in the array is lost, the missing data can be recomputed from the ones left using that parity information. While this is efficient in terms of how little space is wasted relative to the tolerance for disk failures provided, write performance in particular can suffer in RAID 5. RAID 6: Similar to RAID 5, except with more parity information, enabling survival even with two disk failures. It has the same fundamental advantages and disadvantages. RAID 6 is an increasingly common way to cope with the fact that rebuilding a RAID 5 array after a disk loss can take a really long time on modern, high capacity drives. The array has no additional fault tolerance during that period, and seeing a second drive failure before that rebuild finishes is not that unlikely when it takes many hours of intense disk activity to rebuild. Disks manufactured in the same batch are surprisingly likely to fail in groups. To be fair in any disk performance comparison, you need to consider that most systems are going to have a net performance from several disks, such as in a RAID array. Since SATA disks are individually cheaper, you might be able to purchase considerably more of them for the same budget than had you picked SAS instead. If your believe your application will get faster if it is spread over more disks, being able to buy more of them per dollar spent can result in an overall faster system. Note that the upper limit here will often be your server's physical hardware. You only have so many storage bays or controllers ports available, and larger enclosures can cost more both up-front and over their lifetime. It's easy to find situations where smaller numbers of faster drives—which SAS provides—is the better way to go. This is why it's so important to constantly benchmark both hardware and your database application, to get a feel of how well it improves as the disk count increases.
Read more
  • 0
  • 0
  • 1820

article-image-backup-postgresql-9
Packt
25 Oct 2010
11 min read
Save for later

Backup in PostgreSQL 9

Packt
25 Oct 2010
11 min read
Most people admit that backups are essential, though they also devote only a very small amount of time to thinking about the topic. The first recipe is about understanding and controlling crash recovery. We need to understand what happens if the database server crashes, so we can understand when we might need to recover. The next recipe is all about planning. That's really the best place to start before you go charging ahead to do backups. Understanding and controlling crash recovery Crash recovery is the PostgreSQL subsystem that saves us if the server should crash, or fail as a part of a system crash. It's good to understand a little about it and to do what we can to control it in our favor. How to do it... If PostgreSQL crashes there will be a message in the server log with severity-level PANIC. PostgreSQL will immediately restart and attempt to recover using the transaction log or Write Ahead Log (WAL). The WAL consists of a series of files written to the pg_xlog subdirectory of the PostgreSQL data directory. Each change made to the database is recorded first in WAL, hence the name "write-ahead" log. When a transaction commits, the default and safe behavior is to force the WAL records to disk. If PostgreSQL should crash, the WAL will be replayed, which returns the database to the point of the last committed transaction, and thus ensures the durability of any database changes. Note that the database changes themselves aren't written to disk at transaction commit. Those changes are written to disk sometime later by the "background writer" on a well-tuned server. Crash recovery replays the WAL, though from what point does it start to recover? Recovery starts from points in the WAL known as "checkpoints". The duration of crash recovery depends upon the number of changes in the transaction log since the last checkpoint. A checkpoint is a known safe starting point for recovery, since at that time we write all currently outstanding database changes to disk. A checkpoint can become a performance bottleneck on busy database servers because of the number of writes required. There are a number of ways of tuning that, though please also understand the effect on crash recovery that those tuning options may cause. Two parameters control the amount of WAL that can be written before the next checkpoint. The first is checkpoint_segments, which controls the number of 16 MB files that will be written before a checkpoint is triggered. The second is time-based, known as checkpoint_timeout, and is the number of seconds until the next checkpoint. A checkpoint is called whenever either of those two limits is reached. It's tempting to banish checkpoints as much as possible by setting the following parameters: checkpoint_segments = 1000 checkpoint_timeout = 3600 Though if you do you might give some thought to how long the recovery will be if you do and whether you want that. Also, you should make sure that the pg_xlog directory is mounted on disks with enough disk space for at least 3 x 16 MB x checkpoint_segments. Put another way, you need at least 32 GB of disk space for checkpoint_segments = 1000. If wal_keep_segments > 0 then the server can also use up to 16MB x (wal_keep_segments + checkpoint_segments). How it works... Recovery continues until the end of the transaction log. We are writing this continually, so there is no defined end point; it is literally the last correct record. Each WAL record is individually CRC checked, so we know whether a record is complete and valid before trying to process it. Each record contains a pointer to the previous record, so we can tell that the record forms a valid link in the chain of actions recorded in WAL. As a result of that, recovery always ends with some kind of error reading the next WAL record. That is normal. Recovery performance can be very fast, though it does depend upon the actions being recovered. The best way to test recovery performance is to setup a standby replication server. There's more... It's possible for a problem to be caused replaying the transaction log, and for the database server to fail to start. Some people's response to this is to use a utility named pg_resetxlog, which removes the current transaction log files and tidies up after that surgery has taken place. pg_resetxlog destroys data changes and that means data loss. If you do decide to run that utility, make sure you take a backup of the pg_xlog directory first. My advice is to seek immediate assistance rather than do this. You don't know for certain that doing this will fix a problem, though once you've done it, you will have difficulty going backwards. Planning backups This section is all about thinking ahead and planning. If you're reading this section before you take a backup, well done. The key thing to understand is that you should plan your recovery, not your backup. The type of backup you take influences the type of recovery that is possible, so you must give some thought to what you are trying to achieve beforehand. If you want to plan your recovery, then you need to consider the different types of failures that can occur. What type of recovery do you wish to perform? You need to consider the following main aspects: Full/Partial database? Everything or just object definitions only? Point In Time Recovery Restore performance We need to look at the characteristics of the utilities to understand what our backup and recovery options are. It's often beneficial to have multiple types of backup to cover the different types of failure possible. Your main backup options are logical backup—using pg_dump physical backup—file system backup pg_dump comes in two main flavors: pg_dump and pg_dumpall. pg_dump has a -F option to produce backups in various file formats. The file format is very important when it comes to restoring from backup, so you need to pay close attention to that. The following table shows the features available, depending upon the backup technique selected. Table of Backup/Recovery options: SQL dump to an archive file pg_dump -F cSQL dump to a script file pg_dump -F p or pg_dumpallFilesystem backup using pg_start_ backupBackup typeLogicalLogicalPhysicalRecover to point in time?NoNoYesBackup all databases?One at a timeYes (pg_dumpall)YesAll databases backed up at same time?NoNoYesSelective backup?YesYesNo (Note 3)Incremental backup?NoNoPossible (Note 4)Selective restore?YesPossible (Note 1)No (Note 5)DROP TABLE recoveryYes Yes Possible (Note 6) DROP TABLESPACE recovery Possible (Note 2)Possible (Note 6)Possible (Note 6)Compressed backup files?YesYesYesBackup is multiple files?NoNoYesParallel backup possible?NoNoYesParallel restore possible?YesNoYesRestore to later release?YesYesNoStandalone backup?YesYesYes (Note 7)Allows DDL during backupNoNoYes How to do it... If you've generated a script with pg_dump or pg_dumpall and need to restore just a single object, then you're going to need to go deep. You will need to write a Perl script (or similar) to read the file and extract out the parts you want. It's messy and time-consuming, but probably faster than restoring the whole thing to a second server, and then extracting just the parts you need with another pg_dump. See recipe Recovery of a dropped/damaged tablespace. Selective backup with physical backup is possible, though will cause later problems when you try to restore. Selective restore with physical backup isn't possible with currently supplied utilities. See recipe for Standalone hot physical backup How it works... To backup all databases, you may be told you need to use the pg_dumpall utility. I have four reasons why you shouldn't do that, which are as follows: If you use pg_dumpall, then the only output produced is into a script file. Script files can't use the parallel restore feature of pg_restore, so by taking your backup in this way you will be forcing the restore to be slower than it needs to be. pg_dumpall produces dumps of each database, one after another. This means that: pg_dumpall is slower than running multiple pg_dump tasks in parallel, one against each database. The dumps of individual databases are not consistent to a particular point in time. If you start the dump at 04:00 and it ends at 07:00 then we're not sure exactly when the dump relates to—sometime between 0400 and 07:00. Options for pg_dumpall are similar in many ways to pg_dump, though not all of them exist, so some things aren't possible. In summary, pg_dumpall is slower to backup, slow to restore, and gives you less control over the dump. I suggest you don't use it for those reasons. If you have multiple databases, then I suggest you take your backup by doing either. Dump global information for the database server using pg_dumpall -g. Then dump all databases in parallel using a separate pg_dump for each database, taking care to check for errors if they occur. Use the physical database backup technique instead. Hot logical backup of one database Logical backup makes a copy of the data in the database by dumping out the contents of each table. How to do it... The command to do this is simple and as follows: pg_dump -F c > dumpfile or pg_dump -F c –f dumpfile You can also do this through pgAdmin3 as shown in the following screenshot: How it works... pg_dump produces a single output file. The output file can use the split(1) command to separate the file into multiple pieces if required. pg_dump into the custom format is lightly compressed by default. Compression can be removed or made more aggressive. pg_dump runs by executing SQL statements against the database to unload data. When PostgreSQL runs an SQL statement we take a "snapshot" of currently running transactions, which freezes our viewpoint of the database. We can't (yet) share that snapshot across multiple sessions, so we cannot run an exactly consistent pg_dump in parallel in one database, nor across many databases. The time of the snapshot is the only time we can recover to—we can't recover to a time either before or after that time. Note that the snapshot time is the start of the backup, not the end. When pg_dump runs, it holds the very lowest kind of lock on the tables being dumped. Those are designed to prevent DDL from running against the tables while the dump takes place. If a dump is run at the point that other DDL are already running, then the dump will sit and wait. If you want to limit the waiting time you can do that by setting the –-lock-wait-timeout option. pg_dump allows you to make a selective backup of tables. The -t option also allows you to specify views and sequences. There's no way to dump other object types individually using pg_dump. You can use some supplied functions to extract individual snippets of information available at the following website: https://www.postgresql.org/docs/9.0/static/functions-info.html#FUNCTIONS-INFO-CATALOG-TABLE pg_dump works against earlier releases of PostgreSQL, so it can be used to migrate data between releases. pg_dump doesn't generally handle included modules very well. pg_dump isn't aware of additional tables that have been installed as part of an additional package, such as PostGIS or Slony, so it will dump those objects as well. That can cause difficulties if you then try to restore from the backup, as the additional tables may have been created as part of the software installation process in an empty server. There's more... What time was the pg_dump taken? The snapshot for a pg_dump is taken at the beginning of a run. The file modification time will tell you when the dump finished. The dump is consistent at the time of the snapshot, so you may want to know that time. If you are making a script dump, you can do a dump verbose as follows: pg_dump -v which then adds the time to the top of the script. Custom dumps store the start time as well and that can be accessed using the following: pg_restore --schema-only -v dumpfile | head | grep Started -- Started on 2010-06-03 09:05:46 BST See also Note that pg_dump does not dump the roles (such as users/groups) and tablespaces. Those two things are only dumped by pg_dumpall; see the next recipes for more detailed descriptions.
Read more
  • 0
  • 0
  • 1857
Visually different images

article-image-introduction-postgresql-9
Packt
25 Oct 2010
12 min read
Save for later

Introduction to PostgreSQL 9

Packt
25 Oct 2010
12 min read
PostgreSQL 9 Admin Cookbook Over 80 recipes to help you run an efficient PostgreSQL 9.0 database Administer and maintain a healthy database Monitor your database ensuring that it performs as quickly as possible Tips for backup and recovery of your database Introduction PostgreSQL is a feature-rich general purpose database management system. It's a complex piece of software, but every journey begins with the first step. We start with your first connection. Many people fall at the first hurdle, so we try not to skip too swiftly past that. We move on quickly to enabling remote users, and from there to access through GUI administration tools. We also introduce the psql query tool, which is the tool used for loading our sample database. Introducing PostgreSQL 9 PostgreSQL is an advanced SQL database server, available on a wide range of platforms. One of the clearest benefits of PostgreSQL is that it is open source, meaning that you have a very permissive license to install, use, and distribute PostgreSQL without paying anyone fees or royalties. On top of that, PostgreSQL is well-known as a database that stays up for long periods, and requires little or no maintenance in many cases. Overall, PostgreSQL provides a very low total cost of ownership. PostgreSQL is also noted for its huge range of advanced features, developed over the course of more than 20 years continuous development and enhancement. Originally developed by the Database Research group at the University of California, Berkeley, PostgreSQL is now developed and maintained by a huge army of developers and contributors. Many of those contributors have full-time jobs related to PostgreSQL, working as designers, developers, database administrators, and trainers. Some, but not many, of those contributors work for companies that specialize in services for PostgreSQL, such as Hannu and me. No single company owns PostgreSQL, nor are you required, or even encouraged, to register your usage. PostgreSQL has the following main features: Excellent SQL Standards compliance up to SQL 2008 Client-server architecture Highly concurrent design where readers and writers don't block each other Highly configurable and extensible for many types of application Excellent scalability and performance with extensive tuning features What makes PostgreSQL different? The PostgreSQL project focuses on the following objectives: Robust, high-quality software with maintainable, well-commented code Low maintenance administration for both embedded and enterprise use Standards-compliant SQL, interoperability, and compatibility Performance, security, and high availability What surprises many people is that PostgreSQL's feature set is more comparable with Oracle or SQL Server than it is with MySQL. The only connection between MySQL and PostgreSQL is that those two projects are open source; apart from that, the features and philosophies are almost totally different. One of the key features of Oracle since Oracle 7 has been "snapshot isolation", where readers don't block writers, and writers don't block readers. You may be surprised to learn that PostgreSQL was the first database to be designed with this feature, and offers a full and complete implementation. PostgreSQL names this Multi-Version Concurrency Control (MVCC). PostgreSQL is a general-purpose database management system. You defi ne the database that you would like to manage with it. PostgreSQL offers you many ways to work. You can use a "normalized database model", you can utilize extensions such as arrays and record subtypes, or you can use a fully dynamic schema using an extension named hstore . PostgreSQL also allows you to create your own server-side functions in one of a dozen different languages. PostgreSQL is highly extensible, so you can add your own datatypes, operators, index types, and functional languages. For example, you can override different parts of the system using plugins to alter the execution of commands or add a new optimizer. All of these features offer a huge range of implementation options to software architects. There are many ways out of trouble when building applications and maintaining them over long periods of time. In the early days, when PostgreSQL was still a research database, the focus was solely on cool new features. Over the last 15 years, enormous amounts of code have been rewritten and improved, giving us one of the most stable, large, software servers available for operational use. You may also read that PostgreSQL was, or is, slower than My Favorite DBMS, whichever one that is. It's been a personal mission of mine over the last six years to improve server performance and the team have been successful in making the server highly performant and very scalable. That gives PostgreSQL enormous headroom for growth. Who is using PostgreSQL? Prominent users include Apple, BASF, Genentech, IMDB.com, Skype, NTT, Yahoo, and The National Weather Service. PostgreSQL receives well in excess of 1 million downloads per year, according to data submitted to the European Commission, who concluded "...PostgreSQL, is considered by many database users to be a credible alternative... We need to mention one last thing. When PostgreSQL was fi rst developed, it was named Postgres, and so many aspects of the project still refer to the word "postgres". For example, the default database is named postgres, and the software is frequently installed using the postgres userid. As a result, people shorten the name PostgreSQL to simply Postgres, and in many cases people use the two names interchangeably. PostgreSQL is pronounced as "post-grez-q-l". Postgres is pronounced as "post-grez". Some people get confused, and refer to "Postgre", which is hard to say, and likely to confuse people. Two names are enough, so please don't use a third name! Getting PostgreSQL PostgreSQL is 100% open source software. PostgreSQL is freely available to use, alter, or redistribute in any way you choose. PostgreSQL's license is an approved open source license very similar to the BSD (Berkeley Distribution Software) license, though only just different enough that it is now known as TPL (The PostgreSQL License). How to do it... PostgreSQL is already in use by many different application packages, and so you may already find it installed on your servers. Many Linux distributions include PostgreSQL as part of the basic installation, or include it with the installation disk. One thing to be wary of is that the version of PostgreSQL included may not be the latest release. It will typically be the latest major release that was available when that operating system release was published. There is usually no good reason to stick at that level—there is no increased stability implied there and later production versions are just as well-supported by the various Linux distributions. If you don't yet have a copy, or you don't have the latest version, you can download the source code or download binary packages for a wide variety of operating systems from the following URL: http://www.postgresql.org/download/ Installation details vary significantly from platform-to-platform and there aren't any special tricks or recipes to mention. Please, just follow the installation guide, and away you go. We've consciously avoided describing the installation processes here to make sure we don't garble or override the information published to assist you. If you would like to receive e-mail updates of the latest news, then you can subscribe to the PostgreSQL announce mailing list, which contains updates from all the vendors that support PostgreSQL. You'll get a few e-mails each month about new releases of core PostgreSQL and related software, conferences, and user group information. It's worth keeping in touch with developments. For more information about the PostgreSQL announce mailing list, visit the following URL: http://archives.postgresql.org/pgsql-announce/ How it works... Many people ask questions, such as "How can this be free?", "Are you sure I don't have to pay someone?", or "Who gives this stuff away for nothing?" Open source applications such as PostgreSQL work on a community basis, where many contributors perform tasks that make the whole process work. For many of those people, their involvement is professional, rather a hobby, and they can do this because there is generally a great value for both contributors and their employers alike. You might not believe it. You don't have to because It Just Works. There's more... Remember that PostgreSQL is more than just the core software. There is a huge range of websites offering add-ons, extensions, and tools for PostgreSQL. You'll also fi nd an army of bloggers describing useful tricks and discoveries that will help you in your work.   And, there is a range of professional companies able to offer you help when you need it. Connecting to PostgreSQL server How do we access PostgreSQL? Connecting to the database is most people's first experience of PostgreSQL, so we want to make it a good one. So, let's do it, and fix any problems we have along the way. Remember that a connection needs to be made securely, so there may be some hoops for us to jump through to ensure that the data we wish to access is secure. Before we can execute commands against the database, we need to connect to the database server, giving us a session. Sessions are designed to be long-lived, so you connect once, perform many requests, and then eventually disconnect. There is a small overhead during connection. That may become noticeable if you connect/disconnect repeatedly, so you may wish to investigate the use of connection pools. Connection pools allow pre-connected sessions to be served quickly to you when you wish to reconnect. Getting ready First, catch your database. If you don't know where it is, we'll probably have difficulty accessing it. There may be more than one, and you'll need to know the right database to access, and have the authority to connect to it. How to do it... You need to specify the following fi ve parameters to connect to PostgreSQL: host or host address port database name user password (or other means of authentication, if any) To connect, there must be a PostgreSQL server running on host, listening on port number port. On that server, a database named dbname and user must also exist. The host must explicitly allow connections from your client—this is explained in the next recipe, and you must also pass authentication using the method the server specifi es. For example, specifying a password won't work if the server has requested a different form of authentication. Almost all PostgreSQL interfaces use the libpq interface library . When using libpq, most of the connection parameter handling is identical, so we can just discuss that once. If you don't specify the preceding parameters, we look for values set through environment variables, which are as follows: PGHOST or PGHOSTADDR PGPORT (or set to 5432 if this is not set) PGDATABASE PGUSER PGPASSWORD (though this one is defi nitely not recommended) If you specify the first four parameters somehow, but not the password, then we look for a password file. Some PostgreSQL interfaces use the client-server protocol directly, so the way defaults are handled may differ. The information we need to supply won't vary signifi cantly, so please check the exact syntax for that interface. How it works... The PostgreSQL server is a client-server database. The system it runs on is known as the host. We can access the PostgreSQL server remotely through the network. However, we must specify the host, which is a hostname, or a hostaddr , which is an IP address. We can specify a host of "localhost" if we wish to make a TCP/IP connection to the same system. It is often better to use a Unix socket connection, which is attempted if the host begins with a slash (/) and the name is presumed to be a directory name (default is /tmp). On any system, there can be more than one database server. Each database server listens on exactly one "well-known" network port , which cannot be shared between servers on the same system. The default port number for PostgreSQL is 5432, which has been registered with IANA, and is uniquely assigned to PostgreSQL. (You can see it used in the /etc/services file on most *nix servers). The port number can be used to uniquely identify a specific database server if many exist. A database server is also sometimes known as a "database cluster", because the PostgreSQL server allows you to define one or more databases on each server. Each connection request must identify exactly one database identifi ed by its dbname. When you connect, you will only be able to see database objects created within that database. A database user is used to identify the connection. By default, there is no limit on the number of connections for a particular user. In more recent versions of PostgreSQL, users are referred to as login roles, though many clues remind us of the earlier naming, and it still makes sense in many ways. A login role is a role that has been assigned the CONNECT privilege. Each connection will typically be authenticated in some way. This is defined at the server, so is not optional at connection time if the administrator has confi gured the server to require authentication. Once you've connected, each connection can have one active transaction at a time and one fully active statement at any time. The server will have a defined limit on the number of connections it can serve, so a connection request can be refused if the server is oversubscribed. Inspecting your connection information If you want to confirm you've connected to the right place and in the right way, you can execute some or all of the following commands: SELECT inet_server_port(); This shows the port on which the server is listening. SELECT current_database(); Shows the current database. SELECT current_user; This shows the current userid. SELECT inet_server_addr(); Shows the IP address of the server that accepted the connection. A user's password is not accessible using general SQL for obvious reasons. You may also need the following: SELECT version(); See also There are many other snippets of information required to understand connections. Some of those are mentioned in this article. For further details, please consult the PostgreSQL server documentation.
Read more
  • 0
  • 0
  • 1634

article-image-recovery-postgresql-9
Packt
25 Oct 2010
14 min read
Save for later

Recovery in PostgreSQL 9

Packt
25 Oct 2010
14 min read
Recovery of all databases Recovery of a complete database server, including all of its databases, is an important feature. This recipe covers how to do that in the simplest way possible. Getting ready Find a suitable server on which to perform the restore. Before you recover onto a live server, always take another backup. Whatever problem you thought you had could be just about to get worse. How to do it... LOGICAL (from custom dump -F c): Restore of all databases means simply restoring each individual database from each dump you took. Confirm you have the correct backup before you restore: pg_restore --schema-only -v dumpfile | head | grep Started Reload globals from script file as follows: psql -f myglobals.sql Reload all databases. Create the databases using parallel tasks to speed things along. This can be executed remotely without needing to transfer dumpfile between systems. Note that there is a separate dumpfile for each database. pg_restore -d postgres -j 4 dumpfile LOGICAL (from script dump created by pg_dump –F p): As above, though with this command to execute the script. This can be executed remotely without needing to transfer dumpfile between systems. Confirm you have the correct backup before you restore. If the following command returns nothing, then the file is not timestamped, and you'll have to identify it in a different way: head myscriptdump.sql | grep Started Reload globals from script file as follows: psql -f myglobals.sql Reload all scripts like the following: psql -f myscriptdump.sql LOGICAL (from script dump created by pg_dumpall): We need to follow the procedure, which is shown next. Confirm you have the correct backup before you restore. If the following command returns nothing, then the file is not timestamped, and you'll have to identify it in a different way: head myscriptdump.sql | grep Started Find a suitable server, or create a new virtual server. Reload script in full psql -f myscriptdump.sql PHYSICAL: Restore the backup file onto the target server. Extract the backup file into the new data directory. Confirm that you have the correct backup before you restore. $ cat backup_label START WAL LOCATION: 0/12000020 (file 000000010000000000000012) CHECKPOINT LOCATION: 0/12000058 START TIME: 2010-06-03 19:53:23 BST LABEL: standalone Check all file permissions and ownerships are correct and links are valid. That should already be the case if you are using the postgres userid everywhere, which is recommended. Start the server That procedure is so simple. That also helps us understand that we need both a base backup and the appropriate WAL files. If you used other techniques, then we need to step through the tasks to make sure we cover everything required as follows: Shutdown any server running in the data directory. Restore the backup so that any files in the data directory that have matching names are replaced with the version from the backup. (The manual says delete all files and then restore backup—that might be a lot slower than running an rsync between your backup and the destination without the –-update option). Remember that this step can be performed in parallel to speed things up, though it is up to you to script that. Check that all file permissions and ownerships are correct and links are valid. That should already be the case if you are using the postgres userid everywhere, which is recommended. Remove any files in pg_xlog/. Copy in any latest WAL files from a running server, if any. Add in a recovery.conf and set its file permissions correctly also. Start the server. The only part that requires some thought and checking is which parameters you select for the recovery.conf. There's only one that matters here, and that is the restore_command. restore_command tells us how to restore archived WAL files. It needs to be the command that will be executed to bring back WAL files from the archive. If you are forward-thinking, there'll be a README.backup file for you to read to find out how to set the restore_command. If not, then presumably you've got the location of the WAL files you've been saving written down somewhere. Say, for example, that your files are being saved to a directory named /backups/pg/servername/archive, owned by the postgres user. On a remote server named backup1, we would then write this all on one line of the recovery.conf as follows: restore_command = 'scp backup1:/backups/pg/servername/archive/%f %p' How it works... PostgreSQL is designed to require very minimal information to perform a recovery. We try hard to wrap all the details up for you. Logical recovery: Logical recovery executes SQL to re-create the database objects. If performance is an issue, look at the recipe on recovery performance. Physical recovery: Physical recovery re-applies data changes at the block level so tends to be much faster than logical recovery. Physical recovery requires both a base backup and a set of archived WAL files. There is a file named backup_label in the data directory of the base backup. This tells us to retrieve a .backup file from the archive that contains the start and stop WAL locations of the base backup. Recovery then starts to apply changes from the starting WAL location, and must proceed as far as the stop address for the backup to be valid. After recovery completes, the recovery.conf file is renamed to recovery.done to prevent the server from re-entering recovery. The server log records each WAL file restored from the archive, so you can check progress and rate of recovery. You can query the archive to find out the name of the latest archived WAL file to allow you to calculate how many files to go. The restore_command should return 0 if a file has been restored and non-zero for failure cases. Recovery will proceed until there is no next WAL file, so there will eventually be an error recorded in the logs. If you have lost some of the WAL files, or they are damaged, then recovery will stop at that point. No further changes after that will be applied, and you will likely lose those changes; that would be the time to call your support vendor. There's more... You can start and stop the server once recovery has started without any problem. It will not interfere with the recovery. You can connect to the database server while it is recovering and run queries, if that is useful. That is known as Hot Standby mode. Recovery to a point in time If your database suffers a problem at 15:22 p.m. and yet your backup was taken at 04:00 a.m. you're probably hoping there is a way to recover the changes made between those two times. What you need is known as "point-in-time recovery". Regrettably, if you've made a backup with pg_dump at 04:00 a.m. then you won't be able to recover to any other time than 04:00. As a result, the term point-in-time recovery (PITR) has become synonymous with the physical backup and restore technique in PostgreSQL. Getting ready If you have a backup made with pg_dump, then give up all hope of using that as a starting point for a point in time recovery. It's a frequently asked question, but the answer is still "no"; the reason it gets asked is exactly why I'm pleading with you to plan your backups ahead of time. First, you need to decide what the point of time is that to which you would like to recover. If the answer is "as late as possible", then you don't need to do a PITR at all, just recover until end of logs. How to do it... How do you decide to what point to recover? The point where we stop recovery is known as the "recovery target". The most straightforward way is to do this based upon a timestamp. In the recovery.conf, you can add (or uncomment) a line that says the following: recovery_target_time = '2010-06-01 16:59:14.27452+01' or similar. Note that you need to be careful to specify the time zone of the target, so that it matches the time zone of the server that wrote the log. That might differ from the time zone of the current server, so check. After that, you can check progress during a recovery by running queries in Hot Standby mode. How it works... Recovery works by applying individual WAL records. These correspond to individual block changes, so there are many WAL records to each transaction. The final part of any successful transaction is a commit WAL record, though there are abort records as well. Each transaction completion record has a timestamp on it that allows us to decide whether to stop at that point or not. You can also define a recovery target using a transaction id (xid), though finding out which xid to use is somewhat difficult, and you may need to refer to external records if they exist. The recovery target is specified in the recovery.conf and cannot change while the server is running. If you want to change the recovery target, you can shutdown the server, edit the recovery.conf, and then restart the server. Be careful though, if you change the recovery target and recovery is already passed the point, it can lead to errors. If you define a recovery_target_timestamp that has already passed, then recovery will stop almost immediately, though this will be later than the correct stopping point. If you define a recovery_target_xid that has already passed, then recovery will just continue to the end of logs. Restarting recovery from the beginning using a fresh restore of the base backup is always safe. Once a server completes recovery, it will assign a new "timeline". Once a server is fully available, we can write new changes to the database. Those changes might differ from changes we made in a previous "future history" of the database. So we differentiate between alternate futures using different timelines. If we need to go back and run recovery again, we can create a new server history using the original or subsequent timelines. The best way to think about this is that it is exactly like a Sci-Fi novel—you can't change the past but you can return to an earlier time and take a different action instead. But you'll need to be careful not to confuse yourself. There's more... pg_dump cannot be used as a base backup for a PITR. The reason is that a log replay contains the physical changes to data blocks, not logical changes based upon Primary Keys. If you reload a pg_dump the data will likely go back into different data blocks, so the changes wouldn't correctly reference the data. WAL doesn't contain enough information to reconstruct all SQL fully that produced those changes. Later feature additions to PostgreSQL may add the required information to WAL. See also Planned in 9.1 is the ability to pause/resume/stop recovery, and to set recovery targets while the server is up dynamically. This will allow you to use the Hot Standby facility to locate the correct stopping point more easily. You can trick Hot Standby into stopping recovery, which may help. Recovery of a dropped/damaged table You may drop or even damage a table in some way. Tables could be damaged for physical reasons, such as disk corruption, or they could also be damaged by running poorly specified UPDATEs/DELETEs, which update too many rows or overwrite critical data. It's a common request to recover from this situation from a backup. How to do it... The methods differ, depending upon the type of backup you have available. If you have multiple types of backup, you have a choice. LOGICAL (from custom dump -F c): If you've taken a logical backup using pg_dump into a custom file, then you can simply extract the table you want from the dumpfile like the following: pg_restore -t mydroppedtable dumpfile | psql or connect direct to the database using –d. The preceding command tries to re-create the table and then load data into it. Note that pg_restore -t option does not dump out any of the indexes on the table selected. That means we need a slightly more complex procedure than it would first appear, and the procedure needs to vary depending upon whether we are repairing a damaged table or putting back a dropped table. To repair a damaged table we want to replace the data in the table in a single transaction. There isn't a specific option to do this, so we need to do the following: Dump the table to a script file as follows: pg_restore -t mydroppedtable dumpfile > mydroppedtable.sql Edit a script named restore_mydroppedtable.sql with the following code: BEGIN; TRUNCATE mydroppedtable; i mydroppedtable.sql COMMIT; Then, run it using the following: psql -f restore_mydroppedtable.sql If you've dropped a table then you need to: Create a new database in which to work, name it restorework, as follows: CREATE DATABASE restorework; Restore the complete schema to the new database as follows: pg_restore --schema-only -d restorework dumpfile Now, dump just the definitions of the dropped table into a new file, which will contain CREATE TABLE, indexes, other constraints and grants. Note that this database has no data in it, so specifying –-schema-only is optional, as follows: pg_dump -t mydroppedtable --schema-only restorework > mydroppedtable.sql Now, recreate the table on the main database as follows: psql -f mydroppedtable.sql Now, reload just the data into database maindb as follows pg_restore -t mydroppedtable --data-only -d maindb dumpfile If you've got a very large table, then the fourth step can be a problem, because it builds the indexes as well. If you want you can manually edit the script into two pieces, one before the load ("pre-load") and one after the load ("post-load"). There are some ideas for that at the end of the recipe. LOGICAL (from script dump): The easy way to restore a single table from a script is as follows: Find a suitable server, or create a new virtual server. Reload the script in full, as follows: psql -f myscriptdump.sql From the recovered database server, dump the table, its data, and all the definitions of the dropped table into a new file as follows: pg_dump -t mydroppedtable -F c mydatabase > dumpfile Now, recreate the table into the original server and database, using parallel tasks to speed things along. This can be executed remotely without needing to transfer dumpfile between systems. pg_restore -d mydatabase -j 2 dumpfile The only way to extract a single table from a script dump without doing all of the preceding is to write a custom Perl script to read and extract just the parts of the file you want. That can be complicated, because you may need certain SET commands at the top of the file, the table, and data in the middle of the file, and the indexes and constraints on the table are near the end of the file. It's complex; the safe route is the one already mentioned. PHYSICAL: To recover a single table from a physical backup, we need to: Find a suitable server, or create a new virtual server. Recover the database server in full, as described in previous recipes on physical recovery, including all databases and all tables. You may wish to stop at a useful point in time. From the recovered database server, dump the table, its data, and all the definitions of the dropped table into a new file as follows: pg_dump -t mydroppedtable -F c mydatabase > dumpfile Now, recreate the table into the original server and database using parallel tasks to speed things along. This can be executed remotely without needing to transfer dumpfile between systems as follows: pg_restore -d mydatabase -j 2 dumpfile How it works... At present, there's no way to restore a single table from a physical restore in just a single step. See also Splitting a pg_dump into multiple sections, "pre" and "post" was proposed by me for an earlier release of PostgreSQL, though I haven't had time to complete that yet. It's possible to do that using an external utility also; the best script I've seen to split a dump file into two pieces is available at the following website: http://bucardo.org/wiki/split_postgres_dump
Read more
  • 0
  • 0
  • 4138

article-image-unix-monitoring-tool-postgresql
Packt
22 Oct 2010
8 min read
Save for later

UNIX Monitoring Tool for PostgreSQL

Packt
22 Oct 2010
8 min read
PostgreSQL 9.0 High Performance Accelerate your PostgreSQL system Learn the right techniques to obtain optimal PostgreSQL database performance, from initial design to routine maintenance Discover the techniques used to scale successful database installations Avoid the common pitfalls that can slow your system down Filled with advice about what you should be doing; how to build experimental databases to explore performance topics, and then move what you've learned into a production database environment Covers versions 8.1 through 9.0    iostat The data vmstat gives is a total across all devices on the system. If you want totals per disk device instead, you need to use iostat for that. On Linux, iostat defaults to slightly different behavior than vmstat. When it uses "block", it means a 512 byte chunk of data, not the 1024 bytes chunk vmstat uses. You can switch iostat to using kilobytes instead using iostat -k, or you can just divide all the figures by two in order to get them on the same scale. Here's an example of the same data shown both ways: $ iostat Device tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda1 0.07 3.29 0.24 1579784 115560 $ iostat -k Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda1 0.07 1.64 0.12 789892 57780 Since not all UNIX versions will have the kilobyte option available, the examples here all use the default 512 byte blocks, and accordingly halve the block figures to interpret using kilobyte units. You'll likely find that you need to average iostat data over a slightly longer period of time than vmstat data. A single second of vmstat data is a summary of all the disks on the system. A PostgreSQL database goes through several common phases: Just after a checkpoint: heavy full-page writes to WAL, fewer writes to database disks because there are fewer dirty buffer evictions. Between checkpoints: most are an even mix of WAL and database writes. Checkpoint in progress: Small to moderate WAL writes; increasingly heavy database writes as checkpoint data is written and starts flowing to disk. Checkpoint sync phase: Minimal WAL writes because fewer full page writes are likely happening; heavy writes to database disks as all data is flushed out of the OS cache. If you are looking at the vmstat data, or if you don't have the pg_xlog WAL data broken out onto a separate disk, you can't see the balance of the data vs. WAL writes change; you just see a total. But if you're grabbing really short iostat snapshots, you're likely to see writes bounce between the WAL and database disks, with the exact pattern depending on where in the checkpoint cycle you're at. You need to combine a few seconds of data (5 seconds is used for these examples) in order to have both types of writes be usefully averaged out: $ iostat 5 avg-cpu: %user %nice %system %iowait %steal %idle 42.69 0.00 18.07 6.69 0.30 32.25 Device tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 0.00 0.00 0.00 0 0 sda1 0.00 0.00 0.00 0 0 sdc 80.80 0.00 1286.40 0 6432 sdc1 80.80 0.00 1286.40 0 6432 sdd 77.80 0.00 1251.20 0 6256 sdd1 77.80 0.00 1251.20 0 6256 sde 69.40 0.00 1086.40 0 5432 sde1 69.40 0.00 1086.40 0 5432 sdf 2348.20 0.00 88262.40 0 441312 sdf1 2348.20 0.00 88262.40 0 441312 md0 311.40 0.00 2491.20 0 12456 Since all of the activity relates to the single partition on these disks, there's a lot of redundant data in here. You should also note that many of the statistics for the software RAID volume used here are not very interesting—you have to look at the underlying physical disk devices instead. If you're using hardware RAID, that particular problem will go away, but you won't have any easy way to get actual disk performance information out of that abstraction layer either; you'll just see the summary for the whole logical RAID device. The following examples eliminate all the redundant lines, and place the md0 array device between its individual components and the device the WAL is on (sdf1), for easier readability. Examples of good performance When busy but not overloaded, iostat data for this system looks like the following: $ iostat 5 avg-cpu: %user %nice %system %iowait %steal %idle 18.54 0.00 9.45 23.49 0.15 48.38 Device tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sdc1 1068.80 0.00 15740.80 0 78704 sdd1 1041.80 0.00 15459.20 0 77296 sde1 1028.00 0.00 15377.60 0 76888 md0 5969.20 0.00 47753.60 0 238768 sdf1 989.00 0.00 40449.60 0 202248 The %iowait figure of 23% is high enough to know the disks are busy, but not completely saturated yet. This is showing 20 MB/s (40449.6 512-byte blocks per second) being written to the WAL and 24 MB/s to the entire database disk array, the latter of which is evenly split as almost 8 MB/s to each of the three drives. Linux also features an extended iostat mode. This produces a large number of derived statistics from the underlying data. Since that's too wide to display here, the first example showing all of the data here has been transposed to swap the row for columns and vice-versa: $ iostat –x 5 sdc1 sdd1 sde1 md0 sdf1 rrqm/s 0 0 0 0 0 wrqm/s 411.8 404.6 396.2 0 3975.4 r/s 0 0 0 0 0 w/s 438.6 442 444.2 2461.4 1229.8 rsec/s 0 0 0 0 0 wsec/s 6956.8 6966.4 6915.2 19691.2 41643.2 avgrq-sz 15.86 15.76 15.57 8 33.86 avgqu-sz 67.36 67.09 62.93 0 0.65 await 158.18 158.85 148.39 0 0.55 svctm 1.2 1.2 1.19 0 0.51 %util 52.8 52.88 53.04 0 63.04 All of the values here with a "q" in them (most of what's listed on the following bulleted line) represent figures related to the read or write queues on these devices. Since the queue size doesn't correspond with any real-world figure you can benchmark the device against, it's hard to do anything with that data. The number of read and write requests is similarly useless in a database context. The following fields of iostat -x data are therefore not that useful here: rrqm/s, wrqm/s, r/s, w/s, avgrq-sz, avgqu-sz Solaris has a similar extended mode available using iostat -xc This next example is similar to the iostat one given previously: $ iostat –x 5 avg-cpu: %user %nice %system %iowait %steal %idle 21.51 0.00 11.08 23.75 0.10 43.56 Device rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdc1 0.00 6956.80 15.86 67.36 158.18 1.20 52.80 sdd1 0.00 6966.40 15.76 67.09 158.85 1.20 52.88 sde1 0.00 6915.20 15.57 62.93 148.39 1.19 53.04 md0 0.00 19691.20 8.00 0.00 0.00 0.00 0.00 sdf 0.00 41643.20 33.86 0.65 0.55 0.51 63.04 That's 21 MB/s written to the WAL and 20 MB/s to the database disks, about 7 MB/s to each one. However, recall that the total disk read or write throughput available depends heavily on how random the workload is, which is normally a hard thing to estimate. The %util figure, which is by far the most valuable of the derived figures shown here, gives you a rough idea of that by noting how congested the device is to achieve that throughput. In this next example, there's minimal database I/O and heavy WAL I/O, typical of the period just after a checkpoint: $ iostat –x 5 avg-cpu: %user %nice %system %iowait %steal %idle 49.35 0.00 22.00 3.80 0.25 24.60 Device rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdc1 0.00 2649.10 15.01 0.76 4.31 0.06 1.04 sdd1 0.00 2895.01 14.95 0.90 4.64 0.06 1.12 sde1 0.00 2728.94 15.06 0.82 4.51 0.06 1.04 md0 0.00 8273.05 8.00 0.00 0.00 0.00 0.00 sdf1 0.00 103760.48 38.11 0.23 0.09 0.09 23.47 This is happily getting >50 MB/s out of the WAL volume but it's still only busy 23.5% of the time. This suggests writes to it are being cached by the disk controller and written quite efficiently. One of the reasons to break out the WAL onto its own disk is because it makes it so easy to monitor this balance between WAL and database writes, and to determine if the WAL volume (which only gets sequential writes normally) is keeping up. Since there are techniques to accelerate the WAL writes at the expense of something else, such as switching to an unjournaled filesystem, the %util figure can help you determine when the WAL is the system bottleneck and therefore necessary to accelerate that way. A final example of good performance involves the database disks. There are some operations in PostgreSQL that can bypass writing to the WAL. For example, if you start a transaction that creates a new table and does a COPY into it, as long as you don't have PITR archiving turned on that data is not put through the WAL before being written to disk. The idea is that if the server crashes, the whole transaction will be rolled back anyway, which includes deleting the table data; therefore, whether it's consistent or not at the block level doesn't matter. Here is what the database disks are capable of when running such a COPY, which essentially turns into sequential write I/O directly to the database: $ iostat –x 5 avg-cpu: %user %nice %system %iowait %steal %idle 16.39 0.00 6.85 12.84 0.00 63.92 Device rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdc1 25.60 58710.40 249.09 27.22 115.43 1.19 28.08 sdd1 24.00 58716.80 249.11 27.76 117.71 1.20 28.24 sde1 1.60 58667.20 250.51 28.31 120.87 1.14 26.80 md0 51.20 176094.40 8.00 0.00 0.00 0.00 0.00 sdf1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 This is over 29 MB/s being written to each database disk, for a total of 88 MB/s to the RAID 0 array, and even that isn't fully utilizing the disks, as shown by the %util at about 28%. Given that this is a four-core server and the COPY is the only process running, a %user of 16 means that about 64% of a single CPU is busy here. The CPU and disks are likely waiting for each other a bit in this situation, and you might have to improve both to significantly speed this up. This example is from a server with a battery-backed RAID controller; without one, it's much easier to run into one of the disk bottlenecks here before the CPU ones. Final iostat hint: on some versions you can switch the output to use megabytes/second as its units, which is often the easiest to read. The following syntax for example, usually makes for a good summary on Linux systems: $ iostat -x -m 5
Read more
  • 0
  • 0
  • 1851
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-oracle-business-intelligence-drilling-data-and-down
Packt
21 Oct 2010
5 min read
Save for later

Oracle Business Intelligence: Drilling Data Up and Down

Packt
21 Oct 2010
5 min read
What is data drilling? In terms of Oracle Discoverer, drilling is a technique that enables you to quickly navigate through worksheet data, finding the answers to the questions facing your business. As mentioned, depending on your needs, you can use drilling to view the data you're working with in deeper detail or, in contrast, drill it up to a higher level. The drilling to detail technique enables you to look at the values making up a particular summary value. Also, you can drill to related items, adding related information that is not currently included in the worksheet. So, Discoverer supports a set of drilling tools, including the following: Drilling up and down Drilling to a related item Drilling to detail Drilling out The following sections cover the above tools in detail, providing examples on how you might use them. Drilling to a related item Let's begin with a discussion on how to drill to a related item, adding the detailed information for a certain item. As usual, this is best understood by example. Suppose you want to drill from the Maya Silver item, which can be found on the left axis of the worksheet, to the Orddate:Day item. Here are the steps to follow: Let's first create a copy of the worksheet to work with in this example. To do this, move to the worksheet discussed in the preceding example and select the Edit | Duplicate Worksheet | As Crosstab menu of Discoverer. In the Duplicate as Crosstab dialog, just click OK. As a result a copied worksheet should appear in the workbook. On the worksheet, right-click the Maya Silver item and select Drill… in the pop-up menu: As a result, the Drill dialog should appear. In the Drill dialog, select Drill to a Related Item in the Where do you want to drill to? select box and then choose the Orddate:Day item, as shown in the following screenshot: Then, click OK to close the dialog and rearrange the data on the worksheet. The reorganized worksheet should now look like the following one: As you can see, this shows the Maya Silver item broken down into day sales per product. Now suppose you want to see a more detailed view of the Maya Silver item and break it out further into product category. Right-click the Maya Silver item and select Drill… in the pop-up menu. In the Drill dialog, select Drill to a Related Item in the Where do you want to drill to? select box and then choose the Category item. Next, click OK.The resulting worksheet should look now like this: As you can see, the result of the drilling operations you just performed is that you can see the dollar amount for Maya Silver detailed by category, by day, by product. You may be asking yourself if it's possible to change the order in which the Maya Silver record is detailed. Say, you want to see it detailed in the following order: by day, by category, and finally by product. The answer is sure. On the left axis of the worksheet, drag the Orddate:Day item (the third from the left) to the second position within the same left axis, just before the Category item, as shown in the following screenshot: As a result, you should see that the data on the worksheet has been rearranged as shown in the following screenshot: Having just a few rows in the underlying tables, as we have here, is OK for demonstration purposes, since it results in compact screenshots. To see more meaningful figures on the worksheet though, you might insert more rows into the orderitems, orders, and products underlying tables. Once you're done with it, you can click the Refresh button on the Discoverer toolbar to see an updated worksheet. Select the File | Save menu option of Discoverer to save the worksheet discussed here. Drilling up and down As the name implies, drilling down is a technique you can use to float down a drill hierarchy to see data in more detail. And drilling up is the reverse operation, which you can use to slide up a drill hierarchy to see consolidated data. But what is a drill hierarchy? Working with drill hierarchies A drill hierarchy represents a set of items related to each other according to the foreign key relationships in the underlying tables. If a worksheet item is associated with a drill hierarchy, you can look at that hierarchy by clicking the drill icon located at the left of the heading of the worksheet item. Suppose you want to look at the hierarchy associated with the Orddate item located on our worksheet at the top axis. To do this, click the Orddate drill icon. As a result, you should see the menu shown in the following screeenshot: As you can see, you can drill up here from Orddate to Year, Quarter, or Month. The next screenshot illustrates what you would have if you chose Month. It's important to note that you may have more than one hierarchy associated with a worksheet item. In this case, you can move on to the hierarchy you want to use through the All Hierarchies option on the drill menu.
Read more
  • 0
  • 0
  • 3258

article-image-server-configuration-tuning-postgresql
Packt
18 Oct 2010
10 min read
Save for later

Server Configuration Tuning in PostgreSQL

Packt
18 Oct 2010
10 min read
  PostgreSQL 9.0 High Performance A clear, step-by-step guide to optimizing and scaling up PostgreSQL database servers Learn the right techniques to obtain optimal PostgreSQL database performance, from initial design to routine maintenance Discover the techniques used to scale successful database installations Avoid the common pitfalls that can slow your system down Filled with advice about what you should be doing; how to build experimental databases to explore performance topics, and then move what you've learned into a production database environment Covers versions 8.1 through 9.0           Read more about this book       The main tunable settings for PostgreSQL are in a plain text file named postgresql.conf that's located at the base of the database directory structure. This will often be where $PGDATA is set to on UNIX-like systems, making the file $PGDATA/postgresql.conf on those platforms. This article by Gregory Smith, author of PostgreSQL 9.0 High Performance, mirrors the general format of the official documentation's look at these parameters at http://www.postgresql.org/docs/current/static/runtime-config.html. However, it is more focused on guidelines for setting the most important values, from the perspective of someone interested in performance tuning, rather than describing the meaning of every parameter. This should be considered a supplement to rather than a complete replacement for the extensive material in the manual. Logging General logging setup is important but it is somewhat outside the scope of this article. You may need to set parameters such as log_destination, log_directory, and log_filename to save your log files in a way compatible with the system administrations requirements of your environment. These will all be set to reasonable defaults to get started with on most systems. On UNIX-like systems, it's common for some of the database logging to be set in the script that starts and stops the server, rather than directly in the postgresql.conf file. If you instead use the pg_ctl command to manually start the server, you may discover that logging ends up on your screen instead. You'll need to look at the script that starts the server normally (commonly /etc/init.d/postgresql) to determine what it does, if you want to duplicate that behavior. In most cases, you just need to add –l logfilename to the pg_ctl command line to redirect its output to the standard location. log_line_prefix The default log_line_prefix is empty, which is not what you want. A good starting value here is the following: log_line_prefix='%t:%r:%u@%d:[%p]: ' This will put the following into every log line: %t: Timestamp %u: Database user name %r: Remote host connection is from %d: Database connection is to %p: Process ID of connection It may not be obvious what you'd want all of these values for initially, particularly, the process ID. Once you've tried to chase down a few performance issues, the need for saving these values will be more obvious, and you'll be glad to already have this data logged. Another approach worth considering is setting log_line_prefix such that the resulting logs will be compatible with the pgFouine program. That is a reasonable, general purpose logging prefix, and many sites end up needing to do some sort of query analysis eventually. log_statement The options for this setting are as follows: none: Do not log any statement-level information. ddl: Log only Data Definition Language (DDL) statements such as CREATE and DROP. This can normally be left on even in production, and is handy to catch major changes introduced accidentally or intentionally by administrators. mod: Log any statement that modifies a value, which is essentially everything except for simple SELECT statements. If your workload is mostly SELECT based with relatively few data changes, this may be practical to leave enabled all the time. all: Log every statement. This is generally impractical to leave on in production due to the overhead of the logging. However, if your server is powerful enough relative to its workload, it may be practical to keep it on all the time. Statement logging is a powerful technique for finding performance issues. Analyzing the information saved by log_statement and related sources for statement-level detail can reveal the true source for many types of performance issues. You will need to combine this with appropriate analysis tools. log_min_duration_statement Once you have some idea of how long a typical query statement should take to execute, this setting allows you to log only the ones that exceed some threshold you set. The value is in milliseconds, so you might set: log_min_duration_statement=1000 And then you'll only see statements that take longer than one second to run. This can be extremely handy for finding out the source of "outlier" statements that take much longer than most to execute. If you are running 8.4 or later, you might instead prefer to use the auto_explain module: http://www.postgresql.org/docs/8.4/static/auto-explain.html instead of this feature. This will allow you to actually see why the queries that are running slowly are doing so by viewing their associated EXPLAIN plans. Vacuuming and statistics PostgreSQL databases require two primary forms of regular maintenance as data is added, updated, and deleted. VACUUM cleans up after old transactions, including removing information that is no longer visible and returning freed space to where it can be re-used. The more often you UPDATE and DELETE information from the database, the more likely you'll need a regular vacuum cleaning regime. However, even static tables with data that never changes once inserted still need occasional care here. ANALYZE looks at tables in the database and collects statistics about them— information like estimates of how many rows they have and how many distinct values are in there. Many aspects of query planning depend on this statistics data being accurate. autovacuum As both these tasks are critical to database performance over the long-term, starting in PostgreSQL 8.1 there is an autovacuum daemon available that will run in the background to handle these tasks for you. Its action is triggered by the number of changes to the database exceeding a threshold it calculates based on the existing table size. The parameter for autovacuum is turned on by default in PostgreSQL 8.3, and the default settings are generally aggressive enough to work out of the box for smaller database with little manual tuning. Generally you just need to be careful that the amount of data in the free space map doesn't exceed max_fsm_pages, and even that requirement is automated away from being a concern as of 8.4. Enabling autovacuum on older versions If you have autovacuum available but it's not turned on by default, which will be the case with PostgreSQL 8.1 and 8.2, there are a few related parameters that must also be enabled for it to work, as covered in http://www.postgresql.org/docs/8.1/interactive/maintenance.html or http://www.postgresql.org/docs/8.2/interactive/routine-vacuuming.html. The normal trio to enable in the postgresql.conf file in these versions are: stats_start_collector=truestats_row_level=trueautovacuum=on Note that as warned in the documentation, it's also wise to consider adjusting superuser_reserved_connections to allow for the autovacuum processes in these earlier versions. The autovacuum you'll get in 8.1 and 8.2 is not going to be as efficient as what comes in 8.3 and later. You can expect it to take some fine tuning to get the right balance of enough maintenance without too much overhead, and because there's only a single worker it's easier for it to fall behind on a busy server. This topic isn't covered at length here. It's generally a better idea to put time into planning an upgrade to a PostgreSQL version with a newer autovacuum than to try and tweak an old one extensively, particularly if there are so many other performance issues that cannot be resolved easily in the older versions, too. maintainance_work_mem A few operations in the database server need working memory for larger operations than just regular sorting. VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY all can allocate up to maintainance_work_mem worth of memory instead. As it's unlikely that many sessions will be doing one of these operations at once, it's possible to set this value much higher than the standard per-client work_mem setting. Note that at least autovacuum_max_workers (defaulting to 3 starting in version 8.3) will allocate this much memory, so consider those sessions (perhaps along with a session or two doing a CREATE INDEX) when setting this value. Assuming you haven't increased the number of autovacuum workers, a typical high setting for this value on a modern server would be at five percent of the total RAM, so that even five such processes wouldn't exceed a quarter of available memory. This works out to approximately 50 MB of maintainance_work_mem per GB of server RAM. default_statistics_target PostgreSQL makes its decisions about how queries execute based on statistics collected about each table in your database. This information is collected by analyzing the tables, either with the ANALYZE statement or via autovacuum doing that step. In either case, the amount of information collected during the analyze step is set by default_statistics_target. Increasing this value makes analysis take longer, and as analysis of autovacuum happens regularly this turns into increased background overhead for database maintenance. But if there aren't enough statistics about a table, you can get bad plans for queries against it. The default value for this setting used to be the very low (that is,10), but was increased to 100 in PostgreSQL 8.4. Using that larger value was popular in earlier versions, too, for general improved query behavior. Indexes using the LIKE operator tended to work much better with values greater than 100 rather than below it, due to a hard-coded change at that threshold. Note that increasing this value does result in a net slowdown on your system if you're not ever running queries where the additional statistics result in a change to a better query plan. This is one reason why some simple benchmarks show PostgreSQL 8.4 as slightly slower than 8.3 at default parameters for each, and in some cases you might return an 8.4 install to a smaller setting. Extremely large settings for default_statistics_target are discouraged due to the large overhead they incur. If there is just a particular column in a table you know that needs better statistics, you can use ALTER TABLE SET STATISTICS on that column to adjust this setting just for it. This works better than increasing the system-wide default and making every table pay for that requirement. Typically, the columns that really require a lot more statistics to work properly will require a setting near the maximum of 1000 (increased to 10,000 in later versions) to get a serious behavior change, which is far higher than you'd want to collect data for on every table in the database.
Read more
  • 0
  • 0
  • 13537

article-image-oracle-business-intelligence-getting-business-information-data
Packt
13 Oct 2010
11 min read
Save for later

Oracle Business Intelligence : getting business information from data

Packt
13 Oct 2010
11 min read
Most businesses today use Business Intelligence (BI), the process of obtaining business information from available data, to control their affairs. If you're new to Business Intelligence, then this definition may leave you with the following questions: What is data? What is the information obtained from it? What is the difference between data and the information obtained from it? You may be confused even more if you learn that data represents groups of information related to an object or a set of objects. Depending on your needs, though, such groups of information may or may not be immediately useful, and often require additional processing such as filtering, formatting, and/or calculating to take on a meaning. For example, information about your customers may be organized in a way that is stored in several database tables related to each other. For security purposes, some pieces of information stored in this way may be encoded, or just represented in binary, and therefore not immediately readable. It's fairly obvious that some processing must be applied before you can make use of such information. So, data can be thought of as the lowest level of abstraction from which meaningful information is derived. But what is information anyway? Well, a piece of information normally represents an answer to a certain question. For example, you want to know how many new customers have registered on your site this year. An answer to this question can be obtained with a certain query issued against the table containing customer registration dates, giving you the information you asked for. Data, information, and Business Intelligence Although the terms data and information refer to similar things, they aren't really interchangeable as there is some difference in their meaning and spirit. Talking about data, as a rule, involves its structure, format, storage, as well as ways in which you can access and manipulate it. In contrast, when talking about information, you mean food for your decision-making process. So, data can be viewed as low-level information structures, where the internal representation matters. Therefore, the ways in which you can extract useful information from data entirely depend on the structure and storage of that data. The following diagram gives a conceptual view of delivering information from different data sets: As you can see from the figure, information can be derived from different data sources, and by different means. Once it's derived, though, it doesn't matter where it has come from, letting its consumers concentrate on the business aspects rather than on the specifics of the internal structure. For example, you might derive some pieces of data from the Web, using the Oracle Database's XQuery feature, and then process it as native database data. To produce meaningful information from your data, you will most likely need to perform several processing steps, load new data, and summarize the data. This is why the Business Intelligence layer usually sits on top of many data sources, consolidating information from various business systems and heterogeneous platforms. The following figure gives a graphical depiction of a Business Intelligence system. In particular, it shows you that the Business Intelligence layer consumes information derived from various sources and heterogeneous platforms. It is intuitively clear that the ability to solve problems is greatly enhanced if you can effectively handle all the information you're getting. On the other hand, extracting information from data coming in from different sources may become a nightmare if you try to do it on your own, with only the help of miscellaneous tools. Business Intelligence comes to the rescue here, ensuring that the extraction, transformation, and consolidation of data from disparate sources becomes totally transparent to you. For example, when using a Business Intelligence application for reporting, you may never figure out exactly what happens behind the scenes when you instruct the system to prepare another report. The information you need for such a report may be collected from many different sources, hiding the complexities associated with handling heterogeneous data. But, without Business Intelligence, that would be a whole different story, of course. Imagine for a moment that you have to issue several queries against different systems, using different tools, and you then have to consolidate the results somehow—all just to answer a single business question such as: what are the top three customers for the preceding quarter? As you have no doubt realized, the software at the Business Intelligence layer is used to provide a business-centric view of data, eliminating as much of the technology-specific logic as possible. What this means in practice is that information consumers working at the Business Intelligence layer may not even know that, say, customer records are stored in a Lightweight Directory Access Protocol (LDAP) database, but purchase orders are kept in a relational database. The kind of business questions you may need to answer As you just learned, Business Intelligence is here to consolidate information from disparate sources so that you need not concern yourself with it. Okay, but why might you need to gather and process heterogeneous data? The answer is clear. You might need it in order to answer analytical questions that allow you to understand and run your business better. In the following two sections, you'll look at some common questions that Business Intelligence can help you answer. Then, you'll see how you can ask those questions with the help of Business Intelligence tools. Answering basic business questions The set of questions you may need your Business Intelligence system to answer will vary depending on your business and, of course, your corresponding functions. However, to give you a taste of what Business Intelligence can do for you, let's firrst look at some questions that are commonly brought up by business users: What is the average salary throughout the entire organization? Which customers produce the most revenue? What is the amount of revenue each salesman brought in over the preceding quarter? What is the profitability of each product? If you run your business online, you may be also interested in hit counting and traffic analysis questions, such as the following: How much traffic does a certain account generate over a month? What pages in your site are most visited? What are the profits made online? Looking at the business analysis requests presented here, a set of questions related to your own business may flash into your mind. Answering probing analytical questions In the preceding section, you looked at some common questions a business analyst is usually interested in asking. But bowing to the reality, you may have to answer more probing questions in your decision-making process, in order to determine changes in the business and find ways to improve it. Here are some probing analytical questions you might need to find answers to: How do sales for this quarter compare to sales for the preceding quarter? What factors impact our sales? Which products are sold better together? What are ten top-selling products in this region? What are the factors influencing the likelihood of purchase? As you can see, each of these questions reflects a certain business problem. Looking through the previous list, though, you might notice that some of the questions shown here can be hard to formulate with the tools available in a computer application environment. There's nothing to be done here; computers like specific questions. Unlike humans, machines can give you exactly what you ask for, not what you actually mean. So, even an advanced Business Intelligence application will require you to be as specific as possible when it comes to putting a question to it. It's fairly clear that the question about finding the factors impacting sales needs to be rephrased to become understandable for a Business Intelligence application. How you would rephrase it depends on the specifics of your business, of course. Often, it's good practice to break apart a problem into simpler questions. For example, the first question on the above list—the one about comparing quarter sales—might be logically divided into the following two questions: What are the sales figures for this quarter? What are the sales figures for the last quarter? Once you get these questions answered, you can compare the results, thus answering the original, more generically phrased question. It can also provide one definition or variation for drill down. In the above example, it's fairly obvious what specific questions can be derived from the generic question. There may be probing questions, though, whose derived questions are not so obvious. For example, consider the following question: What motivates a customer to buy? This could perhaps be broken down into the following questions: Where did visitors come from? Which pages did they visit before reaching the product page? Of course, the above list does not seem to be complete—some other questions might be added. Asking business questions using data-access tools As you might guess, although all these questions sound simple when formulated in plain English, they are more difficult to describe when using data-access tools. If you're somewhat familiar with SQL, you might notice that most of the analytical questions discussed here cannot be easily expressed with the help of SQL statements, even if the underlying data is relational. For example, the problem of finding the top three salespersons for a year may require you to write a multi-line SQL request including several sub-queries. Here is what such a query might look like: SELECT emp.ename salesperson, top_emp_orders.sales sales FROM (SELECT all_orders.sales_empno empno, all_orders.total_sales FROM (SELECT sales_empno, SUM(ord_total) total_sales, RANK() OVER (ORDER BY SUM(ord_total) DESC) sal_rank FROM orders WHERE EXTRACT(YEAR FROM ord_dt) = 2009 GROUP BY sales_empno )all_orders WHERE all_orders.sal_rank<=3 )top_emp_orders, employees emp WHERE top_emp_orders.empno = emp.empno ORDER BY sales DESC; This might produce something like this: If you're not an SQL guru of course, writing the above query and then debugging it could easily take a couple of hours. Determining profitability by customer, for example, might take you another couple of hours to write a proper SQL query. In other words, business questions are often somewhat tricky (if possible at all) to implement with SQL. All this does not mean that SQL is not used in the area of Business Intelligence. Quite the contrary, SQL is still indispensable here. In fact, SQL has a lot to offer when it comes to data analysis. As you just saw, though, composing complex queries assumes solid SQL skills. Thankfully, most Business Intelligence tools use SQL behind the scenes totally transparently to users. Now let's look at a simple example illustrating how you can get an analytical question answered with a Business Intelligence tool—Oracle BI Discoverer Plus in this particular example. Suppose you simply want to calculate the average salary sum over the organization. This example could use the records from the hr.employees demonstration table. Creating a worksheet representing the records of a database table in the Discoverer Plus focuses on issues related to analyzing data, and creating reports with the tools available through the Oracle Business Intelligence suite. For now, look at the following screenshot to see what such a worksheet might look like: As you can see in the previous screenshot, a Discoverer Plus worksheet is similar to one in MS Excel. As in Excel, there are toolbars and menus offering a lot of options for manipulating and analyzing data presented on the worksheet. In addition, Discoverer Plus offers Item Navigator, which enables you to add data to (or remove it from) the worksheet. The data structure you can see in Item Navigator is retrieved from the database. When we return to our example, answering the question: "what is the average salary across the organization?"Similarly, in Excel, it is as simple as selecting the Salary SUM column on the worksheet, choosing an appropriate menu, and setting some parameters in the dialog shown next. After you click the OK button in this dialog box, the calculated average will be added to the worksheet in the position specified. So, the Total dialog shown in the following screenshot provides an efficient means for automating the process of creating a total on a specified data column: As you can see, this approach doesn't require you to write an SQL query on your own. Instead, Discoverer Plus will do it for you implicitly, thus allowing you to concentrate on business issues rather than data access issues. This previous example should have given you a taste of what Business Intelligence can do for you.
Read more
  • 0
  • 0
  • 1772

article-image-exception-handling-mysql-python
Packt
27 Sep 2010
7 min read
Save for later

Exception Handling in MySQL for Python

Packt
27 Sep 2010
7 min read
Why errors and warnings are good for you The value of rigorous error checking is exemplified in any of the several catastrophes arising from poor software engineering. Examples abound, but a few are particularly illustrative of what happens when bad data and design go unchallenged. On 4 June 1996, the first test flight of the Ariane 5 rocket self-destructed 37 seconds after its launch. The navigation code from Ariane 4 was reused in Ariane 5. The faster processing speed on the newer rocket caused an operand error. The conversion of a 64-bit floating-point value resulted in a larger-than-expected and unsupported 16-bit signed integer. The result was an overflow that scrambled the flight's computer, causing too much thrust to be passed by the rocket itself, resulting in the crash of US$370 million worth of technology. Widely considered to be one of the most expensive computer bugs in history, the crash arose due to mistakes in design and in subsequent error checking. On 15 January 1990, the American telecommunications company AT&T installed a new system on the switches that controlled their long-distance service. A bug in the software caused the computers to crash every time they received a message from one of their neighboring switches. The message in question just happened to be the same one that the switches send out when they recover from a system crash. The result: Within a short time, 114 switches across New York City were rebooting every six seconds, leaving around 60,000 people without long distance service for nine hours. The system ultimately had to be fixed by reinstalling the old software. On the Internet, a lack of proper error-checking still makes it possible for a malformed ping request to crash a server anywhere in the world. The Computer Emergency Response Team (CERT) Advisory on this bug, CA-1996-26, was released in 1996, but the bug persists. The original denial-of-service attack has thus evolved into the distributed denial-of-service attack employing botnets of zombie machines worldwide. More than any other part of a computing system, errors cost significantly more to fix later than if they were resolved earlier in the development process. It is specifically for this reason that Python outputs error messages to the screen, unless such errors are explicitly handled otherwise. A basic dynamic of computing is that the computer does not let anyone know what is happening inside itself. A simple illustration of this dynamic is as follows: x = 2 if x == 2: x = x + x Knowing Python and reading the code, we understand that the value of x is now 4. But the computer has provided us no indication of the value of x. What's more, it will not tell us anything unless we explicitly tell it to do so. Generally speaking, there are two ways you can ask Python to tell you what it's thinking: By outputting values to the screen By writing them to a file Here, a simple print statement would tell us the value of x. Output displayed on the screen or saved to a file are the most common ways for programs to report their status to users. However, the similar effect is done by indicator lights and other non-verbal forms of communication. The type of output is necessarily dependent on the hardware being used. By default, Python outputs all errors and warnings to the screen. As MySQL for Python is interpreted by Python, errors passed by MySQLdb are no different. This naturally gives the debugging programmer information for ironing out the performance of the program—whether determining why a program is not executing as planned or how to make it execute faster or more reliably. However, it also means that any information needed for tracing the error, along with parts of the code, is passed to the user, whoever they may be. This is great for debugging, but makes for terrible security. That is why the Zen of Python reads: Errors should never pass silently Unless explicitly silenced One needs the error messages to know why the program fails, but it is a security hazard to pass raw error messages to the user. If one wants the user to handle an error message, it should be sanitized of information that may compromise the security of the system. Handling exceptions correctly takes a lot of code. At the risk of sounding like a hypocrite, it should be noted that the exigencies of a printed book do not allow for the reproduction of constant, rigorous error-handling in the code examples such as this article espouses. Therefore, while I state this principle, the programming examples do not always illustrate it as they should. If they did, the book would be significantly thicker and heavier (and probably cost more too!). Further, the more complicated an application, the more robust the error-handling should be. Ultimately, every kind of error is covered by one of the several types that can be thrown by MySQL for Python. Each one of them allows for customized error messages to be passed to the user. With a bit of further coding, one can check the authentication level of the user and pass error messages according to their level of authorization. This can be done through a flag system or by using modules from the Python library. If the former is used, one must ensure that knowledge of the flag(s) used is guarded from unauthorized users. Alternatively, one can employ both systems by checking the authentication level of users or programs that pass a particular flag to the program. Errors versus warnings: There's a big difference As with Python in general, the main difference between errors and warnings is that warnings do not cause a program to terminate. Errors do. Warnings provide notice of something we should note; errors indicate the reason the program cannot continue. If not handled appropriately, warnings therefore pass process information to the user without interrupting the execution of the program. This lack of detectability makes warnings more dangerous to the security of an application, and the system in general, than errors. Consequently, the error-handling process of an application must account for both errors and warnings. While Python handles warnings and exceptions differently by default, especially with regard to program execution, both are written to stderr. Therefore, one handles them the same way that one handles standard errors. Additionally, one can set warnings to be silenced altogether or to carry the same gravity as an error. This level of functionality was introduced in Python 2.1. The two main errors in MySQLdb Python generally supports several kinds of errors, and MySQL for Python is no different. The obvious difference between the two is that MySQLdb's errors deal exclusively with the database connection. Where MySQLdb passes warnings that are not MySQL-specific, all exceptions are related to MySQL. The MySQL-specific exceptions are then classified as either warnings or errors. There is only one kind of warning, but MySQLdb allows two categories of errors—DatabaseError and InterfaceError. Of the former, there are six types that we will discuss here. DatabaseError When there is a problem with the MySQL database itself, a DatabaseError is thrown. This is an intermediate catch-all category of exceptions that deal with everything from how the data is processed (for example, errors arising from division by zero), to problems in the SQL syntax, to internal problems within MySQL itself. Essentially, if a connection is made and a problem arises, the DatabaseError will catch it. Several types of exceptions are contained by the DatabaseError type. We look at each of these in the section Handling exceptions passed from MySQL. InterfaceError When the database connection fails for some reason, MySQLdb will raise an InterfaceError. This may be caused from problems in the interface anywhere in the connection process.
Read more
  • 0
  • 0
  • 5659
article-image-creating-your-own-functions-mysql-python
Packt
27 Sep 2010
6 min read
Save for later

Creating Your Own Functions in MySQL for Python

Packt
27 Sep 2010
6 min read
  MySQL for Python Integrate the flexibility of Python and the power of MySQL to boost the productivity of your Python applications Implement the outstanding features of Python's MySQL library to their full potential See how to make MySQL take the processing burden from your programs Learn how to employ Python with MySQL to power your websites and desktop applications Apply your knowledge of MySQL and Python to real-world problems instead of hypothetical scenarios A manual packed with step-by-step exercises to integrate your Python applications with the MySQL database server Read more about this book (For more resources on Phython see here.) Hello() To create a function, we necessarily have to go back to the CREATE statement. As in a Python function definition, MySQL expects us to declare the name of the function as well as any arguments it requires. Unlike Python, MySQL also wants the type of data that will be received by the function. The beginning of a basic MySQL function definition looks like this: CREATE FUNCTION hello(s CHAR(20)) MySQL then expects to know what kind of data to return. Again, we use the MySQL data type definitions for this. RETURNS CHAR(50) This just tells MySQL that the function will return a character string of 50 characters or less. If the function will always perform the same task, it is best for the sake of performance to include the keyword DETERMINISTIC next. If the behavior of the function varies, use the keyword NON-DETERMINISTIC. If no keyword is set for the characteristic of the function, MySQL defaults to NON-DETERMINISTIC. You can learn more about the characteristic keywords used in function definitions at: http://dev.mysql.com/doc/refman/5.5/en/create-procedure.html Finally comes the meat of the function definition. Here we can set variables and perform any calculations that we want. For our basic definition, we will simply return a concatenated string: RETURN CONCAT('Hello, ', s, '!'); The function obviously concatenates the word 'Hello' with whatever argument is passed to it and appends an exclamation point at the end. To call it we use SELECT as with the other functions: mysql> SELECT hello('world') as Greeting; Capitalise() A function to capitalize every initial letter in a string follows the same pattern. The main point of the function is to walk through the string, character by character, and use UPPER() on every character that does not follow a letter. DELIMITER Obviously, we need a way to pass the entire function to MySQL without having any of the lines evaluated until we call it. To do this, we use the keyword DELIMITER. DELIMITER allows users to tell MySQL to evaluate lines that end in the character(s) we set. So the process for complex function definitions becomes: Change the delimiter. Pass the function with the usual semicolons to indicate the end of the line. Change the delimiter back to a semicolon. Call the function. The DELIMITER keyword allows us to specify more than one character as the line delimiter. So in order to ensure we don't need to worry about our code inadvertently conflicting with a line delimiter, let's make the delimiter @@: DELIMITER @@ The function definition From here, we are free to define a function to our specification. The definition line will read as follows: CREATE FUNCTION `Capitalise`(instring VARCHAR(1000)) The function will return a character string of similar length and variability: RETURNS VARCHAR(1000) When MySQL functions extend beyond the simplest calculations, such as hello(), MySQL requires us to specify the beginning and ending of the function. We do that with the keywords BEGIN and END. So let's begin the function: BEGIN Next, we need to declare our variables and their types using the keyword DECLARE: DECLARE i INT DEFAULT 1;DECLARE achar, imark CHAR(1);DECLARE outstring VARCHAR(1000) DEFAULT LOWER(instring); The DEFAULT keyword allows us to specify what should happen if outstring should fail for some reason. Next, we define a WHILE loop: WHILE i <= CHAR_LENGTH(instring) DO The WHILE loop obviously begins with a conditional statement based on the character length of instring. The resulting action begins with the keyword DO. From here, we set a series of variables and express what should happen where a character follows one of the following: blank space & '' _ ? ; : ! , - / ( . The operational part of the function looks like this: SET achar = SUBSTRING(instring, i, 1); SET imark = CASE WHEN i = 1 THEN ' ' ELSE SUBSTRING(instring, i - 1, 1) END CASE; IF imark IN (' ', '&', '''', '_', '?', ';', ':', '!', ',', '-', '/', '(', '.') THEN SET outstring = INSERT(outstring, i, 1, UPPER(achar)); END IF; SET i = i + 1; Much of this code is self-explanatory. It is worth noting, however, that the apodosis of any conditional in MySQL must end with the keyword END. In the case of IF, we use END IF. In the second SET statement, the keyword CASE is an evaluative keyword that functions similar to the try...except structure in Python. If the WHEN condition is met, the empty THEN apodosis is executed. Otherwise, the ELSE exception applies and the SUBSTRING function is run. The CASE structure ends with END CASE. MySQL will equally recognize the use of END instead. The subsequent IF clause evaluates whether imark, defined as the character before achar, is one of the declared characters. If it is, then that character in instring is replaced with its uppercase equivalent in outstring. After the IF clause is finished, the loop is incremented by one. After the entire string is processed, we then end the WHILE loop with: END WHILE; After the function's operations are completed, we return the value of outstring and indicate the end of the function: RETURN outstring;END@@ Finally, we must not forget to return the delimiter to a semicolon: DELIMITER ; It is worth noting that, instead of defining a function in a MySQL session we can define it in a separate file and load it on the fly with the SOURCE command. If we save the function to a file called capfirst.sql in a directory temp, we can source it relatively: We can also use: SOURCE /home/skipper/temp/capfirst.sql;
Read more
  • 0
  • 0
  • 3696

article-image-disaster-recovery-mysql-python
Packt
25 Sep 2010
10 min read
Save for later

Disaster Recovery in MySQL for Python

Packt
25 Sep 2010
10 min read
  MySQL for Python Integrate the flexibility of Python and the power of MySQL to boost the productivity of your Python applications Implement the outstanding features of Python's MySQL library to their full potential See how to make MySQL take the processing burden from your programs Learn how to employ Python with MySQL to power your websites and desktop applications Apply your knowledge of MySQL and Python to real-world problems instead of hypothetical scenarios A manual packed with step-by-step exercises to integrate your Python applications with the MySQL database server Read more about this book (For more resources on Phython, see here.) The purpose of the archiving methods covered in this article is to allow you, as the developer, to back up databases that you use for your work without having to rely on the database administrator. As noted later in the article, there are more sophisticated methods for backups than we cover here, but they involve system-administrative tasks that are beyond the remit of any development post and are thus beyond the scope of this article. Every database needs a backup plan When archiving a database, one of the critical questions that must be answered is how to take a snapshot backup of the database without having users change the data in the process. If data changes in the midst of the backup, it results in an inconsistent backup and compromises the integrity of the archive. There are two strategic determinants for backing up a database system: Offline backups Live backups Which you use depends on the dynamics of the system in question and the import of the data being stored. In this article, we will look at each in turn and the way to implement them. Offline backups Offline backups are done by shutting down the server so the records can be archived without the fear of them being changed by the user. It also helps to ensure the server shut down gracefully and that errors were avoided. The problem with using this method on most production systems is that it necessitates a temporary loss of access to the service. For most service providers, such a consequence is anathema to the business model. The value of this method is that one can be certain that the database has not changed at all while the backup is run. Further, in many cases, the backup is performed faster because the processor is not simultaneously serving data. For this reason, offline backups are usually performed in controlled environments or in situations where disruption is not critical to the user. These include internal databases, where administrators can inform all users about the disruption ahead of time, and small business websites that do not receive a lot of traffic. Offline backups also have the benefit that the backup is usually held in a single file. This can then be used to copy a database across hosts with relative ease. Shutting down a server obviously requires system administrator-like authority. So creating an offline backup relies on the system administrator shutting down the server. If your responsibilities include database administration, you will also have sufficient permission to shut down the server. Live backups Live backups occur while the server continues to accept queries from users, while it's still online. It functions by locking down the tables so no new data may be written to them. Users usually do not lose access to the data and the integrity of the archive, for a particular point in time is assured. Live backups are used by large, data-intensive sites such as Nokia's Ovi services and Google's web services. However, because they do not always require administrator access of the server itself, these tend to suit the backup needs of a development project. Choosing a backup method After having determined whether a database can be stopped for the backup, a developer can choose from three methods of archiving: Copying the data files (including administrative files such as logs and tablespaces) Exporting delimited text files Backing up with command-line programs Which you choose depends on what permissions you have on the server and how you are accessing the data. MySQL also allows for two other forms of backup: using the binary log and by setting up replication (using the master and slave servers). To be sure, these are the best ways to back up a MySQL database. But, both of these are administrative tasks and require system-administrator authority; they are not typically available to a developer. However, you can read more about them in the MySQL documentation. Use of the binary log for incremental backups is documented at: http://dev.mysql.com/doc/refman/5.5/en/point-in-time-recovery.html Setting up replication is further dealt with at: http://dev.mysql.com/doc/refman/5.5/en/replication-solutions-backups.html Copying the table files The most direct way to back up database files is to copy from where MySQL stores the database itself. This will naturally vary based on platform. If you are unsure about which directory holds the MySQL database files, you can query MySQL itself to check: mysql> SHOW VARIABLES LIKE 'datadir'; Alternatively, the following shell command sequence will give you the same information: $ mysqladmin variables | grep datadir| datadir | /var/lib/mysql/ | Note that the location of administrative files, such as binary logs and InnoDB tablespaces are customizable and may not be in the data directory. If you do not have direct access to the MySQL server, you can also write a simple Python program to get the information: #!/usr/bin/env pythonimport MySQLdbmydb = MySQLdb.connect('<hostname>', '<user>', '<password>')cursor = mydb.cursor()runit = cursor.execute("SHOW VARIABLES LIKE 'datadir'")results = cursor.fetchall()print "%s: %s" %(cursor.fetchone()) Slight alteration of this program will also allow you to query several servers automatically. Simply change the login details and adapt the output to clarify which data is associated with which results. Locking and flushing If you are backing up an offline MyISAM system, you can copy any of the files once the server has been stopped. Before backing up a live system, however, you must lock the tables and flush the log files in order to get a consistent backup at a specific point. These tasks are handled by the LOCK TABLES and FLUSH commands respectively. When you use MySQL and its ancillary programs (such as mysqldump) to perform a backup, these tasks are performed automatically. When copying files directly, you must ensure both are done. How you apply them depends on whether you are backing up an entire database or a single table. LOCK TABLES The LOCK TABLES command secures a specified table in a designated way. Tables can be referenced with aliases using AS and can be locked for reading or writing. For our purposes, we need only a read lock to create a backup. The syntax looks like this: LOCK TABLES <tablename> READ; This command requires two privileges: LOCK TABLES and SELECT. It must be noted that LOCK TABLES does not lock all tables in a database but only one. This is useful for performing smaller backups that will not interrupt services or put too severe a strain on the server. However, unless you automate the process, manually locking and unlocking tables as you back up data can be ridiculously inefficient. FLUSH The FLUSH command is used to reset MySQL's caches. By re-initiating the cache at the point of backup, we get a clear point of demarcation for the database backup both in the database itself and in the logs. The basic syntax is straightforward, as follows: FLUSH <the object to be reset>; Use of FLUSH presupposes the RELOAD privilege for all relevant databases. What we reload depends on the process we are performing. For the purpose of backing up, we will always be flushing tables: FLUSH TABLES; How we "flush" the tables will depend on whether we have already used the LOCK TABLES command to lock the table. If we have already locked a given table, we can call FLUSH for that specific table: FLUSH TABLES <tablename>; However, if we want to copy an entire database, we can bypass the LOCK TABLES command by incorporating the same call into FLUSH: FLUSH TABLES WITH READ LOCK; This use of FLUSH applies across the database, and all tables will be subject to the read lock. If the account accessing the database does not have sufficient privileges for all databases, an error will be thrown. Unlocking the tables Once you have copied the files for a backup, you need to remove the read lock you imposed earlier. This is done by releasing all locks for the current session: UNLOCK TABLES; Restoring the data Restoring copies of the actual storage files is as simple as copying them back into place. This is best done when MySQL has stopped, lest you risk corruption. Similarly, if you have a separate MySQL server and want to transfer a database, you simply need to copy the directory structure from the one server to another. On restarting, MySQL will see the new database and treat it as if it had been created natively. When restoring the original data files, it is critical to ensure the permissions on the files and directories are appropriate and match those of the other MySQL databases. Delimited backups within MySQL MySQL allows for exporting of data from the MySQL command line. To do so, we simply direct the output from a SELECT statement to an output file. Using SELECT INTO OUTFILE to export data Using sakila, we can save the data from film to a file called film.data as follows: SELECT * INTO OUTFILE 'film.data' FROM film; This results in the data being written in a tab-delimited format. The file will be written to the directory in which MySQL stores the sakila data. Therefore, the account under which the SELECT statement is executed must have the FILE privilege for writing the file as well as login access on the server to view it or retrieve it. The OUTFILE option on SELECT can be used to write to any place on the server that MySQL has write permission to use. One simply needs to prepend that directory location to the file name. For example, to write the same file to the /tmp directory on a Unix system, use: SELECT * INTO OUTFILE '/tmp/film.data' FROM film; Windows simply requires adjustment of the directory structure accordingly. Using LOAD DATA INFILE to import data If you have an output file or similar tab-delimited file and want to load it into MySQL, use the LOAD DATA INFILE command. The basic syntax is: LOAD DATA INFILE '<filename>' INTO TABLE <tablename>; For example, to import the film.data file from the /tmp directory into another table called film2, we would issue this command: LOAD DATA INFILE '/tmp/film.data' INTO TABLE film2; Note that LOAD DATA INFILE presupposes the creation of the table into which the data is being loaded. In the preceding example, if film2 had not been created, we would receive an error. If you are trying to mirror a table, remember to use the SHOW CREATE TABLE query to save yourself time in formulating the CREATE statement. This discussion only touches on how to use LOAD DATA INFILE for inputting data created with the OUTFILE option of SELECT. But, the command handles text files with just about any set of delimiters. To read more on how to use it for other file formats, see the MySQL documentation at: http://dev.mysql.com/doc/refman/5.5/en/load-data.html
Read more
  • 0
  • 0
  • 2340

article-image-data-transactions-made-easy-mysql-and-python
Packt
24 Sep 2010
9 min read
Save for later

Data Transactions Made Easy with MySQL and Python

Packt
24 Sep 2010
9 min read
  MySQL for Python Integrate the flexibility of Python and the power of MySQL to boost the productivity of your Python applications Implement the outstanding features of Python's MySQL library to their full potential See how to make MySQL take the processing burden from your programs Learn how to employ Python with MySQL to power your websites and desktop applications Apply your knowledge of MySQL and Python to real-world problems instead of hypothetical scenarios A manual packed with step-by-step exercises to integrate your Python applications with the MySQL database server Read more about this book (For more resources on Phython see here.) When we deal with large amounts of data that are all going into the same database, running single instances of individual INSERT commands can take a ridiculous amount of time and waste a considerable amount of I/O. What we need is a way to insert multiple items in one go. In this article, Albert Lukaszewski, PhD, author of MySQL for Python, we will look at the following: How iteration can help us execute several individual INSERT statements rapidly Using executemany() to submit several INSERT statements in one go When not to use executemany() Throttling how much data is inserted at a time The problem You need to collate and rearrange the contents of several databases into one table whenever a given indicator is achieved (the indicator may be, among other things, a stock price, a value in one of the databases, or the flight pattern of African swallows). The format and length of the tables are predictable. There are 5,000 records in each table so manually running a separate INSERT statement for each record is not a viable option even with the programs used. The problem calls for a means of iterating through a series and changing the INSERT statement automatically. We could pursue one of the following two ways to do this: Write a MySQL script to insert the data in batch mode Iterate over the data to form and execute a MySQL INSERT statement accordingly None of these are a very good solution to the present problem. Why not a MySQL script? As we have seen when we created the world and Sakila databases, a MySQL script can contain the schema of a database, the values of the database tables, or both. To create data quickly, there is nothing better. However, following are the several drawbacks to using a script in this scenario: Lack of automation Debugging the process Inefficient I/O Lack of automation Barring the use of an automation daemon (for example, cron) to run a cron job or a similar scheduled task, a DBA or their designate would have to run the script. This unnecessarily consumes time. It is comparable to swapping tape backups when automated backup services are available and proven. Most modern computing systems support automatic task scheduling. On Unix-based systems like Linux and Mac OS X, one can schedule processes to be run at set times and on a regular rotation. One of the most widely used programs for such scheduling is cron. A single scheduled task in cron has thus come to be known as a cron job. Debugging the process Creating a MySQL script can be a very tedious task, and the slightest error or oversight can ruin the entire process. Using the --force flag causes MySQL to ignore errors in the source file. It is therefore not something that should be used regularly if one values the integrity of data. If the script is malformed for any reason, a two minute data insertion job can quickly become a two hour (at least!), unscheduled debugging process. Inefficient I/O Dumping large amounts of data on MySQL can create latency across the network. If the script were run by a DBA or similar, that person should rightly evaluate the state of the network before running it. Regardless of experience and training, judgment calls naturally require estimation and can result in ambiguous decision-making. While this is unavoidable, it should be minimized where possible. If the server is experiencing high traffic, the DBA would need to find something else to do and reschedule the running of the script. This randomly postpones the time of execution and the availability of the results. Also, it runs the risk of the DBA forgetting to execute the script. If the script is automated with a cron job or similar, we risk dumping a lot of data onto MySQL at a time when others are doing more time sensitive tasks. On most servers, we can background the process, so it does not adversely impact the client processes. This, however, only ensures that the process will be run. It does not guarantee that the process will be finished by a particular time. Why not iterate? Every time a program iterates to read or insert data, it creates a certain amount of I/O processing. Depending on how a program is written will determine how much I/O is included in each loop. A test sample: generating primes To illustrate this, consider a program that accepts a series of numbers, generates the prime numbers that are equal to or less than each of those numbers and inserts those numbers into a database called primes with a table of the same name. The table has the following description: mysql> describe primes; and can be created with the following statement: CREATE TABLE `primes` (`ID` int(11) NOT NULL auto_increment, `NUMBER` int(11) NOT NULL default '0', `PRIMES` varchar(300) NOT NULL default '0', PRIMARY KEY (`ID`)) ENGINE=MyISAM DEFAULT CHARSET=latin1; Using the sys module to handle user input, the string module to split the series of entries, and MySQLdb to handle the database connection, we get the following preamble: #!/usr/bin/env pythonimport MySQLdb, string, sys Next, we need a function that will generate the prime numbers to a certain limit called n. def primes(n): """Returns a list of prime numbers up to n using an algorithm based onthe Sieve of Eratosthenes.""" if n < 2: return ['A number less than two is not prime by definition.']else: s = range(3,n+1,2) maxfactor = n ** 0.5 half = (n+1)/2-1 i = 0 m = 3 while m <= maxfactor: if s[i]: j = (m*m-3)/2 s[j] = 0 while j < half: s[j] = 0 j += m i = i + 1 m = 2 * i + 3return str([2] + [x for x in s if x]) This algorithm is based on the Sieve of Eratosthenes, one of the simplest ways of generating prime numbers. It uses the following steps: Generate a list of integers from 2 to n. Discount the multiples of each number that remains and that has a square less than or equal to n; this leaves all the prime factors of n. Stop when the square of the number in the series is greater than n. Prime numbers by definition have no other factor, but themselves and one. The lowest prime number is therefore 2. For this reason, we check whether n is less than 2 and return a message accordingly. For this program, we want a string returned so we convert the results before we return them. For more on the Sieve of Eratosthenes and how it works, see the entry on Wikipedia: http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes The previous algorithm can be found in many forms and in many languages on the Internet. Two of the best that informed this discussion are as follows: http://www.maths.abdn.ac.uk/~igc/tch/mx3015/notes/node79.html http://code.activestate.com/recipes/366178/ Next, we create a function to form and execute the INSERT statement. def insert(n, p, cur): statement = """INSERT INTO primes(number, primes) VALUES("%s", "%s")""" %(n, p) cur.execute(statement) return statement This function takes the number, the primes, and the cursor object handle as arguments. Finally, our main() function looks like this: def main(): numbers = sys.argv[1] iterations = numbers.split(',') for n in iterations: mydb = MySQLdb.connect(host = 'localhost', user = 'skipper', passwd = 'secret', db = 'primes')cur = mydb.cursor() n = int(n)try: p = primes(n) if p.isalpha(): raise ValueErrorelse: statement = insert(n, p, cur) print "Data entered with the following statement:n", statement except: raise We split the values passed by the user and iterate through them with a for loop. Note that we include the database connection and cursor object creation as part of the iteration. We then include the usual if clause to call main(): if __name__ == '__main__': main() Comparing execution speeds We can test the speed of this program by giving it a series of simple primes to generate—even of the same number. Here we call it with a series of sevens to process and with the Unix command time to measure its speed. time ./primes.py "7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7" Given this series of sevens, we get a real execution time of 0.175 seconds. If we rearrange the program by moving the mydb and cur assignment lines to follow immediately after the preamble (remembering to adjust the indentation), we get an execution speed of 0.138 seconds. This difference in speed (0.037 seconds) is not particularly significant on a local system with a single user, but it would be magnified synergistically on a server with hundreds or thousands of users. So, we see that, if the connection and cursor is created each time, the program will be less efficient than if the loop contained only the execution of the statement and the connection and cursor were persistent in the cycle of the program. However, even just passing a series of single statements to MySQL through Cursor.execute() will consume more in protocol than necessary. Excessive iteration consumes resources needlessly.
Read more
  • 0
  • 0
  • 2969
article-image-guide-migrating-data-cloud
Packt
20 Sep 2010
3 min read
Save for later

A Guide for Migrating Data to the Cloud

Packt
20 Sep 2010
3 min read
(For more resources on this subject, see here.) Planning While planning is of course the first step, this is something that any IT shop should not take lightly since this is the step where you decide what vendor to go with. All cloud vendors have snazzy marketing material that tell you how seamless and highly functional a migration will be, but that simply cannot always be guaranteed. How many times have you been sold on a technology in the past that did not live up to expectations? It happens more often than you think, and it is part of the overall sales process by these companies. Instead of relying on their word, you could consider a few options. One would be to work with a company you have experience with. If you’re familiar with working with Microsoft, go with their Azure platform. If you’re already using Salesforce to some degree, continue to talk with them about a more full-scale migration. If you’re planning on working with a non-traditional vendor such as Amazon or Google, ask for references or do a bit of investigating on your own. You might learn a thing or two. Security Keeping data secure is of course very important, and it’s wise to make an assessment of your security prior to any sort of large-scale migration. Performing audits and stress tests are a great way to make sure that your organization and/or system is ready for a large scale change. The unfortunate reality to any sort of migration is that there are inevitable vulnerabilities to any large scale process such as this. With that being said, the more you know about your strengths and weaknesses prior to a cutover, the better off you will be. Any sort of large project is sure to experience bumps in the road, and you might as well prepare yourself for uncomfortable findings prior to commencement. Testing At this point, you’ve already made some critical decisions, but hopefully you’ve reached a testing stage where you will be able to take a look at more than one provider for assessment of what they can do for your organization. Having the redundancy to fall back on a "plan B" in case one provider doesn’t work out is ideal. You may not have that option depending on your particular deployment, but again it’s a more productive scenario. Deployment After the testing point, you’ve reach deployment. If the planning, security and testing phases have proven to be successful (at least optimistically so) you’ll probably realize that cloud deployments require more re-engineering and code changes than any vendor is willing to tell you. Moving an environment outside of your own hosted space takes a lot more in-house development than most people realize. Summary In the long run, however, cloud services are beneficial to both the vendor and the customer. What headaches that go into cloud migration usually pay off over time. Perhaps you might not see the immediate benefit, but once a deployment is successful, the amount of time and money that is saved by adopting a cloud strategy clearly makes all the hard work required worth. Perseverance is key, and in time it’s easy to see the overall benefits. Further resources on this subject: Hands-on Tutorial for Getting Started with Amazon SimpleDB Ground to SQL Azure migration using MS SQL Server Integration Services Microsoft LightSwitch Application using SQL Azure Database
Read more
  • 0
  • 0
  • 912

article-image-integrating-solr-ruby-rails-integration
Packt
09 Sep 2010
12 min read
Save for later

Integrating Solr: Ruby on Rails Integration

Packt
09 Sep 2010
12 min read
(For more resources on Solr, see here.) The classic plugin for Rails is acts_as_solr that allows Rails ActiveRecord objects to be transparently stored in a Solr index. Other popular options include Solr Flare and rsolr. An interesting project is Blacklight, a tool oriented towards libraries putting their catalogs online. While it attempts to meet the needs of a specific market, it also contains many examples of great Ruby techniques to leverage in your own projects. You will need to turn on the Ruby writer type in solrconfig.xml: <queryResponseWriter name="ruby" class="org.apache.solr.request.RubyResponseWriter"/> The Ruby hash structure has some tweaks to fit Ruby, such as translating nulls to nils, using single quotes for escaping content, and the Ruby => operator to separate key-value pairs in maps. Adding a wt=ruby parameter to a standard search request returns results in a Ruby hash structure like this: { 'responseHeader'=>{ 'status'=>0, 'QTime'=>1, 'params'=>{ 'wt'=>'ruby', 'indent'=>'on', 'rows'=>'1', 'start'=>'0', 'q'=>'Pete Moutso'}}, 'response'=>{'numFound'=>523,'start'=>0,'docs'=>[ { 'a_name'=>'Pete Moutso', 'a_type'=>'1', 'id'=>'Artist:371203', 'type'=>'Artist'}]}} acts_as_solr A very common naming pattern for plugins in Rails that manipulate the database backed object model is to name them acts_as_X. For example, the very popular acts_as_list plugin for Rails allows you to add list semantics, like first, last, move_next to an unordered collection of items. In the same manner, acts_as_solr takes ActiveRecord model objects and transparently indexes them in Solr. This allows you to do fuzzy queries that are backed by Solr searches, but still work with your normal ActiveRecord objects. Let's go ahead and build a small Rails application that we'll call MyFaves that both allows you to store your favorite MusicBrainz artists in a relational model and allows you to search for them using Solr. acts_as_solr comes bundled with a full copy of Solr 1.3 as part of the plugin, which you can easily start by running rake solr:start. Typically, you are starting with a relational database already stuffed with content that you want to make searchable. However, in our case we already have a fully populated index available in /examples, and we are actually going to take the basic artist information out of the mbartists index of Solr and populate our local myfaves database with it. We'll then fire up the version of Solr shipped with acts_as_solr, and see how acts_as_solr manages the lifecycle of ActiveRecord objects to keep Solr's indexed content in sync with the content stored in the relational database. Don't worry, we'll take it step by step! The completed application is in /examples/8/myfaves for you to refer to. Setting up MyFaves project We'll start with the standard plumbing to get a Rails application set up with our basic data model: >>rails myfaves>>cd myfaves>>./script/generate scaffold artist name:string group_type:string release_date:datetime image_url:string>>rake db:migrate This generates a basic application backed by an SQLite database. Now we need to install the acts_as_solr plugin. acts_as_solr has gone through a number of revisions, from the original code base done by Erik Hatcher and posted to the solr-user mailing list in August of 2006, which was then extended by Thiago Jackiw and hosted on Rubyforge. Today the best version of acts_as_solr is hosted on GitHub by Mathias Meyer at http://github.com/ mattmatt/acts_as_solr/tree/master. The constant migration from one site to another leading to multiple possible 'best' versions of a plugin is unfortunately a very common problem with Rails plugins and projects, though most are settling on either RubyForge.org or GitHub.com. In order to install the plugin, run:  >>script/plugin install git://github.com/mattmatt/acts_as_solr.gitt We'll also be working with roughly 399,000 artists, so obviously we'll need some page pagination to manage that list, otherwise pulling up the artists /index listing page will timeout:  >>script/plugin install git://github.com/mislav/will_paginate.git Edit the ./app/controllers/artists_controller.rb file, and replace in the index method the call to @artists = Artist.find(:all) with: @artists = Artist.paginate :page => params[:page], :order => 'created_at DESC' Also add to ./app/views/artists/index.html.erb a call to the view helper to generate the page links: <%= will_paginate @artists %> Start the application using ./script/server, and visit the page http://localhost:3000/artists/. You should see an empty listing page for all of the artists. Now that we know the basics are working, let's go ahead and actually leverage Solr. Populating MyFaves relational database from Solr Step one will be to import data into our relational database from the mbartists Solr index. Add the following code to ./app/models/artist.rb: class Artist < ActiveRecord::Base acts_as_solr :fields => [:name, :group_type, :release_date]end The :fields array of hashes maps the attributes of the Artist ActiveRecord object to the artist fields in Solr's schema.xml. Because acts_as_solr is designed to store data in Solr that is mastered in your data model, it needs a way of distinguishing among various types of data model objects. For example, if we wanted to store information about our User model object in Solr in addition to the Artist object then we need to provide a type_field to separate the Solr documents for the artist with the primary key of 5 from the user with the primary key of 5. Fortunately the mbartists schema has a field named type that stores the value Artist, which maps directly to our ActiveRecord class name of Artist and we are able to use that instead of the default acts_as_solr type field in Solr named type_s. There is a simple script called populate.rb at the root of /examples/8/myfaves that you can run that will copy the artist data from the existing Solr mbartists index into the MyFaves database: >>ruby populate.rb populate.rb is a great example of the types of scripts you may need to develop to transfer data into and out of Solr. Most scripts typically work with some sort of batch size of records that are pulled from one system and then inserted into Solr. The larger the batch size, the more efficient the pulling and processing of data typically is at the cost of more memory being consumed, and the slower the commit and optimize operations are. When you run the populate.rb script, play with the batch size parameter to get a sense of resource consumption in your environment. Try a batch size of 10 versus 10000 to see the changes. The parameters for populate.rb are available at the top of the script: MBARTISTS_SOLR_URL = 'http://localhost:8983/solr/mbartists'BATCH_SIZE = 1500MAX_RECORDS = 100000 # the maximum number of records to load, or nil for all There are roughly 399,000 artists in the mbartists index, so if you are impatient, then you can set MAX_RECORDS to a more reasonable number. The process for connecting to Solr is very simple with a hash of parameters that are passed as part of the GET request. We use the magic query value of *:* to find all of the artists in the index and then iterate through the results using the start parameter: connection = Solr::Connection.new(MBARTISTS_SOLR_URL) solr_data = connection.send(Solr::Request::Standard.new({ :query => '*:*', :rows=> BATCH_SIZE, :start => offset, :field_list =>['*','score'] })) In order to create our new Artist model objects, we just iterate through the results of solr_data. If solr_data is nil, then we exit out of the script knowing that we've run out of results. However, we do have to do some parsing translation in order to preserve our unique identifiers between Solr and the database. In our MusicBrainz Solr schema, the ID field functions as the primary key and looks like Artist:11650 for The Smashing Pumpkins. In the database, in order to sync the two, we need to insert the Artist with the ID of 11650. We wrap the insert statement a.save! in a begin/rescue/end structure so that if we've already inserted an artist with a primary key, then the script continues. This just allows us to run the populate script multiple times: solr_data.hits.each do |doc| id = doc["id"] id = id[7..(id.length)] a = Artist.new(:name => doc["a_name"], :group_type => a["a_type"], :release_date => doc["a_release_date_latest"]) a.id = id begin a.save! rescue ActiveRecord::StatementInvalid => ar_si raise ar_si unless ar_si.to_s.include?("PRIMARY KEY must be unique") #sink duplicates endend Now that we've transferred the data out of our mbartists index and used acts_as_solr according to the various conventions that it expects, we'll change from using the mbartists Solr instance to the version of Solr shipped with acts_as_solr. Solr related configuration information is available in ./myfaves/config/solr.xml. Ensure that the default development URL doesn't conflict with any existing Solr's you may be running: development: url: http://127.0.0.1:8982/solr Start the included Solr by running rake solr:start. When it starts up, it will report the process ID for Solr running in the background. If you need to stop the process, then run the corresponding rake task: rake solr:stop. The empty new Solr indexes are stored in ./myfaves/solr/development. Build Solr indexes from relational database Now we are ready to trigger a full index of the data in the relational database into Solr. acts_as_solr provides a very convenient rake task for this with a variety of parameters that you can learn about by running rake -D solr:reindex. We'll specify to work with a batch size of 1500 artists at a time: >>rake solr:start>>% rake solr:reindex BATCH=1500(in /examples/8/myfaves)Clearing index for Artist...Rebuilding index for Artist...Optimizing... This drastic simplification of configuration in the Artist model object is because we are using a Solr schema that is designed to leverage the Convention over Configuration ideas of Rails. Some of the conventions that are established by acts_as_solr and met by Solr are: Primary key field for model object in Solr is always called pk_i. Type field that stores the disambiguating class name of the model object is called type_s. Heavy use of the dynamic field support in Solr. The data type of ActiveRecord model objects is based on the database column type. Therefore, when acts_as_solr indexes a model object, it sends a document to Solr with the various suffixes to leverage the dynamic column creation. In /examples/8/myfaves/vendor/plugins/acts_as_solr/solr/solr/conf/ schema.xml, the only fields defined outside of the management fields are dynamic fields: <dynamicField name="*_t" type="text" indexed="true" stored="false"/> The default search field is called text. And all of the fields ending in _t are copied into the text search field. Fields to facet on are named _facet and copied into the text search field as well. The document that gets sent to Solr for our Artist records creates the dynamic fields name_t, group_type_s and release_date_d, for a text, string, and date field respectively. You can see the list of dynamic fields generated through the schema browser at http://localhost:8982/solr/admin/schema.jsp. Now we are ready to perform some searches. acts_as_solr adds some new methods such as find_by_solr() that lets us find ActiveRecord model objects by sending a query to Solr. Here we find the group Smash Mouth by searching for matches to the word smashing: % ./script/consoleLoading development environment (Rails 2.3.2)>> artists = Artist.find_by_solr("smashing")=> #<ActsAsSolr::SearchResults:0x224889c @solr_data={:total=>9, :docs=>[#<Artist id: 364, name: "Smash Mouth"...>> artists.docs.first=> #<Artist id: 364, name: "Smash Mouth", group_type: 1, release_date: "2006-09-19 04:00:00", created_at: "2009-04-17 18:02:37", updated_at: "2009-04-17 18:02:37"> Let's also verify that acts_as_solr is managing the full lifecycle of our objects. Assuming Susan Boyle isn't yet entered as an artist, let's go ahead and create her:  >> Artist.find_by_solr("Susan Boyle")=> #<ActsAsSolr::SearchResults:0x26ee298 @solr_data={:total=>0, :docs=>[]}>>> susan = Artist.create(:name => "Susan Boyle", :group_type => 1, :release_date => Date.new)=> #<Artist id: 548200, name: "Susan Boyle", group_type: 1, release_date: "-4712-01-01 05:00:00", created_at: "2009-04-21 13:11:09", updated_at: "2009-04-21 13:11:09"> Check the log output from your Solr running on port 8982, and you should also have seen an update query triggered by the insert of the new Susan Boyle record: INFO: [] webapp=/solr path=/update params={} status=0 QTime=24 Now, if we delete Susan's record from our database: >> susan.destroy=> #<Artist id: 548200, name: "Susan Boyle", group_type: 1, release_date: "-4712-01-01 05:00:00", created_at: "2009-04-21 13:11:09", updated_at: "2009-04-21 13:11:09">=> #<Artist id: 548200, name: "Susan Boyle", group_type: 1, release_date: "-4712-01-01 05:00:00", created_at: "2009-04-21 13:11:09", updated_at: "2009-04-21 13:11:09"> Then there should be another corresponding update issued to Solr to remove the document: INFO: [] webapp=/solr path=/update params={} status=0 QTime=57 You can verify this by doing a search for Susan Boyle directly, which should return no rows at http://localhost:8982/solr/select/?q=Susan+Boyle.
Read more
  • 0
  • 0
  • 2896