Backup and Recovery

In general, backup and recovery refers to the various strategies and procedures involved in protecting your database against data loss and reconstructing the database after any kind of data loss.

Physical Backups and Logical Backups

A backup is a copy of data from your database that can be used to reconstruct that data. Backups can be divided into physical backups and logical backups.

Physical backups are backups of the physical files used in storing and recovering your database, such as datafiles, control files, and archived redo logs. Ultimately, every physical backup is a copy of files storing database information to some other location, whether on disk or some offline storage such as tape.

Logical backups contain logical data (for example, tables or stored procedures) exported from a database with an Oracle export utility and stored in a binary file, for later re-importing into a database using the corresponding Oracle import utility.

Physical backups are the foundation of any sound backup and recovery strategy. Logical backups are a useful supplement to physical backups in many circumstances but are not sufficient protection against data loss without physical backups.

Unless otherwise specified, the term "backup" as used in the backup and recovery documentation refers to physical backups, and to back up part or all of your database is to take some kind of physcial backup. The focus in the backup and recovery documentation set will be almost exclusively on physical backups.

Errors and Failures Requiring Recovery from Backup

While there are several types of problem that can halt the normal operation of an Oracle database or affect database I/O operations, only two typically require DBA intervention and media recovery: media failure, and user errors.

Other failures may require DBA intervention to restart the database (after an instance failure) or allocate more disk space (after statement failure due to, for instance, a full datafile) but these situations will not generally cause data loss or require recovery from backup.

Understanding User Error

User errors occur when, either due to an error in application logic or a manual mis-step, data in your database is changed or deleted incorrectly. Data loss due to user error includes such missteps as dropping important tables or deleting or changing the contents of a table. While user training and careful management of privileges can prevent most user errors, your backup strategy determines how gracefully you recover the lost data when user error does cause data loss.

Understanding Media Failure

A media failure is the failure of a read or write of a disk file required to run the database, due to a physical problem with the disk such as a head crash. Any database file can be vulnerable to a media failure.

The appropriate recovery technique following a media failure depends on the files affected and the types of backup available.

Oracle Backup and Recovery Solutions: RMAN and User-Managed Backup

For performing backup and recovery based on physical backups, you have two solutions available:

Recovery Manager, a tool (with command-line client and Enterprise Manager GUI interfaces) that integrates with sessions running on the Oracle server to perform a range of backup and recovery activities, as well as maintaining a repository of historical data about your backups
The traditional user-managed backup and recovery, where you directly manage the files that make up your database with a mixture of host operating system commands and SQL*Plus backup and recovery-related capabilities

Both methods are supported by Oracle Corporation and are fully documented. Recovery Manager is, however, the preferred solution for database backup and recovery. It can perform the same types of backup and recovery available through user-managed methods more easily, provides a common interface for backup tasks across different host operating systems, and offers a number of backup techniques not available through user-managed methods.

Most of the backup and recovery documentation set will focus on RMAN-based backup and recovery. User-managed backup and recovery techniques are covered in the later chapters of Oracle Database Backup and Recovery Advanced User's Guide.

Whether you use RMAN or user-managed methods, you can supplement your physical backups with logical backups of schema objects made using data export utilities. Data thus saved can later be imported to re-create this data after restore and recovery. However, logical backups are for the most part beyond the scope of the backup and recovery documentation.

Oracle Backup and Recovery Solutions

Backup and recovery are among the most important tasks for an administrator, and data warehouses are no different. However, because of the sheer size of the database, data warehouses introduce new challenges for an administrator in the backup and recovery area.

Data warehouses are unique in that the data can come from a myriad of resources and it is transformed before finally being inserted into the database; but mostly because it can be very large. Managing the recovery of a large data warehouse can be a daunting task and traditional OLTP backup and recovery strategies may not meet the needs of a data warehouse.

Data warehouses differ from OLTP systems in the following ways:

Data warehouses are typically much larger.
A data warehouse may have different availability requirements than an operational system. Even though business decisions do rely on information from the data warehouse, a situation in which for example a service desk cannot operate is much worse. Also, due to the size of data warehouses, there is a much higher cost involved in guaranteeing the same level of availability for a data warehouse.
Data warehouses are typically populated through more controlled processes, usually referred to as ETL (Extraction, Transformation, and Loading). As a result, updates in a data warehouse are better known and may be reproducible from data sources.
A data warehouse typically stores a lot of historical data, that is often not subject to change. Data that does not change only needs to be backed up once.

You should plan a backup strategy as part of your system design and consider what to back up and how frequently to back up. The most important variables in your backup design are the amount of resources you have to perform a backup or recovery and the recovery time objective (the amount of time you can afford the system or part of the system to be unavailable).

NOLOGGINGoperations must be taken into account when planning a backup and recovery strategy. Traditional recovery, restoring a backup and applying the changes from the archive log, does not apply for NOLOGGINGoperations. On top of that, subsequent operations that rely on the data that was manipulated by the nologging operation fail. The NOLOGGINGoperations must be taken into account when designing a backup and recovery strategy.

Never make a backup when a NOLOGGINGoperation is taking place.

Plan for one of the following or a combination of the following strategies:

The ETL strategy. Recover a backup that does not contain non-recoverable transactions and replay the ETL that has taken place between the backup and the failure.
The incremental backup strategy. Perform a backup immediately after an otherwise non-recoverable transaction has taken place. Oracle provides a tracking file feature that enables incremental backups based on changed data blocks. RMAN leverages the tracking file feature.

Strategies and Best Practices for Backup and Recovery

Devising a backup and recovery strategy can be a daunting task. And when you have hundreds of gigabytes of data that must be protected and recovered in the case of a failure, the strategy can be very complex.

The following best practices can help you implement your warehouse's backup and recovery strategy:

Best Practice A: Use ARCHIVELOG Mode
Best Practice B: Use RMAN
Best Practice C: Use Read-Only Tablespaces
Best Practice D: Plan for NOLOGGING Operations
Best Practice E: Not All Tablespaces are Equally Important

Best Practice A: Use ARCHIVELOG Mode

Archived redo logs are crucial for recovery when no data can be lost, because they constitute a record of changes to the database. Oracle can be run in either of two modes:

ARCHIVELOG-- Oracle archives the filled online redo log files before reusing them in the cycle.
NOARCHIVELOG-- Oracle does not archive the filled online redo log files before reusing them in the cycle.

Running the database in ARCHIVELOGmode has the following benefits:

The database can be completely recovered from both instance and media failure.
The user can perform backups while the database is open and available for use.
Oracle supports multiplexed archive logs to avoid any possible single point of failure on the archive logs
The user has more recovery options, such as the ability to perform tablespace-point-in-time recovery (TSPITR).
Archived redo logs can be transmitted and applied to the physical standby database, which is an exact replica of the primary database.
The database can be completely recovered from both instance and media failure.

Running the database in NOARCHIVELOGmode has the following consequences:

The user can only back up the database while it is completely closed after a clean shutdown.
Typically, the only media recovery option is to restore the whole database, which causes the loss of all transactions since the last backup.

Is Downtime Acceptable?

Oracle database backups can be made while the database is open or closed. Planned downtime of the database can be disruptive to operations, especially in global enterprises that support users in multiple time zones, up to 24-hours per day. In these cases it is important to design a backup plan to minimize database interruptions.

Depending on your business, some enterprises can afford downtime. If your overall business strategy requires little or no downtime, then your backup strategy should implement an online backup. The database needs never to be taken down for a backup. An online backup requires the database to be in ARCHIVELOGmode.

There is essentially no reason not to use ARCHIVELOGmode. All data warehouses (and for that matter, all mission-critical databases) should use ARCHIVELOGmode. Specifically, given the size of a data warehouse (and consequently the amount of time to back up a data warehouse), it is generally not viable to make an offline backup of a data warehouse, which would be necessitated if one were using NOARCHIVELOGmode.

Of course, large-scale data warehouses may undergo large amounts of data-modification, which in turn will generate large volumes of log files. To accommodate the management of large volumes of archived log files, RMAN provides the option to compress log files as they are archived. This will enable you to keep more archive logs on disk for faster accessibility for recovery.

In summary, a best practice is to put the database in archive log mode to provide online backups and point-in-time recovery options.

Best Practice B: Use RMAN

There are many reasons to adopt RMAN. Some of the reasons to integrate RMAN into your backup and recovery strategy are that it offers:

extensive reporting
incremental backups
downtime free backups
backup and restore validation
backup and restore optimization
easily integrates with media managers
block media recovery
archive log validation and management
corrupt block detection

Best Practice C: Use Read-Only Tablespaces

One of the biggest issues facing a data warehouse is sheer size of a typical data warehouse. Even with powerful backup hardware, backups may still take several hours. Thus, one important consideration in improving backup performance is minimizing the amount of data to be backed up. Read-only tablespaces are the simplest mechanism to reduce the amount of data to be backed up in a data warehouse.

The advantage of a read-only tablespace is that the data only need to be backed up once. So, if a data warehouse contains five years of historical data, and the first four years of data can be made read-only. Theoretically, the regular backup of the database would only back up 20% of the data. This can dramatically reduce the amount of time required to back up the data warehouse.

Most data warehouses store their data in tables that have been range-partitioned by time. In a typical data warehouse, data is generally 'active' for a period ranging anywhere from 30 days to one year. During this period, the historical data can still be updated and changed (for example, a retailer may accept returns up to 30 days beyond the date of purchase, so that sales data records could change during this period). However, once a data has reached a certain date, it is often known to be static.

By taking advantage of partitioning, users can make the static portions of their data read-only. RMAN supports read-only tablespaces rather than read-only partitions or tables. To take advantage of the read-only tablespaces and reduce the backup window, a strategy of storing constant data partitions in a read-only tablespace should be devised. Two strategies for implementing a rolling window are as follows:

Implement a regularly scheduled process to move partitions from a read/write tablespace to a read-only tablespace when the data matures to the point where it is entirely static.

The best practice in this case is to put the database in ARCHIVELOGmode to provide online backups and point-in-time recovery options.
Create a series of tablespaces, each containing a small number of partitions and regularly modify one tablespace from read/write to read-only as the data in that tablespaces ages.

One consideration is that backing up data is only half of the recovery process. If you configure a tape system so that it can backup the read/write portions of a data warehouse in 4 hours, the corollary is that a tape system might take 20 hours to recover the database if a complete recovery is necessary when 80% of the database is read-only.

In summary, a best practice is to place static tables and partitions into read-only tablespaces. A read-only tablespace needs to be backed up only once.

Best Practice D: Plan for NOLOGGING Operations

In general, one of the highest priorities for a data warehouse is performance. Not only must the data warehouse provide good query performance for online users, but the data warehouse must also be efficient during the ETL process so that large amount of data can be loaded in the shortest amount of time.

One common optimization leveraged by data warehouses is to execute bulk-data operations using the NOLOGGINGmode. The database operations which support NOLOGGINGmodes are direct-path loads and inserts, index creation, and table creation. When an operation runs in NOLOGGINGmode, data is not written to the redo log (or more precisely, only a small set of metadata is written to the redo log). This mode is widely used within data warehouses and can improve the performance of bulk data operations by up to 50%.

However, the trade-off is that a NOLOGGINGoperation cannot be recovered using conventional recovery mechanisms, since the necessary data to support the recovery was never written to the log file. Moreover, subsequent operations to the data upon which a nologging operation has occurred also cannot be recovered even if those operations were not using nologging mode. Because of the performance gains provided by NOLOGGINGoperations, it is generally recommended that data warehouses utilize nologging mode in their ETL process.

The presence of NOLOGGINGoperations must be taken into account when devising the backup and recovery strategy. When a database relies on NOLOGGINGoperations, the conventional recovery strategy (of recovering from the latest tape backup and applying the archived logfiles) is no longer applicable because the log files cannot recover the NOLOGGINGoperation.

The first principle to remember is, do not make a backup when a NOLOGGINGoperation is occurring. Oracle does not currently enforce this rule, so the DBA must schedule the backup jobs and the ETL jobs such that the nologging operations do not overlap with backup operations.

There are two approaches to backup and recovery in the presence of NOLOGGINGoperations: ETL or incremental backups. If you are not using NOLOGGINGoperations in your data warehouse, then you do not have to choose either of the following options: you can recover your data warehouse using archived logs. However, the following options may offer some performance benefits over an archive-log-based approach in the event of recovery.

Extraction, Transformation, and Loading
Incremental Backup

Extraction, Transformation, and Loading

The ETL process uses several Oracle features or tools and a combination of methods to load (re-load) data into a data. These features or tools may consist of:

Transportable Tablespaces. The Oracle Transportable Tablespace feature enables users to quickly move a tablespace across Oracle databases. It is the most efficient way to move bulk data between databases. Oracle Database provides the ability to transport tablespaces across platforms. If the source platform and the target platform are of different endianness, then RMAN will convert the tablespace being transported to the target format.
SQL*Loader. SQL*Loader loads data from external flat files into tables of an Oracle database. It has a powerful data-parsing engine that puts little limitation on the format of the data in the datafile.
Data Pump (export/import). Oracle Database offers the Oracle Data Pump technology, which enables very high-speed movement of data and metadata from one database to another. This technology is the basis for Oracle's data movement utilities, Data Pump Export and Data Pump Import.
External Tables. The external tables feature is a complement to existing SQL*Loader functionality. It enables you to access data in external sources as if it were in a table in the database.

The ETL Strategy and NOLOGGING Operations

One approach is to take regular database backups and store the necessary data files to re-create the ETL process for that entire week. If a recovery is necessary, the data warehouse could be recovered from the most recent backup. Then, instead of rolling forward by applying the archived redo logs (as would be done in a conventional recovery scenario), the data warehouse could be rolled forward by re-running the ETL processes. This paradigm assumes that the ETL processes can be easily replayed, which would typically involve storing a set of extract files for each ETL process (many data warehouses do this already as a best practice, in order to be able to identify repair a bad data feed for example).

A sample implementation of this approach is make a backup of the data warehouse every weekend, and then store the necessary files to support the ETL process for each night. Thus, at most, seven days of ETL processing would need to be re-applied in order to recover a database. The data warehouse administrator can easily project the length of time to recover the data warehouse, based upon the recovery speeds from tape and performance data from previous ETL runs.

Essentially, the data warehouse administrator is gaining better performance in the ETL process through nologging operations, at a price of slight more complex and less-automated recovery process. Many data warehouse administrators have found that this is a desirable trade-off.

One downside to this approach is that the burden is upon the data warehouse administrator to track all of the relevant changes that have occurred in the data warehouse. This approach will not capture changes that fall outside of the ETL process. For example, in some data warehouses, end-users may create their own tables and data structures. Those changes will be lost in the event of a recovery. This restriction needs to be conveyed to the end-users. Alternatively, one could also mandate that end-users create all of private database objects in a separate tablespace, and during recovery, the DBA could recover this tablespace using conventional recovery while recovering the rest of the database using the approach of replaying the ETL process.

In summary, a best practice is to restore a backup that does not contain non-recoverable (nologging) transactions. Then replay the ETL process to reload the data.

Sizing the Block Change Tracking File

The size of the block change tracking file is proportional to:

Database size in bytes. The block change tracking file contains data representing data file blocks in the database. The data is approximately 1/250000 of the total size of the database.
The number of enabled threads. All Real Application Cluster (Oracle RAC) instances have access to the same block change tracking file, however, the instances update different areas of the tracking file without any locking or inter-node block swapping. You enable block change tracking for the entire database and not for individual instances.
Changed Block Metadata. The block change tracking file keeps a record of all changes between previous backups, in addition to the modifications since the last backup. The tracking file retains the change history for a maximum of eight backups. If the tracking file contains the change history for eight backups then the Oracle database overwrites the oldest change history information.

Let us take an example of a 500 GB database, with only one thread, and having eight backups kept in the RMAN repository will require a block change tracking file of 20 MB.

((Threads * 2) + number of old backups) * (database size in bytes)
------------------------------------------------------------------ = 20MB
 250000

Incremental Backup

A more automated backup and recovery strategy in the presence of nologging operations leverages RMAN's incremental backup capability Incremental backups have been part of RMAN since it was first released. Incremental backups provide the capability to backup only the changed blocks since the previous backup. Incremental backups of datafiles capture data changes on a block-by-block basis, rather than requiring the backup of all used blocks in a datafile. The resulting backups set are generally smaller and more efficient than full datafile backups, unless every block in the datafile is change.

Oracle Database delivers the ability for faster incrementals with the implementation of the change tracking file feature. When you enable block change tracking, Oracle tracks the physical location of all database changes. RMAN automatically uses the change tracking file to determine which blocks need to be read during an incremental backup and directly accesses that block to back it up.

The Incremental Approach

A typical backup and recovery strategy using this approach is to backup the data warehouse every weekend, and then take incremental backups of the data warehouse every night following the completion of the ETL process. Note that incremental backups, like conventional backups, must not be run concurrently with nologging operations. In order to recover the data warehouse, the database backup would be restored, and then each night's incremental backups would be re-applied. Although the nologging operations were not captured in the archive logs, the data from the nologging operations is present in the incremental backups. Moreover, unlike the previous approach, this backup and recovery strategy can be completely managed using RMAN.

The replay ETL approach and the incremental backup approach are both recommended solutions to efficiently and safely backing up and recovering a database which is a workload consisting of many NOLOGGINGoperations. The most important consideration is that your backup and recovery strategy must take these NOLOGGINGoperations into account.

In summary, a best practice is to implement Block Change Tracking functionality and make an incremental backup after a direct load that leaves objects unrecoverable due to NOLOGGINGoperations.

Best Practice E: Not All Tablespaces are Equally Important

While the simplest backup and recovery scenario is to treat every tablespace in the database the same, Oracle provides the flexibility for a DBA to devise a backup and recovery scenario for each tablespace as needed.

Not all of the tablespaces in a data warehouse are equally significant from a backup and recovery perspective. DBAs can leverage this information to devise more efficient backup and recovery strategies when necessary. The basic granularity of backup and recovery is a tablespace, so different tablespaces can potentially have different backup and recovery strategies. On the most basic level, temporary tablespaces never need to be backed up (a rule that RMAN enforces).

Moreover, in some data warehouses, there may be tablespaces that are not explicit temporary tablespaces but are essentially functioning as temporary tablespaces as they are dedicated to scratch space for end-users to store temporary tables and incremental results. Depending upon the business requirements, these tablespaces may not need to backed up and restored. Instead, in the case of a loss of these tablespaces, the end users would re-create their own data objects.

In many data warehouses, some data is more important than other data. For example, the sales data in a data warehouse may be crucial and, in a recovery situation, this data must be online as soon as possible. But, in the same data warehouse, a table storing clickstream data from the corporate Web site may be much less critical. The business may tolerate this data being offline for a few days or may even be able to accommodate the loss of several days of clickstream data in the event of a loss of database files. In this scenario, the tablespaces containing sales data must be backed up often, while the tablespaces containing clickstream data need only to be backed up once every week or two.

Certified Business Intelligence Professional Backup and Recovery