Oracle Database 11g High Availability: An Oracle White Paper June 2007
Oracle Database 11g High Availability: An Oracle White Paper June 2007
NOTE:
Page 1
Introduction ....................................................................................................... 3
Causes of Downtime .................................................................................... 3
Computer Failure Protection........................................................................... 4
Real Application Clusters............................................................................. 5
Bounding Database Crash Recovery Time ............................................... 6
Data Failure Protection .................................................................................... 6
Storage Failure Protection ........................................................................... 7
ASM Block Repair.................................................................................... 7
Rolling Upgrades of ASM....................................................................... 8
Site Failure Protection.................................................................................. 8
Data Guard................................................................................................ 8
Human Error Protection ........................................................................... 11
Guarding Against Human Errors ........................................................ 12
Oracle Flashback Technology .............................................................. 12
Data Corruption Protection ...................................................................... 15
Oracle Hardware Assisted Resilient Data (HARD) .......................... 16
Backup and Recovery ............................................................................ 16
Planned Downtime Protection...................................................................... 18
Online System Reconfiguration................................................................ 19
Online Patching and Upgrades ................................................................. 19
Online Data and Schema Reorganization ............................................... 22
Maximum Availability Architecture Best Practices................................. 24
Conclusion........................................................................................................ 24
Page 2
INTRODUCTION
The increasing demand on IT within the
enterprise has established a critical
relationship between business success
and the availability of the IT infrastructure.
Page 3
failures that may cause the data to be unavailable (e.g. storage corruption, site
failure, etc.). System maintenance activities such as hardware, software, application,
and/or data changes are typical causes of planned downtime.
Unplanned
Downtime
Computer
Failures
Planned
Downtime
Data
Failures
System
Changes
Data
Changes
A computer failure is encountered when the machine running the database server
unexpectedly fails, most likely due to hardware breakdown. This is one of the most
common types of failures. Oracle Real Application Clusters, which is the
foundation of Oracles Grid Computing architecture, can provide the most
effective protection against such failures.
Figure 2: Hardware Failures
System
Downtime
Unplanned
Downtime
Computer
Failures
Planned
Downtime
Data
Failures
System
Changes
Data
Changes
Page 4
Real Application Clusters enables enterprise Grids. Enterprise Grids are built out of
large configurations of standardized, commodity-priced components: processors,
servers, network, and storage. RAC is the only technology that can harness these
components into useful processing systems for the enterprise. Real Application
Clusters and the Grid dramatically reduce operational costs and provide new levels
of flexibility so that systems become more adaptive, proactive, and agile. Dynamic
provisioning of nodes, storage, CPUs, and memory allow service levels to be easily
and efficiently maintained while lowering cost still further through improved
utilization. In addition, Real Application Clusters is completely transparent to the
application accessing the RAC database, thereby allowing existing applications to be
deployed on RAC without requiring any modifications.
Page 5
Real Application Clusters also gives users the flexibility to add nodes to the cluster
as the demands for capacity increases, scaling the system incrementally to save costs
and eliminating the need to replace smaller single node systems with larger ones. It
makes the capacity upgrade process much easier and faster since one or more
nodes can be incrementally added to the cluster, compared to replacing existing
systems with new and larger nodes to upgrade systems. The Cache Fusion
technology implemented in Real Application Clusters and the support for
InfiniBand networking enables capacity to be scaled near linearly without making
any changes to your application.
Oracle Database 11g further optimizes the performance, scalability and failover
mechanisms of Real Application Clusters to further enhance its scalability and
high availability benefits.
For more information on Real Application Clusters, please visit
http://www.oracle.com/technology/products/database/clustering/index.html.
Bounding Database Crash Recovery Time
One of the most common causes of unplanned downtime is a system fault or crash.
System faults are the result of hardware failures, power failures, and operating
system or server crashes. The amount of disruption these failures cause will depend
upon the number of affected users, and how quickly service is restored. High
availability systems are designed to quickly and automatically recover from failures,
should they occur. Users of critical systems look to the IT organization for a
commitment that recovery from a failure will be fast and will take a predictable
amount of time. Periods of downtime longer than this commitment can have direct
effects on operations, and lead to lost revenue and productivity.
The Oracle Database provides very fast recovery from system faults and crashes.
However, equally important to being fast is being predictable. The Fast-Start Fault
Recovery technology included in the Oracle Database automatically bounds
database crash recovery time and is unique to the Oracle Database. The database
will self-tune checkpoint processing to safeguard the desired recovery time
objective. This makes recovery time fast and predictable, and improves the ability
to meet service level objectives. Oracles Fast-Start Fault Recovery can reduce
recovery time on a heavily loaded database from tens of minutes to less than 10
seconds.
DATA FAILURE PROTECTION
Data failure is the loss, damage, or corruption of business critical data. The causes
of data failure are multifaceted and in many cases data failure can be illusive and
difficult to identify. Generally, one or a combination of the following causes data
failure: storage subsystem failure, site failure, human error, and/or corruption.
Page 6
Unplanned
Downtime
Hardware
Failures
Storage Failure
Site Error
Planned
Downtime
Data
Failures
System
Changes
Human Error
Data
Changes
Corruption
Oracle Database 11g introduces new functionality to increase the reliability and
availability of ASM. The first of these features is the capability to recover corrupt
blocks on a disk by leveraging the valid blocks available on the mirrored disk(s).
When a read operation identifies that a corrupt block exists on disk, ASM
automatically relocates the bad block to an uncorrupted portion of the disk. In
addition, administrators can now utilize the ASMCMD utility to manually relocate
specific blocks due to underlying corruption of the disk.
Page 7
ASM in Oracle Database 11g enhances the availability of the entire cluster
environment with the capability to perform Rolling Upgrades of the ASM Software.
ASM Rolling Upgrades permit administrators to keep their applications online
while they upgrade ASM on individual nodes by keeping the other nodes in the
cluster available during the migration. The ASM instances can run at different
software versions until all nodes in the cluster have been upgraded. Any
functionality introduced in the newer version of the ASM Software would not be
enabled until all nodes in the cluster are upgraded.
Site Failure Protection
Enterprises need to protect their critical data and applications against catastrophic
events that can take an entire data center offline. Events such as natural disasters
and power and communication outages are a few examples of scenarios that can
have detrimental effects on the data center. The Oracle Database offers a variety of
data protection solutions that can safeguard an enterprise from costly downtimes
due to complete site failures. The most basic form of protection is the off-site
storage of database backups. While integral to an overall HA strategy, the process
of restoring backups in a site-wide disaster can take more time than the enterprise
can afford and the backups may not contain the most up to date versions of data.
A more expeditious and comprehensive solution is to manage one or more
duplicate copies of the production database in physically separate data centers.
Data Guard
Page 8
Physical Standby databases have always had the ability to be opened read-only,
providing a means to offload production workloads that only require read access to
the database. Historically, the drawback to this approach was the requirement that
media recovery be quiesced while the Physical Standby database was opened in
read-only mode; thus causing the Physical Standby database to become out of
synch with the production database. Groundbreaking advancements in Oracle
Database 11g allow media recovery to continue while the Physical Standby database
is opened in read-only mode. This exciting new capability, called Physical Standby
with Real Time Query, removes the aforementioned drawbacks of opening standby
for read-only activity now the Physical Standby database remains in synch with
the production database even as it services read-only applications.
A key benefit of having a standby database that is physically identical to the
production database is the ability to utilize this standby database as the source for
backup activities. Oracle Database 10g introduced Block Tracking technology that
keeps a log of which blocks have changed since the last incremental backup was
performed and dramatically reduces the time required for incremental backups.
Prior to Oracle Database 11g, the fast incremental backups using the block tracking
technology could only be performed on the primary database. This restriction has
been lifted in Oracle Database 11g allowing customers to offload all of their backup
activities to the standby database.
Oracle Database 11g also introduces a new functionality called Snapshot Standby
that allows a physical standby to be opened for read-write activities temporarily for
testing activities without losing disaster protection. Using this functionality, a
physical standby database is temporarily converted into a snapshot standby
database that can opened read-write to process transactions that are independent of
the primary database for test or other purposes. A snapshot standby database will
continue to receive and archive updates from the primary database, however, redo
data received from the primary will not be applied until the snapshot standby is
converted back into a physical standby database and all updates that were made
while it was a snapshot standby are discarded. This enables production data to
remain in a protected state at all times.
Page 9
Finally, Oracle Database 11g can apply changes on the standby database in parallel
thereby dramatically improving performance.
Data Guard SQL Apply (Logical Standby)
The primary and standby databases, as well as their various interactions, may be
managed by using SQL*Plus. For easier manageability, Data Guard also offers a
distributed management framework called the Data Guard Broker, which
automates and centralizes the creation, maintenance, and monitoring of a Data
Guard configuration. Administrators may use either Oracle Enterprise Manager or
the Brokers own specialized command-line interface (DGMGRL) to take
advantage of the Brokers management capabilities. From the easy to use GUI in
Oracle Enterprise Manager, a single mouse click can initiate failover processing
from the primary to either type of standby database. The Broker and Enterprise
Manager make it easy for the DBA to manage and operate the standby database. By
Page 10
Data Guard Fast-Start Failover enables the creation of a fault tolerant standby
database environment by providing the ability to totally automate the failover of
database processing from the production to standby database without any human
intervention. In the event of a failure, Fast-Start Failover will automatically, quickly,
and reliably failover to a designated, synchronized standby database, without
requiring administrators to perform complex manual steps to invoke and
implement the failover operation. This greatly reduces the length of an outage.
After a Fast-Start Failover occurs, the old primary database, upon reconnection to
the configuration, will be automatically reinstated as a new standby database by the
Broker. This enables the Data Guard configuration to restore disaster protection in
the configuration easily and quickly, improving the robustness of the Data Guard
configuration. Thanks to this feature, Data Guard not only helps maintain
transparent business continuity, but also reduces the management costs for the DR
configuration.
The new enhancements to Fast-Start Failover mechanism in Oracle Database 11g
further reduce the failover time and provide administrators more control over the
failover scenarios and behavior. For instance, Administrators can now define
specific events, such as database errors (ORA-xxxx), which will trigger a Fast-Start
Failover. Similarly, administrators can configure their Data Guard environment to
shutdown the primary database when Fast-Start Failover is initiated in order to
prevent accidental updates.
Human Error Protection
Almost any research done on the causes of downtime identifies human error as the
single largest cause of downtime. Human errors like: the inadvertent deletion of
important data; or when an incorrect WHERE clause in an UPDATE statement
updates many more rows than were intended; need to be prevented wherever
possible, and undone when the precautions against them fail. The Oracle Database
provides easy to use yet powerful tools that help administrators quickly diagnose
and recover from these errors, should they occur. It also includes features that
allow end-users to recover from problems without administrator involvement,
reducing the support burden on the DBA, and speeding recovery of the lost and
damaged data.
Page 11
The best way to prevent errors is to restrict a users access to data and services they
truly need to conduct their business. The Oracle Database provides a wide range of
security tools to control user access to application data by authenticating users and
then allowing administrators to grant users only those privileges required to
perform their duties. In addition the security model of Oracle Database provides
the ability to restrict data access at a row level, using the Virtual Private Database
(VPD) feature, further isolating users from data they do not need access to.
Oracle Flashback Technology
When authorized people make mistakes, you need the tools to correct these errors.
Oracle Database 11g provides a family of human error correction technology called
Flashback. Flashback revolutionizes data recovery. In the past, it might take
minutes to damage a database but hours to recover it. With Flashback, the time to
correct errors equals the time it took to make the error. It is also extremely easy to
use and a single short command can be used to recover the entire database instead
of following some complex procedure. Flashback provides a SQL interface to
quickly analyze and repair human errors. Flashback provides fine-grained surgical
analysis and repair for localized damage -- like when the wrong customer order is
deleted. Flashback also allows for correction of more widespread damage yet does
it quickly to avoid long downtime -- like when all of this months customer orders
have been deleted. Flashback is unique to the Oracle Database and supports
recovery at all levels including the row, transaction, table, tablespace, and database
wide.
Flashback Query
Using Oracle Flashback Query, administrators are able to query any data at some
point-in-time in the past. This powerful feature can be used to view and
reconstruct logically corrupted data that may have been deleted or changed
inadvertently.
SELECT *
FROM emp
AS OF TIMESTAMP
TO_TIMESTAMP(01-APR-07 02:00:00 PM,DD-MON-YY HH:MI:SS PM)
WHERE
This simple query displays rows from the emp table as of the specified timestamp.
This feature is a powerful tool that administrators can leverage to quickly identify
and resolve logical data corruption. However, this functionality could easily be
built into an application to provide application users with an easy and quick
mechanism to rollback or undo changes to data without contacting their
administrator.
Page 12
This query displays each version of the row between the specified timestamps. The
administrator will have visibility into the values as they were modified by different
transactions throughout this period. This mechanism gives the administrator the
ability to pinpoint exactly when and how data has changed, providing tremendous
value in both data repair and application debugging.
Flashback Transaction
Often times, a logical corruption can occur throughout a transaction that may
change data in multiple rows or tables. Flashback Transaction Query allows an
administrator to see all the changes made by a specific transaction.
SELECT *
FROM FLASHBACK_TRANSACTION_QUERY
WHERE XID = 000200030000002D
Not only will this query show the changes made by this transaction, but it will also
produce the SQL statements necessary to flashback or undo the transaction. A
precision tool such as this empowers the administrator to delicately and efficiently
diagnose and resolve logical corruptions in the database.
Flashback Transaction, new in Oracle Database 11g, is a seamless and powerful set
of PL/SQL interfaces that simplify transaction-level data recovery. Building on the
power of Flashback Transaction Query, this new feature enables a more robust and
failsafe approach to repairing logical data corruptions. Many times, data failures
can take time to be identified. When this is the case, it is possible that additional
transactions have been executed based on logically corrupted data. Flashback
Transaction identifies and resolves not only the initial transaction but all dependent
transactions as well
Page 13
The Flashback query statements discussed above depend on the availability of the
historical data in the UNDO tablespace. The amount of time that historical data
remains in the UNDO tablespace is dependent on the size of the tablespace, the
rate of data changes, and configurable database settings. Typically, administrators
configure their databases to keep UNDO data no longer than days or weeks
certainly not years or decades. To overcome this limitation, Oracle Database 11g
introduces pioneering new capabilities available through Flashback Data Archive.
Flashback Data Archive maintains historical versions of data as regular data within
the database that can be maintained for as long as required by the business.
Flashback Data Archive revolutionizes data retention strategies to assist enterprises
in the ever-changing regulatory landscape, such as Sarbanes-Oxley and HIPPA. To
ensure the integrity of the retained data Flashback Data Archive allows read-only
access to the historical versions of data.
The Flashback Data Archive is a robust tool-set that provides enterprises with
amazing flexibility in managing their critical business data. Clearly, the advantages
of Flashback Data Archive far surpass just the implicit benefits of repairing data
failures. Using this technology, application developers and administrators can
enable users to track and view information evolution. Given the immutable nature
of the Flashback Data Archive, enterprises gain a strategic and financial advantage
in terms of data preservation for purposes such as auditing. Application developers
can take advantage of the Flashback Data Archive by introducing rich features into
their applications allowing users to view past versions of data such as banking
statements. Finally, application developers and administrators are no longer
burdened with creating and maintaining custom logic to track changes to critical
business data.
Flashback Database
Page 14
As you can see, no complicated recovery procedures are required and there is no
need to restore backups from tape. Flashback Database drastically reduces the
amount of downtime required for scenarios requiring a database restore.
Flashback Table
Often times logical corruption is quarantined to one or a set of tables, thus not
requiring a restore of the entire database. Flashback Table is the feature that allows
the administrator to recover a table, or a set of tables, to a specific point-in-time
quickly and easily.
FLASHBACK TABLE orders, order_itmes TIMESTAMP
TO_TIMESTAMP(01-APR-07 02:00:00 PM,DD-MON-YY HH:MI:SS PM)
This query will rewind the orders and order_item tables, undoing any updates made
to these tables between the current time and the specified timestamp. In the event
that a table is accidentally dropped, administrators can use the Flashback Table
feature to restore the dropped table, and all of its indexes, constraints, and triggers,
from the Recycle Bin. Dropped objects remain in the Recycle Bin until the
administrator explicitly purges them or if the objects tablespace becomes pressured
for free space.
Flashback Restore Points
IO Path
ORACLE
Operating System
File System
Volume Manager
Device Driver
Host-Bus Adapter
Storage Controller
Disk Drive
Physical data corruption is created by faults in any one of the various components
making up the IO stack. At a high-level, when Oracle issues a write operation the
database IO operation is passed to the operating systems IO code. This initiates
the process of passing the IO through the IO stack where it is passed through the
various components, from the file system to the volume manager to the device
driver to the Host-Bus Adapter to the storage controller and finally to the disk
drive where the data is written. Hardware failures or bugs in any one of these
components could result in invalid or corrupt data being written to disk. The
resulting corruption could damage internal Oracle control information or
application/user data either of which could be catastrophic to the functioning or
availability of the database.
Page 15
Large databases can be composed of hundreds of files spread over many mount
points, making backup up activities extremely challenging. Neglecting or
overlooking even one critical file in a backup can render the entire database backup
useless. As is too often the case, incomplete backups go undetected until they are
needed in an emergency scenario. Oracle Recovery Manager (RMAN) is the
composite tool that manages the database backup, restore, and recovery processes.
RMAN maintains configurable backup and recovery policies and keeps historical
records of all database backup and recovery activities. Through its comprehensive
feature set, RMAN ensures that all files required to successfully restore and recover
a database are included in complete database backups. Furthermore, through the
RMAN backup operations, all data blocks are analyzed to ensure that corrupt
blocks are not propagated throughout the backup files.
Oracles Block Tracking technology, which
greatly increases the speed of incremental
backups, is now available for managed
standby databases.
Page 16
When the unthinkable situation arises and critical business data becomes
jeopardized all recovery and repair options need to be evaluated to ensure a safe
and fast recovery. These situations can be very stressful and often occur in the
middle of the night. Research shows that administrators spend a majority of Repair
Time performing investigation into what, why, and how data has become
compromised. Administrators need to comb through volumes of information to
identify the relevant errors, alerts, and trace files.
Time to Repair
Time
Investigation
Planning
Recovery
The Oracle Database 11g Data Recovery Advisor, built to minimize the time spent
in the investigation and planning phases of recovery, reduces the uncertainty and
confusion during an outage. Tightly integrated with other Oracle high availability
features such as Data Guard and RMAN, the Data Recovery Advisor analyzes all
recovery scenarios quickly and accurately. Through this integration, the advisor is
able to identify which recovery options are feasible given the specific conditions.
The possible recovery options are presented to the administrator, ranked based on
recovery time and data loss. The Data Recovery Advisor can be configured to
Page 17
automatically implement the best recovery options, thus reducing any dependencies
on the administrator.
Many disaster scenarios can be mitigated based on accurate analysis of errors and
trace files that are presented prior to an outage. Therefore, the Data Recovery
Advisor automatically and continuously analyzes the condition of the database
through various health checks. As the advisor identifies symptoms that could be
precursors to a database outage, the administrator can choose to obtain recovery
advise and perform the necessary actions to fix the associated problem and avoid
system downtime.
Oracle Secure Backup
Oracle Secure Backup, a centralized tape
management system, backs up databases
up to 25% faster than the
leading competition.
Oracle Secure Backup a new product offering from Oracle provides centralized
tape backup management for entire Oracle environments including databases and
file systems. Oracle Secure Backup offers customers a highly secure, cost effective
and high performance tape backup solution. Thanks to its tight integration with
Oracle Database, Oracle Secure Backup can back up an Oracle Database up to 25%
faster than the leading competition. This is accomplished by leveraging direct calls
into the database engine and through efficient algorithms that skip unused data
blocks. This performance advantage will only continue to widen in the future as
Oracle Secure Backup integrates even better with the database engine, thereby
building special optimizations to improve backup performance even further.
Oracle Secure Backup is also integrated with Oracle Enterprise Manager our web
base GUI administrative tool allowing administrators the unprecedented ease of
use for setting up tape backups or restoring/recovering data from tape.
PLANNED DOWNTIME PROTECTION
Page 18
Unplanned
Downtime
Hardware
Failures
Planned
Downtime
Data
Failures
System
Changes
Data
Changes
Oracle supports dynamic online system reconfiguration for all components of your
Oracle hardware stack. Oracles Automatic Storage Management (ASM) has builtin capabilities that allow the online addition or removal of ASM disks. When disks
are added or removed from an ASM Diskgroup Oracle automatically rebalances
the data across the new storage configuration while the storage, database, and
application remain online. As discussed earlier in the paper, Real Application
Clusters provide extraordinary online reconfiguration capabilities. Administrators
can dynamically add and remove clustered nodes without any disruption to the
database or the application. Oracle supports the dynamic addition or removal of
CPUs on SMP servers that have this online capability. Finally, Oracles dynamic
shared memory tuning capabilities allow administrators to grow and shrink the
shared memory and database cache online. With automatic memory tuning
capabilities, administrators can let Oracle automate the sizing and distribution of
shared memory per Oracles analysis of memory usage characteristics. Oracles
extensive online reconfiguration capabilities support administrators ability to not
only minimize system downtime due to maintenance activities but to also enable
enterprises to scale their capacity on demand.
Online Patching and Upgrades
Enterprises with high availability demands can leverage Oracle technology to patch
and upgrade their systems without end user interruption. With the strategic use of
Real Application Clusters and Oracle Data Guard, administrators can more adeptly
support the demands of the business.
Page 19
Page 20
Patch
A
Clients
Clients
Initial RAC
Configuration
Clients on A
Patch B
Patch
Clients
Upgrade
Complete
B
Clients
Patch A
Clients on B
Utilizing Oracles SQL Apply Data Guard technology, administrators can apply
database patchsets, major release upgrades, and cluster upgrades with nearly no
downtime to the end users. The process begins with instantiating a logical standby
database and configuring Data Guard to keep the standby synchronized with the
production database. Once the Data Guard configuration is complete, the
administrator will pause the synchronization and all redo data will be queued. The
standby database is upgraded, brought back online, and Data Guard is activated.
All queued redo data will be propagated and applied on the standby to ensure no
data loss occurs between the two databases. The standby and production databases
can remain in mixed-mode until testing confirms the upgrade completed successfully.
At this point, the switchover can occur resulting in a database role reversal the
standby database is now servicing the production workload and the production
database is ready to be upgraded. While the production database is upgraded, the
standby database (converted to primary during the switchover) is queuing the redo
Page 21
data. Once the production database is upgraded and the redo data is applied, a
second switchover takes place and the original production system is again taking
production traffic. Figure 7 below illustrates the process for upgrading a database
with near zero downtime.
Upgrade
SQL Apply
Clients
Version X
Version X
Setup
SQL Apply
Clients
Version X
Logs
Queue
Version X+1
Upgrade Node B to
Version X+1
Upgrade
SQL Apply
Clients
Version X+1
SQL Apply
B
Clients
Version X+1
Switchover to B
Upgrade A
Version X
Version X+1
Run in mixed-mode
for testing
Oracle Database 11g further enhances the appeal of the rolling upgrade process by
introducing a functionality called Transient Logical Standby. This features allows
users to convert a physical standby to a logical standby database temporarily to
effect a rolling database upgrade, and then revert to a physical standby once the
upgrade is complete (using the KEEP IDENTITY clause). This benefits physical
standby users who wish to execute a rolling database upgrade without investing in
redundant storage otherwise needed to create a logical standby database.
Online Data and Schema Reorganization
Online data and schema reorganization improves the overall database availability
and reduces planned downtime by allowing users full access to the database
Page 22
Page 23
Administrators using this API, enable end users to access the original table,
including insert/update/delete operations, while the upgrade process modifies an
interim copy of the table. The interim table is routinely synchronized with the
original table and once the upgrade procedures are complete, the administrator
performs the final synchronization and activates the upgraded table.
Partitioning
As databases grow, they can become more challenging to manage. Partitioning is a
pivotal technology that allows administrators to break large tables and indexes into
smaller, more manageable pieces. While most maintenance activities can be
performed online, performing maintenance one partition at a time provides
flexibility and performance benefits to most online operations. Furthermore,
partitioning increases the fault tolerance of the Oracle Database. Administrators
can strategically locate individual partitions on different disks; therefore a disk
failure will only affect the partitions that reside on that disk.
CONCLUSION
Page 24
availability and data protection technologies to provide customers with new and
more effective ways of maximizing their data and application availability. Oracles
comprehensive set of technologies provides businesses unparalleled protection
against any kind of outages be it due to a planned maintenance activity or an
unexpected failure. And the Grid capabilities provided make certain that the cost to
deploy your database environment, and adapt to changing business needs, is
significantly less than what you had to spend in the past to achieve equivalent
results.
Page 25