Hadr Tsa Win
Hadr Tsa Win
December 2012
Authors:
Steve Raspudic, IBM Toronto Lab
([email protected])
Michelle Chiu, IBM Toronto Lab
([email protected])
Philippe Stedman, IBM Toronto Lab
([email protected])
Table of Contents
1. Introduction and Overview........................................ 3
- 2-
1. Introduction and Overview
This paper will guide you through the implementation of an automated failover solution
for the IBM® DB2® Enterprise Server Edition for Linux, UNIX, and Windows Version 9.5
database server (DB2 9.5) product. The solution is based on a combination of the high
availability disaster recovery (HADR) feature in DB2 9.5 and the IBM Tivoli® System
Automation for Multiplatforms product (SA MP). The setup described in this paper
focuses on the Windows operating system.
Below you will find information on knowledge requirements, as well as hardware and
software configurations used to set up the topology depicted in this paper. It is important
that you read this section prior to beginning any setup.
Windows Server 2003 Release 2 Enterprise Edition was used in this configuration.
For information about software requirements for running SA MP, refer to:
http://www.ibm.com/software/tivoli/products/sys-auto-linux/platforms.html
Listed below are the components of the software configuration used to set up the
environment for this paper:
• Operating system: Windows Server 2003 Release 2 Enterprise Edition
• DB2 product: IBM DB2 Enterprise Server Edition (ESE) Version 9.5 Fix Pack 7
• Tivoli product: IBM Tivoli System Automation for Multiplatforms Version 3.1 with Fix
Pack 7
- 3-
3. Overview of Important Concepts
The HADR feature of DB2 9.5 allows a database administrator (DBA) to have one “hot
standby” copy of any DB2 database such that, in the event of a primary database failure,
a DBA can quickly switch over to the “hot standby” with minimal interruption to database
clients. (See Fig.1 below for a typical HADR environment.)
However, an HADR primary database does not automatically switch over to its standby
database in the event of failure. Instead, a DBA must manually perform a takeover
operation when the primary database has failed.
3.2 SA MP Overview
Since an HADR primary database does not automatically switch over to its standby
database in the event of failure, to achieve automatic monitoring and failover, a DBA
must set up SA MP with DB2. For example, the topology shown in Fig. 1 below would
perform automatic failover.
During the SA MP configuration process, the necessary HADR resources and their
relationships are defined to the cluster manager. Failure events in the HADR system can
then be detected automatically, and takeover operations can be executed without manual
intervention.
A typical HADR topology contains two nodes: a primary node to host the primary HADR
database, and a standby node to host the standby HADR database. The nodes are
connected to each other over a network to accommodate transaction replication between
the two databases.
In Fig. 1, SA MP now monitors the HADR pair for primary database failure, and will issue
appropriate takeover commands on the standby database in the event of a primary
database failure.
- 4-
Fig. 1. Typical HADR Environment with SA MP
- 5-
4. Steps to Set Up Topology
The following section documents a two-node topology, in which one node (e.g., spock1)
hosts the primary database (e.g., HADRDB), and a second node (e.g., spock2) hosts its
standby. Complete the following steps to set up the topology depicted in Fig. 1 above.
Notes:
1. Your topology does not have to include redundant network interface cards (NICs)
(e.g., eth1 in Fig. 1 above). Redundant NICs allow for recovery from simple outages
caused by primary NIC failure (e.g., eth0). For example, if eth0 on spock1 in Fig. 1
failed for some reason, then the IP address that it was hosting (i.e., 9.26.124.20)
could be taken over by eth1. In fact, there is an opportunity in Section 4.4 step 7 to
make the IP address of each DB2 instance (e.g., db2inst) highly available with SA
MP.
2. The letters in front of a command in the following steps designate on which node a
command is to be issued to properly set up the topology shown in Fig. 1 above. The
order of the letters also designates the order in which you should issue a command
on each node:
3. The parameters given for commands in this paper are based on the topology shown
in Fig. 1 above. Change the parameters accordingly to match your specific
environment. Also, a “\” in a command designates that the command text continues
on the next line (i.e., do not include the “\” when you issue the command).
Make sure that all nodes (e.g., spock1, spock2) will be able to communicate with each
other via TCP/IP protocol.
1. Set up the network using a static IP address. (We use a static IP address in Fig. 2
with a subnet mask of 255.255.254.0 for each node.)
- 6-
Fig. 2: Internet Protocol (TCP/IP) Properties
- 7-
Fig. 4: Advanced TCP/IP Setting (DNS)
2. Turn off the Firewall on each node to make HADR pair to connect each other:
Start > Control Panel > Windows Firewall. Click Off and then OK.
3. Test that you can ping from each node to all other nodes successfully using the
following commands:
As the user name user1, install DB2 ESE Version 9.5 Fix Pack 7 software on the primary
and standby nodes (e.g., spock1 and spock2).
Important: The DB2 SA MP scripts for HADR on Windows require that the same
instance name be used for the primary and standby instances.
- 8-
4.3 Install SA MP
Install SA MP V3.1, and then upgrade to Fix Pack 7 (i.e., SA MP V3.1.0.7) on the nodes,
by following these steps:
2. As user1, the local user, go to the directory where the SA MP 3.1 installation .exe
file and SA MP license exist, and then run the SA MP 3.1 installer.
3. Use the “IBM Tivoli System Automation – Shell” to add the appropriate IP address to
the host name mappings in the /etc/hosts file of each node (P), and (S):
Adding static IP address to host name mappings to the hosts file removes the
systems DNS servers as a single point of failure. If DNS fails, the cluster systems
can still resolve the addresses of the other machines via the hosts file.
Make sure that all SA MP installations in your topology know about one another, and
can communicate with one another in what is referred to as a SA MP cluster domain.
This is essential for management of HADR by SA MP.
1. Using the “IBM Tivoli System Automation – Shell,” run the following command as local
user to prepare the proper security environment between the SA MP nodes:
3. Now start the cluster domain as follows. (Note: all future SA MP commands will be
run relative to this active domain):
(P)(S) $ lsrpdomain
Name OpState RSCTActiveVersion MixedVersions TSPort
GSPort hadr_domain Online 2.5.5.2 No 12347
12348
5. Verify that all nodes are online in the domain as follows:
(P)(S) $ lsrpnode
Name OpState RSCTVersion
spock1 Online 2.5.5.2
spock2 Online 2.5.5.2
(P)(S) $ ls
db2.def hadr_monitor.ksh mkdb2
db2ip.def hadr_start.ksh mkhadr
hadr.def hadr_stop.ksh rmdb2
If the files do not exist, follow the instructions in Appendix C. The definition files and
mk scripts must be customized manually to suit your environment.
Note: You will need the network equivalency only if you use a Service IP address to
connect to the HADR database on the primary instance.
(P) $ ./mkdb2
(P)(S) $ lssam
$ lssam
Online IBM.ResourceGroup:db2_db2inst_spock1_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock1_0-rs:spock1
Online IBM.ResourceGroup:db2_db2inst_spock2_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock2_0-rs:spock2
Online IBM.Equivalency:virpubnic_spock1_spock2
|- Online IBM.NetworkInterface:B13D316B-A16B-4D34-84D3-10B0D743436B:spock1
'- Online IBM.NetworkInterface:78B5ACFD-0B8A-4AD8-9C6B-43B651541F1F:spock2
Note: If the HADR database already exists and is in Peer mode, mkdb2 script will
deactivate the database.
Now that you have created the primary and standby instances (db2inst), you need to
create a database (for example, named HADRDB) that you will replicate with HADR.
All DB2 commands must be run from the db2cmd command prompt. To open the db2cmd
window, select Start > All Programs > IBM DB2 > DB2COPY > Command Line
Tools > Command Window.
Alternatively, run the db2cmd command at a Windows command prompt to open the
db2cmd window.
- 10-
For non-default instances, verify that the current instance is correct (i.e., db2inst in our
example):
1. As user1, the local user, make sure that the database manager instance is started.
2. As primary instance owner (e.g., db2inst), create the database (e.g., HADRDB) that
you will later make highly available with HADR as follows.
3. Change the default Circular Logging of the database (e.g., HADRDB) to Archive
Logging by issuing the following command:
6. Now we must create a backup copy of the primary database (e.g., HADRDB) that will
later be restored on the standby instance, and act as the standby database of the
HADR pair. The backup image will be written to the current directory (e.g.,
C:\Program Files\IBM\SQLLIB\BIN):
7. Transfer the backup image of the primary database (e.g., HADRDB) to the standby
host machine.
8. As standby instance owner (e.g., db2inst), create the standby database on the
standby node (e.g., spock2) as follows:
- 11-
9. Allow TCP/IP communication to both the primary and standby instances (e.g.,
db2inst) as follows. Important: Make sure that “db2j_DB2 55000/tcp” is in
the /etc/services (IBM Tivoli System Automation – Shell) file on the primary and
standby nodes before issuing the following command:
db2set DB2COMM=tcpip
10. As standby instance owner (e.g., db2inst), enable HADR on the standby database
(e.g., HADRDB) as follows:
11. As instance owner, verify that you have set the database configuration parameters
correctly in steps 5 and 10 above:
12. As standby instance owner (e.g., db2inst), start HADR on the standby node (e.g.,
spock2) as follows:
As primary instance owner (e.g., db2inst), start HADR on the primary node (e.g.,
spock2) as follows:
13. As instance owner (e.g., db2inst), verify that the HADR pair is in Peer state as
follows:
HADR Information:
Role State SyncMode HeartBeatsMissed LogGapRunAvg (bytes)
Standby Peer Sync 0 0
- 12-
ConnectStatus ConnectTime Timeout
Connected Wed Aug 25 10:06:04 2010 (1282745164) 120
PeerWindowEnd PeerWindow
Wed Aug 25 10:12:35 2010 (1282745555) 300
LocalHost LocalService
spock2 55555
You should see output similar to the following lines on the primary node (e.g.,
spock1):
HADR Information:
Role State SyncMode HeartBeatsMissed LogGapRunAvg (bytes)
Primary Peer Sync 0 0
PeerWindowEnd PeerWindow
Wed Aug 25 10:12:35 2010 (1282745555) 300
LocalHost LocalService
spock1 55555
In this step, we will enable SA MP to monitor and manage the HADR pair automatically.
We will do this by registering the HADR pair as a resource group with SA MP. Do not
manually issue DB2 “takeover” commands after registering HADR as a resource group
with SA MP.
- 13-
Create HADR Resources Group and Resources
Before running this script, verify that the names in the definition files and mk script are
customized to your environment. Refer to Appendix C.
Additionally, verify that the system is configured to reboot after receiving a stop error
on a blue screen, with the message “*** Fatal System Error:…”. This is defined within
the “Startup and Recovery” dialog of the advanced section of the system properties. An
example of configuring the nodes is through “System Properties” > “Advanced Tab” >
“Startup and Recovery” > “Settings”. Configuration changes need to be specified
separately on both nodes.
From the “IBM Tivoli System Automation – Shell” window (e.g., Start > IBM Tivoli
System Automation – Shell), run the following commands:
(P) $ cd /usr/sbin/rsct/sapolicies/db2
(P) $ ./mkhadr
When mkdb2 and mkhadr files run, verify that the commands are run successfully. Issue
the lssam command to observe the output below.
$ lssam
Online IBM.ResourceGroup:db2_db2inst_db2inst_HADRDB-rg Nominal=Online
|- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs
|- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock1
'- Offline IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock2
'- Online IBM.ServiceIP:db2ip
|- Online IBM.ServiceIP:db2ip:spock1
'- Offline IBM.ServiceIP:db2ip:spock2
Online IBM.ResourceGroup:db2_db2inst_spock1_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock1_0-rs:spock1
Online IBM.ResourceGroup:db2_db2inst_spock2_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock2_0-rs:spock2
Online IBM.Equivalency:virpubnic_spock1_spock2
|- Online IBM.NetworkInterface:B13D316B-A16B-4D34-84D3-10B0D743436B:spock1
'- Online IBM.NetworkInterface:78B5ACFD-0B8A-4AD8-9C6B-43B651541F1F:spock2
- 14-
5. Post Configuration Testing
Once the setup is complete, we can test our automated HADR environment. Issue the
lssam command, and observe the output displayed to the screen. You will see output
similar to this:
$ lssam
Online IBM.ResourceGroup:db2_db2inst_db2inst_HADRDB-rg Nominal=Online
|- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs
|- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock1
'- Offline IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock2
'- Online IBM.ServiceIP:db2ip
|- Online IBM.ServiceIP:db2ip:spock1
'- Offline IBM.ServiceIP:db2ip:spock2
Online IBM.ResourceGroup:db2_db2inst_spock1_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock1_0-rs:spock1
Online IBM.ResourceGroup:db2_db2inst_spock2_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock2_0-rs:spock2
Online IBM.Equivalency:virpubnic_spock1_spock2
|- Online IBM.NetworkInterface:B13D316B-A16B-4D34-84D3-10B0D743436B:spock1
'- Online IBM.NetworkInterface:78B5ACFD-0B8A-4AD8-9C6B-43B651541F1F:spock2
Below is a brief description of the resources shown in the preceding SA MPle output
and what they represent:
Member Resources:
db2_db2inst_spock1_0-rs (primary DB2 instance)
Member Resources:
db2_db2inst_spock2_0-rs (standby DB2 instance)
Member Resources:
db2_db2inst_db2inst_HADRDB-rs (HADR DB)
The resource groups mentioned above are created for both the HADR configurations
discussed in this paper. However, the created networks are different.
In the case of the single network HADR configuration setup, only the following
equivalencies are created by mkdb2:
Displaying Equivalencies:
virpubnic_spock1_spock2
- 15-
In the following steps, we will go through simulating various failure scenarios, and see
how the preceding system configuration reacts to such failures. You can assume that the
system reaction to a failure scenario is identical for both HADR configurations, unless
otherwise mentioned.
Before continuing with this section, you must note some key points:
For all of the following test cases, it is assumed that hadrdb is primary on spock1, and all
of the instance resource groups are online.
The following lines explain the meaning of the states that you see after issuing the lssam
command:
Note: If the HADR group is locked (or suspended from automation), that likely means
HADR DB is not in Peer state.
- 16-
6. Testing Topology Response to Common Failures
All commands are to be run from the “IBM Tivoli System Automation – Shell” window
unless specified otherwise (e.g., Start > IBM Tivoli System Automation – Shell).
2. As user1, verify that the primary database (e.g., HADRDB) has successfully failed
over to the standby node (e.g., spock2) as follows:
(P) $ lssam
(P) $ lssam
Online IBM.ResourceGroup:db2_db2inst_db2inst_HADRDB-rg Nominal=Online
|- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs
|- Offline IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock1
'- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock2
'- Online IBM.ServiceIP:db2ip
|- Offline IBM.ServiceIP:db2ip:spock1
'- Online IBM.ServiceIP:db2ip:spock2
Online IBM.ResourceGroup:db2_db2inst_spock1_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock1_0-rs:spock1
Online IBM.ResourceGroup:db2_db2inst_spock2_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock2_0-rs:spock2
Online IBM.Equivalency:virpubnic_spock1_spock2
|- Online IBM.NetworkInterface:B13D316B-A16B-4D34-84D3-10B0D743436B:spock1
'- Online IBM.NetworkInterface:78B5ACFD-0B8A-4AD8-9C6B-43B651541F1F:spock2
3. Return hadrdb back to being primary on spock1 as follows (ignore the “token”
message):
(P) % db2_kill
- 17-
2. On the primary or standby node (e.g., spock2), issue the following command
repeatedly until you see output similar to what you saw the first time you ran the
command (i.e., db2_db2inst_spock1_0-rg is Online again):
(S) $ lssam
Online IBM.ResourceGroup:db2_db2inst_db2inst_HADRDB-rg Nominal=Online
|- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs
|- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock1
'- Offline IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock2
'- Online IBM.ServiceIP:db2ip
|- Online IBM.ServiceIP:db2ip:spock1
'- Offline IBM.ServiceIP:db2ip:spock2
Online IBM.ResourceGroup:db2_db2inst_spock1_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock1_0-rs:spock1
Online IBM.ResourceGroup:db2_db2inst_spock2_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock2_0-rs:spock2
Online IBM.Equivalency:virpubnic_spock1_spock2
|- Online IBM.NetworkInterface:B13D316B-A16B-4D34-84D3-10B0D743436B:spock1
'- Online IBM.NetworkInterface:78B5ACFD-0B8A-4AD8-9C6B-43B651541F1F:spock2
(S) db2_kill
2. On the standby node (e.g., spock2), issue lssam repeatedly until you see that
db2_db2inst_spock2_0-rg is Online again, and issue db2 activate db hadrdb on
the db2cmd command prompt to get the HADR database back to Peer state.
(S) $ lssam
Online IBM.ResourceGroup:db2_db2inst_db2inst_HADRDB-rg Nominal=Online
|- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs Request=Lock
|- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock1
'- Offline IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock2
'- Online IBM.ServiceIP:db2ip Control=SuspendedPropagated
|- Online IBM.ServiceIP:db2ip:spock1
'- Offline IBM.ServiceIP:db2ip:spock2
Online IBM.ResourceGroup:db2_db2inst_spock1_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock1_0-rs:spock1
Online IBM.ResourceGroup:db2_db2inst_spock2_0-rg Nominal=Online
'- Pending online IBM.Application:db2_db2inst_spock2_0-rs:spock2
Online IBM.Equivalency:virpubnic_spock1_spock2
|- Online IBM.NetworkInterface:B13D316B-A16B-4D34-84D3-10B0D743436B:spock1
'- Online IBM.NetworkInterface:78B5ACFD-0B8A-4AD8-9C6B-43B651541F1F:spock2
- 18-
From the “IBM Tivoli System Automation – Shell” window:
(S) $ lssam
2. Issue the following command and observe that the primary instance resource group
goes offline:
(P) $ lssam
3. Now bring the primary instance resource group back online by issuing the following
command:
4. Verify that the primary instance resource group is back online as follows:
(P) $ lssam
- 19-
6.5 Testing Resource Group Failure: Standby Instance Resource Group
2. Issue the following command and observe that the standby instance resource group
goes offline:
(S) $ lssam
4. Verify that the standby instance resource group has been restarted successfully:
(S) $ lssam
(*) This step is not required if you implemented automatic HADR reintegration. For more
information, see Appendix D.
- 20-
6.6 Testing Network Adapter Failure (e.g., Local Area Connection 1)
1. Pull the cable on the NIC that is currently hosting the primary instance’s IP address
(e.g., Local Area Connection 1). The system behavior will be identical to that
described in the “Node Failure” test that follows this one on the standby node. (Note:
Allow enough time for the old standby node (spock2) to switch to the new primary
node):
(S) $ lssam
$ lssam
Online IBM.ResourceGroup:db2_db2inst_db2inst_HADRDB-rg Nominal=Online
|- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs Request=Lock
|- Failed offline IBM.Application:db2_db2inst_db2inst_HADRDB-
rs:spock1 Node=Offline
'- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock2
'- Online IBM.ServiceIP:db2ip Control=SuspendedPropagated
|- Failed offline IBM.ServiceIP:db2ip:spock1 Node=Offline
'- Online IBM.ServiceIP:db2ip:spock2
Failed offline IBM.ResourceGroup:db2_db2inst_spock1_0-rg Nominal=Online
'- Failed offline IBM.Application:db2_db2inst_spock1_0-rs:spock1
Node=Offline Binding=Unbindable
Online IBM.ResourceGroup:db2_db2inst_spock2_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock2_0-rs:spock2
Online IBM.Equivalency:virpubnic_spock1_spock2
|- Offline IBM.NetworkInterface:B13D316B-A16B-4D34-84D3-
10B0D743436B:spock1 Node=Offline
'- Online IBM.NetworkInterface:78B5ACFD-0B8A-4AD8-9C6B-43B651541F1F:spock2
HADR Information:
Role State SyncMode HeartBeatsMissed LogGapRunAvg (bytes)
Primary Disconnected Sync 0 0
PeerWindowEnd PeerWindow
Null (0) 300
LocalHost LocalService
spock2 55555
- 21-
2. Plug the network cable back to the old primary system (e.g., spock1). If the database
in the old primary machine is still activated, issue db2_kill before attempting
reintegration.
db2_kill
Important: If the old primary database is already active, deactivating it may lead to
later reintegration failure. You must stop the old primary instance using db2_kill or
by ending the db2sysc.exe process manually.
If reintegration of the HADR pair fails, the standby database may have to be re-
established via a backup image of the current primary database.
3. (*) Once the old primary instance is back Online, you can now re-establish the HADR
pair from the old primary machine (e.g., spock1):
From the db2cmd window, as the original primary instance owner, db2inst, issue:
4. The HADR pair should now be re-established (you can issue the lssam command to
check). To bring the primary database back to spock1, issue the following command
(ignore the "token" message):
5. Verify that topology has returned to its original state before the cable was pulled:
(P) $ lssam
(*) This step is not required if you implemented automatic HADR reintegration. For more
information, see Appendix D.
For this test case to work, the script will perform a takeover by force command if the
HADR pair drops out of Peer state before SA MP can issue a failover to the standby
database. Important: If HADR is not operating in synchronization mode (mode = sync),
the standby database may take over as primary at a time when it is not in sync with the
primary database that failed. If this is the case, then later reintegration of the HADR pair
may fail, and the standby database may have to be re-established using a backup image
of the current primary database:
(*) Steps 2 - 6 are not required if you implemented automatic HADR reintegration. For
more information, see Appendix D.
(P) $ lssam
Output similar to the following lines should be seen:
$ lssam
Online IBM.ResourceGroup:db2_db2inst_db2inst_HADRDB-rg Nominal=Online
|- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs
|- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock1
'- Offline IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock2
'- Online IBM.ServiceIP:db2ip
|- Online IBM.ServiceIP:db2ip:spock1
'- Offline IBM.ServiceIP:db2ip:spock2
Online IBM.ResourceGroup:db2_db2inst_spock1_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock1_0-rs:spock1
Online IBM.ResourceGroup:db2_db2inst_spock2_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock2_0-rs:spock2
Online IBM.Equivalency:virpubnic_spock1_spock2
|- Online IBM.NetworkInterface:B13D316B-A16B-4D34-84D3-10B0D743436B:spock1
'- Online IBM.NetworkInterface:78B5ACFD-0B8A-4AD8-9C6B-43B651541F1F:spock2
(S) $ lssam
After a few minutes, output similar to the following lines should be seen:
$ lssam
Pending online IBM.ResourceGroup:db2_db2inst_db2inst_HADRDB-rg Nominal=Online
|- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs Request=Lock
|- Failed offline IBM.Application:db2_db2inst_db2inst_HADRDB-
rs:spock1 Node=Offline
'- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock2
'- Online IBM.ServiceIP:db2ip Control=SuspendedPropagated
|- Failed offline IBM.ServiceIP:db2ip:spock1 Node=Offline
'- Online IBM.ServiceIP:db2ip:spock2
Failed offline IBM.ResourceGroup:db2_db2inst_spock1_0-rg Nominal=Online
'- Failed offline IBM.Application:db2_db2inst_spock1_0-rs:spock1
Node=Offline Binding=Unbindable
Online IBM.ResourceGroup:db2_db2inst_spock2_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock2_0-rs:spock2
Online IBM.Equivalency:virpubnic_spock1_spock2
|- Offline IBM.NetworkInterface:B13D316B-A16B-4D34-84D3-10B0D743436B:spock1
Node=Offline
'- Online IBM.NetworkInterface:78B5ACFD-0B8A-4AD8-9C6B-43B651541F1F:spock2
Verify that the HADR database is now primary on node spock2 using the
db2pd –hadr –db hadrdb command.
- 23-
From the db2cmd window:
HADR Information:
Role State SyncMode HeartBeatsMissed LogGapRunAvg
Primary Disconnected Sync 0 0
PeerWindowEnd PeerWindow
Null (0) 300
LocalHost LocalService
spock2 55555
4. Once the old primary machine (i.e., spock1) comes back online, lssam output will be
as follows:
$ lssam
Pending online IBM.ResourceGroup:db2_db2inst_db2inst_HADRDB-rg Request=Move No
minal=Online
|- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs Request=Lock
|- Offline IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock1
'- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock2
'- Online IBM.ServiceIP:db2ip Control=SuspendedPropagated
|- Offline IBM.ServiceIP:db2ip:spock1
'- Online IBM.ServiceIP:db2ip:spock2
Online IBM.ResourceGroup:db2_db2inst_spock1_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock1_0-rs:spock1
Online IBM.ResourceGroup:db2_db2inst_spock2_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock2_0-rs:spock2
Online IBM.Equivalency:virpubnic_spock1_spock2
|- Online IBM.NetworkInterface:B13D316B-A16B-4D34-84D3-10B0D743436B:spock1
'- Online IBM.NetworkInterface:78B5ACFD-0B8A-4AD8-9C6B-43B651541F1F:spock2
5. Verify that the database is not active on the old primary machine (i.e., spock1):
Option -hadr requires -db <database> or -alldbs option and active database.
- 24-
Important: If the old primary database is already active, deactivating it may lead to
later reintegration failure. You must stop the old primary instance using db2_kill or
by ending the db2sysc.exe process manually, and then proceed to step 6.
If reintegration of the HADR pair fails, the standby database may have to be re-
established using a backup image of the current primary database.
6. You can now re-establish the HADR pair from the old primary machine (i.e., spock1).
From the db2cmd window, as the original primary instance owner, db2inst, issue:
7. The HADR pair should now be re-established (you can issue the lssam command to
check). To bring the primary database back to spock1, issue the following command
(ignore the "token" message):
8. Verify that HADR has returned to its original state before the primary node failure:
(P) $ lssam
(*) Steps 2 - 4 are not required if you implemented automatic HADR reintegration. For
more information, see Appendix D.
(P) $ lssam
$ lssam
Online IBM.ResourceGroup:db2_db2inst_db2inst_HADRDB-rg Nominal=Online
|- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs
|- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock1
'- Offline IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock2
'- Online IBM.ServiceIP:db2ip
|- Online IBM.ServiceIP:db2ip:spock1
'- Offline IBM.ServiceIP:db2ip:spock2
Online IBM.ResourceGroup:db2_db2inst_spock1_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock1_0-rs:spock1
Online IBM.ResourceGroup:db2_db2inst_spock2_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock2_0-rs:spock2
Online IBM.Equivalency:virpubnic_spock1_spock2
|- Online IBM.NetworkInterface:B13D316B-A16B-4D34-84D3-10B0D743436B:spock1
'- Online IBM.NetworkInterface:78B5ACFD-0B8A-4AD8-9C6B-43B651541F1F:spock2
- 25-
2. Simulate a failure of the standby node (e.g., spock2) by rebooting or shutdown
windows.
(P) $ lssam
After a few minutes, output similar to the following lines should be seen:
$ lssam
Online IBM.ResourceGroup:db2_db2inst_db2inst_HADRDB-rg Nominal=Online
|- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs Request=Lock
|- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock1
'- Failed offline IBM.Application:db2_db2inst_db2inst_HADRDB-
rs:spock2 Node=Offline
'- Online IBM.ServiceIP:db2ip Control=SuspendedPropagated
|- Online IBM.ServiceIP:db2ip:spock1
'- Failed offline IBM.ServiceIP:db2ip:spock2 Node=Offline
Online IBM.ResourceGroup:db2_db2inst_spock1_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock1_0-rs:spock1
Failed offline IBM.ResourceGroup:db2_db2inst_spock2_0-rg Nominal=Online
'- Failed offline IBM.Application:db2_db2inst_spock2_0-rs:spock2
Node=Offline Binding=Unbindable
Online IBM.Equivalency:virpubnic_spock1_spock2
|- Online IBM.NetworkInterface:B13D316B-A16B-4D34-84D3-10B0D743436B:spock1
'- Offline IBM.NetworkInterface:78B5ACFD-0B8A-4AD8-9C6B-
43B651541F1F:spock2 Node=Offline
Verify that the HADR database is still primary on node spock1 using the
db2pd –hadr –db hadrdb command.
HADR Information:
Role State SyncMode HeartBeatsMissed LogGapRunAvg (bytes)
Primary Disconnected Sync 0 0
PeerWindowEnd PeerWindow
Mon Sep 13 16:52:27 2010 (1284421947) 300
LocalHost LocalService
spock1 55555
- 26-
4. Once the standby machine (i.e., spock2) comes back online, you can re-establish the
HADR pair as follows:
5. Check that HADR has returned to its original state before the primary node failure:
(P) $ lssam
7. Conclusion
This paper guided you through the implementation of an automated failover solution for
the IBM® DB2® Enterprise Server Edition for Linux, UNIX, and Windows Version product.
The solution is based on a combination of the HADR feature and the IBM Tivoli® SA MP
product. The setup that is described in this paper focuses on the Windows operating
system and was updated to reflect enhancements that were done in DB2 9.7 Fix Pack 8.
Appendix A: Understanding How SA MP Works
IBM Tivoli System Automation for Multiplatforms (IBM Tivoli SA MP) provides a framework
to automatically manage the availability of what are known as resources.
For example, both a DB2 instance and an HADR database itself have start, stop, and
monitor commands. Therefore, SA MP scripts can be written to manage these
resources automatically. In fact, you can create the scripts by copying them from
Appendix C as user1 (root) after installing DB2 9.5:
(P)(S) # cd /usr/sbin/rsct/sapolicies/db2/
Finally, SA MP provides high availability (HA) for any resource group that it manages, by
restarting all of its resources if it fails. The resource group will be restarted on an
appropriate node in the currently online cluster domain. An appropriate node must
contain a copy of all of the resources that are defined in the failing resource group, to be
selected as a node to restart on.
- 28-
The following examples show “dialogs” that would occur between SA MP nodes in Fig.
1, Typical HADR Environment with SA MP, in the event of various failures/user actions.
Note: For each dialog, assume that spock1 is the primary database node and spock2 is
the standby database node:
SA MP on spock2: OK then.
SA MP on spock2: OK then. Hey, standby instance, have hadrdb take over by force as
primary.
DB2: Done.
- 29-
Appendix B: Troubleshooting Tips
1. Text editor:
While working under the Subsystem for UNIX-based Applications (SUA) environment,
you should always use a UNIX editor to modify the files (e.g., “vi” editor, which comes
with the “Utilities and SDK for UNIX-based Application_X86” package” for SUA).
Alternatively, if you need to work with files in a Windows environment, you must
convert them to adopt the UNIX convention for line endings. You can run the following
command to convert your text file from Windows to UNIX format:
• Instance names for the primary and standby databases must be the same.
• The case and format for resource names must be consistent throughout. For
example, the host name must be the same case as the ‘hostname’ return. For
database names, the common practice is to use all capital letters.
If the default DB2 install path is not used, or it does not exist in the PATH
environment variable, you must change the PATH variable at the beginning of each
hadr script (hadr_monitor.ksh, hadr_start.ksh, and hadr_stop.ksh).
PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/sbin/rsct/bin:"/dev/fs/C/Program
Files/IBM/SQLLIB/BIN":/dev/fs/C/WINDOWS/system32
To:
PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/sbin/rsct/bin:"/dev/fs/D/Program
Files/IBM/SQLLIB_TEST/BIN":/dev/fs/C/WINDOWS/system32
- 30-
5. Resources not created after running mkdb2:
Always verify the output of the setup script run. Confirm that the data input files are
updated correctly. Here are some common errors you may see from the mkdb2 run:
Problem 1:
Solution 1:
Verify that the NIC name is correct. The NIC names can be found by issuing the
following in a “IBM Tivoli System Automation – Shell” window:
This error can be ignored if you will not be defining a Service IP address for the
HADR database.
Problem 2:
Solution 2:
Verify that the “PersistentResourceAttributes” header does exist. This error may
also be caused by Windows line-ending characters in the db2.def (e.g., if the file
has been previously edited using Windows editor). Using “vi”, you will see the ^M
characters appended at the end of each line. See tip #1 for resolution.
- 31-
Appendix C: Resources and Resource Groups Setup Scripts
In the "IBM Tivoli System Automation - Shell" window, copy the SA MPle files
to the /usr/sbin/rsct/sapolicies/db2 directory. Create this directory if it
does not already exist. Alternatively, from Windows Explorer, copy the files
onto this default location: C:\WINDOWS\SUA\usr\sbin\rsct\sapolicies\db2
Definition files
******************************************************************************
* db2.def SA MPle data input file for a DB2 instance resource
* The following attributes should be modified:
* Name, StartCommand, StopCommand, MonitorCommand, NodeNameList,
* UserName. In addition, the convention below must be followed:
*
* Name db2_<instance>_<host>_0-rs
* *Command /usr/sbin/rsct/bin/samservice -<x> <service_name>
*
* <instance> is the name of the database instance
* <host> is the primary hostname for resource 1, and the
* standby hostname for resource 2
* <service_name> is the name of the DB2 service that
* appears in the Windows Services console. Note that
* although your default instance name is "DB2", its
* service name may be "DB2-0". Verify this in here:
* Start > Administrative Tools > Services
*
* db2ip.def SA MPle data input file for a virtual IP resource
* The following attributes should be modified:
* Name, NodeNameList, IPAddress, NetMask.
*
* hadr.def SA MPle data input file for an HADR database resource
* The following attributes should be modified:
* Name, StartCommand, StopCommand, MonitorCommand, UserName,
* NodeNameList. In addition, the convention below must be followed:
*
* Name db2_<instance>_<instance>_<DBname>-rs
* *Command /usr/sbin/rsct/sapolicies/db2/hadr_<x>.ksh <instance>
* <instance> <DBname>
*
* <instance> is the name of the database instance
* <DBname> is the HADR database name, capitalized
*
******************************************************************************
$ cat db2.def
PersistentResourceAttributes::
resource 1:
Name = "db2_db2inst_spock1_0-rs"
StartCommand = "/usr/sbin/rsct/bin/samservice -s db2inst"
StopCommand = "/usr/sbin/rsct/bin/samservice -p db2inst"
MonitorCommand = "/usr/sbin/rsct/bin/samservice -m db2inst"
MonitorCommandPeriod = 30
MonitorCommandTimeout = 180
NodeNameList = {'spock1'}
- 32-
StartCommandTimeout = 300
StopCommandTimeout = 300
UserName = "user1"
RunCommandsSync = 1
ResourceType = 0
resource 2:
Name = "db2_db2inst_spock2_0-rs"
StartCommand = "/usr/sbin/rsct/bin/samservice -s db2inst"
StopCommand = "/usr/sbin/rsct/bin/samservice -p db2inst"
MonitorCommand = "/usr/sbin/rsct/bin/samservice -m db2inst"
MonitorCommandPeriod = 30
MonitorCommandTimeout = 180
NodeNameList = {'spock2'}
StartCommandTimeout = 300
StopCommandTimeout = 300
UserName = "user1"
RunCommandsSync = 1
ResourceType = 0
$ cat db2ip.def
PersistentResourceAttributes::
resource 1:
Name = "db2ip"
NodeNameList = {'spock1','spock2'}
IPAddress = '9.26.124.34'
NetMask = '255.255.254.0'
ResourceType = 1
ProtectionMode = 1
$ cat hadr.def
PersistentResourceAttributes::
resource 1:
Name = "db2_db2inst_db2inst_HADRDB-rs"
ResourceType = 1
StartCommand = "/usr/sbin/rsct/sapolicies/db2/hadr_start.ksh db2inst
db2inst HADRDB"
StopCommand = "/usr/sbin/rsct/sapolicies/db2/hadr_stop.ksh db2inst
db2inst HADRDB"
MonitorCommand = "/usr/sbin/rsct/sapolicies/db2/hadr_monitor.ksh db2inst
db2inst HADRDB"
MonitorCommandPeriod = 20
MonitorCommandTimeout = 120
StartCommandTimeout = 300
StopCommandTimeout = 15
UserName = "user1"
RunCommandsSync = 1
ProtectionMode = 0
NodeNameList = {"spock1","spock2"}
- 33-
Resource and resource group setup scripts
******************************************************************************
* mkdb2 Creates a DB2 resource based on db2.def
* mkhadr Creates an HADR database resource based on hadr.def, and
* Service IP resource based on db2ip.def
* rmdb2 Removes any clustered resources and removes the cluster domain
*
* These scripts should be modified to suit your environment.
* Here are a few entries that require customization:
*
* DBNAME HADR database name, capitalized
* HADRINSTANCE DB2 instance name (default is "DB2")
* HOST1 / HOST2 The primary and standby hostname
* NIC1 / NIC2 The name of the NIC resource from primary and
* standby instance. These names can be obtained by
* running SA MP command below:
* "lsrsrc IBM.NetworkInterface Name NodeNameList"
* Note: This is only required if you use a Service
* IP address to connect to the HADR database
* TIEBREAKER network address that serves as a tiebreaker
*
******************************************************************************
$ cat mkdb2
#!/bin/ksh
set -x
HOST1="spock1"
HOST2="spock2"
HADRINSTANCE="db2inst"
NIC1="B13D316B-A16B-4D34-84D3-10B0D743436B"
NIC2="78B5ACFD-0B8A-4AD8-9C6B-43B651541F1F"
#################################################
# Starting setup...
sleep 30
sleep 30
- 34-
lsrg -m
lsrsrc IBM.Application Name NodeNameList OpState
date
lssam
date
sleep 10
sleep 120
date
lssam
date
$ cat mkhadr
#!/bin/ksh
set -x
HADRINSTANCE="db2inst"
DBNAME="HADRDB"
TIEBREAKER="9.26.124.30"
#######################################################
RGNAME=db2_${HADRINSTANCE}_${HADRINSTANCE}_${DBNAME}-rg
RSNAME=db2_${HADRINSTANCE}_${HADRINSTANCE}_${DBNAME}-rs
# Starting setup...
sleep 60
sleep 60
- 35-
echo "Making relationships ..."
lsrsrc IBM.TieBreaker
lsrsrc -c IBM.PeerNode
chrsrc -c IBM.PeerNode OpQuorumTieBreaker="mynetworktb"
chrsrc -c IBM.PeerNode CritRsrcProtMethod=1
lsrsrc -c IBM.PeerNode
$ cat rmdb2
#!/bin/ksh
set -x
HOST1="spock1"
HOST2="spock2"
HADRINSTANCE="db2inst"
DBNAME="HADRDB"
DOMAINNAME="hadr_domain"
#########################################################
# Computing names...
RGNAME1=db2_${HADRINSTANCE}_${HOST1}_0-rg
RGNAME2=db2_${HADRINSTANCE}_${HOST2}_0-rg
DBRGNAME=db2_${HADRINSTANCE}_${HADRINSTANCE}_${DBNAME}-rg
# Removing setup...
sleep 10
sleep 10
lsequ
lsrsrc IBM.Application Name NodeNameList OpState
lsrsrc IBM.ServiceIP Name NodeNameList OpState
lssam
lsrpdomain
stoprpdomain -f ${DOMAINNAME}
sleep 30
rmrpdomain -f ${DOMAINNAME}
sleep 30
lsrpnode
- 36-
Resource management scripts
******************************************************************************
*
* hadr_monitor.ksh Monitors an HADR database resource and checks for its
* status
* hadr_start.ksh Starts up the HADR database resource
* hadr_stop.ksh Stops an HADR database resource
*
******************************************************************************
$ cat hadr_monitor.ksh
#!/bin/ksh -p
#-----------------------------------------------------------------------
# (C) COPYRIGHT International Business Machines Corp. 2001-2010
# All Rights Reserved
#
# US Government Users Restricted Rights - Use, duplication or
# disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
#
# NAME: hadr_monitor.ksh
#
# (%W%) %E% %U%
#
# FUNCTION: Probe for specified HADR pair alive
#
# INPUT: hadr_monitor.ksh db2instp db2insts db2hadrdb verbose [S]
#
# OUTPUT: see probehadr() function for description of return codes
#
#-----------------------------------------------------------------------
PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/sbin/rsct/bin:"/dev/fs/C/Program
Files/IBM/SQLLIB/BIN"
export PATH=$PATH:.
RESOURCE_METHOD=probe
PROGNAME=$(basename $0)
PROGPATH=$(dirname $0)
RT_BASEDIR=$PROGPATH
DB2HADRINSTANCE1=${1?}
DB2HADRINSTANCE2=${2?}
DB2HADRDBNAME=${3?}
VERBOSE=${4:-verbose}
monitor_as_standby=${5:-N}
PROBE=$PROGNAME
START=
STOP=
ACTIVATE=
SVC_PROBE=${RT_BASEDIR}/${PROBE:-hadr_monitor.ksh}
START_CMD=$RT_BASEDIR/${START:-hadr_start.ksh}
STOP_CMD=$RT_BASEDIR/${STOP:-hadr_stop.ksh}
export CT_MANAGEMENT_SCOPE=2
export DB2INSTANCE=${DB2HADRINSTANCE1?}
- 37-
set +x
fi
grepFlags='-e'
CLUSTER_APP_LABEL=db2_${DB2HADRINSTANCE1}_${DB2HADRINSTANCE2}_${DB2HADRDBNAME?}
temp_snap=/tmp/.${PROGNAME?}_${DB2HADRDBNAME?}_$$
temp_snap2=/.${PROGNAME?}_${DB2HADRDBNAME?}_$$
TRAPSIGNALS="TERM KILL"
# trap 'rm -f $temp_snap ; exit $rc' 0 1 2 3 6 9 15
###########################################################
# resource_group_state_on_this_node()
#
# INPUT: ResourceClass, Resource, ResourceType, Node
# OUTPUT:rg_state [Online=1 || Offline]
#
###########################################################
resource_group_state_on_this_node()
{
OpState=$( lsrsrc-api -s ${ResourceClass}::'Name="'${Resource}'" '"${ResourceType}"'
&& NodeNameList={"'${Node}'"} '::OpState 2> /dev/null)
rg_state=${OpState?}
}
###########################################################
# set_candidate_P_instance()
#
# INPUT: DB2HADRINSTANCE1, DB2HADRINTANCE2
# OUTPUT:candidate_P_instance
#
###########################################################
set_candidate_P_instance()
{
Node=$(hostname | tr "." " " | awk '{print $1}')
RSNAME1=db2_${DB2HADRINSTANCE1?}_
RSNAME2=db2_${DB2HADRINSTANCE2?}_
Resource=${RSNAME1?}
ResourceClass=IBM.Application
NodeRG1long=$(lsrsrc-api -s ${ResourceClass?}::'Name like
"'${Resource?}%'"'::NodeNameList | tr "{" " " | tr "}" " " | tr "." " " | awk '{print
$1}' | tail -1)
NodeRG1=$(echo $NodeRG1long | tr "." " " | awk '{print $1}')
Resource=${RSNAME2?}
ResourceClass=IBM.Application
NodeRG2long=$(lsrsrc-api -s ${ResourceClass?}::'Name like
"'${Resource?}%'"'::NodeNameList | tr "{" " " | tr "}" " " | tr "." " " | awk '{print
$1}' | tail -1)
NodeRG2=$(echo $NodeRG2long | tr "." " " | awk '{print $1}')
- 38-
if [[ $NodeRG1 == $Node && $NodeRG2 != $Node ]]; then
candidate_P_instance=$DB2HADRINSTANCE1
elif [[ $NodeRG1 != $Node && $NodeRG2 == $Node ]]; then
candidate_P_instance=$DB2HADRINSTANCE2
else
# either both instances are up in resource groups
# on this node, or neither is
# figure out the one that wants to be primary
rm -rf log.txt
db2clpex.exe -c DB2 -z log.txt -tv "get db cfg for ${DB2HADRDBNAME?}"
HADR_LOCAL_HOST1=cat log.txt | grep -e HADR_LOCAL_HOST | awk '{print $7}'
HADR_LOCAL_HOST2=HADR_LOCAL_HOST2
rm -rf log.txt
###########################################################
# probehadr_on_this_node()
#
# Probe status of HADR on the active resource on this node
# Return 4 if specified DB did not activate and activating
# Return 0 if specified DB2 HADR is unknown
# Return 1 if specified DB2 HADR is online as Primary,Peer on this node
# Return 3 if specified DB2 HADR is online as Primary,non-Peer on this node
# Return 2 if specified DB2 HADR is online as Standby,Peer on this node
# Return 40 if specified DB2 HADR is online as Standby,Not Peer on this node
#
###########################################################
probehadr_on_this_node()
{
# Use db2pd instead of SNAPSHOT if possible ...
db2pd.exe -hadr -db ${DB2HADRDBNAME?} \
| grep ${grepFlags?} "Sync " \
| grep "[a-zA-Z]" \
| awk '{print $1 "\n" $2}' > $temp_snap
if [ -r $temp_snap ]; then
hadr_role=$(head -1 $temp_snap)
hadr_state=$(tail -1 $temp_snap)
fi
rm -f $temp_snap
logger -i -p notice -t $0 "$DB2HADRDBNAME ($hadr_role, $hadr_state)"
- 39-
# Primary non-Peer
logger -i -p notice -t $0 "$DB2HADRDBNAME ($hadr_role, $hadr_state).
Locking $DATA_SVC_R_NAME"
rgmbrreq -o Lock IBM.Application:$DATA_SVC_R_NAME 2> /dev/null
if [[ "$monitor_as_standby" == "N" ]]; then
# Return online
rc=1
else
# Return Primary Non Peer distinctly
rc=3
fi
fi
else
rc=0
fi
else
logger -i -p notice -t $0 "$DATA_SVC_R_NAME state is not known,
$DATA_SVC_R_NAME is locked, returing Offline: $lockreq"
rc=2
fi
else
rc=50
fi
fi
- 40-
return $rc
}
###########################################################
# probehadr()
#
# Probe status of HADR on the active resource on this node
#
###########################################################
probehadr()
{
if [[ "$VERBOSE" == "verbose" ]]; then
typeset -ft $(typeset +f)
set -x
fi
unset instance_to_monitor
set_candidate_P_instance
if [[ -z "$candidate_P_instance" ]]; then
# If we cannot find instance to monitor, return Unknown
logger -i -p err -t $0 "Cannot find instance name!"
exit 0
fi
instance_to_monitor=$candidate_P_instance
probehadr_on_this_node
}
#######################################################
# main()
#######################################################
main()
{
if [[ "$VERBOSE" == "verbose" ]]; then
typeset -ft $(typeset +f)
set -x
fi
probehadr
if [ -f /tmp/.virtual_offline_${CLUSTER_APP_LABEL?} ]; then
# Offline
rc=2
logger -i -p info -t $0 "'CLUSTER_APP_LABEL' ${CLUSTER_APP_LABEL?}"
fi
main "${@:-}"
echo $rc
exit $rc
- 41-
$ cat hadr_start.ksh
#!/bin/ksh -p
#-----------------------------------------------------------------------
# (C) COPYRIGHT International Business Machines Corp. 2001-2010
# All Rights Reserved
#
# US Government Users Restricted Rights - Use, duplication or
# disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
#
# NAME: hadr_start.ksh
#
# (%W%) %E% %U%
#
# FUNCTION: Start DB2 UDB, for specified RESOURCE_NAME and RESOURCE_GROUP
#
# INPUT: hadr_start.ksh db2instp db2insts db2hadrdb verbose [S]
#
# OUTPUT: 0 if started ok
#
#-----------------------------------------------------------------------
PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/sbin/rsct/bin:"/dev/fs/C/Program
Files/IBM/SQLLIB/BIN":/dev/fs/C/WINDOWS/system32
export PATH=$PATH:.
logger -i -p info -t $0 "$*"
RESOURCE_METHOD=start
PROGNAME=$(basename $0)
PROGPATH=$(dirname $0)
DB2HADRINSTANCE1=${1?}
DB2HADRINSTANCE2=${2?}
DB2HADRDBNAME=${3?}
VERBOSE=${4:-verbose}
start_as_standby=${5:-N}
export CT_MANAGEMENT_SCOPE=2
export DB2INSTANCE=${DB2HADRINSTANCE1?}
PROBE=
START=$PROGNAME
STOP=
RT_BASEDIR=${PROGPATH?}
SVC_PROBE=${RT_BASEDIR}/${PROBE:-hadr_monitor.ksh}
START_CMD=$RT_BASEDIR/${START:-hadr_start.ksh}
STOP_CMD=$RT_BASEDIR/${STOP:-hadr_stop.ksh}
Resource=${RSNAME1?}
- 42-
ResourceClass=IBM.Application
NodeRG1=$(lsrsrc-api -s ${ResourceClass?}::'Name like
"'${Resource?}%'"'::NodeNameList | tr "{" " " | tr "}" " " | tr "." " " | awk '{print
$1}' | tail -1)
Resource=${RSNAME2?}
ResourceClass=IBM.Application
NodeRG2=$(lsrsrc-api -s ${ResourceClass?}::'Name like
"'${Resource?}%'"'::NodeNameList | tr "{" " " | tr "}" " " | tr "." " " | awk '{print
$1}' | tail -1)
else
Resource=db2_${DB2HADRINSTANCE1?}_
ResourceClass=IBM.Application
RGNAME1=db2_${DB2HADRINSTANCE1?}_${NodeRG1?}_0-rg
RSNAME1=db2_${DB2HADRINSTANCE1?}_${NodeRG1?}_0-rs
RGNAME2=db2_${DB2HADRINSTANCE2?}_${NodeRG2?}_0-rg
RSNAME2=db2_${DB2HADRINSTANCE2?}_${NodeRG2?}_0-rs
fi
CLUSTER_APP_LABEL=db2_${DB2HADRINSTANCE1}_${DB2HADRINSTANCE2}_${DB2HADRDBNAME?}
##########################################################
# error_exit
##########################################################
error_exit()
{
typeset exit_code="${rc:-1}"
logger -i -p err -t $0 "exiting with $rc"
exit $rc
}
###########################################################
# resource_group_state_on_this_node()
#
# INPUT: ResourceClass, Resource, ResourceType, Node
# OUTPUT:rg_state [Online=1 || Offline]
#
###########################################################
resource_group_state_on_this_node()
{
lsrsrc-api -s ${ResourceClass}::'Name="'${Resource}'" '"${ResourceType}"'
'::NodeNameList
###########################################################
# HADR_partner_node_state()
#
# INPUT: HADR_REMOTE_HOST [hostname]
# OUTPUT:remote_node_alive [Online || Offline]
- 43-
#
###########################################################
HADR_partner_node_state()
{
remote_node_alive=Online
###########################################################
# set_candidate_P_instance()
#
# INPUT: RSNAME1, RSNAME2, DB2HADRINSTANCE1, DB2HADRINTANCE2
# OUTPUT:candidate_P_instance
#
###########################################################
set_candidate_P_instance()
{
Node=$(hostname | tr "." " " | awk '{print $1}')
rm -rf log.txt
db2clpex.exe -c DB2 -z log.txt -tv "get db cfg for ${DB2HADRDBNAME?}"
HADR_LOCAL_HOST1=`cat log.txt | grep HADR_LOCAL_HOST | awk '{print $7}'`
HADR_LOCAL_HOST2=`cat log.txt | grep HADR_LOCAL_HOST | awk '{print $7}'`
rm -rf log.txt
- 44-
HADR_REMOTE_HOST=
fi
fi
}
###########################################################
# starthadr()
###########################################################
starthadr()
{
set_candidate_P_instance
instance_to_start=${candidate_P_instance}
HADR_partner_node_state
else
# Old primary machine is offline
if [ $rc -eq 2 ]; then
# Standby is currently in Peer State
#
# To bring up standby, will now do a TAKEOVER BY FORCE
# No need to block until resource group is offline, we have verified
# that the node is down already
# To bring up standby, will now do a TAKEOVER BY FORCE
:
logger -i -p notice -t $0 "db2 takeover hadr on db ${DB2HADRDBNAME?} by
force"
db2clpex.exe -c DB2 -z log.txt -tv "takeover hadr on db ${DB2HADRDBNAME?} by
force"
logger -i -p notice -t $0 "NOTICE: Takeover by force issued"
elif [ $rc -eq 40 ]; then
# Standby is currently not in Peer State
- 45-
:
logger -i -p err -t $0 "*** Database ${DB2HADRDBNAME} is not in Peer State,
old Primary Offline"
logger -i -p notice -t $0 "db2 takeover hadr on db ${DB2HADRDBNAME?} by
force"
db2clpex.exe -c DB2 -z log.txt -tv "takeover hadr on db ${DB2HADRDBNAME?} by
force"
logger -i -p notice -t $0 "NOTICE: Takeover by force issued"
fi
fi # Bring up HADR on this machine
# Return state
sleep 30
return $rc
}
#######################################################
# main()
#######################################################
main()
{
rm -f /tmp/.virtual_offline_${CLUSTER_APP_LABEL?}
starthadr
logger -i -p info -t $0 "$CLUSTER_APP_LABEL returning $rc"
return $rc
}
#
if [[ "$VERBOSE" == "verbose" ]]; then
typeset -ft $(typeset +f)
fi
main"${@:-}"
echo $rc
exit $rc
- 46-
$ cat hadr_stop.ksh
#!/bin/ksh -p
#-----------------------------------------------------------------------
# (C) COPYRIGHT International Business Machines Corp. 2001-2010
# All Rights Reserved
#
# US Government Users Restricted Rights - Use, duplication or
# disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
#
# NAME: hadr_stop.ksh
#
# (%W%) %E% %U%
#
# FUNCTION: Stop HADR for specified HADR pair
#
# INPUT: hadr_stop.ksh db2instp db2insts db2hadrdb verbose [S]
#
# OUTPUT: 0 if service stopped
#
#-----------------------------------------------------------------------
PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/sbin/rsct/bin:"/dev/fs/C/Program
Files/IBM/SQLLIB/BIN"
export PATH=$PATH:.:PATH
RESOURCE_METHOD=stop
PROGNAME=$(basename $0)
PROGPATH=$(dirname $0)
RT_BASEDIR=$PROGPATH
DB2HADRINSTANCE1=${1?}
DB2HADRINSTANCE2=${2?}
DB2HADRDBNAME=${3?}
VERBOSE=${4:-verbose}
stop_standby_also=${5:-S}
VIRTUAL_STOP_ONLY="yes"
PROBE=
START=
STOP=$PROGNAME
SVC_PROBE=${RT_BASEDIR}/${PROBE:-hadr_monitor.ksh}
START_CMD=$RT_BASEDIR/${START:-hadr_start.ksh}
STOP_CMD=$RT_BASEDIR/${STOP:-hadr_stop.ksh}
export CT_MANAGEMENT_SCOPE=2
export DB2INSTANCE=${DB2HADRINSTANCE1?}
CLUSTER_APP_LABEL=db2_${DB2HADRINSTANCE1?}_${DB2HADRINSTANCE2?}_${DB2HADRDBNAME?}
- 47-
##########################################################
# error_exit
##########################################################
error_exit()
{
typeset exit_code="${rc:-1}"
logger -i -p err -t $0 " exiting with $exit_code"
exit $exit_code
}
###########################################################
# resource_group_state_on_this_node()
#
# INPUT: ResourceClass, Resource, ResourceType, Node
# OUTPUT:rg_state [Online=1 || Offline]
#
###########################################################
resource_group_state_on_this_node()
{
OpState=$( lsrsrc-api -s ${ResourceClass}::'Name="'${Resource}'" '"${ResourceType}"'
&& NodeNameList={"'${Node}'"} '::OpState )
rg_state=${OpState?}
}
###########################################################
# set_candidate_P_instance()
#
# INPUT: RSNAME1, RSNAME2, DB2HADRINSTANCE1, DB2HADRINTANCE2
# OUTPUT:candidate_P_instance
#
###########################################################
set_candidate_P_instance()
{
set -x
Node=$(hostname | tr "." " " | awk '{print $1}')
rm -rf log.txt
db2clpex.exe -c DB2 -z log.txt -tv "get db cfg for ${DB2HADRDBNAME?}"
HADR_LOCAL_HOST1=cat log.txt | grep HADR_LOCAL_HOST | awk '{print $7}'
HADR_LOCAL_HOST2=cat log.txt | grep HADR_LOCAL_HOST | awk '{print $7}'
rm -rf log.txt
- 48-
forceRGOfflineInCaseOfByForce=
fi
fi
}
###########################################################
# stophadr()
#
# Stop HADR on instance that is running local to this node
###########################################################
stophadr()
{
set_candidate_P_instance
instance_to_stop=$candidate_P_instance
rc=0
return $rc
}
#######################################################
# main()
#######################################################
main()
{
touch /tmp/.virtual_offline_${CLUSTER_APP_LABEL?}
rc=$?
main"${@:-}"
echo $rc
exit $rc
- 49-
Appendix D: Automated HADR Reintegration
As of DB2 Version 9.7 Fix Pack 8, automatic HADR reintegration is supported. This feature
allows the HADR pair to automatically regain peer state upon an outage of either the
standby or primary server. The minimum operating system and software requirements are
as follows:
• Operating system: Windows Server 2008 R2 SP1 with Microsoft Hotfix KB2639164
• Tivoli product: IBM Tivoli System Automation for Multiplatforms Version 3.2.2.3
If the HADR primary and standby servers suffer a concurrent outage, you might have to
perform manual recovery to regain HADR peer state.
If the HADR pair fails to automatically recover from a situation where the primary and
standby servers suffered a concurrent outage, the lssam command output is similar to what
follows:
$ lssam
Online IBM.ResourceGroup:db2_db2inst_db2inst_HADRDB-rg Nominal=Online
|- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs Request=Lock
|- Offline IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock1
'- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock2
'- Online IBM.ServiceIP:db2ip
|- Offline IBM.ServiceIP:db2ip:spock1
'- Online IBM.ServiceIP:db2ip:spock2
Online IBM.ResourceGroup:db2_db2inst_spock1_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock1_0-rs:spock1
Online IBM.ResourceGroup:db2_db2inst_spock2_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock2_0-rs:spock2
Online IBM.Equivalency:virpubnic_spock1_spock2
|- Online IBM.NetworkInterface:B13D316B-A16B-4D34-84D3-10B0D743436B:spock1
'- Online IBM.NetworkInterface:78B5ACFD-0B8A-4AD8-9C6B-43B651541F1F:spock2
To recover from this situation, issue the following command from standby server:
If you implemented automatic HADR reintegration, the standby server’s HADR resource
might appear as Failed offline in the lssam command output after an automated
failover takes place, as shown in the following example:
- 50-
$ lssam
Online IBM.ResourceGroup:db2_db2inst_db2inst_HADRDB-rg
Control=MemberInProblemState Nominal=Online
|- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs
Control=MemberInProblemState
|- Failed offline IBM.Application:db2_db2inst_db2inst_HADRDB-
rs:spock1
'- Online IBM.Application:db2_db2inst_db2inst_HADRDB-rs:spock2
'- Online IBM.ServiceIP:db2ip
|- Offline IBM.ServiceIP:db2ip:spock1
'- Online IBM.ServiceIP:db2ip:spock2
Online IBM.ResourceGroup:db2_db2inst_spock1_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock1_0-rs:spock1
Online IBM.ResourceGroup:db2_db2inst_spock2_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst_spock2_0-rs:spock2
Online IBM.Equivalency:virpubnic_spock1_spock2
|- Online IBM.NetworkInterface:B13D316B-A16B-4D34-84D3-10B0D743436B:spock1
'- Online IBM.NetworkInterface:78B5ACFD-0B8A-4AD8-9C6B-43B651541F1F:spock2
To recover from this situation, see “The resetrsrc command - A brief how-to guide”
(http://www.ibm.com/support/docview.wss?uid=swg21458938).
© Copyright IBM Corporation 2012
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada
The information in this document concerning non-IBM products was obtained from the supplier(s) of those products. IBM has not
tested such products and cannot confirm the accuracy of the performance, compatibility or any other claims related to non-IBM
products. Questions about the capabilities of non-IBM products should be addressed to the supplier(s) of those products.
The information contained in this publication is provided for informational purposes only. While efforts were made to verify the
completeness and accuracy of the information contained in this publication, it is provided AS IS without warranty of any kind,
express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are subject to
change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this
publication or any other materials. Nothing contained in this publication is intended to, nor shall have the effect of, creating any
warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license
agreement governing the use of IBM software.
References in this publication to IBM products, programs, or services do not imply that they will be available in all countries in
which IBM operates. Product release dates and/or capabilities referenced in this presentation may change at any time at IBM’s
sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or
feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying
that any activities undertaken by you will result in any specific sales, revenue growth, savings or other results.
IBM, the IBM logo, and ibm.com® are trademarks or registered trademarks of International Business Machines Corporation
registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A
current list of IBM trademarks is available on the Web at “Copyright and trademark information” at
www.ibm.com/legal/copytrade.shtml.
Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates.
Windows is a trademark of Microsoft Corporation in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Other product and service names might be trademarks of IBM or other companies.