Best Practices For DB2 On AIX 6.1 For POWER Systems - Manual
Best Practices For DB2 On AIX 6.1 For POWER Systems - Manual
Michael Kwok
Rakesh Dash Anupama Padmanabhan
Bernard Goelen Punit Shah
Vasfi Gucer Basker Shanmugam
Rajesh K Jeyapaul Sweta Singh
Sunil Kamath Amar Thakkar
Naveen Kumar Bharatha Adriana Zubiri
ibm.com/redbooks
International Technical Support Organization
April 2010
SG24-7821-00
Note: Before using this information and the product it supports, read the information in
“Notices” on page xix.
This edition applies to DB2 Version 9.5 and Version 9.7, AIX Version 6.1, and VIOS Version 2.1.2.
© Copyright International Business Machines Corporation 2010. All rights reserved.
Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP
Schedule Contract with IBM Corp.
Contents
Figures ............................................................................................................... ix
Examples .......................................................................................................... xv
Index................................................................................................................ 385
Contents vii
viii Best Practices for DB2 on AIX 6.1 for POWER Systems
Figures
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area.
Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product, program, or service that
does not infringe any IBM intellectual property right may be used instead. However, it is the user's
responsibility to evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document.
The furnishing of this document does not give you any license to these patents. You can send license
inquiries, in writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer
of express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may
make improvements and/or changes in the product(s) and/or the program(s) described in this publication at
any time without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any
manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the
materials for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without
incurring any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm
the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on
the capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the
sample programs are written. These examples have not been thoroughly tested under all conditions. IBM,
therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.
© Copyright IBM Corp. 2010. All rights reserved. xix
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business
Machines Corporation in the United States, other countries, or both. These and other IBM trademarked
terms are marked on their first occurrence in this information with the appropriate symbol (® or ™),
indicating US registered or common law trademarks owned by IBM at the time this information was
published. Such trademarks may also be registered or common law trademarks in other countries. A current
list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml
The following terms are trademarks of the International Business Machines Corporation in the United States,
other countries, or both:
Snapshot, and the NetApp logo are trademarks or registered trademarks of NetApp, Inc. in the U.S. and
other countries.
Java, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other
countries, or both.
Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other
countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
xx Best Practices for DB2 on AIX 6.1 for POWER Systems
Preface
This IBM® Redbooks® publication presents a best practices guide for DB2® and
InfoSphere™ Warehouse performance on a AIX® 6L with Power Systems™
virtualization environment. It covers Power hardware features such as
PowerVM™, multi-page support, Reliability, Availability, and Serviceability (RAS)
and how to best exploit them with DB2 LUW workloads for both transactional and
data warehousing systems.
The popularity and reach of DB2 and InfoSphere Warehouse has grown in recent
years. Enterprises are relying more on these products for their mission-critical
transactional and data warehousing workloads. It is critical that these products
be supported by an adequately planned infrastructure. This publication offers a
reference architecture to build a DB2 solution for transactional and data
warehousing workloads using the rich features offered by Power systems.
IBM Power Systems have been leading players in the server industry for
decades. Power Systems provide great performance while delivering reliability
and flexibility to the infrastructure.
Guido Somers
IBM Belgium
Doreen Stein
IBM Germany
Dan Braden, Grover C. Davidson, Brian Hart, James Hoy, Bart Jacob, Pete
Jordan, Jessica Scholz, Jaya Srikrishnan, Stephen M Tee, Brian Twichell, Scott
Vetter
IBM USA
David Hepple
IBM Ireland
The team would like to express special thanks to Robert Taraba from IBM USA
for providing the project resources and his support throughout the project.
Find out more about the residency program, browse the residency index, and
apply online at:
ibm.com/redbooks/residencies.html
xxvi Best Practices for DB2 on AIX 6.1 for POWER Systems
Comments welcome
Your comments are important to us!
Chapter 1. Introduction
This IBM Redbooks publication describes the best practices to configure a DB2
database server for data warehouse or OLTP environment on a Power System
architecture-based system running AIX 6.1.
Running the DB2 product on a Power System using the right blend of
virtualization features and meeting your business goals is a real challenge that
allows you to reduce various costs (such as power, floor space, and
administration) by consolidating your servers. Sharing your system resources,
dynamically resource allocation without rebooting, and processor utilization are
key features that help you optimize the performances of your DB2 product.
Moreover, this book is designed to help you understand the major differences
that exist when running your database on AIX 5L™ compared to running your
database on AIX 6.1. This new version of AIX introduces many new features,
including workload partitions, advanced security, continuous availability, and
managing and monitoring enhancements.
© Copyright IBM Corp. 2010. All rights reserved. 1
DB2 9.7 also offers a wide range of new features, including autonomics, data
compression, pureXML®, automatic storage, performance optimization and
security, reliability, and scalability. Most of these features are covered in this
book.
This chapter makes an introduction to the concepts that are detailed throughout
the book and contains the following sections:
› “Introduction to Power Systems” on page 3
› “Introduction to virtualization” on page 6
› “Introduction to AIX 6.1” on page 10
› “Introduction to DB2” on page 15
› “Introduction to PowerVM virtualization” on page 21
2 Best Practices for DB2 on AIX 6.1 for POWER Systems
1.1 Introduction to Power Systems
When setting up a Power Systems environment, there are many options that you
must consider and plan for properly to achieve the optimal performance goal.
Especially for a database server, the database system is the primary application
in the system. Database configuration considerations must also be incorporated
in the Power Systems planing setup. However, the Power Systems setup tasks
are usually performed by the system support team without the database
administrator’s (DBA) involvement, or the DBA might not have the expertise in
this area. The following list details the main points you go through when
configuring your logical partitions on a Power System for your database server:
› Setting up and configuring logical partitions
– Choosing a partition type
– Configuring physical and virtual processors
– Configuring the memory
› Considerations for disk storage and choosing from various options
› Considerations for the network and choosing from various options Setting up
the Virtual I/O Server (VIOS).
Power Architecture® refers both to POWER processors used in IBM servers and
to PowerPC® processors, which can be found in a variety of embedded systems
and desktops.
As of the writing of this book, the latest in the series of POWER processor is
POWER6 processor. The POWER6 chip is a dual-core processor that runs at
speeds between 4 GHz and 5 GHz depending on the type and model of the
system. Figure 1-1 on page 5 shows the POWER6 architecture.
4 Best Practices for DB2 on AIX 6.1 for POWER Systems
1/2
Clock
8B x2 rd
32 MB L3 Control 4B read ½ or 1/3 or ¼
I/O Ctlr
L3 Directory 4B write Clock
8B x2 wr
4B/8B x2
Fabric Switch 2x Memory Ctlr Interface
4B/8B x2
Virtualization can help you consolidate your multiple systems, running multiple
environments and applications on a single system, all into separated
environments, and fully secured, as though you were running on an isolated
single standalone servers.
The emulation function allows you to make use of objects (such as virtual tape,
iSCSI) that seem to be real (although no physical resource exists). It helps in the
compatibility and interoperability mode of your resources and the flexibility of
your environment.
Virtualization also offers the flexibility to maximize capacity, allowing you to move
your resources with DLPAR operations or LPM (Live Partition Mobility)
capabilities.
For example, the business requests to deploy a new application on a new server.
The problem is that you have no server available and need to order one, which
takes time to get, because you need approvals prior to place your order.
Note: Virtualization concerns the server, the storage, and the network.
On System p servers, mixed environments are supported: AIX 5L V5.2 ML2 and
AIX 5L V5.3 partitions with dedicated processors and adapters, and AIX 5L V5.3
partitions using micro-partitioning and virtual devices. AIX 5L V5.3 partitions can
use physical and virtual resources at the same time.
For System p servers configured with one of the PowerVM features, AIX 5L
Version 5.3 or later is required for micro-partitions, virtual I/O, and virtual LAN.
10 Best Practices for DB2 on AIX 6.1 for POWER Systems
1.3.1 IBM AIX V6.1
AIX V6.1 is the latest version of the AIX operating system, which includes new
and improved capabilities for virtualization, security features, continuous
availability features, and manageability. AIX V6.1 is the first generally available
version of AIX V6.
The AIX OS is designed for the IBM Power System p, System i, System p5®,
System i5®, eServer p5, eServer pSeries®, and eServer i5 server product lines,
as well as IBM BladeCenter® blades based on Power Architecture technology
and IBM IntelliStationPOWER workstations.
Chapter 1. Introduction 11
Table 1-1 provides the AIX V6.1 new features summary information.
Virtualization
Security
Trusted AIX Trusted AIX extends the Highest level of security for
security capabilities of the critical government and
AIX OS by integrating business workloads.
compartmentalized,
multilevel security (MLS)
into the base operating
system to meet critical
government and private
industry security
requirements.
12 Best Practices for DB2 on AIX 6.1 for POWER Systems
Features Functionality Benefits
Support for Long Pass AIX 6.1 and AIX 5.3 Improved security.
Phrases Technology Level 7
supports greater than
eight character passwords
for authentication of users.
Continuous Availability
Manageability
1.4.1 Autonomics
Although costs for server management and administration can be hard to
measure and might be less apparent than costs for servers, storage, and power,
they represent the largest percentage of total IT spending. Refer to Figure 1-4 to
see a breakdown of the various costs associated with administration.
1.4.2 pureXML
DB2 supports both relational and XML data, which can simplify development and
deployment of advanced new applications. DB2 pureXML eliminates much of the
work typically involved in the management of XML data, and serves XML data at
unmatched speeds. Applications can mix relational and XML data as business
needs dictate. DB2 9.7 adds end-to-end native XML support for both
transactional and data warehouse applications, opening new opportunities to
extract business value from XML data.
1.4.4 Performance
The DB2 Performance Optimization Feature gives insight and the ability to
optimize workload execution, which can be accomplished using a combination of
DB2 Workload Manager, DB2 Performance Expert, and DB2 Query Patroller.
This can help reduce hardware acquisition costs by optimizing the performance
of servers and postponing costly hardware upgrades. Figure 1-5 shows that
between January 1, 2003 and March 23, 2009, DB2 has held certain industry
benchmarks for more days than all other vendors combined.
For more information see the SQL compatibility enhancements document at the
following Web page:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.
luw.wn.doc/doc/c0054107.html
Moreover, PowerVM extends the base system functions of your server to include
the following capabilities:
› Micro-partitioning
Micro-partitioning enables a processor to be divided to a tenth of a whole
CPU and is available from POWER5 systems, running AIX 5.3 or later. It adds
flexibility to manage your processors efficiently to create virtually partitions up
to ten times the number of available processors.
Chapter 1. Introduction 21
› Live partition mobility
This capability enables you to move a running logical partition (known as
Active partition mobility) physically from one physical server to another,
without interrupting application service (interruption of some milli-seconds
only) to your clients. Inactive partition mobility is also possible for those
partitions that are in power off mode. Live partition mobility improves the
availability of your server by eliminating planned outages, and balancing
workloads during peaks.
› (Dynamic) logical partitioning (DLPAR)
A logical partition is a set of system resources that contain whole or a portion
of a CPU, memory, physical, or virtual I/O resources, all logically grouped into
one same partition. The dynamic capability allows adding resources
dynamically and on demand.
› Virtual Ethernet
Virtual Ethernet is managed by the Power Hypervisor and shares the internal
Power Hypervisor bandwidth.
› Shared Ethernet Adapter (SEA)
SEAs are logical adapters that bridge a physical Ethernet card with the virtual
Ethernet adapter, so that a logical partition can communicate with the outside
world.
› Shared-processor pools
Shared processor logical partitions are allocated processor units out of the
processor pool. The amount of processor capacity that is allocated to that
partition (its entitled capacity) can be as small as a tenth of a physical CPU
and as big as the entire processor. The granularity to add processor units is
1/100 of a CPU. The shared processor pool can have up to one whole CPU
contained in the physical server.
› N-Port Identifier Virtualization (N-PIV)
N-PIV adapters are particular 8 GB Fibre Channel adapters that allow a client
to see the SAN device, transparently provided by the VIOS. For example the
client logical partition sees the hdisk as MPIO DS5100/5300 Disk, instead of
as a Virtual SCSI Disk Drive.
22 Best Practices for DB2 on AIX 6.1 for POWER Systems
PowerVM provides advanced virtualization capabilities to your Power System or
Power Blade server. It exists in editions as shown in Figure 1-7.
› Express Edition
The Express ASE edition allows you to use only 1 VIOS and a single LPAR,
managed by the Integrated Virtualization Manager (IVM).
› Standard Edition
This edition allows you to create as much as 10 LPARs per core, and
supports multiple shared processors pools. See the following Web page for
limits:
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD103130
› Enterprise Edition
This edition covers all capabilities of the express and standard edition, plus
the Live Partition Mobility and the Active Memory™ Sharing capabilities.
Chapter 1. Introduction 23
All Power Systems servers can use standard virtualization functions or logical
partitioning (LPAR) technology by using either the Hardware Management
Console (HMC) or the Integrated Virtualization Manager (IVM). Figure 1-8 shows
the PowerVM virtualization architecture.
The HMC is a separate server that controls the build of logical partition (LPAR),
enables starting and stopping of logical partitions at distance by a remote
terminal console, and enables the dynamic functionalities of your Power System
to add, move or remove hardware to your logical partitions (DLPAR) dynamically.
An HMC can manage multiple Power Systems. The HMC user interface is
Web-based and is the focal point for your server environment and the hardware.
Note: The HMC is only used with Power Systems servers, while the Integrated
Virtual Manager (IVM) is only used for Power Blades servers.
Similar to the HMC, the IVM can be used to create and maintain logical
partitions. It is also integrated with the VIOSs. Nevertheless, the IVM can only
control one Power System.
Tip: Rather than configuring your system as a full system partition, you may
consider configuring a partition containing all the system resources. This gives
you more flexibility if you want to add an extra partition to your system.
Figure 1-10 on page 27 shows the terminologies used when talking about the
processor distribution for your LPARs.
26 Best Practices for DB2 on AIX 6.1 for POWER Systems
Figure 1-10 Processors terminologies
Attention: Upon activation, if a logical partition does not use 100% of its
dedicated processor resources, the unused processor resources are ceded to
the shared processor pool. This is called Donating Dedicated CPU. When the
logical partition needs that donated CPU resource back, the Power Hypervisor
immediately liberates it and give it back to the donating logical partition.
› Shared CPU
With shared CPUs, physical CPUs are placed in a pool of processors and
shared between multiple logical partitions. The logical partition is assigned a
portion of that CPU pool. It can be as small as a 10th of a real CPU, with a
granularity of a 100th. That portion is called a processing unit. The ability to
divide a CPU into smaller units is known as the Micro-Partitioning
technology. The granularity of that processor unit is a hundredth of a CPU. An
example of a shared processor unit allocation is 1.34 CPU. In this mode,
processing units that are not used return to the shared processor pool for
other LPAR if they need additional CPU.
Chapter 1. Introduction 27
Tip: In shared CPU mode, the logical partition is guaranteed to have its
processing entitlement whenever it needs it. The Power Hypervisor is
managing the processing entitlements and allocates it to the demanding
partition.
› Active processors
Active processors are the usable physical processors in the system.
› Inactive processors
Inactive processors are those processors that can be activated on demand.
They are known as Capacity On Demand (COD) processors and need an
activation code to be converted into active processors
› Deactivated processors
Deactivated processors are processors that have a hardware problem. In
case of a hardware problem and if COD processors are available, the broken
processor is swapped with a COD processor. If COD processors are not
available, ask an IBM Customer Engineer engineer to change the defective
processor, which can require down-time.
› Virtual processors
Virtual processors are abstractions of physical processors that are assigned
to logical partitions. The operating system uses the number of virtual
processors assigned to the logical partition to calculate the number of
Chapter 1. Introduction 29
operations that the operating system can perform concurrently. Virtual
processors are relevant to micro-partition. Micro-partitioning it maps virtual
processors to physical processors. The virtual processors are assigned to the
partitions instead of the physical processors.
› SMT Simultaneous Multi-Threading
SMT Simultaneous Multi-Threading is a concept where multiple threads of
execution can execute on the same processor at the same time. With SMT
enabled, each hardware thread is seen as a logical processor. SMT is not a
virtualization concept.
Tip: When SMT is not enabled, a virtual processor appears as a single logical
processor.
Note: AMS needs PowerVM Enterprise Edition key installed on your system.
› Dedicated memory
Dedicated memory is physical memory dedicated to an LPAR. It is reserved
for that partition and cannot be shared. A logical partition that uses dedicated
memory, known as a dedicated memory partition, only uses a fixed amount of
memory that it was assigned.
› Shared memory
Shared memory is that memory assigned to the shared memory pool and
shared among multiple logical partitions. A logical partition that uses shared
memory, known as a shared memory partition, allows an appropriate logical
memory size to be defined without requiring a corresponding amount of
physical memory to be allocated.
You must specify micro-partitions if you want to specify shared memory for this
logical partition.
Chapter 1. Introduction 31
1.5.4 I/O virtualization
Virtual I/O is often referred to as a set of network and storage virtualization
features, such as:
› Virtual Ethernet
The Virtual Ethernet adapter allows logical partitions to communicate with
each other within the same Power System, using the Power Hypervisor’s
internal switch. They do not need any network hardware adapter or cables to
communicate.
› Shared Ethernet Adapter (SEA)
This logical adapter enables an LPAR to communicate with the outside world.
SEA adapters can be configured in a SEA fail-over mechanism to protect your
client logical partitions from VIOS network failure. Moreover, SEA is
mandatory for Live Partition Mobility use.
› Integrated Virtual Ethernet (IVE)/Host Ethernet Adapter (HEA)
IVE, or HEA, is a dual or quad port Fibre Channel card, available in 1 Gbps
(dual or quad port) or 10 Gbps (quad port only). It enables an easy way to
manage the sharing of the integrated high-speed Ethernet adapter ports. The
integrated virtual Ethernet is directly connected to the GX+ bus instead of
being connected to a PCIe or PCI-X bus. This provides the IVE card with high
throughput and low latency. It is available on p520 and p550 servers only. The
IVE card is installed by manufacturing and does not support hot-swappable or
hot-plugable capabilities.
Figure 1-13 illustrates the difference between the use of SEA and IVE.
All virtual network and storage devices are identified with their slot number. Slot
number can aid you in your management while creating your naming convention,
for situations where lots of logical partitions co-exists and where administration of
those virtual resources become more and more complex.
Tip: In a dual VIOS setup, try to get a mapping in slot numbers to ease the
management of your servers. For example, you can use a particular slot
number range dedicated to your virtual networks and make it big enough to
allow for growth (for example 11–30) where virtual SCSI slot number range
ranges from 31 to 50.
Chapter 1. Introduction 33
1.5.5 Useful links
The following list the links to the Information Center documentation for the main
topics discussed in this chapter. You can refer to these links for more information
about these topics.
AIX 6.1
› http://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp
› http://publib16.boulder.ibm.com/pseries/index.htm
› http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD103130
PowerVM
› http://www-03.ibm.com/systems/power/software/virtualization/index.html
› http://www14.software.ibm.com/webapp/set2/sas/f/vios/documentation/ins
tallreadme.html
Power Systems
› http://publib.boulder.ibm.com/infocenter/powersys/v3r1m5/index.jsp
› http://www-03.ibm.com/systems/power/hardware/
› http://www14.software.ibm.com/webapp/set2/sas/f/power5cm/power6.html
DB2
› http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/index
› http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2
.luw.wn.doc/doc/c0054107.html
34 Best Practices for DB2 on AIX 6.1 for POWER Systems
2
Figure 2-1 shows the computational and non computational memory occupancy
in real memory.
Working
Computational Paging
Memory Space
Segment
Persistent
Non computational File System,
permanent memory JFS
Segment
Figure 2-2 shows the file system cache consumption in real memory.
maxperm
90 %
Computational
numperm pages
Non computational
minperm File
System cache
3%
Real Memory
Figure 2-2 The file system cache consumption as pointed out by numperm
minperm and maxperm values show the range that the file system cache can
grow. If computational pages need more memory, then the file cache pages are
stolen (page stealing) based on the repaging rate.
Figure 2-3 shows the page stealing based on minfree and maxfree values.
For example, the vmo -L command displays the current, default, and the boot
vmo parameter settings, as shown in Figure 2-4 on page 39.
38 Best Practices for DB2 on AIX 6.1 for POWER Systems
Figure 2-4 Output of vmo -L
# vmo -a
cpu_scale_memp = 8
data_stagger_interval = 161
defps = 1
force_relalias_lite = 0
framesets = 2
htabscale = n/a
kernel_heap_psize = 4096
kernel_psize = 4096
large_page_heap_size = 0
lgpg_regions = 0
lgpg_size = 0
low_ps_handling = 1
lru_file_repage = 0
lru_poll_interval = 10
lrubucket = 131072
Chapter 2. AIX configuration 39
maxclient% = 90
maxfree = 5418
maxperm = 219580
maxperm% = 90
maxpin = 212018
maxpin% = 80
...........
Example 2-2 shows how to modify and make the changes persistent across the
reboot. The vmo -o command is used to make the changes, and the option -p is
used to make the changes persistent dynamically and across reboot.
Note: Beginning with AIX Version 6.1, tunables are classified as restricted use
tunables. They exist and must be modified primarily for specialized
intervention by the IBM development support or development teams.
As these parameters are not recommended for user modification, they are no
longer displayed by default but only with the -F option (force) on the command.
For example, the command vmo -F -a lists all the tunables including
restricted.
minfree
When the size of the free list falls below this number, the VMM begins stealing
pages. It continues stealing pages until the size of the free list reaches maxfree.
40 Best Practices for DB2 on AIX 6.1 for POWER Systems
If the value free frame waits from vmstat -s value increases over the period of
time, then it is advisable to increase the minfree value.
› Default value in AIX6.1: 960
› DB2 recommended value:
– 4096 for memory less than8 GB
– 8192 for memory greater than 8 GB
› How to set: vmo -p -o minfree=4096
› Type: Dynamic (No reboot required)
maxfree
The difference between minfree and maxfree must not go beyond 1024, as it
increases the time in replenishing the free list. The difference must be equal to or
greater than j2_maxPageReadAhead common maxpgahead IO tunable value.
See 2.1.7, “Input and output tunable considerations” on page 63 for details about
this tunable. See Example 2-3.
› Default Value in AIX6.1: 1088
› DB2 Recommended value:
– minfree + 512
– if minfree is 4096, then maxfree=4608
› How to set: vmo -p -o maxfree=4608
› Type: Dynamic
maxclient%
This specifies the maximum physical memory usage for client pages caching. If
the percentage of real memory occupied by file pages is higher than this level,
the page-replacement algorithm steals only client pages. It is recommended to
set strict_maxclient to 1 along with this setting for DB2. This considers maxclient
as the hard limit and strictly steals client pages as soon as the limit is reached.
› Default value in AIX6.1: 90%
› DB2 recommended value: 90%
Note: This is a restricted tunable in AIX6.1. Hence no further change is
recommended.
If the percentage of real memory occupied by file pages rises above this level,
the page-replacement algorithm steals only file pages.
› Default value in AIX6.1: 90%
› DB2 recommended value: 90%
minperm%
If the percentage of real memory occupied by file pages falls below this level, the
page replacement unit steals only computational values. Hence a low value is
recommended for DB2. AIX by default sets this value to 3% compared to AIX5L
where by default it is 20%.
– Default value in AIX6.1: 3%
– DB2 recommended value: 3%
strict_maxclient
If the client page caching increases beyond the maxclient% value, setting this
parameter acts as a hard limit and triggers file pages to be stolen.
– Default value in AIX6.1: 1
– DB2 recommended value: default value
lru_file_repage
When VMM needs memory if the free memory pages goes below minfree, or
when maxclient% is reached with strict_maxclient set, the lru makes a decision
whether to steal the computational pages or the file pages.
This determination is based on the number of parameters, but the key parameter
is lru_file_repage.
42 Best Practices for DB2 on AIX 6.1 for POWER Systems
Setting this tunable value to 0 indicates that only the file pages are stolen if the
file pages are greater than the minperm value. This prevents computational
pages from being paged out. This is a restricted tunable in AIX6.1. Hence no
further change is recommended.
› Default value in AIX6.1: 0
› DB2 recommended value: default value
Notes:
› AIX Version 6 allows the system to use up to 90% of its real memory for file
caching, but it favors computational pages as resident pages over file
pages.By setting lru_file_repage to 0, it forces the page replacement
algorithm to steal computational pages only when the percentage of
cached file pages is less than the minperm% value. Hence it might not be
required to reduce the maxclient, maxperm value to avoid paging as
practiced with AIX5.3.
› In the situation where reducing the maxperm and maxclient values to lower
helps to improve the performance, it is recommended to inform or contact
IBM Technical Support for a complete solution. IBM does not support any
changes to restricted tunables where the change is the solution.
› Tunables such as lru_poll_interval and strict_maxperm are restricted
tunables and DB2 recommends the AIX6.1 default value. No change is
suggested.
› The tunable page_steal_method is restricted tunable. No change is
suggested. Its default value is 1, which creates a linked list of dirty pages to
steal the pages rather than scanning memory.
Chapter 2. AIX configuration 43
Steps to monitor the memory
The following describes the steps to monitor the memory:
1. The svmon command can be used to monitor the computational memory (such
as the stack heap of a process) and the persistent memory (such as the file
system cache). Computational memory is shown as work, which it considers
as working storage, the persistent memory is shown as, pers, and the paging
space as pgsp. Example 2-4 shows the svmon -G output.
As the maxperm and minperm values are restricted and not recommended to
change in AIX6.1, it is suggested to monitor the memory usage as described in
the steps that follow:
1. Use lsps -a to understand the paging space consumption over the period of
time. See Example 2-5.
# vmstat 1
3. Use the vmstat command to understand the file cache usage by the system,
as shown in Example 2-7.
Example 2-7 vmstat -v | grep num
vmstat -v | grep num
17.6 numperm percentage
17.6 numclient percentage
Note:
› If DB2 is doing more file cache, determine if there are any table spaces set
up with file system cache, and decide whether file system caching can be
disabled.
› If there is any non-DB2 file systems or process causing file system cache
to increase and causing paging leading to low performance, rectify it.
› To determine which table spaces have file system cache enabled, use the
db2pd -db <database name> -tablespaces command and see the FSC
column. The possible values are ON or OFF.
Monitoring helsp to understand the memory activity in the system. Because AIX
does not recommend changing the restricted values such as minperm,
maxperm, contact IBM technical support for further assistance, if required. In
summary, the best practise for DB2 regarding AIX VMM setting is as follows:
› minfree=4096 for physical memory <8 GB and 8192 for memory > 8 GB
› maxfree=minfree+512
› maxperm%=90 (default)
› maxclient%=90(default)
› strict_maxclient=1(default)
› minperm%=3(default)
› lru_file_repage=0(default)
Except for minfree and maxfree, no other modification is required. It does not
require reboot of the system after modification.
Note: VMM only supports dynamically varying the page size of working
storage memory. Working storage memory comprises process stack, data and
shared library text segment in AIX address space.
To make use of the 16 MB page size, you have to specify the amount of physical
memory that you want to allocate to back large pages. The default is not to have
any memory allocated to the large page physical memory pool. It can be
specified using the vmo command. The following example allocates 1 GB to the
large page physical memory pool (a reboot is required after running the
command to make the changes effective):
vmo -r -o lgpg_regions=64 -o lgpg_size=16777216
Note: The tprof -a -y <process> -O all command can also be used for
large page analysis.
48 Best Practices for DB2 on AIX 6.1 for POWER Systems
2.1.5 Paging space considerations for DB2
We recommend that DB2 be tuned in a way that does not use paging space
regularly. Periodic paging space usage monitoring is recommended. The
following recommendations are for the paging space considerations:
› The general recommendation is to have the paging space be twice the size of
the physical memory.
Example: For a RAM size of 128 GB it is recommended to have 256 GB
paging space.
› Use a minimum paging space at default rootvg under the size of the RAM not
less than 512 MB and then have the additional (multiple) paging space in
alternate disks not exceeding 64GB.
› Use multiple paging spaces, each allocated from a separate physical volume.
More than one paging space on a disk is not recommended.
› Create the secondary paging space on physical volumes that are more lightly
loaded than the physical volume in rootvg. If minimum inter-disk policy is
chosen, moving the paging space to another disk in the same volume group is
fine.
› The secondary paging spaces must all be of the same size to ensure that the
algorithm performed in turn can work effectively.
› It is better to have the paging space on fast local disks. If it is configured from
SAN, make sure to avoid the SAN failure points. Some of the considerations
from SAN are detailed in the following list:
– It is recommended to use a separate LUN for paging space.
– If LUN is configured through Virtual I/O Server (VIOS), consider providing
a dual VIOS with multipath (MPIO) implementation.
– It is recommended to mirror the disks at the AIX level using the mirrorvg
command.
– It is recommended to use stripped and mirrored disks.
– It is recommended to use either RAID5 or RAID10 configurations at SAN.
– Ensure FC adapter redundancy to avoid any failure.
Chapter 2. AIX configuration 49
Paging activity probable causes
The following issues are probable causes for paging activity:
› Free list reaches 0
Monitor the status of free list that AIX VMM maintains through the vmstat
command (Example 2-7 on page 45). If the free list represented as fre in the
vmstat command reaches 0 due to an outburst of memory requirement, the
system might page regardless of memory usage.
In such situations it is recommended to increase the minfree value. Always
ensure that the difference between minfree and maxfree does not exceed
1024.
› If active virtual memory exceeds the real memory, then paging happens.
Monitor the growth of active virtual memory from the vmstat output. See
Example 2-9.
# lsps -a
Page Space
Physical Volume Volume Group Size %Used Active Auto Type Chksum
From the DB2 perspective, tunables are modified for effective bandwidth use and
are modified considering the security aspect.
AIX no command
To display the user modifiable network tunables, use the no -a command, as
shown in Example 2-11.
The no -F -a command can be used to display both the non-restricted and the
restricted tunables.
Chapter 2. AIX configuration 51
To modify the tunable and to make the modification permanent use the no -p -o
<tunable>=<value> command, as shown in Example 2-12.
To modify the command to be effective from the next reboot, perform the
following command, as shown in Example 2-13.
clean_partial_connection
This parameter is recommended to be enabled to avoid SYNC attack
(unacknowledged SYN/ACK packets from server by client). As SYNC attack
leads to increases in the listen queue backlog (denial of service), removing
partial connections makes room for new non-attack connections.
› Default value: 0
› DB2 recommended value: 1
› How to set: no -p -o clean_partial_conns= 1
› Type: Dynamic
This prevents the socket listen queue to be filled up with incomplete 3-way
TCP/IP handshake partial connections
52 Best Practices for DB2 on AIX 6.1 for POWER Systems
ip6srcrouteforward
This option specifies whether or not the system forwards source-routed IPv6
packets. Source routing is a technique where the sender of a packet can specify
the route that a packet must take through the network.
› Default value: 1
› DB2 recommended value: 0
› How to set: no -p -o ip6srcrouteforward=0
› Type: Dynamic
A value of 0 causes all source-routed packets that are not at their destinations to
be discarded, preventing attacks through source route.
ipignoreredirects
This parameter is used to control ICMP Redirects. The ICMP Redirect message
is used to notify a remote host to send data packets on an alternative route.
Setting it to 1 ensures that malicious ICMP request cannot be used to create
manipulated routes.
› Default value: 0
› DB2 recommended value: 1
› How to set: no -p -o ipignoreredirects=1
› Type: Dynamic
Setting this parameter to 1 ensures that malicious ICMP request cannot be used
to create manipulated routes.
ipqmaxlen
This specifies the number of received packets that can be queued on the IP
protocol input queue. Examine ipintrq overflows using netstat -s (See
Example 2-14), and consider increasing this value as recommended.
ipsendredirects
This is to specify whether the kernel must send redirect signals. Disable this
parameter to prevent illegal access through source routing.
› Default value: 1
› DB2 recommended value: 0
› How to set: no -p -o ipsendredirects=0
› Type: Dynamic
ipsrcrouterecv
This parameter specifies whether the system accepts source routed packets.
Source routing is a technique where the sender of a packet can specify the route
that a packet must take through the network. The default value of 0 causes all
source-routed packets destined for this system to be discarded. A value of 1
allows source routed packets to be received.
› Default value: 0
› DB2 recommended value: 1
› How to set: no -p -o ipsrcrouterecv=1
› Type: Dynamic
This parameter ensures that the pickets destined to the destination is received.
54 Best Practices for DB2 on AIX 6.1 for POWER Systems
rfc1323
The enhancement suggested through RFC 1323 is used. RFC1323 suggests
maintaining high performance and reliability over high speed paths (bps) through
improved TCP window scale option. The TCP window scale option as defined in
RFC1323 increased the window scale size from 64 KB to 1 GB. DB2
recommends enabling this parameter for more efficient use of high bandwidth
networks.
› Default value: 0
› DB2 recommended value: 1
› How to set:
– ifconfig en0 rfc1323 1 (Interface specific, dynamic, no reboot required)
– no -p -o rfc1323=1 (system specific)
› Type: Connect
Note: The rfc1323 network option can also be set on a per interface basis
through the ifconfig command. It can be verified using ifconfig -a.
tcp_nagle_limit
This tunable helps avoid sending large number of small packets (packet
consolidation). By default, it is set to 65535, the maximum size of IP packet in
AIX. When using DB2, we recommend that you disable it. This ensures that AIX
does not try to consolidate packets.
› Default value: 65535
› DB2 recommended value:1
› How to set: no -p -o tcp_nagle_limit=1
› Type: Dynamic
TCP header size of 40bytes is sent even to send 1 byte of data across the wire,
which is a huge overhead.
Chapter 2. AIX configuration 55
tcp_nodelayack
Enabling this parameter causes TCP to send immediate acknowledgement (Ack)
packets to the sender.We recommend to set it to ON.
› Default value: 0
› DB2 recommended value: 1
› How to set: no -p -o tcp_nodelayack=1
› Type: Dynamic
Delay in sending the ACK affects the real time data transfer.
Enabling this option causes slightly more system overhead, but can result in
much higher performance for network transfers if the sender is waiting on the
receiver's acknowledgement.
tcp_recvspace
The tcp_recvspace tunable specifies how many bytes of data the receiving
system can buffer in the kernel on the receiving socket queue. The attribute must
specify a socket buffer size less than or equal to the setting of the sb_max
network attribute.
› Default value: 16384
› DB2 recommended value: 262144
› How to set:
– ifconfig en0 tcp_recvspace 262144 (interface specific, dynamic, no
reboot required)
– no -p -o tcp_recvspace=262144
› Type: Connect
With RFC1323 enabled, the recommended value ensures high data transfer rate.
Higher value might increase the workload on the adapter. Also, it decreases the
memory space for data in RAM. Monitor using vmstat for memory usage and
paging.
56 Best Practices for DB2 on AIX 6.1 for POWER Systems
tcp_sendspace
The tcp_sendspace tunable specifies how much data the sending application can
buffer in the kernel before the application is blocked on a send call.
› Default value: 16384
› DB2 recommended value: 262144
› How to set:
– ifconfig en0 tcp_sendspace 262144 (interface specific, dynamic, no
reboot required)
– no -p -o tcp_sendspace= 262144
› Type: Connect
With RFC enabled, the recommended value ensures high data transfer rate.
Higher values might increase the workload on the adapter. Also, it decreases the
memory space for data in RAM. Monitor using vmstat for memory usage and
paging.
Note:
› AIX6.1 enables use_isno by default and is a restricted variable. Because
this overrides the global network option values, it is recommended to set
the interface specific option using ifconfig for rfc1323,tcp_sendspace and
tcp_recvspace.
› The application can override all of these with the setsockopt() subroutine.
tcp_tcpsecure
This option protects TCP connections from the following three vulnerabilities:
› The first vulnerability involves the sending of a fake SYN to an established
connection to abort the connection. A tcp_tcpsecure value of 1 provides
protection from this vulnerability.
› The second vulnerability involves the sending of a fake RST to an established
connection to abort the connection. A tcp_tcpsecure value of 2 provides
protection from this vulnerability.
› The third vulnerability involves injecting fake data in an established TCP
connection. A tcp_tcpsecure value of 4 provides protection from this
vulnerability.
Chapter 2. AIX configuration 57
Values of 3, 5, 6, or 7 protects the connection from combinations of these three
vulnerabilities.
› Default value: 0
› DB2 recommended value: 5
› How to set: no -p -o tcp_tcpsecure=5
› Type: Dynamic
jumbo frames
With the advent of Gigabit Ethernet, the TCP/IP protocol now provides an ability
to send large frames (called jumbo frames). An Ethernet frame contains 1500
bytes of user data, plus its headers and trailer. By contrast, a jumbo frame
contains 9000 bytes of user data, so the percentage of overhead for the headers
and trailer is much less and data-transfer rates can be much higher.
Note: Jumbo frames can be used with Etherchannel. If set to Yes it forces
jumbo frames on all underlaying adapters. Virtual Ethernet adapters do not
have jumbo frames attributes, but can send jumbo sized packets.
Chapter 2. AIX configuration 59
Monitoring the network performance
You can use the following features to monitor the network performance:
› Kernel memory use by network packets:
Kernel memory buffers called mbufs are used to store data in the kernel for
incoming and outbound network traffic. Incorrect configuration affects both
network and system performance.
mbuf is controlled by:
– “thewall” network variable. No tuning is required as the system
automatically tunes the value ranging from 0-64GB. Default value is 1 GB.
– “maxmbuf” system variable defines the maximum real memory allowed for
MBUFS. If this value is “0”, it indicates that the “thewall” value is used. If
“maxmbuf >0”, then it overrides the “thewall” value. Following command
gives the “maxmbuf” value.
# lsattr -E -l sys0 | grep maxmbuf
maxmbuf 0 Maximum Kbytes of real memory allowed for MBUFS
True
– “maxmbuf” value can be changed using chdev.
chdev -l sys0 -a maxmbuf=1048576
Monitor the network memory usage using the following command.
echo "kmbucket -s" | kdb | egrep "thewall|allocated" | tr "." ""
| awk
'/thewall/ {thewall=$2} /allocated/ {allocated=$2} END{ print
"ibase=16;scale=5;",allocated,"/400/",thewall}'| bc
If the output shows .02063, it means 2.063% of mbuf has been allocated.
The threshold is indicated by sockthresh, which is 85% by default. After
the allocated memory reaches 85% of the thewall or maxmbuf value, a
new socket connection fails with ENOBUFS, until the buffer usage drops
below 85%
Monitor for any mbuf requirement failures using the netstat -m command.
See Example 2-16.
# netstat -m
In this example:
– By size: Shows the size of the buffer.
– inuse: Shows the number of buffers of that size in use.
– failed: Shows how many allocation requests failed because no buffers
were available.
Note:
› You do not see a large number of failed calls. There might be a few, which
trigger the system to allocate more buffers as the buffer pool size
increases. If the requests for mbuf (failed) is high over a period of time,
consider increasing the thewall value.
no -o thewall=<newvalue>
› After the thewall value is increased, use vmstat to monitor total memory
use, paging activity to understand whether the increase has any negative
impact on overall memory performance.
› Kernel memory buffer (mbuf) to handle data packets are pinned kernel
memory in RAM. These memory are never paged out. Increase in mbuf
decreases the system memory for data segment.
Chapter 2. AIX configuration 61
› Monitoring the network statistics using netstat:
The netstat command can be used to determine the amount of traffic in the
network.
Use netstat -an -f inet and check columns Recv-Q and Send-Q to see
the use of the send/recvspace buffers. This also gives you the number of
connections. See Example 2-17.
# netstat -i
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en0 1500 link#2 a.c.bb.1c.eb.b 27493958 0 3947570 0 0
en0 1500 9.184.65 indus65 27493958 0 3947570 0 0
If the number of errors during input packets is greater than 1% of the total
number of input packets, that is, Ierrs > 0.01 x Ipkts, run the netstat -m
command to check for a lack of memory.
Note: The nmon command can be used to understand the network statistics.
Run nmon and type n to view the details.
62 Best Practices for DB2 on AIX 6.1 for POWER Systems
2.1.7 Input and output tunable considerations
Input/output tunable parameters are managed through the ioo command.
Example 2-19 Output of the command, ioo -a, to display the current value
# ioo -a
aio_active = 0
aio_maxreqs = 65536
aio_maxservers = 30
aio_minservers = 3
aio_server_inactivity = 300
j2_atimeUpdateSymlink = 0
Similar to VMM and network tunables, IO tunable also has a set of restricted
tunables, which can be viewed with ioo -F -a command options.
To modify the tunable and to make the modification permanent use the ioo -p -o
<tunable>=<value> command. See Example 2-20.
lvm_bufcnt to 10
To modify the tunable and to make the modification in the next boot, use the ioo
-r -o <tunable>=<value> command. See Example 2-21.
j2_maxPageReadAhead
This parameter defines the number of pages that needs to be read ahead during
the sequential file read operation on JFS2. This parameter is taken into
consideration while setting the minfree/maxfree value, as discussed in 2.1.3,
“VMM considerations for DB2” on page 40.
› Default value: 128
› DB2 recommended value: 128
j2_maxRandomWrite
This parameter specifies a threshold for random writes to accumulate in RAM
before subsequent pages are flushed to disk by the JFS2's write-behind
algorithm.The random write-behind threshold is on a per-file basis.
For DB2, we recommend that you use the default value. The default value of “0”
disables random write-behind and indicates that random writes stay in RAM until
a sync operation. If vmstat n shows page out and I/O wait peaks on regular
intervals (usually when the sync daemon is writing pages to disk), useful to set
this value to 1 or higher if too much I/O occurs when syncd runs.
› Default value: 0
› DB2 recommended value: 0
› How to set: ioo -p -o j2_maxRandomWrite=<value>
› Type: Dynamic
j2_minPageReadAhead
This parameter specifies the minimum number of pages to be read ahead when
processing a sequentially accessed file on Enhanced JFS. This is useful for
sequential access. A value of 0 might be useful if the I/O pattern is purely
random. For DB2, we recommend that you use the default value, which is 2.
› Default value: 2
› DB2 recommended value: 2
maxpgahead
This parameter specifies the maximum number of pages to be read ahead when
processing a sequentially accessed file. This is a restricted tunable in AIX 6.1.
For DB2, we recommend that you use the default value, which is 8.
› Default value: 8
› DB2 recommended value: 8
64 Best Practices for DB2 on AIX 6.1 for POWER Systems
minpgahead
This parameter specifies the number of pages with which sequential read-ahead
starts. This is a restricted tunable in AIX6.1. For DB2, we recommend that you
use the default value, which is 2.
› Default value: 2
› DB2 recommended value: 2
maxrandwrt
This parameter specifies a threshold (in 4 KB pages) for random writes to
accumulate in RAM before subsequent pages are flushed to disk by the
write-behind algorithm. A value of 0 disables random write-behind and indicates
that random writes stay in RAM until a sync operation. Setting maxrandwrt
ensures these writes get flushed to disk before the sync operation has to occur.
› Default value: 0
› DB2 recommended value: 0
› How to set: ioo -p -o maxrandwrt=<threshold value>
› Type: Dynamic
If vmstat n shows page out and I/O wait peaks on regular intervals (usually when
the sync daemon is writing pages to disk), set it to a threshold level to reduce the
high I/O. However, this can degrade performance because the file is being
flushed each time after the threshold value is reached. Tune this option to favor
interactive response time over throughput. After the threshold is reached, all
subsequent pages are immediately flushed to disk.
j2_nBufferPerPagerDevice
Specifies the number of file system bufstructs for JFS2.This is a restricted
tunable. For DB2, we recommend that you use the default value, which is 512.
› Default value: 512
› DB2 recommended value: 512
numfsbufs
This parameter specifies the number of file system bufstructs. This is a restricted
tunable. For DB2, we recommend that you use the default value, which is 196.
› Default value: 196
› DB2 recommended value: 196
Note: If you make any changes to the file system bufstructs tunables
(j2_nBufferPerPagerDevice and numfsbufs), the new values take effect only
when the file system is remounted.
Chapter 2. AIX configuration 65
j2_nPagesPerWriteBehindCluster
JFS file systems and JFS2 file systems are partitioned into 16 KB partitions or 4
pages. Each of these partitions is called a cluster. In JFS2, this parameter
specifies the number of pages per cluster processed by JFS2's write behind
algorithm.
› Default value: 32
› DB2 recommended value: 32
This is useful to increase if there is a need to keep more pages in RAM before
scheduling them for I/O when the I/O pattern is sequential. It might be
appropriate to increase if striped logical volumes or disk arrays are being used.
pv_min_pbuf
This specifies the minimum number of pbufs per PV that the LVM uses.This is a
global value that applies to all VGs on the system. The lvmo command can be
used to set a value for a particular VG. In this case, the higher of the two values is
used for this particular VG.
› AIX default: 512
› DB2 recommended value: 512
sync_release_ilock
This parameter specifies whether the i-node lock is to be held or not while
flushing I/O to file when the sync daemon is running.
If set, flush all I/O to a file without holding the i-node lock, and use the i-node lock
to do the commit.
A value of 0 indicates off and that the i-node lock is held while all dirty pages of a
file are flushed. I/O to a file is blocked when the syncd daemon is running.
This is a restricted tunable in AIX. For DB2, we recommend that you use the
default value, which is 0.
› Default value: 0
› DB2 recommended value: 0
Note: The j2_ parameters are not applicable for DB2 table spaces configured
with the FILE SYSTEM CACHING DB2 parameter, where the file system
caching is not involved.
66 Best Practices for DB2 on AIX 6.1 for POWER Systems
Asynchronous IO (AIO) consideration for DB2
Asynchronous Input Output (AIO) is a software subsystem within AIX that allows
a process to issue an I/O operation and continue processing without waiting for
the I/O to finish.
Asynchronous I/O operations run in the background and do not block user
applications. This improves performance because I/O operations and
applications processing run simultaneously. Applications such as databases and
file servers take advantage of the ability to overlap processing and I/O.
The difference between them is parameter passing. Both the subsystem can run
concurrently in the system.
From AIX6.1 onwards, no AIO servers are started by default. They are started
when applications initiate AIO requests and stay active as long as they are
servicing requests.
› When AIO is enabled, minimum number of AIO servers are created based on
the minserver value.
› If additional servers are required, more AIO servers are added up to the
maximum, based on the maxserver value.
› The maximum number of outstanding requests is defined by maxreqs.
› These are all ioo tunables. Changes do not require system reboot.
Note: In AIX6 all AIO parameters become the ioo command tunables. The
aioo command used in the previous version of AIX5.3 is removed.
From the DB2 perspective, because the AIO tuning is dynamic, no tuning is
required.
› Default AIO settings of minserver, maxserver, maxreqs are recommended. No
change is required.
› AIO fast path (aio_fastpath) is also enabled by default, so no changes are
required. This is a restricted tunable, as well.
› AIO CIO fastpath (aio_fsfastpath) is also enabled by default and so, no
changes are required. This is a restricted tunable, as well.
Note: AIO can be used with file systems mounted with DIO/CIO. DB2 uses
DIO/CIO when the table space is used with NO FILE SYSTEM CACHING.
Chapter 2. AIX configuration 67
AIO monitoring and tuning suggestions
The following list describes AIO monitoring and tuning suggestions:
› To understand the number of active AIO servers, use the pstat -a | grep -c
aioscan command.
› To get the general AIO report, use the iostat command:
iostat -AQ 1 5
This command displays the output for every second. Monitoring has to be
done over a period of time to understand the status of aio server.
› To understand the state of the each active AIO servers, use the following
command:
ps -vg | egrep “aio | SIZE”
From the output in Example 2-22, we can identify the %CPU,%MEM and size
consumed by the each aio kernel process.
# ps vg | egrep "aio|SIZE"
PID TTY STAT TIME PGIN SIZE RSS LIM TSIZ TRS %CPU %MEM COMMAND
127048 -A 0:00 9 448 448 xx 0 0 0.0 0.0 aioserver
385024 -A 0:00 8 448 448 xx 0 0 0.0 0.0 aioserver
442584 -A 0:00 9 448 448 xx 0 0 0.0 0.0 aioserver
446682 -A 0:00 11 448 448 xx 0 0 0.0 0.0 aioserver
450780 -A 0:00 7 448 448 xx 0 0 0.0 0.0 aioserver
454878 -A 0:00 8 448 448 xx 0 0 0.0 0.0 aioserver
458976 -A 0:00 8 448 448 xx 0 0 0.0 0.0 aioserver
463074 -A 0:00 7 448 448 xx 0 0 0.0 0.0 aioserver
467172 -A 0:00 7 448 448 xx 0 0 0.0 0.0 aioserver
471270 -A 0:00 5 448 448 xx 0 0 0.0 0.0 aioserver
475368 -A 0:00 6 448 448 xx 0 0 0.0 0.0 aioserver
479466 -A 0:00 6 448 448 xx 0 0 0.0 0.0 aioserver
483564 -A 0:00 2 448 448 xx 0 0 0.0 0.0 aioserver
487662 -A 0:00 0 448 448 xx 0 0 0.0 0.0 aioserver
# ioo -o aio_maxservers=33
Setting aio_maxservers to 33 #
ioo -a | grep aio
aio_active = 0
aio_maxreqs = 65536
aio_maxservers = 33
aio_minservers = 3
aio_server_inactivity = 300
In summary, for DB2 we do not recommend any changes to the default ioo
tunable values in AIX 6.1. Any further modification is to be considered based on
the monitoring over a period of time as described.
In general, we do not recommend that you make any changes with respected to
the scheduler tunables for DB2 environments. From the AIX tunable perspective,
apart from the recommended modifications, if a performance degradation is still
observed, relevant performance data is submitted to IBM for analysis.
Note: There might be specific cases where changing the restricted values
might improve the performance. In this situation, contact technical support for
further analysis. It is never recommended to change the restricted tunables
from their defaults.
Chapter 2. AIX configuration 69
2.1.9 DB2 Groups, users, and password configuration on AIX
You can create required groups, users and set initial password using the
commands in Example 2-24.
Example 2-24 Create required groups, users and set initial password
mkgroup id=999
db2iadm1 mkgroup
id=998 db2fadm1
mkgroup id=997 dasadm1
mkuser id=1004 pgrp=db2iadm1 groups=db2iadm1 home=/db2home/db2inst1 db2inst1
mkuser id=1003 pgrp=db2fadm1 groups=db2fadm1 home=/db2home/db2fenc1
db2fenc1
mkuser id=1002 pgrp=dasadm1 groups=dasadm1 home=/db2home/dasusr1
dasusr1 passwd db2inst1
passwd db2fenc1
passwd dasusr1
Note: To enable long password support on the AIX 6.1, install APAR IZ35001.
Monitor the maximum number of processes under DB2 instance ID using the ps
–fu db2inst1 or db2_local_ps command and adjust the maxuproc value
accordingly. The following example sets the maxuproc to 4096
chdev –l sys0 –a maxuproc=4096
DB2_LOGGER_NON_BUFFERED_IO
› Default value: AUTO
› Recommended: Default
See ‘3.2.2, “Tablespace design” on page 92’ to read more about buffered and
non-buffered IO.
Starting with Version 9.7, the default value for this variable is AUTOMATIC. When
set to AUTOMATIC, active log files are opened with DIO. This eliminates the
operating system overhead of caching database recovery logs. The database
manager determines which log files benefit from using non-buffered I/O.
When set to ON, all active log files are always opened with DIO. When set to
OFF all active log files are buffered I/O. In Version 9.5 Fix Pack 1 or later, the
default was OFF.
Do not use any explicit mount options for the file systems used for DB2 log files.
DB2_USE_IOCP
› Default value: ON
› Recommended: Default
IOCP has to be configured before enabling this variable. This feature enables the
use of AIX I/O completion ports (IOCP) to submit and collect asynchronous I/O
(AIO) requests and enhance performance in a non-uniform memory access
(NUMA) environment by avoiding remote memory access. This is also available
on DB2 v9.5 starting from fixpack 3.
It is recommended to leave this parameter at its default. You might monitor the
system IO statistics or nmon to fine-tune this parameter.
This feature is to reserve table space and speed up the process of creating or
altering large table spaces. This is available from DB2 v9.7 fixpack 1 and DB2
v9.5 fixpack 5.
DB2_PARALLEL_IO
› Default value: NULL
› Recommended: Refer to Storage chapter to set an optimum value for this
registry variable.
2.2.1 DB2_Resource_policy
In the following sections we discuss more advanced DB2 registry variables.
DB2_RESOURCE_POLICY
› Default value:
› Recommended:
This variable defines a resource policy that can be used to limit what operating
system resources are used by the DB2 database. It can contains rules for
assigning specific operating system resources to specific DB2 database objects.
This registry variable can be used to limit the set of processors that the DB2
database system uses. The extent of resource control varies depending on the
operating system.
On AIX NUMA and Linux NUMA-enabled machines, a policy can be defined that
specifies what resource sets the DB2 database system uses. When resource set
binding is used, each individual DB2 process is bound to a particular resource
set. This can be beneficial in performance tuning scenarios.
Chapter 2. AIX configuration 73
The following steps illustrate AIX resource sets configuration to enable processor
affinity for DB2 partitions. A single node with 16 processors and eight logical
database partitions is considered.
1. Define AIX resource sets in /etc/rsets file.
Define eight new resource sets in the /etc/rsets files to use CPU Number 8
through 15. These 8eight resource sets are named as DB2/MLN[1-8], as
shown in Example 2-26.
3. Update db2nodes.cfg.
Use the nmon, vmstat, lsps commands to monitor and fine-tune the following
memory-related registry parameters. Refer to 2.1.3, “VMM considerations for
DB2” on page 40 for more information about monitoring and tuning VMM
parameters.
Chapter 2. AIX configuration 75
DB2_LARGE_PAGE_MEM
› Default value: NULL
› Recommended: Refer to 2.1.4, “Large page considerations” on page 46
To enable the DB2 database system to use them, the operating system must be
configured to use large or huge pages. Refer to 2.1.4, “Large page
considerations” on page 46 for all pre-requisites and steps to set this parameter.
DB2MEMDISCLAIM
› Default value: YES
› Recommended: Default
Memory used by DB2 database system processes might have associated paging
space. This paging space might remain reserved even when the associated
memory has been freed. Whether or not this is much depends on the operating
system's (tunable) virtual memory management allocation policy.
The DB2MEMDISCLAIM registry variable controls whether DB2 agents explicitly
requests that the operating system disassociate the reserved paging space from
the freed memory.
A setting of Yes results in smaller paging space requirements and less IO might
provide performance benefit. However, when there is plenty of real memory and
paging space, a setting of No might yield performance benefit.
You need to monitor the paging space use and VMM parameters (see 2.1.3,
“VMM considerations for DB2” on page 40) and set this registry value
accordingly.
DB2_MEM_TUNING_RANGE
› Default value: NULL
› Recommended: Default
Using this registry parameter, DB2 instance sets minimum and maximum free
physical memory thresholds available to the server (for other applications). It is
recommended to leave at default. The setting of this variable has no effect unless
the self-tuning memory manager (STMM) is enabled and database_memory is
set to AUTOMATIC.
You need to monitor the paging space use and other VMM parameters (see
2.1.3, “VMM considerations for DB2” on page 40) to adjust the thresholds
76 Best Practices for DB2 on AIX 6.1 for POWER Systems
2.2.3 DB2 Communications registry variables
Use nmon - Network Interface View (‘n’) to monitor network performance and
fine-tune this parameter. You might also use netstat and entstat to monitor and
fine-tune any of the following TCP/IP registry parameters. See 2.1.6, “Network
tunable considerations” on page 51.
DB2_FORCE_NLS_CACHE
› Default value: FALSE
› Recommended: Default
DB2TCPCONNMGRS
› Default value: 1
› Recommended:Default
Use nmon - Network Interface View (‘n’) to monitor network performance and
fine-tune this parameter.
Chapter 2. AIX configuration 77
DB2SORCVBUF and DB2SOSNDBUF
› Default value: 65536
› Recommended: Default
Use nmon - Network Interface View (‘n’) to monitor network performance and
fine-tune this parameter. You might also use /usr/bin/entstat –d <interface
name> to display all the statistics.
To maximize network and HADR performance, the TCP socket buffer sizes might
require tuning. If you change the TCP socket buffer size at the system level, the
settings are applied to all TCP connections on the machine. Setting a large
system level socket buffer size consumes a large amount of memory.
These two registry variables allow tuning of the TCP socket send and receive
buffer size for HADR connections only. They have the value range of 1024 to
4294967295 and default to the socket buffer size of the operating system, which
varies depending on the operating system. Some operating systems
automatically round or silently cap the user-specified value.
DB2CHECKCLIENTINTERVAL
› Default value: 50
› Recommended: Default
Note: Operating systems also have a connection timeout value that might take
effect prior to the timeout you set using DB2TCP_CLIENT_CONTIMEOUT.
For example, AIX has a default tcp_keepinit=150 (in half seconds) that
terminates the connection after 75 seconds.
DB2TCP_CLIENT_KEEPALIVE_TIMEOUT
› Default value: 0
› Recommended: Default
DB2TCP_CLIENT_RCVTIMEOUT
› Default value: 0
› Recommended: Default
no ipsrcrouterecv 0 0 1 We recommend 1 as
required by topsvcs.
no tcp_tcpsecure3 0 5
8192
[memory
greater
than 8
GB
In this chapter, we refer to two scenarios: one for OLTP environments and one for
data warehouse environments, as each has its own requirements. Figure 3-1 and
Figure 3-2 on page 87 show a high level graphical view of a typical storage layout
in both environments.
In the rest of this chapter we discuss the details on how to build and configure
such environments for best performance.
This simple design ensures that categories of data are being accessed without
interfering with others. As a result, the database system can achieve better
performance and availability. In an OLTP (online transaction processing)
environment, in particular, separating transaction logs from the
permanent/temporary data is important because of its high transaction rate. For
a DW (data warehouse) environment, however, permanent data and transaction
logs might be collocated in the same physical disks. This is because DW
workloads are often read-only and do not have as high transaction rate as OLTP
workloads. Thus, the demand on transaction logs is lower. Allowing the sharing
of disks gives DBAs the flexibility to manage storage space and better use any
unused space.
On the other hand, this simple design can help with problem determination. For
example, if we observe that a set of disks corresponding to a file system is much
busier than the others, we can identify which part of the system or data is hot.
Better parallelism or more resources (for example, more number of spindles)
might be needed to alleviate the issue.
In DB2 v9.7, all databases are created with automatic storage1 by default. In
particular, when creating a database, we establish one or more initial storage
paths in which table spaces have their containers. For example:
CREATE DATABSE TESTDB on /path1, /path2
This command creates the database with two storage paths: /path1 and /path2.
The recommendation is for each storage path to reside in a separate file system.
With the CREATE DATABASE command, the default table spaces created by DB2
(for example, SYSCATSPACE, USERSPACE1, and TEMPSPACE) now have two
containers, one on each of the two storage paths.
1 Ifyou do not want to use automatic storage for a database, you must explicitly specify the
AUTOMATIC STORAGE NO clause on the CREATE DATABASE command.
88 Best Practices for DB2 on AIX 6.1 for POWER Systems
Any new table spaces created without explicitly providing containers definitions
are also created in these two storage paths. For example:
CREATE TABLESPACE TBSP1
This creates a table space with two containers, one on each of the two storage
paths (/path1 and /path2). As the database grows, the database manager
automatically extends the size of containers across all the storage paths.
The new storage path is not used until there is no more room to grow within the
containers on the existing storage paths (/path1 and /path2). To use the new
storage path immediately, issue an ALTER TABLESPACE statement to rebalance all
the database table spaces resided in the existing storage paths. For instance:
ALTER TABLESPACE TBSP1 REBALANCE
The rebalance process runs asychronously in the background and does not
affect the availability of data. However, rebalance is an expensive operation and
has significant IO overhead. Therefore, it is important to gauge the performance
impact on the database system when rebalancing is being performed.
One suggestion is to start with rebalancing a relatively small table space and use
the result as a reference to estimate how long it takes to rebalance the other
table spaces, and what the performance impact is. Then, decide the time of day
that is suitable for performing the rebalance of larger table spaces.
Alternatively, we can use the throttling utility in DB2 to limit the performance
impact of rebalancing on the system. The database manager configuration
parameter util_impact_lim sets the limit on the impact of which all throttled
utilities can have on the overall workload of the system. By default, this
parameter is set to 10, which means all throttled utilities combined can have no
more than a 10% average impact upon the workload as judged by the throttling
algorithm. A value of 100 indicates no throttling. The SET UTIL_IMPACT_PRIORITY
command is used to set the priority that a particular utility has over the resources
available to throttled utilities as defined by the util_impact_lim.
Chapter 3. Storage layout 89
For our rebalancing, we use the SET UTIL_IMPACT_PRIORITY command to set the
priority of the rebalancing process so that the impact of rebalancing is controlled.
Here are the steps:
1. Run db2 list utilities show detail to obtain the utility ID of the
rebalancing process. This command shows the current progress of the
rebalancing process in terms of estimated percentage complete.
2. Set the priority of this process by executing the SET UTIL_IMPACT_PRIRITY FOR
<utility_id> TO <priority> command. If rebalance still incurs too much
overhead or takes an excessive amount of time to complete, an offline
approach (namely backup/redirected restore) can be used. The idea is to
back up the existing database, restore the database with REDIRECT option,
add the new storage path, and continue the restore.
Similar to adding a new storage path, automatic storage allows us to remove
existing storage paths from a database or move the data off the storage paths
and rebalance them. For more details about how this can be done, see the
DB2 v9.7 Information Center at the following Web page:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic
=/com.ibm.db2.luw.doc/welcome.html
An important consideration for storage paths or file systems used for
automatic storage is that the file systems are uniform in capacity and exhibit
similar I/O characteristics. Figure 3-3 on page 91 illustrates how containers of
a table space grow in the storage paths in which capacity is uneven.
90 Best Practices for DB2 on AIX 6.1 for POWER Systems
Figure 3-3 How does a table space grow?
3. The table space starts out with two containers (/path1and /path2) that have
not yet reached maximum capacity. A new storage path (/path3) is added to
the database using the ALTER DATABASE statement. This new storage path is
not yet being used without initiating a rebalance process by ALTER
TABLESPACE.
4. The original containers in /path1 and /path2 reach the maximum capacity.
Note: Stripe is defined as the total of all the segments for one pass of all
the back-end data disks.
6. The containers in the new stripe set (in /path2and /path3) reach their
maximum capacity.
7. A new stripe set is added only to /path3because there is no room for the
container in /path2 to grow.
The recommended practice is to have all storage paths with the same capacity
and I/O characteristics, and rebalance after a new storage path is added.
Figure 3-4 depicts the scenario that follows this practice. With rebalance and
uniform capacity, the table space grows evenly across all the storage paths. This
ensures that parallelism remains uniform and achieves the optimal I/O
performance.
Figure 3-4 Storage grows with uniform capacity storage paths and rebalance
First, what table spaces do you need? A simple design is to have one table space
for temporary tables, one table space for data, and one table space for all
indexes. In a partitioned environment (or DPF environment), a typical approach is
92 Best Practices for DB2 on AIX 6.1 for POWER Systems
to have separate table spaces for partitioned data and non-partitioned data. The
next step is to consider what types of table spaces you need, and their attributes.
The following sections describe suggested practices and important
considerations when designing a table space.
Pagesize
Rows of table data are organized into blocks called pages. Pages can be four
sizes: 4, 8, 16, and 32 KB. Table data pages do not contain the data for columns
defined with LONG VARCHAR, LONG VARGRAPHIC, BLOB, CLOB, DCLOB, or
XML data types, unless the LOB or XML document is inlined through the use of
INLINE LENGTH option of the column. The rows in a data page, however,
contain a descriptor of these columns.
With a different pagesize, the maximum row length can vary, as shown in
Table 3-1.
4 KB 4005 bytes
8 KB 8101 bytes
16 KB 16 293 bytes
32 KB 32 677 bytes
Regardless of the page size, the maximum number of rows in a data page is 255
for tables in a REGULAR table space. This number can go higher with a LARGE
table space.
A REGULAR table space allows tables to have up to 255 rows per data page,
and can be managed by the database manager (DMS) or system (SMS).
A table in a LARGE table space can support more than 255 rows per data page,
which can improve space use on data pages. Indexes on the table are slightly
larger because LARGE table spaces make use of large row identifiers (RIDs),
which are 2 bytes longer than regular RIDs. In terms of performance, there is not
much difference between REGULAR and LARGE table spaces. By default, DMS
table spaces are created as LARGE table spaces. For SMS table spaces,
however, the only supported table space type is REGULAR table spaces.
DMS versus SMS and automatic storage
A table space can be managed by the DMS, the SMS, or automatic storage. For
more information about this, see the following Information Center article:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=/c
om.ibm.db2.luw.admin.dbobj.doc/doc/c0055446.html
Of the three types of table spaces, automatic storage table spaces are the
easiest to set up and maintain, and are recommended for most applications.
They are particularly beneficial in the following situations:
› You have larger tables or tables that are likely to grow quickly.
› You do not want to make regular decisions about how to manage container
growth.
› You want to store other types of related objects (for example, tables, LOBs,
indexes) in other table spaces to enhance performance.
DMS table spaces are useful in the following circumstances:
› You have larger tables or tables that are likely to grow quickly.
› You want to exercise greater control over where data is physically stored.
› You want to make adjustments to or control how storage is used (for example,
adding containers)
› You want to store other types of related objects (for example, tables, LOBs,
indexes) in other table spaces to enhance performance.
A DMS table space has a choice of creating its containers as a raw device or a
file in a file system. A raw device traditionally provides better performance.
Nonetheless, the concurrent I/O (CIO) feature on AIX now has almost completely
eliminated the need to use a raw device for performance. (CIO is discussed in
“File system caching” on page 97.) File systems also provides superior
manageability as compared to raw devices. Therefore, the general
recommendation is to use file systems instead of raw devices.
EXTENSIZE
The EXTENTSIZE for a table space specifies the number of pages that is written
to a container before skipping to the next container (if there are more than one
container in the table space).
So in this example, if the page size used is 16 KB, the extensize in pages is
896/16 = 56. The concept of segment and RAID stripe is discussed further in
3.3.3, “Storage structure: RAID levels” on page 104 and 3.3.4, “Logical drives
(LUN) and controller ownership” on page 114.
We need to take into consideration if we have many tables in which the amount of
data is far less than the EXTENTSIZE. This is not uncommon in the OLTP
environments. Consider an example where we have 40,000 tables, of which
almost 20,000 tables are empty. Because the EXTENTSIZE is the minimum
allocation for tables, a lot of space is wasted. In such environments, a smaller
EXTENSIZE might be preferable. For instance, you might consider the
EXTENTSIZE to be equal to the segment size.
PREFETCHSIZE
The PREFETCHSIZE specifies the number of pages to read by a query prior to
the pages being referenced by the query, so that the query does not need to wait
for IO. For example, suppose you have a table space with three containers. If you
set the PREFETCHSIZE to be three times the EXTENTSIZE, the database
manager can do a big-block read from each container in parallel, thereby
significantly increasing I/O throughput. This assumes that the best practice is
followed in such a way that each container is resided on a separate physical
device.
Chapter 3. Storage layout 95
The recommendation is to leave it as AUTOMATIC. This way, DB2 updates the
prefetchsize automatically when there is a change in the number of containers in
a table space. The calculation of the PREFETCHSIZE is as follows:
number_of_containers × number_of_disks_per_container × extent_size
For example, assume the extent size for a database is eight pages, and that
there are four containers, each of which exists on a single physical disk. Setting
the prefetch size to: 4 × 1 × 8 = 32 results in a prefetch size of 32 pages in total.
These 32 pages are read from each of the four containers in parallel.
These two values affect how DB2 optimizer selects the optimal access plans for
queries. You can use the formula in Example 3-1 to estimate the values for
OVERHEAD and TRANSFERRATE.
For example, assume that a disk performs 7200 RPM. Using the
rotational-latency formula:
(1 / 7200) * 60 * 1000 = 8.328 milliseconds
This value can be used to estimate the overhead as follows, assuming an average
seek time of 11 milliseconds:
OVERHEAD = 11 + (0.5 * 8.328)
= 15.164
96 Best Practices for DB2 on AIX 6.1 for POWER Systems
As to TRANSFERRATE, if each tablespace container is a single physical disk, you
can use the following formula to estimate the transfer cost in milliseconds per
page:
For example, suppose the specification rate for a disk is 3 megabytes per second.
Then:
TRANSFERRATE = (1 / 3) * 1000 / 1024000 * 4096
= 1.333248
or about 1.3 milliseconds per page.
If the table space containers are not single physical disks, but are arrays of disks
(such as RAID), you must take additional considerations into account when
estimating the TRANSFERRATE.
If the array is relatively small, you can multiply the spec_rate by the number of
disks, assuming that the bottleneck is at the disk level.
However, if the array is large, the bottleneck might not be at the disk level, but at
one of the other I/O subsystem components, such as disk controllers, I/O busses,
or the system bus. In this case, you cannot assume that the I/O throughput
capacity is the product of the spec_rate and the number of disks. Instead, you
must measure the actual I/O rate (in megabytes) during a sequential scan. For
example, a sequential scan resulting from select count(*) from big_table could be
several megabytes in size. In this case, divide the result by the number of
containers that make up the table space in which BIG_TABLE resides. Use this
result as a substitute for spec_rate in the formula given above.
Containers assigned to a table space can reside on other physical disks. For the
best results, all physical disks used for a given table space must have the same
OVERHEAD and TRANSFERRATE characteristics. If these characteristics are
not the same, use average values when setting OVERHEAD and
TRANSFERRATE.
AIX provides a feature to disable file system caching and avoid double caching.
This feature is known as Concurrent I/O (CIO). Another advantage of CIO is that
this feature might help reduce the memory requirements of the file system cache,
making more memory available for other uses. In DB2, we can enable CIO on
table spaces by specifying the clause NO FILE SYSTEM CACHING in the
CREATE TABLESPACE or ALTER TABLESPACE statement. (Starting from DB2
v9.5, NO FILE SYSTEM CACHING is the default for any DMS table space
created.)
LOBS (Large Objects) are not cached in DB2 bufferpool. For table spaces that
contain a fair amount of LOBS data, performance might be improved by enabling
file system caching.
Compression
Data compression provides several benefits. The most significant benefit is
lowered overall space requirements. Typical compression ratios for data tables
are between 50% and 65%, although the compression ratio can be much higher
or lower depending on the data. Another benefit of data compression is the
potential for improved performance. Processing compressed data uses more
processor cycles than uncompressed data, but requires less I/O.
Row compression, a feature available since DB2 V9.1, allows data tables to be
compressed. In DB2 V9.7, two new compression techniques are introduced:
› Index compression
Index compression compresses indexes (including indexes on declared or
created temporary tables). If a data table is compressed, new indexes created
on this table are automatically compressed. You can always enable or disable
index compression on any indexes explicitly.
› Temporary table compression.
Temporary table compression, on the other hand, compresses temporary
tables, such as created global temporary tables (CGTT) and declared global
temporary tables (DGTT). Unlike row or index compression, temporary table
compression is always on.
For more details about how row and index compression can be used, see the
DB2 Information Center at the following Web page:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=/c
om.ibm.db2.luw.doc/welcome.html
Important: You must apply for the IBM DB2 Storage Optimization Feature
license to use any of the compression features.
Chapter 3. Storage layout 99
3.3 Storage hardware
In the previous section, we described general considerations in DB2 storage
design. In this section, we look at storage hardware and discuss how storage
hardware can be configured and tuned to meet database storage requirements.
For illustrative purposes, we focus on the IBM System Storage™ DS5000 series.
The DS5000 Storage Manager software, which comes with the hardware, can be
used to configure, manage, and troubleshoot the DS5000 storage servers. We
use this software to configure RAID arrays and logical drives, assign logical
drives to hosts, replace and rebuild failed disk drives, expand the size of the
arrays and logical drives, and convert from one RAID level to another.
In the next sections, we talk about important aspects of the DS5000 series that
deserve special attention. More importantly, a few of them have an impact on
performance of our database systems. For other details on the DS5000 series,
readers are encouraged to refer to IBM RedBooks publication Introduction to the
IBM System Storage DS5000 Series, SG24-7676.
100 Best Practices for DB2 on AIX 6.1 for POWER Systems
3.3.2 Physical components considerations
In this section we discuss the considerations for physical components.
A link that is too long can end up running at 1 Gbps, even with 2 Gbps ports at
either end, so be aware of your distances and make sure your fiber is good. The
same rules apply to 4 Gbps devices relative to 1 Gbps and 2 Gbps environments.
The 4 Gbps devices have the ability to negotiate back down to either 2 Gbps or 1
Gbps, depending upon the attached device and link quality.
Figure 3-5 on page 102 shows that the host ports are connected to eight sets of
FC HBA.
Chapter 3. Storage layout 101
HBA2
HBA1
Host AIX
When multiple HBAs are installed on a host, multipath driver might be used to
achieve redundancy (failover) or load sharing. We revisit this topic later in this
section.
Drives
The speed and the type of the drives2 used impacts the performance. Typically,
the faster the drive, the higher the performance. This increase in performance
comes at a cost. The faster drives typically cost more than the lower performance
drives. FC drives outperform the SATA drives. In particular, the DS5000 storage
server supports the following types of FC drives:
› 4 Gbps FC, 146.8 GB / 15K Enhanced Disk Drive Module
› 4 Gbps FC, 300 GB / 15K Enhanced Disk Drive Module
› 4 Gbps FC, 450 GB / 15K Enhanced Disk Drive Module
Note that a FC 15K drive rotates 15,000 times per minute. Also, its read and
write bandwidth are 76 Mbps and 71 Mbps, respectively.
The speed of the drive is the number or revolutions per minute (RPM). A 15 K
drive rotates 15,000 times per minute. With higher speeds, the drives tend to be
denser, as a large diameter plate driving at such speeds is likely to wobble. With
the faster speeds comes the ability to have greater throughput.
2 The words “drive” and “disk” are used interchangeably in this chapter.
102 Best Practices for DB2 on AIX 6.1 for POWER Systems
Seek time is the measure of how long it takes for the drive head to move to the
correct sectors on the drive to either read or write data. It is measured in
milliseconds (ms). The faster the seek time, the quicker data can be read from or
written to the drive. The average seek time reduces when the speed of the drive
increases. Typically, a 7.2 K drive has an average seek time of around 9 ms, a 10
K drive has an average seek time of around 5.5 ms, and a 15 K drive has an
average seek time of around 3.5 ms. Together with RPM, the seek time is used to
determined the value of OVERHEAD for our DB2 table spaces (see 3.2.2,
“Tablespace design” on page 92).
We recommend that you also split the hot spares so that they are not on the
same drive loops (see Figure 3-6 on page 104).
Tip: When assigning disks as hot spares, make sure they have enough
storage capacity. If the failed disk drive is larger than the hot spare,
reconstruction is not possible. Ensure that you have at least one of each size
or all larger drives configured as hot spares.
Chapter 3. Storage layout 103
Figure 3-6 Hot spare coverage with alternating loops
RAID levels
In this section, we discuss the RAID levels and explain why we choose a
particular setting in a particular situation. You can draw your own conclusions.
Logical Drive
Host view
Stripeset
Logical Drive
Block 0
Block 1
Host view
Block 2
etc.
Actual
device Block 0 Block 0
mappings Block 1 Block 1
Block 2 Block 2
etc. Mirrorset etc.
Because the data is mirrored, the capacity of the logical drive when assigned
RAID 1 is 50% of the array capacity.
Note: RAID 1 is actually implemented only as RAID 10 (see “RAID 10: Higher
performance than RAID 1” on page 109) on the DS5000 storage server
products.
Logical Drive
Block 0
Block 1
Block 2
Block 3 Host View
Block 4
Block 5
etc.
RAID 5 is best used in environments requiring high availability and fewer writes
than reads.
Use write caching on RAID 5 arrays, because RAID 5 writes are not completed
until at least two reads and two writes have occurred. The response time of
writes is improved through the use of write cache (be sure it is battery-backed
up). RAID 5 arrays with caching can give as good as performance as any other
RAID level, and with a few workloads, the striping effect gives better performance
than RAID 1.
Due to the added impact of more parity calculations, RAID 6 is slightly slower
than RAID 5 in terms of writing data. Nevertheless, there is essentially no impact
on read performance when comparing between RAID 5 and RAID 6 (provided
that the number of disks in the array is equal).
Logical Drive
Block 0
Block 1
Block 2
Block 3
Block 4
Block 5
etc.
Controller
internal
mapping
Actual
device
mappings
Stripeset
5 Drives operate OLTP Good for reads, small IOPS, Writes are particularly
independently with DW many concurrent IOPS, and demanding.
data and parity blocks random I/Os.
distributed across all
drives in the group.
110 Best Practices for DB2 on AIX 6.1 for POWER Systems
RAID Description Workload Advantage Disadvantage
6 Stripes blocks of data OLTP Good for multi-user Slower in writing data,
and parity across an DW environments, such as complex RAID controller
array of drives and database, where typical I/O architecture.
calculates two sets of size is small, and in situations
parity information for where additional fault
each block of data. tolerance is required.
Array configuration
Before you can start using the physical disk space, you must configure it. Based
on the previous recommendation in RAID levels, you divide your (physical) drives
into arrays accordingly and create one or more logical drives inside each array.
In simple configurations, you can use all of your drive capacity with one array and
create all of your logical drives in that unique array. However, this presents the
following drawbacks:
› If you experience a (physical) drive failure, the rebuild process affects all
logical drives, and the overall system performance goes down.
› Read/write operations to logical drives are still being made to the same set of
physical hard drives.
The array configuration is crucial to performance. You must take into account all
the logical drives inside the array, as all the logical drives inside the array impact
the same physical disks. If you have two logical drives inside an array and they
both are high throughput, there might be contention for access to the physical
drives as large read or write requests are serviced. It is crucial to know the type
of data that each logical drive is used for and try to balance the load so
contention for the physical drives is minimized. Contention is impossible to
eliminate unless the array only contains one logical drive.
Number of drives
The more physical drives you have per array, the shorter the access time for read
and write I/O operations.
You can determine how many physical drives is associated with a RAID controller
by looking at disk transfer rates (rather than at the megabytes per second). For
example, if a disk drive is capable of 75 nonsequential (random) I/Os per second,
about 26 disk drives working together can, theoretically, produce 2000
nonsequential I/Os per second, or enough to hit the maximum I/O handling
capacity of a single RAID controller. If the disk drive can sustain 150 sequential
I/Os per second, it takes only about 13 disk drives working together to produce
the same 2000 sequential I/Os per second and keep the RAID controller running
at maximum throughput.
Chapter 3. Storage layout 111
Tip: Having more physical drives for the same overall capacity gives you:
› Performance
By doubling the number of the physical drives, you can expect up to a 50%
increase in throughput performance.
› Flexibility
Using more physical drives gives you more flexibility to build arrays and
logical drives according to your needs.
› Data capacity
When using RAID 5 logical drives, more data space is available with
smaller physical drives because less space (capacity of a drive) is used for
parity.
Figure 3-12 on page 113 shows an example of the enclosure loss protection. If
enclosure number 2 were to fail, the array with the enclosure loss protection still
functions (in a degraded state), as the other drives are not affected by the failure.
112 Best Practices for DB2 on AIX 6.1 for POWER Systems
Figure 3-12 Enclosure loss protection
OLTP workloads
OLTP environments contain a fairly high level of reads and a considerable
amount of writes. In most cases, it has been found that laying out the tables
across a number of logical drives that were created across several RAID 5 arrays
of 8+1 parity disks, and configured with a segment size of 64 KB or 128 KB, is a
good starting point to begin testing. This configuration, coupled with host
recommendations to help avoid offset and striping conflicts, seems to provide a
good performance start point to build from. Remember that high write
percentages might result in a need to use RAID 10 arrays rather than the RAID 5.
This is environment-specific and requires testing to determine. A rule of thumb is
that if there are greater than 25–30% writes, then you might want to look at RAID
10 over RAID 5.
As these are critical files to protect in case of failures, we recommend that you
keep two full copies of them on separate disk arrays in the storage server. This is
to protect you from the unlikely occurrence of a double disk failure, which can
result in data loss. Also, as these are generally smaller files and require less
space, we suggest that two separate arrays of 1+1 or 2+2 RAID 1 be used to
hold the logs and the mirror pair separately.
The segment size is the maximum amount of data that is written or read from a
disk per operation before the next disk in the array is used. For OLTP
environments, we suggest that the segment size be 64 KB to 128 KB.
114 Best Practices for DB2 on AIX 6.1 for POWER Systems
With DW workloads, the focus is on moving high throughput in fewer I/O, and the
nature of these workloads is generally sequential in nature. As a result, you want
to have larger segments (128 KB or higher) instead to get the most from each
stripe. When creating a logical drive, we specify the stripe width to be the amount
of segments size multiplied by the number of disks in the logical drive (or array).
More importantly, stripe width is used to determine our EXTENTSIZE during the
creation of our table spaces in DB2 (refer to 3.2.2, “Tablespace design” on
page 92 for more details).
An MPIO-capable device driver can control more than one type of target device.
A PCM can support one or more specific devices. Therefore, one device driver
can be interfaced to multiple PCMs that control the I/O across the paths to each
of the target devices.
The AIX PCM has a health-check capability that can be used to do the following
tasks:
› Check the paths and determine which paths are currently usable for sending
I/O.
› Enable a path that was previously marked failed because of a temporary path
fault (for example, when a cable to a device was removed and then
reconnected).
› Check currently unused paths that are used if a failover occurred (for
example, when the algorithm attribute value is failover, the health check can
test the alternate paths).
MPIO is part of the AIX operating system and does not need to be installed
separately. The required AIX 6.1 level is TL0 (IZ13627).
Chapter 3. Storage layout 115
3.4.2 hdisk tuning
On AIX, each logical drive or LUN is called a physical volume (PV) and often
referred to as hdiskx (where x is a unique integer on the system).
There are three parameters that are important to set for hdisks for performance:
› queue_depth
› max_transfer
› max_coalesce
queue_depth
The maximum queue depth or queue_depth is the maximum number of
concurrent operations in progress on the physical device. Excess requests
beyond queue_depth are held on a pending queue in the disk driver until an
earlier request completes. Setting the queue_depth for every hdisk in the system
to the appropriate value is important for system performance. The valid values for
queue_depth are from 1 to 256.
Typically, disk vendors supply a default queue_depth value appropriate for their
disks in their disk-specific ODM device support fileset. If you know this number
you can use the exact number for the queue_depth calculation. If not, you can
use a number between four and 16, which is the standard queue depths for
drives. FC drives usually have a queue depth of 16. You also need to know how
many disks you have per hdisk. For example, if you have a RAID 5 7+P
configuration, you have eight disks per hdisk so you can start with a
queue_depth for the hdisk of 16 * 8 = 128. You can monitor the queues using
iostat -D. OLTP environments usually require higher queue_depth to perform
better than DW environments. To set a queue_depth of 64, issue the following
command:
#chdev -l hdiskX -a queue_depth=64 -P
To make the attribute changes permanent, use the -P option to update the AIX
ODM attribute. To make this change there must either be no volume group
defined on the hdisk, or the volume group that the hdisk belongs to must be
varied off. You need to reboot the server to make this change take effect.
116 Best Practices for DB2 on AIX 6.1 for POWER Systems
When setting this value, keep the following in mind:
› Setting queue_depth too high might result in the storage device being
overwhelmed. This can result in a range of issues from higher I/O latency
(lower performance) or I/Os being rejected or dropped by the device (errors in
the AIX errpt and greatly reduced performance).
› Setting queue_depth too high for FC disks might overrun the maximum
number of commands (num_cmd_elems) allowed outstanding by the FC driver
(see 3.4.3, “Fibre Channel adapters configuration” on page 118 on how to
tune num_cmd_elems).
› A combination of a large number of large I/Os for FC disks might overrun the
PCI bus address space available for the FC port.
› queue_depth is not honored for disks that do not support SCSI Command
Tagged Queueing (CTQ). Also, queue_depth is not honored for disks whose
q_type attribute is set to none or whose SCSI INQUIRY responses indicate
lack of support for CTQ.
max_transfer
The maximum transfer size (max_transfer) sets the limit for the maximum size of
an individual I/O to the disk. Although applications might make larger I/O
requests, those requests are broken down into multiple max_transfer-sized
requests before they are handed to the disk driver.
For OLTP workloads, most I/Os are for discontiguous 4 or 8 K pages (depending
on the database page size). If it is a DW workload, then most likely there are lots
of scans and I/O sizes are larger. In this case a larger max_transfer means fewer
round trips if the I/Os are large and contiguous. We recommend leaving the
default max_transfer of 256 KB (0x40000) for OLTP workloads and set it to 1 MB
(0x100000) for DW workloads. To set a max_transfer of 1 Mb issue the following
command:
num_cmd_elems
The num_cmd_elems value sets a limit for the maximum number of SCSI I/O
requests that can be active for the FC port at one time. The actual maximum
number of commands that can be issued at one time is the minimum of
num_cmd_elems and the aggregate queue_depth of the devices using the port.
Supported values are between 20 and 2048.
We recommend setting this value to its maximum of 2048 and tuning down if
necessary. To do so, use the chdev command. For example to set this value to
2048 issue:
When setting this value keep the following in mind that a combination of a large
number of large I/Os for Fibre Channel disks might overrun the PCI bus address
space available for the Fibre Channel port.
max_xfer_size
The max_xfer_size value sets the limit for the maximum size of an individual I/O
sent by the port. This value also influences the amount of PCI bus address space
allocated for DMA mapping of I/O buffers. The default setting of 0x100000 (1 MB)
causes the FC driver to request the normal amount of PCI bus address space.
Any larger setting causes the FC driver to request a larger amount of PCI bus
118 Best Practices for DB2 on AIX 6.1 for POWER Systems
address space. There are only two PCI bus address space sizes that the FC
driver requests. It does not request more and more as you increase
max_xfer_size further.
Any excess requests beyond the limit of the drivers ability to map for DMA is held
on a pending queue in the adapter driver until earlier requests complete. Each
time a request is set aside because it cannot be mapped for DMA a PCI_DMA
error is logged. The fcstat No DMA Resource Count statistic tracks the number of
times that a request cannot be issued due to lack of PCI address space for DMA.
We suggest leaving the default max_xfer_size for both OLTP and DW. As an
example, if you were to set this value to 2 Mb issue:
#chdev -l fcs0 -a max_xfer_size=0x200000 -P
To manage the disk storage the LVM uses a hierarchy of structures that have a
clearly defined relationship between them. The lowest element in this structure is
the PV. We have seen in previous sections how we started building the storage
layout (in Figure 3-1 on page 86 and Figure 3-2 on page 87) by taking the disks,
and creating arrays and LUNs. Each LUN is seen by the LVM as a physical
volume and has a name, usually /dev/hdiskx (where x is a unique integer on the
system).
Chapter 3. Storage layout 119
Every physical volume in use belongs to a volume group (VG) unless it is being
used as a raw storage or a readily available spare (also known as Hot Spare). A
VG is a collection of one or more PVs. Within each volume group, one or more
logical volumes (LVs) are defined. LVs are the way to group information located
on one or more PVs. LVs are an area of disk used to store data that appears to
be contiguous to the application, but can be non-contiguous on the actual PV. It
is this definition of a LV that allows them to be extended, relocated, span
multiples PVs, and have their contents replicated.
Now that we have a basic understanding of the LVM, we look at creation and
configuration in detail.
The reason for having one fewer VG in DW environments is that we want the
active logs distributed in the database partitions. Because active logs are not as
performance critical as in an OLTP environment, they do not require dedicated
disks for them.
Next, we discuss performance decisions that we need to make before the actual
creation.
The LVM limits the number of physical partitions that a PV can have. Those limits
depend on the type of VG chosen:
› Original
› Big
› Scalable
For Original or Big VGs, the maximum number of PP per PV is 1016. That limit
can be changed by a factor to make it larger. The factor is between 1 and 16 for
Original VGs and 1 and 64 for Big VGs. The maximum number of PPs per PV for
a VG changes to 1016 * factor. When using this factor the maximum number of
PVs for that VG is reduced by MaxPVs / factor. For example, an Original VG can
have up to 32 PVs. If we decide to increase the number of PPs by a factor of 4,
we are able to have 1016 * 4 = 4064 PPs per PV for that VG, but now we only can
have 32 / 4 = 8 PVs in that VG. For Big VGs it is the same, except that the largest
number of PVs we can handle is 128, so in this example we end up with 32 PVs
as the maximum.
Scalable VGs can accommodate up to 1024 Pvs, 256 LVs and 32,768 PPs by
default. The number of LVs and PPs can be beyond the default values. In
particular, PPs can be increased up to 2,097,152 (Maximum PPs per VG is
expressed in units of 1024 PPs, the maximum being 2048, that is 1024 * 2048 =
2,097,152 PPs).
Chapter 3. Storage layout 121
The disadvantage is than increasing these defaults can significantly increase the
size of the VG Description Area (VGDA)3.
Another consideration for the choice of VG type is that the LVM with Original and
Big VGs reserves by default the first 512 bytes of the volume for the LV control
block. Therefore, the first data block starts at an offset of 512 bytes into the
volume. Care is taken when laying out the segment size of the logical drive to
enable the best alignment. You can eliminate the LV control block on the LV by
using a Scalable VG, or by using the -T 0 option for Big VGs.
Now that we have decided to use Scalable VGs, we need to choose the PP size
and the maximum number of PPs for the VG. The first step is to know how much
physical storage we need to accommodate under that VG. The result of PP Size
* Maximum Number of PP per VG can accommodate at least that capacity. A
best practice for Original or Big VGs for PP size is to go as small as possible
given the maximum PPs per PV restrictions. The reason for that is that PP size
has a performance impact when doing mirroring at the LVM level or in certain
cases when doing LV stripping. Because we do not recommend the usage of
mirroring at the LVM level or LV stripping of differently-sized LVs, then the choice
of a PP size only influences the minimum size allocation on the PV.
A good size for PPs is 64 Mb. Then, depending on the capacity of all the storage
in the volume group, you can calculate what the maximum number of PP per VG
is. For example, if we want to use 64 Mb PPs and a maximum number of PP per
3
The VGDA contains information about the VG, the LVs that reside in the VG, and the PVs that make
up the volume group
122 Best Practices for DB2 on AIX 6.1 for POWER Systems
VG of 512 (that is 512 * 1024 = 524,288 PPs), we can accommodate up to 32 TB
of capacity in that VG. You can use the following command to create such a VG:
mkvg -S -y <vgName> -s 64 -P 1024 f <hdiskList>
value is set by the varyonvg command using the -M flag. When the LTG size is set
using the -M flag, the varyonvg and extendvg commands might fail if an
underlying disk has a maximum transfer size that is smaller than the LTG size. To
obtain the maximum supported LTG size of your hard disk, you can use the
lquerypv command with the -M flag. The output gives the LTG size in KB, as
shown in the following example.
# /usr/sbin/lquerypv -M hdisk0
The lspv command displays the same value as MAX REQUEST, as shown in
Example 3-2.
000bc6fd00004c00000000fda469279d
VG STATE: active PP SIZE: 16 megabyte(s)
VG PERMISSION: read/write TOTAL PPs: 542 (8672 megabytes)
MAX Lvs: 256 FREE PPs: 431 (6896 Megabytes)
LVs: 9 USED PPs: 111 (1776 megabytes)
OPEN LVs: 8 QUORUM: 2
TOTAL PVs: 1 VG DESCRIPTORS: 2
STALE PVs: 0 STALE PPs: 0
ACTIVE PVs: 1 AUTO ON: yes
MAX PPs per VG: 32512
MAX PPs per PV: 1016 MAX PVs: 32
LTG size (Dynamic): 256 kilobyte(s) AUTO SYNC: no
Note that the LTG size for a VG is displayed as dynamic in the lsvg command
output.
Other considerations
Depending on the High Availability solution being considered, other important
considerations when creating volume groups are taken into account. Of
particular importance is the Concurrent Volume Group flag. The concurrent
access VG is a VG that can be accessed from more than one host system
simultaneously. This option allows more than one system to access the PVs.
Then, through a concurrent capable VG, they can now concurrently access the
information stored on them. Initially, this was designed for high-availability
systems, with the high-availability software controlling the sharing of the
configuration data between the systems. However, the application must control
the concurrent access to the data. Another consideration is the major number for
a VG. The major number is a numerical identifier for the VG. It is recommended
in multi-host environments that the major numbers are consistent across all
hosts.
124 Best Practices for DB2 on AIX 6.1 for POWER Systems
3.5.2 Creating and configuring logical volumes
An LV is a portion of a PV viewed by the system as a single unit. LVs consist of
LPs, each of which maps to one or more PPs. The LV presents a simple
contiguous view of data storage to the application while hiding the more complex
and possibly non-contiguous physical placement of data.
LVs can only exist within one VG. They cannot be mirrored or expanded onto
other VGs. The LV information is kept in the VGDA and the VGDA only tracks
information pertaining to one VG.
The Logical Volume Manager subsystem provides flexible access and control for
complex physical storage systems. LVs can be mirrored and stripped with other
strip sizes. There are other types of LVs:
› boot: Contains the initial information required to start the system.
› jfs: Journaled File System
› jfslog: Journaled File Systems log
› jfs2: Enhanced Journaled File System
› jfs2log: Enhanced Journaled File System log
› paging: Used by the virtual memory manager to swap out pages of memory
Following Figure 3-1 on page 86 and Figure 3-2 on page 87 we are creating
each LV associated with one and only one hdisk. Also, as we discuss in the next
section, we are creating a jfs2 file system on each of those LVs, so the jfs2 type is
specified when creating the LVs. When specifying the size of the LV, remember
that a LV is made of logical partitions and each of those logical partitions
corresponds in size to a PP on the PV (in this case only one because we do not
have mirroring). Thus, when specifying the LV size, do so in multiples of PP size
to void wasted space. The size can be specified in logical partition units (the
default) or KB/MB/GB.
For OLTP systems (see Figure 3-1 on page 86) we suggest having two LVs per
hdisk, one for data and one for backups. This allows parallelism when taking a
backup of the database by specifying all the backup file systems as targets.
For DW systems (see Figure 3-2 on page 87), we recommend having three LVs
per hdisk, one for data, one for backups and one for active logs/database
directory. This allows for data partition isolation.
Because of its structure, a few tasks are performed more efficiently on a file
system than on each directory within the file system. For example, you can back
up, move, or secure an entire file system. You can make a point-in-time image of
a JFS file system or a JFS2 file system, called a snapshot.
You can access both local and remote file systems using the mount command.
This makes the file system available for read and write access from your system.
Mounting or unmounting a file system usually requires system group
membership. File systems can be mounted automatically, if they are defined in
126 Best Practices for DB2 on AIX 6.1 for POWER Systems
the /etc/filesystemsfile. You can unmount a local or remote file system with the
unmount command, unless a user or process is currently accessing that file
system.
Both JFS and JFS2 file systems are built into the base operating system.
However it is not recommended to use JFS. You can also use other file systems
on AIX such as Veritas or GPFS, but for this Redbooks publication, we focus on
JFS2. This file system uses database journaling techniques to maintain its
structural consistency. It allows the file system log to be placed in the same
logical volume as the data, instead of allocating a separate logical volume for
logs for all file systems in the VG.
Because write operations are performed after logging of metadata has been
completed, write throughput is highly affected by where this logging is being
done. Therefore, we suggest using the INLINE logging capabilities. We place the
log in the LV with the JFS2 file system. We do not suggest setting any specific
size for the log. The INLINE log defaults to 0.4% of the LV size if logsize is not
specified. This is enough in most cases.
Another consideration for file systems are the mount options. They can be
specified in the crfs command with the -a option parameter or they can be
specified at mount time. We recommend specifying them at the crfs time to
ensure the proper options are used every time the file system is mounted. We
suggest the use of the release behind when read option (rbr option).
When sequential reading of a file in the file system is detected the real memory
pages used by the file are released after the pages are copied to internal buffers.
This solution addresses a scaling problem when performing sequential I/O on
large files whose pages are not reaccessed in the near future. When writing a
large file without using release-behind, writes go fast whenever there are
available pages on the free list.
When the number of pages drops to the value of the minfree parameter, the
virtual memory manager uses its Least Recently Used (LRU) algorithm to find
candidate pages for eviction. As part of this process, the virtual memory
manager needs to acquire a lock that is also being used for writing. This lock
contention might cause a sharp performance degradation.
3.7 Conclusion
This chapter described the storage principles guiding the storage layout, the
mapping of physical storage to logical storage, the layout of file systems, and
how DB2 uses those. We looked into OLTP and DW environments and what the
best practices are in each case. In general, we must design to spread our
hardware evenly, push as much functionality as possible down to the hardware,
have a simple LVM design and let DB2 handle the storage automatically. When
not sure how to configure certain parameters, the defaults are always a good
start. Then, proper monitoring helps to adjust. It is important to check the
performance as the system is being built as after the whole storage is laid out.
One change can mean a lot of re-work needed, in particular when it comes to the
lower layers of the design.
128 Best Practices for DB2 on AIX 6.1 for POWER Systems
4
Chapter 4. Monitoring
Today’s environments range from stand-alone systems to complex combinations
of database servers and clients running on multiple platforms. In any type of
system, the common key for successful applications is performance. Although
performance might initially be good, as time goes on, the system might need to
serve more users, store more data, and process more complex queries.
Consequently, the increased load level on the system affects its performance.
This can be the time to upgrade to more powerful equipment. However, before
investing in equipment, you might be able to improve performance by monitoring
and tuning your environment.
This chapter provides an overview of the tasks involved in monitoring and tuning
DB2, AIX, and storage to obtain optimal performance and identify bottlenecks. It
describes how to monitor the system resource usage using AIX commands (such
as nmon, iostat, vmstat, and so forth). It also discusses the methods available to
monitor the database activity, using tools such as the Snapshot™ Monitor,
Activity Monitor and db2pd. Based on the information provided by these
commands and tools, you can make informed decisions about what actions need
to be taken to tune the database environment. Later in the chapter we include a
few scenarios in which we can use this monitoring information to determine
where the bottleneck can be.
© Copyright IBM Corp. 2010. All rights reserved. 129
This chapter has the following sections:
› “Understanding the system” on page 131
› “Benchmarking” on page 131
› “Determine the possible causes” on page 132
› “Planning monitoring and tuning” on page 133
› “Monitoring tools for DB2” on page 134
› “Monitoring enhancements in DB2 9.7” on page 149
› “Monitoring tools for AIX” on page 169
› “Monitoring scenarios” on page 185
130 Best Practices for DB2 on AIX 6.1 for POWER Systems
4.1 Understanding the system
It is important to have a good understanding of your system to figure out what
tools to use in the monitoring of bottlenecks and where they might lie. For
example, look at Figure 4-1, which details the several components that might
typically exist in today’s OLTP systems and data warehouses. After you know the
architecture, it becomes easier to perform calculated tests at the various levels to
determine where the problem can lie and what evidence to look for which
suggests a possible cause and consequently a solution.
4.2 Benchmarking
After following the various best practices to configure the system, it is then
desirable to run a benchmark and establish the baseline. That is, for a given
workload X the response time for the various queries or jobs is Y. This applies to
both OLTP and data warehouses such that you can establish a baseline that
dictates how long a particular query or report takes. It is important to establish a
baseline and collect evidence of the system performing at this level, because if
the system performs at a substandard level in the future, you have a comparison
data that can aid in determining what can have lead to the performance
degradation.
Chapter 4. Monitoring 131
The last point worth considering is of the analogy of comparing apples to apples,
For example, after migrating the various applications from the testing
environment to the production environment, the performance might be better or
worse. In such a situation a holistic picture must be taken to compare the
workload, configuration, hardware, and the like.
It is these latter shortages and poor system and database design that are the
focus of this chapter.
Note: Do not forget the Pareto's Principle, also known as the 80-20 rule. In our
context, 80% of the performance benefits can be the result of tuning 20% of
the various parameters.
Chapter 4. Monitoring 133
4.5 Monitoring tools for DB2
DB2 provides a suite of monitoring tools that can be effectively used to diagnose
performance problems. The collective power of AIX and DB2 diagnostics can be
used to resolve performance issues quickly. In this section, we discuss the most
useful DB2 performance diagnostics, ones that we can rely on to determine the
overall health of the database.
DB2 Monitoring Tools can be broadly classified into the following two categories:
› Point-in-time monitoring
› Traces
4.5.2 Traces
Tools in this category provide motion picture monitoring, which records the
execution of a series of individual activities. This is achieved with trace-like
mechanisms, such as event monitors (especially statement event monitors) and
WLM activity monitors. These tools provide a much more detailed picture of
system activity and therefore, produce huge volumes of data. This imposes a
much greater overhead on the system. They are more suitable for exception
monitoring when you need in-depth understanding of what is causing a
performance issue.
The recent trend in DB2 releases has been to move towards providing SQL
access to data generated by the monitoring tools. This makes monitoring more
manageable, as it allows you to redirect the output of snapshots and event
monitors back into tables. You can use the power of SQL to delve through the
historical monitoring data and get insights into your system performance.
All snapshot elements provide good information but let us look at the most useful
ones: the Key Performance Indicators.
POOL_TEMP_DATA_L_READS
POOL_TEMP_INDEX_L_READS
Ideally we want this to be high to avoid temporary data physical I/O. However,
large scans and therefore, a low hit ratio is unavoidable.
POOL_READ_TIME /
POOL_TEMP_INDEX_P_READS)
136 Best Practices for DB2 on AIX 6.1 for POWER Systems
Example 4-4 Average Asynchronous/Synchronous Read Time
POOL_ASYNC_READ_TIME / (POOL_ASYNC_DATA_READS +
POOL_ASYNC_INDEX_READS)
(POOL_READ_TIME -
POOL_ASYNC_READ_TIME)/
((POOL_DATA_P_READS + POOL_INDEX_P_READS +
POOL_TEMP_DATA_P_READS + POOL_TEMP_INDEX_P_READS) -
POOL_ASYNC_DATA_READS + POOL_ASYNC_INDEX_READS))
Similarly, you can get the average response time for buffer pool writes, as shown
in Example 4-5.
POOL_XDA_WRITES)
We can further breakup the Write Response time into Average Asynchronous
Write Time and Synchronous Write time, as shown in Example 4-6.
POOL_ASYNC_WRITE_TIME /
(POOL_ASYNC_DATA_WRITES +
POOL_ASYNC_INDEX_WRITES +
POOL_ASYNC_XDA_WRITES)
Page Cleaning
In OLTP, page cleaners play a crucial role. They asynchronously write dirty pages
to the disk. This ensures that there are clean slots available for use when a new
page needs to be read into the buffer pool. We want to make sure that most of
the writes are being done asynchronously by the page cleaners. If not, it is the
agent that is doing the writes which is expensive. See Example 4-9.
POOL_XDA_WRITES
Prefetch Ratio
This represents the percentage of asynchronous reads done by the prefetcher
and is particularly relevant for sequential scans. RoT is to keep this close to
100%. Prefetch Ratio is high for data warehouse workloads
(POOL_ASYNC_DATA_READS + POOL_ASYNC_INDEX_READS) /
(POOL_DATA_P_READS + POOL_INDEX_P_READS +
POOL_TEMP_DATA_P_READS + POOL_TEMP_INDEX_P_READS)
Chapter 4. Monitoring 139
Rows Read/Row Selected Ratio
This is a useful metric. It determines how many rows had to be read to return the
selected rows. A high ratio is an indication of large table scans and indicates
possibility of creating indexes. This is particularly relevant for OLTP, where we do
not want this ratio to be greater than 10–20. You can drill down to Statement
Snapshots to find the culprit SQLs. Look for SQLs that have Rows Read much
higher than the Number of Executions.
ROWS_READ / ROWS_SELECTED
Note that repeated in-memory table scans can also consume significant CPU.
(1-(pkg_cache_inserts / pkg_cache_lookups))*100%
A low package cache ratio is either because we are re-compiling the same SQL
statement with other literal values or because of low package cache size. RoT is
to have it close to 100% at steady state.
Sort Metrics
When using private sort memory, SHEAPTHRES is a soft limit. If the
SHEAPTHRES limit is exceeded, new sorts get significantly less than their
optimal amount of memory, degrading query performance. Database Manager
Snapshot can be used to determine if we exceeded the SHEAPTHRES. If the
Private Sort heap high water mark is greater than SHEAPTHRES, consider
increasing SHEAPTHRES.
Lock Metrics
Contention issues, deadlocks, and lock timeouts can be attributed to poor
application design. They can drastically affect the response time and throughput.
› LOCK_ESCALS
› LOCK_TIMEOUTS
› DEADLOCKS
A deadlock is created when two applications are each locking data needed by
the other, resulting in a situation when neither application can continue to
process.
The best tool for collecting information about a deadlock is a detailed deadlock
event monitor, which must be defined and enabled before the problem occurs. If
you do not have the default deadlock event monitor running, you must create and
enable a detailed deadlock event monitor.
For information about the default deadlock monitor, see the following Web page:
http://publib.boulder.ibm.com/infocenter/db2luw/v9/index.jsp?topic=/com
.ibm.db2.udb.admin.doc/doc/c0005419.htm
Dynamic SQL Metrics
Dynamic SQL Snapshot gives us good insights into all dynamic SQLs processed.
This helps us determine the hottest statements in the database and see if there
is an opportunity to improve their performance. You can query the
SYSIBMADM.SNAPDYN_SQL to access dynamic SQL snapshot data.
There are instances when you notice a big difference between the execution time
and User + System CPU Time.
TOTAL_EXEC_TIME - (TOTAL_USR_CPU_TIME + TOTAL_SYS_CPU_TIME)
This is because Total Execution time includes all the white space between open
and close (fetches, application code, I/O wait and network wait). It is not a good
indicator of real work being done.
You can query the SYSIBMADM. TOP_DYNAMIC_SQL view to find the hottest
statements in the database and sort the result set by other metrics (for example,
number of executions, average execution time and number of sorts per
execution).
The db2top is a tool that uses DB2 snapshot monitoring APIs to retrieve
information about the database system. It is used to calculate the delta values for
those snapshot entries in real time. This tool provides a GUI under a command
line mode, so that users can get a better understanding while reading each entry.
This tool also integrates multiple types of DB2 snapshots, categorizes them, and
presents them in other windows for the GUI environment.
db2top has been included in the following DB2 versions and later:
› DB2 UDB v8.2 Fixpak 17 and later
› DB2 UDB v9.1 Fixpak 6 and later
› DB2 UDB v9.5 Fixpak 2 and later
› DB2 UDB v9.7 GA and later
The following Web page provides an in-depth discussion on the db2top tool:
http://www.ibm.com/developerworks/data/library/techarticle/dm-0812wang/
index.html
4.5.5 db2pd
db2pd is a standalone utility shipped with DB2 starting with V8.2. It provides a
non-intrusive method to view database progress and potential problems. You can
use the power of db2pd to get insights into the DB2 engine. Some examples of
information provided by db2pd are:
› Memory usage
› Agents
› Applications
› Locks
› Buffer pools
› Dynamic package cache
› Static package cache
› Catalog cache
› Logs
› Table and index statistics
In this section, we discuss db2pd options that you can use to diagnose
performance problems.
142 Best Practices for DB2 on AIX 6.1 for POWER Systems
db2pd –edus
Multi-threaded architecture has been introduced on UNIX starting with DB2 9.5.
Prior to DB2 9.5, you can see all active DB2 processes using the ps command on
UNIX. Starting with DB2 9.5, the ps command only lists the parent process,
db2sysc. You can use db2pd –edus to list all DB2 threads, along with their CPU
use. It can be useful to determine which DB2 thread is behind a CPU bottleneck.
See Example 4-10.
EDU ID TID Kernel TID EDU Name USR (s) SYS (s)
219 4791480614329 13976 db2pclnr (DTW) 0.000000 0.000000
6
218 47914814531904 13975 db2pclnr (DTW) 0.000000 0.000000
217 47914822920512 13944 db2dlock (DTW) 0.000000 0.000000
216 47914831309120 13943 db2lfr (DTW) 0.000000 0.000000
215 4791494874963 13942 db2loggw (DTW) 0.760000 2.450000
2
214 47914827114816 13925 db2loggr (DTW) 0.090000 0.120000
213 4792639919955 13825 db2stmm (DTW) 0.040000 0.010000
2
191 4791483550342 13802 db2agent (DTW) 45.280000 7.810000
4
190 4791483969772 13801 db2agent (DTW) 46.310000 7.610000
8
189 4791484389203 13800 db2agent (DTW) 44.950000 7.600000
2
188 4791484808633 13799 db2agent (DTW) 46.000000 7.700000
6
You can use the AIX ps command, ps -mo THREAD -p <db2sysc pid>, as well to
get details about the EDU threads.
-memsets is used to gain a quick, detailed view of how much memory each DB2
memory set is using. See Example 4-11 on page 144.
–mempools is used to drill down further into the memory pools that constitute a
memory set.
Chapter 4. Monitoring 143
Example 4-11 Memory set
db2pd -memset
Database Partition 0 -- Active -- Up 0 days 10:07:04
Memory Sets:
Name Address Id Size(Kb) Key
DBP Type Unrsv(Kb) Used(Kb) HWM(Kb) Cmt(Kb)
Uncmt(Kb)
DBMS 0x0780000000000000 28311566 36288 0x931FF261 0
0 0 14336 17024 17024 19264
FMP 0x0780000010000000 223346762 22592 0x0 0
0 2 0 576 22592 0
Trace 0x0770000000000000 362807300 137550 0x931FF274 0
-1 0 137550 0 137550 0
In Example 4-11:
› Name is the name of memory set.
› DBPis the database partition server that owns the memory set.
You can further delve into the logical memory of a memory pool using the db2pd
–memblocks option. You can view this information at the database level as well to
get overview of the database memory set and memory pools
If the Used size of memory set is much higher than the sum of the physical size
of memory pool, it might indicate a memory leak.
db2pd -tcbstats
This option displays information about the size and activity of tables. The Scan
column can be particularly useful in determining table scans. A steadily
increasing Scan value on a table can indicate a poor plan and a potential
requirement to create an index. In this example, there have been 30419 table
scans on the LAST_TRADE table. Note that although LAST_TRADE is a small
table containing only 643 rows, repeated in-memory table scans can lead to a
potential CPU bottleneck. See Example 4-13.
Dynamic Cache:
Current Memory Used 406678
4
Total Heap Size 406847
4
Cache Overflow Flag 0
Number of References 954293
Number of Statement Inserts 12343
Number of Statement Deletes 12072
Number of Variation Inserts 12399
Number of Statements 271
Note: Tests done in the lab on an OLTP workload suggest that the
performance overhead of collecting Snapshots (all switches ON) is 6%,
while it is only 3% with In-memory metrics.
› Better control over the granularity of monitoring. You can enable monitoring
only for specific service classes.
› A common interface through Package Cache for monitoring both Static and
Dynamic SQL
You can also monitor only a set of workloads by enabling monitoring only for the
service class to which the workload maps. The database level decides the
minimum level of monitoring. For example, if MON_REQ_METRICS=BASE,
then request metrics are collected regardless of service class setting.
MON_REQ_METRICS is set to BASE for new databases and to NONE for
migrated database.
System in-memory metrics are exposed through the following table functions:
› MON_GET_UNIT_OF_WORK, MON_GET_UNIT_OF_WORK_DETAILS
› MON_GET_WORKLOAD, MON_GET_WORKLOAD_DETAILS
› MON_GET_CONNECTION, MON_GET_CONNECTION_DETAILS
› MON_GET_SERVICE_SUBCLASS,
MON_GET_SERVICE_SUBCLASS_DETAILS
Chapter 4. Monitoring 151
Let us see a few examples.
List the active connections where we are spending most of our time
If the application handle is NULL, details are returned for all connections.
Member is used to specify the node in DPF environment. Specify -1 for the
current node and -2 for all nodes.
TOTAL_RQST_TIME is the total time DB2 spent for all requests for an
application. ACT_COMPLETED_TOTAL represents the number of activities
(SQL statements) executed by the application. See Example 4-17.
In the output, application handle = 37 has executed only one activity, while other
applications have completed relatively high number of activities.
152 Best Practices for DB2 on AIX 6.1 for POWER Systems
What are the transactions (UOW) the connections are currently executing?
APPLICATION_HANDLE UOW_ID
TOTAL_RQST_TIME
WORKLOAD_OCCURRENCE_STATE
UOW_START_TIME
-------------------- ----------- --------------------
-------------------------------- --------------------------
37 1 65444
UOWEXEC 2009-12-07-08.41.42.321700
13 5455 0
UOWEXEC 2009-12-07-08.43.27.979914
12 5500 0
UOWEXEC 2009-12-07-08.43.28.009260
11 5457 0
UOWEXEC 2009-12-07-08.43.28.030864
10 5377 0
UOWEXEC 2009-12-07-08.43.28.008118
9 5445 0
UOWEXEC 2009-12-07-08.43.28.020324
15 12 0
UOWEXEC 2009-12-07-08.43.27.943657
8 5509 0
UOWWAIT 2009-12-07-08.43.27.959482
14 5417 0
UOWEXEC 2009-12-07-08.43.28.025917
7 5480 0
UOWEXEC 2009-12-07-08.43.28.023564
10 record(s) selected.
If you have configured WLM, you can obtain metrics at the service class level and
workload level. Let us see an example:
› Determine the number of completed activities and CPU Time consumed by
each service class
We can query MON_GET_SERVICE_SUBCLASS. See Example 4-19.
SERVICE_SUPERCLASS SERVICE_SUBCLASS
TOTAL_CPU SQL_EXECUTED
------------------------------
--------------------------------------------------------------- FINANCE
TRANSACTCLASS
5673098558
43994853
FINANCE REPORTCLASS 1341307246
66
SYSDEFAULTUSERCLASS SYSDEFAULTSUBCLASS 628763
13
SYSDEFAULTMAINTENANCECLASS SYSDEFAULTSUBCLASS 113371
0
SRVCLASS SYSDEFAULTSUBCLASS 0
0
SYSDEFAULTSYSTEMCLASS SYSDEFAULTSUBCLASS 0
0
6 record(s) selected.
If you want to view the metrics of a particular class, specify the service super
class name and service subclass name as first two arguments of
MON_GET_SERVICE_SUBCLASS.
154 Best Practices for DB2 on AIX 6.1 for POWER Systems
It is good practice to configure WLM such that there is one-to-one mapping
between an application and a workload/service sub class. This enables you to
determine the performance characteristics of an application by looking at service
class/workload monitoring metrics.
The database level decides the minimum level of monitoring. For example, if
MON_ ACT_METRICS=BASE, then in-memory metrics are collected regardless
of service class setting. MON_ ACT_METRCIS is set to BASE for new databases
and to NONE for migrated database. Activity metrics are exposed through:
› MON_GET_PKG_CACHE_STMT (Both static and dynamic SQL)
› MON_GET_ACTIVITY_DETAILS (XML)
Continuing our previous example, we want to find details about the SQL
statements that the problematic UOW is currently executing.
1 2009-11-23-12.26.10.824045
156 Best Practices for DB2 on AIX 6.1 for POWER Systems
Next, we use the activity_id to determine which SQL statement corresponds to
this activity and to determine where it is spending time. See Example 4-21.
SELECT actmetrics.application_handle,
actmetrics.activity_id,
actmetrics.uow_id,
varchar(actmetrics.stmt_text, 50) as stmt_text,
actmetrics.total_act_time,
actmetrics.total_act_wait_time,
actmetrics.total_act_time,
actmetrics.total_act_wait_time,
CASE WHEN actmetrics.total_act_time > 0
THEN DEC((
FLOAT(actmetrics.total_act_wait_time) /
FLOAT(actmetrics.total_act_time)) * 100, 5, 2)
ELSE NULL
END AS PERCENTAGE_WAIT_TIME
FROM TABLE(MON_GET_ACTIVITY_DETAILS(37, 1 , 2, -2)) AS ACTDETAILS,
XMLTABLE (XMLNAMESPACES( DEFAULT 'http://www.ibm.com/xmlns/prod/db2/mon'),
'$actmetrics/db2_activity_details'
PASSING XMLPARSE(DOCUMENT ACTDETAILS.DETAILS) as "actmetrics"
COLUMNS "APPLICATION_HANDLE" INTEGER PATH
'application_handle', "ACTIVITY_ID" INTEGER PATH 'activity_id',
"UOW_ID" INTEGER PATH 'uow_id',
"STMT_TEXT" VARCHAR(1024) PATH 'stmt_text',
"TOTAL_ACT_TIME" INTEGER PATH 'activity_metrics/total_act_time',
"TOTAL_ACT_WAIT_TIME" INTEGER PATH 'activity_metrics/total_act_wait_time'
) AS ACTMETRICS;
Unlike Dynamic SQL Snapshot, we get the Rows Returned that can quickly help
us determine the SQL statements that have a high Rows Read/Rows Returned
Ratio. See Example 4-22.
Example 4-22 What are my top 5 hottest SQL statements, sorted by Rows Read?
SELECT
NUM_EXECUTIONS, STMT_EXEC_TIME,
ROWS_READ, ROWS_RETURNED,
SUBSTR(STMT_TEXT,1, 200) AS STMT_TEXT
FROM
(SELECT EXECUTABLE_ID
FROM TABLE ( MON_GET_PKG_CACHE_STMT (NULL, NULL, NULL, -
2) ) ORDER BY ROWS_READ DESC FETCH FIRST 5 ROWS ONLY ) T1,
TABLE (MON_GET_PKG_CACHE_STMT(NULL,T1.EXECUTABLE_ID ,NULL,-2));
158 Best Practices for DB2 on AIX 6.1 for POWER Systems
Database object monitoring
This provides reporting from the perspective of operations performed on
particular database objects (table, index, buffer pool, table space and container).
It provides a complementary perspective of database operation to the workload,
allowing the user to pinpoint issues from a data object centric perspective rather
than a workload centric perspective. See Figure 4-4.
When agents perform work on the system and gather low-level monitoring
metrics, relevant metrics are propagated to accumulation points in in-memory
data objects as well as the aforementioned accumulation points in the workload
framework.
Example 4-23 Determine the tables that are generating table scan
You can then query the package cache to see which statements are executed
against these tables and explore the possibility of creating indexes to eliminate
the table scan.
Time-spent metrics
Identifying the resource bottleneck is the first and the key step towards any
performance investigation. Understanding the bottleneck can help you rule out a
lot of possible problems early on. See Figure 4-5.
DB2 9.7 introduces the concept of time-spent metrics, a set of metrics that
provides a comprehensive breakdown of where and how time is being spent
inside DB2. They give us valuable clues about the source of problems by telling
where DB2 is spending most of its time. When used with the AIX Monitoring
tools, they can drive the investigation in the right direction. There are three
categories of time-spent metrics:
› Wait Time
› Component Time
› Component Processing Time
Chapter 4. Monitoring 161
Wait Time
This category helps answer the following question: “What percentage of total
time in DB2 is spent on waits, and what are the resources we are waiting on?
Examples are:
› Buffer pool Read/Write Wait Time
(POOL_READ_TIME + POOL_WRITE_TIME)
› Direct Read/Write Wait Time
(DIRECT_READ_TIME + DIRECT_WRITE_TIME)
› Lock Wait Time
(LOCK_WAIT_TIME)
› Log I/O Wait Time
(LOG_DISK_WAIT_TIME + LOG_BUFFER_WAIT_TIME)
› TPC/IP (Send/Receive) Wait Time
(TCPIP_SEND_WAIT_TIME + TCPIP_RECV_WAIT_TIME)
› FCM (Send/Receive) Wait Time
(FCM_SEND_WAIT_TIME + FCM_RECV_WAIT_TIME)
Note: Note that the Buffer pool Read Time and Buffer pool Write time
represent the synchronous read and write time.
Component Time
This category helps answer the following question: “Where am I spending time
within DB2?”
Figure 4-6 Wait time versus processing time with respect to overall request time
There is another interesting dimension you can derive from the Time Spent
metrics: you can determine how much time is being spent in the stages of DB2
processing using the Component Time metrics. See Figure 4-9.
Few important points to keep in mind when looking at the time metrics are:
› All time metrics, except for CPU Time, is in milliseconds. CPU Time is in
microseconds.
› All time metrics, including TOTAL_RQST_TIME, is the sum of all work done
by agents working on a query and not the elapsed clock time.
In other words, when we have parallelism between the agents, the values are
greater than elapsed time. TOTAL_APP_RQST_TIME gives the actual
execution time
› Member column correspond to DBPARTITIONNUM of snapshots.
› Sometimes you might find that the CPU Time is less than the processing time
(Total Request time – Total Wait time). This is true in environments where you
have many concurrent connections.
Processing time represents the time an agent is eligible to run. However, it
does not necessarily mean it is consuming CPU. It might be waiting on the
run-queues for the CPU if there are a large number of threads on the system
› MON_FORMAT_XML_TIMES_BY_ROW is a convenient interface to access
the time spent metrics. Visit the following Web page for more information:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic
=/com.ibm.db2.luw.sql.rtn.doc/doc/r0056557.html
Let us see the time-spent metrics in action. A performance problem has been
reported for an OLTP application with ~40 concurrent connections. The
throughput is low at 10 transactions per second. We use the new administrative
views in this example.
Example 4-24 Converting time spent metrics as percentage of total request time
WITH TimeSpent AS
( SELECT SUM(TOTAL_WAIT_TIME)AS WAIT_TIME,
SUM(TOTAL_RQST_TIME - TOTAL_WAIT_TIME) AS PROC_TIME,
SUM(TOTAL_RQST_TIME)AS RQST_TIME,
SUM(CLIENT_IDLE_WAIT_TIME) as client_idle_wait_time FROM
TABLE(MON_GET_CONNECTION(NULL,NULL)) AS METRICS)
SELECT
RQST_TIME,
CASE WHEN RQST_TIME > 0
THEN DEC((FLOAT(WAIT_TIME))/FLOAT(RQST_TIME) * 100,5,2)
ELSE NULL END AS WAIT_PCT ,
CASE WHEN RQST_TIME > 0
THEN DEC((FLOAT(PROC_TIME))/FLOAT(RQST_TIME) * 100,5,2)
ELSE NULL END AS PROC_PCT ,
CLIENT_IDLE_WAIT_TIME
FROM TIMESPENT;
RQST_TIME WAIT_PCT PROC_PCT CLIENT_IDLE_WAIT_TIME
-------------------- -------- -------- --------------------- 113200332
40.19 59.80
6959603
It shows that 60% of the total request time is spent in processing while the
remaining 40% is spent in wait.
What is the breakup of the wait times inside DB2? To find this, we query
sysibmadm.mon_db_summary to get summary of the time spent metrics at the
database level. See Example 4-25.
Example 4-25 What is the breakup of the wait times inside DB2
In this case, we are running with 40 client threads. Example 4-26 shows how
average CPU use looks in vmstat.
CPU is saturated at 100% - even if agents are blocked on I/O, we have 40 agents
concurrently executing and they are driving the CPU hard. The Runnable queue
is quite high. So it looks more like a CPU bottleneck. The next obvious question
is where is DB2 spending its time in processing. See Example 4-27.
1 record(s) selected.
This shows that 50% of our processing time is spent in compiling rather than
executing queries. We are constantly re-compiling queries either because the
package cache is small or possibly because we are using the same SQL with
other literals. We can further check the package cache hit ratio and monitor the
SQLs to see if there is a possibility to use parameter markers, instead of literals.
The time-spent metrics, along with vmstat, helped us identify the source of the
performance issue.
168 Best Practices for DB2 on AIX 6.1 for POWER Systems
The time-spent metrics are available at both the system level and activity level.
They are also available in the new event monitors introduced in DB2 9.7. More
details about the Time Spent Metrics and other monitoring enhancements are
available in DB2 9.7 Information Center:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=/c
om.ibm.db2.luw.wn.doc/doc/c0056333.html
Details on installing and using nmon can be found in the following Web page:
http://www.ibm.com/developerworks/aix/library/au-analyze_aix/
In this section, we focus on using these utilities and tools to help identify the
bottlenecks at various Db2 workloads while monitoring DB2.
Virtual memory management statistics, VMSTAT
The vmstat command is used to report statistics about kernel threads in the run
and wait queues, memory, paging, disks, interrupts, system calls, context
switches, and CPU activity. If the vmstat command is used without any options or
only with the interval and optionally, the count parameter, such as vmstat 2, then
the first line of numbers is an average since system reboot.
Syntax
vmstat interval count
Reporting
The column headings in the report are:
› r: Number of processes on run queue per second
› b: Number of processes awaiting paging in per second
› avm: Active virtual memory pages in paging space
› fre: Real memory pages on the free list
› re: Page reclaims; free, but claimed before being reused
› pi: Paged in (per second)
› po: Paged out (per second)
› fr: Pages freed (page replacement per second)
› sr: Pages per second scanned for replacement
› cy: Complete scans of page table
› in: Device interrupts per second
› sy: System calls per second
› cs: CPU context switches per second
› us: User CPU time percentage
› sys: System CPU time percentage
› id: CPU idle percentage (nothing to do)
› wa: CPU waiting for pending local Disk I/O
170 Best Practices for DB2 on AIX 6.1 for POWER Systems
The important columns in VMSTAT output to look for are id, sys, wa, and us while
monitoring CPU of the AIX system.
Figure 4-10 shows a case of a high CPU issue being identified using the
VMSTAT utility. The system is hitting an average of 65% user CPU usage (us
column) and 35% system (sy column) CPU usage. Pi and Po column values are
equal to 0, so there are no paging issues. The wa column shows there does not
seem to be any I/O issues. See Figure 4-10.
The VMSTAT snapshot in Example 4-28 on page 172 shows the system
monitoring activity on an uncapped micro portioning environment with initial CPU
entitlement as 0.4, highlighted as “ENT” in the output. With shared pools and
LPAR micro partitioning being enabled, note two additional columns as PC and
EC shown in VMSTAT output.
The snapshot from VMSTAT in Figure 4-11 shows the wa (waiting on I/O) column
to be unusually high. This indicates there might be I/O bottlenecks on the system,
which in turn, causes the CPU usage to be inefficient.
Syntax
iostat interval count
Reporting columns
Reporting columns are:
› %tm_act
Reports back the percentage of time that the physical disk was active or the
total time of disk requests.
› Kbps
Reports back the amount of data transferred to the drive in kilobytes.
› tps
Reports back the number of transfers-per-second issued to the physical disk.
› Kb_read
Reports back the total data (kilobytes) from your measured interval that is
read from the physical volumes.
› Kb_wrtn
Reports back the amount of data (kilobytes) from your measured interval that
is written to the physical volumes.
Figure 4-12 Snaphot from IOSTAT showing greater value than 40% to report a IO
bottleneck
Chapter 4. Monitoring 173
To check if there is resource contention, focus on the %tm_act value from the
output. An increase in this value, especially more than 40%, implies that
processes are waiting for I/O to complete, and there is an I/O issue. For more
details on IOSTAT, visit the AIX Information Center:
http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/
com.ibm.aix.prftungd/doc/prftungd/iostat_command.htm
no yes lv
Process State PS
This utility is used to display the status of currently active processes. Apart from
using this utility to understand the CPU of a active process, this utility can be
used to diagnose or identify performance issues (such as memory leak and hang
situations). For details on PS utility usage, see the following Information Center
Web page:
http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/
com.ibm.aix.prftungd/doc/prftungd/ps_command.htm
Chapter 4. Monitoring 175
TOPAS
This utility is used to get the CPU usage of a certain process in the AIX system. It
is used to identify which the processes are consuming most CPU in the AIX
system when VMSTAT command whows that there is a CPU bottleneck in the
system. See Figure 4-14.
Figure 4-14 Snapshot from topas is used to establish that the DB2 engine is contributing
to the CPU bottleneck
topas can be used with several options. One common option used in virtual
environment is topas –C. This command collects a set of metrics from AIX
partitions running on the same hardware platform. For details on using TOPAS,
see the following Web page:
http://publib.boulder.ibm.com/infocenter/aix/v6r1/topic/com.ibm.aix.cmd
s/doc/aixcmds5/topas.htm
Netstat
The netstat command is used to show network status. It can be used to
determine the amount of traffic in the network to ascertain whether performance
problems are due to network congestion. See the following Web page for details
on its usage:
http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix
.prftungd/doc/prftungd/using_netstat.htm
SVMON –G is a common utility that is used to present global statistics and can help
in getting the available memory in system. This is quite effective in finding real
memory estimate used by DB2 instance.
Tip: The VMSTAT –v command can be used to get the similar function.
Chapter 4. Monitoring 177
SVMON can be used to monitor the number of 4-kilobyte and 64-kilobyte page
frames on a AIX Power system. For example, to display DB2 process statistics
about each page size, use the -P flag with the DB2 process ID (PID) with the
svmon command. Figure 4-15 shows the same.
Figure 4-16 shows the first panel seen when nmon is used. The first panel shows
the AIX level and Processor type details. Use nmon –h for full details of usage
help.
There are separate sections in one panel that represent separate system
monitoring parameter. Each of these can be mapped to its corresponding AIX
monitoring utility as we discussed in the earlier sections in this chapter. For
example, the Topas command equivalent can be seen in “Top Processes” section
of nmon. Similarly, VMSTAT o/p equivalent can be seen in
“CPU-Utilization-Small-View” section of the nmon panel.
The performance threshold also applies to this and can be used as a reference to
identify using nmon if a performance bottleneck is encountered or not. For
example, if there is a CPU bound situation, check the
"CPU-Utilization-Small-View“ section of the nmon output as shown in Figure 4-19
on page 182. If the user% + sys% goes beyond 80%, the system is CPU bound.
Chapter 4. Monitoring 181
Figure 4-19 shows in this case, the user% is 77% and sys% is 2.2%, which is
about 80%. The Wait% is zero, signifying that there is CPU waiting for other
activity (such as IO) to be completed.
Figure 4-20 provides a sample CPU usage pattern captured over a period of time
by nmon and generated from nmon analyzer. This is equivalent for VMSTAT
output shown in Figure 4-19.
Starting with the vmstat output it is not apparent as to where exactly the
bottleneck can be, even though we see plenty of kernel threads on the run queue
and likewise with time spent in wait. The focus now turns to DB2 data, which is
collected when the performance degradation was observed. Again, we employ a
top-down approach starting with the dbm snapshot and working down to
individual queries if needed. This is where the quality of information becomes so
important. Based on the quality of this information, one can narrow down and
hypothesize about where the problem can lie. In our case a 3x degradation is
experienced by insert and update statements for all the applications. Knowing
this, we can look into factors that affect more than a query, as opposed to say an
access plan for any given query. See Example 4-31.
186 Best Practices for DB2 on AIX 6.1 for POWER Systems
Private Sort heap high water mark = 7050
Post threshold sorts = 0
Piped sorts requested = 3644324
Piped sorts accepted = 3644324
14:29:58.578177
Again, we cannot deduce where the problem lies so we proceed with the next set
of snapshots, the database snapshots, which are taken 70 seconds apart.
Example 4-32 shows the first snapshot.
Database Snapshot
188 Best Practices for DB2 on AIX 6.1 for POWER Systems
Lock list memory in use (Bytes) = 62960
Deadlocks detected = 10
Lock escalations = 198
Exclusive lock escalations = 177
Agents currently waiting on locks = 0
Lock Timeouts = 52
Number of indoubt transactions = 0
Workspace Information
Database Snapshot
192 Best Practices for DB2 on AIX 6.1 for POWER Systems
Shared Sort heap high water mark = 0
Total sorts = 3645714
Total sort time (ms) = 3000542
Sort overflows = 2089
Active sorts = 0
Workspace Information
This leads us to the suspicion that the bottleneck might be in I/O. Based on this
we need to determine which table spaces this I/O is performed against. To find
this information we observe the table space snapshot and look for a table space
that shows high direct reads and writes as per the previous findings. This reveals
that LONG_TS is the table space against which the vast majority of direct reads
and writes are spending the most time. See Example 4-34.
196 Best Practices for DB2 on AIX 6.1 for POWER Systems
Number of containers =1
Container Name =
/db2/SAMPLE/TBSP1 Container ID =0
Container Type = File (extent sized tag)
Total Pages in Container = 27848630
Usable Pages in Container = 27848576
Stripe Set =0
Container is accessible = Yes
So, what is contained in this table space? Looking at the Tablespace Content
Type we can see that it is Long Data Only. This gives us a good piece of
information that the problem has something to do with the reading and writing of
long data (for example, clobs/lobs) against this table space. However, consider
the output of other monitoring data before reaching any conclusion.
At this point, rather than proceeding with looking into the buffer pools/tables and
various other snapshot data, let us focus our attention to tracing from a DB2 and
AIX point of view. Using the perfcount option of a DB2 trace, it can be seen in
Example 4-35 that DB2 is spending a significant amount of time in the following
DB2 functions:
DB2 trace: The db2trc command controls the trace facility of a DB2 instance
or the DB2 Administration Server (DAS). The trace facility records information
about operations and puts this information into a readable format.
Even without deep knowledge of DB2, you can estimate that functions
sqldxWriteLobData, sqldxReplaceLob, sqldxCreateLob, and sqldxLobCreate
appear to be functions that manipulates lob data on disk. This is consistent with
the observation regarding direct writes and reads against a long table space.
Even though I/O operations against a DMS Long Data table spaces are direct
I/O, we might not be taking full advantage of concurrent I/O operations against
this table space. At this point it was decided to alter the table space to disallow
file system caching, which permits concurrent I/O against the table space
containers.
Having made this change, the application throughput increased back to the
expected level, that is, the 3x performance degradation was alleviated.
Example 4-37 Table spaces are defined on the following physical/logical volumes
hdisk0:
LV LPs PPs DISTRIBUTION MOUNT POINT
NAME
pmrlv 160 160 00..112..48..00..00 /pmr
paging00 40 40 00..00..40..00..00 N/A
db2lv 20 20 20..00..00..00..00 /db2
data2vgjfslog 1 1 00..00..00..00..01 N/A
hdisk2:
LV LPs PPs DISTRIBUTION MOUNT POINT
NAME
data1vg_jfslogl 1 1 01..00..00..00..00 N/A
homelv 420 420 10..112..111..111..76 /home2
data1vg_raw_lv5 20 20 20..00..00..00..00 N/A
data1vg_raw_lv2 20 20 20..00..00..00..00 N/A
data1vg_raw_lv1 20 20 20..00..00..00..00 N/A
data1vg_raw_lv4 20 20 20..00..00..00..00 N/A
data1vg_raw_lv3 20 20 20..00..00..00..00 N/A
During the creation of the index, table data pages on which the index is to be
created are fetched into the buffer pool if needed. After this, the columns (keys)
on which the index is to be defined are extracted and processed within the
sortheap, if possible. If this is a relatively large index, this process of sorting can
spill into a temporary table space resulting in I/O. Let us create an index and
observe this happening,
200 Best Practices for DB2 on AIX 6.1 for POWER Systems
Slow index create
The relevant dbm and db configuration parameters that were set for this test are
as shown in Example 4-38.
time db2 "create index inxall on mystaff ( ID, NAME, DEPT, JOB, YEARS, SALARY,
COMM )"
DB20000I The SQL command completed
successfully. real 16m15.86s
user 0m0.02s
sys 0m0.01s
Let us observe what happened with regards to the index creation. This scenario
is based on DB2 version 9.5, in which the parallelism for the CREATE INDEX
statement is enabled regardless of the setting of the intra_parallel configuration
parameter. From the db2pd -edus output we can see this happening even though
the dbm cfg parameters INTRA_PARALLEL is not enabled. This is because there
are DB2 sub-agents (db2agntdp) performing work on behalf of the coordinator
agent with EDU ID = 1801. See Example 4-40.
After the sort has spilled into the temporary table space in hdisk2, we can see
pages being written to the temporary table space.
This continues for approximately 10 minutes into the index creation, after which
we can see that the sorting has finished and the index itself is being written out to
the index table space on disk, as shown in Example 4-42.
Example 4-42 Index itself is being written out to the index table space
Disks: % tm_act Kbps tps Kb_read Kb_wrtn
hdisk0 0.0 0.0 0.0 0 0
hdisk2 99.9 3264.0 363.3 0 6532
hdisk0 0.0 0.0 0.0 0 0
hdisk2 98.8 3349.0 392.8 0 6472
hdisk0 32.6 210.4 37.1 0 420
hdisk2 100.2 1811.4 184.3 0 3616
hdisk0 60.5 400.0 68.5 0 800
hdisk2 59.0 324.0 55.5 0 648
hdisk0 64.5 400.0 68.5 0 800
hdisk2 73.0 440.0 75.0 0 880
Chapter 4. Monitoring 203
At the same time, using db2pd with the table space output, we can see this
happening until the entire index is created. It consists of 23584 pages as shown
in Example 4-43.
Tablespace Configuration:
Address Id Type Content PageSz ExtentSz Auto Prefetch BufID BufIDDisk
FSC NumCntrs MaxStripe LastConsecPg Name
0x070000003040CDE0 0 DMS Regular 8192 4 Yes 4 1 1
Off 1 0 3 SYSCATSPACE
0x070000003040D660 1 SMS SysTmp 8192 32 Yes 32 1 1
On 1 0 31 TEMPSPACE1
0x0700000031F8A120 2 DMS Large 8192 32 Yes 32 1 1
Off 1 0 31 USERSPACE1
0x0700000031F8A940 3 DMS Large 8192 32 Yes 32 1 1
Off 1 0 31
IBMDB2SAMPLEREL N 172 1 1
0x0700000031F8B160 4 DMS Large 8192 32 o
On 4 0 31 DATADMS
0x0700000031F8BDA0 5 DMS Large 8192 32 N 128 1 1
o
On 4 0 31 INDEXDMS
Tablespace Statistics:
Address Id TotalPgs UsablePgs UsedPgs PndFreePgs FreePgs HWM
State NQuiescers
MinRecTim 79872 160 0 79712 160
80000
e 0x0700000031F8BDA0
5
0x00000000 0 0
0x0700000031F8BDA0 5 80000 79872 160 0 79712 160
0x00000000 0 0
0x0700000031F8BDA0 5 80000 79872 192 0 79680 192
0x00000000 0 0
0x0700000031F8BDA0 5 80000 79872 1120 0 78752 1120
0x00000000 0 0
0x0700000031F8BDA0 5 80000 79872 1824 0 78048 1824
0x00000000 0 0
0x0700000031F8BDA0 5 80000 79872 2336 0 77536 2336
0x00000000 0 0
0x0700000031F8BDA0 5 80000 79872 22240 0 57632 22240
0x00000000 0 0
0x0700000031F8BDA0 5 80000 79872 22848 0 57024 22848
0x00000000 0 0
0x0700000031F8BDA0 5 80000 79872 23456 0 56416 23456
0x00000000 0 0
0x0700000031F8BDA0 5 80000 79872 23584 0 56288 23584
0x00000000 0 0
204 Best Practices for DB2 on AIX 6.1 for POWER Systems
Correlating to these events we can also see the shared sort usage using
db2mtrk. The same can be noticed using db2pd with the mempools output. Notice
that in Example 4-44 the shared sort increases to 6.8 M. This is roughly the
number of sub-agents multiplied by the sortheap allocated by each of them plus
a little bit extra for the overflow.
Let us see what happens when attempting to create the same index. The index is
created and it takes approximately four minutes. This is an improvement by a
factor of four over the previous index creation. See Example 4-46.
real 4m22.45s
user 0m0.01s
sys 0m0.01s
Let us observe the I/O on the disks. In the previous case there is considerable
write activity against hdisk2, which is spilling into the temporary table space. In
this case however, there is considerable read activity against both hdisk0 and
hdisk2, which is where the table data reside. See Example 4-48 on page 207.
206 Best Practices for DB2 on AIX 6.1 for POWER Systems
Example 4-48 Observing the I/O on the disks
Disks: % tm_act Kbps tps Kb_read Kb_wrtn
hdisk0 0.0 0.0 0.0 0 0
hdisk2 0.0 0.0 0.0 0 0
hdisk0 38.9 11732.0 310.3 23552 0
hdisk2 62.3 11477.0 302.4 23040 0
hdisk0 46.0 11520.0 302.5 23040 0
hdisk2 51.5 11584.0 294.5 23168 0
hdisk0 43.0 11123.0 297.2 22232 0
hdisk2 51.5 11090.9 289.7 22168 0
Unlike the previous case in which this continued for 10 minutes, in this case this
activity only last approximately 3 minutes after which sorting has finished and the
index itself is being written out to the index table space on disk. Notice how the
disk use is significantly higher and likewise with the throughput of the number of
bytes being written. This can be attributed to the greater number of prefetchers.
In the previous example this action took 6 minutes as even though the sorting
had been performed, the pages were required to be read from the temporary
table space into DB2's memory and then written out to disk. In this example, the
operation was much quicker because of the greater number of page cleaners
and the fact that the pages to be written out are already in memory. See
Example 4-49.
Example 4-50 db2pd -table spaces output during the index creation
Tablespace Configuration:
Address Id Type Content PageSz ExtentSz Auto Prefetch BufID BufIDDisk
FSC NumCntrs MaxStripe LastConsecPg Name
0x0700000032A9F700 0 DMS Regular 8192 4 Yes 4 1 1
Off 1 0 3 SYSCATSPACE
0x0700000031F8A0A0 1 SMS SysTmp 8192 32 Yes 32 1 1
On 1 0 31 TEMPSPACE1
0x0700000031F8E920 2 DMS Large 8192 32 Yes 32 1 1
Off 1 0 31 USERSPACE1
0x0700000031F8F140 3 DMS Large 8192 32 Yes 32 1 1
Off 1 0 31
IBMDB2SAMPLEREL
0x0700000031F78100 4 DMS Large 8192 32 Yes 128 1 1
Off 4 0 31 DATADMS
0x0700000031F787A0 5 DMS Large 8192 32 N 128 1 1
o
On 4 0 31 INDEXDMS
Tablespace Statistics:
Address Id TotalPgs UsablePgs UsedPgs PndFreePgs FreePgs HWM
State MinRecTim NQuiescers
e
0x0700000031F787A0 5 80000 79872 160 0 79712 160
0x00000000 0 0
0x0700000031F787A0 5 80000 79872 160 0 79712 160
0x00000000 0 0
0x0700000031F787A0 5 80000 79872 448 0 79424 448
0x00000000 0 0
0x0700000031F787A0 5 80000 79872 3520 0 76352 3520
0x00000000 0 0
0x0700000031F787A0 5 80000 79872 6112 0 73760 6112
0x00000000 0 0
0x0700000031F787A0 5 80000 79872 17792 0 62080 17792
0x00000000 0 0
0x0700000031F787A0 5 80000 79872 20800 0 59072 20800
0x00000000 0 0
0x0700000031F787A0 5 80000 79872 23584 0 56288 23584
0x00000000 0 0
0x0700000031F787A0 5 80000 79872 23584 0 56288 23584
0x00000000 0 0
208 Best Practices for DB2 on AIX 6.1 for POWER Systems
Correlating to the events we can also see the shared sort usage using either
db2mtrk or db2pd with the mempools option. The same can be noticed using
db2pd with the mempools output. Notice the shared sort increases to
approximately 900 MB. This is a little bit less than the number if sub-agents
multiplied by the sortheap allocated for each one, that is, 6x 40000 x 4096 =
983040000.
We can see that changing a few parameters at the database configuration and
the table space level can lead to a significant improvement in the amount of time
it takes to create an index. By optimizing the database to take full advantage of
the multi-threaded architecture and exploiting the memory available, it can
alleviate the various bottlenecks that otherwise are left unchanged.
Figure 4-29 The I/O activity looks balanced across the adapters
So far we have not been able to determine any obvious issue with I/O. Let us see
what the DB2 Key Performance Indicators suggest. Is it that DB2 is driving many
physical I/Os? Let us map the LUNs to file systems. See Example 4-53.
/usr/sbin/lspv -l hdisk23
hdisk23:
LV NAME LPs PPs DISTRIBUTION MOUNT
POINT
DATALV7 900 900 00..819..81..00..00 /DATA7
SELECT TABLESPACE_ID,
TABLESPACE_NAME,CONTAINER_NAME FROM
TABLE(SNAPSHOT_CONTAINER('PRDDB',-1)) AS T
db2 get db cfg for prddb1 | grep –i ““Path to log files” Path to
TS_TXM table space is using the DATA[1-7] file systems as containers, while
/LGSHR is where the transaction logs of the database are stored. This explains
why we saw relatively higher percentage of writes on hdisk24.
Let us see what tables are active in TS_TXM table space. See Example 4-55.
The number of rows read is 0. There are no tablescans occurring against the
tables.
TABLE_NAME ROWS_READ
------------------
-----------------------------------------------------------------------
--
CUSTOMER 0
ORDERS 0
WAREHOUSE 0
DISTRICT 0
NEW_ORDER 0
STOCK 0
HISTORY 0
ORDER_LINE 0
214 Best Practices for DB2 on AIX 6.1 for POWER Systems
Example 4-56 shows that the bufferpool hit ratio is quite good.
BP_NAME
TOTAL_HIT_RATIO_PERCENT DATA_HIT_RATIO_PERCENT INDEX_HIT_RATIO_PERCENT
-------------------------------------------------------------------------------
------------------------------------------------- -----------------------
---------------------- ----------------------- IBMDEFAULTBP
99.39 97.90 99.90
Let us check the Read Response time in the Snapshot. See Example 4-57.
Synchronous read time is important because it blocks the agents and prevents
them from doing useful work. This shows up as more idle time in vmstat.
Example 4-57 Average read time of 72 ms points to poor I/O response time
BP_NAME AVERAGE_SYNC_READ_TIME_MS
----------------------------------------------------------------------------- IBMDEFAULTBP
72
To get details about I/O response time, let us monitor iostat –D at regular
intervals. We are constantly hitting sqfull condition for all LUNs except hdisk24.
sqfull indicates the number of times the service queue becomes full per second.
That is, the disk is not accepting any more service requests. On an average,
every I/O transfer is spending 14 milliseconds in the wait queue. There might be
two possible reasons: Either we have exhausted on the bandwidth or
queue_depth of the device is not set properly. Example 4-58 shows hdisk30 use.
They have been left at default of 3. The Storage Administrator informs us that
there are 12 physical disks (spindles) under each LUN. The rule of thumb is that
at least three outstanding I/Os can wait on each physical disk device to keep it
near 100% busy. Because seek optimization is employed in current storage
subsystems, having more than three outstanding I/O can improve
throughput/response time for I/Os. The queue_depth of the LUNs was set to five
times the number of spindles. The num_cmd_elems was also adjusted for the
adapters, so that it is at least equal to the sum of queue_depth of all LUNs
attached to it.
chdev –l hdisk23 -a queue_depth=60
chdev–lfcs0–anum_cmd_elems=120
sqfull was monitored and there were no sqfull observed for the devices. The
Average Buffer pool Synchronous Read time reduced to 11.5 milliseconds.
Analysis
High level of paging activity is noticed in vmstat when the response time drops.
Take note of the pi and po columns. See Example 4-59.
Let us get more insights the memory usage. The -v option of vmstat displays the
percentage of real memory being used by other categories of pages
vmstat -v was collected and monitored at regular intervals. The minperm and
maxperm are set, as per recommendations. However, numperm and numclient is
high at 70%. This suggests that the file system cache is occupying 70% of the
real memory. See Example 4-61.
/usr/bin/vmstat -v 16318448
memory pages
15690369 lruable pages
144213 free pages
2 memory
pools 1060435 pinned
pages
80.0 maxpin percentage
3.0 minperm percentage <<- system’s minperm setting
90.0 maxperm percentage <<- system’s maxperm setting
Chapter 4. Monitoring 217
70.9 numperm percentage <<- % of memory
having non-computational pages
11138217 file pages <<- No of non-computational pages
0.0 compressed percentage
0 compressed pages
70.5 numclient percentage <<- % of memory containing non-comp
client pages
90.0 maxclient percentage <<- system ‘s maxclient setting
11074290 client pages <<- No of client pages
0 remote pageouts scheduled
36 pending disk I/Os blocked with no pbuf 3230
paging space I/Os blocked with no psbuf 21557
filesystem I/Os blocked with no fsbuf
1 client filesystem I/Os blocked with no fsbuf
0 external pager filesystem I/Os blocked with no fsbuf
0 Virtualized Partition Memory Page Faults
0.00 Time resolving virtualized partition memory page faults
nmon also displays the memory statistics. The MEMNEW tab in the nmon
spreadsheet displays the details about usage of memory. See Figure 4-30.
db2sysc is at the top of the list. The InUse column displays the total number of
client pages DB2 is using. It alone accounts for 80% of numclients (8859432
pages out of 11074290 client pages).
If you want to check how much space db2sysc is occupying in the file system
cache, you can pass db2sysc’s pid as an argument (svmon -c –P <db2sysc-pid>)
Why is DB2 using so much of file system cache? Either it is the DB2 logs (less
likely, as this is a data warehouse workload and logging activity is minimal) or file
system caching is ON for DB2 Tablespace. File System Caching is not
recommended for DB2 containers, except for cases where your table space has
LOB data. As DB2 already buffers recently used data in buffer pool, having file
system caching ON unnecessarily creates two levels of buffering.
In case you are curious, this is how we can determine files db2sysc is using from
the svmon output. The Description column points to the location and inode of the
file. Let us map /dev/data_lv8:4098 to file. See Example 4-63.
File system caching was ON for one table space. After disabling file system
cache for the table space, there was no paging observed and that data
warehouse performance was back to normal.
In this scenario, we found that DB2 is the culprit. In case you find non-DB2
processes are consuming file system cache, you’d need to determine the cause
and rectify it.
One thing to check for using the DB2 monitor or snapshots is the buffer pool size
usage and hit ratio. Remember, a smaller buffer pool size and lower buffer pool
hit ratio can cause high disk I/O usage, which causes the CPU waits to go higher.
In this case, the setting used for buffer pool was auto tuning and the DB2
monitoring shows that the DB2 buffer pool hits ratio is 99% which ruled out this
being a cause. See Figure 4-32.
DB2 is normally smart, and when it does sequential reads, it does not do them in
small units, but instead reads multiple pages at a time (large I/O). In this case,
these large sequential reads see no benefit from read ahead because the
requested I/O size is probably similar to the disk stripe size used by read ahead.
The OLTP workload in this case, was suspected to be causing such an issue of
large waits due the read requests being random I/O as opposed to sequential.
Based on this, the sequential read ahead AIX setting was tuned and
experimentation lead to a maximum throughput and reduced the response time
such that the negative performance was alleviated. The following AIX settings
are under investigation to fix this issue.
› ioo -o minpageahead=Number
› ioo -o maxpageahead=Number
Note: Prior to AIX5.3 version, use, minpgahead and maxpgahead values can
be used with options -r and -R respectively in the vmtune AIX command. For
details on using these options, see AIX 5L Performance Tools Handbook,
SG24-6039.
To fix this issue, the minpgahead value was varied from 0–4 with maxpgahead
varied from 8–16. Subsequent tests were carried out for each of these. The
minpgahead being 0 showed maximum improvement in performance for this
issue. Remember, a value of 0 for minpgahead can be useful for situation’s of
random I/O or workload that has predominantly doing random I/O, as it is totally
opposite to the mechanism of sequential read ahead. As a result, the response
time difference came down to 10% from 60–70% due to an increase in workload
and it resulted in better throughput values.
Chapter 4. Monitoring 223
224 Best Practices for DB2 on AIX 6.1 for POWER Systems
5
In this chapter we discuss at a length about best practices for using the
appropriate LPAR type, applying uncapped micro-partitions, multiple shared CPU
pools, and how to manage resources and prioritize resource allocation among
multiple LPARs hosted on a system.
The system shown in Figure 5-1 is configured with four LPARs. The LPAR with
AIX 6.1 is dedicated processor LPAR. The rest of the LPARs are shared
processor LPARs. Each LPAR is running variety of operating systems with
differing amount of resources.
CoD
CPUs
Dedicated
Physical CPUs
Shared Pool of
Physical CPUs
After the LPAR is created and the AIX operating system is installed, the user
experience is similar to working on a regular host running AIX. However, there
are new concepts and setup options related to the virtualization that we discuss
in the remainder of the chapter.
Processor and memory resources have minimum, desired, and maximum values
in an LPAR profile. The desired value is used when LPAR is started. If the desired
amount cannot be fulfilled when starting an LPAR, the POWER Hypervisor
attempts to start an LPAR using any amount between the minimum and desired
value. If the minimum amount cannot be fulfilled, LPAR fails to start.
Rather than provisioning resources for peak usage, use the current workload
resources requirement to allocate LPAR resource. When the demand changes,
use the DLPAR facility to add or remove the required resources. For example, if
the current processor consumption decreases, use DLPAR to remove processor
resources from an LPAR. Similarly, when AIX 6.1 or DB2 needs more memory (in
event of an AIX 6.1 paging activity or low DB2 buffer pool hit ratio, for example),
more memory can be added dynamically. Similarly, the resources can be moved
from one LPAR to another.
Note: If you are using the DLPAR facility to add or remove memory, we
recommend the use of DB2’s self tuning memory management (STMM)
feature for most benefits. The STMM, with the INSTANCE_MEMORY
database manager configuration parameter, and the DATABASE_MEMORY
database configuration parameter (both set to AUTOMATIC) ensures that
memory change in an LPAR is registered with DB2 for optimized memory use.
Additionally, it takes the burden from the database administrator to retune DB2
after a memory change in an LPAR.
230 Best Practices for DB2 on AIX 6.1 for POWER Systems
5.4 Dedicated LPAR versus micro-partition
There are two types of LPAR. Most of the time, questions arise regarding the
correct partition type for the DB2 workload and environment. Besides
performance, there are other factors that contribute to deciding the partition type.
A test was conducted with one, two, and four partitions running simultaneously
for each partition type. All tests used direct attached I/O with properly scaled DB2
data. In essence, the test contained a set of six performance runs: two runs for a
given number of partitions, each time changing only the partition type between
dedicated and shared-processor partitions. The partition type change is a
nondestructive change (no data loss). The only requirement is that the partition
be restarted.
5.4.3 Discussion
Beginning with POWER6 processor architecture-based servers, there is an
option to donate unused dedicated processor partition processing capacity. This
feature eliminates unused processing cycles in a dedicated processor partition,
further solidifying PowerVM virtualization offering.
Tip: Whenever possible use a shared processor LPAR for running DB2. The
shared processor LPAR (micro-partition) offers the same level of performance
as a dedicated processor LPAR for most kinds of DB2 workloads. Besides, the
shared processor LPAR offers many other workload management features
and improves overall server use.
Based on the recorded results on the setup, a recommendation for the DB2
environment, for capped partitions, is to set the number of virtual processors
equal to the rounded-up entitled capacity.
This configuration for uncapped partitions ensures that there are sufficiently
configured virtual processors to allow the use of the free-pool processing
capacity. Carefully tune uncapped weight to assist fair distribution of the free pool
capacity.
Chapter 5. LPAR considerations 235
Tip: For DB2 environment use the following general guideline to configure
virtual processors for shared processor LPAR.
For capped shared processor LPAR, set the number of virtual processor to be
equal to the rounded-up number of the entitled capacity.
For uncapped shared processor LPAR, set the number of virtual processor to
be equal to the round-up of minimum (2 X entitled capacity, number of pyhsical
processors in a system)
Note: For more information about multiple shared-processor pools see the
IBM Redbooks publication PowerVM Virtualization on IBM System p:
Introduction and Configuration Fourth Edition, SG24-7940.
MSPPs use processor capacity from the physical shared processor pool. There
can only be one physical shared processor pool in the system. Each
shared-processor pool has an associated entitled pool capacity, which is
consumed by the set of micro-partitions belonging in that shared-processor pool.
The source of additional processor cycles can be the reserved pool capacity,
which is processor capacity that is specifically reserved for this particular
shared-processor pool, but not assigned to any of the micro-partitions in the
shared-processor pool.
Dedicated-processor
partitions Set of micro-partitions Set of micro-partitions
AIX V6.1
AIX V5.3
AIX V6.1
AIX V6.1
AIX V5.3
AIX V5.3
AIX V5.2
AIX V5.3
AIX V6.1
Linux
Linux
0.75 CPU
0.75 CPU
0.85 CPU
0.5 CPU
1.6 CPU
0.5 CPU
0.6 CPU
0.5 CPU
2 CPU 3 CPU 1 CPU
POWER Hypervisor
There can be up to 64 MSPPs in total. With the default MSPP, there can be an
additional 63 MSPPs that are user-defined.
A new AIX command, lparstat, is available to get information about the LPAR
profile and current use. Example 5-1 shows the output of the lparstat -i
command, which prints exhaustive LPAR information, including LPAR type,
virtual processors, entitled capacity, and memory.
# lparstat 1
%user %sys %wait %idle physc %entc lbusy vcsw phint %nsp
----- ----- ------ ------ ----- ----- ------ ----- ----- ----- 0.1 0.3
0.0 99.6 0.01 0.8 0.0 472 0 99
0.0 0.2 0.0 99.8 0.01 0.6 0.0 471 0 99
0.0 0.7 0.0 99.3 0.01 1.1 0.0 468 0 99
0.0 0.2 0.0 99.8 0.01 0.6 0.0 469 0 99
0.0 0.2 0.0 99.8 0.01 0.6 0.0 469 0 99
0.2 0.5 0.0 99.3 0.01 1.1 0.0 474 0 99
Chapter 5. LPAR considerations 239
The AIX vmstat command shows additional information for the shared processor
LPAR. As shown in Example 5-3, the last two columns are displayed only for
shared processor partition. pc is number of physical processors consumed, and
ec is the percentage of entitled capacity consumed. It can be more than 100 if the
partition is uncapped, and it is currently using more than its entitlement.
# vmstat 1
0.01 0.7
240 Best Practices for DB2 on AIX 6.1 for POWER Systems
6
In this chapter we discuss the advantages of using virtual I/O, VIOS (which is the
backbone for enabling virtual I/O), high availability for virtual I/O and best
practices for configuring for virtual I/O for a DB2 workload.
Use of virtual I/O requires an LPAR with a special operating system. This LPAR is
called a virtual I/O server (VIOS). VIOS enables LPARs to share I/O resources
such as Ethernet adapters, Fibre Channel adapters, and disk storage. When an
LPAR (an LPAR running DB2, for example) is using virtual I/O, all I/O requests
are passed down to the VIOS, where VIOS performs the actual I/O operation and
returns data, in a few cases, directly to a client partition (no double buffering in
the case of Virtual SCSI).
In most cases, DB2 workloads fluctuate over time. The fluctuation leads to
variations in the use of I/O infrastructure. Typically, in the absence of virtual I/O,
each partition is configured with sizing based of the peak I/O capacity, resulting
in under used I/O infrastructure. Sharing resources where the peaks and valleys
are complementary (for example, daytime OLTP workload with nightly batch or
reporting workloads) can result in a significant reduction in the total amount of
physical resource requirements to handle the aggregated workload for all the
partitions that have these complementary peaks and valleys. By reducing the
number of physical disk drives and networking infrastructure, virtual I/O simplifies
the management of these entities.
Chapter 6. Virtual I/O 243
To improve overall use and performance, an objective of the DB2 sizing and
tuning exercise is to reduce or eliminate I/O bottleneck in the DB2 environment.
To that end, to attain the best possible I/O bandwidth, DB2 storage requirements
are typically sized in terms of number of disk spindles, rather than the amount of
data storage. This results in an excessive amount of unused disk space. Using
virtual I/O, such disk space can be shared for complimentary or staggered
workloads.
Virtual SCSI
Physical adapters with attached disks or optical devices on the VIOS logical
partition can be shared by one or more client logical partitions. The VIOS offers a
local storage subsystem that provides standard SCSI-compliant logical unit
numbers (LUNs). VIOS can export a pool of heterogeneous physical storage as a
homogeneous pool of block storage in the form of SCSI disks for client LPAR to
use.
After virtual devices are configured, those devices appear as regular hdisks in
AIX device configuration. The hdisk can be used like an ordinary hdisk for
creating a logical volume group. The important point is that from DB2’s
perspective, DB2 is unaware of the fact that it is using virtual storage. Therefore,
all of the DB2 storage management aspects, such as including containers, table
spaces, automatic storage remain as is, even for the virtual devices
Chapter 6. Virtual I/O 245
Virtual Fibre Channel (NPIV)
With N_Port ID Virtualization (NPIV), you can configure the managed system so
that multiple logical partitions can access independent physical storage through
the same physical Fibre Channel adapter.
To access physical storage in a typical SAN that uses Fibre Channel, the physical
storage is mapped to LUNs. The LUNs are mapped to the ports of physical Fibre
Channel adapters. Each physical port on each physical Fibre Channel adapter is
identified using one worldwide port name (WWPN).
NPIV is a standard technology for Fibre Channel networks that enables you to
connect multiple logical partitions to one physical port of a physical Fibre
Channel adapter. Each logical partition is identified by a unique WWPN, which
means that you can connect each logical partition to independent physical
storage on a SAN.
To enable NPIV on the managed system, you must create a VIOS logical
partition (version 2.1 or later) that provides virtual resources to client logical
partitions. You assign the physical Fibre Channel adapters (that support NPIV) to
the VIOS logical partition. Then you connect virtual Fibre Channel adapters on
the client logical partitions to virtual Fibre Channel adapters on the VIOS logical
partition. A virtual Fibre Channel adapter is a virtual adapter that provides client
logical partitions with a Fibre Channel connection to a SAN through the VIOS
logical partition. The VIOS logical partition provides the connection between the
virtual Fibre Channel adapters on the VIOS logical partition and the physical
Fibre Channel adapters on the managed system.
Figure 6-2 on page 247 shows a managed system configured to use NPIV.
246 Best Practices for DB2 on AIX 6.1 for POWER Systems
Figure 6-2 a managed system configured to use NPIV
Using SAN tools, you can zone and mask LUNs that include WWPNs that are
assigned to virtual Fibre Channel adapters on client logical partitions. The SAN
uses WWPNs that are assigned to virtual Fibre Channel adapters on client
logical partitions the same way it uses WWPNs that are assigned to physical
ports.
Note: DB2 is not aware of the fact that it is using virtual storage. All of the
storage management best practices and methods discussed so far are equally
applied to virtual storage as well.
VIOS provides the virtual networking technologies discussed in the next three
sections.
248 Best Practices for DB2 on AIX 6.1 for POWER Systems
Shared Ethernet adapter (SEA)
The shared Ethernet adapter (SEA) hosted in the VIOS acts as a layer-2 bridge
between the internal virtual and external physical networks. The SEA enables
partitions to communicate outside the system without having to dedicate a
physical I/O slot and a physical network adapter to a client partition
SEA failover
SEA failover provides redundancy by configuring a backup SEA on a VIOS
logical partition that can be used if the primary SEA fails. The network
connectivity in the client logical partitions continues without disruption.
Active Partition Mobility allows you to move AIX and Linux logical partitions that
are running, including the operating system and applications, from one system to
another. The logical partition and the applications running on that migrated
logical partition do not need to be shut down.
Inactive Partition Mobility allows you to move a powered off AIX and Linux logical
partition from one system to another.
The VIOS plays a main role in live partition mobility. The Live Partition Mobility is
initiated from the HMC. The HMC communicates with the VIOS on the two
systems to initiate the transfer. The VIO servers on both system talk to each
other, and transfer the system environment including the processor state,
memory, attached virtual devices, and connected users.
For more information about Live Partition Mobility you can refer to Chapter 7,
“Live Partition Mobility” on page 269.
Chapter 6. Virtual I/O 249
6.2.4 Integrated Virtualization Manager
The Integrated Virtualization Manager (IVM) provides a Web-based system
management interface and a command-line interface that you can use to
manage IBM Power Systems servers and IBM BladeCenter blade servers that
use the IBM VIOS. On the managed system, you can create logical partitions,
manage virtual storage and virtual Ethernet, and view service information related
to the server. The IVM is included with the VIOS, but it is available and usable
only on certain platforms, and where no Hardware Management Console (HMC)
is present.
If you install the VIOS on a supported server, and if there is no HMC attached to
the server when you install the VIOS, then the IVM is enabled on that server. You
can use the IVM to configure the managed system through the VIOS.
Two or more VIOSs and redundant devices can provide improved software
maintenance and hardware replacement strategies.
The following sections show the storage and network redundancy options
provided by a dual VIOS environment.
Chapter 6. Virtual I/O 251
Virtual SCSI redundancy
Virtual SCSI redundancy can be achieved using MPIO at client partition and
VIOS level. Figure 6-3 displays a setup using MPIO in the client partition.
Client Partition
hdiskx
MPIO
vSCSI vSCSI
Client Adapter Client Adapter
Two VIOSs host disks for a client partition. The client is using MPIO to access a
SAN disk. From the client perspective, the following situations can be handled
without causing downtime for the client:
› Either path to the SAN disk can fail, but the client is still able to access the
data on the SAN disk through the other path. No action has to be taken to
reintegrate the failed path to the SAN disk after repair if MPIO is configured
› Either VIOS can be rebooted for maintenance. This results in a temporary
simultaneous failure of one path to the SAN disk, as described before.
To balance the load of multiple client partitions across two VIOSs, the priority on
each virtual SCSI disk on the client partition can be set to select the primary path
and, therefore, a specific VIOS. The priority is set on a per virtual SCSI disk basis
using the chpath command as shown in the following example (1 is the highest
priority):
chpath -l hdisk0 -p vscsi0 -a priority=2
Due to this granularity, a system administrator can specify whether all the disks
or alternate disks on a client partition use one of the VIOSs as the primary path.
The recommended method is to divide the client partitions between the two
VIOSs.
This configuration uses virtual Ethernet adapters created using default virtual
LAN IDs with the physical Ethernet switch using untagged ports only. In this
example, VIOS 1 has a SEA that provides external connectivity to client partition
1 through the virtual Ethernet adapter using virtual LAN ID 2. VIOS 2 also has a
SEA that provides external connectivity to the client partition through the virtual
Ethernet adapter using virtual LAN ID 3. Client partition 2 has a similar setup,
except that the virtual Ethernet adapter using virtual LAN ID 2 is the primary and
virtual LAN ID 3 is the backup. This enables client partition 2 to get its primary
connectivity through VIOS 1 and backup connectivity through VIOS 2.
254 Best Practices for DB2 on AIX 6.1 for POWER Systems
Client partition 1 has the virtual Ethernet adapters configured using Network
Interface Backup such that the virtual LAN ID 3 network is the primary and virtual
LAN ID 2 network is the backup, with the IP address of the default gateway to be
used for heartbeats. This enables client partition 1 to get its primary connectivity
through VIOS 2 and backup connectivity through VIOS 1.
If the primary VIOS for an adapter becomes unavailable, the Network Interface
Backup mechanism detects this because the path to the gateway is broken. The
Network Interface Backup setup fails over to the backup adapter that has
connectivity through the backup VIOS.
SEA failover
SEA failover is implemented on the VIOS using a bridging (layer-2) approach to
access external networks. SEA failover supports IEEE 802.1Q VLAN-tagging,
unlike Network Interface Backup.
With SEA failover, two VIOSs have the bridging function of the SEA to
automatically fail over if one VIOS is unavailable or the SEA is unable to access
the external network through its physical Ethernet adapter. A manual failover can
also be triggered.
Chapter 6. Virtual I/O 255
As shown in Figure 6-5, both VIOSs attach to the same virtual and physical
Ethernet networks and VLANs, and both virtual Ethernet adapters of both
Shared Ethernet Adapters have the access external network flag enabled. An
additional virtual Ethernet connection must be set up as a separate VLAN
between the two VIOSs and must be attached to the Shared Ethernet Adapter
(SEA) as a control channel. This VLAN serves as a channel for the exchange of
keep-alive or heartbeat messages between the two VIOSs that controls the
failover of the bridging functionality. No network interfaces have to be attached to
the control channel Ethernet adapters. The control channel adapter is dedicated
and on a dedicated VLAN that is not used for any other purpose.
In addition, the SEA in each VIOS must be configured with differing priority
values. The priority value defines which of the two SEAs is the primary (active)
and which is the backup (standby). The lower the priority value, the higher the
priority (for example, priority=1 is the highest priority).
256 Best Practices for DB2 on AIX 6.1 for POWER Systems
You can also configure the SEA with an IP address that it periodically pings to
confirm that network connectivity is available. This is similar to the IP address to
ping that can be configured with Network Interface Backup (NIB). If you use NIB,
you have to configure the reachability ping on every client compared to doing it
when on the SEA.
It is possible that during an SEA failover, the network drops up to 15–30 packets
while the network reroutes the traffic.
The following guidelines are intended to get you started and might need
additional adjustments after they are run and monitored in production, but they
enable you to accurately plan how much CPU and memory to use.
The easiest rule to follow is that if you have a simple VIOS that is only bridging a
couple of networks and never uses jumbo frames or a larger MTU, 512 MB is
needed by the VIOS. System configurations with 20–30 clients, many LUNs, and
a generous vhost configurations might need increased memory to suit
performance expectations.
Chapter 6. Virtual I/O 257
If there is a possibility that you might start bridging more networks and use jumbo
frames, use 1 GB of RAM for the VIOS. If you have enough memory to spare, the
best situation is to use 1 GB for the VIOS to allow the maximum flexibility. When
10 Gbps Ethernet adapters are commonly implemented, this memory
requirement might require revision. If VIOS is using IVE/HEA as an SEA, the
LPAR might need more than 1 GB.
The CPU requirements on the VIOS for the network depends on the actual
network traffic that runs through the VIOS and not on the actual number of
network adapters installed on it. Minimal CPU is needed to support a network
adapter card. For every network packet, however, the CPU has to calculate
things such as the network checksum
Table 6-1 Approximate CPU amount VIOS needs for 1 GB of network traffic
MTU (bytes) 1500 9000 or jumbo frames
Let us see an example. Suppose that you have a VIOS with 4 gigabit network
adapters and you know that during normal operations you have about 100 Mb of
traffic overall. However, you also know that during a backup window you use a full
gigabit of network traffic for two hours at night.
258 Best Practices for DB2 on AIX 6.1 for POWER Systems
These values can be translated into a CPU requirement to support the network.
This can best be done by using the shared CPU model. If we assume that your
server has 1.65 GHz CPUs and you are using an MTU of 1500, we can calculate
that during normal operation you only need 0.1 of a CPU. During peak loads, you
need 1.0 CPU. If we assume that your user base is not using the system at night
(thus the backup), there is plenty of unused CPU in the free pool that can be
used for the CPU requirement here. You can configure the VIOS partition as a
shared uncapped partition with 0.1 entitlement with 1 virtual CPU. This
guarantees 0.1 of a CPU to sustain the daily network usage, but by using the
uncapped CPU resources, we can allow the VIOS to grow to 1.0 CPU if required
using spare CPU cycles from the CPU pool.
Remember that you need CPU to support network traffic and that adding
additional network cards (providing the same network traffic) does not require
additional CPU. Using Table 6-1 on page 258, estimate a value for the required
network bandwidth to support normal operations. Guarantee this amount (plus
the disk value from the following section) to the logical partition as the processing
units. Use uncapped CPUs on the servers profile to allow the VIOS to use spare
processing from the free pool to handle any spikes that might occur.
For a rough guideline, it is probably easier to work out what the disks you are
using can provide and make an estimate as to how busy you think these disks
will be. For example, consider a simple VIOS that has a basic internal set of four
SCSI disks. These disks are used to provide all of the I/O for the clients and are
10 K RPM disks. We use a typical workload of 8 KB blocks. For these disks, a
typical maximum is around 150 I/Os per second, so this works out to be about
0.02 of a CPU. A small amount.
Chapter 6. Virtual I/O 259
If we plot the amount of CPU required against the number of I/Os per second for
the 1.65 GHz CPUs for the various I/O block sizes, we get a the data shown in
Figure 6-6.
Figure 6-6 Estimated size of storage array to drive I/O versus VIOS CPU
These I/O numbers and the storage subsystem sizing assume that the storage
subsystem is being driven at 100% by the virtual I/O clients only (so every I/O is
virtual disk) and the disk subsystems have only been placed on the graph to
indicate the performance you need to generate this sort of load.
It is important to remember that we are not saying the bigger the disk subsystem,
the more CPU power you need. What we actually find is that we need the most
powerful storage subsystem offered to require any significant amount of CPU in
the VIOS from a disk usage perspective. As an example, a mid-range disk
subsystem running at full speed is in the region of the yellow/green boundary
(mid-range to high end) on the graph when configured with more than 250 drives.
In most cases, this storage is not 100% dedicated to the VIOS. However, if this
was the case, even when running at full speed, a Virtual I/O only needs one CPU
for most block sizes.
Most I/O requests for systems are in the lower left section, so a starting figure of
0.1 to 0.2 CPU to cater for a disk is a good starting point in most cases.
260 Best Practices for DB2 on AIX 6.1 for POWER Systems
Note: We always suggest testing a configuration before putting it into
production.
Because network traffic requires additional CPU, we need to size both VIOS
adequately to minimize the migration time. The actual amount of CPU required
varies depending upon network parameters and the processor used on the
system. Therefore, it is recommended to configure both the VIOSs as shared
uncapped partitions and allocate one additional VCPU on top of what is required
for processing the required virtual I/O and virtual network requests. This allows
the VIOSs to make use of available CPU from the free pool during the migration
and reduces the total time required for doing the migration.
Multipathing for the physical storage in the VIOS provides failover physical path
redundancy and load-balancing. The multipathing solutions available in the VIOS
include MPIO, as well as solutions provided by the storage vendors.
The multipath I/O software needs to be configured only on the VIOS. There is no
special configuration that needs to be done on the virtual I/O client partitions to
make use of the multipath I/O at the VIOS.
For details on configuring multipath I/O, see the corresponding storage vendor’s
documentation.
6.5.2 Networking
Because a DB2 workload depends on network to talk to the clients, we need to
provide both redundancy as well as efficient bandwidth for the network traffic too.
This can be achieved by configuring link aggregation for the SEA adapter on the
VIOS.
For dual VIOS configurations, redundancy can be provided by using either SEA
failover or configuring a NIB on the client partition as discussed in“Virtual
network redundancy” on page 253.
Chapter 6. Virtual I/O 263
6.5.3 CPU settings for VIOS
If the I/O response time is not important for the DB2 workload running on the
client partition, and if the CPU sizing requirement for the VIOS is less than 1.0
CPU entitlement, we can leave the CPU entitlement less than 1.0, but make it as
a shared uncapped partition. This allows it to make use of available processing
power in the free pool when required during peak load.
On the other hand, if the response time for the I/O is important for the DB2
workload, we need to make sure that the VIOS can get access to a CPU and
service the I/O request. This can be done by either configuring the VIOS with a
whole CPU entitlement (1.0) and making it a shared uncapped partition or, on a
POWER6 system, we can configure it as Dedicate-Donate partition. Both these
methods guarantee that the VIOS can immediately service a I/O request. Even
though we are allocating more CPU entitlement than required for the VIOS,
because we have configured it as shared or dedicate-donate, other partitions can
make use of the unused capacity. We need to make sure the other LPARs are
configured as shared-uncapped partitions with appropriate VCPUs to make use
of the available processing capacity.
Use the following command to change the queue depth on the VIOS:
chdev -dev hdiskN -attr queue_depth=x
Note: Changing the queue depth of the physical LUN might require changing
the queue depth for the virtual disk on the client partition to achieve optimal
performance.
There are several other factors to be taken into consideration when changing the
queue depth. These factors include the value of the queue_depth attribute for all
of the physical storage devices on the VIOS being used as a virtual target device
by the disk instance on the client partition. It also includes the maximum transfer
size for the virtual SCSI client adapter instance that is the parent device for the
disk instance.
The maximum transfer size for virtual SCSI client adapters is set by the VIOS,
which determines that value on the resources available on that server and the
264 Best Practices for DB2 on AIX 6.1 for POWER Systems
maximum transfer size for the physical storage devices on that server. Other
factors include the queue depth and maximum transfer size of other devices
involved in mirrored volume group or MPIO configurations. Increasing the queue
depth for a few devices might reduce the resources available for other devices on
that same parent adapter and decrease throughput for those devices.
When mirroring, the LVM writes the data to all devices in the mirror, and does not
report a write as completed until all writes have completed. Therefore, throughput
is effectively throttled to the device with the smallest queue depth. This applies to
mirroring on the VIOS and the client.
We suggest that you have the same queue depth on the virtual disk as the
physical disk. If you have a volume group on the client that spans virtual disks,
keep the same queue depth on all the virtual disks in that volume group. This is
important if you have mirrored logical volumes in that volume group, because the
write does not complete before the data is written to the last disk.
In MPIO configurations on the client, if the primary path has a much greater
queue depth than the secondary, there might be a sudden loss of performance
as the result of a failover.
The virtual SCSI client driver allocates 512 command elements for each virtual
I/O client adapter instance. Two command elements are reserved for the adapter
to use during error recovery. Three command elements are reserved for each
device that is open to be used in error recovery. The rest are left in a common
pool for use in I/O requests. As new devices are opened, command elements are
removed from the common pool. Each I/O request requires one command
element for the time that it is active on the VIOS.
Increasing the queue depth for one virtual device reduces the number of devices
that can be open at one time on that adapter. It also reduces the number of I/O
requests that other devices can have active on the VIOS.
Chapter 6. Virtual I/O 265
6.5.5 Dual VIOS
To provide the DB2 workload running on the client partition high availability to
virtual I/O and virtual network, it is suggested to configure and use dual VIOSs
as discussed in 6.3.1, “Dual VIOS” on page 251.
The shared Ethernet process of the VIOS prior to Version 1.3 runs at the
interrupt level that was optimized for high performance. With this approach, it ran
with a higher priority than the virtual SCSI if there was high network traffic. If the
VIOS did not provide enough CPU resource for both, the virtual SCSI
performance can experience a degradation of service.
With VIOS Version 1.3, the shared Ethernet function can be implemented using
kernel threads. This enables a more even distribution of the processing power
between virtual disk and network.
This threading can be turned on and off per SEA by changing the thread attribute
and can be changed while the SEA is operating without any interruption to
service. A value of 1 indicates that threading is to be used and 0 indicates the
original interrupt method:
$ chdev -dev ent2 -attr thread=0
As discussed in 6.4, “VIOS sizing” on page 257, usually the network CPU
requirements are greater than the disk. In addition, you probably have the disk
VIOS setup to provide a network backup with SEA failover if you want to remove
the other VIOS from the configuration for scheduled maintenance. In this case,
you have both disk and network running through the same VIOS, so threading is
recommended.
Extending this technology to DB2 and its server installations, we can make use of
LPM to move a database server, without shutting down the database engine,
from one physical system to a second physical system while it is servicing
transactions. LPM uses a simple and automated procedure to migrate the entire
system environment, including processor state, memory, attached virtual
devices, and connected users with no application downtime.
7.2 LPM
Infrastructure flexibility has become a key criteria when designing and deploying
information technology solutions. Imagine you have a DB2 database server
running on a hardware system that has exhausted all its resources. The server is
running 2 processors at 100% use. Response times are slow and users are
complaining.
› Move that DB2 database server from that system to a system that has
additional resources without interrupting service to the users
› Perhaps the system needs to go down for scheduled maintenance such as a
hardware upgrade or a firmware upgrade or even a component replacement?
› You need to do a new system deployment so that the workload running on an
existing system is migrated to a new, more powerful one.
› If a server indicates a potential failure, you can move its logical partitions to
another system before the failure occurs.
› You want to conserve energy by moving a partition to another system during
off-peak periods.
However, while LPM provides many benefits, it does not perform the following
tasks:
› LPM does not do automatic load balancing.
› LPM does not provide a bridge to new functions. Logical partitions must be
restarted and possibly reinstalled to take advantage of new features.
› LPM does not protect you from system failures, so it does not replace
high-availability software such as the IBM HACMP high-availability cluster
technology.
To use the partition migration, there are planning steps that are necessary.
Although an in-depth discussion on the technologies that supports the flexibility
is outside of the scope of this chapter, a quick checklist is provided that assures
your partition is able to migrate successfully.
There are no changes required to make DB2 work when the partition is moved
from one system to another. DB2 has built-in autonomics and self-tuning that
enables it to adapt to changes in the underlying system after being moved from
one system to another:
› Self tuning memory manager (STMM) feature of DB2 queries the operating
system for free and available memory at regular intervals and automatically
adapts to the destination server's memory to maximize throughput.
› DB2 has the ability to change several parameters, without instance restart.
› DB2 uses OS-based Scheduler and can automatically adapt to the new CPU
resources.
Note: Most DB2 parameters are set to AUTOMATIC by default. The general
guideline is to leave these parameters AUTOMATIC.
Table 7-1 shows a list of default parameters used by DB2 version 9.7.
Table 7-1 List of default parameters (DB2 9.7) – plus many more
List of default DB2 9.7 parameters Parameter Value
Note: The major part of this chapter covers systems managed by one or more
HMCs. Although this book covers systems managed by HMC, the tests on
LPM have been done with IVM-based systems. For information about IVM,
refer to the IBM Redpaper™ Integrated Virtualization Manager on IBM System
p5, REDP-4061.
274 Best Practices for DB2 on AIX 6.1 for POWER Systems
7.4.1 Managed system’s requirements
Table 7-2 lists the managed system’s requirements.
Managed systems type POWER6 systems and Power Blade “Managed system type” on
systems page 275
Logical Memory Block Must be identical on both systems “Logical Memory Block
(LMB)” on page 284
Managed servers in an LPM must be of same type (for example source and
target servers are POWER6 servers or Power Blades). The same HMC or
redundant pair of HMCs are used with Power Systems. IVM is used for Power
Chapter 7. Live Partition Mobility 275
Blades and for Power Systems (models below p570). If you want to mix
POWER6 servers and Power blades, this is possible if the systems are managed
by IVM. Each of the servers can act as a source or a target system to the other
as long as it contains the necessary processor hardware to support it. This is
called the migration support. Of course, to be able to use the LPM a minimum of
two systems is necessary.
Notes:
› If the capability you are looking for is not displayed, this means that your
HMC is down level or the HMC does not have the appropriate fix pack. See
7.4.2, “HMC requirements” on page 285.
› If the capability you are looking for is shown as False, the Power
Hypervisor enablement has not been performed or the firmware is
downlevel.
If your server is not configured with the Enterprise Edition, you need to enable
your system. Contact your IBM representative to do so.
Chapter 7. Live Partition Mobility 277
Processor clock speed
LPM can be used between managed systems with differing clock speeds. For
example, the source server can have 4.7 GHz processors and the target can be
configured with 5.0 GHz processors. This is also known as processor
compatibility mode.
The processor compatibility mode is checked by the managed system across the
partition’s profile when the partition is activated and determines whether the
installed operating system supports this mode. If not, the partition uses the most
fully featured mode that is supported by the operating system.
The compatibility mode is important when using LPM. When you move a logical
partition from one system to another that has a different processor type, the
processor compatibility mode enables that logical partition to run in a processor
environment on the destination system in which it can successfully operate.
Time-of-Day
Although not mandatory, it is suggested to synchronize the time-of-day between
the source and the destination system, be it for active or inactive partition
mobility. Therefore, a new attribute, the Time Reference, is available in the
settings tab of the properties of the partition. Any VIOS partition can be
designated as a Time Reference Partition (TRP). Values for the TRP is enable or
disable (default). Changes take effect as soon as you click OK.
Notes:
› Time synchronization is a suggested step in the active partition migration.
If you chose not to synchronize using the TRP attribute, the source and
target systems synchronizes the clocks during the migration process from
the source server to the destination server.
› The TRP capability is only supported on systems capable of active
partition migration.
› Other TRPs are supported by server. The longest running TRP is
recognized as the TRP of the system.
› Ensure that the HMC and all the partitions have the same date and time to
avoid discordances and errors when migrating to another system.
280 Best Practices for DB2 on AIX 6.1 for POWER Systems
Managed system’s firmware level
Firmware level of both systems acting in a LPM needs to be at a minimum level
of EH330_046 for Power System p595, EM320_31 for Power Systems p570
servers or EL320_040 for Power Systems p520 and p550 servers. PowerBlades
systems need a firmware level of Es320 or later.
Refer to the Power code matrix at the following Web page for more
information:
http://www14.software.ibm.com/webapp/set2/sas/f/power5cm/power6.html
Note: Firmware level of source and target systems might defer. However, the
level of the source system firmware must be compatible with the target server.
To upgrade the firmware of your managed system, refer to the IBM fixes web site:
http://publib.boulder.ibm.com/infocenter/systems/scope/hw/index.jsp?top
ic=/ipha5/fix_serv_firm_kick.htm
Also, if you specify an existing profile name, the HMC replaces that profile with
the new migration profile. If you do not want the migration profile to replace any of
the partition’s existing profiles, you must specify a unique profile name. The new
profile contains the partition’s current configuration and any changes that are
made during the migration.
When migrating a mobile partition to another system, an LPAR with the same
name as the source LPAR might not exist on the destination system. Check the
HMC console or through the command line on the HMC using the lssyscfg
command. See Example 7-1 as an example.
In Example 7-1, each defined LPAR is only defined on one single system.
282 Best Practices for DB2 on AIX 6.1 for POWER Systems
Managed systems’ connectivity
Next, we discuss managed systems’ connectivity.
Use the lssyscfg command to verify that both servers are seen by the same
HMC, as shown in Example 7-2.
Separate HMC
Using one HMC per server is another supported environment for LPM. This is
known as Remote LPM. Pre-requisites for Remote LPM are covered in 7.4.2,
“HMC requirements” on page 285.
Note: For information about remote partition mobility, refer to IBM Redbooks
publication IBM PowerVM Live Partition Mobility, SG24-7460.
Chapter 7. Live Partition Mobility 283
Resource monitoring and control (RMC) is a feature that can be configured to
monitor resources such as disk space, CPU usage, and processor status and
allows performing an action in response to a defined condition. Remote migration
operations require that each HMC has RMC connections (see “RMC
connections” on page 298) to its individual system’s VIOSs and a connection to
its system’s service processors. The HMC does not have to be connected to the
remote system’s RMC connections to its VIOSs, nor does it have to connect to
the remote system’s service processor
Note: The logical memory blocksize can be changed at run time, but the
change does not take effect until the system is restarted.
Battery power
Ensure that the target system is not on battery power. If so, bring the target
system in a stable state prior to migrate.
Note: The source system can run on battery power even if you want to migrate
your partition to another system.
284 Best Practices for DB2 on AIX 6.1 for POWER Systems
7.4.2 HMC requirements
Table 7-3 summarizes HMC requirements.
HMC level (single or redundant › 7.3.2.0 or later “HMC model and version”
dual environment) › fix MH01062 on page 285
HMC level in remote LPM 7.3.4 or later “HMC model and version”
on page 285
Both servers need to be connected to the same HMC or redundant pair of HMCs
as illustrated in Figure 7-3 on page 283. They also must be connected on the
same network so they can communicate with each others, as shown in
Figure 7-3 on page 283.
Chapter 7. Live Partition Mobility 285
Notes:
› HMC can perform multiple LPM migrations simultaneously.
› IVM is also supported to perform LPM migrations.
For information about how to upgrade your HMC, refer to the following Web page:
http://www-933.ibm.com/support/fixcentral/
Therefore the requirements on IVM are tied to the VIOS level requirements. You
can refer to VIOS Requirements, part “VIOS profile” on page 288 for more
information.
Tip: Version 2.1.2.10 FP22 of the VIOS introduced a new functionality that
consists of the preservation of the customized VTD names, support for
non-symmetric environments, and multiple IP addresses for the VIOS
configured as a MSP partition.
For more information about how to change your current VIOS level, refer to the
following Web page:
http://www14.software.ibm.com/webapp/set2/sas/f/vios/download/home.html
Tip: The client profile must only be created on the source system. The profile
on the destination system is created automatically during the migration
process.
VIOS profile
This section discusses the VIOS profile.
There must be at least one VIOS declared as a mover service partition per
system. A mover VIOS partition is the specific ability of that partition to interact
with the Power Hypervisor and to migrate a mobile partition from the source
VIOS to the target VIOS.
The mover service partition attribute can be configured either when you
configure your VIOS profile (refer to IBM Redbooks publication PowerVM
Virtualization on IBM System p: Introduction and Configuration Fourth Edition,
SG24-7940 for more information) or after the VIOS has been configured by
changing its properties. You can proceed as described in Figure 7-5 on page 289
by editing the properties of the VIOS partition and selecting the Mover Service
Partition box.
288 Best Practices for DB2 on AIX 6.1 for POWER Systems
Note: Mover service partitions are not used during inactive migration partition.
Tip: If you have multiple network interfaces configured on an MSP, you can
chose, through the HMC command line, which IP address the mover uses to
transport the mobile partition’s data. For this to work, you need VIOS level
2.1.2.10-FP22 or later and HMC level 7.3.5 or later.
Chapter 7. Live Partition Mobility 289
CPU and memory requirements
VIOS requires modest additional CPU and memory. Assuming that CPU cycles
required to host VIOS are available, there is modest performance impact with
VIOS from the client partition perspective. Check the IO latency impact to your
application from a host perspective (DB2 server).
During the migration of a LPAR, each VIOS of both the source and the
destination system need extra CPU to manage the migration. Uncapping the
VIOSs assures the necessary additional CPU to be given to each of them.
The following list details considerations regarding VIOS CPU and memory
requirements:
› Allocating dedicated CPU to a VIOS partition results in faster scheduling of
I/O operations and improved Virtual I/O performance.
› When the overall pool use is high, and using uncapped shared micropartition
for VIOS, allocating less than 1.0 Entitled CPU (ECPU) to VIOS partition
might result in longer latency for IO.
› Memory required for a VIOS partition is insignificant. Allocating more memory
does not result in any performance improvement.
Tip: To size the VIOS CPU appropriately, take the following factors into
consideration:
› LPM requires additional CPU during LPAR mobility.
› Shared uncapped CPU gives you the flexibility to acquire additional CPU
units whenever needed.
For more information about VIOS CPU and memory requirements, refer to 6.4,
“VIOS sizing” on page 257.
The destination system needs to have sufficient resources to host the migrating
partition, not only those such as CPU processing units, and memory, but also
virtual slots. Virtual slots are needed to create the required virtual SCSI adapter
after the mobile partition has moved to the destination system. You might check
for the maximum number of virtual adapters and the actual number of configured
virtual adapters.
290 Best Practices for DB2 on AIX 6.1 for POWER Systems
Recommendations:
VASI interface
The Virtual Asynchronous Service Interface (VASI) provides communication
between both mover service partitions and the Power Hypervisor to gain access
to the partition state. This VASI interface is automatically created when the VIOS
is installed and the VIOS is declared as a mover service partition.
Tips:
› To list partition migration information, use the lslparmigr command.
› VASI interface must be available and is mandatory for active LPM. It is not
required for inactive partition mobility.
When configuring the virtual SCSI interface, it is suggested to create a virtual link
between the VIOS and the logical partition. This can be done by selecting Only
selected client partition can connect in the VIOS profile. Moreover, the VIOSs
on both the source and the destination systems must be capable of providing
virtual access to all storage devices that the mobile partition is using.
For more information about configuring IVE, refer to Integrated Virtual Ethernet
Adapter Technical Overview and Introduction, REDP-4340.
Virtual Ethernet
Virtual Ethernet adapters are created in the VIOS’ profile to ensure the
communication between the VIOSs and the logical partitions configured on a
same system.
Chapter 7. Live Partition Mobility 291
It is mandatory to create shared Ethernet adapters (SEA) on each the source
and the destination server to bridge to the same Ethernet network used by the
mobile partition. For more information about SEA settings, refer to “SEA
configuration” on page 303.
Virtual slots
Make sure sufficient virtual slots are available on the target system.
Note: Any user defined virtual slot must have an ID higher than 11.
Notes:
› The TRP capability is only supported on systems that are capable of active
partition migration.
› Other TRP are supported by servers. The longest running TRP is
recognized as the TRP of the system.
› Ensure that the HMC and all the partitions have the same date and time to
avoid discordances when migration to another server.
Dual VIOS
For more information about dual VIOS setups, refer to 6.3.1, “Dual VIOS” on
page 251.
In other words, the disks zoned on a VIOS (these are LUNs on the storage box
and seen as hdisk on the VIOS) need to be mapped entirely to the logical
partition that needs to be migrated from one source system to another.
Moreover, these LUNs must be accessible to VIOSs on both the source and the
destination server. Other parameters at disk level and HBA level need to be
adapted. For more information about those parameters, refer to 3.4, “Tuning
storage on AIX” on page 115.
Tips:
› For all disks in a mobile logical partition, ensure the following settings for
the HBAs on all of your VIOSs:
– reserve_policy: no_reserve policy
– hcheck_interval: 20
– fc_err_recov: fast_fail
– dyntrk: yes
› For all HBAs in VIOSs ensure the following settings:
– fc_err_recov: fast_fail
– dyntrk: yes.
Chapter 7. Live Partition Mobility 293
7.4.5 LPAR requirements
Table 7-5 lists the LPAR requirements for active migration.
RMC connections For active migrations: active “RMC connections” on page 298
RMC connection, rsct daemons
running
IVE (HEA) adapter No IVE / HEA adapters “Integrated Virtual Ethernet (IVE)”
on page 300p
Partition workload group Might not be part of partition “Partition workload group” on
workload group page 300
Virtual serial adapter Only serial IDs 0 & 1 “Virtual Serial adapter” on page 301
Physical I/O Can have dedicated I/O “Physical I/O” on page 299
Partition workload group Might be part of partition workload “Partition workload group” on
group page 300
Virtual serial adapter Only serial ID’s 0 & 1 “Virtual Serial adapter” on
page 301
Note: The target server might not contain a LPAR with the same name as the
partition you want to move.
For the migration to be successful, the LPAR name from the migrating partition
must not exist on the destination system. You can, however, determine a new
name for your partition. The HMC creates a new profile containing the partition’s
current state, the configuration, and any changes that are made during the
migration.
Notes:
› Active partition migration
This is the ability to move a running logical partition with its operating
system and application from one system to another without interrupting the
service / operation of that logical partition.
› Inactive partition migration
This is the ability to move a powered off logical partition with its operating
system and application from one system to another.
296 Best Practices for DB2 on AIX 6.1 for POWER Systems
Table 7-7 lists the operating system level needed for active partition mobility.
Table 7-7 Operating system level needed for active partition mobility
OS level Active migration
Table 7-8 lists the operating system level needed for inactive partition mobility.
Table 7-8 Operating system level needed for inactive partition mobility
OS level Inactive migration
Processor settings
Configuring processors is of high importance, be it on VIOSs or logical partitions.
For capped shared processor LPAR, set the number of virtual processor equal to
a round-up of the entitled capacity.
RMC connections
Resource monitoring and control (RMC) is a feature that can be configured to
monitor resources such as disk space, CPU usage, and processor status. It
allows performing an action in response to a defined condition. It is actually
technically a subset function of the Reliable Scalable Cluster Technology
(RSCT). For more information about RMC, refer to IBM Redbooks publication A
Practical Guide for Resource Monitoring and Control (RMC), SG24-6615.
298 Best Practices for DB2 on AIX 6.1 for POWER Systems
Prior to an active migration of your logical partition, ensure that RMC connections
are established between:
› VIOSs of both systems intervening in the mobility.
› Mobile partition and the both the source and destination systems.
Tips:
› To re-synchronize the RMC connection run the
/usr/sbin/rsct/bin/refrsrc IBM.ManagementServer command. Running
this command suppresses the wait time after synchronization.
› Your LPAR is not listed with the lspartition command if it is not in running
state (it is not powered on).
› RMC needs about five minutes to synchronize after network changes or
partition activation.
Physical I/O
When going for LPM, no physical or required adapters can be configured in the
mobile partition’s profile. All I/O must be virtual. However, if the mobile partition
has physical or dedicated adapters, it can participate in an inactive partition
migration. Physical adapters marked as Desired can be removed dynamically
with a dynamic LPAR operation. For more information about using DLPAR, refer
to PowerVM Virtualization on IBM System p: Introduction and Configuration
Fourth Edition, SG24-7940.
Tips:
› You can chose to use physical adapter in your mobile partition. If you do so
and want an active partition mobility, move your physical adapters to virtual
adapters prior the migration can occur. Downtime might be necessary to
perform these tasks.
› For your particular LPAR running your DB2 database, it is suggested not to
share the bandwidth of your I/O card with other partitions to avoid
unnecessary overhead on the VIOS. Evaluate the need to share it, keeping
in mind performance.
Chapter 7. Live Partition Mobility 299
Network
The logical partition’s MAC address need to be unique across both systems. You
might consider using the netstat or entstat command to check this.
Moreover, the mobile partition’s network must be virtualized using one or more
VIOSs. Logical Host Ethernet Adapters (LHEA) are not supported for a partition
to be active in a migration process. see “Integrated Virtual Ethernet (IVE)” on
page 300 for more information about Integrated Virtual Ethernet.
Notes:
› Systems configured with IVM have only one VIOS.
› For systems configured with HMC, it is suggested to configure two VIOSs.
If you plan to use the LPM capabilities of your Power System, you cannot use a
logical host, as LHEA adapters are considered as physical adapters.
Notes:
› A partition workload group is a group of logical partitions whose resources
are managed collectively by a workload management application. A
partition workload group identifies a set of partitions that reside on the
same system.
› Workload management applications can balance memory and processor
resources within groups of logical partitions without intervention from the
HMC (Hardware Management Console). Workload management
applications are installed separately and can be obtained from a solution
provider company.
300 Best Practices for DB2 on AIX 6.1 for POWER Systems
For more information about partition workload group and workload management
applications, refer to Chapter 5, “LPAR considerations” on page 225.
Virtual Serial adapter
Virtual serial adapters are often used for virtual terminal connections to the
operating system. The two first serial adapters (adapter ID’s 0 and 1) are
reserved for the HMC.
Validate that no physical adapters are in the mobile partition and that no virtual
serial adapters are in virtual slots higher than 1. In other words, the only
exception for virtual serial adapter is for the virtual terminal connection.
Barrier Synchronization Register (BSR)
Barrier synchronization registers provide a fast, lightweight barrier
synchronization between CPUs. This facility is intended for use by application
programs that are structured in a single instruction, multiple data (SIMD) manner.
Such programs often proceed in phases where all tasks synchronize processing
at the end of each phase. The BSR is designed to accomplish this efficiently.
Barrier synchronization registers cannot be migrated or re-configured
dynamically. Barrier synchronization registers cannot be used with migrating
partitions.
Notes:
› BSR can be used in inactive partition migration.
› Disabling BSR cannot been changed dynamically. Modify the partition’s
profile and shutdown (not reboot) the partition for changes to take effect.
› If your migrating partition contains dedicated physical resources, it is
mandatory to move those physical resources to virtual resources prior to
migrating or the migration fails.
Huge pages
Huge pages can improve performance in specific environments that require a
high degree of parallelism. The minimum, desired, and maximum number of
huge pages can be specified to assign to a partition when you create a partition
profile.
Note: If your mobile partition does use huge pages, it can participate in an
inactive partition migration.
Chapter 7. Live Partition Mobility 301
Huge page settings can be checked on the HCM console as follows:
1. Select your system in the navigation area.
2. Choose Properties.
3. Select the Advanced tab
4. Verify that the Current Requested Huge Page Memory field indicates 0 (zero).
See Figure 7-6.
If this field is not equal to zero, only an inactive partition mobility works.
Note: If you need to use huge pages and active partition mobility, you need to
set to 0 all fields relative to huge page memory of your mobile partition for the
migration to be successful. As this is not a dynamic operation, you need to
modify the partition’s profile and shutdown (not reboot) the partition for the
changes to take effect.
302 Best Practices for DB2 on AIX 6.1 for POWER Systems
Redundant Error Path Reporting
This indicates whether the logical partition is set to report server common
hardware errors to the HMC. The service processor is the primary path for
reporting server common hardware errors to the HMC. Selecting this option
allows you to set up redundant error reporting paths in addition to the error
reporting path provided by the service processor. You can change this setting by
activating the logical partition using a partition profile set to enable redundant
error path reporting. For mobile partitions, this profile attribute must be
de-activated.
SEA configuration
You need to complete several tasks to ensure your network is ready for the
migration.
You have to create a shared Ethernet adapter (SEA) on both of your source and
destination system VIOSs. The SEA bridges your external network to your
internal virtual network (Layer-2 bridge). It provides the ability to your client
partitions to share one physical Ethernet adapter. When using dual VIOS setup
in an HMC environment, the network can be made highly available by creating
the SEA failover mechanism, as shown in Figure 7-7 on page 304.
Chapter 7. Live Partition Mobility 303
Notes:
› SEA can only be hosted on VIOSs.
› VIOS running on IVM cannot implement the SEA failover mechanism
because it can only contain one single VIOS. For more information about
SEA failover mechanism, refer to IBM Redpaper IBM System p Advanced
POWER Virtualization (PowerVM) Best Practices, REDP-4194.
SEA can be built of physical interfaces or IVE cards. If you want to use the IVE
card, ensure that the IVE port you are using is set to promiscuous mode. This
mode ensures that the complete port is dedicated to your VIOS and that no other
partition is able to use it. For more information about IVE, refer to Integrated
Virtual Ethernet Adapter Technical Overview and Introduction, REDP-4340.
Figure 7-7 Sample SEA failover setup with Etherchannel and dual VIOS setup
304 Best Practices for DB2 on AIX 6.1 for POWER Systems
Recommendation: It is suggested to configure your IP address of your VIOS
on an additional virtual adapter that you need to create, apart from the SEA
failover mechanism.
LHEA adapter
When configuring your mobile logical partition, make sure that no LHEA device is
configured. LHEA devices are considered as physical devices. Active partition
mobility requires no physical adapter to migrate successfully. However, inactive
partition mobility can contain LHEA devices if you want to use these.
For more information about Integrated Virtual Ethernet adapters (IVE / HEA)
refer to WebSphere Application Server V6.1: JMS Problem Determination,
REDP-4330.
Chapter 7. Live Partition Mobility 305
7.4.7 Storage requirements
Table 7-10 lists the storage requirements.
Disk mapping versus LUN Only disk mapping supported “Disk mapping
masking versus LUN
masking on
VIOS” on
page 314
For our testing we used a DS5300 SAN Storage Solution. We set up transaction
logs and data on separate file systems and, ideally, on external storage systems
on their own dedicated LUNs. In our example, we set up transaction logs under
the /db2_log_bck directory and data under /db2_data. These file systems were
located in the SAN on their own dedicated LUNs. We defined our database paths
as shown using the database creation clause. The database was configured as
an automatic storage database, with the containers placed on /db2_data.
CREATE DATABASE TPCE AUTOMATIC STORAGE YES on /db2_data DBPATH on
/home/db2inst1 USING CODESET ISO8859-1 TERRITORY US COLLATE
USING IDENTITY;
The following example shows the command for adding two additional storage
paths to our database. Ideally, each storage path is equal in size and lay on its
own file system under dedicated LUNs.
ALTER DB virtdb ADD STORAGE PATH
When the I/O capacity has increased, you might need more CPU capacity to
handle the increased I/O. For an uncapped shared processor logical partition,
this is not a problem as long as there is enough CPU capacity in the shared
processor pool. If the increase in the needed CPU capacity is permanent,
consider increasing the entitled processor capacity for the database logical
partition. With dynamic logical partitioning, this can be achieved without any
effect to service. You must change the value of the desired processor units for the
LPAR. For a Power Systems server environment you must choose between the
virtualized storage and the locally attached storage. Although the virtualized
storage provides easier maintenance, it has a slight overhead. On most systems
this overhead is not noticeable, so we are able to take advantage of the benefits
for easier maintenance when using virtual storage. VIOS makes it easier to add
more storage to your system without a service break. With more than one virtual
server, you are always able to add physical adapters and SAN systems to your
environment without service breaks on your production environment. This ability
Chapter 7. Live Partition Mobility 307
to provide additional storage and I/O resources without service downtime
combined with the ease of use and maintenance of DB2 automatic storage,
makes Power Systems environments ideal for dynamic computing environments
with dynamic resource needs.
For more information about the alternate approaches (NFS, iSCSI and FCP) to
attach storage to a DB2 server, seeIBM DB2 9 on AIX 5L with NFS, iSCSI, and
FCP using IBM System Storage N series, REDP-4250.
A whitepaper covering the testing that was done using DB2 and LPM using a
network attached storage (NAS) over iSCSI can be found at the following Web
page.
https://www-304.ibm.com/partnerworld/wps/servlet/ContentHandler/whitepa
per/power/lpm/use
There is also a demo that has been built from the tests done with DB2 and NAS
over iSCSI. It can be found at the following Web page:
http://www.ibm.com/partnerworld/wps/servlet/ContentHandler/VPAA-7M59ZR
All client logical partition requests are passed down to the VIOS, where it
performs the actual disk I/O operation and returns data directly to a client
partition (no double buffering in the case of Virtual SCSI).
Chapter 7. Live Partition Mobility 309
LUN creation on SAN storage box
LUNs need to be defined, following the best practices for storage configuration,
using RAID techniques as described in 3.3, “Storage hardware” on page 100.
From the array configuration on the SAN, LUNs from equal size are created. A
host group is defined and contains the hosts. Those hosts are referenced by their
world wide port name (WWPN). Host groups contain those hosts that share the
same disks. When migrating a logical partition from a source to a destination
system, all disks defined into that logical partition need to be visible from both the
source and the destination system for the migration to be successful. As for the
network that needs to be accessible by both systems, the SAN disks must be
able to attach to either the source or the destination system. It is one or the other,
not both system at the same time. Therefore, a few parameters need to be set
not only at disk level, but also at Fibre Channel adapter level, as explained in the
next sections.
Here are the suggested settings for the disks parameters (See the list that follows
these settings for explanations about the parameters):
› algorithm: round_robin
› hcheck_interval: 20
› reserve_policy: no_reserve
› max_transfer: see the Note, “Setting the max-transfer size” on the next page.
› queue_depth: This parameter depends on the storage box used and on the
storage box firmware.
You might want to configure the max_transfer size of your newly added disk to
the largest max_transfer size of the existing backing devices. To change the
max_transfer size run the following command:
In this formula, DSxxx queue depth is the queue depth of your storage box.
For DS4800 or DS5000, queue depth is 4096.
› Disk queue_depth is aligned on the queue_depth of the storage box.
› The queue depth of the disk on the VIOS must match the queue depth of
the virtual disk on the logical partition.
For more information about setting your storage system, refer to IBM Midrange
System Storage Hardware Guide, SG24-7676 and to 3.3, “Storage hardware” on
page 100.
The suggested settings for the HBA parameters are as follows (More information
about these parameters can be found in the list after the Note below):
› dyntrk=yes
› fc_err_recov=fast_fail
› max_xfer_size= See Note
› lg_term_dma=See Note
› num_cmd_elems=See Note
Note: As these parameters depend upon the storage type you are using, refer
to Chapter 3, “Storage layout” on page 85 for the optimum values for the
particular storage type you are using.
312 Best Practices for DB2 on AIX 6.1 for POWER Systems
More information about these parameters is given in the following list:
› dyntrk
This parameter enables dynamic tracking and enables the FC adapter driver
to detect when the Fibre Channel N_Port IDD of a device changes. The FC
adapter reroutes traffic destined for that device to the new address while the
device are still online.
› fc_err_recov
AIX supports Fast I/O Failure for Fibre Channel devices after link events in a
switched environment. If the Fibre Channel adapter driver detects a link
event, such as a lost link between a storage device and a switch, the Fibre
Channel adapter driver waits a short period of time, approximately 15
seconds, so that the fabric can stabilize. At that point, if the Fibre Channel
adapter driver detects that the device is not on the fabric, it begins failing all
I/Os at the adapter driver. Any new I/O or future retries of the failed I/Os are
failed immediately by the adapter until the adapter driver detects that the
device has rejoined the fabric.
Fast Failure of I/O is controlled by a new fscsi device attribute, fc_err_recov.
The default setting for this attribute is delayed_fail, which is the I/O failure
behavior seen in previous versions of AIX.
Note: When you only have one single Fibre Channel interface, it is
recommended to leave the fc_err_recov attribute on delayed_fail.
For more information about dyntrk and fc_err_recov, refer to the following
Web page:
http://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=/c
om.ibm.aix.baseadmn/doc/baseadmndita/dm_mpio.htm&resultof=%2522rmpat
h%2522%2520&searchQuery=rmpath&searchRank=1&pa
› max_xfer_size
This is an AIX setting that can directly affect throughput performance with
large I/O blocksizes. For more information about max_xfer_size refer to IBM
Midrange System Storage Hardware Guide, SG24-7676.
› lg_term_dma
This is an AIX setting that can directly affect throughput performance with
large I/O blocksizes. For more information about lg_term_dma refer to IBM
Midrange System Storage Hardware Guide, SG24-7676.
Chapter 7. Live Partition Mobility 313
› num_cmd_elems
This tunable represents the maximum number of requests that can be
outstanding on or queued to a disk adapter. Change this tunable to adapt to
various memory and system conditions. The default is 200.
Attention:
› There is a relation between the queue_depth tunable on a disk and the
num_cmd_elems on a FC card.
You can tune the num_cmd_elems as follows:
num_cmd_elems = (number of paths that form the vpath) * (value of
queue_depth of the disk)
For example, if you have four paths forming the vpath to your storage and
your queue depth is 20, num_cmd_elems need to be set at 4 * 20 = 80.
› Fast I/O Failure is useful in situations where multipathing software is used.
Setting the fc_err_recov attribute to fast_fail can decrease the I/O fail times
because of link loss between the storage device and switch. This supports
faster failover to alternate paths.
› In single-path configurations, especially configurations with a single path to
a paging device, the delayed_fail default setting is recommended.
For more information about how to set those parameters, refer to Chapter 3,
“Storage layout” on page 85.
Note: The device cannot have an IEEE volume attribute identifier. Use the
UDID or PVID of the device to identify uniquely your disk on both VIOSs.
Tip: You can establish a correspondence table to describe the LUNs’ IDs on
the storage side compared to the LUNs’ IDs created with the mkvdev command
on the VIOSs. See the following table for an example:
7.4.8 Summary
Power System virtualization offers a set of resource partitioning and
management features such as LPARs, the DLPAR facility, and virtual I/O, under
which you can implement SEA, virtual Ethernet, virtual SCSI, or a VLAN. Shared
Chapter 7. Live Partition Mobility 315
processor partitions allow you to create an LPAR using as little as 0.10 of a
processor. The DLPAR facility enables you to change the LPAR resources
(processor, memory, and I/O slots) at run time, without rebooting the operating
system.
The versatility of the DB2 data server and a variety of possible combinations of
Power System virtualization features help DB2 applications to perform optimally
in many situations. The Power System virtualization technology can be
configured to realize both computer system and business benefits, which
includes high performance, workload isolation, resource partitioning, maximum
resource use, and high availability at low cost. This technology reduces total cost
of ownership (TCO) while also enhancing expandability, scalability, reliability,
availability, and serviceability.
The best practices presented in this document are essentially lessons that have
already been learned through our own testing. These best practices serve as an
excellent starting point for using the DB2 product with Power System
virtualization. You can use them to help to avoid common mistakes and to
fine-tune your infrastructure to meet your goals for both your business and IT
environment. To validate the applicability of these best practices before using
them in your production environment, establish a baseline and perform sufficient
testing with the various virtualization features.
Table 7-11 Logical partition attributes that might change after a migration
LPAR attributes that remain the same LPAR attributes that can be optionally changed
by the user
The high level procedure of a migration validation, after all requisites for LPM are
met, are described in the following steps:
1. Both the source and destination systems are capable of migrating and both
are Power6 processor-based, as shown in Figure 7-9.
2. HMC and remote HMC setup as shown in Figure 7-8 on page 309. (See also
7.4.2, “HMC requirements” on page 285).
Chapter 7. Live Partition Mobility 317
Figure 7-10 Same HMC management (Remote HMC support under conditions)
Tip: The major differences in the migration flow between active and inactive
partition migration are:
› The MSP attribute is only mandatory for active partition mobility.
› The wait time can be modified when migrating an active partition.
Note: During migration, you can monitor the progress of the migration on both
systems, using IVM or HMC, depending on your environment.
Chapter 7. Live Partition Mobility 321
7.6 Mobility in action
Figure 7-15 shows the setup that was used for DB2 and LPM performance
characterization. Throughout the test, the setup revolved around best practices
to ensure ease in management and higher performance throughput. As
mentioned, preparation for mobility requires careful storage, system and network
planning.
DS5000
An OLTP workload was set up on DB2 9.7 for the LPM experiment. The objective
of this experiment was to demonstrate that DB2 can be moved transparently from
one server to another and that client applications connected to the database
does not experience any downtime. The OLTP workload is a mixture of read-only
and update-intensive transactions that simulate the activities found in complex
OLTP application environments. Storage and Virtualization best practices were
followed to optimize the throughput of the workload.
322 Best Practices for DB2 on AIX 6.1 for POWER Systems
The remote DB2 client created 50 connections to the database server, each
connection executing a set of transactions. The database server LPAR was
moved from one server to another while the OLTP workload was executing.
Figure 7-16 shows the DB2 environment settings used in the LPM experiment.
After the VIOS was installed on each blade, we can connect to the Web-based
IVM. This interface provides a HMC-like GUI that allows an administrator to
configure LPARs, virtual network, and virtual storage on the blade and VIOS. As
this is a Web-based tool, you can point your Web browser at the VIOS host
Chapter 7. Live Partition Mobility 323
name, and you are presented with the IVM login page. To log in, use the VIOS
padmin user ID and password.
Before we can test mobility with the JS43, ensure that the environment is
prepared appropriately to support it. Update the firmware levels of the JS43 and
associated components such as the Fibre Channel (FC) adapters. Download the
latest firmware images for the JS43 and the FC adapters from the JS43 support
site. and apply them to each Blade. Install the latest VIOS fixpacks. (Refer to
“CPU and memory requirements” on page 290).
With the correct software and firmware levels installed, prepare the Blade, the
VIOS, and the LPAR for partition mobility.
What follows is a brief checklist of the tasks performed with the IVM:
1. Enter the PowerVM Enterprise Edition APV key on both Blades. This key is
required to enable the mobility feature on the JS43 Blade.
2. Confirm that the memory region size is the same on both Blades. This
information can be found under View/Modify System Properties in the
Memory tab.
3. Configure an SEA on both VIOS. Enable the Host Ethernet Adapter for
Ethernet bridging. This is required for the virtual Ethernet devices to access
the physical Ethernet adapter and the external network. This is performed
under the View/Modify Host Ethernet Adapter, Properties tab. Select Allow
virtual Ethernet bridging. Under View/Modify Virtual Ethernet and the
Virtual Ethernet Bridge tab, and select the physical adapter to be used as the
SEA. A message displays stating that the operation was successful. The SEA
is now configured.
4. Create an LPAR on the source Blade. Select View/Modify Partition
Create Partition. Enter the LPAR name, memory, and processor
requirements. Ensure that none of the physical HEA ports are selected.
Under Virtual Ethernet, select the SEA to use (for instance, ent0). Under
Storage Type, select Assign existing virtual disks and physical volumes.
Select the SAN disk assigned to the VIOS, which in our environment was the
DS5300 disks.
5. Click Finish to create the LPAR.
The next step is to install AIX. This can be achieved using a NIM mksysb (or
rte) install.
6. With the AIX installation and configuration complete, you can configure DB2.
Verify that the same SAN disk can be seen by both VIOS. Using the lspv
command, check that both VIOS have the same PVID associated with the SAN
storage. Confirm that the AIX LPAR is configured with only virtual devices
(meaning no physical adapters, another prerequisite for mobility).
Note: To size CPU for VIOS appropriately, take these into consideration:
› LPM requires additional CPU during LPAR mobility.
› Shared uncapped CPU give you the flexibility to acquire additional CPU
units whenever needed.
Note: As already mentioned we used PowerBlades during our tests. For more
information about Power Systems using HMC, refer to the IBM Redbooks
publication IBM PowerVM Live Partition Mobility, SG24-7460.
Preliminary test
For the purpose of our tests, a single VIOS has been configured, one per Blade,
and one active AIX LPAR running on the first Blade as a VIO client (VIOC). We
are now ready to perform a live partition migration. During the migration, the first
Blade is known as the source system and the second Blade is the destination
system.
The objective is to move the LPAR, from the Blade in the source system to the
Blade in the destination system. At the end of the migration, the AIX LPAR is
running as a VIOC from the destination system on the other physical Blade. DB2
continues to function throughout the entire migration.
Chapter 7. Live Partition Mobility 325
Prior to the migration, run the lsconf command from AIX, and note the system
serial number (see Figure 7-17).
During the migration, DB2 jobs are running on the LPAR. Monitor the system
using the topas command and observe that DB2 processes are consuming
processors during the migration.
In the mean time VIOSs are hand-shaking and transmitting information through
the network.
326 Best Practices for DB2 on AIX 6.1 for POWER Systems
All tasks to perform partition mobility are executed from the IVM, on the source
Blade. To start the migration, select the box next to the LPAR and choose
Migrate from the More Tasks drop-down menu. Refer to Figure 7-18.
You are presented with a panel to enter the target system details. Enter the
details and then click Validate. Refer to Figure 7-18.
Chapter 7. Live Partition Mobility 327
Migration validation
During the validation phase, several configuration checks are performed. Some
of the checks include:
› Ensuring the target system has sufficient memory and processor resources to
meet the LPAR's current entitlements.
› Checking there are no dedicated physical adapters assigned to the LPAR.
› Verifying that the LPAR does not have any virtual SCSI disks defined as
logical volumes on any VIOS. All virtual SCSI disks must be mapped to whole
LUNs on the SAN.
› RMC connections to the LPAR and the source and target VIOS are
established.
› The partition state is active, meaning Running.
› The LPAR's name is not already in use on the target system.
› A virtual adapter map is generated that maps the source virtual
adapter/devices on to the target VIOS. This map is used during the actual
migration.
328 Best Practices for DB2 on AIX 6.1 for POWER Systems
After the validation completes successfully, a message stating It might be
possible to migrate the this partition ... appears (Figure 7-19). Click Migrate and
the migration to the other Blade begins. Monitor the status of the migration by
clicking Refresh.
Figure 7-20 Observe the status of the migration on the destination server
330 Best Practices for DB2 on AIX 6.1 for POWER Systems
7.6.3 What happens during the partition migration phase?
During the active migration of the LPAR, state information is transferred from the
source to the target system. This state information includes such things as
partition memory, processor state, virtual adapter state, NVRAM (non-volatile
random access memory), and the LPAR configuration.
The following list details events and actions that occur during the migration:
› A partition shell is created on the target system. This shell partition is used to
reserve the resources required to create the inbound LPAR, or processor
entitlements, memory configuration, and virtual adapter configuration.
› A connection between the source and target systems and their respective
Power Hypervisor is established through a device called the Virtual
Asynchronous Service Interface (VASI) on the VIOS. The source and target
VIOS use this new virtual device to communicate with the Power Hypervisor
to gain access to the LPAR's state and to coordinate the migration. You can
confirm the existence of this device with the lsdev command on the VIOS.
Tip: Use the vasistat command to display the statistics for the VASI device.
Run this command on the source VIOS during the migration. Observe that
Total Bytes to Transfer indicates the size of the memory copy and that Bytes
Left to Transfer indicates how far the transfer has progressed.
› The virtual target devices and virtual SCSI adapters are created on the target
system. Using the lsmap command on the target VIOS before the migration,
notice that there are no virtual SCSI or virtual target device mappings.
Running the same command after the migration shows that the virtual disk
mappings have been created as part of the migration process.
› The LPAR's physical memory pages are copied to the shell LPAR on the
target system. Using the topas command on the source VIOS, you might
observe network traffic on the SEA as a result of the memory copy.
› Because the LPAR is still active, with DB2 still running, its state continues to
change while the memory is copied. Memory pages that are modified during
the transfer are marked as dirty. This process is repeated until the number of
pages marked as dirty is no longer decreasing. At this point, the target system
instructs the Power Hypervisor on the source system to suspend the LPAR.
› The LPAR confirms the suspension by quiescing all its running threads. The
LPAR is now suspended.
› During the LPAR suspension, the source LPAR continues to send partition
state information to the target server. The LPAR is then resumed.
Chapter 7. Live Partition Mobility 331
› The LPAR resumes execution on the target system. If the LPAR requires a
page that has not yet been migrated, then it is demand-paged from the source
system.
› The LPAR recovers its I/O operations. A gratuitous ARP request is sent on all
virtual Ethernet adapters to update the ARP caches on all external switches
and systems in the network. The LPAR is now active again.
› When the target system receives the last dirty page from the source system,
the migration is complete. The period between the suspension and
resumption of the LPAR lasts a few milliseconds, as you can see in
Figure 7-21. In the meantime, the migrating LPAR displays the message
Partition Migration in progress ... as shown in Figure 7-22 on page 333.
With the memory copy complete, the VIOS on the source system removes the
virtual SCSI server adapters associated with the LPAR and removes any device
to LUN mapping that existed previously.
The LPAR is deleted from the source Blade. The LPAR is now in a Running state
on the target Blade. The migration is 100% complete.
Chapter 7. Live Partition Mobility 333
Now that the LPAR is running on the other Blade, run the lsconf command to
confirm that the serial number has changed with the physical hardware. See
Figure 7-23.
To confirm and verify that DB2 is not impacted by the migration, check the DB2
alert log for any errors. The ssh login sessions on MOBILE-LPAR remained
active and did not suffer any connectivity issues as a result of the live migration
(see Figure 7-21 on page 332).
Mobility activity is logged on the LPAR and the source and target VIOS. Review
the logs with the errpt (AIX) and errlog (VIOS) commands. On AIX, notice
messages similar to CLIENT_PMIG_STARTED and CLIENT_PMIG_DONE.
Additional information from DRMGR, on AIX is also logged to syslog (for
instance, starting CHECK phase for partition migration). On the VIOS, find
messages relating to the suspension of the LPAR and the migration status (Client
partition suspend issued and Migration completed successfully).
Output of the errpt command from the AIX LPAR and from both VIOSs relate to
the migration activity.
334 Best Practices for DB2 on AIX 6.1 for POWER Systems
7.6.4 Post migration observations
During the migration, we used nmon to capture data that we analyze hereafter.
The migration took about 20 minutes to complete. The LPAR being moved was
configured with 40 GB of memory. Most of the time required for the migration was
for the copying of the LPAR's memory from the source to the target system. The
suspend of the LPAR itself lasted no more than 141 milliseconds. Consider using
a high-performance network between the source and target systems. Also, prior
to the migration, we suggest reducing the LPAR's memory update activity. Taking
these steps improves the overall performance of the migration. (See “Mover
Service Partition” on page 288.)
Note: The time to migrate depends on the quantity of memory allocated to the
logical partition. The more memory allocated, the more time you need to
migrate.
Looking at the error reports show the migration has occurred and reveals the
status of the migration.
LPM has enormous potential for dramatically reducing scheduled downtime for
system maintenance activities. Being able to perform scheduled activities, such
as preventative hardware maintenance or firmware updates, without disruption to
user applications and services is a significant enhancement to any System p
environment. Additionally, this technology can assist in managing workloads
within a System p landscape. It gives administrators the power to adjust resource
usage across an entire farm of System p servers. LPARs can be moved to other
physical servers to help balance workload demands.
Note: For the tests, we used Power Blade systems, each configured with IVM
and one VIOS (one entire CPU, 3 GB of memory). The mobile LPAR had one
entire CPU and 40 GB of memory allocated to it. As such, screen captures
taken from the IVM might differ from screen captures taken with an HMC.
Chapter 7. Live Partition Mobility 335
OLTP workload
The OLTP workload was running when the migration started. It ran on the source
server during the pre-migration phase. There was a short interval where the
throughput dipped. This was when the hand-over happened. The workload
resumed execution on the target server after the handover. See Figure 7-24.
The following diagrams show you what happened at the moment the real hand
over occurs.
The migration process itself started at 11:20 and finished around 11:39. The load
was run about 11:00 and finished after 12:00. The handover from the source
VIOS to the destination system took place at 11:32. The following processes
were running when the migration command was issued:
› ctrlproc, the mover process
› migrlpar, the migration command that performs the validation and migration of
the logical partition.
› seaproc process, part of the shared Ethernet network and related to network.
› accessproc process.
336 Best Practices for DB2 on AIX 6.1 for POWER Systems
CPU level
In the mobile partition, we see a lessening of the CPU activity at the moment the
hand-over happened from the source to the destination system. See Figure 7-25.
Note: After the migration is finished, you can check the error report for Client
Partition Migration Completed. After you see this in the error report, a few
remaining tasks still need to be finalized.
Figure 7-30, shows a sharp decline in disk activity (hdisk5), again showing the
time (11:32) when the source system hands over to the destination server.
VIOS1 shows the I/O activity prior the hand-over as shown in Figure 7-32. We
see that after the migration, VIOS1 has nothing more to deal with I/O throughput.
Memory level
When looking at memory, we can hardly see a change, as shown in Figure 7-34.
This is as expected, as during the migration the application binds to the
destination system’s memory while its already bound memory is copied over the
network or is freed when unused by the application. Nearly all of the memory is
allocated. We see the load on the system finished at 12:04, which released the
memory while the migration itself finished at 11:39.
In the example and test we made, we did not notice the loss of connectivity. at
Figure 7-37 does not show a loss, but a drastic decrease in activity during a
fraction of a second followed by noticeable raise in power.
Attention: From Figure 7-41 and Figure 7-42 on page 347 we notice the
migration activity to migrate the mobile partition from the source to the
destination server. The migration started at 02:46 and ended at 03:01.
However, there are still tasks to be performed to finalize the migration,
including the copying of remaining memory pages still on the source server.
Refer to 7.5, “Migration process and flow” on page 316 for more information
about the migration processes.
Note: The recommendation to configure the VIOSs is to uncap the CPU. The
weight factor needs to be carefully set to allow the VIOSs to get CPU prior to
any other type of partition.
Remember that the VIOSs are the main virtualization servers that cannot
afford lack of resources, or your LPAR suffers or worse, shuts down.
The mobile partition is not influenced because it has nothing to do with the
migration process. Moreover, during this test we did not run any load on it.
System activity here is mainly related to running Java processes. See
Figure 7-43.
Disk level
We do not expect any particular disk activity on the mobile partition because
there is no load on it. This is reflected in Figure 7-46.
This is neither more nor less than the VIOS is actually using when doing nothing.
In other words, the migration does not demand more memory during the transfer.
However, this behavior can be influenced by a heavy network traffic, which
demands memory to handle the network IP packets.
Network level
The network is obviously the part that reveals what is happening during a
migration. As we have already seen in this chapter, the network is doing the job
of migrating from the source to the destination server.
Figure 7-52 on page 353 and Figure 7-53 on page 353 show that in the window
where the migration from the source to the destination server occurs, the network
reaches 75 MBps. When VIOS1 is pushing information to VIOS2, VIOS2 is
receiving that information. This is why both graphs are symmetrical.
352 Best Practices for DB2 on AIX 6.1 for POWER Systems
Moreover, in Figure 7-52 we see little read activity. Similarly in Figure 7-53 we
see write activity. That activity is like a handshaking between both MSPs in the
migration process to transmit all necessary information to rebuild the LPAR’s
profile, and to copy the memory to the destination system.
In this chapter we discuss WPAR concepts and the how DB2 can be configured
under a WPAR environment.
Prior to WPARs, it was required to create a new logical partition (LPAR) for each
new isolated environment. With AIX 6.1, this is no longer necessary, as there are
many circumstances when one can get along fine with multiple WPARs within
one LPAR. Why is this important? Every LPAR requires its own operating system
image and a certain number of physical resources. Although you can virtualize
many of these resources, there are still physical resources that must be allocated
to the system. Furthermore, the need to install patches and technology upgrades
to each LPAR is cumbersome. Each LPAR requires its own archiving strategy
and DR strategy. It also takes time to create an LPAR through a Hardware
Management Console (HMC) or the Integrated Virtualization Manager (IVM).
360 Best Practices for DB2 on AIX 6.1 for POWER Systems
Table 8-1 compares LPARs and WPARs.
Difficult to create and manage (needs HMC, IVM). Easy to create and manage.
Helps to consolidate and virtualize hardware Helps in resource management by sharing AIX
resources in a single server. images and using the hardware resources.
LPAR is the single point of failure for WPAR.If LPAR if WPAR fails then only that instance which was
fails then WPAR fails. partitioned gets affected.
There are two types of workload partitions that can reside in a global
environment.
› System WPAR: Almost a full AIX environment.
› Application WPAR: Light environment suitable for execution of one or more
processes.
Chapter 8. Workload partitioning 361
Figure 8-2 shows the global environment and WPAR.
Environment
WPAR1
WPAR2 WPAR4
WPAR3 WPAR5
WPAR4 WPAR6
PPPPPP
System System
Note: At the time of writing this IBM Redbooks publication, running NFS
server inside the WPAR was not supported.
If the network configuration needs to be done while creating the workload, run
the following command:
mkwpar -l -N interface=en0 address="IP" netmask=255.255.255.192 broadcast=9.2.60.255
-n "wpar name having DNS entry"
For example:
wparexec -n appwpar /tmp/myApp
Chapter 8. Workload partitioning 363
Example 8-1 shows a sample output of the wparexec command.
# hostname
sys_wpar_db2_private #
df
Filesystem 512-blocks Free %Use Iused %Iused Mounted on
d
Global 196608 141152 29% 1999 12% /
Global 65536 63776 3% 5 1% /home
Global 4653056 2096048 55% 27254 11% /opt
Global - - - - - /proc
Global 196608 193376 2% 11 1% /tmp
Global 10027008 5658984 44% 42442 7% /usr
Global 262144 123080 54% 4371 24% /var
5. Create groups and users using the commands shown in Example 8-4.
entry)
7. Create the DB2 Instance. See Example 8-6.
db2icrt -p 50000 -s ese -u db2fenc1 db2inst1
# su - db2inst1
$ db2start
12/02/2009 12:18:51 0 0 SQL1063N DB2START processing was
successful.
SQL1063N DB2START processing was successful.
Note: DB2 cannot be installed in system WPAR that does not have write
permission to global environment's /usr and /opt.
You achieve near continuous availability because Live Partition Mobility and LAM
move running workloads between Power Systems to eliminate the need for
planned system downtime. Both LPM and LAM eliminate downtime during
planned outages but cannot increase system availability during unplanned
outages such as a hardware crash. Therefore, it is not a replacement for HACMP.
With Live Partition Mobility, you can move an active or inactive partition from one
physical server to another without the user aware of the change happening. With
LAM, however, you have to check-stop your WPAR so that it restarts on the
destination machine. This might result in brief periods of unresponsiveness.
Table 8-2 lists the differences between Partition Mobility and Application Mobility.
Table 8-2 Basic differences between Partition Mobility and Application Mobility
Type Partition Mobility Application Mobility
Active and inactive partition can Stop the application and restart on the destination
be moved machine.This results in brief periods of
unresponsiveness.
Note: Both application mobility and partition mobility can eliminate downtime
during planned outages but does not provide high availability. They are not a
replacement for HACMP.
370 Best Practices for DB2 on AIX 6.1 for POWER Systems
A
Table A-1 Ongoing data and data before any major change
Task Date Location of
Completed Output
db2support
An all-purpose utility that collects data ranging from ICBM / db cfg, table
space layout to contents of history file information. Additional information
can be found at the following Web page:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp?
topic=/com.ibm.db2.luw.admin.trb.doc/doc/t0020808.html
db2support to gather optimizer data
If, for example you collect access plans of a certain query, before the
query used to take a greater amount of time, you can use these access
plans to diagnose why the query is experiencing a performance
degradation. db2support utility invoked in the optimizer mode can be to
collect such data. Additional information can be found at the following
Web page:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp?
topic=/com.ibm.db2.luw.admin.cmd.doc/doc/r0004503.html
db2fodc - perf
db2fodc is a utility that can be invoked to gather performance-related
data. By saving the DB2 performance-related data when the system is
performing well you can establish a baseline to compare performance
issues if the DB2 performance degrades. Additional information can be
found at the following Web page:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp?
topic=/com.ibm.db2.luw.admin.cmd.doc/doc/r0051934.html
db2cfexp / db2cfimp
These utilities are used to export connectivity configuration information to
an export profile, which can later be imported. Additional information can
be found at the following Web page:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp?
topic=/com.ibm.db2.luw.qb.migration.doc/doc/t0023398.html
Appendix A. System health check 373
Table A-2 Ongoing health checks
Task Date Location of
Completed Output
Rebinding packages
Applications that issue static SQL statements have their access plan
stored in the system catalogs. If there are changes in data and new
statistics have been gathered to reflect these changes, these packages
that contain the data access methods need to be rebound. Additional
information can be found at the following Web page:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp?
topic=/com.ibm.db2.luw.qb.migration.doc/doc/t0022384.html
Dynamic SQL
Without delving much into tasks involved with monitoring, one monitoring
check is included here. That being the time taken by individual dynamic
SQL statements to identify which SQL statements are consuming the
most CPU and are I/O intensive. Additional information can be found at
the following Web page:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp?
topic=/com.ibm.db2.luw.admin.mon.doc/doc/r0007635.html
db2advis
The design advisor can assist in the creation of indexes or even MQT's
for given workloads and individual SQL statements. After identifying a
particular SQL statement, the db2advis utility can be used to recommend
various indexes and the performance gain that might result as a result of
creating such recommendations. Additional information can be found at
the following Web page:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp?
topic=/com.ibm.db2.luw.admin.cmd.doc/doc/r0002452.html
Diagnostic logs
Prudent DBAs not only monitoring the state of their database but also
check their diagnostic logs to find errors that might not be apparent on the
surface. Two such logs to keep checking are the db2diag.log and the
Administration notification log. Additional information can be found at the
following Web pages:
› http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp
?topic=/com.ibm.db2.luw.admin.ha.doc/doc/c0023140.html
› http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp
?topic=/com.ibm.db2.luw.admin.trb.doc/doc/c0020815.html
db2dart / db2 inspect
To check the structural integrity of the underlying database table space
and containers in which the data, indexes, lobs and so forth reside, use
the db2dart and or the db2 inspect utilities. Additional information can be
found at the following Web page:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp?
topic=/com.ibm.db2.luw.admin.trb.doc/doc/c0020763.html
Appendix A. System health check 375
AIX health check
Table A-3 summarizes the tools you can use for AIX health check.
NMON
nmon (short for Nigel's Monitor) is a popular system monitor tool for the
AIX and Linux operating systems. It provides monitoring information
about the overall health of these operating systems. Information about
nmon can be found from the following Web page:
http://www.ibm.com/developerworks/aix/library/au-analyze_aix/
VMSTAT
This tool is useful for reporting statistics about kernel threads,
virtual memory, disks, and CPU activity. Information about VMSTAT
usage can be found in following Web page:
http://publib.boulder.ibm.com/infocenter/aix/v6r1/topic/com.ibm
.aix.cmds/doc/aixcmds6/vmstat.htm
IOSTAT
The iostat tool reports CPU statistics and input/output statistics for TTY
=devices, disks, and CD-ROMs. See the following Web page for details
on using IOSTAT:
http://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?top
ic=/com.ibm.aix.cmds/doc/aixcmds3/iostat.htm
LPARSTAT
Use this tool to check the resources for the LPAR on AIX.
It can be used to see the overall CPU usage relative to the shared
pool and to get statistics with regard to the power hypervisor. Information
about LPARSTAT usage can be found in the following Web page:
http://publib.boulder.ibm.com/infocenter/aix/v6r1/topic/com.ibm
.aix.cmds/doc/aixcmds3/lparstat.htm
PM Services
PM Services can be used to moniitor the overall system vitals of Power
on AIX. It can be used as a useful utility for monitoring the overall health
of a system or multiple systems. Information about PM can be found in
following Web page:
http://www-03.ibm.com/systems/power/support/pm/news.html
376 Best Practices for DB2 on AIX 6.1 for POWER Systems
Task Date Location of
Completed Output
PS
The PS command displays statistics and status information about
processes in the system, including process or thread ID, I/O activity, and
CPU and memory use.The PS command can be used to monitor memory
use by an individual process. For more information, see the following Web
page: http://publib.boulder.ibm.com/infocenter/aix/v6r1/topic/com.ibm
.aix.cmds/doc/aixcmds4/ps.htm
NETSTAT
The netstat command displays information regarding traffic on the
configured network interfaces. For more information, see the following
Web page:
http://publib.boulder.ibm.com/infocenter/aix/v6r1/topic/com.ibm
.aix.cmds/doc/aixcmds4/netstat.htm
SVMON
The svmon can be used for in-depth analysis of memory usage. It
displays information about the current state of memory. For more
information, see the following Web page:
http://publib.boulder.ibm.com/infocenter/aix/v6r1/topic/com.ibm
.aix.cmds/doc/aixcmds5/svmon.htm
PRTCONF
Get the basic system configuration. For more information see the
following Web page:
http://publib.boulder.ibm.com/infocenter/aix/v6r1/topic/com.ibm
.aix.cmds/doc/aixcmds4/prtconf.htm
Appendix A. System health check 377
378 Best Practices for DB2 on AIX 6.1 for POWER Systems
Abbreviations and acronyms
AIO asynchronous I/O LPAR logical partition
AMS Active Memory Sharing LRU Least Recently Used
Ack acknowledgement LTG logical track group
CA Configuration Advisor LUN logical unit number
CGTT created global temporary LVM Logical Volume Manager
tables MLS multilevel security
COD Capacity On Demand MPIO multiple path I/O
CTQ Command Tagged Queueing
MTU maximum transmission unit
DBA database administrator NIM Network Installation Manager
DFU Decimal Floating point unit NUMA non-uniform memory access
DGTT declared global temporary OLTP online transaction processing
table
PV physical volume
DLPAR dynamic logical partitioning
RAID Redundant Array of
DMS Database managed space
Independent Disks
DW data warehouse RBAC Role Based Access Control
FC Fibre Channel RID row identifier
GPFS General Parallel File System
SAMP System Automation for
HA high availability Multiplatforms
HADR High Availability Disaster SEA Shared Ethernet Adapter
Recovery SMS system managed space
HBA Host Based Adapter SMT Simultaneous Multi Threading
HMC Hardware Management STMM Self Tuning Memory Manager
Console
TCO total cost of ownership
IBM International Business
Machines Corporation VG volume groups
IOCP I/O completion ports VMM Virtual Memory Manager
ITSO International Technical VMX Vector Multimedia extension
Support Organization WPAR Workload Partition
IVE integrated virtual Ethernet
IVM Integrated Virtualization
Manager
JFS2 Journaled Filesystem
Extended
LBAC Label Based Access Control
© Copyright IBM Corp. 2010. All rights reserved. 379
380 Best Practices for DB2 on AIX 6.1 for POWER Systems
Related publications
The publications listed in this section are considered particularly suitable for a
more detailed discussion of the topics covered in this book.
IBM Redbooks
For information about ordering these publications, see “How to get Redbooks” on
page 382. Note that a few of the documents referenced here might be available
in softcopy only.
› IBM PowerVM Live Partition Mobility, SG24-7460
› PowerVM Virtualization on IBM System p: Introduction and Configuration
Fourth Edition, SG24-7940
› Introduction to the IBM System Storage DS5000 Series, SG24-7676
› DB2 UDB V7.1 Performance Tuning Guide, SG24-6012
› Integrated Virtualization Manager on IBM System p5, REDP-4061
› A Practical Guide for Resource Monitoring and Control (RMC), SG24-6615
› Integrated Virtual Ethernet Adapter Technical Overview and Introduction,
REDP-4340
› WebSphere Application Server V6.1: JMS Problem Determination,
REDP-4330
› IBM DB2 9 on AIX 5L with NFS, iSCSI, and FCP using IBM System Storage
N series, REDP-4250
› IBM System p Advanced POWER Virtualization (PowerVM) Best Practices,
REDP-4194
© Copyright IBM Corp. 2010. All rights reserved. 381
Online resources
These Web sites are also relevant as further information sources:
› AIX 6.1 information center:
http://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp
› IBM System p information center:
http://publib16.boulder.ibm.com/pseries/index.htm
› PowerVM information center:
http://www-03.ibm.com/systems/power/software/virtualization/index.html
› Power Systems information center:
http://publib.boulder.ibm.com/infocenter/powersys/v3r1m5/index.jsp
› DB2 information center:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/index.jsp
› AIX Commands Reference:
http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic
=/com.ibm.aix.doc/doc/base/commandsreference.htm
› Information about HMC upgrade:
http://www-933.ibm.com/support/fixcentral/
› Information about how to set up SSH keys authentication for HMC setup:
http://publib.boulder.ibm.com/infocenter/powersys/v3r1m5/index.jsp
› Information about how to change your current VIOS level:
http://www14.software.ibm.com/webapp/set2/sas/f/vios/download/home.html
› Location for DB2 and NAS over iSCSI demo:
http://www.ibm.com/partnerworld/wps/servlet/ContentHandler/VPAA-7M59ZR
382 Best Practices for DB2 on AIX 6.1 for POWER Systems
Help from IBM
IBM Support and downloads
ibm.com/support
ibm.com/services
Related publications 383
384 Best Practices for DB2 on AIX 6.1 for POWER Systems
Index
autonomics 2
Numerics
64-bit kernel 11
80-20 rule 133 B
BACKUP utilities 16
Barrier synchronization 301
A benchmarking 131
Active Memory Sharing (AMS) 23, 250
Blue Gene 4
active migration 270
built-in autonomics 273
active processors 29
advanced DB2 registry parameters 73
DB2_LARGE_PAGE_MEM 73 C
DB2_RESOURCE_POLICY 73 capped mode 28
DB2MEMDISCLAIM 75 catawarehouse environments 98
advanced virtualization capabilities 23 data compression 99
AIX 13 placing tables into tablespaces 98
AIX 5.3 Technology Level 7 13 chpath 253
AIX configuration 35 client memory 36
tunable parameters 36 clustering administration 19
AIX kernel 13 computational memory 36
AIX kernel extensions 13 Configuration Advisor (CA) 16
AIX NUMA 73 configurational differences 80
AIX Virtual Memory Manager (VMM) 36 AIX 5.3 versus AIX 6.1 80
computational memory 36 CREATE TABLESPACE 89
considerations for DB2 40
DB2 considerations
D
lru_file_repage 42
data stream prefetching 6
maxclient% 41
data warehouse (DW) 88, 105
maxfree 41
database administrator (DBA) 86
maxperm% 42
DATABASE_MEMORY parameter 230
minperm% 42
DataPropagator 19
strict_maxclient 42 DB2 17, 19
free list of memory pages 38 DB2 9.7 Enterprise Server Edition 19
Large page considerations 46 DB2 containers 220
non-computational memory 36 DB2 Database Configuration (DB) parameters 80
page replacement 38 DB2 Database Manager Configuration (DBM) pa-
ALTER DATABASE 89 rameters 80
array 104 DB2 HADR (high availability disaster recovery) 272
array configuration 111 DB2 health center 16
Asynchronous Input Output (AIO) 67 DB2 health check 373
consideration for DB2 67 DB2 overrides 83
legacy AIO 67 DB2 Performance Expert 18
Posix AIO 67 DB2 Performance Optimization Feature 18
automatic storage 88 DB2 pureScale 19
Automatic Variable Page Size 14
© Copyright IBM Corp. 2010. All rights reserved. 385
DB2 Query Patroller 18 fc_err_recov 293, 313
DB2 registry parameters 72 fcstat 119
DB2_LOGGER_NON_BUFFERED_IO 72 Fibre Channel (FC) 100
DB2_PARALLEL_IO 73 adapters configuration 118
DB2_USE_FAST_PREALLOCATION 73 max_xfer_size 118
DB2_USE_IOCP 72 num_cmd_elems 118
DB2 Replication 19 file system 126
DB2 storage design 87 caching 97, 220
DB2 Storage Optimization Feature 17 firmware level 281
DB2 user resource limits (ulimits) 70 Floating-Point Arithmetic Standard 5
DB2 Workload Manager 18 fscsi device attribute 313
db2agntdp 201
db2nodes.cfg 75 G
db2pd 129 general parallel file system (GPFS) 36
DBMS memory set 145 global environment 368
deactivated processors 29 global temporary tables (CGTT) 99
DEADLOCKS 140
Decimal Floating point unit (DFU) 5
declared global temporary tables (DGTT) 99
H
HACMP 370
dedicated LPAR 231
hardware management console (HMC) 24, 250,
design advisor 375
300
disaster recovery solution 272
hash joins 136
disk latency 105
hcheck_interval 293, 310
disk mapping 314
health check 371
down level HMC 277
high availability disaster recovery (HADR) 19
DS family 100
host based adapter (HBA) 101, 246
dual-core processor 4
hot spare drive 103
dynamic logical partitioning (DLPAR) 20, 229
huge pages 301
dynamic processor deallocation 6
Hypervisor 229
dynamic SQ
L snapshot data 141
dynamic SQL 141 I
dyntrk 293, 313 I/O per seconds (IOPS) 104
I/O virtualization 32
IBM AIX V6.1 new features 12
E
continuous availability 13
emulation function 7
concurrent AIX kernel update 13
entitled pool capacity 236
dynamic tracing 13
entstat 300
storage keys 13
errlog 334
manageability 14
errpt 334
automatic variable page size 14
establishing a baseline 131
IBM Systems Director Console for AIX 14
Etherchannel 59
Network Installation Manager Support for
extendvg 123
NFSv4 14
solution performance tuning 14
F security 12
fast I/O failure for Fibre Channel 313 encrypted file system 12
fault tolerant array 103 role based access control (RBAC) 12
FC adapter redundancy 49
386 Best Practices for DB2 on AIX 6.1 for POWER Systems
secure by default 13 J
support for long pass phrases 13 JFS (journaled file system) 36
trusted AIX 12 JFS2 (Enhanced Journaled File System 2) 11
trusted execution 13 jumbo frames 58
virtualization 12 jumbo sized packets 59
Live Application Mobility 12
IBM DB2 Storage Optimization Feature license 99 L
IBM System Storage DS3000 100 Label Based Access Control (LBAC) 17
IBM System Storage DS5000 100 large objects 17
IBM Systems Director Console for AIX 11 LARGE tablespace 94
IBM TotalStorage DS6000 4 lg_term_dma 313
IBM TotalStorage DS8000 4 link quality 101
IBM WebSphere 234 Linux 269
IEEE 802.1Q VLAN-tagging 255 Linux NUMA 73
ifconfig 55 live application mobility (LAM) 370
inactive migration 270 live partition mobility (LPM) 8, 11, 249, 269
inactive processors 29 active migration 270
INLINE LENGTH 93 active partition migration 320
input and output tunable considerations 63 benefits 272
j2_maxpageReadAhead 64 DB2 migration 273
j2_maxRandomWrite 64 definition 269
j2_minPageReadAhead 64 HMC requirements 285
j2_nPagesPerWriteBehindCluster 66 inactive migration 270
maxpgahead 64 inactive partition migration 321
numfsbufs 65 IVM requirements 287
pv_min_pbuf 66 LPAR requirements 294
sync_release_ilock 66 managed system’s requirements 275
INSTANCE_MEMORY parameter 230 migration process and flow 316
instruction-set architecture (ISA) 3 migration validation 317
integrated virtualization manager (IVM) 24, 250, mobility in action 322
360 network requirements 303
Interface Tool (SMIT) 11 overview 270
Internet 81 storage requirements 306
Internet Control Message Protocol (ICMP) 81 summary 315
inter-partition communication 21 system planning 274
introduction VIOS requirements 287
DB2 15–20 what it is not 272
IBM AIX V6.1 11 why you would use 271
Power Systems 3 Load Lookahead (LLA) 6
PowerVM virtualization 21–22 lock escalations 140
virtualization 6–8 LOCK_ESCALS 140
ioo command 63 LOCK_TIMEOUTS 140
IOSTAT 173 logical drive 114
IP forwarding 81 logical host Ethernet adapter (LHEA) 305
ipintrq overflows 53 logical memory blocksize 284
iSCSI 308 logical track group (LTG) size 123
isolated 7 logical unit number (LUN) 114
isolated environment 7, 360 logical volume (LV) 120
IT staff 19
Index 387
configuring 125 migration support 276
logical volume manager (LVM) 11, 119 migration-aware application 273
low DB2 bufferpool hit ratio 230 mkps 51
LPAR (logical partition) mksysb 323–324
profile 227 mkvdev 315
throughput 232 mkvg 120
LPAR (logical partition) considerations 225 monitoring 129
capped shared processor LPAR 236 activity level monitoring 155
configuration 228 database object monitoring 159
dedicated LPAR 231 enhancements in DB2 9.7 149
dedicated versus shared-processor partitions in-memory metrics 150
232 monitoring tools for AIX 169
dynamic logical partitioning (DLPAR) 229 new administrative views 160
dynamically altered resources 229 planning and tuning 133
micro-partition 231 scenarios 185
monitoring 238 alleviating bottlenecks during index creation
performance test results 231 199
planning 226 alleviating I/O bottleneck 209
profile 227 Alter Tablespace no file system cache 185
running DB2 234 disk performance bottleneck 221
scalability 232 memory consumed by the AIX filesystem
summary 233 cache 216
throughput 232 tools for DB2 134
uncapped micro-partition 234 Average Bufferpool I/O Response Time 136
uncapped shared processor LPAR 236 Buffer Pool Hit Ratio 135
virtual processors 235 db2top 142
lparstat 238–239 dirty page steals 139
lru_file_repage 37, 42 dirty page threshold 139
lsconf 326 Dynamic SQL Metrics 141
lshmc -V 286 lock metrics 140
lslparmigr 291 no victim buffers available 139
lsmap 325 Package Cache Hit Ratio 140
lspartition 299 page cleaning 138
lspartition -all 299 Prefetch Ratio 139
lsrsrc IBM.ManagementServer 299 Rows Read/Row Selected Ratio 140
lssyscfg 282–283 snapshots 135
LVM configuration 119 sort metrics 140
traces 134
Transaction Log Response Time 138
M mover service partition 288
manipulated routes 81 multi-partition databases 18
manual failover 255 multipath I/O software 262
materialized query table (MQT) 99 multiple shared-processing pool (MSPP) 236
max_coalesce 118 architectural overview 237
max_transfer 117, 311 default shared-processor pool 237
max_xfer_size 313 entitled capacity 236
memory regions 46 maximum pool capacity 236
memory virtualization 30 physical shared-processor pool 236
micro-partitioning 4, 21, 231 reserved pool capacity 236
388 Best Practices for DB2 on AIX 6.1 for POWER Systems
multi-threaded architecture 209 Para virtualization 9
Pareto's Principle 133
PCI error recovery 6
N peak demand 234
netstat 176, 300 performance engineers 170
network file system (NFS) 36 performing ongoing health checks 372
network tunable considerations 51 physical components considerations 101
clean_partial_connection 52 cables and connectors 101
ip6srcrouteforward 53 drives 102
ipignoreredirects 53 Host Based Adapter 101
ipqmaxlen 53 hot spare drive 103
ipsendredirects 54 physical partition size 121
ipsrcrouterecv 54 physical shared processor pool 236
jumbo frames 58 physical volume (PV) 116
maximum transmission unit (MTU) 58 physical volume ID (PVID) 315
rfc1323 55 point-in-time monitoring 134
tcp_nagle_limit 55 poor performance
tcp_nodelayack 56 causes 132
tcp_pmtu_discover 56 application design 132
tcp_recvspace 56 system and database design 132
tcp_sendspace 57 system resource shortages
tcp_tcpsecure 57 CPU 132
udp_pmtu_discover 58 Disk I/O 133
udp_recvspace 58 Memory 132
udp_sendspace 58 Network 133
nmon 179, 335 POWER architecture 3
nmon analyzer 179 Power Architecture® technology 3
no command 51 Power Hypervisor 9, 21, 276, 300
usage 51 Power Hypervisor enablement 277
non-computational memory 36 Power processor technology 4
nondestructive change 231 POWER4 11
not 369 POWER5 11
N-Port Identifier Virtualization (N-PIV) 22, 246 POWER6 11
num_cmd_elems 314 POWER6 processor-based systems 6
numclient 36 POWER6 Storage Keys 11
numperm 36 PowerHA™ pureScale technology 19
NVRAM (non-volatile random access memory) 331 PowerVM 21
PowerVM editions 23
O Enterprise Edition 23
OLTP (online transaction processing) 104 Express Edition 23
online reorganization 19 Standard Edition 23
operating system (OS) 110 PowerVM Enterprise Edition 246
OVERHEAD 96 PowerVM Express 246
PowerVM Standard 246
PowerVM virtualization architecture 24
P PPC970 11
padmin 315 prefetch size 273
page fault 38 private sort heap 140
paging space considerations for DB2 49 probevue 13
Index 389
processor virtualization 26 self-tuning memory feature 16
promiscuous mode 304 self-tuning memory manager (STMM) 20
PS 175 shared 28
pureXML 2 shared CPU mode 28
shared Ethernet adapter (SEA) 22, 32, 255
failover 255
Q shared processor pools 236
queue_depth 116, 311 shared sort usage 205
queue_depth attribute 264 shared-processor pools 22
simultaneous multi-threading (SMT) 30
R SLAs (service level agreements) 8
RAID levels 104–105, 110 Snapshot Monitor, Activity Monitor 129
comparision 110 socket buffer space 82
RAID 0 105 software licensing costs 9
RAID 1 106 source routing 82
RAID 10 109 split the hot spares 103
RAID 5 107 state information 331
RAID 6 108 storage hardware 100
Redbooks Web site 382 storage keys 13
Contact us xxvii support for long pass phrases 13
Redundant Array of Independent Disks (RAID) 100 SVMON 176
redundant array of independent disks (RAID) 100 syncd daemon 66
REGULAR tablespace 93 System i 11
reliability, availability, serviceability (RAS) 6
remote live partition mobility 283 T
REORG 16 table spaces 18
reserve_policy 293, 311 temporary 17
resource monitoring and control (RMC) 298 temporary tables 17
restricted use tunables 40 time reference partition (TRP) 280, 292
rfc1323 82 Tivoli Systems Automation/PowerHigh Availability
RISC System/6000 3 272
role based access control (RBAC) 12 TOPAS 176
rootvg 49 total cost of ownership (TCO) 6, 20
row-level compression 17 TRANSFERRATE 96
RS/6000 3 trusted AIX 12
run a benchmark 131 trusted execution 13
RUNSTATS 16 tuning storage on AIX 115
hdisk tuning 116
S multipath driver 115
SAN tools 248
Sarbanes-Oxley 16 U
schedo 69 uncapped micro-partition 234
scheduler tunable considerations 69 uncapped mode 28
SCSI-compliant logical unit numbers (LUNs) 244 unit of work (UOW) 154
segment size 114 universal device ID (UDID) 315
self tuning memory manager (STMM) 273
self-configuring 16
self-healing 16
390 Best Practices for DB2 on AIX 6.1 for POWER Systems
virtualized consolidated environment 9
V vmo command
varyonvg 123
usage 38
vasistat 331
VMSTAT 170
vector multimedia extension (VMX) 4–5
volume group (VG) 120
vhost adapter 311
configuring 120
virtual 9, 21
creating 120
virtual Ethernet 32
vpm_xvcpus parameter 235
virtual Fibre Channel (NPIV ) 246
vtscsi device 307
virtual Fibre Channel adapter 246
virtual I/O server (VIOS) 243
Active Memory Sharing 250 W
best practices 261 WebSphere Application Server 185–186
CPU settings 264 wireline telecommunications 4
dual virtual I/O Server 251 workload partition 359–360
live partition mobility 249 workload partition (WPAR) 11
logical partition 246 application isolation 368
network interface backup 254 global environment 368
network virtualization 248 live application mobility versus live partition mo-
EtherChannel 249 bility 370
link aggregation 249 sandbox environment 369
live partition mobility 249 self-contained 368
shared Ethernet adapter (SEA) 249 troubleshooting 369
virtual LAN 248 when not be the best choice 369
networking 262 when to use 368
resilience 250 workload isolation 368
SCSI queue depth 264 world wide port name (WWPN) 246–247
SEA threading 266
sizing 257 X
Virtual Fibre Channel (NPIV) 246 XML documents 17
virtual LAN 248
virtual network redundancy 253
Virtual SCSI 244
virtual SCSI redundancy 252
virtual I/O server (VIOS) media support 287
virtual LAN (VLAN) 7, 248
virtual machine 9
virtual memory manager (VMM) 47
virtual processor (VP) 28–29
virtual SCSI 21
virtual storage 33
virtual tape 7
virtualization engine 4
virtualization types 9
full hardware and firmware embedded virtualiza-
tion 9
full virtualization 9
OS based virtualization 9
para virtualization 9
virtualized 9
Index 391
392 Best Practices for DB2 on AIX 6.1 for POWER Systems
(0.5” spine)
0.475”<->0.875”
250 <-> 459 pages
Back cover ®
Explains partitioning This IBM Redbooks publication presents a best practices guide for
and virtualization DB2 and InfoSphere Warehouse performance on a AIX 6L with Power INTERNATIONAL
technologies for Systems virtualization environment. It covers Power hardware TECHNICAL
Power Systems
features such as PowerVM, multi-page support, Reliability, SUPPORT
Availability, and Serviceability (RAS) and how to best exploit them with
DB2 LUW workloads for both transactional and data warehousing
ORGANIZATION
Discusses DB2 systems.
performance The popularity and reach of DB2 and InfoSphere Warehouse has
optimization on grown in recent years. Enterprises are relying more on these products
BUILDING TECHNICAL
for their mission-critical transactional and data warehousing
System p workloads. It is critical that these products be supported by an
INFORMATION BASED ON
adequately planned infrastructure. This publication offers a reference
PRACTICAL EXPERIENCE
Covers OLTP and architecture to build a DB2 solution for transactional and data
data warehouse warehousing workloads using the rich features offered by Power IBM Redbooks are developed by
workloads systems. the IBM International Technical
IBM Power Systems have been leading players in the server industry Support Organization. Experts
for decades. Power Systems provide great performance while from IBM, Customers and
delivering reliability and flexibility to the infrastructure. Partners from around the world
This book presents a reference architecture to build a DB2 solution for create timely technical
transactional and data warehousing workloads using the rich features information based on realistic
offered by Power systems. It aims to demonstrate the benefits DB2 scenarios. Specific
and InfoSphere Warehouse can derive from a Power Systems recommendations are provided
infrastructure and how Power Systems support these products.
to help you implement IT
solutions more effectively in
The book is intended as a guide for a Power Systems specialist to your environment.
understand the DB2 and InfoSphere Warehouse environment and for a
DB2 and InfoSphere Warehouse specialist to understand the facilities
available for Power Systems supporting these products.
For more information:
ibm.com/redbooks
SG24-7821-00 0738434191