GPFS For AIX PDF
GPFS For AIX PDF
Abbas Farazdel
Robert Curran
Astrid Jaehde
Gordon McPheeters
Raymond Paden
Ralph Wescott
ibm.com/redbooks
SG24-6035-00
Take Note! Before using this information and the product it supports, be sure to read the
general information in Special notices on page 265.
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
The team that wrote this redbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Special notice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
IBM trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Comments welcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Chapter 1. A GPFS Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 What is GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Why GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 The basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 When to consider GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Planning considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.1 I/O requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.2 Hardware planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.3 GPFS prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.4 GPFS parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 The application view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 2. More about GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Structure and environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Global management functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 The configuration manager node . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 The file system manager node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.3 Metanode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 File structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Striping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.3 User data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.4 Replication of files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.5 File and file system size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Memory utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.1 GPFS Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.2 When is GPFS cache useful . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.3 AIX caching versus GPFS caching: debunking a common myth . . . 25
Chapter 3. The cluster environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 RSCT basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.1 Topology Services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2 Group Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
iii
iv
Contents
vi
. . . . . . . . . . . . . . . . . . . . . . . . . . . 233
. . . . . . . . . . . . . . . . . . . . . . . . . . . 234
. . . . . . . . . . . . . . . . . . . . . . . . . . . 235
. . . . . . . . . . . . . . . . . . . . . . . . . . . 235
. . . . . . . . . . . . . . . . . . . . . . . . . . . 236
Contents
vii
......
......
......
......
......
......
.......
.......
.......
.......
.......
.......
......
......
......
......
......
......
.
.
.
.
.
.
261
261
261
262
263
263
viii
Preface
With the newest release of General Parallel File System for AIX (GPFS), release
1.4, the range of supported hardware platforms has been extended to include
AIX RS/6000 workstations that are not part of an RS/6000 SP system. This is the
first time that GPFS has been offered to non-RS/6000 SP users. Running GPFS
outside of the RS/6000 SP does require the high availablity cluster
multi-processing/enhanced scalabilty (HACMP/ES) to be configured and the
RS/6000 systems within the HACMP cluster (that will be part of the GPFS
cluster) be concurrently connected to a serial storage architecture (SSA) disk
subsystem.
This redbook focuses on the planning, installation and implementation of GPFS
in a cluster environment. The tasks to be covered include the installation and
configuration of HACMP to support the GPFS cluster, implementation of the
GPFS software, and developing application programs that use GPFS. A
troubleshooting chapter is added in case any problems arise.
ix
Special notice
This publication is intended to help system administrators, analysts, installers,
planners, and programmers of GPFS who would like to install and configure
GPFS 1.4. The information in this publication is not intended as the specification
of any programming interfaces that are provided by GPFS or HACMP/ES. See
the PUBLICATIONS section of the IBM Programming Announcement for GPFS
Version 1, Release 4 and HACMP/ES Version 4, Release 4 for more information
about what publications are considered to be product documentation.
IBM trademarks
The following terms are trademarks of the International Business Machines
Corporation in the United States and/or other countries:
AIX
AS/400
Current
IBM
Notes
Redbooks Logo
SP
XT
AT
CT
e (logo)
Micro Channel
Redbooks
RS/6000
SAA
SP2
Comments welcome
Your comments are important to us!
We want our IBM Redbooks to be as helpful as possible. Send us your
comments about this or other Redbooks in one of the following ways:
Use the online Contact us review redbook form found at:
ibm.com/redbooks
Preface
xi
xii
Chapter 1.
A GPFS Primer
This introductory chapter briefly describes topics which should be understood
prior to attempting the first installation of the GPFS product. It includes the
following:
What is GPFS
Why GPFS
The basics
When to consider GPFS
Planning considerations
The application view
We will use the term cluster to describe either the nodes of an SP or the
members of an HACMP cluster that shares an instance of GPFS. We will use the
term direct attach to describe disks that are physically attached to multiple nodes
using SSA connections and contrast that to the VSD connections within the SP.
Figure 1-1 shows a cluster residing on an SP using the VSD. Figure 1-2 on
page 7 shows a similar cluster using directly attached disks.
GPFS is targeted at applications which execute on a set of cooperating cluster
nodes running the AIX operating system and shares access to the set of disks
that make up the file system. These disks may be physically shared using SSA
loops directly attached to each node within HACMP clusters or shared through
the software simulation of a storage area network provided by the IBM Virtual
Shared Disk and the SP switch. Consult the latest IBM product documentation
for additional forms of physically shared connectivity.
In addition, GPFS requires a communication interface for the transfer of control
information. This interface does not need to be dedicated to GPFS; however, it
needs to provide sufficient bandwidth to meet your GPFS performance
expectations. On the SP, this interface is the SP switch (SP Switch or SP
Switch2). For HACMP clusters, we recommend a LAN with a capability of at least
100Mb/sec.
Application
Application
Application
GPFS
GPFS
GPFS
VSD
VSD
VSD
SP switch
VSD
VSD
Disk Collection
It uses the Group Services component of the IBM Parallel Systems Support
Program (PSSP) or HACMP to detect failures and continue operation
whenever possible.
Release 4 of GPFS uses SSA direct multi-attachment of disks. Additional
direct attachment methods are possible in the future. On the SP, VSD allows
the use of any type of disk which attaches to the RS/6000 SP and is
supported by AIX.
GPFS data can be exported using NFS including the capability to export the
same data from multiple nodes. This provides potentially higher throughput
than servers that are limited to one node. GPFS data can also be exported
using DFS although the DFS consistency protocols limit the export to one
node per file system.
Figure 1-1 on page 4 illustrates a simple five node GPFS configuration. The
three nodes at the top of the configuration are home to applications using GPFS
data. The two at the bottom share connections to some number of disks. One of
these VSD servers is the primary path for all operations involving each disk, but
the alternate path is used if the primary is not available. A node can be the
primary for some disks and the backup for others.
GPFS uses a token manager to pass control of various disk objects among the
cooperating instances. This maintains consistency of the data and allows the
actual I/O path to be low function and high performance. Although we have
illustrated applications and VSD servers on independent nodes, they can also
share a node. The VSD servers consume only a portion of the CPU cycles
available on these nodes, and it is possible to run some applications there. The
GPFS product documentation describes these choices in more detail.
The use of the backup disk server covers the failure of a single VSD server node.
The failure of individual disk drives can cause data outages. However, the use of
RAID, AIX mirroring, or GPFS replication can mitigate these outages. GPFS also
provides extensive recovery capabilities that maintain metadata consistency
across the failure of application nodes holding locks or performing services for
other nodes. Reliability and recovery have been major objectives of the GPFS
product from its inception.
Figure 1-2 on page 7 shows a GPFS configuration within an HACMP cluster. The
HACMP cluster differs from the SP cluster in that it requires direct attachment of
the SSA disks to every node. It also requires the communications link shown to
carry control information such as tokens between nodes. The SSA adapters
support two nodes in RAID mode or eight nodes in JBOD (just a bunch of disks)
mode, so GPFS replication of data may be useful in larger configurations.
IP network
Application
Application
Application
GPFS
GPFS
GPFS
SSA loop
SSA Disks
There are four steps to be taken before attempting a GPFS installation. We will
overview the thought process and considerations in each of these steps:
Consider your I/O requirements
Plan your hardware layout
Consider the GPFS prerequisites
Consider the GPFS parameters required to meet your needs
the request size of your dominant applications, whichever is larger. This is not
the same as the burst I/O rate quoted by disk manufacturers. As file systems
fragment, disk blocks related to the same file will get placed where space is
available on the disks.
You should consider if RAID is needed for your systems and if so, match the
RAID stripe width to the block size of the file system. GPFS and other file
systems perform I/O in file system block multiples and the RAID system should
be configured to match that block size unless the application set is mostly read
only.
You should consider the disk attachment mechanism and its capabilities. In
general, both SSA and SCSI allow the attachment of more disks than the links
can transfer at peak rates in a short period. If you are attaching disks with an
objective for maximum throughput from the disks, you will want to limit the
number of disks attached through any adapter. If disk capacity, rather than
optimal transfer rates, is the major concern, more disks can use the same
adapter. If you are operating in direct attach mode, note that the disk attachment
media is shared among all the nodes and you should plan on enough disk
attachment media to achieve the desired performance.
If you are configuring your GPFS with VSDs you should consider the number of
VSD servers required to achieve your expected performance. The VSD server
performance is usually limited by the capabilities of the specific server model
used. The limiting factor is usually some combination of the I/O bandwidth of the
node and the LAN/switch bandwidth available to the node. CPU utilization of
VSD servers is usually relatively low. The capabilities of differing node types will
vary. Spreading the load across additional VSD servers may also be beneficial if
I/O demand is very bursty and will cause temporary overloads at the VSD
servers.
In a VSD environment, the decision to run applications on the VSD servers or to
have dedicated VSD servers is primarily dependent on the nature of your
applications and the amount of memory on the nodes. There are additional CPU
cycles on most VSD servers that can be used by applications. The requirements
of VSD service are high priority to be responsive to I/O devices. If the
applications are not highly time sensitive or highly coupled with instances that
run on nodes which do not house VSD servers, use of these extra cycles for
applications is feasible. You should insure that sufficient memory is installed on
these nodes to meet the needs of both the disk/network buffers required for VSD
and the working area of the application.
On the SP
GPFS administration uses the PSSP security facilities for administration of all
nodes. You should insure that these are correctly set up. The PSSP:
Administration Guide, SA22-7348 describes this.
GPFS uses the VSD and requires that it be configured correctly. Be sure to
consider the number of pbufs and the number and size of buddy buffers that
you need. The PSSP: Managing Shared Disks, SA22-7349 publication
describes these topics.
GPFS uses the Group Services and Topology Services components of the
PSSP. In smaller systems, the default tuning should be acceptable for GPFS.
In larger configurations, you may wish to consider the correct values of
frequency and sensitivity settings. See the PSSP: Administration Guide,
SA22-7348 for this information.
In HACMP clusters
GPFS uses an IP network which connects all of the nodes. This is typically a
LAN with sufficient bandwidth available for GPFS control traffic. A minimum
bandwidth of 100 Mb/sec is required.
GPFS requires that the SSA disks be configured to all of the nodes within the
GPFS cluster.
GPFS uses the Group Services and Topology Services components of
HACMP/ES.
You may not use LVM mirroring or LVM bad block relocation.
10
GPFS, like all file systems, caches data in memory. The cache size is controlled
by a user command and is split into two pieces; space for control information and
space for file data. Increasing the amount of space available may increase the
performance of many workloads. Increasing it excessively will cause memory
shortages for other system components. You may wish to vary these parameters
and observe the effects on the overall system.
11
12
Chapter 2.
13
Group
Services
HACMP
ES
APP
Group
Services
HACMP
ES
APP
Group
Services
HACMP
ES
AIX
AIX
AIX
GPFS
GPFS
GPFS
LVM
LVM
LVM
APP
14
HACMP cluster
GPFS cluster
nodeset
nodeset
15
The metanode
16
Alternatively, one can specify a single-node quorum when there are only two
nodes in the nodeset. In this case, a node failure will result in GPFS fencing the
failed node and the remaining node will continue operation. This is an important
consideration since a GPFS cluster using RAID can have a maximum of two
nodes in the nodeset. This two-node limit using RAID is an SSA hardware
limitation.
Token management
Quota management
Security services
There is only one file system manager node per file system and it services all of
the nodes using this file system. It is the configuration manager nodes role to
select the file system manager node. Should the file system manager node fail,
then the configuration manager node will start a new file system manager node
and all functions will continue without disruption.
It should be noted that the file system manager node uses some additional CPU
and memory resources. Thus it is sometimes useful to restrict resource intensive
applications from running on the same node as the file system manger node. By
default, all nodes in a nodeset are eligible to act as the file system manager
node. However, this can be changed by using mmchconfig to declare a node as
ineligible to act as the file system manager node (see the GPFS for AIX: Problem
Determination Guide, GA22-7434).
2.2.3 Metanode
For each open file, one node is made responsible for guaranteeing the integrity
of the metadata by being the only node that can update a files metadata. This
node is called the metanode. The selection of a files metanode is made
independently of other files and is generally the node that has had the file open
for the longest continuous period of time. Depending on an applications
execution profile, a files metanode can migrate to other nodes.
17
2.3.1 Striping
Striping is one of the unique features of GPFS compared with many native UNIX
file systems such as the Journaled File System (JFS) under AIX. The purpose of
striping is to improve I/O operation performance by allowing records to be
automatically subdivided and simultaneously written to multiple disks; we will
sometimes refer to this as implicit parallelism as the application programmer
does not need to write any parallel code; all that is needed is to access records
that are larger than a block.
The fundamental granularity of a GPFS I/O operation is generally, but not
always, the block, sometimes called a stripe. The size of this block is set by the
mmcrfs command. The choices are 16K, 64K, 256K, 512K, or 1024K (K
represents 1024 bytes, or one kilobyte) and it cannot be arbitrarily changed
once set (see man pages for mmcrfs and mmchconfig). For example, suppose the
block size is 256K and an application writes a 1024K record. Then this record is
striped over four disks by dividing it into four 256K blocks and writing each block
to a separate disk at the same time. A similar process is used to read a 1024K
record.
The expression granularity of a GPFS I/O operation refers to the smallest unit
of transfer between an application program and a disk in GPFS file system.
Generally this is a block. Moreover, on disk, blocks represent the largest
contiguous chunk of data. However, because files may not naturally end on a
block boundary, a block can be divided into 32 subblocks. In some
circumstances, a subblock may be transferred between disk and an application
program making it the smallest unit of transfer. Section 8.3.1, Blocks and
striping on page 148 explains this in greater detail.
The choice of a block size is largely dependent upon a systems job profile.
Generally speaking, the larger the block, the more efficient the I/O operations
are. But if the record size in a typical transaction is small, while the block is large,
much of the block is not being utilized effectively and performance is degraded.
Perhaps the most difficult job profile to match is when the record size has a large
variance. In the end, careful benchmarking using realistic workloads or synthetic
benchmarks (see Appendix G, Benchmark and Example Code) that faithfully
simulate actual workloads are needed to properly determine the optimal value of
this parameter.
18
Blocks can be striped in three ways. The default and most common way is round
robin striping. In this method, blocks are written to the disks starting with a
randomly selected disk (called the first disk in this chapter) and writing
successive blocks to successive disks; when the last disk has been written to, the
process repeats beginning with the first disk again. For example (refer to
Figure 2-3), suppose you have 16 disks (disk0 to disk15) and the first disk
chosen is disk13; moreover, you are writing the first record, it is 1024K and it
starts at seek offset 0. It is then divided into 4 blocks (b0 , b1 , b2 , b3 ) and is
written to disk13, disk14, disk15, and disk0. Suppose that the second record
written is 1024K and is written beginning at seek offset 3145728 (i.e., 3072K). It
to, is divided into 4 blocks, but is written to disk1, disk2, disk3, and disk4.
256K
0K
512K
768K
3072K
1024K
record 1
3840K
4096K
record 2
b1
disks
3328K 3584K
00
11
b3
b2
22
33
b0
44
. . . 11
12
b1
13
13
b2
b3
14
14
15
15
first
disk
The other methods are random and balanced random. Using the random
method, the mapping of blocks to disks is simply random. With either the round
robin or random method, disks are assumed to be the same size (if disks are not
the same size, space on larger volumes is not wasted, but it is used at a lower
throughput level). If the disks are not the same size, then the balanced random
method randomly distributes blocks to disks in a manner proportional to their
size. This option can be selected and changed using the mmcrfs and mmchfs
commands.
Striping significantly impacts the way application programs are written and is
discussed further in Chapter 8, Developing Application Programs that use
GPFS on page 143.
19
2.3.2 Metadata
Metadata is used to locate and organize user data contained in GPFSs striped
blocks. There are two kinds of metadata, i-nodes and indirect blocks.
An i-node is a file structure stored on disk. It contains direct pointers to user data
blocks or pointers to indirect blocks. At first, while the file is relatively small, one
i-node can contain sufficient direct pointers to reference the entire files blocks of
user data. But as the file grows, one i-node is insufficient and more are needed;
these extra blocks are called indirect blocks. The pointers in the i-node become
indirect pointers as they point to indirect blocks that point to other indirect blocks
or user data blocks. The structures of i-nodes and indirect blocks for a file is
represented as a tree with a maximum depth of four where the tree leaves are
the user data blocks. Figure 2-4 illustrates this.
i-node
indirect
blocks
data
blocks
ABCD
EFGH
IJKL
Metadata
20
MNOP
Data
QRST
UVWX
YZ
fragment
Periodically when reading documentation on GPFS, you will encounter the term
vnode in relation to metadata. A vnode is an AIX abstraction level above the
i-node. It is used to provide a consistent interface for AIX I/O system calls, such
as read() or write(), to the i-node structures of the underlying file system. For
example, i-nodes for JFS and GPFS are implemented differently. When an
application programmer calls read() on a GPFS file, a GPFS read is then initiated
while the vnode interface gathers the GPFS i-node information.
21
This is the architectural limit. The actual limit is set by using the mmcrfs
command. Setting this value unrealistically high unnecessarily
increases the amount of disk space overhead used for control
structures.
If necessary, limits on the amount of disk space and number of files can be
imposed upon individual users or groups of users through quotas. GPFS quotas
can be set using the mmedquota command. The parameters can set soft limits,
hard limits and grace periods.
Finally, a common task is to determine the size of a file. It is customary in a UNIX
environment to use the ls -l command to ascertain the size of a file. But this
only gives the virtual size of the files user data. For example, if the file is sparsely
populated, the file size reported by ls -l is equal to the seek offset of the last
byte of the file. By contrast, the du command gives the size of the file in blocks,
including its direct blocks. For sparse files, the difference in values can be
significant. Example 2-1 illustrates this. sparse.file was created with a 1
megabyte record written at the end of it (i.e., at seek offset 5367660544). Doing
the arithmetic to convert to common units, ls -l lists the file size as 5120
megabytes while du -k lists the file as just over 1 megabyte (i.e., 1.008; the extra
.008 is for direct blocks). Example 2-2 illustrates the same commands on a
dense file. Again, ls -l lists the file size as 5120 megabytes, but so does du -k.
Now consider df -k in the two examples. In each case, /gpfs1 is the GPFS file
system and contains only the one file listed by ls. Comparing df -k between the
two examples shows that it accounts for the real file size as does du -k. (The
same is also true for mmdf).
Example 2-1 Sparse file
host1t:/> ls -l /gpfs1/sparse.file
-rwxr-xr-x
1 root
system
5368709120 Feb 14 13:49 /gpfs1/sparse.file
host1t:/> du -k /gpfs1/sparse.file
1032
/gpfs1/sparse.file
host1t:/> df -k /gpfs1
Filesystem
1024-blocks
Free %Used
Iused %Iused Mounted on
/dev/gpfs1
142077952 141964544
1%
13
1% /gpfs1
Example 2-2 Dense file
host1t:/> ls -l /gpfs1/dense.file
-rwxr-xr-x
1 root
system
5368709120 Feb 14 13:49 /gpfs1/dense.file
host1t:/> du -k /gpfs1/dense.file
5242880 /gpfs1/dense.file
host1t:/> df -k /gpfs1
Filesystem
1024-blocks
Free %Used
Iused %Iused Mounted on
/dev/gpfs1
142077952 136722432
4%
13
1% /gpfs1
22
Kernel heap
Daemon segments
Shared segments
Memory from the kernel heap is allocated most generally for control structures
that establish GPFS/AIX relations such as vnodes. The largest portion of
daemon memory is used by file system manager functions to store structures
needed for command and I/O execution. The shared segments are accessed
both by the GPFS daemon and the kernel and form a GPFS cache. They are
directly visible to the user and are more complex.
23
The size of the i-node cache is controlled , but not set, by the maxFilesToCache
parameter in the mmconfig and mmchconfig commands. The actual number of
i-nodes present in this cache is determined by how many times
maxFilesToCache exceeds the number of files having information in this cache; if
exceeded often, there will be fewer i-nodes stored in this cache than when it is
seldom exceeded.
The stat cache is quite different. Each cache line contains only enough
information to respond to a stat() call and is 128 bytes long. The number of
entries reserved for this cache is a maxStatCache * maxFilesToCache where
maxStatCache = 4 by default. mmchconfig is used to change this value.
This non-pinned cache is most useful in applications making numerous
references to a common file over short durations, as is done for file systems
containing user directories or in transaction processing systems. It is less helpful
for long duration number crunching jobs where there are only a small number of
files open and they remain open for long durations.
When discussing these various caches, the term cache is frequently used
generically and collectively to refer to all three types of cache (i.e., the pinned
pagepool and non-pinned i-node and stat caches), but it is also used to refer
generically to the pagepool alone (since the pagepool is a cache). When its
important, the context makes the intent of the authors clear.
The stat() function is used to retrieve file information such as size, permissions, group ID, etc. It is used by commands
like ls -l and du.
24
transfers between disk and cache, and records do not reside in cache long
enough to be re-used. The second situation occurs when the connections
between disk and the CPU/memory bus are saturated. No amount of caching
can compensate for such a heavy load being placed upon the inter-connections.
In the end, careful benchmarking using realistic workloads or synthetic
benchmarks (see Appendix G, Benchmark and Example Code on page 237)
that faithfully simulate actual workloads is needed to configure the GPFS caches
optimally. However, this parameter can easily be changed using the mmchconfig
command if necessary.
Feb
Jan
Feb
Feb
12
29
24
17
18:28
21:42
13:02
19:17
file1
file2
file3
file4
25
done before other tasks force that memory to be flushed) saving the overhead
and time of reading file1 from disk again. Yet, when diff is executed the second
time with different files not already cached in the page space, it takes nearly the
same amount of time to execute! This observation is consistent with the design
specifications for GPFS.
By contrast, a similar experiment (the files were only 256KB) conducted using
JFS (which does buffer JFS file data in the AIX page space) showed that copying
the file to /dev/null first allowed the diff operation to run nearly 3X faster; i.e., it
makes a big difference in JFS. But not having this AIX buffering action in GPFS is
not a loss; its just not needed. When a JFS file is not buffered, JFS actions go
slower, but the GPFS actions always go faster as they are always cached
(provided their I/O access pattern allows efficient caching). For instance, the
un-buffered JFS action took 3X longer than either of the GPFS actions in the
example above.
26
Chapter 3.
27
28
Node 2
EM
RVSD
Node 1
Event Mgr
RVSD
Others
Group
Services
Clients
GS
Group Services
"hags"
Reliable MSGing
Reliable MSG
(UDP)
TS
Heartbeat
(UDP)
NCT
Topology
Services
"hats"
RSCT provides applications with its services for a certain scope. An application
may consist of multiple processes that run on multiple RS/6000 machines.
Therefore, when the application uses the services provided by the RSCT, it must
consider boundaries in which the application can use them.
An RSCT domain is the collection of nodes (SP node or an RS/6000 machine
running AIX) on which the RSCT is executing. There are two types of domains
for RSCT:
SP domain
HACMP domain
An SP domain includes a set of SP nodes within an SP partition. However, an
HACMP domain includes a set of SP nodes or non-SP nodes defined as an
HACMP/ES cluster.
A domain may not be exclusive. A node may be contained in multiple domains.
Each domain has its own instance of RSCT daemons. Hence multiple instances
of a daemon can be active on a node and each instance having separate
configuration data, log files, etc.
29
State of adapters
Adapters are monitored by keepalive signals. If an adapter is detected as
unreachable (e.g., due to a hardware failure), it will be marked as down.
State of nodes
The state of nodes is deduced from the state of adapters. If no adapter on a node
is reachable, then the node is assumed to be down.
30
31
GPFS on the SP
On an SP, GPFS exists in two environments that are distinguished by the
requirement of presence of the Virtual Shared Disk (VSD) layer.
1. VSD environment
2. non-VSD environment
the RSCT component of PSSP
disk architecture that provides local access of each node to all disks
HACMP/ES for the configuration and administration of the RSCT domain
in a cluster environment
32
VSD subsystem
The Virtual Shared Disk subsystem provides uniform disk access for all nodes in
its domain to raw logical volumes that are configured on disks under its
administration. Each disk is locally connected to at least one node in the domain,
but not required to be locally connected to all nodes. The logical volumes that are
managed by the VSD subsystem appear on all nodes in the VSD domain as
virtual shared disks. Applications access virtual shared disks on all nodes like
raw logical volumes.
Anode (to which a disk is locally connected) serves as a VSD server for each
disk. I/O requests to a virtual shared disk (on any node) are forwarded to the
VDS server of the disk on which the corresponding logical volume resides. The
I/O traffic is routed over the high speed embedded network of the SP.
33
VSD
VSD
VSD
LVM
IP
VSD Server
LVM
IP
VSD Client
LVM
IP
VSD Server
11
44
22
55
33
Figure 3-2 VSD environment
Figure 3-2 shows a cluster of three nodes. I/O traffic between node C and disks
1-3 is routed over host A, which is the VSD server for those disks. In regards to
I/O operations for disks 1-3 Node A is called the server node, and nodes B, and
C are called client nodes.
RVSD subsystem
The RVSD subsystem provides high availability for the virtual shared disk. If a
disk is locally connected to more than one node, two nodes can be configured to
act as VSD servers. They are referred to as primary and secondary VSD servers.
By default, the primary VSD server will be the server for that disk. If a failure
affects the primary VSD server, such as a loss of network connectivity, a disk
adapter or node failure, or a failure of the VSD server itself, the secondary VSD
server will take over, and all I/O requests will be routed to it instead. Once the
failure on the primary VSD server has been has been resolved, the primary node
will resume its role as the VSD server.
34
RVSD
VSD
VSD
VSD
LVM
LVM
IP
11
22
33
IP
LVM
IP
44
55
Figure 3-3 RVSD environment
Figure 3-3 shows a VSD configuration, and a mutual takeover situation for the
VSD server of disks 1-3. Disks 1 and 2 have node A as the primary VSD server,
and node B as the secondary VSD server; disk 3 has node B as the primary and
node A as the secondary VSD server.
Disk fencing is performed in failure scenarios. GPFS uses the capability of the
RVSD for disk fencing. Therefore the RVSD subsystem is a requirement for the
implementation of GPFS in a VSD-environment. The role of disk fencing for error
recovery in GPGS is explained in more detail in Section 3.3.5, Disk fencing on
page 43.
The reader may expect that the Concurrent Logical Volume manager is required
by the RVSD to allow concurrent access of the primary and secondary VDSM to
the disks in order to reduce fallover times. However, this is not the case; the
CLVM is not required. The volume groups in a VSD environment are not created
as concurrent capable, they are varied on at only one node. If the secondary
VSD server takes control, it varies on the volume group, perhaps breaking the
disk reserve. The vary on only takes about ten seconds; the GPFS daemon can
tolerate time-outs for disk access up to 30 seconds.
For more details about VSD and RVDS, see PSSP 3.2: Managing Shared Disks,
SA22-7349.
35
SSA
Currently, in version 1.4 of GPFS, Serial Storage Architecture (SSA) is the only
disk technology that is supported in a non-VSD environment. This limits the
number of nodes to eight, since an SSA loop cannot contain more than eight host
adapters. Hence, at maximum, eight nodes can have direct access to any disk.
However, nodes can participate in more than one SSA loop. In order to use SSA
disks for GPFS, a concurrent capable volume group should be configured on
each disk with each one containing a logical volume.
36
HACMP/ES Cluster
GPFS Cluster
GPFS Nodeset 1
GPFS Nodeset 2
Figure 3-4 shows the relationship between the clustering domains, and GPFS
nodesets. All nodes that belong to a GPFS cluster must belong to the same
HACMP/ES cluster. Only one GPFS cluster can be configured within one
HACMP/ES cluster, which may contain further nodes that are not part of the
GPFS cluster. A GPFS cluster can contain multiple GPFS nodesets; nodes can
be dynamically added to, or removed from, a nodeset. GPFS nodesets are
disjoint, thus, a node cannot belong to two nodesets. All nodes that will be used
in the GPFS cluster need to have direct access to all disks.
SSA currently is the only disk architecture supported for GPFS in a cluster
environment, which limits the number of nodes in a GPFS cluster to eight.
The setup of GPFS in a cluster entails the configuration of the HACMP/ES
cluster topology, which is simple and straightforward from an operational point of
view.
37
Group Services are used to synchronize all actions required to bring a GPFS
daemon into an active state and for the handling of failures. If a failure has been
detected, the affected GPFS daemon leaves the active state and recovery
actions are performed to protect the integrity of the file system and the GPFS
subsystem.
Group Services maintain two groups for each GPFS nodeset, Gpfs.name and
GpfsRec.name, whereby .name is the name of the nodeset. This is an
implementation detail due to the architecture of Group Services (see
Section 3.3.3, Coordination of event processing on page 41).
down
The GPFS daemon is down if it is shown inoperative by the System Resource
controller as shown in Example 3-1.
Example 3-1 GPFS daemon in the down state
host1t:/> lssrc -s mmfs
Subsystem
Group
mmfs
aixmm
PID
Status
inoperative
initializing
In Example 3-2, the GPFS daemon has been started by the mmstartup
command, which is a Shell script, to start the GPFS daemon on one or more
nodes. It issues a start of the daemon by the System Resource Controller on
each specified node. Once mmstartup has completed, the GPFS subsystem is
shown active by lssrc. The system resource controller will issue the runmmfs
command, which is another Shell script.
Example 3-2 GPFS daemon in the initializing state
host1t:/> mmstartup
Sat Mar 10 00:19:14 EST 2001: mmstartup: Starting GPFS ...
0513-059 The mmfs Subsystem has been started. Subsystem PID is 18128.
host1t:/> lssrc -s mmfs
Subsystem
Group
PID
Status
mmfs
aixmm
18128
active
host1t:/> ps -ef | grep mm
root 18130 8270
0 00:21:48
- 0:00 ksh /usr/lpp/mmfs/bin/runmmfs
38
To become part of the GPFS distributed subsystem, the GPFS daemon needs to
connect to the Group Services daemon as a client. The script
/usr/lpp/mmfs/bin/runmmfs tries periodically to connect to Group Services as a
client. This state is called the initialization state of the GPFS daemon. The
attempt to join the GPFS groups may not be successful if Group Services is not
active or the already active daemons in the GPFS groups deny a join.The latter
would occur, for instance, if the network adapter on which the GPFS socket
connection is established has been detected as failed by Topology Services.
active
If the GPFS daemon has connected with Group Services and joined the GPFS
distributed subsystem, it is in the active state, as shown in Example 3-3, and
ready to perform cooperatively with the other GPFS daemons that have
connected with Group Services. It is shown as a member of the Groups that
Group Services maintains for the nodeset. The script runmmfs will spawn the
mmfs daemon.
Example 3-3 GPFS daemon in the active state
host1t:/> lssrc -ls grpsvcs
Subsystem
Group
PID
Status
grpsvcs
grpsvcs
31532
active
4 locally-connected clients. Their PIDs:
27790(hagsglsmd) 36584(mmfsd) 26338(haemd) 27234(clstrmgr)
HA Group Services domain information:
Domain established by node 3
Number of groups known locally: 5
Number of
Number of local
Group name
providers
providers/subscribers
Gpfs.set1
3
1
0
GpfsRec.set1
3
1
0
ha_em_peers
3
1
0
CLRESMGRD_111
3
1
0
CLSTRMGR_111
3
1
0
host1t:/> ps -ef | grep mm
root 19888 4994
0 21:12:03
- 0:01 /usr/lpp/mmfs/bin/mmfsd
If a failure that affects the GPFS daemon is detected by RSCT, the GPFS
daemon will leave the active state and return to the initialization state.
39
Further, Topology Services monitors the state of network adapters and publishes
this information to the GPFS groups. A change is the state of a network adapter,
which affects the functionality of the cluster, will result in a change of
membership for the GPFS daemon on that node.
The role of RSCT in the implementation of GPFS is to detect and synchronize
the necessary changes of all active GPFS daemons that pertain to the following:
40
svc_2A
svc_2B
NET2
mmfs
mmfs
TCP/IP socket
NET1
svc_1A
svc_1B
GPFS daemon communication
Figure 3-5 shows two nodes of a GPFS cluster. The two networks, NET1 and
NET2, are configured to the part of the HACMP/ES cluster topology. Hence, they
are monitored by Topology Services. The GPFS daemons between both nodes
communicate with each other using TCP/IP socket connections that are
established on the network devices of adapters, svc_1A and svc_2A. The
members in the GPFS groups are informed about the state of the adapters
svc_1A and svc_1B, the connectivity to nodes A and B, and the membership of
the mmfs daemons in the GPFS groups.
The recovery actions performed by the GPFS distributed subsystem are the
same that are performed upon the loss of a network adapter used for the GPFS
communication or entire loss of connectivity to a node. However, redundant
networking connections that are known to Topology Services are required, in an
implementation that relies on HACMP/ES to maintain GPFS performance after
failure (see Section 3.5.3, Partitioned clusters on page 51).
41
3.3.4 Quorum
Quorum is a simple rule to ensure the integrity of a distributed system and the
resources under its administration. In GPFS, quorum is enforced.
The notion of quorum applies to the GPFS nodeset. In a GPFS nodeset, quorum
is achieved if half (in a nodeset consisting of two nodes), or more than half (in a
nodeset consisting of more than two nodes), of the GPFS daemons are in the
active state (as described in Section 3.3.1, States of the GPFS daemon on
page 38) and are able to communicate with each other. The latter may not be the
case after multiple network failures that result in a partitioned configuration (see
Section 3.5.3, Partitioned clusters on page 51).
In other words, quorum is achieved if more than half of the GPFS daemons are
members of the same instance of a GPFS group. File systems that are
configured in a GPFS nodeset can only be mounted if quorum is achieved.
42
For a nodeset consisting of two nodes, quorum is user selectable. If Single Node
Quorum is not enabled, the GPFS daemons on both nodes have to be active for
all actions that otherwise depend on quorum to succeed.
43
the configuration manager will elect a new file system manager for any file
system for which the daemon that has left the active GPFS subsystem had been
manager. The file system manager will initiate recovery actions for the file
system metadata and elect new metanodes for ope files, if necessary.
Configuration manager
The configuration manager is assigned internally. It does not impose any
significant load onto the system. The first GPFS daemon in a GPFS nodeset that
become active will be the configuration manager. It it leaves the active state, the
GPFS daemon with the next lowest node number that is active will be elected as
the new configuration manager.
Which node acts as the configuration manager, is normally not of interest from
an operational point of view. However, it can be determined by the command
mmfsadm dump cfgmgr.
Metanode
Metanodes are assigned internally. Usually the node which has the file open for
the longest amount of time is the Metanode for that file.
44
HACMP/ES
Cluster
Lock Manager
HACMP/ES
Cluster Group
GPFS
Event Management
Group Services
HACMP/ES
Topology Services
45
Figure 3-6 on page 45 shows the client server relationship of all active
subsystems in the implementation of GPFS in a cluster. The HACMP/ES cluster
group contains the following daemons:
HACMP/ES Cluster Manager
HACMP/ES SMUX Peer Daemon
HACMP/ES Cluster Information Daemon
GPFS, Event Management, and the HACMP/ES Cluster Manager are clients of
Group Services. There is no interaction between the subsystems of the
HACMP/ES cluster group and the GPFS subsystem. The HACMP/ES Cluster
Lock Manager is a client of the HACMP/ES Cluster Manager.
HACMP/ES provides the means to configure and administer the operating
domain for RSCT. In particular:
Configuration of the HACMP/ES cluster that defines an RSCT domain, which
can be changed dynamically, i.e. while the daemons are active
Environment to start and stop the RSCT subsystem
Cluster monitoring tools
An HACMP/ES cluster, while providing an operating domain for GPFS, can be
used to make other applications highly available. Few restrictions apply; see
Section 3.6.1, Configuring HACMP/ES on page 56 for more information.
nodes
The set of nodes is the operating domain for HACMP/ES and RSCT.
adapters
Adapters are monitored by Topology Services by keepalive signals and used for
the communication within the cluster. They can be used to make IP addresses
highly available (see Section 3.6.1, Configuring HACMP/ES on page 56).
networks
Networks define the association of adapters to physical networks; two adapters
that belong to the same cluster network also belong to the same physical
network.
tuning parameters
Tuning parameters describe the frequency of keepalive signals, which are sent
by Topology Services, and grace periods, which designates the maximum time
span keepalive signals can be missed without issuing a failure notification.
46
Redundant network
adapters
two networks, on
independent hardware
Serial network
host 1
host 2
host 3
Figure 3-7 shows a cluster of three nodes that all are connected by two TCP/IP
networks and a serial network. All nodes and networks are configured in the
HACMP/ES cluster topology, therefore they will be monitored by Topology
Services and used for message passing. Two networks, which are configured on
independent hardware components, ensure fault tolerance for the network
connectivity between nodes. One network has two host adapter connections with
each node.
47
If the GPFS subsystem has been started on a node while the cluster services are
still inactive, the GPFS will be in the initializing state, as shown in Example 3-4.
After the cluster services have been started and Group Services has become
active, the GPFS daemon will attempt to join the GPFS groups and transition into
the active state. Furthermore, the HACMP cluster manager will connect with
Group Services after it is started.
Example 3-5 GPFS and HACMP/ES are both active
host1t:/> ps -ef | grep mm
root 30012 6204
0 01:37:13
0:01 /usr/lpp/mmfs/bin/mmfsd
Example 3-5 shows the groups maintained by Group Services when GPFS and
HACMP/ES are active. Group Services maintains five different groups on host1.
The groups Gpfs.set2, and GpfsRec.set2 correspond to the GPFS nodeset, to
which this node belongs, which has the nodeset name set2. The groups
CLRESMGRD_111, and CLSTRMGR_111 are maintained for the HACMP/ES
Cluster Manager subsystem. The HACMP/ES cluster ID is 111.
48
HACMP/ES
GPFS
RSCT
49
Adapter for
GPFS daemon
communication
alive
No
Wait
Yes
Failure
Configuration of join of
active configuration for
node A
Success
GPFS daemon on node
A is in active state
50
Was node A
configuration
manager for this
node set?
Yes
Elect new
configuration
manager
No
Was node A
file system
manager
Yes
No
Does quorum
persist for this node
set?
Yes
Fence node A
out
recover log files
rebuild token
state
No
Unmount all file systems
on this node set
RSCT
All subsystems of RSCT will remain active on each partition.
Within each partition, Topology Services will recreate its heartbeat rings,
excluding adapters that are not reachable, and keep monitoring all network
adapters in the cluster. The reliable messaging library will be updated to indicate
the loss of connectivity; adapters that do not belong to this partition are assumed
not to be alive. Group Services will be informed about the loss of connectivity to
51
the nodes of the other partition and dissolve membership of them in the GPFS
group. For each group, it will keep providing its services to the active members in
a partition. There is no synchronization between the instances of the same group
that are run on distinct partitions.
After network connectivity has been reestablished, the subsystems of RSCT will
join their operating domains and continue to function without visible interruption.
Topology Services will reconnect the disjoint heartbeat rings to form rings that
include the corresponding adapters of all partitions. Group Services will merge
instances of each group into one group. This is actually done by dissolving all but
one instance of each group and letting the clients on nodes, for which an
instance has been dissolved, rejoin the corresponding group.
partition 1
partition 2
Figure 3-11 shows a GPFS cluster of eight nodes that is partitioned into two sets
of four nodes. Three GPFS nodesets exists, with nodeset IDs 1 through 3. The
nodes of the first nodeset belong to partition 1 maintaining quorum with the
nodes of the third nodeset in partition 2.
The second nodeset contains two nodes. At first, both nodes will maintain
quorum, if quorum is configured. If the file system is mounted on one node, this
node will attempt to fence the other node out from the disks, which will cause the
GPFS daemon on the other node to leave the active state. If the GPFS daemons
on both nodes are active and the file system is mounted, the surviving node is
the one which first succeeds with the recovery actions.
52
HACMP/ES
The HACMP/ES Cluster Manager will continue to operate on all partitions. After
the network connectivity has been reestablished, the nodes on all partitions,
except the one with the highest priority, will halt. A halt -q is performed as part
of the script clexit.rc, which is run. This may have a drastic impact on the
performance of GPFS and may lead to a loss of quorum for a GPFS nodeset
leading to the unmounting of file systems on all nodes. This should be a very rare
scenario if redundant networking connections are configured.
53
Figure 3-13 shows two sequences of recovery actions that are run without
synchronization after the cluster had become partitioned. The sequence of
events starts when GPFS runs a protocol to perform recovery actions for both
nodes, D and E.
First it is determined if a new configuration manager needs to be elected. The
configuration manager elects a new file system manager, if needed; a Group
Services barrier is reached. Afterwards, the file system manager nodes start the
recovery of all file systems, which involves fencing out nodes D and E, rebuilding
the token state, and updating the log files. The loss of connectivity to the nodes
of the corresponding other partition will not be detected at the same time by both
partitions.
54
Topology Services
Group Services
Event Management
HACMP/ES Cluster Manager
HACMP/ES SMUX Peer Daemon
HACMP/ES Cluster Information Services
HACMP/ES Cluster Lock Manager
The HACMP/ES Cluster Manager is the subsystem that drives the recovery
actions to provide high availability for resources. The HACMP/ES SMUX Peer
Daemon and the HACMP/ES Cluster Information Services are to provide other
applications of utilities of HACMP/ES with information about the cluster. The
HACMP/ES Cluster Lock Manager provides a locking protocol for applications
that concurrently access shared external data.
55
Cluster topology
The cluster topology has already been introduced in Section 3.4.1,
Configuration of the cluster topology on page 46.
A cluster adapter is of one of the following three types:
boot
A boot adapter is the primary adapter on a node for a given cluster network. It
can be replaced by a service IP label belonging to the same network.
service
A service adapter is either a primary adapter on a node for a given network or a
network interface configuration that will replace the configuration of a boot of
standby adapter on the network to which it belongs. In the latter case, the service
adapter is referred to as service IP label and is part of a resource group. If a
service adapter is the primary adapter on a node for a given network, it cannot be
moved to another node or replaced by another service label.
standby
Any further adapter, besides a boot or services adapter, that exists on a given
network is a standby adapter. A service IP label can replace a standby adapter.
Cluster resources
The cluster resources are the entity of system resources that are made highly
available by HACMP/ES.
56
[Entry Fields]
RES1
cascading
A B
[svc_1] +
[]
fsck +
sequential +
[]
[]
[]
[vg_1] +
[]
[]
[]
[]
[APP1] +
[]
[]
false
false
false
false
+
+
+
+
+
+
+
+
+
+
+
+
57
false
More than one instance of each resource type can be configured in a resource
group. For example, a resource group could contain multiple volume groups.
Multiple resource groups can be configured in a cluster, depending on the
number of nodes in the cluster and the resources that are configured in each
group.
Cluster verification
To function as a distributed system and to provide recovery for the cluster
resources, the HACMP/ES Cluster Manager depends on the following:
The cluster manager daemons on all nodes have the same cluster
configuration data, which is referred to as the cluster being synchronized.
The cluster configuration corresponds to the real, existing system resources.
For instance, adapters belonging to resource groups, which are configured in
the cluster topology or volume groups, are configured correctly on AIX.
Cluster verification checks the above conditions. A cluster that failed cluster
verification is not guaranteed to function reliably. In a cluster that is not
synchronized, an attempt of a cluster manager daemon to join the active cluster
manager subsystem will fail, if its cluster configuration is different from the one
known to the active cluster. Some configuration errors may not be detected
immediately, but are usually detected during the runtime of the cluster. For
example, you may not notice that a resource is not properly configured until a
node attempts to acquire a resource group and the acquisition of that particular
resource is not possible.
58
Cluster synchronization
Cluster synchronization entails the distribution of a cluster configuration that is
locally present on a node to all nodes in the cluster. During cluster
synchronization the HACMP ODMs of the node from which the command is
issued, are copied to the other nodes in the cluster. Furthermore, the ODMs are
supplied with values that are specific to the configuration of system resources on
each node.
After any change to the cluster configuration or the system resources that relate
to cluster configuration data, the cluster configuration needs to be synchronized.
Verification is run by default during cluster synchronization. The cluster
configuration can be synchronized while the cluster manager is active on some
nodes.
cluster nodes
If a cluster node fails, all resource groups that are online on that node are moved
to another node. This is performed by the event scripts associated with the
node_down event.
cluster adapter
If a service or boot adapter fails, the IP address of that adapter is reconfigured on
the standby adapter by the event scripts associated with the
swap_adapter_event.
applications
If an application that is subscribed to application monitoring fails, the resource
group containing this application is moved to another node, by the events scripts
associated with the rg_move event.
Partitioned Cluster
In a partitioned cluster, the nodes in each partition will detect the other side as
down. The cluster will remain active within each partition; quorum is not
implemented. Node down events are run for the nodes in the corresponding
partition. If multiple partitions contain nodes of a resource group, that resource
group will be online on multiple nodes as a result of the partition.
59
60
Chapter 4.
Planning for
implementation
In this chapter we explain why we made certain design choices for our GPFS
cluster and then elaborate on exactly what those choices were. Each topic is split
into two parts. The first part of a topic covers a general range of decisions that
had to be made and the input to those decisions. The second part of each topic
details what we decided so later chapters can configure the cluster in that
manner. The one absolute we had to abide by was to use GPFS 1.4 without
PSSP, making our implementation a non-SP environment. The resulting
configuration is called a GPFS cluster.
61
4.1 Software
We divided our software discussion into two sections. The first section is a
general discussion on the software options while the second section covers our
specific software installation.
SP environment
GPFS 1.4 can operate within an SP environment with or without VSDs as long as
PSSP is version 3 release 2 or later on the control workstation. For the operating
system, AIX is required to be at version 4 release 3.3 (with APAR IY12051) or
later on the control workstation. PSSP provides group services to GPFS 1.4,
without which it will not function in an SP environment.
Non-SP environment
For GPFS 1.4 in a non-SP environment, the necessary services are provided by
RSCT (reliable scalable cluster technology), which is packaged with HACMP/ES
Version 4.4.0 (5765-E54) with PTF IY12984 or later. The AIX operating system
should again be at version 4 release 3.3 (5765-C34) with APAR IY12051 or later.
Operating system
AIX 4.3.3 with APAR IY12051 or later was required for our non-SP
implementation of GPFS clustering. We knew we were at AIX 4.3.3 by running
oslevel. The important item to learn was: Did we have the required APAR? This
APAR cross-referenced to PTF U473336 which is in the AIX 4.3.3 maintenance
level 05. The cross-reference information is available from:
http://techsupport.services.ibm.com/rs6k/fixdb.html
62
HACMP/ES
The required level for HACMP/ES is version 4.4.0 (5765-E54) with PTF IY12984
or later modifications. To determine if the PTF specified is installed, run the
instfix command, as shown in Example 4-1, and make sure all filesets are
found.
Example 4-1 Verification of required code level of HACMP/ES
host1t:/> instfix -i -k "IY12984"
All filesets for IY12984 were found.
We had to install HACMP/ES from scratch on all four of our nodes. One method
is to acquire the code and run smitty install separately on each node. What
we chose to do was to create a NFS mounted file system we called /tools/images
and made it available to all nodes from host1t. We ran bffcreate to produce
install images of the desired software and simultaneously installed it across the
nodes. For a detailed explanation see Appendix B, Distributed software
installation on page 207.
GPFS 1.4
We installed GPFS 1.4 in the same manner that we installed HACMP/ES, by
using an NFS mounted filesystem mounted on every node with an image of the
desired application software created there by running bffcreate. Verification
was as easy as running lslpp, as shown in Example 4-2, since it had to be at
level 1.4. Prerequisites only applied to AIX and HACMP/ES since that was the
software that GPFS was specifically dependent on.
Example 4-2 Verification of GPFS installation
host1t:/> lslpp -l | grep mmfs
mmfs.base.cmds
3.3.0.0
mmfs.base.rte
3.3.0.0
mmfs.gpfs.rte
1.4.0.0
mmfs.base.rte
3.3.0.0
mmfs.gpfs.rte
1.4.0.0
mmfs.gpfsdocs.data
3.3.0.0
COMMITTED
COMMITTED
COMMITTED
COMMITTED
COMMITTED
COMMITTED
GPFS
GPFS
GPFS
GPFS
GPFS
GPFS
63
4.2 Hardware
We divided our discussion on the hardware requirements into two sections, first
the options available to a new installation and second, is how we implemented
the hardware into our environment.
Host systems
The minimum hardware for GPFS 1.4 is a processor that is able to run AIX 4.3.3,
enough spare disk space for the additional filesets required by the application
software, an SSA adapter and at least one 100Mb/sec network adapter.
64
Minimum
Maximum
SSA disks
0 per loop
1 per node
1 per loop
8 per loop
Nodes
1 per cluster
8 per cluster
RAID arrays
1 per cluster
2 per cluster
4.2.2 Hardware
This section is about how we connected the nodes.
Host systems
The four systems were identical except that host1t and host2t had three internal
SCSI disks while host3t and host4t had only two internal SCSI disks. This was
not a factor in performance or set up, just noteworthy since hdisk addressing is
not identical between the nodes. We used the spare SCSI disk in host1t as a
shared resource between all four nodes mounted over NFS and called it /tools.
4 RS/6000 F50s with the following hardware:
4 x CPUs each
1.5GB memory each
1 x 10/100 ethernet adapter 9-P
1 x token ring adapter 9-O
1 x SSA Adapter 4-P
65
J9
J16
10
11
12
13
14
15
16
J8
A1
A2
A1
host
B2
B1
A2
A1
B2
host
B1
B2
host
B2
B1
A2
host
B1
A2
A1
4.3 Networking
GPFS requires an IP network over which the GPFS socket connections will be
established. This network should be dedicated to GPFS only.
66
The network defined for use by GPFS must not be a network designated for IP
address takeover by HACMP. It must be an adapter with only a service address
and no associated boot address. This network is assigned to GPFS with the
mmcrcluster command.
4.3.2 Network
The token ring network is the default network and our access to the outside
world. The second network is a 100 Mb ethernet that we dedicated to GPFS. To
simplify the network, we named the nodes host1 thru host4 and tacked on a letter
e for ethernet and a letter t for token ring addresses, as shown in Example 4-3.
The local ethernet adapters are connected to an 8 port Alteon 180 hub while the
token ring was attached to the laboratory network. A .rhosts file is required on all
nodes. Figure 4-2 on page 68 is a schematic diagram of our network
configuration.
Example 4-3 Listing of the adapter addresses for our nodes
host1t:/> cat /etc/hosts
127.0.0.1
loopback localhost
9.12.0.21
host1t host1t.itso.ibm.com
9.12.0.22
host2t host2t.itso.ibm.com
9.12.0.23
host3t host3t.itso.ibm.com
9.12.0.24
host4t host4t.itso.ibm.com
129.40.12.129
host1e host1e.itso.ibm.com
129.40.12.130
host2e host2e.itso.ibm.com
129.40.12.131
host3e host3e.itso.ibm.com
129.40.12.132
host4e host4e.itso.ibm.com
67
Outside
world
16 Mb Token Ring
host1t
host2t
host 1e
host 2e
host3t
3
host 3e
host4t
4
host 4e
100 Mb
Ethernet
4.4.1 Networks
Networking requirements to support GPFS
HACMP/ES requires redundant network connections between nodes. Otherwise
a single network failure potentially could cause a system halt on a subset of
nodes in the cluster, as explained in Section 3.5.3, Partitioned clusters on
page 51.
Our network configuration contains two IP networks. Often, in the hardware
setup of an HACMP/ES cluster, a serial network such as RS232 or Target Mode
SSA is used as the second redundant network connection between hosts. In our
case, to support GPFS, this is not recommended. Serial networks are slow and
error recovery performed over serial networks takes more time than over an IP
network. In most uses of HACMP/ES this does not affect things. However, in
GPFS, fast processing of failures is important.
68
69
70
Chapter 5.
Configuring HACMP/ES
This chapter details the steps that are necessary to configure a HACMP/ES
cluster for GPFS and shows it by example for our configuration. A cluster
configuration for GPFS only requires the definition of a cluster topology.
In this chapter, we will discuss:
Requirement onto the system environment
Configuring the HACMP/ES cluster topology
Starting and stopping the HACMP/ES cluster services
Monitoring the cluster
71
5.1 Prerequisites
Sections 5.1.1, 5.1.2, and 5.1.3 describe the prerequisites that we need before
starting to configure the HACMP/ES cluster.
5.1.1 Security
Before starting the configuration of HACMP/ES, we need to verify the remote
access permissions regarding the system security.
Network interfaces
The correct configuration of network interfaces is crucial for HACMP/ES. The
following need to be verified:
All network interfaces that are used to configure cluster adapters are in the up
state, as ascertained by the netstat command.
The configuration of network interfaces corresponds to the one present at
boot time.
All network interfaces that will be configured as cluster adapters belong to the
same IP subnet.
72
We verify that the interface configuration contains all host adapters. The netstat
command is issued on all hosts.
Example 5-2 Verify network adapters are up
host1t:/> netstat -i
Name Mtu Network
en1
1500 link#2
en1
1500 129.40.12.1
tr0
1492 link#3
tr0
1492 9.12
lo0
16896 link#1
lo0
16896 127
lo0
16896 ::1
Address
0.4.ac.3e.b3.95
host1e
0.4.ac.ad.ce.50
host1t
loopback
Ipkts Ierrs
857150
0
857150
0
1412205
0
1412205
0
325031
0
325031
0
325031
0
Opkts Oerrs
864472
289
864472
289
1287386
0
1287386
0
325563
0
325563
0
325563
0
Coll
0
0
0
0
0
0
0
0
0
0
0
177040
717379
707196
895507
1211
3057
0
5860
0
0
0
0
host1t:/> for i in 1 2 3 4; do
> rsh host$it netstat -i | grep host$it
> done
tr0
1492 9.12
host1t
1296792
tr0
1492 9.12
host2t
1721115
tr0
1492 9.12
host3t
177266
tr0
1492 9.12
host4t
324299
0
0
0
0
1434347
1348656
148358
250125
0
0
0
0
0
0
0
0
Name resolution
Aliases need to be configured for all IP addresses that will be used as cluster
adapters.
73
It is good practice to keep the /etc/hosts files on all cluster nodes identical and to
use a naming convention for cluster adapters that reflects the adapter function
and network membership. Example 5-4 lists all of the aliases we have defined for
our cluster adapters.
Example 5-4 Verify aliases for all cluster adapters
host1t:/> more /etc/hosts
127.0.0.1
9.12.0.21
9.12.0.22
9.12.0.23
9.12.0.24
129.40.12.129
129.40.12.130
129.40.12.131
129.40.12.132
loopback localhost
# loopback (lo0) name/address
host1t host1t.itso.ibm.com
host2t host2t.itso.ibm.com
host3t host3t.itso.ibm.com
host4t host4t.itso.ibm.com
host1e host1e.itso.ibm.com
host2e host2e.itso.ibm.com
host3e host3e.itso.ibm.com
host4e host4e.itso.ibm.com
Name resolution should be configured such that name lookup is first attempted
locally. The name lookup sequence is determined by the hosts keyword in the
/etc/netsvcs.conf file as shown in Example 5-5.
Example 5-5 netsvcs.conf file
host1t:/> more /etc/netsvcs.conf
hosts = local , bind
Free* %Used
15492
53%
74
111
hcluster
The cluster definition can be modified at any point the cluster is not active on any
node.
75
host1e
ether
gpfs_net
public
service
host1t
76
Example 5-10 shows the configuration of an adapter for the network noname_net.
Example 5-10 Configuration of cluster adapter
Adapter IP label
Network Type
Network Name
Network Attribute
Adapter Function
Node Name
host1t
token
noname_net
public
service
host1t
In Example 5-11, we add the remaining three adapters from the command line.
Example 5-11 Adding adapters via the command line
host1:/> for i in 2 3 4; do
> /usr/es/sbin/cluster/utilities/claddnode -a host$ie:ether\
> :gpfs_net:public:service::-n host1t
> done
host1:/> for i in 2 3 4; do
> /usr/es/sbin/cluster/utilities/claddnode -a host$it:token\
> :noname_net:public:service::-n host1t
> done
public
host1t
host2t
host3t
host4t
host1e
host2e
host3e
host4e
noname_net
public
host1t
host2t
host3t
host4t
host1t
host2t
host3t
host4t
77
No
actual
No
78
now
true
false
true
The above command starts the daemons of the RSCT subsystems, the
HACMP/ES cluster manager, clstrmgrES, and the clinfoES daemon.
A successful completion of rc.cluster only indicates that all daemons have
been started. We still have to wait for all events in context with the startup of the
cluster node to finish successfully.
On node host1t, we issue the clstat command, as shown in Example 5-16, to
monitor the state of the cluster on all nodes.
Example 5-16 Monitor cluster status
host1t:/> /usr/es/sbin/cluster/clstat
clstat - HACMP for AIX Cluster Status Monitor
--------------------------------------------Cluster: hcluster
(111)
State: UP
SubState: STABLE
79
Node: host1t
State: UP
Interface: host1e (0)
Interface: host1t (1)
Node: host2t
State: DOWN
Interface: host2e (0)
Interface: host2t (1)
Node: host3t
State: DOWN
Interface: host3e (0)
Node: host4t
State: DOWN
Interface: host4e (0)
Interface: host4t (1)
Address:
State:
Address:
State:
129.40.12.129
UP
9.12.0.21
UP
Address:
State:
Address:
State:
129.40.12.130
DOWN
9.12.0.22
DOWN
Address: 129.40.12.131
State:
DOWN
Address: 9.12.0.23
State:
DOWN
Address:
State:
Address:
State:
129.40.12.132
DOWN
9.12.0.24
DOWN
After awhile, the output of the above command should indicate that the cluster is
stable, which indicates that no more events are enqueued to be processed.
Alternatively, we can convince ourselves that the start of the cluster services has
been successful by inspecting the log files as in Example 5-17.
Example 5-17 Check cluster log files
host1t:/> more /usr/es/sbin/cluster/history/cluster.02222001
Feb 22 10:24:56 EVENT START: node_up host1t
Feb 22 10:24:58 EVENT COMPLETED: node_up host1t
Feb 22 10:24:58 EVENT START: node_up_complete host1t
Feb 22 10:24:59 EVENT COMPLETED: node_up_complete host1t
The output of the above command shows that the node_up and node_up_complete
event have completed. These are the two events that are processed by all
members when a node joins the HACMP/ES cluster.
In Example 5-18 on page 81 we can now start the cluster services on the
remaining three cluster nodes simultaneously.
80
The Group Services subsystem will serialize the requests of the daemons to join
the clstrmgrES subsystem. After all daemons have joined, the cluster will return
to the stable state.
A look at the log file, as shown in Example 5-19, illustrates the serialization of
events by Group Services. When the cluster daemon on a node joins the
clstrmgrES subsystem, a node_up and a node_up_complete event are run.
The file on all nodes will show the same order of events.
Example 5-19 Check cluster log files
host1t:/> more /usr/es/sbin/cluster/history/cluster.02222001
Feb 22 10:24:56 EVENT START: node_up host1t
Feb 22 10:24:58 EVENT COMPLETED: node_up host1t
Feb 22 10:24:58 EVENT START: node_up_complete host1t
Feb 22 10:24:59 EVENT COMPLETED: node_up_complete host1t
Feb 22 10:40:42 EVENT START: node_up host2t
Feb 22 10:40:42 EVENT COMPLETED: node_up host2t
Feb 22 10:40:45 EVENT START: node_up_complete host2t
Feb 22 10:40:45 EVENT COMPLETED: node_up_complete host2t
Feb 22 10:41:13 EVENT START: node_up host3t
Feb 22 10:41:14 EVENT COMPLETED: node_up host3t
Feb 22 10:41:16 EVENT START: node_up_complete host3t
Feb 22 10:41:16 EVENT COMPLETED: node_up_complete host3t
Feb 22 10:41:43 EVENT START: node_up host4t
Feb 22 10:41:44 EVENT COMPLETED: node_up host4t
Feb 22 10:41:47 EVENT START: node_up_complete host4t
Feb 22 10:41:48 EVENT COMPLETED: node_up_complete host4t
81
Node: host1t
State: UP
Interface: host1e (0)
Interface: host1t (1)
Node: host2t
State: DOWN
Interface: host2e (0)
Interface: host2t (1)
Node: host3t
State: UP
Interface: host3e (0)
Interface: host3t (1)
Node: host4t
State: DOWN
Interface: host4e (0)
Interface: host4t (1)
Address:
State:
Address:
State:
129.40.12.129
UP
9.12.0.21
UP
Address:
State:
Address:
State:
129.40.12.130
DOWN
9.12.0.22
DOWN
Address: 129.40.12.131
State:
UP
Address: 9.12.0.23
State:
UP
Address:
State:
Address:
State:
129.40.12.132
DOWN
9.12.0.24
DOWN
Cluster state
The cluster state refers to the state of the cluster manager as a distributed
subsystem on all nodes. It can have the values up, down, and unknown.
82
Cluster substate
The cluster substates listed below give detailed information about the cluster.
stable
The cluster services are active on at least one node.
The cluster manager subsystem is in an error free state.
No events are currently processed.
unstable
The cluster services are active on at least one node.
The cluster manager is currently processing events.
unknown
The state is not determined.
reconfig
The cluster services are active on one or more nodes in the cluster. On one
node an event is running and has exceeded the maximum time limit that is
assumed for an event to be run. This is likely due to a failure while executing
an event script.
No new event can be processed; user intervention is necessary.
error
The cluster services are active on at least one node.
An error has occurred and no events can be processed.
Node state
The state of a node can have one of the following four values.
up
down
joining
leaving
Adapter state
The state of an adapter can have the following values:
up
down
The clstat command requires that the clinfoES subsystem is active on the node
on which it is issued.
83
84
The commands that are executed during an event depend on the cluster
configuration and the system environment.
The file /tmp/hacmp.out is the main diagnostic source for errors that occur during
an event. If an event fails, one can easily localize the command that caused the
failure of the event.
active
active
active
active
85
emaixos
clstrmgrES
clsmuxpdES
clinfoES
emsvcs
cluster
cluster
cluster
20850
17938
22204
19640
active
active
active
active
The above contains the groups in which clients of Group Services on host1t
participate. Three clients exist on host1t, hagsglsmd, haemd, and clstrmgr.
86
/usr/es/adm/cluster.log
Generated by cluster scripts and daemons
/usr/es/sbin/cluster/history/cluster.mmdd
Cluster history file generated daily
Further log files exist, that contain debugging information. For a complete list of
all log files maintained by all subsystems of RSCT and HACMP/ES, see
Appendix E, Subsystems and Log files on page 229.
now
true
graceful
87
PID
Status
inoperative
inoperative
inoperative
Note that after successful exit of the cluster manager, the subsystems of RSCT
will terminate as well as shown in Example 5-28.
Example 5-28 Check status of RSCT subsystems
host1t:/> lssrc -a | grep svcs
topsvcs
topsvcs
grpsvcs
grpsvcs
grpglsm
grpsvcs
emsvcs
emsvcs
emaixos
emsvcs
inoperative
inoperative
inoperative
inoperative
inoperative
In Example 5-29, we stop the cluster services on the remaining three nodes
using the Cluster Single Point of Control (CSPOC) facility.
Example 5-29 Stop cluster services
host1t:\> smitty hacmp
Cluster system Management
HACMP for AIX Cluster Services
Stop Cluster Services
(the screen will include the following entries)
* Stop now, on system restart or both
now
Stop Cluster Services on these nodes
[host2t,host3t,host4t]
BROADCAST cluster shutdown?
true
* Shutdown mode
graceful
(graceful or graceful with takeover, forced)
88
We verify that the cluster services have stopped on all nodes, by issuing the
above command lssrc -g cluster on the remaining three nodes, which will
show that the subsystems of the cluster group have become inactive.
When a node leaves the HACMP/ES cluster, a node_down, and node_down
complete event are run on all active cluster nodes. The sluster log file in
Example 5-30 shows these events starting and completing.
Example 5-30 Check cluster log files
host2t:/> more /usr/es/sbin/cluster/history/cluster.02222001
...
Feb 22 10:41:43 EVENT START: node_up host4t
Feb 22 10:41:44 EVENT COMPLETED: node_up host4t
Feb 22 10:41:47 EVENT START: node_up_complete host4t
Feb 22 10:41:48 EVENT COMPLETED: node_up_complete host4t
Feb 22 17:50:13 EVENT START: node_down host1t graceful
Feb 22 17:50:14 EVENT COMPLETED: node_down host1t graceful
Feb 22 17:50:15 EVENT START: node_down_complete host1t graceful
Feb 22 17:50:17 EVENT COMPLETED: node_down_complete host1t graceful
Feb 22 17:50:28 EVENT START: node_down host2t graceful
Feb 22 17:50:28 EVENT COMPLETED: node_down host2t graceful
Feb 22 17:50:29 EVENT START: node_down_complete host2t graceful
Feb 22 17:50:31 EVENT COMPLETED: node_down_complete host2t graceful
89
90
Chapter 6.
91
92
If you are unhappy with the GPFS cluster you created, run mmdelcluster, as in
Example 6-5, to remove the cluster and start over.
Example 6-5 Delete GPFS cluster
host1t:/tools/gpfs_config> mmdelcluster -n gpfs_nodefile
mmdelcluster: Command successfully completed
93
94
PID
PID
PID
PID
is
is
is
is
24176.
17626.
11910.
14846.
95
Status
active
Status
active
Status
active
Status
active
96
Perform the following three steps completely on each node before moving on
to the next node. This is only to be done on the rest of the nodes in the cluster
(host2t, host3t and host4t in our case).
importvg -y gpfs(0-15) hdisk(x) import volume group
chvg -a n gpfsvg(0-15) make volume group non auto varyon
varyoffvg gpfsvg(0-15) vary off volume group
Notice that pdisk0 is hdisk3 on two nodes (host1t and host2t) while it is hdisk2 on
two other nodes (host3t and host4t). The only descriptor guaranteed to be
consistent is the physical volume ID (PVID) which is the middle number in the
lspv command, as shown in Example 6-10. Refer to Appendix A, Mapping
virtual disks to physical SSA disks on page 201, for a detailed explanation of
translating between pdisk numbers and hdisk numbers.
Example 6-10 List physical volumes
host1t:/> lspv
hdisk0
000b4a7df90e327d
hdisk1
000b4a7de4b48b4f
hdisk2
000b4a7d1075fdbf
hdisk3
000007024db58359
hdisk4
000007024db5472e
hdisk5
000007024db54fb4
hdisk6
000007024db5608a
hdisk7
000007024db571ba
hdisk8
000007024db5692c
hdisk9
000158511eb0f296
hdisk10
000007024db57a49
hdisk11
000007024db58bd3
hdisk12
000007024db53eac
hdisk13
000007024db5361d
hdisk14
000007024db51c4b
rootvg
rootvg
toolsvg
None
None
None
None
None
None
None
None
None
None
None
None
97
hdisk15
hdisk16
hdisk17
hdisk18
000007024db524ce
000007024db52d7b
000007024db513d2
000007024db55810
None
None
None
None
We can run lspv | awk {print $2} > logvolpvid, as shown in Example 6-11,
to create a file with just PVIDs then eliminate the entries that are not the SSA
disks we want for our file system (SCSI disks or other SSAs). This file can then
be used to loop commands we need to define the volume groups properly. We
also could have run ssadisk -a ssa0 -L and ssadisk -a ssa0 -P to make these
determinations, but spare SSA disks would have to be removed manually.
Example 6-11 Create file with only PVIDs
host1t:/tools/ralph> lspv | awk '{print $2}' > logvolpvid
host1t:/tools/ralph> cat logvolpvid
000007024db58359
000007024db5472e
000007024db54fb4
000007024db5608a
000007024db571ba
000007024db5692c
000158511eb0f296
000007024db57a49
000007024db58bd3
000007024db53eac
000007024db5361d
000007024db51c4b
000007024db524ce
000007024db52d7b
000007024db513d2
000007024db55810
98
rootvg
rootvg
toolsvg
gpfsvg0
gpfsvg1
gpfsvg2
gpfsvg3
gpfsvg4
gpfsvg5
gpfsvg6
gpfsvg7
gpfsvg8
gpfsvg9
gpfsvg10
gpfsvg11
gpfsvg12
gpfsvg13
gpfsvg14
gpfsvg15
99
To verify this operation was successful, we ran the lsvg -o command, as shown
in Example 6-15, which only shows volume groups that are varied on, making
sure all sixteen disks were present and ignoring other volume groups such as
rootvg. We also ran the lsvg gpfsvg(0-15) command to view the details of each
volume group. Pay particular attention to the line Concurrent : Capable, its
important all volume groups be concurrent capable or simultaneous sharing will
not be possible.
Example 6-15 List volume groups
host1t:/tools/ralph> lsvg -o
gpfsvg15
gpfsvg14
gpfsvg13
gpfsvg12
gpfsvg11
gpfsvg10
gpfsvg9
gpfsvg8
gpfsvg7
gpfsvg6
gpfsvg5
gpfsvg4
gpfsvg3
gpfsvg2
gpfsvg1
gpfsvg0
toolsvg
rootvg
host1t:/tools/ralph> lsvg gpfsvg0
VOLUME GROUP:
gpfsvg0
VG STATE:
active
VG PERMISSION: read/write
MAX LVs:
256
LVs:
1
OPEN LVs:
1
TOTAL PVs:
1
STALE PVs:
0
100
VG IDENTIFIER:
PP SIZE:
TOTAL PPs:
FREE PPs:
USED PPs:
QUORUM:
VG DESCRIPTORS:
STALE PPs:
000b4a7d70d6b2e7
16 megabyte(s)
542 (8672 megabytes)
0 (0 megabytes)
542 (8672 megabytes)
2
2
0
ACTIVE PVs:
Concurrent:
VG Mode:
MAX PPs per PV:
1
Capable
Non-Concurrent
1016
AUTO ON:
no
Auto-Concurrent: Disabled
MAX PVs:
32
VOLUME GROUP:
PERMISSION:
LV STATE:
WRITE VERIFY:
PP SIZE:
SCHED POLICY:
PPs:
BB POLICY:
RELOCATABLE:
UPPER BOUND:
LABEL:
gpfsvg0
read/write
opened/syncd
off
16 megabyte(s)
parallel
542
non-relocatable
yes
32
None
101
102
We confirmed the success of this command by running lsvg on that node and
observing that, along with the existing rootvg volume group, all sixteen GPFS
volume groups should now exist. An alternate method is to run lspv, as shown in
Example 6-21, which shows additional information. We do not need to run
varyonvg at this point, since importvg automatically varies on the volume groups.
Example 6-21 List physical volumes
host1t:/tools/ralph> gdsh -w host2t "lspv"
host2t: hdisk0
000b4a9dcac1948a
host2t: hdisk1
000b4a9dcac192ef
host2t: hdisk2
000444527adfc8bd
host2t: hdisk3
000007024db58359
host2t: hdisk4
000007024db5472e
host2t: hdisk5
000007024db54fb4
host2t: hdisk6
000007024db5608a
host2t: hdisk7
000007024db571ba
host2t: hdisk8
000007024db5692c
host2t: hdisk9
000158511eb0f296
host2t: hdisk10
000007024db57a49
host2t: hdisk11
000007024db58bd3
host2t: hdisk12
000007024db53eac
host2t: hdisk13
000007024db5361d
host2t: hdisk14
000007024db51c4b
host2t: hdisk15
000007024db524ce
host2t: hdisk16
000007024db52d7b
host2t: hdisk17
000007024db513d2
host2t: hdisk18
000007024db55810
rootvg
rootvg
None
gpfsvg0
gpfsvg1
gpfsvg2
gpfsvg3
gpfsvg4
gpfsvg5
gpfsvg6
gpfsvg7
gpfsvg8
gpfsvg9
gpfsvg10
gpfsvg11
gpfsvg12
gpfsvg13
gpfsvg14
gpfsvg15
103
We cannot allow any node to grab the disks and mount them, so auto activation
on all nodes for all GPFS disks must be set to no. That can be checked by
running lsvg gpfsvg(x) | grep AUTO on every GPFS volume group and
observing the word no follows AUTO ON. We already set that to no on host1t
during the mkvg command, now we must also do it on the rest of the nodes in
turn. In Example 6-22, another simple while loop will assist us. Run gdsh -w
host2t /tools/ralph/changevg, the script changevg follows.
Example 6-22 Change volume groups
#!/usr/bin/ksh
i=0
while [ "$i" -le 15 ]
do
chvg -a n gpfsvg$i
i=`expr $i + 1`
done
Step through all remaining nodes (in our case, do host3t then host4t, since we
just did host2t).
104
105
In Example 6-26, we verified the file system was created by observing the new
entry in the /etc/filesystems. We ran the following cat command on all nodes to
make sure.
Example 6-26 Verify creation of the file system
host1t:/tools/gpfs_config> cat /etc/filesystems | grep -p /gpfs1
/gpfs1:
dev
= /dev/gpfs1
vfs
= mmfs
nodename
= mount
= false
type
= mmfs
account
= false
106
Another verification step is to run the lslv -l gpfsls(x) command on all nodes,
as shown in Example 6-27. We built a while loop and ran it on every node. Only
host1t is shown for simplification.
Example 6-27 List logical volumes
#!/usr/bin/ksh
i=0
while [ "$i" -le 15 ]
do
lslv -l gpfslv$i
i=`expr $i + 1`
done
host1t:/tools/ralph> testfs | pg
gpfslv0:N/A
PV
COPIES
IN BAND
hdisk3
542:000:000
19%
gpfslv1:N/A
PV
COPIES
IN BAND
hdisk4
542:000:000
19%
gpfslv2:N/A
PV
COPIES
IN BAND
hdisk5
542:000:000
19%
gpfslv3:N/A
PV
COPIES
IN BAND
hdisk6
542:000:000
19%
gpfslv4:N/A
PV
COPIES
IN BAND
hdisk7
542:000:000
19%
gpfslv5:N/A
PV
COPIES
IN BAND
hdisk8
542:000:000
19%
gpfslv6:N/A
PV
COPIES
IN BAND
hdisk9
542:000:000
19%
gpfslv7:N/A
PV
COPIES
IN BAND
hdisk10
542:000:000
19%
gpfslv8:N/A
PV
COPIES
IN BAND
hdisk11
542:000:000
19%
gpfslv9:N/A
PV
COPIES
IN BAND
hdisk12
542:000:000
19%
gpfslv10:N/A
PV
COPIES
IN BAND
hdisk13
542:000:000
19%
gpfslv11:N/A
PV
COPIES
IN BAND
DISTRIBUTION
109:108:108:108:109
DISTRIBUTION
109:108:108:108:109
DISTRIBUTION
109:108:108:108:109
DISTRIBUTION
109:108:108:108:109
DISTRIBUTION
109:108:108:108:109
DISTRIBUTION
109:108:108:108:109
DISTRIBUTION
109:108:108:108:109
DISTRIBUTION
109:108:108:108:109
DISTRIBUTION
109:108:108:108:109
DISTRIBUTION
109:108:108:108:109
DISTRIBUTION
109:108:108:108:109
DISTRIBUTION
107
hdisk14
gpfslv12:N/A
PV
hdisk15
gpfslv13:N/A
PV
hdisk16
gpfslv14:N/A
PV
hdisk17
gpfslv15:N/A
PV
hdisk18
542:000:000
19%
109:108:108:108:109
COPIES
542:000:000
IN BAND
19%
DISTRIBUTION
109:108:108:108:109
COPIES
542:000:000
IN BAND
19%
DISTRIBUTION
109:108:108:108:109
COPIES
542:000:000
IN BAND
19%
DISTRIBUTION
109:108:108:108:109
COPIES
542:000:000
IN BAND
19%
DISTRIBUTION
109:108:108:108:109
108
%Used
1%
%Used
1%
%Used
1%
%Used
1%
Iused
12
Iused
12
Iused
12
Iused
12
%Iused
1%
%Iused
1%
%Iused
1%
%Iused
1%
Mounted
/gpfs1
Mounted
/gpfs1
Mounted
/gpfs1
Mounted
/gpfs1
on
on
on
on
Chapter 7.
109
110
Order of events
111
Starting GPFS
been started.
been started.
been started.
...
Subsystem PID is 17268.
Subsystem PID is 18092.
Subsystem PID is 18442.
0:01 /usr/lpp/mmfs/bin/mmfsd
0:02 /usr/lpp/mmfs/bin/mmfsd
0:01 /usr/lpp/mmfs/bin/mmfsd
112
3%
3%
3%
13
13
13
1% /gpfs1
1% /gpfs1
1% /gpfs1
autoSgLoadBalance off
node
idx
no host name
---- ---- -------------0
1 host1e
1
2 host2e
2
4 host4e
adapter
admin
fails
SGs mem daem TMreq
ip address
node status panics mngd free CPU
/sec
----------- ----- ------- ------- --- ---- ---- ----129.40.12.129
y
up
0/0
1 99%
0%
0
129.40.12.130
y
up
0/0
0 99%
0%
0
129.40.12.132
y
up
0/0
0 99%
0%
0
Order of events
mmaddcluster to add node to cluster
mmaddnode to add node to specified nodeset
mount to mount the file system
Example 7-3 Adding nodes to a cluster
host1t:/> mmlscluster
GPFS cluster information
========================
GPFS system data repository servers:
------------------------------------Primary server:
host1e
Secondary server: host2e
Nodes for nodeset set1:
------------------------------------------------1 host1e 129.40.12.129 host1e
2 host2e 129.40.12.130 host2e
4 host4e 129.40.12.132 host4e
host1t:/> mmaddcluster host3e
mmaddcluster: Command successfully completed
host1t:/> mmlscluster
GPFS cluster information
========================
GPFS system data respository servers:
-------------------------------------
113
Primary server:
Secondary server:
host1e
host2e
114
node
idx
no host name
---- ---- -------------0
1 host1e
1
2 host2e
2
4 host4e
adapter
admin
fails
SGs mem daem TMreq
ip address
node status panics mngd free CPU
/sec
----------- ----- ------- ------- --- ---- ---- ----129.40.12.129
y
up
0/0
1 99%
0%
0
129.40.12.130
y
up
0/0
0 99%
0%
0
129.40.12.132
y
up
0/0
0 99%
0%
0
adapter
admin
fails
SGs mem daem TMreq
ip address
node status panics mngd free CPU
/sec
----------- ----- ------- ------- --- ---- ---- ----129.40.12.131
y
down
0/0
0
0% 100%
After giving node host3e a minute to complete dynamic addnode processing, mount
the file system.
host2t:/> gdsh -w host3t "mount /gpfs1"
host2t:/> mmfsadm dump cfgmgr
Cluster Configuration: Type: 'HACMP'
Domain , 4 nodes in this cluster
autoSgLoadBalance off
node
idx
no host name
---- ---- -------------0
1 host1e
1
2 host2e
2
4 host4e
3
3 host3e
adapter
admin
fails
SGs mem daem TMreq
ip address
node status panics mngd free CPU
/sec
----------- ----- ------- ------- --- ---- ---- ----129.40.12.129
y
up
0/0
1 99%
0%
0
129.40.12.130
y
up
0/0
0 99%
0%
0
129.40.12.132
y
up
0/0
0 99%
0%
0
129.40.12.131
y
up
0/0
0 99%
0%
0
115
3%
3%
3%
3%
13
13
13
13
1%
1%
1%
1%
/gpfs1
/gpfs1
/gpfs1
/gpfs1
116
2
4
3
host2e
host4e
host3e
129.40.12.130
129.40.12.132
129.40.12.131
host2e
host4e
host3e
117
118
clusterType hacmp
group Gpfs.set1
recgroup GpfsRec.set1
File systems in nodeset set1:
----------------------------/dev/gpfs1
PID
PID
PID
PID
Status
inoperative
Status
inoperative
Status
inoperative
Status
inoperative
119
Starting GPFS
been started.
been started.
been started.
been started.
...
Subsystem
Subsystem
Subsystem
Subsystem
PID
PID
PID
PID
is
is
is
is
31800.
18096.
12210.
5008.
adapter
admin
fails
SGs mem daem TMreq
ip address
node status panics mngd free CPU
/sec
----------- ----- ------- ------- --- ---- ---- ----129.40.12.129
y
up
0/0
0
0% 100%
0
129.40.12.130
y
up
0/0
0
0% 100%
0
129.40.12.132
y
up
0/0
0
0% 100%
0
129.40.12.131
y
up
0/0
0
0% 100%
0
120
n:(5:5)
host1t:/> gdsh "lssrc -s mmfs"
host1t: Subsystem
Group
host1t: mmfs
aixmm
host2t: Subsystem
Group
host2t: mmfs
aixmm
host3t: Subsystem
Group
host3t: mmfs
aixmm
host4t: Subsystem
Group
host4t: mmfs
aixmm
PID
31800
PID
18096
PID
5008
PID
12210
Status
active
Status
active
Status
active
Status
active
3%
3%
3%
3%
13
13
13
13
1%
1%
1%
1%
/gpfs1
/gpfs1
/gpfs1
/gpfs1
121
3%
3%
3%
3%
13
13
13
13
1%
1%
1%
1%
/gpfs1
/gpfs1
/gpfs1
/gpfs1
122
123
124
-n 32
Estimated number of nodes that will mount file system
-B 262144
Block size
-Q none
Quotas enforced
-F 139264
Maximum number of inodes
-V 4
File system version. Highest supported version: 4
-z no
Is DMAPI enabled?
-d
gpfslv0;gpfslv1;gpfslv2;gpfslv3;gpfslv4;gpfslv5;gpfslv6;gpfslv7;gpfslv8;gpf
slv9;gpfslv10;gpfslv11;gpfslv12;gpfslv13;gpfslv14;gpfslv15 Disks in file
system
-A no
Automatic mount option
-C set1
GPFS nodeset identifier
-E no
Exact mtime default mount option
-S no
Suppress atime default mount option
holds
free KB
free KB
data
in full blocks
in fragments
----- --------------- --------------yes
8665088 (98%)
552 ( 0%)
yes
8665088 (98%)
552 ( 0%)
yes
8665856 (98%)
312 ( 0%)
yes
8665344 (98%)
560 ( 0%)
yes
8665344 (98%)
296 ( 0%)
yes
8665344 (98%)
312 ( 0%)
yes
8665600 (98%)
312 ( 0%)
yes
8664320 (98%)
552 ( 0%)
yes
8665088 (98%)
568 ( 0%)
yes
8665344 (98%)
568 ( 0%)
yes
8665344 (98%)
568 ( 0%)
yes
8665088 (98%)
808 ( 0%)
125
gpfslv12
gpfslv13
gpfslv14
gpfslv15
8880128
1 yes
8880128
1 yes
8880128
1 yes
8880128
1 yes
--------(total)
142082048
Inode Information
-----------------Total number of inodes: 139264
Total number of free inodes: 139247
yes
yes
yes
yes
8665088 (98%)
808 ( 0%)
8665344 (98%)
552 ( 0%)
8665088 (98%)
792 ( 0%)
8665344 (98%)
552 ( 0%)
-------------- -------------138643712 (98%)
8664 ( 0%)
126
gpfslv2
gpfslv3
gpfslv4
gpfslv5
gpfslv6
gpfslv7
gpfslv8
gpfslv9
gpfslv10
gpfslv11
gpfslv12
gpfslv13
gpfslv14
gpfslv15
(total)
519904
519904
0
219
209
519840
519840
0
250
240
519776
519776
0
277
263
519840
519840
0
219
209
519872
519872
0
219
209
519712
519712
0
249
237
519744
519744
0
339
325
519872
519872
0
311
297
519840
519840
0
371
353
519808
519808
0
371
353
519840
519840
0
311
297
519776
519776
0
339
321
519744
519744
0
337
323
519744
519744
0
309
297
----------------------- ----------------8316896 8316896
0
4619
4407
46.84
46.83
46.83
46.83
46.83
46.82
46.82
46.83
46.83
46.83
46.83
46.83
46.82
46.82
46.84
46.83
46.83
46.83
46.83
46.82
46.82
46.83
46.83
46.83
46.83
46.83
46.82
46.82
99.96 99.96
99.96 99.96
99.95 99.96
99.96 99.96
99.96 99.96
99.96 99.96
99.94 99.94
99.95 99.95
99.94 99.94
99.94 99.94
99.95 99.95
99.94 99.95
99.94 99.95
99.95 99.95
-----------99.95 99.95
availability
-----------up
up
up
up
up
up
up
up
up
up
127
gpfslv10
gpfslv11
gpfslv12
gpfslv14
gpfslv15
disk
disk
disk
disk
disk
512
512
512
512
512
1
1
1
1
1
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
ready
ready
ready
ready
ready
up
up
up
up
up
Before a disk can be replaced, we must suspend the disk. By using the mmchdisk
command, as shown in Example 7-16, a disk can be placed in several states,
such as suspend, start, and stop. You cannot write new data to a suspended disk
but you can update existing data and read from that disk. A disk is typically
suspended prior to restriping or deletion. We arbitrarily chose to remove gpfslv13
for our test.
Example 7-16 Change disk status to suspend
host1t:/> mmchdisk gpfs1 suspend -d gpfslv13
host1t:/> mmlsdisk gpfs1
disk
driver
sector failure holds
name
type
size
group metadata
------------ -------- ------ ------- -------gpfslv0
disk
512
1 yes
gpfslv1
disk
512
1 yes
gpfslv2
disk
512
1 yes
gpfslv3
disk
512
1 yes
gpfslv4
disk
512
1 yes
gpfslv5
disk
512
1 yes
gpfslv6
disk
512
1 yes
gpfslv7
disk
512
1 yes
gpfslv8
disk
512
1 yes
gpfslv9
disk
512
1 yes
gpfslv10
disk
512
1 yes
gpfslv11
disk
512
1 yes
gpfslv12
disk
512
1 yes
gpfslv13
disk
512
1 yes
gpfslv14
disk
512
1 yes
gpfslv15
disk
512
1 yes
holds
data
----yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
status
------------ready
ready
ready
ready
ready
ready
ready
ready
ready
ready
ready
ready
ready
suspended
ready
ready
availability
-----------up
up
up
up
up
up
up
up
up
up
up
up
up
up
up
up
128
availability
-----------up
up
up
up
up
up
up
up
up
up
up
up
up
up
up
129
hdisk16
hdisk16
VOLUME GROUP:
gpfsvg13
000007024db52d7b0000000000000000 VG IDENTIFIER
active
0
16 megabyte(s)
542 (8672 megabytes)
0 (0 megabytes)
542 (8672 megabytes)
00..00..00..00..00
109..108..108..108..109
ALLOCATABLE:
LOGICAL VOLUMES:
VG DESCRIPTORS:
VG IDENTIFIER: 000b4a7d70d7eecd
PP SIZE:
16 megabyte(s)
TOTAL PPs:
542 (8672 megabytes)
FREE PPs:
0 (0 megabytes)
USED PPs:
542 (8672 megabytes)
QUORUM:
2
VG DESCRIPTORS: 2
STALE PPs:
0
AUTO ON:
no
Auto-Concurrent: Disabled
MAX PVs:
32
130
yes
1
2
holds
metadata
-------yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
holds
data
----yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
status
------------ready
ready
ready
ready
ready
ready
ready
ready
ready
ready
ready
ready
ready
ready
ready
availability
-----------up
up
up
up
up
up
up
up
up
up
up
up
up
up
up
131
132
holds
metadata
-------yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
holds
data
----yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
status
------------ready
ready
ready
ready
ready
ready
ready
ready
ready
ready
ready
ready
ready
ready
ready
ready
availability
-----------up
up
up
up
up
up
up
up
up
up
up
up
up
up
up
up
Activating quotas
When running mmcrfs, the default is to not institute quotas on the GPFS file
system. We would have to include the option mmcrfs -Q yes to activate quotas
during creation of the file system or mmchfs -Q yes after the fact. When running
mmchfs, we have to first issue umount /gpfs1 on all the nodes or the change will
not take affect. Example 7-20 shows the result of mmlsfs before quota activation.
After activation, both user and group appear as flag values that can be enforced.
Running the quota report generator mmrepquota before activation, yielded no
quota management installed. Afterwards, a table is written, but is empty pending
editing of the quota limit file via mmedquota.
Example 7-20 Activating quotas
host1t:/> mmlsfs gpfs1 -Q
flag value
description
---- -------------- -----------------------------------------------------Q none
Quotas enforced
host1t:/> mmrepquota -g -a
gpfs1: no quota management installed
host1t:/> mmchfs gpfs1 -Q yes
mmchfs: Propagating the changes to all affected nodes.
This is an asynchronous process.
host1t:/> mmlsfs gpfs1 -Q
flag value
description
---- -------------- -----------------------------------------------------Q user;group
Quotas enforced
host1t:/> mmrepquota -u -a
*** Report for USR quotas on gpfs1
Block Limits
Name
type
KB
quota
| File Limits
|
files quota
limit in_doubt
limit
in_doubt
grace
grace
133
Establishing quotas
Once we turned quotas on, we designated different limits to different users. In
Example 7-21, we created users red, green, blue and yellow and made
directories (not required) under /gpfs1 for each of them. These users owned all
the files within these directories.
A soft limit is the total of all that users files and does not immediately impact the
ability of that user to created or add to those files. That changes when the user
exceeds the grace period (the time passed after initially going over the soft limit)
or if the user hits the hard limit. Since the limits must be a multiple of the block
size, we chose a soft limit of 256K, a hard limit of 256K and grace period of 2
days (default is 7 days) for user green. For user red, we picked a soft limit of
512K and a hard limit of 1024K, while the grace period is the same for all users.
For the sake of experimentation, we left the inodes at zero, which means
unlimited.
Example 7-21 Establishing quotas
host1t:/> mmedquota -u green
*** Edit quota limits for USR green
NOTE: block limits will be rounded up to the next multiple of the block size.
gpfs1: blocks in use: 0K, limits (soft = 0K, hard = 0K)
inodes in use: 0, limits (soft = 0, hard = 0)
*** Edit quota limits for USR green
NOTE: block limits will be rounded up to the next multiple of the block size.
gpfs1: blocks in use: 0K, limits (soft = 256K, hard = 256K)
inodes in use: 0, limits (soft = 0, hard = 0)
host1t:/> mmedquota -u red
*** Edit quota limits for USR red
NOTE: block limits will be rounded up to the next multiple of the block size.
gpfs1: blocks in use: 0K, limits (soft = 0K, hard = 0K)
inodes in use: 0, limits (soft = 0, hard = 0)
*** Edit quota limits for USR red
NOTE: block limits will be rounded up to the next multiple of the block size.
gpfs1: blocks in use: 0K, limits (soft = 512K, hard = 1024K)
inodes in use: 0, limits (soft = 0, hard = 0)
host1t:/> mmedquota -t -u
*** Edit grace times
Time units may be: days, hours, minutes, or seconds
Grace period before enforcing soft limits for USRs:
134
Listing quotas
To view any established quota, as in Example 7-22, we ran mmlsquota, or
mmrepquota, depending on the format of the output we wanted. As long as we did
not exceed the quota for a user, grace shows up as none. When we exceed our
quota, grace displays the number of days the user has to correct the over quota
problem before the ability to write is curtailed. The output of these commands is
very wide and separated near the middle by a | symbol. The mmlsfs gpfs1 -Q
command shows the current options for the specified file system, the -Q line
shows quotas are enfoced for both user and group.
Example 7-22 Listing quotas
host1t:/gpfs1/red> mmlsquota -u green
Block Limits
File
Limits
Filesystem type
KB
quota
gpfs1
USR
456
256
|
|
files quota
19 0
limit
256
files quota
16 0
6 0
grace
2days
limit in_doubt
grace
0
-9
none
host1t:/gpfs1/green> mmrepquota -u -a
*** Report for USR quotas on gpfs1
Block Limits
File Limits
Name
type
KB
quota
root
USR
65954312
0
red
USR
128
512
green
USR
4056
256
blue
USR
24
0
yellow
USR
16
0
|
|
|
in_doubt
-216
|
limit
0
1024
256
0
0
in_doubt
5120
0
-3728
0
0
grace
none
none
2days
none
none
limit in_doubt
grace
0
20
none
0
0
none
135
|
|
|
23 0
2 0
2 0
0
0
0
-2
0
0
none
none
none
Deactivating quotas
In Example 7-23, we ran mmquotaoff to deactivate quotas. By specifying the file
system name without modifiers, the command disabled both user and group
quotas. Had we specified -u or -g, we could have turned off only user or group
quotas respectively. When we ran mmlsquota, it still looked like quotas was
running; we had to use umount /gpfs1, mmchfs -Q no to completely stop quotas
and then mount /gpfs1 before mmlsquota responded with no quota management
installed.
Example 7-23 Deactivating quotas
host1t:/gpfs1/green> mmquotaoff gpfs1
host1t:/gpfs1/green> mmlsquota
Block Limits
Filesystem type
KB
quota
USR
no limits
|
|
File Limits
files quota
limit in_doubt
limit
in_doubt
grace gpfs1
grace
host1t:/gpfs1/green> mmchfs -Q no
host1t:/gpfs1/green> gdsh "umount /gpfs1"
host1t:/gpfs1/green> gdsh "mount /gpfs1"
host1t:/> gdsh "df -k | grep gpfs"
host1t: /dev/gpfs1
142077952
host2t: /dev/gpfs1
142077952
host3t: /dev/gpfs1
142077952
host4t: /dev/gpfs1
142077952
75916032
75916032
75916032
75916032
host1t:/> mmrepquota -u -a
gpfs1: no quota management installed
host1t:/> mmlsquota
Block Limits
136
47%
47%
47%
47%
67
67
67
67
1%
1%
1%
1%
/gpfs1
/gpfs1
/gpfs1
/gpfs1
Filesystem type
|
File Limits
|
files quota
installed
KB
quota
limit in_doubt
limit
in_doubt
grace
In Example 7-25 on page 138, we stop the HACMP/ES cluster services on host2t
and wait until the cluster is stable. A dynamic reconfiguration event can only be
performed when the cluster is stable.
137
Next, in Example 7-27, we remove host2t from the nodeset of the cluster. This
will remove all cluster adapters that are configured on the node as well. Note that
all following commands have to be performed on a node on which the cluster
services are active; this is a requirement for DARE. We chose host3t.
Example 7-27 Remove cluster node
host3t:/> smitty hacmp
Cluster Configuration
Cluster Topology
Configure Nodes
Remove a Cluster Node
(select host2t)
The node now has been removed from the node set in the HACMP/ES cluster
configuration on host3t, and all cluster adapters that were configured on it.
In Example 7-28, we synchronize the cluster topology on host3t. This will
distribute the change made locally to the cluster topology on host3t to all other
nodes in the cluster. It will update the cluster configuration that is known to the
cluster manager. The cluster manager daemons will run a reconfig_topology
event.
Example 7-28 Synchronize cluster topology
host3t:\> /usr/es/sbin/cluster/utilities/cldare -t
138
In Example 7-29, we monitor the events that are run during the DARE.
Example 7-29 Verify successful synchronization
host3t:/> more /usr/es/sbin/cluster/history/cluster.0306
Feb 22 18:28:39 EVENT START: node_down host2t graceful
Feb 22 18:28:39 EVENT COMPLETED: node_down host2t graceful
Feb 22 18:28:40 EVENT START: node_down_complete host2t graceful
Feb 22 18:28:41 EVENT COMPLETED: node_down_complete host2t graceful
Feb 22 18:34:00 EVENT START: reconfig_topology_start
Feb 22 18:34:01 EVENT COMPLETED: reconfig_topology_start
Feb 22 18:34:01 EVENT START: reconfig_topology_complete
Feb 22 18:34:05 EVENT COMPLETED: reconfig_topology_complete
We can convince ourselves that host2t has been removed from the cluster
topology by displaying the cluster networks, as in Example 7-30.
Example 7-30 Display cluster network
host3t:/> /usr/es/sbin/cluster/utilities/cllsnw
Network
Attribute Node
Adapter(s)
gpfs_net
public
host1t
host3t
host4t
host1e
host3e
host4e
noname_net
public
host1t
host3t
host4t
host1t
host3t
host4t
Afterwards, inExample 7-32, we need to add the cluster adapters, host3t and
host3e, that were previously configured.
Example 7-32 Add cluster adapters
host3t:/> /usr/es/sbin/cluster/utilities/claddnode -a host2t:token:\
> noname_net:public:service : : -n host2t
host3t:/> /usr/es/sbin/cluster/utilities/claddnode -a host2e:ether:\
> gpfs_net:public:service : : -n host2t
139
Example 7-34 shows the sequence of events that is run while the reconfiguration
is performed.
Example 7-34 Cluster reconfiguration events
host1t:/> tail -f /usr/es/sbin/cluster/history/cluster.03062001
Feb 22 19:09:38 EVENT START: reconfig_topology_start
Feb 22 19:09:38 EVENT COMPLETED: reconfig_topology_start
Feb 22 19:09:39 EVENT START: reconfig_topology_complete
Feb 22 19:09:42 EVENT COMPLETED: reconfig_topology_complete
Finally, we add host2t back to the GPFS nodeset and start the GPFS daemon.
First, we determine that quorum exists for the node set. Otherwise, the adding of
a new node may not be successful. In Example 7-36, we can obtain the number
of active nodes in the GPFS nodeset from the long listing for the Group Services
subsystem
Example 7-36 First group services subsystems
host3t:/> lssrc -ls grpsvcs
Subsystem
Group
PID
Status
grpsvcs
grpsvcs
10204
active
4 locally-connected clients. Their PIDs:
16582(hagsglsmd) 14028(haemd) 16018(clstrmgr) 14166(mmfsd)
HA Group Services domain information:
Domain established by node 3
Number of groups known locally: 5
Number of
Number of local
Group name
providers
providers/subscribers
Gpfs.set1
4
1
0
GpfsRec.set1
4
1
0
ha_em_peers
4
1
0
CLRESMGRD_111
4
1
0
CLSTRMGR_111
4
1
0
host3t:/>
140
The command shows that the groups Gpfs.set1 and GpfsRec.set1 have four
providers, hence GPFS is active on all four nodes and quorum exits.
ether
[Ethernet Protocol]
[]
[30]
Custom
[4]
[5]
141
The Heartbeat Rate is the frequency keepalive signals are sent. The Grace
Period is the time span keepalive signals from an adapter can be missed before
Topology Services declares this adapter has failed. The network module settings
are configured by network type. In this example, the settings would apply to all
cluster adapters of type ether. A short grace period is desirable, however no
false adapter failures should be generated, due to a saturated network. We tried
the above settings on the network used by GPFS and did not see any errors. An
adapter failure would be visible in the event history in our configuration as a
network_down event.
After the network module setting has been changed, the cluster topology needs
to be synchronized to update Topology Services on all nodes.
142
Chapter 8.
Developing Application
Programs that use GPFS
This chapter describes concepts and provides guidelines related to the
development of application programs using GPFS in a clustered environment.
This chapter revisits many of the concepts introduced in Chapter 2, More about
GPFS on page 13, but discusses them as they affect application programming
performance. With this focus, numerous benchmark results are provided. Since
this book is being written for an environment without PSSP, it is assumed that the
programs are not parallel (i.e., MPI nor OpenMP is used); however, many of the
concepts in this chapter are equally applicable to parallel programs. See GPFS
for AIX: Administration and Programming Reference for more details.
In particular, this chapter includes:
The relationship of the POSIX I/O API to GPFS
The relationship of GPFS architecture and organization to application
programming
The analysis of several I/O access patterns and examples of how to improve
the random I/O access pattern using the GPFS Multiple Access Range hint
Multinode performance considerations
Numerous benchmark results and performance monitoring
143
A simple example
The following code segment illustrates the simple way GPFS can be used. It
directly uses the familiar POSIX APIs open(), lseek(), write(), fsync() and close().
But there are no specialized system calls, no macros, nor anything else
exceptional.
offset_t seek_offset;
int
fd;
int
k;
char
fname = my_file;
char
buf[16384];
int
buf_size;
. . .
fd = open(fname, O_RDWR|O_CREAT|O_TRUNC, 0777);
for (k = 0; k < 10000; k++)
{
seek_offset = do_something(buf_size, buf);
lseek(fd, seek_offset, SEEK_SET);
write(fd, buf, buf_size);
1
See POSIX Programmers Guide, Donald Levine, OReilly and Associates, April 1991, ISBN 0-937175-73-0 for further
details.
144
}
fsync(fd);
close(fd);
By defining the _LARGE_FILES flag, this code segment can access files
exceeding 2 GB. See Section 8.8.2, Notes on large files on page 181 for further
details.
While production codes are far more complex in practice, any program using the
POSIX API which works correctly on a sequential file systemsuch as the
Journaled File System (JFS)will work correctly using GPFS, but with several
added benefits. For instance, this program can safely read or write a file
distributed over several disks in parallel. In other words, the program can be
written using a conventional sequential style and GPFS will automatically
parallelize the I/O operations. It also includes the ability to access the file from
any of the several nodes where the GPFS file system is simultaneously mounted.
Note on terminology
In many RS6000 and AIX manuals, the reader encounters the term system call.
This generally refers to the API presented by AIX to the application programmer
and it includes the POSIX API calls. The authors use terms like AIX system call
and GPFS system call to distinguish between API calls generally available to
application programmers via AIX and more restricted ones like those associated
with GPFS. If the term system call is used without qualification, assume that it
means an AIX system call.
145
well across multiple file systems, often subtle changes which significantly
improve a program on one file system may not have the same effect on others.
So regarding the second point, POSIX portability guarantees only correctness,
not performance standards.
Finally, regarding the third point, GPFS provides additional AIX system calls that
go beyond the POSIX standard to improve performance in some select cases.
These include the GPFS hints and data shipping features. Programs which use
these features can, under the right circumstances, experience a significant
performance lift, but programs using these features are not generally portable to
other UNIX and UNIX file system combinations.
146
Parameter
Value
Node model
7025-F50
Number of nodes
Parameter
Value
Memory
1.5 GB
256 KB
Disk model
7133-D40 SSA
Number of disks
16 disks on 1 loop
Logical volumes
135 GB
GPFS network
100 MB ether
147
148
Now, suppose that the seek offset is 1310742K. Then this record would span
only two blocks. Either case is relatively inefficient. These operations can be
significantly improved by reading a 256KB record with a seek offset of 1310720.
Then the record would be block aligned and only one block would be accessed.
But aligning records on block boundaries like this is not always possible. A
similar problem occurs when a program writes a record that does not align
properly with a block boundary.
288K
A+288K
A
record
16k
256k
16k
page pool
A
256K
A+288K
256K
256K
288K
block 0
block
block
11
block 2
Figure 8-1 Reading a record that does not align with the GPFS block boundary
149
access pattern is random, the file must be read twice over to access all of the
data. This means that it takes twice as long to execute, thus reducing the
effective I/O rate to 29.2 MB/s. Even so, 29.2 MB/s is significantly better than the
raw rate of 12.8 MB/s when reading 288 KB records.
Table 8-2 Record/block mis-alignment
Record size
256 KB
288 KB
I/O rate
52.6 MB/s
12.8 MB/s
16 KB
256 KB
1024 KB
I/O rate
1.39 MB/s
13.20 MB/s
29.70 MB/s
150
Consider Figure 8-2. Let file X be the 8 GB file in this figure and without loss of
generality let job 1 run on node 1. Now, consider jobs 3 and 4 in particular.
Perhaps job 3 is writing in the byte range 4 G to 6.5 G and job 4 desires to write
in the byte range 6 G to 8 G. That poses a potential write conflict. In this case,
before job 4 can write to the conflicted area, the GPFS daemon must acquire a
token from the token manager.
Subblocks also play a role in the token manager as the fundamental granularity
for byte range locking is the subblock.
write conflict
file X: 8 GB
job 1
node 1
locks
offsets
0-2 G
job 2
node 2
locks
offsets
2-4 G
job 3
node 3
locks
offsets
4-6.5 G
job 4
node 4
locks
offsets
6-8 G
151
Since the activities of the token manager are greatly subdued (but not eliminated) for sequential application programs,
its role for read and write I/O operations is not considered here. See GPFS for AIX: Concepts, Planning, and Installation
Guide for further details.
152
The last item guarantees cache coherence. A daemon actually does the flushing.
Unless specified otherwise (as in the first case), the flush is done asynchronously
(i.e., independently of, and concurrently with, the application program threads).
As with the read case, consider the write I/O operation from its two levels of
complexity.
1. The record is contained in the pagepool
When we say that the record is contained in the pagepool, we mean that the
GPFS blocks that will contain the record are in the pagepool. In this case, the
GPFS kernel extension copies the record from the buffer provided by the
application program directly to the pagepool. Unless this is a synchronous
write, control is then returned to the application program and if one of the
situations occur forcing a flush, it is scheduled and processed asynchronously
by a daemon.
2. The record is not contained in the pagepool
153
If the blocks that will contain the record are not present in the pagepool, the
GPFS kernel extension suspends the application program and dispatches a
thread to allocate the blocks before writing the record. There are three options
to consider.
154
L = 1024 KB (large)
S = 16 KB (small)
Write
I/O Only
Read
I/O Only
Write
I/O and CPU
Read
I/O and CPU
L-Sequential
67.00 MB/s
75.90 MB/s
32.70 MB/s
30.80 MB/s
M-Sequential
64.00 MB/s
73.2 MB/s
32.60 MB/s
30.60 MB/s
S-Sequential
66.80 MB/s
69.80 MB/s
34.9 MB/s
35.90 MB/s
L-Strided
65.90 MB/s
75.20 MB/s
32.80 MB/s
30.70 MB/s
M-Strided
64.80 MB/s
72.80 MB/s
32.00 MB/s
30.50 MB/s
S-Strided
10.70 MB/s
19.90 MB/s
10.70 MB/s
20.60 MB/s
L-Random
64.40 MB/s
29.80 MB/s
32.40 MB/s
19.00 MB/s
M-Random
52.60 MB/s
13.20 MB/s
21.50 MB/s
10.60 MB/s
155
Record sizePattern
S-Random
Write
I/O Only
Read
I/O Only
6.23 MB/s
1.39 MB/s
Write
I/O and CPU
Read
I/O and CPU
5.64 MB/s
1.36 MB/s
156
This statistical identical performance to sequential access patterns of the medium and large record strided writes is
surprising to the authors. On SP systems using VSD servers, the write performance is relatively good but is only 60% of
sequential access patterns.
157
Let F = QS / QR .
Then we conclude:
or more simply, the rate RS = O(n). In other words, when the record size is less
than the block size and n > 1, the I/O rate varies linearly with the number of
records accessed per block. Since n = O(1/N), we can also say that the I/O rate is
inversely proportional to the stride. When n = 1, the constants k1 and k2 change
because GPFS now reads a subblock.
Table 8-5 shows this relationship. In this example, the benchmark programs write
or read a 1 GB file with a record size of 16 KB. The block size is 256 KB. Thus
there is 16 records per block. Note that each time the stride doubles, the rate is
cut in half, until the stride = 16 and the constants k1 and k2 change. When stride
= 1, the I/O access pattern becomes small record sequential and striping
becomes operational. The rates in this case are 66.2 MB/s for the write and 75.9
MB/s for the read.
Table 8-5 Relationship between stride and I/O rate for small record strided
Stride (N)
24.30 MB/s
40.70 MB/s
10.60 MB/s
19.70 MB/s
5.03 MB/s
9.56 MB/s
16
9.73 MB/s
22.10 MB/s
Summarizing, large and medium record strided I/O access patterns perform at
the same rate as sequential I/O access patterns; small record strided I/O access
do not perform as well, but perform better than a random I/O access pattern for
blocks of the same size.
158
159
160
gpfs_fcntl()
The only system call included in the MAR hints API is gpfs_fcntl(). This is a
general purpose system call which invokes different functions based upon the
structure fields in the parameters being passed to it. It is similar in spirit to the
familiar ioctl() system call. The syntax of this system call is:
#include <gpfs_fcntl.h>
int gpfs_fcntl(int fd, void *fcp)
161
fd
The file descriptor for the open file to which the hints are being applied
fcp
This is a pointer to a nested structure designed by the application
programmer containing other GPFS defined structures; the particular
structures included determine the functions to be invoked and contain the
relevant parameters.
The return value is zero if gpfs_fcntl() is successful, otherwise it is -1. In the latter
case, the global variable errno is set to record the specific error.
Note that the application program using this function must be linked with libgpfs.a
(specified by the compiler -lgpfs).
There are three GPFS defined structures used by GMGH that can be contained
in the structure pointed to by fcp. They are:
gpfsFcntlHeader_t
gpfsCancelHints_t
gpfsMultipleAccessRange_t
gpfsFcntlHeader_t
This structure is used by gpfs_fcntl() to specify the version and the size of the
operand (i.e., the size of the structure pointed to by fcp) being passed to it. The
fields used by GMGH are:
typedef struct
{
int totalLength;
int fcntlVersion;
int fcntlReserved;
int errorOffset;
} gpfsFcntlHeader_t;
/*
/*
/*
/*
gpfsCancelHints_t
This is one of the structures used by gpfs_fcntl() to determine which GPFS
functions are being invoked. Strictly speaking, the function associated with this
structure is not a hint, but a directive; as a directive, it is not just good advice
given to GPFS, but an action that must be executed.
162
The function associated with this structure is to remove any hints that have been
issued against the open file. It restores the hint status to what it was when the file
was first opened, but it does not alter the status of the GPFS cache.
The syntax of this structure is:
typedef struct
{
int structLen;
int structType;
} gpfsClearFileCache_t;
gpfsMultipleAccessRange_t
This is another one of the structures used by gpfs_fcntl() to determine which
GPFS functions are being invoked. It is a nested structure used to present a
range of blocks to the MAR hint mechanism; blocks can be issued as a hint
and/or they can be released after having been accessed.
The structures syntax is:
typedef struct
{
163
int
structlen;
int
structType;
int
accRangeCnt;
int
relRangeCnt;
gpfsRangeArray_t accRangeArray[GPFS_MAX_RANGE_COUNT];
gpfsRangeArray_t relRangeArray[GPFS_MAX_RANGE_COUNT];
} gpfsMultipleAccessRange_t;
164
start is the displacement (relative to the block) to the first byte in the block to be
accessed by the application program (for block aligned records, start is 0). length
specifies the number of bytes being accessed starting from start. If the block is
being written, isWrite is 0; if its being read, isWrite is 1.
Application programs think in terms of records, whereas the MAR hint API thinks
in blocks. Therefore it is necessary to correlate records to the blocks they span.
The following code segments illustrate one way to issue the hints corresponding
to the application program records. They come from the function
gmgh_issue_hint() listed in Appendix G, Benchmark and Example Code on
page 237. References in the following discussion to actions taken elsewhere in
this code (e.g., generating the hint vector and block list) are highlighted in the
appendix by call-out boxes. To improve the readability of this example, error
checking, debugging code segments, and comments made redundant by
embedded text have been removed; it is otherwise a complete and correct
function.
This first code segment is the function header.
int gmgh_issue_hint
(
gmgh *p,
int nth
)
GMGH is a structure containing various parameters and lists. There are two
important lists in GMGH. p->hint is a vector whose entries are pointers to
structures describing each record the application program has submitted for
hinting by calling gmgh_post_hint(). There can be a maximum of 128 entries in
this vector. Once exhausted, the application program submits another batch of
128 entries, and so on. The number 128 is arbitrary and can be set to meet
application program needs. Corresponding to this vector is the block list
p->blklst; its entries are pointers to structures describing the blocks
corresponding to the records in p->hint. When the application program posts its
records as hints (i.e., enters them into p->hint), another function,
gmgh_gen_blk(), creates corresponding entries in p->blklst. gmgh_gen_blk()
determines how many blocks each record in p->blklst spans and creates one
entry for each block. Since the record size is variable in this example, a
maximum block size must be declared; this is used to set p->nbleh which is the
maximum number of blocks per record. So if the GPFS block size is 256 KB, and
the maximum record size is set to 1024 KB, then there will be four times as many
entries in p->blklst as there are in p->hint.
165
nth specifies the current record being issued as a hint; nth is always less than
p->nhve, which is the number of entries in the hint vector.
The following code segment simply defines some local variables used in the
code segments that follow. Embedded comments briefly describe them, but
detailed analysis of their use is explained further below.
{
int
int
int
int
int
rem;
accDone, relDone;
nhntacc;
rbx;
ibx;
/*
/*
/*
/*
/*
remember a value */
while loop conditions */
number of hints accepted */
block list index for released blocks */
block list index for issued blocks */
ghint is the parameter that is passed to the gpfs_fcntl() system call. It is being
intialized here.
struct
{
gpfsFcntlHeader_t hdr;
gpfsMultipleAccessRange_t marh;
} ghint;
ghint.marh.structLen = sizeof(ghint.marh);
ghint.marh.structType = GPFS_MULTIPLE_ACCESS_RANGE;
Some more routine initialization comes next. p->nbleh is the maximum number of
block entries per hint; thus rbx references the next set of blocks to be hinted.
relDone and accDone are Boolean values used to determine when we are done
releasing and issuing hints. acc in accDone indicates acceptance of hints; if
accDone = TRUE, we are done accepting hints, so no more can be issued.
rbx = nth * p->nbleh;
This is where we get down to business. Blocks associated with the last accessed
record are released, and we continue issuing new hints (i.e., one hint per block)
until no more are accepted.
while(!accDone || !relDone)
{
This following for loop prepares the relRangeArray structures for release of
issued hint blocks. It does this by extracting the block information from p->blklst
starting with p->lstblkreleased. Notice that relDone is set to TRUE in an if
statement. If the record size is very large relative to the GPFS block size, then
166
=
=
=
=
p->blklst[rbx].blknum;
p->blklst[rbx].blkoff;
p->blklst[rbx].blklen;
p->blklst[rbx].isWrite;
rbx++;
p->hint[nth].lstblkreleased++;
}
ghint.marh.relRangeCnt = k;
The next statements do some book keeping chores. The first one handles the
case where GPFS_MAX_RANGE_COUNT divides the number of blocks
touched. The second one handles an unusual situation that, after repeated
testing, has never occurred, but that we cannot rule out. It is possible that we
have not had hints accepted for some period of time and the records have been
accessed and released (releasing un-issued hints is not a problem). Thus, state
variables are set to be certain that we do not issue these hints since they are no
longer needed.
if (p->hint[nth].lstblkreleased >= p->hint[nth].nblkstouched)
relDone = TRUE;
if (p->nxtblktoissue <= rbx)
p->nxtblktoissue = p->nbleh * (nth + 1);
This following while loop prepares the accRangeArray structures for issuing
hints. It does this by extracting the block information from p->blklst starting with
p->nxtblktoissue. Notice the variable rem; it is used to remember where we
started so that if some of the blocks are not accepted, we are able to backtrack.
As in the case of releasing blocks, if the number of blocks per record exceeds
GPFS_MAX_RANGE_COUNT, it must cycle through the outer while loop again.
rem = p->nxtblktoissue;
ibx = p->nxtblktoissue;
k
= 0;
167
while (!accDone
&&
k < GPFS_MAX_RANGE_COUNT &&
ibx < p->UBnblks
)
{
if (p->blklst[ibx].blkoff >= 0)
/* Is the next entry OK? */
{
ghint.marh.accRangeArray[k].blockNumber = p->blklst[ibx].blknum;
ghint.marh.accRangeArray[k].start
= p->blklst[ibx].blkoff;
ghint.marh.accRangeArray[k].length
= p->blklst[ibx].blklen;
ghint.marh.accRangeArray[k].isWrite
= p->blklst[ibx].isWrite;
ibx++;
k++;
}
else
ibx++;
}
ghint.marh.accRangeCnt = k;
p->nxtblktoissue
= ibx;
Here we set the fields of the gpfsFcntlHeader_t structure and call gpfs_fcntl() to
release accessed blocks and issue the new hints that have been packaged in the
preceding nested loops.
ghint.hdr.totalLength
= sizeof(ghint);
ghint.hdr.fcntlVersion = GPFS_FCNTL_CURRENT_VERSION;
ghint.hdr.fcntlReserved = 0;
gpfs_fcntl(p->fd, &ghint);
Finally, there is one last set of book keeping chores to settle. If GPFS did not
accept all of the hints that were issued, it is necessary to determine which ones
were not accepted and set state variables to pick up where it left off. The variable
nhntacc is set to the number of hints not accepted. rem records the first block
issued in the most recent call to gpfs_fcntl(). This information is used to reset
p->nxtblktoissue to the first unaccepted block so that we can start where we left
off next time. Notice the test p->blklst[++ibx].blkoff < 0; because the record size is
variable, the block offset blkoff is set to -1 for unused blocks in p->blklst.
nhntacc = ghint.marh.accRangeCnt;
if (nhntacc < GPFS_MAX_RANGE_COUNT)
{
accDone = TRUE;
/* no more hints this time */
ibx =
k
=
while
{
if
168
rem - 1;
0;
(k < nhntacc)
(p->blklst[++ibx].blkoff < 0)
continue;
k++;
}
p->nxtblktoissue = ibx + 1;
}
}
return 0;
}
This code is written in ANSI C; had it been written in C++, these functions would have been declared public and private
respectively.
169
gmgh_init_hint()
This function initializes the structures used internally by GMGH. Its syntax is:
#include gmgh.h
int gmgh_init_hint(gmgh *p, int fd, int maxrsz, int maxhint)
170
Setting the hint set too small prevents sufficient numbers of hints from being
issued and increases overhead by forcing the hint set to be re-populated
unnecessarily. Setting the hit set too large just wastes memory reserved for
internal structures.
Upon success, this function returns 0, otherwise it returns -1.
gmgh_post_hint()
This function posts records indicating that they will be issued as hints in the
future. The posting action places each posted record in the hint set (i.e., enters
them into p->hint). As the record is posted, gmgh_gen_blk() is called by
gmgh_post_hint() to determine how many blocks each record spans and it
creates one entry for each block corresponding to the record in p->blklst. Before
any of the records in the hint set are accessed, gmgh_post_hint() is called up to
maxhint times posting records one after another. The syntax is:
#include gmgh.h
int gmgh_post_hint(gmgh *p, offset_t soff, int nb, int nth, int isWrite)
171
The following is a simple example illustrating how to post hints. The loop
generates the hint set one record at a time (next_rec() determines the seek offset
for the next record). Notice that no records are accessed in the loop; that
happens in a loop that comes after this one.
for (i = 0; i < maxhint; i++)
{
soff = next_rec();
gmgh_post_hint(p, soff, bsz, i, 1);
}
gmgh_declare_1st_hint()
gmgh_xfer() is used to access records and gmgh_issue_hint() is called after a
record is accessed. But, the first time gmgh_xfer() is called for the hint set, no
hints have been issued prior to accessing that first record.
gmgh_declare_1st_hint() is therefore designed to be called after posting the last
record in the hint set and prior to calling gmgh_xfer() to access the first record in
the hint set. gmgh_declare_1st_hint() first calls gmgh_cancel_hint() to clean the
slate (n.b., it releases any issued hints leftover from the previous hint set), then
calls gpfs_issue_hint(). The syntax is:
#include gmgh.h
int gmgh_declare_1st_hint(gmgh *p)
p
p points to the gmgh structure defined in gmgh.h.
gmgh_xfer()
gmgh_xfer() is used to access records by calling the read() or write() system
calls, but it first calls llseek() with the whence parameter set to SEEK_SET. After
the record has been accessed, gmgh_issue_hint() is called which issues,
re-issues and releases hints as necessary. gmgh_xfer() is called for each record
in the hint set after all the records have been posted by gmgh_post_hint() and
gmgh_declare_1st_hint() has been called the first time. The syntax is:
#include gmgh.h
int gmgh_xfer(gmgh *p, char *buf, int nth)
p
p points to the gmgh structure defined in gmgh.h.
buf
buf points to a buffer supplied by the user to contain the record. Its size must
be at least maxrsz bytes.
nth
172
nth is the ordinal number of the record and is an index into p->hint where the
record is posted. gmgh_xfer() uses the nth entry in p->hint to get the seek
offset, record size and whether it should be read or written.
The following statements are variable declarations. The variable names and
associated comments should make their meaning clear.
int
int
int
char
int
offset_t
gmgh
int
int
fd;
nrec = NUMBER_OF_RECORDS;
bsz = SIZE_OF_RECORDS;
buf[SIZE_OF_RECORDS];
nrn;
soff;
*p;
maxhint = MAXHINT;
i, k, nhint;
/*
/*
/*
/*
/*
/*
/*
/*
/*
file descriptor */
obvious */
fixed size records */
record buffer */
next record number */
seek offset */
pointer to gmgh structure */
max number of hints */
miscellaneous */
The next two statements are some initialization tasks. Since we are generating a
random I/O access pattern, we need a random number generator. We are calling
it rand(); it returns single precision, floating point, uniform random deviates
between 0 and 1. seed_rand() seeds the random sequence. Afterwords, we
open a file to be written to whose size can exceed 2 GB.
seed_rand(1.0);
What comes next is the first GMGH call. We mallocate memory for the GMGH
structure, initialize the fields to zero and call gmgh_init_hint().
p = (gmgh*)malloc(sizeof(gmgh));
memset(p,\0,sizeof(gmgh));
gmgh_init_hint(p, fd, bsz, maxhint); /* Building hint data */
173
The following loop cycles through all of the records in chunks of size maxhint
which sets the maximum size of the hint set. nhint is the actual size of the hint
set; typically it is maxhint, but if maxhint does not divide nrec, then the last time
through the loop, nhint is set to the appropriate size.
for (k = 0; k < nrec; k+=maxhint)
{
if (k + maxhint > nrec) nhint = nrec - k;
else
nhint = maxhint;
The following code segment is the first of two inner loops. The next record
number, nrn, is generated randomly. Based on the seek offset, soff is calculated.
The purpose of this loop is to post hints for each record in the hint set.
Immediately following the loop, we declare (i.e., issue) the first hint.
for (i = 0; i < nhint; i++)
{
nrn = (int)((float)nrec * rand());
/* 0.0 <= rand() < 1.0 */
soff = (offset_t)nrn * (offset_t)bsz;
gmgh_post_hint(p, soff, bsz, i, 1);
}
gmgh_declare_1st_hint(p);
The next inner loop actually writes the data using gmgh_xfer() after generating
the data. make_data() generates the data to be written and uses the seek offset
to be sure it generates the right data. Since the records are posted before the
data is generated, it is important that correct data is placed in the buffer before
writing it. But generating the data after posting it may not always be possible due
to the difficulty of correlating the seek offset with the data (the seek offset may
need to be recalculated). In that case, the application program can generate,
save, and post the records in one loop and write them in the next loop using
gmgh_xfer(). This requires extra memory (but no significant extra time) and
allows the hints to be issued.
Also, remember that the hints are issued, re-issued, and released as part of the
activities of gmgh_xfer().
for (i = 0; i < nhint; i++)
{
make_data(soff, buf);
gmgh_xfer(p, buf, i);
}
}
174
Benchmark results
Table 8-6 compares the benchmark codes of MAR hints in use versus MAR hints
not in use. These are the I/O intensive versions of the benchmarks. Similar
results hold for mixed CPU and I/O benchmarks. As can be readily seen from
inspecting this table, the use of hints significantly improves the performance of
the read benchmarks. What is puzzling is that performance is degraded for all of
the write benchmarks except for the medium record size case (and here the
improvement is marginal). The authors have tested parallel versions of these
same benchmark programs on SP systems with VSD servers using small record
random writes and the performance increased threefold.
Table 8-6 Using and not using MAR hints
Record sizePattern
Write - NO
hints
Write - using
hints
Read - NO
hints
Read - using
hints
L-Random
64.40 MB/s
47.20 MB/s
29.80 MB/s
60.60 MB/s
M-Random
52.60 MB/s
56.00 MB/s
13.20 MB/s
55.20 MB/s
S-Random
6.23 MB/s
4.92 MB/s
1.39 MB/s
8.93 MB/s
In general, it is fair to conclude that random I/O access patterns are not as
efficient as other access methods. Moreover, the relationships between the
various solution strategies are somewhat perplexing. If a random pattern must be
adopted in an application, it is recommended that careful benchmarking be done
to assess its performance relative to these various strategies before
implementing a definitive algorithm for use in production.
175
Write tests are first launched on two, three and four nodes, followed by read tests
on two, three and four nodes. Figures from single node tests are included in the
table for comparison. In addition, tests mixing reads and writes are launched. In
the first case, there is one write test launched on one node and one read test
launched on another node. The test is repeated except that there are two write
tests and two read tests launched with one test per node. Perusing this table, the
reader sees that the I/O rate per node goes down, but more importantly that the
aggregate I/O rate increases as the number of nodes increases (n.b., it drops off
slightly on four nodes). Thus, more total work is being done.
1
Write-Test
Aggregate Rate
Rate on node 1
Rate on node 2
Rate on node 3
Rate on node 4
Harmonic Mean
Read-Test
Aggregate Rate
Rate on node 1
Rate on node 2
Rate on node 3
Rate on node 4
Harmonic Mean
Mixed Read/Write Test
Aggregate Rate
Rate on node 1
Rate on node 2
Rate on node 3
Rate on node 4
Harmonic Mean
Number of nodes
2
3
66.97
66.97
105.36
52.74
52.68
120.10
41.12
40.03
40.14
66.97
52.71
40.42
75.91
75.91
103.65
51.82
52.27
114.89
38.30
39.58
39.05
75.91
52.04
38.97
write->
read->
92.11
66.00
46.05
54.26
write->
write->
read->
read->
4
110.90
28.77
27.73
28.69
28.88
28.51
109.14
27.37
27.56
27.28
27.51
27.43
117.11
41.04
41.77
29.33
29.28
34.32
A similar tests were conducted using large (e.g., 1024 KB) and small (e.g., 16
KB) record random I/O access patterns and similar results were observed; the
I/O rate per node goes down, but the aggregate I/O rate increases as the number
of nodes increases.
176
8.7.1 iostat
The iostat command is a useful command for monitoring I/O and CPU activity. It
generates statistical reports at designated time intervals for designated devices
and lists them on stdout. Since the focus of this book is on GPFS, we shall
examine iostat regarding its disk I/O monitoring features.
The format of the iostat command for disk I/O is:
# iostat -d [physical volume] [interval [count] ]
The -d option specifies that a disk utilization report is generated. Each of the
parameters in brackets are optional, but if missing have default values.
[physical volume]
This is a space separated list of physical volumes. If it is omitted, then all
volumes are monitored.
[interval [count] ]
interval specifies a time interval in seconds between reports; if it is omitted,
then only one report is generated, but if it is included, count can be given to
specify how many reports are generated. If interval is specified, but count is
not, the reports are generated indefinitely.
The statistics are reported in five columns for the -d option. The first report is the
cumulative calculation of each of these statistics since the system was last
rebooted. Each subsequent report is over the most recent time interval.
177
In the following example, each report consists of three rows, one for each listed
physical volume. The first report is the cumulative measurement of each statistic
since the system was last rebooted (n.b., the system used in this example had
been recently rebooted). The second report was generated during a write
benchmark and the third report during a read benchmark. Care should be taken
interpreting these results. For example, I/O rate (Kbps) is for one physical
volume only; it is not an aggregate I/O rate.
host1t:/> iostat -d hdisk3 hdisk4 hdisk5 60 3
Disks:
% tm_act
Kbps
tps
Kb_read
hdisk4
0.2
13.6
0.2
695227
hdisk3
0.2
13.5
0.2
691926
hdisk5
0.2
13.5
0.2
692550
hdisk4
46.3
4376.5
34.2
0
hdisk3
44.9
4376.5
34.2
0
hdisk5
45.9
4376.5
34.2
0
hdisk4
47.9
4555.7
35.6
45568
hdisk3
43.6
4581.3
35.8
45824
hdisk5
47.1
4606.8
36.0
46080
Kb_wrtn
1684270
1673121
1672352
43776
43776
43776
0
0
0
8.7.2 filemon
filemon is another useful command for monitoring I/O activity. Unlike the iostat
command which only examines file I/O from the perspective of physical volumes,
filemon monitors the performance of the file system on behalf of logical files,
virtual memory segments, logical volumes and physical volumes. In its normal
mode, filemon runs in the background while application programs and system
commands are being executed and monitored. It starts collecting data as soon
as it is launched and stops once directed to by the trcstop command. There are
many alternatives for using filemon, but we restrict this discussion to a limited
number of them.
The format of filemon for the options we are considering is:
# filemon [-u] [-o File] [-O levels]
Each of the parameters in brackets are optional, but if missing, have default
values.
[-u]
Collect statistics on files opened prior to launching the filemon command.
[-o File]
Redirect output to the named file. If omitted, output is sent to stdout.
[-O levels]
Collect statistics for the designated file system levels. The options are
178
After this, the first file in the Detailed File Stats section is listed.
These samples, from the fmon.out file, are enough to provide you with the flavor
of the information provided in this helpful report. You should experiment with the
other options.
host1t:/> less fmon.out
Wed Feb 28 20:15:20 2001
System: AIX host1t Node: 4 Machine: 000B4A7D4C00
Cpu utilization: 11.6%
Most Active Files
-----------------------------------------------------------------------
179
#MBs #opns
#rds
#wrs file
volume:inode
----------------------------------------------------------------------4096.0
2
8192
8192 fmeg
/dev/gpfs1:90129
18.5
0
592
0 pid=19976_fd=5
1.0
0
253
94 pid=24450_fd=4
0.5
0
137
0 pid=14124_fd=0
0.5
0
131
51 pid=14726_fd=4
-----------------------------------------------------------------------Detailed File Stats
-----------------------------------------------------------------------FILE: /gpfs1/fmeg volume: /dev/gpfs1 inode: 90129
opens:
2
total bytes xfrd:
4294967296
reads:
8192
(0 errs)
read sizes (bytes):
avg 262144.0 min 262144 max 262144 sdev
0.0
read times (msec):
avg
3.353 min
2.930 max 34.183 sdev
0.554
writes:
8192
(0 errs)
write sizes (bytes): avg 262144.0 min 262144 max 262144 sdev
0.0
write times (msec):
avg
3.648 min
3.166 max 244.538 sdev
5.472
lseeks:
16384
180
For example, suppose a file system has been configured to have 1024 GB of
storage and 960 GB is already occupied; there is only 64 GB of free space.
Suppose job1 pre-allocates 50 GB of storage and actually writes 10 GB of data
to the file. Doing an ls -l after job1 completes shows the file size to be 50 GB
while a du -k shows its real size to be a little over 10 GB (remember, du also
includes a files indirect blocks; see Example 2-1, Sparse file on page 22).
Now suppose job2 begins, and it writes 50 GB of data to disk. The 50 GB of
virtual space job1 allocated plus the 50 GB of space job2 actually use exceeds
the 64 GB of free space that was available before job1 began. The reason that
this can happen is that only 10 GB real space was used by job1. That means
between job1 and job2, 60 GB of real space was used and now only 4 GB of file
space is actually available in this file system. If job1 resumes activity and tries to
use its remaining 40 GB of virtual space, the file system will run out of space.
This example illustrates an inconvenience associated with pre-allocating file
space. In spite of this, many shops find that the time savings of file pre-allocation
outweigh this inconvenience; they simply adjust their storage policies to handle
this situation.
Continuing with this example (but assuming there is an abundance of file space),
suppose a job reads a record of virtual file space that has not actually had
anything written to it. Then the record is returned containing all zeros; this is a
normal situation and not an error5. On the other hand, suppose a program tried
to read a record beyond the end of the virtual space. This generates an error with
the errno value EBADF (Bad file descriptor).
But caution is urged when reading a record that contains both virtual and real space. The authors have occasionally
encountered unexpected results in this situation.
6 The _LARGE_FILES flag should not be confused with the open() system call flag O_LARGEFILE; this latter flag is used
to open a large file when a file system like NFS or JFS may not be configured to support files exceeding 2 GB by default.
The O_LARGEFILE flag doesnt effect GPFS since it is always configured to support files greater than 2 GB.
181
182
Chapter 9.
Problem determination
This chapter is designed to assist with the problem determination in a GPFS for
HACMP Cluster environment. It is intended to compliment the product specific
guides for this environment, namely:
GPFS for AIX: Problem Determination Guide, GA22-7434
HACMP/ES 4.4 for AIX: Installation and Administration Guide, Vol.2,
SC23-4306
HACMP/ES 4.4 for AIX: Troubleshooting Guide, SC23-4280
This chapter will review:
Logs available for HACMP and GPFS
Group services (grpsvcs)
Topology services (topsvcs)
Disk related errors
varyonvg related problems
Definition problems
SSA Disk Fencing
183
Internode communications
Checking interfaces defined to HACMP
.rhost files
Testing rsh/rcp for correct setup
184
There are other logs specific to the other HACMP/ES subsystems including
group services and topology services. These are detailed in the aforementioned
HACMP manuals.
185
The GPFS daemon is not ready to accept commands until the message GPFS:
6027-300 mmfsd ready has been issued.
Note: The proper operation of both GPFS and HACMP depend on having
space available for their logs files. The /usr, /var, and /tmp filesystems should
be monitored to be sure that they do not become full.
Also note that the hagsd daemon is not active on this node:
(17:16:31) c185n01:/ # ps -ef|grep grpsvcs|grep -v grep
(17:19:48) c185n01:/ #
186
The lack of group services means that HACMP is not active on this node. Start
HACMP and run the same series of commands. How to start HACMP is covered
in Chapter 5, Configuring HACMP/ES on page 71.
With HACMP active on a node, the hagsd daemon is also active:
(17:29:49) c185n01:/ # ps -ef|grep grpsvcs|grep -v grep
root 12532 7820
0 17:22:46
- 0:00 hagsd grpsvcs
(17:31:07) c185n01:/ #
Now that group services is active, the "lssrc -ls grpsvcs" command shows the
activity of the groups that are subscribed to group services, in this case, HACMP.
(17:29:42) c185n01:/ # lssrc -ls grpsvcs
Subsystem
Group
PID
Status
grpsvcs
grpsvcs
12532
active
3 locally-connected clients. Their PIDs:
13946(hagsglsmd) 18770(haemd) 15244(clstrmgr)
HA Group Services domain information:
Domain established by node 5
Number of groups known locally: 3
Number of
Number of local
Group name
providers
providers/subscribers
ha_em_peers
8
1
0
CLRESMGRD_15
8
1
0
CLSTRMGR_15
8
1
0
(17:29:49) c185n01:/ #
The above display from the lssrc -ls grpsvcs command shows that this node
is a "provider" of services, one of eight in this cluster.
The above display shows a state which is ready to support the operation of a
GPFS cluster.
187
Sun
Sun
Sun
Sun
Sun
Apr
Apr
Apr
Apr
Apr
8
8
8
8
8
18:11:07
18:12:07
18:13:08
18:14:08
18:15:08
EDT
EDT
EDT
EDT
EDT
2001
2001
2001
2001
2001
runmmfs:
runmmfs:
runmmfs:
runmmfs:
runmmfs:
6027-1242
6027-1242
6027-1242
6027-1242
6027-1242
GPFS
GPFS
GPFS
GPFS
GPFS
is
is
is
is
is
waiting
waiting
waiting
waiting
waiting
for
for
for
for
for
grpsvcs
grpsvcs
grpsvcs
grpsvcs
grpsvcs
After the group services daemon is started, then GPFS will continue with it's
initialization. An excerpt from the /var/adm/ras/mmfs.log.latest shows this
behavior:
(18:11:16) c185n01:/ # tail -f /var/adm/ras/*latest
Sun Apr 8 18:11:06 EDT 2001 runmmfs starting
Removing old /var/adm/ras/mmfs.log.* files:
Loading modules from /usr/lpp/mmfs/bin
GPFS: 6027-506 /usr/lpp/mmfs/bin/mmfskxload: /usr/lpp/mmfs/bin/mmfs
loaded at 90670232.
Sun Apr 8 18:11:07 EDT 2001 runmmfs: 6027-1242 GPFS is waiting for
Sun Apr 8 18:12:07 EDT 2001 runmmfs: 6027-1242 GPFS is waiting for
Sun Apr 8 18:13:08 EDT 2001 runmmfs: 6027-1242 GPFS is waiting for
Sun Apr 8 18:14:08 EDT 2001 runmmfs: 6027-1242 GPFS is waiting for
Sun Apr 8 18:15:08 EDT 2001 runmmfs: 6027-1242 GPFS is waiting for
Sun Apr 8 18:16:03 2001: GPFS: 6027-310 mmfsd initializing.
Sun Apr 8 18:16:03 2001: GPFS: 6027-1531 useSPSecurity no
Sun Apr 8 18:16:04 2001: GPFS: 6027-841 Cluster type: 'HACMP'
Sun Apr 8 18:16:07 2001: Using TCP communication protocol
Sun Apr 8 18:16:07 2001: GPFS: 6027-1710 Connecting to 9.114.121.2
Sun Apr 8 18:16:07 2001: GPFS: 6027-1711 Connected to 9.114.121.2
Sun Apr 8 18:16:07 EDT 2001 /var/mmfs/etc/gpfsready invoked
Sun Apr 8 18:16:07 2001: GPFS: 6027-300 mmfsd ready
is already
grpsvcs
grpsvcs
grpsvcs
grpsvcs
grpsvcs
188
The above status show the network GBether that has eight members and this
nodes state (the "St" field) is S.
189
Ipkts Ierrs
625423
625423
497283
497283
439176
439176
439176
0
0
0
0
0
0
0
Opkts Oerrs
616640
616640
589281
589281
442219
442219
442219
Coll
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Note the asterisk (*) next to the interface en1 marking it down.
In addition to topology services changing states in response to en1, GPFS has
also been informed by its membership that en1 has gone down. Here is the mmfs
log showing this state change:
Sun Apr 8 18:56:00
Sun Apr 8 18:56:31
down the daemon.
Sun Apr 8 18:56:31
Services.
Sun Apr 8 18:56:31
abnormally.
Sun Apr 8 18:56:31
190
The process of defining volume groups, defining logical volumes, getting the
volume groups imported to all the nodes in the GPFS cluster, setting the volume
group "AUTO ON" parameter is a complicated, but critical procedure for the
proper operation of a GPFS cluster environment.
Chapter 6, Configuring GPFS and SSA disks on page 91 discusses how to
define volume groups and logical volumes and how to use importvg to import
these volume groups to the rest of the nodes in the GPFS cluster. The problems
that may be encountered if these procedures are not followed exactly will be
discussed In the coming sections, as well as procedures that can be used to
verify that all the disks, volume groups, and logical volumes are in the proper
state for GPFS usage.
191
If this parameter is not properly set, then the default behavior of the AUTO ON
parameter is to vary on the volume group at boot time. In the case where one or
more nodes have this parameter improperly set, then one node will get the
volume group varied online and all the other nodes will have errors trying to read
the volume group.
Typical errors that will be seen will include:
(21:34:00) c185n01:/ # varyonvg -u gpfsvg16
PV Status:
hdisk66 000167498707c169
PVNOTFND
0516-013 varyonvg: The volume group cannot be varied on because
there are no good copies of the descriptor area.
This is typical of the node where this varyonvg command is running from being
locked out from a disk that is in an improper vary on state on another node.
There is an SSA utility command called ssa_rescheck that can assist with this
problem state.
(21:34:30) c185n01:/ # ssa_rescheck -l hdisk66
Disk
Primary Secondary Adapter Primary Secondary Reserved
Adapter Adapter
In Use
Access Access
to
hdisk66 ssa0
====
ssa0
Busy
====
c185n02
192
VG IDENTIFIER:
PP SIZE:
TOTAL PPs:
FREE PPs:
USED PPs:
QUORUM:
0002025690caf44f
16 megabyte(s)
268 (4288 megabytes)
0 (0 megabytes)
268 (4288 megabytes)
2
TOTAL PVs:
STALE PVs:
ACTIVE PVs:
Concurrent:
VG Mode:
MAX PPs per PV:
1
0
1
Capable
Non-Concurrent
1016
VG DESCRIPTORS: 2
STALE PPs:
0
AUTO ON:
no
Auto-Concurrent: Disabled
MAX PVs:
32
Here the setting AUTO ON is set to no which is the proper setting for volume
groups in the GPFS for HACMP clusters environment.
The lsvg command needs to have the volume group online before it returns
information including the setting of the AUTO ON setting for that volume group.
Another command that can be used to determine the AUTO ON setting for a
volume group without that volume group having to be online is the getlvodm
command.
getlvodm -u volume_group_name returns a y or an n depending on the setting of
AUTO ON.
Here is a sample script that will return all the settings of all volume groups on one
node whose names include the string gpfs.
#!/usr/bin/ksh
#
for vg in $(lsvg |grep gpfs)
do
auton=$(getlvodm -u $vg)
echo VG: $vg AutoOn: $auton
done
Here is an example of the above script on a node within a cluster, assuming the
above script is called chkautovg:
(00:33:29) c185n01:/ # chkautovg
VG: gpfsvg0 AutoOn: n
VG: gpfsvg1 AutoOn: n
VG: gpfsvg2 AutoOn: n
VG: gpfsvg3 AutoOn: n
VG: gpfsvg5 AutoOn: n
VG: gpfsvg6 AutoOn: n
VG: gpfsvg7 AutoOn: n
VG: gpfsvg8 AutoOn: n
VG: gpfsvg9 AutoOn: n
VG: gpfsvg10 AutoOn: n
VG: gpfsvg11 AutoOn: n
VG: gpfsvg12 AutoOn: n
193
This would have to be run on each node within the cluster and the grep
parameters may have to be adjusted according to the naming standards of the
GPFS vgs in any individual cluster.
194
In order to prevent this situation from occurring, all logical volumes in the GPFS
for HACMP clusters environment need to have Bad Block relocation turned off
(The BB POLICY must be set to non-relocatable.).
To check the setting of Bad Block relocation use the lslv command:
(21:52:08) c185n02:/ # lslv gpfslv16
LOGICAL VOLUME:
gpfslv16
LV IDENTIFIER:
0002025690caf44f.1
VG STATE:
active/complete
TYPE:
mmfsha
MAX LPs:
268
COPIES:
1
LPs:
268
STALE PPs:
0
INTER-POLICY:
minimum
INTRA-POLICY:
middle
MOUNT POINT:
N/A
MIRROR WRITE CONSISTENCY: off
EACH LP COPY ON A SEPARATE PV ?: yes
VOLUME GROUP:
PERMISSION:
LV STATE:
WRITE VERIFY:
PP SIZE:
SCHED POLICY:
PPs:
BB POLICY:
RELOCATABLE:
UPPER BOUND:
LABEL:
gpfsvg16
read/write
closed/syncd
off
16 megabyte(s)
parallel
268
non-relocatable
yes
32
None
195
7 c185n07
8 c185n08
This shows an eight node cluster; the nodes are numbered from one to eight.
The value of the ssa router object ssar is determined by the lsattr command:
(22:20:52) c185n01:/ # gdsh
c185en01: node_number 1 SSA
c185en02: node_number 2 SSA
c185en03: node_number 3 SSA
c185en04: node_number 4 SSA
c185en05: node_number 5 SSA
c185en06: node_number 6 SSA
c185en07: node_number 7 SSA
c185en08: node_number 8 SSA
True
True
True
True
True
True
True
True
This shows how the ssar node_number attribute has been set to be equal to the
node number on each node in the cluster. The procedure to change the ssar
node_number is the chdev command:
chdev -l ssar -a node_number=X
The device ssar already configured the child devices and cannot be changed
with the chdev command. It must be changed with the HACMP command,
set_fenceid:
/usr/es/sbin/cluster/utilities/set_fence_id -l gpfslv16 1
196
In this case, the ssar node_number is incorrectly set to nine while it should be set
to one, to match this node's HACMP node number:
(22:38:50) c185n01:/ # clhandle
1 c185n01
197
The same message is issued to the user if the target node is missing its .rhosts
file, or if the .rhosts file exists but is missing the entry of the issuing node.
198
This will run the date command on each node in the nodeset. This command
should be repeated on each node within the cluster. Any failures will indicate that
there is a problem with the .rhost file on the node returning the error.
199
200
Appendix A.
201
SSA commands
To translate (or map) pdisk to hdisk we used ssaxlate -l pdisk0 and hdisk to
pdisk was ssaxlate -l hdisk3. This command has limited applications in our
case but is useful for some verification.
diag
Task Selection(Diagnostics, Advanced Diagnostics, Service Aids, etc.)
SSA Service Aids
202
diag
Task Selection(Diagnostics, Advanced Diagnostics, Service Aids, etc.)
SSA Service Aids
Configuration Verification
host1t:pdisk0
host1t:pdisk1
host1t:pdisk2
host1t:pdisk3
host1t:pdisk4
host1t:pdisk5
host1t:pdisk6
host1t:pdisk7
host1t:pdisk8
host1t:pdisk9
host1t:pdisk10
host1t:pdisk11
host1t:pdisk12
host1t:pdisk13
host1t:pdisk14
host1t:pdisk15
host1t:hdisk3
host1t:hdisk4
host1t:hdisk5
host1t:hdisk6
host1t:hdisk7
host1t:hdisk8
host1t:hdisk9
host1t:hdisk10
host1t:hdisk11
host1t:hdisk12
host1t:hdisk13
host1t:hdisk14
host1t:hdisk15
host1t:hdisk16
host1t:hdisk17
host1t:hdisk18
29CD5754
29CD697E
29CD6980
29CD6A3B
29CD6ACA
29CD6B0D
29CD6C12
29CD6D07
29CD6D0A
29CD6E19
29CD6EDB
29CD6EE1
29CD70BD
29CD70C5
29CD731D
29CD7372
29CD5754
29CD697E
29CD6980
29CD6A3B
29CD6ACA
29CD6B0D
29CD6C12
29CD6D07
29CD6D0A
29CD6E19
29CD6EDB
29CD6EE1
29CD70BD
29CD70C5
29CD731D
29CD7372
diag
Task Selection(Diagnostics, Advanced Diagnostics, Service Aids, etc.)
SSA Service Aids
Link Verification
203
host1t:ssa0 10-78
Physical
Serial#
host1t:pdisk14
host1t:pdisk11
host1t:pdisk12
host1t:pdisk13
host1t:pdisk10
host1t:pdisk9
host1t:pdisk1
host1t:pdisk2
host4t:ssa0:A
host3t:ssa0:A
host2t:ssa0:A
host1t:pdisk15
host1t:pdisk3
host1t:pdisk5
host1t:pdisk4
host1t:pdisk7
host1t:pdisk0
host1t:pdisk8
host1t:pdisk6
host4t:ssa0:B
host3t:ssa0:B
host2t:ssa0:B
29CD731D
29CD6EE1
29CD70BD
29CD70C5
29CD6EDB
29CD6E19
29CD697E
29CD6980
29CD7372
29CD6A3B
29CD6B0D
29CD6ACA
29CD6D07
29CD5754
29CD6D0A
29CD6C12
Adapter Port
A1 A2 B1
0 10
1
9
2
8
3
7
4
6
5
5
6
4
7
3
8
2
9
1
10
0
0
1
2
3
4
5
6
7
8
9
10
B2
Status
Good
Good
Good
Good
Good
Good
Good
Good
10
9
8
7
6
5
4
3
2
1
0
Good
Good
Good
Good
Good
Good
Good
Good
diag
Task Selection(Diagnostics, Advanced Diagnostics, Service Aids, etc.)
SSA Service Aids
Physical Link Verification
host1t:ssa0 10-78
IBM SSA 160 SerialRAID Adapter (
Link
40-I
40-I
40-I
40-I
40-I
40-I
204
Port
>>
Device
host1t:pdisk14
host1t:pdisk11
host1t:pdisk12
host1t:pdisk13
host1t:pdisk10
host1t:pdisk9
Location
4C09-01
4C09-02
4C09-03
4C09-04
4C09-05
4C09-06
Port
||
>>
Link
40-C
40-I
40-I
40-I
40-I
40-I
40-I
40-C
40-C
40-C
40-C
40-C
40-I
40-I
40-I
40-I
40-I
40-I
40-I
40-C
40-C
40-C
40-C
40-C
||
A2
A2
A2
A2
>>
||
B2
B2
B2
B2
host1t:pdisk1
host1t:pdisk2
host4t:ssa0
host3t:ssa0
host2t:ssa0
host1t:ssa0
host1t:ssa0
host1t:pdisk15
host1t:pdisk3
host1t:pdisk5
host1t:pdisk4
host1t:pdisk7
host1t:pdisk0
host1t:pdisk8
host1t:pdisk6
host4t:ssa0
host3t:ssa0
host2t:ssa0
host1t:ssa0
4C09-07
4C09-08
A1
A1
A1
4C09-09
4C09-10
4C09-11
4C09-12
4C09-13
4C09-14
4C09-15
4C09-16
B1
||
>>
B1
B1
B1
40-I
40-I
40-C
40-C
40-C
40-C
40-C
40-I
40-I
40-I
40-I
40-I
40-I
40-I
40-C
40-C
40-C
205
206
Appendix B.
Distributed software
installation
After installing the base AIX operating system onto our four nodes using CDs
and applying the program temporary fixes (PTFs), we had to choose the method
to add HACMP and GPFS filesets to our systems. We chose the same method
many SP administrators use, which is bffcreate. Instead of moving from node to
node and inserting various CDs, we created an installable image of the software
on an NFS mounted disk that all nodes could access. The added advantage of
this method is once we know the installp syntax we want to use, we can send
that command simultaneously to all nodes we want updated. We created the
installable image on host1t which serves the NFS mounted filesystem
/tools/images. All nodes in our cluster had to be able to communicate to each
other and the NFS directory had with be executable from any node.
207
/dev/cd0
[sysmgt.websm
[/tools/images]
[/tmp]
no
yes
> +
+
+
host1t:/tools/images/> rm .toc
host1t:/tools/images/> inutoc .
host1t:/> rm smit.script
host1t:/> rm smit.log
host1t:/tools/images/> smitty install
->Install and Update Software
208
[/tools/images]
/tools/images
[sysmgt.websm
no
yes
no
yes
yes
no
no
no
yes
> +
+
+
+
+
+
+
+
+
+
host1t:/>
host1t:
Manager
host2t:
Manager
host3t:
Manager
host4t:
Manager
Web-based System
sysmgt.websm.rte
4.3.3.0
COMMITTED
Web-based System
sysmgt.websm.rte
4.3.3.0
COMMITTED
Web-based System
sysmgt.websm.rte
4.3.3.0
COMMITTED
Web-based System
209
210
Appendix C.
211
gdsh
This tool is available to download as described in Appendix H, Additional
material on page 259. It is to be used at your own risk.
gdsh
-v means verbose mode, extra output available during execution
-a is enable entire cluster
-w allows specific nodes to be targeted for commands
-e specifically excludes specific nodes from the stated command
We had to create a file under /root called all.nodes on every host machine which
was a complete list of all the nodes we might want to select for remote command
activity. We used the token ring addresses since we tried to reserve the ethernet
for GPFS specific processes.
#!/usr/bin/perl
#
#
use Getopt::Std;
getopts("vae:w:", \%option);
if ($option{v} ) {
212
EST
EST
EST
EST
2001
2001
2001
2001
elsif ($option{w} ) {
# option w
@include_nodes=split(/\,/, $option{w} );
foreach $node (@include_nodes) {
$node_ctr++;
$node_name[$node_ctr] = $node;
213
if ($option{v} ) {
print "Current working collective:\n";
foreach $node (@node_name) {
print "$node\n";
}
$i=$#ARGV +1;
print "Number parms is: $i \n";
}
if ( $#ARGV >=0 ) {
$i=0;
while ( $i <= $#ARGV ) {
$parm_string = $parm_string . " " . $ARGV[$i] ;
$i++;
}
if ($option{v} ) {
print "The complete parm string is: $parm_string\n";
}
} else {
print "Usage: gdsh -ave node1,node2 cmds \n";
print " where:\n";
print "
-a -> all nodes in hacmp cluster (as reported by clhandle
-a)\n";
print "
-v -> verbose mode\n";
print "
-e -> nodes to exclude in comma separated list\n";
print "
-w -> nodes to include in comma separated list\n";
print " Command will use /all.nodes as working collective\n";
exit(1);
}
214
215
216
Appendix D.
Useful scripts
This appendix describes some useful scripts that you may use at your own risk.
These scripts are provided as suggestions and not the only way to perform any
specific action. To obtain a copy of the source code of any of these scripts, see
Appendix H, Additional material on page 259 for details.
217
This script will create a "driver" file for the next script. This
is the script where we describe the disks to be used by
GPFS and provide the volume group and logical volume
naming standards.
mkgpfsvg
This script will create the actual volume group and logical
volume on the disk and set the "AUTO ON" volume group
property. After this script has completed all the volume
groups and logical volumes will be known to one of the
nodes in the cluster.
importvgs.to.newtail This script will be run from the node where the volume
groups and logical volumes have been defined. Its
function will be to import these LVM disk definitions to the
other nodes in the cluster.
These scripts follow the same principles described in Chapter 6, Configuring
GPFS and SSA disks on page 91 but provide some ideas on how this task can
be further automated.
The same process is followed using these three scripts:
The volume groups and logical volumes are defined completely on one node
in the cluster.
The volume groups are then imported to all the other nodes in the cluster.
These scripts will be described in some detail and an example will be given. As
these scripts are general in nature, they will have to be modified for each use.
The first script, mkvgdriver, will output a series of commands that will in turn call
the mkgpfsvg script. The disks that will be used are coded in this first script.
Example: D-1 mkvgdriver script
#!/usr/bin/ksh
#
# script will create a driver file for the mkgpfsvg script.
# this will list all SSA disks that have successfully configured
# hdisks and create a mkgpfsvg command for that disk.
#
vg_prefix="gpfsvg"
lv_prefix="gpfslv"
#
start_count=0 # where to start the volume group count from
218
#
(( disk_count = $start_count ))
#
# get all Available pdisk names
for pdisk in $( lsdev -Ccpdisk -S Available -F name )
do
hdisk=$( ssaxlate -l $pdisk ) # get the hdisk to be sure its configured
if [[ -n $hdisk ]] then
vgname=$( lspv|grep -w $hdisk | awk '{print $3}' ) # not other vg
if [[ $vgname = "None" ]] then
echo mkgpfsvg $hdisk ${vg_prefix}${disk_count}
${lv_prefix}${disk_count}
(( disk_count += 1 ))
else
echo Disk $hdisk belongs to vg $vgname. Skipping
fi
else
echo Error, no logical hdisk for pdisk $pdisk. Skipping disk.
fi
done
In this example of mkvgdriver, all the SSA disks will be used. The volume groups
will be called gpfsvgXX and the logical volumes will be called gpfslvXX, where
XX is a counter starting at zero and incriminating by one for each disk.
This script can be customized in many ways. For instance, if only the SSA disks
in a specific enclosure were to be included, the lsdev -Ccpdisk -S Available -F
name could be changed to list only them.
To only use the SSA disks in enclosure 30-58-P, the command would be:
(00:37:52) c185n01:/ # lsdev -Ccpdisk|grep 30-58-P
pdisk0 Available 30-58-P 4GB SSA C Physical Disk Drive
pdisk1 Available 30-58-P 4GB SSA C Physical Disk Drive
pdisk2 Available 30-58-P 4GB SSA C Physical Disk Drive
pdisk3 Available 30-58-P 4GB SSA C Physical Disk Drive
pdisk4 Available 30-58-P 4GB SSA C Physical Disk Drive
pdisk5 Available 30-58-P 4GB SSA C Physical Disk Drive
pdisk6 Available 30-58-P 4GB SSA C Physical Disk Drive
pdisk7 Available 30-58-P 4GB SSA C Physical Disk Drive
pdisk8 Available 30-58-P 4GB SSA C Physical Disk Drive
pdisk9 Available 30-58-P 4GB SSA C Physical Disk Drive
pdisk10 Available 30-58-P 4GB SSA C Physical Disk Drive
pdisk11 Available 30-58-P 4GB SSA C Physical Disk Drive
pdisk12 Available 30-58-P 4GB SSA C Physical Disk Drive
pdisk13 Available 30-58-P 4GB SSA C Physical Disk Drive
pdisk14 Available 30-58-P 4GB SSA C Physical Disk Drive
pdisk15 Available 30-58-P 4GB SSA C Physical Disk Drive
219
'{print $1}'
By piping the lsdev command into different grep commands this script can be
very flexible.
The output of the mkvgdriver script will be a list of mkgpfsvg commands. These
should be saved in a file to be run in the next step.
(00:41:52) c185n01:/u/gmcpheet/hacmp/scripts # running mkvgdriver
mkgpfsvg hdisk2 gpfsvg2 gpfslv2
mkgpfsvg hdisk3 gpfsvg3 gpfslv3
mkgpfsvg hdisk4 gpfsvg4 gpfslv4
mkgpfsvg hdisk5 gpfsvg5 gpfslv5
mkgpfsvg hdisk6 gpfsvg6 gpfslv6
mkgpfsvg hdisk7 gpfsvg7 gpfslv7
mkgpfsvg hdisk8 gpfsvg8 gpfslv8
mkgpfsvg hdisk9 gpfsvg9 gpfslv9
mkgpfsvg hdisk10 gpfsvg10 gpfslv10
mkgpfsvg hdisk11 gpfsvg11 gpfslv11
mkgpfsvg hdisk12 gpfsvg12 gpfslv12
mkgpfsvg hdisk13 gpfsvg13 gpfslv13
mkgpfsvg hdisk14 gpfsvg14 gpfslv14
mkgpfsvg hdisk15 gpfsvg15 gpfslv15
mkgpfsvg hdisk16 gpfsvg16 gpfslv16
mkgpfsvg hdisk17 gpfsvg17 gpfslv17
220
The mkgpfsvg script is the script that creates the volume groups and logical
volumes.
Example: D-2 mkgpfsvg script
#!/usr/bin/ksh
#
#
# script to assist with the definitions of gpfsvg and lvs
#
#
#
# this is the proper set up for 4.5 GB SSA disks
#
size_pps=16
# pp size
#
#num_pps=542
#9.1 GB at 16MB
num_pps=268
#4.5GB at 16MB
#
if [[ $# -ne 3 ]] then
echo Incorrect number of parms.
echo Usage is $0 hdisk vgname lvname
exit 1
fi
#
hdisk=$1
vgname=$2
lvname=$3
#
echo
echo Now creating a volume group $vgname with lv $lvname on hdisk $hdisk
#
# verify passed disk is SSA
#
pdisk=$(ssaxlate -l $hdisk)
rc=$?
if [[ $rc != 0 ]] then
echo hdisk $hdisk is not a SSA disk.
exit 1
fi
#
# verify vg does not already exist
#
vgcount=$( lspv | grep -w $vgname | wc -l )
if [[ $vgcount -ne 0 ]] then
echo A volume group with name already exists.
exit 1
fi
#
# verify lv does not already exist
221
#
lvcount=$( /usr/sbin/getlvodm -l $lvname > /dev/null 2>&1 )
rc=$?
if [[ $rc = 0 ]] then
echo There is a LV with the name $lvname.
exit 1
fi
#
# Issue the mkvg command
#
mkvg -f -n -s $size_pps -c -y $vgname $hdisk
rc=$?
if [[ $rc != 0 ]] then
echo Error creating vg $vgname on $hdisk. rc=$rc
exit 1
fi
#
varyonvg $vgname
rc=$?
if [[ $rc != 0 ]] then
echo Error varying on vg $vgname. rc=$rc
exit 1
fi
#
mklv -b n -w n -t mmfsha -x $num_pps -y $lvname $vgname $num_pps
rc=$?
if [[ $rc != 0 ]] then
echo Error making lv $lvname on vg $vgname. rc=$rc
exit 1
else
echo Successfully created lv $lvname on $vgname
fi
#
varyoffvg $vgname
rc=$?
if [[ $rc != 0 ]] then
echo Error varying off vg $vgname. rc=$rc
exit 1
fi
#
echo Successfully completed creation of $vgname containing $lvname on disk
$hdisk
This script runs the commands which are the output from the mkvgdriver script.
This script is also very flexible and this example demonstrates this. In this script
the PP size and number of PPs to fit on one SSA disk must be coded. These are
the parameters num_pps and size_pps.
222
223
else
echo $disk which is pvid $pvid vg $vg is known as vg $remote_vgname on
$remote_hdisk on $remote_node.
if [[ $remote_vgname = "None" ]] then
echo Running importvg -y $vg $remote_hdisk on $remote_node
rsh $remote_node -n importvg -y $vg $remote_hdisk
echo chvg -a n $vg
rsh $remote_node -n chvg -a n $vg
echo varyoffvg $vg
rsh $remote_node -n varyoffvg $vg
else
echo $disk pvid $pvid vg $remote_vgname on $remote_node has a volume
group. Not processed.
fi
fi
#
done
This script is run on the node where all of the volume groups have already been
defined. This script expects one parameter, that of the node where the disks are
to be imported.
For example, if the node c185n08 has all the SSA disks defined, and node
c185n01 needs them imported, then the command is issued on node c185n08
with the parameter of c185n01.
Before the command is issued, the disks should look like this on node c185n01:
hdisk2
hdisk3
hdisk4
hdisk5
hdisk6
hdisk7
hdisk8
hdisk9
hdisk10
hdisk11
hdisk12
hdisk13
hdisk14
hdisk15
hdisk16
hdisk17
000167498707c169
00001351566acb07
00001351566ae6aa
00001351566b0f35
0000099017a37f8c
00001351566b378d
00001351566b4520
00001351566b52f3
00001351566b60a7
00001351566b6e63
00001351566b7be6
0002025690cec645
00001351566b8977
00001351566c1e3d
00001351566c2bea
00001351566c6231
gpfsvg16
gpfsvg17
gpfsvg18
gpfsvg19
gpfsvg20
gpfsvg21
gpfsvg22
gpfsvg23
gpfsvg24
gpfsvg25
gpfsvg26
gpfsvg27
gpfsvg28
gpfsvg29
gpfsvg30
gpfsvg31
224
000167498707c169
00001351566acb07
None
None
hdisk4
hdisk5
hdisk6
hdisk7
hdisk8
hdisk9
hdisk10
hdisk11
hdisk12
hdisk13
hdisk14
hdisk15
hdisk16
hdisk17
00001351566ae6aa
00001351566b0f35
0000099017a37f8c
00001351566b378d
00001351566b4520
00001351566b52f3
00001351566b60a7
00001351566b6e63
00001351566b7be6
0002025690cec645
00001351566b8977
00001351566c1e3d
00001351566c2bea
00001351566c6231
None
None
None
None
None
None
None
None
None
None
None
None
None
None
225
000167498707c169
00001351566acb07
00001351566ae6aa
00001351566b0f35
0000099017a37f8c
00001351566b378d
00001351566b4520
00001351566b52f3
00001351566b60a7
00001351566b6e63
00001351566b7be6
0002025690cec645
00001351566b8977
00001351566c1e3d
00001351566c2bea
00001351566c6231
gpfsvg16
gpfsvg17
gpfsvg18
gpfsvg19
gpfsvg20
gpfsvg21
gpfsvg22
gpfsvg23
gpfsvg24
gpfsvg25
gpfsvg26
gpfsvg27
gpfsvg28
gpfsvg29
gpfsvg30
gpfsvg31
This script would then be called for all the nodes in the cluster, for example:
importvgs.to.newtail
importvgs.to.newtail
importvgs.to.newtail
importvgs.to.newtail
c185n01
c185n02
c185n03
c185n04
This would import the volume group definitions known on the node where it is run
to nodes c185n01/c185n02/c185n03 and c185n04.
comppvid
This script is used to compare the volume groups to the PVIDs on each node to
make sure they are in sync. This assures the administrator that each PVID is
known on each node by only one volume group name.
226
#!/usr/bin/ksh
# a script to compare what volume groups were created on each node
#
for i in `lspv |grep -v rootvg | awk '{print $2}'`
do
gdsh -a "lspv | grep $i"
echo "***************************************"
done
host1t:/tools/ralph> comppvid
host1t: hdisk2
000b4a7d1075fdbf
***************************************
host1t: hdisk3
000007024db58359
host2t: hdisk3
000007024db58359
host3t: hdisk2
000007024db58359
host4t: hdisk2
000007024db58359
***************************************
host1t: hdisk4
000007024db5472e
host2t: hdisk4
000007024db5472e
host3t: hdisk3
000007024db5472e
host4t: hdisk3
000007024db5472e
***************************************
host1t: hdisk5
000007024db54fb4
host2t: hdisk5
000007024db54fb4
host3t: hdisk4
000007024db54fb4
host4t: hdisk4
000007024db54fb4
***************************************
host1t: hdisk6
000007024db5608a
host2t: hdisk6
000007024db5608a
host3t: hdisk5
000007024db5608a
host4t: hdisk5
000007024db5608a
***************************************
host1t: hdisk7
000007024db571ba
host2t: hdisk7
000007024db571ba
host3t: hdisk6
000007024db571ba
host4t: hdisk6
000007024db571ba
***************************************
host1t: hdisk8
000007024db5692c
host2t: hdisk8
000007024db5692c
host3t: hdisk7
000007024db5692c
host4t: hdisk7
000007024db5692c
***************************************
host1t: hdisk9
000158511eb0f296
host2t: hdisk9
000158511eb0f296
host3t: hdisk8
000158511eb0f296
host4t: hdisk8
000158511eb0f296
toolsvg
gpfsvg0
gpfsvg0
gpfsvg0
gpfsvg0
gpfsvg1
gpfsvg1
gpfsvg1
gpfsvg1
gpfsvg2
gpfsvg2
gpfsvg2
gpfsvg2
gpfsvg3
gpfsvg3
gpfsvg3
gpfsvg3
gpfsvg4
gpfsvg4
gpfsvg4
gpfsvg4
gpfsvg5
gpfsvg5
gpfsvg5
gpfsvg5
gpfsvg6
gpfsvg6
gpfsvg6
gpfsvg6
227
***************************************
host1t: hdisk10
000007024db57a49
host2t: hdisk10
000007024db57a49
host3t: hdisk9
000007024db57a49
host4t: hdisk9
000007024db57a49
***************************************
host1t: hdisk11
000007024db58bd3
host2t: hdisk11
000007024db58bd3
host3t: hdisk10
000007024db58bd3
host4t: hdisk10
000007024db58bd3
***************************************
host1t: hdisk12
000007024db53eac
host2t: hdisk12
000007024db53eac
host3t: hdisk11
000007024db53eac
host4t: hdisk11
000007024db53eac
***************************************
host1t: hdisk13
000007024db5361d
host2t: hdisk13
000007024db5361d
host3t: hdisk12
000007024db5361d
host4t: hdisk12
000007024db5361d
***************************************
host1t: hdisk14
000007024db51c4b
host2t: hdisk14
000007024db51c4b
host3t: hdisk13
000007024db51c4b
host4t: hdisk13
000007024db51c4b
***************************************
host1t: hdisk15
000007024db524ce
host2t: hdisk15
000007024db524ce
host3t: hdisk14
000007024db524ce
host4t: hdisk14
000007024db524ce
***************************************
host1t: hdisk16
000007024db52d7b
host2t: hdisk16
000007024db52d7b
host3t: hdisk15
000007024db52d7b
host4t: hdisk15
000007024db52d7b
***************************************
host1t: hdisk17
000007024db513d2
host2t: hdisk17
000007024db513d2
host3t: hdisk16
000007024db513d2
host4t: hdisk16
000007024db513d2
***************************************
host1t: hdisk18
000007024db55810
host2t: hdisk18
000007024db55810
host3t: hdisk17
000007024db55810
host4t: hdisk17
000007024db55810
***************************************
228
gpfsvg7
gpfsvg7
gpfsvg7
gpfsvg7
gpfsvg8
gpfsvg8
gpfsvg8
gpfsvg8
gpfsvg9
gpfsvg9
gpfsvg9
gpfsvg9
gpfsvg10
gpfsvg10
gpfsvg10
gpfsvg10
gpfsvg11
gpfsvg11
gpfsvg11
gpfsvg11
gpfsvg12
gpfsvg12
gpfsvg12
gpfsvg12
gpfsvg13
gpfsvg13
gpfsvg13
gpfsvg13
gpfsvg14
gpfsvg14
gpfsvg14
gpfsvg14
gpfsvg15
gpfsvg15
gpfsvg15
gpfsvg15
Appendix E.
229
Subsystems of HACMP/ES
The following is a listing of the subsystems of HACMP/ES. The corresponding
name in an SP environment appears in brackets.
Name
Subsystem
Group
Daemon
Topology Services
(High Availability Topology Services,
HATS)
topsvcs
(hats)
topsvcs
/usr/sbin/rsct/bin/hatsd
Group Services
(High Availability Group Services,
HAGS)
grpsvcs
(hags)
grpsvcs
/usr/sbin/rsct/bin/hagsd
grpglsm
grpsvcs
/usr/sbin/rsct/bin/hagsglsmd
Event Management
(High Availability Event
Management, HAEM)
emsvcs
emsvcs
/usr/sbin/rsct/bin/haemd
emaixos
emsvcs
/usr/sbin/rsct/bin/emaixos
clstrmgrES
cluster
/usr/es/sbin/cluster/clstrmgrES
clsmuxpdES
cluster
/usr/es/sbin/cluster/clsmuxpdES
clinfoES
cluster
/usr/es/sbin/cluster/clinfoES
cllockdES
lock
/usr/es/sbin/cluster/cllockdES
230
Topology Services
topsvcs.DD.HHMMSS.name in a non SP environment
hats.DD.HHMMSS.name in a SP environment
DD
day of the month
HHMMSS
hours, minutes, seconds of the day
name
HACMP/ES cluster name
Group Services
grpsvcs_nodenum_instnum.name in a non SP environement
hats_nodenum_instnum.name in a SP environment
nodenum
node number, as determined by clhandle, ...
instnum
instance number of daemon
name
HACMP/ES cluster name
Working directories
/var/ha/run contains a directory for each domain of a RSCT subsystem. Each
directory contains the core files (find out how many here)
In a non SP environment those are
topsvcs.name
grpsvcs.name
emsvcs.name
name is the HACMP/ES cluster name.
In an SP environment those are
hats.syspar_name
hags.syspar_name
haem.syspar_name
syspar.name is the name of the system partition
231
clsmuxpdES
/usr/es/adm/cluster.log
clinfoES
/tmp contains log files
clinfo.rc.out
clinfo.rc.out.n
where n is the file
232
Appendix F.
Summary of commands
The following is a compilation of some of the commands we used while working
with HACMP and GPFS. We created this list to serve the wide range of
experience (or lack of) for the folks managing GPFS. This list is not definitive and
we suggest that you refer to the man pages to understand the syntax required for
each one.
233
GPFS commands
mmaddcluster
mmadddisk
mmaddnode
mmchattr
mmchcluster
mmchconfig
mmchdisk
mmcheckquota
mmchfs
mmconfig
mmcrcluster
mmcrfs
mmdefragfs
mmdelcluster
mmdeldisk
mmdelfs
mmdelnode
mmfsadm cleanup
234
mmfsadm shutdown
mmfsck
mmlsattr
mmlscluster
mmlsdisk
mmlsquota
mmquotaoff
mmquotaon
mmrestripefs
mmrpldisk
mmshow_fence
mmstartup
mmshutdown
SSA commands
ssa_speed
ssaadap
ssacand
ssaconn
ssadisk
ssaidentify
ssaxlate
AIX commands
bffcreate
cfgmgr
exportvg
filemon
importvg
installp
iostat
235
lsattr
lscfg
lslpp
lslv
lssrc
mklv
mkvg
netstat
odmget
oslevel
rmdev
setclock
varyoffvg
varyonvg
vmstat
HACMP commands
236
claddnode
cldare
clhandle
clstat
clstop
rc.cluster
Appendix G.
237
ibm_sgr
ibm_shw
ibm_shr
238
<prog_name>
Program name
<path_file>
<rec_size>
<num_rec>
<crunch>
<order>
<stride>
Specifies the stride for the <order> options strd and bkst.
The following example illustrates the use of one of these programs. It specifies
no for the number crunching option. By taking the difference between the I/O
time and overall time, CPU time devoted to non-I/O tasks can be measured
implicitly; with number crunching turned off, the processing time is negligible.
host1t:/> ibm_sgw /gpfs1/L.strd 1048576 5120 no strd 4
-----------------------------------------JOB: ibm_sgw
User Parameter Summary
-----------------------------------------base file name
= /gpfs1/L.strd
buffer size
= 1048576
number of records
= 5120
simulate number crunching = no
processing order
= strd
stride
= 4
-----------------------------------------summary statistics
-----------------------------------------data processed
= 5120.0 MB
I/O time
=
77.573 sec
overall time
=
77.637 sec
Amortized I/O rate =
65.948 MB/s
------------------------------------------
239
The following example illustrates the use of another one of these programs. It
specifies yes for the number crunching option. In this case, CPU time devoted to
non-I/O activities is explicitly measured as crunch time. By tuning the number
crunching loop, the ratio of I/O time to crunch time can be altered.
host1t:/> ibm_shr /gpfs1/Mc.rndm 262144 20480 yes rndm
--------------------------------JOB: ibm_shr
User Parameter Summary
--------------------------------base file name
= /gpfs1/Mc.rndm
buffer size
= 262144
number of records
= 20480
simulate number crunching = yes
processing order
= rndm
-----------------------------------------summary statistics
-----------------------------------------data processed
= 5120.0 MB
crunch time
=
98.748 sec
I/O time
=
87.532 sec
overall time
=
186.670 sec
Amortized I/O rate =
27.428 MB/s
------------------------------------------
240
the gettimeofday() system call and the -lxlf90 flag may be removed from the
makefile xlc lines. By comparing Example G-1and Example G-2 the reader can
easily see how to replace rtc() with gettimeofday() in the benchmark codes, if
necessary.
Example: G-1 Using rtc()
#include <stdio.h>
#include <time.h>
double rtc();
int main()
{
int
k;
double xsum = 0.0;
double bg, dn, delta;
bg = rtc();
for (k = 0; k < 10000000; k++)
xsum += (double)k;
dn = rtc();
delta = dn - bg;
printf("delta = %10.6lf sec\n", delta);
return 0;
}
Example: G-2 Using gettimeofday()
#include <stdio.h>
#include <sys/time.h>
int main()
{
int k;
double xsum = 0.0;
double t1, t2;
double delta;
struct timeval bg, dn;
gettimeofday(&bg, NULL);
for (k = 0; k < 10000000; k++)
xsum += (double)k;
gettimeofday(&dn, NULL);
/* convert time to seconds, then calculate the difference */
241
t1 = (double)bg.tv_sec + (double)bg.tv_usec/1000000.0;
t2 = (double)dn.tv_sec + (double)dn.tv_usec/1000000.0;
delta = t2 - t1;
printf("delta = %10.6lf\n", delta);
return 0;
}
gmgh.c
/*********************************************************************/
/* Title
- Generic Middle-layer GPFS Hint package
*/
/* Module - gmgh.c
*/
/* Envir. - GPFS 1.4, VAC 5.0.1, AIX 4.3.3
*/
/* Last Mod */
/*
*/
/* ABSTRACT:
*/
/* This is a generic middle layer code encapsulating the GPFS
*/
/* multiple access hint facility. The native GPFS interface is
*/
/* overly tedius for use in high level application codes. This
*/
/* code creates a simpler to use interface for high level programs.*/
/*
*/
/* The use of the multiple access hint facility in GPFS 1.3 and
*/
/* higher can significantly improve I/O performance of programs
*/
/* whose I/O access patterns are random. The to use this code the */
/* high level programmer must post up to MAXHINT hints in advance */
/* followed later by the data transfers (i.e., read or write)
*/
/* operations. The comment headers for each function describe the */
/* use of this code in detail. See also ibm_phw.c and ibm_phr.c
*/
/* for examples of how to use this code.
*/
/*
*/
/* FUNCTIONS AND PARAMETERS:
*/
/* gmgh_init_hint(p, fd, maxbsz, maxhint)
*/
/* gmgh_post_hint(p, soff, nbytes, nth, isWrite)
*/
/* gmgh_declare_1st_hint(p)
*/
242
243
/* (I,O) gmgh *p
- gmgh structure. The caller provides an
*/
/*
uninstantiated pointer; this function
*/
/*
instantiates it by mallocating memory for
*/
/*
it and several substructures as well as
*/
/*
initializing some fields in the structure. */
/* (O)
p->blklst - mallocate block list
*/
/* (O)
p->fd
- initialize file descriptor
*/
/* (O)
p->gbsz
- initialize GPFS block size
*/
/* (O)
p->hint
- mallocate hint vector
*/
/* (O)
p->nbleh - initialize max number of block list entries */
/*
per hint
*/
/* (O)
p->nhve
- initialize max number of hint vector entries*/
/* (I)
fd
- file descriptor
*/
/* (I)
maxbsz
- max buffer size
*/
/* (I)
maxhint
- max entries in hint vector
*/
/* Return:
*/
/* Upon success, 0 will be returned, otherwise -1 is returned.
*/
/*********************************************************************/
int gmgh_init_hint
(
gmgh *p,
/* gmgh structure */
int fd,
/* file descriptor */
int maxbsz,
/* max buffer size */
int maxhint
/* max entries in hint vector */
)
{
struct stat64 sbuf;
gmgh_hint_t
*hvec;
blklst_t
*blst;
/*-----------------------------------------------*/
/* Set a few parameters needed for calculations. */
/*-----------------------------------------------*/
p->fd = fd;
if (fstat(p->fd, &sbuf) != 0)
{
printf("*** ERROR *** fstat error:
return -1;
}
p->gbsz
= sbuf.st_blksize;
p->nbleh
= maxbsz / p->gbsz;
/* validated below */
if (maxbsz % p->gbsz > 0) p->nbleh++;
p->nhve
= maxhint;
/* validated below */
/*---------------------------------------------*/
244
/*-------------------------------------------------------------------*/
/* The following limits are artificial. Practically speaking, they */
/* are quite large and merely are trying to prevent wasting space if */
/* not carefully defined.
*/
/*-------------------------------------------------------------------*/
if (p->nbleh > MAXBLK)
{
printf("*** ERROR *** Hint buffersize too big.
return -1;
}
if (p->nhve > MAXHINT)
{
printf("*** ERROR *** Too many hints requested.
return -1;
}
See gmgh.h.\n");
See gmgh.h.\n");
/*---------------------------------------------*/
/* "Mallocate" memory for the hint structures. */
/*---------------------------------------------*/
if (!(hvec = (gmgh_hint_t*)malloc(p->nhve * sizeof(gmgh_hint_t))))
{
printf("*** ERROR *** malloc error: errno = %d\n", errno);
return -1;
}
p->hint = hvec;
if (!(blst = (blklst_t*)malloc(p->nhve * p->nbleh * sizeof(blklst_t))))
{
printf("*** ERROR *** malloc error: errno = %d\n", errno);
return -1;
}
p->blklst = blst;
return 0;
}
/***************************** FUNCTION ******************************/
/* Purpose:
*/
/* This function is called once for each record to be posted in a */
/* set of hints. It tells gmgh which records are to be accessed
*/
245
246
Put records in
the hint vector
p->UBnblks += p->nbleh;
=
=
=
=
soff;
nbytes;
0;
isWrite;
/* upper bound on number of used blocks */
return status;
}
/***************************** FUNCTION ******************************/
/* Purpose:
*/
/* Calling this function forces GPFS to prefetch as many of the
*/
/* records posted by gmgh_post_hint() as possible as well as
*/
/* resetting a few necessary GPFS structures and gmgh parameters. */
/* Records that can not be prefetched now will be prefetched
*/
/* automatically (if possible) when gmgh_xfer() is called. This is */
/* done automatically.
*/
/*
*/
/* Call this function after the last call to gmgh_post_hint() for */
/* the hint set and prior to the first call to gmgh_xfer() for the */
/* hint set. Failure to call this function in this manner will
*/
/* result in poor performance, without warning messages.
*/
/* Parameters:
*/
/* (I) gmgh *p - gmgh structure.
*/
/* (I)
p->fd
- initialize file descriptor
*/
/* Return:
*/
/* Upon success, 0 will be returned, otherwise -1 is returned.
*/
/*********************************************************************/
int gmgh_declare_1st_hint
(
gmgh *p
/* gmgh structure */
)
{
int k;
247
gmgh_cancel_hint(p->fd);
k = gmgh_issue_hint(p, -1);
return k;
}
/***************************** FUNCTION ******************************/
/* Purpose:
*/
/* Will either read or write the nth record in a hint set as
*/
/* specified by gmgh_post_hint() and will prefetch as many records */
/* in advance of future "xfers" as possible. This function should */
/* be called once for each record in a hint set and each record
*/
/* should be "xfered" in the same order it was posted.
*/
/* Parameters:
*/
/* (I)
gmgh *p - gmgh structure.
*/
/*
(I) p->fd
- initialize file descriptor
*/
/*
(I) p->hint - hint vector
*/
/* (I,O) buf
- caller provided data buffer
*/
/* (I)
nth
- nth record in hint vector
*/
/* Return:
*/
/* Upon success, 0 will be returned, otherwise -1 is returned.
*/
/*********************************************************************/
int gmgh_xfer
(
gmgh *p,
/* gmgh structure */
char *buf,
/* externally provided data buffer */
int nth
/* nth record in hint vector; 0 <= nth < p->nhve */
)
{
int nb;
int status = 0;
Here is where
the record is
actually
accessed
248
return status;
}
/***************************** FUNCTION ******************************/
/* Purpose:
*/
/* For each record to be read or written, calculate the number of */
/* GPFS blocks touched and prepare a "block list" description for */
/* each block required for the GPFS_MULTIPLE_ACCESS_RANGE hint.
*/
/* Note that each entry in the gmgh hint vector corresponds to one */
/* record in a read or write access and that this function will
*/
/* process only entry at a time.
*/
/*
*/
/* This is a "private" function and should only be called in gmgh. */
/* Parameters:
*/
/* (I) gmgh *p
- gmgh structure.
*/
/*
(O)
p->blklst - add block(s) to block list
*/
/*
(I)
p->fd
- file descriptor
*/
/*
(I)
p->gbsz
- initialize GPFS block size
*/
/*
(I,O) p->hint
- hint vector
*/
/*
(I)
p->nbleh - max number of block list entries per hint */
/* (I) int nth
- nth record in hint vector
*/
/* (I) int isWrite - boolean; 0 = read, write = 1
*/
/* Return:
*/
/* Upon success, the number of blocks touched is returned and
*/
/* should be greater than 0. Returning a value <= 0 indicates are */
/* error ocurred.
*/
/*********************************************************************/
int gmgh_gen_blk
(
gmgh *p,
/* gmgh structure */
249
int
int
nth,
isWrite
)
{
int
int
int
int
off_t
int
i;
nbt;
blkidx;
disp;
bn;
fblen, lblen;
/*
/*
/*
/*
/*
/*
loop variables */
number of blocks touched */
block index */
displacement */
block number */
first/last blocks length */
/*------------------------------*/
/* How many blocks are touched? */
/*------------------------------*/
/* if the record touches more than p->nbleh blocks, ignore extra blocks */
Calculate how
many blocks
are spanned by
the record
nbt
nbt
p->hint[nth].nblkstouched = nbt;
/*---------------------------------------*/
/* Calculate parameters for first block. */
/*---------------------------------------*/
/* displacement to record in first GPFS block */
disp = (int)(p->hint[nth].soff % (off_t)p->gbsz);
/* first GPFS block touched by record */
bn
= p->hint[nth].soff/(off_t)p->gbsz;
1)
p->hint[nth].len;
p->gbsz - disp;
(disp + p->hint[nth].len) - ((nbt - 1) * p->gbsz);
/*----------------------------------------*/
/* Set parameters for each block touched. */
/*----------------------------------------*/
/*** get the first blocks index ***/
250
blkidx
= nth * p->nbleh;
p->hint[nth].fblkidx = blkidx;
/*** do first block outside loop since its a little different ***/
p->blklst[blkidx].blkoff
p->blklst[blkidx].blknum
p->blklst[blkidx].blklen
p->blklst[blkidx].isWrite
=
=
=
=
disp;
bn;
fblen;
isWrite;
=
=
=
=
-1;
0;
0;
0;
}
/***************************** FUNCTION ******************************/
/* Purpose:
*/
/* This function will issue the GPFS_MULTIPLE_ACCESS_RANGE hints
*/
/* via gpfs_fcntl(). It will take the hints as expanded in
*/
/* p->blklst and cancel the hints for the issued GPFS blocks
*/
/* corresponding to the nth entry in p->hint (i.e., the nth record */
/* in the hint set which is presumably the latest record accessed).*/
/* In the same call, it will also issue as many new hints from
*/
/* p->blklst as possible (in sets of GPFS_MAX_RANGE_COUNT). Once */
/* no more new hints are accepted, it will stop issuing new hints */
/* and will re-issue any new hints that were not accepted in the
*/
251
/* last set. This function should be called once for each record */
/* (as specified by the parameter nth), just after its been
*/
/* accessed, and once before the first record in the hint set is
*/
/* accessed (in this case, nth = -1).
*/
/*
*/
/* This is a "private" function and should only be called in gmgh. */
/*
*/
/* WARNING: This code is subtle and prone to "off by one" bugs.
*/
/*
use caution when modifying this.
*/
/*
*/
/* Parameters:
*/
/* (I) gmgh *p
- gmgh structure.
*/
/*
(I)
p->blklst - block list
*/
/*
(I)
p->fd
- file descriptor
*/
/*
(I)
p->hint
- hint vector
*/
/*
(I)
p->nbleh - max number of block list entries per hint */
/*
(I)
p->nhve
- max number of hint vector entries
*/
/*
(I,O) p->nxtblktoissue - next block to issue
*/
/*
(I)
p->UBnblks - upper bound on number of used blocks
*/
/* (I) int nth
- nth record in hint vector
*/
/* Return:
*/
/* Upon success, 0 will be returned, otherwise -1 is returned.
*/
/*********************************************************************/
int gmgh_issue_hint
(
gmgh *p,
/* gmgh structure */
int nth
/* hint index to release; -1 <= nth < p->nhve*/
)
{
int
k, kk;
/* loop variables */
int
rem;
/* remember a value */
int
accDone, relDone;
/* while loop conditions */
int
nhntacc;
/* number of hints accepted */
int
rbx;
/* block list index for released blocks */
int
ibx;
/* block list index for issued blocks */
struct
{
gpfsFcntlHeader_t hdr;
gpfsMultipleAccessRange_t marh;
} ghint;
/*-----------------------*/
/* Initialization stuff. */
/*-----------------------*/
ghint.marh.structLen = sizeof(ghint.marh);
ghint.marh.structType = GPFS_MULTIPLE_ACCESS_RANGE;
252
ghint.marh.relRangeArray[k].blockNumber
ghint.marh.relRangeArray[k].start
ghint.marh.relRangeArray[k].length
ghint.marh.relRangeArray[k].isWrite
=
=
=
=
p->blklst[rbx].blknum;
p->blklst[rbx].blkoff;
p->blklst[rbx].blklen;
p->blklst[rbx].isWrite;
rbx++;
p->hint[nth].lstblkreleased++;
}
ghint.marh.relRangeCnt = k;
/*** in case p->hint[nth].nblkstouched % GPFS_MAX_RANGE_COUNT == 0 ***/
if (p->hint[nth].lstblkreleased >= p->hint[nth].nblkstouched)
relDone = TRUE;
/*** if we get behind, dont issue hints for released records ***/
if (p->nxtblktoissue <= rbx)
p->nxtblktoissue = p->nbleh * (nth + 1);
253
/*** prepare data structure to issue hints for future accesses ***/
rem = p->nxtblktoissue;
ibx = p->nxtblktoissue;
k
= 0;
Prepare list of
hint blocks to
be issued
while (!accDone
&&
k < GPFS_MAX_RANGE_COUNT &&
ibx < p->UBnblks
)
{
if (p->blklst[ibx].blkoff >= 0)
/* Is the next entry OK? */
{
ghint.marh.accRangeArray[k].blockNumber = p->blklst[ibx].blknum;
ghint.marh.accRangeArray[k].start
= p->blklst[ibx].blkoff;
ghint.marh.accRangeArray[k].length
= p->blklst[ibx].blklen;
ghint.marh.accRangeArray[k].isWrite
= p->blklst[ibx].isWrite;
ibx++;
k++;
}
else
ibx++;
}
ghint.marh.accRangeCnt = k;
kk
= k;
p->nxtblktoissue
= ibx;
Prints record of
released and
issued blocks
which is helpful
in learning how
this code works
254
if (gpfs_fcntl(p->fd, &ghint) != 0)
{
printf("*** ERROR *** gpfs_fcntl error:
return -1;
}
else
nhntacc = ghint.marh.accRangeCnt;
#ifdef DEBUG
for (k = 0; k < ghint.marh.relRangeCnt; k++)
printf("R --> k = %d, blockNumber =%lld, start =%d, length =%d\n", k,
ghint.marh.relRangeArray[k].blockNumber,
ghint.marh.relRangeArray[k].start,
ghint.marh.relRangeArray[k].length);
printf("\n");
Figure out
which hints
were not
accepted
rem - 1;
0;
(k < nhntacc)
(p->blklst[++ibx].blkoff < 0)
continue;
k++;
}
p->nxtblktoissue = ibx + 1;
}
}
return 0;
}
/***************************** FUNCTION ******************************/
/* Purpose:
*/
/*
Remove any hints that have been issued against the open file
*/
/*
p->fd.
This restores the hint status to what it was when the */
/*
file was first opened, but it does not alter the status of the */
/*
GPFS cache.
*/
/*
*/
/* This is a "private" function and should only be called in gmgh. */
/*
*/
/* Parameters:
*/
/* (I) gmgh *p
- gmgh structure.
*/
/*
(I) p->blklst - block list
*/
/*
(I) p->fd
- file descriptor
*/
/*
(I) p->hint
- hint vector
*/
/*
(I) p->nbleh - max number of block list entries per hint
*/
/*
(I) p->nhve
- max number of hint vector entries
*/
/*
(O) p->nxtblktoissue - next block to issue
*/
/*
(I) p->UBnblks - upper bound on number of used blocks
*/
/* (I) int nth
- nth record in hint vector
*/
255
/* Return:
*/
/* Upon success, 0 will be returned, otherwise -1 is returned.
*/
/*********************************************************************/
int gmgh_cancel_hint(int fd)
/* file descriptor */
{
struct
{
gpfsFcntlHeader_t hdr;
gpfsCancelHints_t cancel;
} cancel;
cancel.hdr.totalLength
cancel.hdr.fcntlVersion
cancel.hdr.fcntlReserved
cancel.cancel.structLen
cancel.cancel.structType
=
=
=
=
=
sizeof(cancel);
GPFS_FCNTL_CURRENT_VERSION;
0;
sizeof(gpfsCancelHints_t);
GPFS_CANCEL_HINTS;
return 0;
}
gmgh.h
/*********************************************************************/
/* Title - Generic Middle-layer GPFS Hint package
*/
/* Module - gmgh.h
*/
/*
*/
/* See comments in gmgh.c for details.
*/
/*********************************************************************/
#ifndef GMGH_H
#define GMGH_H
#include
#include
#include
#include
#include
#include
#include
#include
256
<stdio.h>
<stdlib.h>
<fcntl.h>
<memory.h>
<sys/types.h>
<sys/stat.h>
<errno.h>
<gpfs_fcntl.h>
#define FALSE 0
#define TRUE 1
Here is the
block list
structure
Here is the
general gmgh
structure which
contains the
hint vector and
block list
#define GMGH_HINT_READ 0
#define GMGH_HINT_WRITE 1
#define MAXBLK
#define MAXHINT
256
128
typedef struct
{
int
blkoff;
off_t
blknum;
int
blklen;
int
isWrite;
} blklst_t;
typedef struct
{
off_t
soff;
int
len;
int
nblkstouched;
int
lstblkreleased;
int
isWrite;
int
fblkidx;
} gmgh_hint_t;
typedef struct
{
int
int
int
int
int
gmgh_hint_t
int
blklst_t
*/
} gmgh;
int
int
int
int
int
int
int
fd;
gbsz;
nbleh;
nhve;
UBnblks;
*hint;
nxtblktoissue;
*blklst;
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
file descriptor */
GPFS block size */
max number of block list entries per hint */
max number of hint vector entries */
upper bound on number of used blocks */
hint vector -- 1 entry per record */
next block to issue as hint in block list */
block list -- each hint has 1 or more blocks
#endif
257
258
Appendix H.
Additional material
This redbook refers to additional material that can be downloaded from the
Internet as described below.
Select the Additional materials and open the directory that corresponds with
the redbook form number, SG24-6035.
259
Description
README.ibm_seq
ibm_seq.tar
README.ibm_par
ibm_par.tar
gpfs.utilities.tar
0.5 MB
UNIX
260
Related publications
The publications listed in this section are considered particularly suitable for a
more detailed discussion of the topics covered in this redbook.
IBM Redbooks
For information on ordering these publications see , How to get IBM Redbooks
on page 263.
Exploiting HACMP 4.4: Enhancing the Capabilities of Cluster
Multi-Processing, SG24-5979
GPFS: A Parallel File System, SG24-5165
Monitoring and Managing IBM SA Disk Subsystems, SG24-5251
A Practical Guide to Serial Storage Architecture for AIX, SG24-4599
PSSP Version 3 Survival Guide, SG24-5344
RSCT Group Services: Programming Cluster Applications, SG24-5523
Sizing and Tuning GPFS, SG24-5610
Understanding SSA Subsystems in Your Environment, SG24-5750
Other resources
These publications are also relevant as further information sources:
7133 SSA Subsstem: Hardware Technical Information, SA33-3261
7133 SSA Subsystem: Operator Guide, GA33-3259
7133 Models 010 and 020 SSA Disk Subsystems: Installation Guide,
GA33-3260
7133 Models 500 and 600 SSA Disk Subsystems: Installation Guide,
GA33-3263
7133 SSA Disk Subsystem: Service Guide, SY33-0185
GPFS for AIX: Guide and Reference, SA22-7452
GPFS for AIX: Installation and Tuning Guide, GA22-7453
GPFS for AIX: Problem Determination Guide, GA22-7434
GPFS for AIX: Concepts, Planning, and Installation Guide, GA22-7453
261
262
http://techsupport.services.ibm.com/rs6000/support
A place to start when updating your RS/6000 host with fixes, drivers and
tools.
http://www.rs6000.ibm.com/resource/technology/gpfs_perf.html
GPFS performance White Paper
http://www.cae.de.ibm.com/forum/ssa/ssa.forum.html
This site carries discussions on SSA and can provide additional links to other
useful sites.
Related publications
263
264
Special notices
References in this publication to IBM products, programs or services do not imply
that IBM intends to make these available in all countries in which IBM operates.
Any reference to an IBM product, program, or service is not intended to state or
imply that only IBM's product, program, or service may be used. Any functionally
equivalent program that does not infringe any of IBM's intellectual property rights
may be used instead of the IBM product, program or service.
Information in this book was developed in conjunction with use of the equipment
specified, and is limited in application to those specific hardware and software
products and levels.
IBM may have patents or pending patent applications covering subject matter in
this document. The furnishing of this document does not give you any license to
these patents. You can send license inquiries, in writing, to the IBM Director of
Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785.
Licensees of this program who wish to have information about it for the purpose
of enabling: (i) the exchange of information between independently created
programs and other programs (including this one) and (ii) the mutual use of the
information which has been exchanged, should contact IBM Corporation, Dept.
600A, Mail Drop 1329, Somers, NY 10589 USA.
Such information may be available, subject to appropriate terms and conditions,
including in some cases, payment of a fee.
The information contained in this document has not been submitted to any formal
IBM test and is distributed AS IS. The use of this information or the
implementation of any of these techniques is a customer responsibility and
depends on the customer's ability to evaluate and integrate them into the
customer's operational environment. While each item may have been reviewed
by IBM for accuracy in a specific situation, there is no guarantee that the same or
similar results will be obtained elsewhere. Customers attempting to adapt these
techniques to their own environments do so at their own risk.
Any pointers in this publication to external Web sites are provided for
convenience only and do not in any manner serve as an endorsement of these
Web sites.
265
266
Glossary
block. The largest contiguous segment of
data in a GPFS file. It is set when the file
system is created using the mmcrfs command
and can not be reset using the mmchfs
command. Data is commonly accessed in
units of blocks, but under some
circumstances only subblocks are accessed.
cluster. A loosely-coupled set of nodes
organized into a network for the purpose of
sharing resources and communicating with
each other. See GPFS cluster.
concurrent access. Simultaneous access
to a shared volume group or a raw disk by
two or more nodes. In this configuration, all
the nodes defined for concurrent access to a
shared volume group are owners of the
shared resources associated with the volume
group or raw disk.
267
268
Glossary
269
270
American National
Standards Institute
ITSO
International Technical
Support Organization
API
Application Programming
Interface
I/O
Input/Output
JBOD
ATM
Asynchronous Transfer
Mode
JFS
BOS
LAN
BLOB
LPP
CLVM
Concurrent LVM
LRU
CSPOC
LV
Logical Volume
LVCB
DARE
Dynamic Automatic
Reconfiguration Event
LVM
DNS
MAR
EOF
End of File
NFS
ODM
OEM
Original Equipment
Manufacturer
OSF
ESSL
FCAL
FTP
GMGH
PCI
Peripheral Component
Interconnect
GPFS
PID
Process ID
POSIX
GSAPI
HACMP
PSSP
HACMP/ES
PESSL
Parallel ESSL
PTF
PV
Physical Volume
IBM
International Business
Machines Corporation
PVID
Physical Volume
Identification
ISO
International Organization
for Standardization
RAID
Redundant Array of
Independent Disks
IT
Information Technology
RAS
Reliability Availability
Serviceability
271
RSCT
RVSD
SAN
SCSI
SGID
Set Group ID
SMIT
System Management
Interface Tool
SMP
Symmetric Multiprocessor
SSA
TCP/IP
Transmission Control
Protocol/Internet Protocol
UTP
VFS
VG
Volume Group
VGDA
VGSA
VSD
VSDM
WebSM
272
Index
Symbols
/tmp/cspoc.log 232
/tmp/hacmp.out 185
/tmp/hamcp.out 232
/usr/es/adm/cluster.log 185
/usr/es/sbin/cluster/clinfoES 230
/usr/es/sbin/cluster/cllockdES 230
/usr/es/sbin/cluster/clsmuxpdES 230
/usr/es/sbin/cluster/clstrmgrES 230
/usr/es/sbin/cluster/history/cluster.mmdd 232
/usr/es/sbin/cluster/utilities/clhandle 195
/usr/lpp/mmfs/bin/runmmfs 185
/usr/sbin/rsct/bin/emaixos 230
/usr/sbin/rsct/bin/haemd 230
/usr/sbin/rsct/bin/hagsd 230
/usr/sbin/rsct/bin/hagsglsmd 230
/usr/sbin/rsct/bin/hatsd 230
/var/adm/ras/mmfs.log.latest 185
/var/adm/ras/mmfs.log.previous 185
_LARGE_FILES 145, 181
A
adapter
boot 69
service 69
standby 69
add disk 132
add node to cluster 113
administrative tasks 109
AIX 62
AIX buffering 25
AIX commands
bffcreate 235
cfgmgr 235
exportvg 235
filemon 235
importvg 235
installp 235
iostat 235
lsattr 236
lscfg 236
lsdev 236
lslpp 236
lslv 236
lssrc 236
mklv 236
mkvg 236
netstat 236
odmget 236
oslevel 236
rmdev 236
setclock 236
varyoffvg 236
varyonvg 236
vmstat 236
AIX Maintenance Level
instfix 63
AIX system calls 145
AIX,mirroring 268
APAR 62
autoload 117
B
bandwidth 66
benchmark configuration 146
Benchmark programs
downloading 260
parallel 260
sequential 260
benchmark programs 146, 238
linking 240
Parameters 238
using 238
benchmark results 150, 155, 158, 175, 176
benchmark results,measuring 147
bffcreate 63, 207, 235
blink light 235
block
choosing size 18
size 148
block larger than record 150
block smaller than record 150
block, GPFS 18, 148, 267
hint 161
record alignment 148
record larger than block 150
273
C
cache coherence 153
cfgmgr 235
change server 116
characteristics of GPFS 5
chddev 196
checking Topology Services 188
chkautovg 193
chvg 191
claddnode 236
cldare 236
clhandle 195, 236
client 15
clinfoES 230, 232
cllockdES 230
close() 144
clsmuxpdES 230, 232
clstat 236
clster events
node_up 81
clstop 236
clstrmgrES 230, 232
cluster 37, 267
adapters 76
add node 113
definition 4
environment 33
event history 84
GPFS 37
HACMP/ES 29, 37
ID 75
monitoring 81
name 75
network 76
partition 51
partitioned 51
state 81
synchronization 74, 78
synchronized 58
topology 37
274
verification 78
cluster adapters
configuring 76
cluster network
single point of failure 69
cluster nodes
adding 75
cluster resources
configuration 78
cluster services
starting 78
stopping 87
cluster topology
adding adapters 141
configuring 74
deleting adapters 141
displaying 77
cluster_events
node_up_complete 81
clustering environment 28
CLVM 32, 201
CLVM,server 15
Commands
bffcreate 207, 235
cfgmgr 235
chdev 196
chkautovg 193
chvg 191
claddnode 236
cldare 236
clhandle 195, 236
cllsnw 139
clstat 82, 83, 236
clstop 87, 236
comppvid 226
df 22
diag 202
distributed commands 211
dsh 211
du 22, 181
exportvg 235
filemon 178, 235
gdsh 211
getlvodm 193
importvg 235
installp 207, 235
inutoc 208
iostat 177, 235
list of commands 233
ls -l 22, 181
lsattr 236
lscfg 236
lsdev 236
lslpp 209, 236
lslv 236
lssrc 38, 85, 86, 236
lssrc -ls grpsvcs 186
lssrc -ls topsvcs 188, 189
mklv 194, 236
mkvg 236
mmaddcluster 234
mmadddisk 191, 234
mmaddnode 44, 234
mmchattr 234
mmchcluster 234
mmchconfig 17, 18, 23, 234
mmchdisk 234
mmcheckquota 234
mmchfs 19, 234, 267
mmconfig 23, 44, 234
mmcrcluster 234
mmcrfs 18, 19, 22, 191, 234, 267
mmdefragfs 234
mmdelcluster 234
mmdeldisk 234
mmdelfs 234
mmdelnode 234
mmedquota 22
mmfsadm 44
mmfsadm cleanup 234
mmfsadm dump cfgmgr 234
mmfsadm dump waiters 234
mmfsadm shutdown 234
mmfsck 234
mmlsattr 234
mmlscluster 234
mmlsdisk 234
mmlsquota 234
mmquotaoff 234
mmquotaon 234
mmrestripefs 234
mmrpldisk 191, 235
mmshow_fence 43, 235
mmshutdown 235
mmshutdwon 137
mmstartup 38, 235
netstat 72, 236
odmget 236
D
daemon, memory 23
daemon, segments 23
daemon,multi-threaded 15
Daemons
mmfsd 15
DARE 137
Index
275
data replication 70
failure groups 70
metadata 70
user data 70
datashipping 267
deactivate quota 136
DEBUG 169
dedicated 67
default network 67
definition, cluster 4
device 236
attributes 236
driver 64
remove 236
df 22
diag 202, 203
direct attach disks 4
direct pointers 20
directive 162, 268
disk 64
mirroring 268
disk fencing 53
GPFS nodest 43
quorum 53
SSA 43
disk, RAID 269
disk, twin tailing 269
disks
direct attach 4
distributed subsystem 28
domain
HACMP 29
RSCT 29
SP 29
domain,RSCT 30
down 123
dsh 211
du 22, 181
du,sparse file size 22
E
emaixos 230
emsvcs 230
environment
cluster 33
non-VSD 32
VSD 32
ESSL 240
276
F
failure
adapter 142
failure group 21, 65, 70, 267
fence register 43
fencing 17
file
large 161, 181
locking 150
metadata 18
record 148
replication 21
size 21
space pre-allocation 180
sparse 22, 180
virtual size 22
virtual space 180
file space pre-allocation 180
file system
defragmentation 126
modify 125
repair 123
size 21
file system manager 17, 268
disk space allocation 17
file system configuration 17
quota management 17
security services 17
token management 17
file system size
df 22
file, large 161, 181
_LARGE_FILES 145, 181
O_LARGEFILE 181
filemon 178, 235
Files
.rhosts 67
/.rhosts 72, 109
/etc/cluster.nodes 94
/etc/hosts 74
/var 74
/var/adm/ras/mmfs.log.latest 115, 119, 121
/var/mmfs/etc/cluster.preferences 92
/var/mmfs/etc/mmfs.cfg 94
/var/mmfs/gen/mmsdrfs 94
Filesets
instifx 63
filesets 64
Flags
_LARGE_FILES 145, 181
DEBUG 169
-lessl 240
-lgpfs 182
-lxlf90 147, 240
flush 153, 268
fsync() 144, 147
G
gdsh 209, 211
General Parallel File System
See GPFS
Generic Middle Layer GPFS Hints API
See GMGH
getlvodm 193
gettimeofday() 147, 241
global management functions 16
GMGH
benchmark results 175
block list 165, 251, 257
DEBUG 169, 254
Example 165, 173
gmgh structure 257
gmgh.c 242
gmgh.h 243, 256
gmgh_cancel_hint() 245, 248, 256
gmgh_declare_1st_hint() 172, 247
gmgh_gen_blk() 165, 249
gmgh_init_hint() 170, 244
gmgh_issue_hint() 165, 248, 252
gmgh_issue_hints() 249
gmgh_post_hint() 171, 246
gmgh_xfer() 172, 248
gpfs_fcntl() 254, 256
hint set 170
Index
277
datashipping 267
delete disk 128
directive 162, 268
disk space allocation 17
failure group 21, 70
failure processing 41
features 2
fencing 17
file space pre-allocation 180
file system configuration 17
file system manager 17, 44, 268
global management functions 16
global management nodes 43
GPFS buffering cf JFS buffering 26
GPFS_CANCEL_HINTS 163, 256
GPFS_CLEAR_FILE_CACHE 163
gpfs_fcntl() 161, 254, 256
GPFS_FCNTL_CURRENT_VERSION 162,
163, 254, 256
GPFS_MAX_RANGE_COUNT 164, 253, 254,
255
GPFS_MULTIPLE_ACCESS_RANGE 164,
249, 252
gpfsCancelHints_t 162
gpfsFcntlHeader_t 162
gpfsMultipleAccessRange_t 163
granularity 18, 148
hardware planning 8
hint 268
hints 161
hints as suggestions 161
I/O operation 152
I/O requirements 8
i-node 268
i-node cache 23
JFS,relationship to 145
kernel extension 15
maxFilesToCache 24
maxStatCache 24
memory utilization 23
metadata 18, 20, 23, 268
metanode 17, 44, 268
mmchconfig 17, 18, 23
mmchfs 19
mmconfig 23
mmcrfs 19, 22
mmedquota 22
mmfsd 15
monitoring 109
278
Index
279
H
HACMP 207
add adapter 236
domain 29
GPFS configuration with 7
name node 236
RCST 62
start cluster 236
status monitor 236
stop cluster 236
update daemons 236
HACMP commands
claddnode 236
cldare 236
clhandle 236
clstat 236
clstop 236
280
rc.cluster 236
HACMP/ES 14, 32, 33, 62
adapter function 76
adapter IP label 76
boot adapter 56
cluster 29, 37, 267
cluster adapters 46, 76
cluster configuration 56
cluster ID 75
Cluster Information Services 55
Cluster Lock Manager 55
Cluster Manager 55
cluster name 75
cluster networks 46
cluster nodes 46, 75
cluster resources 56
cluster synchronization 58, 74, 78
cluster topology 37, 46, 47, 56, 74
cluster verification 58, 78
configuration restrictions for GPFS 58
DARE 49
dynamic reconfiguration 49
error recovery 59
events script 84
high availability 55
IP address takeover 69
IP Address Takeover. 58
log files 86
name resolution 73
network attribute 76
network tuning parameters 46
network type 76
networking requirements by GPFS 68
partitoned cluster 59
resource group 57
server 15
service adapter 56
service IP label 56
SMUX Peer Daemon 55
standby adapter 56
hardware 236
hardware planning for GPFS 8
hats 230
hdisk 201, 235
Heartbeat Rate 142
high availability 55
networks 68
planning for 68
SSA configuration 70
hint 268
hints 160, 161, 268
acepted 161
issued 161
released 161
I
I/O access pattern 239
benchmark results 155
heirarchy 156
random, no hints 159
random, using hints 160
sequential 156
stride vs I/O rate 157
strided 157
I/O operation 152
granularity 18, 148
read 152
write 153
I/O performance monitoring
filemon 178
gettimeofday() 241
iostat 177
rtc() 147, 177, 240
I/O performance, measuring 147
I/O rates,measuring 147
I/O rates,units 147
I/O requirements for GPFS 8
IBM General Parallel File System
See GPFS
IBM Virtual Shared Disk 33
ibm_sgr 238
ibm_sgw 238, 239
ibm_shr 238, 240
ibm_shw 238
image 207
implicit parallelism 18
importvg 235
indirect blocks 18
indirect pointers 20
i-node 18, 20, 268
cache 23
install image 235
install images 63
installp 207, 235
instfix 63
inutoc 208
iostat 177, 235
IP network 66
J
JFS 18, 268
buffering 26
GPFS,relationship to 145
Journaled File System
SeeJFS
K
kernel heap 23
L
large file 161, 181
_LARGE_FILES 145, 181
O_LARGEFILE 181
latest mmfs log file 185
-lessl 240
-lgpfs 182
libessl 240
libgpfs 182
Libraries
libessl 240
libgpfs 182
libxlf90 147, 240
libxlf90 147, 240
licensed program products 62
limit file 133
linking benchmark programs 240
list of commands 233
list quota 135
locality of reference 152
exploiting 154
logical volume
create 91, 236
display 236
LPPs 62
ls -l 22, 181
lsattr 236
lscfg 236
lsdev 236
lseek() 144, 181, 248
lseek64() 181
lslpp 63, 209, 236
lslv 236
lspv 130
lssrc 119, 236
Index
281
lsvg 130
LVM 14
mirroring 268
-lxlf90 147, 240
M
maintenance level 236
man pages 233
MAR 160
MAR hint
See GPFS multiple access range hint
maxFilesToCache 24
MAXHINT 170
maxStatCache 24
memory 236
metadata 18, 20, 23, 268
direct pointers 20
indirect blocks 18
indirect pointers 20
i-node 18, 268
vnode 21, 269
metanode 17, 268
metatada
i-node 20
microcode 64
mirroring 268
mklv 194, 236
mkvg 236
mmaddcluster 113, 234
mmadddisk 132, 191, 234
mmaddnode 114, 234
mmchattr 234
mmchcluster 117, 234
mmchconfig 17, 18, 23, 118, 234
mmchdisk 123, 128, 234
mmcheckquota 234
mmchfs 19, 125, 234, 267
mmconfig 23, 117, 234
mmcrcluster 234
mmcrfs 18, 19, 22, 120, 133, 191, 234, 267
mmdefragfs 126, 234
mmdelcluster 112, 234
mmdeldisk 128, 234
mmdelfs 123, 234
mmdelnode 110, 111, 234
mmdf 125
mmedquota 22, 133
mmfsadm cleanup 234
282
N
netstat 236
network
private 76
public 76
serial 68
network interfaces
configuration for HACMP/ES 72
Network Module
configuration 141
Network Module settings
Heart Beat Rate 142
network status 236
new devices 235
NFS 63, 207
NFS mount 65
node
configuration manager 16, 267
file system manager 17, 268
metanode 17, 268
nodeset 14, 269
GPFS 37
non-pinned memory 23
non-VSD environment 32
O
O_LARGEFILE 181
odmget 236
open() 14, 144, 181
open64() 181
oslevel 62, 236
P
p->blklist 165
p->blklst 171
p->hint 165
packaging APAR 63
pagepool 23, 117, 152, 153
buffering 152
hints 161
pinned memory 23
parallel programming 151
Parameters
benchmark programs 238
I/O access pattern benchmarks 155
maxFilesToCache 24
maxStatCache 24
partition 51
partitioned cluster 51, 53
RSCT 51
pdisk 201, 235
performance 65
affected by record size 148
multi-node jobs 175
performance monitor 235
performance monitoring
filemon 178
gettimeofday() 241
iostat 177
rtc() 147, 177, 240
pinned memory 23
planning GPFS 7
POSIX API 144, 181
POSIX I/O API 144
pre-allocation, file space 180
prefetching 23, 153, 160
prerequisites for GPFS 10
previous mmfs log file 185
primary server 116
program portability
GPFS, correctness 145
performance 145
Programs
benchmark 146, 238
benchmark configuration 146
ibm_sgr 179, 238
ibm_sgw 179, 180, 238, 239
ibm_shr 180, 238, 240
ibm_shw 238
PSSP 62
PTF 62, 207
PVID 201
Q
quorum 16, 269
disk fencing 53
GPFS 42
GPFS nodeset 42
multi-node 16, 269
single-node 16, 269
quota 269
deactivate 136
establish 134
limit file 133
list 135
report 133
R
RAID 70, 269
adapter 65
array 64
random I/O access pattern 160
benchmark results 155, 175
no hints 159
using hints 160, 268
random number generator 240
rc.cluster 236
read I/O operation 152
read I/O operation,prefetching 153
read() 14, 21, 25, 152, 249
read-ahead 15, 153, 269
rebalance 131
record 148, 150
record,variable size vs performance 148
record/block alignment 148
recovery 14
Redbooks Web Site 263
Index
283
S
secondary server 116
sequential I/O access pattern 156
benchmark results 155
server 15
set_fenceid 196
setclock 236
shared resource 65
shared segments,GPFS cache 23
single-node quorum 16, 269
smit.log 208
smit.script 208
smitty
install 63
sockets 16
Software
bffcreate 207
fileset installation 209
GPFS 207
HACMP 207
install image 235
installable image 207
installp 207, 235
inutoc 208
list products 236
284
ssacand 235
ssaconn 235
ssadisk 235
ssaidentify 235
ssaxlate 202, 235
start GPFS 120
stat cache 23
stat() 23
statistics 235
memory 236
status
network 236
status, subsystem 236
stopsrc 119
strided I/O access pattern 157
benchmark results 155, 158
mathematical expression,stride vs I/O rate 157
stride vs I/O rate 157
stripe 18
stripe group 269
stripe vs. block 18
stripe, size 18
striping 18, 148, 269
balanced random 19
first disk 19
implicit parallelism 18
random 19
round robin 19
subblock 18, 148, 151, 269
subsystem 28
distributed 28
RVSD 34
VSD 33
subsystem status 236
Subsystems
clinfoES 83
sundered network 51
surand() 240
switch 64
System calls 145, 181
AIX 145
close() 144
fsync() 144, 147
gettimeofday() 147, 241
GPFS 145
gpfs_fcntl() 161, 254, 256
lseek() 144, 248
lseek64() 181
open() 14, 144, 181
open64() 181
POSIX API 144
POSIX I/O API 144
read() 14, 21, 25, 152, 249
rtc() 147, 177, 240, 241
stat() 23
surand() 240
write() 14, 21, 25, 144, 153, 249
system error log 185
system resource 32
System types
gpfsCancelHints_t 162
gpfsFcntlHeader_t 162
gpfsMultipleAccessRange_t 163
off_t 181
off64_t 181
T
time and date 236
timing
filemon 178
gettimeofday() 147, 241
iostat 177
rtc() 147, 177, 240
token management 15, 150, 269
byte range locking 150
byte range locking granularity 151
parallel programming 151
Topology Services 28, 47, 51, 55
detection of failures 142
overwiev 30
partitioned cluster 51
reliable messaging library 30
reliable messaging service 30
status of adapters 30
Topology Services, checking 188
topsvcs 230
translate 202
hdisk-to-pdisk 235
pdisk-to-hdisk 235
troubleshooting 65
twin tailing 269
U
umount 119, 122
unrecovered 123
up 123
user data 18, 21
Index
285
V
varyoffvg 236
varyonvg 191, 236
Virtual Shared Disk 269
virtual size 22
vmstat 236
vnode 21, 269
read() 21
write() 21
volume group
activate 236
create 91, 236
deactivate 236
export 235
import 235
voting protocol
n-phase 31
one phase 31
VSD 32, 33, 62, 64
environment 4, 32
See Virtual Shared Disk
subsystem 33
VSD environment 32
VSDM
primary 34
secondary 34
W
write I/O operation 153
cache coherence 153
complete block 154
flush 153, 268
new block 154
partial block 154
write-behind 153
write() 14, 21, 25, 144, 153, 249
write-behind 15, 153, 160, 269
286
Back cover
INTERNATIONAL
TECHNICAL
SUPPORT
ORGANIZATION
BUILDING TECHNICAL
INFORMATION BASED ON
PRACTICAL EXPERIENCE
IBM Redbooks are developed by
the IBM International Technical
Support Organization. Experts
from IBM, Customers and
Partners from around the world
create timely technical
information based on realistic
scenarios. Specific
recommendations are provided
to help you implement IT
solutions more effectively in
your environment.
ISBN 0738422088