The RS6000 SP Inside Out
The RS6000 SP Inside Out
http://www.redbooks.ibm.com
SG24-5374-00
SG24-5374-00
May 1999
Take Note!
Before using this information and the product it supports, be sure to read the general information in
Appendix B, “Special Notices” on page 521.
This edition applies to PSSP Version 3, Release 1 (5765-D51) for use with the AIX Operating System
Version 4, Release 3 Modification 2.
When you send information to IBM, you grant IBM a non-exclusive right to use or distribute the
information in any way it believes appropriate without incurring any obligation to you.
Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xi
Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
The Team That Wrote This Redbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Comments Welcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi
v
8.6.3 EM and GS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
8.6.4 Resource Monitors - The SP Implementation . . . . . . . . . . . . . . 205
8.7 Putting It All Together - A Simple Example of Exploiting RSCT . . . . 209
8.7.1 Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
8.7.2 TS Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
8.7.3 GS Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
8.7.4 EM Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
8.7.5 ES Client - Host_Responds Daemon . . . . . . . . . . . . . . . . . . . . 213
8.7.6 Perspectives Shows Green. . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
vii
13.1.2 Backup/Recovery of Nodes . . . . . . . . . . . . . . . . . . . . . .. . . . 378
13.2 Backup/Recovery of System Database Repository (SDR) . . .. . . . 379
13.3 Backup/Recovery of Kerberos Database . . . . . . . . . . . . . . . .. . . . 380
13.4 Backup/Recovery of User Specified Volume Group . . . . . . . .. . . . 381
13.5 Proper Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 382
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
ix
x The RS/6000 SP Inside Out
Figures
xiii
127.Changing rootvg For Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
128.Initiate Mirroring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
129.SP Perspectives Launch Pad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
130.Event Perspective - Main Window. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
131.Icon Colors for Event Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
132.Sample Event Notification Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
133.Event Definition - Resource Variable Selection. . . . . . . . . . . . . . . . . . . . 301
134.Event Definition - Condition Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
135.Event Definition - Resource Variable Identification . . . . . . . . . . . . . . . . . 303
136.Event Notification Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
137.PTX Functional Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
138.PTPE Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
139.PTPE Monitoring Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
140.RS/6000 SP System with Defined Roles. . . . . . . . . . . . . . . . . . . . . . . . . 317
141.Site Environment- Accounting Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
142.SP Nodes - Accounting Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
143.Problem Management Subsystem Design . . . . . . . . . . . . . . . . . . . . . . . 330
144.Problem Management Subsystem Daemons . . . . . . . . . . . . . . . . . . . . . 331
145.SMIT Panel for Controlling SPUM Settings . . . . . . . . . . . . . . . . . . . . . . . 342
146.SMIT Panel for Adding SP Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
147.SMIT Panel for Deleting SP Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
148.SMIT Panel for Changing SP User Information. . . . . . . . . . . . . . . . . . . . 345
149.Sample of an Entry in /etc/security/user . . . . . . . . . . . . . . . . . . . . . . . . . 347
150.SMIT Panel for Setting a NIS Domain Name . . . . . . . . . . . . . . . . . . . . . 351
151.SMIT Panel for Configuring a Master Server . . . . . . . . . . . . . . . . . . . . . 352
152.SMIT Panel for Configuring a Slave Server . . . . . . . . . . . . . . . . . . . . . . 353
153.SMIT Panel for Configuring a NIS Client . . . . . . . . . . . . . . . . . . . . . . . . . 354
154.SMIT Panel for Managing NIS Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
155.SMIT Panel for Setting File Collections . . . . . . . . . . . . . . . . . . . . . . . . . . 356
156.Summary of the File Collection Directory Structure . . . . . . . . . . . . . . . . 359
157.Sample /etc/auto.master File. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
158.SMIT Panel for Turning On or Off the Automounter . . . . . . . . . . . . . . . . 369
159.Default /etc/auto.master File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
160.Sample /etc/auto/map/auto.u File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
161.Contents of an mksysb Backup Image . . . . . . . . . . . . . . . . . . . . . . . . . . 376
162.SMIT mksysb. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
163.Conceptual Overview of NFS Mounting Process . . . . . . . . . . . . . . . . . . 388
164.Basic DFS Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
165.VSD Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
166.Perspectives Panel for Designating VSD Nodes. . . . . . . . . . . . . . . . . . . 398
167.Perspectives Panel for Creating VSDs . . . . . . . . . . . . . . . . . . . . . . . . . . 399
168.Perspectives Panel for Defining VSDs . . . . . . . . . . . . . . . . . . . . . . . . . . 400
169.Perspectives Panel for Configuring VSDs . . . . . . . . . . . . . . . . . . . . . . . . 401
xv
xvi The RS/6000 SP Inside Out
Tables
1. Flynn’s Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2. Current Nodes Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3. Supported Switch Adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4. RSCT Install Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
5. A Node’s switch_responds State Table . . . . . . . . . . . . . . . . . . . . . . . . . . 232
6. IBM.PSSP.CSSlog.errlog is a Structured Byte String with These Fields . 242
7. AIX Error Log Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
8. NIM Objects Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
9. NIM Objects Classification Used By PSSP . . . . . . . . . . . . . . . . . . . . . . . 260
10. PSSP Levels and Their Corresponding AIX Levels . . . . . . . . . . . . . . . . . 261
11. Perfagent File Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
12. Adapter Type and Transmit Queue Size Settings . . . . . . . . . . . . . . . . . . 263
13. Recommended Network Option Tunables . . . . . . . . . . . . . . . . . . . . . . . . 263
14. Supported External SSA Boot Devices . . . . . . . . . . . . . . . . . . . . . . . . . . 280
15. Supported External SCSI Boot Devices. . . . . . . . . . . . . . . . . . . . . . . . . . 281
16. SP Perspective Application Pathnames. . . . . . . . . . . . . . . . . . . . . . . . . . 294
17. Pre-Defined Event Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
18. Network Tuning Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
19. VSD Fileset Names and Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
20. RVSD Levels Supported by PSSP Levels . . . . . . . . . . . . . . . . . . . . . . . . 405
21. Supported Network Interface Cards for MCA Nodes . . . . . . . . . . . . . . . . 517
22. Supported SCSI and SSA Adapters for MCA Nodes. . . . . . . . . . . . . . . . 517
23. Other Supported Adapters for MCA Nodes . . . . . . . . . . . . . . . . . . . . . . . 517
24. Supported Network Interface Cards for PCI Nodes . . . . . . . . . . . . . . . . . 518
25. Supported SCSI and SSA Adapters for PCI Nodes. . . . . . . . . . . . . . . . . 518
26. Other Supported Adapters for PCI Nodes . . . . . . . . . . . . . . . . . . . . . . . . 519
The RS/6000 SP is by far the most successful product within the RS/6000
family. With over 5,600 systems sold by year-end 1998, with an average of ten
nodes each, the RS/6000 SP community has been increasing at a frantic rate
for the past few years.
This redbook applies to PSSP Version 3, Release 1 for use with the AIX
Operating System Version 4, Release 3, Modification 2. It describes the bits
and pieces that make the RS/6000 SP a successful competitor in the difficult
market of high-end computing.
People new to the RS/6000 SP will find valuable and extensive introductory
information to help them understand how this system works. This redbook
provides solutions for a variety of business environments.
In the last year of this millennium, the RS/6000 SP has proven to be prepared
for the challenges of the next millennium. Are you prepared to take advantage
of this?
Bernard Woo, P.Eng., is an IBM I/T Support Specialist who has been
working in the Canadian RS/6000 Customer Assist Centre (CAC) for three
years. For the past two years, he has specialized in supporting the SP. He
holds a degree in Electrical Engineering from the University of Waterloo in
Canada.
Steven Yap has been working in IBM Singapore as an IT specialist for the
past six years. His job responsibilities include account support, help desk,
project planning and implementation, and on-site problem determination for
both software and hardware in all areas of RS/6000 systems. A certified
RS/6000 SP specialist, his main focus is on the RS/6000 SP and HACMP. He
holds a degree in Electrical Engineering from the National University of
Singapore.
Thanks to the following people for their invaluable contributions to this project:
Klaus Gottschalk
IBM Germany
Frauke Boesert
Klaus Geers
Horst Gernert
Roland Laifer
Reinhard Strebler
University of Karlsruhe, Germany
Mark Atkins
Joe Banas
Pat Caffrey
Lisa Case-Hook
Michael K Coffey
Richard Coppinger
Chris DeRobertis
Ron Goering
Jan Ranck-Gustafson
Barbara Heldke
Nancy Mroz
Norman Nott
Robert Palmer
Larry Parker
Kevin Redman
Richard Treumann
IBM Poughkeepsie
John Maddalozzo
IBM Austin
Scott Vetter
ITSO Austin
Comments Welcome
Your comments are important to us!
xxi
• Fax the evaluation form found in “ITSO Redbook Evaluation” on page 547
to the fax number shown on the form.
• Use the electronic evaluation form found on the Redbooks Web sites:
For Internet users http://www.redbooks.ibm.com
For IBM Intranet users http://w3.itso.ibm.com
• Send us a note at the following address:
[email protected]
Since its introduction to the marketplace five and a half years ago, the
Scalable Powerparallel (SP) supercomputer has built up an impressive
resume:
• It has been installed in over 70% of the US Fortune 500 companies, as
well as in scientific and technical institutions worldwide.
• It has beaten humanity’s chess champion, Gary Kasparov.
• It has expanded from 0 to approximately 28% of IBM’s total UNIX-based
system revenue.
• It has been selected by the US Department of Energy for the Accelerated
Strategic Computing Initiative (ASCI) project, which will result in the
creation of the most powerful computer in the world.
• It has repeatedly set world records for Internet Web-serving capability,
most recently at the Nagano Winter Olympics.
In 1996, the SP2 was renamed to simply the SP and formally became a
product of the RS/6000 Division. It represents the high-end of the RS/6000
family. IBM secured a number of large SP contracts, of particular note the
ASCI project of the US Department of Energy. These contracts, coupled with
the broad marketplace acceptance of the product, have fuelled SP
development. In 1996, IBM introduced a faster version of the Trailblazer
switch, more than doubling the bandwidth of its predecessor, new nodes,
including Symmetric Multiprocessor (SMP) versions, and more robust and
functional PSSP software. As of the end of 1997, there were over 3770 SP
systems installed throughout the world. From 1994 to 1997, the SP install
base has been growing at an annual rate of 169%.
1.3.1 Scalability
The SP efficiently scales in all aspects, including:
• Hardware and software components, to deliver predictable increments of
performance and capacity
• Network bandwidth, both inside and outside the machine
• Systems management, to preserve the investment in tools, processes, and
skills as the system grows
Importantly, the SP can scale both up and down. It can be subdivided, either
logically or physically, into smaller SPs. In addition, scaling is a consistent
process regardless of the initial size of an SP implementation. A customer
1.3.2.2 Technologies
The SP relies heavily on mainstream hardware and software components.
This bolsters the SP product business case by minimizing development costs
and integration efforts.
1.3.4 Manageability
The SP is production-worthy. It has good reliability, availability and
serviceability, allowing it to host mission-critical applications. It has a single
point-of-control and is managed with consistent tools, processes and skills
regardless of system size.
This cluster technology has been used outside the SP as part of the
HACMP/ES product. HACMP provides high availability through failure
detection and failover for any RS/6000 servers or SP nodes. The HACMP/ES
product extends the capabilities of HACMP to 32 node clusters. This is an
example where technology developed on the SP is extended to provide
customer value throughout the RS/6000 product line.
The ultimate question we want to answer in this chapter is: how does the
RS/6000 SP fit in the broad spectrum of computer architectures?
If you want a straight answer, then you need to read the rest of the book to
understand how hard is to classify the SP in a single category.
The term shared nothing, in this context, refers to data access. In this type of
architecture, processors do not share a common repository for data (memory
or disk), so data sharing has to be carried out through messages. Systems
using this architecture are also called message-based systems.
Single Multiple
However, this self-imposed limitation has been overcome year after year, thus
increasing the practical number of processors in an SMP machine to numbers
we would not have imagined a few years ago.
If you analyze the RS/6000 SP from a system point of view, you see that
processor nodes communicate each other through communication networks
Shared Nothing
Shared Disk
I/O I/O
MEMORY MEMORY
I/O I/O I/O
MEMORY MEMORY MEMORY
Global Memory
I/O I/O
MEMORY MEMORY
Shared Memory
Symetric Multi-Processor
Uniprocessor (SMP)
Distributed Memory
I/O
I/O
MEMORY
MEMORY
I/O I/O
MEMORY MEMORY
CPU
CPU CPU CPU
Besides the fact that we can use the RS/6000 SP nodes as standalone
machines for running “serial” and independent applications in what is called
“server consolidation”, the RS/6000 SP can be also viewed as a cluster or
even as a parallel machine (a massive parallel machine).
RS/6000 SP Architecture 13
Currently, partitions have evolved to what are called “domains”. This is true
for some of the RS/6000 SP software components, but not for all of them.
Some subsystems remain global to the system even after they have been
“partitioned”.
At the high level, applications can take advantage of the RS/6000 Cluster
Technology (described in Chapter 8, “RS/6000 Cluster Technology” on page
185) by using the notification and coordination services provides by RSCT. At
the system level, administrators can take advantage of the high availability
functions provided by the basic PSSP components as well as the Enhanced
Scalability version of the High Availability Cluster Multiprocessing (HACMP
ES, described in Chapter 15, “High Availability” on page 433).
RS/6000 SP Architecture 15
16 The RS/6000 SP Inside Out
Part 2. System Implementation
Internal
Node
Switch
Control
Workstation
3.1 Frames
The building block of RS/6000 SP is the “frame”. There are two sizes: the Tall
frame (75.8in high) and the Short frame (49in high). RS/6000 SP internal
nodes are mounted in either a Tall or Short frame. A Tall frame has eight
drawers, while a Short frame has four drawers. Each drawer is further divided
into two slots. A Thin node occupies one slot; a Wide node occupies one
drawer (two slots) and a High node occupies two drawers (four slots). An
internal power supply is included with each frame. Frames get equipped with
optional processor nodes and switches.
Since the original RS/6000 SP product was made available in 1993, there
have been a number of model and frame configurations. The frame and the
first node in the frame were tied together, forming a model. Each
configuration was based on the frame type and the kind of node installed in
the first slot. This led to an increasing number of possible prepackaged
configurations as more nodes became available.
The introduction of a new Tall frame in 1998 was the first attempt to simplify
the way frames and the nodes inside are configured. This new frame replaces
the old frames. The most noticeable difference between the new and old
frame is the power supply size. Also, the new Tall frame is shorter and deeper
than the old Tall frame. With the new offering, IBM simplified the SP frame
options by telecopying the imbedded node from the frame offering. Therefore,
when you order a frame, all you receive is a frame with the power supply units
and a power cord. All nodes, switches, and other auxiliary equipment are
ordered separately.
All new designs are completely compatible with all valid SP configurations
using older equipment. Also, all new nodes can be installed in any existing SP
frame provided that the required power supply upgrades have been
implemented in that frame.
Hardware Components 21
optional switch board. Depending on the type of node selected, an SP Short
frame can contain up to a maximum of 8 Thin nodes, 4 Wide nodes or 2 High
nodes. Also, node types can be mixed and scaled up to only 8 nodes.
Therefore, for a large configuration or high scalability, Tall frames are
recommended.
Only the Short model frame can be equipped with a switch board. The Short
expansion frame cannot hold a switch board, but nodes in the expansion
frame can share unused switch ports in the model frame.
Processor Node
LED and Control Breaker Processor Node
(Slot 8)
Processor Node
(Slot 7)
Switch Assembly
C(Optional)
48 Volt Power Modules
The base level SP Switch frame (feature code #2031) contains four ISBs. An
SP Switch frame with four ISBs will support up to 128 nodes. The base level
SP Switch frame can also be configured into systems with fewer than 65
nodes. In this environment, the SP switch frame will greatly simplify future
Note: The SP Switch frame is required when the sixth SP Switch board is
added to the system, and is a mandatory prerequisite for all large scale
systems.
A Tall frame has four power supplies. In a fully populated frame, the frame
can operate with only three power supplies (N+1). Short frames come with
two power supplies and a third, optional one, for N+1 support.
Figure 5 on page 24 illustrates Tall frame components from front and rear
views.
Hardware Components 23
The power consumption depends on the number of nodes installed in the
frame. For details refer to IBM RS/6000 SP: Planning Volume 1, Hardware
and Physical Environment, GA22-7280.
15
Processor Node 1
LED and Control Breaker
Processor Node
(Slot 8) Processor Node Processor Node
Processor Node
(Slot 8) (Slot 7)
(Slot 7)
RF Shunt
Assembly
Main Power
Switch Assembly
PCI Card
Switch
Frame Supervisor
Left Skirt Right Skirt Card
Right Skirt Left Skirt
Left Skirt
Supervisor Connector
Frame Supervisor
4 8
3 7
2
1
6
5
A
Green Orange
Orange LEDs
Green LEDs
DB25
DB25 Connector
Connector
for Y-Cable
for Y-Cable
RS-232 cable
To Control Workstation
There is a cable that connects from the frame supervisor card (position A) to
the switch supervisor card (position B) on the SP Switch or the SP Switch-8
boards and to the node supervisor card (position C) of every node in the
frame. Therefore, the control workstation can manage and monitor frames,
switches and all in-frame nodes.
Hardware Components 25
3.2 Standard Nodes
The basic RS/6000 SP building block is the server node or standard node.
Each node is a complete server system, comprising processors, memory,
internal disk drives, expansion slots, and its own copy of the AIX operating
system. The basic technology is shared with standard RS/6000 workstations
and servers, but differences exist that allow nodes to be centrally managed.
There is no special version of AIX for each node. The same version runs on
all RS/6000 systems.
Standard nodes can be classified as those which are inside the RS/6000 SP
frame and those which are not.
Since 1993, when IBM announced the RS/6000 SP, there have been 14
internal node types, excluding some special "on request" node types. There
are five most current nodes: 160 MHz Thin P2SC node, 332 MHz SMP Thin
node, 332 MHz SMP Wide node, POWER3 SMP Thin node and POWER3
SMP Wide node. Only the 160 MHz Thin P2SC node utilizes Micro Channel
Architecture (MCA) bus architecture, while the others use PCI bus
architecture.
This node is the first PCI architecture bus node of the RS/6000 SP. Each
node has two or four PowerPC 604e processors running at 332 MHz clock
The 332 MHz SMP Wide node is a 332 MHz SMP Thin node combined with
additional disk bays and PCI expansion slots. This Wide node has four
internal disk bays with a maximum of 36.4 GB (mirror), and 10 PCI I/O
expansion slots (three 64-bit, seven 32-bit). Both 332 MHz SMP Thin and
Wide nodes are based on the same technology as the RS/6000 model H50
and have been known as the “Silver” nodes. Figure 7 shows a 332 MHz SMP
node component diagram.
Hardware Components 27
POWER3 SMP Thin Nodes
This node is the first 64-bit internal processor node of the RS/6000 SP. Each
node has a 1- or 2-way (within two processor cards) configuration utilizing a
64-bit 200 MHz POWER3 processor, with a 4 MB Level 2 (L2) cache per
processor. The standard ECC SDRAM memory in each node is 256 MB,
expandable up to 4 GB (within two card slots). This new node is shipped with
disk pairs as a standard feature to encourage the use of mirroring to
significantly improve system availability. This Thin node has two internal disk
bays for pairs of 4.5 GB, 9.1 GB and 18.2 GB Ultra SCSI disks. Each node
has two 32-bit PCI slots and integrated 10/100 Ethernet and Ultra SCSI
adapters. The POWER3 SMP Thin node can be upgraded to the POWER3
SMP Wide node.
The POWER3 SMP Wide node is a POWER3 SMP Thin node combined with
additional disk bays and PCI expansion slots. This Wide node has four
internal disk bays for pairs of 4.5 GB, 9.1 GB and 18.2 GB Ultra SCSI disks.
Each node has 10 PCI slots (two 32-bit, eight 64-bit). Both POWER3 SMP
Thin and Wide nodes are equivalent to the RS/6000 43P model 260. A
diagram of the POWER3 SMP node is shown in Figure 8 on page 29. Notice
that it uses docking connectors (position A) instead of flex cables as in the
332 MHz node.
The minimum software requirements for POWER3 SMP Thin and Wide nodes
are AIX 4.3.2 and PSSP 3.1.
Node Type 160 MHz 332 MHz 332 MHz POWER3 POWER3
Thin SMP Thin SMP Wide SMP Thin SMP Wide
Max. Memory 1 GB 3 GB 4 GB
Memory Slots 4 2 2
Disk Bays 2 2 4 2 4
Hardware Components 29
Node Type 160 MHz 332 MHz 332 MHz POWER3 POWER3
Thin SMP Thin SMP Wide SMP Thin SMP Wide
System Address
Bus 64-bit Memory- I/O
Controller
PCI Bridge
SP Switch
MX Adapter PCI bus PCI bus
The 332 MHz SMP node contains 2- or 4-way 332 MHz PowerPC 604e
processors, each with its own 256 KB Level 2 cache. The X5 Level 2 cache
controller incorporates several technological advancements in design
providing greater performance over traditional cache designs. The cache
controller implements an 8-way, dual-directory, set-associative cache using
SDRAM. When instructions or data are stored in cache, they are grouped into
sets of eight 64-byte lines. The X5 maintains an index to each of the eight
sets. It also keeps track of the tags used internally to identify each cache line.
Dual tag directories allow simultaneous processor requests and system bus
snoops, thus reducing resource contention and speeding up access.
System Bus
Hardware Components 31
The SMP system bus is optimized for high performance and multiprocessing
applications. It has a separate 64-bit address bus and 128-bit data bus.
These buses operate independently in the true split transaction mode and are
aggressively pipelined. For example, new requests may be issued before
previous requests are completed. There is no sequential ordering
requirement. Each operation is tagged with an 8-bit tag, which allows a
maximum of up to 256 transactions to be in progress in the system at any one
time.
System Memory
I/O Subsystem
The POWER3 SMP node system structure is shown in Figure 10 on page 33.
6XX Data
Memory Bus
6XX Data Bus
6XX Address 128-bit @ 100 MHz
128-bit @ 100 MHz
32-bit @ 33 MHz
64-bit @ 33 MHz 64-bit @ 33 MHz
PCI Slots
PCI Slots PCI Slots
POWER3 Microprocessor
System Bus
The system bus, referred to as the “6XX” bus, connects up to two POWER3
processors to the memory-I/O controller chip set. It provides 40 bits of real
address and a separate 128-bit data bus. The address, data and tag buses
are fully parity protected. The 6XX bus runs at a 100 MHz clock rate and peak
data throughput is 1.6 GB/second.
Hardware Components 33
System Memory
I/O Subsystem
Service Processor
The service processor function is integrated on the I/O planner board. This
service processor performs system initialization, system error recovery and
diagnostic functions that give the POWER3 SMP node a high level of
availability. The service processor is designed to save the state of the system
to 128 KB of nonvolatile memory (NVRAM) to support subsequent diagnostic
and recovery actions taken by other system firmware and the AIX operating
system.
Each I/O rack accommodates up to two I/O drawers (maximum four drawers
per system) with additional space for storage and communication
subsystems. The base I/O drawer contains:
• A high-performance 4.5GB SCSI -2 Fast/Wide disk drive (S70) or 9.1
GB UltraSCSI disk drive (S7A)
• A 32X (Max) CD-ROM
• A 1.44 MB 3.5-inch diskette drive
• A service processor
Hardware Components 35
• Fourteen PCI slots (nine 32-bit and five 64-bit) (eleven slots are
available)
• Three media bays (Two available) for S70
• Two media bays (one available) for S7A
• Twelve hot-swapped disk drive bays (eleven available)
When all four I/O drawers are installed, the S70 contains twelve media bays
(Eight media bays for S7A), forty-eight hot-swapped disk drive bays, and
fifty-six PCI slots per system.
Hardware Components 37
RS/6000 SP
RS/6000 Model S70/S7A
Switch Cable
frame_to_frame
Frame 15m
Supervisor Control
Cable Connection
SAMI
S1TERM
Ethernet SP LAN
An SP Switch Router may have multiple logical dependent nodes, one for
each dependent node adapter it contains. If an SP Switch Router contains
more than one dependent node adapter, it can route data between SP
systems or system partitions. For an SP Switch Router, this card is called a
Switch Router Adapter (feature code #4021). Data transmission is
accomplished by linking the dependent node adapters in the switch router
with the logical dependent nodes located in different SP systems or system
partitions.
Hardware Components 39
Disk Array
Subsystem
SP Switch HiPPI
Adapter Adapter
HiPPI
Adapter WAN
ATM
SP Switch OC-12c
Adapter ATM
SP OC-3c
ATM Switch
Switch
Processor
Nodes 8-port
Ethernet
SP Switch 10/100
Adapter 4-port
FDDI
SP Switch
Router
SP System
Figure 13. SP Switch Router
Although you can equip an SP node with a variety of network adapters and
use the node to make your network connections, the SP Switch Router with
the Switch Router Adapter and optional network media cards offers many
advantages when connecting the SP to external networks, as follows:
• Each media card contains its own IP routing engine with separate
memory containing a full route table of up to 150,000 routes. Direct
access provides much faster lookup times compared to software-driven
lookups.
• Media cards route IP packets independently at rates of 60,000 to
130,000 IP packets per second. With independent routing available
from each media card, the SP Switch Router gives your SP system
excellent scalability characteristics.
• The SP Switch Router has a dynamic network configuration to bypass
failed network paths using standard IP protocols.
• Using multiple Switch Router Adapters in the same SP Switch Router,
you can provide high performance connections between system
partitions in a single SP system or between multiple SP systems.
Two versions of the RS/6000 SP Switch Router can be used with the SP
Switch. The Model 04S (GRF 400) offers four media card slots and the Model
16S (GRF 1600) offers 16 media card slots. Except for the additional traffic
capacity of the Model 16S, both units offer similar performance and network
availability, as shown in Figure 14.
Hardware Components 41
2. A connection between an SP Switch Router Adapter and the SP Switch —
The SP Switch Router transfers information into and out of the processor
nodes of your SP system. The link between the SP Switch Router and the
SP processor nodes is implemented by:
• An SP Switch Router adapter
• A switch cable connecting the SP Switch Router adapter to a valid
switch port on the SP Switch
3. A frame-to-frame electrical ground — The SP Switch Router frame must
be connected to the SP frame with a grounding cable. This frame-to-frame
ground is required in addition to the SP Switch Router electrical ground.
The purpose of the frame-to-frame ground is to maintain the SP and SP
Switch Router systems at the same electrical potential.
For more information refer to IBM 9077 SP Switch Router: Get Connected to
the SP Switch, SG24-5157.
The control workstation also acts as a boot/install server for other servers in
the RS/6000 SP system. In addition, the control workstation can be set up as
an authentication server using Kerberos. It can be the Kerberos primary
server, with the master database and administration service, as well as the
ticket-granting service. As an alternative, the control workstation can be set
up as a Kerberos secondary server, with a backup database, to perform
ticket-granting services.
Note:
1. Requires a 7010 Model 150 X-Station and display. Other models and
manufacturers that meet or exceed this model can be used. An ASCII
terminal is required as the console.
Notes:
1. Supported by PSSP 2.2 and later.
2. On systems introduced since PSSP 2.4, either the 8-port (feature code
#2493) or 128-port (feature code #2944) PCI bus asynchronous
adapter should be used for frame controller connections. IBM strongly
suggests you use the support processor option (feature code #1001). If
you use this option, the frames must be connected to a serial port on
an asynchronous adapter and not to the serial port on the control
workstation planar board.
Hardware Components 43
3. The native RS232 ports on the system planar can not be used as tty
ports for the hardware controller interface. The 8-port asynchronous
adapter EIA-232/ RS-422, PCI bus (feature code #2943) or the
128-port Asynchronous Controller (feature code #2944) are the only
RS232 adapters that are supported. These adapters require AIX 4.2.1
or AIX 4.3 on the control workstation.
4. The 7043 can only be used on SP systems with up to four frames. This
limitation applies to the number of frames and not the number of nodes.
This number includes expansion frames.
5. The 7043-43P is not supported as a control workstation whenever an
S70/S7A is attached to the SP. The limitation is due to the load that the
extra daemons place on the control workstation.
Hardware Components 45
RS/6000 SP
Ethernet SP LAN
SPVG SDR
and
Sys Mgmt Data
The primary and backup control workstations are also connected on a private
point-to-point network and a serial TTY link or target mode SCSI. The backup
control workstation assumes the IP address, IP aliases, and hardware
address of the primary control workstation. This lets client applications run
without changes. The client application, however, must initiate reconnects
when a network connection fails.
Hardware Components 47
supports half duplex (HDX). There is a hard limit of 30 stations on a single
10BASE-2 segment, and the total cable length must not exceed 185 meters.
However, it is not advisable to connect more than 16 to 24 nodes to a single
segment. Normally, there is one segment per frame and one end of the
coaxial cable is terminated in the frame. Depending on the network topology,
the other end connects the frame to either the control workstation or to a
boot/install server in that segment, and is terminated there. In the latter case,
the boot/install server and control workstation are connected through an
additional Ethernet segment, so the boot/install server needs two Ethernet
adapters.
In order to use Twisted Pair in full duplex mode, there must be a native RJ-45
TP connector at the node (no transceiver), and an Ethernet switch like the
IBM 8274 must be used. A repeater always works in half duplex mode, and
will send all IP packets to all ports (like in the 10BASE-2 LAN environment).
We therefore recommend that you always use an Ethernet switch with native
UTP connections.
Ethernet/1
HDX
CWS
10 Mbps
Hardware Components 49
• The control workstation acts as boot/install server for all nodes.
• Performance is limited to one 10-Mbps HDX connection at a time.
• Only six to eight network installs of SP nodes from the control workstation
NIM server could be performed simultaneously.
Ethernet/1 Ethernet/2
HDX HDX
CWS
10 Mbps 10 Mbps
routing
Attention
Hardware Components 51
Ethernet/2
Ethernet/1
routing
HDX CWS
Media-Converter Router
BNC to TP
10 Mbps
HDX
Ethernet/2
routing
Ethernet/1
100 Mbps
FDX
BIS
Fast Ethernet
HDX
BIS CWS
Inst-Ethernet 10 Mbps
Even when a router is added, the solution presented in the following section is
normally preferable to a segmented network with boot/install servers, both
Hardware Components 53
from a performance viewpoint and from a management and complexity
viewpoint.
Coll.Domain/2 10 Mbps
HDX
Media-Converter
BNC to TP
Coll.Domain/1
Ethernet Switch
(one logical LAN)
Fast Ethernet
CWS
Considering only the network topology, the control workstation should be able
to install 6 to 8 nodes in each Ethernet segment (port on the Ethernet switch)
simultaneously, since each Ethernet segment is a separate collision domain.
Rather than the network bandwidth, the limiting factor most likely is the ability
of the control workstation itself to serve a very large number of NIM clients
simultaneously, for example answering UPD bootp requests or acting as the
NFS server for the mksysb images. To quickly install a large SP system, it
may therefore still be useful to set up boot/install server nodes, but the
network topology itself does not require boot/install servers. For an
installation of all nodes of a large SP system, we advocate the following:
1. Using the spbootins command, set up approximately as many boot/install
server nodes as can be simultaneously installed from the control
workstation.
2. Install the BIS nodes from the control workstation.
3. Install the non-BIS nodes from their respective BIS nodes. This provides
the desired scalability for the installation of a whole, large SP system.
4. Using the spbootins command, change the non-BIS nodes’ configuration
so that the control workstation becomes their boot/install server. Do not
forget to run setup_server to make these changes effective.
5. Reinstall the original BIS nodes. This removes all previous NIM data from
them, since no other node is configured to use them as boot/install server.
The configuration in Figure 20 scales well to about 128 nodes. For larger
systems, the fact that all the switched Ethernet segments form a single
broadcast domain can cause network problems if operating system services
or applications frequently issue broadcast messages. Such events may cause
"broadcast storms", which can overload the network. For example, Topology
Hardware Components 55
Services from the RS/6000 Cluster Technology use broadcast messages
when the group leader sends PROCLAIM messages to attract new members.
Attention
ARP cache tuning: Be aware that for SP systems with very large networks
(and/or routes to many external networks), the default AIX settings for the
ARP cache size might not be adequate. The Address Resolution Protocol
(ARP) is used to translate IP addresses to Media Access Control (MAC)
addresses, and vice versa. Insufficient APR cache settings can severely
degrade your network’s performance, in particular when many broadcast
messages are sent. Refer to /usr/lpp/ssp/README/ssp.css.README and
the redbook RS/6000 SP Performance Tuning, SG24-5340, for more
information about ARP cache tuning.
In order to avoid problems with broadcast traffic, no more than 128 nodes
should be connected to a single switched Ethernet subnet. Larger systems
should be set up with a suitable number of switched subnets. To be able to
network boot and install from the control workstation, each of these switched
LANs must have a dedicated connection to the control workstation. This can
be accomplished either through multiple uplinks between one Ethernet switch
and the control workstation, or through multiple switches which each have a
single uplink to the control workstation.
Ethernet Switch
(100 Mbps)
Hardware Components 57
100 Mbps Switched 10 Mbps Coll.Domain/1 10 Mbps Coll.Domain/2
10 Mbps HDX
Fast Ethernet
100 Mbps FXD
Ethernet Switch
(10/100 Mbps Autosense)
Media-Converter CWS
BNC to TP
In this configuration, an Ethernet switch like the IBM 8274 is again used to
provide a single LAN, and connects to the control workstation at 100 Mbps
FDX. One frame has new nodes with a 100-Mbps Ethernet, which are
individually cabled by 100BASE-TX Twisted Pair to ports of the Ethernet
Switch, and operate in full duplex mode as in the previous example. Two
frames with older nodes and 10BASE-2 cabling are connected to ports of the
same Ethernet switch, using media converters as in the configuration shown
in Figure 20 on page 54. Ideally, a switching module with autosensing ports is
used, which automatically detects the communication speed.
All of these applications are able to take advantage of the sustained and
scalable performance provided by the SP Switch. The SP Switch provides the
message passing network that connects all of the processors together in a
way that allows them to send and receive messages simultaneously.
Indirect networks, on the other hand, are constructed such that some
intermediate switch elements connect only to other switch elements.
Messages sent between processor nodes traverse one or more of these
intermediate switch elements to reach their destination. The advantages of
the SP Switch network are:
• Bisectional bandwidth scales linearly with the number of processor nodes
in the system.
Bisectional bandwidth is the most common measure of total bandwidth for
parallel machines. Consider all possible planes that divide a network into
two sets with an equal number of nodes in each. Consider the peak
bandwidth available for message traffic across each of these planes. The
bisectional bandwidth of the network is defined as the minimum of these
bandwidths.
• The network can support an arbitrarily large interconnection network while
maintaining a fixed number of ports per switch.
• There are typically at least four shortest-path routes between any two
processor nodes. Therefore, deadlock will not occur as long as the packet
travels along any shortest-path route.
• The network allows packets that are associated with different messages to
be spread across multiple paths, thus reducing the occurrence of hot
spots.
Hardware Components 59
The hardware component that supports this communication network consists
of two basic components: the SP Switch adapter and the SP Switch board.
There is one SP Switch adapter per processor node and generally one SP
Switch board per frame. This setup provides connections to other processor
nodes. Also, the SP system allows switch boards-only frames that provide
switch-to-switch connections and greatly increase scalability.
SP Switch SP Switch
Chip Chip
SP Switch SP Switch
Chip Chip
To SP Connections to other
Nodes SP Switch Boards
SP Switch SP Switch
Chip Chip
SP Switch SP Switch
Chip Chip
The first two elements here are driven by the transmitting element of the link,
while the last element is driven by the receiving element of the link.
The relationship between the SP Switch chip link and the SP Switch chip port
is shown in Figure 24 on page 62.
Hardware Components 61
SP Switch Port SP Switch Link SP Switch Port
Data (8 bits)
Output Port Input Port
Data Valid (1 bit)
Token (1bit)
System Clock
Data (8 bits)
Input Port Output Port
Data Valid (1 bit)
Token (1bit)
Figure 24. Relationship Between Switch Chip Link and Switch Chip Port
Buffer
Input Port Input Port
Nodes based on RS/6000s that use the MCA bus use the MCA-based Switch
Adapter (#4020). The same adapter is used in uniprocessor Thin, Wide and
SMP High nodes.
New nodes based on PCI bus architecture (332 MHz SMP Thin and Wide
Nodes, the 200 MHz POWER3 SMP Thin and Wide Nodes) must use the
newer MX-based Switch Adapters (#4022 and #4023, respectively) since
these are installed on the MX bus in the node. The so-called mezzanine or
MX bus allows the SP Switch adapter to be connected directly onto the
processor bus, providing faster performance than adapters installed on the
I/O bus. The newer (POWER3) nodes use an improved adapter based on a
faster mezzanine (MX2) bus.
Hardware Components 63
External nodes such as the 7017-S70 and 7017-S7A are based on standard
PCI bus architecture. If these nodes are to be included as part of an SP
switch network, then the switch adapter installed in these nodes is a
PCI-based adapter (#8396).
SP Node
SP Switch Adapter
SP Switch Port
Output Port
SP Switch Link
Input Port
MCA Bus
MX/MX2 Bus
PCI Bus
SP Switch Board
Hardware Components 65
Model Frame
Node Switch Board
Node A
Node B
Node C
The 16 unused SP Switch ports on the right side of the node switch board are
used for creating larger networks. There are two ways to do this:
• For an SP system containing up to 80 nodes, these SP Switch ports
connect directly to the SP Switch ports on the right side of other node
switch boards.
• For an SP system containing more than 80 nodes, these SP Switch ports
connect to additional stages of switch boards. These additional SP Switch
boards are known as intermediate switch boards (ISBs).
Node D
Node C
To Nodes To Nodes
To Nodes
Figure 30. SP 48-Way System Interconnection
Adding another frame to this existing SP complex further reduces the number
of direct connections between frames. The 4-frame, 64-way schematic
diagram is shown in Figure 31 on page 68. Here, there are at least five
connections between each frame, and note that there are six connections
Hardware Components 67
between frames 1 and 2 and between frames 3 and 4. Again, there are still
four potential paths between any pair of nodes that are connected to separate
NSBs.
To Nodes
To Nodes To Nodes
To Nodes
Fra
2
me
me
Fra
4
4
e5
Fr
am
am
Fr
e1
The addition of a sixth frame to this configuration would reduce the number of
direct connections between each pair of frames to below four. In this
hypothetical case, each frame would have three connections to four other
frames and four connections to the fifth frame, for a total of 16 connections
per frame. This configuration, however, would result in increased latency and
reduced switch network bandwidth. Therefore, when more than 80 nodes are
required for a configuration, an (ISB) frame is used to provide 16 paths
between any pair of frames.
Hardware Components 69
ISB1 ISB2 ISB3 ISB4
Fra
me e6
1 ram
F
Fra
me2 e5
Fram
Frame3
Frame4
3.6.3.1 SP Switch
The operation of the SP Switch (feature code #4011) has been described in
the preceding discussion. When configured in an SP order, internal cables
are provided to support expansion to 16 nodes within a single frame. In
multi-switch configurations, switch-to-switch cables are provided to enable the
physical connectivity between separate SP switch boards. The required SP
switch adapter connects each SP node to the SP Switch board.
3.6.3.2 SP Switch-8
To meet some customer requirements, eight port switches provide a low cost
alternative to the full size 16-port switches. The 8-port SP Switch-8 (SPS-8,
feature code #4008) provides switch functions for an 8-node SP system. The
The SP Switch-8 has two active switch chip entry points. Therefore, the ability
to configure system partitions is restricted with this switch. With the maximum
eight nodes attached to the switch, there are two possible system
configurations:
• A single partition containing all eight nodes
• Two system partitions containing four nodes each
200 Mhz POWER3 SMP Thin or Wide #4023 SP Switch MX2 Adapter
The 332 Mhz and 200 Mhz SMP PCI-based nodes listed here have a unique
internal bus architecture which allows the SP Switch adapters installed in
these nodes to see increased performance compared with previous node
types. A conceptual diagram illustrating this internal bus architecture is
shown in Figure 34 on page 72.
Hardware Components 71
CP U Card CP U Card
A ddress
D ata Memory
Bus
6xx
Bus
6xx-MX
Bus
These nodes implement the PowerPC MP System Bus (6xx bus). In addition,
the memory-I/O controller chip set includes an independent, separately
clocked "mezzanine" bus (6xx-MX) to which three PCI bridge chips and the
SP Switch MX or MX2 Adapter are attached. The major difference between
these node types is the clocking rates for the internal buses. The SP Switch
Adapters in these nodes plug directly into the MX bus - they do not use a PCI
slot. The PCI slots in these nodes are clocked at 33 Mhz. In contrast, the MX
bus is clocked at 50 Mhz in the 332 Mhz SMP nodes, and at 60 Mhz in the
200 Mhz POWER3 SMP nodes. Thus, substantial improvements in the
performance of applications using the Switch can be achieved.
Each node must have internal disks to hold its copy of the operating system.
Since multiple internal disks can be installed in a node, and software loading
is provided by the control workstation or boot/install server across the SP
administrative Ethernet, it is common to have nodes without any peripheral
devices.
The currently supported adapters for MCA nodes and PCI nodes are listed in
Appendix A, “Currently Supported Adapters” on page 517.
Hardware Components 73
This section uses the following node, frame, switch and switch adapter types
to configure SP systems:
Nodes
• 160 MHz Thin node (feature code #2022)
• 332 MHz SMP Thin node (feature code #2050)
• 332 MHz SMP Wide node (feature code #2051)
• POWER3 SMP Thin node (feature code #2052)
• POWER3 SMP Wide node (feature code #2053)
• 200 MHz SMP High node (feature code #2009) (This node has been
withdrawn from marketing.)
• RS/6000 Server Attached node (feature code #9122)
Frames
• Short model frame (model 500)
• Tall model frame (model 550)
• Short expansion frame (feature code #1500)
• Tall expansion frame (feature code #1550)
• SP Switch frame (feature code #2031)
• RS/6000 server frame (feature code #9123)
Switches
• SP Switch-8 (8-port switch, feature code #4008)
• SP Switch (16-port switch, feature code #4011)
Switch Adapter
• SP Switch adapter (feature code #4020)
• SP Switch MX adapter (feature code #4022)
• SP Switch MX2 adapter (feature code #4023)
• SP System attachment adapter (feature code #8396)
Although the RS/6000 SP configurations are very flexible, there are some
basic configuration rules that apply.
The Tall frame and Short frame cannot be mixed within an SP system.
Configuration Rule 2
If there is a single PCI Thin node in a drawer, it must be installed in the odd
slot position (left side of the drawer).
With the announcement of the POWER3 SMP nodes in 1999, a single PCI
Thin node is allowed to be mounted in a drawer. In this case it must be
installed in the odd slot position (left side). This is because the lower slot
number is what counts when a drawer is not fully populated. Moreover,
different PCI Thin nodes can be mounted in the same drawer, so that you can
install a POWER3 SMP Thin node in the left side of a drawer and a 332 MHz
Thin node in the right side of the same drawer.
Based on configuration rule 1, the rest of this section is separated into two
major parts. The first part provides the configuration rules for using Short
frames, and the second part provides the rules for using Tall frames.
Configuration Rule 3
Hardware Components 75
3.8.1.1 Nonswitched Short Frame Configuration
This configuration does not have a switch, and it mounts 1 to 8 nodes. A
minimum configuration is formed by one Short model frame and one PCI Thin
node, or one Wide node, or one High node, or one pair of MCA Thin nodes as
shown in Figure 35.
High Node
PCI Thin Wide Node MCA Thin MCA Thin
The Short model frame must be completely full before the Short expansion
frame can mount nodes as shown in Figure 36.
Thin Thin
Wide Node
High Node
Wide Node
SP Switch-8 SP Switch-8
SP Switch-8
High Node
PCI Thin
SP Switch-8
Hardware Components 77
Configuration Rule 5
Wide Node
Wide Node
Wide Node
High Node
Wide Node PCI Thin
Wide Node
High Node High Node
Wide Node
SP Switch-8 SP Switch-8
In configuration (a), four Wide nodes and eight Thin nodes are mounted in a
Tall model frame equipped with an SP Switch. There are 4 available switch
ports which you can use to attach SP-attached servers or SP Switch routers.
Expansion frames are not supported in this configuration because there are
Thin nodes on the right side of the model frame.
Configuration Rule 6
If a model frame on switched expansion frame has Thin nodes on the right
side, it cannot support nonswitched expansion frames.
In configuration (b), six Wide nodes and two PCI Thin nodes are mounted in a
Tall model frame equipped with an SP Switch. There also are a High node,
two Wide nodes, and four PCI Thin nodes mounted in a nonswitched
expansion frame. Note that all PCI Thin nodes on the model frame must be
placed on the left side to comply with configuration rule 6. All Thin nodes on
an expansion frame must also be placed on the left side to comply with the
switch port numbering rule. There is one available switch port which you can
use to attach SP-attached servers or SP Switch routers.
In configuration (c), there are eight Wide nodes mounted in a Tall model
frame equipped with an SP Switch and four High nodes mounted in a
nonswitched expansion frame (frame 2). The second nonswitched expansion
Hardware Components 79
frame (frame 3) houses a High node, two Wide nodes and one PCI Thin node.
This configuration occupies all 16 switch ports in the model fame. Note that
the Wide nodes and the PCI Thin node in frame 3 have to be placed on High
node locations.
Now, you can try to describe configuration (d). If you want to add two
POWER3 Thin nodes, what would be the locations?
SP Switch SP Switch
(a) (b)
W ide Node
High Node
W ide Node PCI Thin
W ide Node
High Node
W ide Node W ide Node
W ide Node
High Node
W ide Node W ide Node
W ide Node
High Node High Node
W ide Node
SP Switch
(c)
High Node High Node
Wide Node
SP Switch
(d)
Figure 39. Example of Single SP-Switch Configurations
Hardware Components 81
Single Stage with Multiple SP-Switches Configuration
If your SP system has 17 to 80 nodes, switched expansion frames are
required. You can add switched expansion frames and nonswitched
expansion frames. Nodes in the nonswitched expansion frame share unused
switch ports that may exist in the model frame and in the switched expansion
frames. Figure 40 shows an example of a Single Stage SP Switch with both
switched and nonswitched expansion frame configurations. There are four SP
Switches, each can support up to 16 processor nodes. Therefore, this
example configuration can mount a maximum of 64 nodes.
SP Switch SP Switch
SP Switch SP Switch
SP Switch Board
SP Switch Board
SP Switch Board
SP Switch Board
Hardware Components 83
If you have an SP Switch frame, you must configure it as the last frame in
your SP system. Assign a high frame number to an SP Switch frame to allow
for future expansion.
Figure 42 on page 85 shows slot numbering for Tall frame and Short frame.
Figure 42. Slot Numbering for Short Frame and Tall Frame
where slot_number is the lowest slot number occupied by the node. Each
type (size) of node occupies a consecutive sequence of slots. For each node,
there is an integer n such that a Thin node occupies slot n, a Wide node
occupies slots n, n+1 and a High node occupies n, n+1, n+2, n+3 . For Wide
and High nodes, n must be odd.
Hardware Components 85
15 16 31
45 93
13 14 29
11 12 27
41 89
9 10 25
7 8 23 39
85
5 6 21 37
3 4 19 35
1 2 81
17 33
SP Switch SP Switch SP Switch
Frame 4 Node 49
Switch boards are numbered sequentially starting with 1 from the frame with
the lowest frame number to that with the highest frame number. Each full
switch board contains a range of 16 switch port numbers (also known as
switch node numbers) that can be assigned. These ranges are also in
sequential order with their switch board number. For example, switch board 1
contains switch port numbers 0 through 15.
Switch port numbers are used internally in PSSP software as a direct index
into the switch topology and to determine routes between switch nodes.
where switch_number is the number of the switch board to which the node is
connected and switch_port_assigned is the number assigned to the port on
the switch board (0 to 15) to which the node is connected.
Figure 44 on page 88 shows the frame and switch configurations that are
supported and the switch port number assignments in each node. Let us
describe more details on each configuration.
In configuration 1, the switched frame has an SP Switch that uses all 16 of its
switch ports. Since all switch ports are used, the frame does not support
nonswitched expansion frames.
If the switched frame has only Wide nodes, it could use at most eight switch
ports, and so has eight switch ports to share with nonswitched expansion
frames. These expansion frames are allowed to be configured as in
configuration 2 or configuration 3.
Hardware Components 87
14 15
14 15
12 13
12 13
10 11
10 11
8 9
8 9
6 7
6 7
4 5 4 5
2 3 2 3
0 1 0 1
SP Switch SP Switch
Frame n Frame n Frame n+1
Configuration 1 Configuration 2
14
12 13 15
10
8 9 11
6
4 5 7
2
0 1 3
SP Switch
Configuration 3
12 13 15 14
8 9 11 10
4 5 7 6
0 1 3 2
SP Switch
Configuration 4
Figure 45 shows sample switch port numbers for a system with a Short frame
and an SP Switch-8.
4 5
2 3 7
1
6
0
SP Switch-8
Hardware Components 89
90 The RS/6000 SP Inside Out
Chapter 4. System Data Repository and System Partitioning
System partitions appear to most subsystems and for most user tasks as
logical SP systems. It is important to understand that from an administrative
point of view, each system partition is a logical SP system within one
administrative domain. Isolation of these logical SP systems is achieved by
partitioning the switch in such a way that switch communication paths for
different system partitions do not cross each other. Of course, normal
TCP/IP-based communication between nodes is still possible using the SP
Ethernet or other external networks.
The reason these two topics are presented together in this chapter is
because an SP system is, by default, partitioned, albeit in most installations
there is only one partition. This fact is reflected in the data structures that
make up the SDR, and in the way that nodes communicate with the control
workstation to retrieve data from the SDR, since they all are in at least one
partition.
The SDR data model consists of classes, objects and attributes. Classes
contain objects. Objects are made up of attributes. Objects have no unique
ID, but their combined attributes must be unique among other objects.
Attributes can be one of three types: strings, floating-point numbers or
integers.
The attributes are the details about a class. For example, some of the
attributes of the Adapter class are: node_number, adapter_type, netaddr.
Two examples of specific classes within the SDR (the Adapter class and the
SP Class) are compared in Figure 46 on page 93.
.....
5 en0 192.168.3.5 255.255.255.0
. . . . .
. . . . .
Class = SP
From these examples, we observe that the Adapter class has many objects
stored within it, whereas the SP class has only one object. As we have
previously mentioned, the attributes that belong to each object have a
datatype associated with them, and the datatypes can have one of three
values:
1. String. For example, the control_workstation attribute within the SP class
is a string consisting of the hostname of the control workstation.
2. Integer. The node_number attribute of the Adapter class has an integer
value.
3. Floating Point. There are currently no attributes of this datatype in the
SDR.
The SDR stores its data within the spdata/sys1/sdr directory on the control
workstation. The complete SDR directory structure is presented in Figure 47
on page 94.
*
Adapter nnn.nnn.nnn.nnn# locks classes files
Frame
GS_config
classes
host_responds
Node Adapter
NodeControl GS_config
Adapter host_responds
GS_config Node
switch_responds
host_responds
Volume_Group
Node
switch_responds
Volume_Group
Therefore, there are two types of SDR classes: system and partitioned.
• System classes contain objects that are common to all system partitions.
• Partitioned classes contain objects that are specific to a partition. The
definition of a partitioned class is common to all system partitions, but
each system partition contains different objects for the class.
The SDR directory structure shows us there are only eight classes that are
common to all system partitions (the existence of some of these classes
depends on your specific SP configuration). The remaining classes are
duplicated for each partition that is defined. The subdirectory name for the
default partition is the IP address associated with the en0 adapter of the
control workstation. Any additional partitions defined will use the alias IP
address of the en0 adapter of the control workstation.
/spdata/sys1/sdr/archives
Contains archived copies of the SDR for backup purposes. A snapshot of the
contents of the SDR directories are saved in a tar file, and this file can be
used to restore the SDR to a known working state.
/spdata/sys1/sdr/defs
Contains the header files for all the object classes (system and partitioned).
Each file describes the fields and attributes that are used to define the object
of the class.
/spdata/sys1/sdr/partition
An /spdata/sys1/sdr/partition/ <syspar_ip_addr> subdirectory is maintained
for each partition. There will always be at least one subdirectory here,
corresponding to the default partition. Object classes are replicated for each
partition and each class keeps information on the objects specific to the
partition.
/spdata/sys1/sdr/system
This subdirectory contains classes and files that are common to all partitions
within an SP system. Some classes exist in this subdirectory only if explicitly
created; for example, the SysNodeGroup Class will only be present if node
groups are defined.
The control workstation does not belong to any system partition. Instead, it
provides a central control mechanism for all partitions that exist on the
system.
If you choose one of the supplied layouts, your partitioning choice is "switch
smart": your layout will still be usable when the switch arrives. This is
With the introduction of coexistence support in PSSP 2.2, and the availability
of the SP Switch in 1996, which allowed the isolation of switch faults to the
failing node alone, the need to consider partitioning as a possible solution to
these operational requirements has been reduced.
Attention
The control workstation must always be at the highest level of PSSP and
AIX that is used by the nodes.
There is one SDR daemon for each partition. Each daemon is uniquely
identified by an IP address. In the case of the default partition, this is the IP
address associated with the hostname interface of the control workstation.
For other partitions, this address is the alias IP address that was established
for this same interface. The startup process passes the IP address
associated with the partition to the respective daemon.
Multiple SDR daemons are started by use the group option with the startsrc
command.
When a node is initially booted, the /etc/rc.sp script, started from inittab, looks
for the environment variable SP_NAME to retrieve the IP address of the
partition of which it is a member. The SP_NAME variable is not set by default,
and if the system administrator has a requirement to execute
partition-sensitive commands, then this variable has to be explicitly set and
exported. If it is not set, the /etc/SDR_dest_info file is referenced.
This file contains the IP address and name for the default and primary
partitions. The contents of this file extracted from a node are shown.
The default stanza is set to the IP address associated with the default system
partition (the hostname of the control workstation). The primary stanza is set
An example of how a node in the default partition interacts with the SDR
daemon sdrd to get access to partition information is shown in Figure 48 on
page 100.
2
4 retrieves Syspar_map object file read
for node_number 1
/etc/SDR_dest_info
--
default:192.168.4.140
primary:192.168.4.140
nameofdefault:sp4en0
nameofprimary:sp4en0
A reboot (or an initial boot) of the node results in the following (high-level)
steps being carried out.
1. During /etc/inittab processing, the /etc/rc.sp script is executed. Among
other actions, this script reads the contents of the /etc/SDR_dest_info that
resides on the node.
2. The SP_NAME environment variable is set to the default address that it
finds in this file.
3. Using an SDR API call, the node makes a request to the sdrd daemon on
the control workstation for the Syspar_map object corresponding to its
node_number.
4. The value of the primary field of the node’s /etc/SDR_dest_info file is
compared with the value of syspar_addr attributes returned from the
Syspar_map class in the SDR. If the values differ, then the primary field is
updated with the syspar_addr value.
This step is repeated for the default, nameofprimary and nameofdefault
stanzas. The latter two stanzas’s fields are compared with the value of the
syspar_name attribute.
We can use a similar logic flow to consider the case where an SP system has
been partitioned on the control workstation. To simplify this example, we
assume that an alias IP address has been set up for the control workstation,
and that the hostname corresponding to the alias has been added to the
/etc/hosts file. The system has been partitioned in two: one partition consists
of nodes in slots 1 to 8, and the second partition consists of nodes in slots 9
to 16.
When the node in the second partition is being rebooted, the logic flow is as
shown in Figure 49.
6+ Node sp4n09
(node_number=9)
sdrd.sp4en0
Boot
requests Syspar_map
from default syspar SDR server
3 1
/etc/inittab
--
sp:2:wait:/etc/rc.sp > /dev/console 2>&1
4 2
5 file updated
retrieves Syspar_map object file read
for node_number 9
/etc/SDR_dest_info /etc/SDR_dest_info
-- --
default:192.168.4.140 default:192.168.4.140
primary:192.168.4.140 primary:192.168.4.141
nameofdefault:sp4en0 nameofdefault:sp4en0
nameofprimary:sp4en0 nameofprimary:sp4en1
In this example, we see that the node recognizes that it is now in a new
partition. It updates the /etc/SDR_dest_info file (the primary and
nameofprimary stanzas) with the address and hostname of the partition that it
now belongs to. It uses these fields to establish communication with the
appropriate sdrd daemon on the control workstation.
Whenever the SDR daemons are started up, they check for the presence of
any lock files in case they were left behind when an update of the SDR ended
abnormally. If any are found at this point, they are removed. This is another
reason why it is a good idea to recycle the SDR daemons when there are
apparent SDR problems.
SMIT
SMIT is by far the most commonly used method for interacting with the SDR.
Once PSSP is successfully installed, SMIT menus under the “RS/6000 SP
Through SP commands
The SDR can also be accessed through high-level PSSP commands. One
such command is splstdata, which (among other things) can be used to query
node, frame, and boot/install server information from the SDR.
Before SDR commits changes to a class, the class contents are moved to a
backup file in the same directory as the class. The backup file is named
<class>.shadow, where <class> is the name of the class being written. If a
power loss occurs or the SDR daemon is killed while the SDR class is being
written, the class file that was being written at the time may not exist or may
be corrupted. You should check for the existence of a <class>.shadow file. If
one exists, take the following steps to restore the class.
1. Remove the corrupted class file (if one exists).
2. Rename the <class>.shadow file to class.
3. Restart the SDR daemon (sdrd) by issuing sdr reset.
The SDR daemon will restart automatically and recognize the renamed
shadow file as the class.
You must be careful when making changes to the SDR, either directly
through the SDR commands or through the SMIT dialogs. If not, your
actions may result in a corrupted SDR with unpredictable consequences.
When using the SDR commands, the syntax outlined in IBM PSSP for AIX
Command and Technical Reference Vol 2, SA22-7351 must be used.
When making changes to attributes of existing objects in a class, you can use
the SDRChangeAttrValues command. With this command, you query the
attribute that needs changes and assign it a new value. This command (and
others) uses a specify syntax to confirm the value of an attribute before it is
changed. The syntax of this command is:
SDRChangeAttrValues class_name [ attr==value ... ] attr=value ...
The double equal signs (==) signify comparison, and if the value of the
attribute in the SDR is not the same as that specified in the command, then
the change will not be made.
Since these commands require the user to be root , the SDR offers some
degree of security.
For example, let us suppose that some sensitive configuration files are stored
in /etc/test_file.conf on the control workstation, and you would like the nodes
to securely update this file on the nodes. You can store this file in the SDR
using:
This command creates an SDR file called TestFile that can be retrieved from
any system partition. (The SDRCreateFile command creates a partitioned
SDR file from the AIX file, that is, it can only be retrieved from a node that is a
member of the partition it was created in.) To retrieve this file on any node in
the SP complex, you would enter:
[sp4n06:/]# SDRRetrieve TestFile /etc/test_file.conf
SDR_test
This command verifies that the installation and configuration of the SDR
completed successfully. The test clears out a class name, creates the class,
performs some tasks against the class, and then removes it.
SDRGetObjects
This command lists the contents of attributes in the specified object. Some
useful objects to remember include:
• host_responds
• switch_responds
• Syspar
• Syspar_map
• Node
• SP
• Adapter
You can filter the output by specifying the required attributes as arguments.
For example, to query the SDR Adapter class for the node number and
network address of all switch adapters, enter:
[sp4en0:/]# SDRGetObjects Adapter adapter_type==css0 node_number netaddr
node_number netaddr
1 192.168.14.1
5 192.168.14.5
6 192.168.14.6
7 192.168.14.7
8 192.168.14.8
9 192.168.14.9
10 192.168.14.10
11 192.168.14.11
Remember, the double equal signs signify comparison, and because we have
specified additional attributes as arguments, only those attributes will be
listed in the output.
SDRChangeAttrValues
This command changes attribute values of one or more objects. Remember,
this is a “last resort” action and should never be performed if there is another
method, through command-level actions, to perform the same function. For
example, the SP Switch will not start if an Oncoming Primary node is fenced,
but we cannot use the Eunfence command to unfence the node, since the
switch is not started. So, in this case, we have to change this attribute in the
SDR using the SDRChangeAttrValues command. The command with the
relevant arguments is:
[sp4en0:/]# SDRChangeAttrValues switch_responds node_number==7 isolated=0
[sp4en0:/]#
SDRListClasses
This command outputs all of the class names (system and partitioned)
currently defined in the SDR to standard output. It has no flags or arguments.
SDRWhoHasLock
This command returns the transaction ID of a lock on a specified class. The
form of the transaction ID is host_name:pid:session, where host_name is the
long name of the machine running the process with the lock, pid is the
process ID of the process that owns the lock, and session is the number of
the client’s session with the SDR.
SDRClearLock
This command unlocks an SDR class.
Attention
Use this command carefully and only when a process that obtained a lock
ended abnormally without releasing the lock. Using it at any other time
could corrupt the SDR database.
A new log file is created every time the SDR daemon is started. Once the
contents are checked, the files can be either archived or discarded, except for
the file that corresponds to the daemon that is currently running. To find the
current log file, find the process ID of the SDR daemon using lssrc -g
sdr. The current log file has that process ID as its extension.
The logs contain the data and time the process started along with any
problems the daemon may have run into.
If you need to restore the SDR to a previous state, you must use the PSSP
command SDRRestore.
This command will remove the contents of the SDR and retrieve the archived
contents of the backup file. The backup file must be in the
/spdata/sys1/sdr/archives directory. Any new SDR daemons that represent
partitions in the restored SDR are then started and any SDR daemons (from
old partitions) that are not in the new SDR are stopped.
Each new PSSP release may introduce new SDR classes and attributes.
Use caution when using SDRArchive and SDRRestore in order to avoid
overwriting new SDR classes and attributes.
We suggest that after migration you do not execute SDRRestore from a back
level system since it will overwrite any new SDR classes and attributes.
One main task of the control workstation for the RS/6000 SP system is to
provide a centralized hardware control mechanism. In a large SP
environment, the control workstation makes it easier to monitor and control
the system. The administrator can power a node on or off, update supervisor
microcodes, open a console for a node and so on, without having to be
physically near the node. This chapter describes how the hardware control
subsystem works and where it is applied.
Frame
Hardware components in a frame that can be monitored include:
• Power LEDs
• Environment LEDs
• Switch
• Temperature
• Voltage
• State of the power supply
Node
Hardware components in a node that can be monitored include:
• Three-digit or LCD display
• Power LEDs
• Environment LEDs
• Temperature
• Voltage
• Fan
Switch
Hardware components in a switch that can be monitored include:
• Temperature
• Voltage
• Fan
• Power LEDs
• Environment LEDs
• MUX
Attention
A daemon, called hardmon, runs on the Monitor and Control Node (MACN),
an alias of the control workstation where hardmon is concerned, which has
the RS-232 serial cables connected to the frames. It constantly polls the
frames at a default rate of 5 seconds.
Do not change the value of this default poll rate as it can cause
unpredictable results in some client commands.
This daemon is invoked during the boot process by the /etc/inittab file and is
controlled by the System Resource Controller (SRC) subsystem. Unless
specifically called by the stopsrc command, the SRC subsystem will restart
this daemon on termination.
The hardmon daemon first obtains configuration information from the System
Data Repository (SDR). The SP_ports object class provides the port number
this daemon uses to accept TCP/IP connections from client commands. The
Frame object class provides the frame number, tty port, MACN name (also
backup_MACN name in a HACWS environment), hardware_protocol, s1_tty
and so on. Once this information is acquired, the hardmon daemon opens up
the tty device for each frame and begins polling for state information over the
RS-232 serial line, using SLIP protocol. The first thing this daemon checks for
is the Frame ID. Any discrepancy will disable any control over the frame.
However, monitoring of the frame is still allowed. To correct Frame ID
mismatch, the hmcmds command is used.
While the hardmon daemon provides the underlying layer for gathering
hardware state information, client commands spmon, hmmon, hmcmds, s1term, and
nodecond interact with it to monitor and control the hardware.
The hardmon daemon reports variables representing the state of the frame,
node and switch hardware. These state variables can be queried with the
spmon and hmmon commands. For a full list of available state variables, use the
following command:
# hmmon -V
The splogd daemon runs alongside the hardmon daemon. It provides error
logging capability for all errors detected by hardmon. To define what logging
is to be done and what user exits should be called, syslogd uses the
/spdata/sys1/spmon/hwevents file.
SP
/spdata/sys1/spmon/hmacls
splogd errdemon
/spdata/sys1/spmon/hmthresholds setup_logd
hardmon
/spdata/sys1/spmon/hwevents
SDR
When the hardmon daemon starts, one of the things it checks for is the
hardware_protocol field in the SDR’s Frame class. If this field is SP, it treats
the frame as a regular SP frame and begins its communications to the
supervisor cards. If the field indicates SAMI, it recognizes the attached server
as an external S70/S7A and begins communicating with the S70/S7A’s
Service Processor. If the field specifies SLIM, the daemon recognizes the
# SDRGetObjects Frame
Although you can access these files (as well as other SDR files) directly, do
not modify them manually. All changes must be made through the proper
installation and customization procedures.
Attention
Support for SP-attached Netfinity Servers does not come with the default
PSSP codes. It can be purchased as a separate product and is available
only on a Programming Request for Price Quotation (PRPQ) from IBM. The
PRPQ number is P88717.
For the S70/S7A servers, the s70d daemon is started by the hardmon
daemon. For each server attached, one daemon is started. Figure 51 on page
114 shows the hardware monitor components and relationships for the
S70/S7A attachment.
The s70d daemon provides the interface for hardmon to communicate with
the S70/S7A hardware. Since S70/S7A’s SAMI Manufacturing Interface uses
SAMI protocol, this daemon has the function of translating hardmon
The s70d daemon consists of two interfaces: the frame supervisor interface
and the node supervisor interface. The frame supervisor interface is
responsible for keeping the state data in the frame’s packet current and
formatting the frame packet for return to the hardmon daemon. The node
supervisor interface is responsible for keeping the state data in the node’s
packet current and translating the commands received from the frame
supervisor interface into SAMI protocol before sending them to the service
processor on the servers.
SP S70/S7A
/spdata/sys1/spmon/hmacls /var/adm/SPlogs/SPdaemon.log
/spdata/sys1/spmon/hmthresholds errdemon
hardmon splogd
setup_logd
SDR /spdata/sys1/spmon/hwevents
The nfd daemon provides the interface for hardmon to communicate with the
Netfinity server’s hardware. It emulates the frame supervisor card as well as
the node supervisor card. This daemon translates hardmon commands to
SLIM language for the service processor on the server. Information sent back
by the service processor gets translated into supervisor data packets for
hardmon to digest. An RS-232 serial line provides the channel for hardware
data flow. Through this line, the daemon polls the servers for hardware states
and sends them to hardmon. Unlike the S70/S7A attachment, where another
serial line allows the opening of a virtual console, the Netfinity server runs
Windows NT or OS/2 and thus does not support virtual console.
SP Netfinity
SLIM
Frame Supervisor Card RS-232
/spdata/sys1/spmon/hmthresholds errdemon
hardmon splogd
setup_logd
SDR /spdata/sys1/spmon/hwevents
Attention
Examples listed here are based on the current setup in our laboratory.
Variables used may not be applicable in your hardware environment.
Check against the listed variables in PSSP: Administration Guide,
SA22-7348 for your hardware configuration.
hmmon
The hmmon command monitors the state of the SP hardware in one or more SP
frames. For a list of state variables which this command reports, refer to
# hmmon -V
To list all the hardware states at a particular point in time, the following
command is used:
# hmmon -G -Q -s
This command provides information about the frames, switches and nodes.
Without the -G flag, only nodes information in the current partition will be
listed. To monitor for any changes in states, leave the -Q flag out. This will
continuously check for state changes. Any condition that changed will be
reflected on the screen where the command was executed.
A common usage for this command is to detect whether the LED on a node is
flashing. Nodes that crash with flashing 888 LED codes can be detected by
this command.
For example, to detect if the LED on Frame 1 Node 5 is flashing, issue the
following command:
If the LED is flashing, the value returned will be TRUE. The output will be as
follows:
frame 001, slot 05:
7 segment display flashing TRUE
As another example, to detect if the power switch is off for a frame (maybe
due to a power trip) on power module A, use the following command:
If nothing is wrong with this power module, the value returned is FALSE as
shown in the following:
frame 001, slot 00:
AC-DC section A power off FALSE
spmon
The spmon command can be used for both monitoring and controlling the
system. There is a whole list of tasks this command can perform. Included
here are examples showing the common usage of this command.
# spmon -G -d
3. Querying frame(s)
1 frame(s)
Check ok
4. Checking frames
5. Checking nodes
--------------------------------- Frame 1 -------------------------------------
Frame Node Node Host/Switch Key Env Front Panel LCD/LED is
Slot Number Type Power Responds Switch Fail LCD/LED Flashing
-------------------------------------------------------------------------------
1 1 high on yes autojn normal no LCDs are blank no
5 5 thin on yes yes normal no LEDs are blank no
6 6 thin on yes yes normal no LEDs are blank no
7 7 thin on yes yes normal no LEDs are blank no
8 8 thin on yes yes normal no LEDs are blank no
9 9 thin on yes yes normal no LEDs are blank no
10 10 thin on yes yes normal no LEDs are blank no
11 11 thin on yes yes normal no LEDs are blank no
12 12 thin on yes yes normal no LEDs are blank no
13 13 thin on yes yes normal no LEDs are blank no
14 14 thin on yes yes normal no LEDs are blank no
15 15 wide on yes yes normal no LEDs are blank no
The next example shows how you can power off a node. Use the following
command with caution, since as you can accidentally power off a wrong node
if you specify the wrong node number. This command also does not issue a
shutdown sequence prior to powering it off.
An example of using this command to turn the key mode of all nodes in
Frame 1 to Service is as follow:
s1term
The s1term command enables an administrator to open up a virtual console
on a node. The default read-only option allows an administrator to monitor the
processes going on in a node. The read/write option allows the administrator
to enter commands to administer the node. The following command opens a
console with read/write option on Frame 1 Node 15.
# s1term -w 1 15
Attention
There can only be one read/write console at a time for a particular node. If
there is already a read/write console opened, the nodecond command will
fail, since it will attempt to open a read/write console.
nodecond
The nodecond command is used for conditioning a node, where it is used to
obtain the Ethernet hardware address or initiate a network boot. This
command is discussed in further detail in Chapter 10, “Installation and
Configuration” on page 257.
5.3.2 SP Perspectives
The graphical version of spmon is no longer available in PSSP 3.1. Instead,
use the sphardware command monitor and control the SP hardware from SP
Perspectives. Alternatively, the perspectives command can be invoked, which
will start off the SP Perspectives Launch Pad. An icon indicating Hardware
Perspective will lead you to the same screen; this is illustrated in Figure 53 on
page 120. For information relating to SP Perspectives, refer to Chapter 11,
“System Monitoring” on page 293.
After selecting the Power Off option, another window appears, indicating the
nodes selected and prompting the user to select the required mode before
continuing with the task. This is shown in Figure 58 on page 125. To shut
down and power off a node, select the Power Off and Shutdown options.
Clicking Apply will cause the node to be shut down and then powered off.
The next example shows how you can add new panes on a new window to
monitor the frame, switch and host_responds for the nodes. To add a new
pane, click View and select Add Pane. A window will appear as shown in
Figure 59. Select Pane type. In the example shown, the Frames and
Switches pane was selected. You have an option to add this new pane to a
new window or keep it in the existing window. Here, we added it to a new
window.
After selecting the Apply button to confirm, a new window with the new pane
appears. Select the Add Pane option on this new window to add a new
Nodes pane to this current window. The resulting window is shown in Figure
60 on page 126.
To start monitoring the host_responds for all the nodes, select a node in the
Nodes pane. Choose Set Monitoring from the View pull-down menu. A
window will appear as shown in Figure 61 on page 127.
Select the hostResponds condition and click Apply. The nodes are now set
for monitoring the host_responds. The green color indicates the
host_responds is up. A red cross sign means the host_responds is down. You
can choose to select more than one condition to monitor. To do so, click the
conditions while holding down the Control key on the keyboard. Figure 62 on
page 128 shows the Nodes pane set to monitor the host_responds of all the
nodes. Note that Node 6 had been powered off.
You can view or modify the properties of any nodes, frames or switches, by
double clicking their respective icons. To illustrate this example, we double
click on Frame 1. A window showing the properties appears as shown in
Figure 63 on page 129.
To power off the frame or switch, set your system partition to Global. Select
Change System Partition from the View pull-down menu. A window will
appear as shown in Figure 64. Select Global and click Ok.
To power off Frame 1, click on Frame 1 in the Frames and Switches pane
and select Power Off from the Actions pull-down menu. A warning window
will appear, as shown in Figure 65 on page 130. Clicking Enter will power off
the frame.
In a single computing system, there is one clock which provides the time of
day to all operating system services and applications. This time might not be
completely accurate if it is not synchronized with an external source like an
atomic clock, but it is always consistent among applications.
In a distributed computing system like the SP, each node has its own clock.
Even if they are all set to a consistent time at some point, they will drift away
from each other since the clocking hardware is not perfectly stable.
Consequently, the nodes in such a distributed system have a different notion
of time.
Inconsistent time is a critical problem for all distributed applications which rely
on the order of events. An important example is Kerberos, the SP
authentication system: it is a distributed client/server application and encodes
timestamps in its service tickets. Another example is the RS/6000 Cluster
Technology, in particular the Topology Services, which send heartbeat
messages containing timestamps from node to node. On the application level,
tools like the make utility check the modification time of files and may not
work correctly when called on different nodes which have inconsistent time
settings.
12
11 1
10
9
2
3
Stratum 0
8
7
6
5
4
Primaries
12 12
11 1 11 1
10
9
2
3
Peers
10
9
2
3 Stratum 1
8 4 8 4
7
6
5 7
6
5
Secondaries
12 12 12 12
11 1 11 1 11 1 11 1
10
9
2
3
10
9
2
3
10
9
2
3
10
9
2
3 Stratum 2
8 4 8 4 8 4 8 4
7
6
5 7
6
5 7
6
5 7
6
5
Secondaries
Figure 66. NTP Hierarchy
NTP computes the clock offset, or time difference, of the local client relative
to the selected reference clock. The phase and frequency of the local clock is
then adjusted, either by a step-change or a gradual phase adjustment, to
reduce the offset to zero. The accuracy of synchronization depends on a
host’s position in the stratum and the duration of synchronization: whereas a
few measurements are sufficient to establish local time to within a second,
long periods of time and multiple comparisons are required to resolve the
# On regular nodes:
#
server 192.168.3.130
server 192.168.3.11
server 192.168.3.17
The script ntp_config also starts the NTP daemon, xntpd, by running
/etc/rc.ntp. Be aware that although xntpd is known to the AIX System
Resource Controller (SRC), rc.ntp starts it directly and so lssrc -s xntpd will
show xntpd to be inoperative even if the daemon is running. NTP status
information can be obtained via the xntpdc command, for example xntpdc -c
help or xntpdc -c sysinfo.
Clerk Srv
Clerk
Srv
Srv
/.:/lan-profile
A global time server is a time server outside the LAN. Global time servers usually
synchronize their time with external time providers. Local DTS servers are
categorized as courier, backup courier, and noncourier servers depending on
their relation to global time servers. A courier time server is a time server in a
LAN which synchronizes with a global time server outside the courier’s LAN.
So courier time servers import an outside time to the LAN. Backup courier
time servers become courier time servers if no courier is available in the LAN.
Noncourier time servers never synchronize with global time servers. Figure
68 on page 137 shows the different local and global time servers.
Backup
Courier Backup
Courier Courier Courier
12
11 1
10 2
Global 9
8 4
3
7 5
6
Figure 68. Local (Courier, Backup Courier, Noncourier) and Global DTS Servers
In order to ensure that the distributed systems’ clocks are also synchronized
to the correct time, DTS provides a time-provider interface which can be used
to import a reference source like an atomic clock to a DTS server. This time,
which can be very accurate, is then communicated to the other DTS servers.
DTS can coexist with NTP in the sense that both NTP servers can provide
time to a DTS system, and DTS servers can provide time to a NTP domain.
After this configuration and after each reboot, the DCE client services will be
automatically started by /etc/rc.dce. DTS is implemented by the dtsd
daemon, whose personality is controlled by command line options: dtsd -c
starts it as a clerk, and dtsd -s runs dtsd as a server. In server mode, dtsd
can be further configured as follows:
• By default, dtsd runs as a backup courier time server.
• Specifying -k courier or -k noncourier starts dtsd as courier or noncourier
time server, respectively.
• The -g flag indicates that dtsd should run as a global time server.
After the DCE client configuration, the use of DTS should be transparent. The
DTS time, including the inaccuracy described by the interval representation,
can be queried through an API or through the command dtsdate:
> date
Tue Feb 9 04:17:25 CET 1999
> dtsdate
1999-02-09-04:17:18.356+01:00I4.022
The format of the dtsdate output is date and time in Universal Coordinated
Time (UTC) in the form YYYY-MM-DD-hh:mm:ss.sss, followed by the Time
Differential Factor (TDF) for use in different time zones, followed by the
inaccuracy in seconds.
Depending on the size and network connectivity of the SP (and the reset of
the DCE cell), it might be advisable to set up additional DTS servers inside
the SP, or even establish a separate LAN profile for the SP.
This chapter discusses some security issues which are of special importance
on the SP. With respect to security, the SP basically is a cluster of RS/6000
workstations connected on one or more LANs, with the control workstation
serving as a central point to monitor and control the system. Therefore, the
SP is exposed to the same security threats as any other cluster of
workstations connected to a LAN. The control workstation, as the SP’s single
point of control, is particularly sensitive: if security is compromised here, the
whole system will be affected.
Computer security is a large topic, and this book does not attempt to replicate
the abundant literature on security of standalone or networked workstations.
Apart from the fact that most security exposures are still caused by poorly
chosen passwords, the insecure network introduces several new security
threats. Both partners of a client/server connection may be impersonated,
that is, another party might pretend to be either the client (user) or server.
Connections over the network can be easily monitored, and unencrypted data
can be stored and reused. This includes capturing and replaying of user
passwords or other credentials, which are sent during the setup of such
client/server connections.
A common way to prevent impersonation and ensure the integrity of the data
that is transferred between client and server is to set up a trusted third party
Although written for AIX 4.1, most of its content is still valid for AIX 4.3. Here
we focus on some issues which are not discussed in detail in that redbook, or
have been newly introduced into AIX since then.
SP Security 141
method is discouraged, since plain text passwords should not be
stored in the (potentially remote and insecure) file system.
rexec Same as ftp. As mentioned previously, use of $HOME/.netrc files
is discouraged.
With AIX 4.3.1, all these commands except rexec also support Kerberos
Version 5 authentication. The base AIX operating system does not include
Kerberos; we recommend that DCE for AIX Version 2.2 be used to provide
Kerberos authentication. Note that previous versions of DCE did not make the
Kerberos services available externally. However, DCE for AIX Version 2.2,
which is based on OSF DCE Version 1.2.2, provides the complete Kerberos
functionality as specified in RFC 1510, The Kerberos Network Authentication
Service (V5).
For backward compatibility with PSSP 3.1 (which still requires Kerberos
Version 4 for its own commands), the AIX r-commands rcp and rsh also
support Kerberos Version 4 authentication. See 7.3, “How Kerberos Works”
on page 148 for details on Kerberos.
Attention
On the SP, the chauthent command should not be used directly. The
authentication methods for SP nodes and the control workstation are
controlled by the partition-based PSSP commands chauthpar and
lsauthpar. Configuration information is stored in the Syspar SDR class, in
the auth_install, auth_root_rcmd and auth_methods attributes.
The Kerberized rsh and rcp commands are of particular importance for SP, as
they replace the corresponding Kerberos Version 4 authenticated
r-commands which were part of PSSP Versions 1 and 2. Only the PSSP
versions of rsh, rcp and kshd have been removed from PSSP 3.1, it still
includes and uses the Kerberos Version 4 server. This Kerberos server can
be used to authenticate the AIX r-commands. A full description of the
operation of the rsh command in the SP environment can be found in 7.5.2,
“Remote Execution Commands” on page 173, including all three possible
authentication methods.
SP Security 143
7.2.2 Securing X11 Connections
If configured improperly, the X Windows system can be a major security hole.
It is used to connect X servers (machines with a graphical display like the SP
administrator’s workstation) with X clients (machines which want to display
graphical output on an X server, like the SP control workstation). If the X
server is not secured, then everybody can monitor, or even control, the X
server’s resources. This includes the keyboard and screen, so everything
which is typed or displayed on an unprotected X server can be monitored.
> xhost +
access control disabled, clients can connect from any host
> xhost -
access control enabled, only authorized clients can connect
> xhost -
access control enabled, only authorized clients can connect
> xhost + sp5cw0
sp5cw0 being added to access control list
> xhost
access control enabled, only authorized clients can connect
sp5cw0.itso.ibm.com
To limit access to specific users (for example the root user on sp5cw0), the
Xauthority mechanism is used. When the X server starts up, it generates a
The local user who started the X server immediately has access to it, but all
other users on the X server machine and all users on other X client machines
first need to get this key. This transfer has to be secured, of course. Securely
transferring the key can be challenging. Using an NFS-mounted file system,
for example, cannot be considered secure since it is relatively easy to bypass
file access permissions of NFS-exported file systems. If the shared file
system is in AFS or DFS, this is much more secure. If there is no shared file
system, an actual copy has to be performed, which might also expose the key.
The xauth command can be used on the client machine to add the cookie to
the .Xauthority file of the user whose processes want to access the X server,
as shown in the following example:
The xauth list command displays the contents of the .Xauthority file,
showing one line per <hostname>:<display> pair. The cookies are displayed
as a string of 32 hex digits. Individual keys for a <hostname>:<display> pair
can be added and removed by the add and remove options, using the same
format as the list output. Each time the X server starts, it creates a new
cookie which has to be transferred to its clients.
The secure shell described in 7.2.3, “Secure Shell” on page 146 transparently
encrypts all X traffic in a separate channel. On the remote node, a new
SP Security 145
.Xauthority file is created, and the cookie is used to secure the
communication between the sshd daemon and the X client. On the local
node, the cookie is necessary to set up a secure connection between the X
server and the ssh client. This is very convenient, as both the text-based login
and any X Windows traffic are automatically protected.
The standard method to circumvent the second problem is to use a public key
method, where encryption is asymmetric . Such encryption methods use two
keys, a private key and a public key. Messages encrypted with one of these
keys can only be decrypted with the other one. Using this system, there is no
need to share a secret key. The public keys of all participants are publicly
available, and the private key needs only be accessible by its owner.
The Secure Shell (SSH) is a de facto industry standard for secure remote
login using public key encryption. It is widely used, and available for almost
every operating system (which also solves the first problem mentioned
previously). SSH provides secure remote login through the ssh or slogin
commands, and a secure remote copy command scp. As mentioned in 7.2.2,
“Securing X11 Connections” on page 144, a very useful feature of SSH is its
ability to transparently and securely handle X Windows traffic. We therefore
highly recommend that you use the Secure Shell for remote login and remote
data copy.
More information on the Secure Shell can be found at the SSH home page:
http://www.ssh.fi/sshprotocols2/
DCE for AIX has its own user management infrastructure, and provides a
loadable authentication module /usr/lib/security/DCE to be used with the
standard AIX login mechanism. To enable an integrated login, which provides
access to the machine and at the same time establishes a DCE context for
the user (including DCE and Kerberos Version 5 credentials), the DCE
authentication module should be added as a stanza to /etc/security/login.cfg
and configured for all non-system users in /etc/security/user. The standard
AIX users, notably root, should normally be authenticated through the local
user database only, so that they can still access the system when DCE is
down. The relevant stanzas are:
/etc/security/login.cfg:
DCE:
program = /usr/lib/security/DCE
/etc/security/user:
default:
auth1 = SYSTEM
auth2 = NONE
SYSTEM = "DCE"
dce_export = false
...
root:
SYSTEM = "files"
registry = files
...
More details can be found in the DCE for AIX documentation. Note that if the
integrated DCE login is enabled, then by default all principals in the DCE cell
will be able to log in to that host. This is probably not what is intended, but
unfortunately DCE does not provide an easy means to handle this situation.
SP Security 147
There is a mechanism to restrict access to a host by means of a
passwd_overwrite file, which must exist locally on the host and excludes
users listed in the file from login. However, maintaining this file for all hosts in
the DCE cell is laborious, and care has to be taken that each time a principal
is added to the DCE cell the passwd_overwrite file is updated on all hosts in
that cell to reflect this addition.
It would be more convenient to include users, rather than exclude all others.
Unfortunately, such a feature is not available in DCE.
Kerberos
Also spelled Cerberus - The watchdog of Hades, whose duty was to guard
the entrance (against whom or what does not clearly appear) ... It is known
to have had three heads.
Kerberos Server
(AS and TGS)
Application
Client
This section describes the protocol that Kerberos uses to provide these
services, independently of a specific implementation. A more detailed
rationale for the Kerberos design can be found in the MIT article Designing an
Authentication System: a Dialogue in Four Scenes available from the
following URL: ftp://athena-dist.mit.edu/pub/kerberos/doc/dialogue.PS
SP Security 149
This approach must be distinguished from public key cryptography, which is
an asymmetric encryption method. There, two keys are used: a public key
and a private key. A message encrypted with one of the keys can only be
decrypted by the other key, not by the one which encrypted it. The public keys
do not need to be kept secret (hence the name “public”), and a private key is
only known to its owner (it is not even to the communication partner as in the
case of symmetric cryptography). This has the advantage that no key must be
transferred between the partners prior to the first use of encrypted messages.
Plain Text
Client Kerberos AS
This ticket is encrypted with the secret key of the TGS, so only the TGS can
decrypt it. Since the client needs to know the session key, the AS sends back
a reply which contains both the TGT and the session key, all of which is
encrypted by the client’s secret key. This is shown in Figure 71 on page 152.
SP Security 151
Kerberos AS
Client
Now the sign-on command prompts the user for the password, and generates
a DES key from the password using the same algorithm as the Kerberos
server. It then attempts to decrypt the reply message with that key. If this
succeeds, the password matched the one used to create the user’s key in the
Kerberos database, and the user has authenticated herself. If the decryption
failed, the sign-on is rejected and the reply message is useless. Assuming
success, the client now has the encrypted Ticket-Granting Ticket and the
session key for use with the TGS, and stores them both in a safe place. Note
that the authentication has been done locally on the client machine, the
password has not been transferred over the network.
If the client sent only the (encrypted) TGT to the Kerberos TGS, this might be
captured and replayed by an intruder who has impersonated the client
The authenticator is encrypted with the session key that the client shares with
the TGS. The client then sends a request to the TGS consisting of the name
of the service for which a ticket is requested, the encrypted TGT, and the
encrypted authenticator. This is shown in Figure 72.
Kerberos TGS
T
Service Client's Client's Time-
G c a t l k
Name S Name IP-addr stamp
TGT, Authenticator 1,
Encrypted with K(TGS) Encrypted with K(C,TGS)
Client
The Ticket-Granting Server can decrypt the TGT since it is encrypted with its
own secret key. In that ticket, it finds the session key to share with the client. It
uses this session key to decrypt the authenticator, and can then compare the
client’s name and address in the TGT and the authenticator.
If the timestamp that the TGS finds in the authenticator differs from the
current time by more than a prescribed difference (typically 5 minutes), a
ticket replay attack is assumed and the request is discarded.
If all checks pass, the TGS generates a service ticket for the service indicated
in the client’s request. The structure of this service ticket is identical to the
SP Security 153
TGT described in 7.3.2, “Authenticating to the Kerberos Server” on page 150.
The content differs in the service field (which now indicates the application
service rather than the TGS), the timestamp, and the session key. The TGS
generates a new, random key that the client and application service will share
to encrypt their communications. One copy is put into the service ticket (for
the server), and another copy is added to the reply package for the client
since the client cannot decrypt the service ticket. The service ticket is
encrypted with the secret key of the service, and the whole package is
encrypted with the session key that the TGS and the client share. The
resulting reply is shown in Figure 73. Compare this to Figure 71 on page 152.
Kerberos TGS
Client
The client can decrypt this message using the session key it shares with the
TGS. It then stores the encrypted service ticket and the session key to share
with the application server, normally in the same ticket cache where it already
has stored the TGT and session key for the TGS.
To actually request the application service, the client sends a request to that
server which consists of the name of the requested service, the encrypted
service ticket, and a newly generated authenticator to protect this message
against replay attacks. The authenticator is encrypted with the session key
that the client and the service share. The resulting application service request
Application Server
Application Client
The application server decrypts the service ticket with its secret key, uses the
enclosed session key to decrypt the authenticator, and checks the user’s
identity and the authenticator’s timestamp. Again, this processing is the same
as for the TGS processing the service ticket request. If all checks pass, the
server performs the requested service on behalf of the user.
Attention
If the client required mutual authentication (that is, the service has to prove its
identity to the client), the server could send back a message which is
encrypted by the session key it shares with the client, and an
application-dependent contents that the client can verify. Since the service
can only know the session key if it was able to decrypt the service ticket, it
must have known its secret key and so has proven its identity.
SP Security 155
7.4 Managing Kerberos on the SP
The basic functionality of Kerberos and the protocol used to provide its
services was described in the previous section. In this section, we focus on
the implementations of Kerberos used on the SP.
More details on the PSSP and AFS implementations can also be found in
Chapter 12, “Security Features of the SP System” of the PSSP for AIX
Administration Guide, SA22-7348.
Note that although PSSP 3.1 also tolerates Kerberos Version 5 authentication
(for the AIX authenticated r-commands only), it does not actively support it. In
7.4.3, “DCE Kerberos (Version 5)” on page 168 we describe the manual setup
of Kerberos Version 5 on the SP. Full PSSP support for DCE/Kerberos
Version 5 will probably be added in later releases.
Attention
Multiple Kerberos Realms with Same Name : It is possible to have several
independent Kerberos realms with the same name (like two independent
SP systems in the same IP domain). Since the basic configuration
information of a Kerberos realm includes the realm name and the names of
the Kerberos servers in that realm, this removes the ambiguity: each
machine in a cells knows only the Kerberos servers within that cell.
However, you might want to set up non-default realm names (like
SP1.DOMAIN and SP2.DOMAIN) to avoid confusion.
SP Security 157
This lists the IP name of an external network connection of the control
workstation, which is in the IP domain ITSO.IBM.COM rather than in the IP
domain MCS.ITSO.IBM.COM which contains the SP system itself (and is the
realm name). All other IP names of the system are in the default realm, and
are therefore not listed here.
We recommend that you add these services to the /etc/services file to make
their use explicit, and also reserve both the tcp and udp ports. Note that there
may be two other ports registered, for the services kerberos (port 88) and
kerberos-adm (port 749). These are used for Kerberos Version 5, as provided
by DCE for AIX Version 2.2.
Note that hardmon and rcmd service principals are allowed to have unlimited
lifetime, whereas all others are restricted to the default maximum of 30 days.
Attention
DCE/PSSP name conflicts: DCE for AIX provides commands
/usr/bin/kinit, /usr/bin/klist and /usr/bin/kdestroy for its own (Kerberos
Version 5) authentication. PSSP used to establish symbolic links in
/usr/bin/ to the PSSP versions of these commands, which reside in
/usr/lpp/ssp/kerberos/bin/. To avoid name conflicts with DCE, these
symbolic links have been renamed to k4init, k4list and k4destroy in PSSP
3.1.
The sign-on by using the k4init command prompts for the principal’s
password, and on success creates a ticket cache file which stores the
SP Security 159
Ticket-Granting Ticket, obtained through the protocol described in 7.3.2,
“Authenticating to the Kerberos Server” on page 150. The ticket cache file is
placed in the default location /tmp/tkt<uid>, where <uid> is the AIX numeric
UID of the AIX user which issues k4init. This default location can be changed
by setting the environment variable $KRBTKFILE. If the ticket cache file
existed before, it will be replaced by the new one.
application kerberos
Client appl_client
/etc/services
application 1234/tcp
/tmp/tkt<uid>
kpasswd
$KRBTKFILE
k4list /etc/krb.conf
/etc/krb.realms
k4destroy
If the application server needs to decrypt a request from a client, it uses its
key in the /etc/krb-srvtab file. It does not require any communication with the
Kerberos server. These interactions are shown in Figure 76.
Application Server
application
/etc/services
application 1234
kerberos4 750
TKT cache
Server
Keyfile $KRBTKFILE
/etc/krb-srvtab
/etc/krb.conf
ksrvtgt
/etc/krb.realms
However, there may be cases where the application is itself a client to another
Kerberized service. In this case, the application needs to get authenticated to
the Authentication Server (which returns a TGT), and eventually gets a
SP Security 161
service ticket using the TGT. To enable this kind of scenario, the ksrvtgt
command is available, which acquires a TGT from the Kerberos server by
presenting the server’s key from the /etc/krb-srvtab file. That TGT has only a
lifetime of five minutes, to allow the application to get the desired service
ticket, and then expires. The ticket cache file for such cases will be specified
by the $KRBTKFILE environment variable, set by the application. The dashed
part of Figure 76 shows this optional part.
Kerberos Server
kstash
Kerberos
kdb_destroy
/etc/services
Database
chkp,lskp,mkkp,rmkp
kdb_util
kerberos4 750
kerberos-admin 751
/var/kerberos/database/
principal.{dir|pag}
kdb_edit
admin_acl.*
kdb_init
/etc/krb.conf
/etc/krb.realms
vi
There are two aspects of PSSP Kerberos security that require special
attention: the protection of the database itself, and access control to the
administration commands.
The Kerberos database, which is stored in the two files principal.dir and
principal.pag in /var/kerberos/database/, should only be accessible by the
root user. In addition, it is encrypted with a key called the Kerberos master
key. This key (and the password from which it is generated) should not be
confused with the password of the administrative principal of the root user,
root.admin. In particular, these two passwords should be different, and
access to the master password should be even more restricted than access to
the root or root.admin passwords.
The kerberos and kadmind daemons have to be able to access the encrypted
database. To start them automatically, PSSP stores the master key in the /.k
file, and the daemons read it during startup. The kstash command can be
used to convert the master password to the corresponding key, and store it in
the /.k file. This is shown in Figure 77. Like the database, the /.k file should be
SP Security 163
owned by root, and it should have permissions 600. The same file is also
used by some of the administrative commands.
PSSP Kerberos also maintains its own Access Control Lists for Kerberos
administrative commands, stored in three files: admin.add, admin.get and
admin.mod. They are in the same directory as the database,
/var/kerberos/database/. These files are plain text files, and list the principals
which are authorized to perform the corresponding operations. An example is:
# cat /var/kerberos/database/admin_acl.add
root.admin
# cat /var/kerberos/database/admin_acl.get
root.admin
michael.admin
# cat /var/kerberos/database/admin_acl.mod
root.admin
This grants all rights to root.admin, and the principal michael.admin has
read-only access to the database (through the administrative commands).
Attention
The /.klogin file: Strictly speaking, the /.klogin file is not a Kerberos file.
Kerberos is only responsible for authentication, but the /.klogin file specifies
the authorization for Kerberos Version 4 authenticated r-commands (and
Sysctl). As such, it belongs to these applications and not to Kerberos itself.
Also compare to step 33.
A flow chart for the internals of setup_authent is in the redbook RS/6000 SP:
Problem Determination Guide, SG24-4778. Although based on PSSP 2.1,
these diagrams should still be accurate (except for the additional logic to
support AIX authenticated r-commands).
The only choice actually made here is whether a standard /.rhosts file for the
root user should be created or not. We recommend that you not have PSSP
create the /.rhosts entries for all the SP nodes, since this is much less secure
than the Kerberos Version 4-based security. If standard AIX is selected,
SP Security 165
PSSP creates (or adds to) the /.rhosts file, with entries for the control
workstation and the nodes of all partitions for which standard AIX is selected.
Note that running spsetauth on the control workstation only adds or removes
the entries in the /.rhosts file which resides on the control workstation. The
nodes are reconfigured only when they execute /etc/rc.sp or its subcommand
spauthconfig. This is not an issue for the initial installation, but must be kept
in mind if spsetauth is called later to change the initial settings.
Since PSSP 3.1 and AIX use the same set of AIX authenticated r-commands
and daemons, the choices made in this step will not only affect the root user
or SP system administration tasks, they will be valid for all users. We
therefore recommend that you enable all options which should be available to
normal users. In particular, specifying neither standard AIX nor Kerberos
Version 5 means that the telnet and ftp commands will not work at all (since
they do not support Kerberos Version 4), and the rcp and rsh commands will
only work for authenticated Kerberos Version 4 principals. Of course,
specifying Kerberos Version 5 only makes sense if DCE for AIX Version 2.2 is
actually used on the SP system.
If the -c flag is specified with the chauthpar command, the SDR will be
updated with any changes, but only the control workstation will be
immediately reconfigured (using the AIX chauthent command). If the -f flag is
specified, all the nodes in the selected partition will be reconfigured (by
running chauthent on the nodes through an rsh from the control workstation),
whether or not the command actually changes the current SDR settings. If
none of these flags are given, nodes will only be reconfigured if the settings in
the SDR are changed.
After this, Kerberos services can be used on the node. For example, the
firstboot.cust script might use the kerberized rcmdtgt and rcp commands to
copy more files from the control workstation to the node.
SP Security 167
SA22-7348, contains detailed descriptions of all the administrative
commands.
The only command every system administrator will have to use regularly is
the k4init command, used to re-authenticate to the Kerberos server when the
Ticket-Granting Ticket has expired.
The PSSP for AIX Installation and Migration Guide, GA22-7347 explains the
steps required to initially set up SP security using an AFS server, and the
PSSP for AIX Administration Guide, SA22-7348, describes the differences in
the management commands of PSSP Kerberos and AFS Kerberos.
However, PSSP 3.1 does support the AIX authenticated r-commands, which
can work with Kerberos Version 5. If you want to use Kerberos Version 5 for
the AIX authenticated r-commands, you first have to set up a DCE security
and CDS server, either on an external DCE server machine or on the control
workstation. It would be sensible to also set up the DCE Distributed Time
Service, DTS, as your time management system when using DCE.
Be aware that the DCE servers must be running DCE for AIX Version 2.2. We
recommend that you apply at least DCE22 PTF Set 3, since earlier levels had
some Kerberos Version 5 related problems in the security service. The
following DCE/Kerberos interoperability enhancements were provided with
DCE for AIX 2.2, and are described in DCE for AIX Administration Guide:
Core Components, which is available in softcopy only with the product:
Install the DCE client components on the control workstation and nodes. This
is not managed by PSSP 3.1. If the configuration is done by the config.dce
command of DCE for AIX Version 2.2, this step will automatically create:
• The /etc/krb5.conf file, which describes the Kerberos Version 5 realm.
• The DCE principal /.:/ftp/<ip_hostname>, which is the service principal for
the AIX authenticated ftp command.
• The DCE principal /.:/host/<ip_hostname>, which is the service principal
for the AIX authenticated r-commands, notably rcp and rsh. Be aware that
this service principal is different from the machine principal
/.:/hosts/<ip_host_name>/self.
• The keytab entries for these ftp and host service principals.
Attention
DCE migration: These steps are not automatically performed when a DCE
2.1 client is migrated to DCE 2.2. The command kerberos.dce can be used
to set up the Kerberos environment.
You can verify that the service principals have been created by issuing
/usr/bin/dcecp -c principal catalog -simplename and, for example, piping the
output to grep -E "^host\/|^ftp\/".
SP Security 169
page 165, which only handles PSSP authorization (which is a no-op, as
/.klogin has already been created by setup_authent) and AIX authorization.
The /.k5login file should contain the DCE/Kerberos Version 5 principals that
are allowed to access the root account through the AIX authenticated
r-comands. The entries must be in Kerberos Version 5 format, not in DCE
format. Typically, this file should include the machine principals of the control
workstation and nodes; these are the principals which the local root user on
these machines will be identified with. You can also add user principals of
authorized system administrators. For example:
# cat /.k5login
[email protected]
[email protected]
[email protected]
hosts/sp4en0/[email protected]
hosts/sp4n01/[email protected]
hosts/sp4n05/[email protected]
hosts/sp4n06/[email protected]
hosts/sp4n07/[email protected]
hosts/sp4n08/[email protected]
This file has to be transferred to all the nodes. Since PSSP Kerberos is
always required, you can use the Kerberos Version 4 authenticated rcp
command to copy this file to the nodes.
The following commands are the primary clients to the hardware control
subsystem:
• hmmon — Monitors the hardware state.
• hmcmds — Changes the hardware state.
• s1term — Provides access to the node’s console.
• nodecond — For network booting, uses hmmon, hmcmds and s1term.
• spmon — For hardware monitoring and control. Some parameters are used
to monitor, and others are used to change the hardware state. The spmon
-open command opens a s1term connection.
# k4list -srvtab
Server key file: /etc/krb-srvtab
Service Instance Realm Key Version
------------------------------------------------------
hardmon sp4cw0 MSC.ITSO.IBM.COM 1
rcmd sp4cw0 MSC.ITSO.IBM.COM 1
hardmon sp4en0 MSC.ITSO.IBM.COM 1
rcmd sp4en0 MSC.ITSO.IBM.COM 1
SP Security 171
server to acquire a service ticket for the hardmon service. This service ticket
is then presented to the hardmon daemon, which decrypts it using its secret
key stored in the /etc/krb-srvtab file.
Each line in that file lists an object, a Kerberos principal, and the associated
permissions. Objects can either be host names or frame numbers. By default,
PSSP creates entries for the control workstation and for each frame in the
system, and the only principals which are authorized are root.admin and the
instance of hardmon for the SP Ethernet adapter. There are four different sets
of permissions, each indicated by a single lowercase letter:
• m (Monitor) - monitor hardware status
• v (Virtual Front Operator Panel) - control/change hardware status
• s (S1) - access to node’s console via the serial port (s1term)
• a (Administrative) - use hardmon administrative commands
Note that for the control workstation, only administrative rights are granted.
For frames, the monitor, control, and S1 rights are granted. These default
entries should never be changed. However, other principals might be added.
For example, a site might want to grant operating personnel access to the
monitoring facilities without giving them the ability to change the state of the
hardware, or access the nodes’ console.
SP Security 173
SP Parallel Management
Commands
dsh SP System SP
Mgmt
boot/install
SP
SP Kerveros V4 Kerberos V4
rsh daemon
User-issued rsh
commands /usr/lpp/ssp/rcmd/bin/rsh /usr/lpp/ssp/rcmd/etc/kshd
AIX
rsh daemon
AIX rsh
/usr/bin/rsh
/usr/sbin/rshd
In PSSP 3.1, the authenticated r-commands in the base AIX 4.3.2 operating
system are used instead. As described in 7.4, “Managing Kerberos on the
SP” on page 156, they can be configured for multiple authentication methods,
including the PSSP implementation of Kerberos Version 4. To allow
applications which use the full PSSP paths to work properly, the PSSP
commands rcp and remsh/ rsh have not been simply removed, but have been
replaced by links to the corresponding AIX commands. This new calling
structure is shown in Figure 79 on page 175.
dsh SP System SP
Mgmt
boot/install
/usr/lpp/ssp/rcmd/bin/rsh
Symbolic Link
User-issued AIX
commands Kerberos V5/V4
rsh daemon
/usr/sbin/krshd
AIX rsh
/usr/bin/rsh
AIX
rsh daemon
/usr/sbin/rshd
SP Security 175
Attention
K5MUTE: Authentication methods are set on a system level, not on a user
level. This means that, for example, on an SP where Kerberos Version 4
and Standard AIX is set, a user’s rsh command will produce a Kerberos
authentication failure if that user has no Kerberos credentials (which is
normally the case unless the user is an SP system administrator). After
that failure, the rsh attempts to use the standard AIX methods. The delay
caused by attempting both methods can not be prevented, but there is a
means to suppress the error messages of failed authentication requests,
which may confuse users. Suppress these messages by setting the
environment variable K5MUTE=1. Authorization failures will still be
reported, though.
These requests are then processed by the rshd and krshd daemons.
Attention
Be aware that the daemon itself does not call the get_auth_method()
subroutine to check if STD is among the authentication methods. The
chauthent command simply removes the shell service from the /etc/inetd.conf
file when it is called without the -std option, so inetd will refuse connections
on the shell port. But if the shell service is enabled again by editing
/etc/inetd.conf and refreshing inetd, the rshd daemon will honor requests,
even though lsauthent still reports that Standard AIX authentication is
disabled.
SP Security 177
7.5.2.3 The Kerberized krshd Daemon
The /usr/sbin/krshd daemon implements the kerberized remote shell service
of AIX. It listens on the kshell port (normally 544/tcp), and processes the
requests from both the kcmd() and spk4rsh() client calls.
After checking if the requested method is valid, the krshd daemon then
processes the request. This depends on the protocol version, of course.
The daemon then calls the kvalid_user() subroutine, from libvaliduser.a, with
the local user name (remote user name from the client’s view) and the
principal’s name. The kvalid_user() subroutine checks if the principal is
authorized to access the local AIX user’s account. Access is granted if one of
the following conditions is true:
1. The $HOME/.k5login file exists, and lists the principal (in Kerberos form).
See 7.4.3, “DCE Kerberos (Version 5)” on page 168 for a sample
$HOME/.klogin5 file.
2. The $HOME/.k5login file does not exist, and the principal name is the
same as the local AIX user’s name.
Case (1) is what is expected. But be aware that case (2) is quite
contra-intuitive: it means that if the file does exist and is empty, access is
denied but if it does not exist access is granted! This is completely reverse to
the behavior of both the AIX $HOME/.rhosts file and the Kerberos Version 4
$HOME/.klogin file. However, it is documented to behave this way (and
actually follows these rules) in both the kvalid_user() man page and AIX
Version 4.3 System User's Guide: Communications and Networks ,
SC23-4127.
Attention
DFS home directories: This design may cause trouble if the user’s home
directory is located in DFS. Since the kvalid_user() subroutine is called by
krshd before establishing a full DCE context via k5dcelogin, kvalid_user()
does not have user credentials. It runs with the machine credentials of the
local host, and so can only access the user’s files if they are open to the
"other" group of users. The files do not need to be open for the "any_other"
group (and this would not help, either), since the daemon always runs as
root and so has the hosts/<ip_hostname>/self credentials of the machine.
The daemon then checks the Kerberos Version 4 $HOME/.klogin file, and
grants access if the principal is listed in it. This is all done by code provided
by the PSSP software, which is called by the base AIX krshd daemon. For this
reason, Kerberos Version 4 authentication is only available on SP systems,
not on normal RS/6000 machines.
Attention
rcmdtgt: PSPP 3.1 still includes the /usr/lpp/ssp/rcmd/bin/rcmdtgt
command, which can be used by the root user to obtain a ticket-granting
ticket by means of the secret key of the rcmd.<localhost> principal stored in
/etc/krb-srvtab.
SP Security 179
To work around this problem, PSSP uses the authenticated rsh command to
temporarily add the boot/install server’s root user to the /.rhosts file of the
control workstation, and removes this entry after network installation.
7.5.3 Sysctl
Sysctl is an authenticated client/server application that runs commands with
root privileges on remote nodes, potentially in parallel. It is implemented by
the sysctld server daemon running with root privileges on the control
workstation and all the nodes, and a sysctl client command. The functionality
provided by sysctl is similar to the dsh parallel management command in the
sense that it provides remote, parallel command execution for authenticated
users. It differs from dsh in three fundamental ways:
1. Authentication: Sysctl requires Kerberos Version 4 for authenticating users
that issue sysctl commands. It does not support Kerberos Version 5, and
it is not based on the AIX authenticated r-commands (like dsh is).
2. Authorization can be controlled in a more fine-grained way. With dsh and
the underlying authenticated rsh, an authenticated principal which is listed
in the authorization file of the remote node (.k5login, .klogin, or .rhosts)
can run arbitrary commands since it has access to a login shell. With
sysctl, authorization can be set for individual commands, by using
sysctl-specific Access Control Lists (ACLs) which are checked by an
authorization callback mechanism in the sysctld server. Command
execution is under control of the sysctld daemon.
3. Language: dsh provides access to a shell, whereas sysctl uses its own
scripting language, based on Tcl and a number of built-in commands
provided by IBM. This might ease the development of site-specific sysctl
applications, but also requires you to learn the Tcl/Sysctl language.
The set of commands that are understood by the sysctld daemon is specified
through the sysctl configuration file, by default /etc/sysctl.conf, which is
parsed by the daemon when it starts up. Note that this file can access other
configuration information through include statements. The default
configuration file contains include statements for configuration files from
/usr/lpp/ssp/sysctl/bin/, which contain more sysctl procedures. The sysctl
facilities are described in detail in Chapter 13, “Sysctl” of IBM PSSP for AIX
Administration Guide, SA22-7348. Here we focus on the security aspects of
the sysctl system.
SP Security 181
Kerberos as the AUTH callback does, but in addition requires that
this Kerberos principal is explicitly listed in an Access Control List.
Attention
This behavior is probably caused by the fact that sysctl client always issues
the svcconnect command to the server before sending the actual command.
Since svcconnect has the AUTH callback, it will fail if the user does not have
Kerberos credentials, and terminate the connection.
The AUTH callback returns OK if the sysctld server has verified the
authentication of the client user by decrypting the rcmd service ticket sent by
the sysctl client. The ACL callback also requires this, but additionally checks if
the Kerberos principal is listed in the server’s Access Control List which
applies to the command that is to be executed. The default ACL file for all
built-in sysctl commands is /etc/sysctl.acl. Sysctl ACL files can contain entries
for principals, or include statements for other ACL files, such as the following:
In the Tcl/Sysctl source of a sysctl procedure, the ACL callback phrase can
optionally contain the full path name of an ACL file. If this is done, the ACL
callback will use the specified file for authorization checks. This allows a
modular design, where different sets of commands can be controlled by
SP Security 183
184 The RS/6000 SP Inside Out
Chapter 8. RS/6000 Cluster Technology
This chapter aims to concisely explain the key concepts of RS/6000 Cluster
Technology (RSCT) formerly known as HAI, illustrated by real-life examples
of RSCT in action. There are entire books devoted to the topic, such as:
• RS/6000 SP High Availability Infrastructure , SG24-4838
• RS/6000 SP Monitoring: Keeping it Alive, SG24-4873
• RSCT: Event Management Programming Guide and Reference,
SA22-7354
• RSCT: Group Services Programming Guide and Reference, SA22-7355
IBM took HACMP’s cluster manager code - the heart of the product - and
reworked it to scale. Around this IBM built High Availability Infrastructure
(HAI). Internally, HAI was known as Phoenix. In 1998, with the announcement
Although this technology was implemented only on the SP, IBM’s offering
includes clusters of RS/6000 machines through HACMP Enhanced Scalability
(up to 32 machines connected in a cluster or 128 SP nodes). RSCT is
designed to be a cross-platform set of services, potentially implemented on
IBM’s and even other manufacturer’s server platforms.
The new files sets are called rsct.basic and rsct.clients. The current version at
the time of this writing is version 1.1 (PTF set 4). The RSCT file sets come
standard with PSSP 3.1 and HACMP Enhanced Scalability 4.3. Table 4 shows
details of the RSCT install images.
Table 4. RSCT Install Images
Fileset Description
The Resource Monitor component, closely linked with RSCT, partially relies
on the Performance Agent Tools component of AIX (perfagent.tools). So, this
file set (which comes with AIX 4.3.2) is a prerequisite for RSCT.
PMan
Perspectives
EMAPI
PTPE
Event
SPMI Event Management Management/ES
Group
Services/ES
Group
Services
Topology
Topology Services/ES
Resource Monitors
Services
SW HW
Although many components are presented in the figure (all of which will be
explained throughout the book), RSCT comprises three principle elements:
1. Topology Services (TS)
2. Group Services (GS)
3. Event Manager (EM)
As you can see from Figure 80 on page 187, there is a separate stack for
HACMP/ES. As we mentioned earlier in the chapter, RSCT can function also
The RSCT components will behave slightly different if they are functioning on
an SP domain or an HACMP/ES domain. For example, it they are running on
an SP domain, the configuration data is taken from the SDR, while if they are
running on an HACMP/ES domain, the configuration data is taken from the
Global ODM (GODM).
We now discuss the core elements of RSCT: TS, GS, and EM.
TS works tirelessly to update node and network adapter states, but does not
care about the implications of, for example, a node going down. However,
Group Services (GS) does care about this. GS subscribes to TS for node and
network adapter availability. To take action on this information, GS needs a
reliable network roadmap to broadcast its messages to the nodes. This leads
us to TS’s other primary responsibility: maintaining the network roadmap or
Before describing how TS performs its duties, a few terms are explained:
Group Services
UDP API
Topology Services
Shared Reliable
Adapter Node
Memory Messaging
Membership Membership
UDP
Networks
Figure 81. TS and GS Interfaces
Each node builds a machine list after receiving information from the SDR.
The machine list defines all nodes in the system, and the switch and
administrative Ethernet IP addresses for each node. Note that the
administrative Ethernet interface is typically en0, but could be en1. You
specify which interface when defining the reliable hostname in the SDR at
installation time, and the corresponding IP address is sent in the machine list.
When new nodes are added to a running SP system, the SDR does not
automatically send the configuration changes to the nodes. The RSCT stack
must be refreshed for all affected partitions to update the machine lists and
membership groups. TS could easily be ported to different platforms by
essentially changing the front-end process, served by the SDR on the SP,
that supplies the initial node and adapter information.
Machine lists are partition-specific and always include the CWS. TS has the
dependency that each node has to be able to communicate with the CWS,
thus the CWS is included in every machine list. The CWS runs one instance
of hatsd for each partition.
Mayor - A daemon, with a local adapter present in this group that has been
picked by the Group Leader to broadcast a message to all the adapters in the
1
Personality is a duty that a daemon has to carry out. It is common that a daemon assumes multiple duties.
Now each node knows, via its machine lists, all the potential members of
Adapter Membership Groups for the Ethernet and switch. Each node first
initializes itself into a Singleton group - making it its own Group Leader and
Crown Prince. The Group Leaders periodically send proclamations to all
lower IP address adapters. The lower IP address Group Leaders eventually
join the rings of the higher IP address Group Leaders, which incorporate the
joiners’ topology. Group Leaders, Crown Princes and member lists are
constantly updated in the new, bigger rings. Finally, one ring for each adapter
type exists and the final personality of the adapters is determined by IP
address order. With the rings established, Adapter Membership Groups are
created.
Every five seconds, using a Proclaim Packet, the Group Leader in an Adapter
Membership Group invites other adapters that are in the machine list but not
currently part of the group, to join the group. This is how adapters in new or
rebooted nodes, for example, become part of the group.
The Group Leader “wraps around”: its Neighbor is the adapter with the lowest
IP address in a group.
By default, heartbeats are sent every second (the frequency ), and if four
successive heartbeats are not received (the sensitivity ), the Neighbor adapter
is declared unavailable. Heartbeat frequency and sensitivity are tunable. If an
adapter declares its neighbor unavailable, the hatsd daemon notifies the
Group Leader with a DEATH_IN_FAMILY message. The Group Leader
updates the membership status of the group and distributes it to the
SDR
Group Services
Now that we know what nodes are available and how to talk to them, GS
steps into the picture.
Group Services
GS
"hags"
Reliable MSGing
Reliable MSG
(UDP) NCT
Topology
TS Services
Heartbeat "hats"
(UDP)
Topology
NCT Topology Services
Services
Reliable (Shared layer
Messages Memory)
Heartbeat
To/from other nodes
Node 1 Node 3
Node 4
provider provider
Node 2 subscriber provider subscriber
G G
provider provider G G C
hagsd G C hagsd
hagsd
hagsd
provider
G
Node 6
hagsd
provider provider
designated
G C glynn (G)
Group Leader
hagsd provider
provider
C
designated C Node 5
clarabut (C) hagsd
Group Leader hagsd
Node 7
Node 8
Name
Group Name Group Leader Members
Server glynn (G) 5 1,2,3,4,5,6
Node 9
clarabut (C) 6 2,4,6,7,8
For each new group, one GS daemon will become the Group Leader. Yes,
this is the same term as in TS, but GS Group Leaders are chosen by the age
of the provider in the membership list, not by the IP address. The GS daemon
Group Leaders replicate data to all GS daemons on nodes where the group’s
providers reside for availability in case the Group Leader node fails.
If a provider does not vote in the time allowed (for example, the provider’s
node fails), the Group Leader uses the group’s default vote for that
provider.
4. Notification of Result
GS notifies the provider of the result, and updates the membership list and
state value accordingly. In our example, the potential provider has been
voted into the group, is made a member, and put on the membership list of
the other providers.
Bad things can happen in sundered namespaces. Two nodes, each owning a
tail of a twin-tailed disk subsystem, could be on different sides of the split.
The disk recovery application, relying on GS for node availability, may
mistakenly inform each node in the split group that its availability partner is
gone. Each node may try to acquire the disk subsystem, leading potentially to
data corruption. A quorum mechanism in the application (GS does not
provide such as mechanism) would be useful in this case.
EMAPI
Event Management
Events
Other
subsystems
e
Va ourc
s
ble
Observes Resource Variables
s
ria
Notifies Clients about events RMAPI
Re
Control Resource
Monitors
Communication
Communication
Client
Peer
Event Manager Protocol Reliable
UNIX Messaging
Command/Response
(SOCK_STREAM)
Topology
EMAPI
Services
Internal Structure
RMAPI
Figure 87. EM Client and Peer Communication
Remote clients, that is, clients executing in a separate partition or outside the
SP entirely, use TCP/IP sockets, which is a less reliable method because of
the protocol that cannot always properly deal with crashed communications
sessions between programs. Remote clients usually connect only to the EM
daemon on the CWS. When connecting, a remote client specifies the name of
the target partition on the call to the EMAPI. The remote client will then
connect to the EM daemon on the CWS that is running in the target partition.
A client could connect directly to any EM daemon in the target partition and
get the same events, but you would need an algorithm to determine the target
node. It is easier to just connect to the appropriate daemon on the CWS.
Clients must use the event registration process to inform EM of their interest
in specific events. Applications perform event registration through function
calls to the EMAPI. Perspectives provides a menu-driven interface to register
events, and further specifies actions to take when events occur. Event
registration defines:
• The resource variable, or attribute of a specific resource. A resource
variable can be a counter (such as the number of packets transmitted on
en0), quantity (such as average CPU busy), or state (such as the network
interface up or down). Resource variables must be associated with a RM
so that EM knows who to ask for data. This association is made via the
resource class .
• The Resource Identifier of the resource variable. You want to monitor
average CPU busy, but on which node? Which occurrence of that resource
in the system? Resource identifiers pinpoint specific resources in the
system.
• The expression or event-triggering rule to apply against the resource
variable (such as “average CPU busy is greater than 90%”). Each variable
has a default expression.
You can also specify a rearm expression for a resource variable. If set, the
rearm expression is monitored after the initial expression is true. When the
rearm expression is true, then the initial expression is once again
monitored. For example, you are interested in CPU busy on node sam.
Your expression will trigger an event whenever sam’s CPU busy rises
above 90%, but if the CPU workload is fluctuating, the resource variable
value may repeatedly cross the 90% threshold, generating many identical
events. If you specify a rearm expression of CPU busy going below 60%,
then after sam goes above 90% CPU busy, EM will not generate another
event on this resource variable for sam until sam’s CPU busy falls below
60%. You assume that the situation has stabilized on sam if the CPU busy
falls back below 60%.
8.6.3 EM and GS
EM initialization illustrates how EM exploits GS. When haemd starts on a
node, it fetches the EMCDB version from the SDR. Operating with the correct
version of the EMCDB is crucial to service EM clients. After more initialization
steps, haemd connects to GS.
EMCDB version = my
n
Copy from CWS
CDB version?
y y EMCDB version = my
CDB version?
Enable Communications
exit
From the group state value, the EM daemon knows what version of EMCDB
is being used by the already-running EM daemons (peers). It compares the
state value with the one it obtained from the SDR earlier in the initialization. If
they do not match, the EM daemon uses the state variable value to remain
consistent with the group. Now the EM daemon attempts to load the database
from a local copy. It examines the EMCDB in the appropriate file to verify
consistency with what the group is currently using. If the versions match, EM
daemon loads the database and is ready to accept clients. If not, the EM
daemon checks one more place: an EMCDB staging area on the CWS. If still
no match, then the EM daemon is not configured and refuses any requests
for event monitoring.
Membership
Event Management Daemon
Response PTPE
SYS V IPC
Counter
Switch Responds
Quantity
SDR SPMI
HostMembership
EM Shared Memory
Shared
ds
(segments)
AdapterMembership
spon
Memory
t Re
State State
State State State
Hos
hrd RMAPI
hmrmd harmpd
GSAPI CSSLogMon
pmanrmd SDR harmld
Group aixos
Services hardmon Programs
AIX / PSSP
Hardware
You monitor resource variables of type counter and quantity with respect to
time. You control how you monitor a resource by associating it with a
resource class . The resource class defines the following for a resource
variable:
• The associated RM
• The observation interval, in seconds
• The reporting interval, in seconds, to the associated RM
All resource variables that are located in shared memory with the same
observation interval are observed on the same time boundary. This minimizes
the observation overhead, no matter when the request for a resource variable
is made.
Resource variables of type state are handled differently. Every time they
change, the resource monitor responsible for supplying the resource variable
sends a message containing the data directly to the EM event processor. The
rationale is that, unlike resource variables of type counter or quantity, the
Event Manager daemon needs to be aware of every change of resource
variables of type state, so the sampling method used with counter and
quantity variables cannot be applied to state variables.
• IBM.PSSP.hmrmd
This monitor provides the state of the SP hardware. The information is
obtained from the PSSP hardware monitoring subsystem (hardmon). The
resource variables are of type state and sent directly to the EM daemon as
a message.
• IBM.PSSP.harmpd
You may wonder why these resource monitors are internal. Primarily, the
reason is performance.
8.7.1 Scenario
Your SP was turned off for the weekend due to scheduled power testing in the
building. Early Monday morning (or Saturday morning, depending on your
part of the world), you come into the office to restart the SP. You have already
booted the CWS, and using Perspectives, are ready to boot the nodes. Your
SP has one frame, four thin nodes, a switch, and one (default) partition. The
hostnames and IP addresses for the system are as follows:
8.7.2 TS Level
Since the CWS is already up, TS has already initialized node and adapter
membership groups and created an NCT. The CWS is the only member of the
en0 group and node group, and there is no internode connectivity in the NCT.
The CWS is both the Group Leader and Crown Prince of the en0 ring, and
from its machine list is aware there should be four other members. It is
periodically issuing proclaims to the en0 IP addresses of the nodes, asking
them to join its Singleton group.
The nodes begin processing the /etc/inittab entries. The TS daemons begin
on each node and obtain configuration data from the SDR. After a flurry of
proclaims, joins, distributions of topology, and ring mergers, we end up with
the following adapter membership groups:
• en0: Group Leader is node4; Crown Prince is node3; members are node2,
node1 and spcw.
• css0: Group Leader is node4; Crown Prince is node3; members are node2
and node1.
Figure 90 on page 211 shows the topology graph for en0 (Ethernet) and css0
(switch), both before and after an Estart command is issued to bring up the
switch. The Group Leaders are designated GL. Notice that each switch
adapter is in a Singleton group before the Estart. Heartbeating begins. Each
TS daemon builds (or updates in the case of the CWS) its topology tables and
NCT, puts the NCT in shared memory, then begins accepting requests from
clients, namely GS.
0 1 2 3 4
8.7.3 GS Level
Before node booting, the CWS was already up and GS on the CWS had
already created the internal node and adapter membership groups. The CWS
is the only member of the en0 group and node group, and the switch group
does not exist. The CWS is both the Group Leader of the en0 membership
group and GS nameserver.
Since the CWS has the oldest GS daemon, it becomes Group Leader for the
en0 and node membership internal groups. The first GS daemon on the node
to issue a join request to the css0 internal group becomes the Group Leader
of that group (in our case, say it was node3). Membership lists are maintained
in the order of GS daemons joining the group. The providers of the groups are
in effect the TS daemons running on each node. We have the following
groups:
8.7.4 EM Level
Before node booting, the CWS was already up and EM on the CWS had
already created the ha_em_peers group. The CWS is the only member of the
group, and has set the state variable to be the version of the EMCDB it
retrieved from the SDR at initialization.
At this point, we have TS, GS, and EM successfully initialized throughout the
SP. The EM daemons’ internal Membership RMs update the
IBM.PSSP.Membership.LANadapter.state resource variable from its
connection to GS, which includes data on the state of the en0 and css0
interfaces by node. Happily, the news is good: TS is successfully
heartbeating all en0 and css0 adapters, which GS learns of by subscribing to
TS, which EM learns of by its Response monitor connecting to GS.
Across the EMAPI, hrd periodically receives events from the EM daemon.
The events describe the state of the en0 adapters for all nodes
(IBM.PSSP.Response.Host.state variable). Using this information, hrd
modifies the host_responds class in the SDR.
Prior to the availability of Parallel Environment v2.4, only one user space
process per partition could access this switch network in this mode of
operation. With Parallel Environment v2.4, support for Multiple User Space
Processes Per Adapter was provided.
send/receive
Library
Switch
Com mands
fault_service
(E sta rt, E fen ce, etc.) daemon
get fs_request
User space
Kernel space
fault_service
work queue
CSS Device
Driver
TB x Adapter
The Worm daemon plays a key role in the coordination of the switch network.
It is a non-concurrent server, and therefore can only service one switch event
Error recovery on the SP Switch is done locally: error detection and fault
isolation is done at the link level while normal traffic flows across the rest of
the switch fabric. Detected faults are forwarded to the active primary node for
analysis and handling. When the primary node completes assessing the fault,
the remaining nodes on the fabric are non-disruptively informed of status
changes. (On the older HiPS Switch, a fault on a link would cause the entire
switch fabric to be interrupted while recovery was performed.)
The primary backup node passively listens for activity from the primary node.
When the primary backup node detects that it has not been contacted by the
primary node for a predetermined time, it assumes the role of the primary
node. This takeover involves reinitializing the switch fabric in a nondisruptive
way, selecting another primary backup, and updating the SDR accordingly.
The primary node also watches over the primary backup node. If the primary
node detects that the primary backup node can no longer be contacted on the
switch fabric, it selects a new primary backup node.
Oncoming primary and oncoming backup primary are used only by the SP
Switch. With oncoming nodes, you are now able to change the primary and
the primary backup for the next restart of the switch. This is convenient for
maintenance purposes, since it allows you to change the oncoming nodes
without taking the switch down. Actually, these settings stored in the SDR are
informative only, and will only be read by the switch subsystem when an
Estart command is issued. To summarize, the primary and primary backup
fields in the SDR reflect the current state of the system and the oncoming
fields are not applicable until the next invocation of the Estart command.
expected.top.<NSBnum>nsb.<ISBnum>isb.type
where <NSBnum> is the number of Node Switch Boards that are installed,
and <ISBnum> is the number of Intermediate Switch Boards installed in the
Node-to-Switch connections
s 15 2 tb2 1 0 L01-S00-BH-J8 to L01-N2
switch node number
chip external jack
port
logical frame
adapter connection
switch node number switch port no.
Switch-to-Switch connections
s 13 2 s 23 2 L01-S00-BH-J4 to L02-S00-BH-J4
switch
chip external jack
port
logical frame
Each line in the topology file describes a link, either between nodes
connected to the same switch board or between switches. Fields 2 to 4
describe the switch end of the connectivity, whereas fields 5 to 8 describe the
second end (node or another switch) of the connectivity. These fields are
further defined as follows:
• Field 1: This is always "s", signifying that this end of the cabling is
connected to a port on the switch board.
• Field 2: Describes the switch number where the link is connected to. A "1"
in this example means that it is switch board number 1, located inside a
normal frame. If this were a switch in a intermediate switch board (ISB)
frame, 1000 is added to differentiate it from a switch board in a normal
frame.
• Field 3: The switch chip number. The valid values are from 0 to 7, as there
could be 8 switch chips on a switch board. Note that this field follows
immediately after Field 2 - there is no space separating these two fields. A
An example illustrating how the notations used in the topology file map to the
physical layout of the switch is shown in Figure 94 on page 222. It graphically
illustrates the sample topology file shown in Figure 92 on page 219.
3 4 7 0
N14 J4
2 6 SW4 1
SW3 5 N13 J6
1 (SWA8) 6 5 2 N10 J8
0 7 4 (SWA1) 3 N9 J10
3 4 7 0 N6 J12
2 SW2 5 6 SW5 1 N5 J14
5 2 N2 J16
1 (SWA7) 6
4 (SWA2) 3 N1 J18
0 7
3 4 7 0 N3 J20
2 SW1 5 6 SW6 1 N4 J22
1 (SWA6) 6 5 2 N7 J24
4 (SWA3) 3 N8 J26
0 7
3 4 7 0 N11 J28
2 5 6 SW7 1 N12 J30
SW0
1 (SWA5) 6 5 2 N15 J32
4 (SWA4) 3 N16 J34
0 7
The precoded connection labels on the topology file start with an “L” which
indicate logical frames. The Eannotator command replaces the “L” character
with an “E”, which indicates physical frames.
The following samples show extracts from a topology file with entries before
and after the execution of the Eannotator command.
Before:
s 15 3 tb0 0 0 L01-S00-BH-J18 to L01-N1
After:
s 15 3 tb3 0 0 E01-S17-BH-J18 to E01-N1
Logical frame L01 is defined as physical frame 1 in the SDR Switch
object.
Before:
s 10016 0 s 51 3 L09-S1-BH-J20 to L05-S00-BH-J19
After:
s 10016 0 s 51 3 E10-S1-BH-J20 to E05-S17-BH-J19
Logical frame L09 is defined as physical frame 10 in the SDR Switch
object.
Before:
s 15 3 tb0 0 0 L03-S00-BH-J18 to L03-N3
After:
s 15 3 tb3 0 0 E03-S17-BH-J18 to E03-N3 # Dependent Node
Logical frame L03 is defined as physical frame 3 in the SDR Switch
object and the node was determined to be a dependent node.
The other four chip ports (4, 5, 6, and 7) are used to communicate with the
other four switch chips on the opposite side switch board (SW3, SW2, SW1,
and SW0). It is possible for each of the chips that service the nodes to reach
the other three node chips through four different routes. The switch chips
connect together to form multi-stage switching networks.
To see how routing actually works, let us consider the possible paths a switch
packet could take from node 14 (N14) communicating to node 16 (N16).
Initially, the packet enters SW4 through port 0, and can exit the switch chip
through one of the following four routes.
• Port 7 across to SW3, onto that chip through port 4, exiting on port 7, over
to SW7, onto the chip on port 7.
• Port 6 across to SW2, onto that chip through port 4, exiting on port 7, over
to SW7, onto the chip on port 6.
• Port 5 across to SW1, onto that chip through port 4, exiting on port 7, over
to SW7, onto the chip on port 5.
• Port 4 across to SW0, onto that chip through port 4, exiting on port 7, over
to SW7, onto the chip on port 4
Once onto SW7, it will exit through port 3, go through J15 to the cable, and to
the switch adapter on node 16 (N16). The four possible paths are highlighted
in Figure 95 on page 225.
Bulkhead Bulkhead
Jacks Jacks
With the release of PSSP 3.1, many of these SP Switch management and
administration issues have been addressed. Specifically, the following
changes in the operation of the switch network have been made:
• Switch Admin Daemon
This daemon will monitor for specific node and switch events to ensure
that the switch is re-started automatically in the event that the primary
node goes down. Also, if the whole SP complex is powered on, the
daemon will automatically start the SP Switch.
• Automatic Node Unfence
Nodes will rejoin the switch communication fabric without any explicit
action by a system administrator. The autojoin attribute of the SDR
switch_responds class is set whenever nodes join (or rejoin) the switch.
This allows the switch network to behave similar to other LANs, since it is
immediately available after a node reboot, without any manual
intervention.
• Switch-Dependent Application Startup at Boot Time
For applications that depend on the switch interface being up and
available, the switch adapter startup script (rc.switch) has been modified
to monitor the status of the switch adapter. After starting the Worm
daemon, the script waits (up to 6 minutes) for the css0 adapter to be in an
UP status before exiting. Further, the location of the fsd entry in the
/etc/inittab file (this entry runs the rc.switch script) has been moved to
immediately after the sp entry.
• Centralized Error Logging
Switch-related AIX error log entries from all nodes installed at PSSP v3.1
are now consolidated into a single file on the control workstation. Entries
are ordered by time, and tagged with identifying information, thus
providing a summary of switch errors across the entire SP system.
We will review these new features in more detail in the following discussion
and in later sections of this chapter.
Prior to PSSP 3.1, some additional procedures were required before the
switch network could be used after the power-on of the system, either in the
form of writing a site-specific script to automate Estart execution or by having
the SP system administrator manually execute the Estart command. In PSSP
3.1, there is no need to do this since now the SP Switch network is started
automatically by the switch admin daemon.
Attention
Coexistence consideration: The switch admin daemon works with any
combination of nodes, since it does not require any modifications to the code in
a node. It is required that the event manager daemon running on nodes installed
pre-PSSP 3.1 levels be recycled in order to pick up the new resource variables
being monitored.
Node 1
The digit "1" here means that the cssadm daemon will perform the node
recovery, that is, it tries to Estart whenever it detects significant node events
in the system (for example, the primary node goes down, the primary backup
does not comes up). To disable node recovery, change this line from "Node 1"
to "Node 0", then stop and restart the cssadm daemon by using the normal
SRC commands to stop and start daemons, namely:
stopsrc -s swtadmd
startsrc -s swtadmd
START
IS IT N IS N
PRIMARY NODE? PRIMARY NO ACTION IS TAKEN
NODE = NONE?
Y Y
IS
PRIMARY
BACKUP NODE NO ACTION IS TAKEN
UP ON Y AWAITING NORMAL
SWITCH_RESPONDS? PRIMARY NODE TAKEOVER
IS
ONCOMING
PRIMARY NODE N NO ACTION IS TAKEN
UP ON
HOST_RESPONDS?
ESTART
Notice that when the primary node is down on the switch but the primary
backup node is active, the cssadm daemon takes no action, because the
primary backup node will take over the primary node responsibility.
In the situation where both the primary and the primary backup node are
down on the switch but the oncoming primary node is not yet up on
host_responds, the daemon takes no action but waits until another significant
event occurs.
Start
Estart
Note that the cssadm daemon will only issue an Estart if it determines that
the oncoming primary node is up on host_responds.
In all cases, if Estart fails, it logs the errors in the cssadm.stderr log file
located in the /var/adm/SPlogs/css directory, but takes no additional recovery
actions.
complete =2
complete =2
(i) cssadm: Oncoming primary node has come up on Host Responds in partition
sp3en0. Estart will be run
(i) cssadm: Estart successful in partition sp3en0.
^C[sp3en0:/]# Eprimary
1 - primary
1 - oncoming primary
14 - primary backup
14 - oncoming primary backup
[sp3en0:/]#
By observing the contents of the cssadm.debug file we see that the Estart
command was issued in response to the oncoming primary node (node 1)
returning a positive host_responds. The console dialog also shows that node
1 is now the primary node.
In a second example, we simulate the loss of both the primary node (node 5)
and the primary backup node (node 15) from the switch network by killing the
Worm daemon on both nodes at the same time. The console log output, along
with the relevant contents of the cssadm.debug file, are shown in Figure 99
on page 231.
complete =2
complete =2
(i) cssadm: Primary node is down on switch responds in partition sp3en0.
Checking primary backup.
(i) cssadm: Primary backup is not up on switch responds. Checking
if oncoming primary is up on host responds.
(i) cssadm: Oncoming primary up on host responds. Going to Estart.
^C[sp3en0:/]# Eprimary
1 - primary
1 - oncoming primary
14 - primary backup
14 - oncoming primary backup
[sp3en0:/]#
Though the cssadm daemon detected that both the primary node and the
primary backup node are down, it found that the oncoming primary node is
up, so it issued an Estart to get the switch network back to normal.
1 X X up on Switch
When a node is up on the switch fabric, it does not matter how the isolated or
autojoin attributes are set: it will remain on the switch until it is fenced,
rebooted, or shutdown. The opposite is true of a node that is fenced or
isolated: it will remain off the switch fabric until it is unfenced or the autojoin
attribute is set. Nodes that are fenced with their autojoin attribute set will get
unfenced automatically by the switch primary node.
However, if a node has an intermittent problem with the switch adapter, it may
continually be unfenced and refenced. This causes unnecessary work to be
done by the primary node. To avoid this problem, two mechanisms are put
into place:
• If the fault service daemon on the failing node reaches an error threshold
or detects an unrecoverable error, it puts its TBIC (see the following note)
into reset and sets the autojoin attribute to off. Once this occurs, the node
will not unfence until the rc.switch script is run to recover the node, and
then Eunfence is executed.
[sp3en0:/]# date
Fri Feb 19 19:55:23 EST 1999
[sp3en0:/]# date
Fri Feb 19 19:58:10 EST 1999
The sample console dialog in Figure 101 on page 234 shows the status of the
switch_responds class for a node being rebooted.
[sp3en0:]# date
Fri Feb 19 19:39:23 EST 1999
[sp3en0:]# date
Fri Feb 19 19:47:05 EST 1999
A periodic review of the status of this node also shows the new method of
operation of the automatic unfence function. An example of this is in Figure
102.
-------------------------------------------------------------------------------
8 8 thin on yes yes normal no LEDs are blank no
(node up)
...
_ _ _
8 8 thin on no autojn normal no |_ |_ _| no
_| _| _|
(node being rebooted)
...
Attention
By default, the action field is set to once, so the behavior of the inittab
processing is the same as in previous releases of PSSP. You must make
the necessary changes to the inittab file to take advantage of these new
features.
If you need to fence a node off the switch network, and not have it rejoin the
switch network, you must fence the node with the Efence command, which
turns off the automatic rejoin attribute in the switch_responds class. If you do
not have the autojoin attribute set, the fault service daemon (Worm) will not
automatically unfence it during Estart. The node will remain fenced until
either it is unfenced using the Eunfence command or the autojoin attribute is
set in the SDR.
Notes:
1. The primary node and primary backup node for an SP switch cannot be
fenced.
2. A node which is Efenced with autojoin will automatically join the switch
network within two minutes due to the new automatic unfence function in
PSSP 3.1.
However, if you turn off the Switch Administration daemon functions you may
still wish to use the Emonitor subsystem. Furthermore, if you are using a
primary node with a code version of PSSP v2.4 or earlier in a coexistence
environment, the new automatic unfence functions are not provided by the
Worm daemon.
You can monitor the availability of nodes on a switch with the Emonitor
daemon. Emonitor is controlled by the System Resource Controller (SRC).
One instance of the daemon exists for each partition and is named
The Equiesce command causes the primary and primary backup nodes to shut
down their recovery actions. Data still flows over the switch, but no faults are
serviced and primary node takeover is disabled. Only the Eannotator, Eclock,
Eprimary, and Etopology commands are functional after the Equiesce command
is issued.
The clock subsystem is critical for the proper functioning of the switch fabric.
In order to assure that every switch board is receiving the same clock signal,
usually alternative sources are designated in case the primary alternative
fails. However, this switchover is not automatic; it requires manual
intervention.
The Eclock command establishes a master clock source after the system is
powered up or when an alternate must be selected. It can set the
appropriate clock input for every switch in the system or for a single switch
after power-on.
Eclock.top.<NSBnum>nsb.<ISBnum>isb.0
If Eclock is run to change the clock multiplexor settings while the switch is
operational, you will experience switch outages until a subsequent Estart is
completed. If you run Eclock and specify the -f, -a, -r or -d flags, you do not
need to run Estart if the switch admin daemon (swtadmd) subsystem is
active. In this case the subsystem runs Estart for you. These flags perform
the following operations:
-f Eclock_topology file
Specifies the file name of the clock topology file containing the
initial switch clock input values for all switches in the system.
-a Eclock_topology file
Uses the alternate Eclock topology specified in the given clock
topology file.
-r Extracts the clock topology file information from the System Data
Repository (SDR) and initializes the switch clock inputs for all
switches in the system.
Since Eclock operates across system partitions, if you specified the -f, -a, -r
or -d flag, you must run the Estart command in all system partitions unless
the swtadmd subsystem is active. In this case the subsystem runs Estart for
you. If you use the -s flag (which, together with the -m flag, sets the specific
clock source for an individual switch), then the Eclock command operates just
on the specified switch board. In this case, you need to run Estart only in the
partitions which share that switch board. However, if you used the -s flag to
reset the master switch board, the effect is the same as having issued a
global Eclock command and you must run Estart in all partitions. The -s flag
will recycle the Worm daemons only on the nodes connected to the target
switch boards.
Additionally, some switch events not only create a summary record on the
control workstation, but also trigger a snapshot, generated by the css.snap
utility on the node logging the error. This utility captures the data necessary
for further problem determination at the most appropriate time, when the error
happened, and thus helps shorten problem resolution time.
The new summary log file has entries with the following format:
• Timestamp - in the form of MMDDhhmmYYYY
• Nodename - short reliable hostname
• Snap - "Y" or "N", indicates whether a snap was taken
• Partition name - system partition name or global
• Index - the sequence number field in the AIX error log
• Label - the label field in the AIX error log
All switch-related AIX error log entries that occur on a node and that generate
a summary log record on the control workstation are shown in Table 7.
Table 7. AIX Error Log Entries
[sp3en0:/]# Eprimary
1 - primary
1 - oncoming primary
7 - primary backup
14 - oncoming primary backup
[sp3en0:/]# Eprimary
7 - primary
1 - oncoming primary
5 - primary backup
14 - oncoming primary backup
The AIX error log entries on nodes 1 and 7 are shown in Figure 104 on page
247.
We can see that the entries from the AIX error logs on the nodes have been
conveniently recorded in the summlog file, making the initial problem
determination process easier.
The various log files associated with the SP Switch are located as follows:
On all Nodes:
• /var/adm/SPlogs/css/Ecommands.log
• /var/adm/SPlogs/css/daemon.stderr
• /var/adm/SPlogs/css/daemon.stdout
• /var/adm/SPlogs/css/dtbx.trace
• /var/adm/SPlogs/css/flt
• /var/adm/SPlogs/css/fs_daemon_print.file
• /var/adm/SPlogs/css/logevnt.out
• /var/adm/SPlogs/css/out.top
• /var/adm/SPlogs/css/rc.switch.log
• /var/adm/SPlogs/css/router.log
• /var/adm/SPlogs/css/worm.trace
We discussed earlier in this chapter the naming conventions that are used to
interpret the various fields in this file (see 9.4, “Switch Topology File” on page
218). Those conventions still apply. After the logical and physical information
is displayed, other switch network information is provided with a return code.
This additional information could be an indication of an error, or simply, an
observation that there is no node in a slot to receive a connection from the
switch board. In this example, there are no nodes in the slots 2, 3, 4 and 16.
The likely explanation is that there is a high node in slot 1 (it occupies 4
slots), and a wide node in slot 15 (it occupies 2 slots).
The control workstation does not belong to any system partition. Instead, it
provides a central control mechanism for all partitions that exist on the
system.
Figure 106 on page 251 shows how a single-switch system with 16 thin nodes
could be partitioned using one of the layouts for the 8-8 configurations.
Bulkhead Bulkhead
Jacks Jacks
Where:
layout.1/ layout.2/
The first and foremost task an administrator has to perform after purchasing
an RS/6000 SP is to set it up. Setting up an RS/6000 SP system requires a
fair amount of planning. This chapter explains the processes involved in
installing and configuring the RS/6000 SP system from the software
perspective. Hardware installation is planned with the help of your IBM
hardware engineers. They will be able to assist you in developing a
satisfactory design.
If you are new to the RS/6000 SP, we recommend that you read IBM
RS/6000 SP: Planning, Volume 1, Hardware and Physical Environment ,
GA22-7280 and IBM RS/6000 SP: Planning, Volume 2, Control Workstation
and Software Environment , GA22-7281 for complete information regarding
system planning.
After installing at least one node, it can configure one or more of the nodes to
become boot/install servers. These nodes, serving as boot/install servers,
can be used to boot up and install other nodes, offloading from the control
workstation this time-consuming and process-hungry task.
With NIM, you can manage standalone, diskless, and dataless systems. In a
broad sense, an SP node can be considered a set of standalone systems:
each node has the capability of booting up on its own because that the
RS/6000 SP system is based on a share-nothing architecture. Each node is
basically a standalone system configured into the SP frame.
Working together with the System Data Repository (SDR), NIM allows you to
install a group of nodes with a common configuration or individually
customize to each node’s requirements. This helps to keep administrative
jobs simpler by having a standard rootvg image and a standard procedure for
installing a node. The rootvg volume groups are consistent across nodes (at
least when they are newly installed). You also have the option to customize
an installation to cater to the specific needs of a given node if it differs from
the rest.
As NIM installations utilize the network, the number of machines you can
install simultaneously depends on the throughput of your network (namely,
Administrative Ethernet). Other factors that can restrict the number of
installations at a time are the disk access throughput of the installation
servers, and the processor type of your servers.
The control workstation and boot/install servers are considered NIM masters.
They provide resources (like files, programs and booting capability) to nodes.
Nodes are considered NIM clients, as they are dependent on the masters for
A NIM master makes use of the Network File System (NFS) utility to share
resources with clients. As such, all resources required by clients must be
local file systems on the master.
Attention
The way NIM organizes itself is by object classes, object types and object
attributes. Table 8 shows how each NIM object is related.
Table 8. NIM Objects Classification
NIM allows two modes of installation: pull or push. In the pull mode, the client
initiates the installation by pulling resources from the master. In the push
mode, the master initiates the installation by pushing resources to the client.
The RS/6000 SP technology utilizes the pull mode. Nodes request resource
like mksysb image, SPOT, AIX source files and so on from the master (by
default, the control workstation).
During the initial booting of a node, a bootp request from the node is issued
over the network specifying its en0 hardware address. This request reaches
the boot/install server or control workstation. The boot/install server or control
workstation verifies the node’s hardware address against the /etc/bootptab
file. If the node is registered in the /etc/bootptab file, the boot/install server or
control workstation sends the node’s IP address back to the node.
On receiving the IP address, the node’s Initial Program Load (IPL) Read-Only
Storage (ROS) requests a boot image from the boot/install server or control
workstation through tftp. Once the boot image is run, the rootvg gets created,
and NIM starts the mksysb image installation.
Upon completing the installation of the mksysb image, NIM invokes the
pssp_script script to customize the node.
If you need further details about the way NIM works, refer to AIX Version 4.3
Network Installation Management Guide and Reference , SC23-2627.
The rest of this chapter describes in detail the procedure to set up the
RS/6000 SP system. For further details, refer to PSSP: Installation and
Migration Guide, GA22-7347.
In a broad sense, the set up of the control workstation involves the following
four steps:
1. Install AIX
2. Set up the AIX environment
3. Set up the PSSP environment
4. Set up SDR information
There must be enough disk space on the control workstation since it is the
default boot/install server. Backup images, AIX filesets, and SPOT all require
substantial amount of disk space.
Ensure that bos.net (TCP/IP and NFS) is installed on your system. Another
pre-requisite file set that comes with AIX 4.3.2 is the perfagent.tools 2.2.32.*.
Refer to Table 11 for the required perfagent file sets.
Table 11. Perfagent File Sets
Next you will need to configure the RS-232 tty connection to the frame. Select
the port to which the RS-232 frame supervisor cable is connected. The
default baud rate of 9600 can be used since hardmon changes it to 19200
when it accesses the line. Each SP frame has one RS-232 line connected,
except in HACWS configurations, where two RS-232 lines are attached for
The transmit queue size for the SP administrative ethernet has to be tuned
accordingly for better performance. Refer to Table 12 for the various settings.
Table 12. Adapter Type and Transmit Queue Size Settings
After changing the transmit queue size for the appropriate adapter, you can
proceed to configure the SP administrative ethernet IP address based on
your planning sheet.
The maximum number of processes allowed per user must be increased from
the default value of 40 to a recommended value of 256. This is done to
accommodate the numerous processes spawned off during the installation.
The default network options have to be tuned for optimal performance over
the network. The recommended values are stated in Table 13. To make the
changes effective every time the control workstation is rebooted, these
values must be set in the /etc/rc.net file. For immediate effect, use the
command line. For example, to change thewall value, type:
# no -o thewall=16384
Table 13. Recommended Network Option Tunables
Parameters Values
thewall 16384
sb_max 163840
ipforwarding 1
tcp_sendspace 65536
tcp_recvspace 65536
udp_sendspace 32768
udp_recvspace 65536
tcp_mssdflt 1448
spdata
sys1
install
lppsource PSSP-3.1
Each one of the directories must be created. If you have to install nodes of
other AIX levels, you must create the relevant lppsource directories under
their own AIX directories. Here, only the aix432 directory and its related
lppsource directory are created. If the other AIX levels require different PSSP
codes, then the relevant PSSP directories must be created under the pssplpp
directory. This shown in Figure 109 on page 265.
sys1
install
lppsource
PSSP-2.4 PSSP-3.1
aix421
lppsource
Figure 109. Multiple AIX And PSSP Levels
# /usr/sbin/bffcreate -v -d /dev/cd0 -t \
/spdata/sys1/install/aix432/lppsource -X all
# smitty bffcreate
Also required in the lppsource directory are the perfagent.* file sets. The
perfagent.server file sets are part of the Performance Aide for AIX (PAIDE)
feature of the Performance Toolbox for AIX (PTX). The perfagent.tools file
sets are part of the AIX packaging. See Figure 11 on page 262 for the correct
level for your installation.
Copy the mksysb image that you want to use for installing your nodes into the
/spdata/sys1/install/images directory. Each mksysb must have an equivalent
AIX lppsource to support it. You can use the minimum image that is shipped
with the RS/6000 SP system, or you can create your own mksysb image. By
Next, copy the PSSP images from the product tape into the
/spdata/sys1/install/pssplpp/PSSP-3.1 directory:
# mv /spdata/sys1/install/pssplpp/PSSP-3.1/ssp.usr.3.1.0.0
/spdata/sys1/install/pssplpp/PSSP-3.1/pssp.installp
# inutoc /spdata/sys1/install/pssplpp/PSSP-3.1
Attention
If you are adding dependent nodes, you must also install the ssp.spmgr
fileset.
To finish off with the installation of the control workstation, the install_cw
script is run. This command configures the control workstation as follows:
• It adds the PSSP SMIT panels to the ODM.
• It creates an SP object in the ODM which contains the node_number for
the control workstation, which is always 0.
• It calls the script /usr/lpp/ssp/install/bin/post_process to do
post-installation tasks such as:
• Starting the hardmon daemon, SDR daemons
• Calling the setup_logd script to start the monitor's logging daemon
• Updating the /etc/services and /etc/inittab files.
• It also ensures that /tftpboot/tuning.cust exists. If the file does not exist,
it copies from /usr/lpp/ssp/install/config/tuning.default.
• Finally, it sets the authentication server attribute in the SDR to reflect
the kind of authentication server environment.
# SDR_test
# spmon_itest
# smitty enter_data
The first task is to set up the Site Environment Information. Select Site
Environment Information on the SMIT menu. Figure 110 shows this SMIT
screen. Enter the following information:
• Default network install image name (directory path not required)
• NTP information
• Automounter option
• User administration information
• File collection information
• SP accounting information
• AIX lppsource name (directory path not required)
Next, enter the SP Frame Information as shown in Figure 111 on page 269. In
this panel, specify the starting frame number, the frame count (the number of
consecutive frames), the starting frame tty port and whether you want to
re-initialize the SDR. Do not enter non-SP frames in this panel. The
re-initialization of the SDR is required only when the last set of frames
information is entered. Take note that if your tty numbers do not run
consecutively, you need to enter the frames information in separate sessions.
If you have non-SP frames attached, enter their information in the Non-SP
Frame Information SMIT panel as shown in Figure 112 on page 270. If the
non-SP frame is an S70 or S7A, make sure that the starting frame tty port you
specify is the one connected to the operator panel on the server. The s1 tty
port field is the tty connected to the serial port on the server. The starting
switch port number is the switch node number. The frame hardware protocol
is where you specify if the non-SP frame is an S70/S7A server or a Netfinity
server. For an S70/S7A server, select SAMI; for a Netfinity server, select
SLIM. Re-initialize the SDR if the information you entered is for the last set of
frames.
At this point, your frames must already be powered up. Verify that the System
Monitor and Perspectives have been correctly installed. The SMIT fastpath is:
# smitty SP_verify
Select the System Monitor Configuration option. Figure 113 on page 271
shows the output if everything is installed properly. If it fails, check that your
tty configuration is correct. Also, ensure that the physical RS-232 lines are
properly connected. Delete the frame information and reconfigure again.
Make sure that you do not remove or change the RS-232 connections before
you delete the frame or you will end up having to hack the SDR to salvage the
situation. Other causes of error include Kerberos authentication problems,
SDR or hardmon daemons not running and so on. Rectify the errors before
proceeding.
# spmon -d
If any error is found, check the RS-232 cables. You should not have this
problem, though, as the previous test will have detected it.
Next, verify the supervisor microcode of your system. The SMIT fastpath is:
# smitty supervisor
Figure 114 on page 272 shows the result of selecting the List Status of
Supervisors (Report Form) option. Any item that has the status Upgrade
needs the microcode updated. Choose the Update *ALL* Supervisors That
Require Action (Use Most Current Level) option to update microcode on all
the necessary supervisor cards.
# smitty node_data
If you intend to use the node’s slot number as a reference for assigning an IP
address, set the Skip IP Address for Unused Slots to yes. This will skip the
next sequential IP addresses when it encounters Wide or High nodes. Figure
115 on page 273 shows the SMIT panel available for setting up SP Ethernet
addresses.
Attention
Acquiring hardware addresses for nodes will power those nodes down. Do
not use this step on nodes running in a production environment.
If you have a switch or other network adapters (ethernet, FDDI, token ring) in
your system and you want them to be configured during installation, you need
to perform the this step. To do so, select the Additional Adapter Information
menu. To configure the SP Switch adapter, use css0 for the Adapter Name.
Specify the IP addresses and netmask just like for the en0 adapter.
Attention
• To skip IP addresses for unused slots, do not use the switch node
numbers for css0 IP addresses.
• If you do not use switch node numbers for css0 IP addresses, you must
ensure ARP is enabled.
In addition, you can specify additional IP addresses for the adapter if you
have IP aliasing. Figure 117 on page 275 shows the SMIT panel available for
setting up additional network adapters.
Next, configure the default hostname of the nodes. By default, the en0’s long
hostname is used. You can use another adapter’s hostname or change to
short hostname. Figure 118 shows the SMIT panel for setting up hostnames.
The next step creates the appropriate authorization files for the use of remote
commands. The available methods are k4 (Kerberos Version 4) and std
# smitty spauth_config
Next, enable the selected authentication methods for use in conjunction with
System Management tasks. The default is to enable authentication on all
nodes as well as the control workstation if the Force change on nodes option
is set to yes, it will force the authentication method to change to whatever you
set, even though the information in the node may be the same. The available
methods are k5 (Kerberos Version 5), k4 and std; k4 is required while k5 and
std are optional. If you intend to use ftp, rlogin and telnet commands, k5 or
std must be enabled as well. It is recommended that you enable all. Figure
120 on page 277 shows the SMIT panel available for enabling the
authentication methods.
If you have a dependent node, add it at this point. First, ensure that you have
the ssp.spmgr fileset installed and that the UDP port 162 it uses for SNMP
traffic does not clash with other applications that use the same port. If a
conflict arises, modify the spmgrd-trap entry in /etc/services to use another
port number.
# smitty enter_extnode
After adding the dependent node information, each node adapter for the
dependent node must be defined. The SMIT fastpath is:
# smitty enter_extadapter
Specify the network address, netmask and the node number as shown in
Figure 122 on page 278.
At this point, the system partition-sensitive subsystems like hats, hags, haem
and hr must be added and started. To do so, issue the following command:
# syspar_ctrl -A
Verify that all the system partition-sensitive subsystems have been properly
added by issuing the following command:
# syspar_ctrl -E
Check that the subsystems are active with the following command:
You may have to wait a few minutes before some of the subsystems become
active.
With PSSP 3.1, you can install the rootvg on external SSA or SCSI disks.
Refer to Table 14 for a list of supported SSA boot devices and Table 15 on
page 281 for a list of supported SCSI boot devices.
Table 14. Supported External SSA Boot Devices
62 MHz Thin N
66 MHz Wide N
66 MHz Thin 2 N
77 MHz Wide Y X X X X
POWER3 Thin N
POWER3 Wide N
62 MHz Thin Y X X
66 MHz Wide Y X X
66 MHz Thin 2 Y X X
77 MHz Wide Y X X
00-00-00-0,0:00-00-00-1,0
The third format specifies the parent-connwhere attribute (SSA only), for
example ssar//0004AC5052B500D. To indicate more that one disk, separate
the disks by colons:
ssar//0004AC5052B500D:ssar//0004AC5150BA00D
You can always go back to make changes to this volume group by selecting
the Change Volume Group Information option.
Before continuing with the installation, perform a check on all the information
you have entered into the SDR using the splstdata. command.
# splstdata -e
List Site Environment Database Information
attribute value
------------------------------------------
control_workstation sp4en0
cw_ipaddrs 9.12.0.4:192.168.4.140:
install_image bos.obj.ssp.432
remove_image false
primary_node 1
ntp_config consensus
ntp_server ""
ntp_version 3
amd_config false
print_config false
print_id ""
usermgmt_config true
passwd_file /etc/passwd
passwd_file_loc sp4en0
homedir_server sp4en0
homedir_path /home/sp4en0
filecoll_config true
supman_uid 102
supfilesrv_port 8431
spacct_enable true
spacct_actnode_thresh 80
spacct_excluse_enable false
# splstdata -f
# splstdata -n
# splstdata -a
# splstdata -b
# splstdata -s
To list the node information for dependent nodes use the following command:
You can customize your nodes during the installation process. There are
three files where you can specify your customization requirements. The three
files are:
• tuning.cust - It is used to set the initial network tuning parameters. This
file is called by pssp_script after installing the node. It must be placed in
the /tftpboot directory for it to be effective. If during the installation
process, the boot/install server cannot locate this file, the default
The next step will configure the control workstation as a boot/install server.
Prior to performing this step, ensure that the /usr file system or its related
directories are not NFS-exported. The /spdata/sys1/install/images directory
also must not be NFS-exported.
The setup_server Perl script is run on the control workstation when explicitly
called, and on every node when they reboot. On the control workstation, or on
a node set to be a boot/install server (stated in the SDR), this script
configures it as a boot/install server. This command requires a
ticket-granting-ticket to run. It performs the following functions:
• Defines the boot/install server as a Network Installation Management
(NIM) master
• Defines the resources needed for the NIM clients
• Defines each node that this server installs as a NIM client
• Allocates the NIM resources necessary for each NIM client
• Creates the node.install_info file containing netinstall information
• Creates the node.config_info file containing node-specific configuration
information be used during network boot
start
# SYSMAN_test
# smitty annotator
Enter the topology file name, specifying the full directory path. Also enter a
fully-qualified name of a file in which to store the annotated topology file.
Store the annotated file in the SDR. If the system cannot find a
/etc/SP/expected.top file, it will read from the SDR to get the required
information.
The primary and primary backup nodes will already be defined. Verify this by
running the Eprimary command. You will see an output like this:
1 - primary
1 - oncoming primary
The nodes assigned may be different in your environment, but all four roles
are defined.
You must next set the switch clock source. Refer to 9.7, “Switch Clocks” on
page 238 for details on selecting the correct Eclock topology file. To initialize
the clock setting in SMIT, use the fastpath:
# smitty chclock_src
Important
If you have a running switch network, this command will bring the whole
switch network down. The moment you select the topology file, it executes
the Eclock command immediately. There is no warning message given.
You can partition your system now or you can do it after the installation. Refer
to PSSP: Administration Guide, SA22-7348 for details on partitioning the
system.
The nodecond_mca script does things a little differently. It first gets the
network type and attributes from the SDR. It determines the node type to see
whether it is a Thin, Wide or High node. Next, it powers off the node and
opens up a serial port. If the node type is either Thin or Wide, it sets the key
mode to secure. Next, it initiates a hmmon process to monitor the LED status.
The node is then powered on. When the LED reaches 200, the key mode is
changed to service. On detecting an LED of 262, it lets the node know that
the serial port is available. The IP address is determined and the network
boot proceeds. NIM does the rest of the installation. For the High node, the
key mode is switched to service. The BUMP processor is awaken and is set
up. The node is then powered on. The script will then set the node up for
network boot. The IP address is determined and the network boot proceeds.
NIM takes over the rest of the installation.
Once the node is installed and set up, run the SYSMAN_test command again
to verify that System Management tools are properly installed on the node.
When this verification check succeeds, the rest of the nodes can be installed.
# spmon -d -G
# smitty changevg_dialog
Specify the node on which you want mirroring to be performed. Enter rootvg
as the Volume Group Name and specify the hard disks that rootvg will occupy
(including hdisk0). Change the number of copies to two or three, depending
on the mirroring copies you want.
After changing the rootvg characteristics, begin the mirroring process using
the SMIT fastpath:
Select the node and use Forced Extending the Volume Group in case the
hard disk contains unwanted volume group information; see Figure 128. The
mirroring process starts and takes about 30 minutes or more depending on
the size of your volume group.
You can choose to install an alternate rootvg on another hard disk (maybe for
testing purposes). You need to create the new volume group, naming it
anything except rootvg (an example is rootvg1). The installation procedure is
the same as for rootvg installation. With two rootvg residing on the two hard
disks, you need to know how to switch them back and forth. The spbootlist
command assists you in performing this task.
You will first change the Volume Group Information to the volume group from
which you want to boot up. Use the following command if you want to use
command line:
Perform a check on the bootlist of the node using the following command:
If the change is correct, reboot the node. Your node will boot up from the
rootvg you specified.
The information presented in this chapter will help you answer these
questions and guide you to other resources that will assist in system
management activities. We will be covering the following topics in this
chapter:
• SP Perspectives – the graphical user interface for system management on
the RS/6000 SP
• SP Tuning and Performance Monitoring
• SP Accounting
• Problem Management
11.1 SP Perspectives
Scalable POWERparallel Perspectives for AIX (SP Perspectives) is a set of
applications, each of which has a graphical user interface (GUI), that enables
you to perform monitoring and system management tasks for your SP system
by directly manipulating icons that represent system objects.
The AIX command perspectives starts the Launch Pad, from which you can
launch the following applications:
• Hardware Perspective, for monitoring and controlling hardware
You can run the individual applications outside of the Launch Pad. The full
pathnames of the individual applications are shown in Table 16.
Table 16. SP Perspective Application Pathnames
Before we review these functions and how to perform them, we will look at the
structure of the Hardware Perspective window as it appears when it is run
either from the command line with the sphardware command or by double
clicking the top leftmost icon Hardware Perspective on the launch pad.
These objects are displayed by default as icons, which are placed inside
panes in the perspective window (icon view).
The hardware perspective allows four different kinds of pane. It refers to them
as:
• CWS, System and Syspars (contains the control workstation, system and
system partition objects)
• Nodes (contains node objects)
• Frames and Switches (contains frame and switch objects)
• Node Groups (contains node group objects)
The Event Perspective allows you to define and manage event definitions
within a system partition. In this sense it is not a GUI for displaying
information; rather, it allows you to define monitors and triggers for other
perspectives to display.
An event definition allows you to specify under what condition the event
occurs and what actions to take in response. Using the Event Perspective,
you can:
• Create an event definition
• Register or unregister an event definition
• View or modify an existing event definition
• Create a new condition
#acl# /etc/sysctl.pman.acl
#
# These are the kerberos principals for the users that can configure
# Problem Management on this node. They must be of the form as indicated
# in the commented out records below. The pound sign (#) is the comment
# character, and the underscore (_) is part of the "_PRINCIPAL" keyword,
# so do not delete the underscore.
#
_PRINCIPAL [email protected]
You can initialize the Event Perspective from the launch pad, or alternatively
from the command line using the spevent command. The main window of the
Event Perspective is shown in Figure 130 on page 298.
The bottom pane of the initial event perspective window shows some of the
19 pre-defined Event Definitions that are displayed when the Event
Perspective is started. These definitions are summarized in Table 17.
Table 17. Pre-Defined Event Definitions
By default, none of these events are active. You can start event monitoring
using one or more of these defined events by using the following procedure:
1. Choose the event you wish to monitor by selecting its icon in the Events
Definition pane. You can choose more than one event by pressing the
left-CTRL button at the same time as you select the event(s).
2. Go to the Menu bar, and select Actions->Register.
3. The icons for the selected events will change from all-grey to colored. The
definitions for the colors assigned to icons are shown in Figure 131 on
page 300.
All Blue
2G ray 2C olor
All Grey
White Envelope
Blue Envelope
In addition to this visual alert, the system administrator will see an event
notification log window pop up on his display. An example is shown in Figure
132 on page 301. There we can see both the original event notification when
the switchResponds event was triggered, and the rearm notification once the
problem was diagnosed and remedial action taken to fix it.
Here you can see that the resource elements IDs that have been selected
are:
• The appslv logical volume, which resides on
• The rootvg volume group that is in
• Each node of the SP system (gaps in the node numbers indicate that
there are no nodes in those slots)
9. Click on the Create button. The Event Definition will be saved and
registered. When the file system utilization on any of these nodes
becomes greater than 80% or falls below 70%, you will be notified by the
Event Notification Log, as shown in Figure 136.
Each action correlates to one or more virtual shared disk commands. You can
run virtual shared disk, other PSSP, and AIX commands from within this
interface as well. SMIT is also available to you for managing shared disks.
Click Help->Tasks at the top right-hand corner of the primary window to see
an online help system that explains how to use the IBM Virtual Shared Disk
Perspective interface.
The information at this site is updated with the latest performance and tuning
data.
sb_max Upper limit on the size of the TCP and UDP buffers in
allocated space per connection.
tcp_sendspace The default size of the TCP send space value in bytes.
udp_sendspace The default size of the UDP send space value in bytes.
The safe maximum is 65536 (64K).
The SP Switch introduces specific tuning parameters. You can tune the
switch’s device driver buffer pools, rpool and spool, by changing the rpoolsize
and spoolsize parameters on the node’s switch adapter. These pools are
used to stage the data portions of IP packets. Their sizes interact with the
node’s network options settings. A detailed discussion of tuning the rpool and
spool buffers is presented in an online document at:
http://www.rs6000.ibm.com/support/sp/perf
The send pool and receive pool are separate buffer pools, one for outgoing
data (send pool) and one for incoming data (receive pool). When an IP packet
is passed to the switch interface, if the size of the data is large enough, a
buffer is allocated from the pool. If the amount of data fits in the IP header
To see the current send pool and receive pool buffer sizes, run the following
command on your nodes:
The default values for these buffer pools is set at 512 Kbytes or 524288 bytes.
After reviewing your current usage of the buffer pools (by running the vdidl3
command on the node or nodes), and following the suggestions and
recommendations in the online document, you can modify the spoolsize
and/or rpoolsize parameters using the chgcss command. For example, to
change both the send and receive buffer pool size on a node to 1 Mbyte,
enter:
chgcss -l css -a rpoolsize=1048576 -a spoolsize=1048576
New values for the pool sizes must be expressed in bytes, and will not take
effect until the node(s) are re-booted.
Detailed tuning information for VSD and RVSD (the basis of GPFS) can be
found in IBM Parallel System Support Programs for AIX Managing Shared
Disks, SA22-7349.
Transmit queues
For transmit, the device drivers may provide a transmit queue limit. There
may be both hardware queue and software queue limits, depending on the
driver and adapter. Some drivers have only a hardware queue, some have
both hardware and software queues. Some drivers internally control the
hardware queue and only allow the software queue limits to be modified.
Generally, the device driver will queue a transmit packet directly to the
adapter hardware queue. On an SMP system, or if the system CPU is fast
relative to the speed of the network, the system may produce transmit
packets faster than they can be transmitted on the network. This will cause
the hardware queue to fill. Once the hardware queue is full, some drivers
provide a software queue and will then queue to the software queue. If the
software transmit queue limit is reached, then the transmit packets are
discarded. This can affect performance because the upper level protocols
must then timeout and retransmit the packet.
Prior to AIX 4.2.1, the upper limits on the transmit queues were in the range
of 150 to 250, depending on the specific adapter. The system default values
were quite low, typically 30. With AIX 4.2.1 and later, the transmit queue limits
were increased on most of the device drivers to 2048 buffers. The default
For adapters that provide hardware queue limits, changing these values will
cause more real memory to be consumed because of the associated control
blocks and buffers associated with them. Therefore, these limits should only
be raised if needed, or for larger systems where the increase in memory use
is negligible. For the software transmit queue limits, increasing these does not
increase memory usage. It only allows packets to be queued that were
already allocated by the higher layer protocols.
Receive Queues
Some adapters allow you to configure the number of resources used for
receiving packets from the network. This might include the number of receive
buffers (and even their size) or may simply be a receive queue parameter
(which indirectly controls the number of receive buffers). The receive
resources may need to be increased to handle peak bursts on the network.
SP Considerations
One of the final RS/6000 SP installation steps is setting all network adapters
in SP nodes to their maximum transmit queue size. With AIX release 4.2.1
and later, the default transmit queue limit has been increased to 512.
Be aware that the transmit queue size values recommended in IBM PSSP
for AIX Installation and Migration Guide, GA23-7347 at Step 59: Tune the
Network Adapters are incorrect.
Check the value of this adapter attribute on all newly installed nodes using
an appropriate dsh command, such as:
# dsh -a lsattr -El ent0 | grep xmt_que_size
If the value for the queue size is not 512, then change it using:
# dsh -a chdev -P -l ent0 -a xmt_que_size=512
Reboot the nodes to make the change effective. A value of 512 should be
sufficient for initial operation, and can be increased if you change the
tcp_sendspace or tcp_recvspace as set in the network options.
An Ethernet MTU is usually 1500 bytes. 512 MTUs represent 768,000 bytes,
more than ten times the size of a single packet (MTU) on the SP Switch,
which is 65,520 bytes.
For the most part, PDT functions with no required user input. PDT data
collection and reporting are easily enabled, and then no further administrator
activity is required. Periodically, data is collected and recorded for historical
analysis, and a report is produced and mailed to the adm userid. Normally,
only the most significant apparent problems are recorded on the report. If
PDT optionally runs on individual nodes, assesses the system state, and
tracks changes in performance and workload. It tries to identify impending
problems and suggest solutions before they become critical.
With reference to Figure 137 on page 313, the xmservd daemon of PA feeds
the performance statistics into shared memory. While the aixos resource
monitor within the Event Management (EM) daemon can supply AIX-level
resource variable data selectively, as per EM client event registrations,
Remote Data
Local Data Consumers'
Consumers 3dmon
Dynam ic
Data Supplier SPMI RSI
Shared M em ory
xmservd
Apps AIX
DDS/IBM/stat }Extended
ptpertm -p Shared Memory
shmem
Third Tier:
C entral C oordinator C oordinate and adm inister
D ata M anagers
n First Tier:
P roviding, A rchiving, R eporting
H andling S tatistics R equests
1 2
There are three filesets in the PSSP installation media that make up PTPE:
ptpe.program The PTPE programs. This software will not run unless RSCT
has been installed.
ptpe.docs Documentation material, for example man pages.
ssp.ptpegui This image should be installed on any node on which you plan
to run SP Perspectives.
Before you can use Performance Toolbox Parallel Extensions for AIX, you
must create a monitoring hierarchy as shown in Figure 139 on page 315.
PTPE distributes the management of performance information among a
number of data manager nodes rather than giving total responsibility to a
single node. Although one central coordinator node is designated when the
monitoring hierarchy is created, the data manager nodes act as
intermediaries, absorbing most of the administrative overhead and greatly
reducing data transfer operations. The resulting central coordinator node
workload is far less than that required by a single point of control for all nodes
and all data management functions.
sp4n15
sp4n13
Reporters
Data Manager sp4n11 - sp4n06
- sp4n11 - sp4n07
sp4n09 sp4n10 - sp4n08
- sp4n09
sp4n07 sp4n08
- sp4n10
- sp4n13
- sp4n15
Data Manager sp4n05 sp4n06
- sp4n05
sp4n01
Central Coordinator
- sp4n01
Here we have decided that nodes sp4n05 and sp4n11 will be Data Managers,
and sp4n01 will be the Central Coordinator; the remaining nodes will be
simply Reporters. To formalize this hierarchy, you create a text file (we call it
samp_hier) in the /tmp directory, with the following format:
To initialize this hierarchy, you use the ptpehier command with a -c flag to
specify the Central Coordinator node and -i flag to specify that the hierarchy
will be read from standard input — in this case, the /tmp/samp_hier file. You
can also let PTPE automatically create a hierarchy, based on your Ethernet
local area subnetwork, or if you have a multi-frame SP system, then it can
develop a hierarchy based on the node/frame configuration. In our example,
we use the following commands to set up and confirm the hierarchy.
[sp4en0:/]# /usr/lpp/ptpe/bin/ptpehier -p
ptpehier: The current monitoring hierarchy structure is:
sp4n01.msc.itso.ibm.com
sp4n11.msc.itso.ibm.com
sp4n11.msc.itso.ibm.com
sp4n13.msc.itso.ibm.com
sp4n15.msc.itso.ibm.com
sp4n01.msc.itso.ibm.com
sp4n05.msc.itso.ibm.com
sp4n05.msc.itso.ibm.com
sp4n06.msc.itso.ibm.com
sp4n07.msc.itso.ibm.com
sp4n08.msc.itso.ibm.com
sp4n09.msc.itso.ibm.com
sp4n10.msc.itso.ibm.com
[sp4en0:/]#
All that remains is to initialize and start the data collection process using the
ptpectrl command with the -i and -c flags:
[sp4en0:/]# ptpectrl -c
------------------------------------------------------------
ptpectrl: Starting collection of performance information.
ptpectrl: Reply from the Central Coordinator expected within 250 seconds. OK.
ptpectrl: Performance information collection successfully started.
------------------------------------------------------------
ptpectrl: Command completed.
------------------------------------------------------------
[sp4en0:/]#
Now you can use the familiar xmperf and 3dmon commands to review the
performance of the nodes in your SP complex. Since performance monitoring
creates additional load on the nodes being monitored, you should shut down
the monitoring process once you have collected sufficient performance data,
using the ptpectrl -s command.
11.3 SP Accounting
The SP accounting facilities are based on the standard System V accounting
system utilities of AIX. These accounting facilities are described in AIX
Version 4.3 System Management Guide: Operating System and Devices,
SC23-4126. Enabling SP system accounting support is optional. It is a
separately-installed portion of the PSSP ssp.sysman fileset. SP system
accounting extends the function of base AIX accounting in three ways:
• Accounting-record consolidation - Partial reduction of accounting data
is done on each node before the data is consolidated on an accounting
You can define or change these values after installation by using SMIT (the
fastpath is site_env_dialog) or the spsitenv commands. An extract of the
relevant SMIT screen is shown in Figure 141.
You also need to further define SP accounting at the node level. This can be
done using the SMIT fastpath acctnode_dialog or the spacctnd command. The
following sequence of spacctnd commands will set the job charge value for
nodes 1, 9 and 10 to 30.0, and these nodes will have the default accounting
We can review the relevant attributes of the Node class in Figure 142 on page
322.
See IBM Parallel System Support Programs for AIX Administration Guide,
SA22-7348, for more information on SP accounting implementation.
To collect accounting information from all machines in this way, the llctl
command is used with the capture keyword
llctl -g capture eventname
Examples
Once LoadLeveler accounting has been configured, you can extract job
resource information on completed jobs by using the llsummary command. For
detailed information on the syntax of this command and the various output
reports that it can generate, see LoadLeveler for AIX: Using and
Administering Version 2 Release 1, SA22-7311.
You can produce three types of reports using the llsummary command. These
reports are called the short, long, and extended versions. As their names
imply, the short version of the report is a brief listing of the resources used by
LoadLeveler jobs, the long version provides more comprehensive detail with
summarized resource usage and the extended version of the report provides
the comprehensive detail with detailed resource usage. If you do not specify a
report type, you will receive the default short version.
The short report displays the number of jobs, along with the total CPU usage
according to user, class, group, and account number. The extended version of
the report displays all of the data collected for every job.
Error logging reports debugging information into log files for subsystems that
perform a service or function on behalf of an end user. The subsystem does
not communicate directly with the end user and therefore needs to log events
to a file. The events that are logged are primarily error events.
Error logging for the SP uses BSD syslog and AIX Error Log facilities to report
events on a node basis. The System Monitor and the SP Switch use this form
of error logging.
Error log entries include a DETECTING MODULE string that identifies the
software component, module name, module level, and the line of code or
function that detected the event that was logged. The information is formatted
based on the logging facility the user is viewing. For example, the AIX Error
Log facility information appears as follows:
DETECTING MODULE
LPP=<LPP name>,Fn=<filename>, SID=<ID_level_of_the_file>,L#=<line number>
Important
The file contains entries for the user root as principal root.admin and giving
root the authority to execute log management commands.
Detailed information on configuring the error logs can be found in the redbook
RS/6000 SP: Problem Determination Guide, SG24-4778, and in IBM Parallel
System Support Programs for AIX: Administration Guide, SA22-7348.
You may be asked to reference some of these files and send them to your
IBM Support Center representative when diagnosing RS/6000 SP problems.
PMAN receives events from EM, and can react in one or more of the following
ways:
• Send mail to an operator or administrator
• Notify all logged-on users via the wall command or opening a window on
displays with a graphical user interface
• Generate a Simple Network Management Protocol (SNMP) trap for an
enterprise network manager, such as Tivoli
• Log the event to AIX and BSD error logging
• Execute a command or script
Network
Problem Management
Log
SP C onfiguration
Three daemons constitute the PMAN subsystem. The ways in which they
inter-operate are shown in Figure 143. They are described as follows:
• pmand - This daemon interfaces directly to the EM daemon. The pmand
daemon registers for events, receives them, and takes actions. PMAN
events are stored in the SDR, and pmand retrieves them at initialization
time.
• pmanrmd - If EM does not monitor the resource variable of interest,
PMAN supplies its own resource manager daemon, pmanrmd, to access
the resource variable’s data. You can configure pmanrmd to periodically
execute programs, scripts, or commands and place the results in one of
EM’s 16 user-defined resource variables. The pmanrmd daemon supplies
the resulting data to the RMAPI. It also cooperates with pmand.
• sp_configd - This daemon creates SNMP traps from the event data in
pmand, and is the interface for an enterprise network manager to access
SP event, configuration, and resource variable data.
errorlog sp_configd
Application
Network
errpt
SNMP Manager
trapgend snmpd
GET (Netview)
GET-NEXT
The SP exploits AIX Error Notification Facility (AENF) to link the AIX error log
and the PMAN subsystem. When a node is installed, an AENF object is
created which sends all AIX Error Log entries to the pmand daemon. PMAN
filters the entries based on subscriptions you define. The sp_configd daemon
picks up SNMP-alertable AIX errors and passes them to snmpd for forwarding
to the network manager (if this is installed in your environment).
By default, all PMAN daemons are started on all nodes and the control
workstation. It is important to understand that PMAN is not a distributed
application, but an Event Management (EM) client. The PMAN daemons on
one node do not know about their counterparts on other nodes, and do not
care. At initialization, each instance of pmand obtains PMAN event
subscriptions from the SDR, so each node is aware of all PMAN events to be
monitored. Although it is not mandatory to run the PMAN daemons
everywhere, we do not recommend disabling them. PMAN must execute:
• Where you want an action to originate, given a specific event from EM
• Where you need custom resource monitoring, facilitated by the pmanrmd
resource monitor
The subscribed events will result in an event notification being mailed to the
root user on the control workstation when the specified event occurs. Events
are defined for all nodes in the current system partition, and all events are
monitored from the control workstation. An extract from this script follows.
#
# Watch /var space on each node in the partition
#
pmandef -s varFull \
-e 'IBM.PSSP.aixos.FS.%totused:NodeNum=*;VG=rootvg;LV=hd9var:X>95'\
-r 'X<70' \
-c /usr/lpp/ssp/bin/notify_event \
-C "/usr/lpp/ssp/bin/notify_event -r" \
-n 0 -U root -m varFull
The syntax for this command is completely described in IBM Parallel System
Support Programs for AIX: Command and Technical Reference, SA22-7351.
The major flags you specify are:
-s — This flag specifies that this is a subscribe request and the remaining
flags define the Problem Management subscription. The name of this
subscription is varFull.
Service Focal Point — Allows for a single focal point to view errors reported
from any machine on the network.
Once this information is compiled, you can view it and compress it for
downloading to diskette or tape or for remote transmission. You may be asked
by support specialists to execute the snap command to help them accurately
identify your system problem.
Output Directory
The default directory for the output from the snap command is /tmp/ibmsupt. If
you want to name an optional directory, use the -d option with the path of the
desired output directory. Each execution of the snap command appends to
previously created files.
Options
The main options of the snap command are:
Note: Other information that is not gathered by the snap command can be
copied to the snap directory tree before executing the tar/compress option.
For example, you may be asked by the support specialist to provide a test
case that demonstrates the problem. The test case should be copied to the
/tmp/ibmsupt directory. When the -c option of the snap command is executed,
the test case will be included.
The snap -c and snap -o commands are mutually exclusive. Do not execute
both during the same problem determination session.
• The snap -c command should be used to transmit information
electronically.
• The snap -o command should be used to transmit information on a
removable output device.
css.snap
The css.snap script collects log files created by switch support code such as
device drivers, the Worm, diagnostic outputs and so on, into a single
package.
The css.snap script is called automatically from the fault service daemon
when certain serious errors are detected. However, it can also be issued from
the command line when a switch or adapter related problem is indicated with:
# /usr/lpp/ssp/css/css.snap
Important
where xxxxxxxx is a timestamp. You must ensure that there is sufficient free
space available in the target file system for the output file to be created,
otherwise the command exits with an appropriate message.
One of the challenges for the SP system administrator is managing the users
in the system. Should a user be allowed to access to one node, some nodes
or all nodes? The SP can be viewed as one logical unit, therefore users need
to be defined across all nodes, with the same login name, the same login
password, the same group characteristics, the same home directory and so
on. This is only achieved by sharing the user database across the SP system.
An SP system administrator therefore has the responsibility of maintaining
consistent copies of files such as /etc/passwd, /etc/group and
/etc/security/passwd across the nodes.
Two types of users can reside within an SP system. AIX users are those
created through AIX on an individual node and reside only on that node. SP
users are created through SP user management (SPUM) and can have
access to every node. It is possible to have both types of users on a given
node. This makes it difficult, if not impossible, to use file collections and NIS
to manage the user database because the user database on each node is
different from the other nodes. File collections and NIS are designed to
manage one consistent copy of the user database across the system.
This chapter begins with a look into the files within the AIX operating system
which make up the user database. We then examine SPUM, SPAC and NIS.
We conclude with a look at file collections and the automounter.
There are distinct sets of commands for managing AIX and SP users. To
manage AIX users, use the AIX commands mkuser, rmuser, chuser and lsuser.
To manage SP users, the PSSP software provides the SP user management
(SPUM) commands. The two sets of command perform similar functions; the
difference is the type of users they manage. SPUM is discussed in 12.1.2,
“SP User Management (SPUM)” on page 341.
The *.idx files are password index files used to improve login performance.
The other files in /etc/security may or may not be used depending upon your
system’s requirements. A more in-depth look into user login control using the
AIX files can be found in 7.2.4, “Login Control in AIX” on page 147.
If any of these optional files are used, they need to be included in the user
database to be replicated across the system.
You can enable SPUM both during and after the installation process. It is also
possible to change the default SPUM settings after enablement. This is done
through SMIT panels or the use of the spsitenv command. For the SMIT
panel, run smit enter_data and select Enter Site Environment Information.
Figure 145 on page 342 shows this SMIT panel.
To add an SP user, you can use SMIT or the command spmkuser. To use SMIT,
run smit spmkuser. Figure 146 shows this SMIT panel.
The only mandatory field is the name of the user. If all other fields are left
blank, the system generates its own user ID, places the user in the staff
group, does not put it into any secondary groups and creates a home
directory based on the defaults specified when SPUM is turned on. If you
specify a home directory path that is different than the default, this entry is
used.
To delete users, you can use SMIT or sprmuser. To use SMIT, run smit
sprmuser. Figure 147 shows this SMIT panel.
To change a user’s information, you can again use SMIT or the command
spchuser. Run smit spchuser to access the SMIT panel. Figure 148 on page
345 shows this SMIT panel.
Changes that you make here to an SP user can override the default settings
specified when SPUM is initially enabled.
To list an SP user account and its information, you can run splsuser or use
SMIT by running smit spchuser. The same panel is used for listing and
changing user information.
When SPUM is enabled, user password changes are handled in one of two
ways, depending on whether NIS is used or not. If NIS is not in use, PSSP
assumes that file collection is in use and restricts password changes to the
machine that has the master copy of the /etc/passwd file. SPUM does this by
linking the AIX password commands to SP password commands on the
nodes. The commands that are modified are:
• /bin/chfn, which is linked to /usr/lpp/ssp/config/sp_chfn
• /bin/chsh, which is linked to /usr/lpp/ssp/config/sp_chsh
• /bin/passwd, which is linked to /usr/lpp/ssp/config/sp_passwd
User password changes when NIS is in use are discussed in 12.2, “Network
Information System (NIS)” on page 348.
Attention
If both file collections and NIS are not used, but SPUM is enabled, user
password changes are still restricted to the machine with the master
/etc/passwd file. However, in this instance, after a change has been made,
the system administrator needs to propagate the files /etc/passwd and
/etc/security/passwd across the SP system.
SPAC makes use of login and rlogin attributes in the /etc/security/user file to
control a user’s ability to log into particular nodes. It sets the attributes to
either true or false depending on whether you want users to login (true) or not
(false). The login attribute is used to control user local (serial line) logins
while rlogin controls remote (network) logins. If you are using file collections,
be sure to remove /etc/security/user from any file collections because it is
unique among nodes. If you are using NIS, no action is needed because NIS
by default does not handle this file.
SPAC controls the login and rlogin attributes by using spacs_cntrl with four
keywords: block, unblock, allow and deny.
If a user’s state is requested to change more than once using the allow and
deny keywords, the file spacs_data is created to keep track of outstanding
requests. When a change request is submitted, a check is made against the
spacs_data file. If the request is the same as what is in the file, the request is
not stored; instead, a counter is incremented. If the next request is opposite
to what is in spacs_data, the counter is decremented, the user removed from
spacs_data, and the /etc/security/user file is updated to reflect the change.
When a job submission program is using the allow and deny keywords to
control user login on nodes, be careful that the system administrator does not
run spacs_cntrl block or spacs_cntrl unblock on the same node. The block or
unblock state automatically causes spacs_cntrl to clear the contents in the
spacs_data file.
Further information for SPAC is in IBM Parallel System Support Programs for
AIX: Administration Guide, SA22-7348.
A NIS domain defines the boundary within which file administration is carried
out. In a large network, it is possible to define several NIS domains to break
the machines up into smaller groups. This way, files meant to be shared
among for example, five machines, stay within a domain that includes the five
machines, not all the machines on the network.
A NIS server is a machine that provides the system files to be read by other
machines on the network. There are two types of servers: Master and Slave.
A NIS client is a machine which has to access the files served by the NIS
servers.
There are four basic daemons that NIS uses: ypserv, ypbind, yppasswd and
ypupdated. NIS was initially called yellow pages, hence the prefix yp is used
for the daemons. The daemons work in the following way:
• All machines within the NIS domain run the ypbind daemon. This daemon
directs the machine’s request for a file to the NIS servers. On clients and
slave servers, the ypbind daemon points the machines to the master
server. On the master server, its ypbind points back to itself.
• The ypserv daemon runs on both the master and the slave servers. It is
this daemon that responds to the request for file information by the clients.
• The yppasswd and ypupdated daemons run only on the master server.
The yppasswd daemon makes it possible for users to change their login
passwords anywhere on the network. When NIS is configured, the
/bin/passwd command is linked to the /usr/bin/yppasswd command on the
nodes. The yppasswd command sends any password changes over the
network to the yppasswd daemon on the master server. The master server
changes the appropriate files and propagates this change to the slave
servers using the ypupdated daemon.
Important
NIS serves files in the form of maps. There is a map for each of the file that
it serves. Information from the file is stored in the map, and it is the map
that is used to respond to client requests.
Attention
By serving the /etc/hosts file, NIS has an added capability for handling
name resolution in a network. Refer to Managing NIS and NFS by O’Reilly
and Associates for detailed information.
To configure NIS, there are four steps, all of which can be done via SMIT. For
all four steps first run smit nfs and select Network Information Service
(NIS) to access the NIS panels, then:
Step 1 Choose Change NIS Domain Name of this Host to define the NIS
Domain. Figure 150 on page 351 shows this SMIT panel. In this
example, SPDomain has been chosen as the NIS domain name.
Step 2 On the machine that is to be the NIS master (for example, the control
workstation), select Configure/Modify NIS and then Configure this
Host as a NIS Master Server. Figure 151 on page 352 shows the
SMIT panel. Fill in the fields as required. Be sure to start the
yppasswd and ypupdated daemons. When the SMIT panel is
executed, all four daemons (ypbind, ypserv, yppasswd and
ypupdated) are started on the Master server. This SMIT panel also
updates the NIS entries in the local /etc/rc.nfs file.
Step 3 On the machines set aside to be slave servers, go to the NIS SMIT
panels and select Configure this Host as a NIS Slave Server. Figure
152 on page 353 shows the SMIT panel for configuring a slave server.
This step starts the ypserv and ypbind daemons on the slave servers
and updates the NIS entries in the local /etc/rc.nfs file(s).
Step 4 On each node that is to be a NIS client, go to the NIS SMIT panels
and select Configure this Host as a NIS Client. This step starts the
ypbind daemon and updates the NIS entries in the local /etc/rc.nfs
file(s). Figure 153 on page 354 shows this SMIT panel.
Once configured, when there are changes to any of the files served by NIS,
their corresponding maps on the master are rebuilt and either pushed to the
slave servers or pulled by the slave servers from the master server. These
tasks are done via the SMIT panel or the command make. To access the SMIT
panel, select Manage NIS Maps within the NIS panel. Figure 154 on page
355 shows this SMIT panel.
Select Build/Rebuild Maps for this Master Server and then either have the
system rebuild all the maps with the option all, or specify the maps that you
want to rebuild. After that, return to the SMIT panel shown in Figure 154 and
select either Transfer Maps to Slave Servers (from the master server) or
Retrieve Maps from Master Server for this Slave (from a slave server).
To turn off File Collections, run smit enter_data, select Site Environment
Information, and choose false for the field File Collection Management.
Figure 155 on page 356 shows this SMIT panel.
File collections are managed by the Perl program called supper, which in turn
is based on the Software Update Protocol (SUP) and can be run as a
command. A file collection has to be defined to supper so supper can
recognize and maintain it. supper interacts with the file collection daemon
supman to manage the file collections. supman is also installed as a unique
userid for file collection operations and requires read access permission to all
files that are to be managed as part of a file collection.
A file collection can be one of two types: primary or secondary. A primary file
collection can contain a group of files or a secondary file collection. When a
primary file collection is installed, the files in this collection are written to a
node for usage. What if you want to use a node to serve a file collection to
other nodes? This is made possible by using a secondary file collection.
When a secondary file collection is installed on a node, its files do not get
executed on the node, rather they are stored ready to be served out to other
nodes.
File collections have two possible states: Resident or Available. A resident file
collection is a group of files that are installed in their true locations and can
be served to other systems. An available file collection is one that is not
installed in its true location but is able to be served to other machines.
You may change this hierarchy and use a boot/install server as a master
server for one or some of the file collections. This way, you can maintain
different copies of files on different groups of nodes. To implement this, you
run supper offline on a boot/install server against a file collection. This
prevents that file collection from being updated by the control workstation.
Changes specific to the group of nodes served by the boot/install server can
now be made on the boot/install server.
Attention
Password Changes
Recall that if NIS is not running, the password control files are changed so
that all password updates are done on the master server. This may be a
problem because you may not want all users to have access to the control
workstation.
The first three file collections are primary and resident on all nodes and
boot/install servers. They are available to be served by the control
workstation and boot/install servers. The node.root collection is a secondary
file collection stored within the power_system file collection, available to be
served by the control workstation and boot/install servers and resident on
boot/install servers and nodes.
Attention
File Collections and NIS
It is possible to run both NIS and file collections, because NIS only handles
the system administration files while file collections can be configured to
handle other files. In this case, simply configure the user.admin file
collection to not handle the user administration files of /etc/passwd,
/var/sysman/sup
Within each file collection subdirectory, there is a set of “master files” which
define the file collection:
When supper runs the scan, it begins at the directory specified by the prefix
file and traverses the directory tree to find all the files that meet the criteria in
the list file. This process produces the scan file. The scan file is optional for
the other supper processes. However, its presence can improve their
performance since they no longer have to execute the equivalent of a scan.
The install process takes a file collection and copies the files onto the target
machines for use, while the update process takes any changes in the files of
a file collection and propagates them throughout the system. Both processes
read from the scan file, if present, to identify the files to install or update. If a
scan file is not present, supper performs a search to identify the files to install
or update. An install or update also checks the /var/sysman/sup/refuse file to
see if there are files to be excluded from the process.
Attention
The supper update process is a “pull” process. It is run from the clients to
extract updated files from the master. For example, if a user changes their
password on the control workstation, it is a node’s responsibility to "pull"
this change from the control workstation. Unlike NIS, the control
workstation does not "push" a change.
For the purpose of system administration, the process to run most frequently
is the update process, to ensure that the files in a file collection are kept
up-to-date. When you are updating files, make sure that you are writing to the
copy of the file on the master server. Once the file is updated, run supper scan
<file collection name> on the master server in case there have been changes
to the number of files making up the file collection. Then, run supper update
<file collection name> on the nodes to make sure that the nodes receive the
change.
As the files listed are passed over for update, the system writes the names of
these files into /var/sysman/<file collection name>/refuse.
The first step is to make sure the file is added or deleted on its server. You
need to keep in the mind the exact location of the file relative to the directory
that is specified in the prefix file. If you are adding a file, ensure that the file is
readable by everyone else on the machine.
The second step is to run supper scan to update the scan file. This step may
be optional since the scan file is not a requirement in a file collection.
However, it is recommended that you do run this command because it
increases the performance of the next command on large systems.
The first line is a comment. The fields in the second tells supper that this is
a primary file collection, named samplefc, with no file system associated
with it, which has files accessed through the /sample directory, with no
specified file system size, having the scan process start at /sample, that all
files in the directory should be evaluated, that this file collection runs on
machines with the power architecture (RS/6000s) and that this file
collection can be run on machines with different architecture.
There are further details in Chapter 5, “Managing File Collections” in IBM
Parallel System Support Programs for AIX: Administration Guide,
SA22-7348.
6. Update the /var/sysman/sup/.resident file to include your new file
collection. We add in the line:
samplefc 0
where 0 indicates that this file collection is served by the control
workstation.
7. Build the scan file in the file collection’s own directory by running:
supper scan samplefc
It is a two-step process:
1. Run supper scan <file collection> to create an updated scan file.
2. Run supper remove <file collection> to remove the file collection from the
node.
If you want to remove the file collection from every node where it is installed,
you have to run these steps on each node.
Attention
Do not remove any of the default file collections that are included when
PSSP is installed. These are required by PSSP for its operation.
To do this, one option is to store the home directories in an NFS file system
on the control workstation and export it to the four nodes. Using NFS on its
own, however, presents a problem. First, you have to decide whether each
user’s home directory is mounted as an individual file system, or whether the
The automounter can take care of all of this for you. When the mounting of a
file system is under automounter control, the automounter transparently
mounts the required file system. When there is no activity to that file system
for a prescribed period of time, the file system is unmounted.
The automounter uses map files to determine which file systems it needs to
control. There is typically one map file per file system that you want to
manage. It is also possible to use NIS maps instead of automounter maps to
specify which file systems the automounter is to handle. The usage of NIS
maps is discussed in greater detail in “Managing the Automounter” in IBM
Parallel System Support Programs for AIX: Administration Guide, SA22-7348.
The rest of this chapter is devoted to the use of automounter maps to handle
the mounting of file systems.
There are three versions of the automounter available for use within an SP
system. At PSSP 2.2 and below, the Berkeley Software Distribution (BSD)
version of the automounter, called AMD, is used. From PSSP 2.3 onwards,
the automounter included with the AIX operating system is used.
There are two versions of the automounter available with AIX. At AIX 4.3.0
and below, it is simply known as the “automounter”. At AIX 4.3.1 and above, it
is called the “AutoFS automounter”. The biggest difference between the two
automounters is that in AutoFS, there is a kernel implementation that
separates the automount command from the automountd daemon. This makes
it possible to change map information without having to stop and re-start the
automount daemon process. For the purposes of our discussion, we are
going to refer to the two automounters as automounter and AutoFS.
Both the automounter and AutoFS use map files residing on the server and
clients to control file system mounts. There is a master map file,
/etc/auto.master which contains entries for each file system to be controlled.
Note that AutoFS by default looks for a map file named /etc/auto_master first.
If it does not find an /etc/auto_master, then it looks for the /etc/auto.master.
The use of the /etc/auto_master file is a requirement for AutoFS, and the
/etc/auto.master is included in case users are migrating from the automounter
and want to retain the old /etc/auto.master. Figure 157 on page 367 is an
example of a /etc/auto.master file.
This example tells the automountd that /sample_fs is to be served over the
network according to the /etc/auto/maps/auto.sample map file. By default, the
directory /etc/auto/maps is set up to store all the maps that automountd
references.
dir_1 sp3en0:/share:&
dir_2 sp3en0:/tmp/:&
dir_3 sp3en0:/tmp/:&
The mount point /tmp/*, however, depends on which of dir_2 and dir_3 is first
accessed. For example, if dir_2 is accessed first, then the mount point for
/tmp/* is /tmp_mnt/sample_fs/dir_2/*.
This can create problems for C-shell users because the C-shell pwd built-in
command returns the actual path of a file directory. Since there is no
guarantee that a file directory is always the same, C-shell commands cannot
be based on the return value from pwd.
What if the client is the server? That is, we are trying to access a directory on
the same machine on which it is residing. For example, we are trying to
access /sample_fs/dir_1 on the machine sp3en0. In this case, a local mount
of /share to the mount point of /sampe_fs/dir_1 is carried out.
In an SP, the automounter and AutoFS write error messages to an error log
file, /var/adm/SPlogs/auto/auto.log. In addition, you can use the daemon
facility of syslog subsystem to record errors. By default, all daemon.notice
and greater messages are written by syslogd to
/var/adm/SPlogs/SPdaemon.log.
For more information on the AIX and AutoFS automounter, refer to AIX 4.3
System Management Guide: Communications and Networks, SC23-4127 and
"Managing the Automounter" in IBM Parallel Systems Support Programs for
AIX: Administration Guide, SA22-7348.
12.4.3.1 SP Setup
In an SP, AutoFS can be used to manage the mounting of both user home
directories and other file systems across the system. The entry in the SDR
which determines whether AutoFS is used or not is held in amd_config. You
can change the value in this entry either by the command spsitenv or the
Automounter Configuration field in the Site Environment Information SMIT
panel. Figure 158 shows this SMIT panel.
If SPUM has also been turned on, the SP can then add, remove or change
entries in the auto.u map file when SP users are added, removed or changed.
Attention
12.4.3.2 Migration
If you are migrating from PSSP 2.2 and below to PSSP 2.3 and above, you
have to convert the AMD map files into AutoFS map files. The two types of
map files are not compatible. Since AutoFS and the automounter can share
the same map files, there is no need for conversion above PSSP 2.3.
Assume that you are using AMD and have not customized any of the system
created AMD configurations and map file at PSSP 2.2. During the migration,
the system configuration process will detect that amd_config is set to true,
The command which actually converts the map file is mkautomap. It, however,
can only be used to convert the amd.u file.
If you have modified the amd.u file, it may not be properly converted by
mkautomap. You have to check the auto.u file the command builds. If the file
does not properly convert, you have to manually build the appropriate entries
in the auto.u file.
If you have created your own AMD map files, you have to rebuild the
equivalent automounter map files.
If you have customized AMD in any way, you have to consider whether
AutoFS (automounter) is going to allow the same customization. AutoFS
customization is described in "Managing the Automounter" in IBM Parallel
Systems Support Program for AIX: Administration Guide, SA22-7348.
12.4.3.3 Coexistence
It is possible to run both AMD and AutoFS within the same SP system. The
SP maintains both the AMD and auto sub-directories under /etc.
In this instance, the control workstation runs both versions of the automounter
and the nodes run either one of the automounters. That is, if a node is
running PSSP 2.2, it is going to continue to run AMD. If a node is running
PSSP 2.3 and above, it is going to run either automounter or AutoFS,
depending on the AIX level.
If you are at PSSP 2.3 and above, and you have some nodes that are AIX
4.3.0 and below and some nodes that are AIX 4.3.1 and above, you need to
run the compat_automountd. This automountd can handle both automounter
and AutoFS requests and is specifically included in for this purpose.
If you have to run a mixed environment of PSSP 2.2 and above, with AIX a
mixture of AIX levels (both below 4.3.0 and above 4.3.1), you will run both
compat_automountd in place of automountd, and AMD.
When the user database gets sent to the nodes for updates (for example, by
using File Collections), the nodes may receive and update both copies of the
There are various aspects of backup that need to be considered, namely, the
operating system, user data and configurations.
Users familiar with AIX tend to agree that backing up of the AIX operating
system is made easy with the implementation of the Logical Volume Manager
(LVM). The SP system runs on AIX, which is why the backup procedure on
the SP system is not much different from that of a non-SP system.
No doubt when we deal with the SP system, we are dealing with a cluster of
RS/6000 systems. The strategies involved in backup and recovery can prove
to be of utmost importance to an administrator dealing with hundreds of
nodes. Poor management in this area can drastically increase backup time
and result in unnecessary waste of archival space. Always ask yourself when
you need to back up and what to back up. A quick guide as to when to back
up the operating system would be when major changes are made. This
includes installation of new software, upgrades, patches, changes to
configurations, hardware changes especially to system planar, hard disks
belonging to rootvg, power supplies and so forth.
This chapter briefly describes the backup and restore procedure for the
RS/6000 SP system. There are products on the market that assist the backup
of data. Examples include ADSM and Sysback from IBM, and Networker from
Legato.
First we give a brief description of mksysb before we look into how we can
back up the CWS using this command.
Ty p e te x t
Ty x t
te pe
pe
te
xt
Ty
Ty p e te x t
Ty p e
te x t
T yp e
te x t
Ty x
te pe
te
pe t
xt
Ty
Ty p e te x t
The first part is the boot sector from which the node does a bootstrap boot,
building the basic kernel in memory.
The second part is a menu that is displayed on the console of the node that is
booting. This menu can be overridden with commands at the time the
bootable image is created, or at the time of restoration by a certain file with a
specific internal format.
The last part contains the actual data files. This is the largest part. One of the
first files that is in this data section is a file that contains the size, location and
details about the various logical volumes and file systems that were mounted
on the root volume group at the time of the mksysb. This file is looked at by the
booting process to reconstruct the logical volume manager part of the system
to what it was before so that the rest of the files on the tape can be restored
into the same file systems that they were in.
To back up the CWS, you can either use the command line or the SMIT
screen. You have to be a root user to perform mksysb backups.
/usr/bin/mksysb -i -X /dev/rmt0
# smitty mksysb
The restore procedure of the CWS is dependent on the type of system you
have as a CWS. Refer to your CWS system’s installation guide about how to
reinstall your system.
A mksysb to a file is very much the same as doing it onto tape media. Here,
however, instead of specifying a tape drive, we specify a filename. The file
can either reside on an NFS-mounted file system from the CWS or a local file
system. For an NFS-mounted file system, the moment the mksysb is done,
the file can be backed up to tape on the CWS. For a mksysb that is created
on a local file system, you can either ftp or rcp it over to the CWS or any host
that has a locally attached tape drive for back up.
On the CWS, export the file system for read and write access to the node:
On the node, mount the NFS-exported file system from the CWS:
[sp5n09:/] # /usr/sbin/mount sp5en0:/spdata/sys1/install/images /mnt
[sp5n09:/] # df
Filesystem 512-blocks Free %Used Iused %Iused Mounted on
/dev/hd4 16384 8224 50% 964 24% /
/dev/hd2 598016 141472 77% 9295 13% /usr
/dev/hd9var 65536 59920 9% 355 5% /var
/dev/hd3 65536 63304 4% 33 1% /tmp
/dev/hd1 8192 7840 5% 18 2% /home
sp5en0:/spdata/sys1/install/images 3907584 2687680 32% 8009
2% /mnt
The recovery of nodes is similar to the steps listed in the installation section
and are not further elaborated on here. Refer to 10.3, “Frame, Node And
Switch Installation” on page 267 for details on how to reinstall a node.
It is quite normal for administrators to do many SDR archives and end up not
knowing which archived file is usable. The SDRArchive command provides an
append_string where you can add meaningful words to tell you more about
the archived file. For example, if migrating PSSP to a higher version, you
would want to archive your system’s SDR with the following format:
# SDRArchive Before_Migrate
If anything goes wrong with the SDR, we can restore the previous SDR
information with the SDRRestore command:
# SDRRestore backup.99039.1001.Before_Migrate
The SDR should be backed up before making changes to it. Also, a few
copies of SDR done on separate dates should be kept in case of corruption to
any one of them. Although SDRArchive/SDRRestore is a handy tool to back
up and restore the SDR, it should not be a replacement for a full system
backup. There are misconceptions that the SDR is the heart of the SP and
restoring it is all that is needed to restore a fully functional CWS. This is
definitely an erroneous (and dangerous) idea. There are many components
that make up the SP; the SDR is just one part of it.
On KAS:
• /var/kerberos/database/*
• /etc/krb-srvtab
On the nodes:
• /etc/krb-srvtab
• /etc/krb.conf
• /etc/krb.realms
• $HOME/.klogin
• $KRBTKFILE or /tmp/tkt<uid>
For information relating to each file, refer to 7.4, “Managing Kerberos on the
SP” on page 156.
To restore the Kerberos database, you will have to restore all the files you
backed up.
There are instances where even restoring these files cannot help to recover
the Kerberos database. In cases like those, you will need to rebuild the
Kerberos database.
Commands provided by AIX for backing up files are tar, cpio, backup, rdump.
These can be used for journaled file systems (JFS). For databases, like
Oracle, SAP, Sybase and so on, you will need to refer to the relevant
database backup procedure; using conventional AIX backup commands may
render your backups useless.
SP nodes use AIX’s Logical Volume Manager (LVM) as a building block for
data storage. In turn, LVM uses a three-layered logical approach of volume
group (VG), logical volume (LV), and file system to structure the data on the
hard disks. For further details on LVM and how it works, refer to AIX Storage
Management, GG24-4484.
There are two strategies for sharing files. The first is to locate all shared files
in file systems on one node, then use products like Network File Systems
(NFS), Distributed File System (DFS) and Andrew File System (AFS). The
second is to create a distributed environment, where data stored on any node
is accessible by other nodes, using IBM’s Virtual Shared Disks (VSD),
Hashed Shared Disks (HSD), Recoverable Virtual Shared Disks (RVSD) or
General Parallel File System (GPFS).
This chapter begins with a look at data stored locally on nodes, then
examines the tools previously mentioned for sharing data among nodes,
starting with NFS, DFS and AFS, then moving on to VSD, HSD and RVSD,
and concluding with GPFS.
Recall that an SP node uses LVM to handle the storage of data onto hard
disks. LVM has many features which make it suitable for such a task.
However, it does have one drawback: stored data is accessible to only the
users and processes on the node on which the data is stored. In an SP
system, this is a major inconvenience.
1. Use an application like telnet to log into the node and execute a backup.
2. Then use an application like ftp to download the backup image for storage
on the control workstation.
If it is possible for the node and the control workstation to share data with
each other, this process can be reduced to one step. The node “sees” the
area where the control workstation stores backup images. All the control
workstation has to do is initiate the backup on the node, and the node can
store its backup image into this area.
We can therefore conclude that LVM is insufficient as the sole means for
handling data storage in an SP system.
In summary, the design that allows the storage of data on SP nodes has both
advantages and disadvantages. Our concerns in this chapter center on the
use of LVM within this design. We now look at the file sharing software which
overcomes LVM’s limitation.
One important motivation to use global file systems is to give users the
impression of a single system image by providing their home directories on all
the machines they can access. Another is to share common application
software which then needs to be installed and maintained in only one place.
Global file systems can also be used to provide a large scratch file system to
many machines, which normally utilizes available disk capacity better than
distributing the same disks to the client machines and using them for local
scratch space. However, the latter normally provides better performance, so a
trade-off has to be made between speed and resource utilization.
In NFS, file systems residing on the NFS server are made available through
an export operation, either automatically when the NFS startup scripts
process the entries in the /etc/exports file, or explicitly by invoking the
exportfs command. They can be mounted by the NFS clients in three different
ways. A predefined mount is specified by stanzas in the /etc/filesystems file,
an explicit mount can be performed by manually invoking the mount command,
and automatic mounts are controlled by the automount command, which
mounts and unmounts file systems based on their access frequency. This
relationship is sketched in Figure 163 on page 388.
mount
/export/tina -ro,access=client
/home/joe:
dev = /export/joe
nodename = nfs_srv
mount = true
vfs = nfs
exportfs
rpc.mountd
mount \
nfs_srv:/export/tmp \ /etc/xtab
/home/tmp
automountd
automount
/etc/auto.master
/home /etc/auto/maps/home.maps
exportfs -i /export/tmp
/etc/auto/maps/home.maps
tina nfs_srv:/export/tina
client nfs_srv
The PSSP software uses NFS for network installation of the SP nodes. The
control workstation and boot/install servers act as NFS servers to make
resources for network installation available to the nodes, which perform
explicit mounts during installation. The SP accounting system also uses
explicit NFS mounts to consolidate accounting information.
NFS is often used operationally to provide global file system services to users
and applications. Among the reasons for using NFS are that it is part of base
AIX, it is well known in the UNIX community, it is very flexible, and it is
relatively easy to configure and administer in small to medium-sized
environments. However, NFS also has a number of shortcomings. We
summarize them here to provide a basis to compare NFS to other global file
systems.
Performance: NFS Version 3 contains several improvements over NFS
Version 2. The most important change probably is that
NFS Version 3 no longer limits the buffer size to 8 kB,
improving its performance over high bandwidth networks.
Other optimizations include the handling of file attributes
and directory lookups, and increased write throughput by
For reasons that are discussed later, we recommend using DFS rather than
AFS except when an SP is to be integrated into an existing AFS cell. We
therefore limit the following high-level description to DFS. Most of these
general features also apply for AFS, which has a very similar functionality.
After a general description of DFS, we point out some of the differences
between DFS and AFS that justify our preference of DFS.
The client component of DFS is the cache manager. It uses a local disk cache
or memory cache to provide fast access to frequently used file and directory
data. To locate the server that holds a particular fileset, DFS uses the fileset
location database (FLDB) server. The FLDB server transparently accesses
The primary server component is the file exporter. The file exporter receives
data requests as DCE Remote Procedure Calls (RPCs) from the cache
manager, and processes them by accessing the local file systems in which
the data is stored. DFS includes its own Local File System (LFS), but can
also export other UNIX file systems (although with reduced functionality). It
includes a token manager to synchronize concurrent access. If a DFS client
wants to perform an operation on a DFS file or directory, it has to acquire a
token from the server. The server revokes existing tokens from other clients
to avoid conflicting operations. In this way, DFS is able to provide POSIX
single-site semantics.
local Token
cache Manager
aggr3
fxd
aggr2
Cache File
Manager Exporter aggr1
dfsd
dfsbind
Fileset Location
Server
FLDB
CDS Security flserver
Server Server
Fileset Location DB Machine
The following list summarizes some key features of DCE/DFS, and can be
used to compare DFS with the discussion in 14.2.1, “Network File System
(NFS)” on page 387.
Performance: DFS achieves high performance through client caching.
The client to server ratio is better than with NFS, although
exact numbers depend on the actual applications. Like
NFS, DFS is limited by the performance of a single server
in the write case. However, replication can help scale
read-only access.
Security: DFS is integrated with the DCE Security Service, which is
based on Kerberos Version 5. All internal communication
uses the authenticated DCE RPC, and all users and
services which want to use DFS services have to be
authenticated by logging in to the DCE cell (except when
access rights are explicitly granted for unauthenticated
users). Access control is by DCE principal; root users on
DFS client machines cannot impersonate these DCE
principals. In addition, DCE Access Control Lists can be
used to provide fine-grained control; they are recognized
even in a heterogeneous environment.
Management: Since fileset location is completely transparent to the
client, DFS filesets can be easily moved between DFS
servers. Using DCE’s LFS as the physical file system, this
can even be done without disrupting operation. This is an
invaluable management feature for rapidly growing or
otherwise changing environments. Because there is no
local information on fileset locations on the client,
administering a large number of machines is much easier
than maintaining configuration information on all of these
clients.
Namespace: DFS provides a global, worldwide namespace. The file
system in a given DCE cell can be accessed by the
absolute path /.../cell_name/fs/, which can be abbreviated
as /: (slash colon) within that cell. Access to foreign cells
always requires the full cell name of that cell. The global
name space ensures that a file will be accessible by the
same name on every DFS client. The DFS client has no
control over mount points: filesets are mounted into the
DFS namespace by the servers. Of course, a client may
In summary, many of the problems related to NFS either do not exist in DFS,
or have a much weaker impact. DFS is therefore more suitable for use in a
large production environment. On the other hand, DCE administration is not
easy and requires a lot of training. The necessary DCE and DFS licenses
also add extra cost.
It is obvious that DFS is well integrated with the other DCE core services,
whereas AFS requires more configuration and administration work. DFS also
provides file system semantics that are superior to those of AFS. So unless
an existing AFS cell is expanded, we recommend that you use DFS rather
than AFS to provide global file services.
If we can provide a node with the capability to access data residing on any
other node in the SP system, we satisfy the access need of parallel
applications and also offer a possibility for improved performance of serial
applications.
IBM makes this possible with its family of offerings based on the Virtual
Shared Disks (VSD) technology.
A node which has VSDs defined and configured is called a server node. A
node which accesses other nodes’ VSDs is called a client node. A node may
be both a client node and a server node.
Node X Node Y
Application Application
Cache Cache
VSD VSD
LVM IP IP LVM
IP Network
(SP Switch)
lv_X lv_Y
For example, the application on Node X is looking for a piece of data and
passes this request to the VSD device driver. Node X’s VSD driver can then
look for the data in one of three places:
VSD device drivers use their own stripped-down IP protocol for performance
reasons. In addition, VSD uses unique buffers, buddy buffers and pbufs to
handle the network transmissions. Detailed information on these buffers,
including how to tune them, can be found in IBM Parallel System Support
Programs For AIX: Managing Shared Disks , SA22-7349.
Because VSDs are LVs, they do not have a file locking mechanism to
preserve data integrity. This task falls upon the application that is using the
VSDs.
Fileset Description
3. Create or define the VSDs for each node that is to be a VSD server.
There are two options to set up VSDs on the server nodes: create them or
define them.
Creating VSDs means that there are no VGs and LVs established on the
server node to be used for VSDs. In this case, one can again go through
the Perspectives interface or the command line. In Perspectives, it is
necessary to first add a pane that shows the VSDs on the server node
(even if none has yet been created). Once the pane is added, use the
action Create to create both the global VG and the LVs to be used as
VSDs. For SMIT, run smit vsd_data and then select the Create a Virtual
Shared Disk option. From the command line, run createvsd. Once again, it
we recommend that you use the Perspectives interface or the SMIT
panels because they automatically bring up all the settings and options
available.
Defining VSDs means that VGs and LVs have already been established
locally on the node and you want the SP to use them for VSDs, that is, you
want to define them in the SDR on the control workstation. In
Perspectives, from the VSD pane, use the action Define to define a LV as
a VSD within a global VG. For SMIT panels, run smit vsd_data and then
select the Define a Virtual Shared Disk option. The command line
equivalent is defvsd.
There are five different states in which a VSD can be found, depending on the
circumstances in the SP system at that time. These circumstances include
changes to the configuration (either on a server node or the entire SP
system) and problems in the system, application or network. By moving the
VSDs into different states, the system is better able to keep track of I/O
Undefined
define
undefine
VSD information is
Defined available in the SDR
cfgvsd
ucfgvsd
Open/close and I/O
Stopped requests fail
preparevsd
stopvsd
startvsd I/O requests queued and
Suspended open/close requests
resumevsd serviced
suspendvsd
Open/close and I/O
Active requests serviced
Available
In the past, any changes made to VSDs required the system administrator to
first stop the VSDs, unconfigure them from all VSD nodes, make the
changes, re-configure the VSDs, and re-start them. This is at best a tedious
task. In PSSP 3.1, improvements have been made so that the following
functionalities can be dynamically carried out; that is, there is no need to stop
and re-start the VSDs:
1. The addition and subtraction of VSD nodes
2. The addition and subtraction of VSDs running on a node
3. Turning on or turning off the cache option
4. Increasing the size of individual VSDs
VSDs can be run over any IP network. However, the SP Switch network is the
only communication device available on the RS/6000 SP capable of providing
the necessary bandwidth and scalability for VSD to operate at good
performance. The SP Switch permits:
• High I/O bandwidth for optimal VSD performance
• Scalable growth for any applications using VSDs
Fast Network
HSD
I/O Request
HSD adds a device driver above the VSD layer on individual nodes. When an
I/O request is made by the application, the HSD device driver breaks it down
into smaller blocks, depending upon the strip size specified and the number
of VSDs making up the HSD. The strip size defines the amount of data which
is read from or written out to one VSD per instruction. The I/O request is then
distributed among all the VSDs which make up the HSD for handling.
HSD helps to spread the data over several VSD and VSD nodes, thereby
reducing the chance of I/O performance bottlenecks on individual nodes. It is,
however, difficult to manage because any changes, either at the HSD level or
at the VSD level, require the deletion and re-creation of the HSD. Other
restrictions with HSD are documented in IBM Parallel System Support
Programs For AIX: Managing Shared Disks , SA22-7349.
RVSD designates nodes which serve VSDs as primary nodes . The backup
nodes are called secondary nodes .
In the example in Figure 165, with RVSD installed and configured, if Node Y
has to be shut down for maintenance, then lv_Y is going to failover and be
controlled by Node X. While Node Y is down, Node X becomes a server to
both VSDs lv_X and lv_Y. When Node Y is back up and operational, RVSD
returns back to it the control of lv_Y. This failover is transparent to the users
and processes in Node X.
There are different versions of the RVSD software for different versions of
PSSP. Each version of RVSD can interoperate with each of the supported
levels of PSSP as shown in Table 20.
Table 20. RVSD Levels Supported by PSSP Levels
RVSD 1.2 Y Y Y Y
RVSD 2.1 N Y Y Y
RVSD 2.1.1 N N Y Y
RVSD 3.1 N N N Y
RVSD is set to operate with the functionality of the lowest PSSP level
installed in the SP system, regardless of whether this node is running RVSD
or not. To overcome this problem, PSSP 3.1 introduces the command
rvsdrestrict, which allows system administrators to set the functionality level
of RVSD. A detailed description of rvsdrestrict in PSSP 3.1
Announcement , SG24-5332.
RVSD contains two subsystems: rvsd and hc. Subsystem rvsd controls the
recovery for RVSD, while the hc subsystem supports the development of
recoverable applications.
The rvsd subsystem records the details associated with each of the VSD
servers: the network adapter status, the number of nodes active, the number
of nodes required for quorum, and so on.
The rvsd subsystem handles three types of failures: node, disk cable and
adapters, and communication adapters. Node failures are described later in
this section. Disk cable and adapter failures are also handled as node
failures, but only for those VGs which are affected. For example, if there are
two global VGs served by one primary node, and one disk within one of the
VG fails, only that VG is going to be failed over to the secondary node. The
primary node will continue to serve the other VG. When the disk has been
replaced, it is necessary to manually run the vsdchgserver command to
change the control back to the primary node. Note that rvsd only handles
communication adapter failures for the Ethernet and SP switch networks. It is
possible to run RVSD using other types of network adapters, but there is no
failover mechanism in those instances. Communication adapter failures are
handled in the same manner as node failures due to RVSD’s dependence
upon an IP network to function.
The hc subsystem, also called the Connection Manager, shadows the rvsd
subsystem, recording the same changes in state and management of VSDs
that rvsd records. There is, however, one difference: hc records these
changes after rvsd has processed them. This ensures that RVSD recovery
When the failed node is active again, the application’s recovery script can
issue unfencevsd to permit it to issue VSD I/Os. The syntax is similar to that
for fencevsd.
The command lsfencevsd may be run to display a map of all fenced nodes
and the VSDs from which they are fenced.
Consider a scenario when a primary node goes down. The first event logged
is vsd.down1. At vsd.down1, the primary node runs suspendvsd and stopvsd to
bring its VSDs into the stopped state. On all non-primary nodes (both
secondary servers and clients), suspendvsd is run to bring the VSDs into the
suspend state. This ensures no further I/O activities to the VSDs, and thereby
maintains data integrity. Refer to Figure 171 on page 403 for a pictorial
review of VSD states.
When the primary node is back up and running, there is again a two-step
process to return control of its VSDs.
Event vsd.up1 has the clients running suspendvsd and the secondary node
running both suspendvsd and stopvsd against the VSDs to stop all I/O
activities.
Because GPFS spreads a file system across many nodes and disks, it is
possible to create very large file systems to hold very large files.
GPFS configurations are stored in the SDR on the control workstation. When
a GPFS node boots, it checks the SDR to see if there have been changes. If
so, these are applied during the boot. This permits changes to be made to
GPFS even if not all the GPFS nodes are up and running.
VFS Layer
JFS GPFS
RVSD
PSSP
VSD
LVM IP
Switch
The GPFS device driver is a kernel extension which sits between the
application and r/vsd layers. It uses standard AIX Virtual File System (VFS)
system calls, making it possible for most AIX applications to run over GPFS
without modifications. The application views a GPFS file system similar to an
AIX Journal File System (JFS). In operations, when GPFS fetches data, it
uses the r/vsd device driver to go to the VSD node which has the data. This
design makes it possible to separate nodes involved with GPFS into GPFS
nodes and VSD server nodes. GPFS nodes are those which mount and use a
GPFS while disk server nodes are those which have the underlying VSDs
configured on them. The GPFS device driver does not need to know where
the data is stored; this is left to the r/vsd device driver. Using the r/vsd device
driver makes it possible to separate GPFS nodes into two types: the GPFS
nodes, which are those that mount and use GPFS; and the VSD server
nodes, which have the underlying hard disks configured as VSDs. The
advantage with this is that the VSD servers, unless an application requires it,
are not further burdened with running both VSD and GPFS software.
Underneath these layers, GPFS uses the traditional UNIX file system
structure of i-nodes, indirect blocks and data blocks to store files. A detailed
discussion of these components and their impact on a GPFS file system can
be found in IBM General Parallel File System for AIX: Installation and
Administration Guide, SA22-7278.
The GPFS filesets (LPPs) consists of 6 different filesets, all beginning with
mmfs. This prefix refers to GPFS’s development alongside IBM’s Multi-Media
LAN Server product. These filesets are:
• mmfs.base.usr.3.1.0.0
• mmfs.gpfs.usr.1.2.0.0
• mmfs.util.usr.3.1.0.0
• mmfs.msg.en_US.usr.3.1.0.0
• mmfs.man.en_US.shr.3.1.0.0
• mmfs.gpfsdocs.shr.3.1.0.0
The procedure to install the GPFS filesets and the prerequisites are
described in IBM General Parallel File System for AIX: Installation and
Administration Guide, SA22-7278.
Configuration Manager
In this role, the mmfsd daemon is responsible for configuration tasks within a
GPFS domain, such as selecting the Stripe Group Manager for each file
system and determining whether quorum exists.
There is one GPFS Configuration Manager per system partition. It is the first
node to join the group Mmfs Group in Group Services (for more information
on Group Services, refer to 8.5, “Group Services (GS)” on page 193). If this
node goes down, Group Services selects the next oldest node in the Mmfs
Group to be the Configuration Manager.
A stripe group consists of the set of hard disks which makes up a GPFS file
system. There is one Stripe Group Manager per GPFS file system to perform
the following services:
• Process changes to the state or description of the file system
1. Add/delete/replace disks
2. Change disk availability
3. Repair file system
4. Restripe file system
• Control disk region allocation
In addition, the Stripe Group Manager handles the token requests for file
access, passing each request to the Token Manager Server for processing.
This selection may be influenced by the GPFS administrator. Create the file
/var/mmfs/etc/cluster.preference and list in it the switch hostnames of the
nodes, one per line, that you want to be a Stripe Group Manager. The
Configuration Manager looks for this file, and if found, selects a node from
this list. There is no order of priority given to the nodes in cluster.preference;
the only requirement is that the node listed in the file be up and available
when the selection is made.
If the node that is the Stripe Group Manager goes down, another node is
selected to take over. The selection comes from another node in the
cluster.preference file, if it is used, or simply another node in the system
Metadata Manager
A Metadata Manager maintains the integrity of the metadata for open files
within a GPFS file system. There is one Metadata Manager for each open file.
The Metadata Manager is the only node that can update the metadata
pertaining to the open file, even if the data in the file is updated by another
node.
The Metadata Manager is selected to be the first node that opened the file.
This node stays as the Metadata Manager for the file until one of following
events occur:
• The file is closed everywhere.
• The node fails.
• The node resigns from the GPFS domain.
When the Metadata Manager goes down, the next node to use the metadata
service takes over as the Metadata Manager.
The use of tokens helps GPFS manage and coordinate file access among the
GPFS nodes. In turn, this preserves data integrity. The concept of tokens in
GPFS is similar to that of file locks in LVM. There is one Token Manager
Server per GPFS file system. It resides in the same node as the Stripe Group
Manager, and processes all requests for access to files within the GPFS.
There is a token manager component which runs on every GPFS node. When
that node wants to access a file, its token manager requests a token from the
Token Manager Server. The Token Manager Server determines if any locking
conflicts exist among any previously granted tokens and the current request.
If conflicts exist, the Token Manager Server sends a list called a copy set
back to the requesting node’s token manager. It is then the token manager’s
responsibility to process the copy set and negotiate with the token
manager(s) on the node(s) in the copy set to obtain a token and access the
file.
This type of negotiation among nodes helps to reduce the overhead on the
Token Manager Server.
If a node goes down, it can recover by having its token manager request a
copy of the token from the Token Manager Server.
If the Token Manager Server node goes down, the new Token Manager
Server (which is also the node that is the new Stripe Group Manager) then
obtains the client copy of all tokens from all the token managers in the GPFS
domain.
There are three areas of consideration when GPFS is being set up: the nodes
using GPFS, the VSDs to be used, and the file systems to be created. These
areas are discussed in detail in the remainder of this section, with reference
to a sample file system setup consisting of four nodes. Nodes 12, 13 and 14
are GPFS nodes, while node 15 is the VSD server node.
Important
Carry out the following procedures to configure GPFS, then start the
mmfsd daemon to continue creating the file system.
Nodes
The first step in setting up GPFS is to define which nodes are GPFS nodes.
The second step is to specify the parameters for each node.
There are three areas where nodes can be specified for GPFS operations:
node count, node list, and nodes preferences.
A node list is a file which specifies to GPFS the actual nodes to be included in
the GPFS domain. This file may have any filename. However, when GPFS
configures the nodes, it copies the file to each GPFS node as
/etc/cluster.nodes. The GPFS nodes are listed one per line in this file, and the
switch interface is specified because this is the interface over which GPFS
runs.
Figure 176 on page 417 is an example of a node list file. The filename in this
example is /var/mmfs/etc/nodes.list.
Once the nodes which form the GPFS domain are chosen, there is the option
to choose which of these nodes are to be considered for the personality of
stripe group manager (see stripe group manager in 14.4.1, “The mmfsd
Daemon” on page 412). There are only three nodes in the GPFS domain in
this example, so this step is unnecessary. However, if there are a large
number of nodes in the GPFS domain, it may be desirable to restrict the role
of stripe group manager to a small number of nodes. This way, if something
happens and a new stripe group manager has to be chosen, GPFS can do so
from a smaller set of the nodes (the default is every GPFS node) particularly
if some node configurations are better suited to the role. To carry this out,
follow the format for creating a node list to create the file
/var/mmfs/etc/cluster.preferences (this filename must be followed).
To configure GPFS, you can use SMIT panels or the mmconfig command. The
mmconfig command is further described in IBM General Parallel File System
by AIX: Installation and Administration Guide, SA22-7278. The SMIT panel
may be accessed by typing smit gpfs and then selecting the Create Startup
Configuration option. Figure 177 on page 418 shows the SMIT panel used to
configure GPFS (this is being run on node 12 in our example). This step
needs to be run only on one node in a GPFS domain.
The pagepool and mallocsize options specify the size of the cache on each
node dedicated for GPFS operations. mallocsize sets an area dedicated for
holding GPFS control structures data while pagepool is the actual size of the
cache on each node. In this instance, pagepool is specified to the default size
of 4M while mallocsize is specified to be the default of 2M, where M stands
for megabytes and must be included in the field. The maximum values per
node are 512 MB for pagepool and 128 MB for mallocsize.
The priority field refers to the scheduling priority for the mmfsd daemon. The
concept of priority is beyond the scope of this book: refer to AIX
documentation for more information.
Further information, including details regarding the values to set for pagepool
and mallocsize, is in IBM General Parallel File System for AIX: Installation
and Administration Guide , SA22-7278.
Once GPFS has been configured, the mmfsd daemon has to be started on
the GPFS nodes before a file system can be created. The steps to do this are
as follows:
1. Set the WCOLL environment variable to target all GPFS nodes for the dsh
command. IBM Parallel Systems Support Programs: Administration Guide ,
SA22-7348, IBM Parallel Systems Support Programs: Command and
Technical Reference , SA22-7351, and IBM RS/6000 SP Management,
Easy, Lean, and Mean , GG24-2563, all contain information on the WCOLL
environment variable.
2. Designate each of the nodes in the GPFS domain as an IBM VSD node.
3. Ensure that the rvsd and hc daemons are active on the GPFS nodes.
Note: rvsd and hc do not start unless they detect the presence of one VSD
defined for the GPFS nodes. This VSD may or may not be used in the
GPFS file system.
4. Start the mmfsd daemon by running the following command on one GPFS
node:
dsh startsrc -s mmfs
The mmfsd starts on all the nodes specified in the /etc/cluster.nodes file. If
the startup is successful, the file /var/adm/ras/mmfs.log* looks like Figure
178 on page 420.
VSDs
Before the file system can be created, the underlying VSDs must be set up.
The nodes with the VSDs configured may be strictly VSD server nodes, or
they can also be GPFS nodes. Consider the application to decide whether to
include VSD server-only nodes in the GPFS domain.
You must also decide the level of redundancy needed to guard against
failures. Should the VSDs be mirrored? Should they run with a RAID
subsystem on top? Should RVSD be used in case of node failures? Again,
this depends on the application, but it also depends on your comfort and
preferences for dealing with risk.
Recall that there are two types of data that GPFS handles: metadata and the
data itself. GPFS can decide what is stored on each VSD: metadata only,
data only, or data and metadata. It is possible to separate metadata and data,
to ensure that data corruption does not affect the metadata, and vice versa.
Once you adopt the redundancy strategy, there are two ways to create VSDs:
have GPFS do it for you, or manually create them. For either method, this is
done through the use of a Disk Descriptor file. This file can be set up
manually or through the use of SMIT panels. If using SMIT, run smit gpfs and
then select the Prepare Disk Descriptor File option. Figure 179 on page 422
shows the SMIT panel for our example.
In this case, the VSD vsd1n15 has already been created on node 15
(sp3n15). Do not specify a name for the server node because the system
already has all the information it needs from the configuration files in the
SDR. In addition, the VSD(s) must be in the Active state on the VSD server
node and on all the GPFS nodes prior to the file system creation.
If the VSDs have not been created, specify the name of the disk (such as
hdisk3) in the disk name field, instead of vsd1n15, and specify the server
where this hdisk is connected. GPFS then creates the necessary VSDs to
create the file system.
The failure group number may be system generated or user specified. In this
case, a number of 1 is specified. If no number is specified, the system
provides a default number that is equal to the VSD server node number plus
4000.
File System
There are two ways to create a GPFS file system: using SMIT panels or the
mmcrfs command. Figure 180 on page 423 shows the SMIT panel. This is
accessed by running smit gpfs and then selecting the Create File System
option. Details on mmcrfs can be found in IBM General Parallel File System for
AIX: Installation and Administration Guide, SA22-7278.
Four issues must be considered before the file system is created, as follows:
1. How to structure the data in the file system
There are three factors to consider in structuring the data in the file
system: block size, i-node size, and indirect block size.
GPFS offers a choice of three block sizes for I/O to and from the file
system: 16 KB, 64 KB, or 256 KB. Consider the applications running on
your system to determine which block size to use. If the applications
handle large amounts of data in a single read/write operation, then a large
block size may be best. If the size of the files handled by the applications
is small, a smaller block size may be more suitable. The default is 256 KB.
GPFS further divides each block of I/O into 32 sub blocks. If the block size
is the largest amount of data that can be accessed in a single I/O
operation, the sub block is the smallest unit of disk space that can be
allocated to a file. For a block size of 256 KB, GPFS reads as much as 256
KB of data in a single I/O operation, and a small file can occupy as little as
8 KB of disk space.
Once a GPFS file system has been set up, it can be mounted or unmounted
on the GPFS nodes using the AIX mount and umount commands. Or, you can
use the SMIT panel by running smit fs and then selecting Mount File
System. Figure 181 on page 427 shows the SMIT panel for mounting a file
system.
mmchconfig pagepool=60M -i
It is also possible to add and delete nodes from a GPFS configuration. The
commands to do so are mmaddnode and mmdelnode. However, use caution when
adding or subtracting nodes from a GPFS configuration, because GPFS uses
quorum to determine if a GPFS file system stays mounted or not, and it is
easy to break the quorum requirement when adding or deleting nodes.
Note that adding or deleting nodes automatically configures them for GPFS
usage. Newly added nodes are considered GPFS nodes in a down state, and
are not recognized until a restart of GPFS. By maintaining quorum, you
ensure that you can schedule a good time to refresh GPFS on the nodes.
The command to delete a file system is mmdelfs. Before you delete a GPFS
file system, however, you must unmount it from all GPFS nodes.
For example, if you want to delete fs1 (shown in Figure 180 on page 423),
you can run umount fs1 on all GPFS nodes, then run mmdelfs fs1.
The command mmfsck checks for and repairs the following file
inconsistencies:
• Blocks marked allocated that do not belong to any file. The blocks are
marked free.
• Files for which an i-node is allocated, but no directory entry exists.
mmfsck either creates a directory entry for the file in the /lost+found
directory, or it destroys the file.
• Directory entries pointing to an i-node that is not allocated. mmfsck
removes the entries.
• Ill-formed directory entries. They are removed.
• Incorrect link counts on files and directories. They are updated with the
accurate counts.
• Cycles in the directory structure. Any detected cycles are broken. If the
cycle is a disconnected one, the new top level directory is moved to the
/lost+found directory.
File system attributes can be listed with the mmlsfs command. If no flags are
specified, all attributes are listed. For example, to list all the attributes of fs1,
run:
mmlsfs fs1
To change file system attributes, use the mmchfs command. The following
eight attributes can be changed:
• Automatic mount of file system at GPFS startup
• Maximum number of files
• Default Metadata Replication
• Quota Enforcement
• Default Data Replication
• Stripe Method
• Mount point
• Migrate file system
For example, to change the file system to permit data replication, run:
The command mmlsattr shows the replication factors for one or more files. If it
is necessary to change this, use the mmchattr command.
For example, to list the replication factors for a file /gpfs/fs1/test.file, run
mmlsattr /gpfs/fs1/test.file
Say the value turns out to be 1 for data replication and you want to change
this to 2, run:
mmchattr -r 2 /gpfs/fs1/test.file
If disks have been added to a GPFS, you may want to restripe the file system
data across all the disks to improve system performance. This is particularly
useful if the file system is seldom updated, for the data has not had a chance
to propagate out to the new disk(s). To do this, run:
mmrestripefs
There are three options with this command; any one of the three must be
chosen. The -b flag stands for rebalancing. This is used when you simply
want to restripe the files across the disks in the file system. The -m flag
stands for migration. This option moves all critical data from any suspended
disk in the file system. Critical data is all data that would be lost if the
currently suspended disk(s) are removed. The -r flag stands for replication.
This migrates all data from a suspended disk and restores all replicated files
in the file system according to their replication factor.
For example, when a disk has been added to fs1 and you are ready to
restripe the data onto this new disk, run:
mmrestripefs fs1 -b
The AIX command df shows the amount of free space left in a file system.
This can also be run on a GPFS file system. However, to obtain information
For example, to check on the GPFS file system fs1 and the amount of free
space within each VSD which houses it, run:
mmdf fs1
It is, however, possible to run multiple levels of GPFS codes, provided that
each level is in its own group, within one system partition.
There are two possible scenarios to migrate to GPFS 1.2 from previous
versions: full and staged. As its name implies, a full migration means that all
the GPFS nodes within a system are installed with GPFS 1.2. A staged
migration means that certain nodes are selected to form a GPFS group with
GPFS 1.2 installed. Once you are convinced by this test group that it is safe
to do, you can migrate the rest of your system.
15.3.1 Cluster
With HACMP/ES, you can include up to 32 nodes in a single SP partition.
Alternatively, you can define clusters that are not contained within a single SP
partition. Consider the following points when planning for cluster nodes:
• Nodes that have entirely separate functions and do not share resources
should not be combined in a single cluster. Instead, create several smaller
clusters on the SP. Smaller clusters are easier to design, implement, and
maintain.
• For performance reasons, it may be desirable to use multiple nodes to
support the same application. To provide mutual takeover services, the
application must be designed in a manner that allows multiple instances of
the application to run on the same node.
• In certain configurations, including additional nodes in the cluster design
can increase the level of availability provided by the cluster; it also gives
you more flexibility in planning node fallover and reintegration.
15.3.2 Application
Application availability is the key item to look at when considering the
purchase of high availability products. Plan carefully in this area before
implementing HACMP/ES to take full advantage of its capabilities. Some
considerations that must be considered include:
• The application and its data should be laid out such that only the data
resides on shared external disks while the application resides on each
node that is capable of running it. This arrangement prevents software
license violations and simplifies failure recovery. Ensure each node has
15.3.3 Network
In a cluster environment, each node must have network connectivity to other
nodes. This is required for communications between nodes to update each
Disk 1
Mirroring
Disk 2
15.4.1 Server
This section explains some of the terms used in HACMP/ES as well as the
general guidelines you have to follow during configuration of the servers.
15.4.1.1 Networks
There are three types of networks to consider when configuring HACMP/ES.
Private Network
A private network is considered a network that is inaccessible by outside
hosts. For the SP complex, this refers to the SP Administrative Ethernet and
the SP Switch.
• SP Administrative Ethernet
It is advised that the SP Administrative Ethernet be used only as a private
network within the SP complex. In HACMP/ES, it is configured as a private
network providing a heartbeat channel for keepalive packets.
• SP Switch
The SP Switch is also classified as a private network in HACMP/ES with
connections to SP nodes. Unlike the old High Performance Switch (HiPS)
where HACMP/ES has to manage the Eprimary takeover, the SP Switch
does not allow HACMP/ES to control the Eprimary.
When using SP Switch in HACMP/ES, Address Resolution Protocol (ARP)
must be enabled. To find out if a node has ARP turned on, use the
following command:
# dsh -w <host> "/usr/lpp/ssp/css/ifconfig css0"
If NOARP appears on the output, it reflects that ARP is turned off for that
node and will require steps to turn it on. There are two methods to turn
ARP on.
The first method requires the node to be customized after setting the SDR
to enable ARP. Refer to Parallel System Support Programs for AIX:
Administration Guide, SA22-7348 for the procedure.
Important
The steps listed must be followed exactly and completely. Ensure that the
CuAt file is backed up as directed. Without this backup copy, any error
made may end up with the re-installation of the node. Mistakes made can
be fixed by replacing the CuAt file with the backup copy and then rebooting
the node. When in doubt, do not use this method; use the first method to
customize the node.
Public Network
Public networks are used by external hosts to access the cluster nodes. They
can be classified under the following categories:
• Ethernet, Token-Ring, and FDDI
Serial Network
Serial networks do not rely on TCP/IP for communication. As such, they
provide a reliable means of connection between cluster nodes for heartbeat
transmissions.
• RS232
This is the most commonly used non-IP network, as it is cheap and
reliable. It is a crossed (null modem) cable connected to a serial port on
the node. Earlier model nodes like the 62MHz Thin (7012-370), 66MHz
Thin (7012-390), 66MHz Wide (7013-590), 66MHz Thin2 (7012-39H),
77MHz Wide (7013-591) and 66MHz Wide (7013-59H) do not have
supported native serial ports and therefore require the use of the 8-port or
16-port asynchronous adapters. Extension nodes S70 and S7A will
require PCI multi-port asynchronous adapters to support this feature.
• TMSCSI
Like the RS232 connection, TMSCSI is non-IP reliant. It is configured on
the SCSI-2 Differential bus to provide HACMP/ES clusters with a reliable
network for heartbeat transmission.
• TMSSA
The TMSSA is a new feature of HACMP/ES, which allows SSA
Multi-Initiator RAID adapters to provide another channel for keepalive
packets. This feature is supported only when PTF2 or greater is installed.
Take note that single-ended SCSI disks cannot be used to configure shared
volume groups. When configuring for twin-tail access of SCSI-2 Differential
disks, be careful to avoid having a SCSI address clash on the same SCSI
bus.
Consider the following when configuring shared volume groups and LVM
components for Non-Concurrent Access:
• All shared volume groups must have the auto-varyon feature turned off.
• The major number for a shared volume group must be the same on all
cluster nodes using NFS mounted file systems. However, when using
NFS, it is also recommended that the major number be kept the same for
easier management and control. The lvlstmajor command can be used to
check for available major numbers on each node.
• All journaled file system logs, logical volume names and file system mount
points must be unique throughout the cluster nodes.
• All file systems must have the auto-mount feature turned off.
• For shared volume groups with mirroring, we recommend that quorum is
turned off.
The procedure for making concurrent capable volume groups on serial disk
subsystems differs from that on RAID disk subsystems. Refer to HACMP for
AIX: Enhanced Scalability Installation and Administration Guide, SC23-4284
for details on how to configure the different disk subsystems for concurrent
access.
Install the components that you require for your nodes, bearing in mind the
minimum required filesets are:
• rsct.basic.*
• rsct.clients.*
• cluster.es.*
• cluster.cspoc.* (automatically installed when cluster.es.* is being installed)
• cluster.clvm.* (for CRM only)
• cluster.hc.* (for CRM only)
The latest Program Temporary Fixes (PTFs) must be installed together with
the base filesets. Without PTFs, certain functionality mentioned in this book
may not be applicable. One example is the support of TMSSA.
Subnet1 Subnet2
192.168.3.0 192.168.4.0
Control Workstation
All routes are defined to ensure that nodes on Subnet1 can communicate
to nodes on Subnet2. To ensure heartbeat transmission can flow from
nodes on Subnet1 to nodes on Subnet2, we add these 2 subnets to a
global network SPGlobal. The following commands show how:
# /usr/sbin/cluster/utilities/claddnetwork -u Subnet1:SPGlobal
# /usr/sbin/cluster/utilities/claddnetwork -u Subnet2:SPGlobal
The SMIT fastpath is:
# smitty cm_config_global_networks.select
In case you want to remove Subnet1 from SPGlobal, use this command:
# /usr/sbin/cluster/utilities/claddnetwork -u Subnet1
• Tune the Network Module. The heartbeat rate can be tuned for all
networks defined. If any network is expected to be heavily loaded, the
heartbeat rate can be tuned to "slow" to avoid false failure detection over
that network. Failure detection is dependent on the fastest network in the
cluster. For a start, leave the settings as default. Tune only when you
experience false failure detections.
Define Resources
In order to associate all resources with each node, resource groups must be
configured. Use the following procedures to set up resources for the cluster:
• Define resource groups. When creating resource groups, it is important to
note that when specifying Participating Node Names for cascading or
rotating, the priority decreases in the order you specify. For concurrent
mode, the order does not matter.
• Define application servers. This will tell HACMP/ES which scripts are used
for starting the applications and which to stop.
• Define resources for resource groups. Here, you will input all the related
resources which you want HACMP/ES to manage for you. Take note that
IPAT is required if you are using the NFS mount option. Also, you must set
the value of Filesystems mounted before IP configured to true.
• Configure run time parameters. If you use NIS or nameserver, set the
value to true. Leave the Debug Level as high to ensure that all events are
captured.
• Configure cluster events. You can customize the way HACMP/ES handles
events by incorporating pre-event, post-event scripts, notification or even
recovery scripts. How you want HACMP/ES to react depends on how you
configure these events. The default is generally good enough.
15.4.2 Client
In order for an RS/6000 system to receive messages about what is happening
on the cluster, it must be configured as a client running the clinfo daemon.
Use the following steps to do this:
• Install Base Client Software
Install the following minimum filesets in order for your client system to
access any cluster information:
• rsct.clients.*
• cluster.es.client.*
• Edit /usr/sbin/cluster/etc/clhosts
Like the server, this file contains resolvable hostnames or IP addresses of
all boot and service labels of servers and clients. The clinfo daemon uses
names in this file to communicate with the clsmuxpd process on the
servers. This file must not be left empty, and you must not include standby
addresses in it.
• Edit /usr/sbin/cluster/etc/clinfo.rc
This file is used the same way as in the servers. Include all client IP
addresses or hostnames to the PING_CLIENT_LIST variable. This is
especially important where no hardware address swap is configured and
where there are clients that do not run the clinfo daemon. Be sure to
update the ARP tables on routers and systems that are not running clinfo
when IPAT occurs.
• Reboot Clients
Reboot your clients machines to finish the setup.
The normal clstart command to start up one node at a time is still available
and is accessible via the SMIT fastpath:
# smitty clstart
# smitty cl_clstop.dialog
To stop the cluster services one node at a time, use the SMIT fastpath:
# smitty clstop
Note that the Forced option has been removed from the SMIT screen and will
not be available till later releases of HACMP/ES. The clstop script still retains
this forced option, but you are advised not to use it.
In order to monitor cluster activity from a client system using the clstat
command, the clinfo daemon must first be started up. To start it, follow the
procedure shown in Figure 184.
# /usr/sbin/cluster/clstat
This chapter describes how the SP can be used for parallel programming.
Many of today’s scientific and engineering problems can only be solved by
using the combined computational power of parallel machines, and some also
rely on high performance I/O capabilities. Typically, the applications in this
area are highly specialized programs, developed and maintained in
universities, government labs, or company research departments.
Parallel Environment for AIX (PE), program number 5765-543, provides the
application programmer with the infrastructure for developing, running,
debugging, and tuning parallel applications on the SP. In 16.1, “Parallel
Operating Environment (POE)” on page 453, we describe the operating
environment which allows you to run and control programs in parallel from a
single node.
Note
Starting POE Jobs on Large Systems : For large SP systems, starting an
executable on many nodes from within a shared file system like NFS
can cause performance problems. DFS has a better client to server ratio
than NFS, and so supports a larger number of nodes that
simultaneously read the same executable. As a parallel file system,
GPFS is also able to sustain a higher number of clients simultaneously
accessing the same executable.
The poe command allocates the remote nodes, and on each of these nodes, it
does the following:
• Initializes the local user environment, including a chdir() to the current
directory of the home node.
• Runs the command with stdin, stdout and stderr connected to the poe
process on the home node.
Each partition manager daemon starts and controls the program which runs
on its node, and routes that program’s stdin, stdout and stderr traffic to the poe
command on the home node through SSM. This program may be any
executable or (Korn shell) script; it need not be part of a parallel application in
which the programs running on different nodes interact with each other.
The POE startup process differs depending on whether the poe command is
invoked interactively, or from within LoadLeveler. The interactive case is
shown in Figure 186 on page 456.
login shell
(user)
NODE 0 NODE 1..(N-1)
~user/a.out ~user/a.out
(user) (user)
Note
Home Node versus Node 0 : In interactive POE, there is no direct relation
between the home node and any of the remote nodes. It is a common
misconception that the home node is identical with node 0 of the parallel
application. In general, this is not true for interactive POE jobs. The
resource allocation mechanism chooses the remote nodes, and the
selected set of nodes may or may not include the node on which the poe
command was invoked.
The poe command then uses the pmv2 service (normally port 6125/tcp) to
start the pmdv2 partition manager daemons on the remote nodes, through
As shown in Figure 187, the LoadL_starter process on each node starts the
pmdv2 partition manager daemon, which in this case runs under the identity
of the user since all authentication checks and priority settings have been
performed already by the LoadLeveler components. Finally, pmdv2 invokes
the user’s application as in the interactive case.
LoadL_master LoadL_master
(loadl) (loadl)
LoadL_startd LoadL_startd
(loadl) (loadl)
LoadL_starter LoadL_starter
(user) (user)
~user/a.out ~user/a.out
(user) (user)
When running POE in a batch job, node 0 is a special case. Since there is no
poe process already running somewhere, the LoadL_starter process on
node 0 spawns two tasks: one which starts the partition manager daemon
POE environment options can be grouped into the following set of functions:
• Partition Manager Control. This is the largest set of POE options. Apart
from several timing selections, this group of environment variables also
determines how resources (nodes, processors in case of SMP nodes,
network adapters) are allocated. See 16.1.4, “POE Resource
Management” on page 460 for details.
• Job Specification. MP_CMDFILE specifies a file from which to read node
names for allocation. MP_NEWJOB can be used to run multiple job steps
on the same set of nodes: if set to yes the partition manager will maintain
the partition for future job steps. By using MP_PGMMODEL=mpmd,
applications which use the MPMD programming model can be loaded.
• I/O Control. These settings specify how standard I/O which is routed
to/from the remote nodes is handled at the home node. The four variables
which control this are MP_LABELIO, which can be used to prefix each line
of output with the task number where it originated, MP_STDINMODE,
MP_HOLD_STDIN, and MP_STDOUTMODE.
• VT Trace Collection. These settings control VT trace generation,
discussed in 16.4.2.1, “IBM Visualization Tool (VT)” on page 488. The
level of tracing is set by MP_TRACELEVEL, which may take values 0, 1, 2,
3 or 9.
• Generation of Diagnostic Information. The most useful option in this
category is MP_INFOLEVEL. It determines the level of message
reporting. A value of 0 specifies no messages, 6 specifies maximum
messages.
• Message Passing Interface. These environment variables can be used to
tune MPI applications. See 16.2.2, “IBM’s Implementation of MPI” on page
466 for details.
Note
Enforcing Global POE Settings : The /etc/poe.limits file can be used to
enforce system-wide values of three POE options. Users cannot override
the values in that file by setting the corresponding environment variables to
a different value:
• MP_BUFFER_MEM can be used to limit the maximum value to which
this variable can be set by users (they can select lower values).
• MP_AUTH=DFS can be used to globally enable DCE credential
forwarding. See 16.1.5, “POE Security Integration” on page 461 for
details.
• MP_USE_LL=yes causes POE to reject all POE jobs which are not run
under LoadLeveler. This can be used to force users to use LoadLeveler.
Note that C Set++ for AIX is withdrawn from marketing, and the VisualAge
C++ Professional for AIX Version 4.0 compiler is not supported by Parallel
Environment.
POE provides several compilation scripts which contain all the POE include
and library paths, and automatically link the user’s program with the POE and
message passing libraries. Four scripts mpcc, mpCC, mpxlf and mpxlf90 are
available which call the xlc, xlC, and xlf compilers. The Fortran compilation
scripts can be configured to use XL HPF instead of XL Fortran (if both are
installed) by setting the environment variable MP_HPF=yes. These four
Note
Initializing the POE Runtime : With AIX 4.2 and later, the POE compiler
scripts now link the standard AIX /lib/crt0.o, and use the
-binitfini:poe_remote_main linker option to initialize the POE runtime
environment. With earlier releases of AIX which did not support -binitfini,
POE linked a special /usr/lpp/ppe.poe/lib/crt0.o. The call in the executable
which actually invokes -binitfini initializations is modinit().
Unfortunately, this method does not work with DCE credentials because the
buffer space used by POE to transfer the credentials to the remote nodes is
too small for DCE tickets.
There is no command line flag for this option. The default value for this
variable is AIX. Note that even with MP_AUTH=DFS, the POE authorization
check is still based on the /etc/hosts.equiv or $HOME/.rhosts file.
The poeauth command can then be invoked to copy the DCE credentials, but
you have to understand what it does in order to successfully use it:
• Be aware that poeauth expects DCE credentials on node 0. This is not
always the case, for example when poe is run from a login node but the
nodes allocated for the actual computation come from another pool of
nodes. You have to make sure that DCE credentials exist on node 0, either
by a suitable hostfile which contains the home node, or by logging in to
node 0 before calling poeauth.
• Be aware that poeauth is a normal POE application. POE initializes the
user’s environment on the remote nodes, which includes a chdir() to the
directory in which it was called on the home node. This means that poeauth
must not be called when the current directory is in DFS: POE would fail
when attempting the chdir() to DFS, before the copying of DCE credentials
could take place. You might therefore want to provide a circumvention for
this problem, like the following script:
#!/bin/ksh
cd /tmp
/usr/lpp/ppe.poe/bin/poeauth $@
cd - >/dev/null
• Make sure that the parallel application runs on the same nodes to which
poeauth has copied the credential files. This can be done by setting the
POE environment variable MP_NEWJOB=yes.
If these issues are addressed, poeauth can be used to run interactive POE
jobs which have full DCE user credentials on the remote nodes.
If you install Parallel Environment before DCE, the POE environment will
not be completely built for DCE support. The safest solution in such a case
is to deinstall POE, and reinstall it after DCE is installed.
These specifications, and much more information about MPI, can be obtained
from the following MPI home pages:
http://www.erc.msstate.edu/mpi
http://www.mpi-forum.org
Although the MPI-2 document was published about two years before the time
of writing this redbook, there are very few (if any) full implementations of that
standard available. This is quite different to MPI Version 1, for which a
number of full implementations exist (in the public domain as well as
commercial products).
Note that with the signal-based MPI library, these signals are not available to
the user application; they are used internally by POE/MPI.
This signal-based version of MPI is still provided with PE 2.4 for backward
compatibility. It is contained in the libraries libppe.a and libmpi.a. The
compiler scripts mpcc, mpCC, mpxlf and mpxlf90 described in 16.1.3, “The
POE Compilation Scripts” on page 459 use these signal-based libraries.
Threads-based MPI
The more recent PE implementation of MPI uses threads instead of signals.
Threads are normally more efficient than the interrupts which are caused by
signals, and they also offer the potential of parallelism if the MPI task runs on
an SMP node. The threads-based implementation resides in libppe_r.a and
libmpi_r.a. The compiler scripts mpcc_r, mpCC_r, mpxlf_r and mpxlf90_r
described in 16.1.3, “The POE Compilation Scripts” on page 459 use these
threads-based libraries.
Figure 188 on page 468 shows the control flow in a task which uses the
threads-based MPI implementation:
• The Partition Manager Daemon (PMD) uses fork() and exec() to start the
POE task. This is the same as for the signal-based implementation.
• Before calling the user's main() program, the POE initialization routine will
create a thread called the “POE Asynchronous Control Thread”. This
PMD
Fork/Exec
Timer Thread
Message
MPI Call Dispatchers
PTHREAD
MPI Call
MPI Call
EXIT
The MPI library also creates “responder threads” that handle asynchronous
MPI-IO and non-blocking communications. These threads are dynamically
The name responder thread derives from the fact that a nonblocking MPCI
Irecv can register a handler function. When a message arrives that matches
that MPCI Irecv, MPCI responds by putting the registered handler onto the
responder queue so it can be executed on a responder thread. Thus, the MPI
task “responds” to the arriving message without any application-level
participation.
MPCI
MPCI Stream-
pipes
oriented
TB3 Packet layer
TB3MX
TB3PCI UDP
DMA Packet-
UNIX sockets
CSS fifo's oriented
Device
dependent Adapter microcode
ADAPTER
This means that the AIX threads library (libpthreads.a) has changed in AIX
4.3.1 and later, but AIX maintains binary compatibility across versions. AIX
does that by having multiple shared objects in the library itself. Executables
built in AIX 4.3.0 or earlier (which link the threads library) reference the shr.o
shared objects. Executables compiled in AIX 4.3.1 and later reference
shr_xpg5.o shared objects. However, AIX 4.3.1 and later maintains both
shared objects in the library, so “old” applications referencing shr.o objects
will run without problems. This is shown in Figure 190.
The threaded MPI library in Parallel Environment 2.4 is compiled in AIX 4.3.1,
which means it uses the shr_xpg5.o shared objects. Earlier versions of the
threaded MPI library reference shr.o objects, since they have been built with
AIX 4.2.1 or earlier. Applications compiled with previous versions of AIX and
Parallel Environment will run, as long as mutexes (locks) and thread condition
structures (signaling structures) are not shared.
Final Standard
Draft 7 1003.1 - 1996
libpthreads.a
shr.o shr_xpg5.o
Loadable only
Loadable and Linkable
In AIX 4.3.1 and later, this thread structure has changed. The default is now
M:N or Process Contention Scope , which means that the Kernel Dispatcher
has a “pool” of kernel threads which are dynamically allocated to user or
process threads when a user scheduling thread switches from one user
thread to another, or when a user thread makes blocking system calls.
In any case, you need to experiment with these settings to find out if and how
your application will benefit from changing the default values.
Chapter 9 of the MPI-2 standard defines the set of MPI calls that allow
parallel file I/O. This set of calls is called MPI-IO and a portion is being
implemented as part of the threaded MPI library within Parallel Environment
2.4.
The MPI-IO subset provides great flexibility for applications to define how
they will do their I/O. Tasks within an application can use MPI predefined and
derived datatypes to partition the single file in multiple views. This allows
applications to partition the data and create their own access patterns based
on these basic blocks or datatypes.
In PE 2.4, MPI-IO is fully supported only for GPFS. Other file systems may be
used only when all tasks of the MPI job are on a single node or workstation.
(JFS cannot make a single file visible to multiple nodes. NFS, DFS and AFS
can make the same file visible to multiple nodes, but do not provide sufficient
file consistency guarantees when multiple nodes access a single file.)
There are two implementations which run on the SP: an RS6K version and an
SP2MPI version. Only the SP2MPI implementation supports User Space
communication over the SP Switch, as it is layered on top of IBM’s MPI. The
disadvantage of this PVM implementation is that it requires one extra
processor for each invocation of pvm_spawn/pvmfspawn and tasks created
by different spawn calls do not intercommunicate via User Space protocol.
Therefore, care must be taken when this PVM implementation is used for
applications which call pvm_spawn/pvmfspawn. The RS6K implementation
does not have this drawback, but can only use IP communication.
Note
MP_MSG_API : If you want to use LAPI, the POE environment variable
MP_MSG_API must be set to include LAPI, since LAPI requires a separate
DMA window on the SP Switch adapter, different than the adapter window
used by MPI.
MPL and MPI could be used in the same program, but both partners of a
communication must always be handled by calls to the same library. For
example, an MPI_Send() call cannot be received by an MPL mpc_recv() call.
HPF Version 1 defines a subset HPF language, which requires neither full
Fortran 90 support nor all the HPF features. Although many HPF compilers
now support the full Fortran 90 language, very few support all of the HPF
Version 1 features.
For more information about XL HPF, check out the following sources:
• http://www.software.ibm.com/ad/fortran/xlhpf/
• IBM XL High Performance Fortran Language Reference and User's Guide, SC09-2631
• http://www.software.ibm.com/ad/fortran/xlfortran/
• IBM XL Fortran Language Reference, SC09-2718
• IBM XL Fortran User's Guide, SC09-2719
The pghpf compiler actually uses the machine’s native Fortran compiler (XL
Fortran on AIX platforms), and supplements it with PGI’s own libraries and
runtime environment. The compiler produces intermediate serial Fortran code
with embedded message passing calls. These calls can be layered on top of
various message passing libraries. PGI provides RPM by default, which is a
stripped-down version of PVM. On the SP, we recommend that you use IBM’s
implementation of MPI to achieve the highest performance. After building the
executable, running it on the SP is comparable to running a plain MPI
program.
In addition to the pghpf compiler, PGI provides an HPF profiler pgprof which
supports line-level profiling at the HPF source level. For debugging, the
TotalView parallel debugger can be used. TotalView is described in 16.4.1.2,
“The TotalView Debugger” on page 484.
The pdbx debugger can operate in two modes: normal mode, which starts the
application under control of the debugger; and attach mode, which permits the
attachment of the debugger to an already running POE application. Attach
mode is particularly useful for long-running applications which are suspected
of hanging in a late stage. If such applications had to be run completely under
control of a debugger, it would probably take too much time to reach the
interesting state in the calculation. With pdbx, it is possible to attach to the
running program at any time.
To debug a parallel program in normal mode, you basically replace the poe
command in your program invocation with the pdbx command. The debugger
accepts most of the POE environment variables and command-line
arguments. For example:
pdbx my_prog [prog_options] [poe_options] -l src_dir1 -l src_dir2
Since the POE environment is already set up when the application has been
started, it is normally not necessary to specify any program or POE options to
pdbx when starting it in attach mode.
The graphical user interface of pedb makes it more convenient to display data
structures in the application. However, there are some limitations concerning
the size of the data sets that can be displayed. The pedb debugger also
provides a message queue debugger to inspect the MPI messages queued
on the nodes.
Note that pedb requires the selection of a fixed set of tasks during startup.
This cannot be changed later without detaching from the application and
attaching again with a new selection of tasks to be debugged. The tasks that
are not selected will not run under debugger control. The maximum number of
tasks supported within pedb is 32. For larger applications, only a subset of at
most 32 processes can be debugged.
More details on pdbx and pedb can be found in IBM Parallel Environment for AIX,
Operation and Use, Volume 2, Part 1: Debugging and Visualizing, SC28-1980.
All TotalView options must be specified after the totalview and before the poe
command. The -no_stop_all option specifies that only the task which reaches
a breakpoint stops at that breakpoint, by default all other tasks would stop,
too. Also note that the -a option must be specified between poe and all its
arguments.
TotalView then displays its root window, and a process window as shown in
Figure 192 on page 485. Initially, both windows only show the poe process on
the home node.
Type g (lower case g for go) in the poe process window to start the individual
tasks of the parallel application. The screen displays progress information for
the parallel tasks, and a pop-up which allows you to stop all the tasks before
they enter the application’s main program. Then a separate process window
for the main application’s tasks opens, as shown in Figure 193 on page 486.
Note that the window title shows the application program’s name (heatd2) and
the task number (one). The process window for poe can then be closed.
Normally, TotalView is controlled using the mouse: the left mouse button
selects objects, the middle button displays the currently available
functions/commands, and the right button "dives" into the selected object.
The right mouse button can be used to display a subroutine’s source text,
values of variables, or attributes of breakpoints.
To navigate between different tasks, select them in the root window, or use
the navigation buttons in the upper right corner of the process window. In
general, commands in lower case only apply to one task, whereas upper case
commands apply to all tasks of a parallel application.
Note
MP_PULSE : To avoid timeouts during debugging, we recommend that you
set the MP_PULSE environment variable to zero before starting TotalView.
Otherwise, POE might cancel the application if a task is stopped and does
not respond to POE’s heartbeat messages.
To use TotalView to debug a serial program, call totalview with the name of
the executable as argument. For PVM programs, the directory where the
TotalView executables reside must be included in the ep= clause in the
hostfile, and totalview must be invoked with the -pvm option after the pvmd
daemon has been started. Debugging PGI HPF programs requires that the
program is compiled and linked with the -g -qfullpath -Mtv -Mmpi options,
and run with the -tv runtime option which starts off TotalView.
Among the tools to do this are IBM’s VT and Xprofiler, which are part of the
Parallel Environment for AIX. In addition, the Vampir tool from Pallas GmbH is
introduced. Vampir has a focus similar to the trace visualization part of VT,
but is easier to use and often presents information in a more comprehensive
way than VT.
More details on VT can be found in IBM Parallel Environment for AIX, Operation and
Use, Volume 2, Part 1: Debugging and Visualizing, SC28-1980.
Parallel Environment takes steps to ensure that each task of a parallel job
writes its profile data to a separate file. More details can be found in IBM
Parallel Environment for AIX: Operation and Use, Volume 2, SC28-1980.
Note that Xprofiler accepts wildcards, so the previous line could be rewritten
as follows:
xprofiler a.out gmon.out* [xprof_options]
As can be seen in this example, for a parallel application all profiling output
files must be specified, in addition to the (SPMD) executable.
Note
POE profiled libraries : The AIX operating system and compilers provide a
version of their libraries which has been built with the -pg compiler options.
Typically these are located in /usr/lib/profiled/. When Parallel Environment
for AIX is installed, it rebuilds libc.a and stores it in /usr/lpp/ppe.poe/lib/. To
support profiling of this modified libc.a, a copy of that library is also built
with the -pg profiling option, and stored in /usr/lpp/ppe.poe/lib/profiled/. Be
sure to include this library in your applications if you want to profile system
routines. This can be achieved by setting MP_EUILIBPATH, for example:
export MP_EUILIBPATH=/usr/lpp/ppe.poe/lib/profiled:/usr/lib/profiled:/lib/profiled
:/usr/lpp/ppe.poe/lib
With Xprofiler, you can easily zoom in and out of interesting parts of the code.
The VampirTrace library uses the MPI profiling interface to attach to the
application, and writes a trace file when the application calls MPI_Finalize.
Details can be controlled through a configuration file, or by inserting
subroutine calls into the application. For the above example, the default name
of the trace file would be myprog.bpv.
The visualization tool Vampir is used to analyze the resulting trace files. It is
started by the vampir command. Vampir is very easy to use, and has a fast
and versatile graphical user interface which allows you to zoom into arbitrary
parts of a trace. A variety of graphical displays presents important aspects of
the application runtime behavior. Three important classes of displays are:
• Timeline views display application activities and message-passing along
the time axis. Figure 196 on page 492 shows an example, zoomed to a
very detailed level.
Figure 201 on page 495 shows the time spent in selected MPI calls for one of
the processes. Call-tree comparison between different program runs is also
possible to evaluate the effect of optimizations.
This network of machines may include an SP, other IBM RS/6000s and other
types of workstations. The network environment may include Distributed
Computing Environment (DCE), AFS, NFS and NQS. All machines in this
network that can run LoadLeveler jobs are grouped together and called a
cluster.
A batch job consists of a set of job steps. Each job step is an executable that
is run on a node or a group of nodes. A job is defined by a Job Command File,
which stores the name of the job, the job step(s) involved and other
LoadLeveler statements. It is this file that is submitted to LoadLeveler for
execution.
A job can either run on one machine (serial) or multiple machines (parallel).
LoadLeveler jobs can be submitted and monitored via commands or the GUI
called xloadl.
17.1 Architecture
Once a machine becomes part of a LoadLeveler cluster, it can take on one or
more of four roles:
• Central Manager Machine or the Negotiator. A central manager is a node
dedicated to examining a submitted job’s requirements and finding one or
more nodes in the cluster to run the job. There is only one central
manager per LoadLeveler cluster; however, it is possible to identify an
alternate central manager which becomes the central manager if the
primary central manager becomes unavailable.
• Scheduling Machine. When jobs are submitted to LoadLeveler, they get
stored in a queue on the scheduling machine. It is this machine’s
responsibility to contact the central manager and ask it to find an
appropriate machine to run the job.
• Executing Machine. This is a node that runs the jobs submitted by users.
There are six daemons that are used by LoadLeveler to process jobs:
• master — This daemon runs on every node in a LoadLeveler cluster. It
does not, however, run on any submit-only machines. It manages all other
LoadLeveler daemons running on the node.
• schedd — This daemon receives the submitted jobs and schedules them
for execution. The scheduling is based on the machines selected by the
negotiator daemon (discussed later in this section) on the central
manager. The schedd is started, restarted, signalled and stopped by the
master daemon.
• startd — This daemon monitors jobs and machine resources on all
executing machines in the cluster. It communicates the machine
availability to the central manager, and receives a job to be executed from
the schedd.
• starter — This is spawned by the startd daemon when startd receives the
job from schedd. The starter daemon is responsible for running the jobs
and reporting the status back to startd.
• negotiator — This daemon runs on the central manager and records
information regarding the availability of all executing machines to perform
jobs. It receives job information from schedd and decides which nodes is
are to perform the job. Once a decision has been made, it sends this
information to schedd and lets schedd contact the appropriate executing
machines.
Figure 202 on page 500 shows the steps LoadLeveler uses to handle a job.
They are summarized as follows:
1. A job is first submitted by a user using either the GUI or the llsubmit
command to the schedd daemon on the scheduling machine. The
submitting machine can be a submit-only machine or any other machine in
the cluster.
2. The schedd daemon on the scheduling machine receives the job and
places it into the job queue.
3. The schedd daemon contacts the negotiator daemon on the central
manager to inform it that a job has been placed in the queue. The schedd
daemon also sends the job description information to the negotiator
daemon.
4. The negotiator daemon, which receives updates from the startd daemon
on all executing machines, checks for an available executing machine with
resources which match the job’s requirements. Once found, the negotiator
daemon sends a "permit to run" signal to the schedd daemon.
5. The schedd daemon dispatches the job to the startd daemon on the
executing machine.
6. The startd daemon spawns the starter daemon which handles the job.
When the job is finished, the starter daemon sends a signal back to the
startd daemon.
7. The startd daemon sends a signal back to the schedd daemon informing it
that the job has been completed.
8. The schedd daemon updates the negotiator daemon that the job has
finished.
LoadLeveler 499
User submits a job to schedd
step 1 step 3
step 4 update
of
step 5 machine
step 7 resources
startd on an
executing
machine
step 6
starter on an executing
machine
The user priority is specified in the job command file. It is a number between
0 and 100, inclusive. The higher the number assigned, the greater the priority
given to the job.
LoadLeveler calculates its priority for a job by using a formula defined in its
configuration file to come up with the SYSPRIO. The system administrator
sets up the formula, which can be based on a number of factors. Details on
how to set up the SYSPRIO formula can be found in IBM LoadLeveler for AIX
Using and Administering Version 2 Release 1, SA22-7311.
After installing LoadLeveler, you need to configure it for your system. Global
configuration information includes the following:
• LoadLeveler user ID and group ID (default user ID is loadl)
• The configuration directory (default is /home/loadl)
• The global configuration file (default is LoadL_config)
LoadLeveler 501
/home/loadl/LoadL_admin, which assumes that you have given the user id
loadl for LoadLeveler.
2. Set up the global configuration file that contains the parameters controlling
how LoadLeveler operates in the cluster.
3. Set up local configuration files on machines to define settings for specific
nodes. The default file is
/home/loadl/<hostname_of_node>/LoadL_config.local.
Once LoadLeveler has been installed and configured, the command to start it
is /usr/lpp/LoadL/full/bin/llctl -g start. This starts LoadLeveler in all the
machines defined to be part of the LoadLeveler cluster.
#!/bin/ksh
# @ job_type = serial
# @ executable = /usr/bin/errpt
# @ arguments = -a
# @ output = $(Executable).$(Cluster).$(Process).out
# @ error = $(Executable).$(Cluster).$(Process).err
# @ initialdir = /usr/lpp/LoadL/full/bin
# @ notify_user = [email protected]
# @ notification = always
# @ checkpoint = no
# @ restart = no
# @ requirements = (Arch == "R6000") && (OpSys == "AIX43")
# @ queue
In this example, we ask each node in the cluster to run the command errpt -a
and store it in a unique filename as defined by the output keyword.
The job_type keyword is optional and defines the type of job, serial or
parallel, that we are running.
The executable keyword describes the command that you want to run. This
can be a system command or a script you wrote. If the executable keyword is
not used, then the job command file itself is treated as the executable. You
need to then specify the commands as though the job command file, that is,
without a # or a #@.
The arguments keyword specifies the flags that the executable requires.
The output and error keywords define the files to which LoadLeveler will write
the respective output and error messages.
The initialdir keyword specifies the path name of the directory to be used as
the initial directory during the execution of the job step. The directory
specified must exist on both the submitting machine and the machine where
the job runs.
The restart keyword specifies whether the central manager is to requeue the
job should the node go down and come back up. This is different than a
restart using checkpoint because this restarts the whole job, whereas a
checkpoint specifies a particular point in the program.
LoadLeveler 503
The requirements keyword specifies one or a number of requirements which
a machine in the LoadLeveler cluster must meet in order to execute the job
step.
Once the job command file is built, it must be submitted for LoadLeveler to
place on the queue. The command to execute this is llsubmit. For our
previous example, run llsubmit errpt.cmd to submit the job to LoadLeveler. If
the command is successful, it returns the message:
The job name given is the name of the node where the job is submitted, and a
job ID that LoadLeveler assigns. In this case, the job ID is 1.
The preceding is just a simple example of a job command file for a serial job.
There are many other ways to construct a job command file. For a detailed
explanation of the possible structures (including examples) and all the
keywords that can be specified in a job command file, refer to IBM LoadLeveler
for AIX Using and Administering Version 2 Release 1, SA22-7311.
To use the GUI xloadl to handle the chores of building and submitting jobs,
run xloadl from the command line. This starts the GUI shown in Figure 203
on page 505.
LoadLeveler 505
Figure 204. Dialog Box for Writing a Job Command File
Once the file has been built, you have the option to submit the file right away
or save it first and submit it later. To submit it immediately, click on the
Submit button. To save the file first, click on the Save button and specify the
file name in the dialog box that pops up.
To submit a job command file that has already been built, choose from the
main xloadl GUI File, File_Submit_Job and select the job command file you
want to submit in the associated dialog box. This dialog box is shown in
Figure 205 on page 507.
After a job command file has been submitted for execution, you can run the
command /usr/lpp/LoadL/full/bin/llq to check on its status within a queue.
The following is a sample output from running llq:
$ llq
Id Owner Submitted ST PRI Class Running On
------------------------ ---------- ----------- -- --- ------------ -----------
sp4en0.21.0 spuser1 3/10 13:02 ST 100 No_Class sp4n06
LoadLeveler 507
After choosing the method to submit parallel jobs, install the appropriate
software. Information on parallel programming is found in Chapter 16,
“Parallel Programming” on page 453.
For POE jobs, ensure that all the adapters available to be used on each node
have been identified to LoadLeveler. To do this, run llextSDR to get the node
and adapter information out of the SDR. The following example shows what
these entries look like for a node with the hostname sp4n05, one Ethernet
adapter and one SP switch adapter:
sp4n05:
type = machine
adapter_stanzas = sp4sw05.msc.itso.ibm.com sp4n05.msc.itso.ibm.com
spacct_excluse_enable = false
alias = sp4sw05.msc.itso.ibm.com sp4n05.msc.itso.ibm.com
sp4sw05.msc.itso.ibm.com:
type = adapter
adapter_name = css0
network_type = switch
interface_address = 192.168.14.5
interface_name = sp4sw05.msc.itso.ibm.com
switch_node_number = 4
sp4n05.msc.itso.ibm.com:
type = adapter
adapter_name = en0
network_type = ethernet
interface_address = 192.168.4.5
interface_name = sp4n05.msc.itso.ibm.com
Create two adapter stanzas in the LoadL_admin file, one for each adapter,
and then add an adapter_stanzas line to the machine stanza. The following
example shows these entries:
After adding these entries, ensure that you have selected an appropriate
scheduler for your job (for information on scheduling, refer to 17.5,
“Scheduling” on page 511).
At this point, you can consider whether you want to set up specific classes in
the LoadL_admin file to describe POE job characteristics. Other optional
configurable functions include grouping nodes into pools, restricting nodes to
handle particular types of jobs (batch, interactive or both), and turning on SP
exclusive use accounting.
Once these have been considered and added to the LoadL_admin file, you
are ready to start LoadLeveler.
For PVM 3.3 RS6K architecture jobs, you need to set up a path for
LoadLeveler to find the PVM software. LoadLeveler by default expects that
PVM is installed in ~loadl/pvm3. If the software is installed elsewhere, include
this in the machine stanza in the LoadL_admin file. For example, if you have
PVM installed in /home/loadl/aix43/pvm3, include the following line in the
machine stanza:
pvm_root = /home/loadl/aix43/pvm3
If you are running PVM 3.3.11+, LoadLeveler does not expect the software to
be installed in ~loadl/pvm3. Rather, LoadLeveler expects PVM to be installed
in a directory that is accessible to and executable by all nodes in the
LoadLeveler cluster.
LoadLeveler 509
Both versions of PVM have a restriction that limits each user to only run one
instance of PVM on each node. You can ensure that LoadLeveler does not
start more than one PVM job per node by setting up a class for PVM jobs. For
more information on this topic, refer to Chapter 6, “Administration Tasks for
Parallel Jobs”, in IBM LoadLeveler for AIX Using and Administering Version 2 Release 1,
SA22-7311.
Once LoadLeveler is properly set up, it is possible to build and submit jobs
either by the command line or the xloadl GUI.
The GUI for building and submitting parallel jobs is the same as the one for
serial jobs. The difference is the selection in the Job Type field under the
Build a Job dialog box. You have your choice of either Serial, Parallel or
PVM.
Here is an example of a job command file for a parallel job (using POE):
# @ job_type = parallel
# @ output = poe_job.out
# @ error = poe_job.err
# @ node = 4, 10
# @ tasks_per_node = 2
# @ network.LAPI = switch,shared,US
# @ network.MPI = switch,shared,US
# @ wall_clock_limit = 1:30, 1:10
# @ executable = /home/spuser1/poe_job
# @ arguments = /poe_parameters -euilib "share"
# @ class = POE
# @ queue
The output and error entries remain the same as in the case of a serial job.
The node keyword can be used to specify a minimum and maximum number
of nodes that this job can use. The format is node = min, max where min is
the minimum number of nodes this job requires and max is the maximum
number of nodes required. In this case, the values of 4 and 10 indicate that
the job requires at least 4 nodes but no more than 10.
Since the backfill scheduler available with LoadLeveler 2.1 supports multiple
tasks being scheduled on a node, you can specify how many jobs are to be
scheduled per node. This is done using the tasks_per_node keyword, which
is used in conjunction with the node keyword.
The wall_clock_limit sets the time limit for running a job. It is required, and is
set either in the job command file by the user or in the class configuration in
the LoadL_admin file by the administration when a backfill scheduling
algorithm is used. The two numbers for the wall_clock_limit in this example
specify a hard limit of one hour and thirty minutes and a soft limit of one hour
and ten minutes. The hard limit is the absolute maximum while the soft limit is
an upper bound that may be crossed for a short period of time.
The executable and arguments keywords are used just like in a serial file to
specify the job to run.
The queue keyword is once again required to let the system know when a job
step’s definitions have been completed and that the schedd can place this job
step on the queue to be dispatched by LoadLeveler.
One major change that has been implemented in LoadLeveler 2.1 is the
incorporation of certain Resource Manager functions into the program to run
parallel batch jobs, since Resource Manager is no longer offered in PSSP
3.1. LoadLeveler 2.1, for example, is able to load and unload the Job Switch
Resource Table (JRST) itself using a switch table API. LoadLeveler can also
provide the JRST to other programs, such as POE. The JRST is discussed in
greater detail in Chapter 11, "Parallel Environment 2.4" in PSSP 3.1
Announcement, SG24-5332, while the switch table API is detailed in IBM Parallel
System Support Program for AIX Command and Technical Reference, Version 3 Release 1,
SA22-7351.
17.5 Scheduling
Once jobs have been defined and submitted to LoadLeveler, they are
scheduled for execution. LoadLeveler can schedule jobs using one of three
possible algorithms: the default LoadLeveler scheduler, the backfill scheduler
LoadLeveler 511
and the job control API. The scheduling method is set by manipulating two
keywords within the global configuration file: SCHEDULER_API and
SCHEDULER_TYPE. Recall from section 17.2, “Configuration of
LoadLeveler” on page 501, that the system default name for the global
configuration file is /home/loadl/LoadL_config.
The default LoadLeveler scheduler is meant primarily for serial jobs, although
it can handle parallel jobs. This algorithm schedules jobs according to a
machine’s MACHPRIO. The higher the MACHPRIO, the more available a
machine is to handle a job.
The MACHPRIO can only be set in the global configuration file or the
configuration file local on the central manager.
SCHEDULER_API = NO
The backfill scheduler can be used for both serial and parallel jobs, although
it is designed to handle primarily parallel jobs. The backfill schedule requires
that every job sent to schedd has the wall_clock_limit set in its job command
file. This defines the maximum amount of time in which the job will complete.
Using this information, the backfill algorithm works to ensure that the highest
priority jobs are not delayed.
The backfill scheduler supports the scheduling of multiple tasks per machine
and the scheduling of multiple user spaces per adapter. This means that if a
node is reserved for running a large job, and a small job with high priority is
received (one that can be completed before the large job is started), the
backfill scheduler algorithm permits LoadLeveler to run this small job on the
node.
SCHEDULER_TYPE = BACKFILL
Job control API can be specified if you want to use an external scheduler.
This API is intended for those users who want to create a scheduling
algorithm for parallel jobs based on specific on-site requirements. It provides
a time-based interface, instead of an event-based interface. Further
information on the Job Control API is in IBM LoadLeveler for AIX Using and
Administering Version 2 Release 1, SA22-7311.
To enable the job control API, include this line in the configuration file:
SCHEDULER_API = YES
LoadLeveler 513
Do not include a SCHEDULER_TYPE entry.
Unfortunately, this does not mean that LoadLeveler is able to schedule jobs
on individual CPUs within SMP nodes. LoadLeveler treats all nodes,
regardless of the number of CPUs within the node, as one machine and
dispatches jobs to the machine to be run. You can, however, specify a job to
be run on a particular node. If you have a job that needs to run on an SMP
node, you can define that particular job in a class called SMP. Then, you
specify in the nodes’ configuration files which node can run the SMP class
job.
For example, you can add the following stanza in the administration file:
And then add in the local configuration file on the SMP node:
Class = SMP
You can also add in other jobs that you want the node to handle. Recall that
the parameters specified in the local configuration file overwrites the settings
in the global configuration file. Therefore, any job classes that you defined as
being allowed to run on the SMP node will not run unless you specify them
again in the local configuration file.
LoadLeveler 515
516 The RS/6000 SP Inside Out
Appendix A. Currently Supported Adapters
These lists are for reference only. A complete list and further documentation
can be found at the official Web site for RS/6000 products at the following
URL:
http://www.rs6000.ibm.com
Table 22. Supported SCSI and SSA Adapters for MCA Nodes
Table 25. Supported SCSI and SSA Adapters for PCI Nodes
IBM may have patents or pending patent applications covering subject matter
in this document. The furnishing of this document does not give you any
license to these patents. You can send license inquiries, in writing, to the IBM
Director of Licensing, IBM Corporation, 500 Columbus Avenue, Thornwood,
NY 10594 USA.
Licensees of this program who wish to have information about it for the
purpose of enabling: (i) the exchange of information between independently
created programs and other programs (including this one) and (ii) the mutual
use of the information which has been exchanged, should contact IBM
Corporation, Dept. 600A, Mail Drop 1329, Somers, NY 10589 USA.
The information contained in this document has not been submitted to any
formal IBM test and is distributed AS IS. The information about non-IBM
("vendor") products in this manual has been supplied by the vendor and IBM
assumes no responsibility for its accuracy or completeness. The use of this
Any pointers in this publication to external Web sites are provided for
convenience only and do not in any manner serve as an endorsement of
these Web sites.
The following document contains examples of data and reports used in daily
business operations. To illustrate them as completely as possible, the
examples contain the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and
addresses used by an actual business enterprise is entirely coincidental.
Reference to PTF numbers that have not been released through the normal
distribution process does not imply general availability. The purpose of
including these reference numbers is to alert IBM customers to specific
information relative to the implementation of the PTF when it becomes
available to each customer according to the normal IBM PTF distribution
process.
Java and all Java-based trademarks and logos are trademarks or registered
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of
Microsoft Corporation in the United States and/or other countries.
The publications listed in this section are considered particularly suitable for a
more detailed discussion of the topics covered in this redbook.
This section explains how both customers and IBM employees can find out about ITSO redbooks,
redpieces, and CD-ROMs. A form for ordering books and CD-ROMs by fax or e-mail is also provided.
• Redbooks Web Site http://www.redbooks.ibm.com/
Search for, view, download or order hardcopy/CD-ROM redbooks from the redbooks web site. Also
read redpieces and download additional materials (code samples or diskette/CD-ROM images) from
this redbooks site.
Redpieces are redbooks in progress; not all redbooks become redpieces and sometimes just a few
chapters will be published this way. The intent is to get the information out much quicker than the
formal publishing process allows.
• E-mail Orders
Send orders via e-mail including information from the redbooks fax order form to:
e-mail address
In United States [email protected]
Outside North America Contact information is in the “How to Order” section at this site:
http://www.elink.ibmlink.ibm.com/pbl/pbl/
• Telephone Orders
United States (toll free) 1-800-879-2755
Canada (toll free) 1-800-IBM-4YOU
Outside North America Country coordinator phone number is in the “How to Order”
section at this site:
http://www.elink.ibmlink.ibm.com/pbl/pbl/
• Fax Orders
United States (toll free) 1-800-445-9269
Canada 1-403-267-4455
Outside North America Fax phone number is in the “How to Order” section at this site:
http://www.elink.ibmlink.ibm.com/pbl/pbl/
This information was current at the time of publication, but is continually subject to change. The latest
information for customer may be found at http://www.redbooks.ibm.com/ and for IBM employees at
http://w3.itso.ibm.com/.
Company
Address
We accept American Express, Diners, Eurocard, Master Card, and Visa. Payment by credit card not
available in all countries. Signature mandatory for credit card payment.
GB gigabytes
LV logical volume
GL Group Leader
LVM Logical Volume
Manager
GPFS General Parallel File
System MB megabytes
535
schedd 498 MP_BUFFER_MEM 459
sdrd 91, 98, 99 MP_DEBUG_INITIAL_STOP 484
setup_logd 267 MP_MSG_API 476
sp_configd 330 MP_NEWJOB 462
spdmapid 315 MP_PULSE 487
spdmcold 315 MP_TRACELEVEL 488
spdmspld 315 MP_USE_LL 459
splogd 111, 173 SP_NAME 98, 100
startd 498 Envoy 5
starter 498 Ethernet switch 48, 54
supman 356 Etnus, Inc. 481, 484
sysctld 180 Event Management 200
Worm 218, 235 aixos 208
xmservd 312 client 202
ypbind 349 daemon 187
yppasswd 349 EMAPI 187, 204
ypserv 349 EMCDB 204, 205
ypupdated 349 event registration 203
DARE expression 203
See Dynamic Automatic Reconfiguration Event ha_em_peers 204
Data distribution directives 477 haemd 201, 204
Data Encryption Standard 149 rearm expression 203
dataless 258 resource class 203, 206
data-parallel programming 476 resource variable 203
DCE ticket granting ticket 515 RMAPI 202, 204
dce_export 147 Event Perspective 296
dce_login -f 177 /etc/sysctl.pman.acl 297
dceunix -t 177 Condition Pane 302
Department of Energy 481 Conditions Pane 301
design 257 Create Condition notebook window 301
device database 236 Event Condition 301, 302
diagnostics mode 118 Event Definition 301, 303
discretionary security control 141 event definition 296
diskless 258 Event Definitions 298
DISTRIBUTE 477 Event Defintion 302
dog coffins 5 event icon 300
Dolphin Interconnect Solutions 484 Event Management 296
Dynamic Automatic Reconfiguration Event 434 Event Notification Log 303
dynamic port allocation 394 event notification log 300
dynamic reconfiguration 434 icon colors for event definitions 299
pre-defined events 298
Rearm Expression 302
E rearm expression 300
eavesdropping 148
registering events 296, 299
EMAPI 203
resource elements ID 303
endpoint map 394
resource variable 302
Environment Variables
resource variable class 302
KRBTKFILE 160, 162
spevent 297
MP_AUTH 459, 462
unregistering events 296
537
SP Switch frame 22 HAI, see also High Availability Infrastructure 185
supervisor interface 114 Half duplex 48
tall expansion frame 21 hardmon 110, 172
tall model frame 21 hardmon principal 171
FZJ 491 hardware address 273
Hardware Perspective 120
Controlling hardware 295
G icon view 296
gather/scatter operations 464
Monitoring hardware 295
General Parallel File System (GPFS) 409
panes 296
get_auth_method 143, 175, 177, 178
sphardware command 295
gettokens.c 461
system objects 295
Global file systems 387
hardware_protocol 112
global network 447
Hashed Shared Disks 404
Global ODM 188
hatsd 189
GODM
HDX
See Global ODM
See Half Duplex
GRF 39
heartbeat 447
Group Services 187
high availability 433
barrier synchronization 199
High Availability Cluster Multiprocessing 185
clients 193
High Availability Cluster Multiprocessing Enhanced
external or user groups 194
Scalability (HACMP/ES) 433
group 193
High Availability Cluster Multiprocessing Enhanced
Group State Value 194
Scalability Concurrent Resource Manager (HAC-
Membership List 194
MP/ESCRM) 433
Name 194
High Availability Control Workstation 45
Group Leader 197, 198, 199
High Availability Infrastructure 185
Group Services Application Programming Inter-
High Performance Fortran 476
face 195
High Performance Gateway Node 39
hagsd 193, 196
High Performance Supercomputer Systems Devel-
hagsglsmd 196
opment Laboratory 4
internal components 195
home directories 387
internal groups 194
home node 453
join request 196
host impersonation 142
meta-group 196
host responds 290
nameserver 195
host responds daemon 207, 209
namespace 195
HPSSDL 4
Protocols 198
HPSSL 5
providers 193
hrd
Source-target facility 200
See host responds daemon
subscriber 193
sundered namespace 200
Voting 199 I
1-phase 199 I/O pacing 445
n phase 199 IBM Support Tools 333
GSAPI 195 css.snap 240, 336
Gathering AIX Service Information 334
Gathering SP-specific Service Information 336
H Service Director 333
HACMP 185
L
LAPI_Amsend 475 N
n2 problem 185
LAPI_Get 475
National Security Agency 141
LAPI_Put 475
Negotiator 497
Launch Pad 119, 293, 294
netboot 287
libc.a 176
Netfinity 112, 115, 269
libspk4rcmd.a 176
network boot 284, 287
539
Network Connectivity Table 189 O
Network File System 259 Oak Ridge National Laboratory 475
Network Information System 339, 348 One-Sided Communication 465
client 349 OpenMP 479
login control 147 ORNL 475
maps 349
Master Server 348
Slave Server 348 P
PAIDE
Network installation 55
See Performance Aide for AIX
Network Installation Management 258, 284
Pallas GmbH 479, 484, 488, 491
Network Module 448
panes 120
Network Option 263
Parallel Operating Environment 215
Network Time Protocol 131
Parallel Tools Consortium 480
ntp_config 133
Parallel Virtual Machine 215, 475
ntp_server 133
partition manager daemon 455
ntp_version 133
Partitioning the SP System
peer 132
See System Partitioning
stratum 132
passwd_overwrite 148
timemaster 133
PDT
nfd 115
See Performance Management Tools
NFS
peer 132
See Network File System
perfagent.server 186, 265
NIM
perfagent.tools 186, 265
See Network Installation Management
performance 263
NIM pull mode 260
Performance Aide for AIX 265
NIM push mode 260
Performance Management 311
NIS
AIX utilities 311
See Network Information System
PerfPMR 312
node
Performance Monitoring 305
dependent node 38
Performance Toolbox (PTX/6000) 312
external node 34
Performance Toolbox Parallel Extensions 313
High node 26
System Performance Measurement Interface
Internal Nodes 26
312
standard node 26
Performance Optimization with Enhanced 4
Thin node 26
Performance Toolbox (PTX/6000) 265, 312
Wide node 26
3dmon 312
node database 272
Performance Agent 312
node supervisor interface 114
Performance Manager 312
non-concurrent 438
Remote Statistics Interface (RSI) 313
none 133
xmservd daemon 312
NONE callback 181
Performance Toolbox Parallel Extensions 206, 313
nonlocsrcroute 445
3dmon 319
NTP
Central Coordinator nodes 315
See Network Time Protocol
Collection of SP-specific data 314
ntp_config 133, 134
Collector nodes 315
ntp_server 133
Data Analysis and Data Relationship Analysis
Nways LAN RouteSwitch 57
315
Data Manager nodes 315
541
Response 208 rexec 142
resource variables 206 rlogin 142
observation interval 206 rsh 142, 143, 175
reporting interval 206 telnet 141, 143
Resource Variables xauth 145
IBM.PSSP.Membership.LANadapter.state 213 xhost 144
IBM.PSSP.Response.Host.state 209 security administration 139
IBM.PSSP.Response.Switch.state 209 security policy 139
restore 375 Security Server 168
RFC 1416 143 send 464
RFC 1508 143 Serial Network 443
RFC 1510 142, 148 Service Processor 112
RJ-45 48 service ticket 149, 160
RMAPI Services
See Resource Monitor Application Programming kerberos 158
Interface kerberos_admin 158
root.admin 172 kerberos4 158
ROS kerberos-adm 158
See Read-Only Storage krb_prop 158
rotating 434 pmv2 456
routing 50, 53 session key 151
RPM 480 set_auth_method 143
RS232 443 settokens.c 461
RVSD Failover 407 setup_authent 164
shell port 177
Simple Network Management Protocol 329
S site environment 268
S70 112, 269
SLIM 112, 269
s70d 113
SMPI
S7A 112, 269
See System Performance Measurement Inter-
SAMI 112, 269
face
SCHEDULER_API 512
SNMP
SCHEDULER_TYPE 512
See Simple Network Management Protocol
Scheduling Machine 497
SP access control (SPAC) 340, 346
SCSI 280
SP Accounting 319
SDR 190, 192, 204
accounting class identifiers 321
See also System Data Repository
Configuring 320
SDRCreateFile 104
Node-exclusive-use accounting 320
SDRCreateSystemFile 104
Parallel job accounting 320
SDRDeleteFile 104
record consolidation 319
SDRReplaceFile 104
spacctnd 321
SDRRetreiveFile 104
ssp.sysman fileset 319
secondary Kerberos servers 157
user name space 323
secret key 146, 149
SP LAN 47
secret password 143
SP Perspectives 116, 119, 293
secure shell 145
Event Perspective 294
Security
Hardware Perspective 293, 295
ftp 141, 143
Launch Pad 293, 294
rcp 142, 143
Performance Monitor Perspective 294
543
spmon 119 Manipulating SDR Data 103
SPOT 260 multiple SDR daemons 102
spsitenv 133 Objects 92
consensus 133 Partitioned classes 94
internet 133 Restore 107
none 133 SDR Daemon 98
SSA 280 SDR Data Model 92
standalone 258 SDR directory structure 93, 95
state 111 SDR Log File 106
stratum 132 SDR_test 105
Stripe Group Manager 413 SDRArchive 103, 107
Structured Socket Messages 455 SDRChangeAttrValues 104, 106
Submit-only Machine 498 SDRClearLock 102, 106
subnet 50 sdrd 91, 98, 99
Subsystems SDRGetObjects 105
Emonitor 237 SDRListClasses 106
Event Management 296 SDRRestore 103, 107
sdrd 91, 98 SDRWhoHasLock 102, 106
swtadmd 227 Shadow Files 103
swtlog 241 SP class example 93
supercomputer 4 SP_NAME 100
supervisor card 24 Syspar_map class 100
supervisor microcode 271 syspar_name 100
switch clock 287 System classes 94
switch responds 290 System partitioning and the 91
switch topology 286 User Interface 102
symmetric encryption 146 , 149 System Monitor
Symmetric Multiprocessor 463 command line 304
syncd 445 system hardware 304
synchronization 444, 447 System Monitoring 293
Sysctl 180 System Partitioning 95
syslogd 111 /etc/inittab 100
SYSTEM 147 /etc/rc.sp 100
SYSTEM callback 181 /etc/SDR_dest_info 99, 100, 101
system characteristics 6 alias IP address 98
System Data Repository 91, 104 default IP address 99
/etc/inittab 100 default partition 98
/etc/rc.sp 100 Partitioning Rules 96
/etc/SDR_dest_info 98, 100, 101 primary partition 98
/spdata/sys1/sdr/system/locks directory 102 sdrd 99
Adapter class example 93 SP_NAME 100
Adding AIX Files to the 104 syspar_addr attribute 99
alias IP address 98 Syspar_map class 99, 100
Attributes 92 syspar_name 100
Backup 107 Why Partition? 97
changing an object 104 System Performance Measurement Interface 188,
Classes 92 206
default partition 94 System Resource Controller 237
Locking 102
U
UDP 189
UDP/IP 193
UNIX domain sockets 202
UNIX Domain Stream 189
545
546 The RS/6000 SP Inside Out
ITSO Redbook Evaluation
The RS/6000 SP Inside Out
SG24-5374-00
Your feedback is very important to help us maintain the quality of ITSO redbooks. Please complete this
questionnaire and return it using one of the following methods:
• Use the online evaluation form found at http://www.redbooks.ibm.com
• Fax this form to: USA International Access Code + 1 914 432 8264
• Send your comments in an Internet note to [email protected]
Please rate your overall satisfaction with this book using the scale:
(1 = very good, 2 = good, 3 = average, 4 = poor, 5 = very poor)
Was this redbook published in time for your needs? Yes___ No___